Sie sind auf Seite 1von 722

Basic Statistics

STATONE
SY 2016-2017
RM Alecha

1
•"It is not how much
you do, but how
much love you put
in the doing."
Introduction to Business Statistics

RM Alecha
Objectives
• Identify the position of a data value in a
data set, using various measures of
position such as percentiles, deciles, and
quartiles
Measures of Position (or Location
or Relative Standing)
 Are used to locate the relative position of
a data value in a data set
 Can be used to compare data values from
different data sets
 Can be used to compare data values
within the same data set
 Can be used to help determine outliers
within a data set
 Includes z-(standard) score, percentiles,
quartiles, and deciles
Other Measures of Location
• Percentiles are measures The nth percentile is the
of central location that value such that at least n
divides the group of data percent of the data are
into 100 parts. There are below that value and at
99 percentiles, because it most (100-n) percent are
takes 99 dividers to above that value.
separate a group of data Specifically, the 87th
into 100 parts. percentile is a value such
that at least 87% of the data
are below the value and no
more than 13% are above
the value.
Steps in Locating the Location of
Percentile
1. Organize the numbers . 3. Determine the location
into an ascending-order by either:
array. • If i is a whole number, the
2. Calculate the percentile Pth percentile is the
location (i) by average of the value at
i = pn/100 the ith location and the
value at the (i+1)st
where p = the percentile of location.
interest ** If i is not a whole
i = percentile location number, the Pth percentile
n = number in the value is located at the value
data set number part of (i +1).
Percentiles
 Divides the data set in 100 (“per cent”)
equal groups
 Used to compare an individual data value
with the national “norm”
 Symbolized by P1, P2 ,…..
 Percentile rank indicates the percentage
of data values that fall below the
specified rank
To find the percentile rank for a
given data value, x
Percentile Rank 
 (number of data values below 
 0.5 
 the given data point)  100% 
 total number of values 
 
 
Examples
American College Test (ACT) Scores attained by 25 members of a local high
school graduating class (Data is ranked)

14 16 17 17 17

18 19 19 19 19

20 20 20 21 21

21 23 23 24 25

25 25 28 28 31

1) Thad scored 22 on the ACT. What is his percentile rank?

2) Ansley scored 20 on the ACT. What is her percentile rank?


 Step 1: Arrange data in order from lowest to highest
 Step 2: Substitute into the formula

n p
c
100
where n is total number of values and p is given percentile
 Step 3: Consider result from Step 2
 If c is NOT a whole number, round up to the next whole number.
Starting at the lowest value, count over to the number that
corresponds to the rounded up value
 If c is a whole number, use the value halfway between the cth and
(c+1)st value when counting up from the lowest value

Finding a Data Value Corresponding to a Given


percentile
Examples
American College Test (ACT) Scores attained by 25 members of a local high
school graduating class (Data is ranked)

14 16 17 17 17

18 19 19 19 19

20 20 20 21 21

21 23 23 24 25

25 25 28 28 31

To be in the 90th percentile, what would you have to score on the ACT?

Find P85
Quartiles
• Same concept as percentiles, except the data
set is divided into four groups (quarters)
• Quartile rank indicates the percentage of
data values that fall below the specified rank
• Symbolized by Q1 , Q2 , Q3
• Equivalencies with Percentiles:
– Q1 = P25
– Q2 = P50 = Median of data set
– Q3 = P75 Minitab calculates these
for you.
Q1 (First Quartile) separates the bottom
25% of sorted values from the top 75%.

Q2 (Second Quartile) same as the median;


separates the bottom 50% of sorted
values from the top 50%.

Q3 (Third Quartile) separates the bottom 75%


of sorted values from the top 25%.
QUARTILES
divides ranked scores into four equal parts
IQR  Q3  Q1
Q3  Q1
SIR 
2
Q3  Q1
Midquartil e 
2
Deciles
• Same concept as percentiles, except
divides data set into 10 groups
• Symbolized by D1 , D2 , D3 , … D10
• Equivalencies with percentiles
– D1 = P10 D2 = P20 ……..
– D5 = P50 =Q2 =Median of Data Set
Outliers
• Outlier is an extremely high or an
extremely low data value when compared
with the rest of the data values
• A data set should be checked for
“outliers” since “outliers” can influence
the measures of central tendency and
variation (mean and standard deviation)
Identifying Outliers
• Interquartile Range (IRQ)
– Q3-Q1

• Identifying Outliers
– Is the data point between

Q1  1.5IRQ and Q3  1.5IRQ


American College Test (ACT) Scores attained by 25 members of a local high
school graduating class (Data is ranked)

14 16 17 17 17

18 19 19 19 19

20 20 20 21 21

21 23 23 24 25

25 25 28 28 31

1) Emily scored 11 on the ACT. Would her score be considered an


outlier?
2) Danielle scored 38 on the ACT. Would her score be considered an
outlier?

Examples
Why Do Outliers Occur?
• Data value may have • Data value might be a
resulted from a legitimate value that
measurement or occurred by chance
observational error (although the
• Data value may have probability is
resulted from a extremely small)
recording error
• Data value may have
been obtained from a
subject that is not in
the defined population
Important Characteristics of Data

1. Center: A representative or average value that indicates where the


middle of the data set is located

2. Variation: A measure of the amount that the values vary among


themselves

3. Distribution: The nature or shape of the distribution of data (such as


bell-shaped, uniform, or skewed)

4. Outliers: Sample values that lie very far away from the vast majority of
other sample values

5. Time: Changing characteristics of the data over time

Copyright © 2004 Pearson Education, Inc.


Definitions
• In addition to frequency distributions, we use two
other types of statistics to describe the
distribution of a variable:

– Measures of central tendency are statistics that


summarize a distribution of scores by reporting the
most typical or representative value of the
distribution.

– Measures of variability are statistics that indicate the


amount of variety or heterogeneity in a distribution of
scores.
SW318
Social Work Statistics
Slide 23
Measures of Variation

Created by Tom Wegleitner, Centreville, Virginia

Copyright © 2004 Pearson Education, Inc.


Measures of variability
describes the spread or the
dispersion of a set of data.
Variations are measures of average distance of
each observation from the center of the
distribution. These measures tell how
homogenous or how heterogeneous are the
observations (scores, elements or entries) in a
particular distribution or set of data.
STATISTICAL TOOLS IN
EVALUATION

DESCRIPTIVE VALUES

MEASURES OF VARIABILITY
MEASURES OF CENTRAL TENDENCY
• WHEN THE GRAPH OF THE SCORES IS A NORMAL
CURVE, THE MODE, MEDIAN, AND MEAN ARE EQUAL
• THE MEAN IS THE MOST COMMON MEASURE OF
CENTRAL TENDENCY
• WHEN THE SCORES ARE QUITE SKEWED OR THE
DATA IS ORDINAL LACKING A COMMON INTERVAL,
THE MEDIAN IS A BETTER MEASURE OF CENTRAL
TENDENCY
• THE MODE IS USED ONLY WHEN THE MEAN OR
MEDIAN CANNOT BE CALCULATED (E.G., NOMINAL
DATA) OR WHEN THE ONLY INFORMATION WANTED
IS THE MOT FREQUENT SCORE (E.G., MOST UNIFORM
SIZE OR INJURY SITE)
MEASURES OF VARIABILITY

• DESCRIBES THE SET OF SCORES IN


TERMS OF THEIR SPREAD, OR
HETEROGENEITY
• CONSIDER TWO GROUPS OF SCORES
GROUP 1 = 9, 5, 1; GROUP 2 = 5, 6, 4
• BOTH HAVE A MEAN AND MEDIAN OF 5
BUT GROUP 2 HAS MUCH MORE
HOMOGENEOUS OR SIMILAR SCORES
THAN GROUP 1
MEASURES OF VARIABILITY

• RANGE

• STANDARD DEVIATION

• VARIANCE
RANGE
• EASIEST MEASURE OF
VARIABILITY TO CALCULATE
• USED WHEN THE MEASURE OF
CENTRAL TENDENCY IS THE MODE
(NOMINAL DATA OR WHEN THE
MOST FREQUENT SCORE IS OF
INTEREST) OR MEDIAN (ORDINAL
DATA OR SKEWED DATA)
• SIMPLY THE DIFFERENCE
BETWEEN THE HIGHEST AND
LOWEST SCORES
WHAT IS THE RANGE IN THE SET OF
SCORES BELOW?
• SET OF SCORES:
7, 2, 7, 6, 5, 6, 2

RANGE = HIGHEST SCORE MINUS


LOWEST SCORE = 7 - 2 = 5
STANDARD DEVIATION (S)
• MEASURE OF VARIABILITY USED WITH
THE MEAN (NORMALLY DISTRIBUTED
INTERVAL OR RATIO DATA)
• INDICATES THE AMOUNT THAT ALL
SCORES DIFFER OR DEVIATE FROM THE
MEAN
• THE MORE THE SCORES DIFFER FROM
THE MEAN, THE HIGHER THE
STANDARD DEVIATION (S)
• SUM OF THE DEVIATIONS OF SCORES
FROM THE MEAN IS ALWAYS IS 0
DEFINITIONAL FORMULA FOR
STANDARD DEVIATION
• FORMULA 2.2 SHOULD BE USED IF
THE GROUP TESTED IS VIEWED AS A
REPRESETATIVE PART OF THE
POPULATION; CONSIDERED THEN A
SAMPLE
• STANDARD DEVIATION CALCULATED
ON THE SAMPLE IS USED AS AN
ESTIMATE OF THE POPULATION
STANDARD DEVIATION (E.G.,
CALCULATION OF THE STANDARD
DEVIATION OF THE PERCENT BODY
FAT OF COLLEGE RUNNERS THAT IS
USED AS AN ESTIMATION OF THE
STANDARD DEVIATION OF ALL
COLLEGE RUNNERS)
• X = SCORES
• BAR X = MEAN OF SCORES
• N = NUMBER OF SCORES
• MANY CALCULATORS AND MOST
COMPUTER PROGRAMS USE THIS
FORMULA
SAMPLE CALCULATION OF THE STANDARD
DEVIATION USING FORMULA 2.1 AND 2.2 AND
THE FOLLOWING TESTS SCORES: 7, 2, 7, 6, 5, 6, 2
CALCULATIONAL FORMULA FOR
STANDARD DEVIATION
• FORMULA 2.3 SHOULD BE
USED IF THE GROUP TESTED
IS VIEWED AS THE GROUP OF
INTEREST; CONSIDERED
THEN THE POPULATION (E.G.,
CALCULATING STANDARD
DEVIATION OF THE 50-M
SWIM TIMES AT A SWIM
MEET )

• X = SCORES
• N = NUMBER OF SCORES
• FORMULA TYPICALLY USED
FOR HAND CALCULATION
CALCULATIONAL FORMULA FOR
STANDARD DEVIATION
• FORMULA 2.4 SHOULD BE USED IF
THE GROUP TESTED IS VIEWED AS A
REPRESETATIVE PART OF THE
POPULATION; CONSIDERED THEN A
SAMPLE
• STANDARD DEVIATION CALCULATED
ON THE SAMPLE IS USED AS AN
ESTIMATE OF THE POPULATION
STANDARD DEVIATION (E.G.,
CALCULATION OF THE STANDARD
DEVIATION OF THE 40-YARD TIME OF
COLLEGE WIDE RECEIVERS THAT IS
USED AS AN ESTIMATION OF THE
STANDARD DEVIATION OF ALL
COLLEGE WIDE RECEIVERS)
• X = SCORES
• N = NUMBER OF SCORES
• FORUMULA TYPICALLY USED FOR
HAND CALCULATION
SAMPLE CALCULATION OF THE STANDARD
DEVIATION USING FORMULA 2.3 AND 2.4 AND THE
FOLLOWING TESTS SCORES: 7, 2, 7, 6, 5, 6, 2
VARIANCE
• USEFUL STATISTIC IN CERTAIN
HIGH LEVEL STATISTICAL
PROCEDURES LIKE REGRESSION
ANALYSIS AND ANALYSIS OF
VARIANCE (ANOVA)
• CALCULATED BY SQUARING THE
STANDARD DEVIATION (S2)
• STANDARD DEVIATION = S = 4
• VARIANCE = S2 = 42 = 16
The Standard Deviation
• The standard deviation measures the deviations
between the mean of the distribution and each of the
individual scores.

• The standard deviation is the preferred measure of


variability for interval level variables, unless the
distribution is badly skewed. For a badly skewed
distribution, the range is a preferred measure of
variability.

• The standard deviation is not an appropriate statistic


for ordinal, nominal, or dichotomous variables.
SW318
Social Work Statistics
Slide 39
Computing a Standard Deviation
When we calculated the Using the mean of 2.4 as our
mean for the the credit guess:

card problem, we 2 – 2.4 = -0.4, (-0.4)2 = 0.16


1 – 2.4 = -1.4, (-1.4)2 = 1.96
computed the deviations 3 – 2.4 = 0.6, (+0.6)2 = 0.36
2 – 2.4 = -0.4, (-0.4)2 = 0.16
from the mean as a 4 – 2.4 = 1.6, (+1.6)2 = 2.56
measure of error. 5.20

The total error is 5.20 units of


error.
The standard deviation is
computed by dividing the The standard deviation is the
square root of (5.20 ÷ 5), or
measure of total error by 1.02.

the number cases and


taking the square root of
that result.
Interpreting the Standard Deviation
• The standard deviation does not have any inherent or
intuitive meaning; it is a statistical measure of the
variability of cases around the mean for an interval
level variable.

• Standard deviation is commonly presented in terms of


the proportion of cases that fall between the mean
plus/minus 1, 2, or 3 standard deviation measures.

• The standard deviation is most useful in comparing


variability among groups for interval level variables.
Picturing the Standard Deviation
• The larger the Mean = 10.01 Mean = 10.9
standard deviation, 80
Standard Deviation = 2.23
80
Standard Deviation = 4.57

the more spread out 70

the distribution of 60 60

50

cases. 40 40

30
Frequency
20

Frequency
20

10

• The number of scores 0


-35.00 -15.00 5.00 25.00 45.00
0
-35.0 -15.0 5.0 25.0 45.0

near the mean will be


-25.00 -5.00 15.00 35.00 -25.0 -5.0 15.0 35.0

Mean = 10.3 Mean = 11.3


fewer.
Standard Deviation = 8.01 Standard Deviation = 14.51
80 80

70 70

• The range of scores 60

50
60

50

will be larger, with 40 40

more cases having 30 30

20

Frequency
20
Frequency

larger deviations 10

0
10

from the mean. -35.0


-30.0
-25.0
-20.0
-15.0
-10.0
-5.0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0 -35.0 -25.0
-30.0
-15.0
-20.0 -10.0
-5.0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0

SW318
Social Work Statistics
Slide 42
Variability: the Variance
• The variance is another measure of variability
that is equal to the square the standard
deviation. The variance is the average of the
squared deviations from the mean.

• In describing distributions, the standard deviation


is the more commonly cited statistic. Variance is
used primarily in inferential statistics such as the
analysis of variance and correlation, which we
will study later in this course.
SW318
Social Work Statistics
Slide 43
The standard deviation of a set of
sample values is a measure of
variation of values about the
mean.
Sample Standard
Deviation Formula (Ungrouped Data)

S =
 (x - x) 2

n-1
Sample Standard Deviation
(Shortcut Formula)

s =
n (x ) - (x)
2 2

n (n - 1)
Sample Standard
Deviation Shortcut Formula
(Grouped Data)

 f(x - x) 2

S =
n-1
Sample Standard
Deviation Formula (Grouped Data)

n ( f ) - (
x2 2
fx )2

s =

n (n - 1)
The variance of a set of
values is a measure of
variation equal to the square
of the standard deviation.
Sample variance:
Square of the sample
standard deviation s
Variability: the Range
• The range is the highest score minus the lowest
score in a sorted distribution.

• The range is the preferred measure of variability


for ordinal level variables, and for interval level
variables that have a badly skewed distribution.

• The range can be computed for interval level


variables, but is not an appropriate statistic for
nominal or dichotomous variables.

SW318
Social Work Statistics
Slide 49
The Range

Range = highest score - lowest score

• Range – A measure of variation in


interval-ratio variables. It is the
difference between the highest
(maximum) and the lowest (minimum)
scores in the distribution.
Inter-Quartile Range
• Inter-Quartile Range (IQR) – A measure of
variation for interval-ratio data. It
indicates the width of the middle 50
percent of the distribution and is defined
as the difference between the lower and
upper quartiles (Q1 and Q3.)
• IQR = Q3 – Q1
The difference between the Range
and IQR

These
values fall
Shows greater
together
variability
closely

Importance of
the IQR
Yet the ranges
are equal!
The Box Plot
• The Box Plot is a graphic device that visually
presents the following elements: the range, the IQR,
the median, the quartiles, the minimum (lowest
value,) and the maximum (highest value.)

Maximum

Q3

Range IQR Median

Q1

Minimum
Find the Mean and the
Standard Deviation
Computing a Range

The range is computed by sorting the scores in the distribution and


subtracting the lowest score from the highest score.

Using the data from the credit card problem, we would sort the
five scores (2, 1, 2, 3, and 4) as shown below, and compute the
range by subtracting 1 from 4.

1 2 2 3 4
Range = 3.0

SW318
Social Work Statistics
Slide 55
Interpreting the Range
• The range is usually described as the total spread in the
distribution.

• The range is based only on two scores in the


distribution, the highest and the lowest, and it tells us
nothing about the distribution of the majority of scores
in between.

• The range is most useful when we are comparing


groups and can describe one group as having a larger
or smaller range, or spread, than the other groups.
SW318
Social Work Statistics
Slide 56
Picturing the Range
• The median and range do not precisely define a distribution. The three
charts below have the same median and range, but very different
distributions of cases.
Median = 8 Median = 8 Median = 8
Range = 10 Range = 10 Range = 10
30 30 30

20 20 20

10 10 10
Frequency

Frequency

Frequency
0 0 0
3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5

 The presence of cases with either larger or smaller values will


change the range for the distribution, adding additional bars to
the chart.
 The impact on the overall shape of the distribution will be as
varied as the three charts above. Even though they had the
same statistical values for central tendency and dispersion,
SW318
each had a very different pattern of scores in the distribution.
Social Work Statistics
Slide 57
Variability: the Interquartile Range
• Since many variables contain one or more extremely
large or extremely small scores, the range may be
misleading.

• This problem is avoided with the interquartile range


which is the difference between the third quartile and
the first quartile. The third quartile is the value below
which 75% of the cases fall. The first quartile is the
value below which 25% of the cases fall.

• While less subject to the influence of extreme cases


than the range, the interquartile range still uses
information for only two cases or values in the
distribution.
SW318
Social Work Statistics
Slide 58
Variability: Index of Qualitative
Variation (IQV)
• The IQV is a measure of variability for nominal level and
dichotomous variables. Since the mode is the preferred
measure of central tendency for a nominal variable, the
measure of dispersion for a nominal variable would indicate
the degree to which cases fall in the non-modal categories.

• The IQV is the preferred measure of central tendency for a


nominal level variable, including dichotomous variables.

• The IQV can be computed for ordinal level variables and for
interval level variables that have been grouped in a
frequency distribution.

SW318
Social Work Statistics
Slide 59
Computing an IQV
 If all of the cases in a distribution fall in one
category, that category would be the modal
category and there would be no dispersion. In
this case, the IQV would be 0%.

 If all of the categories in a distribution have the


same frequency, I.e. the distribution was
multimodal, there would be maximum dispersion,
and the IQV would be 100%.

 The IQV ranges from 0% to 100%, where low


percentages represent low dispersion, and high
percentages represent high dispersion.
SW318
Social Work Statistics
Slide 60
Interpreting the IQV
• Low values of the IQV (approaching 0%) indicate
low dispersion, or little difference from the modal
category.

• High values for the IQV (approaching 100%)


represent high dispersion, or large numbers of
cases not in the modal category.

• The IQV is most useful when comparing


dispersion among groups on a nominal level
variable.
SW318
Social Work Statistics
Slide 61
Picturing the Index of Qualitative
Variation - 1
 The IQV increases as the number of cases in the modal
category decreases.

IQV = 36.00% IQV = 64.00%


 If a variable has 100 100

two categories and


90 90

80 80

the number of 70

60
70

60

cases in each is 50

40
50

40

50, the IQV is 30 30

100%.
20 20

Percent
Percent

10 10

0 0
1.000 2.000 1.000 2.000

IQV = 84.00% IQV = 96.00% IQV = 100.00%


100 100 100

90 90 90

80 80 80

70 70 70

60 60 60

50 50 50

40 40 40

30 30 30

20 20 20
Percent

Percent
Percent

SW318 10 10 10

Social Work 0Statistics 1.000 2.000


0
1.000 2.000
0
1.000 2.000

Slide 62
Picturing the Index of Qualitative
Variation - 2
 If the variable has three categories and the 50 cases not in the modal
category are divided among the non-modal categories, the IQV
decreases, i.e. there is less dispersion.
IQV = 87.00% IQV = 93.00%
 As the division of 100 100

cases among the non-


90 90

80 80

modal categories 70

60
70

60

becomes more evenly 50 50

divided, the IQV 40

30
40

30

increases, indicating 20 20
Percent

Percent
greater dispersion.
10 10

0 0
1 2 3 1 2 3

IQV = 93.75% IQV = 93.00% IQV = 87.00%


100 100 100

90 90 90

80 80 80

70 70 70

60 60 60

50 50 50

40 40 40

30 30 30

20 20 20
SW318
Percent

Percent

Percent

10 10 10
Social Work0 Statistics 0 0

Slide 63 1 2 3 1 2 3 1 2 3
Picturing the Index of Qualitative
Variation - 3
 On this slide, we IQV = 99.00% IQV = 96.00%
keep the number 100

90
100

90

of cases in the 80

70
80

70

modal category the 60 60

same for all four


50 50

40 40

charts, but change


30 30

20 20

Percent

Percent
the number of 10

0
10

categories in the
1 2 3 1 2 3 4

IQV = 93.75% IQV = 92.16%


distributions. 100

90
100

90

80 80

70 70

 The IQV decreases 60

50
60

50

as the number of 40

30
40

30

categories in the 20 20

Percent
Percent

distribution
10 10

0 0
1 2 3 4 5 1 2 3 4 5 6

increases.
SW318
Social Work Statistics
Slide 64
Index of Qualitative Variation
• In summary, IQV is affected both by the division of
cases between the modal and non-modal categories,
and by the number of categories for the variable.

• The IQV is at its maximum value (100%) when the


number of cases in the sample is evenly divided
among the categories of the variable.

• Variables with more categories will have lower IQV’s.

• A more even division of cases among categories will


SW318
increase the value of the IQV.
Social Work Statistics
Slide 65
Measures of Variability

• The Importance of Measuring Variability


• The Range
• IQR (Inter-Quartile Range)
• Variance
• Standard Deviation
• Considerations for choosing a measure of
variation
The Importance of
Measuring Variability
• Central tendency - Numbers that describe
what is typical or average (central) in a
distribution
• Measures of Variability - Numbers that
describe diversity or variability in the
distribution.
These two types of measures together help us to
sum up a distribution of scores without looking at
each and every score. Measures of central tendency
tell you about typical (or central) scores. Measures
of variation reveal how far from the typical or central
Notice that both distributions have the same
mean, yet they are shaped differently
Considerations for Choosing a
Measure of Variability

• For nominal variables, you can only use IQV (Index of


Qualitative Variation.)
• For ordinal variables, you can calculate the IQV or the
IQR (Inter-Quartile Range.) Though, the IQR provides
more information about the variable.
• For interval-ratio variables, you can use IQV, IQR, or
variance/standard deviation. The standard deviation
(also variance) provides the most information, since
it uses all of the values in the distribution in its
calculation.
MAD (Mean Absolute Deviation)
• Is the measure of the average of the absolute
deviation from the mean of all observations
in a given data.
• For ungrouped data
MAD 
 xx
n
where x – particular score/entry
x – mean
n – total frequency
For grouped data

MAD 
 f xx
n
Application
1. The following are the response times in seconds
of a smoke alarm after the release of smoke
from a fixed source:
12 9 11 7 9 14 6 10

Find the range and the standard deviation.


Observe how the dispersion of set of data from the
mean. Are they scattered or bunched closely to
the mean?
2. The following are the closing prices for two
stocks on 5 consecutive Fridays:
Stock A 15.3 15.5 16.4 15.45 14.9
Stock B 22 10.9 6.78 8.96 17.4

Calculate the standard deviation and compare


the closing prices of the two stocks. Which of
them would you prefer? Why.
Identify
1. The difference of the largest and smallest
values
2. The square of the standard deviation
3. The average of the sample data values
4. The Greek symbol used for population mean
5. To differentiate sample standard deviation
from population standard deviation, the
number of observations is subtracted by __
Write the formula for each of the following:
6. SIR for ungrouped data
7. Standard deviation for grouped data
8. Range for ungrouped data
9. Mean for ungrouped data
10. Quartile 3
11. Interquartile range (IQR)
12. midquartile
13. mean for grouped data
14. mode for grouped data
15. variance for ungrouped data
Answers
1. Range 8. R = HS - LS
2. Variance
x
 x
3. Sample mean 9. n
4. mu
Q  3 N 
5. 1 (one) 10. 3 4
Q3  Q1
SIR 
6. 2 11. IQR  Q3  Q1
 x  x 
2

s Q3  Q1
7. n 1 12. midquartile 
2
Quiz 1 (Measures of Variation)
2.
1. For the measurements weights (in kilos)
2 5 9 10 15 19 no. of
Students
Compute:
52-54 2
a) the Range
55-57 3
b) MAD
58-60 4
c) standard deviation
61-63 6
d) Q3 and Q1
64-66 5
e) SIR
67-69 3
f) midquartile
70-72 2

Find standard deviation, range


and skewness
For 108 randomly selected
Quiz students, this exam score
frequency distribution
Use this data set: table was obtained:
10 20 30 40 50 Class Limits Frequency
Find a) standard deviation 90-98 6
b) range 99-107 22
c) Q3 and Q1 108-116 43
d) SIR 117-125 28
e) midquartile 126-134 9
f) MAD Find a) standard deviation
b) range
c) skewness
2.3. Measures of Dispersion (Variation):
The variation or dispersion in a set of values refers to how spread
out the values are from each other.

· The variation is small when the values are close together.


· There is no variation if the values are the same.
Larger variation

Smaller variation

· Same
Center

Smaller variation

Larger variation
Some measures of dispersion:
Range – Variance – Standard deviation
Coefficient of variation

Range:
Range is the difference between the largest (Max) and smallest (Min)
values.
Range = Max  Min
Example:
Find the range for the sample values: 26, 25, 35, 27, 29, 29.

Solution:
Range = 35  25 = 10 (unit)

Note:
The range is not useful as a measure of the variation since it only
takes into account two of the values. (it is not good)
Variance:

The variance is a measure that uses the mean as a point of reference


· The variance is small when all values are close to the mean.
· The variance is large when all values are spread out from the mean

Squared deviations from the mean:


X1 X2 x Xn

(X1  x )2 (X2  x )2 (Xn  x )2


(1) Population variance:
Let be the population values.
X , X
The population
1 ,  , X N is defined by:
2 variance
N
  X i   2
 X   2
  X   2
    X   2
(unit)2
 2  i 1  1 2 N

N N
N
 Xi
where   i is1 the population mean.
N
Notes:
·
2
is a parameter because it is obtained from the population
values (it is unknown in general).
· 2  0

(2) Sample Variance:


Letx1 , x 2 ,  , x n be the sample values.
The sample variance is defined by:

 xi  x
n

  x   
2
x 2 2 2
i 1 x  x    xN  x (unit)2
S2   1 2

n 1 n 1
n
 xi
i 1
Where x is the sample mean
n
Notes:
· S2 is a statistic because it is obtained from the sample values (it
is known).
S is used to approximate (estimate)
2 2
· .
· S 0
2

Example:
We want to compute the sample variance of the following sample
values: 10, 21, 33, 53, 54.
Solution:
n=5
5
 xi
10  21  33  53  54 171
x 
i 1
  34.2 (unit)
5 5 5

 x  x 
n 5

 i 
2

2
i x 34 .2
S 2 i 1
 i 1

n 1 5 1

S2 
10  34.2 2
 21  34.2 2
 33  34.2 2
 53  34.2 2
 54  34.2 2

4
1506.8
  376.7 (unit) 2
4
Another method:

x  x 
5
x   xi
2
x 
xi i i
i 1
xi  34.2  xi  34.2
2 x
5
10 -24.2 585.64 171
21 -13.2 174.24   34.2
33 -1.2 1.44 5
53 18.8 353.44
54 19.8 392.04
1506 .8
S  2

4
 xi  x   0  xi  x
5 5
 xi  376.7
2
 171  1506 .8
i 1 i 1

Calculating Formula for S2:


n
x
2
2
i  nx
S 
2 i 1 * Simple
n 1
* More accurate
Note:
To calculate S2 we need:
· n = sample size
·  x i  The sum of the values
·  xi2  The sum of the squared values
For the above example:

xi 10 21 33 53 54  xi  171

xi2 100 441 1089 2809 2916  x i2  7355

7355  534.2
2
1506 .8 (unit)2
S 
2
  376.7
5 1 4
Standard Deviation:
· The standard deviation is another measure of variation.
· It is the square root of the variance.
(1) Population standard deviation is:   2 (unit)
(2) Sample standard deviation is:

S  S2 (unit)
Example:
For the previous example, the sample standard deviation is

S  S 2  376.7  19.41 (unit)

Coefficient of Variation (C.V.):

· The variance and the standard deviation are useful as measures


of variation of the values of a single variable for a single
population (or sample).
· If we want to compare the variation of two variables we cannot
use the variance or the standard deviation because:
1. The variables might have different units.
2. The variables might have different means.
· We need a measure of the relative variation that will not depend
on either the units or on how large the values are. This measure is the
coefficient of variation (C.V.) which is defined by:
S
C.V  *100% (free of unit or unit
x less)

Mean St.dev. C.V.


S1
1stdata x1 S1 C.V1  100 %
x1
set
S2
2nd data x2 S2 C.V2  100 %
set x2

· The relative variability in the 1st data set is larger than the relative
variability in the 2nd data set if C.V1> C.V2 (and vice versa).
Example:
1st data set: x 1  66 kg, S 2  4.5 kg
4.5
 C.V1  * 100%  6.8%
66
2nd data set: x 2  36 kg, S 2  4.5 kg
4.5
 C.V2  * 100%  12.5%
36

Since C.V1  C.V, 2the relative variability in the 2nd data set is larger than the relative
variability in the 1st data set.

Notes: (Some properties of 2:


, S, and Sx

Sample values are :


a and b are constants x1 , x 2 ,  , x n
Sample Data Sample Sample Sample
mean st.dev. Variance
x1 , x 2 ,  , x n x S S2
ax1 , ax 2 ,  , ax n ax aS a2S 2
x1  b, ,, xn  b xb S S2
ax1  b,  , ax n  b a x  b aS a2S 2

Absolute value:

a  a
a
if a  0
if a  0
Example:

Sample Sample Sample Sample C. V.


mean St..dev. Variance
1,3,5 3 2 4 66.7%

(1) -2, -6, -10 -6 4 16 66.7%


(2) 11, 13, 15 13 2 4 15.4%
(3) 8, 4, 0 4 4 16 100%

Data (1)  2 x1 ,2 x2 ,2 x3 (a = 2)

(2) (b = 10)
x1  10 , x 2  10 , x 3  10
(3) (a = 2, b = 10)
 2x1  10 ,  2x 2  10 ,  2x 3  10
Can C. V. exceed 100%?

Data: 10,1,1,0
Mean=3
Variance=22
STDEV=4.6904

C. V.=156.3%
Measures of Skewness
Normal Distribution
-is a distribution with a bell-shaped
appearance. In a normal distribution, the
mean=median=mode
When the mean < median, the bulk of the
distribution is on the right. This implies that
the questions are generally easy (in case of
test) or that many students in the group are
bright.
When the mean > median, the bulk of the
distribution is on the left. This implies that
the questions are generally difficult (in case
of test) or that many students in the group
are not prepared for the test or not smart.
Skewness refers to the degree of
symmetry or asymmetry of a
distribution
It may be:
negatively skewed when
mean < median
positively skewed when:
mean > median
The extent of skewness can be
obtained by getting the coefficient of
skewness using the formula:

sk 
3 xm  
where x= mean
m= median
s
s = standard deviation
SUMMARY
For normal distribution
sk =0
For skewed to the left
sk < 0
For skewed to the right
sk > 0
KURTOSIS
KURTOSIS refers to the peakedness or
flatness of a distribution.
• Mesokurtic is a normal distribution
• Leptokurtic is more peaked than the
normal distribution
• Platykurtic is flatter than the normal
distribution
Kurtosis ( Ku )
For ungrouped data

ku 
 d  x
4

4
Ns
For grouped data

 f cm  x 
4

ku  4
Ns
where
Ku = is the kurtosis
d = is the raw data
cm = is the class mark
(bar x) x = is the mean
s4 = is the square of the
variance
N = is the sample size
Exercises

•Range
•Standard deviaton
•Coefficient of variation
•Skewness
1. The number of packs of cigarettes Mang Juan
sold during the last 12 days of Dec are as
follows:
10 15 5 21 7 25
90 14 18 20 10 12
Determine the following and interpret each
result: a. range
b. standard deviation
c. coefficient of variation
d. kurtosis
1. Consider this set of data:

9 56 30 3 70 2 40 51
23 15
Find the stand. dev.
variance
skewness
degree of skewness
coefficient of variation
1. Find the standard deviation, variance, range, percentile
deviation, semi-interquartile range and coefficient of
variation of the following weights distribution of 25 male
students
weights (in kilos) no. of Students
52-54 2
55-57 3
58-60 4
61-63 6
64-66 5
67-69 3
70-72 2
Historical Events (January 4)
• 1896 Utah is admitted as the 45th U.S. state
• 2004 Spirit, a NASA Mars Rover, lands
successfully on Mars
• Famous Birthdays:
• 1643 Sir Isaac Newton (Scientist)
• 1809 Louis Braille (Inventor of touch reading
system for blind)
Today in History
January 5
• 1920 The Boston Red Sox sell Babe Ruth to
the New York Yankees in what is later known
as the Bambino Curse.
• 1933 Construction of the Golden Gate Bridge
begins in San Francisco Bay.
• Famous Birthday:
• 1914 George Reeves (Actor - Superman)
the father of Christopher Reeves
Number Trivia
• What is the larger number of the binary system?

• What is the only number that equals twice the


sum of its digits (digit means numerical symbol)?

• Conventionally how many books are in the Bible's


New Testament?
1
18 (1+8=9; 2x9=18)
27
The grand aim of all science
is to cover the greatest
number of empirical facts by
logical deduction from the
smallest possible number of
hypotheses or axioms.

— Albert Einstein
Fundamental Counting
Principles
On several occasions, before making
an important decision, we resort to
determining and counting all the
possible number of alternative options
that we can choose from. Certainly, the
simplest way to do this is to list down or
enumerate all the possible options
manually and individually. To do this,
however, will require a lot of time and
effort.
Objectives:

 apply fundamental counting principle

 compute permutations

 compute combinations

 distinguish permutations vs combinations


Fundamental Counting
Principle
Fundamental Counting Principle can be used
to determine the number of possible
outcomes when there are two or more
characteristics .
Fundamental Counting Principle states that
if an event has m possible outcomes and
another independent event has n possible
outcomes, then there are m* n possible
outcomes for the two events together.
Fundamental Counting
Principle
Lets start with a simple example.

A student is to roll a die and flip a coin.


How many possible outcomes will there be?

1H 2H 3H 4H 5H 6H
6*2 = 12 outcomes
1T 2T 3T 4T 5T 6T

12 outcomes
Fundamental Counting
Principle
For a college interview, Robert has to choose
what to wear from the following: 4 slacks, 3
shirts, 2 shoes and 5 ties. How many possible
outfits does he have to choose from?

4*3*2*5 = 120 outfits


Principles of Counting
n1 x n2 x n3 ….x n ways
1. At a certain restaurant, a customer
has a choice of 2 salads, 4 main
dishes and 3 desserts. If every meal
is to consist of a salad, main dish
and a dessert, how many choices
one has to get?
2x4x3 = 24 choices of meals
2. How many different committees
consisting of 1 Junior and 1 Senior
can be formed from 7 Juniors and 4
Seniors?
7x4 = 28 ways
3. A P5 and P10 peso coins are tossed.
In how many possible combinations
of upturned faces will occur when
they fall?
2x2 = 4
4. From the numbers 2,3,4,5,6,7 and 8,
three digit numbers are to be formed.
How many numbers are there if
repetition is allowed?
5. Consider the problem in no. 4, how
many three-digit numbers are there if
repetition is not allowed?
Permutations

A Permutation is an arrangement
of items in a particular order.

Notice, ORDER MATTERS!


To find the number of Permutations of
n items, we can use the Fundamental
Counting Principle or factorial notation.
Permutations
The number of ways to arrange
the letters ABC: ____ ____ ____

Number of choices for first blank? 3 ____ ____


Number of choices for second blank? 3 2 ___
Number of choices for third blank? 3 2 1

3*2*1 = 6 3! = 3*2*1 = 6
ABC ACB BAC BCA CAB CBA
Permutations

To find the number of Permutations of


n items chosen r at a time, you can use
the formula
n!
n pr  ( n  r )! where 0  r  n .

5! 5!
5 p3    5 * 4 * 3  60
(5  3)! 2!
Permutations
Practice:
A combination lock will open when the
right choice of three numbers (from 1
to 30, inclusive) is selected. How many
different lock combinations are possible
assuming no number is repeated?

Answer Now
Permutations
Practice:
30! 30!
30 p3    30 * 29 * 28  24360
( 30  3)! 27!
Permutations
Practice:
From a club of 24 members, a
President, Vice President, Secretary,
Treasurer and Historian are to be
elected. In how many ways can the
offices be filled?

Answer Now
Permutations
Practice:
24! 24!
24 p5   
( 24  5)! 19!
24 * 23 * 22 * 21 * 20  5,100,480
Permutations with
Repetitions
Permutations with
Repetitions

The number of permutations of “n” objects, “r” of

which are alike, “s” of which are alike, ‘t” of which

are alike, and so on, is given by the expression

n!
r !  s !  t ! ...
Permutations with
Repetitions
Example 1: In how many ways can all of the
letters in the word SASKATOON be arranged?

Solution: If all 9 letters were different, we could


arrange then in 9! Ways, but because there are 2
identical S’s, 2 identical A’s, and 2 identical O’s,
we can arrange the letters in:
n! 9!
  45360
r !  s !  t ! ... 2!  2!  2!

Therefore, there are 45 360 different ways


the letters can be arranged.
Permutations with
Repetitions
Example 2: Along how many different routes can
one walk a total of 9 blocks by going 4 blocks
north and 5 blocks east?
Solution: If you record the letter of the direction in
which you walk, then one possible path would be
represented by the arrangement NNEEENENE. The
question then becomes one to determine the number of
arrangements of 9 letters, 4 are N’s and 5 are E’s.
9!
126
 Therefore, there are 126 different
5!  4! routes.
Circular and Ring
Permutations
Circular Permutations Principle
“n” different objects can be arranged in
circle in (n – 1)! ways.
Ring Permutations Principle

“n” different objects can arranged on a

circular ring in (n  1)! ways.


2
Circular and Ring
Permutations
Example 1: In how many different ways can
12 football players be arranged in a circular
huddle?
Solution: Using the circular permutations principle
there are:
(12 – 1)! = 11! = 39 916 800 arrangements
If the quarterback is used as a point of reference,
then the other 11 players can be arranged in 11!
ways.
Combinations
A Combination is an arrangement
of items in which order does not
matter.
ORDER DOES NOT MATTER!
Since the order does not matter in
combinations, there are fewer combinations
than permutations. The combinations are a
"subset" of the permutations.
Combinations

To find the number of Combinations of


n items chosen r at a time, you can use
the formula
n!
C  where 0  r  n .
n r r! ( n  r )!
Combinations
To find the number of Combinations of
n items chosen r at a time, you can use
the formula n!
C  where 0  r  n .
n r r! ( n  r )!
5! 5!
5 C3   
3! (5  3)! 3!2!
5 * 4 * 3 * 2 * 1 5 * 4 20
   10
3 * 2 * 1* 2 * 1 2 *1 2
Combinations
Practice:

To play a particular card game, each


player is dealt five cards from a
standard deck of 52 cards. How
many different hands are possible?

Answer Now
Combinations
Practice:
52! 52!
52 C5   
5! (52  5)! 5!47!
52 * 51 * 50 * 49 * 48
 2,598,960
5* 4* 3* 2*1
Combinations
Practice:

A student must answer 3 out of 5


essay questions on a test. In how
many different ways can the
student select the questions?

Answer Now
Combinations
Practice:
5! 5! 5 * 4
5 C3     10
3! (5  3)! 3!2! 2 * 1
Combinations
Practice:
A basketball team consists of two
centers, five forwards, and four
guards. In how many ways can the
coach select a starting line up of
one center, two forwards, and two
guards?
Answer Now
Combinations
Practice:

Center: Forwards: Guards:


2! 5! 5 * 4 4! 4 * 3
2 C1   2 C
5 2    10 4 C2   6
1!1! 2!3! 2 * 1 2!2! 2 * 1

2 C1 * 5 C 2 * 4 C 2

Thus, the number of ways to select the


starting line up is 2*10*6 = 120.
Objectives/Assignment
• How to use the Fundamental Counting Principle
to find the number of ways two or more events
can occur.
• How to find the number of ways a group of
objects can be arranged in order.
• How to find the number of ways to choose
several objects from a group without regard to
order.
• How to use counting principles to find
probabilities
The Fundamental Counting
Principle
• If one event can occur in m ways and
a second event can occur in n ways,
the number of ways the two events
can occur in sequence is m ● n. This
rule can be extended for any number
of events occurring in sequence.
Example 1
• You are purchasing a new car. Using the
following manufacturers, car sizes and
colors, how many different ways can you
select one manufacturer, one car size and
one color?

Manufacturer: Ford, GM, Chrysler


Car size: small, medium
Color: white(W), red(R), black(B), green(G)
Solution
• There are three choices of manufacturer,
two choices of car sizes, and four colors.
So, the number of ways to select one
manufacturer, one car size and one color
is:

3 ●2●4 = 24 ways. A tree diagram can


help you see why there are 24 options.
Tree diagram for Car Selections
Chrysler
Ford GM

Small Medium Small Medium Small Medium

w w w w
w w

R R R R
R R

B B B B
B B

G G G G
G G

Do you see now?


Ex. 2 Using FCP

• The access code for a car’s security


system consists of four digits. Each
digit can be 0 through 9. How many
access codes are possible if:
1. each digit can be used only once
and not repeated?
2. each digit can be repeated?
Solution to 1
1. each digit can be used only once and not
repeated?
Because each digit can only be used once,
there are 10 choices for the first digit, 9
digits for the second, 8 choices left for the
3rd digit, and 7 for the fourth digit. Using
the fundamental counting principle, you
could conclude there are:

10●9●8●7 = 5040 possible access codes.


Solution to 2
2. Each digit can be repeated.

Because each digit can be repeated,


there are 10 choices for each of the
four digits, So there are:
10●10●10●10 = 10,000 possible
access codes.
Permutations
• The expression n! is read as n
factorial and is defined as follows:

n! = n ●(n -1)●(n -2)●(n-3) ● ● ● 3 ● 2 ● 1

As a special case, 0! = 1
Study Tip
Here are several values of n!.
1! = 1
2! = 2 ● 1 = 2
3! = 3 ● 2 ● 1 = 6
4! = 4 ● 3 ● 2 ● 1 = 24
5! = 5 ● 4 ● 3 ● 2 ● 1 = 120

Notice that as n increases, n! becomes very large.


Take some time now to learn how to use the
factorial key on your calculator.
Example 3: Finding the number of
permutations of n objects
• The starting lineup for a baseball team
consists of nine players. How many
different batting orders are possible using
the starting lineup?
Solution: the number of permutations is 9!

9! = 9 ● 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● 2 ● 1 = 362,880
Permutations of n objects taken r at a time

• Suppose you want to choose some of


the objects in a group and put them in
order. Such an ordering is called a
permutation of n objects taken r at a
time.
n!
n P r  Where r  n
(n  r )!
Example 4: Finding n P r
Find the number of ways of forming
three-digit codes in which no digit is
repeated.

10! 10!
nPr= 10 P 3  
(10  3)! 7!
10  9  8  7  6  5  4  3  2  1

7  6  5  4  3  2 1
 720
There are 720 possible three-digit codes that do not have
repeating digits.
Example 5: Finding n P r
Forty-three race cars started the 2007 Daytona 500. How many ways
can the cars finish first, second, and third?
Because there are 43 race cars and order is important, the number of
ways the cars can finish first, second, and third is:

43! 43!
n P r = 43 P 3
 
(43  3)! 40!
 43  42  41
 74,046
Ordering same objects
Suppose you want to order a group of n objects where some of
the objects are the same. For instance, consider a group of
letters consisting of four A’s, 2 B’s, and one C. How many
ways can you order such a group? Using the previous
formula, you might conclude the following:

nPr= 7 P 7 = 7!
However, because some of the objects are the
same, not all of these permutations are
distinguishable. How many distinguishable
permutations are possible. The answer can be
found using the formula on the next slide.
Distinguishable Permutations
n!
, where
n1!n2 !n3!    nk !
n1  n2  n3  ...  nk  n.
7! 7  6  5  4  3  2 1
 
4!2!1! 4!2!1!
765

2
 105
Example 6: Distinguishable
Permutations
• A building contractor is planning to develop a
subdivision. The subdivision consists of six
one-story houses, four two-story houses, and
two split-level houses. In how many
distinguishable ways can the houses be
arranged?

Solution: There are to be twelve houses in the


subdivision (6+4+2)
Example 6: Distinguishable Permutations
12! 12  11  10  9  8  7  6  5  4  3  2  1
 
6!4!2! 6!4!2!
12  11  10  9  8  7  6  5  4  3  2  1

(6  5  4  3  2  1)( 4  3  2  1)( 2  1)
12  11  10  9  8  7

(4  3  2  1)( 2  1)
665,280

48
 13,860
Combinations
Suppose you want to buy three CD’s from a selection of
five CD’s. There are 10 ways to make your selections

ABC,ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE.

In each selection, order does NOT matter. (ABC is the


same set as BAC). The number of ways to choose r
objects from n objects without regard to order is called
the number of combinations of n objects taken r at a
time.
Combination of Objects taken r at a time
• A combination is a selection of r objects
from a group of n objects without regard
to order and is denoted by n C r. The
number of combinations of r objects
selected from a group of n objects is:

n!
n Cr 
(n  r )! r!
Example 7: Finding the number of
combinations
16!
A state’s department of
16 C 4 
transportation plans to develop
a new section of interstate
(16  4)!4!
highway and receives 16 bids
16!
for the project. The state plans
to hire four of the bidding 
companies. How many 12!4!
16  15  14  13
different combinations of four


companies can be selected
from the 16 bidding
companies? Because order is 4  3  2 1
NOT important, there are:
43680
  1820
24
Applications – Example 8 Finding
Probabilities
A word consists of one M, four I’s, four S’s, and two P’s.
If the letters are randomly arranged in order, what is
the probability that the arrangement spells the word
Mississippi? Solution. There is one favorable outcome
and there are There are 34,650
11! distinguishable
1!4!4!2! permutations of the
11  10  9  8  7  6  5 word Mississippi.
 So the probability
4  3  2 1 2 1 that the
1663200 arrangement spells
 the word
48
Mississippi is:
 34650
Applications – Example 8 Finding
Probabilities
There are 34,650 distinguishable permutations of the
word Mississippi. So the probability that the
arrangement spells the word Mississippi is:

1
P( Mississipp i)   .000029
34,650
Applications – Example 9 Finding
Probabilities
Find the probability of being dealt five diamonds from a
standard deck of playing cards. (In poker, this is a
diamond flush.)
SOLUTION: The possible number of way of choosing 5
diamonds out of 13 is 13C5. The number of possible 5
card hands is 52C5. So the probability of being dealt 5
diamonds is:
C5 1287
P( DiamondFlush)  13

52 C5 2,598,960
 .0005
Seatwork
1. How many 5 digit numbers can be
formed using the digits 0, 1, 2,3,….9
such that
a. Repetition is allowed
b. Repetition is not allowed
c. The first digit must not be 9 and
repetition is not allowed
2. A boy can buy a pair of shoe from 6
different stores and a bag from 5
different stores. If he buys a pair of
shoes and a bag from different stores,
how many sets of two stores will there
be?
3. In how many ways can a customer
order a sandwich and a drink if there
are 5 sandwiches and 4 drinks on a
meal?
4. To code its property inventories, a
company designed a card system by
which the first 2 characters are
numbers (0 to 9) and the next two
characters are letters of the English
alphabet. How many different coding
cards can be made?
Factorial Notation
• N factorial is actually the product of all
positive numbers from 1 to n and it is
written as:
n! = n(n-1)(n-2)….(3)(2)(1)
1. Evaluate 5!
5! = 5x4x3x2x1 = 120
2. 8!/4! =( 8x7x6x5x4!)/4! = 8x7x6x5
=1 680
Evaluate
1. 9𝑃6 + 4 2𝑃1
2. 4𝐶3 𝑥 5𝐶2
3. 6! + 3!
4. How many ways can 5 female
students and 4 male students be
seated on a long bench if the bench
can accommodate only 5 persons?
VOCABULARY
• INFER –to form an opinion from evidence, to
reach a conclusion
• RANDOM-without definite aim or direction,
rule or method
• EXPERIMENT-a series of test, something that
you do to see how well or how badly it works
• OUTCOME-result, a consequence
• PRINCIPLE-basic truth or theory, a law, rule or
fact
Why Learn Probability?
• Nothing in life is certain. In everything we do, we gauge
the chances of successful outcomes, from business to
medicine to the weather
• A probability provides a quantitative description of the
chances or likelihoods associated with various outcomes
• It provides a bridge between descriptive and inferential
statistics

Probability

Population Sample
Statistics
Principles of Counting
• Preliminary Concepts:
Random Experiment – an experiment that can
be used to generate information or data. Like
an ordinary experiment, this can also be
repeated.
- Rolling a die
- Tossing a coin
- Drawing a card from a well-shuffled deck of 52
cards
• Sample Space- a set of all possible outcomes
in a random experiment, denoted by S.
• Sample Point – an entry from the sample
space
• Event-is a collection of one or more outcomes
considered within a sample space, denoted by
E.
Example
• Random experiment – rolling a die
• Sample space : S = {1,2,3,4,5,6}
• Sample point = 1,2,3,…6
• Event (even numbers) : E = {2,4,6}
Classical Probability
The probability of any event E is

Number of outcomes in E
----------------------------------------
Total number of outcomes in the sample space

n( E )
P( E ) 
n( S )
Examples
1. A pair of dice is tossed. Find the probability
of getting
a. A total of 7 = 6/36 = 1/6
b. At most a total of 10
c. At least a total of 5
2. Find the probability of getting ace in a well
shuffled deck of cards?
3. What is the probability of passing a subject?
4. There are two (2) dice thrown. What is the
probability of the following events
a. That all 2 dice show the same number
b. That all 2 dice show odd number
c. The sum of numbers is 13
Probability Rules
R1. The probability of any event E, is a number
(either a fraction or decimal) between and
including 0 and 1. This is denoted by

0  P( E )  1
Rule 1 states that probabilities cannot be
negative or greater than 1.
R2. If an event E cannot occur (i.e. the event
contains no members in the sample space), its
probability is 0.

R3. If an event E is certain, then the probability


of E is 1.

R4. The sum of all the probabilities of all


outcomes in the sample space is 1.
Probabilistic vs Statistical Reasoning
• Suppose I know exactly the proportions of car
makes in California. Then I can find the
probability that the first car I see in the street
is a Ford. This is probabilistic reasoning as I
know the population and predict the sample
• Now suppose that I do not know the
proportions of car makes in California, but
would like to estimate them. I observe a
random sample of cars in the street and then I
have an estimate of the proportions of the
population. This is statistical reasoning
What is Probability?
• In Chapters 2, we used graphs and numerical
measures to describe data sets which were
usually samples.
• We measured “how often” using

Relative frequency = f/n


• As n gets larger,
Sample Population
And “How often”
= Relative frequency Probability
Basic Concepts
• An experiment is the process by which an
observation (or measurement) is obtained.
• An event is an outcome of an experiment,
usually denoted by a capital letter.
– The basic element to which probability is applied
– When an experiment is performed, a particular
event either happens, or it doesn’t!
Experiments and Events
• Experiment: Record an age
– A: person is 30 years old
– B: person is older than 65
• Experiment: Toss a die
– A: observe an odd number
– B: observe a number greater than 2
Basic Concepts
• Two events are mutually exclusive if, when
one event occurs, the other cannot, and vice
versa.
•Experiment: Toss a die Not Mutually
–A: observe an odd number Exclusive

–B: observe a number greater than 2


–C: observe a 6 B and C?
Mutually
–D: observe a 3 Exclusive B and D?
Basic Concepts
• An event that cannot be decomposed is called
a simple event.
• Denoted by E with a subscript.
• Each simple event will be assigned a
probability, measuring “how often” it occurs.
• The set of all simple events of an experiment
is called the sample space, S.
Example
• The die toss:
• Simple events: Sample space:
1 E1
2
S ={E1, E2, E3, E4, E5, E6}
E2
S
3 E3 •E1 •E3
4 E4 •E5
5
E5 •E2 •E4 •E6
6 E6
Basic Concepts
• An event is a collection of one or more simple
events.
S
•E1 •E3
•The die toss: A •E5
–A: an odd number B
–B: a number > 2 •E2 •E4 •E6

A ={E1, E3, E5}


B ={E3, E4, E5, E6}
The Probability
of an Event
• The probability of an event A measures “how
often” A will occur. We write P(A).
• Suppose that an experiment is performed n
times. The relative frequency for an event A is

Number of times A occurs f



n n
• If we let n get infinitely large,
f
P ( A)  lim
n n
The Probability
of an Event
• P(A) must be between 0 and 1.
– If event A can never occur, P(A) = 0. If event A
always occurs when the experiment is performed,
P(A) =1.
• The sum of the probabilities for all simple
events in S equals 1.

• The probability of an event A is found


by adding the probabilities of all the
simple events contained in A.
Finding Probabilities
• Probabilities can be found using
– Estimates from empirical studies
– Common sense estimates based on equally
likely events.

• Examples:
–Toss a fair coin. P(Head) = 1/2
– Suppose that 10% of the U.S. population has
red hair. Then for a person selected at random,
P(Red hair) = .10
Using Simple Events
• The probability of an event A is equal to the
sum of the probabilities of the simple events
contained in A
• If the simple events in an experiment are
equally likely, you can calculate

nA number of simple events in A


P( A)  
N total number of simple events
Example 1

Toss a fair coin twice. What is the probability of


observing at least one head?
1st Coin 2nd Coin Ei P(Ei)

H HH 1/4 P(at least 1 head)


H
1/4 = P(E1) + P(E2) + P(E3)
T HT
1/4 = 1/4 + 1/4 + 1/4 = 3/4
H TH 1/4
T
T TT
Example 2
A bowl contains three M&Ms®, one red, one blue
and one green. A child selects two M&Ms at
random. What is the probability that at least one
is red?
1st M&M 2nd M&M Ei P(Ei)
m RB 1/6
m
m RG 1/6 P(at least 1 red)

m 1/6 = P(RB) + P(BR)+ P(RG) + P(GR)


BR
m 1/6
m = 4/6 = 2/3
BG 1/6
m
m GB 1/6
m GR
Example 3
The sample space of throwing a pair of dice is
Example 3
Event Simple events Probability

Dice add to 3 (1,2),(2,1) 2/36


Dice add to 6 (1,5),(2,4),(3,3), 5/36
(4,2),(5,1)
Red die show 1 (1,1),(1,2),(1,3), 6/36
(1,4),(1,5),(1,6)
Green die show 1 (1,1),(2,1),(3,1), 6/36
(4,1),(5,1),(6,1)
Counting Rules
• Sample space of throwing 3 dice has 216
entries, sample space of throwing 4 dice has
1296 entries, …
• At some point, we have to stop listing and
start thinking …
• We need some counting rules
The mn Rule
• If an experiment is performed in two stages,
with m ways to accomplish the first stage and
n ways to accomplish the second stage, then
there are mn ways to accomplish the
experiment.
• This rule is easily extended to k stages, with
the number of ways equal to
n1 n2 n3 … nk
Example: Toss two coins. The total number of
simple events is:
22=4
Examples m
m
Example: Toss three coins. The total number of
simple events is: 222=8
Example: Toss two dice. The total number of
simple events is: 6  6 = 36
Example: Toss three dice. The total number of
simple events is: 6  6  6 = 216

Example: Two M&Ms are drawn from a dish


containing two red and two blue candies. The total
number of simple events is:
4  3 = 12
Permutations
• The number of ways you can arrange
n distinct objects, taking them r at a time is
n!
Pr 
n

(n  r )!
where n! n(n  1)( n  2)...( 2)(1) and 0! 1.
Example: How many 3-digit lock combinations
can we make from the numbers 1, 2, 3, and 4?
The order of the choice is important! 4!
P   4(3)( 2)  24
3
4

1!
Examples
Example: A lock consists of five parts and can
be assembled in any order. A quality control
engineer wants to test each order for
efficiency of assembly. How many orders are
there?
The order of the choice is important!

5!
P   5(4)(3)( 2)(1)  120
5
5

0!
Combinations
• The number of distinct combinations of n
distinct objects that can be formed, taking
them r at a time is n!
Cr 
n

r!(n  r )!
Example: Three members of a 5-person committee must
be chosen to form a subcommittee. How many different
subcommittees could be formed?
5! 5(4)(3)( 2)1 5(4)
The order of
C 
5
   10
3!(5  3)! 3(2)(1)( 2)1 (2)1
the choice is 3
not important!
Example m
m m
m mm
• A box contains six M&Ms®, four red
and two green. A child selects two M&Ms at
random. What is the probability that exactly one
is red?
2!
The order of C2 
6 6! 6(5)
  15
C 
1
2
2
1!1!
the choice is 2!4! 2(1)
not important! ways to choose
ways to choose 2 M & Ms.
1 green M & M.
4! 4  2 =8 ways to choose 1
C 
1
4
4 P(exactly one red)
1!3! red and 1 green M&M.
= 8/15
ways to choose
1 red M & M.
Example

A deck of cards consists of 52 cards, 13 "kinds"


each of four suits (spades, hearts, diamonds, and
clubs). The 13 kinds are Ace (A), 2, 3, 4, 5, 6, 7, 8,
9, 10, Jack (J), Queen (Q), King (K). In many poker
games, each player is dealt five cards from a well
shuffled deck.
52! 52(51)(50)( 49)48
There are C 52
   2,598,960
5!(52  5)!
5
5(4)(3)( 2)1
possible hands
Example
Four of a kind: 4 of the 5 cards are the same
“kind”. What is the probability of getting four
of a kind in a five card hand?
There are 13 possible choices for the kind of
which to have four, and 52-4=48 choices for
the fifth card. Once the kind has been
specified, the four are completely determined:
you need all four cards of that kind. Thus there
are 13×48=624 ways to get four of a kind.
The probability=624/2598960=.000240096
and
Example
One pair: two of the cards are of one kind, the
other three are of three different kinds.
What is the probability of getting one pair in a
five card hand?

There are 13 possible choices for the kind


of which to have a pair; given the choice,
there are C  6 possible choices of two
4
2

of the four cards of that kind


Example
There are 12 kinds remaining from which to
select the other three cards in the hand. We
must insist that the kinds be different from
each other and from the kind of which we
have a pair, or we could end up with a second
pair, three or four of a kind, or a full house.
Example
There are C 12
3  220 ways to pick the kinds of
the remaining three cards. There are 4 choices
for the suit of each of those three cards, a total
of 4 3  64 choices for the suits of all three.
Therefore the number of " one pair" hands is
13  6  220  64  1,098,240.
The probabilit y  1098240/2598960 
 .422569
Event Relations
The beauty of using events, rather than simple events, is
that we can combine events to make other events using
logical operations: and, or and not.
The union of two events, A and B, is the event that
either A or B or both occur when the experiment is
performed. We write
A B
S

A B A B
Event Relations
The intersection of two events, A and B, is
the event that both A and B occur when the
experiment is performed. We write A B.
S

A B A B

• If two events A and B are mutually


exclusive, then P(A B) = 0.
Event Relations
The complement of an event A consists of
all outcomes of the experiment that do not
result in event A. We write AC.

S
AC

A
Example
Select a student from the classroom and
record his/her hair color and gender.
– A: student has brown hair
– B: student is female
– C: student is male Mutually exclusive; B = C
C

What is the relationship between events B and C?


•AC: Student does not have brown hair
•BC: Student is both male and female = 
•BC: Student is either male and female = all students = S
Calculating Probabilities for
Unions and Complements
• There are special rules that will allow you to
calculate probabilities for composite events.
• The Additive Rule for Unions:
• For any two events, A and B, the probability of
their union, P(A B), is

P( A  B)  P( A)  P( B)  P( A  B)
A B
Example: Additive Rule
Example: Suppose that there were 120
students in the classroom, and that they
could be classified as follows:
A: brown hair Brown Not Brown
P(A) = 50/120 Male 20 40
B: female Female 30 30
P(B) = 60/120
P(AB) = P(A) + P(B) – P(AB)
= 50/120 + 60/120 - 30/120
= 80/120 = 2/3 Check: P(AB)
= (20 + 30 + 30)/120
Example: Two Dice

A: red die show 1


B: green die show 1

P(AB) = P(A) + P(B) – P(AB)


= 6/36 + 6/36 – 1/36
= 11/36
A Special Case
When two events A and B are
mutually exclusive, P(AB) = 0
and P(AB) = P(A) + P(B).
A: male with brown hair Brown Not Brown
P(A) = 20/120 Male 20 40
B: female with brown hair Female 30 30
P(B) = 30/120
A and B are mutually P(AB) = P(A) + P(B)
= 20/120 + 30/120
exclusive, so that
= 50/120
Example: Two Dice

A: dice add to 3
B: dice add to 6

A and B are mutually P(AB) = P(A) + P(B)


= 2/36 + 5/36
exclusive, so that
= 7/36
Calculating Probabilities AC
A
for Complements
• We know that for any event A:
– P(A AC) = 0
• Since either A or AC must occur,
P(A AC) =1
• so that P(A AC) = P(A)+ P(AC) = 1

P(AC) = 1 – P(A)
Example
Select a student at random
from the classroom. Define:
A: male Brown Not Brown
P(A) = 60/120 Male 20 40
B: female Female 30 30
P(B) = ?

A and B are P(B) = 1- P(A)


complementary, so = 1- 60/120 = 60/120
that
Calculating Probabilities for
Intersections
In the previous example, we found P(A  B)
directly from the table. Sometimes this is
impractical or impossible. The rule for calculating
P(A  B) depends on the idea of independent
and dependent events.
Two events, A and B, are said to be
independent if the occurrence or
nonoccurrence of one of the events does
not change the probability of the
occurrence of the other event.
Conditional Probabilities
The probability that A occurs, given
that event B has occurred is called
the conditional probability of A
given B and is defined as
P( A  B)
P( A | B)  if P( B)  0
P( B)

“given”
Example 1
Toss a fair coin twice. Define
– A: head on second toss
– B: head on first toss
P(A|B) = ½
HH
1/4 P(A|not B) = ½
1/4
HT
1/4
P(A) does not A and B are
TH 1/4
change, whether independent!
TT B happens or
not…
Example 2
A bowl contains five M&Ms®, two red and three
blue. Randomly select two candies, and define
– A: second candy is red.
– B: first candy is blue.

m P(A|B) =P(2nd red|1st blue)= 2/4 = 1/2


m m
P(A|not B) = P(2nd red|1st red) = 1/4
m m

P(A) does change,


depending on A and B are
whether B happens dependent!
or not…
Example 3: Two Dice
Toss a pair of fair dice. Define
– A: red die show 1
– B: green die show 1

P(A|B) = P(A and B)/P(B)


=1/36/1/6=1/6=P(A)

P(A) does not


change, whether A and B are
B happens or independent!
not…
Example 3: Two Dice
Toss a pair of fair dice. Define
– A: add to 3
– B: add to 6

P(A|B) = P(A and B)/P(B)


=0/36/5/6=0

P(A) does change A and B are dependent!


when B happens In fact, when B happens,
A can’t
Defining Independence
• We can redefine independence in terms of
conditional probabilities:
Two events A and B are independent if and
only if
P(A|B) = P(A) or P(B|A) = P(B)
Otherwise, they are dependent.

• Once you’ve decided whether or not two


events are independent, you can use the
following rule to calculate their
intersection.
The Multiplicative Rule for
Intersections
• For any two events, A and B, the probability that
both A and B occur is
P(A B) = P(A) P(B given that A occurred)
= P(A)P(B|A)

• If the events A and B are independent, then


the probability that both A and B occur is
P(A B) = P(A) P(B)
Example 1
In a certain population, 10% of the people can be
classified as being high risk for a heart attack. Three
people are randomly selected from this population.
What is the probability that exactly one of the three
are high risk?
Define H: high risk N: not high risk
P(exactly one high risk) = P(HNN) + P(NHN) + P(NNH)
= P(H)P(N)P(N) + P(N)P(H)P(N) + P(N)P(N)P(H)
= (.1)(.9)(.9) + (.9)(.1)(.9) + (.9)(.9)(.1)= 3(.1)(.9)2 = .243
Example 2
Suppose we have additional information in the
previous example. We know that only 49% of the
population are female. Also, of the female patients, 8%
are high risk. A single person is selected at random. What
is the probability that it is a high risk female?
Define H: high risk F: female
From the example, P(F) = .49 and P(H|F) = .08.
Use the Multiplicative Rule:
P(high risk female) = P(HF)
= P(F)P(H|F) =.49(.08) = .0392
The Law of Total Probability
Let S1 , S2 , S3 ,..., Sk be mutually exclusive and
exhaustive events (that is, one and only one
must happen). Then the probability of any event
A can be written as

P(A) = P(A  S1) + P(A  S2) + … + P(A  Sk)


= P(S1)P(A|S1) + P(S2)P(A|S2) + … + P(Sk)P(A|Sk)
The Law of Total Probability
S1

A Sk
A

A  S1 Sk
S2….

P(A) = P(A  S1) + P(A  S2) + … + P(A  Sk)


= P(S1)P(A|S1) + P(S2)P(A|S2) + … + P(Sk)P(A|Sk)
Bayes’ Rule
Let S1 , S2 , S3 ,..., Sk be mutually exclusive and
exhaustive events with prior probabilities P(S1),
P(S2),…,P(Sk). If an event A occurs, the posterior
probability of Si, given that A occurred is
P( Si ) P( A | Si )
P( Si | A)  for i  1, 2,...k
 P( Si ) P( A | Si )
Proof
P( ASi )
P( A | Si )  
 P( ASi )  P( Si ) P( A | Si )
P( Si )
P( ASi ) P( Si ) P( A | Si )
P( Si | A)  
P( A)  P( Si ) P( A | Si )
Example
From a previous example, we know that 49% of the
population are female. Of the female patients, 8% are
high risk for heart attack, while 12% of the male patients
are high risk. A single person is selected at random and
found to be high risk. What is the probability that it is a
male? Define H: high risk F: female M: male

We know: P( M ) P( H | M )
P( M | H ) 
P( M ) P ( H | M )  P( F ) P( H | F )
.49
P(F) =
.51
P(M) = .51 (.12)
  .61
P(H|F) = .08
.51 (.12)  .49 (.08)
P(H|M) = .12
Example
Suppose a rare disease infects one out of
every 1000 people in a population. And
suppose that there is a good, but not perfect,
test for this disease: if a person has the
disease, the test comes back positive 99% of
the time. On the other hand, the test also
produces some false positives: 2% of
uninfected people are also test positive. And
someone just tested positive. What are his
chances of having this disease?
Example
Define A: has the disease B: test positive
We know:
P(A) = .001 P(Ac) =.999
P(B|A) = .99 P(B|Ac) =.02

We want to know P(A|B)=?


P ( A ) P ( B| A )
P( A | B)  c c
P ( A ) P ( B| A )  P ( A ) P ( B| A )
.001  .99
  .0472
.001  .99  .999  .02
Example
A survey of job satisfaction2 of teachers was
taken, giving the following results
Job Satisfaction
Satisfied Unsatisfied Total
L College 74 43
E
117
V High School 224 171 395
E
L Elementary 126 140 266
Total 424 354 778

2 “Psychology of the Scientist: Work Related Attitudes of U.S. Scientists”


(Psychological Reports (1991): 443 – 450).
Example
If all the cells are divided by the total number
surveyed, 778, the resulting table is a table of
empirically derived probabilities.

Job Satisfaction
Satisfied Unsatisfied Total
L College 0.095 0.055
E
0.150
V High School 0.288 0.220 0.508
E
L Elementary 0.162 0.180 0.342
Total 0.545 0.455 1.000
Job Satisfaction
Satisfied Unsatisfied Total

Example L
E
V
E
College 0.095
High School 0.288
0.055
0.220
0.150
0.508
L Elementary 0.162 0.180 0.342
Total 0.545 0.455 1.000

For convenience, let C stand for the event that


the teacher teaches college, S stand for the
teacher being satisfied and so on. Let’s look at
some probabilities and what they mean.

P(C)  0.150 is the proportion of teachers who are college teachers.

P(S)  0.545 is the proportion of teachers who are satisfied with


their job.

P(C S)  0.095 is the proportion of teachers who are college teachers


and who are satisfied with their job.
Job Satisfaction
Satisfied Unsatisfied Total
L College 0.095 0.055
E
0.150

Example V
E
L
High School 0.288
Elementary 0.162
0.220
0.180
0.508
0.342
Total 0.545 0.455 1.000

is the proportion of teachers who are college


P(C S)
P(C | S)  teachers given they are satisfied. Restated:
P(S) This is the proportion of satisfied that are
college teachers.
0.095
  0.175
0.545

P(S C)
P(S | C)  is the proportion of teachers who are satisfied
given they are college teachers. Restated:
P(C)
This is the proportion of college teachers that
P(C S) 0.095 are satisfied.
 
P(C) 0.150
 0.632
Job Satisfaction
Satisfied Unsatisfied Total
L College 0.095 0.055 0.150
Example E
V
E
High School 0.288 0.220 0.508
L Elementary 0.162 0.180 0.342
Total 0.545 0.455 1.000

Are C and S independent events?

P(C S) 0.095
P(C)  0.150 and P(C | S)    0.175
P(S) 0.545

P(C|S)  P(C) so C and S are dependent events.


Job Satisfaction
Satisfied Unsatisfied Total
L College 0.095 0.055 0.150
Example E
V
E
L
High School 0.288 0.220 0.508
Elementary 0.162 0.180 0.658
Total 0.545 0.455 1.000

P(CS)?

P(C) = 0.150, P(S) = 0.545 and


P(CS) = 0.095, so
P(CS) = P(C)+P(S) - P(CS)
= 0.150 + 0.545 - 0.095
= 0.600
Example
Tom and Dick are going to take
a driver's test at the nearest DMV office. Tom
estimates that his chances to pass the test are
70% and Dick estimates his as 80%. Tom and
Dick take their tests independently.
Define D = {Dick passes the driving test}
T = {Tom passes the driving test}
T and D are independent.
P (T) = 0.7, P (D) = 0.8
Example
What is the probability that at most one of the
two friends will pass the test?

P(At most one person pass)


= P(Dc  Tc) + P(Dc  T) + P(D  Tc)
= (1 - 0.8) (1 – 0.7) + (0.7) (1 – 0.8) + (0.8) (1 – 0.7)
= .44

P(At most one person pass)


= 1-P(both pass) = 1- 0.8 x 0.7 = .44
Example
What is the probability that at least one of the
two friends will pass the test?

P(At least one person pass)

= P(D  T)
= 0.8 + 0.7 - 0.8 x 0.7
= .94
P(At least one person pass)
= 1-P(neither passes) = 1- (1-0.8) x (1-0.7) = .94
Example
Suppose we know that only one of the two
friends passed the test. What is the probability
that it was Dick?

P(D | exactly one person passed)

= P(D  exactly one person passed) / P(exactly one


person passed)
= P(D  Tc) / (P(D  Tc) + P(Dc  T) )
= 0.8 x (1-0.7)/(0.8 x (1-0.7)+(1-.8) x 0.7)
= .63
Random Variables
• A quantitative variable x is a random variable if
the value that it assumes, corresponding to the
outcome of an experiment is a chance or
random event.
• Random variables can be discrete or continuous.

• Examples:
x = SAT score for a randomly selected student
x = number of people in a room at a randomly
selected time of day
x = number on the upper face of a randomly
tossed die
Probability Distributions for Discrete
Random Variables
The probability distribution for a discrete
random variable x resembles the relative
frequency distributions we constructed in
Chapter 2. It is a graph, table or formula that
gives the possible values of x and the
probability p(x) associated with each value.

We must have
0  p ( x)  1 and  p( x)  1
Example
Toss a fair coin three times and
define x = number of heads.
x
HHH x p(x)
3
1/8 P(x = 0) = 1/8 0 1/8
HHT 2
1/8 P(x = 1) = 3/8 1 3/8
2
HTH 1/8 P(x = 2) = 3/8
2 2 3/8
THH
1/8 P(x = 3) = 1/8
1/8
1 3 1/8
HTT 1
1/8
1 Probability Histogram
1/8
THT for x
0
1/8
TTH

TTT
Example
Toss two dice and define
x = sum of two dice. x p(x)

2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
Probability Distributions
Probability distributions can be used to describe
the population, just as we described samples in
Chapter 2.
– Shape: Symmetric, skewed, mound-shaped…
– Outliers: unusual or unlikely measurements
– Center and spread: mean and standard
deviation. A population mean is called  and a
population standard deviation is called .
The Mean
and Standard Deviation
Let x be a discrete random variable with
probability distribution p(x). Then the mean,
variance and standard deviation of x are given
as

Mean :    xp( x)
Variance :   ( x   ) p( x)
2 2

Standard deviation :    2
Example
Toss a fair coin 3 times and record x
the number of heads.
x p(x) xp(x) (x-2p(x) 12
0 1/8 0 (-1.5)2(1/8)    xp( x)   1.5
8
1 3/8 3/8 (-0.5)2(3/8)
2 3/8 6/8 (0.5)2(3/8)
3 1/8 3/8 (1.5)2(1/8)
  ( x   ) p( x)
2 2

 2  .28125  .09375  .09375  .28125  .75


  .75  .688
Example
The probability distribution for x the
number of heads in tossing 3 fair
coins.
Symmetric; mound-
• Shape? shaped

• Outliers? None

• Center?  = 1.5

• Spread?  = .688


Key Concepts
I. Experiments and the Sample Space
1. Experiments, events, mutually exclusive events,
simple events
2. The sample space

II. Probabilities
1. Relative frequency definition of probability
2. Properties of probabilities
a. Each probability lies between 0 and 1.
b. Sum of all simple-event probabilities equals 1.
3. P(A), the sum of the probabilities for all simple events in A
Key Concepts
III. Counting Rules
1. mn Rule; extended mn Rule
2. Permutations: Prn  n!
(n  r )!
n!
3. Combinations: Crn 
r!(n  r )!
IV. Event Relations
1. Unions and intersections
2. Events
a. Disjoint or mutually exclusive: P(A B)  0
b. Complementary: P(A)  1  P(AC )
Key Concepts
P( A  B)
3. Conditional probability: P( A | B) 
P( B)
4. Independent and dependent events
5. Additive Rule of Probability:
P( A  B)  P( A)  P( B)  P( A  B)

6. Multiplicative Rule of Probability:


P( A  B)  P( A) P( B | A)

7. Law of Total Probability


8. Bayes’ Rule
Key Concepts
V. Discrete Random Variables and Probability
Distributions
1. Random variables, discrete and continuous
2. Properties of probability distributions
0  p( x)  1 and  p( x)  1
3. Mean or expected value of a discrete random
variable: Mean :    xp( x)
4. Variance and standard deviation of a discrete
random variable: Variance :  2  ( x   )2 p( x)
Standard deviation :    2
Exercises
1. In how many ways can a committee of 5
students be formed from 8 students?
2. In how many ways can we arrange the word
STATISTICS?
3. Eight different colored marbles are to be
arranged in a row. How many different
arrangements will be there?
4. 5C3
5. 9C9
6. 5P3
7. A freshman students must
take a natural science, a
social science, and PE. If
there are 3 NatSci, 5 SocSci,
and 4 PE courses, how many
different ways can a student
select the courses?
8. How many positive 3 digit
numbers can be formed
from 3,5,7 and 9 if
a) repetitions are not
allowed
b) repetitions are allowed
9. If you are to answer any 5
questions out of 9, how many
ways can you answer them?
10. In how many ways of
distinct arrangement can be
made with the letters
“PROBABILITY”?
11. If 2 dice are thrown, find
a) P (sum is six)
b) P (sum is 5)
12. Find P(spade from a deck of
52 playing cards
13. Find P(four spades in
succession without replacement
13. Two cards are drawn from 52
playing cards, find the probability
that the cards are both aces if the
first card is :
a) replaced
b) not replaced
• Determining the probability of events can be
complicated. There are two major probability
rules : the addition rule and multiplication
rule. These rules provide the foundation
necessary for understanding the inference test
that follows.
The Addition Rule
• If A and B are any two events, then
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵)

Mutually Exclusive Events


• If A and B are any two events, then
𝑃 𝐴∪𝐵 =𝑃 𝐴 +𝑃 𝐵
Addition Rule
for
Probability
This is Rita.
Are the statements TRUE or FALSE?

“and“or
””
means
means
bothone
must
or be
thetrue
other
(or both) are true

Rita is playing the violin and soccer. FALSE

Rita is playing the violin or soccer. TRUE


Elm St.

Elm St.
Maple St. Maple St.

Elm and Maple Elm or Maple


This is called This is called
Which one is “Elm and Maple”?
INTERSECTION. UNION.
Which one is “Elm or Maple”?
Like when you put the North
Like when two streets cross.
and the South together.
Next we will look at Venn Diagrams.
In a Venn Diagram the box represents
the entire sample space.
Members
Members
that fit
that fit
Event A
Event B
go in this A B go in this
circle.
circle.
A B A B

Event A and B Event A or B


This is called This is called
Which is “A and B”?
INTERSECTION. UNION.
Which is “A or B”?
The Addition Rule for Probability

A B A B A B A B
_
= +

P(A or B) = P(A) + P(B) - P(A and B)


But we have We need to
added this piece A B
subtract off
twice! That is the extra
one extra time! time!
Example #1)
Given the following probabilities:
P(A)=0.8 P(B)=0.3 P(A and B)=0.2
Find the P(A or B).

This can be solved two ways.


1. Using Venn Diagrams
2. Using the formula
We will solve it both ways.
Example #1 (continued)
P(A)=0.8 P(B)=0.3 P(A and B)=0.2
Find the P(A or B).

Solution using Venn Diagrams:

A B In this example we
will fill up the
Venn Diagram
with probabilities.
Example #1 (continued)
P(A)=0.8 P(B)=0.3 P(A and B)=0.2
Find the P(A or B).

Solution using Venn Diagrams:


The probability
First fill in
that
The abox
where student
the eventsfits
represents
A B the
theevent
entireA is 0.8.
Bsample
overlap. 0.3.
0.6 0.2 0.1 That
The
That
space means
probability
means
and mustthe
the
thatentire
a student
add 1.fits
Atocircle
upB
0.1 themust
eventaddA andup toB
is 0.2.
0.8.
0.3.
Then find the probability of A or B.
I will start by
A B
shading A or B.
0.6 0.2 0.1

0.1
Then I will add up
the probabilities in
the shaded area.
P(A or B) = 0.6 + 0.2 + 0.1

= 0.9 Answer
Example #1 (continued)
P(A)=0.8 P(B)=0.3 P(A and B)=0.2
Find the P(A or B).

Solution using the formula:

P(A or B) = P(A) + P(B) - P(A and B)

= 0.8 + 0.3 - 0.2

= 0.9 Answer
Example #2.)
There are 50 students. 18 are taking
English. 23 are taking Math. 10 are
taking English and Math.
If one is selected at random, find the
probability that the student is taking
English or Math.

E = taking English
M = taking Math
Example #2 (continued) There are 50 students.
18 are taking English. 23 are taking Math. 10
are taking English and Math.
If one is selected at random, find the probability
that the student is taking English or Math.

Solution using Venn Diagrams:

In this example
E M we will fill up the
Venn Diagram
with the number
of students.
Example #2 (continued) There are 50 students.
18 are taking English. 23 are taking Math. 10
are taking English and Math.
If one is selected at random, find the probability
that the student is taking English or Math.

Solution using Venn Diagrams:


The
Thenumber
number ofof
First fill in
students
students taking
taking
where
The therepresents
box events
E M Math is
English is 18.
23.
theoverlap.
entire sample
ThatThat
The means
means
number the
theof
8 10 13 space and must
number
number
students ofof
taking
add up to 50.
students
students taking
taking
English and Math
19
Math must
English 10.add
ismust addup
to 18.
up to 23.
Then find the probability of English or Math.
I will start by
E M
shading E or M.
8 10 13
Then I will find the
19 probability in the
shaded area.
P(E or M) = 8 10 13
50
= 0.62
Example #2 (continued) There are 50 students. 18
are taking English. 23 are taking Math. 10 are
taking English and Math.
If one is selected at random, find the probability
that the student is taking English or Math.

Solution using the formula:


P(E or M) = P(E) + P(M) - P(E and M)
 18  23  10
50 50 50

= 0.62
Class Activity #1)
There are 1580 people in an
amusement park. 570 of these
people ride the rollercoaster. 700 of
these people ride the merry-go-round.
220 of these people ride the roller
coaster and merry-go-round.
If one person is selected at
random, find the probability that
that person rides the roller
coaster or the merry-go-round.
a.) Solve using Venn Diagrams.
b.) Solve using the formula for
the Addition Rule for Probability.
Example #3) Population of apples and pears.

Each member of this population can


be described in two ways.
1. Type of fruit
2. Whether it has a worm or not
We will make a table to organize this data.
Example #3) Population of apples and pears.

no worm worm

apple 5 ? 3? 8?
pear 4 ? 2? 6?
9 ? 5? grand total 14
Ex. #3 (continued)
no worm worm

apple 5 3 8
pear 4 2 6
9 5 grand total 14

Experiment: One is selected at random.


Find the probability that . . .
a.) . . . it is a pear and has a worm.
b.) . . . it is a pear or has a worm.
Ex. #3 (continued)
no worm worm

apple 5 3 8
pear 4 2 6
9 5 grand total 14
Solution to #3a.)

P(pear and worm) = 2  0.1429


14
Ex. #3 (continued)
no worm worm

apple 5 3 8
pear 4 2 6
9 5 grand total 14
Solution to #3b.)

P(pear or worm) = 4 23  0.6429


14
Ex. #3 (continued)
no worm worm

apple 5 3 8
pear 4 2 6
9 5 grand total 14
Alternate Solution to #3b.)
P(pear or worm)= P(pear) + P(worm) – P (pear and worm)
 6  5  2
14 14 14
 0.6429 Answer
Class Activity #2)

There are our modes of transportation – horse, bike, &


canoe. Each has a person or does not have a person.
1.) Make a table to represent this data.
2.) If one is selected at random find the following:
a.) P( horse or has a person)
b.) P( horse and has a person)
c.) P( bike or does not have a person)
Probability of Events
• According to classical probability rule, the
probability of an event E is equal to the
number of outcomes occurring in event E
divided by the total number of outcomes in
the experiment.
Numberofoutcomes( E )
• P(E) = Tota ln umberofoutcomesforthe exp eriment
Properties of Probability
a) The p of a sample space S is 1,
i.e. P(S) = 1
a) The p of a null set is 0, i.e. P(φ) = 0
b) The p of an event E always lie in the
range between zero to 1, i.e. 0<P(E)<1
c) The sum of the probabilities of all
events (or final outcomes) for an
experiment is always 1
1. Find the variance and the standard deviation of
the data set for Brand A paint sold at different
weeks:
10, 60, 50, 30, 40, 20
2. The mean for the number of pages of a sample of
women’s fitness magazine is 132, with a
variance of 23; the mean for the number of
advertisements of a sample of men’s magazine
is 182, with a variance of 62. Compute the
variations
3. A card is drawn from an ordinary deck. Find
these probabilities.
a) Of getting a jack
b) Of getting a 3 or a diamond
4. If a family has 3 children, find the probability
that all the children are boys.
5. When a single die is rolled, find the
probability of getting a 9.
6. In a sample of 50 people, 21 had type O
blood, 22 had type A blood, 5 had type B
blood, and 2 had type AB blood. Find the
probabilities:
a) A person has type O blood
b) A person has type A or type B blood
c) A person does not have type AB blood
7. Compute 5C4 and 3P2
Exercises 1
1. A probability experiment is conducted. Which of
these cannot be considered a probability of an
outcome?
a) 1/3 b) -0.59 c) 1 d) 0.80
e) 1.45 f) 112% g) 33%
2. If a die is rolled one time, find these
probabilities:
a) Of getting an even number
b) Of getting a n>3 and an odd number
Complementary Events
The complement of an event E is the set of
outcomes in the sample space that are not
included in the outcome of an event E. The
complement of E is denoted by
E

P( E )  1  P( E )
Classical and Empirical Probabilities
• The difference between classical and empirical
probability is that classical probability
assumes that certain outcomes are equally
likely( (such as the outcomes when a die is
rolled), while empirical probability relies on
actual experience to determine the likelihood
of outcomes.
Given a frequency distribution, the probability
of an event being in a given class is
f
P( E ) 
n
where f is the frequency for the class and n ,
the total frequencies in the distribution.
Addition Rules for Probability
Two events are mutually exclusive events if they
cannot occur at the same time (i.e., they have
no outcomes in common).
When two events A and B are mutually
exclusive, the probability that A or B will occur
is P( AorB)  P( A)  P( B)
If A and B are not mutually exclusive, the

P( AorB)  P( A)  P( B)  P( AandB)
Exercises
1. Define mutually exclusive events, and give an
example of two events that are mutually
exclusive and two events that are not mutually
exclusive.
2. Determine whether these events ae mutually
exclusive
a) Roll a die. Get an even number and get a
number less than 3.
b) Roll a die. Get a prime number (2,3,5) and get
an odd number.
Exercises
1. A card is drawn from a well-shuffled deck of
52 playing cards. What is the probability that
the card drawn is:
a) Diamond b) queen of hearts
2. A group of scientists consists of 7 chemists, 4
biologists, and 5 physicists. If a scientist is
randomly chosen, find the probability that
the scientist is
a) A physicist b) chemists or biologists
3. There are 600 male and 200 female
engineering students and 80 male and 320
female education students in a certain
university. Find the probability if students are
randomly chosen as:
a) a female
b) a male engineering students
c) an education students
4. An urn contains 4 green marbles and 6 red
marbles. Let E be the event “ first marble is
red” and B be the event “second marble is
red” and the marbles are not replaced after
being drawn. Find the probability that both
marbles are RED.
5. Find the complement of each event
a) Rolling a die and getting a 5
b) Selecting a month that begins with a J
6. A sales representative who visits customers at home
finds she sells 0, 1, 2, 3, or 4 items according to the
following frequency distribution
items sold frequency
0 8
1 10
2 3
3 2
4 1
Find the probability that she sells
a) exactly 1 item
b) more than 2 items
c) at least 1 item
d) at most 3 items
7. Three fair coins are tossed. What is the
probability that
a) Three HEADS appear
b) Two HEADS and a TAIL appear
8. How many ways can you arrange 5 books in a
row?
Multiplication Rules and Conditional
Probability
Multiplication Rule 1
When two events are independent, the
probability of both occurring is
P( AandB)  P( A) P( B)
A coin is flipped and a die is rolled. Find the
probability of getting a HEAD on the coin and
a 4 on the die.
1
P(h)andP(4)  1 1 
2 6 12
Dependent Events
When the outcomes or occurrences of the first
event affects the outcome or occurrence of the
second event in such a way that the probability is
changed, the events are said to be DEPENDENT
EVENTS.
Examples:
a) Drawing a card, not replacing it, then draw a
second card
b) having high grades and getting scholarship
c) being a lifeguard and getting a suntan
Conditional Probability
• The conditional probability of an event B in a
relationship to an event A is the probability
that event B occurs after event A has already
occurred.
Multiplication Rule 2
When two events are dependent, the
probability is
P( AandB)  P( A) P( B / A)
Independent and Dependent Events

Given two events, E1 and E2, if the


occurrence or non-occurrence of
E1 does not affect the probability
of occurrence of E2, then we say
that E1 and E2 are independent
events. Otherwise, E1 and E2 are
dependent events.
If we denote E1E2 be the event
that “both E1 and E2 will occur,
then the probability P(E1E2) is
given by:
P(E1E2) = P(E1) x P(E2) if E1 and
E2 are independent events
Continuous
Probability
Distributions

© 2002 Thomson / South-Western Slide 6-312


Learning Objectives
• Understand concepts of the uniform distribution.
• Appreciate the importance of the normal
distribution.
• Recognize normal distribution problems, and know
how to solve them.
• Decide when to use the normal distribution to
approximate binomial distribution problems, and
know how to work them.
• Decide when to use the exponential distribution to
solve problems in business, and know how to work
them.

© 2002 Thomson / South-


Slide 6-313
Western
Uniform Distribution
The uniform distribution is a continuous distribution
in which the same height, of f(X), is obtained over a
range of values.
1
 1 ba
b  a for a  x b
 f (x)
f ( x)  
 0 for all other values

 Area = 1
a x b

© 2002 Thomson / South-


Slide 6-314
Western
Example: Uniform Distribution
of Lot Weights

 1
 47  41 for 41  x  47
 1 1
f ( x)   
 0 47  41 6
for all other values f (x)

Area = 1

41 47 x

© 2002 Thomson / South-


Slide 6-315
Western
Example: Uniform Distribution,continued
Mean and Standard Deviation
Mean Mean
a+b 41 + 47 88
 = =   44
2 2 2

Standard Deviation Standard Deviation


ba 47  41 6
    1. 732
12 12 3. 464

© 2002 Thomson / South-


Slide 6-316
Western
Example: Uniform Distribution
Probability, continued

P( x  X  x )  x x 2 1
1
ba2
45  42 1

47  41 2
f (x)
45  42 1
P(42  X  45)  
47  41 2 Area
= 0.5

41 42 45 47 x

© 2002 Thomson / South-


Slide 6-317
Western
FINALS
The Normal Distribution
• A widely known and much-used distribution
that fits the measurements of many human
characteristics and most machine-produced
items. Many other variable in business and
industry are normally distributed.
• The normal distribution and its associated
probabilities are an integral part of statistical
quality control

© 2002 Thomson / South-


Slide 6-321
Western
Characteristics of the
Normal Distribution
• Continuous distribution
• Symmetrical distribution
• Asymptotic to the horizontal
axis
• Unimodal
• A family of curves
• Total area under the
curve sums to 1. 1/2 1/2
• Area to right of mean  X
is 1/2.
• Area to left of mean is 1/2.

© 2002 Thomson / South-


Slide 6-322
Western
Probability Density Function
of the Normal Distribution

 x 
2

 
1
1 
f ( x)    
 2 e
2

Where:
  mean of X
  standard deviation of X
 = 3.14159 . . .
e  2.71828 . . .

 X

© 2002 Thomson / South-


Slide 6-323
Western
Standardized Normal Distribution
• A normal distribution with
 1
– a mean of zero, and
– a standard deviation of one
• Z Formula
– standardizes any normal
0
distribution
• Z Score
– computed by the Z Formula X 
Z
– the number of standard 
deviations which a value is
away from the mean
© 2002 Thomson / South-
Slide 6-324
Western
Z Table
Second Decimal Place in Z
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.00 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0359
0.10 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.20 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.30 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517

0.90 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.00 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.10 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.20 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015

2.00 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817

3.00 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.40 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.50 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998

© 2002 Thomson / South-


Slide 6-325
Western
Table Lookup of a
Standard Normal Probability

P(0  Z  1)  0. 3413
Z 0.00 0.01 0.02

0.00 0.0000 0.0040 0.0080


0.10 0.0398 0.0438 0.0478
0.20 0.0793 0.0832 0.0871

1.00 0.3413 0.3438 0.3461

1.10 0.3643 0.3665 0.3686


1.20 0.3849 0.3869 0.3888
-3 -2 -1 0 1 2 3

© 2002 Thomson / South-


Slide 6-326
Western
Exercises
Consider a standard normal random variable
with μ = 0 and a standard deviation σ = 1. Find
the following probabilities:
1. 𝑃 𝑧 < 2
2. 𝑃(𝑧 > 1.16)
3. 𝑃 −2.32 < 𝑧 < 2.05
4. 𝑃(−1.6 < 𝑧 < 2)
5. 𝑃(2 < 𝑧 < 2.76)
• Calculate and sketch the area under the
standard normal curve to the:
6. 𝑙𝑒𝑓𝑡 𝑜𝑓 𝑧 = 1.23
7. 𝑟𝑖𝑔ℎ𝑡 𝑜𝑓 𝑧 = 2.3
8. 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑧 = 1.3 𝑎𝑛𝑑 2.2
9. 𝑙𝑒𝑓𝑡 𝑜𝑓 𝑧 = −2.1
10. 𝑟𝑖𝑔ℎ𝑡 𝑜𝑓 𝑧 = −1.1
Bacteria in Drinking Water
Suppose the numbers of a particular type of
bacteria in samples of 1 millimeter (ml) of
drinking water tend to be approximately
normally distributed, with a mean μ = 85 and a
standard deviation σ = 9.
What is the probability that a given 1 ml sample
will contain more than 100 bacteria?
Applying the Z Formula:
Example, Assume….
X is normally distributed with  = 485, and  = 105
P( 485  X  600)  P(0  Z  1.10) . 3643
For X = 485, Z 0.00 0.01 0.02
X -  485  485
Z=  0 0.00 0.0000 0.0040 0.0080
 105 0.10 0.0398 0.0438 0.0478

1.00 0.3413 0.3438 0.3461


For X = 600,
X -  600  485 1.10 0.3643 0.3665 0.3686
Z=   1.10
 105 1.20 0.3849 0.3869 0.3888

© 2002 Thomson / South-


Slide 6-330
Western
Normal Approximation
of the Binomial Distribution
• The normal distribution can be used to
approximate binomial probabilities
• Procedure
– Convert binomial parameters to normal parameters
– Does the interval lie between 0 and n? If so,
continue; otherwise, do not use the normal
approximation.
– Correct for continuity
– Solve the normal distribution problem

© 2002 Thomson / South-


Slide 6-331
Western
Using the Normal Distribution to Work
Binomial Distribution Problems

• The normal distribution can be used to


approximate the probabilities in binomial
distribution problems that involve large
values of n.
• To work a binomial problem by the normal
distribution requires conversion of the n
and p of the binomial distribution to the µ
and  of the normal distribution.

© 2002 Thomson / South-


Slide 6-332
Western
Normal Approximation of Binomial:
Parameter Conversion
• Conversion equations
  n p
  n pq
• Conversion example:
Given that X has a binomial distribution , find
P( X  25| n  60 and p . 30 ).
  n  p  (60)(. 30)  18
  n  p  q  (60)(. 30)(. 70)  3. 55
© 2002 Thomson / South-
Slide 6-333
Western
Normal Approximation of Binomial:
Interval Check

  3  18  3(355
. )  18  10.65
  3  7.35
  3  28.65

0 10 20 30 40 50 60 70
n

© 2002 Thomson / South-


Slide 6-334
Western
Normal Approximation of Binomial:
Correcting for Continuity

Values The binomial probability ,


Being Correction
Determined P( X  25| n  60 and p . 30 )
is approximated by the normal probability
X +.50
X -.50 P(X  24.5|   18 and   3. 55).
X -.50
X +.05
X -.50 and +.50
X +.50 and -.50

© 2002 Thomson / South-


Slide 6-335
Western
Normal Approximation of Binomial:
Graphs

0.12
0.10
0.08
0.06
0.04
0.02
0
6 8 10 12 14 16 18 20 22 24 26 28 30

© 2002 Thomson / South-


Slide 6-336
Western
Normal Approximation of Binomial:
Computations
X P(X)
The normal approximation,
P(X  24.5|   18 and   355
. )
25 0.0167
 24.5  18 
26 0.0096
 P Z  
27 0.0052  355
. 
28 0.0026
29 0.0012  P( Z  183
. )
30 0.0005
31 0.0002 .5  P 0  Z  183
. 
32 0.0001
33 0.0000 .5.4664
Total 0.0361
.0336

© 2002 Thomson / South-


Slide 6-337
Western
Exponential Distribution
• Continuous
• Family of distributions
• Skewed to the right
• X varies from 0 to infinity
• Apex is always at X = 0
• Steadily decreases as X gets larger
• Probability function
 X
f (X)  e for X  0,   0
© 2002 Thomson / South-
Slide 6-338
Western
Graphs of Selected Exponential
Distributions
2.0
1.8 .
1.6
.
.
1.4
1.2
1.0 .
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6 7 8

© 2002 Thomson / South-


Slide 6-339
Western
Exponential Distribution Example:
Probability Computation
1.2

X 0
1.0
. P X  X 0   e
(12
P X  2|   12
. e
. )(2)
0.8
.0907
0.6

0.4

0.2

0.0
0 1 2 3 4 5

© 2002 Thomson / South-


Slide 6-340
Western
Theoretical Distribution
• Empirical distributions
– based on data
• Theoretical distribution
– based on mathematics
• derived from model or estimated from data
Normal Distribution
• Symmetrical, bell-shaped curve
• Also known as Gaussian distribution
• Point of inflection = 1 standard deviation
from mean
• Mathematical formula

(X  ) 2
1 
f (X )  (e) 2 2
 2
Key Areas under the Curve

• For normal
distributions
+ 1 SD ~ 68%
+ 2 SD ~ 95%
+ 3 SD ~ 99.9%
Example IQ mean = 100 s = 15
Normal Probability Distributions
Standard Normal Distribution – N(0,1)
• We agree to use the
standard normal
distribution
• Bell shaped
• =0
• =1
• Note: not all bell
shaped distributions
are normal
distributions
Normal Probability Distribution
• Can take on an
infinite number of
possible values.
• The probability of
any one of those
values occurring is
essentially zero.
• Curve has area or
probability = 1
Normal Distribution
• The standard normal distribution will
allow us to make claims about the
probabilities of values related to our own
data
• How do we apply the standard normal
distribution to our data?
Z-score
If we know the population mean and
population standard deviation, for any
value of X we can compute a z-score by
subtracting the population mean and
dividing the result by the population
standard deviation

X 
z

Important z-score info
• Z-score tells us how far above or below the
mean a value is in terms of standard
deviations
• It is a linear transformation of the original
scores
– Multiplication (or division) of and/or addition to
(or subtraction from) X by a constant
– Relationship of the observations to each other
remains the same
Z = (X-)/
then
X = Z + 
[equation of the general form Y = mX+c]
Probabilities and z scores: z tables
• Total area = 1
• Only have a probability from width
– For an infinite number of z scores each point
has a probability of 0 (for the single point)
• Typically negative values are not reported
– Symmetrical, therefore area below negative
value = Area above its positive value
• Always helps to draw a sketch!
Probabilities are depicted by areas under the curve

• Total area under the curve


is 1
• The area in red is equal to
p(z > 1)
• The area in blue is equal to
p(-1< z <0)
• Since the properties of the
normal distribution are
known, areas can be looked
up on tables or calculated
on computer.
Strategies for finding probabilities for the
standard normal random variable.

• Draw a picture of standard normal


distribution depicting the area of interest.
• Re-express the area in terms of shapes like
the one on top of the Standard Normal
Table
• Look up the areas using the table.
• Do the necessary addition and subtraction.
Suppose Z has standard normal distribution
Find p(0<Z<1.23)
Find p(-1.57<Z<0)
Find p(Z>.78)
Z is standard normal
Calculate p(-1.2<Z<.78)
Example
• Data come from distribution:  = 10,  = 3
• What proportion fall beyond X=13?
• Z = (13-10)/3 = 1
• =normsdist(1) or table  0.1587
• 15.9% fall above 13
Example: IQ
• A common example is IQ
• IQ scores are theoretically normally
distributed.
• Mean of 100
• Standard deviation of 15
IQ’s are normally distributed with mean 100 and standard
deviation 15. Find the probability that a randomly selected
person has an IQ between 100 and 115

P(100  X  115) 
P(100  100  X  100  115  100) 
100  100 X  100 115  100
P(   
15 15 15
P(0  Z  1)  .3413
Say we have GRE scores are normally distributed with mean 500 and
standard deviation 100. Find the probability that a randomly selected
GRE score is greater than 620.

• We want to know what’s the probability


of getting a score 620 or beyond.

620  500
 1.2  z
100
• p(z > 1.2)
• Result: The probability of randomly
getting a score of 620 is ~.12
Exercises
• Let z represent the standard normal variable.
Suppose a value of z is randomly selected. To
find each of the following probabilities, (1)
draw the standard normal curve and indicate
the area representing the probability, (2)
express the probability in terms of areas from
0 to appropriate values obtained, (3) calculate
the answer.
1. P(z< 1.41) 2. P(z<-1.72)
3. P(z>1.51) 4. P(z>-2.43)
5. P(-2.02 < z <1.74) 6. P(1.02 < z < 1.84)of the
7. Between -0.67 and 0
8. Less than 1.96
9. Within 1 standard deviation of the mean
10.Within 3 standard deviations of the mean
• Assume the standard normal distribution. Fill
in the blanks
1. P(z < _ ) =0.9772
2. P(z > _ ) = 0.5
3. P( z > _ ) = 0.9599
• Consider a normal population with mean 200
and standard deviation 25. Find the following:

1. The 90 the percentile


2. The 30th percentile
3. The 70th percentile
4. The 45th percentile
17 February 2016
1. Draw the normal curve, shade the
required area, then find:
a) The area from z = 0.43 to z = 1.62
b) The area to the left of –z = 2.04
c) The area between -z = 1.2 and +z =
2.07
2. Given a mean of 160 and standard
deviation of 15, find the area to the right of
180
3. In an examination, the mean grade is 72,
and the standard deviation is 6
a) Find the probability that a particular
student will have a score higher than or
equal to 75
b) …. from 65 to 80
c) If there are 150 students who took the
exam, how many have scores between 65 to
80?
Work time...
• What is the area for scores less than z = -1.5?
• What is the area between z =1 and 1.5?
• What z score cuts off the highest 30% of the
distribution?
• What two z scores enclose the middle 50%
of the distribution?
• If 500 scores are normally distributed with
mean = 50 and SD = 10, and an investigator
throws out the 20 most extreme scores,
what are the highest and lowest scores that
are retained?
Standard Scores
• Z is not the only transformation of scores to be
used
• First convert whatever score you have to a z
score.
• New score – new s.d.(z) + new mean

• Example- T scores = mean of 50 s.d. 10


– Then T = 10(z) + 50.
• Examples of standard scores: IQ, GRE, SAT
Standard Normal Distribution

• History
• The normal curve was developed mathematically in 1733 by DeMoivre
as an approximation to the binomial distribution. His paper was not
discovered until 1924 by Karl Pearson. Laplace used the normal curve
in 1783 to describe the distribution of errors. Subsequently, Gauss used
the normal curve to analyze astronomical data in 1809. The normal
curve is often called the Gaussian distribution. The term bell-shaped
curve is often used in everyday usage.
Example
1. On a final examination in Statistics, the mean
= 76 and its standard deviation = 10. Find:
a. the standard score of the student when
receiving the grade of 90

b. The grade corresponding to the z-score of


-1
2. Find the z value for each of the following x values for a
normal distribution with a mean = 30 and s = 5
a. X = 37 b. X = 19
c. X = 23 and d. x = 44

3. Find the following areas under a normal distribution


curve
a. Area between z = 2.0 and z =- 2.4
b. Area from z = 2.3 to z = 2.5
c. Area to the left of z = 2.1 and
d. to the right of z = -1.7
SPORTS
Today in history ( September 1)

1972 - Bobby Fischer (US) defeats Boris


Spassky (USSR) for world chess title

1973 - George Foreman KOs Jose "King"


Roman in 1 for heavyweight boxing title
Today in Philippine History,
• On September 1, 1909, Baguio, then a
municipality of Benguet province in Northern
Luzon, was declared a chartered city by
virtue of Act No. 1963.
• Then Governor General William Cameron
Forbes directed Justice George Malcolm, a
young lawyer in the American-led Philippine
government, to write the city's charter.
• The name of the city is derived from the word
"bagiw" in Ibaloi, the indigenous language of
the Benguet Region meaning "moss."
Standard Normal Distribution

• History
• The normal curve was developed mathematically in 1733 by DeMoivre
as an approximation to the binomial distribution. His paper was not
discovered until 1924 by Karl Pearson. Laplace used the normal curve
in 1783 to describe the distribution of errors. Subsequently, Gauss used
the normal curve to analyze astronomical data in 1809. The normal
curve is often called the Gaussian distribution. The term bell-shaped
curve is often used in everyday usage.
Areas under the Normal Curve
To convert the units of measurement into
standard units, standard scores or z-scores by
means of the formula
x – xm
z = ----------------
s
where z = standard scores
xm = mean
s = stand dev
x = given value of a particular variable
Exercises
1. Find the z value for a normal distribution
with mean = 30 and standard deviation = 5 if
a) X = 44 b) x = 23 c) 15

2. Find the areas under the normal curve with


mean = 20 and standard deviation = 4
a) Area between x = 15 and x = 26
b) Area from x = 20 to x = 23
c) Area to the left of x = 14
Calculate the following probabilities
using the Table under the Normal
Curve
1.P(z ≥ 1.5)
2.P(z ≤ -1.2)
3.P(1.6 ≤ z ≤ 2.5)
4.P(to the left of z ≤ -1.2 and
to the right of z = 1.7)
Assignment:
1. On a final examination in Statistics,
the mean = 76 and its standard
deviation = 10. Find:
a. the standard score of the student
when receiving the grade of 90
b. The grade corresponding to the
z-score of -1
2. Find the z value for each of the following x
values for a normal distribution with a mean
= 30 and s = 5
a. X = 37 b. X = 19
c. X = 23 and d. x = 44
3. Find the following areas under a normal
distribution curve
a. Area between z = 2.0 and z =- 2.4
b. Area from z = 2.3 to z = 2.5
c. Area to the left of z = 2.1 and
d. to the right of z = -1.7
Applications of the Normal Curve
• The actual amount of instant coffee that a
filling machine deposits into “6 ounce” jars
varies from jar to jar, and may be looked upon
as a random variable having normal
distribution with a standard deviation of 0.04
ounce. If only 2 percent of the jars are to
contain less than 6 ounces of coffee, what
must be the mean fill of these jars?
Solution:

Given :  = 0.04 , x = 6.00, = ?


Subtracting 0.0200 from 0.5 = 0.4800


From the Table 0.4800 = 2.05
Thus -2.05 = (6.00 - )/0.04 = 6.08
Application
2. In a certain restaurant, the processing and
serving time of dishes to customer may be
looked upon as a random variable having the
normal distribution with a mean of 16.20
minutes and a standard deviation of 0.52
minutes. Find the probabilities that the time it
takes to prepare one of the dishes will be:
a) At least 17 minutes
b) Anywhere from 16 min to 18 min
3. Find the standard normal curve area that lies:
a) Between z = -1.66 and z = 0
b) To the right of z = - 0.27
c) Between z = 0 and z = 0.87
Exercises
Calculate the area under the normal curve
between these values:
1. z = 0 and z = 1.6
2. z = -1.4 and z = 1.4
3. z = -3 and z = 1.2
4. P (z<2.33)
5. P(z>2.9)
A normal random variable x has a mean of 10
and the standard deviation of 2. Find the P of
these values:
6. x > 13.5
7. x < 8.2
8. 9.4 < x < 10.6
Sketch each region, then shade the area.
27 September 2016
It is TRUE.
It is FALSE, so reject it.
Good! ACCEPTED
It needs further revision.
Try again, check the
procedures.
• When and how do you accept
answer/opinion?
• How often do you perform
hypothesis testing? Why.
• What are some of the factors
you consider in terms of
hypothesis testing?
Hypothesis Testing
A Statistical Test of Hypothesis
consists of 5 parts:
a) The null hypothesis, Ho
b) The alternative hypothesis, Ha
c) The test statistic and its p-value
d) The rejection region
e) The conclusion
Definition:

The two competing hypothesis are


the alternative hypothesis, Ha,
generally the hypothesis that the
researcher wishes to support,
and the null hypothesis, Ho, a
contradiction of the alternative
hypothesis.
Example:
You wish to show that the average hourly wage
of construction workers in one city is different
from P 215, which is the national average.
Thus, which is a two-tailed test

Ho :   215
Ha :   215
A milling process currently produces an average
of 3% defectives. You are interested in
showing that a simple adjustment on a
machine will decrease p, the proportion of
defectives produced in the milling process.
Thus, Ho : p  0.03
Ha : p  0.03
which is a one-tailed test
• The p-value or observed significance level
of a statistical test is the smallest value of
alpha for which Ho can be rejected. It is
the actual risk of committing a Type I
error, if Ho is rejected based on the
observed value of the test statistic. The p-
value measures the strength of the
evidence against Ho.
If the p-value is less than a preassigned
significance level alpha, then the Ho can be
rejected , and you can report that the results
are statistically significant at level alpha.
Tests of Statistical Hypothesis
Goal of Hypothesis Testing-
to make a judgment about
the difference between the
sample statistics and a
hypothesized population
parameter
Use of Hypothesis Testing-
It enables the researcher to
generalize population from
relatively small samples. In many
instances, a researcher can only
rely on the information provided
by a part of the population.
Basic Definitions:
Statistical hypothesis is an
assumption or statement,
which may or may not be
true concerning one or
more populations.
Hypothesis Testing is the
process of making
inference or prediction on
a population based on the
result of the study on
samples
Null hypothesis is also known
as a no difference
relationship hypothesis. It
implies neutrality and
objectivity, which must be
present in any research
undertaking.
Alternative hypothesis is the opposite
of the null hypothesis.
Rejection of the hypothesis is to
conclude that the hypothesis is
false.
Acceptance of a hypothesis merely
implies that there is no sufficient
statistical evidence to believe
otherwise.
Critical region is a set of values
of the test statistic that is
chosen before the
experiment to define the
conditions under which the
null hypothesis will be
rejected.
One-tailed test is used
when the critical region is
located at only one
extreme of distribution or
range of values for the
test statistics.
It is a directional test
with region of rejection
lying on either left or
right tail of the normal
curve.
Right directional test. The region
of rejection is on the right tail.
It is used when the alternative
hypothesis uses comparatives
such as >, higher than, superior
to, exceeds, better, etc.
Left directional test. The
region of rejection is on the
left tail. It is used when the
alternative hypothesis uses
comparatives such as <,
smaller than, lower than,
below, etc.
Two-tailed test is used
when the critical region is
located on both sides of
the distribution or range
of values for the test
statistic.
Significance level of a test is
the maximum value of the
probability of rejecting the
null hypothesis when in
fact it is true.
Statistic is a function of the
random sample, that is based
on the observations and is
used to make the decision in
favor of the null or
alternative hypothesis.
Type I error is when we
reject the null hypothesis
when it is true.

Type II error is when we accept


or fail to reject the null
hypothesis when the
alternative hypothesis is true.
Facts Accept Ho Reject
Ho

Ho is true Correct decision


Type I
error
Ha is false Type II error
Correct
decision
Parameter is a
numerical characteristic of the
population mean, population
standard deviation, population
variance. It is usually unknown
and estimated only by a
corresponding statistic computed
from the sample data.
Population or
universe is a complete set
of all possible
observations, values,
elements or objects under
consideration.
Sample is a representative
part of the population.
Proportion is the ratio of
two given quantities, say, the
ratio of a sample to its total
population.
Exercises 27 September 2016
A. Classify as null or alternative
hypothesis
1. Short stories has no influence in shaping the sex
typed attitudes of HS students.
2. Public school students performed better than
private school students in sports competition
3. Boys and girls performed equally in the recent
declamation contest
4. Public opinion has a positive relationship to
policy making at Tanauan City
5. Students’ activities affected their academic
performance in class
6. Media exposure strongly affects the lifestyle
of every individual
B. Form the null and alternative
hypotheses based on:
7. Computer games on academic performance
of elementary students (two-tailed test)
8. Pre-test in Language Ability to Group A and B
(right-tailed test)
9. Number of studying hrs of STAT students on
their examination scores (one-tailed test)
A. Determine whether the following
statements can be a null hypothesis
or alternative hypothesis. Indicate
Ho or Ha on the blanks provided for.

1. Classical music has a positive effect on the


memory ability of Gr IV students of a certain
elementary school.
2. The pre-test scores of the students belonging to
Grp A in the Language ability test of Grp A do not
differ with that of the students belonging to Grp B.
3. Low scores in mental ability test corresponds
to low scores in the self-concept test.
4. The performance of pre-schooler from
private schools in the memory ability test is
significantly different from those coming
from the public schools.
5. Sleep deprived students has lower
performance in the mathematical learning
ability test than those with 8 hours sleep.
6. Introducing colors to pictures has no effect
on the memory retention among GrI pupils.
7. The performance of the students exposed to
verbal motivation in a given learning ability
test do not differ with that of the students
exposed to nonverbal motivation.
8. High weekly television advertising cost
results to a high weekly revenue.
9. Classroom arrangement has no effect in the
learning process of the students in the
classroom.
10.The mean body temperature of a male adult
is significantly higher than 98.7⁰F.
B. Construct a null and alternative
hypothesis on each of the given
statements below:
11.To determine the effect of extrinsic reward on
the participation of pre-elementary students of
Integrated School (two-sided test)
12.To determine the difference in the performance
of public and private secondary school students
in the national entrance exam (right directional,
one sided test)
13.To know whether the brand of cellular phone
used by college students has an effect on
developing one’s self-image. (two-sided test)
C. Determine if the hypothesis given is a
directional (one-sided test) or non-
directional hypothesis(two-sided test).
14. The performance of the students exposed to
verbal motivation in a given learning ability test
is higher than that of the students exposed to
nonverbal motivation.
15. The number of studying hours of a student
has a significant effect to the score he/she
obtains in STAT 101examination.
Inferential Statistics
Level of Significance-
For hypothesis testing, it is customary to use an
alpha of 5% or 1%. It means that we are
willing to commit an alpha error of 5% or 1%
as the case may be. It also implies that we are
95% or 99% confident in making correct
decisions.
Type I and Type II Errors

Decision
H0 is TRUE H0 is FALSE
Reject Ho Type I error Correct
(alpha) Decision
Reject Ho Correct Type II error
Decision (beta)
30 September 2016
Statistical Tests
Data Analysis
Statistics - a powerful tool for analyzing data
1. Descriptive Statistics - provide an overview
of the attributes of a data set. These include
measurements of central tendency (frequency
histograms, mean, median, & mode) and
dispersion (range, variance & standard
deviation)

2. Inferential Statistics - provide measures of how


well your data support your hypothesis and if
your data are generalizable beyond what was
tested (significance tests)
Inferential Statistics
The Population: =5.314
2 4 10 4 6 8 7 10 4 3 7 9 6 7 5 2 5 8 2 10
7 2 3 5 2 9 3 9 6 1 4 2 6 4 9 3 4 1 8 7
9 1 8 1 10 10 6 4 2 7 1 1 9 10 4 4 6 6 2 5
9 10 2 6 8 10 1 6 10 10 4 4 4 9 2 1 4 5 9 6
6 2 7 8 8 6 6 10 6 6 7 5 9 2 6 4 8 6 6 10
5 7 1 9 1 10 8 8 5 10 1 4 8 3 6 7 1 5 2 4
4 10 5 8 5 1 1 4 3 6 7 3 1 5 4 3 6 2 7 8
3 3 6 6 2 8 6 5 9 8 4 6 3 8 3 3 10 8 10 5
7 5 1 4 3 2 1 10 2 10 6 10 7 9 8 8 4 9 9 10
3 7 6 2 1 1 10 3 5 7 4 1 2 9 10 10 6 1 3 2
1 3 9 9 4 2 2 2 1 8 3 1 5 9 9 8 3 2 5 4
4 2 3 10 8 2 3 4 1 3 3 2 10 10 5 7 3 3 10 1
5 7 5 1 2 5 8 7 3 8 9 2 10 8 1 1 5 3 3 7
6 7 9 8 8 4 9 8 4 3 10 8 10 4 10 2 3 5 6 3
1 9 8 1 10 2 3 1 6 3 8 9 6 2 4 4 2 7 8 4
4 4 4 10 8 5 9 3 10 5 3 6 9 3 7 4 2 3 10 2
5 1 6 8 5 6 8 1 8 5 7 6 4 1 2 7 2 9 5 3
8 2 3 2 9 9 1 1 5 7 8 5 6 3 8 5 4 10 6 9
5 1 10 10 5 1 4 3 2 3 6 9 10 2 6 3 1 2 8 6
1 8 7 8 5 3 7 2 4 1 8 9 10 10 5 1 3 6 5 8
3 3 8 8 2 7 1 6 9 8 2 10 3 7 9 2 1 9 7 7
3 1 9 6 8 2 6 4 6 3 7 10 9 6 1 10 7 5 3 10
1 6 5 4 3 2 4 4 1 5 5 10 6 2 1 1 1 5 6 3
8 10 8 10 9 7 7 7 8 4 8 1 3 5 8 1 8 4 4 6
4 7 2 4 9 1 8 5 3 3 5 10 1 4 6 3 3 8 2 2

Population size = 500


The Population: =5.314
2 4 10 4 6 8 7 10 4 3 7 9 6 7 5 2 5 8 2 10
7 2 3 5 2 9 3 9 6 1 4 2 6 4 9 3 4 1 8 7
9 1 8 1 10 10 6 4 2 7 1 1 9 10 4 4 6 6 2 5
9 10 2 6 8 10 1 6 10 10 4 4 4 9 2 1 4 5 9 6
6 2 7 8 8 6 6 10 6 6 7 5 9 2 6 4 8 6 6 10
5 7 1 9 1 10 8 8 5 10 1 4 8 3 6 7 1 5 2 4
4 10 5 8 5 1 1 4 3 6 7 3 1 5 4 3 6 2 7 8
3 3 6 6 2 8 6 5 9 8 4 6 3 8 3 3 10 8 10 5
7 5 1 4 3 2 1 10 2 10 6 10 7 9 8 8 4 9 9 10
3 7 6 2 1 1 10 3 5 7 4 1 2 9 10 10 6 1 3 2
1 3 9 9 4 2 2 2 1 8 3 1 5 9 9 8 3 2 5 4
4 2 3 10 8 2 3 4 1 3 3 2 10 10 5 7 3 3 10 1
5 7 5 1 2 5 8 7 3 8 9 2 10 8 1 1 5 3 3 7
6 7 9 8 8 4 9 8 4 3 10 8 10 4 10 2 3 5 6 3
1 9 8 1 10 2 3 1 6 3 8 9 6 2 4 4 2 7 8 4
4 4 4 10 8 5 9 3 10 5 3 6 9 3 7 4 2 3 10 2
5 1 6 8 5 6 8 1 8 5 7 6 4 1 2 7 2 9 5 3
8 2 3 2 9 9 1 1 5 7 8 5 6 3 8 5 4 10 6 9
5 1 10 10 5 1 4 3 2 3 6 9 10 2 6 3 1 2 8 6
1 8 7 8 5 3 7 2 4 1 8 9 10 10 5 1 3 6 5 8
3 3 8 8 2 7 1 6 9 8 2 10 3 7 9 2 1 9 7 7
3 1 9 6 8 2 6 4 6 3 7 10 9 6 1 10 7 5 3 10
1 6 5 4 3 2 4 4 1 5 5 10 6 2 1 1 1 5 6 3
8 10 8 10 9 7 7 7 8 4 8 1 3 5 8 1 8 4 4 6
4 7 2 4 9 1 8 5 3 3 5 10 1 4 6 3 3 8 2 2

The Sample: 7, 6, 4, 9, 8, 3, 2, 6, 1
mean = 5.111
The Population: =5.314
2 4 10 4 6 8 7 10 4 3 7 9 6 7 5 2 5 8 2 10
7 2 3 5 2 9 3 9 6 1 4 2 6 4 9 3 4 1 8 7
9 1 8 1 10 10 6 4 2 7 1 1 9 10 4 4 6 6 2 5
9 10 2 6 8 10 1 6 10 10 4 4 4 9 2 1 4 5 9 6
6 2 7 8 8 6 6 10 6 6 7 5 9 2 6 4 8 6 6 10
5 7 1 9 1 10 8 8 5 10 1 4 8 3 6 7 1 5 2 4
4 10 5 8 5 1 1 4 3 6 7 3 1 5 4 3 6 2 7 8
3 3 6 6 2 8 6 5 9 8 4 6 3 8 3 3 10 8 10 5
7 5 1 4 3 2 1 10 2 10 6 10 7 9 8 8 4 9 9 10
3 7 6 2 1 1 10 3 5 7 4 1 2 9 10 10 6 1 3 2
1 3 9 9 4 2 2 2 1 8 3 1 5 9 9 8 3 2 5 4
4 2 3 10 8 2 3 4 1 3 3 2 10 10 5 7 3 3 10 1
5 7 5 1 2 5 8 7 3 8 9 2 10 8 1 1 5 3 3 7
6 7 9 8 8 4 9 8 4 3 10 8 10 4 10 2 3 5 6 3
1 9 8 1 10 2 3 1 6 3 8 9 6 2 4 4 2 7 8 4
4 4 4 10 8 5 9 3 10 5 3 6 9 3 7 4 2 3 10 2
5 1 6 8 5 6 8 1 8 5 7 6 4 1 2 7 2 9 5 3
8 2 3 2 9 9 1 1 5 7 8 5 6 3 8 5 4 10 6 9
5 1 10 10 5 1 4 3 2 3 6 9 10 2 6 3 1 2 8 6
1 8 7 8 5 3 7 2 4 1 8 9 10 10 5 1 3 6 5 8
3 3 8 8 2 7 1 6 9 8 2 10 3 7 9 2 1 9 7 7
3 1 9 6 8 2 6 4 6 3 7 10 9 6 1 10 7 5 3 10
1 6 5 4 3 2 4 4 1 5 5 10 6 2 1 1 1 5 6 3
8 10 8 10 9 7 7 7 8 4 8 1 3 5 8 1 8 4 4 6
4 7 2 4 9 1 8 5 3 3 5 10 1 4 6 3 3 8 2 2

The Sample: 1, 5, 8, 7, 4, 1, 6, 6
mean = 4.75
Parametric or Non-parametric?
•Parametric tests are restricted to data that:
1) show a normal distribution
2) * are independent of one another
3) * are on the same continuous scale of measurement
•Non-parametric tests are used on data that:
1) show an other-than normal distribution
2) are dependent or conditional on one another
3) in general, do not have a continuous scale of
measurement

e.g., the length and weight of something –> parametric


vs.
did the bacteria grow or not grow –> non-parametric
The First Question
After examining your data, ask: does what you're testing
seem to be a question of relatedness or a question of
difference?

If relatedness (between your control and your experimental


samples or between you dependent and independent variable),
you will be using tests for correlation (positive or negative)
or regression.

If difference (your control differs from your experimental),


you will be testing for independence between distributions,
means or variances. Different tests will be employed if
your data show parametric or non-parametric properties.

See Flow Chart on page 50 of HBI.


Tests for Differences
• Between Means
- t-Test - P
- ANOVA - P
- Friedman Test
- Kruskal-Wallis Test
- Sign Test
- Rank Sum Test
• Between Distributions
- Chi-square for goodness of fit
- Chi-square for independence
• Between Variances
- F-Test – P
P – parametric tests
Differences Between Means
Asks whether samples come from populations with
different means
Null Hypothesis Alternative Hypothesis

Y Y

A B C A B C
There are different tests if you have 2 vs more than 2 samples
Differences Between Means – Parametric
Data
t-Tests compare the means of two parametric samples

E.g. Is there a difference in the mean height of men and


women?

HBI: t-Test
Excel: t-Test (paired and unpaired) – in Tools – Data
Analysis
A researcher compared the height of plants grown in high
and low light levels. Her results are shown below. Use a
T-test to determine whether there is a statistically
significant difference in the heights of the two groups

Low Light High Light


49 45
31 40
43 59
31 58
40 55
44 50
49 46
48 53
33 43
Differences Between Means – Parametric
Data
ANOVA (Analysis of Variance) compares the means of
two or more parametric samples.

E.g. Is there a difference in the mean height of plants


grown under red, green and blue light?

HBI: ANOVA
Excel: ANOVA – check type under Tools – Data Analysis
A researcher fed pigs on four different foods. At the end
of a month feeding, he weighed the pigs. Use an ANOVA
test to determine if the different foods resulted in
differences in growth of the pigs.

weight of pigs fed different foods

food 1 food 2 food 3 food 4

60.8 68.7 102.6 87.9

57.0 67.7 102.1 84.2

65.0 74.0 100.2 83.1

58.6 66.3 96.5 85.7

61.7 69.8 90.3


Aplysia punctata – the sea hare
Aplysia parts
Differences Between Means – Non-
Parametric Data
The Sign Test compares the means of two “paired”, non-
parametric samples

E.g. Is there a difference in the gill withdrawal response of


Aplysia in night versus day? Each subject has been tested
once at night and once during the day –> paired data.

Night Day
HBI: Sign Test Subject Response Response
1 2 5
Excel: N/A 2 1 3
3 2 2
Differences Between Means – Non-
Parametric Data
The Friedman Test is like the Sign test, (compares the
means of “paired”, non-parametric samples) for more than
two samples.

E.g. Is there a difference in the gill withdrawal response of


Aplysia between morning, afternoon and evening? Each
subject has been tested once during each time period –>
paired data
Morning Afternoon Evening.
Subject Response Response Response
HBI: Friedman Test 1 4 3 2
2 5 2 1
Excel: N/A 3 3 4 3
Differences Between Means – Non-
Parametric Data
The Rank Sum test compares the means of two non-
parametric samples

E.g. Is there a difference in the gill withdrawal response of


Aplysia in night versus day? Each subject has been tested
once, either during the night or during the day –> unpaired
data.
Night Day
Subject Response Response
HBI: Rank Sum 1 5
2 1
Excel: N/A 3 2
4 3
5 4
6 1
7 5
Differences Between Means – Non-
Parametric Data
The Kruskal-Wallis Test compares the means of more
than two non-parametric, non-paired samples

E.g. Is there a difference in the gill withdrawal response of


Aplysia in night versus day? Each subject has been tested
once, either during the morning, afternoon or evening –>
unpaired data.
Morning Afternoon Evening.
Subject Response Response Response
HBI: Kruskal-Wallis Test 1 4
2 5
Excel: N/A 3 4
4 3
5 2
6 3
Differences Between Distributions
Chi square tests compare observed frequency
distributions, either to theoretical expectations or to other
observed frequency distributions.
Differences Between Distributions
E.g. The F2 generation of a cross between a round pea
and a wrinkled pea produced 72 round individuals and 20
wrinkled individuals. Does this differ from the expected 3:1
round:wrinkled ratio of a simple dominant trait?

E
Frequency

Smooth Wrinkled
HBI: Chi-Square One Sample Test (goodness of fit)
Excel: Chitest – under Function Key – Statistical
Differences Between Distributions
E.g. 67 out of 100 seeds placed in plain water germinated
while 36 out of 100 seeds placed in “acid rain” water
germinated. Is there a difference in the germination rate?
Alternative Hypothesis
Null Hypothesis

Germination
Germination

Proportion
Proportion

Plain Acid Plain Acid


HBI: Chi-Square Two or More Sample Test (independence)
Excel: Chitest – under Function key - Statistical
Correlation
Correlations look for relationships between two variables
which may not be functionally related. The variables may
be ordinal, interval, or ratio scale data. Remember,
correlation does not prove causation; thus there may not
be a cause and effect relationship between the variables.

E.g. Do species of birds with longer wings also have


longer necks?

HBI: Spearman’s Rank Correlation (NP)


Excel: Correlation (P)
Question – is there a relationship between students aptitude
for mathemathics and for biology?
Student Math score Math Rank Biol. score Biology rank

1 57 3 83 7

2 45 1 37 1

3 72 7 41 2

4 78 8 84 8

5 53 2 56 3

6 63 5 85 9

7 86 9 77 6

8 98 10 87 10

9 59 4 70 5

10 71 6 59 4
Regression
Regressions look for functional relationships between two
continuous variables. A regression assumes that a
change in X causes a change in Y.

E.g. Does an increase in light intensity cause an increase


in plant growth?

HBI: Regression Analysis (P)


Excel: Regression (P)
Correlation & Regression

Looks for relationships between two continuous variables

Null Hypothesis Alternative Hypothesis

Y Y

X X
Is there a relationship between wing length and
tail length in songbirds?
wing length cm tail length cm
10.4 7.4
10.8 7.6
11.1 7.9
10.2 7.2
10.3 7.4
10.2 7.1
10.7 7.4
10.5 7.2
10.8 7.8
11.2 7.7
10.6 7.8
11.4 8.3
Is there a relationship between age and systolic
blood pressure?
Age (yr) systolic blood pressure
mm hg
30 108
30 110
30 106
40 125
40 120
40 118
40 119
50 132
50 137
50 134
60 148
60 151
60 146
60 147
60 144
70 162
70 156
70 164
70 158
70 159
Statistical Tests
Let’s Take it Step by Step...
 Identify topic  Collect data
 Literature review  Set up spreadsheet
 Variables of interest  Enter data
 Research hypothesis  Statistical analysis
 Design study
 Graphs
 Power analysis
 Slides / poster
 Write proposal
 Design data tools  Write paper /
manuscript
 Committees
Goals

 To understand why a particular statistical


test was used for your research project
 To interpret your results
 To understand, evaluate, and present your
results
Free Statistics Software
Mystat:
http://www.systat.com/MystatProducts.aspx

List of Free Statistics Software:


http://statpages.org/javasta2.html
Before choosing a statistical
test…
 Figure out the variable type
– Scales of measurement (qualitative or
quantitative)

 Figure out your goal


– Compare groups
– Measure relationship or association of
variables
Scales of Measurement

 Nominal
 Ordinal
} Qualitative
 Interval
 Ratio
} Quantitative
Nominal Scale (discrete)
 Simplest scale of measurement
 Variables which have no numerical value
 Variables which have categories
 Count number in each category, calculate
percentage
 Examples:
– Gender
– Race
– Marital status
– Whether or not tumor recurred
– Alive or dead
Ordinal Scale
 Variables are in categories, but with an
underlying order to their values
 Rank-order categories from highest to lowest
 Intervals may not be equal
 Count number in each category, calculate
percentage
 Examples:
– Cancer stages
– Apgar scores
– Pain ratings
– Likert scale
Interval Scale
 Quantitative data
 Can add & subtract values
 Cannot multiply & divide values
– No true zero point

 Example:
– Temperature on a Celsius scale
• 00 indicates point when water will freeze, not an absence of
warmth
Ratio Scale (continuous)
 Quantitative data with true zero
– Can add, subtract, multiply & divide
 Examples:
– Age
– Body weight
– Blood pressure
– Length of hospital stay
– Operating room time
Scales of Measurement

 Nominal
 Ordinal
} Lead to nonparametric
statistics
 Interval
 Ratio
} Lead to parametric statistics
Two Branches of Statistics

 Descriptive
– Frequencies & percents
– Measures of the middle
– Measures of variation

 Inferential
– Nonparametric statistics
– Parametric statistics
Descriptive Statistics

 First step in analyzing data

 Goal is to communicate results, without


generalizing beyond sample to a larger
group
Frequencies and Percents
 Number of times a specific value of an
observation occurs (counts)
 For each category, calculate percent of
sample
SMOKING

Cumulative
Frequency Percent Valid Percent Percent
Valid s moker 26 20.5 24.8 24.8
non-s moker 79 62.2 75.2 100.0
Total 105 82.7 100.0
Mis sing unknown 22 17.3
Total 127 100.0
Measures of the Middle or
Central Tendency
 Mean
– Average score
• sum of all values, divided by number of values
– Most common measure, but easily influenced by
outliers

 Median
– 50th percentile score
• half above, half below
– Use when data are asymmetrical or skewed
Measures of Variation or Dispersion
 Standard deviation (SD)
– Square root of the sum of squared deviations of the
values from the mean divided by the number of
values

SD = sum of (individual value – mean value) 2


________________________________________________

number of values

 Standard error (SE)


– Standard deviation divided by the square root of the
number of values
Measures of Variation or Dispersion

 Variance
– Square of the standard deviation

 Range
– Difference between the largest & smallest
value
nocigs_b

Cumulative
Frequency Percent Valid Percent Percent
Valid 1 2 1.6 7.7 7.7
2 1 .8 3.8 11.5
3 1 .8 3.8 15.4
5 3 2.4 11.5 26.9
6 1 .8 3.8 30.8
12 1 .8 3.8 34.6
13 1 .8 3.8 38.5
14 1 .8 3.8 42.3
15 2 1.6 7.7 50.0
17 1 .8 3.8 53.8
18 1 .8 3.8 57.7
19 2 1.6 7.7 65.4
20 2 1.6 7.7 73.1
22 1 .8 3.8 76.9
24 1 .8 3.8 80.8
30 1 .8 3.8 84.6
39 1 .8 3.8 88.5
40 1 .8 3.8 92.3
45 1 .8 3.8 96.2
100 1 .8 3.8 100.0
Total 26 20.5 100.0
Mis sing Sys tem 101 79.5
Total 127 100.0
Statistics

nocigs_b
N Valid 26
Mis sing 101
Mean 19.62
Std. Error of Mean 3.985
Median 16.00
Mode 5
Std. Deviation 20.320
Variance 412.886
Range 99
Minimum 1
Maximum 100
Inferential Statistics
Sample Population
 Nonparametric tests
– Used for analyzing nominal & ordinal variables
– Makes no assumptions about data

 Parametric tests
– Used for analyzing interval & ratio variables
– Makes assumptions about data
• Normal distribution
• Homogeneity of variance
• Independent observations
Which Test Do I Use?

 Step 1 Know the scale of measurement


 Step 2 Know your goal
– Is it to compare groups? How many groups
do I have?
– Is it to measure a relationship or association
between variables?
Key Inferential Statistics
 Chi-Square
– Fisher’s exact test } Nonparametric
Association/Relationship
 T-test
– Unpaired
– Paired
} Parametric
Compare groups

 Analysis of Variance (ANOVA) } Parametric


Compare groups
Pearson’s Correlation

 Linear Regression } Parametric


Association/Relationship
Probability and p Values
 p < 0.05
– 1 in 20 or 5% chance groups are not different
when we say groups are significantly different

 p < 0.01
– 1 in 100 or 1% chance of error

 p < 0.001
– 1 in 1000 or .1% chance of error
Research Hypothesis

 Topic research question


 Research question hypothesis
– Null hypothesis (H0)
• Predicts no effect or difference

– Alternative hypothesis (H1)


• Predicts an effect or difference
Example
Topic: Cancer & Smoking
Research Question: Is there a
relationship between smoking &
cancer?
H0: Smokers are not more likely to
develop cancer compared to non-
smokers.
H1: Smokers are more likely to
develop cancer than are non-smokers.
Are These Categorical
Variables Associated?

SMOKING * SES Crosstabulation

SES
low middle high Total
SMOKING s moker Count 7 13 6 26
% within SES 38.9% 20.3% 26.1% 24.8%
non-s moker Count 11 51 17 79
% within SES 61.1% 79.7% 73.9% 75.2%
Total Count 18 64 23 105
% within SES 100.0% 100.0% 100.0% 100.0%
2
Chi-Square
 Most common nonparametric test
 Use to test for association between
categorical variables
 Use to test the difference between observed
& expected proportions
– The larger the chi-square value, the more the
numbers in the table differ from those we would
expect if there were no association
 Limitation
– Expected values must be equal to or larger than 5
Let’s Test For Association
Low SES 38.9%, Middle SES 20.3%, High SES 26.1%

Chi-Square Tests

Asymp. Sig.
Value df (2-s ided)
Pears on Chi-Square 2.630 a 2 .268
Likelihood Ratio 2.476 2 .290
Linear-by-Linear
.653 1 .419
Ass ociation
N of Valid Cas es 105
a. 1 cells (16.7%) have expected count les s than 5. The
minimum expected count is 4.46.
Alternative to Chi-Square
 Fisher’s exact test
– Is based on exact probabilities
– Use when expected count <5 cases in
each cell and
– Use with 2 x 2 contingency table

R A Fisher 1890-1962
LUNG_CA * SMOKING Crosstabulation

SMOKING
s moker non-s moker Total
LUNG_CA pos itive Count 3 1 4
% within SMOKING 11.5% 1.3% 3.8%
negative Count 23 78 101
% within SMOKING 88.5% 98.7% 96.2%
Total Count 26 79 105
% within SMOKING 100.0% 100.0% 100.0%
Chi-Square Tests

Asymp. Sig. Exact Sig. Exact Sig.


Value df (2-s ided) (2-s ided) (1-s ided)
Pears on Chi-Square 5.633 b 1 .018
Continuity Correctiona 3.179 1 .075
Likelihood Ratio 4.664 1 .031
Fisher's Exact Test .046 .046
Linear-by-Linear
5.580 1 .018
Ass ociation
N of Valid Cas es 105
a. Computed only for a 2x2 table
b. 2 cells (50.0%) have expected count less than 5. The minimum expected count is
.99.
Do These Groups Differ?

Group Statistics

Std. Error
SMOKING N Mean Std. Deviation Mean
BMI s moker 26 25.1846 5.27209 1.03394
non-s moker 79 26.2228 5.47664 .61617
Unpaired t-test
or Student’s t-test
William Gossett 1876-1937

 Frequently used statistical test


 Use when there are two independent
groups
Unpaired t-test or Student’s
t-test
 Test for a difference between groups
– Is the difference in sample means due to their
natural variability or to a real difference between
the groups in the population?
 Outcome (dependent variable) is interval or
ratio
 Assumptions of normality, homogeneity of
variance & independence of observations
Let’s Test For A Difference
Smokers’ BMI = 25.18 ± 5.27
Non-Smokers’ BMI = 26.22 ± 5.48

Independent Samples Test

Levene's Test for


Equality of Variances t-tes t for Equality of Means
95% Confidence
Interval of the
Mean Std. Error Difference
F Sig. t df Sig. (2-tailed) Difference Difference Lower Upper
BMI Equal variances
.079 .779 -.846 103 .400 -1.0382 1.22719 -3.47200 1.39566
ass umed
Equal variances
-.863 44.127 .393 -1.0382 1.20362 -3.46371 1.38737
not as sumed
Do These Groups Differ?
Light smoker < 1 pack/day
Heavy smoker > 1 pack/day

Descriptives

BMI
95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
non-s moker 79 26.2228 5.47664 .61617 24.9961 27.4495 17.70 40.20
light s moker 17 26.1765 4.96154 1.20335 23.6255 28.7275 18.90 35.00
heavy s moker 9 23.3111 5.62015 1.87338 18.9911 27.6311 17.90 35.90
Total 105 25.9657 5.42028 .52896 24.9168 27.0147 17.70 40.20
Analysis of Variance (ANOVA)
or F-test
 Three or more independent groups
 Test for a difference between groups
– Is the difference in sample means due to their
natural variability or to a real difference between
the groups in the population?
 Outcome (dependent variable) is interval or
ratio
 Assumptions of normality, homogeneity of
variance & independence of observations
Let’s Test For A Difference
Non-Smokers’ BMI = 26.22 ± 5.48
Light Smokers’ BMI = 26.18 ± 4.96
Heavy Smokers’ BMI = 23.31 ± 5.62

ANOVA

BMI
Sum of
Squares df Mean Square F Sig.
Between Groups 69.398 2 34.699 1.185 .310
Within Groups 2986.058 102 29.275
Total 3055.457 104
No_Cigs BMI
1 30.1
1 18.9
2 22.8
3 22.6
5 24.2
5 26.2

Is there a 5
6
12
33.3
19.1
35

relationship 13
14
23
22.2
15 28.7

between the 15
17
28.6
24.3
18 30.9
variables? 19
19
22.5
32.6
20 19
20 26.7
22 18.8
24 23.4
30 23.2
39 25
40 35.9
45 17.9
100 19.9
Pearson’s Correlation

Karl Pearson 1857-1936


 Measures the degree of relationship between
two variables
 Assumptions:
– Variables are normally distributed
– Relationship is linear
– Both variables are measured on the interval or
ratio scale
– Variables are measured on the same subjects
Scatterplots r = -1.0 ---- +1.0

Perfect positive Perfect negative No correlation


correlation correlation
Let’s Test For A Relationship

Correlations

NOCIGS_B BMI
NOCIGS_B Pears on Correlation 1 -.169
Sig. (2-tailed) . .410
N 26 26
BMI Pears on Correlation -.169 1
Sig. (2-tailed) .410 .
N 26 105
40

30

20
BMI

10
0 20 40 60 80 100 120

NOCIGS_B
Interpretation of Results
 The size of the p value does not
indicate the importance of the result
 Appropriate interpretation of statistical
test
– Group differences
– Association or relationship
– “Correlation does not imply causation”
Statistical Inference
Hypothesis Testing for Single Populations.
Population mean using Z statistic (  known)

x
z

n
The Z table on Hypothesis Testing
Z Table

Type 0.025 0.01 0.05

One - ±1.96 ±2.33 ±1.65


tailed

Two - ±2.33 ±2.58 ±1.96


tailed
Population mean using t statistic (  unknown,
n <=30)

x
t ; df  n  1
s
n
Exercises
1. A bus company advertised a mean time of
150 min for a trip between two cities. A
consumer group had reason to believe that the
mean time was more than 150 minutes. A
sample of 40 trips showed a mean of 153 min
and a standard deviation of 7.5 min. Using 5 %
level of significance, is there a sufficient
evidence to support the consumer group’s
contention? What type of error has possibly
been committed? Explain.
2. A plastic has a mean breaking strength of 27
and a standard deviation of 6 pounds per square
inch. A new process is developed and will
replace the old one, provided there is
substantial evidence that it improves the
strength of the product. A random sample of 40
pieces made with the new process gives a
sample mean of 30 pounds/sq inch. Assuming
that the variability is unchanged (𝜎 = 6), is
there a sufficient evidence to suggest that the
strength of the product has increased at the 1%
level of significance?
3.An ice cream company claimed
that its product contained 500
calories/pint (on the average). To
this claim, 24 one-pint containers
were analyzed, giving a mean of
507 calories and a standard
deviation of 21 calories. Test the
claim at the 2% level of
significance.
4. A manufacturer claimed that the
company’s product would not require
by more than 18 months on the
average. A sample of 12 customers
who had purchased their product gave
a mean of 18.542 and standard
deviation of 1.177. At a 5% level of
significance, do the data support the
belief that the mean repair time is
more than 18 months?
5. A sample of 12 customers who had
purchased the product provided the
following information on how many months
elapsed before repair was needed on their
purchases:
16.5 17 17.5 18 18.5
18.5 18.5 19 19 19.5
20 20.5
Refer to the above problem
Generalization

•When do we use the z


test and the t test
concerning means in a
normal distribution?
Test for the mean of paired observations

t
d  do d
 d i
; df  n  1
sd n
n d i   d i 
2 2
n
sd 
n(n  1)
• Independent samples are samples drawn
from entirely different populations
Examples:
a) Comparison of two groups
b) Performance of boys and girls
c) Length of life of brand A and brand B
* Groups randomly selected from two
entirely different populations
Test for the difference of means from
independent samples when the population
variances are unknown and the samples are
more than 30.

z
x  x  d
1 2 0
2 2
s s 2

1

n1 n2
Test for the difference of means from
independent sample when the population
variances are unknown and the samples are
not more than 30.

t
x  x 
1 2
; df  (n1  n2 )  2
n1  1s1   n2  1s2 
2 2
1 1
 ;
n1  1  n2  1 n1 n2
Test about a single proportion

p  p0 x
z ;p
p0 q0 n
n
where p =sample proportion
p0 = population proportion
Test for the difference of means from
independent samples when the population

p1  p2   p1  p 2 
variances are unknown

z

p1q1 p2 q2

where p (with a bar)- n1 n2
population proportion; q = 1-p
Test about a single variance or standard
deviation

x 2

n  1s 2

 2
Inferential Statistics
Level of Significance-
For hypothesis testing, it is customary to use an
alpha of 5% or 1%. It means that we are
willing to commit an alpha error of 5% or 1%
as the case may be. It also implies that we are
95% or 99% confident in making correct
decisions.
(Hypothesis Test Concerning Means)
Solve showing the 5 step procedures.
1. A drug company alleges that the average time
for a cough syrup to take effect is 15 min, with
standard deviation of 3 min. In a random
sample of 49 patients, the average time was
18 min. Test the company’s allegation against
the alternative that the average time is not 15
min using 1% level of significance.
2. An operator of a large fleet of taxicabs is
trying to decide whether to purchase Brand a
of Brand B tires for its new models. To arrive
at a decision, an endurance experiment was
conducted using 10 cars for each brand. The
results are
Brand A: mean=35000 km, s=5000 km
Brand B: mean=38000 km, s=5500 km
Test the hypothesis at alpha=5%, that there is no
significant difference in the mean endurance
rate (kms) between the two brands.
3. A random sample of 100 deaths in a
certain area during the past year showed
an average life span of 71.8 years.
Assuming that the population standard
deviation is 8.9 years, does this seem to
indicate the average life span today is
significantly greater than 70 years. Use
alpha at 5%.
A teacher wants to find out if the calculator-
based method of teaching Statistics is more
effective than the lecture method. Two classes
of approximately equal intelligence were
selected. From one class, she considered 15
students with whom she used the calculator
based method of teaching and from the other
class, she considered 14 students with whom
she used the lecture method. After several
sessions, a test was given with the following
results:
Can we say that the calculator-based
method of teaching is more effective
than the lecture method? Use 0.05
level of significance.
n x s
Calculator-based (1) 15 28.6 6.0
Lecture (2) 14 21.7 4.5
df = 27

t = 3.50
Ho: u1=u2; Ha: u1>u2
α = 0.05, one-tailed test,
t tab = 1.703
Reject Ho if tc ≥ 1.703
Reject Ho
The calculator-based method of
teaching is more effective in teaching
than the lecture method.
Sample Problem
A sample of 87 professional working women showed that
the average amount paid annually into a private pension
fund per person was P 3352. The population standard
deviation is P1100. A sample of 76 professional working
men showed that the average amount paid annually into a
private pension fund per person was P5727 with a
population standard deviation of P1700. A women’s activist
group wants to prove that women do not pay as much as
much per year as men into private pension funds. If they
use alpha at .001 and these sample data, will they be able
to reject a null hypothesis that women annually pay the
same as more than men into private pension funds?
Solution
Women Men
X1 = 3352 X2 = 5727
n1 = 87 n2 = 76
σ1 = 1100 σ2 = 1700
Ho : u1 = u2
Ha : u1 < u2
Alpha at .001
1- .001 = .999 - .5 =.499, z = -3.08
Zc = -10.42
Reject Ho
Women paid lower than what was paid by men.
Exercises
1. Test the following hypotheses of the difference
on population means at alpha = .10 with
Ho : u1-u2 = 0
Ha : u1-u2 < 0
Sample 1 Sample 2
mean 51.2 53.2
pop sd 52 60
number 31 32
What is the p value for this problem?
2. According to a study several years ago by the
Personal Communications Industrial Association,
the average wireless phone user earns P62000 per
year. Suppose a researcher believes that the
average annual earnings of a wireless phone user
are lower now, and he sets up a study in an attempt
to prove his theory. He randomly samples 18
wireless phone users and finds out that the average
annual salary from this sample is P58974 with a
population standard deviation of P 7810. Use alpha
=0.01 to test the researcher’s theory. Assume
wages in this industry are normally distributed.
3. A survey of the morning beverage market shows
that the primary breakfast beverage for 17% of
Americans is milk. A milk producer in
Wisconsin, where milk is plentiful, believes the
figure is higher for Wisconsin. To test this idea, she
contacts a random sample of 550 Wisconsin
residents and asks which primary beverage they
consumed for breakfast that day. Suppose 115
replied that milk was the primary beverage. Using a
level of significance of 0.05, test the idea that the
milk figure is higher for Wisconsin.
4. Previous experience shows the variance of a
given process to be 14. Researchers are testing
to determine whether this value has changed.
They gather the following dozen measurements
of the process. Use this data and alpha=0.05 to
test the null hypothesis about the variance.
Assume the measurements are normally
distributed: 52, 44, 51, 58, 48, 49, 38, 49,
50, 42, 55, 51
5. Two processes in a manufacturing line are performed
manually: operation A and operation B. A random sample
of 50 different assemblies using operation A shows that
the sample average time per assembly is 8.05 minutes,
with a population standard deviation of 1.36 minutes. A
random sample of 38 different assemblies using
operation B shows that the sample average time per
assembly is 7.26 minutes, with a population standard
deviation of 1.06 minutes. For alpha =0.10, is there
enough evidence in these samples to declare that
operation A takes significantly longer to perform than
operation B?
15 March 2016
Recap
Activity 1. Response for every Question
This activity measures how well you have gained
knowledge on statistical inference. A piece of
rolled paper will be drawn from the box and you
have the opportunity to answer the question in
1 or 2 minutes. Then, it’s your turn to call
someone to draw another paper inside the box
until the questions have been answered.
Hypothesis Testing
Activity 2. Each group will be given a problem
involving statistical test. Then, a group
representative will report on the
conclusion/decision after consultation with
other members. Each group will be graded
based on this rubric points:
a. Team effort 5 b. Delivery 5
c. Consistency/accuracy 5 d. Follow the steps
in hypothesis testing 5
Exercises
1. A random sample of size 20 is taken,
resulting in a sample mean of 16.45
and a sample standard deviation of
3.59. Assume x is normally distributed
and use this information and alpha is
0.05 to test the following hypotheses:
Ho: u = 16 H1: u =/ 16
2. A manufacturing firm has been averaging
18.2 orders per week for several years. However,
during a recession, orders appeared to slow.
Suppose the firm’s manufacturing manager
randomly samples 32 weeks and finds a sample
mean of 15.6 orders. The population standard
deviation is 2.3 orders. Test to determine
whether the average number of orders is down
by using alpha at .10.
3. A certain study showed that 79% of companies
offer employees flexible scheduling. Suppose a
researcher believes that in accounting firms this
figure is lower. The researcher randomly selects 415
accounting firms and through interviews
determines that 303 of this firms have flexible
scheduling. With a 1% level of significance, does the
test show enough evidence to conclude that a
significantly lower proportion of accounting firms
offer employees flexible scheduling?
4. With landfills quickly reaching their capacities,
recycling household trash has assumed increased
importance. A city legislator believes that a greater
proportion of residents in the city favor a
mandatory recycling bill. To show this, 982
southern residents were randomly sampled, and
678 were found to be supportive of the approval. A
random sample of 952 residents in the northern
region revealed 599 in favor of the bill. Formulate a
suitable set of hypotheses and test at the 1%
significant level.
Explore
• Deepen your understanding of hypothesis
procedures by examining the given
information and the things being asked. Think
of the appropriate test statistics and form the
conclusion based on the formulated null and
alternative hypotheses.
Values Integration
Statistical Inference
Hypothesis Testing for Single Populations.
Population mean using Z statistic (  known)

x
z

n
Population mean using t statistic (  unknown,
n <=30)

x
t ; df  n  1
s
n
Test for the mean of paired observations

t
d  do d
 d i
; df  n  1
sd n
n d i   d i 
2 2
n
sd 
n(n  1)
Test for the difference of means from
independent samples when the population
variances are unknown and the samples are
more than 30.

z
x  x  d
1 2 0
2 2
s s 2

1

n1 n2
Test for the difference of means from
independent sample when the population
variances are unknown and the samples are
not more than 30.

t
x  x 
1 2
; df  (n1  n2 )  2
n1  1s1   n2  1s2 
2 2
1 1
 ;
n1  1  n2  1 n1 n2
Test about a single proportion

p  p0 x
z ;p
p0 q0 n
n
where p =sample proportion
p0 = population proportion
Test for the difference of means from
independent samples when the population

p1  p2   p1  p 2 
variances are unknown

z

p1q1 p2 q2

where p (with a bar)- n1 n2
population proportion; q = 1-p
Correlation

Finding the relationship between two


quantitative variables without being
able to infer causal relationships

Correlation is a statistical technique


used to determine the degree to which
two variables are related
Scatter diagram
• Rectangular coordinate
• Two quantitative variables
• One variable is called independent (X) and
the second is called dependent (Y)
• Points are not joined
• No frequency table Y
* *
*
X
Example

Wt. 67 69 85 83 74 81 97 92 114 85
(kg)
SBP 120 125 140 160 130 180 150 140 200 130
mmHg)
Wt. 67 69 85 83 74 81 97 92 114 85
SBP(mmHg) (kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)

220
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood


pressure
SBP (mmHg)
220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood pressure


Scatter plots

The pattern of data is indicative of the type of


relationship between your two variables:
 positive relationship

 negative relationship

 no relationship
Positive relationship
18

16

14

12
Height in CM

10

0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Negative relationship

Reliability

Age of Car
No relation
Group Activity

Think of variables which can be 


compared to describe significant
relationship . State how each
variable affects the other. This
activity is good for 5 minutes to
share in class. What kind of test
describe such relationship?
Correlation Coefficient

Statistic showing the degree of relation


between two variables
Simple Correlation coefficient (r)

 It is also called Pearson's correlation


or product moment correlation
coefficient.
 It measures the nature and strength
between two variables of
the quantitative type.
The sign of r denotes the nature of
association

while the value of r denotes the


strength of association.
 If the sign is +ve this means the relation
is direct (an increase in one variable is
associated with an increase in the
other variable and a decrease in one
variable is associated with a
decrease in the other variable).

 While if the sign is -ve this means an


inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).
 The value of r ranges between ( -1) and ( +1)
 The value of r denotes the strength of the
association as illustrated
by the following diagram.

strong intermediate weak weak intermediate strong

-1 -0.75 -0.25 0 0.25 0.75 1


indirect Direct
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or
correlation between the two variables.

If 0 < r < 0.25 = weak correlation.

If 0.25 ≤ r < 0.75 = intermediate correlation.

If 0.75 ≤ r < 1 = strong correlation.

If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)

 xy   x y
r n

x 
2
(  x) 2
 
.  y 
2
(  y) 2


 n  n 
  
Example:
A sample of 6 children was selected, data about their
age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.

serial Age Weight


No (years) (Kg)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13
These 2 variables are of the quantitative type, one
variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y)
variables to find the relation between age and
weight compute the simple correlation coefficient
using the following formula:

 xy   x y
r  n
 ( x) 2  ( y) 2 
x 
2 .  y 
2 
 n  n 
  
Age Weight
Serial
(years) (Kg) xy X2 Y2
n.
(x) (y)
1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169
Total ∑x= ∑y= ∑xy= ∑x2= ∑y2=
41 66 461 291 742
41  66
461 
r 6
 (41) 2   (66) 2 
291  .742  
 6  6 

r = 0.759
strong direct correlation
EXAMPLE: Relationship between Anxiety and
Test Scores
Anxiety Test X2 Y2 XY
(X) score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
∑X = 32 ∑Y = 32 ∑X2 = 230 ∑Y2 = 204 ∑XY=129
Calculating Correlation Coefficient

(6)(129)  (32)(32) 774  1024


r   .94
6(230)  32 6(204)  32 
2 2
(356)( 200)

r = - 0.94

Indirect strong correlation


YOUR TURN !!! 

Think of things which can be used for


correlational test. Then make a list of tabular data
then compute its correlation.
(Good for 8 minutes)
SUMMARY
Summarize what you have learned in this lesson. 
Class Activities
1. The time required for a factory to install a 
certain component in a video camcorder appears
to be related to the number of days of
experience with this procedure. Installment
times (in seconds) and the number of days of
experience appear below for a sample of 9
workers.
Exper 8 6 10 8 1 3 2 5 7
eience
(x)
Time 32 30 25 28 39 35 40 30 33
Requir
ed (y)

Calculate the correlation coefficient r and


interpret its value.
2. The following are the times after injury (in 
weeks) and the scores on one subtest for eight
patients with similar medial nerve injuries.
Time X 3 2 5 6 2 4 10 5

Scores Y 6 8 5 3 7 8 3 4

Find the correlation coefficient r and


interpret the results.
Spearman Rank Correlation Coefficient
(rs)
It is a non-parametric measure of correlation.
This procedure makes use of the two sets of
ranks that may be assigned to the sample
values of x and Y.
Spearman Rank correlation coefficient could be
computed in the following cases:
Both variables are quantitative.
Both variables are qualitative ordinal.
One variable is quantitative and the other is
qualitative ordinal.
Procedure:
1. Rank the values of X from 1 to n where n
is the numbers of pairs of values of X and
Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of di for each pair of
observation by subtracting the rank of Yi
from the rank of Xi
4. Square each di and compute ∑di2 which
is the sum of the squared values.
5. Apply the following formula

6 (di) 2
rs  1 
n(n 2  1)

The value of rs denotes the magnitude


and nature of association giving the same
interpretation as simple r.
Example
In a study of the relationship between level
education and income the following data was
obtained. Find the relationship between them
and comment.
sample level education Income
numbers (X) (Y)
A Preparatory. 25
B Primary. 10
C University. 8
D secondary 10
E secondary 15
F illiterate 50
G University. 60
Answer:
Rank Rank di di2
(X) (Y) X Y
A Preparatory 25 5 3 2 4

B Primary. 10 6 5.5 0.5 0.25


C University. 8 1.5 7 -5.5 30.25
D secondary 10 3.5 5.5 -2 4
E secondary 15 3.5 4 -0.5 0.25
F illiterate 50 7 2 5 25
G university. 60 1.5 1 0.5 0.25

∑ di2=64
6  64
rs  1   0.1
7(48)

Comment:
There is an indirect weak correlation
between level of education and income.
exercise
F test (ANOVA)
It involves testing the equality of several
means SIMULTANEOUSLY. It is used to test
the significance difference between means
of 3 or more sets of data simultaneously. It
is a method of dividing the variation
observed in experimental data into
different parts, each part assignable to a
known source, cause or factor. It was
developed by Fisher, a famous statistician
from whom the term F-test came.
Simple analysis of variance is based on two
sources of variation:
1. Actual difference of the means due to
TREATMENT (SSb)
2. Chance or experimental ERROR (SSw)

TSS   x 
2
 x
2

Total Sums of Squares N


Sum of Squares Between Columns

1
sumofeachcolumn 
 x
2

SSb  
2

# rows N
Sum of Squares Within Columns
SS w  TSS  SSb
Total degrees of freedom dfT  N  1
Between Columns degree of freedom df b  k  1
Within Columns degree of freedom df w  dfT  df b
Mean Sum of Squares Between Columns
SS b
MSS b 
df b

Mean of Squares within Columns


SS w
MSS w 
MSS b
Fc 
df w

MSS w
Where N = the number of samples
Fc = the computed value of F
Ft = the tabular value of F
k = the number of columns
df = degree of freedom

Sources of Sum of Mean Sum of df F values


Variation Squares Squares
Treatment Fc
Between (b)
Error within Ft
(w)
TOTAL
Test about two variances or standard deviations
2
s1 greater _ var iance
F 2 
s2 smaller _ var iance
df  n1  1, n2  1
Exercises
1. A certain clinical laboratory set up their
quality control standard for the reagents of
blood glucose test with an average of 110
mg/dL and a standard deviation of 5 mg/dL. In
a random sample of 36 runs of the test
procedure using the control reagents, the
results yielded an average of 113 mg/dL. Is
there enough evidence to conclude that the
reagents are out of control? Use 5%.
2. Test the hypothesis that the mean of
a particular electrophoresis
procedure is less than or equal to 2.5
hrs if a random sample of 16 shows a
mean procedure time of 2.6 hrs, with
a standard deviation of 0.24 hrs. Use
1%.
3. To find out whether a new drug will reduce the
spread of cancer, 9 mice which have all reached
and advanced state of the diseases are selected.
Five mice receive the treatment and four do not.
The survival times, in months from time of the
experiment commenced are as follows:
Treatment 2.4 5.3 1.4 4.6 0.9

No 1.9 0.8 2.8 3.7


treatment

At 5%, is there an evidence to say that the new drug


is effective. Assume the two distributions to be
normal with equal variances
4. It is believed that 40% of all adults in
Metro Manila are in favor of the death
penalty for heinous crimes. Do we have a
reason to believe that the proportion of
adults favoring death penalty for heinous
crimes is greater than 40% if in a random
sample of 150, 80 are in favor for death
penalty. Use alpha at 5%.
5. A course in nursing is taught to 15 students by
the traditional classroom procedure (lecture
type). A second group of 11 students was
given the same course by means of
programmed materials. At the end of the
semester the same examination was given to
each group. The 15 students meeting in the
classroom made an average grade of 86 with a
standard deviation of 3, while the 11 students
using programmed materials made an average
of 84 with standard deviation of 4. Test the
hypothesis that the two methods of learning
are equal using 0.10 level. Assume the
population to be approximately normal with
equal variances.
Form the null and alternative hypothesis,
then solve.
1. A weight reducing clinic claims that
after completion of its advanced course,
the weight of participants is decreased
by an average of 15%. A random sample
of 100 graduates of this course found an
average decrease of 14.14%. The sample
standard deviation was 3.61%. Test the
clinic’s claim at 5% level of significance.
2. It is believed that 40% of the residents
of San Pablo are in favor of the
restoration of te death penalty. Out of
500 residents surveyed, 200 are in
favor of the issue. Test the hypothesis
that the proportion of residents who
favor the restoration of the death
penalty is not significantly different
from 40%. Use a 1% level of
significance.
3. A survey was conducted to determine the
proportion of male and female respondent
who are in favor of artificial birth control
methods. 120 out of 400 males and 130 out
of 500 females are in favor of the issue. Is
there a significant difference in the
proportion of males and females who favor
the issue. Use a 5% level.
p1  p2
z
 p1q1 p2 q2 
  
 n1 n2 
4. Records show that the performance of
male college students in Math in terms of
final grade is 5% higher than in female
college students with a corresponding
standard deviations of 4.35% and 3.97%
respectively. A Math instructor verifies this
information by having the final grades.
There are 80 students, consisting of 44
females and 36 males. The following
statistics are shown: Ave. FG
Male 89.45%
Female 77.42%
Can the Math instructor say that the difference
between the average final grades obtained by
the males and females is higher than 5%. Use
1% level of significance.
6. A random sample of 225 nails in a
manufacturing company is gathered. The
engineer specified that the specified length
of a nail must be 8cm having a standard
deviation of 0.04 cm. It shows from the
sample that the average length of the nails
is 8.055 cm. Do the produced nails exceeds
to the specification of the engineer at 1%
level of significance. Show the steps and
computation.
7. An educational research found that the
average entrance score of the incoming
freshmen in a university is 84. A random
sample of 24 students from a public school
was then selected and found that their
average score in the entrance examination
is 88 with a standard deviation of 16. Is
there any evidence to show that the sample
students from public school performed
better than the rest in the entrance
examination using 0.10 level of
significance?
8. At a certain college, it is estimated
that 25% of the students have cars
on campus. Does this mean to be a
valid estimate if in a random
sample of 90 college students, 28
are found to have cars? Use 0.05
level of significance.
9. It has been claimed that 30% of
students in a particular a school
dislike mathematics. When a
survey was conducted, 153 of 600
students dislike math. Test if the
claim was too high at 0.05 level.
10. The average score in the entrance
examination in Mathematics in a certain
school is 80 with a standard deviation of
10. a random sample of 40 students was
taken from this year’s examinees and it was
found to have a mean score of 84. Is there a
significant difference between the known
mean and the sample mean? Test at 0.05
level. Does this indicate that this year’s
batch is better than the previous batches?
11. The principal wants to know which batch
of students performed better in English. He
took a random sample of 40 students in
last year’s batch and a mean grade of 83
with a standard deviation of 7. Fifty
students from this year’s batch were
randomly selected and it was found to have
a mean grade of 86 with standard deviation
of 10. Does this indicate that last year’s
batch is poorer in English than this year’s
batch? Test at 0.01 level.
12.The average height of 2nd yr HS
students is 1.52 m. A random
sample of 26 students were taken
and were found to have a mean
height of 1.56m with a standard
deviation of 0.11m. Are the 26
students in the sample significantly
taller that the rest? Test at 0.05
level.
13. A study was conducted to determine the
number of customers which can be served
by 5 service crews in a leading food chain
during peak hours. A sample of 5 crews was
observed for one hour for 4 days, as
follows:
Crew

Mel 21 20 24 23
Mike 25 20 29 26
Mark 22 24 26 20
Mon 27 21 30 26
Matt 26 25 24 20
Use 1% level of significance to test whether
the difference among the mean number of
customers being served by 5 crews are not
significant
14.A psychologist in a big company gave an IQ
test on 3 groups of 4 applicants each for
managerial position. The results are given
below: Single Married Widowed
90 110 115
120 105 105
125 100 110
100 95 130
At alpha = 0.05 level, test the claim that the
three population means are equal.
Problem Set 1
Solve and write all the steps in hypothesis testing:
1. A professor in a typing class found out that the
average performance of an expert typist is 85
words per minute. A random sample of 16
students took the typing test and an average
speed of 62 wpm was obtained with a standard
deviation of 8. Can we say that the sample
students performance is below the standard at
the 0,05 level? (t tab = 1.753, one-tailed)
2. In a time and motion study, it was found
out that a certain manual work can be
finished at an average time of 40 minutes
with a standard deviation of 8 minutes. A
group of 36 students is given a special
training and then found to average only 35
minutes. Can we conclude that the special
training can speed up the work using the
0.01 level? (z tab = 2.33, one-tailed)
3. Determine if there is significant difference
among the test scores obtained by the
group of 4 students from 5 different
sections. Test at 0.05 level of significance. (F
tab = 2.90)
A B C D E
89 80 97 88 89
75 87 78 92 90
95 91 89 82 94
85 95 79 77 75
Linear Correlation
The Concept of Correlation
In some research problems, several variables or
characteristics of a population are studied
simultaneously to determine whether a
relationship exists and if so how close or how
significant the relationship might be.
Correlation is a statistical tool to measure the
association of two or more quantitative
variables.
The Scatterpoint Diagram
To estimate roughly if a relationship exists
between two variables, a scatterpoint diagram
is made. Draw a straight line intersecting as
many points as possible in the graph. If the
diagram suggess roughly the existence of a
linear relationship, then compute for the
coefficient of correlation r.
Scatter Diagram

A plot of paired (x,y) data with a horizontal x-axis and a vertical y-axis

Copyright © 2004 Pearson Education, Inc.


LINEAR CORRELATION
1. A random sample of the heights of 9
fathers and their respective sons were
taken to find out if height is genetically
inherited:
Father’s height in cm Son’s height in cm
170 158
141 152
160 165
158 162
155 150
180 175
169 170
166 163
170 143

Is there a significant correlation between father’s heights and son’s


height? Support your answer by computation
Pearson Product Moment
Correlation Coefficient
n xy   x  y
n x 2 2

 ( x) n y  ( y )2 2

x  observeddata(independen t var)
y  observeddata(dependent var)
N  samplesize
r  deg reeofrelat ions
Correlation Coefficient
Value of r Interpretation
± 1.00 PERFECT
Between ± 0.91 to ± 0.99 Very High
Between ± 0.71 to ± 0.90 High

Between ± 0.51 to ± 0.70 Moderate

Between ± 0.31 to ± 0.50 Low


between ± 0.01 to ± 0.30 Negligible
correlation
0 NO correlation
Testing the Significance of r
n2
tr
1 r 2
2
Coefficient of determination: r
This value shows the percentage of the
variation of the dependent variable y that
can be attributed to the variation of the
independent variable x. The rest of the
variation is due to CHANCE..
Refer to Critical values of Pearson’s r
2. In a study on the predisposition of obese
persons to have cardivcascular diseases,
the relationship between the weights and
the total serum cholesterol levels of 10
individuals were measured
Weights in kgs Total Serum Cholesterol levels
(mg/dL
70 180
75 185
80 200
110 200
100 195
105 220
120 225
125 210
130 210
a. Draw the scatter diagram
b. Compute the Pearson’s r
c. Compute for the Coefficient of
Determination and interpret the result
d. Test the significance of the computed t.

http://www.statsoft.com/textbook/distribution-tables/?button=3
The Least Squares Linear Regression
Equation
The mathematical relationship between X and Y,
in this particular method, is expressed in the
linear equation; y  a  bx where a is the
regression constant or intercept and b is the
regression coefficient.
n xy   x  y 
b
n x   x 
2 2 a  y  bx
Selected Families and Corresponding Monthly
Income in Thousand pesos
Number of Members in a Family Monthly Income

3 14.3

8 20.4

5 17.5

6 20.3

7 20.5

Estimate the monthly income when x = 10 and


12.
Solution:
Regression Model
Assets of Bank A from 2000 to 2011 in million pesos
YEAR ASSETS YEAR ASSETS
2000 0.67 2006 0.81
2001 0.69 2007 0.85
2002 0.72 2008 0.92
2003 0.76 2009 0.97
2004 0.78 2010 0.95
2005 0.81 2011 0.95

a) Calculate the regression coefficient


b) Calculate the regression intercept
c) Find the LSLRE
Correlation Analysis
• It is a method used to measure the strength
of relationship between two or more
variables
• Types of Correlation
a) Positive correlation exists when high scores
in one variable are associated with high
scores in the second variable, thus there is
a direct relationship that exists in positively
correlated variables.
b) Negative correlation exists when high scores
in one variable are associated with low
scores in the second variable.
c) Zero correlation exists when scores in one
variable tend to score either systematically
high or systematically low in the other
variable.
Statistical Tests
• Pearson Product -Moment Correlation coefficient
(correlation between interval variables)
• Spearman Rank-Order Correlation Coefficient
(correlation between ordinal variables of 30
samples or less)
• Guttman’s Lambda (correlation between nominal
variables)
• Goodman and Kruskal’s Gamma (correlation
between ordinal variables of more than 30
samples)
• Point biserial (correlation between interval
and dichotomous nominal variables)
• Correlation ratio (correlation between interval
and any nominal variables)
Statistical Tests
• Pearson Product -Moment Correlation coefficient
(correlation between interval variables)
• Spearman Rank-Order Correlation Coefficient
(correlation between ordinal variables of 30
samples or less)
• Guttman’s Lambda (correlation between nominal
variables)
• Goodman and Kruskal’s Gamma (correlation
between ordinal variables of more than 30
samples)
• Point biserial (correlation between interval
and dichotomous nominal variables)
• Correlation ratio (correlation between interval
and any nominal variables)
Scatter Diagram

A plot of paired (x,y) data with a horizontal x-axis and a vertical y-axis

Copyright © 2004 Pearson Education, Inc.


Pearson r

n xy   x y
n x  ( x) n y  ( y) 
2 2 2 2
A businessman would like to determine if there is a relationship between
the size of a store and the profit to be earned. Calculate the value of r,
interpret at .01 level. Is there a significant relationship between the
size and the profit?
Store Store size (in sq m) Profit ( in
Thousand pesos)

A 35 20
B 22 15
C 27 17
D 16 9
E 28 16
F 12 7
G 40 22
Spearman rho

6 D 2

  1
N ( N  1)
2
A panel of 5 men and 5 women are asked to rank 10
ideas for a new TV program on the basis of their
appeal to general audiences. The following are
Program Men’s Women’s
the results: Idea Ranking Ranking
Find value of r 1 6 6
Test at alpha =.05 2 4 10
3 8 8
Is there an association
4 7 2
or relationship 5 2 7
between the 6 1 1

rankings of men 7 3 5
8 5 9
and those of women?
9 9 4
10 10 3
• Guttman’s Lambda

FR  CT

N  CT
where FR = the biggest cell frequency in each
column
CT = the biggest row total
N = total frequency
Let us measure the degree of relationship of
individual’s religion and political party.
LAKAS LAMMP REPORMA TOTAL

Catholic 20 9 15 44
INC 5 18 4 27
Protestant 11 8 10 29
Total 36 35 29 100
Academic NSAT Scores Total
Grades Low Students
Average
High
Above 90 13 18 14 45

80 to 90 25 31 20 76

Below 80 21 38 20 79

Total 59 87 54 200
Correlation Ratio

2
Ni yi  N y 2

E 2

 yij  N y 2 2

Where Ni = number of sample per category


¯
yi = average obtained per category
N= total number of samples
¯
y = overall average
yij – individual item
Let us measure the degree of relationship
between the civil status and the annual
salary (in thousand pesos) of the given
samples:

Single 65 83 81 69 73 89 76 60

Married 70 67 90 84 78
Widow 89 64 78
A. Construct the null and alternative
hypothesis on each of the given
statements:
1. To determine the influence of short stories
in shaping the sex typed attitudes of high
school students (two-tailed test)
2. To determine the difference in the
performance of public and private
secondary school students in the national
entrance examination ( one-tailed, right
directional test)
3. A team of researchers want to determine if
grades in college are related to success in
a chosen field. The most appropriate
statistical analysis for the problem is _
a. Correlation analysis
b. Regression analysis
c. Prediction
d. Measure of variability
4. Which correlation coefficient represents
the strongest relationship between two
variables?
a. 0.0 b.-0.80 c. 0.60 d. 0.73
5. When small values of X are associated with
small values of Y and large values of X are
associated with large values of Y, then the
relationship between X and Y is_
a. Negatively correlated
b. Positively correlated
c. No correlation
d. Undetermined
6. State when to use the following statistical
tests:
a. Spearman rho
b. Guttman’s lambda
c. ANOVA (one-way)
d. Z test (single mean)
e. Correlation ratio
f. Pearson r
B. Perform test on hypothesis using the suggested
steps in hypothesis testing:
7. Previous research showed that the average
height of female students of ABC University is
1.55 meters with a standard deviation of
0.14meters. In order to verify this, the student-
researchers draw a random sample of 144
female students and it shows that the average
height is 1.48 m. Use alpha at 5% to test that the
previous research study was valid? (z tab = ±
1.96, 2-tailed test)
8. The following are the sales and the profit
earned by XYZ Co. in million pesos for the
last 7 months:
Month Sales Profit
January 152 20
February 150 25
March 140 15
April 130 18
May 122 17
June 120 19
July 112 21

Convert the data to ranks and compute the value of p to find


out if there is a relationship between sales and profit.
Test at 0.01 level (0.893)
9. The following are the ages of 7 employees in
a hospital and their corresponding efficiency
rating:
Employee Age Efficiency Rating
1 44 61
2 44 41
3 45 91
4 43 77
5 40 70
6 52 88
7 43 93

Is there a significant relationship between their ages and efficiency rating.


Interpret the result based on the correlation coefficient (r) table.
10. A study was conducted to know if there
was any relationship between grades as a
high school senior and grades as a college
freshmen. Grades are recorded as the
average during the last year of high school
and the average during the freshman year
of college. Calculate r. Is r significant at
alpha = 0.057?
HS 74 90 93 92 98 78 88 94 76
College 78 85 94 94 98 84 88 97
Chi-Square Distribution
• One-Way Sample (Test of Goodness of Fit)
 O  E  2

x 
2

E df = k-1
where O = observed data
E = expected data
E = np
n = total frequencies
p = number of proportion
• Two-way classification (two or more samples)

where E = (row total x column total)


grand total

df = (row-1)(column-1)

 O  E   0.5 2

x 2

E
Seatwork
1. A researcher wishes to get the pulse of the
studentry about the new enrollment
scheme. The scheme is not so popular and
so the researcher put a 50-50 chance of
acceptance. He used a sample of 150
students and asked them to give their
preferences to the scheme as favorable or
not favorable. Alpha = 0.05, x tab = 3.84
Category Total
Favorable Not Favorable
78 72 150
2. Is sex related to alcohol consumption? Test
at alpha = 0.90, x tab = 0.211

Alcohol Consumption

Heavy Moderat Light Total


e
Male 11 18 21 50

Female 7 15 28 50

Total 18 33 49 100
Quiz
• Write the word being described in the given
statements and phrases.
1. This is a subset of a group of representatives of a
population.
2. A form of presenting data using tables for better
understanding and interpretation
3. This is the opposite of the null hypothesis.
4. This test is used when the number of sample
means is less than 30.
5. Data that permits us to describe how much more
or less one object possesses than another
6. It refers to how high, flat, or moderate a normal
curve is drawn based on the values of mean,
median and mode.
7. A form of presenting data information through
words or statements
8. This is done after the result of every statistical
test.
9. A method used in measuring the strength of
relationships between two or more variables
10. A Greek letter that stands for summation
11– 13. With your answers from #1-10, write the
words that you are able to decode.(Think of the
first letters of every answers)
• Read and understand the statements
and questions below. Write only the
letter.
14. The hypothesis which is hoped to be rejected is
A. null hypothesis B. alternative hypothesis
C. both A and B D. neither A nor B
15. If we reject Ho when it is true, we commit a
A. type I error B. type II error
C. type III error D. no error
• If you want to show that Method A of
teaching computer programming is
more effective than Method B, then
16. Ho should be stated as follows:
A. Ho: Method A is more effective than Method B
B. Ho: Method A is as effective as Method B
C. Ho: Method A is less effective than Method B
D. Ho: Method A is less effective or as effective as
Method B

17. Ha: should be stated as
follows:
A. Ha: Method A is as effective as Method B
B. Ha: Method A is more effective as Method B
C. Ha: Method A is less effective than Method B
D. Ha: Either B or C
18. The statistic used to test the significance
of difference between means when n is
large
A. Chi-square test B. t-test
C. F-test D. Z-test
Determine whether the following statements
can be a null hypothesis or alternative
hypothesis. Indicate Ho or Ha on the
blanks provided for.
19.Classical music has a positive effect on the
memory ability of Grade IV students of a certain
elementary school.
20. The pre-test scores of the students
belonging to Group A in the Language
ability test of Group A do not differ with
that of the students belonging to Group B.
21. Low scores in mental ability test
corresponds to low scores in the self-
concept test.
22.The performance of pre-schoolers from
private schools in the memory ability test is
significantly different from those coming
from the public schools.
23. Sleep deprived students have lower
performance in the mathematical learning
ability test than those with 8 hours sleep.
24. Introducing colors to pictures has no
effect on the memory retention of Grade I
pupils.
25. The performance of the students exposed
to verbal motivation in a given learning
ability test do not differ with that of the
students exposed to nonverbal motivation.
Exercises

Fill in the blanks


1. If zc is -2.50 and ttab is -2.58, the
correct decision is …
2. When Ho is rejected at 0.01 level, the
difference is considered to be ….
3. Given : Ho: u1= u2,
Ha: u1 > u2, this needs what
kind of test ?....
4. At alpha = 5% means
a) 95%confident that you made the right
decision
b) 5% confident
c) 99%confident
d) 1% confident
5. Test the hypothesis for :
a) In a certain city, the mean age of newly hired
nurses is 28.95 with a standard deviation of
16.31. A sample of 45 newly hired nurses this
year were taken and found to have a mean
age of 35.05. Does this indicate that the age
of newly hired nurses has increased ? Test at
5% level.
6. Represent a perfect negative correlation
using scatter diagram…
7. The value r = 0.48 means
a) Moderate relationship
b) High relationship
c) Moderately high
d) Negligible relationship
8. A store manager wants to estimate weekly
sales on the basis of |Monday and Friday
sales. Which of the following statistical
analysis would help him?
a) Correlation analysis
b) Z-test
c) T-test
d) Regression analysis
9. Two supervisors were asked to evaluate
separately 10 employees based on their
performance. Did the evaluation of both
supervisors agree with each other? The
appropriate statistic is …
a) Pearson r
b) Spearman rank order correlation
c) Regression analysis
d) All of the above
10.A coefficient of correlation indicates both
the direction and strength of relationship
between two variables. Which of the
following indicates the strength of
relationship?
a) The sign of the coefficient
b) The absolute value of the coefficient
c) Both a and b
d) None of the above
For testing the difference between two proportions from
independent groups
x1 x2

n1 n2
z
1 1
p1  p   
 n1 n2 

where p = (x1+x2)/(n1+n2)
x1= number of successes in the 1st grp
x2=number of successes in the 2nd grp
n1= number of cases in the 1st grp
n2= number of cases in the 2nd grp
In a factory of baby dresses, one
production process yielded 28
defective pieces in a random sample
of 400 while another yielded 15
defective pieces in a random sample
size of 300. Is there a significant
difference between the proportions
of defective baby dresses? Test at
0.05 level
Assignment
1. A candy maker produces two brands of
candy, X and Y. It is found that 56 of 200
h\buyers prefer brand X and 29 of 150 buyers
prefer brand Y. Can we conclude that brand X
outsells brand Y. Use 0.05 level of significance
2. Opinion on a certain issue in a college
community is believed to be split 80% for and
20% against. In a sample of 400, 83% answers
affirmatively. Does this result discredit the
organization hypothesis at 0.10 level?
3. In a study of cheating among college, 144 or
41.4% of 348 students from homes of good-
socio-economic status were found to have
cheated on various tests. In the same study,
133 or 50.2% of 265 students from homes of
poor socio-economic status also cheated on
the same tests. Is there a true difference in
the incidence of cheating in these two groups
at 0.10 level of significance? (Z tab = ±1.645)
4. A study is made to determine if a cold climate
contributes more to absenteeism from school
during December, than a warmer climate. Two
groups of students are selected at random,
one group from Baguio and the other from
Metro Manila. Of the 300 students from
Baguio, 72 were absent at least 1 day during
the December, and of the 400 students from
Metro Manila, 70 were absent 1 or more days.
Can we conclude that a colder climate, results
in a greater proportion of students being
absent from school at least 1 day during
December? Use a 0.01 level of significance.
(Z tab = ±2.33)
• LARGE SAMPLE CONFIDENCE INTERVALS FOR
A POPULATION MEAN
Let X1…Xn be a large (n>30) random sample
s
from a population with mean μ and a
x3
n

standard deviation δ, so that a sample mean


(x) is approximately normal. Then a level
100(1-ᾳ)% confidence interval for μ is

x  z / 2 x
where  x  s
n
When the value of δ is unknown, it can be
replaced with the sample standard deviation
s.
s
In particular x  n is a 68% confidence interval
s s
x  1.645 is a 90%, x  1.96 is a 95%,
n n
s is a 99%, s is a 99.7%
x  2.58 x3
n n
In a random sample of 100 batteries produced
by a certain method, the average lifetime was
150 hrs and the standard deviation was 25 hrs.
a) Find a 95% confidence interval for the mean
lifetime of batteries produced by this
method?
b) Find the 80% confidence interval
c) Find the 99% confidence bound for µ
Solutions
a) 150 ± (1.96)(25/10) =(154.9, 145.1)
b) Set 1-ᾳ = .80
ᾳ = .20
ᾳ/2 = .10 = 1.28 (Z table)
150±(1.28)(2.5) = (153.2, 146.8)
• Subtract 0.5 from 0.10 before looking at the Z table.
c) 150 ± 2.33 (2.5) =
Upper and Lower Confidence Bound
for μ
• s
x  z  x,  x 
n
x  1.28 x x  1.645 x
x  z  x

90% 95%
x  2.33 x
99%
QUIZ
A. Construct the null and alternative
hypothesis on each of the given
statements:
1. To determine the influence of short stories
in shaping the sex typed attitudes of high
school students (two-tailed test)
2. To determine the difference in the
performance of public and private
secondary school students in the national
entrance examination ( one-tailed, right
directional test)
3. A team of researchers want to determine if
grades in college are related to success in
a chosen field. The most appropriate
statistical analysis for the problem is _
a. Correlation analysis
b. Regression analysis
c. Prediction
d. Measure of variability
4. Which correlation coefficient represents
the strongest relationship between two
variables?
a. 0.0 b.-0.80 c. 0.60 d. 0.73
5. When small values of X are associated with
small values of Y and large values of X are
associated with large values of Y, then the
relationship between X and Y is_
a. Negatively correlated
b. Positively correlated
c. No correlation
d. Undetermined
6. State when to use the following statistical
tests:
a. Spearman rho
b. Guttman’s lambda
c. ANOVA (one-way)
d. Z test (single mean)
e. Correlation ratio
f. Pearson r
B. Perform test on hypothesis using the suggested
steps in hypothesis testing:
7. Previous research showed that the average
height of female students of ABC University is
1.55 meters with a standard deviation of
0.14meters. In order to verify this, the student-
researchers draw a random sample of 144
female students and it shows that the average
height is 1.48 m. Use alpha at 5% to test that the
previous research study was valid? (z tab = ±
1.96, 2-tailed test)
8. The following are the sales and the profit
earned by XYZ Co. in million pesos for the
last 7 months:
Month Sales Profit
January 152 20
February 150 25
March 140 15
April 130 18
May 122 17
June 120 19
July 112 21
Convert the data to ranks and compute the value of p to find
out if there is a relationship between sales and profit.
Test at 0.01 level (0.893)
9. The following are the ages of 7 employees
in a hospital and their corresponding
efficiency rating:
Employee Age Efficiency Rating
1 44 61
2 44 41
3 45 91
4 43 77
5 40 70
6 52 88
Is there
7 a significant relationship
43 between their
93 ages and efficiency
rating. Interpret the result based on the correlation coefficient (r)
table.
10. A study was conducted to know if there
was any relationship between grades as a
high school senior and grades as a college
freshmen. Grades are recorded as the
average during the last year of high school
and the average during the freshman year
of college. Calculate r. Is r significant at
alpha = 0.057?
HS 74 90 93 92 98 78 88 94 76
College 78 85 94 94 98 84 88 97
Chi-Square Distribution
• One-Way Sample (Test of Goodness of Fit)
 O  E  2

x 
2

E df = k-1
where O = observed data
E = expected data
E = np
n = total frequencies
p = number of proportion
• Two-way classification (two or more samples)

where E = (row total x column total)


grand total

df = (row-1)(column-1)

 O  E   0.5 2

x 2

E
Seatwork
1. A researcher wishes to get the pulse of the
studentry about the new enrollment
scheme. The scheme is not so popular and
so the researcher put a 50-50 chance of
acceptance. He used a sample of 150
students and asked them to give their
preferences to the scheme as favorable or
not favorable. Alpha = 0.05, x tab = 3.84
Category Total
Favorable Not Favorable
78 72 150
2. Is sex related to alcohol consumption? Test
at alpha = 0.90, x tab = 0.211

Alcohol Consumption

Heavy Moderate Light Total

Male 11 18 21 50

Female 7 15 28 50

Total 18 33 49 100
Quiz
• Write the word being described in the given
statements and phrases.
1. This is a subset of a group of representatives of a
population.
2. A form of presenting data using tables for better
understanding and interpretation
3. This is the opposite of the null hypothesis.
4. This test is used when the number of sample
means is less than 30.
5. Data that permits us to describe how much more
or less one object possesses than another
6. It refers to how high, flat, or moderate a normal
curve is drawn based on the values of mean,
median and mode.
7. A form of presenting data information through
words or statements
8. This is done after the result of every statistical
test.
9. A method used in measuring the strength of
relationships between two or more variables
10. A Greek letter that stands for summation
11– 13. With your answers from #1-10, write the
words that you are able to decode.(Think of the
first letters of every answers)
• Read and understand the statements
and questions below. Write only the
letter.
14. The hypothesis which is hoped to be rejected is
A. null hypothesis B. alternative hypothesis
C. both A and B D. neither A nor B
15. If we reject Ho when it is true, we commit a
A. type I error B. type II error
C. type III error D. no error
• If you want to show that Method A of
teaching computer programming is
more effective than Method B, then
16. Ho should be stated as follows:
A. Ho: Method A is more effective than Method B
B. Ho: Method A is as effective as Method B
C. Ho: Method A is less effective than Method B
D. Ho: Method A is less effective or as effective as
Method B

17. Ha: should be stated as
follows:
A. Ha: Method A is as effective as Method B
B. Ha: Method A is more effective as Method B
C. Ha: Method A is less effective than Method B
D. Ha: Either B or C
18. The statistic used to test the significance
of difference between means when n is
large
A. Chi-square test B. t-test
C. F-test D. Z-test
Determine whether the following statements
can be a null hypothesis or alternative
hypothesis. Indicate Ho or Ha on the
blanks provided for.
19.Classical music has a positive effect on the
memory ability of Grade IV students of a certain
elementary school.
20. The pre-test scores of the students
belonging to Group A in the Language
ability test of Group A do not differ with
that of the students belonging to Group B.
21. Low scores in mental ability test
corresponds to low scores in the self-
concept test.
22.The performance of pre-schoolers from
private schools in the memory ability test is
significantly different from those coming
from the public schools.
23. Sleep deprived students have lower
performance in the mathematical learning
ability test than those with 8 hours sleep.
24. Introducing colors to pictures has no
effect on the memory retention of Grade I
pupils.
25. The performance of the students exposed
to verbal motivation in a given learning
ability test do not differ with that of the
students exposed to nonverbal motivation.
• Write the word being described in the given statements and
phrases.
• This is a subset of a group of representatives of a population.
• A form of presenting data using tables for better understanding and
interpretation
• This is the opposite of the null hypothesis.
• This test is used when the number of sample means is less than 30.
• Data that permits us to describe how much more or less one object
possesses than another
• It refers to how high, flat, or moderate a normal curve is drawn
based on the values of mean, median and mode.
• A form of presenting data information through words or statements
• This is done after the result of every statistical test.
• A method used in measuring the strength of relationships between
two or more variables
• A Greek letter that stands for summation
Exercises

Fill in the blanks


1.If zc is -2.50 and ttab is -2.58, the
correct decision is …
2.When Ho is rejected at 0.01 level, the
difference is considered to be ….
3.Given : Ho: u1= u2,
Ha: u1 > u2, this needs what
kind of test ?....
4. At alpha = 5% means
a) 95%confident that you made the right
decision
b) 5% confident
c) 99%confident
d) 1% confident
5. Test the hypothesis for :
a) In a certain city, the mean age of newly hired
nurses is 28.95 with a standard deviation of
16.31. A sample of 45 newly hired nurses this
year were taken and found to have a mean
age of 35.05. Does this indicate that the age
of newly hired nurses has increased ? Test at
5% level.
6. Represent a perfect negative correlation
using scatter diagram…
7. The value r = 0.48 means
a) Moderate relationship
b) High relationship
c) Moderately high
d) Negligible relationship
8. A store manager wants to estimate weekly
sales on the basis of |Monday and Friday
sales. Which of the following statistical
analysis would help him?
a) Correlation analysis
b) Z-test
c) T-test
d) Regression analysis
9. Two supervisors were asked to evaluate
separately 10 employees based on their
performance. Did the evaluation of both
supervisors agree with each other? The
appropriate statistic is …
a) Pearson r
b) Spearman rank order correlation
c) Regression analysis
d) All of the above
10.A coefficient of correlation indicates both
the direction and strength of relationship
between two variables. Which of the
following indicates the strength of
relationship?
a) The sign of the coefficient
b) The absolute value of the coefficient
c) Both a and b
d) None of the above
3. Low scores in mental ability test corresponds to
low scores in the self-concept test.
4. The performance of pre-schooler from private
schools in the memory ability test is significantly
different from those coming from the public
schools.
5. Sleep deprived students has lower performance
in the mathematical learning ability test than
those with 8 hours sleep.

Das könnte Ihnen auch gefallen