Beruflich Dokumente
Kultur Dokumente
STATONE
SY 2016-2017
RM Alecha
1
•"It is not how much
you do, but how
much love you put
in the doing."
Introduction to Business Statistics
RM Alecha
Objectives
• Identify the position of a data value in a
data set, using various measures of
position such as percentiles, deciles, and
quartiles
Measures of Position (or Location
or Relative Standing)
Are used to locate the relative position of
a data value in a data set
Can be used to compare data values from
different data sets
Can be used to compare data values
within the same data set
Can be used to help determine outliers
within a data set
Includes z-(standard) score, percentiles,
quartiles, and deciles
Other Measures of Location
• Percentiles are measures The nth percentile is the
of central location that value such that at least n
divides the group of data percent of the data are
into 100 parts. There are below that value and at
99 percentiles, because it most (100-n) percent are
takes 99 dividers to above that value.
separate a group of data Specifically, the 87th
into 100 parts. percentile is a value such
that at least 87% of the data
are below the value and no
more than 13% are above
the value.
Steps in Locating the Location of
Percentile
1. Organize the numbers . 3. Determine the location
into an ascending-order by either:
array. • If i is a whole number, the
2. Calculate the percentile Pth percentile is the
location (i) by average of the value at
i = pn/100 the ith location and the
value at the (i+1)st
where p = the percentile of location.
interest ** If i is not a whole
i = percentile location number, the Pth percentile
n = number in the value is located at the value
data set number part of (i +1).
Percentiles
Divides the data set in 100 (“per cent”)
equal groups
Used to compare an individual data value
with the national “norm”
Symbolized by P1, P2 ,…..
Percentile rank indicates the percentage
of data values that fall below the
specified rank
To find the percentile rank for a
given data value, x
Percentile Rank
(number of data values below
0.5
the given data point) 100%
total number of values
Examples
American College Test (ACT) Scores attained by 25 members of a local high
school graduating class (Data is ranked)
14 16 17 17 17
18 19 19 19 19
20 20 20 21 21
21 23 23 24 25
25 25 28 28 31
n p
c
100
where n is total number of values and p is given percentile
Step 3: Consider result from Step 2
If c is NOT a whole number, round up to the next whole number.
Starting at the lowest value, count over to the number that
corresponds to the rounded up value
If c is a whole number, use the value halfway between the cth and
(c+1)st value when counting up from the lowest value
14 16 17 17 17
18 19 19 19 19
20 20 20 21 21
21 23 23 24 25
25 25 28 28 31
To be in the 90th percentile, what would you have to score on the ACT?
Find P85
Quartiles
• Same concept as percentiles, except the data
set is divided into four groups (quarters)
• Quartile rank indicates the percentage of
data values that fall below the specified rank
• Symbolized by Q1 , Q2 , Q3
• Equivalencies with Percentiles:
– Q1 = P25
– Q2 = P50 = Median of data set
– Q3 = P75 Minitab calculates these
for you.
Q1 (First Quartile) separates the bottom
25% of sorted values from the top 75%.
• Identifying Outliers
– Is the data point between
14 16 17 17 17
18 19 19 19 19
20 20 20 21 21
21 23 23 24 25
25 25 28 28 31
Examples
Why Do Outliers Occur?
• Data value may have • Data value might be a
resulted from a legitimate value that
measurement or occurred by chance
observational error (although the
• Data value may have probability is
resulted from a extremely small)
recording error
• Data value may have
been obtained from a
subject that is not in
the defined population
Important Characteristics of Data
4. Outliers: Sample values that lie very far away from the vast majority of
other sample values
DESCRIPTIVE VALUES
MEASURES OF VARIABILITY
MEASURES OF CENTRAL TENDENCY
• WHEN THE GRAPH OF THE SCORES IS A NORMAL
CURVE, THE MODE, MEDIAN, AND MEAN ARE EQUAL
• THE MEAN IS THE MOST COMMON MEASURE OF
CENTRAL TENDENCY
• WHEN THE SCORES ARE QUITE SKEWED OR THE
DATA IS ORDINAL LACKING A COMMON INTERVAL,
THE MEDIAN IS A BETTER MEASURE OF CENTRAL
TENDENCY
• THE MODE IS USED ONLY WHEN THE MEAN OR
MEDIAN CANNOT BE CALCULATED (E.G., NOMINAL
DATA) OR WHEN THE ONLY INFORMATION WANTED
IS THE MOT FREQUENT SCORE (E.G., MOST UNIFORM
SIZE OR INJURY SITE)
MEASURES OF VARIABILITY
• RANGE
• STANDARD DEVIATION
• VARIANCE
RANGE
• EASIEST MEASURE OF
VARIABILITY TO CALCULATE
• USED WHEN THE MEASURE OF
CENTRAL TENDENCY IS THE MODE
(NOMINAL DATA OR WHEN THE
MOST FREQUENT SCORE IS OF
INTEREST) OR MEDIAN (ORDINAL
DATA OR SKEWED DATA)
• SIMPLY THE DIFFERENCE
BETWEEN THE HIGHEST AND
LOWEST SCORES
WHAT IS THE RANGE IN THE SET OF
SCORES BELOW?
• SET OF SCORES:
7, 2, 7, 6, 5, 6, 2
• X = SCORES
• N = NUMBER OF SCORES
• FORMULA TYPICALLY USED
FOR HAND CALCULATION
CALCULATIONAL FORMULA FOR
STANDARD DEVIATION
• FORMULA 2.4 SHOULD BE USED IF
THE GROUP TESTED IS VIEWED AS A
REPRESETATIVE PART OF THE
POPULATION; CONSIDERED THEN A
SAMPLE
• STANDARD DEVIATION CALCULATED
ON THE SAMPLE IS USED AS AN
ESTIMATE OF THE POPULATION
STANDARD DEVIATION (E.G.,
CALCULATION OF THE STANDARD
DEVIATION OF THE 40-YARD TIME OF
COLLEGE WIDE RECEIVERS THAT IS
USED AS AN ESTIMATION OF THE
STANDARD DEVIATION OF ALL
COLLEGE WIDE RECEIVERS)
• X = SCORES
• N = NUMBER OF SCORES
• FORUMULA TYPICALLY USED FOR
HAND CALCULATION
SAMPLE CALCULATION OF THE STANDARD
DEVIATION USING FORMULA 2.3 AND 2.4 AND THE
FOLLOWING TESTS SCORES: 7, 2, 7, 6, 5, 6, 2
VARIANCE
• USEFUL STATISTIC IN CERTAIN
HIGH LEVEL STATISTICAL
PROCEDURES LIKE REGRESSION
ANALYSIS AND ANALYSIS OF
VARIANCE (ANOVA)
• CALCULATED BY SQUARING THE
STANDARD DEVIATION (S2)
• STANDARD DEVIATION = S = 4
• VARIANCE = S2 = 42 = 16
The Standard Deviation
• The standard deviation measures the deviations
between the mean of the distribution and each of the
individual scores.
the distribution of 60 60
50
cases. 40 40
30
Frequency
20
Frequency
20
10
70 70
50
60
50
20
Frequency
20
Frequency
larger deviations 10
0
10
SW318
Social Work Statistics
Slide 42
Variability: the Variance
• The variance is another measure of variability
that is equal to the square the standard
deviation. The variance is the average of the
squared deviations from the mean.
S =
(x - x) 2
n-1
Sample Standard Deviation
(Shortcut Formula)
s =
n (x ) - (x)
2 2
n (n - 1)
Sample Standard
Deviation Shortcut Formula
(Grouped Data)
f(x - x) 2
S =
n-1
Sample Standard
Deviation Formula (Grouped Data)
n ( f ) - (
x2 2
fx )2
s =
n (n - 1)
The variance of a set of
values is a measure of
variation equal to the square
of the standard deviation.
Sample variance:
Square of the sample
standard deviation s
Variability: the Range
• The range is the highest score minus the lowest
score in a sorted distribution.
SW318
Social Work Statistics
Slide 49
The Range
These
values fall
Shows greater
together
variability
closely
Importance of
the IQR
Yet the ranges
are equal!
The Box Plot
• The Box Plot is a graphic device that visually
presents the following elements: the range, the IQR,
the median, the quartiles, the minimum (lowest
value,) and the maximum (highest value.)
Maximum
Q3
Q1
Minimum
Find the Mean and the
Standard Deviation
Computing a Range
Using the data from the credit card problem, we would sort the
five scores (2, 1, 2, 3, and 4) as shown below, and compute the
range by subtracting 1 from 4.
1 2 2 3 4
Range = 3.0
SW318
Social Work Statistics
Slide 55
Interpreting the Range
• The range is usually described as the total spread in the
distribution.
20 20 20
10 10 10
Frequency
Frequency
Frequency
0 0 0
3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5
• The IQV can be computed for ordinal level variables and for
interval level variables that have been grouped in a
frequency distribution.
SW318
Social Work Statistics
Slide 59
Computing an IQV
If all of the cases in a distribution fall in one
category, that category would be the modal
category and there would be no dispersion. In
this case, the IQV would be 0%.
80 80
the number of 70
60
70
60
cases in each is 50
40
50
40
100%.
20 20
Percent
Percent
10 10
0 0
1.000 2.000 1.000 2.000
90 90 90
80 80 80
70 70 70
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
Percent
Percent
Percent
SW318 10 10 10
Slide 62
Picturing the Index of Qualitative
Variation - 2
If the variable has three categories and the 50 cases not in the modal
category are divided among the non-modal categories, the IQV
decreases, i.e. there is less dispersion.
IQV = 87.00% IQV = 93.00%
As the division of 100 100
80 80
modal categories 70
60
70
60
30
40
30
increases, indicating 20 20
Percent
Percent
greater dispersion.
10 10
0 0
1 2 3 1 2 3
90 90 90
80 80 80
70 70 70
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
SW318
Percent
Percent
Percent
10 10 10
Social Work0 Statistics 0 0
Slide 63 1 2 3 1 2 3 1 2 3
Picturing the Index of Qualitative
Variation - 3
On this slide, we IQV = 99.00% IQV = 96.00%
keep the number 100
90
100
90
of cases in the 80
70
80
70
40 40
20 20
Percent
Percent
the number of 10
0
10
categories in the
1 2 3 1 2 3 4
90
100
90
80 80
70 70
50
60
50
as the number of 40
30
40
30
categories in the 20 20
Percent
Percent
distribution
10 10
0 0
1 2 3 4 5 1 2 3 4 5 6
increases.
SW318
Social Work Statistics
Slide 64
Index of Qualitative Variation
• In summary, IQV is affected both by the division of
cases between the modal and non-modal categories,
and by the number of categories for the variable.
MAD
f xx
n
Application
1. The following are the response times in seconds
of a smoke alarm after the release of smoke
from a fixed source:
12 9 11 7 9 14 6 10
s Q3 Q1
7. n 1 12. midquartile
2
Quiz 1 (Measures of Variation)
2.
1. For the measurements weights (in kilos)
2 5 9 10 15 19 no. of
Students
Compute:
52-54 2
a) the Range
55-57 3
b) MAD
58-60 4
c) standard deviation
61-63 6
d) Q3 and Q1
64-66 5
e) SIR
67-69 3
f) midquartile
70-72 2
Smaller variation
· Same
Center
Smaller variation
Larger variation
Some measures of dispersion:
Range – Variance – Standard deviation
Coefficient of variation
Range:
Range is the difference between the largest (Max) and smallest (Min)
values.
Range = Max Min
Example:
Find the range for the sample values: 26, 25, 35, 27, 29, 29.
Solution:
Range = 35 25 = 10 (unit)
Note:
The range is not useful as a measure of the variation since it only
takes into account two of the values. (it is not good)
Variance:
N N
N
Xi
where i is1 the population mean.
N
Notes:
·
2
is a parameter because it is obtained from the population
values (it is unknown in general).
· 2 0
xi x
n
x
2
x 2 2 2
i 1 x x xN x (unit)2
S2 1 2
n 1 n 1
n
xi
i 1
Where x is the sample mean
n
Notes:
· S2 is a statistic because it is obtained from the sample values (it
is known).
S is used to approximate (estimate)
2 2
· .
· S 0
2
Example:
We want to compute the sample variance of the following sample
values: 10, 21, 33, 53, 54.
Solution:
n=5
5
xi
10 21 33 53 54 171
x
i 1
34.2 (unit)
5 5 5
x x
n 5
i
2
2
i x 34 .2
S 2 i 1
i 1
n 1 5 1
S2
10 34.2 2
21 34.2 2
33 34.2 2
53 34.2 2
54 34.2 2
4
1506.8
376.7 (unit) 2
4
Another method:
x x
5
x xi
2
x
xi i i
i 1
xi 34.2 xi 34.2
2 x
5
10 -24.2 585.64 171
21 -13.2 174.24 34.2
33 -1.2 1.44 5
53 18.8 353.44
54 19.8 392.04
1506 .8
S 2
4
xi x 0 xi x
5 5
xi 376.7
2
171 1506 .8
i 1 i 1
xi 10 21 33 53 54 xi 171
7355 534.2
2
1506 .8 (unit)2
S
2
376.7
5 1 4
Standard Deviation:
· The standard deviation is another measure of variation.
· It is the square root of the variance.
(1) Population standard deviation is: 2 (unit)
(2) Sample standard deviation is:
S S2 (unit)
Example:
For the previous example, the sample standard deviation is
· The relative variability in the 1st data set is larger than the relative
variability in the 2nd data set if C.V1> C.V2 (and vice versa).
Example:
1st data set: x 1 66 kg, S 2 4.5 kg
4.5
C.V1 * 100% 6.8%
66
2nd data set: x 2 36 kg, S 2 4.5 kg
4.5
C.V2 * 100% 12.5%
36
Since C.V1 C.V, 2the relative variability in the 2nd data set is larger than the relative
variability in the 1st data set.
Absolute value:
a a
a
if a 0
if a 0
Example:
(2) (b = 10)
x1 10 , x 2 10 , x 3 10
(3) (a = 2, b = 10)
2x1 10 , 2x 2 10 , 2x 3 10
Can C. V. exceed 100%?
Data: 10,1,1,0
Mean=3
Variance=22
STDEV=4.6904
C. V.=156.3%
Measures of Skewness
Normal Distribution
-is a distribution with a bell-shaped
appearance. In a normal distribution, the
mean=median=mode
When the mean < median, the bulk of the
distribution is on the right. This implies that
the questions are generally easy (in case of
test) or that many students in the group are
bright.
When the mean > median, the bulk of the
distribution is on the left. This implies that
the questions are generally difficult (in case
of test) or that many students in the group
are not prepared for the test or not smart.
Skewness refers to the degree of
symmetry or asymmetry of a
distribution
It may be:
negatively skewed when
mean < median
positively skewed when:
mean > median
The extent of skewness can be
obtained by getting the coefficient of
skewness using the formula:
sk
3 xm
where x= mean
m= median
s
s = standard deviation
SUMMARY
For normal distribution
sk =0
For skewed to the left
sk < 0
For skewed to the right
sk > 0
KURTOSIS
KURTOSIS refers to the peakedness or
flatness of a distribution.
• Mesokurtic is a normal distribution
• Leptokurtic is more peaked than the
normal distribution
• Platykurtic is flatter than the normal
distribution
Kurtosis ( Ku )
For ungrouped data
ku
d x
4
4
Ns
For grouped data
f cm x
4
ku 4
Ns
where
Ku = is the kurtosis
d = is the raw data
cm = is the class mark
(bar x) x = is the mean
s4 = is the square of the
variance
N = is the sample size
Exercises
•Range
•Standard deviaton
•Coefficient of variation
•Skewness
1. The number of packs of cigarettes Mang Juan
sold during the last 12 days of Dec are as
follows:
10 15 5 21 7 25
90 14 18 20 10 12
Determine the following and interpret each
result: a. range
b. standard deviation
c. coefficient of variation
d. kurtosis
1. Consider this set of data:
9 56 30 3 70 2 40 51
23 15
Find the stand. dev.
variance
skewness
degree of skewness
coefficient of variation
1. Find the standard deviation, variance, range, percentile
deviation, semi-interquartile range and coefficient of
variation of the following weights distribution of 25 male
students
weights (in kilos) no. of Students
52-54 2
55-57 3
58-60 4
61-63 6
64-66 5
67-69 3
70-72 2
Historical Events (January 4)
• 1896 Utah is admitted as the 45th U.S. state
• 2004 Spirit, a NASA Mars Rover, lands
successfully on Mars
• Famous Birthdays:
• 1643 Sir Isaac Newton (Scientist)
• 1809 Louis Braille (Inventor of touch reading
system for blind)
Today in History
January 5
• 1920 The Boston Red Sox sell Babe Ruth to
the New York Yankees in what is later known
as the Bambino Curse.
• 1933 Construction of the Golden Gate Bridge
begins in San Francisco Bay.
• Famous Birthday:
• 1914 George Reeves (Actor - Superman)
the father of Christopher Reeves
Number Trivia
• What is the larger number of the binary system?
— Albert Einstein
Fundamental Counting
Principles
On several occasions, before making
an important decision, we resort to
determining and counting all the
possible number of alternative options
that we can choose from. Certainly, the
simplest way to do this is to list down or
enumerate all the possible options
manually and individually. To do this,
however, will require a lot of time and
effort.
Objectives:
compute permutations
compute combinations
1H 2H 3H 4H 5H 6H
6*2 = 12 outcomes
1T 2T 3T 4T 5T 6T
12 outcomes
Fundamental Counting
Principle
For a college interview, Robert has to choose
what to wear from the following: 4 slacks, 3
shirts, 2 shoes and 5 ties. How many possible
outfits does he have to choose from?
A Permutation is an arrangement
of items in a particular order.
3*2*1 = 6 3! = 3*2*1 = 6
ABC ACB BAC BCA CAB CBA
Permutations
5! 5!
5 p3 5 * 4 * 3 60
(5 3)! 2!
Permutations
Practice:
A combination lock will open when the
right choice of three numbers (from 1
to 30, inclusive) is selected. How many
different lock combinations are possible
assuming no number is repeated?
Answer Now
Permutations
Practice:
30! 30!
30 p3 30 * 29 * 28 24360
( 30 3)! 27!
Permutations
Practice:
From a club of 24 members, a
President, Vice President, Secretary,
Treasurer and Historian are to be
elected. In how many ways can the
offices be filled?
Answer Now
Permutations
Practice:
24! 24!
24 p5
( 24 5)! 19!
24 * 23 * 22 * 21 * 20 5,100,480
Permutations with
Repetitions
Permutations with
Repetitions
n!
r ! s ! t ! ...
Permutations with
Repetitions
Example 1: In how many ways can all of the
letters in the word SASKATOON be arranged?
Answer Now
Combinations
Practice:
52! 52!
52 C5
5! (52 5)! 5!47!
52 * 51 * 50 * 49 * 48
2,598,960
5* 4* 3* 2*1
Combinations
Practice:
Answer Now
Combinations
Practice:
5! 5! 5 * 4
5 C3 10
3! (5 3)! 3!2! 2 * 1
Combinations
Practice:
A basketball team consists of two
centers, five forwards, and four
guards. In how many ways can the
coach select a starting line up of
one center, two forwards, and two
guards?
Answer Now
Combinations
Practice:
2 C1 * 5 C 2 * 4 C 2
w w w w
w w
R R R R
R R
B B B B
B B
G G G G
G G
As a special case, 0! = 1
Study Tip
Here are several values of n!.
1! = 1
2! = 2 ● 1 = 2
3! = 3 ● 2 ● 1 = 6
4! = 4 ● 3 ● 2 ● 1 = 24
5! = 5 ● 4 ● 3 ● 2 ● 1 = 120
9! = 9 ● 8 ● 7 ● 6 ● 5 ● 4 ● 3 ● 2 ● 1 = 362,880
Permutations of n objects taken r at a time
10! 10!
nPr= 10 P 3
(10 3)! 7!
10 9 8 7 6 5 4 3 2 1
7 6 5 4 3 2 1
720
There are 720 possible three-digit codes that do not have
repeating digits.
Example 5: Finding n P r
Forty-three race cars started the 2007 Daytona 500. How many ways
can the cars finish first, second, and third?
Because there are 43 race cars and order is important, the number of
ways the cars can finish first, second, and third is:
43! 43!
n P r = 43 P 3
(43 3)! 40!
43 42 41
74,046
Ordering same objects
Suppose you want to order a group of n objects where some of
the objects are the same. For instance, consider a group of
letters consisting of four A’s, 2 B’s, and one C. How many
ways can you order such a group? Using the previous
formula, you might conclude the following:
nPr= 7 P 7 = 7!
However, because some of the objects are the
same, not all of these permutations are
distinguishable. How many distinguishable
permutations are possible. The answer can be
found using the formula on the next slide.
Distinguishable Permutations
n!
, where
n1!n2 !n3! nk !
n1 n2 n3 ... nk n.
7! 7 6 5 4 3 2 1
4!2!1! 4!2!1!
765
2
105
Example 6: Distinguishable
Permutations
• A building contractor is planning to develop a
subdivision. The subdivision consists of six
one-story houses, four two-story houses, and
two split-level houses. In how many
distinguishable ways can the houses be
arranged?
n!
n Cr
(n r )! r!
Example 7: Finding the number of
combinations
16!
A state’s department of
16 C 4
transportation plans to develop
a new section of interstate
(16 4)!4!
highway and receives 16 bids
16!
for the project. The state plans
to hire four of the bidding
companies. How many 12!4!
16 15 14 13
different combinations of four
companies can be selected
from the 16 bidding
companies? Because order is 4 3 2 1
NOT important, there are:
43680
1820
24
Applications – Example 8 Finding
Probabilities
A word consists of one M, four I’s, four S’s, and two P’s.
If the letters are randomly arranged in order, what is
the probability that the arrangement spells the word
Mississippi? Solution. There is one favorable outcome
and there are There are 34,650
11! distinguishable
1!4!4!2! permutations of the
11 10 9 8 7 6 5 word Mississippi.
So the probability
4 3 2 1 2 1 that the
1663200 arrangement spells
the word
48
Mississippi is:
34650
Applications – Example 8 Finding
Probabilities
There are 34,650 distinguishable permutations of the
word Mississippi. So the probability that the
arrangement spells the word Mississippi is:
1
P( Mississipp i) .000029
34,650
Applications – Example 9 Finding
Probabilities
Find the probability of being dealt five diamonds from a
standard deck of playing cards. (In poker, this is a
diamond flush.)
SOLUTION: The possible number of way of choosing 5
diamonds out of 13 is 13C5. The number of possible 5
card hands is 52C5. So the probability of being dealt 5
diamonds is:
C5 1287
P( DiamondFlush) 13
52 C5 2,598,960
.0005
Seatwork
1. How many 5 digit numbers can be
formed using the digits 0, 1, 2,3,….9
such that
a. Repetition is allowed
b. Repetition is not allowed
c. The first digit must not be 9 and
repetition is not allowed
2. A boy can buy a pair of shoe from 6
different stores and a bag from 5
different stores. If he buys a pair of
shoes and a bag from different stores,
how many sets of two stores will there
be?
3. In how many ways can a customer
order a sandwich and a drink if there
are 5 sandwiches and 4 drinks on a
meal?
4. To code its property inventories, a
company designed a card system by
which the first 2 characters are
numbers (0 to 9) and the next two
characters are letters of the English
alphabet. How many different coding
cards can be made?
Factorial Notation
• N factorial is actually the product of all
positive numbers from 1 to n and it is
written as:
n! = n(n-1)(n-2)….(3)(2)(1)
1. Evaluate 5!
5! = 5x4x3x2x1 = 120
2. 8!/4! =( 8x7x6x5x4!)/4! = 8x7x6x5
=1 680
Evaluate
1. 9𝑃6 + 4 2𝑃1
2. 4𝐶3 𝑥 5𝐶2
3. 6! + 3!
4. How many ways can 5 female
students and 4 male students be
seated on a long bench if the bench
can accommodate only 5 persons?
VOCABULARY
• INFER –to form an opinion from evidence, to
reach a conclusion
• RANDOM-without definite aim or direction,
rule or method
• EXPERIMENT-a series of test, something that
you do to see how well or how badly it works
• OUTCOME-result, a consequence
• PRINCIPLE-basic truth or theory, a law, rule or
fact
Why Learn Probability?
• Nothing in life is certain. In everything we do, we gauge
the chances of successful outcomes, from business to
medicine to the weather
• A probability provides a quantitative description of the
chances or likelihoods associated with various outcomes
• It provides a bridge between descriptive and inferential
statistics
Probability
Population Sample
Statistics
Principles of Counting
• Preliminary Concepts:
Random Experiment – an experiment that can
be used to generate information or data. Like
an ordinary experiment, this can also be
repeated.
- Rolling a die
- Tossing a coin
- Drawing a card from a well-shuffled deck of 52
cards
• Sample Space- a set of all possible outcomes
in a random experiment, denoted by S.
• Sample Point – an entry from the sample
space
• Event-is a collection of one or more outcomes
considered within a sample space, denoted by
E.
Example
• Random experiment – rolling a die
• Sample space : S = {1,2,3,4,5,6}
• Sample point = 1,2,3,…6
• Event (even numbers) : E = {2,4,6}
Classical Probability
The probability of any event E is
Number of outcomes in E
----------------------------------------
Total number of outcomes in the sample space
n( E )
P( E )
n( S )
Examples
1. A pair of dice is tossed. Find the probability
of getting
a. A total of 7 = 6/36 = 1/6
b. At most a total of 10
c. At least a total of 5
2. Find the probability of getting ace in a well
shuffled deck of cards?
3. What is the probability of passing a subject?
4. There are two (2) dice thrown. What is the
probability of the following events
a. That all 2 dice show the same number
b. That all 2 dice show odd number
c. The sum of numbers is 13
Probability Rules
R1. The probability of any event E, is a number
(either a fraction or decimal) between and
including 0 and 1. This is denoted by
0 P( E ) 1
Rule 1 states that probabilities cannot be
negative or greater than 1.
R2. If an event E cannot occur (i.e. the event
contains no members in the sample space), its
probability is 0.
• Examples:
–Toss a fair coin. P(Head) = 1/2
– Suppose that 10% of the U.S. population has
red hair. Then for a person selected at random,
P(Red hair) = .10
Using Simple Events
• The probability of an event A is equal to the
sum of the probabilities of the simple events
contained in A
• If the simple events in an experiment are
equally likely, you can calculate
(n r )!
where n! n(n 1)( n 2)...( 2)(1) and 0! 1.
Example: How many 3-digit lock combinations
can we make from the numbers 1, 2, 3, and 4?
The order of the choice is important! 4!
P 4(3)( 2) 24
3
4
1!
Examples
Example: A lock consists of five parts and can
be assembled in any order. A quality control
engineer wants to test each order for
efficiency of assembly. How many orders are
there?
The order of the choice is important!
5!
P 5(4)(3)( 2)(1) 120
5
5
0!
Combinations
• The number of distinct combinations of n
distinct objects that can be formed, taking
them r at a time is n!
Cr
n
r!(n r )!
Example: Three members of a 5-person committee must
be chosen to form a subcommittee. How many different
subcommittees could be formed?
5! 5(4)(3)( 2)1 5(4)
The order of
C
5
10
3!(5 3)! 3(2)(1)( 2)1 (2)1
the choice is 3
not important!
Example m
m m
m mm
• A box contains six M&Ms®, four red
and two green. A child selects two M&Ms at
random. What is the probability that exactly one
is red?
2!
The order of C2
6 6! 6(5)
15
C
1
2
2
1!1!
the choice is 2!4! 2(1)
not important! ways to choose
ways to choose 2 M & Ms.
1 green M & M.
4! 4 2 =8 ways to choose 1
C
1
4
4 P(exactly one red)
1!3! red and 1 green M&M.
= 8/15
ways to choose
1 red M & M.
Example
A B A B
Event Relations
The intersection of two events, A and B, is
the event that both A and B occur when the
experiment is performed. We write A B.
S
A B A B
S
AC
A
Example
Select a student from the classroom and
record his/her hair color and gender.
– A: student has brown hair
– B: student is female
– C: student is male Mutually exclusive; B = C
C
P( A B) P( A) P( B) P( A B)
A B
Example: Additive Rule
Example: Suppose that there were 120
students in the classroom, and that they
could be classified as follows:
A: brown hair Brown Not Brown
P(A) = 50/120 Male 20 40
B: female Female 30 30
P(B) = 60/120
P(AB) = P(A) + P(B) – P(AB)
= 50/120 + 60/120 - 30/120
= 80/120 = 2/3 Check: P(AB)
= (20 + 30 + 30)/120
Example: Two Dice
A: dice add to 3
B: dice add to 6
P(AC) = 1 – P(A)
Example
Select a student at random
from the classroom. Define:
A: male Brown Not Brown
P(A) = 60/120 Male 20 40
B: female Female 30 30
P(B) = ?
“given”
Example 1
Toss a fair coin twice. Define
– A: head on second toss
– B: head on first toss
P(A|B) = ½
HH
1/4 P(A|not B) = ½
1/4
HT
1/4
P(A) does not A and B are
TH 1/4
change, whether independent!
TT B happens or
not…
Example 2
A bowl contains five M&Ms®, two red and three
blue. Randomly select two candies, and define
– A: second candy is red.
– B: first candy is blue.
A Sk
A
A S1 Sk
S2….
We know: P( M ) P( H | M )
P( M | H )
P( M ) P ( H | M ) P( F ) P( H | F )
.49
P(F) =
.51
P(M) = .51 (.12)
.61
P(H|F) = .08
.51 (.12) .49 (.08)
P(H|M) = .12
Example
Suppose a rare disease infects one out of
every 1000 people in a population. And
suppose that there is a good, but not perfect,
test for this disease: if a person has the
disease, the test comes back positive 99% of
the time. On the other hand, the test also
produces some false positives: 2% of
uninfected people are also test positive. And
someone just tested positive. What are his
chances of having this disease?
Example
Define A: has the disease B: test positive
We know:
P(A) = .001 P(Ac) =.999
P(B|A) = .99 P(B|Ac) =.02
Job Satisfaction
Satisfied Unsatisfied Total
L College 0.095 0.055
E
0.150
V High School 0.288 0.220 0.508
E
L Elementary 0.162 0.180 0.342
Total 0.545 0.455 1.000
Job Satisfaction
Satisfied Unsatisfied Total
Example L
E
V
E
College 0.095
High School 0.288
0.055
0.220
0.150
0.508
L Elementary 0.162 0.180 0.342
Total 0.545 0.455 1.000
Example V
E
L
High School 0.288
Elementary 0.162
0.220
0.180
0.508
0.342
Total 0.545 0.455 1.000
P(S C)
P(S | C) is the proportion of teachers who are satisfied
given they are college teachers. Restated:
P(C)
This is the proportion of college teachers that
P(C S) 0.095 are satisfied.
P(C) 0.150
0.632
Job Satisfaction
Satisfied Unsatisfied Total
L College 0.095 0.055 0.150
Example E
V
E
High School 0.288 0.220 0.508
L Elementary 0.162 0.180 0.342
Total 0.545 0.455 1.000
P(C S) 0.095
P(C) 0.150 and P(C | S) 0.175
P(S) 0.545
P(CS)?
= P(D T)
= 0.8 + 0.7 - 0.8 x 0.7
= .94
P(At least one person pass)
= 1-P(neither passes) = 1- (1-0.8) x (1-0.7) = .94
Example
Suppose we know that only one of the two
friends passed the test. What is the probability
that it was Dick?
• Examples:
x = SAT score for a randomly selected student
x = number of people in a room at a randomly
selected time of day
x = number on the upper face of a randomly
tossed die
Probability Distributions for Discrete
Random Variables
The probability distribution for a discrete
random variable x resembles the relative
frequency distributions we constructed in
Chapter 2. It is a graph, table or formula that
gives the possible values of x and the
probability p(x) associated with each value.
We must have
0 p ( x) 1 and p( x) 1
Example
Toss a fair coin three times and
define x = number of heads.
x
HHH x p(x)
3
1/8 P(x = 0) = 1/8 0 1/8
HHT 2
1/8 P(x = 1) = 3/8 1 3/8
2
HTH 1/8 P(x = 2) = 3/8
2 2 3/8
THH
1/8 P(x = 3) = 1/8
1/8
1 3 1/8
HTT 1
1/8
1 Probability Histogram
1/8
THT for x
0
1/8
TTH
TTT
Example
Toss two dice and define
x = sum of two dice. x p(x)
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
Probability Distributions
Probability distributions can be used to describe
the population, just as we described samples in
Chapter 2.
– Shape: Symmetric, skewed, mound-shaped…
– Outliers: unusual or unlikely measurements
– Center and spread: mean and standard
deviation. A population mean is called and a
population standard deviation is called .
The Mean
and Standard Deviation
Let x be a discrete random variable with
probability distribution p(x). Then the mean,
variance and standard deviation of x are given
as
Mean : xp( x)
Variance : ( x ) p( x)
2 2
Standard deviation : 2
Example
Toss a fair coin 3 times and record x
the number of heads.
x p(x) xp(x) (x-2p(x) 12
0 1/8 0 (-1.5)2(1/8) xp( x) 1.5
8
1 3/8 3/8 (-0.5)2(3/8)
2 3/8 6/8 (0.5)2(3/8)
3 1/8 3/8 (1.5)2(1/8)
( x ) p( x)
2 2
• Outliers? None
• Center? = 1.5
• Spread? = .688
Key Concepts
I. Experiments and the Sample Space
1. Experiments, events, mutually exclusive events,
simple events
2. The sample space
II. Probabilities
1. Relative frequency definition of probability
2. Properties of probabilities
a. Each probability lies between 0 and 1.
b. Sum of all simple-event probabilities equals 1.
3. P(A), the sum of the probabilities for all simple events in A
Key Concepts
III. Counting Rules
1. mn Rule; extended mn Rule
2. Permutations: Prn n!
(n r )!
n!
3. Combinations: Crn
r!(n r )!
IV. Event Relations
1. Unions and intersections
2. Events
a. Disjoint or mutually exclusive: P(A B) 0
b. Complementary: P(A) 1 P(AC )
Key Concepts
P( A B)
3. Conditional probability: P( A | B)
P( B)
4. Independent and dependent events
5. Additive Rule of Probability:
P( A B) P( A) P( B) P( A B)
“and“or
””
means
means
bothone
must
or be
thetrue
other
(or both) are true
Elm St.
Maple St. Maple St.
A B A B A B A B
_
= +
A B In this example we
will fill up the
Venn Diagram
with probabilities.
Example #1 (continued)
P(A)=0.8 P(B)=0.3 P(A and B)=0.2
Find the P(A or B).
0.1
Then I will add up
the probabilities in
the shaded area.
P(A or B) = 0.6 + 0.2 + 0.1
= 0.9 Answer
Example #1 (continued)
P(A)=0.8 P(B)=0.3 P(A and B)=0.2
Find the P(A or B).
= 0.9 Answer
Example #2.)
There are 50 students. 18 are taking
English. 23 are taking Math. 10 are
taking English and Math.
If one is selected at random, find the
probability that the student is taking
English or Math.
E = taking English
M = taking Math
Example #2 (continued) There are 50 students.
18 are taking English. 23 are taking Math. 10
are taking English and Math.
If one is selected at random, find the probability
that the student is taking English or Math.
In this example
E M we will fill up the
Venn Diagram
with the number
of students.
Example #2 (continued) There are 50 students.
18 are taking English. 23 are taking Math. 10
are taking English and Math.
If one is selected at random, find the probability
that the student is taking English or Math.
= 0.62
Class Activity #1)
There are 1580 people in an
amusement park. 570 of these
people ride the rollercoaster. 700 of
these people ride the merry-go-round.
220 of these people ride the roller
coaster and merry-go-round.
If one person is selected at
random, find the probability that
that person rides the roller
coaster or the merry-go-round.
a.) Solve using Venn Diagrams.
b.) Solve using the formula for
the Addition Rule for Probability.
Example #3) Population of apples and pears.
no worm worm
apple 5 ? 3? 8?
pear 4 ? 2? 6?
9 ? 5? grand total 14
Ex. #3 (continued)
no worm worm
apple 5 3 8
pear 4 2 6
9 5 grand total 14
apple 5 3 8
pear 4 2 6
9 5 grand total 14
Solution to #3a.)
apple 5 3 8
pear 4 2 6
9 5 grand total 14
Solution to #3b.)
apple 5 3 8
pear 4 2 6
9 5 grand total 14
Alternate Solution to #3b.)
P(pear or worm)= P(pear) + P(worm) – P (pear and worm)
6 5 2
14 14 14
0.6429 Answer
Class Activity #2)
P( E ) 1 P( E )
Classical and Empirical Probabilities
• The difference between classical and empirical
probability is that classical probability
assumes that certain outcomes are equally
likely( (such as the outcomes when a die is
rolled), while empirical probability relies on
actual experience to determine the likelihood
of outcomes.
Given a frequency distribution, the probability
of an event being in a given class is
f
P( E )
n
where f is the frequency for the class and n ,
the total frequencies in the distribution.
Addition Rules for Probability
Two events are mutually exclusive events if they
cannot occur at the same time (i.e., they have
no outcomes in common).
When two events A and B are mutually
exclusive, the probability that A or B will occur
is P( AorB) P( A) P( B)
If A and B are not mutually exclusive, the
P( AorB) P( A) P( B) P( AandB)
Exercises
1. Define mutually exclusive events, and give an
example of two events that are mutually
exclusive and two events that are not mutually
exclusive.
2. Determine whether these events ae mutually
exclusive
a) Roll a die. Get an even number and get a
number less than 3.
b) Roll a die. Get a prime number (2,3,5) and get
an odd number.
Exercises
1. A card is drawn from a well-shuffled deck of
52 playing cards. What is the probability that
the card drawn is:
a) Diamond b) queen of hearts
2. A group of scientists consists of 7 chemists, 4
biologists, and 5 physicists. If a scientist is
randomly chosen, find the probability that
the scientist is
a) A physicist b) chemists or biologists
3. There are 600 male and 200 female
engineering students and 80 male and 320
female education students in a certain
university. Find the probability if students are
randomly chosen as:
a) a female
b) a male engineering students
c) an education students
4. An urn contains 4 green marbles and 6 red
marbles. Let E be the event “ first marble is
red” and B be the event “second marble is
red” and the marbles are not replaced after
being drawn. Find the probability that both
marbles are RED.
5. Find the complement of each event
a) Rolling a die and getting a 5
b) Selecting a month that begins with a J
6. A sales representative who visits customers at home
finds she sells 0, 1, 2, 3, or 4 items according to the
following frequency distribution
items sold frequency
0 8
1 10
2 3
3 2
4 1
Find the probability that she sells
a) exactly 1 item
b) more than 2 items
c) at least 1 item
d) at most 3 items
7. Three fair coins are tossed. What is the
probability that
a) Three HEADS appear
b) Two HEADS and a TAIL appear
8. How many ways can you arrange 5 books in a
row?
Multiplication Rules and Conditional
Probability
Multiplication Rule 1
When two events are independent, the
probability of both occurring is
P( AandB) P( A) P( B)
A coin is flipped and a die is rolled. Find the
probability of getting a HEAD on the coin and
a 4 on the die.
1
P(h)andP(4) 1 1
2 6 12
Dependent Events
When the outcomes or occurrences of the first
event affects the outcome or occurrence of the
second event in such a way that the probability is
changed, the events are said to be DEPENDENT
EVENTS.
Examples:
a) Drawing a card, not replacing it, then draw a
second card
b) having high grades and getting scholarship
c) being a lifeguard and getting a suntan
Conditional Probability
• The conditional probability of an event B in a
relationship to an event A is the probability
that event B occurs after event A has already
occurred.
Multiplication Rule 2
When two events are dependent, the
probability is
P( AandB) P( A) P( B / A)
Independent and Dependent Events
1
47 41 for 41 x 47
1 1
f ( x)
0 47 41 6
for all other values f (x)
Area = 1
41 47 x
P( x X x ) x x 2 1
1
ba2
45 42 1
47 41 2
f (x)
45 42 1
P(42 X 45)
47 41 2 Area
= 0.5
41 42 45 47 x
x
2
1
1
f ( x)
2 e
2
Where:
mean of X
standard deviation of X
= 3.14159 . . .
e 2.71828 . . .
X
0.00 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0359
0.10 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.20 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.30 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.90 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.00 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.10 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.20 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
2.00 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
3.00 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.40 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.50 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
P(0 Z 1) 0. 3413
Z 0.00 0.01 0.02
3 18 3(355
. ) 18 10.65
3 7.35
3 28.65
0 10 20 30 40 50 60 70
n
0.12
0.10
0.08
0.06
0.04
0.02
0
6 8 10 12 14 16 18 20 22 24 26 28 30
X 0
1.0
. P X X 0 e
(12
P X 2| 12
. e
. )(2)
0.8
.0907
0.6
0.4
0.2
0.0
0 1 2 3 4 5
(X ) 2
1
f (X ) (e) 2 2
2
Key Areas under the Curve
• For normal
distributions
+ 1 SD ~ 68%
+ 2 SD ~ 95%
+ 3 SD ~ 99.9%
Example IQ mean = 100 s = 15
Normal Probability Distributions
Standard Normal Distribution – N(0,1)
• We agree to use the
standard normal
distribution
• Bell shaped
• =0
• =1
• Note: not all bell
shaped distributions
are normal
distributions
Normal Probability Distribution
• Can take on an
infinite number of
possible values.
• The probability of
any one of those
values occurring is
essentially zero.
• Curve has area or
probability = 1
Normal Distribution
• The standard normal distribution will
allow us to make claims about the
probabilities of values related to our own
data
• How do we apply the standard normal
distribution to our data?
Z-score
If we know the population mean and
population standard deviation, for any
value of X we can compute a z-score by
subtracting the population mean and
dividing the result by the population
standard deviation
X
z
Important z-score info
• Z-score tells us how far above or below the
mean a value is in terms of standard
deviations
• It is a linear transformation of the original
scores
– Multiplication (or division) of and/or addition to
(or subtraction from) X by a constant
– Relationship of the observations to each other
remains the same
Z = (X-)/
then
X = Z +
[equation of the general form Y = mX+c]
Probabilities and z scores: z tables
• Total area = 1
• Only have a probability from width
– For an infinite number of z scores each point
has a probability of 0 (for the single point)
• Typically negative values are not reported
– Symmetrical, therefore area below negative
value = Area above its positive value
• Always helps to draw a sketch!
Probabilities are depicted by areas under the curve
P(100 X 115)
P(100 100 X 100 115 100)
100 100 X 100 115 100
P(
15 15 15
P(0 Z 1) .3413
Say we have GRE scores are normally distributed with mean 500 and
standard deviation 100. Find the probability that a randomly selected
GRE score is greater than 620.
620 500
1.2 z
100
• p(z > 1.2)
• Result: The probability of randomly
getting a score of 620 is ~.12
Exercises
• Let z represent the standard normal variable.
Suppose a value of z is randomly selected. To
find each of the following probabilities, (1)
draw the standard normal curve and indicate
the area representing the probability, (2)
express the probability in terms of areas from
0 to appropriate values obtained, (3) calculate
the answer.
1. P(z< 1.41) 2. P(z<-1.72)
3. P(z>1.51) 4. P(z>-2.43)
5. P(-2.02 < z <1.74) 6. P(1.02 < z < 1.84)of the
7. Between -0.67 and 0
8. Less than 1.96
9. Within 1 standard deviation of the mean
10.Within 3 standard deviations of the mean
• Assume the standard normal distribution. Fill
in the blanks
1. P(z < _ ) =0.9772
2. P(z > _ ) = 0.5
3. P( z > _ ) = 0.9599
• Consider a normal population with mean 200
and standard deviation 25. Find the following:
• History
• The normal curve was developed mathematically in 1733 by DeMoivre
as an approximation to the binomial distribution. His paper was not
discovered until 1924 by Karl Pearson. Laplace used the normal curve
in 1783 to describe the distribution of errors. Subsequently, Gauss used
the normal curve to analyze astronomical data in 1809. The normal
curve is often called the Gaussian distribution. The term bell-shaped
curve is often used in everyday usage.
Example
1. On a final examination in Statistics, the mean
= 76 and its standard deviation = 10. Find:
a. the standard score of the student when
receiving the grade of 90
• History
• The normal curve was developed mathematically in 1733 by DeMoivre
as an approximation to the binomial distribution. His paper was not
discovered until 1924 by Karl Pearson. Laplace used the normal curve
in 1783 to describe the distribution of errors. Subsequently, Gauss used
the normal curve to analyze astronomical data in 1809. The normal
curve is often called the Gaussian distribution. The term bell-shaped
curve is often used in everyday usage.
Areas under the Normal Curve
To convert the units of measurement into
standard units, standard scores or z-scores by
means of the formula
x – xm
z = ----------------
s
where z = standard scores
xm = mean
s = stand dev
x = given value of a particular variable
Exercises
1. Find the z value for a normal distribution
with mean = 30 and standard deviation = 5 if
a) X = 44 b) x = 23 c) 15
Ho : 215
Ha : 215
A milling process currently produces an average
of 3% defectives. You are interested in
showing that a simple adjustment on a
machine will decrease p, the proportion of
defectives produced in the milling process.
Thus, Ho : p 0.03
Ha : p 0.03
which is a one-tailed test
• The p-value or observed significance level
of a statistical test is the smallest value of
alpha for which Ho can be rejected. It is
the actual risk of committing a Type I
error, if Ho is rejected based on the
observed value of the test statistic. The p-
value measures the strength of the
evidence against Ho.
If the p-value is less than a preassigned
significance level alpha, then the Ho can be
rejected , and you can report that the results
are statistically significant at level alpha.
Tests of Statistical Hypothesis
Goal of Hypothesis Testing-
to make a judgment about
the difference between the
sample statistics and a
hypothesized population
parameter
Use of Hypothesis Testing-
It enables the researcher to
generalize population from
relatively small samples. In many
instances, a researcher can only
rely on the information provided
by a part of the population.
Basic Definitions:
Statistical hypothesis is an
assumption or statement,
which may or may not be
true concerning one or
more populations.
Hypothesis Testing is the
process of making
inference or prediction on
a population based on the
result of the study on
samples
Null hypothesis is also known
as a no difference
relationship hypothesis. It
implies neutrality and
objectivity, which must be
present in any research
undertaking.
Alternative hypothesis is the opposite
of the null hypothesis.
Rejection of the hypothesis is to
conclude that the hypothesis is
false.
Acceptance of a hypothesis merely
implies that there is no sufficient
statistical evidence to believe
otherwise.
Critical region is a set of values
of the test statistic that is
chosen before the
experiment to define the
conditions under which the
null hypothesis will be
rejected.
One-tailed test is used
when the critical region is
located at only one
extreme of distribution or
range of values for the
test statistics.
It is a directional test
with region of rejection
lying on either left or
right tail of the normal
curve.
Right directional test. The region
of rejection is on the right tail.
It is used when the alternative
hypothesis uses comparatives
such as >, higher than, superior
to, exceeds, better, etc.
Left directional test. The
region of rejection is on the
left tail. It is used when the
alternative hypothesis uses
comparatives such as <,
smaller than, lower than,
below, etc.
Two-tailed test is used
when the critical region is
located on both sides of
the distribution or range
of values for the test
statistic.
Significance level of a test is
the maximum value of the
probability of rejecting the
null hypothesis when in
fact it is true.
Statistic is a function of the
random sample, that is based
on the observations and is
used to make the decision in
favor of the null or
alternative hypothesis.
Type I error is when we
reject the null hypothesis
when it is true.
Decision
H0 is TRUE H0 is FALSE
Reject Ho Type I error Correct
(alpha) Decision
Reject Ho Correct Type II error
Decision (beta)
30 September 2016
Statistical Tests
Data Analysis
Statistics - a powerful tool for analyzing data
1. Descriptive Statistics - provide an overview
of the attributes of a data set. These include
measurements of central tendency (frequency
histograms, mean, median, & mode) and
dispersion (range, variance & standard
deviation)
The Sample: 7, 6, 4, 9, 8, 3, 2, 6, 1
mean = 5.111
The Population: =5.314
2 4 10 4 6 8 7 10 4 3 7 9 6 7 5 2 5 8 2 10
7 2 3 5 2 9 3 9 6 1 4 2 6 4 9 3 4 1 8 7
9 1 8 1 10 10 6 4 2 7 1 1 9 10 4 4 6 6 2 5
9 10 2 6 8 10 1 6 10 10 4 4 4 9 2 1 4 5 9 6
6 2 7 8 8 6 6 10 6 6 7 5 9 2 6 4 8 6 6 10
5 7 1 9 1 10 8 8 5 10 1 4 8 3 6 7 1 5 2 4
4 10 5 8 5 1 1 4 3 6 7 3 1 5 4 3 6 2 7 8
3 3 6 6 2 8 6 5 9 8 4 6 3 8 3 3 10 8 10 5
7 5 1 4 3 2 1 10 2 10 6 10 7 9 8 8 4 9 9 10
3 7 6 2 1 1 10 3 5 7 4 1 2 9 10 10 6 1 3 2
1 3 9 9 4 2 2 2 1 8 3 1 5 9 9 8 3 2 5 4
4 2 3 10 8 2 3 4 1 3 3 2 10 10 5 7 3 3 10 1
5 7 5 1 2 5 8 7 3 8 9 2 10 8 1 1 5 3 3 7
6 7 9 8 8 4 9 8 4 3 10 8 10 4 10 2 3 5 6 3
1 9 8 1 10 2 3 1 6 3 8 9 6 2 4 4 2 7 8 4
4 4 4 10 8 5 9 3 10 5 3 6 9 3 7 4 2 3 10 2
5 1 6 8 5 6 8 1 8 5 7 6 4 1 2 7 2 9 5 3
8 2 3 2 9 9 1 1 5 7 8 5 6 3 8 5 4 10 6 9
5 1 10 10 5 1 4 3 2 3 6 9 10 2 6 3 1 2 8 6
1 8 7 8 5 3 7 2 4 1 8 9 10 10 5 1 3 6 5 8
3 3 8 8 2 7 1 6 9 8 2 10 3 7 9 2 1 9 7 7
3 1 9 6 8 2 6 4 6 3 7 10 9 6 1 10 7 5 3 10
1 6 5 4 3 2 4 4 1 5 5 10 6 2 1 1 1 5 6 3
8 10 8 10 9 7 7 7 8 4 8 1 3 5 8 1 8 4 4 6
4 7 2 4 9 1 8 5 3 3 5 10 1 4 6 3 3 8 2 2
The Sample: 1, 5, 8, 7, 4, 1, 6, 6
mean = 4.75
Parametric or Non-parametric?
•Parametric tests are restricted to data that:
1) show a normal distribution
2) * are independent of one another
3) * are on the same continuous scale of measurement
•Non-parametric tests are used on data that:
1) show an other-than normal distribution
2) are dependent or conditional on one another
3) in general, do not have a continuous scale of
measurement
Y Y
A B C A B C
There are different tests if you have 2 vs more than 2 samples
Differences Between Means – Parametric
Data
t-Tests compare the means of two parametric samples
HBI: t-Test
Excel: t-Test (paired and unpaired) – in Tools – Data
Analysis
A researcher compared the height of plants grown in high
and low light levels. Her results are shown below. Use a
T-test to determine whether there is a statistically
significant difference in the heights of the two groups
HBI: ANOVA
Excel: ANOVA – check type under Tools – Data Analysis
A researcher fed pigs on four different foods. At the end
of a month feeding, he weighed the pigs. Use an ANOVA
test to determine if the different foods resulted in
differences in growth of the pigs.
Night Day
HBI: Sign Test Subject Response Response
1 2 5
Excel: N/A 2 1 3
3 2 2
Differences Between Means – Non-
Parametric Data
The Friedman Test is like the Sign test, (compares the
means of “paired”, non-parametric samples) for more than
two samples.
E
Frequency
Smooth Wrinkled
HBI: Chi-Square One Sample Test (goodness of fit)
Excel: Chitest – under Function Key – Statistical
Differences Between Distributions
E.g. 67 out of 100 seeds placed in plain water germinated
while 36 out of 100 seeds placed in “acid rain” water
germinated. Is there a difference in the germination rate?
Alternative Hypothesis
Null Hypothesis
Germination
Germination
Proportion
Proportion
1 57 3 83 7
2 45 1 37 1
3 72 7 41 2
4 78 8 84 8
5 53 2 56 3
6 63 5 85 9
7 86 9 77 6
8 98 10 87 10
9 59 4 70 5
10 71 6 59 4
Regression
Regressions look for functional relationships between two
continuous variables. A regression assumes that a
change in X causes a change in Y.
Y Y
X X
Is there a relationship between wing length and
tail length in songbirds?
wing length cm tail length cm
10.4 7.4
10.8 7.6
11.1 7.9
10.2 7.2
10.3 7.4
10.2 7.1
10.7 7.4
10.5 7.2
10.8 7.8
11.2 7.7
10.6 7.8
11.4 8.3
Is there a relationship between age and systolic
blood pressure?
Age (yr) systolic blood pressure
mm hg
30 108
30 110
30 106
40 125
40 120
40 118
40 119
50 132
50 137
50 134
60 148
60 151
60 146
60 147
60 144
70 162
70 156
70 164
70 158
70 159
Statistical Tests
Let’s Take it Step by Step...
Identify topic Collect data
Literature review Set up spreadsheet
Variables of interest Enter data
Research hypothesis Statistical analysis
Design study
Graphs
Power analysis
Slides / poster
Write proposal
Design data tools Write paper /
manuscript
Committees
Goals
Nominal
Ordinal
} Qualitative
Interval
Ratio
} Quantitative
Nominal Scale (discrete)
Simplest scale of measurement
Variables which have no numerical value
Variables which have categories
Count number in each category, calculate
percentage
Examples:
– Gender
– Race
– Marital status
– Whether or not tumor recurred
– Alive or dead
Ordinal Scale
Variables are in categories, but with an
underlying order to their values
Rank-order categories from highest to lowest
Intervals may not be equal
Count number in each category, calculate
percentage
Examples:
– Cancer stages
– Apgar scores
– Pain ratings
– Likert scale
Interval Scale
Quantitative data
Can add & subtract values
Cannot multiply & divide values
– No true zero point
Example:
– Temperature on a Celsius scale
• 00 indicates point when water will freeze, not an absence of
warmth
Ratio Scale (continuous)
Quantitative data with true zero
– Can add, subtract, multiply & divide
Examples:
– Age
– Body weight
– Blood pressure
– Length of hospital stay
– Operating room time
Scales of Measurement
Nominal
Ordinal
} Lead to nonparametric
statistics
Interval
Ratio
} Lead to parametric statistics
Two Branches of Statistics
Descriptive
– Frequencies & percents
– Measures of the middle
– Measures of variation
Inferential
– Nonparametric statistics
– Parametric statistics
Descriptive Statistics
Cumulative
Frequency Percent Valid Percent Percent
Valid s moker 26 20.5 24.8 24.8
non-s moker 79 62.2 75.2 100.0
Total 105 82.7 100.0
Mis sing unknown 22 17.3
Total 127 100.0
Measures of the Middle or
Central Tendency
Mean
– Average score
• sum of all values, divided by number of values
– Most common measure, but easily influenced by
outliers
Median
– 50th percentile score
• half above, half below
– Use when data are asymmetrical or skewed
Measures of Variation or Dispersion
Standard deviation (SD)
– Square root of the sum of squared deviations of the
values from the mean divided by the number of
values
number of values
Variance
– Square of the standard deviation
Range
– Difference between the largest & smallest
value
nocigs_b
Cumulative
Frequency Percent Valid Percent Percent
Valid 1 2 1.6 7.7 7.7
2 1 .8 3.8 11.5
3 1 .8 3.8 15.4
5 3 2.4 11.5 26.9
6 1 .8 3.8 30.8
12 1 .8 3.8 34.6
13 1 .8 3.8 38.5
14 1 .8 3.8 42.3
15 2 1.6 7.7 50.0
17 1 .8 3.8 53.8
18 1 .8 3.8 57.7
19 2 1.6 7.7 65.4
20 2 1.6 7.7 73.1
22 1 .8 3.8 76.9
24 1 .8 3.8 80.8
30 1 .8 3.8 84.6
39 1 .8 3.8 88.5
40 1 .8 3.8 92.3
45 1 .8 3.8 96.2
100 1 .8 3.8 100.0
Total 26 20.5 100.0
Mis sing Sys tem 101 79.5
Total 127 100.0
Statistics
nocigs_b
N Valid 26
Mis sing 101
Mean 19.62
Std. Error of Mean 3.985
Median 16.00
Mode 5
Std. Deviation 20.320
Variance 412.886
Range 99
Minimum 1
Maximum 100
Inferential Statistics
Sample Population
Nonparametric tests
– Used for analyzing nominal & ordinal variables
– Makes no assumptions about data
Parametric tests
– Used for analyzing interval & ratio variables
– Makes assumptions about data
• Normal distribution
• Homogeneity of variance
• Independent observations
Which Test Do I Use?
p < 0.01
– 1 in 100 or 1% chance of error
p < 0.001
– 1 in 1000 or .1% chance of error
Research Hypothesis
SES
low middle high Total
SMOKING s moker Count 7 13 6 26
% within SES 38.9% 20.3% 26.1% 24.8%
non-s moker Count 11 51 17 79
% within SES 61.1% 79.7% 73.9% 75.2%
Total Count 18 64 23 105
% within SES 100.0% 100.0% 100.0% 100.0%
2
Chi-Square
Most common nonparametric test
Use to test for association between
categorical variables
Use to test the difference between observed
& expected proportions
– The larger the chi-square value, the more the
numbers in the table differ from those we would
expect if there were no association
Limitation
– Expected values must be equal to or larger than 5
Let’s Test For Association
Low SES 38.9%, Middle SES 20.3%, High SES 26.1%
Chi-Square Tests
Asymp. Sig.
Value df (2-s ided)
Pears on Chi-Square 2.630 a 2 .268
Likelihood Ratio 2.476 2 .290
Linear-by-Linear
.653 1 .419
Ass ociation
N of Valid Cas es 105
a. 1 cells (16.7%) have expected count les s than 5. The
minimum expected count is 4.46.
Alternative to Chi-Square
Fisher’s exact test
– Is based on exact probabilities
– Use when expected count <5 cases in
each cell and
– Use with 2 x 2 contingency table
R A Fisher 1890-1962
LUNG_CA * SMOKING Crosstabulation
SMOKING
s moker non-s moker Total
LUNG_CA pos itive Count 3 1 4
% within SMOKING 11.5% 1.3% 3.8%
negative Count 23 78 101
% within SMOKING 88.5% 98.7% 96.2%
Total Count 26 79 105
% within SMOKING 100.0% 100.0% 100.0%
Chi-Square Tests
Group Statistics
Std. Error
SMOKING N Mean Std. Deviation Mean
BMI s moker 26 25.1846 5.27209 1.03394
non-s moker 79 26.2228 5.47664 .61617
Unpaired t-test
or Student’s t-test
William Gossett 1876-1937
Descriptives
BMI
95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
non-s moker 79 26.2228 5.47664 .61617 24.9961 27.4495 17.70 40.20
light s moker 17 26.1765 4.96154 1.20335 23.6255 28.7275 18.90 35.00
heavy s moker 9 23.3111 5.62015 1.87338 18.9911 27.6311 17.90 35.90
Total 105 25.9657 5.42028 .52896 24.9168 27.0147 17.70 40.20
Analysis of Variance (ANOVA)
or F-test
Three or more independent groups
Test for a difference between groups
– Is the difference in sample means due to their
natural variability or to a real difference between
the groups in the population?
Outcome (dependent variable) is interval or
ratio
Assumptions of normality, homogeneity of
variance & independence of observations
Let’s Test For A Difference
Non-Smokers’ BMI = 26.22 ± 5.48
Light Smokers’ BMI = 26.18 ± 4.96
Heavy Smokers’ BMI = 23.31 ± 5.62
ANOVA
BMI
Sum of
Squares df Mean Square F Sig.
Between Groups 69.398 2 34.699 1.185 .310
Within Groups 2986.058 102 29.275
Total 3055.457 104
No_Cigs BMI
1 30.1
1 18.9
2 22.8
3 22.6
5 24.2
5 26.2
Is there a 5
6
12
33.3
19.1
35
relationship 13
14
23
22.2
15 28.7
between the 15
17
28.6
24.3
18 30.9
variables? 19
19
22.5
32.6
20 19
20 26.7
22 18.8
24 23.4
30 23.2
39 25
40 35.9
45 17.9
100 19.9
Pearson’s Correlation
Correlations
NOCIGS_B BMI
NOCIGS_B Pears on Correlation 1 -.169
Sig. (2-tailed) . .410
N 26 26
BMI Pears on Correlation -.169 1
Sig. (2-tailed) .410 .
N 26 105
40
30
20
BMI
10
0 20 40 60 80 100 120
NOCIGS_B
Interpretation of Results
The size of the p value does not
indicate the importance of the result
Appropriate interpretation of statistical
test
– Group differences
– Association or relationship
– “Correlation does not imply causation”
Statistical Inference
Hypothesis Testing for Single Populations.
Population mean using Z statistic ( known)
x
z
n
The Z table on Hypothesis Testing
Z Table
x
t ; df n 1
s
n
Exercises
1. A bus company advertised a mean time of
150 min for a trip between two cities. A
consumer group had reason to believe that the
mean time was more than 150 minutes. A
sample of 40 trips showed a mean of 153 min
and a standard deviation of 7.5 min. Using 5 %
level of significance, is there a sufficient
evidence to support the consumer group’s
contention? What type of error has possibly
been committed? Explain.
2. A plastic has a mean breaking strength of 27
and a standard deviation of 6 pounds per square
inch. A new process is developed and will
replace the old one, provided there is
substantial evidence that it improves the
strength of the product. A random sample of 40
pieces made with the new process gives a
sample mean of 30 pounds/sq inch. Assuming
that the variability is unchanged (𝜎 = 6), is
there a sufficient evidence to suggest that the
strength of the product has increased at the 1%
level of significance?
3.An ice cream company claimed
that its product contained 500
calories/pint (on the average). To
this claim, 24 one-pint containers
were analyzed, giving a mean of
507 calories and a standard
deviation of 21 calories. Test the
claim at the 2% level of
significance.
4. A manufacturer claimed that the
company’s product would not require
by more than 18 months on the
average. A sample of 12 customers
who had purchased their product gave
a mean of 18.542 and standard
deviation of 1.177. At a 5% level of
significance, do the data support the
belief that the mean repair time is
more than 18 months?
5. A sample of 12 customers who had
purchased the product provided the
following information on how many months
elapsed before repair was needed on their
purchases:
16.5 17 17.5 18 18.5
18.5 18.5 19 19 19.5
20 20.5
Refer to the above problem
Generalization
t
d do d
d i
; df n 1
sd n
n d i d i
2 2
n
sd
n(n 1)
• Independent samples are samples drawn
from entirely different populations
Examples:
a) Comparison of two groups
b) Performance of boys and girls
c) Length of life of brand A and brand B
* Groups randomly selected from two
entirely different populations
Test for the difference of means from
independent samples when the population
variances are unknown and the samples are
more than 30.
z
x x d
1 2 0
2 2
s s 2
1
n1 n2
Test for the difference of means from
independent sample when the population
variances are unknown and the samples are
not more than 30.
t
x x
1 2
; df (n1 n2 ) 2
n1 1s1 n2 1s2
2 2
1 1
;
n1 1 n2 1 n1 n2
Test about a single proportion
p p0 x
z ;p
p0 q0 n
n
where p =sample proportion
p0 = population proportion
Test for the difference of means from
independent samples when the population
p1 p2 p1 p 2
variances are unknown
z
p1q1 p2 q2
where p (with a bar)- n1 n2
population proportion; q = 1-p
Test about a single variance or standard
deviation
x 2
n 1s 2
2
Inferential Statistics
Level of Significance-
For hypothesis testing, it is customary to use an
alpha of 5% or 1%. It means that we are
willing to commit an alpha error of 5% or 1%
as the case may be. It also implies that we are
95% or 99% confident in making correct
decisions.
(Hypothesis Test Concerning Means)
Solve showing the 5 step procedures.
1. A drug company alleges that the average time
for a cough syrup to take effect is 15 min, with
standard deviation of 3 min. In a random
sample of 49 patients, the average time was
18 min. Test the company’s allegation against
the alternative that the average time is not 15
min using 1% level of significance.
2. An operator of a large fleet of taxicabs is
trying to decide whether to purchase Brand a
of Brand B tires for its new models. To arrive
at a decision, an endurance experiment was
conducted using 10 cars for each brand. The
results are
Brand A: mean=35000 km, s=5000 km
Brand B: mean=38000 km, s=5500 km
Test the hypothesis at alpha=5%, that there is no
significant difference in the mean endurance
rate (kms) between the two brands.
3. A random sample of 100 deaths in a
certain area during the past year showed
an average life span of 71.8 years.
Assuming that the population standard
deviation is 8.9 years, does this seem to
indicate the average life span today is
significantly greater than 70 years. Use
alpha at 5%.
A teacher wants to find out if the calculator-
based method of teaching Statistics is more
effective than the lecture method. Two classes
of approximately equal intelligence were
selected. From one class, she considered 15
students with whom she used the calculator
based method of teaching and from the other
class, she considered 14 students with whom
she used the lecture method. After several
sessions, a test was given with the following
results:
Can we say that the calculator-based
method of teaching is more effective
than the lecture method? Use 0.05
level of significance.
n x s
Calculator-based (1) 15 28.6 6.0
Lecture (2) 14 21.7 4.5
df = 27
t = 3.50
Ho: u1=u2; Ha: u1>u2
α = 0.05, one-tailed test,
t tab = 1.703
Reject Ho if tc ≥ 1.703
Reject Ho
The calculator-based method of
teaching is more effective in teaching
than the lecture method.
Sample Problem
A sample of 87 professional working women showed that
the average amount paid annually into a private pension
fund per person was P 3352. The population standard
deviation is P1100. A sample of 76 professional working
men showed that the average amount paid annually into a
private pension fund per person was P5727 with a
population standard deviation of P1700. A women’s activist
group wants to prove that women do not pay as much as
much per year as men into private pension funds. If they
use alpha at .001 and these sample data, will they be able
to reject a null hypothesis that women annually pay the
same as more than men into private pension funds?
Solution
Women Men
X1 = 3352 X2 = 5727
n1 = 87 n2 = 76
σ1 = 1100 σ2 = 1700
Ho : u1 = u2
Ha : u1 < u2
Alpha at .001
1- .001 = .999 - .5 =.499, z = -3.08
Zc = -10.42
Reject Ho
Women paid lower than what was paid by men.
Exercises
1. Test the following hypotheses of the difference
on population means at alpha = .10 with
Ho : u1-u2 = 0
Ha : u1-u2 < 0
Sample 1 Sample 2
mean 51.2 53.2
pop sd 52 60
number 31 32
What is the p value for this problem?
2. According to a study several years ago by the
Personal Communications Industrial Association,
the average wireless phone user earns P62000 per
year. Suppose a researcher believes that the
average annual earnings of a wireless phone user
are lower now, and he sets up a study in an attempt
to prove his theory. He randomly samples 18
wireless phone users and finds out that the average
annual salary from this sample is P58974 with a
population standard deviation of P 7810. Use alpha
=0.01 to test the researcher’s theory. Assume
wages in this industry are normally distributed.
3. A survey of the morning beverage market shows
that the primary breakfast beverage for 17% of
Americans is milk. A milk producer in
Wisconsin, where milk is plentiful, believes the
figure is higher for Wisconsin. To test this idea, she
contacts a random sample of 550 Wisconsin
residents and asks which primary beverage they
consumed for breakfast that day. Suppose 115
replied that milk was the primary beverage. Using a
level of significance of 0.05, test the idea that the
milk figure is higher for Wisconsin.
4. Previous experience shows the variance of a
given process to be 14. Researchers are testing
to determine whether this value has changed.
They gather the following dozen measurements
of the process. Use this data and alpha=0.05 to
test the null hypothesis about the variance.
Assume the measurements are normally
distributed: 52, 44, 51, 58, 48, 49, 38, 49,
50, 42, 55, 51
5. Two processes in a manufacturing line are performed
manually: operation A and operation B. A random sample
of 50 different assemblies using operation A shows that
the sample average time per assembly is 8.05 minutes,
with a population standard deviation of 1.36 minutes. A
random sample of 38 different assemblies using
operation B shows that the sample average time per
assembly is 7.26 minutes, with a population standard
deviation of 1.06 minutes. For alpha =0.10, is there
enough evidence in these samples to declare that
operation A takes significantly longer to perform than
operation B?
15 March 2016
Recap
Activity 1. Response for every Question
This activity measures how well you have gained
knowledge on statistical inference. A piece of
rolled paper will be drawn from the box and you
have the opportunity to answer the question in
1 or 2 minutes. Then, it’s your turn to call
someone to draw another paper inside the box
until the questions have been answered.
Hypothesis Testing
Activity 2. Each group will be given a problem
involving statistical test. Then, a group
representative will report on the
conclusion/decision after consultation with
other members. Each group will be graded
based on this rubric points:
a. Team effort 5 b. Delivery 5
c. Consistency/accuracy 5 d. Follow the steps
in hypothesis testing 5
Exercises
1. A random sample of size 20 is taken,
resulting in a sample mean of 16.45
and a sample standard deviation of
3.59. Assume x is normally distributed
and use this information and alpha is
0.05 to test the following hypotheses:
Ho: u = 16 H1: u =/ 16
2. A manufacturing firm has been averaging
18.2 orders per week for several years. However,
during a recession, orders appeared to slow.
Suppose the firm’s manufacturing manager
randomly samples 32 weeks and finds a sample
mean of 15.6 orders. The population standard
deviation is 2.3 orders. Test to determine
whether the average number of orders is down
by using alpha at .10.
3. A certain study showed that 79% of companies
offer employees flexible scheduling. Suppose a
researcher believes that in accounting firms this
figure is lower. The researcher randomly selects 415
accounting firms and through interviews
determines that 303 of this firms have flexible
scheduling. With a 1% level of significance, does the
test show enough evidence to conclude that a
significantly lower proportion of accounting firms
offer employees flexible scheduling?
4. With landfills quickly reaching their capacities,
recycling household trash has assumed increased
importance. A city legislator believes that a greater
proportion of residents in the city favor a
mandatory recycling bill. To show this, 982
southern residents were randomly sampled, and
678 were found to be supportive of the approval. A
random sample of 952 residents in the northern
region revealed 599 in favor of the bill. Formulate a
suitable set of hypotheses and test at the 1%
significant level.
Explore
• Deepen your understanding of hypothesis
procedures by examining the given
information and the things being asked. Think
of the appropriate test statistics and form the
conclusion based on the formulated null and
alternative hypotheses.
Values Integration
Statistical Inference
Hypothesis Testing for Single Populations.
Population mean using Z statistic ( known)
x
z
n
Population mean using t statistic ( unknown,
n <=30)
x
t ; df n 1
s
n
Test for the mean of paired observations
t
d do d
d i
; df n 1
sd n
n d i d i
2 2
n
sd
n(n 1)
Test for the difference of means from
independent samples when the population
variances are unknown and the samples are
more than 30.
z
x x d
1 2 0
2 2
s s 2
1
n1 n2
Test for the difference of means from
independent sample when the population
variances are unknown and the samples are
not more than 30.
t
x x
1 2
; df (n1 n2 ) 2
n1 1s1 n2 1s2
2 2
1 1
;
n1 1 n2 1 n1 n2
Test about a single proportion
p p0 x
z ;p
p0 q0 n
n
where p =sample proportion
p0 = population proportion
Test for the difference of means from
independent samples when the population
p1 p2 p1 p 2
variances are unknown
z
p1q1 p2 q2
where p (with a bar)- n1 n2
population proportion; q = 1-p
Correlation
Wt. 67 69 85 83 74 81 97 92 114 85
(kg)
SBP 120 125 140 160 130 180 150 140 200 130
mmHg)
Wt. 67 69 85 83 74 81 97 92 114 85
SBP(mmHg) (kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
220
200
180
160
140
120
100
80 wt (kg)
60 70 80 90 100 110 120
200
180
160
140
120
100
80
Wt (kg)
60 70 80 90 100 110 120
negative relationship
no relationship
Positive relationship
18
16
14
12
Height in CM
10
0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Negative relationship
Reliability
Age of Car
No relation
Group Activity
If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)
xy x y
r n
x
2
( x) 2
. y
2
( y) 2
n n
Example:
A sample of 6 children was selected, data about their
age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.
xy x y
r n
( x) 2 ( y) 2
x
2 . y
2
n n
Age Weight
Serial
(years) (Kg) xy X2 Y2
n.
(x) (y)
1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169
Total ∑x= ∑y= ∑xy= ∑x2= ∑y2=
41 66 461 291 742
41 66
461
r 6
(41) 2 (66) 2
291 .742
6 6
r = 0.759
strong direct correlation
EXAMPLE: Relationship between Anxiety and
Test Scores
Anxiety Test X2 Y2 XY
(X) score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
∑X = 32 ∑Y = 32 ∑X2 = 230 ∑Y2 = 204 ∑XY=129
Calculating Correlation Coefficient
r = - 0.94
Scores Y 6 8 5 3 7 8 3 4
6 (di) 2
rs 1
n(n 2 1)
∑ di2=64
6 64
rs 1 0.1
7(48)
Comment:
There is an indirect weak correlation
between level of education and income.
exercise
F test (ANOVA)
It involves testing the equality of several
means SIMULTANEOUSLY. It is used to test
the significance difference between means
of 3 or more sets of data simultaneously. It
is a method of dividing the variation
observed in experimental data into
different parts, each part assignable to a
known source, cause or factor. It was
developed by Fisher, a famous statistician
from whom the term F-test came.
Simple analysis of variance is based on two
sources of variation:
1. Actual difference of the means due to
TREATMENT (SSb)
2. Chance or experimental ERROR (SSw)
TSS x
2
x
2
1
sumofeachcolumn
x
2
SSb
2
# rows N
Sum of Squares Within Columns
SS w TSS SSb
Total degrees of freedom dfT N 1
Between Columns degree of freedom df b k 1
Within Columns degree of freedom df w dfT df b
Mean Sum of Squares Between Columns
SS b
MSS b
df b
MSS w
Where N = the number of samples
Fc = the computed value of F
Ft = the tabular value of F
k = the number of columns
df = degree of freedom
Mel 21 20 24 23
Mike 25 20 29 26
Mark 22 24 26 20
Mon 27 21 30 26
Matt 26 25 24 20
Use 1% level of significance to test whether
the difference among the mean number of
customers being served by 5 crews are not
significant
14.A psychologist in a big company gave an IQ
test on 3 groups of 4 applicants each for
managerial position. The results are given
below: Single Married Widowed
90 110 115
120 105 105
125 100 110
100 95 130
At alpha = 0.05 level, test the claim that the
three population means are equal.
Problem Set 1
Solve and write all the steps in hypothesis testing:
1. A professor in a typing class found out that the
average performance of an expert typist is 85
words per minute. A random sample of 16
students took the typing test and an average
speed of 62 wpm was obtained with a standard
deviation of 8. Can we say that the sample
students performance is below the standard at
the 0,05 level? (t tab = 1.753, one-tailed)
2. In a time and motion study, it was found
out that a certain manual work can be
finished at an average time of 40 minutes
with a standard deviation of 8 minutes. A
group of 36 students is given a special
training and then found to average only 35
minutes. Can we conclude that the special
training can speed up the work using the
0.01 level? (z tab = 2.33, one-tailed)
3. Determine if there is significant difference
among the test scores obtained by the
group of 4 students from 5 different
sections. Test at 0.05 level of significance. (F
tab = 2.90)
A B C D E
89 80 97 88 89
75 87 78 92 90
95 91 89 82 94
85 95 79 77 75
Linear Correlation
The Concept of Correlation
In some research problems, several variables or
characteristics of a population are studied
simultaneously to determine whether a
relationship exists and if so how close or how
significant the relationship might be.
Correlation is a statistical tool to measure the
association of two or more quantitative
variables.
The Scatterpoint Diagram
To estimate roughly if a relationship exists
between two variables, a scatterpoint diagram
is made. Draw a straight line intersecting as
many points as possible in the graph. If the
diagram suggess roughly the existence of a
linear relationship, then compute for the
coefficient of correlation r.
Scatter Diagram
A plot of paired (x,y) data with a horizontal x-axis and a vertical y-axis
http://www.statsoft.com/textbook/distribution-tables/?button=3
The Least Squares Linear Regression
Equation
The mathematical relationship between X and Y,
in this particular method, is expressed in the
linear equation; y a bx where a is the
regression constant or intercept and b is the
regression coefficient.
n xy x y
b
n x x
2 2 a y bx
Selected Families and Corresponding Monthly
Income in Thousand pesos
Number of Members in a Family Monthly Income
3 14.3
8 20.4
5 17.5
6 20.3
7 20.5
A plot of paired (x,y) data with a horizontal x-axis and a vertical y-axis
n xy x y
n x ( x) n y ( y)
2 2 2 2
A businessman would like to determine if there is a relationship between
the size of a store and the profit to be earned. Calculate the value of r,
interpret at .01 level. Is there a significant relationship between the
size and the profit?
Store Store size (in sq m) Profit ( in
Thousand pesos)
A 35 20
B 22 15
C 27 17
D 16 9
E 28 16
F 12 7
G 40 22
Spearman rho
6 D 2
1
N ( N 1)
2
A panel of 5 men and 5 women are asked to rank 10
ideas for a new TV program on the basis of their
appeal to general audiences. The following are
Program Men’s Women’s
the results: Idea Ranking Ranking
Find value of r 1 6 6
Test at alpha =.05 2 4 10
3 8 8
Is there an association
4 7 2
or relationship 5 2 7
between the 6 1 1
rankings of men 7 3 5
8 5 9
and those of women?
9 9 4
10 10 3
• Guttman’s Lambda
FR CT
N CT
where FR = the biggest cell frequency in each
column
CT = the biggest row total
N = total frequency
Let us measure the degree of relationship of
individual’s religion and political party.
LAKAS LAMMP REPORMA TOTAL
Catholic 20 9 15 44
INC 5 18 4 27
Protestant 11 8 10 29
Total 36 35 29 100
Academic NSAT Scores Total
Grades Low Students
Average
High
Above 90 13 18 14 45
80 to 90 25 31 20 76
Below 80 21 38 20 79
Total 59 87 54 200
Correlation Ratio
2
Ni yi N y 2
E 2
yij N y 2 2
Single 65 83 81 69 73 89 76 60
Married 70 67 90 84 78
Widow 89 64 78
A. Construct the null and alternative
hypothesis on each of the given
statements:
1. To determine the influence of short stories
in shaping the sex typed attitudes of high
school students (two-tailed test)
2. To determine the difference in the
performance of public and private
secondary school students in the national
entrance examination ( one-tailed, right
directional test)
3. A team of researchers want to determine if
grades in college are related to success in
a chosen field. The most appropriate
statistical analysis for the problem is _
a. Correlation analysis
b. Regression analysis
c. Prediction
d. Measure of variability
4. Which correlation coefficient represents
the strongest relationship between two
variables?
a. 0.0 b.-0.80 c. 0.60 d. 0.73
5. When small values of X are associated with
small values of Y and large values of X are
associated with large values of Y, then the
relationship between X and Y is_
a. Negatively correlated
b. Positively correlated
c. No correlation
d. Undetermined
6. State when to use the following statistical
tests:
a. Spearman rho
b. Guttman’s lambda
c. ANOVA (one-way)
d. Z test (single mean)
e. Correlation ratio
f. Pearson r
B. Perform test on hypothesis using the suggested
steps in hypothesis testing:
7. Previous research showed that the average
height of female students of ABC University is
1.55 meters with a standard deviation of
0.14meters. In order to verify this, the student-
researchers draw a random sample of 144
female students and it shows that the average
height is 1.48 m. Use alpha at 5% to test that the
previous research study was valid? (z tab = ±
1.96, 2-tailed test)
8. The following are the sales and the profit
earned by XYZ Co. in million pesos for the
last 7 months:
Month Sales Profit
January 152 20
February 150 25
March 140 15
April 130 18
May 122 17
June 120 19
July 112 21
x
2
E df = k-1
where O = observed data
E = expected data
E = np
n = total frequencies
p = number of proportion
• Two-way classification (two or more samples)
df = (row-1)(column-1)
O E 0.5 2
x 2
E
Seatwork
1. A researcher wishes to get the pulse of the
studentry about the new enrollment
scheme. The scheme is not so popular and
so the researcher put a 50-50 chance of
acceptance. He used a sample of 150
students and asked them to give their
preferences to the scheme as favorable or
not favorable. Alpha = 0.05, x tab = 3.84
Category Total
Favorable Not Favorable
78 72 150
2. Is sex related to alcohol consumption? Test
at alpha = 0.90, x tab = 0.211
Alcohol Consumption
Female 7 15 28 50
Total 18 33 49 100
Quiz
• Write the word being described in the given
statements and phrases.
1. This is a subset of a group of representatives of a
population.
2. A form of presenting data using tables for better
understanding and interpretation
3. This is the opposite of the null hypothesis.
4. This test is used when the number of sample
means is less than 30.
5. Data that permits us to describe how much more
or less one object possesses than another
6. It refers to how high, flat, or moderate a normal
curve is drawn based on the values of mean,
median and mode.
7. A form of presenting data information through
words or statements
8. This is done after the result of every statistical
test.
9. A method used in measuring the strength of
relationships between two or more variables
10. A Greek letter that stands for summation
11– 13. With your answers from #1-10, write the
words that you are able to decode.(Think of the
first letters of every answers)
• Read and understand the statements
and questions below. Write only the
letter.
14. The hypothesis which is hoped to be rejected is
A. null hypothesis B. alternative hypothesis
C. both A and B D. neither A nor B
15. If we reject Ho when it is true, we commit a
A. type I error B. type II error
C. type III error D. no error
• If you want to show that Method A of
teaching computer programming is
more effective than Method B, then
16. Ho should be stated as follows:
A. Ho: Method A is more effective than Method B
B. Ho: Method A is as effective as Method B
C. Ho: Method A is less effective than Method B
D. Ho: Method A is less effective or as effective as
Method B
•
17. Ha: should be stated as
follows:
A. Ha: Method A is as effective as Method B
B. Ha: Method A is more effective as Method B
C. Ha: Method A is less effective than Method B
D. Ha: Either B or C
18. The statistic used to test the significance
of difference between means when n is
large
A. Chi-square test B. t-test
C. F-test D. Z-test
Determine whether the following statements
can be a null hypothesis or alternative
hypothesis. Indicate Ho or Ha on the
blanks provided for.
19.Classical music has a positive effect on the
memory ability of Grade IV students of a certain
elementary school.
20. The pre-test scores of the students
belonging to Group A in the Language
ability test of Group A do not differ with
that of the students belonging to Group B.
21. Low scores in mental ability test
corresponds to low scores in the self-
concept test.
22.The performance of pre-schoolers from
private schools in the memory ability test is
significantly different from those coming
from the public schools.
23. Sleep deprived students have lower
performance in the mathematical learning
ability test than those with 8 hours sleep.
24. Introducing colors to pictures has no
effect on the memory retention of Grade I
pupils.
25. The performance of the students exposed
to verbal motivation in a given learning
ability test do not differ with that of the
students exposed to nonverbal motivation.
Exercises
where p = (x1+x2)/(n1+n2)
x1= number of successes in the 1st grp
x2=number of successes in the 2nd grp
n1= number of cases in the 1st grp
n2= number of cases in the 2nd grp
In a factory of baby dresses, one
production process yielded 28
defective pieces in a random sample
of 400 while another yielded 15
defective pieces in a random sample
size of 300. Is there a significant
difference between the proportions
of defective baby dresses? Test at
0.05 level
Assignment
1. A candy maker produces two brands of
candy, X and Y. It is found that 56 of 200
h\buyers prefer brand X and 29 of 150 buyers
prefer brand Y. Can we conclude that brand X
outsells brand Y. Use 0.05 level of significance
2. Opinion on a certain issue in a college
community is believed to be split 80% for and
20% against. In a sample of 400, 83% answers
affirmatively. Does this result discredit the
organization hypothesis at 0.10 level?
3. In a study of cheating among college, 144 or
41.4% of 348 students from homes of good-
socio-economic status were found to have
cheated on various tests. In the same study,
133 or 50.2% of 265 students from homes of
poor socio-economic status also cheated on
the same tests. Is there a true difference in
the incidence of cheating in these two groups
at 0.10 level of significance? (Z tab = ±1.645)
4. A study is made to determine if a cold climate
contributes more to absenteeism from school
during December, than a warmer climate. Two
groups of students are selected at random,
one group from Baguio and the other from
Metro Manila. Of the 300 students from
Baguio, 72 were absent at least 1 day during
the December, and of the 400 students from
Metro Manila, 70 were absent 1 or more days.
Can we conclude that a colder climate, results
in a greater proportion of students being
absent from school at least 1 day during
December? Use a 0.01 level of significance.
(Z tab = ±2.33)
• LARGE SAMPLE CONFIDENCE INTERVALS FOR
A POPULATION MEAN
Let X1…Xn be a large (n>30) random sample
s
from a population with mean μ and a
x3
n
x z / 2 x
where x s
n
When the value of δ is unknown, it can be
replaced with the sample standard deviation
s.
s
In particular x n is a 68% confidence interval
s s
x 1.645 is a 90%, x 1.96 is a 95%,
n n
s is a 99%, s is a 99.7%
x 2.58 x3
n n
In a random sample of 100 batteries produced
by a certain method, the average lifetime was
150 hrs and the standard deviation was 25 hrs.
a) Find a 95% confidence interval for the mean
lifetime of batteries produced by this
method?
b) Find the 80% confidence interval
c) Find the 99% confidence bound for µ
Solutions
a) 150 ± (1.96)(25/10) =(154.9, 145.1)
b) Set 1-ᾳ = .80
ᾳ = .20
ᾳ/2 = .10 = 1.28 (Z table)
150±(1.28)(2.5) = (153.2, 146.8)
• Subtract 0.5 from 0.10 before looking at the Z table.
c) 150 ± 2.33 (2.5) =
Upper and Lower Confidence Bound
for μ
• s
x z x, x
n
x 1.28 x x 1.645 x
x z x
90% 95%
x 2.33 x
99%
QUIZ
A. Construct the null and alternative
hypothesis on each of the given
statements:
1. To determine the influence of short stories
in shaping the sex typed attitudes of high
school students (two-tailed test)
2. To determine the difference in the
performance of public and private
secondary school students in the national
entrance examination ( one-tailed, right
directional test)
3. A team of researchers want to determine if
grades in college are related to success in
a chosen field. The most appropriate
statistical analysis for the problem is _
a. Correlation analysis
b. Regression analysis
c. Prediction
d. Measure of variability
4. Which correlation coefficient represents
the strongest relationship between two
variables?
a. 0.0 b.-0.80 c. 0.60 d. 0.73
5. When small values of X are associated with
small values of Y and large values of X are
associated with large values of Y, then the
relationship between X and Y is_
a. Negatively correlated
b. Positively correlated
c. No correlation
d. Undetermined
6. State when to use the following statistical
tests:
a. Spearman rho
b. Guttman’s lambda
c. ANOVA (one-way)
d. Z test (single mean)
e. Correlation ratio
f. Pearson r
B. Perform test on hypothesis using the suggested
steps in hypothesis testing:
7. Previous research showed that the average
height of female students of ABC University is
1.55 meters with a standard deviation of
0.14meters. In order to verify this, the student-
researchers draw a random sample of 144
female students and it shows that the average
height is 1.48 m. Use alpha at 5% to test that the
previous research study was valid? (z tab = ±
1.96, 2-tailed test)
8. The following are the sales and the profit
earned by XYZ Co. in million pesos for the
last 7 months:
Month Sales Profit
January 152 20
February 150 25
March 140 15
April 130 18
May 122 17
June 120 19
July 112 21
Convert the data to ranks and compute the value of p to find
out if there is a relationship between sales and profit.
Test at 0.01 level (0.893)
9. The following are the ages of 7 employees
in a hospital and their corresponding
efficiency rating:
Employee Age Efficiency Rating
1 44 61
2 44 41
3 45 91
4 43 77
5 40 70
6 52 88
Is there
7 a significant relationship
43 between their
93 ages and efficiency
rating. Interpret the result based on the correlation coefficient (r)
table.
10. A study was conducted to know if there
was any relationship between grades as a
high school senior and grades as a college
freshmen. Grades are recorded as the
average during the last year of high school
and the average during the freshman year
of college. Calculate r. Is r significant at
alpha = 0.057?
HS 74 90 93 92 98 78 88 94 76
College 78 85 94 94 98 84 88 97
Chi-Square Distribution
• One-Way Sample (Test of Goodness of Fit)
O E 2
x
2
E df = k-1
where O = observed data
E = expected data
E = np
n = total frequencies
p = number of proportion
• Two-way classification (two or more samples)
df = (row-1)(column-1)
O E 0.5 2
x 2
E
Seatwork
1. A researcher wishes to get the pulse of the
studentry about the new enrollment
scheme. The scheme is not so popular and
so the researcher put a 50-50 chance of
acceptance. He used a sample of 150
students and asked them to give their
preferences to the scheme as favorable or
not favorable. Alpha = 0.05, x tab = 3.84
Category Total
Favorable Not Favorable
78 72 150
2. Is sex related to alcohol consumption? Test
at alpha = 0.90, x tab = 0.211
Alcohol Consumption
Male 11 18 21 50
Female 7 15 28 50
Total 18 33 49 100
Quiz
• Write the word being described in the given
statements and phrases.
1. This is a subset of a group of representatives of a
population.
2. A form of presenting data using tables for better
understanding and interpretation
3. This is the opposite of the null hypothesis.
4. This test is used when the number of sample
means is less than 30.
5. Data that permits us to describe how much more
or less one object possesses than another
6. It refers to how high, flat, or moderate a normal
curve is drawn based on the values of mean,
median and mode.
7. A form of presenting data information through
words or statements
8. This is done after the result of every statistical
test.
9. A method used in measuring the strength of
relationships between two or more variables
10. A Greek letter that stands for summation
11– 13. With your answers from #1-10, write the
words that you are able to decode.(Think of the
first letters of every answers)
• Read and understand the statements
and questions below. Write only the
letter.
14. The hypothesis which is hoped to be rejected is
A. null hypothesis B. alternative hypothesis
C. both A and B D. neither A nor B
15. If we reject Ho when it is true, we commit a
A. type I error B. type II error
C. type III error D. no error
• If you want to show that Method A of
teaching computer programming is
more effective than Method B, then
16. Ho should be stated as follows:
A. Ho: Method A is more effective than Method B
B. Ho: Method A is as effective as Method B
C. Ho: Method A is less effective than Method B
D. Ho: Method A is less effective or as effective as
Method B
•
17. Ha: should be stated as
follows:
A. Ha: Method A is as effective as Method B
B. Ha: Method A is more effective as Method B
C. Ha: Method A is less effective than Method B
D. Ha: Either B or C
18. The statistic used to test the significance
of difference between means when n is
large
A. Chi-square test B. t-test
C. F-test D. Z-test
Determine whether the following statements
can be a null hypothesis or alternative
hypothesis. Indicate Ho or Ha on the
blanks provided for.
19.Classical music has a positive effect on the
memory ability of Grade IV students of a certain
elementary school.
20. The pre-test scores of the students
belonging to Group A in the Language
ability test of Group A do not differ with
that of the students belonging to Group B.
21. Low scores in mental ability test
corresponds to low scores in the self-
concept test.
22.The performance of pre-schoolers from
private schools in the memory ability test is
significantly different from those coming
from the public schools.
23. Sleep deprived students have lower
performance in the mathematical learning
ability test than those with 8 hours sleep.
24. Introducing colors to pictures has no
effect on the memory retention of Grade I
pupils.
25. The performance of the students exposed
to verbal motivation in a given learning
ability test do not differ with that of the
students exposed to nonverbal motivation.
• Write the word being described in the given statements and
phrases.
• This is a subset of a group of representatives of a population.
• A form of presenting data using tables for better understanding and
interpretation
• This is the opposite of the null hypothesis.
• This test is used when the number of sample means is less than 30.
• Data that permits us to describe how much more or less one object
possesses than another
• It refers to how high, flat, or moderate a normal curve is drawn
based on the values of mean, median and mode.
• A form of presenting data information through words or statements
• This is done after the result of every statistical test.
• A method used in measuring the strength of relationships between
two or more variables
• A Greek letter that stands for summation
Exercises