Chapter 1 To Chapter 2 Stat 222

Statistics 222: Introduction to Statistical Inference 18
1. INTRODUCTION ON STATISTICAL INFERENCE
1.1 PROBABILITY THEORY VERSUS STATISTICAL INFERENCE
A typical problem in Probability Theory is of the following form: A sample space and underlying
probability distribution are specified, and we are asked to compute the probabilities of a given
event(s).
Example: Experiment of rolling a loaded die, for which the chance of landing an odd-
numbered outcome is twice as much as that of an even-numbered outcome
a. Define the sample space.
b. Define an appropriate probability space.
c. Let E be event of getting either a 3 or a 4. Find P(E).
In a typical problem of Statistical Inference, it is not a single underlying probability distribution which
is specified, but rather a class of probability distributions, any of which may possibly be the one that
governs the chance experiment, whose outcome we shall observe. We know that the underlying
probability distribution is a member of this class, but we do not know which one it is. The objective
is thus to determine a “good” way of guessing, on the basis of the observed outcome of the
experiment, which of the underlying probability distributions is the one that actually governs the
experiment.
Example: Consider the experiment of rolling a die, for which we know nothing
a. Define the random variable X as the number of dots, i.e., face value of the die. What
are the possible values of X?
b. What would be a possible (probability) distribution of X?
c. Suppose the die is rolled 5 times. On the ith roll, let Xi be the number of dots. What
would be the (joint) distribution of X?
Specifications of a Statistical Problem
We now consider the specification of a statistical problem. Suppose that there is an experiment whose
outcome can be observed by the statistician. This outcome is described by a random variable X (or
random vector X), which takes on values in the space S. The distribution function of X, say FX, (or in
the case of a random vector, the distribution of X is FX ) is unknown to the statistician, but it is known
that FX belongs to a specified class of distribution functions, the class Ω. The collection of possible
actions that the statistician can take, or the collection of possible statements that can be made, at the
end of the experiment, is called the decision space, denoted as D. At the conclusion of the experiment,
the statistician actually chooses only one action (or makes only one statement) from the possible
choices in D.
In summary, therefore, any statistical problem can be specified by defining each of the components of
the triplet (S, Ω, D).
Example
Suppose we are given a coin, about which we know nothing. We are allowed to perform 10
independent flips with the coin, on each of which, the probability of getting a head is p. We do not
know the value of p, but we know that p will be in the interval [0,1].
In this example, we can let 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋10 )′ with each 𝑋𝑖 defined being either a “1” or “0”
according to whether the ith flip is a head or tail. Then, 𝑺 consists of all the ___ possible values of the
vector 𝑋. The class 𝛀 consists of all possible probability mass functions of 𝑋 for which the 𝑋𝑖 𝑠 are
independently and identically distributed Bernoulli random variables with probability of success 𝑝,
i.e., iid 𝐵𝑒(𝑝). Thus, for a specific value of the vector 𝑋, say 𝑥 = (𝑥1 , 𝑥2 , … , 𝑥10 )′,
𝛀 = { 𝑝𝑋 ∶ 𝑝𝑋 (𝑥) = 𝑝∑ 𝑥𝑖 (1 − 𝑝)10−∑ 𝑥𝑖 , 0 ≤ 𝑝 ≤ 1 } .
If we are required to guess the value of 𝑝, we can define 𝑫 as follows:
𝑫 = { 𝑝̂ ∶ 0 ≤ 𝑝̂ ≤ 1} .
This type of statistical problem is referred to as Point Estimation. If, on the other hand, we do not
merely want a guess as to the value of 𝑝, but rather a statement of an interval of values which is thought
to enclose the true value of the 𝑝, then we can define 𝑫 as
𝑫 = { (𝑝𝐿 , 𝑝𝑈 ) ∶ 0 ≤ 𝑝𝐿 ≤ 𝑝𝑈 ≤ 1 } .
This approach is called Interval Estimation.
Suppose that we are not required to come up with a numerical guess as to the value of 𝑝, but only to
know whether the coin is fair or not. In this case, 𝑫 can be defined as
𝑫 = { 𝑑1 : 𝑐𝑜𝑖𝑛 𝑖𝑠 𝑓𝑎𝑖𝑟 , 𝑑2 : 𝑐𝑜𝑖𝑛 𝑖𝑠 𝑛𝑜𝑡 𝑓𝑎𝑖𝑟 } .
This type of problem is called Hypothesis Testing.
Note that 𝑫 can be viewed as the collection of possible answers to a question (e.g., “What do you guess
𝑝 to be?” or “Within what interval do you guess 𝑝 to lie?” or “Is the coin fair?”) asked of the statistician.
The real problem in Statistical Inference lies in choosing the best “guessing method.” Note also
that there are infinitely many ways of arriving at a guess of the value of 𝑝 or arriving at a decision
whether the coin is fair or not. Which of these ways of forming a guess from the experimental data
should we actually employ? This is the real problem of inference.
1.2 CLASSIFICATION OF STATISTICAL PROBLEMS
When statisticians discuss statistical problems, they naturally classify them in certain ways. In our
case, we shall classify statistical problems on the basis of the structure of 𝛀 and 𝑫.
1.2.1 Classification based on the Structure of 
Statistical problems are classified as either parametric or nonparametric depending on the

structure of 𝛀. Parametric statistical problems are all those in which the class  (of all distribution
functions 𝐹𝑋 ) can be specified in terms of a parameter (or vector of parameters) 𝜃. In this case,
inferences made on 𝐹𝑋 necessarily focus on the parameter 𝜃; hence, the name. All other problems
not falling into this formulation are called nonparametric statistical problems.
Example:
Suppose we take a measurement on the length of an object, using a given measuring

instrument. The experiment is thus the observation of a random variable 𝑋, where 𝑋
represents the measurement taken (i.e., the length), having an unknown distribution 𝐹𝑋 .
This problem becomes a parametric statistical problem if we assume Ω to be a
parametric family of distributions. For instance, we can take Ω to be the Normal family
of distributions with parameter vector 𝜃 = (𝜇, 𝜎 2 )′ with −∞ < 𝜇 < ∞ and 𝜎 2 > 0.
Note that “saying something” about 𝐹𝑋 is equivalent to “saying something” about 𝜃.
In contrast, a nonparametric treatment of the above problem would entail only slight
assumptions regarding the distributions 𝐹𝑋 in Ω, such that Ω consists of absolutely
continuous distributions, not necessarily having common parameters.
1.2.2 Classification based on the Structure of D
Statistical problems may be classified based on the structure of D a follows:
a. Point Estimation
b. Interval (or Region) Estimation
c. Hypothesis Testing
d. Ranking / Multiple Decision Problems
e. Regression / Experimental Designs Problems
Letters (a) – (c) were discussed in the example of the previous section. Region Estimation involves
estimating a vector of parameters 𝜃 = (𝜃1 , 𝜃2 , … , 𝜃𝑘 )′. For example, we might be interested in
estimating the mean 𝜇 and variance 𝜎 2 of a distribution simultaneously. In this case, our estimate
will be of the form:
𝑫 = { (𝑢, 𝑣) ∶ 𝑢1 ≤ 𝑢 ≤ 𝑢2 ; 𝑣1 ≤ 𝑣 ≤ 𝑣2 } ,
which is a region, namely, a rectangle, in the Cartesian coordinate plane.
Multiple Decision Problems are decision problems where there are a finite (more than 2) number
of possible decisions. Note that hypothesis testing is just a special case of this problem with the
number of possible decisions equal to two. Ranking Problems are those for which a decision is a
statement as to the complete ordering of certain objects or things, as in “Method A is best, B is next,
and C is the worst.”
A Regression Problem looks into investigating the (linear) relationship between one variable
(dependent variable) and a set of other variables (independent variables). For this type of a problem,
the objective is to try to predict the value of the dependent variable based on the observed values of
independent variables. If Regression Analysis investigates the linear relationships between
variables, an Experimental Designs Problem, on the other hand, looks into investigating the causal
relationship between a dependent variable and several independent variables. Such problems aim
to determine whether changes in the independent variables will cause some effect on the dependent
variables.
1.2.3 Other Topics
Topics that do not fall into any of the classifications just mentioned but are of practical importance
to us are listed below. Some of the more important topics usually discussed are
a. Sampling Methods
b. Cost Considerations
c. Randomization
d. Asymptotic Theory
In Sampling Methods, the focus oftentimes falls on the so-called fixed-sample-size and sequential
procedures. The former entails fixing the sample size even before any data are collected, while the
latter is characterized by taking the observations sequentially, hence, the sample size is not fixed in
advance. Cost Considerations frequently become a deciding factor in the choice between the two
sampling procedures.
Some mathematically oriented problems in statistics involve the use of Randomization. Loosely,
this is the process of incorporating some element of chance into the manner in which the experiment
is being performed, so as to minimize the possibility of having biases. Asymptotic Theory is the
class of results and theories that apply for cases using very large samples. The word asymptotic is
usually used to describe a method, a result, a theorem, or a definition associated with very large
samples.
2. RANDOM SAMPLE AND SAMPLING DISTRIBUTIONS
2.1 Concept of a Random Sample
Defn: The totality of elements which are under discussion, and about which information is desired
will be called the target population.
Remarks:
1. The target population must be capable of being well defined. It may be real or hypothetical.
2. The object of any investigation is to find out something about a given target population.
3. It is generally impossible or impractical to examine the entire population, but, on the basis of
examining a part of it, inferences regarding the entire target population can be made.
Consider 𝑋~𝐹𝑋 . You wish to make inferences about FX on the basis of n independent observations of
𝑋, say 𝑋1 , 𝑋2 , … , 𝑋𝑛 . The problem now is how to select a part of the population, i.e., how do we obtain
a sample? In answering this question, the following consideration should be taken into account: If the
sample is selected in a certain way, we can make probabilistic statements about the population.
Defn: Given a probability space and a positive integer n, a collection of n independent random
variables 𝑋1 , 𝑋2 , … , 𝑋𝑛 , all having common distribution 𝐹𝑋 is called a random sample from a
population (with distribution) 𝐹𝑋 .
Note: We assume that each physical element of the population has some numerical value associated
with it and that the distribution of these values is given by the distribution function 𝐹𝑋 .
Remarks:
1. A random sample (r.s.) can be viewed as a random vector 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 )′ defined on the
n-dimensional real space Rn. Further, it can also be interpreted as the outcome of a series of n
independent trials of an experiment performed under identical conditions.
2. Inference works under the assumption that the sample (data) reflects the truth about the
population. To ferret out this truth, inference employs the so-called process of inductive
argumentation.
Example: If a given coin is biased (loaded) in favor of heads, we would expect to observe heads
more frequently than tails in repeated tosses of the coin. Out of 20 tosses of the coin,
14 were heads and only 6 were tails. This is thus taken as evidence that the coin may
not be fair.
Since we cannot make absolutely certain generalizations, uncertainty will always be present in
all inductive inferences we make. This is why statistical inference is based on laws of
probability.
3. The distribution 𝐹𝑋 is usually called the sampled population, the collection of all elements from
which the sample is actually selected. (In certain cases, 𝐹𝑋 may be replaced with the
corresponding PMF or PDF.)
Example: target population : all 25-year old males in the country

sampled population : all 25-year old males in Quezon City
4. Sampling from the distribution 𝐹𝑋 is sometimes referred to as sampling from an infinite

population or sampling with replacement from a finite population.
5. Implicitly, sampling without replacement from a finite population is ruled out in the above
definition.
6. Since the r.s. 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 )′ consists of independent and identically distributed (iid)
random variables, then the distribution of the r.s. X, which is simply the joint distribution of
𝑋1 , 𝑋2 , … , 𝑋𝑛 , is thus,
F X x   FX 1 , X 2 ,...,X n x1 , x2 ,..., xn   FX 1 x1 FX 2 x2  FX n xn 

n
  FX i xi 
i 1
If the PDF or PMF exists, then FX may be replaced accordingly. Thus,
f X x   f X 1 , X 2 ,...,X n x1 , x2 ,..., xn   f X 1 x1  f X 2 x2  f X n xn 

n
  f X i  xi 
i 1
or
p X x   p X 1 , X 2 ,...,X n x1 , x2 ,..., xn   p X 1 x1  p X 2 x2  p X n xn 

n
  p X i  xi 
i 1
Examples:
1. In studying the “reliability” of light bulbs, the lifetime X (in hours) of a given light bulb is taken
to be a r.v. with density function:
f X x   exp{x}I 0,  x , where   0 is unknown.
A collection of n light bulbs is put to a “reliability test” and their lifetimes are recorded. Then,
𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 )′ can be considered a r.s. (from an exponential population with parameter
 ). What is the PDF of the r.s. X? What are S and  for this statistical problem?
2. A r.s. 𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑛 )′ from a Bernoulli population with parameter 0p1, will have a PMF
of the form
p X x  
One way of regarding a r.s. from a Bernoulli population is as follows:
Consider a “Yes or No” type of question, answered by each of n respondents,

chosen at random with replacement from a population of N people. Then, the
random vector X = (X1, X2, …, Xn)’ is a r.s. from a Bernoulli population with
parameter p, the unknown proportion of “Yes” answers, and each Xi is equal to
“1” or “0” according to whether the ith respondent answers “Yes” or “No”.
2.2 Statistics and Sampling Distributions
2.2.1 Statistics
Defn: Let X = (X1, X2, …, Xn)’ be an observable random vector. Any observable function of X, say
T(X), which is itself a r.v. (or random vector), is called a statistic. The standard deviation of a
statistic is called its standard error.
Remarks:
1. A statistic is always
a. a function of observable random variables
b. itself a r.v.
c. does not contain any unknown parameter
2. By “observable”, we mean that the value of the statistic T(X) can be computed directly from
the values of the r.v.’s in the r.s.
Examples:
For the given random samples, which of the given are statistics?
1. Let X be a r.s. (of size 1) from N(, 2), where  and 2 are both unknown.
a. X- d. X2 + 3
b. X/ e. X2 + log X2
c. X
2. Let X = (X1, X2, …, X10)’ be a r.s. from FX ;  , where  is unknown.

10
X i 5
a. X  i 1
10
d. X
i 1
i
b. min(X1, X2, …,X10) e. X4

c. X 
3. Let X = (X1, X2, …, Xn) be a random sample from N(, 2).
 X 
n
2
n i X
a. Sn   X i d. S2  i 1
i 1 n 1
X1  
b. X  n1 S n e. Z

c.
n
T1  X     X i   
2
f. T2  X  
n  1S 2
i 1 2
Some Important Statistics
Let X = (X1, X2, …, Xn)’ be a r.s. from FX. Some common statistics are:
n
1. Sample Sum : Sn   X i
i 1
n
X i
2. Sample Mean : X  i 1
 X 
n
2
i X
3. Sample Variance : S2  i 1
n 1
n
X i
r
4. rth Sample Raw Moment : M r ' i 1

, r  1,2,...
n
 X 
n
r
i X
5. rth Sample Central Moment : Mr  i 1
, r  1,2,...
n
6. Order Statistics : X 1  X 2     X n 

Theorem: Let X = (X1, X2, …, Xn)’ be a r.s. from FX. Then  r = 1, 2, 3, …,
   
a. EM r '  E X   r ' , if E X exists; and,
r r
b. VarM r ' 
EX   EX  , if EX  exists.
2r r 2
2r
n
Corollary: Let X = (X1, X2, …, Xn)’ be a r.s. from FX, with mean  and variance 2. Then,
 
a. E X  E X    ; and
VarX  
 2
b. .
n
Theorem: Let X = (X1, X2, …, Xn)’ be a r.s. from FX, with mean  and variance 2. Then,
2
 
a. E S   ; and
2
  n  3 4  
 4   
 n  1  
b. Var S 2   
n
.
2.2.2 Sampling Distributions
Defn: The probability distribution of a statistic is called a sampling distribution.
Remarks:
1. A statistic, being a r.v., has thus its own probability distribution, which is called the sampling
distribution.
2. The sampling distribution of a statistic is affected by the sample size n, the population size N
(for finite cases), and the way X was observed (i.e., the manner in which the r.s. was selected).
Examples:
1. Sampling from a finite population:
Suppose we have a population of N = 4 numbers: {0, 1, 2, 3}. From this population, a

sample of size n = 2 is drawn at random with replacement. If X1 is defined as the number
observed on the 1st draw and X2 the number on the 2nd draw, determine the sampling
X  X 2 
distribution of the sample mean X  1 . Also, determine the sampling
2
distribution of S2.
2. Let X = (X1, X2, …, Xn)’ be a r.s. from Be(p), where 0p1.
a. The sampling distribution of the sample sum is
b. The sampling distribution of the sample mean is
3. Let X = (X1, X2, …, Xn)’ be a r.s. from Po(), where >0.

4. Let X = (X1, X2, …, Xn)’ be a r.s. from N(,2), where R and 2>0.
5. Let X = (X1, X2, …, Xn)’ be a r.s. from Ga(1,), where >0.

Special Results on Sums of Random Variables
Defn: The family of density (or mass) functions { f X ;  , with parameter    } is said to be
reproductive with respect to the parameter  if, and only if,
X 1 ~ f X1 ;1  and X 2 ~ f X 2 ;2  , with X1 and X2 independent,

 (X1 + X2) ~ f X 1  X 2 ;1  2  .
Remark: The above definition also applies for more than 2 independent r.v.’s.
Examples:
Let X1, X2, …, Xn be independent random variables.
n
1. Xi ~ Bi(mi, p)  i = 1, 2, …, n  Sn ~ Bi( mi , p) ;
i 1
and if mi = m  i, then Sn ~ Bi( , )

n
2. Xi ~ Po(i)  i = 1, 2, …, n  Sn ~ Po( i ) ;
i 1
and if i =   i, then Sn ~ Po( )

n
3. Xi ~ Neg Bi(ri, p)  i = 1, 2, …, n  Sn ~ Neg Bi( ri , p) ;
i 1
and if ri = r  i, then Sn ~ Neg Bi( , )

n n
4. Xi ~ N(i, i )  i = 1, 2, …, n
2
 Sn ~ N ( i ,  i2 ) ;
i 1 i 1
and if i = , and i =   i, then Sn ~ N (

2 2
, )
n
5. Xi ~ Ga(ri, )  i = 1, 2, …, n  Sn ~ Ga( ri ,  ) ;
i 1
and if ri = r  i, then Sn ~ Ga( , )

Special Results:
n n n
 Xi ~ N(i, i2)  i = 1, 2, …, n   ai X i ~ N ( ai i ,  ai2 i2 )
i 1 i 1 i 1
 Xi ~ Exp()  i = 1, 2, …, n  Sn ~ Ga(n,  )
2.2.2.1 Sampling from the Normal Distribution
Defn: A continuous r.v. X is said to have a chi-square distribution with k degrees of freedom
(d.f.) if, and only if, the PDF of X is given by
1 k
f X x   2 x 
2
k
2 1
exp x 2I 0,   x , k   .
k 2 
Notation : X~ 2k 
Mean : E(X) = k
Variance : Var(X) = 2k
k
 1 2
MGF : mX t    
 1  2t 
k =2
k =5
k =10
k =15
FIGURE 1.1. Graph of the chi-square distribution with varying degrees of freedom (k).
Remarks
1. The degrees of freedom (d.f.) of the chi-square distribution completely specify the
distribution of a chi-square r.v.
2. A chi-square r.v. can take on only positive real numbers.
3. A chi-square r.v. with k d.f. is equivalent to a Gamma r.v. with parameters r = k/2 and 
= ½, i.e., 2k   Ga(r = k/2,  = ½).
Theorem: If the r.v.’s X1, X2, …, Xk are normally and independently distributed with means
i and variances i2, i = 1, 2, …, k, respectively, then,
2
 X  i 
k
U    i  ~  2k  .
i 1  i 
Corollary: If the r.v.’s X1, X2, …, Xn is a r.s. from N(, ), then, 2
2
k
X 
U   i  ~  k  .
2
i 1   
Remarks:
1. The theorem states that the sum of the squares of independent standard normal
random variables is a chi-square random variable, with d.f. equal to the number of
r.v.’s (number of terms) in the sum.
2
X 
2. If X~ N(,2), then   ~  1 .
2
  
3. If X~ N(0,1), then X 2 ~ 21 .
4. The chi-square family of densities is reproductive with respect to the degrees of freedom.
a. X1, X2, …, Xn ~ independent 2ki   Sn ~  2 
  ki 
 
 i 
b. X1, X2, …, Xn ~ iid  2

k   Sn ~  2
nk 
Illustration:
Four r.s.’s, each of size 100, from N(0,1) are obtained using PHStat for MS Excel. The
histograms of the normal random samples, and the histograms for the squares of the values
are shown below.
50
12
40
Count
Count
8
30
20
10
0
-2 .0 0 -1 .0 0 0 .00 1 .00 2 .00 2 .00 4 .00 6 .00
n1 sq1
50
15
40
30
Count
10
Count
20
10
0
-2 .0 0 -1 .0 0 0 .00 1 .00 2 .00 2 .00 4 .00 6 .00 8 .00
n2 sq2
40
12
30
8
Count
Count
20
10
0
-1 .0 0 0 .00 1 .00 2 .00 1 .00 2 .00 3 .00 4 .00 5 .00 6 .00
n3 sq3
40
12
30
Count
Count
20
4
10
0
-2 .0 0 -1 .0 0 0 .00 1 .00 2 .00 1 .00 2 .00 3 .00 4 .00 5 .00
n4 sq4
Theorem: Let X = (X1, X2, …, Xn)’ be a r.s. from N(,2), where R and 2>0 and n  2.
Then,
a. X and S2 are independent; and,
b.
n  1S 2
~  2n 1 .
2

Defn: A continuous r.v. X is said to have an F-distribution with m (numerator) and n

(denominator) degrees of freedom (d.f.) if, and only if, the PDF of X is given by
 m2 n   m  x  2 1
m m
2
f X x   m n   I 0,   x , m, n  Z  .
 2  2   n  1  x
 m n 
m 2
n
Notation : X~ Fm, n 
n
Mean : E(X) = , n2
n2
2n 2 m  n  2 
Variance : Var(X) = , n4
mn  2  n  4 
2
MGF : DNE
FIGURE 1.2. Graph of F-distribution with (m, n) degrees of freedom.
Remarks:
1. The numerator (m) and denominator (n) degrees of freedom completely specify the
distribution.
2. An F-distributed r.v. can take on only positive real numbers.
Theorem: Let U ~ 2m  and V ~ 2n  , with U and V independent. Then,
U
X m
~ Fm, n  .
V
n
Remark: The theorem states that the ratio of two independent chi-square r.v.’s over their
respective d.f. is an F-distributed r.v., with numerator d.f. equal to the d.f. of the
chi-square r.v. in the numerator and denominator d.f. equal to the d.f. of the chi-
square r.v. in the denominator.
Corollary: If X1, X2, …, Xm is a r.s. from N(X,2) and Y1, Y2, …, Yn is another r.s. from
N(Y,2), then,
S X2
~ Fm 1, n 1 ,
SY2
 X   Y  Y 
m n
2 2
i  X i
where S X2  i 1
and SY2  i 1
.
m 1 n 1
Corollary: If X~ Fm, n  , then (1/X) ~ Fn, m  .

Illustration: Using 4 of the r.s.’s of size 100 each from N(0,1) in the earlier illustration, the
histograms of sq1/sq2 and sq3/sq4 are shown below.
1 00
75
75
Count
Count
50 50
25 25
0 0
1 00 0 .00 0 2 00 0 .00 0 3 00 0 .00 0 2 50 0 .00 0 5 00 0 .00 0 7 50 0 .00 0 1 00 0 0.0 0 0
f1_2 f5_4
Defn: A continuous r.v. X is said to have a Student’s t-distribution with k degrees of

freedom (d.f.) if, and only if, the PDF of X is given by
 k 21 
 
  k 1 
f X x   1 X2
I  ,   x , k  Z  .
2
k  k2 
k
Notation : X~ tk 
Mean : E(X) = 0
k
Variance : Var(X) = , k2
k 2
MGF : DNE
dotted curve: v=25

black curve: standard normal
v=5
v=2
-3 -2 -1 0 1 2 3
FIGURE 1.3. Graph of the t-distribution with varying degrees of freedom (k) and the standard normal
distribution
Remarks:
1. The degrees of freedom (d.f.) completely specify the Student’s t-distribution.
2. A t-distributed r.v. can take on any real number.
3. The t-distribution is symmetric about its mean 0.
4. The t-distribution is more variable than the standard normal distribution.
5. For large d.f. k, the t-distribution reduces to the standard normal distribution.
Theorem: Let Z~N(0,1) and U ~ 2k  , with Z and U independent. Then,
Z
T ~ t k  .
U
k
Remark: The theorem states that the ratio of a standard normal random variable to the
square root of an independent chi-square random variable over its degrees of
freedom is a t-distributed random variable, with d.f. equal to the d.f. of the chi-
square random variable in the denominator.
Corollary: If X1, X2, …, Xn is a r.s. from N(,2), with sample mean X and sample standard
deviation S, then,
X 
T ~ t n  1 .
S
n
Corollary: If X~ tk  , then X2~ F1, k  .
Illustration: The histograms of n1/n2 and n5 are shown below.

sq 4
75 60
50 40
Count
Count
25 20
0 0
-5 0.0 00 -2 5.0 00 0 .00 0 2 5.0 0 0 5 0.0 0 0 -5 0.0 00 0 .00 0 5 0.0 0 0 1 00 .0 00
n1_n2 n5_rootsq4
Some Important Results:
 
Let X 1 , X 2 , ..., X n1 be a r.s. from N 1 , 12 and Y1 , Y2 , ..., Yn2 be another independent r.s. from

N 2 , 22 . Then,
X  1 Y  2
1. ~ tn1 1 and ~ t n 2  1 
S1 S2
n1 n2
 X   Y  Y 
n1 n2
2 2
i X i
where S12  i 1
and S22  i 1
n1  1 n2  1
2.
X  Y       ~ N (0,1)
1 2
 2
 2
1
 2
n1 n2
3.
X  Y       ~ t 1 2
assuming  12   22 ,
n1  n2  2  ,
1 1
S p2   
 n1 n2 
where S p2 
n1  1S12  n2  1S22 (pooled variance)
n1  n2  2
S12
 12
4. ~ Fn1 1, n2 1
S22
 22
S12
5. 2
~ Fn1 1, n2 1 , assuming  12   22
S2
2.2.2.2 Order Statistics
Defn: Let X1, X2, …, Xn be a r.s. from FX. Let X(1)  X(2)  …  X(n) be the Xi’s arranged in
increasing order. Then, X(1), X(2), …, X(n) are called the order statistics corresponding
to the r.s. X1, X2, …, Xn, and X(r) is called the rth order statistic.
Remarks
1. For each r = 1, 2, …, n, X(r) is a random variable.
2. In general, the order statistics (o.s.) are not independent, unless FX is a distribution that is
degenerate at some constant c.
3. The first and the last order statistics, X(1) and X(n), are called the sample minimum and
sample maximum, respectively.
Theorem: Let X(1), X(2), …, X(n) represent the o.s. of a r.s. from the distribution FX. For r
=1, 2, …, n, the CDF of X(r) is given by
n
n
Fr  y     FX  y  1  FX  y  .
j n j
j r  j 
Corollary: The CDF of the sample minimum X(1) and maximum X(n) are, respectively,
F1  y   1  1  FX  y  n and Fn  y   FX  y  n .
Example: Suppose 20 identical light bulbs operate independently in a system. The system
stops when one light bulb expires. For i = 1, 2, …, 20, let Xi represent the lifetime
(in days) of the ith bulb, with each Xi ~ Exp(). Find the CDFof X(4). What is the
probability that the system will still be working after 150 days?
Theorem: If FX is absolutely continuous with PDF fX, then for r = 1, 2, …, n, the PDF of
X(r) denoted fr, is given by
n
f r  y   r   FX  y  f X  y 1  FX  y  .
r 1 nr
r 
Corollary: If FX is absolutely continuous with PDF fX, then for r = 1, 2, …, n, the PDF of
the sample minimum X(1) and maximum X(n) are, respectively
f1  y   n 1  FX  y  f X y and f n  y   n FX  y  f X y.

n 1 n 1
Theorem: Let X(1), X(2), …, X(n) represent the o.s. of a r.s. from the distribution FX. For r, s
=1, 2, …, n, and r < s, if FX has PDF fX, then the joint PDF of X(r) and X(s),
denoted by fr,s, is given by
n!FX x  r 1 f X x FX  y   FX x  s  r 1 f X  y 1  FX  y  n  s

f r , s x, y   , x y.
r  1!s  r  1!n  s !
Corollary: The joint PDF of the sample minimum X(1) and maximum X(n) is given by
f1,n x, y   nn  1FX  y   FX x n2 f X x f X  y , x  y .
Corollary: In general, the joint PDF of the o.s. is given by
f1, 2,...,n x1 , x2 ,..., xn   n! f X x j , x1  x2    xn .

n
j 1
Functions of Order Statistics and Their Distributions:
 X  j
1. Sample Mean of the o.s. : Y j

X
n
The distribution of Y is the same as the distribution of X .
~
2. Sample Median : X  X  n1  , if n is odd
2
~ X  n   X  n  1
X 2 2
, if n is even
2
For n odd, the distribution of the median is f M where M  n21 .

For n even, the distribution of the median can be derived (using transformation) from the
joint PDF f M , M 1 , where M  n2 .
3. Sample Range : R  X n   X 1
The distribution of R can be derived (using transformation) from the joint PDF f1, n .
Examples:
1. Suppose we take a r.s. of size n from Bi(m, p). Find the CDF and the PMF of X(r).
2. Consider n identical batteries operating independently in an electrical system. For each i =

1, 2, …, n, let Xi denote the length of life (in hours) of the ith battery, with each Xi ~ Exp(1/).
Determine the distribution of the length of life of the system if the batteries operate
simultaneously (i.e., when one dies, the system dies), and if the batteries operate in a series
(i.e., the system dies when the last battery dies).
3. Let X1, X2, …, Xn be a r.s. from U(0, ), n  2. Find the mean and the variance of the r.v.
n 1
X n  .
n
2.3 Some Asymptotic Theory
Asymptotic theory deals with results that arise for sample size n approaching infinity, or for very
large n. The following asymptotic results will be useful in obtaining approximate (or asymptotic)
sampling distributions of certain statistics.
Theorem: (Chebyshev’s Inequality) Let X be a r.v. with mean  and variance 2 < , finite.
Then,   > 0,
2
P X       2 .

P X    k  
1
Corollary: .
k2
Remark: Alternatively, this inequality is also equivalent to:
P X    k   1 
1 1
2
, or, P k  X    k   1  2 .
k k
Special Results
1. For k = 1, P   X       0 .
2. For k = 2, P 2  X    2   0.75 .
3. For k = 3, P 3  X    3   89 0.888 .  

Examples
1. Let X~N(0,1). Find the probabilities that X is within 1 standard deviation from the mean , 2
standard deviations from the mean , and 3 standard deviations from the mean .
2. Let X~Bi(n=10, p=0.9). Find the probabilities that X is within 1 standard deviation from the
mean , 2 standard deviations from the mean , and 3 standard deviations from the mean .
3. Let X~Po(=9). Find the probabilities that X is within 1 standard deviation from the mean , 2
4. Let X~t(6). Find the probabilities that X is within 1 standard deviation from the mean , 2
Theorem: Weak Law of Large Numbers (WLLN) Let X1, X2, …, Xn be a r.s. from the PDF f X ,
with mean  and finite variance 2 < . Let  and  be 2 arbitrary numbers such that  >
0 and 0 <  < 1. If n 

2
 2
 
, then P X      1   .
Remarks
1. In Probability Theory, we say “ X converges in probability to ”, that is,
n 
 
lim P X      0,   0, and we write X 

P
.
2. Explanation of the WLLN: The probability that X will deviate from the true population mean
 by more than some arbitrary small nonzero value  can be made arbitrarily small, by choosing
n sufficiently large. Because of this, the sample mean can be used to estimate  reliably.
3. If X1, X2, …, Xn is a r.s. from the PDF f X , with mean  and variance 2 < , we can determine
n  Z+ so that the probability that X will differ from  by less than an arbitrary small amount
, can be made as close to 1 as possible. Thus, X can be used to estimate  with a high degree
of accuracy.
 
P X      1 as n   ,   0 .
4. If n is sufficiently large, X   is likely to be small, but this does not imply that X   is
small for all large n.
 
5. The result does not imply that P X      1 . It only means that it can be very likely that X
is near  for sufficiently large n, and not that X is near  if n is increased.
Example: Consider a distribution with unknown mean  and variance 2 = 1. How large a
sample should be taken so that a probability of at least 0.95 is attained that the sample
mean X will not deviate from the population mean  by more than 0.4 units?
Theorem: Central Limit Theorem (CLT) Let X1, X2, …, Xn be a r.s. from the PDF f X , with mean
 and finite variance 2. Let X be the sample mean of the r.s. and define the r.v. Z as
Zn 
   X  E X  .
X E X
VarX  
n
Then, the distribution of Zn approaches the standard normal distribution as n   , i.e.,
Zn  N (0,1) as n   .
Remarks
1. In Probability Theory, we say “ X converges in distribution to the standard normal

distribution”, i.e.,
lim FZ n z   z  , and we write Zn 

D
N 0, 1 .
n 
2. The CLT also implies that X  N  ,  2
n
 as n   and S  N n, n  as n   .
n
2
3. The CLT result holds for all r.s.’s, regardless of the form of the parent PMF/PDF, for as long
as this distribution has finite variance.
4. Importance of the CLT: In making inferences about population parameter(s), we need the
distribution of certain statistics, e.g., the sample mean X . Finding the sampling distributions of
statistics is often mathematically easier if samples are taken from the normal distribution.
However, if the r.s. is not taken from the normal distribution, finding the sampling distribution
of X can become very difficult. The CLT states that, for as long as (1) the parent PMF/PDF of
the r.s. has finite variance, and (2) the sample size is large, the approximate distribution of the
sample mean is a normal distribution.
Examples
1. Consider a distribution with unknown mean  and variance 2 = 1. How large a sample should
be taken so that a probability of (exactly) 0.95 is attained that the sample mean X will not
deviate from the population mean  by more than 0.4 units?
2. An electrical firm manufactures light bulbs that have an average length of life equal to 800 hours
and a standard deviation of 40 hours. Find the probability that a random sample of 16 bulbs will
have an average life of less than 775 hours?
Corollary: De-Moivre – Laplace Theorem (Normal Approximation to Binomial Distribution) If

Z~Bi(n, p), with p close to ½ (i.e., not very close to 0 or 1), then, the approximate
(asymptotic) distribution of Z is N(np, npq) as n   .
Remark: The De-Moivre – Laplace Theorem uses a normal distribution to approximate the
probabilities under a binomial distribution. However, the approximation is appropriate
only for binomial distributions with (1) very large values of n, and (2) values of p that
are not very close to 0 or 1. When the value of p is very close to 0 or 1, and when the
value of n is very large, the following corollary, which uses the normal distribution to
approximate the Poisson distribution, will be more appropriate.
Examples
1. Toss a pair of dice 600 times. Find the probability that there will be between 90 and 110 tosses
(exclusive) resulting in a total of “7” on the pair of dice.
2. The probability that a patient recovers from a rare blood disease is 0.6. If 100 people are known
to have contracted the disease, what is the probability that less than half of them will survive?
3. A multiple-choice quiz has 200 questions, each with 4 possible answers, only 1 of which is the
correct answer. What is the probability that sheer guesswork yields from 25 to 30 correct
answers for 80 of the 200 problems about which the student has no knowledge?
Corollary: (Normal Approximation to Poisson Distribution) If X1, X2, …, Xn is a r.s. from Po(),
with  small, the sample sum Sn   X i is approximately (or asymptotically)
i
distributed as N(n, n) as n   .
Note that the exact distribution of Sn is Po(n).
Examples
1. Suppose that, on average, 1 person in every 1000 is alcoholic. Find the probability that a random
sample of 8000 people will yield fewer than 7 alcoholics.
2. The probability that a person dies from a respiratory infection is 0.002. Find the probability that
fewer than 5 of the next 2000 so infected will die.

Chapter 1 To Chapter 2 Stat 222

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter 1 To Chapter 2 Stat 222

Hochgeladen von

Copyright:

Verfügbare Formate

Statistics 222: Introduction to Statistical Inference 18

1. INTRODUCTION ON STATISTICAL INFERENCE

1.1 PROBABILITY THEORY VERSUS STATISTICAL INFERENCE

Specifications of a Statistical Problem

If we are required to guess the value of 𝑝, we can define 𝑫 as follows:

This approach is called Interval Estimation.

𝑫 = { 𝑑1 : 𝑐𝑜𝑖𝑛 𝑖𝑠 𝑓𝑎𝑖𝑟 , 𝑑2 : 𝑐𝑜𝑖𝑛 𝑖𝑠 𝑛𝑜𝑡 𝑓𝑎𝑖𝑟 } .

This type of problem is called Hypothesis Testing.

1.2 CLASSIFICATION OF STATISTICAL PROBLEMS

1.2.1 Classification based on the Structure of 

Statistical problems are classified as either parametric or nonparametric depending on the

Suppose we take a measurement on the length of an object, using a given measuring

1.2.2 Classification based on the Structure of D

Statistical problems may be classified based on the structure of D a follows:

which is a region, namely, a rectangle, in the Cartesian coordinate plane.

1.2.3 Other Topics

2. RANDOM SAMPLE AND SAMPLING DISTRIBUTIONS

2.1 Concept of a Random Sample

Example: target population : all 25-year old males in the country

4. Sampling from the distribution 𝐹𝑋 is sometimes referred to as sampling from an infinite

F X x   FX 1 , X 2 ,...,X n x1 , x2 ,..., xn   FX 1 x1 FX 2 x2  FX n xn 

If the PDF or PMF exists, then FX may be replaced accordingly. Thus,

f X x   f X 1 , X 2 ,...,X n x1 , x2 ,..., xn   f X 1 x1  f X 2 x2  f X n xn 

p X x   p X 1 , X 2 ,...,X n x1 , x2 ,..., xn   p X 1 x1  p X 2 x2  p X n xn 

f X x   exp{x}I 0,  x , where   0 is unknown.

One way of regarding a r.s. from a Bernoulli population is as follows:

Consider a “Yes or No” type of question, answered by each of n respondents,

2.2 Statistics and Sampling Distributions

2. Let X = (X1, X2, …, X10)’ be a r.s. from FX ;  , where  is unknown.

b. min(X1, X2, …,X10) e. X4

3. Let X = (X1, X2, …, Xn) be a random sample from N(, 2).

Some Important Statistics

4. rth Sample Raw Moment : M r ' i 1

6. Order Statistics : X 1  X 2     X n 

Theorem: Let X = (X1, X2, …, Xn)’ be a r.s. from FX. Then  r = 1, 2, 3, …,

2.2.2 Sampling Distributions

Defn: The probability distribution of a statistic is called a sampling distribution.

1. Sampling from a finite population:

Suppose we have a population of N = 4 numbers: {0, 1, 2, 3}. From this population, a

3. Let X = (X1, X2, …, Xn)’ be a r.s. from Po(), where >0.

5. Let X = (X1, X2, …, Xn)’ be a r.s. from Ga(1,), where >0.

Special Results on Sums of Random Variables

X 1 ~ f X1 ;1  and X 2 ~ f X 2 ;2  , with X1 and X2 independent,

Let X1, X2, …, Xn be independent random variables.

and if mi = m  i, then Sn ~ Bi( , )

and if i =   i, then Sn ~ Po( )

and if ri = r  i, then Sn ~ Neg Bi( , )

and if i = , and i =   i, then Sn ~ N (

and if ri = r  i, then Sn ~ Ga( , )

2.2.2.1 Sampling from the Normal Distribution

2. A chi-square r.v. can take on only positive real numbers.

3. If X~ N(0,1), then X 2 ~ 21 .

b. X1, X2, …, Xn ~ iid  2

Defn: A continuous r.v. X is said to have an F-distribution with m (numerator) and n

FIGURE 1.2. Graph of F-distribution with (m, n) degrees of freedom.

2. An F-distributed r.v. can take on only positive real numbers.

Theorem: Let U ~ 2m  and V ~ 2n  , with U and V independent. Then,

Corollary: If X~ Fm, n  , then (1/X) ~ Fn, m  .

Defn: A continuous r.v. X is said to have a Student’s t-distribution with k degrees of

dotted curve: v=25

1. The degrees of freedom (d.f.) completely specify the Student’s t-distribution.

2. A t-distributed r.v. can take on any real number.

3. The t-distribution is symmetric about its mean 0.