Beruflich Dokumente
Kultur Dokumente
Chapter 5
Statistical Decision Problems
Decision problems called statistical are those in which there
are data on the state of nature, hopefully containing
information that can be used to make a better decision. The
availability of data will generally provide some illumination,
so that in the selection of an action one is not completely in
the dark concerning the state of nature. However, in
practice, one still faces the problem of taking bad action if
information contained in the data is not intelligently utilized.
Decision rule
A function
d : SX A
is called a nonrandomized decision rule.
Example 5.2.1
Suppose (, A, L) is a decision problem with A = { a1 , a2 },
and suppose X is an observable random variable such that
S X = {x1 , x2 }. Then for are four distinct nonrandomized
decision rules as in the following table:
d1 d2 d3 d4
x = x1 a2 a2 a1 a1
x2 a2 a1 a2 a1
Leong YK & Wong WY Introduction to Statistical Decisions 3
Example 5.2.2
Suppose (, A, L) is a decision problem with A = { a1 , a2 },
and suppose X is an observable random variable such that
S X = {x1 , x2 , x3}. Then for are eight distinct
nonrandomized decision rules as in the following table:
d1 d2 d3 d 4 d5 d 6 d 7 d8
x = x1 a2 a2 a2 a1 a2 a1 a1 a1
x2 a2 a2 a1 a2 a1 a2 a1 a1
x3 a2 a1 a2 a2 a1 a1 a2 a1
Risk Function
The risk function of the decision rule d , when
the state of nature is , is defined to be
R( , d ) = E[ L( , d ( X ))]
Leong YK & Wong WY Introduction to Statistical Decisions 4
d1 d2 d3 d4
R(1, d ) 4 2.4 1.6 0
R( 2 , d ) 0 3.2 0.8 4
The decision rules are also be represented graphically by
means of their risk points.
Remark
Decisions d1 and d 4 , which ignore the data, give risk
points which are exactly the same as the corresponding
loss points for no-data problem.
The straight line joining the risk points of d1 and d 4
would consists the loss points ( as regards risks ) to the
randomized actions of a2 and a1.
An intelligent use of data, such as decision rule d 3 can
improve the expected losses.
Leong YK & Wong WY Introduction to Statistical Decisions 6
Using data foolishly, such as decision d 2 can deteriorate
the loss function.
How many distinct nonrandomized decision rules are in a
given statistical decision problem in which there n available
Actions and the observed random variable has m possible
values?
Answer : n m
Example 5.3.2
Consider a decision problem in which the loss matrix is
given by
a1 a2
1 0 1
2 6 5
Suppose that the observed random variable X takes three
possible values with density function given in the following
table
x1 x2 x3
f ( x;1) 0.6 0.3 0.1
f ( x; 2 ) 0.1 0.4 0.5
The list of nonrandomized decision rules are tabulated as
follows:
Leong YK & Wong WY Introduction to Statistical Decisions 7
d1 d2 d3 d4 d5 d6 d7 d8
x1 a2 a2 a2 a1 a2 a1 a1 a1
x2 a2 a2 a1 a2 a1 a2 a1 a1
x3 a2 a1 a2 a1 a1 a1 a2 a1
Example 5.3.3
Consider the decision problem stated in Example 5.3.2 with
loss table given by
a1 a2
1 0 1
2 6 5
x1, x2 , x3
The risk of a decision rule, say d 2 =
(Note)
a2, a2 , a1
Would be
R(1, d 2 ) = 1 0.6 + 1 0.3 + 0 0.1
R( 2 , d 2 ) = 5 0.6 + 5 0.3 + 6 0.1,
or written in vector form as
R(1, d 2 ) 1 0
= 0.9 + 0.1
R( 2 , d 2 ) 5 6
This is a convex combination of the pure actions a1 and a2 .
Leong YK & Wong WY Introduction to Statistical Decisions 10
Example 5.4.1
Consider the decision problem stated in Example 5.3.1. The
risk set and the risk points of the nonrandomized decision
rules are reproduced here for easy reference.
Leong YK & Wong WY Introduction to Statistical Decisions 11
d1 d2 d3 d4
R(1, d ) 4 2.4 1.6 0
R( 2 , d ) 0 3.2 0.8 4
Example 5.4.2
The risk set of the risk points of the decision rules
considered in Example 5.4.1 is reproduced as follows:
d1 d2 d3 d4
R(1, d ) 4 2.4 1.6 0
R( 2 , d ) 0 3.2 0.8 4
max R ( , d ) 3 3.2 1.6 4
So d3 is the minimax decision rule.
Note that there are two ways to introduce the idea of regret.
It is either
(a) applying it to the initial loss function
Data
Loss Regret Regret
Example 5.4.3
Reconsider the statistical decision problem in Example 5.3.2.
The risk functions of the nonrandomized decision rules were
tabulated as follows:
Leong YK & Wong WY Introduction to Statistical Decisions 15
d1 d 2 d 3 d 4 d 5 d 6 d 7 d8
R(1, d ) 1.0 0.9 0.7 0.4 0.6 0.3 0.1 0.0
R( 2 , d ) 5.0 5.5 5.4 5.1 5.9 5.6 5.5 6.0
~ d1 d 2 d 3 d 4 d5 d 6 d 7 d8
p=
0 0 0 p 0 0 1 p 0
or simply denoted as
~
p = ( 0 , 0 , 0 , p , 0 , 0 ,1 p , 0 )
E[ Lr (1, ~
p )] = E[ Lr ( 2 , ~
p )]
Bayes Principle
Another scheme for ordering decision rules is that of
assigning prior probabilities ( ) to the various states of
nature, and determining the average risk over these states.
Leong YK & Wong WY Introduction to Statistical Decisions 17
Posterior Distribution
The Bayes approach to select an optimal decision rule
involves the assumption that the state of nature is random
with probability function ( ) . The probability function of
the observed random variable X will be regarded as the
conditional distribution given that = and is written as
P( X = x | = ) or simply as f ( x | ) . We shall denote the
conditional distribution of given that X = x as ( | x ) .
Note that
( | x ) P ( X = x ) = P( X = x | = ) ( ) (*)
Leong YK & Wong WY Introduction to Statistical Decisions 18
In fact, both sides of the above equation represent the joint
probability of X and . We call conditional distribution
( | x ) the posterior distribution of given that X = x .
As ( | x ) is considered as a function of while x is fixed,
( | x ) is proportional to the product
P( X = x | = ) ( )
and we write
( | x ) P ( X = x | = ) ( ) (**)
Example 5.4.4
Suppose that the conditional probabilities of X given =
are
P ( X = x1 | = 1 ) = 1 / 4 = 1 P ( X = x 2 | = 1 )
P( X = x1 | = 2 ) = 2 / 3 = 1 P ( X = x2 | = 2 ) ,
and suppose the prior probability function of is given by
P( = 1 ) = (1 ) = w
P( = 2 ) = ( 2 ) = 1 w , 0 w 1
For X = x1,
3w 8 8w
(1 | x1) = , ( 2 | x1 ) = 1 (1 | x1 ) =
8 5w 8 5w
Leong YK & Wong WY Introduction to Statistical Decisions 19
For X = x2 ,
3w 4 4w
(1 | x2 ) = , ( 2 | x2 ) =
4w 4w
Successive Observations
If an observation can alter prior odds to posterior odds, it
would seem that further observation, applied to the first
posterior distribution as though it were a prior, should result
in yet a new posterior distribution.
2 ( | x2 ) f 2 ( x2 | )1( )
f 2 ( x2 | ) f1( x1 | ) ( )
Since the f1( x1 | ) f 2 ( x2 | ) is the joint density function of
X 1 and X 2 , this shows that 2 ( | x2 ) is the posterior
density of given the observed vector ( x1, x2 ) .
R( , d ) = E [ R( , d )] = R( , d ) ( ) .
Leong YK & Wong WY Introduction to Statistical Decisions 21
Now
R( , d ) = L( , d ( x )) f ( x | ) ( )
x
= L( , d ( x ) ( | x ) f ( x )
(*)
x
where f (x ) represents the marginal density function of X .
It follows from (*) that for each observed value x , the Bayes
decision rule d 0 such that
L( , d 0 ( x )) ( | x ) L( , d ( x )) ( | x ) d D
Example 5.4.5
Consider a decision problem with loss matrix
a1 a2
1 0 8
2 4 0
Suppose that the statistician can observe a random variable
X with the following conditional distributions:
P( X = 0 | = 1 ) = 3 / 4 P( X = 0 | = 2 ) = 1 / 3
P( X = 1 | = 1 ) = 1 / 4 P( X = 1 | = 2 ) = 2 / 3 .
It is requires to construct a Bayes decision rule against the
following prior distribution of
Leong YK & Wong WY Introduction to Statistical Decisions 22
: P( = 1) = w , P( = 2 ) = 1 w , 0 w 1
For x = 0 ,
(1 | 0) P ( X = 0 | = 1) P ( = 1)
= 3w / 4
( 2 | 0) P ( X = 0 | = 2 ) P ( = 2 )
= (1 w) / 3 .
This implies that
3w / 4 9w
(1 | 0) = =
3w / 4 + (1 w) / 3 9w + 4(1 w)
4(1 w)
( 2 | 0) = .
9 w + 4(1 w)
L( , a1 ) ( | 0) = 0 (1 | 0) + 4 ( 2 | 0)
16(1 w )
=
9 w + 4(1 w)
L( , a2 ) ( | 0) = 8 (1 | 0) + 0 ( 2 | 0)
72w
= .
9 w + 4(1 w)
Leong YK & Wong WY Introduction to Statistical Decisions 23
Similarly, for x = 1,
(1 | 1) P( X = 1 | = 1) P( = 1)
= w/4
( 2 | 1) P ( X = 1 | = 2 ) P( = 2 )
= 2(1 w ) / 3
w/4 3w
(1 | 1) = =
w / 4 + 2(1 w) / 3 3w + 8(1 w)
8(1 w)
( 2 | 1) = .
3w + 8(1 w)
Therefore, d (1) = a1 32(1 w ) 24 w w 4 / 7
Conclusion : The Bayes decision rule is
d1 , 0 w 2 / 11
d0 ( x ) = d3 , 2 / 11 w 4 / 7
d , 4/7 w 1
4
Leong YK & Wong WY Introduction to Statistical Decisions 24
d1 d2 d3 d4
R(1, d ) 8 6 2 0
R( 2 , d ) 0 8/3 4/3 4
Example 5.4.6
In the above example (Example 4.5.4), the (randomized)
decision rule with constant risk is a mixture of d3 and d3 .
The slope of the line joining risk points of d3 and d 4 is
4 4/3 4
m= = .
r0 2 3
Hence the prior vector =< w ,1 w > which is perpendicular
to vector joining risk points of d3 and d 4 is given by
1 w 3 4
= or w = .
w 4 7
This implies that the randomized decision rule d * with
constant risk is Bayes against the prior distribution
: P( = 1) = 4 / 7 and P( = 2 ) = 3 / 7 .
5.5 Sufficiency
It is common practice for statistician, when confronted with
a mass of data, to compute some simple measure from the
data and then base statistical procedures on the simpler
quantity . Computing such simpler measure is called
reducing the data; the measures themselves are called
statistics.
A question arises naturally, How much reduction of the data
is possible without losing information regarding the state of
nature?
Leong YK & Wong WY Introduction to Statistical Decisions 27
Sufficient Statistic
statistic T is said to be sufficient for a family of density
functions { f ( | ) ; } if ( | ~
x1 ) = ( | ~
x2 ) for
any prior distribution of , and any data ~ x1 and ~ x2 of
the same size from the family { f ( | ) ; }.
Factorization Theorem
Suppose that f ( ~ x | ) represents the joint density
~
function of the observed random vector X .
~
Statistic T = t ( X ) is sufficient for if and only if
f (~x | ) = g (t ( ~
x ); )h( ~
x)
where g depends on ~
x only through t and h does
not depend on .
~
Statistic T = t ( X ) is said to be sufficient for the
family of density functions { f (; ) : } if the
~
conditional distribution of X given T = t does not
depend on .
Leong YK & Wong WY Introduction to Statistical Decisions 28
Example 5.5.1
Consider a decision problem (, A, L ) in which
A = { a1 , a2 }. Let X 1 and X 2 be a random sample of size 2
from a Bernoulli distribution with parameter . Then
Leong YK & Wong WY Introduction to Statistical Decisions 30
T = X 1 + X 2 is sufficient for . Moreover, T is a binomial
random variable with parameter ( 2 , ).
a1 , t =0
(t ) =
a2 , t=2
a1 , if Head occurs
(1) = .
a2 , if Tail occurs
R( , ) = E[ L( , (T ))]
= L( , a1 ) P (T = 0) + L( , (1)) P (T = 1)
+ L( , a2 ) P (T = 2)
2 1
= L( , a1 )(1 ) + [ L( , a1 ) + L( , a2 )]2 (1 )
2
+ L( , a2 ) 2
Leong YK & Wong WY Introduction to Statistical Decisions 31
= L( , a1 ){(1 ) 2 + (1 )} + L( , a2 ){ (1 ) + 2 }
~ ~
= L( , a1 ){P ( X = (0,0)) + P( X = (0,1))}
~ ~
+ L( , a2 ){P ( X = (1,1)) + P ( X = (1,0))}
Example 5.6.1
Consider a decision problem (, A, L ) in which A = { a1 , a2 }.
Let X1 and X 2 be a random sample of size 2 from a Bernoulli
distribution with parameter . Then T = X 1 + X 2 is sufficient for
. Moreover, T is a binomial random variable with parameter
( 2 , ).
a1 , t =0
(t ) =
a 2 , t=2
a1 , if Head occurs
(1) = .
a 2 , if Tail occurs
Leong YK & Wong WY Introduction to Statistical Decisions 32
R ( , ) = E[ L ( , (T ))]
= L( , a1 ) P (T = 0) + L( , (1)) P (T = 1) + L( , a2 ) P (T = 2)
1
= L( , a1 )(1 ) 2 + [ L( , a1 ) + L( , a2 )]2 (1 ) + L( , a2 ) 2
2
= L( , a1 ){(1 ) 2 + (1 )} + L( , a2 ){ (1 ) + 2 }
~ ~
= L( , a1){P ( X = (0,0)) + P ( X = (0,1))}
~ ~
+ L( , a2 ){P ( X = (1,1)) + P ( X = (1,0))}
= R ( , d )