0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)

6 Ansichten67 Seiten© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)

6 Ansichten67 Seiten© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 67

Reading:

Ch. 3 in Kay-II.

see http://www.ece.rice.edu/~dhj/courses/elec531/notes5.pdf.

Ch. 10 in Wasserman.

Introduction to Detection Theory (cont.)

measurements. Statistical tools enable systematic solutions

and optimal design.

Communications,

Biomedicine, etc.

Example: Radar Detection. We wish to decide on the

presence or absence of a target.

Introduction to Detection Theory

We assume a parametric measurement model p(x | ) [or

p(x ; ), which is the notation that we sometimes use in the

classical setting].

In point estimation theory, we estimated the parameter

given the data x.

Suppose now that we choose 0 and 1 that form a partition

of the parameter space :

0 1 = , 0 1 = .

(i.e. make the appropriate decision):

H0 : 0, null hypothesis

H1 : 1, alternative hypothesis.

they are composite.

(0, ).

The Decision Rule

1, decide H1,

(x) =

0, decide H0.

into two regions:

Z

PFA = E x | [(X) | ] = p(x | ) dx for in 0

X1

Z

PM = E x | [1 (X) | ] = 1 p(x | ) dx

X1

Z

= p(x | ) dx for in 1.

X0

Z

PD = 1PM = E x | [(X) | ] = p(x | ) dx for in 1.

X1

Note: PFA and PD/PM are generally functions of the parameter

(where 0 when computing PFA and 1 when

computing PD/PM).

terminology:

Bayesian Decision-theoretic Detection Theory

loss: Z

(action | x) = L(, action) p( | x) d

that we introduced in handout # 4 when we discussed Bayesian

decision theory. Let us now apply this theory to our easy

example discussed here: hypothesis testing, where our action

space consists of only two choices. We first assign a loss table:

x X1 L(1|1) = 0 L(1|0)

x X0 L(0|1) L(0|0) = 0

L(declared | true):

typically set to zero in real life. Here, we adopt zero losses

for correct decisions.

Now, our posterior expected loss takes two values:

Z

0(x) = L(0 | 1) p( | x) d

1

Z

+ L(0 | 0) p( | x) d

0

| {z }

0

Z

= L(0 | 1) p( | x) d

1

Z

L(0 | 1) is constant

= L(0 | 1) p( | x) d

| 1 {z }

P [1 | x]

and, similarly,

Z

1(x) = L(1 | 0) p( | x) d

0

Z

L(1 | 0) is constant

= L(1 | 0) p( | x) d .

| 0 {z }

P [0 | x]

the posterior expected loss; this rule corresponds to choosing

the data-space partitioning as follows:

X1 = {x : 1(x) 0(x)}

or

P [1 | x]

zZ }| {

( p( | x) d )

L(1 | 0)

X1 = x : Z 1 (1)

L(0 | 1)

p( | x) d

| 0 {z }

P [0 | x]

or, equivalently, upon applying the Bayes rule:

( R )

1

p(x | ) () d L(1 | 0)

X1 = x : R . (2)

p(x | ) () d L(0 | 1)

0

decision rule true state 1 0

x X1 L(1|1) = 0 L(1|0) = 1

x X0 L(0|1) = 1 L(0|0) = 0

a posteriori (MAP) rule:

n P [ 1 | x] o

X1 = x : 1 (3)

P [ 0 | x]

( R )

1

p(x | ) () d

X1 = x : R 1 . (4)

0

p(x | ) () d

Simple hypotheses. Let us specialize (1) to the case of simple

hypotheses (0 = {0}, 1 = {1}):

( )

p(1 | x) L(1 | 0)

X1 = x : . (5)

p(0 | x) L(0 | 1)

| {z }

posterior-odds ratio

( )

p(x | 1) 0 L(1 | 0)

X1 = x : (6)

p(x | ) 1 L(0 | 1)

| {z 0}

likelihood ratio

where

0 = (0), 1 = (1) = 1 0

describe the prior probability mass function (pmf) of the binary

random variable (recall that {0, 1}). Hence, for binary

simple hypotheses, the prior pmf of is the Bernoulli pmf.

Preposterior (Bayes) Risk

Z Z

E x, [loss] = L(1 | 0) p(x | )() d dx

X1 0

Z Z

+ L(0 | 1) p(x | )() d dx.

X0 1

risk?

Z Z

L(1 | 0) p(x | )() d dx

X1 0

Z Z

+ L(0 | 1) p(x | )() d dx

X0 1

Z Z

= L(1 | 0) p(x | )() d dx

X1 0

Z Z

L(0 | 1) p(x | )() d dx

X1 1

Z Z

+ L(0 | 1) p(x | )() d dx

X0 1

Z Z

+ L(0 | 1) p(x | )() d dx

X1 1

= const

| {z }

not dependent on (x)

Z n Z

+ L(1 | 0) p(x | )() d

X1 0

Z o

L(0 | 1) p(x | )() d dx

1

n Z Z o

X1 : L(1 | 0) p(x | )() dL(0 | 1) p(x | )() < 0

0 1

which, as expected, is the same as (2), since

minimizing the preposterior risk for every x

0-1 loss: For the 0-1 loss, i.e. L(1|0) = L(0|1) = 1, the

preposterior (Bayes) risk for rule (x) is

Z Z

E x, [loss] = p(x | )() d dx

X1 0

Z Z

+ p(x | )() d dx (7)

X0 1

performed over the joint probability density or mass function

(pdf/pmf) or the data x and parameters .

Bayesian Decision-theoretic Detection for

Simple Hypotheses

p(x | 1) H1 0 L(1|0)

(x) = (8)

| {z } p(x | 0) 1 L(0|1)

likelihood ratio

see also Ch. 3.7 in Kay-II. (Recall that (x) is the sufficient

statistic for the detection problem, see p. 37 in handout # 1.)

Equivalently,

H1

log (x) = log[p(x | 1)] log[p(x | 0)] log 0.

familiar 0-1 loss case where L(1|0) = L(0|1) = 1, we know

that the preposterior (Bayes) risk is equal to the average error

probability, see (7). This average error probability greatly

simplifies in the simple hypothesis testing case:

Z

av. error probability = L(1 | 0) p(x | 0)0 dx

X1

| {z }

1

Z

+ L(0 | 1) p(x | 1)1 dx

X0

| {z }

1

Z Z

= 0 p(x | 0) dx +1 p(x | 1) dx

| X1 {z } | X0 {z }

PFA PM

pdf/pmf of the data x and parameters , and

In this case, our Bayes decision rule simplifies to the MAP rule

(as expected, see (5) and Ch. 3.6 in Kay-II):

p(1 | x) H1

1 (9)

p(0 | x)

| {z }

posterior-odds ratio

p(x | 1) H1

0

. (10)

p(x | 0) 1

| {z }

likelihood ratio

EE 527, Detection and Estimation Theory, # 5 15

which is the same as

(4), upon substituting the Bernoulli pmf as the prior pmf for

and

Bayesian Decision-theoretic Detection Theory:

Handling Nuisance Parameters

parameters (, say) out!

H0 : 0 versus

H1 : 1

Z

p | x( | x) = p, | x(, | x) d

and, therefore,

p( | x)

zZ }| {

R

p, | x(, | x) d d H1

1 L(1 | 0)

Z (11)

R L(0 | 1)

0

p, | x(, | x) d d

| {z }

p( | x)

or, equivalently, upon applying the Bayes rule:

R R

1

px | ,(x | , ) ,(, ) d d H1 L(1 | 0)

R R . (12)

0

px | ,(x | , ) ,(, ) d d L(0 | 1)

,(, ) = () () (13)

p(x | )

zZ }| {

R

() px | ,(x | , ) () d d H1

1 L(1 | 0)

Z . (14)

R L(0 | 1)

0

() px | ,(x | , ) () d d

| {z }

p(x | )

us specialize (11) to the simple hypotheses (0 = {0}, 1 =

{1}): R

p, | x(1, | x) d H1 L(1 | 0)

R . (15)

p, | x(0, | x) d L(0 | 1)

Now, if and are independent a priori, i.e. (13) holds, then

we can rewrite (14) [or (15) using the Bayes rule]:

same as (6)

R z }| {

px | ,(x | 1, ) () d p(x | 1) H1 0 L(1 | 0)

R = (16)

px | ,(x | 0, ) () d p(x | 0) 1 L(0 | 1)

| {z }

integrated likelihood ratio

where

0 = (0), 1 = (1) = 1 0.

Chernoff Bound on Average Error Probability

for Simple Hypotheses

Z Z

av. error probability = p(x | )() d dx

X1 0

Z Z

+ p(x | )() d dx

X0 1

n Z Z o

X1? = x : p(x | ) () d p(x | ) () d < 0 .

0 1

simple closed-form expression for the minimum average error

probability, but we can bound it as follows:

Z Z

min av. error probability = p(x | )() d dx

X1? 0

Z Z

+ p(x | )() d dx

X0? 1

using

the def.

of X1?

Z nZ Z o

= min p(x | )() d, p(x | )() d) dx

X

|{z} 0 1

data space

4 4

= q0 (x) = q1 (x)

Z h zZ }| { i h zZ

}| {i

1

p(x | )() d p(x | )() d dx

X 0 1

Z

= [q0(x)] [q1(x)]1 dx

X

probability. Here, we have used the fact that

When

x1

x2

x=

..

xN

with

distributed (i.i.d.) given and

then

N

Y

q0(x) = p(x | 0) 0 = 0 p(xn | 0)

n=1

N

Y

q1(x) = p(x | 1) 1 = 1 p(xn | 1)

n=1

yielding

Chernoff bound for N conditionally i.i.d. measurements (given ) and simple hyp.

Z h Y N i h Y N i1

= 0 p(xn | 0) 1 p(xn | 1) dx

n=1 n=1

N nZ

Y o

= 0 11 [p(xn | 0)] [p(xn | 1)]1 dxn

n=1

nZ oN

= 0 11 [p(x1 | 0)] [p(x1 | 1)]1 dx1

or, in other words,

1

log(min av. error probability) log(0 11)

N

Z

+ log [p(x1 | 0)] [p(x1 | 1)]1 dx1, [0, 1].

when evaluating average error probabilities), we can say that,

as N ,

for N cond. i.i.d. measurements (given ) and simple hypotheses

n Z o

f (N ) exp N min log [p(x1 | 0)] [p(x1 | 1)]1

[0,1]

| {z }

Chernoff information for a single observation

exponential term:

log f (N )

lim = 0.

N N

the above expression quantifies the asymptotic behavior of the

minimum average error probability.

We now give a useful result, taken from

K. Fukunaga, Introduction to Statistical Pattern Recognition,

2nd ed., San Diego, CA: Academic Press, 1990

N (2, 2). Then

Z

[p1(x)] [p2(x)]1 dx = exp[g()]

where

(1 )

g() = (2 1)T [ 1 + (1 ) 2]1(2 1)

2

h | + (1 ) | i

1 2

+ 12 log .

|1| |2|1

Probabilities of False Alarm (P FA) and

Detection (P D) for Simple Hypotheses

PDF under H0

0.2 PDF under H1

0.18

0.16

Probability Density Function

0.14

PD

0.12

0.1

0.08

0.06

P

FA

0.04

0.02

0

0 5 10 15 20 25

Test Statistic

(x)

z }| {

PFA = P [test {z > } | = 0]

| statistic

XX1

| statistic

XX0

Comments:

probabilities shrink towards zero.

(ii) As the region X1 grows (i.e. & 0), both of these

probabilities grow towards unity.

PFA and PD; in most cases, as R1 grows, PD grows more

rapidly than PFA (i.e. we better be right more often than we

are wrong).

(iv) However, the perfect case where our rule is always right

and never wrong (PD = 1 and PFA = 0) cannot occur when

the conditional pdfs/pmfs p(x | 0) and p(x | 1) overlap.

also allow for the false-alarm probability PFA to increase.

This behavior

represents the fundamental tradeoff in hypothesis testing

and detection theory and

motivates us to introduce a (classical) approach to testing

simple hypotheses, pioneered by Neyman and Pearson (to

be discussed next).

Neyman-Pearson Test for Simple Hypotheses

Setup:

H0 : = 0 versus

H1 : = 1.

PD = P [X X1 ; = 0]

constraint

PFA = P [X X1 ; = 0] = 0 .

testing composite hypotheses is much more complicated. The

Bayesian version of testing composite hypotheses is trivial (as

is everything else Bayesian, at least conceptually) and we have

already seen it.

maximize

L = PD + (PFA 0)

Z hZ i

0

= p(x ; 1) dx + p(x; 0) dx

X1 X1

Z

= [p(x ; 1) p(x ; 0)] dx 0.

X1

To maximize L, set

n p(x ; 1) o

X1 = {x : p(x ; 1)p(x ; 0) > 0} = x : > .

p(x ; 0)

p(x ; 1)

(x) = .

p(x ; 0)

Z

p(x ; 0) dx = PFA = 0 .

X1

If we increase , PFA and PD go down. Similarly, if we decrease

, PFA and PD go up. Hence, to maximize PD, choose so

that PFA is as big as possible under the constraint.

Two useful ways for determining the threshold that

achieves a specified false-alarm rate:

Z

p(x ; 0) dx = PFA =

x : (x)>

or,

Z

p;0 (l ; 0) dl = .

T (x) = monotonic function((x)).

continuous function of . Some (not insightful) technical

adjustments are needed if this is not the case.

A way of handling nuisance parameters: We can utilize the

integrated (marginal) likelihood ratio (16) under the Neyman-

Pearson setup as well.

Chernoff-Stein Lemma for Bounding the Miss

Probability in Neyman-Pearson Tests of Simple

Hypotheses

D(p k q) from one pmf (p) to another (p):

X pk

D(p k q) = pk log .

qk

k

The complete proof of this lemma for the discrete (pmf) case

is given in

of Information Theory. Second ed., New York: Wiley, 2006.

Setup for the Chernoff-Stein Lemma

decision threshold to achieve a fixed PFA. Let us study the

asymptotic PM = 1 PD as the number of observations N

gets large.

decision threshold (, say) vary with N , i.e.

= N (PFA)

PM = PM() = PM N (PFA) .

Chernoff-Stein Lemma

1

lim lim log PM = D p(Xn | 0) k p(Xn | 1)

PFA 0 N N | {z }

K-L distance for a single observation

h p(Xn | 1) i

D p(Xn | 0) k p(Xn | 1) = E p(xn | 0) log

p(Xn | 0)

discrete (pmf) case

h p(x | ) i

n 1

X

= p(xn | 0) log

x

p(xn | 0)

n

conditionally i.i.d. given .

Equivalently, we can state that

h i

N +

PM f (N ) exp N D p(Xn | 0) k p(Xn | 1)

function compared with the exponential term (when PFA 0

and N ).

Detection for Simple Hypotheses: Example

(AWGN), Example 3.2 in Kay-II.

Consider

H1 : x[n] = A + w[n], n = 1, 2, . . . , N

where

2, i.e.

w[n] N (0, 2).

versus signal plus noise. It is similar to the on-off keying

scheme in communications, which gives us an idea to

rephrase it so that it fits our formulation on p. 4 (for which

we have developed all the theory so far). Here is such an

alternative formulation: consider a family of pdfs

N

1 h 1 X 2

i

p(x ; a) = p exp (x[n] a) (17)

(2 2)N 2 2 n=1

H1 : a=A (on).

p(x ; a = A)

(x) =

p(x ; a = 0)

N

exp[ 21 2 n=1(x[n] A)2]

2 N/2

P

1/(2 )

= 1

PN .

2

1/(2 ) N/2 2

exp( 22 n=1 x[n] )

our likelihood-ratio test to comparing

N

1 X 4

T (x) = x = x[n]

N n=1

(x).] If T (x) > accept H1 (i.e. reject H0), otherwise accept

H0 (well, not exactly, we will talk more about this decision on

p. 59).

the Bayes decision rule, is a function of 0 and 1. For

the Neyman-Pearson test, is chosen to achieve (control) a

desired PFA.

(corresponding to minimizing the average error

probability):

N N H1

1 X 1 X

0

log (x) = 2 (x[n] A)2 + (x[n])2

log

2 n=1 2 2 n=1 1

N H1

1 X

0

(x[n] x[n] +A) (x[n] + x[n] A) log

2 2 n=1 | {z } 1

0

N

X

H1

0

2 A x[n] A2 N 2 2 log

n=1

1

N

X AN H1 2

0

x[n] log since A > 0

n=1

2 A 1

H1 2 A

and, finally, x log(0/1) +

N

| A {z 2}

which, for the practically most interesting case of equiprobable

hypotheses

0 = 1 = 21 (18)

simplifies to H1 A

x

2

|{z}

known as the maximum-likelihood test (i.e. the Bayes decision

rule for 0-1 loss and a priori equiprobable hypotheses is defined

as the maximum-likelihood test). This maximum-likelihood

detector does not require the knowledge of the noise variance

2 to declare its decision. However, the knowledge of 2 is

key to assessing the detection performance. Interestingly, these

observations will carry over to a few maximum-likelihood tests

that we will derive in the future.

probability. First, note that X | a = 0 N (0, 2/N ) and

X | a = A N (A, 2/N ). Then

1 A A

min av. error prob. = 2 P [X > | a = 0] + 12 P [X < | a = A]

| 2

{z } | 2

{z }

PFA PM

1

h X A/2 i

= 2 P p >p ;a=0

2

/N 2

/N

| {z }

standard

normal

random var.

h X A A/2 A i

+ 21 P p < p ;a=A

2

/N 2

/N

| {z }

standard

normal

random var.

r r r

N A2 N A2 N A2

1

= 2 Q 2

+ 21 2

= Q 21

| {z4 } | {z 4 } 2

standard standard

normal complementary cdf normal cdf

since

(x) = Q(x).

The minimum probability of error decreases monotonically with

N A2/ 2, which is known as the deflection coefficient. In this

A2

case, the Chernoff information is equal to 8 2 (see Lemma 1):

nZ o

min log [N (0, 2/N )] [N (A, 2/N )]1 dx

[0,1] | {z } | {z }

p0 (x) p1 (x)

h (1 ) A2 i A2

= min 2 =

[0,1] 2 8 2

A2

N +

min av. error probability f (N )exp N 2 (19)

8

where f (N ) varies slowly with N when N is large, compared

with the exponential term:

log f (N )

lim = 0.

N N

Indeed, in our example,

r

N A2 1 N A2

N +

Q 2

q exp 2

4 NA 2

2 8

4 2

| {z }

f (N )

where we have used the asymptotic formula

1

x+

Q(x) exp(x2/2) (20)

x 2

Known DC Level in AWGN:

Neyman-Pearson Approach

known DC level in AWGN. Recall that our likelihood-ratio

test compares

N 1

4 1 X

T (x) = x = x[n]

N n=0

with a threshold .

PFA = P [T (X) > ; a = 0] = Q p

2/N

and we obtain the decision threshold as follows:

r

2

= Q1(PFA).

N

A

PD = P (T (X) > | a = A) = Q p

2/N

s !

1 A2

= Q Q (PFA)

2/N

v !

= Q Q1(PFA) u N A2/ 2

u

.

t | {z }

4

= SNR = d2

PD depends only on the deflection coefficient:

d = 2 =

var[T (X | a = 0)]

ratio (SNR).

Receiver Operating Characteristics (ROC)

1

PD = Q Q (PFA) d .

Comments:

better by flipping a coin.

Typical Ways of Depicting the Detection

Performance Under the Neyman-Pearson Setup

examine two relationships:

operating characteristics (ROC).

Asymptotic (as N and P FA & 0)

P D and P M for a Known DC Level in AWGN

We apply the Chernoff-Stein lemma, for which we need to

compute the following K-L distance:

D p(Xn | a = 0) k p(Xn | a = A)

n h p(X | a = A) io

n

= E p(xn | a=0) log

p(Xn | a = 0)

where

p(xn | a = 0) = N (0, 2)

h p(x | a = A) i

n (xn A)2 x2n 1 2

log = 2

+ 2

= 2

(2 A xn A ).

p(xn | a = 0) 2 2 2

Therefore,

A2

D p(Xn | 0) k p(Xn | 1) =

2 2

and the Chernoff-Stein lemma predicts the following behavior

of the detection probability as N and PFA 0:

A2

PD 1 f (N ) exp N

| {z 2 2 }

PM

where f (N ) is a slowly-varying function of N compared with

the exponential term. In this case, the exact expression for PM

(PD) is available and consistent with the Chernoff-Stein lemma:

p

1

PM = 1 Q Q (PFA) N A2/ 2

p

1

= Q N A2/ 2 Q (PFA)

N + 1

p

2 2 1

[ N A / Q (PFA)] 2

n p o

exp [ N A2/ 2 Q1(PFA)]2/2

1

= p

2 2 1

[ N A / Q (PFA)] 2

n

exp N A2/(2 2) [Q1(PFA)]2/2

p o

1

+Q (PFA) N A2/ 2

1

= p

2 2 1

[ N A / Q (PFA)] 2

n p o

exp [Q1(PFA)]2/2 + Q1(PFA) N A2/ 2

| {z }

as predicted by the Chernoff-Stein lemma

is small and N is large, the first two (green) terms in the

above expression make a slowly-varying function f (N ) of N .

Note that the exponential term in (21) does not depend on

the false-alarm probability PFA (or, equivalently, on the choice

of the decision threshold). The Chernoff-Stein lemma asserts

that this is not a coincidence.

A2/(2 2)

of the exponential decay of the minimum average error

probability:

A2/(8 2).

Decentralized Neyman-Pearson Detection for

Simple Hypotheses

sensors (nodes) follow the same marginal probabilistic model:

Hi : xn p(xn | i)

observation xn and sends it to the headquarters (fusion center),

which collects all the local decisions and makes the final global

decision about H0 or H1. This structure is clearly suboptimal

it is easy to construct a better decision strategy in which

each node sends its (quantized, in practice) likelihood ratio to

the fusion center, rather than the decision only. However, such

a strategy would have a higher communication (energy) cost.

Suppose that each node n makes a local decision dn

{0, 1}, n = 1, 2, . . . , N and transmits it to the fusion center.

Then, the fusion center makes the global decision based on

the likelihood ratio formed from the dns. The simplest

fusion scheme is based on the assumption that the dns are

conditionally independent given (which may not always be

reasonable, but leads to an easy solution). We can now write

p(dn | 1) = PDd,n

n

(1 PD,n)1dn

| {z }

Bernoulli pmf

Similarly,

dn 1dn

p(dn | 0) = PFA ,n (1 PFA,n )

| {z }

Bernoulli pmf

probability. Now,

N

h p(d | ) i

n 1

X

log (d) = log

n=1

p(dn | 0

N h P dn (1 P )1dn i H1

D,n D,n

X

= log dn log .

1d

PFA,n (1 PFA,n) n

n=1

have identical performance:

N

X

u1 = dn

n=1

P 1 P H1

D D

log (d) = u1 log + (N u1) log log

PFA 1 PFA

or

h P (1 P ) i H1 1 P

D FA FA

u1 log log + N log . (22)

PFA (1 PD) 1 PD

Clearly, each nodes local decision dn is meaningful only if

PD > PFA, which implies

PD (1 PFA)

>1

PFA (1 PD)

rule (22) further simplifies to

H1

u1 0 .

is easy: the random variable U1 is binomial given (i.e.

conditional on the hypothesis) and, therefore,

N

P [U1 = u1] = pu1 (1 p)N u1

u1

global false-alarm probability is

N

0

X N u1

PFA,global = P [U1 > | 0] = PFA (1PFA)N u1 .

0

u1

u1 =d e

therefore, we should choose our PFA,global specification from

the available discrete choices. But, if none of the candidate

choices is acceptable, this means that our current system does

not satisfy the requirements and a remedial action is needed,

e.g. increasing the quantity (N ) or improving the quality of

sensors (through changing PD and PFA), or both.

Testing Multiple Hypotheses

Suppose now that we choose 0, 1, . . . , M 1 that form a

partition of the parameter space :

0 1 M 1 = , i j = i 6= j.

which hypothesis is true:

H0 : 0 versus

H1 : 1 versus

.. versus

HM 1 : M 1

design a decision rule : X (0, 1, . . . , M 1):

0, decide H0,

1, decide H1,

(x) = ..

M 1, decide HM 1

into M regions:

We specify the loss function using L(i | m), where, typically,

the losses due to correct decisions are set to zero:

L(i | i) = 0, i = 0, 1, . . . , M 1.

posterior expected loss takes M values:

M

X 1 Z

m(x) = L(m | i) p( | x) d

i=0 i

M

X 1 Z

= L(m | i) p( | x) d, m = 0, 1, . . . , M 1.

i=0 i

data-space partitioning:

?

Xm = {x : m(x) = min l(x)}, m = 0, 1, . . . , M 1

0lM 1

n M

X 1 Z o

?

Xm = x : m = arg min L(l | i) p(x | ) () d

0lM 1 i

i=0

n M

X 1 Z o

= x : m = arg min L(l | i) p(x | ) () d .

0lM 1 i

i=0

The preposterior (Bayes) risk for rule (x) is

M

X 1 M

X 1 Z Z

E x, [loss] = L(m | i) p(x | )() d dx

m=0 i=0 Xm i

M

X 1 Z M

X 1 Z

= L(m | i) p(x | )() d dx.

m=0 Xm i=0 i

| {z }

hm ()

hM

X 1 Z i hM

X 1 Z i

hm(x) dx hm(x) dx 0

m=0 Xm m=0 Xm ?

preposterior (Bayes) risk.

Special Case: L(i | i) = 0 and L(m | i) = 1 for i 6= m (0-1

loss), implying that m(x) can be written as

M

X 1 Z

m(x) = p( | x) d

i=0, i6=m i

Z

= const

| {z } p( | x) d

m

not a function of m

and

n Z o

?

Xm = x : m = arg max p( | x) d

0lM 1 l

n o

= x : m = arg max P [ l | x] (23)

0lM 1

hypotheses (m = {m}, m = 0, 1, . . . , M 1):

n o

?

Xm = x : m = arg max p(l | x)

0lM 1

or, equivalently,

n o

?

Xm = x : m = arg max [l p(x | l)] , m = 0, 1, . . . , M 1

0lM 1

where

0 = (0), . . . , M 1 = (M 1)

define the prior pmf of the M -ary discrete random variable

(recall that {0, 1, . . . , M 1}). If i, i = 0, 1, . . . , M 1

are all equal:

1

0 = 1 = = M 1 =

M

n h1 io

?

Xm = x : m = arg max p(x | l)

0lM 1 M

n o

= x : m = arg max p(x | l) (24)

0lM 1 | {z }

likelihood

inspecting (24) and noting that the computation of the optimal

?

decision region Xm requires the maximization of the likelihood

p(x | ) with respect to the parameter {0, 1, . . . , M 1}.

Summary: Bayesian Decision Approach versus

Neyman-Pearson Approach

for applications where the null hypothesis can be formulated

as absence of signal or perhaps, absence of statistical

difference between two data sets (treatment versus placebo,

say).

treated very differently from the alternative. (If the null

hypothesis is true, we wish to control the false-alarm rate,

which is different from our desire to maximize the probability

of detection when the alternative is true). Consequently, our

decisions should also be treated differently. If the likelihood

ratio is large enough, we decide to accept H1 (or reject

H0). However, if the likelihood ratio is not large enough,

we decide not to reject H0 because, in this case, it may be

that either

(i) H0 is true or

(ii) H0 is false but the test has low detection probability

(power) (e.g. because the signal level is small compared

with noise or we collected too small number of

observations).

The Bayesian decision framework is suitable for

communications applications as it can easily handle multiple

hypotheses (unlike the Neyman-Pearson framework).

0-1 loss: In communications applications, we typically

select a 0-1 loss, implying that all hypotheses are treated

equally (i.e. we could change the roles of null and

alternative hypotheses without any problems). Therefore,

interpretations of our decisions are also straightforward.

Furthermore, in this case, the Bayes decision rule is

also optimal in terms of minimizing the average error

probability, which is one of the most popular performance

criteria in communications.

P Values

Instead, we could vary PFA and examine how our report would

change.

0

be accepted for PFA > PFA. Therefore, there exists a smallest

PFA at which H1 is accepted. This motivates the introduction

of the p value.

hypotheses), here is a definition of a size of a hypothesis

test.

is defined as follows:

0

or equal to . Therefore, a level- test is guaranteed to have

a false-alarm probability less than or equal to .

Definition 2. Consider a Neyman-Pearson-type setup where

our test

(25)

achieves a specified size , meaning that,

= max P [x X1, | ] (composite hypotheses)

0

FA P

z }| {

= P [x X1, | = 0] (simple hypotheses).

with decision regions (25). Then, the p value for this test is

the smallest level at which we can declare H1:

H1. For example, p values less than 0.01 are considered very

strong evidence supporting H1.

There are a lot of warnings (and misconceptions) regarding p

values. Here are the most important ones.

Warning: A large p value is not strong evidence in favor of

H0; a large p value can occur for two reasons:

(i) H0 is true or

(power).

P [H0 | data] = P [ 0 | x]

probability that H0 is true.

0

In words, Theorem 1 states that

data realization X is observed yielding a value of the test

statistic T (X) that is greater than or equal to what has

actually been observed (i.e. T (x)).

to be repeated many times. This is what Bayesians criticize by

saying that data that have never been observed are used for

inference.

distribution, then, under H0 : = 0, the p value has a

uniform(0, 1) distribution. Therefore, if we declare H1 (reject

H0) when the p value is less than or equal to , the probability

of false alarm is .

draw from an uniform(0, 1) distribution. If H1 is true and if we

repeat the experiment many times, the random p values will

concentrate closer to zero.

Multiple Testing

e.g.

bioinformatics and

sensor networks.

node in a sensor network, say). This is different from testing

multiple hypotheses that we considered on pp. 5458, where

we performed a single test of multiple hypotheses. For a

sensor-network related discussion on multiple testing, see

discovery in an uncertain networked world, IEEE Signal

Processing Magazine, vol. 23, pp. 107118, Jul. 2006.

PFA = . For example, in a sensor-network setup, each node

conducts a test based on its local data.

chance of at least one falsely alarmed node is much higher,

since there are many nodes. Here, we discuss two ways to deal

with this problem.

Consider M hypothesis tests:

and denote by p1, p2, . . . , pM the p values for these tests. Then,

the Bonferroni method does the following:

pi < .

M

n

h[ i n

X

P Ai P [Ai].

i=1 i=1

# 2 in EE 420x notes.

probability of any individual false alarm is less than or equal to

.

alarm and by Ai the event that the ith node is falsely alarmed.

Then

M n M

n[ o X X

P {A} = P Ai P {Ai} = = .

i=1 i=1 i=1

M

independent and consider the smallest of the M p values.

Under H0i, pi are uniform(0, 1). Then

P {min{P1, P2, . . . , PM } > x} assuming H = (1 x)M

0i

i = 1, 2, . . . , M

min{p1, p2, . . . , pm} as

will be close to m min{p1, p2, . . . , pM }.

False Discovery Rate: Sometimes it is reasonable to control

false discovery rate (FDR), which we introduce below.

Suppose that we accept H1i for all i for which

pi < threshold

and let m0 + m1 = m be

number of true H0i hypotheses + number of true H1i

hypotheses = total number of hypotheses (nodes, say).

H0 true U V m0

H1 true T S m1

total mR R m

V /R, R > 0,

FDP =

0, R = 0

define

FDR = E [FDP].

The Benjamini-Hochberg (BH) Method

(i) Denote the ordered p values by p(1) < p(2) < p(m).

(ii) Define

i

li = and R = max{i : p(i) < li}

Cm m

where Cm is defined P

to be 1 if the p values are

m

independent and Cm = i=1(1/i) otherwise.

Hochberg) If the above BH method is applied, then, regardless

of how many null hypotheses are true and regardless of the

distribution of the p values when the null hypothesis is false,

m0

FDR = E [FDP] .

m