Sie sind auf Seite 1von 672

llII(IIl1!j!ItIIIj!!

Contents

Foreword

xi

by David Hendry

XV

Preface
Acknowledgements
List of symbols and abbreviations

lntroduction

Part I

Econometric modelling, a preliminary view


1
1.1 Econometries - a brief historical overview
1.2 Econometric modelling - a sketch of a
methodology
Looking ahead

2.1
2.2
2.3

study of data
Histograms and their numerical
Frequency curves
Looking ahead

DescriptiYe

Probability

Part 11
3
3.1
3.2
3.3

characteristics

15
22
23
23
27
29

theory

Probability
The notion of probability
The axiomatic approach
Conditional probability

33
34
37
43

Contents

and probability distributions


Random Yariables
The concept of a random variable
The distribution and density functions
The notion of a probability model
distributions
Some univariate
Numerical characteristics of random variables

47
48
55
60
62
68

5
5. 1
5.2
5.3
5.4

and their distributions


Random vtors
Joint distribution and density functions
Some bivariate distributions
Marginal distributions
Conditional distributions

78
78
83
85
89

6
6. 1
6.2*
6.3

Functions of random Yariables


Functions of one random variable
Functions of several random variables
Functions of normally distributed random

,' 4
4.1
4.2
4.3
4.4
4.5

96
96

99
variables,

108
109

summary

Looking ahead
Appendix 6.1 The normal

distributions

and

related
110

7
7. 1
7.2
7.3

The general notion of expectation


Expectation of a function of random variables
Conditional expectation
Looking ahead

116
116
121
127

8*
8. 1
8.2

Stochastic processes
The concept of a stochastic process
of a stochastic
Restricting the time-heterogeneity

130
131

process

Restricting the memory of a stochastic process


Some special stochastic processes

8.3
8.4
8.5
9
9. 1
i?2
.
<i3
.1
.

Summary

Limit theorems
The early limit theorems
The 1aw of la rge n umbers
The central limit theorem
l.: nit ! heorems for stochastic processes

137
140
144
162
165
165
168
173
178
180

Contents

10*
10.1
10.2
10.3
10.4
10.5
10.6

lntroduction to asymptotic theory


lntroduction
Modes of convergence
Convergence of moments
The
0' and little o' notation
Extending the limit theorems
expansions
Error bounds and asymptotic
tbig

183
183
185
192
194
198
202

Statistical inference

Part III

The nature of statistical inference


lntroduction
The sampling model
The frequency approach
An overview of statistical inference
Statistics and their distributions
Appendix 11.1 -. The empirical distribution

213
213
215
219
221
223

function

228

12
12.1
12.2
12.3

Estimation I - properties of estimators


Finite sample properties
Asymptotic properties
Predictors and their properties

231
232
244
247

13
13.1
13.2
13.3

Estimation 11 The method of


The method of
The maximum

14
14.1
14.2
14.3
14.4
14.5
14.6

Hypothesis testing and confidence regions


Testing, definitions and concepts
Optimal tests
Constructing optimal tests
The likelihood ratio test procedure
Confidence estimation
Prediction

285
285
290
296
299
303
306

15*
15.1
15.2
15.3

normal distribution
The multivariate
Multivariate distributions
The multivariate normal distribution
Quadraticforms related to the normal distribution

3 12
3 12
3 15

11
11.1
11.2
11.3
11.4
11.5

methods

least-squares
moments
likelihood method

252
253
256
257

319

Contents

15.4
15.5

Estimation
Hypothesis

16*
16.1
16.2

Asymptotic test procedures


Asymptotic properties
The likelihood ratio and related

Part IV

320
323

testing and confidence regions

test procedures

The Iinear regression and related

models

326
326
328

statistical

17
17.1
17.2
17.3
17.4
17.5

Statistical models in onometrics


Simple statistical models
Economic data and the sampling model
Economic data and the probability model
The statistical generating mechanism
Looking ahead
Appendix 17.1 Data
-

339
339
342
346
349
352
355

18
18.1
18.2
18.3
18.4
18.5

The Gauss Iinear model

357
357
359
363
366
367

l9. 1
19.2
19.3
19.4
19.5
19.6
19.7
19.8

20

Specification
Estimation
Hypothesis testing and confidence intervals
Experimental design
Looking ahead
The Iinear regression model I spification,
estimation and testing
lntroduction

369

369
370

Specification
Discussion of the assumptions
Estimation
Specification testing
Prediction
The residuals
Summary and conclusion
Appendix 19. 1 A note on measurement
-

375

378
392
402
405

408
systems

The Iinear regression model 11 departures from


the assumptions underlying the statistical GM
The stochastic linear regression model
The statistical parameters of interest

409

412

413
418

Contents
Weak exogeneity
Restrictions on the statistical parameters

of

interest

422
432
434

20.5
20.6

Collinearity
S'Near' collinearity

2 1.1
2 1.2
2 1.3
2 1.4
2 1.5
2 1.6

The Iinear regression model III


departures from
the assumptions underlying the probability model
Misspecification testing and auxiliary regressions
Normality
Linearity
Homoskedasticity
Parameter time invariance
Parameter structural change
Appendix 21.1 - Variance stabilising
-

443
443
447
457
463
472
48 1

transformations

22
22.1
22.2
22.3
22.4

The linear regression model IV departures from


the sampling model assumption
Implications of a non-random sample
Tackling temporal dependence
Testing the independent sample assumption
Looking back
Appendix 22. 1 Deriving the conditional
-

expectation

23
23. 1
23.2
23.3
23.4
23.5
23.6

The dynamic Iinear regression

24
24.1
24.2
24.3
24.4
24.5
24.6

The multiYariate
lntroduction

model

Specification
Estimation
Misspecification

testing
Specification testing
Prediction
Looking back

linear regression model

Specification and estimation


A priori information
The Zellner and Malinvaud formulations
Specification testing
Misspecification testing

493
494
503

511
521
523

526
527
533
539
548
562
567
571
571
574
579
585
589
596

Contents

Prediction
The multivariate dynamic linear regression
(MDLR) model
Appendix 24. 1 The Wishart distribution
..
Appendix 24.2 Kronecker products and matrix
-

25
25. 1
25.2

25.4
25.5
25.6
25.7
25.8
25.9
25.10
26

599
6O2

differentiation

603

The simultaneous equations model


Introduction
The multivariate linear regression and
simultaneous equations models
ldentification using linear homogeneous

608
608

restrictions
Specification
Maximum likelihood estimation

610
614
619
621
626
637
644
649
654

Least-squares estimation
lnstrumental variables
Misspecification testing
Specification testing
Prediction
Epilogue: towards a methodology of econometric

modelling
26. 1
26.2

A methodologist's critical eye


Econometric modelling, formalising

26.3

methodology
Condusion

659
659
a

References

673

lndex

689

* Starred chapters and or sections are typically more difficult and


might be avoided at first reading.

PA R T I

Introduction

Econometric

1.1

modelling, a preliminary view

- a brief historical overview

Econometrics

lt is customary to begin a textbook by defining its subject matter. ln this


case this brings us immediately up against the problem of defining
'econometrics'. Such a definition, however, raises some very difficult
methodological issues which could not be discussed at this stage. The
epilogue might be a better place to give a proper definition. For the
purposes of the discussion which follows it suffices to use a working
definition which provides only broad guide-posts of its intended scope:
Econometrics

phenomena

is concerned willl
observed data.

the slaremarfc

study

t#'

economic

lsfrlg

This definition is much broader than certain textbook definitions


of
narrowing the subject matter of econometrics to the
theoretical relationships as suggested by economic theory. lt is argued in
the epilogue that the latter definition of econometrics constitutes a relic of
an outdated methodology, that of the logical positivism (see Caldwell
( 1982:. The methodological position underlying the denition given above
The term systematic is used
is largely hidden behind the word
where economic theory
observed
of
data
in
framework
describe
the
a
use
to
well
role,
play
statistical
inference
important
as yet undetined. The
an
as
as
observed data is what distinguishes econometrics from other forms of
of
use
studying economic phenomena.
Econometrics, detined as the study of the economy using observed data,
can be traced as far back as 1676, predating economics as a separate
discipline by a century. Sir William Petty could be credited with the first
kmeasurement'

tsystematic'.

A preliminary

view

'systematic' attempt to study economic phenomena using data in his


Political Arithmetik. Systematic in this case is used relative to the state of the
art in statistics and economics of the time.
Petty ( 1676) used the pioneering results in descriptive statistics developed
forms of economic
by his friend John Graunt and certain rudimentary
in studying economic
attempt
theorising to produce the first
phenomena using data. Petty might also be credited as the first to submit to
modelling.
According to Hull,
a most serious temptation in econometric
main
biographers and collector of his works:
one of his
'systematic'

Petty sometimes appears to be seeking figures that will support a


conclusion he has already reached: Graunt uses his numerical data as a
basis for conclusions. declining to go beyond them.
(See Hull ( 1899), p. xxv.)

Econometrics, since Petty's time, has developed alongside statistics and


economic theory borrowing and lending to both subjects. In order to
understand the development of econometrics
we need to relate it to
developments in these subjects.
Graunt

(i)
(ii)
(iii)

and Petty initiated three important

developments

in statistics:

the systematic collection of (numerical)data;


related to life-tables; and
theory (#' probabilitl'
the mathematical

the development

of what we nowadays call descriptive statistics


a coherent set of techniques for analysing

(see Chapter 2) into


numerical data.

It was rather unfortunate that the last two lines of thought developed
largely independent of each other for the next two centuries. Their slow
convergence during the second half of the nineteenth and early twentieth
centuries in the hands of Galton, Edgeworth, Pearson and Yule, inter alia,
culminated with the Fisher paradigm which was to dominate statistical
theory to this day.
The development of the calculus of probability emanating from Graunt's
work began with Halley (1656-1742jand continued with De Moivre
(1667-1754q, Daniel Bernoulli (1700-824,Bayes 41702-61q, Lagrange
(1736- 1813j, Laplace g1749-1827j, Legendre g1752-1833q, Gauss (17891857) inter alia. ln the hands of De Moivre the main line of the calculus of
probability emanating from Jacob Bernoulli (1654-17054 was joined up
with Halley's life tables to begin a remarkable development of probability
theory (see Hacking ( 1975), Maistrov ( 1974)).
The most important of these developments can be summarised
under the
following headings:
manipulation of probabilities (addition, multiplicationl;
(i)

1.1

A brief historical oveniew

families of distribution
exponentiall;

functions

(normal, binomial, Poisson,

1aw of error, least-squares. least-absolute errors',


limit theorems (law of large numbers, central limit theoreml;
life-table probabilities and annuities;
higher order approximations;
probability generating functions.
Some of these topics will be considered
in some detail in Part 11 because
they form the foundation of statistical inference.
The tradition in Political Arithmetik originated by Petty was continued
by Gregory King (1656-17141. Davenant might be credited with
demand schedule (see Stigler ( 1954:, drawing
publishing the rst
freely from King's unpublished work. For this reason his empirical demand
for wheat schedule is credited to King and it has become known as tKing's
law'. Using King's data on the change in the price (pf) associated with a
given change in quantity (fyt)Yule (1915) derived the empirical equation

(iii)
(iv)
(v)
(vi)
(vii)

tempirical'

explicitly as
p,

2.33:, +0.05:,2

0.0017gJ.

Apart from this demand schedule, King and Davenant extended the line of
thought related to the population and death rates in various directions thus
establishing a tradition in Political
art of reasoning by
relating
figures upon things,
to government'. Political Arithmetik was to
stagnate for almost a century without any major developments in the
descriptive study of data apart from grouping and calculation of tendencies.
From the economic theory viewpoint
Political Arithmetil played an
where
numerical
role
in
classical
economics
data on money
important
and
prices,
public
imports were extensively
stock,
finance, exports
wages,
used as important tools in their various controversies. The best example of
the tradition established by Graunt and Petty is provided by Malthus'
'Essay on the Principles of Population'. ln the bullionist and currencybanking schools controversies numerical data played an important role (see
Schumpeter ( 1954)). During the same period the calculation of index
numbers made its lirst appearance.
With the establishment of the statistical society in 1834 began a more
coordinated activity in most European countries for more reliable and
complete data. During the period 1850-90 a sequence of statistical
congresses established a common tradition for collecting and publishing
data on many economic and social variables making very rapid progress on
this front. ln relation to statistical techniques, however, the progress was
much slower. Measures of central tendency (arithmetic mean, median,
u4/'lnltall'/v',

'the

A preliminary

view

mode, geometric mean) and graphical techniques were developed.


of dispersion (standard deviation,
interquartile
Measures
range),
correlation and relative frequencies made their appearance towards the end
of this period. A leading figure of this period who can be considered as one
of the few people to straddle all three lines of development emanating from
Graunt and Petty was the Belgian statistician Quetelet(1796-1874j.
From the econometric viewpoint the most notable example of empirical
modelling was Engel's study of family budgets. The most important of his
conclusions was:
The poorer a family, the greater the proportion of its total expenditure
that must be devoted to the provision of food.

(Quotedby Stigler (1954).)


This has become known as Engel's law and the relationship between
consumption and income as Engel's curve (see Deaton and Muellbauer

(1980)).

The most important contribution during this period (early nineteenth


century) from the modelling viewpoint was what we nowadays call the
Gauss linear model. In modern notation this model takes the following
general form:

y =X#

+u,

where y is a F x 1 vector of observations linearly related to the unknown


k x 1 vector p via a known fixed F x k matrix X (ranktx) k, F> k) but
subject to (observation)error u. This formulation was used to model a
situation such that:
=

it is suspected that for settings


linear relation:
k
)., jj;1 pix;
=

xl

xc,

xk there is a value y related

by a

where p
is unknown. A number T of observations on y can be made,
corresponding to F different sets of (x:
xkl,i.e. we obtain a data set (yf,
xfk), r 1s2,
xf : xta,
'Cbut the readings on y, are subject to error.
(See Heyde and Seneta (1977).)
uzzpi)

.p

The problem as seen at the time was one of interpolation (approximation),


that is, to
the value of #. The solution proposed came in the
form of the least-squares approximation of p based on the minimisation of
Gapproximate'

u'u=ty

-x#)'(y

-X#),

whichlead to
/= (X'X)--1X'y

( 1.4)

A brief historical oveniew

(see Seal (1967:. The problem, as well as the solution, had nothing to do
with probability theory as such. The probabilistic arguments entered the
problem as an afterthought in the attempts of Gauss and Laplace to justify
T are assumed
the method of least-squares. lf the error terms lk1, t 1,2,
to be independent and identically distributed (IlD) according to the normal
distribution, i.e.
=

c2),

.::(.0,

ld,
x.

1, 2,

(1.5)

T)

optimal solution' from a probabilistic


then j- in (4) can be justified as
viewpoint (see Heyde and Seneta (1977),Seal (1967), Maistrov (1974)).
The Gauss linear model was later given a very different interpretation in
the context of probability theory by Galton, Pearson, and Yule, which gave
rise to what is nowadays called the linear regression model. The model given
in (2)is now interpreted as based wholly on probabilistic arguments, y, and
Xf are assumed to bejointly normally distributed random variables and X/
is viewed as the conditional expectations of pf given that X! xf (Xr takes the
Tt.i.e.
value xp) for r 1, 2,
kthe

.E()rry/Xt xf )
=

#'xt,

1, 2,

(1.6)

'C

with the error term u, defined by lz, yf - F(y/ Xr


further details). The linear regression model
=

.pf .Et-p,Xf xl) +


=

u,,

1, 2,

xf)

(see Chapter 19 for

can be written in matrix form as in (2) and the two models become
indistinguishable in terms of notation. From the modelling viewpoint,
however, the two models are very different. The Gauss linear model
relationship where the xffs are known constants. On
describes a
the other hand, the linear regression model refers to a
relationship where yf is related to the observed values of the random vector
Xl (forfurther discussion see Chapter 19). This important difference went
largely unnoticed
by Galton, Pearson and the early twentieth-century
applied econometricians. Galton in particular used the linear regression
causal relationships in support of his theories
model to establish
of heredity in the then newly established discipline of eugenics.
The Gauss linear model was initially developed by astronomers in their
relationships for planetary orbits, using a
attempt to determine
large number of observations with less than totally accurate instruments.
The nature of their problem was such as to enable them to assume that their
theories could account for all the information in the data apart from a
vhfrc-ntpfsp(see Chapter 8) error term uf. The situation being modelled
design' situation because of the relative
resembles an
of
constancy the phenomena in question with nature playing the role of the
'law-like'

'predictive-like'

'law-like'

'law-like'

'experimental

A preliminary

view

experimenter. Later, Fisher extended the applicability of the Gauss linear


phenomena using the idea of randomisation
model to
(see Fisher ( 1958:. Similarly, the linear regression model. firmly based on
expectation, was later extended by Pearson to the
the idea of conditional
stochastic
of
regressors
(see Seal ( 1967)).
case
ln the context of the Gauss linear and linear regression models the
convergence of descriptive statistics and the calculus of probability became
a reality, with Galton (1822-1911.q, Edgeworth g1815- 1926q, Pearson
g1857-19361 and Yule g1871-195 1q being the muin protagonists. ln the
hands of Fisher (1890-1962jthe convergence was completed and a new
One of the most important
modelling paradigm
was proposed.
contributing factors to these developments in the early twentieth century
was the availability of more complete and reliable data towards the end of
the nineteenth century. Another important development contributing to
the convergence of the descriptive study of data and the calculus of
probability came in the form of Pearson's family of frequency curves which
provided the basis for the transition from histograms to probability density
functions (seeChapter 2). Moreover, the various concepts and techniques
and provide the
developed in descriptive statistics were to be reinterpreted
basis for the probability theory framework. The frequency curves as used in
descriptive statistics provide convenient
for the observed data at
hand. On the other hand, probability density functions were postulated as
imodels' of the population giving rise to the data with the latter viewed as a
representative sample from the former. The change from the descripti,e
statistics to the probability theor approach in statistical modelling went
almost unnoticed until the mid-1930s when the latter approach formalised
by Fisher dominated the scene.
During the period 1890-1920 the distinction between the population
from where the observed data constitute a sample and the sample itself was
blurred by the applied statisticians. This was mainly because the paradigm
tacitly used, as formulated by Pearson, was firmly rooted in the descriptive
statistics tradition where the modelling proceeds from the observed data in
hand to the frequency (probablity)model and no distinction between the
population and the sample is needed. In a sense the population consists of
the data in hand. In the context of the Fisher paradigm, however, the
modelling of a probability model is postulated as a generalised description
of the actual data generation prpcess (DGP), or the population and the
observed data are viewed as a realisation of a sample from the process. The
transition from the Pearson to the Fisher paradigm was rather slow and
went largely unnoticed even by the protagonists. ln the exchanges between
Fisher and Pearson about the superiority of the maximum likelihood
estimation over the method of moments on efficiency grounds, Pearson
'experimental-like'

'models'

1.1

A brief historical oveniew

never pointed out that his method of moments was developed for a different
statistical paradigm where the probability model is not postulated a priori
(see Chapter 13). The distinction between the population and the sample
was initially raised during the last decade of the nineteenth century and
of the
early twentieth century in relation to biqher t?l'J:'r approximations
limit
theorem
(CLT)
results
emanating
from
Bernouli,
Moivre
De
central
and Laplace. These limit theorems were sharpened considerably by
the Russian school (Chebyshev (1821-941, Liapounov (1857-1922q,

Markov g1856-1922q, Kolmogorov g19031 (seeMaistov ( 1974)) and


this
period.
Edgeworth
and Charlier, among
extensively
used
during
expansions
which
could be used to improve the
others, proposed asymptotic
offered
by
the
CLT
given
sample size T (see Cramer
approximation
for a
The
of
development
( 1972)).
a formal distribution theory based on a fixed
sample size 'T)however, began with Gosset's (Studenfs) t and Fisher's F.
distributions (see Kendal and Stuart ( 1969)). These results provided the
basis of modern statistical theory based on the Fisher paradigm. The
transition from the Pearson to the Fisher paradigm became apparent in the
1930s when the theory of estimation and testing as we know it today was
formulated. lt was also the time when probability theory itself was given its
axiomatic foundations by Kolmogorov ( 1933) and tirmly established as
part of mathematics proper. By the late 1930s probability theory as well as
statistical inference as we know them today were firmly established.
The Gauss linear and linear regression
models were appropriate for
Yule ( 1926) discussed the
modelling essentially static phenomena.
difficulties raised when time series data are used in the context of the linear
regression model and gave an insightful discussion of
regressions' (seeHendry and Morgan ( 1986)). ln an attempt to circumvent
these problems, Yule (1927) proposed the linear autorenressive model
(AR(m)) where the xffs are replaced by the lagged ).fS, i.e.
'non-sense

An alternative model for time-series data was suggested by Slutsky ( 1927) in


such data using weighted
his discussion of the dangers in
showed
that
weighted
He
averaging
of a white-noise
by
averaging.
process
with
produce
periodicities.
Hence, somebody looking for
a data series
ut can
cyclic behaviour can be easily fooled when the data series have been
smoothed. His discussion gave rise to the other important family of timeseries models, subsequently called the moinq tkptalv/gcp m()(leI(MA(p)):
ksmoothing'

A preliminary

view

Wold ( 1938) provided the foundations for time series modelling by relatinp
the above models to the mathematical theory of probability establshed bj
Kolmogorov ( 1933). These developments in time series modelling were to
have only a marginal effect on mainstream econometric modelling until the
mid-7os when a slow but sure convergence of the two methodologies began.
One of the main aims of the present boc)k is to complete this convergence in
methodology,
the context of a reformulated
With the above dcvelopments in probability theory and statistical
inference in mind, let us consider the history of econometric modelling in
the early twentieth eentury. The marginalist revolution of the 187s, with
Walras and Jevons the protagonists, began to take root and with it a change
of attitude towards mathematical and statistical techniques and their role in
studying the economy. ln classical ekronomics ebserved data were used
mainly to
tendencies in support of theoretical arguments or as
kfacts' to be explained. The mathematisation of economic theory brought
about by the marginalist revolution contributed towards a purposeful
attempt to quantify theoretical relationships using observed data. The
theoretical relationships formulated in terms of equations such as demand
and supply funetions seemed to offer themselves for quantification using the
newly established techniques of correlation and regression.
The early literature in econometric modelling concentrated mostly on
two general areas, business cycles and demand curves (see Stigler (1954:.
This can be explained by the availability of data and th influence of the
marginalist revolution. The statistical analysis of business cycles took the
form of applying correlation as a tool to separate long-term secular
movements, periodic movements and short-run oscillations (seeHooker
(1905), Moore (1914)inter alia). The empirical studies in demand theory
concentratd mostly on estimating demand curves using the Gauss linear
model disguised as regression analysis. The estimation of such curves was
fitting' with any probabilistic
treated as
being
arguments
coincidental. Numerous studies of empirical demand schedules, mostly of
agricultural products, were published during the period 1910-30 (see
Stigler ( 1954), Morgan ( 1982), Hendry and Morgan (1986:, seeking to
of demand'. These studies
establish an empirical foundation for the
purported to estimate demand schedules of the simple form
'establish'

tcurve

'law

qD=
l

(1.10)

f/() +J1n,

OD

refers to quantities demanded at time l (intentions on behalf of


where
economic agents to buy a certain quantity of a commodity) corresponding
to a range of hypothetical prices (q. By adopting the Gauss linear model
line through the
these studies tried to approximate' (10) by fitting the
of
where
usually
referred to
1,
T
fq- t ) t
2,
t
scatter diagram
'best'

/7

'

1.1

A brief historical overview

11

quantities transacted (or produced) and the corresponding


1. That is, they would estimate

b1

-1-

b1

1 2s
,

prices

at time

(1 11)

using least-squares or some other interpolation method and interpret the


estimated coefficients h and /)1as estimates of the theoretical parameters?o
and f1j respectively, if the signs and vaiues were consistent wth the law of
demand'. This simplistic modelling approach, however, ran into difficulties
immediately. Moore (19l4) estimated ( l 1) using data on pig-iron (raw steel)
production and price and
(or so he thought) a positively
sloping demand schedule (Jj > 0). This result attracted considerable
criticism from the applied econometricians of the time such as Lehfeldt and
issue in
Wright (see Stigler ( 1962)) and raised the most important
econometrie modelling; te connccrft?n bpfwtzpn be estimated equations
pt/ullf/tvlt?l %'cctpntpl?c'
rs/ng observed data tznJ the theovetical l-elationships
argued that
1915),
commenting
Moore*s
Lehfeldt
theory.
(
on
equation
estimated
supply
demand
but
the
a
curve. Several
was not a
applied econometricians argued that Moore's estimated equation was a
mixture of demand and supply. Others, taking a more extreme view, raised
the issue of whether estimated equations represent statistical artifacts or
genuine empirical demand or supply curves. lt might surprise the reader to
learn that the same issue remains largely unresolved to this day. Several
'solutions' have been suggested since then but no satisfactory answer has
emerged.
During the next two decades (1910-30) the applied econometricians
struggled with the problem and proposed several ingenious ways to
'resolve' some of the problems raised by the estimated versus theoretical
relationships issue. Their attempts were mainly directed towards specifying
theoreteal models and attempting to rid the observed data
more
of
information'. For example, the scenario that the demand and
supply curves simultaneously shifting allowing us to observe only their
intersection points received considerable attention (see Working (1927)
incr ('/fll. The time dimension of time-series data proved particularly
given that the theoretical model was commonly static.
difticult to
Hence
the observed data
the data was a popular way to
In order to bring them closer to the theoretical concepts
purporting to
estimated-theoretical
As
argued
Morgan
the
below
(1982/.
measure (see
tssue raises numerous problems which, given the state of the art as far as
statistical inference is concerned, could not have been resolved in any
Attisfactory
way. ln modern terminology these problems
can be
xummarised under the following headings:
theoretical variables versus observed data;
:
Sdiscovered'

'discovery',

4realistic'

kirrelevant

'solve'

Sdetrending'

'

kpurify'

A preliminary

(ii)
(iii)
(iv)
(v)

statistical

N'iew

model specificalion'.
misspecification
testing;

statistical

specification

testing, reparametrisation,
identification;
theoretical models.
By the late 1920: there was a deeply felt need for a more organised effort
to face the problems raised by the early applied econometricians such as
Moore. Mitchell, Schultz. Ctark- Working, Wallace, Wright, fnlcr (Ilftl. This
led to the creation of the Econometric Society in 1930. Frisch, Tinbergen
of
international society'
and Fisher (Irving) initiated the establishment
//- r/lt, adt,ancement (?/' economic l/:t?t??'). in its rt?lation to statistics and
Anfl/lpnlrkrs.
The decade immediately after the creation
of the
Econometric Society can be characterised as the period during which the
foundations of modern econometrics were laid mainly by posing some
important and insightful questions.
An important attempt to resolve some of the problems raised by the
estimated theoretical distinction was mae by Frisch ( 1928) (1934).
Arguing
from the Gauss linear model viewpoint Frisch suggested the so-called
errors-in-variables formulation where the theoretical relationships defined
Jtkf)
in terms of theoretical variables pt EEEfy 1 ,,
defined by the system
of k linear equations:
empirical

versus

'an

bare

A'p f
and

=0

the observed data yt BE (.v11


yg

pt + ct

,.J.!k2
)'

are related

to pt

via

where cr are errors of measurement. This formulation emphasises the


distinction between theoretical variables and observed data with the
measurement equations ( 13) relating the two. The problem as seen by
Frisch was one of approximation
(interpolation) in the context of linear
algebra in the same way as the Gauss linear model was viewed. Frisch,
however, with his coqjluence analq,sis offered no. proper solution to the
problem. A complete solution to the simplest case was only recently
provided, 50 years later, by Kalman ( 1982). lt is fair to say that although
Frisch understood
the problems raised by the empirical theoretical
the
quotation
distinction as
below testifies, his formulation of the problem
turned out to be rather unsuccessful in this respect. Commenting on
Tinbergen's 'A statistical test of business cycle theories'. Frisch argued that:
The qucstion of what connection there is between relations we work with
in theory and those we get by fitting ctlrves to actual statistical data is a
very delicatc onc. Tinbergen in his work hardly mentions it. He more or
Iess takes it for granted that the relatons he has found are in their nature

1.1

A brief historical overview

the same as
of this sort,

the theory
This is, in my opinion, unsatisfactory.
In a work
and
bctween slatisifL'fll
the connection
l,b.ltll'et'il'al l-ta/f-/ri?ns
must be thoroughly understood and the nature of the intbrmation which
the statistical relations furnish - although
they are not identical with the
theoretical rclations - should be clearly brought out.
(Sce Frisch ( 1938), pp. 2 - 3.)
.

As mentioned above, by the late 1930s the Fisher paradigm of statistical


inference was formtllated into a coherent body of knowledge with a firm
foundatin in probability theory. The first important attempt to introduce
this paradigm into econometrics was made by Koopmans ( 1937). He
formulation in the
proposed a resetting of Frisch's errors-in-variables
context of the Fisher paradigm and related the least-squares method to that
of maximum likelihood, arguing that the latter paradigm provides us with
additional insight as to the nature of the problem posed and its
(estimation). Seven years later Haavelmo (a student of Frsch) published his
celebrated monograph on b'T'heprobability approach in econometrics' (see
Haavelmo ( 1944)) where he argued that the probability approach (the
bsolution'

Fisher

paradigm)

was

the most promising

approach

to econometric

modelling (see Morgan ( 19841). His argument in a nutshell was that if


statistical inference (estimation, testing and prediction) are to be used
systematically we need to accept the framework in the context of which
these results become available. This entails formulating theoretical
propositions in the context of a well-defined statistical model. ln the same
monograph Haavelmo exemplified a methodological awareness far ahead
of his time. ln relation to the above discussion of the appropriateness of the
Gauss linear model in modelling economic phenomena he distinguished
between observed data resulting from:
( 1) experiments that we should like to make to see if ccrtain real economic
intlucnces'
would
phenomena - when artificially isolatcd from
verify certain hypotheses, and
'other

(2) the stream


own enormous
observers

of experiments

that Nature is steadily turning out from her


wc mcrely watch
as passive

laboratory, and which

He went on to argue:

ln the first case we can make the agreement or disagreement between


theory and facts depend upon two things: the facts we choose to consider,
ln the second casc we can only try to
as well as our theory about them
adjust our theories to reality as it appears before us. And what is the
meaningof a design of experiments in this case'? lt is this: Wc try to choose
a theory and a design of experiments to go with it. in such a way that thc
.

A preliminary

view

resulting data would be those which we get by passive observation


reality.And to the extent that we succeed in doing so, we become master
reality - by passive agreement.

t)!
0:

Now if we examine current economic theories, we see that a great many of


them, in particular the more profound ones, require experiments of the
first type mentioned above. On the other hand, the kind of economic data
that we actually have belong mostly to the second type.
(See Haavelmo (1944).)

Unfortunately for econometrics, Haavelmo's views on the methodology of


econometric modelling had much lesser influence than his formulation of a
statistical model thought to be tailor made for econometrics; the so-called
simultaneous equations model.
ln an attempt to capture the interdependence of economic relationships
Haavelmo ( 1943) proposed an alternative to Frisch's errors-in-variables
and
formulation where no distinction between theoretical variables
observed data is made. The simultaneous equation formulation was
specified by the system
F'y f + A'xf + tt

0,

where yr refers to the variables whose behaviour this system purports to


explain (endogenous)and x/ to the explanatory (extraneous) variables
whose behaviour lies outside the intended scope of the theory underlying
( 14) and zt is the error term (seeChapter 25). The statistical analysis of (14)
provided the agenda for a group of distinguished statisticians and
econometricians assembled in Chicago in 1945. This group known as the
Cowles Foundation Group, introduced the newly developed techniques of
estimation (maximum likelihood) and testing into econometrics via the
simultaneous equation model. Their results, published in two monographs
(see Koopmans (1950)and Hood and Koopmans (1953))
were to provide
the main research agenda in econometrics for the next 25 years.
lt is important to note that despite Haavelmo's stated intentions in his
discussion of the methodology of econometric modelling (seeHaavelmo
(1944)), the simultaneous equation model was later viewed in the Gauss
linear model tradition where the theory is assumed to account for a11 the
information in the data apart from some non-systematic (white-noise)
errors. lndeed the research in econometric theory for the next 25-30 years
was dominated by the Gauss linear model and its misspecification analysis
equations
model and its identification and
and the simultaneous
estimation. The initial optimism about the potential of the simultaneous
equations model and its appropriateness for econometric modelling was
not fulfilled. The problems related to the isstle of estimated versus

A sketcb of a metbodology

mentioned above were largely ignored because of


theoretical relationships
this initial optimism. By the late 1970s the experience with large
equations
macroeconometric models based on the simultaneous
formulation called into queson
the whole approach to econometric
modelling (seeSims (1980),Malinvaud (1982),fnl!r t:!ia).
The inability of large macroeconometric
models to compete with BoxJenkins ARIMA models, which have no economic theory content, on
prediction grounds (see Cooper (.1972)) renewed the interest of
econometricians to the issue of sttzlic tllptlry tlcrsus Jynclmfcrfmpseries Jcltl
raised in the 1920s and 30s. Granger and Newbold ( 1974) questioned the
conventional econometric approach of paying little attention to the time
series features of economic data', the result of specifying statistical models
using only the information provided by economic theory. By the late 1970s
it was clear that the simultaneous equations model, although very useful,
was not a panacea for al1 econometric modelling problems. The whole
in view of the
econometric methodology needed a reconsideration
experience of the three decades since the Cowles Foundation.
The purpose of the next section is to consider an outline of a particular
approact to econometric modezing wtch takes account of some of the
problems raised above. lt is only an outline because in order to formulate
the methodology in any detail we need to use concepts and resultswhich are
developed in the rest of the book. In particular an important feature of the
proposed methodology is the recasting of statistical models of interest in
econometrics in the Fisherian mould where the probabilistic assumptions
are made directly in terms of the observable random variables giving rise to
the observed data and not some unobservable
error term. The concepts and
ideas involved in this recasting are developed in Parts 1l-IV. Hence, a more
detailed discussion of the proposed methodology is given in the epilogue.

1.2

Econometric modelling - a sketch of a methodology

In order to motivate the methodology

of econometric

modelling adopted

below let us consider a simplistic view of a commonly propounded


methodology as given in Fig. 1.1 (for similar diagrams see lntriligator
( 1978), Koutsoyiannis (1977),inter 4?/fJ), ln order to explain the procedure
represented by the diagram let us consider an extensively researched
theoretical relationship, the transactions demand for money.
There is a proliferation of theories related to the demand for money
which are beyond the scope of the present discussion (fora survey see Fisher
(1978:. For our purposes it suffices to consider a simple theory where the
transactions demand for money depends on income. the price level and

A preliminary

view

Data
Prediction
Econometric
model

Theoretical

Theory

model

(f Orecasting)

Estimation
testing

policy evaluation
Statistical inference

Fig. 1.1. The

interest

'textbook'

approach

to econometric

modelling.

rate, i.e.
MB

=./'(Ft #, 1).

(1.15)

in some
Most theories of the demand for money can be accommodated
variation of (15) by attributing different interpretations to F. The theoretical
model is a mathematical formulation of a theory. ln the present case we
expressed the theory directly in the functional form (15) in an attempt to
keep the discussion to a minimum. Let the theoretical model be an explicit
functional form for ( 15), say
MB

or

ln

zlyvt'ac.fza

MB

atj + rzj ln F + xz ln # + aa ln 1

in log-linear form

with tzll ln
being a constant.
The next step in the methodological
scheme represented
,4

by Fig. 1.1 is to
transform the theoretical model ( 17) into an econometric model. This is
commonly achieved in an interrelated sequence of steps which is rarely
explicitly stated. Firstly, certain data series, assumed to represent
measurements of the theoretical variables involved, are chosen. Secondly,
the theoretical variables are assumed to coincide with the variables giving
rise to the observed data chosen. This enables us to respecify (17) in terms of
these observable variables, say, V,, f,, Ft and I-t
ln Vt

txtl +

a1 ln 1-,)+ a2 ln #f + aafr,

(1.18)

The last step is to turn (18) into an econometric (statistical) model by


attaching an error term t1t which is commonly assumed to be a normally
distributed random variable of the form

A sketch of a methodology

l.2

of the exeluded variables.

the effects
n

x () + x 1 .f?+ a at + u

3l

Adding this error term onto


+ at

( 18) yields
(1

.20)

T,

u here small letters represent the logarithm of the corresponding capital


lctters. Equation (20) is now viewed as a Gauss linear model with the
cstimation, testing and prediction techniques related to this at our disposal
transactions demand for money'.
:o analyse
next
qate
''l'e
t%to esttmate 2% ustttg tte statistkal results retated to the
Gauss linear model and test the postulated assumptions for the error term.
lf any of the assumptions are invalid we correct by respecifying the error
term and then we proceed to test the a priori restrictions suggested by the
theory such as, tz1 :>: 1, aa cls 1, 1 < (s <0, using the statistical techniques
related to the linear model. When we satisfy ourselves that the theory is
'correct' we can proceed to use the estimated equation for prediction or and
policy evaluation.
ln practice the above methodological procedure turns out to be much
more difficult to apply and applied econometricians find themselves having
procedures in order to get something worth publishing.
to use
include
estimating dozens of equations like (20)with
Such procedures
relevant variables as well as including a
of
combinations
possibly
various
variables.
variables
the
This is most
explanatory
few lagged
among
graphically described in the introduction of Leamer ( 19781:
'the

'illegitimate'

1 began thinking about these problems when l was a graduate student in


economics at the University of Michigan, 1966-1970. At that timc there
was a very active group building an econometric model of the United
modelling was done in the
States. As it happens, the econometric
basement of the building and the econometric theory courses were taught
on the top tloor (the third). l was perplexed by the fact that the same
language was used in both places. Even more amazing was the
transmogrification
of particular individuals who wantonly
sinned in the
basement and metamorphosed
into the highest of high priests as they
ascended to the third floor.

ln the same book Leamer went on to attempt a systematisation of these


'illegitimate' procedures using Bayesian techniques. The approach
procedures are
proposed below will show that some of these
certain
appropriate
problems
indeed
in the context of an
ways to
methodology
Chapter
22).
alternative
(see
ln an attempt to motivate the underlying logic of the methodology to be
sketched below 1et us return to Fig. 1.1 in order to consider some of the
possible
links' in the textbook methodology.
The first possible weakness of the textbook methodology is that the
'illegitimate'

'tackle'

'weak

A preliminary

view

starting point of econometric modelling is some theory. This arises because


the intended scope of econometrics is narrowly defined as the
Such a definition was rejected at the outset of
of theoretical relationships'.
the present book as narrow and misleading. Theories are developed not for
the sake of theorising but in order to understand
some observable
phenomenon of interest. Hence, defining the intended scope of econometrics as providing numbers for our own constructions and ignoring
the original aim of explaining phenomena of interest, restricts its scope
considerably by attaching
to the modeller. In a nutshell, it
only
that
the
information' contained in the
presupposes
observed data chosen is what the theory allows. This presents the modeller
with insurmountable difficulties at the statistical model specification stage
chosen for them without their
when the data do not fit the
nature being taken into consideration. The problem becomes more
apparent when the theoretical model is turned into a statistical
(econometric) model by attaching a white-noise error term to a
reinterpreted equation in terms of observable variables. lt is naive to
suggest that the statistical model should be the same whatever the observed
data chosen. ln order to see this 1etus consider the demand schedule at time
t referred to in Section 1.1:
'measurement

Sblinkers'

ilegitimate

tstraightjacket'

qtB

(1.2 1)

atl + a Lpt.

If the data refer to intentions q?tt q) pitj,i 1, 2,


n wlich correspond to
the hypothetical range of prices pit,i 1, 2,
n then the most appropriate
statistical model in the context of which (21) can be analysed is indeed the
Gauss linear model. This is because the way the observed data were
generated was under conditions which resemble an experimental situation',
the hypothetical prices pff, i 1, 2,
n were called out and the economic
agents considered their intentions to buy at time !. This suggests that uo
and 1 can be estimated using
=

y-P,t
=

+ a j pit+ uft,

'xo

1, 2,

(1.22)

n.

ln Haavelmo's categorisation, ty-pjf,


n constitute observed
pit),i 1, 2,
data of type one; experimental-like situations, see Section 1.1. On the other
hand, if the observed data come in the form of time series li, f), t 1,2,
F where tj'r refers to quantities transacted and
the corresponding prices at
time t then the data are of type two; generated by nature. In this case the
Gauss linear model seems wholly inappropriate unless there exists
additional information ensuring that
=

t?2D(f)

Such a condition

for all

1.

(1.23)

is highly unlikely to hold given that in practice other

of a methodology

A sketch

factors such as supply-side, historical and institutional will influence the


It is highly likely that the data (!, t), t 1, 2,
determination of j', and
T when used in the context of the Gauss linear model,
.

)'()+ / l

lt

will give rise to verq' misleading estimates for the theoretical parameters of
interest tztl and tzj (see Chapter 19 for the demand for money). This is
because the GM represented by the Gauss linear model bears little, if any,
resemblanee to the acttlal DGP which gave rise to the observed data (f, pt),
TJln order to account for this some alternative statistical model
t 1.2,
should be specified in this case (see Part IV for several such models).
Moreover, in this case the theoretical model (21) might not be estimable. A
moment's retlection suggests that without any additional information the
estimable form of the model is likely to be an kp-juslrrlcn!
process (price
or/and quantity). lf the observed data have a distinct time dimension this
should be taken into consideration in deciding the estimable form of the
model as well as in specifying the statistical model in the context of which
the latter will be analysed. The estimable form of the model is directly
related to the observable phenomenon of interest which gave rise to the
data (the actual DGP). More often than not the intended seope of the
theory in question is not the demand schedule itself but the explanation of
changes in prices and quantities of interest. ln such a case a demand or and
a supply schedule are used as a means to explain price and quantity changes
not as the intended scope of the theory.
=

ln the context of the textbook methodology distinguishing between the


theoretical and estimable models in view of the observed data seems totally
unnecessarj' for three interrelated reasons:
the observed data are treated as an afterthought',
(i)
the actual DGP has no role to play', and
(ii)
theoretical variables are assumed to coincide (one-to-one) with the
(iii)
observed data chosen.
As in the case of (2 1) above the theoretical variables do not correspond
directly to a particular observed data series unless we generate the data
isolating the economic phenomenon of interest
ourselves by
from other influences- (see the Haavelmo ( 1944) quotation in Section 1.1).
We only have to think of theoretical variables such as aggregate demand for
money, income, price level and interest rates and dozens of available data
these variables.
for measuring
series become possible candidates
what the theoretical
Commonly, none of these data series measures
variables refer to. Proceeding to assume that what is estimable coincides
with the theoretical model and the statistical model differs from these by a
-artificially

.4 preliminary'

view

w'hite-noise error terln re.ltlrtlll?vs


to misleading conclusions.

q/' lhe

(?).st?,-l't.?t/

(l(l(l

t'/?t?st>??. can only

lead

The question which naturally arises at this stage is whether we can tackle
some of the problems raised above in the context of an alternative
methodological framework. ln view of the apparent limitations of the
framework should be flexible
textbook methodology any alternative
enough so as to allow the modeller to ask some of the questions raised
above even though readily available answers might not always be
forthcoming. With this in mind such a methodological framework should
attribute an important role to the actual DGP in order to widen the
modelling.
lndeed. the estimable model
intended scope of econometric
should be interpreted as an approximation
to the aclual DGP. This brings
the nature of the observed data at the centre of the scene with the statistical
model being defined directly in terms of the random variables giving rise to
the data and not the error term. The statistical model should be specified as
a generalised description of the mechanism giving rise to thedata. in view of
the estimable model- because the latter is going to be analysed in its context.
A sketch of such a methodological
framework is given in Fig. 1 An
important feature of this framework is that it can include the textbook
methodology as a speeial case under certain conditions. When the actual
DGP is
to resemble the conditions assumed by the theory in
question (Haavelmo type one observed data) then the theoretical and
estilnable models could coincide and the statistical model could differ from
these by a white-noise error. In general, however. we need to distinguish
between them even though the estimable model might not be readily
available in some cases such as the case of the transactions demand for
money (see Chapter 23).
In order to be able to turn the above skeleton of a methodology into a
ftllly fleshed framework we need to formulate some of the concepts involved
in more detail and discuss its implementation at length. Hence, a more
detailed discussion of this methodology is considered in the epilogue where
the various components shown in Fig. 1 are properly defined and their
role explained. In the meantime the following w'orking definitions will
stlffice for the discussion which follows:
.2.

-designed-

,2

Tlle()l-

'.'

a conceptual

constrtlct

provid ing an ideal ised description of the


us to seek

phenomena within its intended scope which will enable


cxplanations and predictions related to the actual DGP.

A sketch of a methodology

l.2

Actual data generating

Theory

Theoretical

Observed data

model

r'-

l
l

l
Statistical model

Estimable model

1
1

Estimation
Misspecification
Reparametrisation
Model selection

l
l
I

I
I
l
l
1
l

I
I
1
I

l
l

I
I

Empirical econometric
prediction,

model

poI icy evaluation

I
l

1
l

potentitlly
chosen.

1
I
I

l
l
l

Estilnable

zyt?t/cp/..a particular
form of the theo retical model which is
estimable
in view of the actual DG P and the observed data

a probabilistic formulation purporting to provide a


generalised description of the actual DGP with a view of analysing thtt
estimable model in its context.
Sttltivstil'tll

Enlpirical

??kt)t/(>/.'

trvtlrltpnt-zr-.

motlel:

reformulation

(reparametrisation/'

restriction) of a well-defined estimated statistical model in view of the


estimable model which can be used for description. explanation or/and
prediction.

A preliminary

view

Looking ahead
its main aim is the statistical
As the title of the book exemplifies.
of
modelling.
econometric
In relation to Fig. 1.2 the book
foundations
mainly
The
concentrates
on the part u ithin the dotted rectangle.
statistical
model in lerms of the variables giving rise to the
of
specification a
observed data as well as the related statistictl inference results will be the
subject matter of Parts 11and 111.ln Part 15' N'arious statistical models of
interest in econometric modelling and the related statistical inference
results will be considered in some detail. Special attention will be given to
the procedure from the specification of the stttistical model to 1he
dcmand for money
of the empirical econometric model. The transactions
example considered above will be tlsed th roughout Part 1V in an attempt to
awaiting the tlnaware
il1 the context of the textbook
illustrate the
well
with
the alternative
methodology
methodology as
as compare this
formalised in the present book.
modelling
Parts 11 and lll form an integral part of econometric
and
viewed
of
and
the concepts
should not be
as providing a summary
definitions to be used in Part lV, A sound background in probability theory
and statistical inference is crtlcial for the implementation of the approach
adopted in the present book. This is mainly becatlse the modeller is required
statistical model taking into consideration the
to specify the
nature of the data in hand as well as the estimable model. This entails
making decisions about characteristics of the random variables which gave
independence,
rise to the observed data chosen such as normalitys
stationarity- mixing, before any estimation is even attempted. This is one of
the most crucial decisions in the context of econometric modelling because
model
renders
choice of the statistical
the related
an inappropriate
reader
advised
statistical inference concltlsions invalid. Hence. the
is
to view
of
Parts 11 and 11I as an integral part
econometric modelling and not as
reference appendices. ln Part IV the reader is encouraged
to view
econometric modelling as a thinking person's activity and not as a sequence
of technique recipes. Chapter 2 provides a very brief introduction to the
Pearson paradigm in an attempt to motivate 1he Fisher paradigm which is
the subject matter of Parts 11 and 111.
tdesign'

'dangers'

'appropriate'

Additional references

C H AP T E R 2

Descriptive study of data

2.1

Histograms

and their numerical characteristics

By descriptive study of data we refer to the summarisation and exposition


of observed data as well as
(tabulation, grouping, graphical representation)
such as measures of location,
the derivation of numerical characteristics
dispersion and shape.
Although the descriptive study of data is an important facet of modelling
with real data in itself, in the present study it is mainly used to motivate the
need for probability theory and statistical inference proper.

In order to make the discussion more specific let us consider the after-tax
personal income data of 23 000 households for 1979-80 in the UK. These
data in raw form constitute 23 000 numbers between f 1000 and f 50000.
This presents us with a formidable task in attempting to understand how
income is distributed among the 23 000 households represented in the data.
The purpose of descriptive statistics is to help us make some sense of such
data. A natural way to proceed is to summarise the data by allocating the
The number of intervals is chosen a priori
numbers into classes (intervals).
and it depends on the degree of summarisation needed. In the present case
the income data are allocated into 15 intervals, as shown in Table 2. 1 below
(see National lncome (CH(/ Expenditul'e ( 1983)). The first column of the table
shows the income intervals, the second column shows the number of
incomes falling into each interval and the third column the relative
frequency for each interval. The relative frequency is calculated by dividing
the number of observations in each interval by the total number of
observations. Summarising the data in Table 2.1 enables us to get some idea
of how income is distributed among the various classes. lf we plot the
relative frequencies in a bar graph we get what is known as the histogram,

23

Descriptive study of data

0. 16
0. 15
0. 14
0.13
U'0. 12
11
. E 0.
# 0.10

,/ 0.09
1 0.08
g o o7

0.06

c.c5

0.04
0.03
0.02
0,01

10

11

12 13 14

15

l ncome
Fig. 2.1. The histogram and frequency polygon of the personal

income

data.

shown in Fig. 2. 1. The pictorial representation of the relative frequencies


gives us a more vivid impression of the distribution of income. Looking at
the histogram we can see that most households earn less than E4500 and in
some sense we can separate them into two larger groups: those earning
between f 1000 and f 4500 and those above f4500. The first impression is

and their numerical characteristies

Histograms

that the distribution


rather smilar.

of income inside these two larger groups appears to be

For further information on the distribution of income we could caiculate


location,
charaeteristcs
describing the hstogram's
varous numerical
dispersion and shape. Such measures can be caleulated directly in terms of
the raw data. Howeqrer, in the present case it is more convenient for
expositional purposes to use the grouped data. The main reason for this is
to introduce variotls concepts which will be reinterpreted in the context of
probability theory in Part ll.
The n'lpt/l' as a measure of location takes the form
15

()j

.- =

..u

=1

-.
'...

...

'y

,1

where $: and zi refer to the relative frequency and the midpoint of interval I'.
' Tbe rnott? as a measure of location refers to the value of income that occurs
most frequentl) in the data set. ln the present case the mode belongs to the
first interval f 1.0- 1.5. Another measure of location is the mtalan referring to
when incomes al'e arranged in an
the value of ineome in thc luiddle
ascenling (01' descending) order according to the size of income. The best
cIftnulatit'v
qvaph which
way to calculate the median is to plot the
such
consrenient
answering
questions
-Ho5v
is more
for
as
many
observations fall below a particular value of income?' (see Fig. 2.2). From
the cumulative frequency graph we can see that the median belongs to the
interval f 3.0-3.5. Comparing the three measures of location we can see that
.//gt/l/.eznc)'

$'

1.0
0.9
$

0.8
' o7
g
.

# o.6

+-

0.5

';

'

'B 0.4
E

c=

0,3
1

0.2
0. 1

l
1

Fig. 2.2. The cumulative


data.

8 9
lncome

10

1 1 12

13

14

15

histogram and ogive of the personal income

Descriptive study of data

mode < median


histogram.

<

confirming

mean,

the

obvious

asymmetry

ol' the

Another important feature of the histogram is the dispersion of the


relative frequencies around a measure of central tendency. The most
defined by
frequently used measure of dispersion is the l'al-iance
15

'2

(zf pli

4.85,

which is a measure of dispersion around the mean;


standard deriation.
We can extend the concept of the variance to

15

mk
=

)
=

(zf
-

z'lksi

3 4

is known as the

defining what are known as hiqber central rrlf?rntanrs. These higher moments
can be used to get a better idea of the shape of the histogram. For example,
the standardised form of the third and fourth moments defined by
SK

?n a
X. an d
-

n
--7.4
u?

(2

.4)

known as the skewness and kurtosis tro//'icfcaurk. measure the asymmetry


and peakedness of the histogram, respectively. In the case of a symmetric
and the less peaked the histogram thegreater value of
histogram, SK
K. For the income data
=0

SK

1.43

and

7.33,

which confirms the asymmetry of the histogram (skewed to the right). The
above numerical characteristics referring to the location, dispersion and
shape were calculated for the data set as a whole. lt was argued above,
however, that it may be preferable to separate the data into two larger
groups and study those separately. Let us consider the groups f 1.0-4.5 and
f4.0-20.0 separately. The numerical characteristics for the two groups are
and

2.5,

(721=

0.996,

SKL

0,252,

6.18,

:22

3.8 14,

SKz

2.55,

Kz

11.93,

respectively.

Looking at these measures we can see that although the two subsets of the
income data seemed qualitatively rather similar they actually differ
substantially. The second group has much bigger dispersion, skewness and
kurtosis coefficients.
Returning to the numerical characteristics of the data set as a whole we

2.2

Frequency curves

can see that these seem to represent an uneasy compromise between the
above two subsets. This confirms our first intuitive reaction based on the
histogram that it might be more appropriate to study the two larger groups
separately.
Another form of graphical representation for time-series data is the time
'C The temporal pattern of an economic time series
grkpll (zf l). l 1 2e
is important not only in the context of descriptive statistics but also plays an
important role in econometric modelling in the context of statistical
inference proper', see Part lV.
=

2.2

Frequency curves

Although the histogram can be a very useful way to summarise and study
observed data it is not a very convenient descriptor of data. This is because
()mof intervals) are
j (m being the number
m 1 parameters /1, /a,
describe
it.
analytically
histogram
is a
Moreover,
needed to
the
of
form
the
cumbersome step function
-

( )
,.c

)''')(Sf
i -.-

--

( f
'

1-

J (g,:.r
i
i
,:.r

L'i

where gf z'i 1 ) represents


indicator function
.

j.

))
,

the Eth half-closed


for
for

interval and 1(

'

is the

Ilc,zi 1 )
k!llz'f,zi 1 ).

zg

+.

Hence, the histogram is not an ideal descriptor especially in relation to the


modelling facet of observed data.
The first step towards a more convenient descriptor of observed data is
Jmtvfr/tpnwhich is a modified histogram. This is
the so-called p-equency
midpoints of the step function, as shown in Fig.
the
obtained by joining up
continuous
function.
2. 1. to get a
An analogous graph for the cumulative frequency graph is known as the
ogive (seeFig. 2.2). These two graphs can be interpreted as the histograms
obtained by increasing the number of intervals. In summarising
the data in
the form of a histogram some information is lost. The greater the number of
intervals the smaller the information lost. This suggests that increasing the
number of intervals we might get more realistic descriptors for our data.
lntuition suggests that if we keep on increasing the number of intervals to
infinity we sllould get a much smoother frequency curve. Moreover, with a
smooth frequency curve we should be able to describe it in some functional
form with fewer than m - l parameters. For example, if we were to describe

Descriptive study of data

be able to
the two subsets of the data separately we cotlld conceivably
version
smoothed
of
the
frequency
in
polynomial
polygons
form
a
express a
reasoning
This
line
of
1ed
statisticians
the
in
with one or two parameters.
second part of the nineteenth century to suggest various such families of
frequency curves with various shapes for describing observed data,
The Pearson

familyof frequencytwrptas'

ln his attempt to derive a general family of frequency curves to describe


observed data, Karl Pearson in the late 189()s suggested a family based on
the differential equation

dtlo /(z))

z+

(1

bo + b 1 :: + b 2::U,

d ):

which satisfies the condition that the curve touches the z-axis at T)(.c) 0 and
has an optimum at z= -a, that is, the curve has one mode. Clearly, the
solution of the above equation depends on the roots of the denominator. By
imposing different conditions on these roots and choosing different values
for a, bv, ?1 and bz we can generate numerous frequency curves such as
=

(2.8)

(iii)

4(J)

.,4

aty,':--lt'*

11

J-shaped.

(2. 10)

In the case of the income data above we can see that the J-shaped (iii)
frequency curve seems to be our best choice. As can be seen it has only one
parameter (1 and it is clearly a much more convenient descriptor ('if
equal to the
appropriate) of the income data than the histogram. For
lowest income value this is known as the Pareto frequency curve. Looking
at Fig. 2. 1 we can see that for incomes greater than f 4.5 the Pareto
frequency curve seems a very reasonable descriptor.
An important property of the Pearson family of frequency cursres is that
the parameters a, bv, l?I and bs are completely determined from knowledge
of the first four moments. This implies that any frequency curq'e can be fitted
to the data using these moments (see Kendall and Sttlart ( 1969)). At this
point, instead of considering how such frequency curves can be fitted to
observed data we are going to leave the story unfinished to be taken up in
Parts ll1 and IV in order to look ahead to probability theory and statistical
inference proper.
,4a

2.3
2.3

luooking ahead

Looking ahead

The most important drawback of descriptive statistics is that the study of


the observed data enables us to draw certain conclusions which relate tpnlr
to the data in hand. The temptation in analysing the above income data is to
attempt to make generalisations beyond the data in hand, in particular
about the distribution of ineome in the UK. This- however, is not possible in
the descriptive statistics framework. ln order to be able to generalise
model' the distribution of income in the
beyond the data in hand weneed
UK and not just
the observed data in hand. Such a general
'model- is provided by probability theory to be considered in Part Il. lt
turns out that the model provided by probability theory owes a 1ot to the
earlier developed descriptive statistics. ln partieular, most of the concepts
which form the basis of the probability model were motivated by the
descriptive statistics concepts eonsidered above. The concepts of measures
of location, dispersion and shape, as well as the frequency curve, were
transplanted into probability theory with renewed interpretations. The
frequencl curve when reinterpreted becomes a density function purporting
real world phenomena. ln particular the Pearson
to model observable
family of frequency curves can be reinterpreted
as a family of density
in
functions. As for the various measures- they will now be reinterpreted
terms of the density function.
Equipped with the probability model to be developed in Part 11 we can
go on to analyse observed data (now interpreted as generated by some
assumed probability model) in the context of statistical inference proper',
the subject matter of Part 111.ln such a context we can generalise beyond
the observed data in hand. Probability theory and statistical inference will
enable us to construct and analyse statistical models of particular interest in
econometrics', the subject matter of Part lV.
ln Chapter 2 we consider the axiomatic approach to probability which
forms the foundation for the discussion in Part ll. Chapter 3 introduces the
concept of a random variable and related notions; arguably the most widely
used concept in the present book. ln Chapters 4--10 we develop the
mathematical framework in the context of which the probability model
could be analysed as a prelude to Part 111.
:to

tdescribe'

Additional references

PART

11

Probability theory

Probability

'Why do we need probability theory in analysing observed dataf?' ln the


in the previous chapter it was
descriptive study of data considered
emphasised that the results cannot be generalised outside the observed data
under consideration. Any question relating to the population from which
the observed data were drawn cannot be answered within the descriptive
statistics framework. ln order to be able to do that we need the theoretical
framework offered by probability theory. ln effect probability theory
molt?! which provides the logical foundation of
develops a matbematical
statistical inference procedures for analysing observed data.
ln developing a mathematical model we must first identify the important
features, relations and entities in the real world phenomena and then devise
the concepts and choose the assumptions with which to project a
generalised description of these phenomena', an idealised picture of these
of its
phenomena. The model as a consistent mathematical system has a
own' and can be analysed and studied without direct reference to real world
phenomena. Moreover, by definition a model should not bejudged as
because we have no means of making suchjudgments
(seeChapter
or
approximation
model
only
A
bejudged
to the
26).
as a
or
can
with
grips
explain
if
enables
it
the
'reality' it purports to
us to come to
phenomena in question. That is, whether in studying the model's behaviour
the patterns and results revealed can help us identify and understand the
real phenomena within the theory's intended scope.
The main aim of the present chapter is to construct a theoretical model
for probability theory. ln Section 3. 1 we consider the notion of probability
itself as a prelude to the axiomatisation of the concept in Section 3.2. The
probability model developed comes in the form of a probability space (S,
P( )).ln Section 3.3 this is extended to a conditional probability space.
'life

'true'

'false',

'good'

'better'

,%

'

Probability

3.1

The notion of probability

The theory of probability had its origins in gambling and games of chance
in the mid-seventeenth eentury and its early history is associated with the
names of Huygens, Pascal, Fermat and Bernoulli. This early development
of probability was rather sporadic and without any rigorous mathematical
foundations. The first attempts at some mathematical rigour and a more
sophisticated analytical apparatus than just combinatorial reasoning, are
credited to Laplace, De Moivre, Gauss and Poisson (see Maistrov (1974)).
Laplace proposed what is known today as the classical definition of

probability:
Dehnition 1
If a random experiment can rtrsu/r in N mutuall-v exclusive and
equally likely outcomes and if NA oj' rtasp outcomes result in lr
then te probability of A is desned !?.p
occurrence oj' te event
a4,

N
J'(.4) c=..-..J-.
N
To illustrate the definition let us consider the random experiment of tossing
The set of al1
a fair coin twice and observing the face which shows up.
equally likely outcomes is
S

)(SF), (FS), (SS), (TF)l,

Let the event

-4

'observing

be

With

4.

at least one head (S)', then

.4 )(.J.fF). TH), (J.ff.f)).


=

Applying the classical definition in the above


Since Nz 3, P(A)
straightforward
but in general it can be a tedious exercise
example is rather
Moreover, there are a number of
1968/.
in combinatorics (see Feller (
this
definition of probability, which render it
serious shortcomings to
foundation
for probability theory. The obvious
totally inadequate as a
limitations of the classical approach are:
it is applicable to situations where there is only a jlnite number of
(i)
possible outcomes; and
likely' condition renders the definition circular.
the
(ii)
Some important random experiments, even in gambling games (inresponse
to which the classical approach was developed) give rise to a set of infinite
until it turns up
outcomes. For example, the game played by tossing a coin
possible
outcomes S (4S), (TS),
heads gives rise to the infinite set of
could flip a coin
somebody
(FTf1), (TFTS),
it is conceivable that
likely' is
indefinitely without ever turning up heads! The idea of
=t.

Sequally

.);

iequally

3.1

The notion

of probability

synonymous with equally probable', thus probability is defined using the


idea of probability! Moreover, the definition is applicable to situations
symmetry exists, which raises not only the
where an apparent
question of circularity but also how this definition can be applied to the case
of a biased coin or to consider the probability that next year's rate of
likely' outcomes
inflation in the UK will be 10oz,z7Where are the
and which ones result in the occurrence of the event? These objections were
well known even by the founders of this approach and since the 1850s
several attempts have been made to resolve the problems related to the
'equally likely' presupposition and extend the area of applicability of
probability theory.
The most intluential of the approaches suggested in an attempt to tackle
the problems posed by the classical approach are the so-called frequency
approach had its
and subjective approaches to probability. Therwltpncy
origins in the writings of Poisson but it was not until the late 1920s that Von
Mises put forward a systematic account of the approach. The basic
argument of the frequency approach is that probability does not have to be
restricted to situations of apparent symmetry (equallylikely) since the
notion of probability should be interpreted as stemming from the
observable stability of empirical frequencies.
For example, in the case of a
(S) is not because there are
fair coin we say that the probability of
two equally likely outcomes but because repeated series of large numbers of
trials demonstrate that the empirical frequency of occurrence of
'converges' to the limit as the number of trials goes to infinity. lf we denote
by n.4 the number of occurrences of an event zt in n trials, then if
'objective'

iequally

-4

..4

1im na
ex

11

PA
,

t1:fl

PA.
Fig. 3. 1 illustrates this notion for the case of #=
we say that #(z4)=
in a typical example of 100 trials. As can be seen, although there are some
twild fluctuations' of the relative frequency for a small number of trials, as
these increase the relative frequency tends to
(convergearound ).
Despite the fact that the frequency approach seems to be an improvement
over the classical approach, giving objective status to the notion of
probability by rendering it a property of real world phenomena, there are
as n goes to
some obvious objections to it. tWhat is meant by
infinity'''?' l-low can we generate infinite sequences of trials'?' 'What happens
to phenomena where repeated trials are not possible'?'
The subjecttve approach to probability renders the notion of probability
a subjective status by regarding it as degrees of belief' on behalf of
individuals assessing the uncertainty of a particular situation. The
tsettle'

ilimit

Probability

36
1

.0

0.9
0.8
0.7

..c.
(ru)

0.0

0. 5
0.4
0.3
0.2
0.1
10

20

Fig. 3. 1. Observed
tossings.

30
relative

l
50

40

I
60

70

l
80

frequency of an experiment

1
90

I
100

with 100 coin

protagonists of this approach are interalia Ramsey ( 1926), de Finetti ( 1937),


Savage ( 1954), Keynes ( 192 1) and Jeffreys ( 196 1),.see Barnett ( 1973) and
Leamer (1978)on the differences between the frequency and subjective
approaches as well as the differences among the subjectivists.
Recent statistical controversies are mainly due to the attitudes adopted
towards the frequency and subjective definitions of probability. Although
these controversies are well beyond the material covere in this book, it is
advisable to remember that the two approaches lead to alternative methods
of statistical inference. The frequentists will conduct the discussion around
what happens
average', and attempt to develop
the long-run' or
objective', procedures which perform well according to these criteria. On
the other hand, a subjectivist will be concerned with the question of revising
prior beliefs in the light of the available information in the form of the
observed data, and thus devise methods and techniques to answer such
questions (see Barnett ( 1973)). Although the question of the meaning of
probability was high on the agenda of probabilists from the mid-nineteenth
century, this did not get in the way of impressive developments in the
subject. ln particular the systematic development of mathematical
techniques related to what we nowadays call limit theorems (see Chapter 9).
These developments
the work of the Russian School
were mainly
tin

-on

(Chebyshev, Markov. Liapounov and Bernstein). By the 1920s there was a


wealth of such results and probabilith' began to grow knto a systematic body
of knowledge. Although various people attempted
of
a systematisation
probability it was the work of the Rtlssian mathematician
Kolmogorov
which proved to be the cornerstone for a systematic approach to

The axiomatic

approach

managed to relate the concept of


probability theory. Kolmogorov
that
of
integration
probability to
theory and exploited to the
a measure in
of functions on the one
theory
and
the
analogies
between
theory
the
set
full
and
variable
other.
the
of
random
ln a monumental
concept
hand
on the
a
monograph in 1933 he proposed an axiomatisation of probability theory
establishing it once and for a1l as part of mathematics proper. There is no
doubt that this monograph proved to be the watershed for the later
development of probability theory growing enormously in importance and
applicability. Probability theory today plays a very important role in many
disciplines including physics, chemistry, biology, sociology and economics.

3.2

The axiomatic

approach

The axiomatic approach to probability proceeds from a set of axioms


(accepted without questioning as obvious), which are based on many
centuries of human experience, and the subsequent development is built
deductively using formal logical arguments, like any other part of
mathematics such as geometry or linear algebra. ln mathematics an
axiomatic system is required to be complete, non-redundant and consistent.
By complete we mean that the set of axioms postulated should enable us to
prove every other theorem in the theory in question using the axioms and
refers to the
mathematical logic. The notion of non-redundancy
impossibility of deriving any axiom of the system from the other axioms.
Consistency refers to the non-contradictory
nature of the axioms.
A probability model is by construction intended to be a description of a
chance mechanism giving rise to observed data. The starting point of
such a model is provided by the concept of a vandom t?xpt?rrntrnr describing
a simplistic and idealised process giving rise to observed data.
Djinition

wllfcll satishes
,4 random experiment, denoted /?y 4 is an f?xpt?rrrlt?;r
conditions:
the Ji?lltpwng
alI possible distinct f.?lkrct?nlty.s are knfpwn a priori;
(f-l)
il1 /ny particular trial rt? ourtrol'tlt/ is not known a priori; and
(/?)
it trfkn be repeated unde. Eftpnlftrtnl conditions.
(c)
Although at first sight this might seem as very unrealistic, even as a model of
a chance mechanism, it will be shown in the following chapters that it can be
extended to provide the basis for much more realistic probability and
statistical models.
The axiomatic approach to probability theory can be viewed as a
In an attempt to
formalisation of the concept of a random yxpcrzlcnr.

Probability

formalise condition (a) all possible distinct outcomes are known a priori,
possible distinct
Kolmogorov
devised the set S which includes
outcomes' and has to be postulated before the experiment is performed.
tall

Dejlnition J
The samplespace,

denoted by S, is dejlned to be the set

outcomes (?J te
elementary events.

The elements

'.

experiment

t#'

t#*a//

possible

are

called

Example
Consider the random experiment Ji' of tossing a fair coin twice and
observing the faces turning up. The sample space of & is

l(f1T), (TS), (HH4' (TT)l,

with (ST), (TS), (SS), (TT) being the elementary events belonging to S.
The second ingredient of ($* to be formulated relates to (b)and in particular
to the various forms events can take. A moment's reflection suggests that
there is no particular reason why we should be interested in elementary
outcomes only. For example, in the coin experiment we might be interested
least one S',
at most one H' and these are not
in such events as z4l particular
in
elementary events;
.,42

'at

,41 t(z;T),

TH4, HH)

-4c l(SF),

CTH), (FT)l

and

are combinations of elementary events. A1lsuch outcomes are called e,ents


associated with the sample space S and they are defined by combining'
elementary events. Understanding the concept of an event is crucial for the
discussion which follows. lntuitively an event is any proposition associated
with which may occur or not at each trial. We say that event
occurs
when any one of the elementary events it comprises occurs. Thus, when a
trial is made only one elementary event is observed but a large number of
events may have occurred. For example, if the elementary event (ST)
and
have occurred as well,
occurs in a particular trial,
Given that S is a set with members the elementary events this takes us
immediately into the realm of set theory and events can be formally defined
('t..J5
- union,
to be subsets of S formed by set theoretic operations
complementation) on the elementary events (seeBinmore
intersection,
z41

'

.,4l

.,12

'c7'

$-'

3.2

The axiomatic

approacb

( 1980:. For example,

Two special events are S itself, called the sul-e plllrll and the impossible event
Z defined to contain no elements of S, i.e. .Z yf j; the latter is defined for
=

completeness.
A third ingredient of &' associated with (b) which Kolmogorov had to
formalise was the idea of uncertainty related to the outcome of any
particular trial of Ji. This he formalised in the notion of probabilities
attributed to the various events associated with $ such as #(,4j), #(.,4c),
expressing the likelihood' of occurrence of these events. Although
attributing probabilities to the elementary events presents no particular
mathematical problems, doing the same for events in general is not as
and
straightforward, The difficulty arises because if
are events
z1c, etc.,
k.p
S z41, zlc S ch
are also events because
of
and
implies the occurrence or
the occurrence or non-occurrcnce
not of these events. This implies that for the attribution of probabilities to
make sense we have to impose some mathematical structure on the set of all
which reflects the fact that whichever way we combine these
events, say
events, the end result is always an event. The temptation at this stage is to
define .F to be the set of a1l subsets of S, called the pwt!r ser; surely this
covers all possibilities! ln the above example the power set of S takes the
form
.g-f:

-4:

.,4a,

z4l

.4a,

z4:

x4c,

zzlc

-41

-42

.,4l

r%

,.F

lS,
t(.HT)), )(Tf1)), )(1S)), )(TT)), )(F11), (ST)),
(4'.Ff1),(1/f1)),
(TT')), .t(1-1'F),(HHII',
)(f1T'), (TT)),
(TT)), .t(fT), (TH), (ff'f1)l,.t(f.fT'), (Tf.f), (TT)),
((ff'.f),
tlfff/'l,(TT), (Tff)l, (CHH), (FT'), (HT)l).
.?,

t('f'f:l),

lt can be easily checked that whichever way


end up with events in .LEFor example,

tlflffl,('FT))

ch

(fT')l
(('f'f1),

(T1-1)) k.p t(T1-1),(1-7T)l(CHHS,

we combine any events

in ,F we

(3 6,:/;
tl'fffl.tTfll,

(ffT)l e z:/k etc.

lt turns out that in most cases where the power set does not lead to any
inconsistencies in attributing probabilities we dene the set of events .F to
be the power set of S. But when S is intinite or uncountable (it has as many

40

Probability

elements as there are real numbers) or we are interested in some but not a11
ft
possible events, inconsistencies can arise. For example, if S
)
zlf
S and #(z4) a > 0, #f,
such that
1
ch Aj ,Z (#.j), isjzzz1, 2,
Then #4.)
where .P(.,4)refers to the probability assigned to the event
z' 1 P(.,4) )J,z 1 a > 1 (seebelow), which is an absurd probability- being
p')
greater than one; similar inconsistencies arise when S is uncountable. Apart
from these inconsistencies sometimes we are not interested in a1Ithe subsets
of S. Hence, we need to define ,LF independently of the power set by
structure which ensures that no
endowing it with a mathematical
inconsistencies arise. This is achieved by requiring that .LF has a special
mathematical structure, it is a c-field related to S.
.,41

z42,

U,i)-

zztf

-4.

Dnron

lf :
is called a J-field
Let ..F be a set q subsets t#' S.
lnt/l/complementation:
(f)
e: r/ltpnWG .F - closure
zzlfl
zzlf
1* 1 2,
clllsure
then ( I-%
6: ,F
g
()
1
,.t+-

.4

.:)

.kj

utlioll.
ctprfflrlh/t.?

Note that

t.'/??J
ullder

(i) and (ii)taken together imply the following:


S e .J; because

(iii)
(iv)
(v )

.z'1-=

.,4

k.p

S;

,F (from(iii) V= .(J (E

and
,5.)
.p1.
1 2
t h en ((-) I i ) G .:F
These suggest that a c-field is a set of subsets of S which is closed under
complementation- and countable unions and intersections. That is, any of
these operations on the elements of will give rise to an element of lt can
be checked that the power set of S is indeed a c-field, and so is the set
.f3'

.#-),.

(EE
.k9j

-4

..9

-#'

but the set C


What we can

)HH), (T/f), (FT)),

(.t(ST)),

.?A)

Z,

.),

t(fT),( TH)( is not because ZIC, S'#C, t(ST), (FS) )#C.


t
do, however, in the latter case is to start from C and construct
j

the minimal cn#e'/J generated by its elements. This can be achieved by


extending C to include all the events generated by set theoretic operations
(unions, intersections, complementations) on the elements of C. Tlaen the
minimal c-field generated by C is
Z, ((J-1F), (FS)), t(SS), (FT)))
c(C).
and we denote it by
This way of constructing
a c-field can be very useful in cases where the
events of interest are fewer than the onesgiven by the power set in the case of
each H or F
a finite S. For example. if we are interested in events w'ith one of
c-field
and
to be the power set,
can do as
there is no point in defining the
...kL.
well with fewer events to attribute probabilites to. The usefulness of this
method of constructing c-fields is much greater in the case where S is either
in such cases this method is indispensable. Let us
infinite or uncountable;
.%

.6

ts',

3.2

The axiomatic

eonsider an example
such a c-field.

approach

where S is uncountable

and discuss the construction

of

Example
Let S be the real line R

be
J

t6BxL

x c 2)

tx:

:c)

x<

<

where Bx

'.r

the set of events of interest

) and

c: z .%x

lj

x)

'.'.f
-

This is an educated ehoice, whieh will prove to be very useful in the sequel.
How can we construct a c-field on E2?The definition of a c-field suggests

that if we start from the events Bx, x 6: R then extend this set to include Xx
andtake countable unions of Bx and X'xwe should be able to define a c-field
on (R, c(J) -- the mfnmll c-,/it!?J lenerated b t'Ile t?rt?nrs Bx, x iE Q. By
definition Bx G c(.f). lf we take complements of Sx: X'x z.. e R, z > x
(x, :7- ) e c( J4 Taking countable unions of Bx : UJ- 1 ( :f- x (1/))j
( :y- x) s c(./). These imply that c(.f) is indeed a c-field. ln order to see how
large a collection c(J') is we can show that events of the form (x, ), gx,
also belong to c(J), using set theoretie operations
as
(x, z) for x < c, and
follows'.
'

.7

.:ys),

.cc

tx)

(x, (y.)

gx, :y:. ) (
=

(x, z)
fx )

'ctp
-

'L<) ,

xj c c(./).
.Y) g

tr(J),

gc'.1x. ) e:c(J),

.x(l

>;.

(')

11 =

ct., ,

1.

x, x

1
-

/1

e:c(J).

This shows not only that t#J) is a c-field but it includes almost every
conceivable subset (event) of R, that is, it coincides with the c-field
The
generated by :7r?.1. set of subsets of R, which wedenote by i.e. tr(J)
c-field will play a very important role in the sequel; we call it the Borel
#c/J on R.
.?4

.?d,

..'#

solved the teehnical problem of possible inconsistencies


in
attributing probabilities to events by postulating the existence of a c-field
'..F associated with the sample space S, Kolmogorov went on to formalise
the concept of probability itself.
Having

Probability

42

Dqflnition

.5

Probability is dhned as a set function on ,W'satisfying thefollowing

axioms:

6 +j.
Axiom 1: #(.g1)): 0 for
1,.and
Axiom 2: PS)
ZI5-)
,5;
ft
Axiom 3:
IL-1
1 #(,4f)
(' ylf ) : is
that
sequence of muttally exclusive events in
called
countable
additivity).
zlf ra Aj=
for i #.j)
-4

'l7crl'

#IU

,4fl

,),/-

.g

In other words, probability is defined to be a set function with 'F as its


domain and the closed real line interval g0,1J as its range, so that

#(

'

):,/-

I0, 1q.

-+

The first two axioms seem rather self-evident and are satisfied by both the
classical as well as frequency definitions of probability. Hence, in some
sense,the axiomatic definition of probability
the deficiencies of
of probability
the other definitions by making the interpretation
dispensable for the mathematical model to be built. The third axiom is less
obvious, stating that the probability of the union of unrelated events must
be equal to the addition of their separate probabilities. For example, since
((SF)l rn
Z,
tovercomes'

(IHHII
=

#(.r(JfF))

kl

(f'f)))

#(l(ST'))) +#(((ffS)))

+
- .t .)
=

interpretation'
result. To
Again this coincides with the
summarise the argument so far, Kolmogorov formalised the conditions (a)
and (b)of the random experiment ($ in the form of the trinity (k%, P ))
comprising the set of a1l outcomes S v-the sample space,a c-field c'Fof events
related to S and a probability function #( assigning probabilities to events
For the coin example, if we choose .F t)(SF)), ((TH), (HH), (F7)),
in
Z, 5') to be the c-field of interest, P( ) is dened by
dfrequency

..%

'

.)

si'

'

Psl-

1,

.13(.43):=:0,

#(t(z.fT)))=.t,

Because of its importance the trinity (S,

,%

P(((TH),
P

'

)) is given

HH), (TT')))=.1.
a name.

Dejlnition 6

S endowed witb
-4 sample
satisfying
axioms 1-3 is
function
.sp/cre

.F and a probability
a c-jeld
called a probability space.

As far as condition (c)of & is concerned, yet to be fonnalised, it will prove


of paramount importance in the context of the limit theorems in Chapter 9,
as well as in Part 111.

3.3 Conditional

probability

43

Having defined the basic axioms of the theory we can now proceed to
derive more properties for the probability set function using these axioms
and mathematical logic. Although such properties will not be used directly
what we called a probability model, they will be used
in constructing
indirectly. For this reason some of these properties will be listed here for
references without any proofs:
P1)
(#J)
P34
(P4)

(#.5)

#(W) 1 J'(..4),
=

.y1

E:

q.t'

f'4Z) 0.
f.J' ! c X2, P(X1) G P(X2), 1, X2 6 .X'
P(.,1l) + Ptz4al Pt,4j ro
8.,1.1%.)
t#*
1 is a monotone sequence
1.Jjz4,,)J,.:
=

z4

./l

v4c)

#(lim,,....

..42).

events in

.#-

then

a4,,)

limpi...wP(.4,,).
A monotone sequence of events in ,F can be either increasing (expanding)
z4p,
z4l
z1l
c yla c
c
c A,
uo
or
or decreasing (contracting), i.e.
- l
z4,,,
z4,,
z4p,
:u)
respectively. For an increasing sequence lim,,-.
.4,, j
1
z4,,.
P5 is known as the
and for a decreasing sequence lim,,-,
1
continuity propert of the set function #( ) and plays an important role in
probability theory. In particular it ensures that the distribution function
(see Section 4.2) satisfies certain required conditions see also Section 8.4 on
martingales.
.

'

'

'

'

0,*,

.,1,,

'

U,*,.

'

Conditional probability
One important extension of the above formalisation of the random
experiment t.$' in the form of the probability space (.$, #( )) is in the
probabilities.
direction of conditional
So far we have considered
probabilities of events on the assumption that no information is available
relating to the outcome of a particular trial. Sometimes, however,
additional information is available in the form of the known occurrence of
z4.
For example, in the case of tossing a fair coin twice we might
some event
know that in the first trial it was heads. What difference does this
information make to the original triple (S, P ))? Firstly, knowing that
the first trial was a head, the set of all possible outcomes now becomes
,t)6

'

,%

'

SA

)(.f/T'), (SS)),

sincetTW), (TT) are no longer possible. Secondly, the c-field taken to be the
power set now becomes
.F,

(S.a

((f.fT)), ((f.fS))).

,g,

Thirdly the probability

#.:(,,4)=1,

set function becomes

.P,4(,3)=0,

#x(l(ST)))=,

#x(t(SS)))=-i'.

Probability

Thus, knowing that the event


least one H' has occurred (in the first
trial) transformed the original probability space (,,
P )) to the
conditional probability space (5'u,.FA #4( )).The question that naturally
arises is to what extent we can derive the above conditional probabilities
without having to transform the original probability space. The following
formula provides us with a way to calculate the conditional probability.
*at

.4

.%

'

'

,4)

/',14z11)

>

z4)

#(z4I

z'4,4,ra
#4z1)

In order to illustrate this formula let


then since #(-4j)
#(.,4) J, .P(z41fo
=t,

/ll

zz

#a(X1)

P(XI

j,l 4

I
-4)

t(ST)) and
#( t(1-JT')))

.,4

t(/fT), (HH4)

=.t,

.4)

=j=-,

as above.
Note that .P4-4)>0 for the conditional
Using the above rule of conditional

PlAl

.?121'--P(,41

f--

,4cl

'

probabilities to be delined.
probability we can deduce that

Pfz4cl

z11) #4d1),
= .P(.,42
.

(3.8)
for

.,41,

zzlc

e'

(3.9)

,?>

This is called the mullp/fccllfon rule. Moreover, when knowing that


occurred does not change the original probability of z4c, i.e.

-42

has

uIX.,4l
l
-42)-

we say that

ztj

Independence

and

#tz'1ll.
zlc

are independent.

is very different from mutual t?xc/l/sllt?ne'ss in the sense that


but
Ptz4j
..41rn
# .!X.41)and vice versa can both arise.
Z
lndependence is a probabilistic statement which ensures that the
occurrence of one event does not influence the occurrence (or nonoccurrence) of the other event. On the other hand, mutual exclusiveness is a
statement which refers to the events (sets) themselves not the associated
probabilities. Two events are said to be mutually exclusive when they
cannot occur together (see exercise 4).
The careful reader would have noticed that the axiomatic approach to
probability does not provide us with ways to calculate probabilities for
individual events unlike the dassical or frequency approaches. What it
provides us with are relationships between the probabilities of certain
events when the events themselves are related in some way. This is a feature
of the axiomatic approach which allows us to construct a probability modcl
without knowing the numerical values of the probabilities but still lets us
deduce them from empirical evidence.
.42

-42)

Conditional probability

45

Impovtant concepts

random experiment;
classical, frequency and subjective definitions of probability;
sample space, eiementary events's
c-field. minimal c-field generated by eventss Borel field;
probability set function, probability space (S, P( ));
conditional probability, independent events, mutually exclusive
events.
e?6

'

Questions
Why do we need probability theory in analysing observed data?
What is the role of a mathematical model in attempting to explain real
P henomena'?
Compare and contrast the classieal and frequency definitions of
probability. How do they differ from the axiomatic definition'?
Explain how the axiomatic approach formalises the concept of a
random experiment 4 to that of a probability space (S,
)).
Why do we need the coneept of a c-field in the axiomatisation
of
probability? Explain the concept intuitively.
Explain the concept of the minimal c-field generated by some events
using the half-closed intervals ( uo, xj, x (E R on the real line as an
t%p

example.

Explain intuitively the continuity property of the probability set


function #( ).
Discuss the concept of conditional probability and show that #( 1
for some e: is a proper probability set funetion.
'

'

.,1

z4)

.t#'

Exerdses
Consider the random experiment of throwing a dice and you stand to
lose money if the number of dots is odd. Derive a c-field which will
enable you to consider your interests probabilistically. Explain your
choice.
Consider the random experiment of tossing two indistinguishable fair
coins and observing the faces turning up.
Derive the sample space S, the c-field of the power set L.F'and
(i)
define the probability set function P( ).
Derive the c-field generated by the events )SS) and (T'l).
lf you stand to lose a pound every time a coin turns up
what is the c-field of interest'?
'

'heads'

Probability

46

Consider the effect on S, P ))when knowing that event


$at least one F' has occurred and dene the new conditional
#a( )).Confirm that for the event
probability space (,$x,,.#r4,
tails,
- two
.6

.4

'

-41

'

P,4(z41)
=

8.4

c5

z11)
.

17(X)

Consider the events IHH). and (FT) and show whether they
are mutually exclusive or and independent.
Consider the random experiment of tossing a coin until it turns up
theads'. Define the sample space and discuss the question of detining a
c-field associated with it.
Consider the random experiment of selecting a card at random from an
ordinary deck of 52 cards.
Find the probability of
(i)
.41 - the card is an ace;

and

.,42- the card is a diamond.


Knowing that the card is a diamond show how the original
(S,
#( )) changes and calculate the probability of
,d

'

yla - the card is the ace of diamonds.


Find 174..11
ro

z4cl

derived in (ii).

and compare

Define two events which are:


mutually exclusive and
(a)
mutually
exclusive but
(b)
mutually
exclusive
not
(c)
(d)
not mutually exclusive

it with the probability

of

..43

independent;
not independent;

but independent; and


and not independent.

Additional references
Barnett
(1976).

Giri ( 1974); Mood, Graybill


(1973);

and Boes ( 1974); Pfeiffer ( 1978/ Rohatgi

CHAPTER

Random variables and probability

distributions

In the previous chapter the axiomatic approach provided us with a


mathematical model based on the triplet (S, P( )) which we called a
probability space, comprising a sample space S, an event space + (c-field)
and a probability set function P( ). The mathematical model was not
developed much further than stating eertain properties of P ) and
introducing the idea of conditional probability. This is because the model
based on (S, Pq ))does not provide us with a flexible enough framework.
,#t

'

-f6

'

The main purpose of this section

is to change this probability

space

by

mapping it into a much more flexible one using the concept of a random
varable.
The basic idea underlying the construction of S, #( ))was to set up a
framework for studying probabilities of events as a prelude to analysing
problems involving uncertainty. The probability space was proposed as a
formalisation of the concept of a random experiment & One facet of tf'
which can help us suggest a more flexible probability space is the fact that
when the experiment is performed the outcome is often considered in
relation to somc quantisable attribute; i.e. an attribute which can be
represented by numbers. Real world outcomes are more often than not
expressed in numbers. lt turns out that assigning numbers to qualitative
outcomes makes possible a much more flexible formulation of probability
theory. This suggests that if we could find a consistent way to assign
numbers to outcomes we might be able to change (,S, #( ))to something
more easily handled. The concept of a random variable is designed to do
just that without changing the underlying probabilistic structure of
(S, % P( )).
,#7k

'

,%

'

'

Random variables

and probability

distributions

4
LHHi

j(J?s))
1(Fr)l

(/?F)

1(8r) (88) (r/?)l


14F8) (/?r), (rrlt
(rr)#
148/-/),
jjsr), (rsjj
,

(r8)
(rr)

Fig. 4. 1. The relationship


probability set function.

4.1

I 1 1 1 1 I

0 0.2 0.4 0.6 0.8 1.O

between

sample

space,

c-field

and

The concept of a random uriable

Fig. 4. 1 illustrates the mathematical model (,$', #( )) for the coin-tossing


example discussed in Chapter 3 with the c-tield of interest being .F= (S, Z,
)(TT)),
l(HH,(TT))l(TH),(11T')),
't(1.T),(TH),(H.H)l,
((1.f1.f)),
)(SF),(T'S),(FF))).
The probability set function #(') is defined on .F and
1j, i.e. #(.) assigns probabilities to the events
takes values in the interval
in ,F. As can be seen. various combinations of the elementary events in S
define the c-field .F (ensure that it is a c-fieldl) and the probability set
function #(.) assigns probabilities to the elements of .F.
The main problem with the mathematical model (S,
#( )) is that the
general nature of S and .F being defined as arbitrary sets makes the
of #( ) N'ery difficult; its domain being a c-field
mathematical manipulation
of arbitrary sets. For example, in order to define #( ) we will often have to
derive all the elements of .F and tabulate it (a daunting task for large or
infinite
to say nothing about the differentiation or integration of such a
r9;

'

r0,

,@

'

'

./-s),

set function.

Let us consider the possibility of defining a function Ar( ) which maps S


directly into the real Iine R, that is,
'

A'(

): S

-+

Rx,

assigning a real number xl to eaeh sl in S by xl


xl g R. sl e S. For
example, in the coin-tossing experiment we could define the function A'
the number of heads'. This maps all the elements of S onto the set Rx
.t0, 1, 2) see Fig. 4.2.
-tsll,

The question arises as to whether

every function from S to R will provide

4.1

The concept of a random

Fig. 4.2. The random


example.

variable

variable

Ar-number of

49

Sheads'

in the coin-tossing

us with a consistent way of attaching numbers to elementary events',


consistent in the sense of preserving the event structure of the probability space (S, t'f #( )). The answer, unsurprisingly, is certainly not.
This is because, although
is a function defined on S, probabilities are
and
in
qF
the issue we have to face is how to dene the
assigned to events
values taken by X for the different elements of S in a way which preserves the
ln order to illustrate this let us return to the earlier
event structure of
To
value
of X, equal to 0, 1 and 2 there correspond some
each
example.
of
i.e.
S,
subset
'

.?/f

'r))

-...

t(T'
(z?'z')),
tt-rffl,
,

1
2 .:(1-1.r.f)),
-+

-.+

and we denote it by
-Y-

i(0)

t(TT)),

A--

14

1)= t(TS), (f1T)),

X-

1(2)::=,

)(Sf1)),

'(

used by abuse of mathematical


using the inverse mapping
) (sinverse
language). What we require from A' - 1( ) (or .Y) is to provide us with a
correspondence between Rx and S which reflects the event structure of
that is, it preserves unions, intersections and complements. ln other words,
x'

.#1

Random variables

and probability distributions

i
for each subset N of Rx the inverse image X - (N) must be an event
'
in ,F. Looking at X as defined above we can see that X - (0)G,?A,
k..p 1
X - 1 (2)c
X - 1(
X - 1 (/t.0 J1 t..p f( 2 ))g
X - 1( 1) G
))(E
1((
1) t.p
g
A' that is, -Y( ) does indeed preserve the event structure of
Ry defined by F(tff1F) )=
,X On the other hand, the function Y'( ): S
YI..fLHH)) 1, y(.t TH j ) F( TT))
0 does nOt preserve the event structure
140)
1(
of .Lt- since F # F - 1) ( ,i/J This prompts us to define a random
variable A' to be any such function satisfying this event prpst?rnfng condition
in relation to some c-field defined on Rx; for generality we always take the
.%

to)

cz

.t2))

.t#',

'

.ft

.%

-+

'

.#t

Borel field

,?d

.%

on R.

Dhnition

A random variable X is a p-,(?l valued function


S to R wllfc
B G 4 on E, tbe set
satjles the c'(?nlll't?n that jr ptktr/? Borel
X - 1(/)
in
s.' .Y(.$)e:B, s g s') is an
.#(?rl1

.$t?r

.?8

'gt?rlr

Three important features of this definition are worth emphasising.


A random variable is always defined relative to some specific c-

(i)

field
R is a
ln deciding whether some function F( ) : S
(ii)
variable we proceed from the elements of the Borel field
of the c-field rF and not the other way around.
variable'.
(iii)
A random variable is neither
nor
Let us consider these important features in some more detail in
of the concept of a random
enhance our understanding
undoubtedly the most important concept in the present book.
,%'

-+

'

..,d

trandom'

random

to those

'a

order to
variable)

The question is X( ) : S -+ R a random variable?' does not make any


sense unless some c-field ..F is also specified. ln the case of the function Xnumber of heads, in the coin-tossing example we see that it is a random
variable relative to the c-field
as defined in Fig. 4. 1. On the other hand, F,
variable relative to ..R This, however, does
random
above,
is
nt?r
defined
a
as
preclude
from
variable with respect to some other crandom
F
being
not
a
'y
(S,Z,)(S1'1),(f1F)),
((Ff.f),(TT)) ) lntuition
field ,.Fy; for instance
valued
real
function
.Y(
S
R we should be able to
that
for
suggests
any
):
variable.
that
random
ln the previous
such
Ar
define a c-field
S
is
on
a
section we considered the c-field generated by some set of events C.
Similarly, we can generate tr-fields by functions A-( ): S -+ R which turn
Indeed
above is the nlfnfrnf?/ o'zfleld
Ar( ) into a random variable.
generated ?v F, denoted by c(F). The way to generate such a minimal c-field
14
is to start from the set of events of the inverse mapping F - ), i.e.
1(0)
1
J-field
.t(.fF), (HH);
by
and generate a
FJ' - ( 1) and t(TS), (FT))
taking unions, intersections and complements. ln the same way we can see
'

,%

-+

rh

'

.%

'

'

The concept of a random variable

4.1

that the minimal c-field generated by .Y - the number of heads, c(Ar)


coincides with the c-field .LFof Fig. 4.2,. verify this assertion. ln general,
however, the c-field .F associated with S on which a random variable X is
defined does not necessarily coincide with c(Ar). Consider the function
X :(

): S

'

(R

-+

Xl( klSfflll

A' !('t(Tf1)))= X1('t(1T)l)

Xl('t(TT)))

1,

140)

'(

'

(4.2)
)(TF)) iE

1)= t(ffJ1), (Tf1), (ffT)) iE ,,F (see Fig. 4.2), X(


tj ) s (E x, is a random variable on with respect to the c1
s
x(ST), (TS))) #
indeed
field ,.'F c(i-). But c(X:) (S, Z, (41-1f1),

since
i(

z%-1

.%

,#,

tot,
.:

.%

c(A-1) cu ..'F

c(aY).

The above example is a special case of an important general result where


A',, are random variables on the same probability space
-Y:, Xz,
(S,
P( )) and we define the new random variables
.

.%

'

Fz

X : + A'a + X s

(4.5)

c4Y;,)form an increasing sequence of c-fields in ln the above


i.e. c(Yj),
example we can see that if we define a new random variable ,Yc( ): S R by
,.'k

'

-Yztttf1'flll=

1,

A-2()(f1T)))= Xa()(Tf1)))

-+

A-2('t(TT)))=0,

+ X1 (seeTable 4. 1) is also a random variable relative to c(X);


then X
X is defined as the number of J'ls (see Table 4.1).
=

x'j

Note that

z1

is defined as

:at

least one H' and Xz as

Etwo

Jls'.

generated by random variables will prove


very useful in the discussion of conditional expectation and martingales (see
Chapters 7 and 8). The concept of a c-field generated by a random variable
enables us to concentrate on particular aspects of an experiment without
having to consider everything associated with the experiment at the same
time. Hence, when we choose to define a r.v. and the associated c-field we
make an implicit choice about the features of the random experiment we are
The above concept of c-fields

interested in.
il-low do we decide that some function .X( ): S R is a random variable
relative to a given c-field ,i.F?9From the above discussion of the concept of a
.

-+

and probability distributions

Random uriables

random variable it seems that if we want to decide whether a function X is a


random variable with respect to .F we have to consider the Borel field on
R or at least the Borel field
on Jx'. a daunting task, lt turns out, however.
that this is not necessary. From the discussion of the c-field c(J) generated
by the set J
x e: J@
) where Bx ( 'Lt, x(l we know that .t4 c(J) and if
.Y( ) is such that
.4

./dx

.t#.:

'

X '' ((
-

v.

xj

ft

.Y(
-)

.'

(-

xj

py.- s

,!;

e:S

j!

.kT'

for a11 ( -

then

A-- (B4

(s

-Y(s)e: B, s (F S 'j c .F

v-

x)Js

.??,

for a1l B g .A

ln other words, when we want to establish that A- is a random variable or


define #xt ) we have to look no further than the half-closed intervals
(x(Iand the c-field c(..f)ft they generate, whatever the range Rx. Let us
g(
yt xj, s 6E
to
use the shorthand notation .YtylGxlj. instead of f(s:
number
the
above
in
the
of
of Hs. with
argument
A' - the
consider
case
in Fig. 4.2.
respect to
'

':y-

'(x)

.t7-

1 x< 2
:;

,(3
,

f#

( TH (T T) )
-

(H

'r)-

0 .A.J' < 1

(HH)

(4.9)

1 % )',

and thus F -- (( - v- ).q) # .F for ). 0, y 1, i.e. F ( ) is not a random


however, f(s : F(s) ,6Ll.j1 G .k) fo r
variable with respect to With respect to
all )' e: R and thus it is a random variable.
The tenn random variable is rather unfortunate
because as can be seen
h?t?/- (1 vtll-l'tlble'', it i s a real
from the above definition A- is tleib t?lvalued function and the notion of probability does not enter its definition.
=

'

.../'),,

..

wrandotnn

Probability
an attempt

enters the picture after the random variable has been defined in
model induced by X.
to complete the malhematical

Tbe concept of a random variable

Table 4. 1

A' relative to .- maps S into a subset of the real line,


on 2 plays now the role of .k ln order to complete the
Common
assign probabilities to the elements B of
the assignment of probabilities to the events B (E @ must
the probabilities assigned to the corresponding events
need to define a set function #xl ):
E0,11 such that

variable

A random

and the Borel field


model we need to
sense suggests that
be consistent with
in Formally, we
.ut

.?4

.?4

.?#

-->

'

for all B G
in the case illustrated

For example,
p

).

()
x ( J

Px l (h0 l)

4.

1.4

p x ( tf j J) j

-1-

Px , ( 111)

2'
=

3-4
.

(4.10)

,.#.

in Table 4. 1
p x (ft 2,))

-l4

Px ( )0 l k.p
j

p x ( j()'j

h11)

k.p (
t

1 Px
,

j J).j

J- ?
4

etc.,

( ft0 ) rn t 1)) 0.
=

The question which arises is whether, in order to define the set function
#x( ), we need to consider al1 the elements of the Borel field 4. The answer is
that we do not need to do that because, as argued above, any such element
of can be expressed in terms of the semi-closed intervals ( :s, .xq. This
we can
implies that by choosing such semi-closed intervals
define #xt ) with the minimum of effort. For example, Px( ) fOr x, as defined
in Table 4. 1, can be defined as follows:
'

..#

Sintelligently',

'

'

As we can see, the semi-closed intervals were chosen to divide the real line at
the points corresponding to the values taken by X. This way of defining the
semi-closed intervals is clearly non-unique but it wll prove very convenient
in the next section.
The discerning reader will have noted that since we introduced the
concept of a random variable A'( ) on (S, .k P( )) we have in effect
'

'

Random variables

54

and probability

distributions

developed an alternative but equivalent probability space (R, Px ))


induced by X. The event and probability structure of (S, #4 )) is
#xt ))and the latter has a
preserved in the induced probability space (R,
much
to handle' mathematical structure; we traded S, a set of
arbitrary elements, for R, the real line, ,F' a c-field of subsets of S with 2..d, the
Borel field on the real line- and #( ) a set function defined on arbitrary sets
with #x( ),a set function on semi-closed intervals of the real line. ln order to
P( )) to
illustrate the transition from the probability
space (S,
(Rx,
Pxt )) let us return to Fig. 4. 1 and consider the probability space
of heads, defined above. As can
induced by the random variable z-number
variable
Fig.
4.3,
the
random
A'( ) maps S into k0,1, 2).
be seen from
1q, ( 'Js, 21 we can
intervals
semi-closed
the
Choosing
( vs, %, ( which
of
Borel
#xl
R
field on
forms the domain
generate a
).The concept of
a random variable enables us to assign numbers to arbitrary elements of a
as
set (S) and we choose to assign semi-closed intervals to events in
induced by X. By defining #xt ) over these semi-closed intervals we
complete the procedure of assigning probabilities which is consistent with
the one used in Fig. 4. 1. The important advantage of the latter procedure is
Px )) is a
that the mathematical structure of the probability space (R.
lot more flexible as a framework for developing a probability model. The
purpose of what follows in this part of the book is to develop such a tlexible
mathematical framework. lt must be stressed, however, that the original
probability space (S, #( )) has a role to play in the new mathematical
framework both as a reference point and as the basis of the probability
model we propose to build. Any new concept to be introduced has to be
related to (S, P( )) to ensure that it makes sense in its context.
..%

'

.%

'

.?d,

'

Ceasier

'

'

.%

'

..%

'

'

':yo,

'

.t7-

'

.@,

'

-i/')

'

-%

'

1(8H), ( rrll

1(/./s)) j(rrlk

CHHL

(TH3
t8f)
Fr)

s 1(rJ?)

(/./r))
(/.fr)
(
1
r/.?),(r7')t
1(8r) (THb (/-/8)1
,

0
!

.-

o o.2 o.4 0.6.-0.8


I

.))

to (Rx,?#,#x(

'

.0

0.2 O.4 0.6 0,8 1.O

Rx
Fig. 4.3. The change from (.S,,#-,17(

)) induced

by X.

The distribution and density functions

4.2

The distribution and density functions

4.2

ln the previous section the introduction of the concept of a random variable


(r.v.), X, enabled us to trade the probability space (,$', #( )) for
.gi

'

(R,
#xt )) which has a much more convenient mathematical structure.
The latter probability space, however, is not as yet simple enough because
Px( ) is still a set function albeit on real line intervals. ln order to simplify it
we need to transform it into a point function (a function from a point to a
point) with which we are so familiar.
The first step in transforming #xl ) into a point function comes in the
form of the result discussed in the previous section, that #xt ) need only be
defined on semi-closed intervals ( - cc,
x g R, because the Borel field 4
viewed
c-field
the
minimal
generated
by such intervals. With this
as
can be
view
of the fact that a1l such
proceed
in
mind
in
to argue that
we can
starting
intervals have a common
(
) we could conceivably define
point
function
a
.%

'

'

'

'

.x(1,

'point-

E:c,

F( ): ER--+
.

g0,11,

which is, seemingly, only a function of x. In effect, however, this function will
do exactly the same job as Px ). Heuristically, this is achieved by defining
F( ) as a point function by
'

for all x c R,
and assigning the value zero to F( - :y.). Moreover, given that as increases
the interval it implicitly represents becomes bigger we need to ensure that
F(x) is a non-decreasing
function with one being its maximum value (i.e.
.x

F(x1) % F(x2) if xl :t xc and limx- . F(.x) 1). For mathematical


also require F( ) to be continuous from the right.

reasons we

'

Dlflnition 2
Let ZYbe a ,-.!?. tljlned
I0. 11 dehned tv
F(x) #x((
=

(.,

(pll

x1)

'wy-

#(

.k',i

)).The ptpnr

Pr(X G x),

is cf;l//t?J tbe distribution function (1)F)


///t?.'fng pl-operties:
(f)

F(x) is rltpn-f/fs?c-tv/.sfng''

( )

F( -

i1*

:y.-

) li rn.x
=

-+

.,

F( x ) 0
=

.jr

t#'

./ntrlforl

all x

A- and

F(
6

R
ustkrs/'s

)..R

-+

(4.14)
the

56

Random variables

and proability

distributions

It can be shown (seeChung ( 1974)) that this defines a unique point function
for every set function #x( ).
The great advantage of F( ) over #( ) and #xt ) is that the former is a
point function and can be represented in the form of an algebraic formula)
the kind of functions we are so familiar with in elementary mathematics.
'

'

'

'

This will provide us with a very convenient way of attributing probabilities


to events.
Fig. 4.4 represents the graph of the DF of the r.&'.X in the coin-tossing
example discussed in the previous section, illustrating its properties in the
of Hs.
case of a discrete 1-.1,.
--number

Drlrft'?n

z4random
of l/?t? set

variable

is called discrete if its rc/nfyer Izx is stplot? subset


0 k 1 + 2,
Z
integers
(?J
).
-

ln this book we shall restrict ourselves to only two types of


variables, namely, discrete and (absolutely)continuous.

random

Dehnition 4
X is called (absolutelyj continuous i.f its
F(x) is continuous .Jtpr
alI x iE R and there t.?--?'xr.
real
tbe
that
Iine
.J( ) on

-4 random lwr/?/g
(Iistribution
(1non-neqative
,lrlcrft'?n

-s?.,/c/?

.//ntrrft?l?

'

F(x)

(l) d r/,

4.2

The distribution and density functions

It must be stressed that for A- to be continuous is not enough for the


distribution function F(x) to be continuous. The above definition postulates
that F(x) must also be derivable by integrating some non-negative function
/'(x). So far the examples used to illustrate the various concepts referred to
discrete random variables. From now on, however, emphasis will be placed
almost exclusively on continuous random variables. The reason for this is
that continuous random variables (r.v,-s)are stlsceptible to a more flexible
and this helps in the
mathematical treatment than discrete r.N
construction of probability models and facilitates the mathematical and
statistical analysis.
r.N we introdtlced
the function
In defining the concept of a contnuous
/'(x) which is directly related to F(x).
.'s

Dehni rtpn 5

Y e: r'.'i
't'

F(x)

()

ld A

i.

t't'?? t 1*?.7
ut?l?.s

(4.20)

vx G Er,!l
- (Iis'l-ee

./'(u),

.v

(probability)density function ( pt/./') q/' X.


A,y,
.1-4.
example,
1) $-sand
(see Fig. 4. 5). I n

saitl t() /?t?tll(?

ln the coin-tossing
for a discrete with those of a continuous r.v.
order to compare F(x) and
where
consder
the
Alet us
takes values n the interval lk, ?q and a1l
case
attributed
of
the
values z are
same probability', we express this by saying
unljrnll
is
Adistributed in the interval (k, l and we write Athat
t7tl, !?).The DF of Ar takes the form
,/-(0)

./'(

./'(2)

,/'(x)

'v

X < (1

(see Fig. 4.6). The corresponding

pdf of X is given b)'

elsewhere.

Random uriables

and probability

distributions

Comparing Figs. 4.4 and 4.5 with 4.6 and 4.7 we can see that in the case of
random variable the DF is a step function and the density
discrete
a
function attributes probabilities at discrete points. On the other hand, for a
continuous r.v. the density function cannot be interpreted as attributing
probabilities because, by definition, if X is a continuous r.v. P(X .x) 0 for
a1l x 6 R. This can be seen from the detinition of /'(x) at every continuity
=

4.2

59

The distribution and density functions

Fig. 4.7. Thedensity

function of a uniformly distributed

random

variable.

(4.23)
i.e. .J( ): R
.

-+

r0,

v-,l.

(4.24)

Although we can use the distribution function F(x) as the fundamental


concept of our probability model we prefer to sacrifice some generality and
adopt the density function ,J(x)instead, because what we lose in generality
we gain in simplicity and added intuition. lt enhances intuition to view
density functions as distributing probability mass over the range of .Y. The
density function satisfies the following properties:

(4.25)
(4.26)
(4.27)
(iv)

.J(x)=.uF(x),

at every point where the DF is continuous.

(4.28)

Random variables

60

and probability

distributions

Properties (ii)and (iii)can be translated for discrete r.v.'s by substituting


t) for j- dx' It must be noted that a continuous r.v. is not one with a
continuous DF F( ). Conlfnufrv refers to the condition that also requires
the existence of a non-negative function /'( ) such that
'

'

'

(4.29)
ln cases where the distribution function F(x) is continuous but no
integrating function .J(x)existss i.e. (d/dx)F(x) 0 for some x e: J2,then F(x)
is sqid to be a sinqular f/f.$r?'l'?l./lf(?n. Singular distributions are beyond the
scope of this book (see Chung ( 1974)).
=

4.3

The notion of a probability model

Let us summarise the discussion so far in order to put it in perspective. The


axiomatic approach to probability formalising the concept of a random
#( )), where S
experiment J' proposed the probability space (.,
of
all
possible
of
,F
the
is
the
set
events and #( )
set
outcomes,
rcpresents
assigns probabilities to events in The uncertainty relating to the outcome
is formalised in P( ). The concept of a
of a particular performance of
random variable A-enabled us to map S into the real line Ii and construct an
equivalent probability space induced by X. (R, Px( )),which has a much
teasier to handle' mathematical structure, being defined on the real line.
Although #xt ) is simpler than #( ) it is still a set funiion albeit on the
Borel field
Using the idea of c-fields generated by particular sets of
#xt ) on semi-closed intervals of the fonn ( vs, xl and
defined
we
events
managed to define the point function F( ), the three being related by
.:/t

'

'

.k

'

'

.@,

'

'

'

,.?d.

'

'

Pfs: -Y(y)6

(-

:y.,

x(1,

us

S)

#x( -

.,.'y,

xl

(4.30)

F(.x).

The distribution function F(x) was simplified even further by introducing


jXvia F(x)
du. This introduced further
the density function
w
is definable in closed
flexibility into the probability model because
algebraic form. This enables us to transform the original uncertainty related
to J to uncertainty related to unknown parameters 0 of /'('); in order to
emphasise this we write the pdf as
$. We are now in a position to define
./?'ld'/J' ()f t/pnsl'r
model
probabilit
parametric
in
form
the
of
J.'
jilnctions
a
our
which we denote by
./'4lk)

,/'(x)

./'(x)

./'(x;

'

(l)

).J(.x',04, 0 G O )
.

* represents a set of density functions indexed by the unknown parameter


(.) (usuallya multiple of
0 which are assumed to belong to a parameterspace
the real line). In order to illustrate these concepts let us consider an example

The notion of a probability

model

Fig. 4.8. The density function of a Parcto distributed random


of the parameter.

variable

for

differentvalues

family of density functions, the Pareto distribution:

of a parametric
(Lp=

ftx.,(?)

x
-9.
x

t?+ l
,

x > 0, ()(E (.)

xll - a known number O r1+- the positive real line. For each value in 0.
/'(.x;p)represents a different density (hencethe term parametric family) as can
be seen from Fig. 4.8.
When such a probability model is postulated it is intended as a
description of the chance mechanism generating the observed data. For
example, the model in Fig. 4.8 is commonly postulated in modelling
personal incomes exceeding a certain level x(). lf we compare the above
graph with the histogram of personal income data in Chapter 2 for incomes over E4500 we can see that postulating a Pareto probability
density seems to be a reasonable model. In practice there are numerous
such parametric families of densities we can choose from, some of which will
be considered in the next section, The choice of one such family, when
modelling a particular real phenomenon, is usually determined by previous
experience in modelling similar phenomena or by a preliminary study of the
data.
When a particular parametric family of densities * is chosen, as the
appropriate probability model for modelling a real phenomenon, we are in
effect assuming that the observed data available were generated by the
'chance mechanism' described by one of those densities in *. The original
uncertainty relating to the outcome of a particular trial of the experiment
=

62

Random variables

and probability

distributions

has now been transformed into the uncertainty relating to the choice of one
0 in 6), say 0*, which determines uniquely the one density, that is, tx,'p*),
which gave rise to the observed data. The task of determining 0* or testing
some hypothesis about 0* using the observed data lies with statistical
inference in Part 111. ln the meantime, however, we need to formulate a
mathematical framework in the context of which the probability model (l)
can be analysed and extended. This involves not only considering a number
of different parametric families of densities, appropriate for modelling
different real phenomena but also developing a mathematical apparatus
which enables us to describe, compare, analyse and extend such models.
The reader should keep this in mind when reading the following chapters to
enable him her not to lose sight of the woods for the trees. The woods
comprise the above formulation of the probability model and its various
generalisations and extensions, the trees are the various concepts and
techniques which enable us to describe and analyse the probability model in
its various formulations.
4.4

uniYariate

Some

distributionst

ln the previous section we discussed how the concept of a random variable


(r.v.) X defined on the probability space (S, P( )) enabled us to construct
a general probability model in the form of a parametric family of densities
(31).This is intended to be an appropriate mathematical model purporting
of real phenomena in a stochastic
to provide a good approximation
(probabilistic) environment. ln practice we need a menu of densities to
describe different real phenomena and the purpose of this section is to
consider a sample of such densities and briefly consider their applicability
to such phenomena. For a complete menu and a thorough discussion see
Johnson and Kotz ( 1969), (1970),(1972).
.%

'

(1)

Discrete distributions

(i) Bernoulli distribution


experiment J's where there are only two possible
and
for convenience, that is, S
tfailure').
vafiable
X by
lf we define on S the random
1) p
1, A-tfailure) 0 and postulate the probabilities Przr
0) 1 p we can deduce that the density function of X takes the

Consider a random
outcomes, we call
(Esuccess',

Asuccess)
and Pr(X

'success'

kfailure'

'!- The term probability distribution is used to denote a set of probabilities


complete system (a c-field) of events.

on a

4.4 Some

/'(x', p)

pXl

distributions

univariate

1 p)
-

1 -

fo r x

'N
'

0, 1

otherwise.
and the probability

ln practice p is unknown

takes the form

model

(4.34)
Such a probability model might be appropriate in modelling the sex of a
newborn baby, boy or girl, or whether the next president of the USA will be
a Democrat or a Republican.

(ii) Tlle binomial distributiov


The binomial distribution is unquestionably the most important discrete
distribution. lt represents a direct extension of the Bernoulli distribution in
the sense that the random experiment Js is repeated n times and we define
in rl trials. If we
the random variable F to be the number of
al1 21 trials the density
of
probability
is
in
that
the
the
same
assume
of Y takes the form
tsuccesses-

isuccess'

otherwise.
and

denote this by writing J'

we

'v

#(n, p):

&

'
'w

Sdistributed

reads

as'.

Note that

n
.'

n!
-(n y) ! .p!

/(!

'

(/

1)

'

(/( 2)

2 1.
.

The relationship between the Bernoulli and binomial distributions is of


considerable interest. lf the Bernoulli r.v. at the th trial is denoted by Xi, i
+ X,,;
+ ,Ya +
1, 2,
n, then Y is the summation of the Arfs,i.e. L
This
is
used
emphasise
subscript
because
Y'
dependence
in is
its
the
to
on n.
which implies that
the Arfstake the value 1for a success' and 0 for a
))- j Xi represents the number of successes' in n trials. The interest in this
relationship arises because the 'is generate a sequence of increasing c-fields
cu c(L); c(Ff) represents the c-field generated
of the fonn c(Fj) cz c('L)c
by the r.f. Y).This is the property, known as martinqale condition (seeSection
8.4),that underlies a remarkable theorem known as the De Moivre-luaplace
=

tfailure',

'

'

'

'

-1

'

'

64

Random variables and probability distributions

cvntral Iimit rtrtpl-?rrl.De Moivre and Laplace, back in the eighteenth


n) for a
century, realised that in order to calculate the probabilities
large n the formula given above was rather impractical. ln their attempt to
find an easier way to calculate such probabilities they derived a very
important approximation to the formula by showing that, for large n,
./'()';

where
j.
,g

-u-

.jy

&

x'7.t,7/.,(
-

-- . - .

1 J?)q

rx

reads approximately

equal.

on the RHS of the equality was much easier. This


the density function of the most celebrated of a1l
has a bell-shaped symmetric curve, the nortnal. Fig. 4.9
of a binomial density for a variety of values for 11 and p.
As we can see, as n increases the density function becomes more and more
bell-shape like- especially when the value of # is around 1y. This result gave
rise to one of the most important and elegant chapters in probability
theory, the so-called limit theorems to be considered in Chapter 9.

Using the formula


formula represents
distributions which
represents the graph

(2)

Continuous distributions
(i)

-'b

e ntp/vzlt//

tlist'lnibut

k't?/'?

The normal distribtltion is by far the most important distribution in both


probability theory and statistical inference. As seen above, De Moivre and
Laplace regarded the distribution only as a convenient approximation
to
the binomial distribution. By the beginning of the nineteenth century,
however, the work of Laplace, Legendre and Gauss on the theory of
placed the normal distribution at the centre of probability theory. lt was
found to be the most appropriate distribution for modelling a large number
situations in astronomy, physics and eugkmics. Moreoverof experimental
Markov, Lyapounov and
the work of the Russian School (ChebysheN,
'errors'

limit theorems, relating to the behaviour of certain


standardised sums of random variables, ensured a central role for the
Kolmogorov)

on

normal distribution.

A random variable

zYis normally

distributed if its probability

function is given by
j'q

x y
.

''

(y

1
ex j)
cx (.27r)
-

-s .-.
...

1
-

2c

2
(x - g j
.

density

4.4

Some univariate distributions

0.8

0.8

0.7

0. 7

65

0,6

0.6
0.6

n
R

< 04

5
0'0S

0.5

n
V

< 04

0.3

0.3

0.2

0,2

0.1

0.1
0 12 3 4 5 6

0.5

0 1 2 34 56

0.6

0.6

0.5

0.5
0.4

n
#'

< 03

1()

0.05

0.4
< 03

0.2

0.2

0. 1

0. 1

lo
0. 5

0 1 2 3 4 5 6 7 8 9 10

0.6

0.5

0. 5

0.4

0 12 3 4 56 7
0.6

n
/7

< 03

n
P

ac

0.05

0.4
< 03

0.2

0.2

0. 1

0. 1
0 12 34 5 6 7
X

pc
0'5

jI

0 1 2 3 4 5 6 7 8 9 10 12 14
1 1 13 15

Fig. 4.9. The density function of a binomially distributed


variable for different values of the parameters n and p.

random

this by Ar N(p, c2). The parameters p and c2 will be studied in more detail
when we consider mathematical expectation. At this stage we will treat
them as the parameters determining the location and flatness of the density.
For a fixed c2 the normal density for three different values of p is given in
Fig. 4. 10.
'v

and probability distributions

Random variables

-4.0

Jz

Jl

4.0

0.40
0.30

:i o,2o
<

0.10
0.00

-8

-6

-7

-4

-5

-3

-2

-1

Fig. 4.10. The density function of a normally distributed random variable


with c2 1 and different values for the mean p.
=

1.00
0.90
0.80
0.70
0,60

Sil(Lso
<

0.40
0.30
0.20
0, 10
0.00

J=

2.5

-8

-7

-6

-4

-5

-3

-2

-1

Fig. 4.1 1. The density function of a normally distributed random variable


and different values for the variance.
with mean p
=0

Fig. 4.1 1 represents the graph of the normal density for p 0 and three
alternative values of c; as can be seen, the greater the value of c the flatter
the graph of the density. As far as the shape of the normal distribution and
density functions are concerned we note the following characteristics:
=

The normal density is symmetric about p, i.e.

fp
=

k)

Pry

1 expt /(2,
2c
cx/(2a)
x x/t+ k) Pry
-k

,v

.ytjj

(4.39)

.k),

(4.40)

G.. <yj,

and for the DF,


F(

x)

1 - F(x + 2p).

(4.4 1)

4.4 Some

univariate distributions

67

The density function attains its maximum

d./-txl ftx)
dx =

'

2(x

=0

2cz

'

at x

=p,

-p)

=
'

and

ftuj
-'

'

(2zr)
(4.42)

(iii)

The density function has two points of inflection at x

=p

+ c,

Thus, c not only measures the flatness of the graph of the pdf but it
determines the distance of the points of inflection from p. Fig. 4.12
represents the graphs of the normal DF and pdf in an attempt to
give the reader some idea about the concentration
of these
functions in terms of the standard deviation parameter c around
the mean p.

10
.

W)

0.84 r----0.50

I
I

---u 1 c.:6

-c

f (x)
Shaded area

I
I
I

f
.-

+ (z

2c

1
(J,s)

0.9545

o 65
tV(27r1

0.1353
(V(2F1

-2g

p
Fig. 4. 12. lmportant

functions.

y-o

Jz + c

g + 2c

features of the normal

distribution and density

Random variables and probability distributions

68

The density function of the random variable Z

1
/.(y -t;jaalexp g
.
x.'

..--j.:r2

c is

(-Y-p)

---

which does not depend on the unknown parameters p, c. This is called the
standard normal distribution, which we write in the form Z N(0, 1). This
shows that any normal nv. can be transformed to a standard normal nv.
when p and c are known. The implication of this is that using the tables for
and F(x) using the transformation Z
J(z)and F(z) we can deduce
(.Y -p) c. For example, if we want to calculate P6X G 1.5) for A' N( 1, 4)
/:-4z) F(0.25) 0.5987, that is, F(x)
we form J (x 1) 2 0.25
x

./'(x)

F(1.5)

uc>

0.5987.

(ii) Expon

(.w tial

.?'l'll'/-J'

t?/' (listributions

Some of the most important distributions in probability theory, including


the Bernoulli, binomial and normal distributions, belong to the exponential
family of distributions. The exponential family is of considerable interest in
statistical inference because several results in estimation and testing (see
Chapters 11- l4) depend crucially on the assumption that the underlying
probability model is defined in terms of density functions belonging to this
family; scc Barndorff-Nielscn ( 1978).
4.5

characteristics

Numerical

of random

Yariables

ln modelling real phenomena using probability models of the form *=


p), 0 g (.)). we need to be able to postulate such models having only a
f
general quantitative description of the random variable in question at our
Such information comes in the form of certain numerical
disposal a
characteristics of random variables such as the mean, the variance. the
skewness and kurtosis coefficients and higher moments. lndeed, sometimes
such numerical characteristics actually determine the type of probability
density in *. Moreover, the analysis of density functions is usually
undertaken in terms of these numerical characteristics.
./'(x;

'priori.

(1)

Mathematial

expectation

(r.v.) on (S, P )) with F(x) and .J(x)its


distribution function (DF) and (probability) density function (pdf)
vafiable

Let Ar be a random

respectively. The
F(.Y)

mean
.ylxl

n%

'

of A- denoted by f)A-) is defined by


dx - for a continuous

r.v.

Characteristics

4.5

69

of random vayiables

and
F(A-)

xf./'t-Yf

fOr

a discrete r.v.,

when the integral and sum exist.


is over all possible values of X.
The integral in the definition of mathematical expectation for a continuous
random variable can be interpreted as an improper Riemann integral. lf a
unifying approach to both discrete and continuous r.v.'s is required the
integral (see Clarke ( 1975))
concept of an improper Riemann-stieltjes

Note that the summation

(4.47)
used. We sacrifice a certain generality by not going directly to the
Lebesque integral which is tailor-made for probability theory. This is done,
however, to moderate the mathematical difficulty of the book.
The mean can be interpreted as the centre q/' gravitv of the unit mass as
distributed by the density function. lf we denote the mass located at a
from the origin by m(xf) then the centre of gravity is
distance x, i 1, 2,

can be

located

at
-vfnyt-vj)

Zf-

.-

(4.48)

?Fl(.Y)

1.
If we identify nt-vsl with ptxf) then f)A-)= jf xptxf), given that
f Jx'f)=
provides a measure of location (orcentral
In this sense the mean of the r.v.
tendency) for the density function of X.
'

If A- is a Bernoulli distributed

r.v. (X

'v

b 1, p)) then

A'

/'(x) (1 - p)
lf X

'v

distributed

/?), i.e. Ar is a uniformly

U((I,

'(A-)

,x?

.Vtxldx

j
x

b-

(1

dx

1
=

-.

then

r.v.,

x2

---

2 b

a +. /)

70

and probability distributions

Random uriables

lf X

Np, c2), i.e. X is a normally distributed r.v., then

'v

F(Ar)

(2,c)
+. p )
e .-ya (ja
(2zr)

.x;

'

--

cc

1 x u
2
c
-

exo

(Jz

for

e-izz

(jz

(r

:c

dx,

(2z:)

-.jc2 (j Z

(27:)

= 0 + p 1 p,
since the first term is an odd function, i.e. h - x)= - /1(x).Thus, the
P arameter p for X Np, c2) represents its mean.
.

'w

ln the above examples the mean of the nv. A- existed. The condition which
guarantees the existence of '(Ar) is that
X

dx <
Ixl.(x)

cc.

< w.
)(gIxfl/txf)

or

vo

(4.49)

One example where the mean does not exist is the case of a Cauchy
distributed r.v. with a pdf given by
flxs

1
zr(1 + x 2 )

R.

.X 6

In order to show this let us consider the above condition:

x,
-X

dx
Ixl/txl

lxl1 + x

=-

z: - x

=- olimzc 2
zr
-+

Ctl

1
--

zr

x,

oxc

dx

by synlnxetry

dx
c

1
x
dx=- lim logell
: a .
1+ x a

+J2)

-+

That is, '(Ar) does not exist for the Cauchy distribution.

Some properties of the expectation


(E1)
(.E2)

c, (' c is a constant.
ftzArj + bxz) tkEtxjl + lxEl-fzl for Jn-p
'(c)

lwt? r.,.'s

ArI and

of random

4.5 Cllaracteristics

variables

Xz whose means exist and a, b are real constants.


For example, (J Xi b 1, p), i 1, 2,
n, i.e. Xi
represents the Bernoulli r.r. of the ith trial, then for F;,
S(Fl, P)
f (Y;,)
1 E (Xf)
np.
1 Xfl
1 p
That is, the mean of a binomiallv distributed r.r. equals the
number of trials multiplied b)' the probability' oj'
and
Properties E1
E2 desne F(.) as a linear transformation.
Prx > ;.E(X)) % 1/2 jr a positive r.t,. X and ;.> 0,' this is
('3)
the so-called Markov inmuality.
Although the mean of a r.v. X, .E(Ar) is by far the most widely used
measure of location two other measures are sometimes useful. The first is
themode defined to be the value of X for which the density function achieves
its maximum. The second is the median, xm of X defined to be the value of X
=

'w

(Z)1-

::::>'

'v'

Z)'=

Z7=

bsuccess'.

such that

It is obvious that if the density function of A- is sq?mmetric then


'(.Y) x..

(4.51)

If it is both symmetric and unimodal

mean

median

(i.e.it has

only one mode) then

=mode,

assuming that the mean exists. On the other hand, if the pdf is not unimodal
this result is not necessarily valid, as Fig. 4. 13 exemplifies. ln contrast to the
mean the median always exists, in particular for the Cauchy distribution
x,n
=0.

I
I

l
l

I
I
I

I
I
I
1

I
j
1
1
I
I
I

'G

mode

Fig. 4.13. A symmetric

mode.

mean

median

mode

density whose mean and median differ from the

(2)

and probability distributions

variables

Random

The varlance

When a measure of location for a nv. is available, it is often required to get


widely the values of .,Y are spread around the location
an idea as to how
is,
measure, that a measure of dispersion tor spread). Related to the mean as
variance and
a measure of location is the dispersion measure called the
defined by
'

Vart#l

SIIA- f'tA-ljz

(x - f(-))2/(x)

F(.Y))2/'(xf)

= il (xf-

(4.52)

dx - continuous

(4.53)

- discrete.

t#' inertia of the mass


can be interpreted as the moment
through the mean.
axis
distribution with respect to the perpendicular

The variance

is referred to as

Note: the square root of the variance

deviation.

standard

Exkrnp/t?s
(i)

Let A' ??(1,p); it


x,

was shown above that F(Xl


-p2)(

VarlA-l

(0

Sf

Var(A-)

X-

p, thus

-p)2p=p(

a+

+ (1

-p)

-p).

-f.'/)2

(?
dx
=

-a

12

(verify).

An equality which turns out to be convenient for deriving Var(-Y) is given by


E'(A-2) (E'(.Y)12,
Vart-l
=

where
v2./'(x)dx.

for z

-..-

Characteristics of random

4.5

Yariables

.fr

( PrI)

fkny c'tpnsltknr ('.


Vartt?l 0
2
(1 constant.
Varll X) a VartXl,
#-((xY E(xY)l> k) :$ gVar(X)q,/k2 Chebyshev's inmuality
the
fr /( > 0. This ntx?-fllflr lives (1 relation
and
JC#nt?l
probabilitv
q.f
by
dispersion as
variance
te
such
tA- E(A-)1p: k, pl-onpng in //t'?cr
bound
an
upper
for
=

lS2)

.jr

(
( Iz'3)

-lt?lwtlt?n

probabilities.
(3)

Higher moments

Continuing

the analogy with the various concepts from mechanics we


define the moments of inertia from x 0 to be the so-called rlw' moments:
=

the l-th raw moment, if it exists, with p'a 1 and yj


usually denoted by Jt. Similarly, the rth moment around
central nlt?rnt?nl, is defined (ifit exists) by
pvEEE/I..Y -

plr

(-x-- p)r./-(x)dx,

2, 3,

F(A-); the mean is


x p, called the rth

c2. These higher


;tz a E'(- - p)2 is the variance, usually denoted by
moments are sometimes useful in providing us with further information
relating to the distribution and density functions of r.v.'s. In particular, the
in the form:
3rd and 4th central moments, when standardised
X3

14

:3

--

and
lt *

-:

(T

are referred to as measures of skpwncss and kljrtosis and provide us with


measures of asymmetry and flatness of peak, respectively. Deriving the raw
moments first is usually easier and then the central moments can be derived
via (see Kendall and Stuart ( 1969::
r

pr- j=1 (

l
.

1/

p'ipr-j.
,

(4.58)

An important tool in the derivation of the raw moments is the characteristic

74

and probability distributions

Random variables

function defined by
J

Eteilx)

kx-

eirxdz-txl,

Using the power series form of

/x(l)

v''- 1.

(4.59)

eA

we can express it in the form

., (jrlr

p'r.
F. --r!

1+

(4.60)

This implies that we can derive y'r via


t

Fr

dr/

x
(jjr

(r)
=

A function related to
Chapter 10) is

loge/xtr)

()

/xtrl of

1+

(4.6j)

interest in asymptotic theory

particular

x) (ir)r&r
r 1 r!

)2

(see

(4.62)

where

sr, r= 1, 2,

are called the cumulants.

Example
Let A- Np, c2), the characteristic

function takes the form

'w

-r2c2),

1d/xlll
'i- dl

(M=expirp

=
o 1-

,-

1 d24 (r)
7 dr ,

-.ir

exptirp

()

a
= p +

Gz =

pa

J,

)(ip

(r a

rc c)

p,

()

,-

Similarly we can show that ps


pzb 3/4, ps
/z6 15c6, etc.
c2, x,=0,
Kendall
and
Stuart
rb
3, aa 0, a,yc::z3', see
( 1969).
&72
=0,

=0,

l1

=p,

the various numerical characteristics of r.v.'s as related


under certain
to their distribution, it is natural to ask whether
1,
2,
Jz'r,
knowing
the moments
we can determine the
circumstances
r
DF F(x). The answer is that F(x) is uniquely determined by its moments /t;,
if and only if
r= 1, 2,
Having considered

l (/t'c,)-r
-

(4.63)

vs.

This is known as Carleman's

condition.

4.5

Important
Random

of random variables

Characteristics
concepts

the probability

variable,

by a r.v., a c-field

space induced

generated by a r.v., an increasing sequence of c-fields, the minimal Borel


field generated by half-closed intervals ( r xl x e R, distribution
function,density function,discrete and continuous r.v.'s, probability model,
parametlic family of densities, unknown parameters, normal distribution,
expectation and variance of a r.v., skewness and kurtosis, higher raw and
central moments, characteristic function, cumulants.
-

Questions
Since we can build the whole of probability theory on (S, % P ))why
do we need to introduce the concept of a random variable?
Define the concept of a r.v. and explain the role of the Borel field
generated by the half-closed intervals ( vz, x(l in deciding whether a
function .Xt ): S R is a r.v.
'Although any function Art ); S
Ii can be defined to be a nv. relative
c-field
valuable information if we do not
stand
lose
to
we
to some
define the nv. with care'. Discuss.
Explain the relationship between #( ), #xt ) and Fx( ).
Discuss the relationship between Fx( ) and .J( ) for both discrete and
continllous r.v.'s. What properties do density functions satisfy?
Explain the idea of a probability model * ).J(.x;
0j, 0 (E 6)) and
discuss its relationship with the idea of a random experiment $ as well
as the real phenomenon to be modelled.
Give an example of a real phenomenon for which each of the
following distributions might be appropriate:
Bernoulli;
(i)
binomial',
(ii)
normal.
(iii)
'

-+

'

--+

,.'h

'

'

'

'

'

Explain your choice.


Explain why we need the concepts of mean, variance and higher
moments in the context of modelling real phenomena using the
probability model *
0), 0 6 O).
B'hat features of the density function do the following numerical
characteristics purport to measure?
=

ttx;

mean, median, mode, variance, skewness, kurtosis.


ixplain the difference between these and the concepts with the same
names in the context of the descriptive study of data.
iixplain Markov's and Chebyshev's inequalities.

76

Random variables and probability

distributions

12. Compare the properties of the mean with those of the variance.
do the moments characterise the
13. Under what circumstances
distribution function of a r.v.?
Exercises
b, c, J) and 84/) #(h)
Consider a random experiment with S (tp,
#(C)
P(Is l'.
Derive the c-eld of the power set
(i)
l/J, ?)
say ..W).
Derive the minimal c-field generated by
(ii)
S:
Consider the following function defined as

=t,

=-),

,?/.'

-tfzl

-Y(c) -Y(J)

0,

.(?)

J'(/))

F((.')

1,

7 405 926,t

J'(J)

2.

Show that .Y and F are both nv.'s relative to


but F is not.
is a r.v. relative to
Show that
Find the minimal c-field generated by F,
.?>:

(iii)
(iv)
(v)
(vi)
(vii)

,t7j

,#(.

1) k.p g1, 2:.


Find #y(.t0) ), #y((0. 1q), #y,((- cfs 11), Py(I)0,
Derive the distribution and density functions F(y),
,

.(y)

and

plot them.
Calculate E(F), Var(1'), az(F) and a4(J'l.
(viii)
The distlibution function o the exponential distl-ibution is
F(x)

exp

'

and plot its graph.


Derive the density function
Derive the characteristic function /(1)
'Derive /J(Ar), VartxYl, aa(,Y) and a4(Ar).
.(x)

(i)
(ii)
(iii)

'(ei'x).

Note:

'# If the reader is wondering about the significance of thfs number it fs the number of
demons inhabiting the earth as calculated by German physician Weirus in the
sixteenth centtlry (see Jaslrow (1962:.

Characteristics

4.5

of random

variables

lndicate which of the following functions


functions and explain your answer:
(i)
/'(x) kx2, 0 < x < 2,'

represent

proper

density

2(1 -x)2,
x > 1,'
341
x y 1,e
.J(x)
< x < 2,'
1),
+
0
.(x3
(iv)
/'(x)
3,
J(.Y)
iE R.
x
(v)
-lx
Prove that Vart-Yl E(X2) - gE(xY)q2.
lf for the nv. X, E(-Y) 2, F(xY2) 4, find F(3X + 4), Var(4X).
Let Ar N(0. 1). Use the tables to calculate the probabilities
F(2.5);
(i)
F(0. 15),(ii)
1 - F(2.0).
(iii)
and compare them with the boundsfrom Chebyshev's inequality. What
is the percentage of error in the three cases?
,f(x)

(ii)

(iii)

''E).

Additional references
Bickel and Doksum ( 1977)., Chung ( 1974)., Cramer ( 1946)., Dudewicz ( 1976),. Giri
t 1974)) Mood, Graybill and Boes (1974); Pfeiffer ( 1978); Rohatgi ( 1976),

C H AP T E R 5

Random vectors and their distributions

The probability model formulated in the previous chapter was in the form
of a parametric family of densities associated with a random variable (r.v.)
0), 0 s O). ln practice, however, there are many observable
X: *
phenomena where the outcome comes in the form of several quantitative
attributes. For example, data on personal income might be related to
number of children, social class, type of occupation, age class, etc. ln order
to be able to model such real phenomena we need to extend the above
framework for a single r.v. to one for multidimensional r.v.'s or random
vectors, that is,
=

t/tx',

(A'1 Xz,

X',)'.

where each Xf, i 1, 2,


n measures a particular quantifiable attribute of
experiment's
random
(J)
outcomes.
the
For expositional
purposes we shall restrict attention to the twodimensional (bivariate)case, which is quite adequate for a proper
understanding of the concepts involved, giving only scanty references to the
n-dimensional random vector case (just for notational purposes). ln the
next section we consider the concept of a random vector and its joint
distribution and density functions in direct analogy to the random variable
case. ln Sections 5.2 and 5.3 we consider two very important forms of the
joint density function, the marginal and conditional densities respectively.
These forms of the joint density function will play a very important role in
Part 1V.
=

Joint distribution and density functions

Consider the random experiment


78

of tossing a fair coin twice. The sample

Joint distribution and density functions

5.1

space takes the form S t(ST), CTH), CHH), (TT)). Define the function
Both
and -Y2( ) to be the number of
.lt ) to be the number of
of these functions map S into the real line (!4in the form
=

Stails'.

'heads'

'

'

(.X'1( ),X2( ))1)(FS)l


'

'

(-''i( ),.X'2( ))1)(.fT)l


'

'

(1,1), i.e. (.X'1(Tf1), X2(TS))= (1,1),


(1,1),

),.Yc( ))((f.f.f)l (2,0),

(A-1(

'

'

(.Y1(

'

),

-'2(
-

))l(TT)l

(.0,2).

R2 is a twoThis is shown in Fig. 5. 1. The function (xY:( ), A-2(. )): S


dimensional vector function which assigns to each element s of S, the pair of
ordered numbers (x1,x2) where xj xYj(s), xc Xc(s). As in the onedimensional case, for the vector function to define a random vector it has to
satisfy certain conditions which ensure that the probabilistic and event
structure of (S, P( )) is preserved. ln direct analogy with the single
s ariable case we say that the mapping
-+

'

..%

X(

'

)H (X1( ), X2( )): S


'

-+

222

vector if for each event in the Borel


#a), the event defined by

definesa random
.22.say B H (:j,
X- 1(B)

belongsto

'

(s: xYI(s)

e #1, -Y2(.s)e #c, s e

tield product

.@

,)

.@

(5.2)

..%

S
Xa

(88)
(HP

( F8)
(rr)

Fig. 5.1. A bivariate random vector.

xl

80

Random vectors and their distributions

Extending

can be profitably seen as being the c-field


generated by half-closed intervals of the fonn ( :yz
to the case of the
direct product
x
we can show that the random vector X( ) satisfying
the result that

,#

,xj

.@

,?d

X-

((

w xq)

.F for a1l x

/2

implies X - 1(B) (E .F for all B G 41.


This allows us to define a random vector
Dhnition

A random

Vtor

Ar(

'

)..is

a vector

as follows:

.jnction

cc <

X a(.s) G x 2 s s
,

s) e

..@

Note. ( :r xj (( x. xlqs ( LJ -xc(I)


represents an infinite rectangle (see
Fig. 5.2). The random vector (as in the case of a single random variable)
induces a probability space (R2,
J7xl where
are Borel subsets on
the plane and #x( ) a probability set function defined over events in
in a
way which preserves the probability structure of the original probability
,

,82,

')),

r@1

.#2,

'

Joint distribution an4 density functions

5.1

This is achieved

by attributing

i@l

to each B i!

the

(5.6)
This enables us to reduce

joint

#x(

tlistribution

tcunultlrfrt?l
Djlnition

'

./ncrftpn.

a point function F(x1, xa), we call the

) to

fat?lX (Ar1 Ar2) be


/rlcrt?n dehned py
EEE

F(

):

/2

--.

a random

ptvltpr

dehned on (S, .@I


P

'

)). The

g0,1q,

stch that
EE

#?-(X .Gx)

is said to be the joint distribution function of X.


ln the coin-tossing example above, the random vector

X(

) takes the
alue (1, 1), (2,0), (0,2) with probabilities .l,
ln order to
derive thejoint distribution function (DF) we have to define al1 the events of
the form )s: Xltsl Gxl, -Yc$) Gxc, s c 5') for all (xj, x2) (E (22
.

1.4and .1.4.
respectively.

x:

<0,

0 Gx l

xc <0
<

2, 0 :; x2

.xl 1 2,
Ntlrt, a

The

degree of arbitrariness

in choosing the intinite

joint DF of Xj and X1 is given by


0, xl
12
,

F(x1 xa)
,

<

0,

xa <

0t x1 < 2 0 Gx2 < 2


,

-1.4.x 1 > 2 0
,

1,

t;

xl

y 2,

xcy

xz< 2

2.

<

2, x 2 y 2

x2 >

rectangles

<2

:1;h 2, oGxc
0 Gx j

<

2.

':t;,
-

xq.

and their distributions

Random vectors

82

of (X1, X2)

jnction

Table 5. 1. Joint Jcnsly

From the definition of the joint DF we can deduce that F(x1 x2) is a
monotone non-decreasing function in each variable separately, and
s

lim F(x1
X

-/

-#

1
X 2 -#

x2)

1im F(x1, x2)

-*

lim F(x1
X

x2)

(5.8)

=0.,

1.

m.
'X

As in the case of one r.v., we concentrate exclusively on discretc and


continuous joint DF only; singular distributions are not considered.
Dehnition 3
and Xz is called a discrete distribution
DF 4?./--Y'1
function.J(.) such that

Tejoint

4' rlrtr

exists a density

(5.10)

J(%1,.Y2) > 0, (x1 x2) (E (22


,

and it takes tbe value zero


injlnite ptpjfTl in lc plane

.Jtxl.x2)

Pt-lxk

everywhere

except at a

jinite t)r

countablv

wjr
=

x1, Xz

(5.11)

x2).

In the coin-tossing example the density function in


array
form is represented in Table 5. 1. Fig. 5.3 represents the graph of the joint
xz) via
density function of X > (.Y1-A'2). Thejoint DF is obtained from
the relation
a rectangular
,/'(xj

.''

F(x1, x2)

./'txlf,

.'. j i < .' :

A i<
,z

xci).

Dehnition 4
Thejoint DF t?/' A-1 and A-cis called absolutely') continuous if r/-lthl't?
exists a non-negative function .J(x1x2) such that
,

5.2

Some

bivariate distributions
f (x xc)
z

A'

-ej

Xc

1
2
Xl

function of Table 5. 1.
Fig. 5.3. The bivariate density

-va).

if j'

at

) is continuous

(xj,

Some bi>'ariate distributions


(1)

Bivariate normal distvibution


--i.

2)
(.--.&-1

f'lx:xa; 0) aagjo.a-

0- (pj pc,
,

cf, czc,p) c R2 x pi2+


x

(p,1q.

Random vectors and their distributions

f (x

xa)

,-

l
l
l
l
l

Xz
xe

Nw

Nx
N

.-..

'-

w.

..-

>

w.

(0 0)

'--

N
>.

X1

normal density.

Fig. 5.4. The density function of a standard

lt is interesting to note that the expression inside the square brackets


expressed in the form of
X1

#1

--

X1

- 2p

#1
c1 -

p2

X2 -

(r 2

X2

- p2
o'z

2
m

ca

defines a sequence of ellipses of points with equal probability which can be


x2) represented in Fig. 5.4.
viewed as map-like contours of the graph of
.f(xl,

0 (k
=

(3)

a l a a)
,

Bikariate binomial dltrl'bution

/'(x 1 x c)
.
,

/1
... - ! J?1'P)2,
a -.j
1 .v 2

-Y1

+ xc

p 1 + I.'z

1
.

(5.20)

The extension of the concept of a random variable A- to that of a random


(.YI
.Y,,) enables us to generalise the probability model
X1,
vector X
=

85

5.3 Marginal distributions


*

.f

/ (x ; 0) 0 6 O )
'

t.

family of

to that of a parametric
*

,/'(

xa

joint density functions

x,,; 0) 0 6 (9
,

(5.22)

generalisation since in most applied disciplines,


the real phenomena to be modelled are usually
including econometrics,
multidimensional in the sense that there is more than one quantifiable
feature to be considered.

This is a very important

5.3

Marginal distributions

Let X EB (xY1 Xal be a bivariate random vector defined on (,. ...Fi#( ))with a
joint distribution function F(x1, xa). The question which naturally arises is
whether we could separate A-l and Xz and consider them as individual
random variables. The answer to this question leads us to the concept of a
marginal J.srrf/pfklfon. The marginal distribution funetions of A-1and Xz are
defined by
'

F1 (x1)

and
F c(x 2 )

lim F(x1 xc)


,

-+

J-

l im F(x : x 2)
,

. 1

''*

ik

Having separated .,Y1and Xz we need to see whether they can be considered


defining a random
as single r.v.'s defined on the same probability space. In
that
condition
the
vector we imposed

$ts : A-1 (s) < x l

2(#

:%

x2

)G

(5.25)

.:''

The definition of the marginal distribution function we used the event


s : A-1 (s)

:$

2.(
x 1 A- s) < vs
,

lj

(5 6)

.2

t
which we know belongs to .K This event, however, can be written as the
intersection of two sets of the form

u hich implies that


.ts:

.,1

(-$)% xl X 2(s)
,

<

'zt-'

ts: A- (s)
I

:;

xl

(5.28)

and it is the condition needed for A'1 to be a r.v.


u hich indeed belongs to
probability
function Fjlxjl,' the same is true for X1. In order to see
with a
..k-

WK

86

and their distributions

Random vectors

this, consider the joint distribution function

F1(x1)

lim Ftxl

X2

since 1im,,-+

.(e-'')

-+

x2)

e-ttVl,
-

0. Similarly,

F 2 (x2) 1 - e-tz
=

x2

6F R +

Note that F1(xj) and F2(x2) are proper distribution functions.


Given that the probability model has been defined in terms of the joint
density functions, it is important to consider the above operation of
marginalisation in terms of these density functions. The marginal densitq'
functions of -Y1 and Xz are defined by

(5.30)

and

that is, the marginal density of Xii= 1, 2) is derived by integrating out


Xji #.j) from thejoint density. ln the discrete case this amounts to summing
out with respect to the other variable:
X

h (x1)

/t-vl x2).
-

Example
Consider the working

population

of the UK classified by income and age as

follows:
Income:

2000-4000, f 4000-8000, f 8000- 12 000, f 12 000-20 000, f 20 00050 000, over f 50 000.

Age: young, middle-aged. senior.


Define the random variables A'l-income class, taking values 1-6, and Xzage class, taking values 1-3. Let the joint density be (Table 5.2):

5.3

distributions

Marginal

87

Table 5.2. Joint densi )., q/' (.zYj Xz)


,

.?

-..;

-U-

...y.

Xz

---..

.--.

:LL

-l-

.y.

l-

1
2
3
4
5
6

Jc(xa)

0.250
0.075

0.020
0.250

0.040

0.075

0.020
0.010
0.005

0.030
0.015
0.010

0.400

0.400

.y,

ay

-7

.3

.....t.

0.5

0.020
0. 1(y)
0.035

1
I

/.1(.xj )

0.275
0.345
0.215

0.020

0.085
0.045

0.020

0.035

0.200

1.000

The marginal density function of -Y: is shown in the column representing


r()w totals and it refers to the probabilities that a randomly selected person
will belong to the various income classes. The marginal density of Xz is the
column totals and it refers to the probabilities
that a
row representing
randomly selected person will belong to the various age classes. That is, the
marginal distribution of A'I(Ara) incorporates no information relating to
Xat-Yk ). Moreover, it is quite obvious that knowing the joint density
function of Arl and Xz we can derive their marginal density functions; the
is
reverse, however, is not true in general. Knowledge of txj)and .Jc(x2)
xc) only when
enough to derive
./'(x1,

../'txlx2)

'-h

(-x,)

(5.32)

(x2),

'jz'

Independence in
that ,Yl and Xz are independent
terms of the distribution functions takes the same form

in which case we say


F(x1,

.X2)

l-.t.'.s.

F1(x1) F2(x2).
'

ln the case of the income-age

example it is clear that

'..f2(x2),

/'(x1,x2) #.Jllxll
0.250 #

(0.275)(0.4),

and hence, Arl and Xz are not independent r.v.'s, i.e. income and age are
related in some probabilistic sense; it is more probable to be middle-aged
and rich than young and rich!
In the continuous r.v.'s example we can easily verify that
Fl(xl)
Ltnd

'

F2(x2)

(1

e-0xt)( 1
-

e-0X2)
=

thus .Y1 and X2 are indeed independent.

F(x

1,

x2),

and their distributions

Random vectors

and
in the context of the probability
Note that two events,
said
P( )) are
(S,
to be independent (seeSection 3.3) if
.42,

,41

space

,.1

'

#(v4: ro

.4cl

#(-4j) #(,4a).
.

It must be stressed that marginal density functions are proper density


functions satisfying all the properties of such functions. ln the income-age
7: 0 and
0, .Jc(x2)
j .J1(x1f) 1 and
example it can be seen that
.txlly

Zi

.J2(X2)

1.

Because of its importance in what follows let us consider the marginal


density functions in the case of the bivariate normal density:

(5.36)

(5.37)

being a proper conditional

since the integral equals one, the integrand


density function (see Section 5.4 below).
Simlarly, we can show that

../-2
( ) '-y.,-y.
N ( a )o.a
.X2

eX p

1
-

//x

2 -

u :1

(5.39 )

'-

(72

Hence, the marginal density functions ofjointly normal r.v.'s are unvarate

normal.
provides us with ways to
ln eonclusion we observe that marqinalisation
when
model
is
model
such
defined in terms of joint
simplify a probability
variables.
random
In
unwanted
out'
by
density functions
any
be
of
Xk
interest
X2,
the
density
of
nv.'s
marginal
the
can
general,
'taking

-'.'l

distributions

Conditional

5.4

our investigation

ln the ineome-age example if age is not relevant in


simplify the probability by marginalising out

we can

-a.

Conditional distributions

5.4

ln the previous section we considered the question of simplifying


probability models of the form (22)by marginalising out some subset of the
away' the information
X,,. This amounts to
r.v.'s A' 1 -Ya,
r.v.'s;
irrelevant.
integrated out as being
In this section we
related to the
('ollditiollis)
with
simplifying
respect to some
of
*
by
question
consider the
of
r.v.'s.
the
subset
ln the context of the probability space (.i. ./7 #( )) the conditional
is defined by (see Section 3.3):
probability of event z'll given event
bthrowing

'

zzla

-4....1.

#(
''jz?d
:k'
!'kF'

1
-4

..1

P(

..4

z)

ro

#(d2)

,.4

-.4

a e:./'

(1:*
''''

By choosing X j fts: X 1 (s) < .'7l ) we could use the above formula to derive
an analogous definition in terms of distribution functions. that is
=

where

f .Gxl
( .''A(::!1.
) P$X
.
't'

..

'
..

..'::1.

,,.'.,4c),

there are two related forms we are


As far as event zulc is concerned
zlc
it zk-z
jl
where
is a specific value taken
interested
in,
particularly
c-field
generated
and
by
X 2. ln the case where
a),
the
ct
i.e.
by A'a.
1
arising
in
ctz'a
particular
problems
the definition of the
.4a
), there are no
.f-a

.-a

,4

conditional distribution function

P ( t s : A 1 (s ) .Gx : j...'ro c ( a ))
7
-

F .!k

j .' rr1

lk

2)

f'(c('c))

since ctA'al e although it is not particularly clear what form Fv, ct,ya) will
however, it is immcdiately
X2(s)
take. In the case where
a)
a
xa)
when
is
Xz
since
#(s:
obvious that
a continuous 1'.v., there will
.t

./1

z4a(s)

'(s:

.f

=0

90

and their distributions

Random vectors

be problems. When Xz is a discrete nv. there are no problems arising and we


can define the conditional density jnction in direct analogy to the
conditional probability formula'.
/'(x1

c)

.f

Pr(X

Pr(X

x j 'Xz

x 1 X:
,

Prlx,

.f

2)

.f2)
=

j'lx 1
./2(.z)

.2)

.f-

2)
.

The upper tilda is used to emphasise the fact that it refers to just one value

taken by Xz.

Example
Let us consider the income-age example of the previous section in order to
derive the conditional density for the discrete r.v. case. Assume that we want
6 (incomeclass of over
to derive the conditional density of Xz given
f 50 000). This conditional density takes the form:
-',t-1

flxz

.f1)

0.005
0.035

=0.143

for Xz

2 (middle aged)

1 for Xz

3 (senior).

0.0 10
= 0.035 0.286
=

0.020
= 0.035

=0.57

1 (young)

for Xz

This example shows that conditioning is very different from marginalising


because in the latter case all the information related to the r.v.'s integrated
out is lost but in the former case some of that information in the form of the
value taken by the conditioning variable is included in the conditional

density.
variables

In the case of continuous random


above procedure because
Prukrj

xj Xz

Prx,

.kL)

7cl

it does not make sense to use the

0
(j

apparatus needed to bypass this problem is rather


formidable but we can get the gist of defining the conditional distribution in
the continuous case by using the following heuristic argument. Let us define
the two events to be

The mathematical

-41

.t

s: X

(.s),I x 1 ) i! .F

(5.46)

5.4

Conditional distributions

distribution of .Y1 given Xz

This enables us to define the conditional


0. That is,
taking the limit as

kz by

-+

lim
= 0<h-0

tg,
..fa)

xl

(5.49)

dK.

J2(.f2)

we can define the conditional density of Arl

Using this heuristic argument


given Xz xc to be
=

xl

Rx:.

Examples
Consider the bivariate logistic distribution
F(x1 x2)
,

(X

FX
l

+e-x2)-

(1

+e-At

l (1 + C - )Xl

1+ e - x e 1+
j

2( 1 +

e-X2)

e-X2

A:2

e-x'

92

and their distributions

Random vectors

The distribution function of the bivariate exponential


takes the form

Xx c (X 2 )
-

Hence

1-

=
Let j'lx 1 x a)
.
,

xl

>fC1 >0,

C-

X2
'

f?xzl - *1 exp

1 + $x 1 )(1 +
()4

2)t2
;.lk + 1)(r.,1
1 (1

xc >az

distribution

>0.

-1-

l $4/./

2x 1

+ a 1x a -

k- x ( 1 + 0x a) l
21

(1 1 (1

c )-

(2

'

density functions
There are two things to note in relation to conditional
above
examples'.
brought out by the
the conditional density is a proper density function- i.e.
(a)

both properties can be verified in the above examples.


lf we vary Xz. that is, allow Xz to take all its values in Rxz, we get
different conditional density for each Xz .x2 of the form

/x' vct-xl /x2),

xlgR.y,

(5.54)

xzeRxa

reflection suggests that knowledge of al1 these


xa), a
densities
is equivalent to knowledge of
conditional
equality
brought
general
relationship
out by the
A moment's

./'(xj

(x1 x2)
-

e' R2.

5.4 Conditional distributions


model tp .t.J'(x
x,:; 0), 0 e (.)) because it offers us a general way to
I xc,
decompose the joint density function. It can be seen as a generalisation of
holding when A-I and Xz are
xa)
the equality
(x1)
independent, considered in the previous section; (55)-(56)being valid for
any joint density function. lndeed, we can use the condition which makes
the two equalities coincide as an alternative definition of independence, i.e.
.;t-: and Xz are independent if
=

'./'2(xa),

=./)

./'(xj

/)

x24-Y1

,'''xz)

(.Y1

=l

(5.57)

),

This definition of independence


can be viewed as saying that the
information relating to X2 is irrelevant in attributing probabilities to XI.
Looking back at the way (x1) was derived from the bivariate normal
density we can see that the expression inside the integrl in (37)was the
conditional density of X; given Arj xl. It can be verified directly that in
this case
=

/'(xl
.
.

.x2)

-), (
x

exp

j
azrr.):

-j

x,(.x'2,
( /'1(x1))(-/x'
c

The marginal and conditional

1 vi -

'

.:1)).

distributions

t'/2
cj

2) -

( 1 - /?
cax,

(2z:)
--

(5.59)
in this case are denoted by

(5.60)
(5.6 1)
(5.62)

C a Se :

(5.64)

Random vectors

and their distributions

A sequence of r.v.'s -Yl, Xc,


any x G R,
Fj(x)

F2(x)

'

'

'

Ar,,is said to be identically distributed if,for


.65)

(f

F,,(x).

The concept of conditioninq enables us to simplify the probability model


in the form of a parametric family of multivariate density function (1)
x,,; p), 0 i5 e) in two related ways:
t/'txl xa,
(i)
to decompose the joint density function into a product of
conditional densities which can make the manipulation much
easier; and
information in some
in the case where the stochastic (probabilistic)
of the r.v.'s is not of interest we can use the conditional density with
respect to some observed values for these r.v.'s. For instance in the
case of the income-age example if we were to consider the question
of poverty in relation to age we would concentrate
on
a:(X2/'X1
exclusively.
1)
2
.
=

lmportant concepts
Px( ));
random vector, the induced probability space (Rn,
the joint distribution and density functions',
marginal distribution and density functions, marginal normal
density;
independent r.v.'s; identically distributed r.v.'s;
conditional distribution and density functions, conditional normal
density.
.@'',

'

Questions
Why do we need to extend the concept of a random variable to that of a
random vector defined on (S, P( ))?
Explain how we can extend the definition of a r.v. to that of a random
vector stressing the role of half-closed intervals ( vz xq.
Explain the relationships between #( ), #xl ) and Fx x2( ).
!
Compare the concepts of marginalisation and conditloning.
Define the concepts of marginal and conditional density functions and
discuss their properties.
Define the concept of independence among r.v.'s via both marginal as
well as conditional density functions.
Define the concept of identically distributed r.v.'s.
.%

'

'

'

'

6.

Exercises

Consider the random experiment of tossing a fair coin three times.

5.4 Conditional distributions


Derive the sample space S.
Define the random variables A-1- the number of
Xz
lnumber of
number of
Derive the c-fields generated by
c(-Y1), Xz, c(A-a) and
cl-Yl, A-2).
Define the joint distribution and density functions F(x1, xc),
x2).
/'(xj x2) and plot
Derive the marginal density function /'j(x1)and ,J2(x2).
Derive the conditional densities
theads',

'tails'l.

eheads'

(iii)

-4-1

,/'(x1

/'l(x1.7l),

3)
.J'1(x1

2),

./(x2

./'clxc

0),

For the income-age example discussed above


l),
Plot the graph of .J(x: xc), .J'1(.x
(i)
://2),
c,/'3)
(ii)
Derive /'ltx/ 1), .J:(x
fzqx and compare their graphs
with that of Jttx ).
be bivariate normal
3. Let the joint distribution of Arj and
,/'24x2).

.:-2

X1
Xz

p1
(i)
(ii)
(iii)

t7'il

p1

'w N

PG 1 (r 2
-

cl

pclcc

/:2

Derive
Derive

(x1) and L (x2).


jx' xa(.Y1/'.X2), fOr x2
j

0, 1, 2.

Under what circumstances are X3 and X1 independent'?


Let the joint density function of A'j and X1 be
/'(x 1 x2)
,

Derive
Derive f.
1, 2, 10.

x l expt - xj( 1 + xzlj'

and

./-(x1)

v-tx,

'x

,)

fctxz).
for x

x 1 > 0,

1, 2, 10, and

x2

>

0.

for
(xc,/'x1)
/'xz.a.j

x:

XG4itional derences
Bickel and Doksum (1977); Chung (1974)',Clarke (1975);Cramer (.1946/ Dudewicz
1974); Pfeiffer ( 19781.,Rohatgi (1976),
(1976); Giri ( 1974); Mood, Graybill and Boes (

6.1

of one random

Functions

variable

CHAPT ER 6

Functions

of random

variables
Fig. 6. 1. A Borel function of a random

One of the most important problems in probability theory and statistical


A',,) when
inference is to derive the distribution of a function I1lZYl A-a,
This
known.
randoln
X,, ) is
vecto r X (A-l
the distribtltion of the
problem is important for at least two reasons'.
it is often the case that in modelling observable phenomena we are
(i)
primarily interested in functions of random variables; and
inference the quantities ot- primary interest are
in statistical
colnmonly functions of random variables.
It is no exaggeration
to say that the whole of statistical inference is based on
of various l'unctions of r.v.'s. l n the fi rst
ou r abil ity to deri ve the distribution
consider
the
distribution of functions of a single
subsection we art! going to
and then consider the case of functions of random
vecto rs.
r.v.
.

6.1

Functions of one random

considered a function from S to R and the above ensures that the composite
function (X): S R is indeed a random variable, i.e. the set
/l(Ar)(s)e:.B,)s,.i/-for any Bkt?d (seeFig. 6. 1). Let us denote the r.v. hX) by
Ft then F induces a probability set function #yt ) such that PyBh)= #x(#x)
JN.,4),in order to preserve the probability structure of the original
(S,
P )).Note that the reason we need ( ) to be a Borel function is to
preserve the event structure of
Having ensured that the function ( ) of the r.v. A' is itself a r.v. F hzrj
we want to derive the distribution of F when the distribution of X is known.
Let us consider the discrete case first. When A' is a discrete r.v. the F hzr)
is again a discrete r.v. and a1l we need to do is to give the set of values of F
and the corresponding probabilities. Consider the coin-tossing example
where X is the r.v. defined by A' (number of H - number of F), then since
.4

-.+

'

ts:
=

'%

'

,..d.

'

variable

P
Let A' be a r.v. on the probability space (,,
Suppose
-.+
function
S.
real
valued
R, i.e. -Yis a
on
S
function with at most a
11 is a continuous
(liscontinuities. More formally we need /1( ) to be
..#7t

'

)). By definition, A-( ):


'

'

that hl

countable

): R

)HT TH, HH, FF), XIHT)

XITH)

=0,

X(SS)

2, A'IFFI -2 and
=

the probability

function is
A- x
- 2 0 2
.,t J

Rs where
number
of
--+

a B()reI function.

'

22 4 with the same


&'&''
I.et F Xl, then F takes values ( 2)2 4, (0)2
probabilities as X but since 4 occurs twice we add the probabilities, i.e.
=

Dpnit 1011l

variable.

=0,

#r(F

y)

1.

-t2

'

ll1 gencral. the distribution


1-3(
)')

!1t- I't. t I1t- i1)

'$

'

P(

,s

t. 1' st.

1'-(s)
lI

.1

t'

<

function of J' is defined as


).)

t ik) I l /?

#( s :
1

(.)

- ((
(s) es1-1
l 1 t-t--t

11#.

'.'

t 17e

cs. )j
,

)),

tl 11it.l t.l e

I1tlt')ll
v''
'$''

I.'l I,t- t il/llh 4) f ra

y ba

r ia Illi.s

I 11 t llc ct str w' I'le1't,., is lt c() 11t i11tlt) tl s I'. tle l'i i I1g t 11etl ist l'l 1)11t I i 1 11 ( ) 1'1'
/7( ) is not as si l'npltt as thtl discrcte eastl because. rstly. 1'' is ntlt al ways a
continutus r.v. as well and, secondly, the solution to the problem depends
crucially on the nature of h ). A sufflcient condition for F to be a
continuous nv. as well is given by the following lemma.
.'

'

where hX) is dlrentiable


Let X be a continuous r.r. and F= h
< 0 forall x. Then
> 0 or (dll(x)q/(dx)
for alI x (5 Rx and gd/ltxlq/tdx)
the densit), functionof F is given by
=.fxl

(.p)

>

1(.p))

h - '(-p)

for a <

dy

.p

<

(6.2)

b,

where I stands for the absolute value and a and b


smallest and biggest value y can take, respectively.
Example 1
-p)/c,

(aY

1/(2Uy)

(y)-./)(x71')(j-try1

-j
i.e. F

ln cases where the conditions of Lemma 1 are not satisfied we need to


derive the distribution from the relationship
Fy(y)

P#(x)

> y)

Pr(X

E; h -

1((

(6.3)

xt x1)).
,

Example 2
Np, c2) and F= Xl (seeFig. 6.2). Since Edtxll/tdx) 2.x we can see
increasing for x > 0 and monotonically
that /l(x) is monotonically
<0
and
Lemma
2 is not satisfied. However, for y>0
decreasing for x
Let X

expt

(aa)

-1 (y- 1-l (.y)

normal distribution.

N(0, 1) the standard

'v

1+.t( )

Ed-

i(y))/

-/.p)4yt7y1

distributed.

(see Fig. 6.2). In this form we can apply the above lemma with
for x > 0 and x < 0 separately to get
(d.')
=

which implies that Edtxlq/tdx) 1/c>


and F
Let X N(p,
1(y)q/(dy)
1(y)
>
o.y +y and Ed
0 for a1l x c R since tr 0 by definition; /lc. Thus since
tr2)

'w

Fig. 6.2. Thc function F= ..Y2where X is normally

to the

refer

-jy)

expt

for y>o

((a) expt .-jyjjy-.i

-y)

That is, fytl,'lis the so-called gamma density, where F( is the gamma
fnction (F(n) jJ' t??1e-r d$. A gamma r.v.1 denoted by F.v G(r, p) has a
density of the form .J(y) g.p/F(r)q(p#)'- expt p)1, > 0. The above
1.2) and is known
distribution; an
distribution is G(1y,
as the chi-square
important distribution in statistical inference; see Appendix 6.1.
.)

,y

'v

Fy(y)

#r@(x)

= #r(

:%

y)

#r(x

6F

1(

),

xj)
-

Functions of several random variables

As in the case of a single r.v. for a Borel function h ): R'1 R and a random
xY,,), (X) is a random variable. Let us consider
vector X (.11,Xz,
used
functions of random variables concentrating on the
certain commonly
variables
case for convenience of exposition.
two
'

Fx(
Vy)
Uy% A' %Qyl Fx(Uy)
=

6.2*

Fig. 6.3. The function

(l )

.Y1 + Xz.

The distribution of X 1 + Ar2

By definition

the distribution

function of F

..Yj +

.:-.,.,

(sey Fig. 6.3) is


Example 4
U( - 1, 1), 1 1, 2 (uniformlydistributed).
t 7sing Fig. 6.3 we can show that
I

.et

'j

'v

and define F

A-j + X1.

ln particular,

if .,Y1and .Y2 are independent,

Iy'(..v)
=

by
X1

symmetry.
-

A' 2

./'1

(.p
-

x.z)./2'(x2)dx2

Using an analogous

argument

then
,1

(xj )./a*
(-'

-vl
-

) dx 1

we can show that for F

(2hl
-#/1

:6:

zc

1L
.

This density function is shown below (seeFig. 6.5) and ascan

be seen it is not

1():!

6.2*

of random variables

Functions

Functions

of several random variables

Fig. 6.6. The function F= Xb Xz for F


Fig. 6.4. The density function of F
uniformly distributed.

X3 + Xz where

and

A-I

l03

.::0

and F

>0.

A'c are

1$k t-s t he lbrl'rl


./'txl

F y (.p)

fv (y)

x2) dx1 dxc

- ccl

0.5
.;''r'

)
:

i
-1

-p

Fig. 6.5. The density function of F=


uniformly distributed.

.11

+ Xz + X5 where

.
).

Xi, i

1, 2, 3,
X

are

only continuous but also differentiable everywhere. The shape of the curve
is very much like the normal density. This is a general result which states
uniformly distributed independent r.v.'s,
that for Xi U( 1, 1), i 1, 2,
which
is closer to a normal distribution the
L Z)'-l Xi has a distribution
particular
of
value
case of the central limit theorem (see
greater the
n; a
Chapter 9).
'v

II1 tilt. t'tstt

:.

The distrihution

(.

A' 1 and Xz are independent

'

this becomes

I., tl'l;I'
t j

(.1,.v2)
'.fL(x2)dxc,
I.vzI./'l

'q

of A-1 A'2

where

.E.

(2)

xa) dxc,
jxajytyxa,

y,j
,y,.,(

1 1- t
.1! t

.$ (t l)e

(6.8)

htllnatical manipulations

lllat

are not ilnportantl)

.N(t ). l ) :1 11 tl A c y 4tl ) chi-sq u a re wi t h 11 deg rees of freedorn, .' 1


t. 1l t j? l I ) k lt- I I 1t It- l l I 'lt-t'il t- 1
lt ) lt n d let u s d e ri ve its
j
.
x (
j t l I 1 l l 1 1 l t I 1 I l l t ( Ik l 1 s l ! l't I I ) t' l i( ) I ( .3#' l I1t- (.It. ! 1( I 11il : t t (. 1'
l ) is i t't r'l
4
N (
.

x.

'

.1

'')t-

''

.'

.'

'1

..!

.'

'(

', 1'1sitl

t- 1'

t u-(

-s
',

l'.

'$
.

.'

:1 1,1(.1

.'

(.1

Ct1'1

It'!t

'i

'

'

'

'$

.1

'k

'1

'k

.....'''1

.j?.

--

..:

'$'

Functions of random

l() l

Ix. (Z)
.
Since

.J'(x1,::)

variables

n/ 2
Zn

ag(u,a)jjj-jus)

=/'1(x:)

nz

eXP -

it takes values only for z > 0, which implies that

'.gz),

'
.x

.-

--

Jr G j G 2

This is the density of Sfut/pnr's

j)

()

u-) z V

S( G
2

'

'y.

:-

- 2-

y +.

()' 1

j-

V1

($ .- u

d ly

z2(nj)and

'v

()

l-t/fsll-fhuffon.

'

/y(
. Aj

xz2(na)

Xz

be two independent

.-.

xz/nz)

'c

,?,

n
-!.

The distribution

v2

n1

/ 1 -na
,

n2

exp
F

/'y (

.r.)

n 1 + nz
2

--

j-

n j

n2

..2

y,

y))

nl

dxa
nj /2

-n

),'c(,,j

ajyaj

2
j +
1.4,1

m,

j'n

h (x g ) dx .c
.

.v

tj1( la
1+

xa

:=

and define

(X1 /n1) nz Xl

. --

r.v.'S

1+

u 1
'

112

y?

)/21

,, ...?

d xa

W Uere bl =

Example 6
Let Arl

G2

V 1 CT2
.--

xp

'

of F

mint A' 1 A-c)


,

'

v
%

--

2
2

u!
..y -V ---j
G1

G2

106

Functions

of random

6.2*

uriables

of several random variables

Functions

'rhese assumptions

enable

to deduce that

us

(6. 12)
Example 9

.ct

Xi

N(0, 1), i

'v

1'1
=

1, 2 be two independent

ll1(.. 1 X2)
,

X1 + X2,

nv.'s and

1'2 /12(X1. X2)

A' 1

Ar2

sillce
X

l''
1 0 1( 1
-

.172)

'1.F2

j+

'

.y,

.'j!

Fig. 6.7. The function F

min tXj

X2).

J= det

consider them together. Let (#1, X2,


joint probability density function
transformation'.
.(x1,

Ar,,) be a random vector with a


x,,) and define the one-to-one
x2,
.

1
1+ )'c

t lllskilnplies that

+ ),c

I.'''l

/'(y1
.

-p2)

2,:

exp

-j

jy 3....1
(yy-).4)

(y 1 .J.'2 ) + y f
(1 + yc) c
2

...

1 ( 1 -I
1 J.c)2
.J,f
-j
exp
+
( j .y y,a)
(
1
27:
.rc)

(6. 10)
l l1t. Illltin

drawback of this approach

is well demonstrated by the above

'N.$l1:I'le. The method


provides us with a way to derive the joint density
1(,11
of
the Ffs and not the marginal density functions. These can be
t
1$l,1k
1It.I 15 t'tl
by integrating out the other variables. For instance,

j
.. E

.I1l

111

t 11cltbove

example

these take the form

Assume:

(i)
(ii)
(iii)

hi( ) and gi ) are continuous;


the partial derivatives Jxf/pyf, i,j 1, 2,
continuous', and
the Jacobian of the inverse transformation
'

'

bn exist and

are

(. atlchy density.
'f!
.

,:@

108

Functions

6.3

Functions of normally distributed random

of random variables

Looking ahead

6.4
variables,

109

a summary

The above examples on functions of random variables show clearly that


A',,) when
deriving the distribution of hz-,
x,,) is known is
not an easy exercise. Indeed this is one of the most difficult problems in
probability theory as argued below. Some of the above results, although
involved (as far as mathematical manipulations
are concerned), have been
included because they play a very important role in statist'al
/ld/-pnct?.
Because of their importance
generalisations
of these results will be
summarised below for reference purposes.
,(x1,

Lgnlrntt 6 l
.

1/' Xi

(Yt'1

Nqyi c/).

x.

z'b'ij

Letnlna 6

1 2,

N(Yt,'-pi, Y'f'-

11 tgr'(?

c/)

normal.

independen r

?-.t?.*.!f

then

.2

//' X,. .N(0, 1),i 1 2,


z2(rl) chi-square with
=

x.

indepvndent
n Jtx' rees q/' jeedom.
-.l?.5y

g/-g

tbe

()-)'.1 .Y,2.)

6.2*
f--pnklrlt?

1/' Xi

()-'1
i 1

N(/tf c/), i
x
g2(j7.

'v

1 2,

2/.c2)
i
i

1'-'l'i/i fl

..

'G 2
i

Fig. 6.8. The normal

independent r'.!7.'-$ then


chi-square wr non-centralit
p

n are

) ..-non-central

'

p(.l /-(1111(.4l Q9l*


,

I-cmma 6.5

N(0, 1), ..Y1 x. 2(n) arv A' : .,Yz j?7(/t?/?(.?nJ?n


t ?-.t).'s then
z
with
??
f
Student's
'A j l)x,''
X
Jt.yf://-t.afz?-s
)1
t
2,
n
(
(rl).q.Jjeedoln.
-

'

distributions.

Ll,nltntl 6.-3

1.1%X 1

and related

x.

../

x,

Lt.zrrlnkt/6

.-3

N(p, c2), Xz
1/' X
1(?; J)
.,:-1,,/'(x'/(X2,/:)1
t.??J
p/'c.
ptlratnet
j

among the distributions referred to in these lemmas are


in Fig. 6.8. For a summary of these distributions see Appendix 6. 1
1,4.11 lt'r tt more extensive discussion see the excellent book by Johnson
II,lt $ K t ' t z ( I970).
I llt.

'v

'v

'v

I'cl:ttionships

lt'ki

L'1'It

,y$

c2g2(n),

.Yj Xz indepgndgnt ,-.j?.'y thgn


non-central
t with ntr?n-trt.?nl?-./// )'
,

Lemma 6.4
?-.!,:.*s
A' 1 ztln1 )s X1 z2(nc), A-j Xz independent
thetl
(xYl n j )/(A-2/n2) F(n l na) - Fisher's F with n j and na degrees t?/'

lf

'v

'v

'v

jeedom.

Lenma 6.4*

.k.
.t *4
'

II,

lllx

.*,
II1

?
q 1)
.(.)

I
:

,1

,1

'

'lk's

y (.t

ahead

.tltpking

w'e considered

of functions of random
AIt Iltltlgh thtl
are in general rather
l 1t is kt ve 1'),' iInp() rtant facet of probability theo ry for two reasons:
I ( t 'l't t.I1 k'cctl I's i1) prltct icl?t hat the probability model is not defined
I 11 t t. l I 11 s $ ) l. l llt! ( ) l'i
g i11:1l I'. N' s btl t i 11 sorne fu nctions of these.
%l ; t 1I s l I k': 1 I I I 1 t't-I't- I 1k't.- isk c I'tlt. i t lIy tlttpentlell
t o 11 t he d ist ribution of
l 1 1 l 1 t I l , s . l l ; 1! 1k 1( ', I 1) y : t !' 1kl l7 It- s I s ! iI1 1 t t ( I's t 11(.l t t.ls t s t l.t t is t ic s .trtt
.

k'llitlltel'

the distribution

manipulations

l'nathematical

1dtI

( '

.,

Functions

of random

Appendix 6.1

variables
f(B)

A-,,)and the distribution


functions of r.v.'s of the form h(X3, Xz,
of such functions is the basis of any inference related to the
unknown parameters 0.
From the above discussion it is obvious that determining the distribution
Ar,,)is by no means a trivial exercise. lt turns out that more
of /1(.-1, X1,
often than not we cannot determine the distribution exactly. Because of the
importance of the problem, however, we are forced to develop
approximations', the subject matter of Chapter 10.
lt is no exaggeration to say that most of the results derived in the context
of the various statistical models in econometrics, discussed in Part IV,
in Section 6.3 above.
depend crucially
on the results summarised
Estimation, testing and prediction in the context of these models is based on
the results related to functions of normally distributed random variables
and the normal, Student's t, Fisher's F and chi-square distributions are
used extensively in Part IV.
.

f (/; 1)

Appendix 6.1 The normal


Univariate

normal

.- Ar

#lxl

.. .

Gx/

;.- .....-

/-(x) --,V-,2 exp


c ( z)
'
x
F(.Y)

y,

exp

(27:)

Var(.Y)

1 u- p
- -2
(5'

1 xy
- 2 .--..c
skvwnvss

s)

p'

Chi-square distribution
- F

'v

z2(n)

d r/,'

Reproductit'e

prtpptrrly

2
,

as

0,

kurtosis

a4

distribution

chi-square

()

7.
,

n)

---.;.,-

Vn

Higher moments:

r t?7 el

2-

cc

Zch

tN''2)

exp

z2(n,.(i)

- 15
'v

y!fH.''

.j
2 ), + J)1.
t - -.-41

2) - 1

)J)(,?'/'

is

Some prtppt?rres

F( F)

n+

(),)kuk +. j.)

(2k)! 1- k +

>0

Cumulants

chi-square.

.E'(.p) n (thedegrees q/' freedomb,Vart)l) 2/.


I ilt! density function is illustrated for several values of n in Fig. 6.9.

Non-central
c2,

Fig. 6.9. The density functions of a central and non-central

distributions

and related

/. (z ; 6

N(p, c2)

'v

.Y

(y; 6 )

Var(F)

>

12
,

2(n + 2).

E:

'E

'

.)
''m.
:

is that the
rtant difference with the centra j c jaj-square
ftl llct itln is shifted to the right and the variance increases.

Ilctlt.t-. l Ile iIllpo

deltsl l

11t

-;'''(

?t/

11(

+1

l 1'(:

/)?'f

)'

Functions

of random

variables
f tx)

z
(w; 4 ) Nxw

-..V'''''- f (x)

z-

tzvj

exp

x2
-

nz
nz -2

.E'(LJ)

nz > 2,

I l1ccentral and non-central

>

t's /-J-ifr?-//7l.flk'(?l71.1z' r(N)

F-distribution

/'( &; n 1

'v

F-distribution

l ig. 6. 11 for purposes of comparison.


Non-central

e-

n 2,'

y ..
1+

.. U

2k) - 1

N1

nz

j-'

f-distribution

F(n1 rl2,' J),


,

>

+2:)
n 1 -i''t
-..n.2

n 1 + n z + 2/
-....
2

yn, +nz

n 1 4.
These moments show that for large n the

'v

density functions are shown in

ku J(n,+

Gt'

+2k)

nz

n2

......
Fl 1 +

gk

... . ,

>2s

is very close to the

normal (see Fig. 6. 10).

t (t.?)

Non-central

t-distrib. u/ ion - Hz' r(n; J), J

'

w; n

>

N .,

'''

ir

/'(

(jj
,

:;2!

.--z

(.ti

C .....(

iEl

-.-::?2

.-

)1

- .-y
wzliV'
Y /n r(n/2)(n +
--

f (tz; n , n z )

) , a-j

'

)'.

);'.

tit

y'gs(.l-+k -1--ljtjt
k=0

2w2

b-y
.

jk/l

n + u(r,

ws

f (fz; n j n a ; )
,

'.
)il)1/.,
''''
::.

1201-ft,lp-f/n
g

()
:'

'

kr( 1.12)

---

'i;

St l-ff/cri

+rl2
2#lJ(n1
Vart (p)
n1(n2
2) c (n2 -4)
-2)

&

u s ()

Functions

of random variables

Important

Appendix 6.1

tl'tlaccr/.:

Additional

Borel functions, distribution of a Borel function of a r.v., normal and related


distributions, Student's t, chi-square, Fisher's F and Cauchy distributions.

'larke

(
( l t?78);

Questions
Why should

be interested

we

in Borel functions

of r.v.'s and their

distributions?
:A Borel function is nothing more than a r.v. relative to the Borel field
on the real line.' Discuss.
Explain intuitively why a Borel function of a r.v. is a r.v. itself.
Explain the relationships between the normal, chi-square, Student's !,
Fisher's F and Cauchy distributions.
What is the difference between central and non-central
chi-square and
F-distributions'?
.td

Exercises
Let zYkbe a nv. with density function

A-1

(.X

*2

Derive the density functions


(i)

X
X
A'

(ii)
(iii)

of

2*

A' 17

e'Yi

l0 + 2%-21
Let the density function of the r.v. X be /'(x) e
distribution of F logc X.
Let the joint density function of X3 and Xz be
=

-X,

x > 0. Find the

),t

Derive the distribution of


F X21 + X2'25
(i)
F m in (A-j .Ya )
(ii)
( .'(() 1) kl t.-I-i t..'yt I1t-tl is t 1'i13t I t i(
1. t

j
*)

.'

.e

.x.

'$.'

',

1) (

.'

1'1'

references

(1975)., Cramer ( 1946),, Giri (1974),. Mood, Graybill


Rao ( 1973),' Rohatgi

( 1976).

and Boes

( 1974)*,Pfeiffer

The general notion

of expectation

expectation

In Chapter 4 we considered the notion of mathematical


context of the simple probability model

(7.1)

0j, 0 e:O

./'(x;

single random
as a useful characteristic of density functions of a
model
probability
generalised
the
to
Since then we

/ (x 1

x 2s

variable.

J'

'

..

in the

x , , ; 0 ) 0 iE ().
,

and put forward a framework in the context of whichjoint density functions


luarginalisation,
conditioning
and functions
can be analysed. This included
section
is
variables
of
this
The
to consider the
of random
(r.v.'s).
purpose
general
framework.
of
this more
notion of expectation in the context
Expectation

7.1

of a function of random

A'ariables

ln the one-dimensional case of a single r.v. we considered many numerical


characteristics of density functions, namely. E'(-Y), S(,Yr), .E'I-Y A(A-))2,
which contain summary information concerning
'(.Y - JJ(x))r, r= 2, 3.
the nature of the distribution of zY.lt is important to note that each of these
characteristics is the expectation of some function of X, that is, F(/;(xY));
/)4zY) X, /?(')
Xr, etc. This provides us with the natural way to proceed in
the present section, having diseussed the idea of a Borel functioll ( ):
For
(r which preserves the event structure of the Borel field
R''
where
n 2.
simplicity of exposition we consider the case
their joint
Let (.Y1 z'c) be a bivariate random vector with
density function and let h ): 22 -+ R be a Borel function. Define F
(A'1 Xcl and consider its expectation. This can be dened in two
-

'

;..d.

--+

.vc)

.(.x1

'

7.1

of a function of random

Expectation

variables

equivalent ways:

:X.

(ii) E(ll('

:,

A-z))

/1(.x1 x2)./'(x1

dxj dxa.

xz)

and it is usually
The choice between (i) and (iij is a matter of convenience
of
of Y.
diffkulty
deriving
by
the
degree
in
the
distribution
determined

Let Xi
Using

'v

N(0, 1), i

X( + X(.

(ii),
E(X 21 + X 2)
2

1
/',(J.')
2,)-.(
.
=

----

(x( +

=(v2j

the other hand, Y


freedom, that is,

2aj

2/

exp

jx ( + x 2aj) dx ( dx c

z2(2.j- chi-square

+ -Y2a)
'.w

exp

cc

on

and

r.v.'s and /1(Xl X a)

1. 2 be two independent

with two degrees of

t 1-v)
-

we know that

E(Y)

)')
-.'Jy',(

-v

equals the number of degrees of freedom (see Appendix 6.1).


Before we consider particular forms of pltAr(. .Ya) let us consider some
properties of E(ll(A-: .Xa)).
,

Properties

of expectation
fkE(lll(A-l Xa)) + ?E(/12(X1, A'2));
avd :( ), b a( ) are Borel
are
122 rtp R. .1n particular

EIJ/k1(X1. X2) + bhzx

Iinearity,
u/ntl'l ons

u'lltpl-t? a
.//4)m

1,

'

'

11

aiz-i
1* .=

t'onsrlnr.s

/1

E
x

Ar2)1

/nt

Z aikjz-i).
=

(7.5)

I/' X 1 and X a are independent

-.p.'s,

.ll-

tvyt?ry

Bovel jnction h

(')

The general notion of expectation

and Jl2( ),'R

R,

-.+

'

f(/'l1(A-1)2(A'c))

Eht

(.,1))

/'l2(..Y2)),

'

(7.6)

given that the above expectations exist.


This is botb a nessary
as well as suflkient condition for independence.
particular
of
interest is when /lj(xY1) XL and /lc(aY2) Xz,
One
case
=

.E'(A-1X2) F(.X1) &.zY2).


=

This is in some sense Iinear independence


independence. Moreover, given that
CovtA'l

(7.7)

'

which is much weaker than

.Y2) F(Ar:A-c) - f)-Y1) .E't-tcl

(7.8)

'

(see below), linear independence is equivalent to uncorrelatedness since it


implies that Covt-rj A'a) 0. A special case of linear independence of
particular interest in what follows is when
=

=0,

S(.Y1A'2)

and we say that A-l and Xz are orthogonal, writing X3 -1-Xz. A case between
F(.Y1), that
independence and uncorrelatedness is that in which E(k3/Xz)
is, the conditional and unconditional expectations of .11 coincide. The
analogous case for orthogonality is
=

E(XL,/X1)

when .E'(A-:) 0.

=0,

(7.10)

This will prove to be a useful property in the context of limit theorems in


Chapter 9, where the condition
EzY',/xn

.',,

.Y'l) 0
- c,
plays an important role. For reasons that will become apparent
sequel we call this property martingale orthogonality'.
1,

Forms of hX

A-cl ofparticular

in the

interest

lllA'l -Y2) A-tA-22, 1,k > 0,


,

then

p;k
=

'(.YtA%)

X)

xtxt-fxlxz)

dxl dx2

are calledytpfnl mw moments of order I + k; this is a direct generalisation


the concept for one random variable. For
(-Y1,.-2)=(*-1

-&A-1)/(-Y2 -'(A-2))k,

of

of a function of random variables

Expectation

7.1

E'(-Y1)

EEEE

/t1

are called joint central moments of order I + k. Two especially interesting


joint central moments are the variance and covariance;
() Covariance.. l k
=

Covtxl,

(7.14)

-/-t2)).

-Y2)- .E'((A'1-p1)(A-c

provides a measure of the linear


lrlwt?t?nfwt? random variables. With a direct multiplication
above formula becomes :

relationship

The covariance

Covt-l

Ar2)

E(A-1) F(A-a).
'(Ar1.Y2)
independent then using E2 wtl deduce tat
=

I/' zYl and Xz are

'

=0.

to note that the

tlt?rv important

(7.15)
(7.16)

Covlx'j, Ar2)
1t is

the

is not true.

conYerse

(iij Variance l 2, k 0
=

-p1)2,

'Varl-Yl)
For a linear

E(A'1

EEE

functionjf

var jl aiuri

aixi the tlariance

=)

is of the

form

a,? Var(A-i).
i

Using the variance wc could dejlne the standardised


formof a
covariance kntpwnas rt? correlation coefficient and dehned by

Corrt-'l, X2)=

Covtx'l A'c)
,

gvartA'jl

'

VartAralq

(7.20)

The general notion of exptation

Properties
(C/)
(CJ)

(4)7.)

of CorrtA'l

-Y2)

Corr(A'1, aY2) 0 for A-j and .Y2 independent r.p.'.s.


1 % Corrt.r 1 A'al % 1.
Corrtfj
A'2) 1 for
a + bX1, a, b being real constants.
=

-:-1

Example 2
Let
pcgljczj),
(*'x1a)
xllyLljafc,
-

Covtfj, X1)
X

(x1

-/.tc)T(x1,xc) dx1 dx2

(-Y1

-/z1)(xa

'X

cjcc

=,

-/t1)
cl

xz (2r)

(2a)

exp

exp

-j

-p2)-.

(1

.%1

(1

dx1

xa -/.ta

-p

-Jt1
cl

o.,

.-ptxlts-/jtjztj
(txa

-0.,a,v.

Hence, Corrtztj A-a)


It should be noted that p
in the case of
normality does imply independence, as can be easily verified directly that
p 0 o.J'(x1 x2)
'X(xz).
=p.

=0

=/(x1)

Nlta that correlation measures the linear dependence between r.v.'s but if
we consider only the first two moments that is the only dependence we can
analyse.

7.2 Conditional

expectation

Conditional expectation

7.2

The concept of conditional expectation plays a very important role both in


x,,; 0), 0 (E 0) to time
extending the probability model *= ).(x1, x2,
and
in
the specification of
variables
process)
random
dependent
(stochastic
models
Part
in
lV.
Iinear
of Arj for Xz x2 was
ln Section 5.4 the conditional distribution function
exists)
limit
the
defined to be (if
.

Fx: Aatxl

xc)

).

1im #r(.Y1 Gxj/xa - h %Xz %xa +

0<h-0

(7.21)

(7.22)
Note that

(7.23)
'fx,

X.
'

xzlx

2)

.f

qc
-

h (x1)
h (x1)

)
(x2?7x1
x, (xz/ x 1 ) d x I

x.
./-,-2

-.

Bayes'

formula,

the denominator being equal to .Ja(x2).


propertiesL
The conditional distribution function satisfies the following
in .Yl, i.e.
(CDI)
For a jlxed x2, Fxj x2(x1/x2) s a distribution function
1j.
Fxj
); IR (0,
,t.a(

'

-,

For a jlxed xl Fxj,.xatxl/'xzl is a


talues
of x1, x2
For 'ny

(CDJ)
(CD3)

./'x,,xa(x1/x2)

xgf/ xa,
/xc(.X1 '.Y2) > 0,
xl

For a
fxL

.fnction

of x2.

is a proper density
.
fxj xatxl, xc) dx1 1.
.rfrlcrft?rl

EER

and

J'''-

wl

The conditional expectation of A-j given that Xz takes a particular value


.vc (A'a xc) is defined by
=

The general notion of expectation

ln general for any Borel function hl ) whose expectation


'

Properties

of the conditional

exists

expectation

Let X, Arl and Xz be random variables ()n (s%, P )), tben:


x)
'(tz1:(A-1)+ a1h(Xz)jX
x) + azkhzrzlj
J1 Eh(X3),''X
X x), tJ1 az constants.
(CE2)
lf Arj > X1, '(.Y1/-Y x) )v:ELXI/X x).
(CE3)
E(l1(A-j -Ya) Xz x2) E(l1(X1 xa),,''X2 xa).
ICfVI f((A'1) X1 x2) f(ll(-Yt)) if A-1 al)(I are independent.
outside tbe square
(CF5)
J)/l(-Y1))
'gS((-1)
Xz xaq, tbe pxptvltzrr?rl
brackets being wfr respect t() X1.
The conditional expectation EIXL/XZ
x2) is a non-stochastic function of
-+
f(.Y1
Rx,
The
graph
i.e.
@.
):
(x2, EIXL x2)) is called the regression
x2,
.%

'

-:-2

'

(7&rPf?.

Example 3

ln the case of the bivariate normal distribution,

The graph of this linear

reqression

fncrftan

is shown in Fig. 7.1.

Conditional expectation

f (xlexa)

e'

.-

l
I

.-

I
I

X1
1
1
l

Xz

>

7z
.z.

.f

''-o

Fig. 7. 1. The regression

;v

2w

7e

eJ

curve of a bivariate normal density.

As in the case of ordinary expectation,


moments:

we can define higher conditional

Rflw conditional moments:

(7.28)
Central conditional moments:
A'EIA-I

Of particular
skedasticity'

-EX,/Xz

=x2))'/-Y2

interest is the conditional

Vart-Yl/-'rc

x2)

'E(Arj
Exljjxz

(7.29)

x21,

variance,

EIXL/XZ

sometimes

x2))2/Ara

called

x
2.

xc) g.E'(Arj Xz xalq


=
In the above example of the bivariate normal distribution we can show that
VartArl/Arc x2) c21(1
In the case where the conditional variance is
free of the conditioning variables it is said to be homoskedastic; otherwise it
is called heteroskedastic.
-p2).

The general notion of exptation

Example 4
Let

?6
This

is the distribution
distribution

function of Gumbel's

bivariate

E0,1(l
.

exponential

since

'(X:/X2

.X2)

r! ( 1 + pxa +
r+
(1 + pxc)

Var(.Y1,.'A-a xc)
=

and
-2p2

+pxa)2

(l +p

--

r0)

,f.--.--

( 1 + pxc)

For a bivariate exponential distribution the regression


and the conditional variance is heteroskedastic.

curve is non-linear

Example 5
The

joint density of a bivariate Pareto distribution takes the form

'./'(x1xc).
,

ltkcx,

ptp-h 1)(4?1t7c)+

',x.a

-:71Ja)-''+''

xl
The marginal

> fzj >0,

x2 >az

>0.

density function of X1 is
2

(.X

PJt?
'-tt'-b 11
,
2x 2

Var(.;Y'1/,Y2 xa)
=

t?l (p+ 1) xj
c
az 0 0 + 1)

ln the case of the Pareto distribution the


conditionalvariance is heteroskedastic.

regression curve is linear but the

7.2

Conditional

expectation

Example 6

joint density of a bivariate Iogistic distribution is

The

.Jtxj xa)
,

E(X3,,'Xz

x2)

'Xz

xa)

Vart-

2E1 +

zll -

expt

xc)),

- (x1+
1 - logt 1 + e-'2) - non-linear in x2

=.1yzrz
-

e-rt + e-

- homoskedastic

xa) is a non-stochastic function of x2 and for a


particular
.Q)is interpreted as the average value of .Y1
given Xz k1. The temptation at this stage is to extend this concept by
considering the conditional expectation
'(-Y1 Xz

As argued above
value

ECXL,/XI

.fa,

(7.31)
The problem, however, is that we will be very hard pressed to explain its
meaning. What does it mean to say
average value of A-j given the
--+
variable
E'? The only meaning we can attach to such a
-Y2( ): S
random
conditional expectation is
'the

StA- 1 c(-Ya)),

where c(A'c) represents the c-field generated by Xz, since what we condition
on must be an event. Now, however, E'(-Yj c('c)) is no longer a nonstochastic function, being evaluated at the r.v. Xz. lndeed, .E'tArl/ct.Ycll is a
random tlariable with respect to the c-field c(A-a) c: .F which satisfies certain
properties. The discerning reader would have noticed that we have already
used this generalised concept of conditional expectation in stating CE5,
'(/l(A'1)/
where we took the expected value of the conditional expectation
'c xc). What we had in mind there was '(llj(.Y1) c(-2)) since otherwise
the conditional expectation is a constant. The way we defined '(A-j/c(-Y2))
as a direct extension of E(X !yA-a x2), one would hope that the similarity
between the two concepts wlll not end there. lt turns out, not surprisingly,
that f'taj /c(A'2)) satisfies certain properties analogous to CEI-CES:
=

LSCE/) '(J1(A-1) + t72g(A-2)/c(A-)) fk11)ll(-Y!) t7'(xY))


+ t72S(#(-Y2) t4aY)).
$SCE2) 1./*-Y1y Xz. J).Yl c(-)) > E(A-a c4x)).
(SCS-?) SEll(A'1) #(zY2)1 FEg(-Y2).E'((A-l) c(X2))1.
(SCE4) &(zY1),/c(xYa)) JJ(&(-Y1))#' zk-k and A-a are independent.
'((aYI)
g(xYc)f)(zYI),
(SC'5)
g(-Y2) c(-2))
c(A-c)).
This implies that the two conditional expectation concepts are directly
related but one is non-stochastic and the other is a random variable.
=

'

The general notion

of expectation

Example 7

ln the case of the bivariate normal distribution considered above,


'(-Yl/c(A'a))

-i-p --1 (A'a

=pl

-p2),

G2

and it is a
Xz

random

variable. Note that Vartx''l 'c(A'2))

(homoskedastic).

ln relation to the conditional variance


the variance exists,
Vart.Yll

we can show

c2j( 1

-p2)

is free of

that if '(A-f) <

(x),

i.e.

Var(F(.Y1/c(A'2))) + '(Var(A-1,,'c(.Yc))),

that is, the varianee of X : can be decomposed into the variance of the
conditional expectation of ..Y1 plus the expectation of the conditional
variance. This implies that
Var(-Y1) > Vart'tA-l

c(Ara)).

(7.34)

Note that some books write E(X3/Xz) when they


EI-Yl X 2 x2) or F(-Y1,/c(X2)), thus causing confusion.

mean

either

The concept of conditional expectation can be extended further to


an
arbitrary c-field .f.2 where .1 is some sub-c-field of
not necessarily
generated by a random variable. This is possible because all elements of fzl
to which the conditional expectation of a r.v. X
are events with
relative to ...F can be defined. If we were to interpret the conditional
expectation EtXl Xz x2) as
-Y1 to a constant we can think of
X to a random variable.
EX/9) as
Because of the generality of EX,'''L we are going to summarise its
properties (whichindude CE1-5 and SCE1-5 as special cases) for reference
purposes:
Let A' and i' be two r.v.'s defined on (.$, P( )) and 91.g
(c-C.E1)
1/' c- is a ctlnslclnl, then E(c V') c.
wCE2)
1.f X % F, then E(X f/) % JJIF V).
ouCE3)
1j' a, b Jp'c constanss, E(aX + :F V) aEIX,/f?-) + bE(Yj,@
).
.%

'respect

'smoothing'

tsmoothing'

.%

,%

'

(c-cE4)

lE(x

(c-Cf.5)
(c-()7f3)

If
F(A').
tZ, then EX,
E(X/,F)
X.
flgftA- f/)) F(A-).
i 1, 2), FIJJIA#' V'1 9 2 (h
Ex/g' 1 ).

(c-CE7)
Lo'-CE8)

.f.0)1

z:

E(lA'l,,'f).
,$)

o'.s

.#t))

.%

'.f/)/.f/.

lj

2)
ELEIXI.@LII.LJ

7.3

Looking ahead

Let
be a r.p. relative to 94 and '(J-Yj)< tx;,, '()ArFj) <
SI-YF r2l A-JJIF f/').
If A- is independent of f/ ten fxY f?) S(Ar).
.;t'

(c-C'E9)

:t;

then

(wCE10)

These properties hold when the various expectations used are defined and
are always relative to some equivalence class. For this reason all the above
be qualified
statements related to conditional expectations must (formally)
surely' ('a.s.)(see Chapter 10). For example,
with the statement
C-CEI formally should read:
c' is a constant, i.e. X c a.s. then flxY V')
Ealmost

'lf

a S
.

=c

expectation will prove invaluable in the study


The concept of conditional
considered
stochastic
in Chapter 8 because it provides us with a
of
processes
natural way to formulate dependence.

7.3

Looking ahead

The concept of conditional expectation provides us with a very useful way


to exploit auxiliary information or manipulate information (stochasticor
otherwise) related to r.v.'s in the context of the probability model

tf

j'l.zj x 2
,

x,:,' p),

0 g (.))

(7.36)

Moreover,

as argued in the next section, the conuept of conditional


expectation provides us with the most natural and direct link between
sequences of independent r.v.'s discussed above and sequences of dependent
r.v.'s of stochastic processes which enable us to extend (l) to include more
realistic Jvnlrnfc phenomena. Stochastic processes are the subject of
Chapter 8.

Important

concepts

expectation of a Borel function of a r.v.-'


linearity of the expectation operator;
independence in terms of expectation, linear independence
(uncorrelatednessl;
orthogonality, martingale orthogonality',
joint raw and central moments, covariance, correlation;
conditional distribution and density functions- Bayes' formula;
F4.Y1 f.?).,
conditional expectations E(X3,/'Xz x2). '(X1
regression curve, raw and central conditional
moments,
skedasticity, homoskedasticity, heteroskedasticity.
,/c(-c)),

The general notion of exptation

Questions

3.
4.
5.
6.
7.

Explain the concept of expectation for a Borel function of a nv.


Explain the concepts of independence and uncorrelatedness in the
context of expectation of Borel functions of r.v.'s.
Compare the concepts of independence and martingale orthogonality.
What does the regression curve represent'?
What does the correlation coefficient measure?
What does skedasticity measure?
Compare the concepts of Etxj X1 x2), f2(A'l c(-Ya)) and fl.Yl 9. ),
where c +
=

.f/

Explain intuitively the equality


Vart-Yl)

VartE't.Yl -Y2))+ E'(Var(.Y 1 /Xz)).

Exercise.
For exercise 3 of Chapter 6 show that f'(A-1.Yc) F(A-1).E'(-Y2)but Ar1
and Xz are not independent.
For exercise 3 of Chapter 6,
derive J)-Y: /'Xz xa). Vartx't-l Xz x2), for x2 1, 2,
(i)
find Covt'j -Y2)-CorrtxYl -Ya).
(ii)
Let Xj and Xz be distributed as bivariate normal r.v.'s such that
=

A' 1

cf

Jzl

N
X2 ,v

pc1 c2
G1a

PtF1tr2

/22

pl

6'2a

=4,

(lalculate
E(X

X2

E(Ar2

-:-1

VartAfj/.Yz

4. Determine the
flxb

X2),

x1),

.x1

1, 2, 6,

=0,

1, 2,

Vart-rc/fj

x2),

x1).

of c for which

value
.X2)

X2

.X1(.X1

A72),

is a proper joint density function and derive:


EX 1 X 2 .X2);
(i)
Vart-'j Xz xc),'
(ii)
.E'IA'/-YZ
Xz x2);
(iii)
Przk-k + Xz % 0.5),.
(iv)
Corrt-l, .:-2).
(v)
Show tat f'(.Y1) f'tf'tatj/A-cll.
=

2,

pz
P

=0.2.

4,

7.3 Looking

ahead

Additional references
Bickel and Doksum
Whittle ( 1970).

( 1977); Chung (1974)) Clarke ( 1975/ Giri (1974); Rao (1973);

CHAPTER

8*

Stochastic processes

ln Chapter 3 it was argued that the main aim of probability theory is to


provide us with mathematical models, appropriately called probability
models, which can be used as idealised descriptions of observable
phenomena. The axiomatic approach to probability was viewed as based
on a particular mathematical formulation of the idea of a random
expeliment in the form of the probability space (.$,i@1 P )).The concept of a
random variable introduced in Chapter 4 enabled us to introduce an
isomorphic probability space (Rx,2,#x(')) which has a much richer (and
easier) mathematical structure to help us build and analyse probability
models. From the modelling viewpoint the concept of a random variable is
particularly useful because most observable phenomena come in the form
of quantifiable features amenable to numerical representation.
A particularly important aspect of real observable phenomena,which
the
random variable concept cannot accommodate, is their time dimension;the
concept is essentially static. A number of the economic phenomena for
which we need to formulate probability models come in the form of
dynamic processes for which we have discrete sequence of observations in
time. Observed data referring to economic variables such as inflation,
national income, money stock, represent examples where the time
might be very important as argued in Chapters 17
dependency (dimension)
and 23 of Part IV. The problem we have to face is to extend the simple
probability model,
.

* (.f(x;04, 0 e' t-)),


-

to one which enables us to model dynamic pbenomena. We have already


moved in this direction by proposing the random vector probability model

8.1
(I)

The concept of a stochastic


).f

txl xa,

13 1

process

(8.2)

x,,; 0), 0 6 (.)).

The way we viewed this model so far has been as representing different
characteristics of the phenomenon in question in the form of the jointly
X,,. lf we reinterpret this model as representing the
distributed r.v.'s Arl,
same characteristic but at successive points in time then this can be viewed
as a dynamic probability model. With this as a starting point let us consider
the dynamic probability model in the context of (S, % P )).
.

'

The concept of a stochastic process

8.1

The natural way to make the concept of a random variable dynamic is to


extend its domain by attaching a date to the elements of the sample

space S.
Dejnition

probability space and -F an index set of real


R. The
A-( ) by .X'( ): S x T
numbers and dejlne the function
ordered sequence of random variables )A-( r), t e: T) is called a

Let

(.,

,./

#4

'

)) be a

'

'

'

-+

'

stochastic (random)process.
This definition suggests that for a stochastic process (xY( r), l c T), for
each r c T,
r) represents a random variable on S. On the other hand,for
each s G S, A'(s, ) represents a function of t which we call a realisation of the
process. -Y(s, r) for given s and t is just a real number.
'

x(

'

Example 1

Consider the stochastic process (-Y( r), r G T) defined by


.

Xs, r)

F(# costztslr + u(s)),


-z:,zr),

U(
where y(.) and z(.) are two jointly distributed r.v.'s and tI(')
independent of F( ) and Z( ). For a fixed r, say t 1, -Y(# F(# cos(Z(# +
u(#), being a function of r.v.'s, it is itself a r.v. For a fixed y, F(# y, Z(# z,
n(,s) u are just three numbers and there is nothing stochastic about the
function z(r) y costzr + ?# being a simple cosine function of l (see Fig.
8.1(fz)).
This example shows that for each t G T we have a different nv. and for
each s 6 S we have a different realisation of the process. ln practice we
observe one realisation of the process and we need to postulate a dynamic
probability model for which the observed realisation isconsidered to beone
of a family of possible realisations. The original uncertainty of the outcome
of an experiment is reduced to the uncertainty of the choice of one of these
,v

'

Stochastic
X

processes

(t)
3
2
1
0

10 2030

40

50 60

0. 10

0.05

-0.05

-0, 10

1964

1967

1970

1973

1976

1979

1982

T ime

(b)

possible realisations. Thus, there is nothing


about a realisation of a
process which can be smooth and regular as in the above example (seeFig.
8. 1(:1))or have wild fluctuations like most econometric data series (seeFig.
8. 1(4)).
The main elements of a stochastic process :A'( r), r e: T) are:
(i)
its range space (sometimes
called the state space), usually R,'
(ii)
the index set -1J)usually one of R. ii
1, 0, 1, 2,
zc ). Z
t Z + k.f0 1 2
d
;
a
n
)
the dependence structure of the r.v.'s
t g -1J'.
'random'

'

zj ,

r0,
..(1)-

The concept of a stochastic

8.l

process

ln what follows a stochastic process will be denoted by t-Y(r), l (E Tj. (s is


dropped) and its various interpretations as a random varia t7le, a rea lisation
orjust a number should be obvious from the context used. The index set T
used will always be either T
+ 1, EfE2,
) or T )0, 1, 2, ), thus
concentrating exclusively on discree slotl/lfgsrtr processes (for contlnuous
stochastic processes see Priestley ( 198 1)).
The dependence structure of kfA-(1),t (E T) in direct analogy with the case
of a random vector, should be determined by the joint distribution of the
T is commonly an infinite set,
process. The question arises, however,
do we need an infinite dimensional distribution to define the structure of the
process?' This question was tackled by Kolmogorov ( 1933) who showed
that when the stochastic process satisfies certain regularity conditions the
ln particular, if we define the
joint
answer is definitely
< r,,) of T by
distribution of the process for the subset (r1< tz < r3,
A'(r,,)%x,,) then, if the stochastic
P-l.-trll
F(-Y(11),
x1,
satisfies
conditions:
A'(r),
the
-t'-)
t
e'
process:
Ar(ly,))
A'(r,,)) Ftxtryl ),
(i)
symmetrq F(xY(f1 ), .X(12),
is any permutation of the indices 1, 2,
where.jl..jz,
n (i.e.
reshuffling the ordering of the index does not change the
distributionl;
.X(l,, 1))
A-(r,,)) F(Ar(lj),
compatibilitq lirflxa.-.. F(.Y(rj )s
(i.e. the dimensionality of the joint distribution can be reduced by
marginalisationl',
there exists a probability space (S, P )) and a stochastic process tA'(l),
t e T) defined on it whose finite dimensional distribution is the distribution
F(.(rj),
A-(l,,)).as defined above. That is, the probabilistic structure of
the stochastic process t-Y(!), l e:T) is completely specified by the joint
-Y(r,,))for all values of n (a positive integer) and any
distribution F(A-(21),
This is a remarkable result because it enables us
of
T.
r,,)
subset (r1 r:!,
the
stochastic
to
process without having to detine an infinite
dimensional distribution. ln particular we can concentrate on the joint
distribution of a finite collection of elements and thus extend the
mathematical apparatus built for random vectors to analyse stochastic
processes.
Given that, for a specific 1, .Y(r) is a random variable, we can denote its
distribution and density functions by F(A-(l)) and /'(A'(l)) respectively.
Moreover, the mean, variance and higher moments of -Y(r)(as a r.v.) can be
defined as in Section 4.6 by:

t0,

tsince

itentative'

Kno'.

-(f,,))

'tti

'tl.jal,

.jn

..

.%

Adescribe'

(8.3)

'(-Y(r))

=p(l),

'E(.X(r)-/.t(r))21

tl(r),

'(-Y(r)r) /t,,(r),
=

(8.4)
(8.5)

Stochastic processes
As we can see, these numerical characteristics
of X(1) are in general
1, given that at each t 6 T, Ar( 1) has a different distribution
F(Ar(r)).
The compatibility condition (ii)enablcs us to extend the distribution
function to any number of elements in T, say r1, lc,
1,,. That is, F(.Y(rj),
A'(rz),
X(r,,))denotes thejoint distribution of the same random variables
A'(r) at different points in T. The question which naturally arises at this stage
is how is this joint distribution different from the joint distribution of the
random vector X !E (#1, Xz,
Xnj' where X3 Xz,
Xt, are different
random variables'?' The answer is not very different. The only real difference
stems from the fact that the index set T is now a cardinal set, the difference
between ti and tj is now crucial, and it is not simply a labelling device as in
the case of Fz'j
Xn). This suggests that the mathematical
Xz,
developed
in
Chapters
5-7 for random vectors can be easily
apparatus
extended to the case of a stochastic process. Forexpositional purposes let us
consider the joint distribution of the stochastic process )A'(l), t c T) for
t t 1 , t :,
The joint distribution is defined by

functions of

'

=z

F(Ar(ll), aY(r2)) #r(A'(r1) G x1, -Y(r2)G x2),

(8.6)

The marginal and conditional distributions for xY(r1)and


are defined
in exactly the same way as in the case of a two-dimensional random vector
(see Chapter 5). The various moments related to this joint distribution,
however, take on a different meaning due to the importance of the
cardinality of the index set T. ln particular the linear dependence measure
.:-(12)

t11,l2)
is now called the

'rl,

p(l t

(lf

-p(l2))q,

r1,r2GT.

(8.7)

function. ln standardised form

autocovariance

-------2-..5--.,

r2)

FE(.X'(f1)-p(f1))(A'(ra)

rl tz G T
,

1)t'(12))'

is called the autocorrelation


function. Similarly, the autoproduct moment is
defined by m(r1, l2) .E'(-Y(l1)A-(r2)).These numerical characteristics of the
stochastic process )A7(r),t s T) play an important role in the analysis of the
process and its application to the modelling of real observable phenomena.
We say that (.:71),
t is T) is an uncorrelated process if r(lj lc) 0 for any
r1, 12 6 T, 11 # l2. When rn(11 l2) 0 for any l1, t2 c T, rl # ra the process is
said to be orllltlgontll.
=

Example 2

One of the most important examples of a stochastic process is the normal

(or

8.1

The concept of a stochastic

process

Gaussian) process. The stochastic process tAr(l), t (E T) is said to be normal if


-Y(1,:))EEEEX,,(t)'
for any finite subset of T, say lj r2.,
r,,, (-Y(r1),.:-(12)),
has a multivariate normal distribution, i.e.
.

f (.2(11),
.

A''(l,,))

(det V )-1 expl

-ifX,,(t) #(t)) v,, 1 (x


-

,,

- (2zr)

,,/2

,,

(t)-

,(t))),

1, 2,
n is an n x n autocovariance matrix and
g(12),
/t(?,,))'
is
a
n x 1 vector of means. ln view of the
#t) (/t(rl),
deduce
condition
the marginal distribution of each
we
can
compatibility
.Y(rf), which is also normal,

where V,, EEEEt?trf,


ry)1,i,j
=

(8. 10)
As in the case of a normal random variable, the distribution of a normal
stochastic process is characterised by the first two moments /t(r) and n(l) but
now they are both functions of t.
The concepts introduced above for the stochastic process .Y(1),t e:T) can
be extended directly to a k x 1 vector stochastic process )X(l), t e:-F) where
X(l) = (.Y1(l), .Yc(r),
.Xk(l))'. Each component of X(l) defines a stochastic
tArftrls
(E T), i
1, 2,
k. This introduces a new dimension to the
process
r
concept of a random vector because at each 1, say r1, X(l1) is a k x 1 random
X(l,,)')' defines a random n x k
vector and for tzuz11 r2,
r,,,?' EEE(X(r1)?,
matrix. The joint distribution of .'.Tis defined by
.

with the marginal distributions F(X(lg)) #r(X(rf) xf) being k-dimensional


distribution functions. Most of the numerical characteristics introduced
above can be extended to the vector stochastic process )X(l), r s T) by a
simple change in notation, say J)X(l)) p(r), 'g(X(l) - p(l))(X(l) p(l))?1
V(l), t c T, but we also need to introduce new concepts to describe the
relationship between .Yj(l) and Xjz) where i #.j and 1, z c T. Hence, we
detine the cross-covariance and cross-correlation functions by
=

cf./ll,

'r)

:$

.E'E(-Y/?)-/.ti(l))(.'.j('r) -Jt/z))(1

(8.13)

Stochastic processes
These concepts
for i
;e tween the stochastl'c processes Arf(r),
linear
dependence
the
measure
)
moment
r g Tl and .Y/(z), z e: Tl Similarly, we define the
Note that 1.41, z)
function by l?1fyl/, z)
)-j(l), .Yj(z)) ntr, z) when i
z)Et.?(r)l.T)()
r(r,
z) p(l)p(z)
' Using the notation introduced in Chapter
1.n(1,
6 (seealso Chapter 15) we can denote the distribution of a normal random
by
matrix
N()te that

cijt, z)

r7(r,z) and

1')

rijt,

r41, r)

=j.

'ross-product

=j.

nk'

X(l1)
X(r2)
X(r,,)

XN

Jl(r1)

V(r1 )C(f1 rc),

p(r2)

C(r2-rl )V(r2)

'

1,,)
(r(?.1
,

llltnl

t7(r,, t I )

V(r,,)

(8. 14)

where V(rj) and C(lj, ly) are l x k matrices of autocorrelations and crosscorrelations, respectively. The formula of the distribution of needs special
notation which is rather complicated to introduce at this stage.
,'

(EX

In defining the above concepts we (implicitly)assumed that the various


moments used are well defined (bounded)for all t G -F, which is not generally
true. When the moments of LzY(r),t e: T) are bounded for al1 l G T up to order

/, i.e.

for

all '

7J,

(8.15)

we say that the process is q/' order 1. In defining the above concepts we
assumed implicitly that the stochastic processes involved are at least of
order 2.
The definition of a stochastic process given above is much too general to
enable us to obtain a manageable (operational) probability model for
modelling dynamic phenomena. ln order to see this let us consider the
question of constructing a probability model using the normal process. The
natural way to proceed is to define the parametric family of densities
/'(.Y(f) p,) which is now indexed not by 0 alone but r as well- i.e.

t./'(X(1); pr), %e' (% r


.

EET),

lf ./'(Ar(r);p,) is the normal density %EB (/t41),lz'(r,r)) and (% EE E2x R+


The fact that the unknown parameters of the stochastic process tA'(r), l (E T)
change with r (such parameters are sometimes called incidentalj presents us
with a difficult problem. The problem is that in the case where we only have
a single realisation of the process (the usual case in econometrics) we will

8.2

Restricting

time-heterogeneity

have to deduce the values of p(1) and P'(r, r) with the help of a single
observation! This arises because, as argued above for each r, -Y(s. 1) is a
random variable with its own distribution.
The main purpose of the next three sections is to consider various special
forms of stochastic processes where we can construct probability models
which are manageable in the context of statistical inference. Such
manageability is achieved by imposing certain restrictions which enable us
to reduce the number of unknown parameters involved in order to be able
to deduce their values from a single realisation. These restrictions come in
two forms:
of the process', and
restrictions on the time-heteroleneitq'
restrictions on the memory of the process.

ln Section 8.2 the concept of stationarity inducing considerable timehomogeneity to a stochastic process is considered. Section 8.3 considers
various concepts which restrict the memory of a stochastic process in
different ways. These restrictions will play an important role in Chapters
22 and 23. The purpose of Section 8.4 is to consider briefly a number of
important stochastic processes which are used extensively in Part lV. These
include martingales, martingale differences, innovation processes. Markov
processes, Brownian motion process, white-noise, autoregressive (AR) and
moving average (MA) processes as well as ARMA and ARIMA processes.
8.2

Restricting the time-heterogeneity of a stochastic process

process :A'(1), t c T) the distribution function


the parameters % characterising it being
F(X(tj; 0t) depends
well.
stochastic
That
l
is, a
functions of as
process is time-heterogeneous in
raises
issues in modelling real
difficult
general. This, however,
very
only
usually
have
phenomena because
one observation for each 1.
we
will
pf On the basis Of a single
practice
Hence, in
have to
we
observation, which is impossible. For this reason we are going to consider
an important class of stationary processes which exhibit considerable timehomogeneity and can be used to model phenomena approaching their
undergoing
but continuously
equilibrium steady-state,
stochastic
stationary
class
of
processes.
fluctuations. This is the

For an

arbitrary

stochastic

on t with

'estimate'

'random'

Djinition

,4 stochastic process .Y(r), t EET) is said lo be (strictly)stationary


1,,) C!JT and some
for Jn-' subset (rj !2,
ft

(J

'r,

(8.17)

Stochastic processes
That is, the distribution function of the process remains unchanged when
shifted in time by an arbitrary value z. ln terms of the marginal distributions
t c T stationarity implies that
F(A'(M),
F(-Y(r))

FX(t +z)),

(8.18)

F(-Y(f,,)). That is, stationarity implies


and hence F(A'(!1)) F(zY(t2))
-Y(r,,)are (individually)
that A'(!j),
identically distributed (ID); a perfect
time-homogeneity. As far as thejoint distribution is concerned, stationarity
implies that it does not depend on the date of the first time index l1.
This concept of stationarity, although very useful in the context of
probability theory, is very difficult to verify in practice because it is defined
in terms of the distlibution function. For this reason the concept of Ithorder stationarity, defined in terms of the first l moments, is commonly
preferred.
=

'

'

'

Dehnition 3

,4 stochastic process 1tAr(r),t (E T) is said to be lth-order stationary 4J'


1,,) of T and Jny z, F(aY(l1,
for any subset (r1,tz,
is of
equal
corresponding
the
itsjoint
moments are
order land
to
moments
.:71,,))

t?.JF(Ar(l1 +z),

.:-(11
.

t.Y(12)ldz,
E'EtArtrjl )d1
.

=
ln order to understand
(1)

First-order

(2)

/,, l,.see Priestley


::t

(8.19)

(1981).

this definition let us take 1= 1 and I 2.


=

stationarity'
<
for al1 t c T and
'(I.Xr)I)

=p,

Second-order

(x;

stationarity'

(.-(8, l c T) is said to be second order stationary if '(I-Y(l)12)<


and
(/j

1, Iz

A-(!',,))l,,(I

tA-(r), t c T) is said to be first order stationary if


1, f'(AXr))E(X(t +z))
constant free of r.

for /1

k.f

FgeftArtrl + z) )..'1 wtf


A'(r,:+ z))Z(1

where/1 + lz +

+ z)), i.e.

0)

'ExY(r)j

.E'EXII

+ z)j

:x;

forall t e T

=pl,

(/1 2, lz 0)
=

constant free of

#.2

Restricting

time-heterogeneity

(/j

1,lz

(iii) '(.tA-(r1)).tA'rtral)j

and

Taking z

1)

z)) .t.-Y(/'a+

= figta'&-trj+
rl we can deduce that

.EE.Lz''(0)).:.'(r2
- r1))I
-

lllfa

z))(I
.

Irz rl 1.

a function of

...-.11),

These suggest that second-order stationarity for -Y(!) implies that its mean
and variance (Var(Ar(l)) p
pI) are constant and free of r and its
l2)
f;g)A-(0)))A'(l2
autocovariance (r(11
,
r1))(J pl) depends on the
stationarity, which is also
and
Second-order
not
interval jl2 rj j;
r2.
rj or
called wuk or wide-sense stationarit-v, is by far the most important form of
stationarity in modelling real phenomena. This is partly due to the fact that
in the case of a normal stationary process second-order stationarity is
given that the first two moments
equivalent to strict stationarity
normal
distribution.
the
characterise
ln order to see how stationarity can help us define operational
probability models for modelling dynamic phenomena let us consider the
implications of assuming stationarity for the normal stochastic process
',A-(r),t c T) and its parameters 0t. Given that '(-Y(r)) p and Var(-Y(l)) c2
for all l G T and tyrj, f2) s(lr1 rcl)for any rl rc c T we can deduce that for
1,,) of T the joint distribution of the process is
the subset (tI l2,
=

characterised by the parameters

(n+

1) x 1 vector.

(8.20)

This is to be contrasted with the non-stationary case where the parameter


N
n), a (n+ n2) x 1 vector. A sizeable
ector is 0= (p(rf),ntrf, ly), i,jzuz1, 2,
reduction in the number of the unknown parameters. lt is important,
however, to note that even in the case of stationarity the number of
1,,) although the
parameters increases with the size of the subset (lj l2,
This
does
(E
time-homogeneity
T.
is
because
depend
t
on
parameters do not
Ar(rj)
and
The
of
between
dependence
restrict
the process.
the
not
r2l but the
4rc) is restricted only to be a function of the distance
.
r'unction itself is not restricted in any way. For example h ) can take forms
.

'memory'

'

Il1-

xueh as:

-?a)2.

(a)ll(lr1 -rc1)-(r1
(b)/,(lr1 /.c1) exp.t - Ir1 1c1).
-

(8.21)
(8.22)

Ic case (a) the dependence between .Y(rj) and xY(lc) increases as the gap
een rl and tz increases and in case (b) the dependence decreases as
-hi?ru

140

Stochastic processes

Il: !cIincreases. ln terms of the


indeed
from

'memory'

of the process these two cases are


stationarity
viewpoint they are identical
different
but
the
very
(second-order stationary process autocovariance functions). ln the next
section we are going to consider
restrictions in an obvious
the problem of the parameters increasing with the size of
attempt to
the subset (r: 12,
r,,)of T.
Before we consider memory restrictions, however, it is important to
stochastic process as the
comment on the notion of a non-stationary
absence of time-homogeneity. Stationarity, in time-series analysis, plays a
similar role to linearity in mathematics', every function which is not linear is
said to be non-linear. A non-stationary stochastic process in the present
context is said to be a process which exhibits time-heterogeneity. ln terms of
actual observed realisations, the assumption of stationarity is considered
appropriate for the underlying stochastic process, when a z-period (z> 1)
wfnltpw,wide enough to include the width of the realisation, placed directly
over the time graph of the realisation and sliced over it along the time axis,
shows
same picture' in its frame; no systematic variation in the picture
(see Fig. 8. 144). Non-stationarity will be an appropriate assumption for the
underlying stochastic process when the picture shown by the window as
sliced along the time axis changes
such as the presence of a
variance.
change
monotonic
An
trend or a
important form of nonin the
is
so-called
non-stationarity
stationarity
which is
the
homogeneous
described as local time dependence of the mean of th,process only (see
ARIMAI/?,q) formulation below).
-

kmemory'

'solve'

'the

'systematically',

8.3

Restricting the memory of a stochastic process

In the case of a typical economic time series, viewed as a particular


realisation of a stochastic process .tA-(1),r 6 T) one would expect that the
dependence between Ar(r1)and Ar(r2)would tend to weaken as the distance
refers to the GNP in the UK at time l
(r2 - l1) increases. For example, if
would
between
.Y(r1) and .:X12)
dependence
that
expect
one
to be much
greater when lj 1984 and tz 1985 than when 11 1952 and tz 1985.
Formally, this dependence can be described in terms of the joint
distribution F(aY(t1), .X(la),
X(!,,)) as follows:
-lrl

Dehnition 4
A slochflslfc process .fA-(1),t e T) dejlned on the probability space
#( ))is said to be asymptotically independent lf forany subset

(.,

,%

'

14 1

8.3

gOeS

!t? zero

J.$ T

izfw).

-+

Let us consider the intuition underlying this definition of asymptotic


and B in
independence. Any two events
are said to be independent
0. Using
when #4z4 ra S) #(yl) PBj or equivalently P(.4 ch Bj - #(z4)#(:)
this notion of independence in terms of the distribution function of two
random variables (r.v.'s)Arl and Xz we can view tF(-Yj .Y2) - F(x1)F(Ara)l
as a measure of dependence between the two r.v.'s. In the above definition of
asymptotic independence j(z) provides an upper bound for such a measure
) the
of dependence in the case of a stochastic process. lf j(z) 0 as z
+z))
+z),
subsets
Ar(l,,))
and
(A'(j),
A-(r,,
become
two
(-:-411
independent.
A particular case of asymptotic independence is that of m-dependence
and -Y(l2) are
which restricts j(T) to be zero for all T > n;. That is,
would
for
practice
>
ln
independent
expect to be able to find a
we
rrl.
)lj
enough'
able
approximate
'large
m so as to be
to
any asymptotically
m-dependent
This
is equivalent to
independent process by an
process.
small
able
>
assuming that j(z) for z m are so
to equate them to zero.
as to be
An alternative way to express the weakening of the dependence between
.Y(l1) and A'(l2) as !lj ra( increases is in tenms of the autocorrelation
function which is a measure of linear dependence (see Chapter 7).
,4

,t7-

-+

-+

-(r1)

lcJ

Dejlnition J

.4 stochastic process (A'(l), r (E T) is said to be asymptotically


uncorrelated #' tere exists t; sequence of constants kfp(z), z y 1)
dehned by
d1, t +z)

(t'(l)r(f+z))''

G P(T),

./r

aIl r 6 V,

sucb tat

0 %p(z) <

As wecan see, the sequence of constants


z 7. 1) defines an upper bound
r(!, r +z). Moreover, given
autocorrelation
coefficients
the
of
sequence
or
-f 1 +'
-+
and
<
ptz)
p(z)
for > 0, a sufficient
0 as z w, is a necessary
'.hat
z
v'-,
underlying the
White
1984)),
intuition
<
/?(z)
condition for
the
I
x (see
(
obvious.
above definition is

tplz),

-.+

Stochastic processes

ln the case of a normal stochastic process the notions of asymptotic


coincide because the dependence
independence and uncorrelatedness
l1,
and
-Y(lj)
A'(rc)
for
between
tz e:T is completely determined by the
any
autocorrelation function rtlj r2). This will play a very important role in
Part IV (see Chapters 22 and 23) where the notion of a stationary,
asymptotically independent normal process is used extensively. At this
stage it is important to note that the above concepts of asymptotic
which restrict the memory of a
independence and uncorrelatedness
stochastic process are not defined in terms of a stationary stochastic process
but a general time-heterogeneous process. This is the reason why #(z)and
ptz) for z > 1 define only upper bounds for the two measures of dependence
given that when equality is used in their definition they will depend on (11,
Svell
1,,)
as
as z.
r2,
general
formulation of asymptotic independence can be achieved
A more
using the concept of a c-field generated by a random vector (seeChapters 4
where (.Y(r),
and 7). Let
denote the c-field generated by ..Y(1),
T)
is a stochastic process. A measure of the dependence among the
tc
elements of the stochastic process can be defined in terms of the events
(E
and
B c Mt
by
,

,@j

.:-(1)

.,4

.@y

a(z) suplf'l-gt
=

c B4 - #(.4)/'4:)1,

(8.25)

Dejlnition 6

,-1stocbastic process
(z) 0 as z --.

kf

( c T)
.Y(M,

is said to be

(strongly)mixing

1)/-

bz'.

-+

of the asymptotic
As we can see, this is a direct generalisation
independence concept which is defined in terms of particular events and B
related to the definition of thejoint distribution function. In the case where
)A'(r), l G T) is an independent process a(z)
for z > 1. Another interesting
special case defined above of a mixing process is the m-dependent process
where a(z) 0 for z > m. In this sense an independent process is a zerodependent process. The usefulness of the concept of an m-dependent
process stems from the fact that commonly in practice any asymptotically
independent (ormixing) process can be approximated by such a process for
ilarge enough' ?'n.
A stronger form of mixing, sometimes called unlfrm mixing, can be
defined in terms of the following measure of dependence:
,4

=0

tp(T)

sup

(8,4/,)-#(-4)t,

PB)

>0.

(8.26)

Restricting

#.3

143

memory

Dehnition

,4 stochastic process (Ar(r), t c T) is said to be uniformly mixing


:f
0 as

(P(T)

'r

'--.

(f

-+

Looking at the two delinitions of mixing we can see that a(z) and @(z)define
absolute and relative measures of temporal dependence, respectively. The
formeris based on the definition of dependence between two events and B
separated by z periods using the absolute measure
,4

E#(z4ro #) 8-4)
-

'

#(#)) 10

and the latter the relative measure


EPIA/.B)

#(X)) > 0.

stochastic
stationary
processes
ln the context of second-order
asymptotic uncorrelatedness can be defined more intuitively in terms of the
temporal covariance as follows:
Xt
CovtAXll,

+z))

t?(z)

-+

A weaker form of such memory restriction is the so-called ergodicity


property. Ergodicity can be viewed as a condition which ensures that the
p(z)
by averaging over
memory of the process as measured by
time'
kweakens

Dehnition 8

-4second-order

stationary

1 r
lim -T Zl p(z)
zv-vx

process (-Y(l), r e T) is said to be ergodic

=0.

(8.28)

If wecompare (28)with (25)wecan deduce that in the case of a second-order


stationary process strong mixing implies ergodicity. The weaker form of
temporal dependence in (28),however, is achieved at the expense of a very
In modelling we need
restrictive form of time-homogeneity (stationarity).
and
off
between them (see
there is often a trade
both type of restrictions
Domowitz and White (1982:.
Memory restrictions enable us to model the temporal dependence of a
stochastic process using a finite set of parameters in the form of temporal
moments or some parametric process (see Section 4). This is necessary
in order to enable us to construct operational probability models for
modelling dynamic phenomena. The same time-heterogeneity and memory
restrictions enable us to derive asymptotic results which are crucial for

Stochastic processes

statistical inference purposes. For example one of the most attractive


features of mixing processes is that any Borel function of them is also
mixing. This implies that the limit theorems for mixing processes (see
Section 9.4) can be used to derive asymptotic results for estimators and test
statistics which are functions of the process. The intuition underlying these
results is that because of stationarity the restriction on the memory enables
us to argue that the observed realisation of the process is typical (in a certain
sense) of the underlying stochastic process and thus the time averages
of the corresponding
probability
constitutee reliable estimates
expectations.

8.4

Some spial

stochastic processes

The purpose of this section is to consider briefly several special stochastic


processes which play an important role in econometric modelling (seePart
IV). Thcse stochastic processes will be divided into parametric and nonparametric processes. The non-parametric
processes are defined in terms of
their joint distribution functions or the first few joint moments. On the
other hand, parametric processes are defined in terms of a generating
mechanism which is commonly a functional form based on a nonparametric process.

(1)

Non-parametric

processes

The concept of conditional expectation discussed in Chapter 7 provides us


with an ideal link between the theory of random variables discussed in
Chapters 4-7 and that of stochastic processes, the subject matter of the
present chapter. This is because the notion of conditional expectation
enables us to formalise the temporal dependence in a stochastic process
(.:-(1),t e:T) in terms of the conditional expectation of the process at time t,
.:-(1)('the present') given (-(l 1), A'(l -2,
(ithe past). One important
application of conditional expectation in such a context is in connection
with a stochastic process which forms a martingale.
.)

(1)

Martingales
Dehnition 9
(lfned
#( )) and
Let )A'(r), t G T) be a stocastic
on (S,
process
Jt
oj'
oujlelds
G
T
increasing
9$.
l e: T)
sequence
r
(f ,,
) an
tc
the
conditions:
satisfyinq
followinq
,.%

'

.@

,#7t

8.4 Some

special stochastic

Ar(l) is a random

()
(ff)
(fi)

processes
lmriable

zR1x(r)l) (.c. its

(r.t?.)relative rtl

mean is bounde
X(l 1), for aII t 6 T.
'(-Y(l) 9. t - j)
Then (A'(l), t 6 T) is said rtp be a martingale
).f4, t 6 T) and wtl wrflp tAr(l), 4, t e T).
< c/s

.f/?.

for aII r e: T.

for all t c T,. and

wirll

respect

to

Several aspects of this definition need commenting on. Firstly, a


martingale is a relative concept; a stochastic process relative to an
increasing sequence of c-elds. That is, c-fields such that 91 c cLhcu L'it,
and each -Y(l) is a r.v. relative to %, t G T. A natural choice for
c Vt cu
c-fields
X(1)), t 6 T. Secondly, the
1),
will be gtt c(Ar(l), Xt
such
value
bounded
all
of
xY(l)
expected
for
must be
t c T. This, however, implies
because Z'(X(l))=
stochastic
has
constant
that the
mean
process
jq
all
c-CE7
of conditional
E(.Y(l
1))
'gF(xY(t))/.@
T
for t c by property
1.expectations (see Section 7.2). Thirdly, (iii)implies that
.

E(xY(!+ z) i? t-

1)

Xt

1) for all l (E T and z y 0.

(8.29)

That is, the best predictor of xY(r+ z), given the information 9't-. 1, is A'(l 1)
for any z >0.
Intuitively, a martingale can be viewed as a fair game'. Defining .Y(1)to
be the money held by a gambler after the rth trial in a casino game (say,
black jack) and Vk to be the history' of the game up to time 1, then the
because the gambler
condition (iii)above suggests that the game is
before trial t expects to have the same amount of money at the end of the
trial as the amount held before the bet was placed. It will take a very foolish
gambler to play a game for which
-

%fair'

(iiil'

Eqxltj/.@t

1)
.GX(l 1).
defines what is called a supermartingale

(8.30)

This last condition


(tsuper'for the
casino?).
The importance of martingales stems from the fact that they are general
enough to include most forms of stochastic processes of interest in
econometric modelling as special cases, and restrictive enough so as to
theorems' (seeChapter 9) needed for their statistical
allow the various
analysis to go through, thus making probability models based on
operational. In order to appreciate their generality let
martingales
consider
examples of martingales.
extreme
two
us
tlimit

ilargely'

Example 3
Let tZ(r), l c T) be a sequence of independent r.v.'s such that F(Z(r))

0 for

146

Stochastic processes

al1 t e: T. If we define .:-(1) by


f

z(k),

-Y41)
=

k=1

then )X(r),

(8.3 1)

h, r e T) is a

martingale, with 9.t c(Z(M,


Z(l - 1),
c(A'(r), X(t 1),
xY(1:. This is because conditions (i) and
automatically satisfied and we can verify that
=

z( 1))

(ii) are
(8.32)

using the properties c-CE9 and 10 in Section 7.2.

Example 4

t 6 T) be an arbitrary' stocbastic process whose


tz(!),
F(lz(r)j)
<
for all t e:T. If we define ..Y(1)by
that

only restriction is

Let

gz

A'(r)

EZ(k)

1)1,

Elzkl/.i

(8.33)

k=1

.Y(1:, then
c(Z(k), Zk
1),
Z( 1)) c(A-(k), .Y(k 1),
tA'(r), ?,, t G T) is a martingale. Note that condition (iii)can be verified using
the property c-CE8 (seeSection 7.2).

where 9k

The above two extreme examples illustrate the flexibility of martingales


very well. As we can see, the main difference between them is that in example
3, aY(r)was defined as a linear function of independent r.v.'s and in example 4
as a linear function of dependent nv.'s centred at their conditional means,

i.e.

F(r)

.E'(Z(l)/Mt

.:-(/)

==

.-.

(8.34)

1),

lt can be easily verified that )F(l), r e: T defines


martingale dfference process relative to tt because
'(F(l),

In the case where


F(F(l)F(k))

:)

=0,

for all

'(.E'(F(r)F(/())

= E gF(k)'(F(f)/f/,
That is, (F(l), t e: T) is an orthogonal

rG T

q
-

is known as a

(8.35)

f G T.

< x for all


.E'(IZ(f)I2)
=

what

we can

deduce that for r > k

yj

:)j

=0

sequence as well

(8.36)
(seeChapter 7).

special stochastic

8.4 Some

147

processes

Dehnition 10

.4 stochastic process tF(l), t (E 7-) is said to be a martingale diffrence


c
c
process relative to the increasing sequence of c-.l/l.s
.f4

j.j
Q.y (uu
F41) is a nn, relative to
(f)
'(lF(r)!) < Jz'; and
(if)
f)F(r) . , - 1) 0, t c T.
(fff)
.

(uu

%;

Dehnition 11

-4 stochastic process )F41), t (EFT' ) is said to be an innovation process


c(-(l), Xt
1),
#- it is a martinqale dierence wfr/l respect to
.f4

.
(i)
(f)

xY(0),where
'(1

-:-(1)

1'(r)I<

2)

'(y(r)F(z))

Js;
=

.,

0, t >

() F(.j), and

z,

r, z c T.

These special processes related to martingales will play a very important


role in the statistical models of interest in econometrics in Part 1V.
Returning to the main difference between the two examples above we can
elargely'
equivalent to that
see the independence assumption in a sequence is
of orthogonality in the context of martingales. It will be shown in Chapter 9
that as far as the various limit theorem results are concerned this Trude
tlaw
of large numbers'
equivalence' carries over in that context as well. The
limit theorem' results for sequences of independent nv.'s
and the
can be extended directly to orthogonal martingale difference processes.
Martingales are particularly important in specifying statistical models of
interest in econometrics (seePart 1V) because they enable us to decompose
any stochastic process (Ar(1),t G T) whose mean is bounded for all t G T into
two orthogonal components, p(r) and u(r), called the systematic and nonsystematic components, respectively, such that
icentral

.X'(1)

=/-t(r)

141),

(8.37)

and u41) xY(1) '(A'(r) a't


Some c-field 9t
where /t41) E(X(t)/.%
defining our sample information set. The non-systematic component u(l)
defines a martingale difference and thus all the limit theorem results needed
for the statistical analysis of such a specification are readily available.
ln view of the discussion in the last two sections on time-homogeneity
and memory restrictions, the question which arises naturally is to what
extent martingales assume any of these restrictions. As shown above,
martingales are first-order stationary bcause
1)

F(aY(l)) .E'4-Y(1 1)) p,


=

for a11 t 6 T.

1), fOr

(8.38)

Stochastic

processes

Moreover, their conditional memory is restricted enough to allow us to


define a martingale difference sequence with any martingale. That is, if
tA-(r). t e: T) is a martingale, then we can define the process
F(r)

xY(r)- X(t - 1),

F(0)

(8.39)

A-(0),

which is a martingale difference and A'(!) j () FU).ln the case where the
martingale is also a second-order process then F(f), r e: T) is also an
innovation process. In Chapter 9 it will be shown that an innovation
process behaves asymptotically like an independent sequence of random
variables; the most extreme form of memory restriction.
=

(2)

Markov processes

class of stochastic processes is that of Markov


These
processes are based on the so-called Markov property that
processes.
is independent of the
tthe future' of the process, given the
Another

important

'past'.

'present',

Djlnition

12

-4 stocbastic pl-otr'ss t-Y(r), r e: T) is said rtpbe a s'Iarko: process


/??-ever #, Borel yncrtpn (.Y(r))e:
(tbe future')such that
,.;d;'

#'

<
'j/2(A''(r))I
:yz

A((A'(l)),

.@%

c(A-(r): a
w/-ltlrp,Mb
J
=

ln particulara in the case


suggest that
f(-Y(r + z)/,.df-

.,)

.)

F44-(1)) /'c(A'(r

< t<

where

f-'tr

/?).(Note: ,#L

,g:

1)),
is past' p/lf.s

(8.40)
ipresent'.)

/l(A-(r)) X(t + z), the Markov property


=

+ z)/tr(A'(!)),

(8.41)

An alternative but equivalent way to express the Markov property is to


1
define the events B c e: t 1 and state that
.4'-

.4

.4cE

LxL:,

#(X

r-h B Jdff)=

#(X

Jdf)

'

#(S .Xj).

lt is important to note that the Markov property is not a direct restriction on


the memory of the process. lt is a restriction on the conditional melnory of
the future of the process the present provides
the process. For
all the relevant information.
A natural extension of the Markov process is to allow the relevant
information to be m periods into the past.
tpredicting'

8.4 Some

spial

149

stochastic processes

Dejlnition 13
<

Marko:

prtptrtlsy.taY(l),t ( Tl is said to be vth-order

.,4stocbastic

f(IX(r)l)

('

and

:y-

.2?d--.1,)

.E4..Y41)

.E'(.X'(r)c(.Y(r

1),

(8.43)

Xt - n'l))).

ln terms of this definition a Markov process is a first-order Markov. The


rnth-order Markov property suggests that for predicting X(l) only the
irecent past' of the process is relevant.
ln practice a Markov process is commonly supplemented with direct
restrictions on its memory and time-heterogeneity. In particular, if an rnthorder Markov process is also assumed to be normal, stationary and
asymptotically independent we can deduce that
'(A'(r)

.?f--,1,)

- 2) +

a1-Y(l - 1) + xzzrt

'

'

'

(8.44)

+ amhjt - rn),

'

a.)
0 lie inside the
and the roots of the polynomial km
unit circle (seeAR(m) below). This special stochastic process will play a very
important role in Part lV.
xk;.m -

(3)

'

'

'

Brownian motion process

A particular form of a Markov process with a long history in physics is the


so-called Brownian motion (or Wiener) process.
Dqjlnition 14

.4 stochastic process .X/1), r iE T) is called a Brownian motion


#( )) (J
process, dfned on (,
(J)
0,
.Y41) for r 0 the process
at 0,' a con,entionj;
independent
A'(r)
stationar-b'
increments, i-e.
is
t'
wf
(la)
process
for 0 < 11 G tz % G r,,, tbe increments (-Y(lj) - Xti j)),
f

,%

-srlrf.s

'

1, 2,

'

n, are independent

.E'(.Y(?,.)
- Arlrf-

)) 0,
=

ti 1 )
1zz41,,
-

czlrj

- ti

1),.

the increments Xtrfl Artrf 1 ), i 1, 2,


distributed. This implies rt7l tbe densitv
=

/'(.v 1

t1

-v,,,'

r'.r.'ss such tbat

n, are normally

Jnction is

tn)
-.i

---=-

CXn
'-

/(27:)lj

x2
1

2c221

(;

,,

jw

-.

)
(2zr)

ti -

Stochastic processes
(.Xf

exp

1)2

xi

2c2(/.f /.j-

'

j)

ln the case where (7.2 1 the process is called standard Brownian motion. lt
is not very difficult to see that the standard Brownian motion process is
both a martingale as well as a Markov process. That is, (.Y'(1),l G T) is a
.Y(z), z %r. Moreover,
martingale with respect to 4t- since '(xY(l)/,?Fsince E(X(t)/@%
f'(A'(r)/c(A)z)))it is also a Markov process. Note also
that Eg(-Y(l) Xzjllj@z.
(1 z), t % z.
=

.)

(x)

.)

.1

Dejlnition 15

,4 stochastic

(f)
()

process
'(l/(r))

JtM4,r g T ) is said

0,'

J;(l/(r)lg(z))
-

if t
(f t+
=

to be a white-noise process

#'

z
z

r.v.'s).
(uncorrelated

Hence, a white-noise process is both time-homogeneous, in view of the fact


that it is a second-order stationary process, and has no memory. In the case
where (?.(r),
t e: T) is also assumed to be normal the process is also strictly

stationary.

Despite its simplicity (or because of it) the concept of a white-noise


process plays a very important role in the context of parametric time-series
models to be considered next, as a basic building block.

(11)

Parametric

stochastic processes

The main difference between the type of stochastic processes considered so


far and the ones to be considered in this sub-section is that the latter are
defined in terms of a generating mechanism', they are in some sense
stochastic processes.
'derived'

(4)

Autoregressivenjirst

order

(z4#(1))

The AR( 1) process is by far the most widely used stochastic process in
econometric modelling. An adequate understanding of this process will
provide the necessary groundwork for more general parametric models
such as AR(m), MA(m) or ARMA(p,t?) considered next.
Dejlnition 16

.4 stochastic

process

(x(r),!GT)

s said to be autoregressive

of

8.4

Some spial

stochastic

processes

order one (,4R( 1)) lf it satisjles the stochastic difference equation,


.-41)=xX(t

1) + u(1),

wllcrp u is a con,stfznf flnt

l1(f)

(8.45)
is a white-noise

process.

The main

difference between this definition and the non-parametric


definitons given above is that the processes )-Y(r), r c T) are now defined
indirectly via the generating mechanism GM (45).This suggests that the
properties of this process depend crucially on the properties of u(l) and the
structure of the difference equation. The properties of /(r) have already been
discussed and thus we need to consider how the structure of the GM as a
non-homogeneous difference equation (seeMiller (1968)) determines the
properties of .t-Y(l), t s T), T )0,1, 2,
Viewing (45) as a stochastic non-homogeneous
first-order difference
equation we can express it (by repeated substitution) in the form
.).

/-1

.:-41) a'Ar(0) +
=

)( xiult
=

f).

(8.46)

This expresses X(t) as a linear function of the white-noise


process )u(l),
Using this form we can deduce certain properties of the
r s T) and
AR( 1) such as time-homogeneity or and memory. ln particular, from (46)
we can deduce that
..(0).

F(xY(f))

fE'(-Y(0))

(8.47)

and
+z))
'(A-(t)aY(l
f- 1

a'?r(0) A-

==
.

)
=

liut

--

a'+1?f(0) ?-

j
i 0

xiut A-z

--

i4

f-1

r+ z - l

')a2+'1xY(0)2)+ E
i

)
=

l+r-1

autr

j)
i

j0

aijtr +z

jj

(8.48)
This shows clearly that if no restrictions are placed on .Y(0) or/and a the
AR(1) process represented by (45)is neither stationary nor asymptotically
independent. Let us consider restricting
and a.
.1(0) in (46) represents the initial condition for the solution of the
stochastic difference equation. This seems to provide a
solution for
the difference equation (seeLuenberger (1969)) and plays an important role
in the determination of the dependence structure of stochastic difference
.40)

bunique'

Stochastic processes

equations. If
form

that .Y(0)

we assume

for simplicity,

=0

(47)and (48)take the

(8.50)

As we can see, assuming .140)


does not make the stochastic process
T)
generated
(..'.t-41),
by (45),stationary or asymptotically independent.
tG
as
The latter is achieved by supplementing the above initial condition by the
< 1 we can
coefficient restriction lal< 1. Assuming that .Y(0) 0 and Jal
deduce that

=0

J)-Y(r).Y(r + z))

clul

(:t

2f

-+

(8.j1)

and thus )A-(l), l e: T) is indeed asymptotically independent (but not


stationary). For (a1
> 1 the process is neither asymptotically independent
stationary.
nor
For the stochastic process (xY(!), t e: T), as generated by (45),to be
stationary we need to change the index set T to a double infinite set T* (0,
That is, assume that the process X(t) stretches back to the
u!r1, + 2,
infinite remote past (a convenient fictionl). The stochastic process (.,:71),
t 6 T*) with GM (45)can be expressed in the form:
=

.).

f(AX0)) 0,
=

'(A-(l).Y(r+ z))

E
i

liult

i)

j Vult + z
j 0
=

=
Hence, for

c2

< 1 the stochastic


1(z1
and asymptotically

stationary

j0

-jj

C&

aiaf +1

a2

c2a1
i

(8.54)

process )A'(r), r c T*) is both second-order


independent with autocovariance function

?-?(z)
=

(1

(8.55)

a',

az, and the process is not even


On the other hand, for k: 1, (:/7.0 a2)
f(tAr(r)t2)
order
since
second
is not bounded.

)(

-+

stocllastic processes

Some spial

8.4

The main purpose of the above discussion has been to bring out the
importance of seemingly innocuous assumptions such as the initial
conditions (Ar(0) 0, Ar(- F)-+O as T-+ :yz ), the choice of the index set (T or
T*) and the parameter restrictions ()al 1), as well as the role of these
assumptions in determining the properties of the stochastic process. As seen
above, the choice of the index set and the initial conditon play a very
important role in determining the stationarity of the process. In particular,
for the stochastic process as defined by the GM (45)to be stationary it is
necessary to use T* as the index set. This, however, although theoretically
convenient, is an unrealistic presupposition which we will prefer to avoid if
possible. Moreover, the initial condition .Y(0) 0 or .Y( T) 0 as T :yz
is not as innocuous as it seems at first sight. -Y(0) 0 determines the mean of
the process in no uncertain terms by attributing to the origin a very special
status. The condition A'( T) 0 as T :x- ensures that A'Ir)can be
< 1.
expressed in the form (52)even without the condition
modelling
IV)
it is interesting
For the purposes of econometric
(seePart
1,
2,
in relation to
to consider the case where the index set is T )0,
above,
in
this case kfxYtfl,
stationarity and asymptotic independence. As seen
restriction
#40) 0,
under
second-order
T)
stationary even
the
is not
l iE
asymptotically
< 1. Under the same conditions, however, the process is
independent. The non-stationarity stems from the fact the autocorrelation
function t?(f, t +z) depends on t because for z 0,
=

-..

--+

-+

-+

Ial
.)

Ial

/1

+
,t

1')

J2aT
=

- tz

lt

(8.56)

This dependence on 1, however, decreases as t increases. This led Priestley


(198 1) to introduce the concept of asymptotic stationarity which enables us
to approximate the autocorrelation function by

t,tr,t +z)

::x:

c2(g

-#

(8.57)

At this stage, it is interesting to consider the question of whether, instead


it by
of postulating (45)as a generating mechanism, we can actually
imposing certain restrictions on the structure of an arbitrary stochastic
stochastic' process )A'fr), t c T) is: (i)normal,
process. lf we assume that the
(ii) Markov and (iii)stationary then
lderive'

f)Ar(l)/f/

t- j

fJ(A'(r)/c(X(l

= aA'tr

1/)

(8.58)
(8.59)

1),

.-40)). The first equality stems from


c(A'(r - 1), Arlf - 2),
where
1
the Markov property, the linearity of the conditional mean from the
.%

Stochastic processes

normality and the time invariance of a from stationality. In order to ensure


< 1 we need to assume that the process is also (iv)asymptoticall),
that IaI
independent. Defining the process u(r) by
u(r) A-(r) Elxtlj.,Lit't =

),

N(0)

-Y40),

(8.60)

we can construct the GM


A'(r)

yX(t - 1)+ u(r),

(8.61)

where lk(1) is now a martingale difference orthogonal


process relative to
This is because by construction Elutl/t
0
and
for t > k,
j
- )
1q)
0
c-CE8
of
7.2.
by
Moreover,
Section
the process
.E'(FElg(r)lt(/()/f4
t(u(l), ! G 5-) ean be viewed as a wllilp-noisc process or an innosation process
relative to the information set ? because
c,#t.

'(u(r)l(k))

E'(u(r))

'(.E'(lf(l) f. t

'(l,/(f)2)FtF(I#l)2/.(#()

- 1

f(-Y/)

0-,

(8.62)

j))

-a/.E'(-YJ- trx2(1 -af)


1)

c2,

(8.63)

say. This way of defining a white-noise process is very illuminating because


it brings out the role of the information set relative to which the white-noise
process is not predictable. ln this case the process (l(1),t t5 T). as defincd in
That is, ?.(r) contains
(60) is white-noise (or non-predictable) relative to
no systematic' information relative to tt. This, however, does not preclude
the possibility of being able to predict u(r) using some other information set
.@)= Lh with respect to which l1(f) is a random variable. To summarise the
above argument, if :A'(r), t e:T) is assumed to be a normal, Markov,
stationary and asymptotically independent process, then the GM (61)
follows by
The question which naturally arises at this point is
that (61)seems
identical to (45) what happens to the difficulties related to the time
dependence of the autocorrelation function as shown in (50)75
The answer is
arises
stationarity
of
given
that the difficulty never
because
the
we can
derive its autocorrelation function by
...'

,.

tdesign'.

kgiven

'(r)

tl(T)

EX(t)Xt

= E @.Y(r

z))

1) + !(r.))aY(r z) )
-

az

1),

since '(l/(r).Y(r z)) 0. This implies that


-

t(z)

(f/0).

(8.65)

8.4 Some

spial

stochastic

processes

Moreover,

d0)

f'(A-(r)Ar(l)) Exxt

- 1).:-(8) + /)tI(1)-Y(1))
=ar(1) + c2 a2tj0) + c2.

Hence,
G

d0)

(F

and

(1 a)

tz

'

'

(1

-aZ)

(8.67)

af.

tdesigned'
This shows clearly that in the case where the GM is
no need to
change to asymptotic stationarity arises as in the case where (45) is
postulated as a GM. What is more, the role of the various assumptions
becomes much easier to understand when the probabilistic assumptions are
made directly in tenns of the stochastic process tAr(r), t c: T) and not the

white-noise process.
(5)

mth-order

Autoregressive,

(z1#(-))

The above discussion of the AR(1) process generalises directly to the AR(/#
process where mb 1.
Dejlnition 1
of order
-4stochastic process t-Y(l), t G T) is said to be autoregressiYe
lf
satisjles
the
stochastic
(v4R(m))
it
equation
dterence
m
Ar(r) alArtl
=

1) + xzxt

- 2) +

'

'

+ xmzrt

1.n)

+ u(l),

(8.68)

where aj, a2,


xmare constant and u(r) is a white-noise process.
For the discussion of (68)viewed as a generating mechanism (GM) it is
convenient to express it in the 1ag operator notation
.

a(f)x(l)

(8.69)

/.k48,

-xzLl
-am1r)
with 1tX(t) Xt
k),k> 1. The
whereatLl 1
T*
of
equation
1,
for
2,
the
difference
+
:I:
can be
solution
(24)
(0,
writtenas
=(

.-ajf-

.)

.71)

g(r) + x -

'(f-)l/(r).

(8.70)

g(1) c12l +c22@ +


+ c,,,2tmis the so-called general solution which is
2m) of the
expressed as a linear combination of the roots (21,22,
polynomial
'

txm

ajx'l -

.
-

aml

(8.71)

(assumed to be distinct for convenience) with the constants cj, c2,

cm

Stocbastic processes

being functions of the initial conditions


particular solution is the second component

.:-40),

-Y(1),
xtn1 1). The
of (70)and takes the form:
-

(8.72)
where
7 t -F- 1 7 ()
(7

7,,,+

a j ),,,,

7,. vz

7,,,+z

'

'

'

'

myfl

'

(8 3)
.7

m7,=

0, 1, 2,

ln the simple case m 1 considered above a(fa) (1 - xL), qt)


The restriction
(21 a), j= a/,.j 0, 1, 2,
IaI< 1 in the AR( 1) case is now
extended to al1 the roots of (71), i.e.
=

I2f
I<

1,

1, 2,

A-(0)aJ

(8.74)

n1.

That is, the roots of the polynomial (71) are said to lie within the unit circle.
Under the restrictions (74) the general solution goes to zero, i.e.

(/(1) 0
->

and the solution of the difference equation (68) can be Fritten as


X

.Y4/.)
=

-.j).
)- o?.#l
j 0
=

This form can be used to detennine the first two moments


stochastic proccss )A'(r), l (E T*). In particular
&.Y(r))

0,

'(Ar(r)A'(l+ z))

E
j

l ),yl(l

-jj

pfutl +
i

of the

':
-

X'

= Z(,
j

'jb-vz

'2.

tUz'

This is bounded when


)?y?)
< ts a condition which holds when the
()
of
polynomial
1)
within
the
the unit circle (seeMiller (1968/.ln a
roots
(7 lie
conditions
the
sense
(74)ensure that the stochastic process tA-(r), r (E T*) as
generated by (68) is both second-order stationary and asymptotically
independent because the condition (.l- (, -yJ?) <
implies that t?('r) 0 as
,

'w.

-->

%;

-+

As in the case of an AR( 1) process when the index set T

)0,1, 2,

.)

is

Some spial

8.4

stochastic processes

used instead of T* )0,EE 1, +2,


we run into problems with the secondT),
EE
of
which
stationarity
order
t
we can partly overcome using the
(.:-41),
will not be pursued any further
asymptotic
stationarity.
This
of
concept
AR(
argued
in
1)
when
the
the parametric model is not
because, as
case,
making
the necessary assumptions
GM
but
by
postulated as a
T)
tA-(l),
of
such
problems
arise. ln particular, if we
in
directly terms
t e:
no
mth-order
normal,
Markov, (iii)stationary
the
process is (i)
(ii)
assume that
independent
asymptotically
then
and (iv)
.)

idesigned'

E'(.Y(l) c(.X(l - 1),

.Y(0))) F(A-(r) c(Ar(l - 1),


=

.Xtr

n'l)))

= i j 1 afzYtr-).

(8.78)

The first equality stems from the mth-order Markov property. The linearity
of the conditional mean is due to the normality, the time invariance of the
afs is due to the stationarity and asymptotic independence implies that the
roots of the polynomial (71) 1ie inside the unit circle. lf we define the
X(0)), t e T, and the
increasing sequence of c-fields 9L tr(.Y(r), A-t 1),
u(r),
T)
by
t
c
process
=

utrl-xtr)- Ex/'u

k-

(8.79)

1)-

fk, l 6 T) is a martingale difference, orthogonal (an


we can deduce that )?,f(r),
the AR(m) GM as in (68)
innovation) process. This enables us to
from rst principles with u(1) being a white-noise process relative to the
information set %.
The autocovariance function can be defined directly, as in the
AR(1) case by multiplying (68)with X(t -z) and taking expectations to yield
Sdesign'

tdesigned'

F(-Y(f)A'(r -z))
>

aj '(Ar(r- 1)Ar(l-z)) +

'

'

'

+ E(l(l)A-(t -z)),

(8.80)

(8.81)
satisfy the same difference
Hence, we can see that the autocovariances
equation as the process itself. Similarly, the autocorrelation function takes
the form

(8.82)
The system of equations for z 1, 2,
m are known as Flg/t?-lzl//k!r
which
play
important
role
in
the estimation of the coefticients
equations
an
The
relationship
198
1)).
between the
a. (see Priestley (
tz1 aa,
=

Stochastic processes

autocorrelations and the asymptotic independence of )A'(r), t c T) is shown


most clearly by the relationship
z

=0,

1, 2,

(8.83)

viewed as a general solution of the difference equation (82).Under the


1, 2,
< 1, i
restrictions
m (impliedby asymptotic independence) we
can deduce that

I2fI

r(z)
(6)

-+

(8.84)
(4f-4)processes

Moving average
Dejlnition 18

The stochastic process taY(r), t (5 T) is said to be a moving average


process of order k (M,4(k)) 4- it can be expressed in the form

(8.85)
where :1, bz,
bk are constants and
t e:T) is a white-noise
the
white-noise
used
to build the process
process is
process. Ttzr is,
(aY(?), t c T), beinq a linear combination (#' the Iast k ?,k(r ils.
Given that f .Y(f), t c T) is a linear combination of uncorrelated random
variables we can deduce that
.

tlt(r),

(8.86)
0 tKz %;k
z> k

(8.87)

o/zGk
(8.88)
(!?() 1). These results show that, firstly, a MA(k) process is second-order
stationary irrespective of the values taken by /?1 bz,
bk, and, secondly,
after k
its autocovariance and autocorrelation functions have a
periods. That is, a MA(k) process is both second-order stationary and kcorrelated (r(z) 0, z > k).
ln the simple case of a MA(1), (85)takes the form
=

tcut-off'

..(1)

Ik(1)

/71u(t

1),

(8.89)

with
t7(0)
=

r( 1)

(1+ hf)c2,
1)1
+hf)

(1

p(1)

r(z)

0,

#1c2,

r(z)

0,

(8.90)
(8.9 1)

8.4 Some

special stochastic

processes

As we can see, a MA(k) process is severely restrictive in relation to timeheterogeneity and memory. It turns out, however, that any second-order
independent
stationary, asymptotically
process )-Y(r), t e: T) can be
as a MA(:ys ), i.e.

Kexpressed'

x(r)

)( bjut

-j),

j=

bjb)< (yo and


t (E T). is an innovation process. This result
where (N.)'o
constitutes a form of the celebrated l4b/J decomposition theorem (see
Priestley ( 198 1)) which provided the theoretical foundation for MA(k) and
ARMNP, q) processes to be considered next. The MA(aa) in (92) can be
constructed from tirst principles by restricting the time-heterogeneity and
the memory of the process. If we assume that ).Y(r), l G T) is (i)second-order
stationary, and (ii)asymptotically uncorrelated, then we can define the
innovation process )u(r),
r e:T) by
.tu(!),

l(r)

A'(r)

EX(t)(%-

where fh c(A-(r), A'(1 1),


c(u(l), u(l
to deduce that
7.2)
=

1),

.62,

(8.93)

.:70)). Asymptotic independence enables us


u(0)) and thus by c-CE6 (seeStion
- 1),
,

A-(r) S(aY(r)/f4).

(8.94)

tl4l),

Given that
t G T) is an innovation process (martingaledifference,
orthogonal process), it can be viewed as an orthogonal basis for %. This
'(-Y(r)
enables us to deduce that
can be expressed as a linear
combination of the l41
i.e.
-4)

-.j)s,

.E'(-Y(1).@f
)

bjut -j4,

(8.95)

j=O

from which (92)follows directly. ln a sense the process :u(1), t G T). provides
blocks' for any second-order stationary process. This can be
the
seen as a direct extension of the result that any element of a linear space can
be expressed in terms of an orthogonal basis uniquely, to the case of an
infinite dimensional linear space, a Hilbert space (see Kreyszig ( 1978/.
The MA(k) process can be viewed as a special case of (92) where the
uncorrelatedness
assumption of asymptotic
is restricted
to kcorrelatedness. ln such a case )Ar(r), l G -F) can be expressed as a linear
u(? -k).
function of the last k orthogonal elements l(l), l41 - 1),
ibuilding

(7)

Autoregressive

moving avevage processes

As shown above, any second-order

stationary, asymptotically

uncorrelated

Stochastic processes

process can be expressed in a MA(az) form


%)

.Y(r)

bjut

.-jj,

j=

where ()5t () %)< c/s and lg(r), r G T) is an innovation process. Such a


representation, however, is of very little value in practice in view of its nonoperational nature. The ARMAIT,q) formulation provides a parsimonious,
operational fonn for (96).

Dejlnition 19

,4 stochastic process .tvY(l), t e T) is said to be an autoregressive


moving average process of order p, q (,4RMz4(p, q)) 4- it can be
expressed frl the tlrm'
.Y(f) + aI-Y(l

1) +

'xvxl.t
'

'

p)

l/(r)

htutr

1) +

+ bqult q)
-

1p, /71 bz,


where a1, ac,
white-noise process.
.

bq,are

constants

and

.ju(l),

'

'

'

(8.97)

t g T) is a

In order to motivate the ARMAIP, qj formulation as an operational form


of the MA(cc)) representation (96)let us express the latter in terms of the
1 + bj L + bzl? +
infinite polynomial )*(f-)
=

.:-(1) 47*(fa)u,.

(8.98)

Under certain mild regularity conditions b*L) can be approximated


ratio of two finite polynomials (seeDhrymes ( 197 1:,
b*L)

bq(L)
=

xp(L)

(1+ lljfz + bzl? +


(1 + a jfa + ac1a2+

bq)
,

+ uplup)

For large enough p and q, b*(L) can be approximated


accuracy. Substituting (99)back into (98)we get

xpl-z-

k(1a)l/(r),

by the

(8.99)
to any degree of

(8.100)

which is an ARMAIP, q) model. This is an operational form which is widely


used in time-series modelling to provide a parsimonious approximation to
second-order stationary processes. Time-series modelling based on
ARMAIP, q) formulation was popularised by Box and Jenkins (1976).
The ARMAIP, q) formulation (97)can be viewed as an extension of the
part of the
ARIrn)representation in so far as the non-homogeneous
difference equation includes additional terms. This, however, makes no
difference to the mathematical properties of (97)as a stochastic difference

8.4 Some

special stochastic processes

equation. ln particular, the asymptotic independence of the process


depends only on the restriction that the roots of the polynomial
x (2) (2#+
=

a12#-

'

'

'

+ aP)
.

(8.101)

lie inside the unit circle. No restrictions are needed on the coefficients or the
roots of bql. Such restrictions are needed in the case where an AR((x))
lie inside
formulation of (97)is required. Assuming that the roots of bpli
the unit circle enables us to express the ARMAIP, q) in the form
=0

(8.102)
This form, however, can be operational only when it can be approximated
enough' m. The conditions on a/2)
by an AR(m) representation for
stability
commonly
known
and those on k(2) as
conditions
as
are
invertibility conditions (seeBox and Jenkins ( 1976)).
The popularity of ARMAIP, q) formulations in time-series modelling
stems partly from the fact that the formulation can be extended to a
stochastic processes; the so-called
particular type of non-stationary
homogeneous non-stationarity This is the case where only the mean is time
dependent (the variance and covariance are time invariant) and the time
change is local. ln such a case the stochastic process tZ(l), t G T) exhibiting
such behaviour can be transformed into a stationary process by
differencing, i.e. define
'large

-Y(f)

( 1 - f-)dz(l),

(8.103)

where J is some integer. For J=0, A'(l)=z(l); for J


first difference and, for J 2,

ls Ar(r) Z(r) -Z(r


=

1);

-:-(1) Z(l) - 2Z(l


=

1)+Zt

-2).

(8.104)

Once the process is transformed into a stationary one the ARMAIP, q)


fonnulation is used to model *-(8. ln terms of the original model, however,
the formulation is
ap(f,)(1 - L)dZ(t)

$(f)l(l),

(8.105)

which is called an ARIMA (p, J, /)-, autoregressive.


integrated moving
average, of order p, J, q (see Box and Jenkins ( 1976)).
ln the context of econometric modelling the ARIMA formulation is of
limited value because it is commonly preferable to model non-stationarity
as part of the statistical model specification rather than transform the data
at the outset.

Stochastic processes

8.5

Summarr

The purpose of this chapter has been to extend the concept of a random
variable (r.v.) in order to enable us to model dynamic processes. The
extension came in the form of a stochastic process (Ar(r), t 6 T) where .:71) is
dened on S x T notjust on S as in the case of a r.v.; the index set T provides
the time dimension needed. The concept of a stochastic process enables us
x,,; 0), 0 6 0)
to extend the notion of the probability model *=
discussed so far to one with a distinct time dimension

t/xl,

%4,#! 6 Or' t 6 T1.


* l'./'(x(l);

(8.106)

This, however, presents us with an obvious problem. The fact that the
unknown parameter vector 0t, indexing the parametric family of densities,
their values from
depends on t will make our task of
(commonly) a single sample realisation impossible.
In order to make the theory build upon the concept of a stochastic
process manageable we need to impose certain restrictions on the process
itself. The notions of asymptotic independence and stationarity are
employed with this purpose in mind. Asymptotic independence, by
restricting the memory of the stochastic process, enables us to approximate
such processes with parametric ones which reduces the number of
unknown parameters to a finite set 0. Similarly, stationarity by imposing
time-homogeneity on the stochastic process enables us to use timeindependent parameters to model a dynamic process in a
is to reduce the
equilibrium'. The effect of both sets of restrictions
model
106)
probability
to
(
testimating'

tstatistical

*=

(.J(x(l);0), 0 G (1),t e:T).

(8.107)

This form of a probability model is extensively used in Part IV as an


important building block of statistical models of particular interest in
econometrics.
Impovtant concepts

Stochastic process, realisation of a process discrete stochastic processes,


distribution of a stochastic process, symmetry and compatibility
autoproduct,
autocorrelation,
restrictions, autocovariance,
crossfunctions, normal stochastic process,
covariance and cross-correlation
vector stochastic process, lth-order process, time-homogeneity and
memory of a process, stlict stationarity, second-order stationarity, nonasymptotically independent
stationarity, homogeneous non-stationarity,
m-dependent
and uncorrelated
process, strong mixing,
processes,

8.5 Summary

163

Markov
ergodicity,
asymptotic independence, nth-order
process,
martingale, martingale difference, innovation process, Markov property,
Brownian

motion process, white-noise, parametric and non-parametric


processes, AR( 1), AR(m), initial conditions, stability and invertibility
conditions, MA(m), ARMAIP, (3), ARIMAIP, J, q).

Questions
What is the reason for extending the concept of a random valiable to
that of a stochastic process?
Define the concept of a stochastic process and explain its main

4.
5.
6.

components.
'Ar(.s,r) can be interpreted as a random valiable, a non-stochastic
function (realisation)
as well as a single numben' Discuss.
Wild fluctuations of a realisation of a process have nothing to do with
its randomness.' Discuss.
How do we specify the structure of a stochastic process?
Compare the joint distribution of a set of n normally distributed
independent r.v.'s with that of a stochastic process (.:-(0,
t E: T) for (r1,
;,,) in terms of the unknown
parameters involved.
l2,
Let (Ar(r), l c T) be a stationary normal process. Define its joint
< l,, and explain the effect on the unknown
distribution for r < t,
parameters involved by assuming (i)m-dependence or (ii)mth-order
.

8.

9.
10.

14.

Markovness.
(xY(r), l e: T) is a normal stationary process then:
asymptotic independence and uncorrelatedness', as well as
(i)
strict and second-order stationarity, coincide.'
(ii)
Explain.
Discuss and compare the notions of an m-dependent and an mthorder Markov process.
Explain how restrictions on the time-heterogeneity and memory of a
stochastic process can help us construct operational probability
models for dynamic phenomena.
restriction
notions
of asymptotic
Compare the memory
Fnth-order
independence, asymptotic uncorrelatedness,
Markovness, mixing and ergodicity.
Explain the notion of homogeneous non-stationarity and its relation
to .4RlMA(p, J, q) formulations.
Explain the difference between a parametric AR(1) stochastic process
tdesigned'
non-parametric
AR(1) model.
and a
Define the notion of a martingale and explain its attractiveness for
tlf

modelling dynamic phenomena.

Stochastic

processes

15. Compare and contrast the concepts of a white-noise and an


innovation process.
The AR( 1) process is a Markov process but not a martingale unless
we sacrifice asymptotic independence.' Discuss.
'T'he AR( 1) process defined over T tfo, 1, 2,
is not a second-order
stationary process even if < l.' Discuss.
.)

tAny

19.

second-order

lal

stationary
and asymptotically
uncorrelated
stochastic process can be expressed in MA(cfs) form.' Explain.
Explain the role of the initial conditions in the context of an AR(1)

PI-OCCSS.

20.

Explain the role of the stability conditions

in the context of an AR(m)

PI-OCCSS.

22.

b'T'heARMAIP, q) formulation provides a parsimonious representation for second-order stationary stochastic processes.' Explain.
Discuss the likely usefulness of ARIMAIP, q) formulations in
econometric modelling.
Additional references

Anderson (197 1); Chung ( 1974); Doob ( 1953.4;Feller ( 1970),' Fuller ( 1976); Gnedenko
( 1969); Granger and Newbold (1977); Granger and Watson ( 1984),' Hannan ( 1970);
Lamperti ( 1977); Nerlove et (11. ( 1979); Rosenblatt ( 1974); Whittle ( 1970); Yaglom
( 1962).

C H A PT E R 9

Limit theorems

9.1

The early Iimit theorems

The term

tlimit

theorems' refers to several theorems in probability theory


under the generic names,
of large numbers' (LLN) and
limit
theorem' (CLT). These limit theorems constitute one of the most important
and elegant chapters of probability theory and play a crucial role in
statistical inference. The origins of these theorems go back to the
seventeenth-century result proved by James Bernoulli.
tlaw

Scentral

Bernoulli's theorem
Let S', be the number q occurrences t#' an cpt?nr in n independent
trials t? (1 random experiment rsi and p #4z1) is the probabilitq' 0)'
iv each of rf? trials. F!n jr t/ny s > 0
occurrence
.,4

.f

.4

lim Pr
->

,1

(c

S',

-p

--Fl

< z

1,

i.e. tbe Iimit t#' the probabilitq' q/' the event $((S,,,/n) J?)l<t)
approaches one as the number #' trials (/f?(!.$to iljjlnity.
Shortly after the publication of Bernoulli's result De Moivre and Laplace
in their attempt to provide an easier way to calculate binomial probabilities
proved that when g(S,, n) -p(l is multiplied by a factor equal to the inverse of
its standard error the resulting quantity has a distribution which
approaches the normal as n
i.e.
-

--+

.:x;.,

(9.2)

166

Limit theorems

These two results gave rise to a voluminous literature related to the


various ramifications and extensions of the Bernoulli and De MoivreLaplace theorems known today as
LLN and
CLT respectively.
The purpose of this chapter is to consider some of the extensions of the
Bernoulli and De Moivre-Laplace results. In the discussion which follows
emphasis is placed on the intuitive understanding of the conclusions as well
as the crucial assumptions underlying the various limit theorems. The
discussion is semi-historical in a conscious attempt to motivate the various
extensions and the weakening of the underlying assumptions giving rise to
the results.
The main conditions underlying the Bernoulli and De Moivre-l-aplace
results are the following:
(LTI)
variables
Sn Z)'=
1 Xi, that is, & dejlned as the sum of n random
(r.n.'#.
LT2)
Xi 1, (' occurs, and Xi 0, otherwise, i 1, 2,
n, i.e. the Xis
is a binomially distributed r.t?.
are Bernoulli r.v.'s and hence
LT3)
A-j Xz,
-Y,, are independent r.17.'y.
/'(.X1) =(.X2)
=(.X,,),
(faF#)
i.e. A'l, X2,
xY,, (lre identically
wl/l
Prlxi
1)= p, Prlzri
distributed
1 -p for i 1,2,
n.
Est,/nl
p, i.e. wc consider the event of the dterence between a r.r.
value.
and its expted
The main difference between the Bernoulli and De Moivre-l-aplace
theorems lies in their notion of convergence, the former referring to the
convergence of the probability associated with the sequence of events
IE(&/n) < s and the latter to the convergence of the probability
associated with a very specific sequence of events, that is, events of the form
(z %z) which define the distribution function F(z). ln order to discriminate
between them we call the fonner convergence in probability' and the latter
convergence in distribution'.
ithe'

'the'

-4

.,,

'

'

'

=0)

-pq1

Dehnition 1

z'tsequence of nv.'s F;,, n > 1) is said to converge in probability to


r.p. (t?rconstant) F (/' for every ; > 0
lim #r(1F;,
11->

cf)

FI<

I'Iz denote this wflll Y;,

-+

:)

1.

(9.3)

Desnition 2
tF,,(y),
,4 sequence of r.v.'s (Y;,, n > 1) wr distribution functions
n > 1) is said to conYerge in distribution to a r.p. F with distribution

9.1 The

early limit theorems

function F(y)

4J'

1im F,,(y)
11

--#

F(y)

C:l

at aII points oj' continuity, of F(y); denoted by F;,

--+

FL

lt should be emphasised that neither of the above types of convergence


tells us anything about any convergence of the sequence ('F;,) to F in the
sense used in mathematical analysis, such as for each s > 0 and s sS, there
exists an N N(c, # such that
=

IL(s)
-

<
F(s)I

for n > N.

(9.5)

Both convergence types refer only to convergence of probabilities or


functions associated with probabilities. On the other hand, the definition of
a r.v. has nothing to do with probabilities and the above convergence of F;,
to F on S is convergence of real valued functions defined on S. The type of
stochastic convergence which comes closer to the above mathematical
sure' convergence.
convergence is known as
'almost

Dehnition 3
.,4sequence of r.v.'s (Y;,,n > 1) con,erges to F
almost surely or with probability one) ,)'
lim Y;, F

Pr

lt

--#

or, equivalently',

lim Pr

u -/

(x)

1,' denoted b)' F;,

243

(f for any

U IYk

(f?r.1-,.or
-+

F,

a constantj

(9.6)

) 2>0

Fl

<

m 7) n

1.

This is a much stronger mode of convergence than either convergence in


probability or convergence in distribution. For a more extensive discussion
of these modes of convergence and their interrelationships see Chapter 10.
The limit theorems associated with convergence almost surely are
appropriately called
law of large numbers' (SLLN). The term is used
1awof large numbers' (WLLN)
to emphasise the distinction with the
associated with convergence in probability.
In the next section the 1aw of large numbers is used as an example of the
developments the various limit theorems have undergone since Bernoulli.
For this reason the discussion is intentionally rather long in an attempt to
motivate a deeper understanding of the crucial assumptions giving rise to
all the limit theorems considered in the sequel.
Sstrong

'weak

lwimittheorems
The law of large numbers

(1)

The weak

of lavge numbers

/Jw.,

( WL f.,'vl

Early in the nineteenth


century Poisson
asserting identical distributions for A'j
result to go through.
,

that the condition LT4


Xn was not necessary for the

realised
.

Poisson's theorem
Let .tfxY,,,n y 1) be a sequence
1) pi and Pr(Xi 0)
Prlzri
> 0,
;
=

I-S-

-+

,1

-.11.
- ...
N
Fl

ljm pr
x

(?Jindependent
1 - pi i

1, 2,

St?rnt?l/// r.!r.'y witb


n, then, for t7/nl'
.

''

jj
i
=

pi

< ;

(9.s)

The important breakthrough


in relation to the WLLN was made by
Chebyshev who realised that not only LT4 but LT2 was unnecessary for the
Ar,,were Bernoulli r.v.'s was
result to follow. That is, the fact that Xj,
not contributing to the result in any essential way. What was crucially
important was the fact that we considered the summation of n r.v.'s to form
Sn
3Xi and comparing it with its mean.
.

7-

(9.9)

Pl't?q/:. Since the A'ls are independent

since lim
u

--+

.x

(,
ll8

0, lim Pr
,,

--+

w.

. i
11

11
.=

Z
=

1
Xi - 11

!1

Z Jtf >
=

0,

9.2

The Iaw of Iarge numbers

1imPr

-/1

169

j( Xi -- 1 )2pi
j

<)

1,

Markov, a student of Chebyshev's, noticed in the proof of Chebyshev's


A-,, are independent played only a
theorem the fact that the .YI Xz.
minor role in enabling us to deduce that Var(&) ( 1 ,72) :)1-j c/. The
above proof goes through provided that ( 1,.'r?2)Var(&)
0 as n ry-.. Since
.

-+

-+

11

Var(5',,)

Vart.Yfl +

we

need

Chapter

j #j)

Covtxl..Yjli

(9.10)

ol-der f#' magnitude (see


to assume that Vartzf Xf) is Of Smaller
10) than nl for the result to follow. Hence LT3 is not a crucial

condition.

Markov's theorem

1im Pr
,,

-+

az.

-.
'

11

)
=

Xi - 11

11

E(Xij
i

< ;

1.

Khinchin. a student of Markov's, realised that, in the case of independent


and identically distlibuted (11D) r.v.'s, Markov's condition was not a
nessary condition. In fact in the llD case no restriction on the nature of the
valiances is needed.

& % N --/

'Jw

Limit theorems

(2)

Iaw of large numbers

Te strong

(SL Nm

The first result relating to the almost sure convergence of


Bernoulli distributed nv.'s case was proved by Borel in 1909.

& for the

Borel's theorem
Let fA-,,) be a sequence 0). 11D Bernoulli r.v.'.%wjr Prlxi
Prlxi
0) 1 p for aIl i, ten
=

1) p and
=

S ''

lim

Pr

-+x

13

1
.

ln other words, the event defined by


s e:S), has
n
(k,,(y)q
probability one; S being the semple space. An equivalent way to express this
is
.ty:1im,,-,

=p,

Sm

lim Pr max
?1 -+

-p

I'?I

mp n

x:.

=0.

This brings out the relationship


between the SLLN and the WLLN since
simultaneous
refers
realisation of the inequalities and
former
the
to the

S
-.$1

Sm

%max

-p

&

(9.17)

-p
.

??1

mn

as

t-0*=

%-.+'

This implies that


Kolmogorov, by replacing the Markov condition
11

lim

l1 --/ ::Ll

'Y

)
-

VartA4l

=0

(9.18)

for the WLLN in the case of independent r.v.'s, with the stronger condition
Cf

)
=1

Varta&k) < cs,

(9.19)

proved the first SLLN for a general sequence of independent


Kolmogoro''s

r.v.'s.

teorem

Let (a,,,n y 1) be a sequence of independent r.v.'s such that E(Xi)


then #- they satisfy lt?
and Var(.Y) exist for aIl i 1, 2,
condition (19) wc can deduce that
=

Pr

lim n n rl
-+

?1

) g#f-f(A-j)()
=

=0
=

1.

9.2

The Iaw of large numbers

This SLLN is analogous

to Chebyshev's WLLN and in the same way we


inequality.
The inequality used in this context is
an
Kolmoqorov's inequalit-v If Arj, xYa,
.Y,,are independent r.v.'s, such that
c/ < :Ef,, i 1, 2,
Vartlf)
rl, then for any ) >0

can prove it

using

Pr
Kolmogorov

,u

'

k < ,,

jsk-'(sk)Iy

max

11

1g

c <

)) c/.
1

went on to prove that in the case where 1t-Y,,,n 7: 1) is a


r.v.'s such that F(A'f) <
then

sequence of llD

:* Varl'x k )
)

:2

k=1

%'
=

k=1

-k

dx
x2.J(x)

< ct;. ,

which implies that for such a sequence the existence of expectation is a


necessary as well as sufficient condition for the SLLN.
Having argued that some of the conditions of the Bernoulli theorem did
not contribute (in any essential way) to the result, the question that arises
of
naturally is,
are the important elements giving rise to the
large numbers'' (SLLN, WLLNI'?'The Markov condition ( 18) for the
WLLN, and Kolmogorov's condition ( 19) for the SLLN, hold the key to the
answer of this question. It is clear from these two conditions that the most
important ingredient is the restriction on the variance of the partial sums S,,,
that is, we need the Var(&) to increase at niost as quickly as ;. More
formally we need Var(&) to be at most of order n and we write Var(S,,)
O(n). ln order to see this let us consider some of the cases discussed above.
ln the llD case if VartxYf) c2 for all f, then Var(5',,) nc2 O(n).
ln the case of independent r.v.'s with
'what

tlaw

Varl Xi ) cf2 <


=

':s

1, 2,

(9.22)
Moreover,

0(.;2)

where
the Markov condition can be written as Var(S,,)
smaller order than' achieves the same effect since
Var(5',,) O(n) Var(S,,) o(n2) (see Chapter 10). The Kolmogorov
1.
conditlon is a more restrictive form of the Markov condition, requiring the
variance of the partial sums to be uniformly of at most of order n. This being
the case, it becomes obvious that the conditions LT3 and LT4, assuming
independence and identically distributed r.v.'s, are not fundamental
ingredients. lndeed, if we drop the identically distributed condition
altogether and weaken independence to martingale orthogonality the above
limit theorems go through with minor modifications. We say that a
sequence of r.v-'s Xn, n y 1) is martingale orthogonal if f'(A',,/c(A-,,- I

small

:o'

reads

'of

'=.

Limit theorems

n > 1. It should come as no surprise to learn that both important


proving
the WLLN and SLLN, the Chebyshev and Kolmogorov
tools in
inequalities hold true for orthogonal r.v.'s. This enables us to prove the
WLLN and SLLN under much weaker conditions than the ones discussed
above. The most useful of these results are the ones related to martingales
because they can be seen as direct extensions of the
case and
the results are general enough to cover most types of dependencies we are
.Y1))

=0,

'independent'

interested in.

The law of large numbers

(3)

for martingales

such that E%)


0, for all n, and define
f.;,, n (E N) be a martingale
1
S,,
discussed
0).
As
in
Section
8.4, if Sn defines a
L S,,- - ! n ): (.$0
then by construction F;,defines an orthogonal
martingale wlth respect to
process and thus, assuming a bounded variance for F;,, the above limit
theorems can go through with minor modifications.

Let

.t5',,,

.';,

WLLN

for martingales

.Y,,, n k: 1) be a sequence t? r.v.'.%with respect to the increasing


< ct;, and
sequence of c-//z-us
( n > 1) suc that z;(1Az'1)
#(1A-,,1> x)
and
x)
,? > 1, c-constant (i.e.all
>
f0r x > 0
xis
A-).
Tben
are bounded /?y some r.p.
Let

%c#(lA-I

z,-

y;.K Xi - Elxijfh j ),
An equivalent way to state the WLLN is
1g

-N i

13

11

LXi
=

(9.24)

E(Xi/%.

1)j

-->

0.

This result shows clearly how the assumption

of stationarity

of (A-,,, n y 1)

9.3

The central limit theorem

1),
and );(-Y,,
n > 1) (seeChapter 8) can strengthen the WLLN result to
that of the SLLN.
The above discussion suggests that the most important ingredients of the
Bernoulli theorem are that:
(i)
we consider the probabilistic behaviour of centred nv.'s of the form
,f.6-

Zn S', - n;) j)- 1 l-Yj- F(Ar2));


Var(.,,)
0(,1)., and
(f
F,,. n y 1) is a martingale
X,,
Y;,
for
- f).Y,,), the sequence
F1))
0, n > 1.
i.e.
'(Y;,
difference,
c(L- 1,
This suggests that martingales provide a very convenient framework for
these limit theorems because by definition they are r.v.'s with respect to an
increasing sequence of c-fields and under some general conditions they
converge to some nv. as n tys. The latter being of great importance when
r.v. is needed. Moreover, for any
convergence to a non-degenerate
1)
tXu,
9. u, n y:
the martingale differences sequence
martingale sequence
orthogonal
martingale
sequence of r.v.'s which can help
tL, n y 1) defines a
us ensure (ii) above.
=

-+

The SLLN is sometimes credited as providing a mathematical


foundation for the frequency approach to probability. This is, however,
erroneous because the definition is rendered circular given that we need a
notion of probability to define the SLLN in the first place.
Remarl:

9.3

The central limit theorem

As with the WLLN and SLLN, it was realised that LT2 was not
contributing in any essential way to the De Moivre-Laplace theorem and
the literature considered sequences of r.v.'s with restrictions on the first few
jt
j Xi, the CLT
moments. Let tA-,,, 1177 1 be a sequence of r.v.'s and S,,
of
considers the limiting behaviour
=

F,:

S,, - .E(S,,)

-7tvar(s,,);
N'
.

which is a normalised
and SLLN.

version

Lindeberl-luevy,

of

F(%,), the subject matter of the WLLN

.%,
-

theorem

lim F.,( y)
''

?,-.

cc.

)f'-

'

1im P(K. G r
'

?,-.

w.

''

--,

'

x''(27:)

expt

1yu2)

d)?.

(9.28)

Limit theorems

Liapunov's theorem
Let

kfX,,,

n > 1)

be a

VartA-zl

c2
i

<
J >0.
.E'(lAz'fI2+)

c/ < az,

r.v.'s wr/l
'aa,

1
2

11

)-

C=
1,

of independent

sequence

>

ten if

(9.30)
theorem is rather restrictive because it requires the existence of
higher
than the second. A more satisfactory result providing both
moments
and
sufficient
conditions is the next theorem', Lindeberg in 1923
necessary
established the
if' part.
part and Feller in 1935 the
Liapunov's

'if'

'only

Lindeberq-Feller

theorem

Let (A',,, n 7: 1) be a sequence ofindependent


)F,,(x),n > 1) such tat

r.p.'.s wf/'ll

distribution

jnctions

() Exi)
ii) Vartxf)

=pi

Ten

(9.3 1)

c/ < vs,

the relations

(7) lim max


l
-+x un
?1

t2

11

Gi
=

t??,

0,

where

tr,, =
i

l
=

c/

,'

(9.34)
11

jg
(x
i 1 l-pilxsc
'

'

-pf)2

dFj(x)

0(c',2,) jr

alI

; >0.

(9.35)

The necessary and sufficient condition is known as the Lindeberg condition


really gives rise to the result'.
and provides an intuitive insight into
twhat

The central Iimit theorem

9.3
Given that
1g

11

2
Cn i

F,
=

:2

-/z)2

Ix

.-pikxircf

dFftx)

(x

max prtjAry - pfl > c(rf),

1% i ts n

this shows that the heart of the CLT is the condition that no one r.a
dominates the sequence of sums, that is, each (Arf
cf is small relative to
the sum gS,,- f(S,,)q c,, as n increases. The Liapunov condition can be
deduced from the Lindeberg condition and thus it achieves the same effect.
Hence the CLT ref ers to ihe istributional bealliour of the summation of
an increasing number of r.v.'s whfcll individuall), do not exert any
signlkfcant c/-kcr on the behaviour of the sum. An analogy can be drawn
from economic theory where under the assumptions of perfect competition
(no individual agent dominates the aggregate) we can prove the existence of
a general equilibrium. A more pertinent analogy can be drawn between the
CLT and the theory of gas in physics. A particular viewpoint in physics
considers a gas as consisting of an enormous number of individual particles
in continuous but chaotic motion. One can say nothing about the
behaviour of individual particles in relation to their position or velocity but
we can determine (at least probabilistically) the behaviour of a large group
of them.
Fig. 9. 1 illustrates the CLT in the case where
Ar,, are 1lD
r.v.'s,
1,
U(
1,
2,
Xi
l),
uniformly distributed
i.e.
i
n, and
+ Xz +
+ Xn.
represents the density function of Y;,
-/tf)

-:-1

,v

,(#)

.:-1

'

fl (p')

f (y)
--#'

+.

Z#

y'

.* Z
zz

--

(y1

*.

'

x.

Fig. 9. 1. lllustrating the CLT using the density function of


are uniformly distributed for n 1, 2, 3.

where the

xfl.s

(y)

)'= Xi
1

Limit theorems

Returning to the Bernoulli and De Moivre- Laplace theorems, we can see


that the important ingredients were the same in both cases and both
families of limit theorems refer to the behaviour of the sum of a sequence of
P

r.v.'s in a different probabilistic

sense. The WLLN referred


n)Qp(land the CLT to

the SLLN to

to

r(,%,

' j--!S pj

jr 1.-,y.-?)

j-

--+

pj,

z xtt),1).

-..,

!),L/n)

,.

&

Let us consider the relationship


between these limit theorems in the
particular case of the binomial distribution. From the CLT we know that
pv

...-

:u

. .--

,,'1),,p(

N/
'

S - np
1-p);
''

g <

i)

1
-x,7(z:l exp

rx
a

u
-

du

(9.38)

equal'). ln order to see how good the


(t c>: reads
10, ;)
< & 8), n
approximation is 1et us take
From the binomial
0 -k
tables we get j7.($(1k0)(0.j)k(0.5)1
0.3662. 'J'hu normal approximation
to this probability takes the form
'approximately

.13?-46

=..

8 + .$ np
L-

#r(6 < S, % 8) :x *

y
l
N' (DJ?)(

pq

$ nn

'

(142.2

tnpii-i p)

1)

(140.3

16),

where *4 ) refers to the normal cumulative distribution function. It must be


noted that #r(S,, G h) is ap,proximated
by F
+
-p)q )
V'gnptl
rather than Ft(? - np) x,'' Inpl1
) in order to improve the
approximation by bridging the discontinuity between integers. From the
normal tables we get (142.2 1) *(0.3 16) 0.9866 -0.6239 0.3627 which is
a very good approximation of the exact binomial probability for /7 as small
as 10.
.

t(l?

-p)q

Using the above results we can deduce


Pr

c-

s,, -p

< c

'

(-

1
-,z (2J:)

-e x/'

h, 100

p wa

lhat

uz
eXp

K--!.p.p

Pr

.-np)

e.qat),#?'
Pr

--

S2(?q

p
20 -

S5 oo

500 -p

.,

'

,#
=

1.
/:/( 1 - p)
.

<

0.944,

<

0.965.

The central Iimit theorem

From the above example of the normal distribution providing an


approximation to the binomial distribution we saw that for n as small as 10
This is, however, by no means the rule. ln
it was a very good approximation.
this case it arose because p =c1.Z. For values of p near zero or one the
approximaton is much worse for the same n. ln general the accuracy of the
approximation depends on n as well as the unknown parameter 0. This
presents us with the problem of assessing how good the approximation is
for a particular value of n and a range for 0. This problem will be considered
further in Chapter 10 where the additional question of improving the
approximation will also be considered.
Although the CLT refers to the convergence in distribution of the
it
standardised sum
(.,, - m,,) c,,, where 1,n,, /;4S,,) and c',, x/'EVar(,%,)(I
is common place in practice to refer to Sj being asymptotically normally
distributed with mean mn and variance c,,z an d to denote this by
.,*,

c,2,
s x(m,,, ).
pl

Strictly speaking,
following sense:

such a statement

is incorrect, but it can be justified in the

of the form Pr$S', s: (k) can be approximated


rl probabilities
J)
since
approximation
*uf.? -nu)
the
- m,,),/'c,J
error Pljst, :K
-+
uniformly
R
to
vt
as n
on
goes zero

for large
(I)r(fJ

by
c

(See Ash ( 1972).)


The CLT can be extended to a sequence of random vectors
where X 11 is a k x 1 vector.
Lintleberg-yeller

)X,,, n 7: 1)

CLT

X,,, n y 1) be a sequence 0j. k x 1 independent rtlndom vectors


and distribution
w'ff F(Xf) pi and COv(Xf) Zf, i 1. 2,
.//nclfons 'f(F,, n 7: 1) such that:

Let

ft

lim (?,)

()
I

--+

j: + 0,

(9.40)

each
n
ln practice this

>

0,'

11

-1
i

L-.Z

result

(Xf - pi)

N(0, Z).

'v

(9.42)

t:t

is demonstrated by showing that for any fixed

Limit theorems

(9.43)
Since

c'Z,,c c'Zc + 0, c'


--+

lt

n-'l

(Xj - pi)

N(0, c'Ec)

,v

for all c + 0.

(9.44)

Then, using the Cramer-Wold theorem (seeChapter 10) we can deduce the
CLT result stated above.
As in the case of the other limit theorems (WLLN, SLLN) the CLT can be
easily extended to the martingale case.
CLT

for martinqales

.()?. n p: 1) be
t,%,,
dljlrences
S,,
Let

,,,

',,

,$',,

wl

.E'(.X'n/.f/.',,I )

and dejlne the nmrlfntytk/tr


that
JtA',,,
is,
n > 1) is a sequence oj' r.n.'s
1
0, n 1
sucb that:
=

a martingale
,

(f)

(9.45)
(9.46)
11

ctl, j

c/,

ci

EX,?),

then -..
f-n

?1

Xi
1

'w

N(0, 1). (9.47)

x
=
This theorem is a direct extension of the Lindeberg-Feller theorem. lt is
important to note that (i)and (ii)ensure that the summations involved are of
smaller order of magnitude than c,2,
....-

9.4*

Limit theorems for stochastic processes

The purpose of this section is to cosider briefly various extensions of the


limit theorems discussed above to some interesting cases where )A',,, n > 1)
is a stochastic process satisfying certain restrictions (seeChapter 8). The
first step towards generalising the limit theorems to dependent r.v.'s has
already been considered above for the case where kfX., n > 1) is a martingale
relative to the increasing sequence of c-fields Jtf2pj, n 7: 1).
Another interesting form of a stochastic process is the m-dependent
process (seeChapter 8). For an m-dependent zero mean stochastic process
(A',,, n > 1) with finite third moments,
'(1A',,13)
< K for all n > 1 and some

Limit theorems for stochastic processes

9.4*

constant K, it can be shown that


1:1

11

jg

V Xi
--sn
V

if c2

tT2),

ZxN(,

-+

--

1im 11
co

-+

11

oj-vi
k

< cx',,

(9.48)

m-1

c/

Covtx''

+ Vart-f

Xi +.)

+j,

+.).

j=0

The importance of martingales and m-dependent


processes lies with the
and mixing stochastic processes behave
fact that stationary-ergodic
asymptotically like martingale differences and m-dependent processes,
respectively. Hence, the limit theorems for martingales and m-dependent
when certain
processes can be extended to stationary and mixing processes
restrictions related to their homogeneity and memory are imposed.

(t7) /(:n) O(r?1-'),


=

(?) a(rn)

04?,n-1),

z>

(r (?--

(9.51)

1:,

<Z,, and .E1z,,)r+'< K < cs, ,1 > 1 for ry 1 and


such tbat 11,,1
> 0 (f.t?.Xn is dominated !?yz,,, n y 1) then
.j

11

a.s.

E-Yi-.E.;t-)(I
- )
N i

--+

Jrl),'

(9.52)

(5t?pWhite and Domowitz 1984)).


The value of r is the above SLLN determines the highest moment assumed
kf
to exist and the same value restricts the memory of Ar,,,n y 1). For
1)
ftAr,,,n 2:
an independent process r= 1 and $(m)dies out exponentially.
CLT

./r

mixinq

processes

Let
n y 1) be a mixing process satisfyinq the restrictions
the SLLN to hold. ln addition let ly assume that..
imposed .J't??kfX,,,

(i)

'(Xn)

(9.53)

0,

(ii)

.(lA',,l2r)

(fff)

for 5',,(z)

'K

(9.54)

< crs,

--i

j Xi,
7--7+

tere

exists I/-hnite and non-zero


(x),
then
n

such r/-lfl (fJ(S,,(z))2 - Jzz-.j 0 jr alI z as


--+

(np'l-'is(0) x(O, 1)
,,

(see White and Domowitz (198444.

--+

Limit tlleorems

The importance of the above limit theorems for mixing processes


becomes more apparent in view of the following result'.
lf kf-Y,,, n > 1) is a mixing p?vpcas-s(44n1)or .:t4?A2)) then /./ny Borel
lf zY,,is
Xn k ) is also mixing. A-/rt?rct??.lpr,
jllnction F;, f./,,(zY,,,
O(n
then F; is aIs() O(m
z > 0.
This result is of considerable interest in statistical inference where
asymptotic results for functions of stochastic processes are at a premium.
For stationary stoehastic processes which are also ergodic several limit
theorems can be proved.
=

-')

-')

SLLN

/br stationarq',

t.?rg(4Jfc'processes

1) be a stationary
n > 1, then

.f..,:4f Xn,n >


J;1.1,,(<

:x,

1 w-m Xi
-N i L 1

and

ergodic

process such

that

a s
.

-+#

(9.56)

E (X,,),

tsce' Stout (1974)).

(9.58)
(9.59)

E('n)1't'+J''t'+''1

<

'v

then
l

-1 .-.r!
G

(5't?t!Hall

,.

srjtl j)
,

and S?)'Je'

(198044.

Note that mixing implies ergodicity (see Chapter


r'

)'

9.5

8).

Summary

The limit theorems discussed above provide us with useful information


relating to the probabilistic behaviour of a particular aggregate function,
the sum, of an increasing sequence of r.v.'s, as the number of nv.'s goes to

18 l

9.5 Summary

infinity, when certain conditions are imposed on the individual r.v.'s to


one dominates the sum'. The WLLN refers to the
ensure that
of Sn r?, i.e.
probability
in
convergence
tno

S11

. -#

ES

11

11

The SLLN strengthens


11

-+

the convergence

to

surely', i.e.

balmost

11

11

h1

The CLT, on the other hand, provides us with information relating to the
rate of convergence. This information comes in the form of the
factor by which to premultiply % E(%) so that it converges to a nondegenerate r.v. (the convergence in WLLN and SLLN is to a degenerate r.v.;
standard deviation of S,,,i.e.
a constant). This factor comes in the form of the
'appropriate'

1
N

,-k
,,,
g ar( s )j

(S,,

.E'(S,,))-/ Z

'v

(9.62)

N(0- 1).

,,

lmportant

concepts

convergence almost surely, convergence in


numbers, strong law of large numbers,
large
law
of
distribution,
central limit theorem, Lindeberg condition.
Convergence

in probability.

weak

Questions
1 and contrast it
k) P? <l)
Explain the statement 1im,,.-.
U5-,,
.g(.,,,
rl) =J?ll
1.
with #l-t1im,,-+
assumptions of the Bernoulli WLLN in
Discuss the underlying
relation to their contribution to the result.
Explain the difference between Chebyshev's and Markov's WLLN.
Whose behaviour do the WLLN and SLLN refer to?
Explain intuitively why a sequence of martingale differences with
finite variances Obeys the WLLN and SLLN.
Explain the Lindeberg-Feller CLT in relation to the assumptions and
.#?')

conclusions.
7*. ln the CLT why is the limit distribution a normal and not some other
distribution'?
'A1l limit theorems impose conditions on the individual r.v.'s of a
sequence in order to ensure that no one dominates the behaviour of
the aggregate and this leads to their conclusions.' Discuss.

Limit theorems

Compare the Lindeberg-Feller


CLT with the one for martingales.
Compare Markov's condition for the WLLN with Kolmogorov's
condition for the SLLN.
Compare the conclusions of the WLLN, SLLN and CLT.
Exercises

(i)

#/.Xp,

-1.

+ 2/)

(ii)
(iii)

u (jyy. (jj

11

Note:
i

Determine

)
=

the value of a for which

#?-(.Y,, + n
=

the sequence of nv.'s )Ar,,)

ly

satisfies the SLLN.


3*. Show that for the sequence of independent r.v.'s (A-,,, n > 1) with
I)j
Pr(X,,
the Lindeberg conditions holds iff a <-12.
+ h7) 1/gan2t?=

Additional references

Chung (1974); Cramer (1946); Feller (1968); Giri (1974); Gnedenko (1969);
(1963); Pfeiffer ( 1978); Rao (1973); Renyi ( 1970); Rohatgi (1976).

Locve

10*

CHAPTER

lntroduction to asymptotic theory

10.1

Introduction

At the heart of statistical inference lies the problem of deriving the


distribution of some Borel function h ) of the random vector X (A'j,
Ar )', i.e. determine
'

?1

F,,(y)

Prlhqz'b-j

.Y,,)G )?),

from the distribution of X. ln Chapter 6 we saw that this is by no means a


trivial problem even for the simplest functions hl ).Indeed the results in this
area are almost exclusively related to simple functions of normally
distributed r.v.'s, most of which have been derived in Chapter 6. For more
complicated functions even in the case of normality very few results are
available. Given, however, that statistical inference depends crucially on
being able to determine the distribution of such functions (X) we need to
tackle the problem somehow. lntuition suggests that the limit theorems
discussed in Chapter 9, when extended, might enable us to derive
approximate solutions to the distribution problem.
The limit theorems considered in Chapter 9 tell us that under certain
conditions, which ensure that no one r.v. in a sequence tAr,,, n > 1)
dominates the behaviour of the sum (Z)'I A-), we can deduce that:
.

jj

IA

1;

-n )'-=

-->

''

--+

11

a.s.
z-i

J
-

-rl i

Xi

11 i

j
-

F(xYf)
1

(WLLNI;

11

)
=

F(vYf) (sLLN);
1

Introduction

to asymptotic

theory

(CLT).

ln order to be able to extend these results to arbitrary Borel functions /7(X),


notjust j)'- 1 Xi, we firstly need to extend the various modes of convergence
(convergence in probability, almost sure convergence, convergence in
distribution) to apply to any sequence of r.v.'s ).Y,,, ?1 y 1).
The various modes of convergence related to the above limit theorems
are considered in Section 10.2. The main purpose of this section is to relate
notions of convergence to the probabilistic
the various mathematical
needed
modes
in
asymptotic theory. One important mode of
convergence
encountered
in the context of the limit theorems
not
convergence
s
'convergence in the rth mean', which refers to convergence of moments.
Section 10.3 discusses various concepts related to the convergence of
moments such as asymptotic moments, limits of moments and probability
limits in an attempt to distinguish between these concepts often confused in
asymptotic theory. ln Chapter 9 it was stressed that an important
ingredient which underlies the conditions giving rise to the various limit
of magnitude'. For example, the
theorems is the notion of the
Markov condition needed for the WLLN,
'order

lim

c
?,-+x n

/1

j
=

Vart-tk)

0,

is a restriction on the order of magnitude


Var(&)

o(rl2).

In Section 10.4 the

tbig

of Var(&) of the form

( 10.6)

o' notation is considered in some detail as a


O',
prelude to Sections 10.5 and 10.6. The purpose of Section 10.5 is to consider
the question of extending the limit theorems of Chapter 9 from (j)'- I Arf)to
A',,) such as
more general functions of (A-j Xz,
))'-1 XL,/-> 1. This is
indeed the main aim of asymptotic theoo'.
Asymptotic theory results are resorted to by necessity, when finite sample
results are not available in a usable form. This is because asymptotic results
provide only approximations.
'How good the approximations
are' is
commonly unknown because we could answer such a question only when
the finite result is available. But in such a case the asymptotic result is not
needed! There are, however, various
error bounds which ean shed
some light on the magnitude of the approximation error. Moreover, it is
often possible to
upon the asymptotic results using what we call
Klittle

krough'

kimprove'

Modes of convergence

10.2

asymptotic expansions such as the Edgeworth expansion. The purpose of


Section 10.6 is to introduce the reader to this important literature on error
bounds and asymptotic expansions. The discussion is only introductory
and much more intuitive than formal in an attempt to demystify this
For a more
literature whieh plays an important role in econometrics.
complete and formal discussion see Phillips ( 1980), Rothenberg ( 1984), intelalia.

Modes of convergence

10.2

and
play a very important role in
of
of
only
because
the limit theorems discussed in
probability theory, not
they
underlie
Chapter 9 but also because
some of the most fundamental
concepts such as probability and distribution functionss density functions,
mean, variance- as well as higher moments. This was not made explicit in
Chapters 3-7 because of the mathematical subtleties involved.
ln order to understand the various modes of convergence in probability
theory 1et us begin by reminding ourselves of the notion of convergence in
ftf.?,,, G
mathematical analysis.
n
t is defined to be a function
sequence
from the natural numbers . # )1, 2, 3,
) to the real line R.
'convergence'

'limit'

The notions

'

,4

'

Dqpnition 1
f
A smuence ( a n g # is said to converge to a Iimit a f/' jr t:'t,ta?-v
Llrbitrar)'
(1 l?un'l/?t?/- A'(;) such
small nl/yl/afskls > 0 tbere ctp?-?-fpsptpnps
')

?;

,.

< t
1t.,,
w't? denote tbis
/1

l/-lar/'/?ta inequtllitq'
n > N(;);

wlll

all rt?r,e'n-s
?,, t?/'
/-':'IJs
/?p lim,,-.. fk,, a.
./?-

the s:?t/?..',nce-

E'xtl/nkp/a1
'
..
..

lim,,-, n
(1,/'n)''-Ie
.,x')'

j-

:.'

nj

.(log,
0, b > 0, b + 1,'1im,,-..I( 1 +
nln)
1, for any n > 0'. 1im,,-+
+
2.7 1828-,1im,,- .,. g(n2
+ n + 6)/(3n2 2n + 2)j -j;1im,,-+
. gx'''(n
() fo r n g t
,

'.

whose
This notion of convergence can be extended directly to kny
--+
The
necessarily
but any subset of R, i.e. /?(.v):(J R.
way
domain is not
this is done is to allow the variable x to define a sequence of numbel's
1x,,. h? G . t l converging to some limit x() and consider the sequence ) (x?,),
--+
?2i: . f as x,, .v() and denote it by limx .x(, /7(x) /.
.lnction

'

','

.-.

Depn i rt?n 2
A function

said r(?converge to a Iimit I fxs


(x) 1.%

--+

xo.

(J

'for

crt?rl?

Introduction

theory

to asymptotic

)> 0, bowever small, there exists

a number

J(#

0 such that

>

Example 2
For

(x)

ex, limx.-

/1(x)
=

./1(x)

0 and for the polynomial

aox'' + t/lx'' -

+ a 11 j x +
-

function

1imJl(x) a,,.

(I,t,

x-0

Ix

-x()l < J(c) excludes the point x xtl in the


Note that the condition 0 <
above definition and thus for /1(x) (x2 9),(x 3), limx-ps (x) 6, even
though /7(x) is not defined at x 3.
For
=

(x)

t7X,

0,

a>

/l(x) (1 + x)1?-v,
=

(x)
/l(x)

lim /1(x) e,
=

x-0

1+ X

lim /1(.x) 1,
x-0

1im J?(x)

eJ,

x-0

gloget1 +

x) xj,

lim /?(x)

1,

x-0

Using the notion of convergence of a function to a limit we can detine the


continuity of a function at a point.
Djlnition

-4 function (x),dehned over some interval D)


tf
to be continuous at the point
jr eacb
(:) > 0 such tat
.xo

l(x)

R, xa G D() is said
0, tere exists a

>

-(xo)I<?

./??-trllcr.y: x satiqjkving r/-l?rgstriction l-x x()l < J(c). W' denote this b).,
/l(-Y)
limx /-1(x) /,1(.X()).
is-said to be continuous if it is
q/'
point
at
its domain, D.
continuous
every
.,4

-xo

Example 3
The functions

(verifyl).

./ntrrtpl7

Modes of convergence

10.2

187

t,,(x),

n e: f
ln the case of a general function we can define the sequence
is a subset of Dh).
and consider its behaviour for each x e: where
.4

.4

')

Dqhnitioll 4
-)

-4 sequence of jnctions (,,(x),n c f wl common domain D is


D f/' jr each c > 0, there
said to converge pointwise to Jl(x) on
x)
,7
tben
if
x),
>
sucb
that
N(c,
N(c,
exists a
.

-4

1,,(x)-

ll(x)1<c

holds

for all

x s

-4.

Example 4
1,

,,(x) k

/(

()

k!

lim

,:

-.

for all x (EFR.

,,(x)=ex

x.j

ln the case where N(s, x) does not depend on x (onlyon c) then (ll,,(x),
n 67 #'')
is said to converqe uniformly on X. The importance of uniform convergence
stems from the fact that if each h,,(x)in the sequence is continuous and
converges uniformly to (x)on D then the limit /l(x) is also continuous. That
iss if
,

,,(x)

lim Jl,,(x)
=

for xo D

,:(x)

Xo

-*

for al1 x G D,

lim /1?,(.x) /l(x)


=

!1

untformly,

( 10.8)

CX)

-#

then
lim /?(x) ll(xo).
=

X0

''*

With the above notions of continuity and limit in mathematical analysis


in mind let us consider the question of convergence in the context of the
#( )) and (R, @,Px )). Given that a random
probability spaces (,,
xY(
function
from S to R we can define pointwise and uniform
variable
) is a
the
S
sequence )A-,,(y), n (E f by
conyergence on for
'

'

'')

t.Y,,(,s)
-

and

<
Ar(s)I

I.Y,,(,s) A'(-s)1<
-

for

')

for n > N(c),

(10.10)

n > .N(o, s),

s s S,

(10. 11)

respectively. These notions of convergence are of little interest because the


probabilistic structure of )A-,,(#, n e
is ignored. Although the
probability set functions #4 and #xt ) do not come into the definition of a
'-)

.f

.)

'

188

Introduction

to asymptotic

theory

random variable they play a crucial role in its behaviour. lf we take its
probabilistic structure into consideration both of the above forms of
convergence are much too strong because they imply that for n > N
$z,,(s) Xlsll <f; whatever the outcome s (E S.
( 10.12)
The form of probabilistic convergence closer to this is the almost yurp
for a11s except of
con,erlence which allows for convergence of zY,,(s)to A-(.&)
S
which
#4A)
is
for
0,.
said
of
probability zero.
be
set
to
some
a
The term almost sure is used to emphasise the convergence on S-A not the
whole of S.
,4

.s-set

.gl

Dqhnition 5
fX

r.l.'s

.4 sctyttlnctl f#'

t.

(s),n (E

l1

''.

s said
a.S.

(tk.s.)t() a

>

xY(s),denoted b J' A-,,-->

-.t!.

Pr s: 1im ..Y,,(-s) A-(s)


=

rtlconverge almost surely


ij'

z,

1.

-#

11

z4rlequit/alent
1

wly of Jt-/nfl';g almost

1im Pljs
--#

ll

tyL

k.Y,,,(-$) .(.)I<
-

(stit?Cnnj.y (1974)). The


convergence associa ted

tllazlt.l.sl

x'ith

F;-

all m > n)

f/?t? stronq

/kJ1,

(SLLN).

Another mode

of convergence

not considered

theorems (see Chapter 9) is that of

converlencv

(10. 14)
ls

conterqence

'sn-t?

is /?),

sl/r'f? convergence

t?

lllt, 'node

Iarge

in relation
in rth mean.

q/

nl/rrlltu?l-.s

to the limit

/3nlt?rl6

Let

ft,',,(y).

')

I 1r)

t'ha l E ( A-,, <


n e: # be a std/l/t?llct? q/' r.l?.'s -sl,lt)'Jl
all n e' f and .E'(1A-1r)
,. > 0, thvn the sequence
< w
converges to x
r
.

./y,.

c/?-

.::yz:

in

>

rth mean,

denoted /?r X,,

lim f(IA-,,
''''*

11
11

-+

X,

#'

A-lr) 0.
=

'Ed(';'

0/' particular in rcrtszslin w/'?tkr Illos


(/- 1) and mean ygulrt:? (?- 2).
=

is the

c(?rlLrt/rlyt?nct?

in

lnean

A weaker mode of convergence related to the


(WLLN) is that of convergence in probability.

weak law of large numbers

189

Modes of convergence

10.2

Dejlnition

to

Ar,,(s),n

-)

(E

,.

jy said to converge in probability

X, denoted b Ar,,--> X,
'

?-.'.

of r.v.'s

.4 sequence

#'

lim #/-(-s: IX,,(s) - A''(-s)i<

1)

-*

:f

between

relationship

Tbe

probability

l)

( 10. 16)

'

t't?l

ctlnt'tlrfy?nt't?

be deduced b).'

tyornpt/rfnry

surel p and
almost
wr/,
( 16)
(14).

in

The mode of convergence related to the central limit theorem (CLT) is


that of convergence in distribution.

Dhnition

.4 sequence
F,:(x), n (E
.ft
A'
A' /'
,

t?./ l-.v.'s
')

')

,1

is said to converge

distribution jnctions
denoted b.v
in distribution to

x (s), n e:

wlll

-(-s),

-+

);

1im F,,(x)

11

-#

F(x)

(Et)

continuity point x of F(x).


This is nothing more than the pointwise convergence of a sequence of
functions considered above. In the case where the convergence is also
unijrm then F(x) is continuous and vice versa. lt is important, however, to
note that F(x) in (17)might not be a proper distribution function (see
Chapter 4).
Without any further restrictions on the sequence of r.v.'s
nc
the above four modes of convergence are related as shown in Fig. 10.1. As
we can see, convergence in distribution is the weakest mode of convergence
being implied by al1 three other modes. Moreover, almost sure and rth
mean convergence are not directly related but they both imply convergence
in probability. ln order to be able to relate almost sure and rth mean
convergence we need to impose some more restrictions on the sequence
CX11(s),n e: f
such as the existence of moments up to order r.
t
at

El/lfsll-r

-u(#,

-,#-)

')

The implication

Fig. 10.1

a.s.
--+

'=>

-+

stems from the fact that

( 14) is a stronger form of

Introduction

to asymptotic

theory

convergence than ( 16), which holds for a1l n 7:/.n. The implication
based on the inequality

r
=

-+

-+

is

(10. 18)
The
=
is rather obvious in the case where F(x) is a proper
implication
distribution function because for every .)>0, J >0, there exists N so that for
all 17> N, #?'($A',, ..Y1> c) < J, and thus
-+

-+

F,,(x -c)

implying that as

J < F(x) < F,,lx +

:,

l)

J,

0,

--+

( 10.19)

X 11 --+ X.
P

The reverse implication A-,,


is a consttlnt.

A' = .Y,,

-.+

X holds only in the case where X

-+

In order to be able to establish the result


the convergence in probability is

--+

-sufficiently

co

)(

a.s.

we need to ensure that


fast'. lf

-->

!>

#?-()A',,

'j

>

11z= 1

<

for every s > 0.

.cc

Moreover, a similar condition on the convergence


convergence almost surely. ln particular, if

a.s.
=

--+

(10.20)

-+.

in rth mean implies

a..S.

then f,,

-->

(10.2 1)

X.

In order to go from convergence in probability or almost sure convergence


to rth mean convergence we need to ensure that the sequence of r.v.'s
.ftA-,,,,1 e: f is bounded and the moments up to order 1- exist. In particular if
')

(i)-(iii) above hold, then Xn

.Y,, --+ X and conditions


( 1980)).

r
.-.+

(see Serfling
r

An important property
l

of

p'th

mean convergence

is that if A-,, X for


-+

some rb 1then Xn X foro < l < ?'. For example, mean square convergence
implies mean convergence. This is related to the result that if
-+

A((zY,,(r)<

'w

then F(1-Y,,(d)<

:t.

for 0 < I <

/-.

(10.22)

That is, if the rth moment exists (is bounded) then a11the moments of order
less than r also exist. This is the reason why when we assume that

slodesof

10.2

Var(.Y,,) <

191

convergence

we do not need to add that .E'(A-,:)< cfs, given that it is always

ct..

implied.
asymptotic theory we often need to extend the above
results
to transformed sequences of random vectors .t(y(X,,),
convergence
iE f
above
results are said to hold for a random vector
The
convergence
n
.fX
f
k
if
they
hold
for each component Xin, i 1, 2,
,1
c
n
sequence (
of X ?1
In applying
').

'')

Lemma 10.1
')

fatrr t X n e: f be a random lltrcrtppcontinuous function at X, then:


,

11,

a.S.

(f)

a.9.

#(X,,)

X,, --> X

g(X,,) --+ g(X),'

g(X,,)

X,,

-+

-+

and g(

'

): Rk

-+

Ra

(10.23)

g(X),'

( 10.24)

(iii) X,,

sequence

-.+

-+

( 10.25)

g(X).

These results also hold in the case where g ) is a Borel jnction (see
Chapter 6) when certain conditions are placed on the set of discontinuities
of g ): see Mann and Wald (1943).Borel functions have a distinct
advantage over continuous functions in the present context because the
limit of such functions are commonly Borel functions themselves without
requiring uniform convergence. Continuous functions are Borel functions
but not vice versa. ln order to get some idea about the generality of Borel
functions note that if h and g are Borel functions then the following are also
Borel functions: (i)a + bg, a, b G R, (ii)
(iii) maxt/?, g), (iv)mintn, (F), (v)
'

'

'

11,

l.

Of particular

interest in asymptotic

theory are the following results.

( 10.26)
Lemma 10.3
,f

Let t X Y n c f
l1 5

?1 '

')

be a sequence

(?Jpair

(?/'

randotn

k x 1 vectors.

lntroduction

theory

to asymptotic

(1)

/./' (X,, - 5'-!? )

-+

(x?)If X?,

-+

(-) /./ X,,

-+

Y 11 --+ 0

= X ??Y 11

0,'

-,+

and

Y,, --. C

D
(('o??s/Jnl)

(X,,+ Y,,)

= Y, X
1

10.3

11

--.

--.

X+ C

CX;

Convergence of moments

Consider

the sequence

of r.v.'s

( .Y11,

n > 1J? such

that

A-,,

-+

A-

(i,e. lim F,,(x) F(x)),


=

!1

--*

LX?'

where F,,(x) and F(x) refer to the cumulative distribution functions of .Y,,
respectively. We define the momens q/' A',, (when they exist) by
and
-

/J(.Y;l )

xr

d F,,(x),

(10 8)
.2

This definition of raw moments involves the Riemann-stieltjer


integral
which is a direct generalisation of the ordinary Riemann integral. ln the
case where F,,tx) is a monotonically non-decreasing function of x and it has
a continuous derivative (as in the case of a continuous r.v.) then dF,,(x) is
equivalent to the differential
dxs'
()dF,,(x)1/''dx being the
corresponding density function.
Thv /fnlr t?/' the rth moment (f;(
is defined by
./,l(x)

,/,',(x)

',r,))

lim f).Y,r,),

( 10.29)

and it refers to the ordinary mathematical limit of the sequence it F(A',r,),


lj
,77: 1 This limit is by no means equivalent to the asllnptotic nytp?'nt.?nlyof -Y,,
defined by
.

F( Xr11) EEEEq.)
M

r)

xr d yqx )
,

-%

As w'e can see from (30) the asymptotic

( 10.30)
moments of .Y,,are defined in terms

10.3 Convergence of

moments

distribution F(x) and not its finite sample distribution


F,,(x). In view of the fact that F,,(x) might have moments up to order m and
F(x) might not (or vice versa), there is no reason why '(A'r,,),lim,,-. .A'(A-r,,)
and Fz(xYr,,)will be equal for all r :Kq and all n. lndeed, we can show that the
for some l.> 1 provide upper bounds for the
limit inferior of
corresponding asymptotic moments. ln particular:

of its asymptotic

.E'(I-Y,,Ir)

Lemma 10.4

1/' A'

A- then;

--+

11

lim inf

11

--#

E'41-Y,,1)
> F(1-1)..

C42:

(fi) lirrl illf 5; t rt


?,

-*

;y SJa rt

ah,,)

,l)

cxl

tsct? Chunq 1974)).


however, these concepts are equal.

Under certain conditions,


Lemma 10.5
If Xn

-+

X and the sequence

i.e. lim sup


c -+

(x)

A7,, n p: 1) is uniformly integrable

d.p 0
IA',r,I
=

(jxir>t.)

l-tlrl< uo

and

Iim,,- w F(-Y,r,) Exr).


-

Lemma 10.6

Lemma 10.7
P

If .Y,:

limn

-+

<
n y 1) is uniformly
'(!aYIr)
txprj,

X and

-+

:x;

integrable, then

F(.Y,',) f(.Yr).
=

.,

Lemma 10.8

l.f

a S.
.

x,,

-+

A' and lim,,-

limu..+.'txYrnl

xinf

F(I-,,Ir)A'(lxlr)
,v

then

'txYrl.

(For these lemmas see Sertling (1980).)Looking at these results we can see
that the important condition for the equality of the limit of the rth moment
and the Kth asymptotic moment is the uniform integrability of tA',r,,n y 1)
which allows us to interchange limits with expectations.

Introduction

to asymptotic

theory

Beyond the distinction between moments, limits of moments and


asymptotic moments we sometimes encounter th concept of approximate
moments.
Consider the Taylor series expansion of glmr), mv (1/n) Z)'=
1 Ary
=

+ /t''(>-rlln',r

-#(/z'r)

glmr)

-i#t2'(/t',.)(,nr+
-/z'r)2

-p'r)

'

'.

(10.31)

This expansion is often used to derive approximate moments for gmvj.


Under certain regularity conditions (see Sargan (1974/,
Elglm

>#(/z'r)

Var(g(?'nr)) as
-'(:(,n2))3

Eglm

-/t'r)2,

+#t2'(/'r)f(,rlr

(10.32)

Vartmr),
(kt1'(/z-r)(I'2
(#'11(p'r))3'(g(,nr)
gt''(/t;))2#t2'(/z')'E#(,'nr)

(10.33)

-:/'r)1*,

(10.34)

-g(/z'r))3

+c

reads approximately
where
equal. These moments are viewed as
moments of a statistic purporting to approximate glm and under certain
conditions can be treated as approximations
to the moments of glm
(seeSargan ( 1974)). Such approximations must be distinguished from f(A7,)
as well as 1)A-r). The approximate moments derived above can be very
useful in choosing the functions g( ) so as to make the asymptotic results
more accurate in the context of variance stabilising transformations and
asymptotic expansions (seeRothenberg (1984:.In deriving the asymptotic
distributions of g(m,) only the l'irst two moments are utilised and one can
improve upon the normal approximation
by utilising the above
approximate higher moments in the context of asymptotic expansions. A
brief introduction to asymptotic expansions is given in Section 10.6.
:

:xt

'

'

10.4

The

ibig

0' and little o' notation

As argued above, the essence of asymptotic theory is approximation;


approximation of Borel functions, random variables, distribution
functions, mean, variances and higher moments (see Section 10.5). A
particularly useful notion in the context of any approximation theory is
ln
that of the accuracy or order of magnitude of the approximations.
quantities
of
magnitude
of
various
analysis
order
the
mathematical
the
track of' by the use of the big 0, little
involved in an approximation is
notation
notation.
lt
o'
turns out that this
can be extended to probabilistic
modilications.
with
minor
The purpose of this section is to
approximations
consider
extension
notation
and
its
review the 0, o
to asymptotic theory.
tkept

10.4
Let tfat,,

),,,

The

0' and

tbig

195

o' notation

ilittle

n e:-.'U-)be a double sequence of real numbers.

Dejlnition 9
is said to
ttku,nc..,.1-)

The sequence

denotedby

be at most of order bn' and


1J,,

>'

J,,

O(hn)

tf lim
11

bn

-.

<

K,

for some constant K

>

0.

(10.35)

Dejlnition 10
The sequence

denoted by'
1

a',

is said to be
tln,ng.zfz-l
lf

/(/7,,)

smaller order than bn' and

iof

a
lim --'!
ll

--#

(10.36)

=0.

bn

'2Kl

Example 5

1
2n2-

= O ip. ; (n+

ln + n2
O(n5n2+ n3 =

1),.

1)

log. n

O(n) o(n2);

o(n?),

a > o;

expt

n)

(6n2+ 3n)

o(n-),
o(n3)

O(n2).

A very important implication stemming from these examples is that if

pl

O(n*) then

The 0, o notation
If

an

o(n'+)

a, J > 0.

'

satisties the following properties'.


t7,, =

O(q,)

and

',,
J,,',,

O(c,,), then

O(c,,c,,);

IJ,,IrOert
-

an+

),,

,1

c,,)).
otrrlintcu,

The same results hold for small o' in place of

1j' r./,, O(!,,) and


=

/?,,
=

t?(c,,),

Vbig

0%above.

then

t7l,+ b,, O(p,,);


anbn o(c,,c,,).
=

The 0, o notation can be extended to general real valued functions hl ) and

Introduction

to asymptotic

p ) with common domain D #


constant K >0,

theory

Z.

We say

(x) O(g(x)) as x
=

xo

--+

if for a

/?(x)
K. x (E CD- xo).
lim
g(x)
x-xo
Moreover, we say that /1tx) o(g(x)) if
:;

/14x)

lim

x-xv

0,

g(x)

x G CD- xa).

Example 6
Jl(x)= ex

I(x)IGclxl for x e E- 1, 1), /?(x)= O(x),

1,

and
/l(x) cos
=

1 + otx)

(x)

x,

In the case where

lx) - g(x)

O(l(x)) we write

yx) 47(x)+ O(/(x))


=

and for
'

(x) g(.x)
-

0(/(x))

we write /k(x) gtx) + o(/(x)1.


=

is particularly useful in the case of the Taylor expansion,


where we can show that if /1(x)is differentiable of order n (i.e.the derivatives
(dJ)
(7x./)v hlb, j 1, 2,
n, exist for some positive integer n) at x x(),
This notation

then

/y(?,)()( ())
+--- - J + o(j
,2

n!

kt

) as

--+

0.

The 0, o notation
considered above can be extended to the case of
stochastic convergence, convergence almost surely and in probability.
Dehnition 11
Let kfX ?1 n is
,

'')
'#

be a sequence

(4j%r.t-.'s

and

R,,,n G

'(1

i.' a sequence

positive real numbers. W sa r tbat


X,, is at most of order c,, in probability if there
(j)

of

exists ntln-

10.4

The

()9 and

kbig

stochastic

ilittle

o' notation
)J,,,

sequence

(i)

that

suc

A'p'

(ln J

and denoted /?y Xn

')

..i

a ?,

C'',

-..+

()

t?p((r,,).

X,, is of order smaller than

(.',,

in probability i.f

't
tl,,

0,' denoted by Xn

-+

In the same way we can define Xn

p(c,:).

Ou.,.(c,,) and .Y,,

t?g.s.(c,,).

lt turns out that the properties P 1 and P2 for 0, o can be extended


directly to Op, ov and Os.s.,ou.s. with minor modilications (seeWhite (1984:.

Moreover, order of magnitude results related to non-stochastic sequences


can be transformed into stochastic order of magnitude using the following
theorem due to Mann and Wald (1943).
Theorem tMtknn-tl/#l
Let

JtX,,,

'')

where X,, N
X

(E

xjn
:

jn

J,,(a,,)

jr

some

Ja

(E

be

f
.j

(1

of k-dimensional random

sequence

1, 2,

Op (cj,,), j

k)

1, 2,

t'ectors

such tbat
.

n.l,

0(,,),

non-stochastic

#'' sucb

sequence

of

rllr?l

k-dimensional vectors

g,,(X,,) 0,(7,,)
=

tscfrFuller (./9 76)).


This theorem can be used to translate non-stochastic
Taylor expansion to stochastic ones.

results related to the

198

Introduction

to asymptotic

Useful results
(OJ)

If Var(Ar,,)

relating

theory
tlrdel.

to

c,2,< rx) then Xt,=

deviation.
(02)
(OJ)

Os(c,,) - as big as the standard

If A'u Op(1/v' n) = A% op( 1).


=

If X,,

0/1).

Xn

:cc>

Xn +

--+

tp/1)

X.

(04)

lf -Y,,

10.5

Extending the Iimit theorems

-+

-.+

As mentioned above, in statistical inference most of the Borel functions


/4.Y1, Xz,
.Y,,)whose asymptotic behaviour is of interest are functions of
The limit theorems discussed in Chapter 9 refer exclusively
xYr,
1.
rb
:J'- 1
asymptotic
behaviour of the case r= 1. What we need to do now is to
the
to
extend this to any r > 1 as a prelude to a discussion of the asymptotic
behaviour of a function of such quantities and then extend results to any
Borel function /74.:-1,Xz,
.Y,,), when possible.
For expositional purposes let us consider the case where t-Y,,, n > 1) is a
The quantities
rb 1, are
j .YC,
sequence of IID r.v.'s on S, #(
directly related to the raw moments p'vn '(Arr),rb 1,in the sense that if we
were to select an estimator' of p'v we would select
.

,.d

1g

mr

.)).

:'f'-

11

Fl -Y7
i 1

(10.38)

-=

as a natural choice. Given that the l'aw moments tf/ti, rb 1) play an


important role in the manipulation and description of the distribution
function F(x) their estimators sequence (rn,,rb 1) is expected to play an
important role in statistical inference. For reasons which will become
apparent in the next chapter the mrs are called sample r/w moments. Let us
X,,
consider mr=(1/n) j),!-1 xYras a Borel function of the lID r.v.'s -Y1,
and thus a nv. itself. Taking its expectation,
.

1g

.E'4,?1,)

?1

=-

)
=

'(.Yr) - because of the linearity of .E'( )


.

1)

=-rl jg p'r
i 1
=

- because of the IlD assumption,

(10.39)

10.5

the Iimit theorems

Extending

199

Let

1. 2,

then

nzly
=

that is, S,, is the sum of IID r.v.'s F1, Fa,


we can deduce that

,'
''

where S,,

n
.

Y Ff,
=

and SLLN

L. From the WLLN

P
f

( 10.40)

and

(1

( 10.4 1)

r,

Using the Lindeberg-l-evy

CLT we can deduce that if Var(F;)

then
(

wn
lf
Vart#fl

-+

Gy

c2

assuming that p'zr <

<

..'x

:;c

( 10.42)

1),
pj. -p')

-/t;)2

.E'(Arr

c)l,

1. Moreover, given that

ly

jj

11

frnr,nkl-iyj.1 j
i

-x(0,

VartFk.l

ry.), rb
jj

cy2<

'r)

mr - #

&A-r-Y))

n-1

/
= - Fr +k +
n

rl-'

!1

--s

F(-Yr+k) +

nc

Z 2 ;(A'rA-))
i y,./

( 10.43)

#r#k,

we can deduce that


Covtrrlrlh)

(/t;+k
-/.t;/zk).

( 10.44)

Since Covtnlrrnk) fitnkrrnkl

-/z;/t'k,

We

(iv)v'km,
l'nr

can deduce that

z -x(0,E),

-p'r)--+

(n'?11 ma,

m,.), p',.

(/t/1p'2
,

/t',.)

(10.45)

The results (i)-(iv)can be


p'ipj' i,j 1,2,
with E gcfjqo-ij p'
Xi to
seen as extensions of the limit theorems in Chapter 9 for the sum
the general sum )'-: .Yr, rb 1.
Having derived the asymptotic results (i)-(iv)for the sample raw
of
moments we can now extend these results to any continuous function
them using Lemma 10.1 above.
=

..j

,r.

Zl'j

to asymptotic

Introduction

theory

Example 7

lf

4?(
'

) loget
=

) then for rnr >

'

0, p'v> 0, Z

>

0:

logetrnr)
-->

logel/t'rl,'

a.S.

(ii) logetn-lr)

logetp/,.l;

-'.

/t;)j
,.'
c
& (/-tzr
/ -pr )

I)ltn.?r
.52-,--,
-

(iii) log e

log (? z

-.+

The last result, however, is not particularly useful because we usually need
distribution
and
of
derive
of
that
lmrl
the
not
to
,.'
(;1!
-p21
asymptotic
of
Let
distribution
derive
the
us
#?lx''Entrrlr
x' pzv -pr )).
g(mr), taking the opportunity to use some of the concepts and results
propounded above.
and hence from the
From 01 above we know that nly=p'r + 0/ 1/V'n)
Mann-Wald theorem we can deduce that if q( ) has continuous derivatives
of order k then
,.F'''

/.

,,

,,

'

( 10.46)
Assuming that k

qm)
r

and

2 we can deduce that

-f?(/.t',.) -g'''(/z'r)(rnr -/t',.)

andby o2,
) -f?(p'r))
x/''N (# (r?'?
r

Op(n- t)

v'-nln'lr-/,t'r)gt'1(p')r

+ o p (1).

Let

1z;,x'
=

then

n(g(rrlr)

P;, &U,,+
=

-(/(/.tr)),

L ,,

x.'

( 10.47)

-Jt:)

lit?'Flr

t.

=J

( 10.48)

1).

0p(

(/zr)#

From Lemma 3 we may conclude that if b'n -+ U then 1,$-+ cb', and thus
j'

(v) w/ ntf/trrlrl
For example, if glnr)

-g(/.tr))

'-a

A'(0- (/tzr-pr

.a).
)Eg )(/-tr)1

jj

(10.49)

loge n, then
,

/2

/t 2r -- 1if
7
n'lr- loge pv) N 0, . wj ..
N ,''ntloge
Jtr
a
,

'v

( 10.50)

to the

This result can be generalised


G(In,.)

20 1

Extending the Iimit theorems

10.5
H(g1(l'nr),

#./2(I'nr),

case (iv) directly. lf we dene

vector

f?k(ln,.))

(10.51)

v'n(G(mr) -G(gr))'v N(0, DED'),


where
(I'l1r)

.-

'li
ln'l./

mr

1, 2,

I, j

1 2,

r.

-/z;

The above procedure is known as the


(see Rothenberg ( 1984/.
generalised
results
directly
also
above
to an I1D sequence of
be
The
can
crucial
changes. In order
1),
1
without
X,,:/g
x
any
random vectors kfX,,,n y
results
that
easily
show
useful
these
one of the most
are we can
to see how
estimator
lV,
of
Borel
the
functions
interest in Part
important
-metbod

( 10.52)

z1

.i
?'1

J4

11 i

11

xf

?1

j'r 2 !
f

then
,rs)

qz,

c'2,

.7:,-

.-,.,

-Sz - zl zz
2
zs z 1

( 10.53)

Let us suinmarise the argument so far. ln our attempt to derive asymptotic


A-,,)we extended the limit
results for arbitrary Borel functions /;t'l, Xz,
general
Xi
to more
theorems for j)'- 1
sums ))- 1 -Yr, r7: 1, and then these
continuous
extended
functions of these general sums.
results were
to
following
The results so far suggest the
way to extend them even further.
of
k-dimensional random vectors kX,,,n y 1)
Consider the 1ID sequence
?- (analogous
to
and define the Borel functions Ii ): Rk R, i 1, 2,
EE.
/,.(X,,))
(the z'fs above) with
)'- 1 .-;, r > 1) and Z,, (!1(X,,), /c(X,,),
iRbe a Borel function (f?( ) above), and define /1(Z,:)
Let ( ): Rr
'(Z,,)
2
ff.
where Z,, ( 1 .n) f I Zi. The above result ln terms of /?(Z,,) being
continuously differentiable at Z,, # takes the form
.

'

-.+

-+

,1

/1(/z)) N,
Un((2,,)
'w

DVD'),

(10.54)

202

Introduction

to asymptotic theory

where
5,5 Cov(Zj),

(see Bhattacharya

P(Z)
t'lzf

( 1977) and references therein).

Error bounds and asymptotic

10.6

1, 2,

zmJ,

expansions

The CLT tells us that under certain conditions the distribution of the
gVar(,%,)q), tends to a standard
standardised sum F;, (gS,,
normal r.v., i.e.
'(5',,)j/v

lim F,,()?)
-.+r

(14.J7)=

11

1
x

l.o 2

exo 1-

x (21)

i'

) dl.

( 10.55)

It provides us with no information as to how large the approximation error


l.&(y)
is for a particular value n. The first formal bound of the
approximation error was provided by Liapunov in 190 1 proving that: For
X, X1,
Vartxtf) c2 and
an I1D sequence of nv.'s with E(Xi)
3) B
pl
E
v3, all finite for i 1, 2,
-

*(y)I
.

=p,

'

(I.'f

-*4A,)1 'K C v
sup 1J--,,(.:)
.-G
R

>

rs

log n

(10.56)

----,

w'here C is a positive constant. This result was later sharpened by Cramer


but the best result was provided by Berry (1941)and Esseen (1945).
Berry-Esseen

theorem

Under the same conditions as tbe Liapunov result we can deduce that

c uu
supl#,,ly) -*(y)1 G--.r,i
ca
x'

33
C--j-.

>

( 10.57)

That is, the factor 1og n was unnecessary. As we can see, this is a sharpening
of the Lindeberg-l-evy CLT (seeChapter 9) in so far as the latter states that
*(y)1 0 as n :x. without specifying the rate of convergence.
The Berry-Esseen theorem using higher moments provides us with the
a dditional information that the rate of convergence is O(n-). Various
authors improved the upper bound of the error by reducing the constant C.
The best bound for C so far was provided by Beeck ( 1972), 0.4097 G C <

supyI#,,(y)
-

-+

-+

0.7975.

For an independent but not identically distributed sequence of nv.'s


Xn with '(-Yf)
paf < :c i
Vartatf) c/ and F(1Arf
1,2,

XL,

-/tfl3)

=pi,

10.6 Error bounds and


the Berry-Esseen

asymptotic

203

expansions

result takes the form

( 10.58)

Although these
absolute magnitude
(i.e. uninformative)
:yz
as n
y
order to reflect the

results provide us with a simple and usually accurate


of the approximation error, they become trivially true
when y is allowed to vary with n. For example if
--+
-->
then 6,(.$and *(.y) tend to zero (seePhillips (1980/. ln
dependence of the upper bound on y as well, the BerryEsseen result can be extended to
':c

v:3

*(.$1 G
c'
1f-*,,4.r)

>

:$

1
1+ yc

for all y e' EEi

( 10.59)

(see Ibragimov and Linnik (1971/. An important disadvantage of these


results is that they provide us with no information as to how we can improve
the asymptotic approximations provided by the CLT or how we can choose
between asymptotic approximations provided by the CLT or how we can
choose between asymptotically equivalent results. This line of reasoning
leads us naturally to the idea of asymptotic expansions related to the
approximation error.
The idea underlying some asymptotic expansions of the approximation
error is that we can think of the asymptotic distribution as the first term in
an expansion of the form
r
(y)
,4

Fk(.p) *(.r)

>

Z (,,2)+ Rr,,(.v),

( 10.60)

-----.-,

f=1

here the terms of the expansion are polynomials in y in powers of (n-'i)and


-r/2).
This is the so-called Edgeworth expansion
the remainder Rr,, O(n
U idely
used in econometrics.
Let us consider how such asymptotic
expansions can be derived in the present context.
Let
be the density function of an appropriately normalised Borel
X,,.
function of the llD sequence of r.v.'s -Y1, X1,
is assumed to
belong to a particular family of functions defined over the interval ( :f;, x)
nd satisfy the condition

%&

,(x)

,(x)

-X

dx<
I/'(x)I2

.....c ,

xnown as square integrability. The space of al1 such functions is denoted by


J-2l - w :c) and constitutes a Hilbert space (seeKreyszig (1978/.One of
,

204

Introduction

to asymptotic theory

the most important properties of 1-zt


is that any element in this
space can be represented or sufficiently accurately approximated by the use
of the orthogonal polynomials:
,yzl

,yz

Hklx)

..

1) dk4(x)

--

/(-Y)

'

dx k

k 0, 1 2,
=

( 10.62)

known as H erm ite polynomials of degree k; /(x) (J N /(Jzr)j exp'f( - 1.x2


2
y,
the density function of a standard normal r.v. This is a natural extension of
the concept of an orthogonal
basis spanning a linear space. The
of
polynomials
orthogonality
these
comes in the form of
=

WI

Hks,xjHmlxjst.

dx

=/f!

=0
The first five of these polynomials
H0

1 H1

are

=x2

Hz

=.x,

otherwise.

1, H5

x3 -3x,

-6.%2 + 3.
( 10.64)

=x*

H.

The density function /,,(x)


is assumed to be an element of f.2(
it
represented
thus can be
in the form

and

'.yl

'wx
,

J.

bkllkx).

./;,(x)k=0
-

( 10.65)

sufficiently accurately by
Although in principle we can approximate
choosing a finite value for the summation. in practice we prefer to
approximate the ratio L)n'
(x)/4(x)qin view of the fact that it is usually
and thus easier to approximate sufficiently accurately
smoother than
by a low degTee polynomial. This ratio can be approximated by
,(x)

,(x)

,&(.Y)

.4(x) :x: k ) bkuklx),


o

( 10.66)

where

( 10.67)
These coefficients are chosen so as to minimise the error
./;

.
-

/(x)
.

'

(x)

t)(x)

Hence, the density function

(o bklikxs
./,',(x)

(I.Y.

can be approximated

by

./')(x)

where

=4(A:)
/*1(x)

Z bklikx).

k=0

(10.69)

10.6

Error bounds and asymptotic

205

expansions

As we can see, the coefficients bk, /( 0, 1, 2,


as defined above are directly
of
For
instance
the
moments
related to
/,',(x).
=

b3

3/t'1
6
hp
- ),

etc.

This implies that if we know the moments of


up to some order r we can
it.
approximate
to
l
use .J,*(x)
ln view of the relationship between the derivatives of /(.x)and Hermite
polynomials /',*,
(x)can be expressed in the form
r

/*(x) ).
.

l1

k=0

./?',(x)

-' 4tk) (x),


j
yjy

(10.70)

where

4('(x)

dk/tx) k
dvk
,

1, 2,

of 4(x) and its


is a linear combination
is, the approximation
derivatives up to order r,' see Cramer (1946). This form of the approximation
approximation.
In terms of the
series
is known as the Grtkr?1-Cllf.'l?'/ft?ljXd?zlthe approximation is
cumulative distribution function F,,(x)
.

That

-4

./;,(u)

F,*(x)
'

k=0

r
-1
(I)tk) (x).
k!

( 10.72)

The question which is of primary interest about these approximations is


,(x)
as r
not so much whether the above series converges to F,,(x) or
but whether for a small 1. < vl the above series will provide a good
approximation to these functions or not. A particular way to proceed which
enables us to choose the order of the approximation error is to break down
them according to
lhe terms bkblkqxjinto components and then
-'.
That
of
is, express .J,*,
(x)in the
n
:heir order of magnitude. say in powers
-+

,.',f

reassemble

f ;1'(x ) 4(x )
-

1+

.,1.

(x)

-.-3.6.,....-,
k

rl)
k (x?'
-

( 10.73)

lntroduction

206

to asymptotic

theory

where
is a polynomial in x. ln order to be able to choose the order of
Rr,,(x) defined by Ro,(x)
the approximation error (remainder)
(x)
we need to ensure that the above series constitutes a proper asymptotic
expansion. An asymptotic expansion is defined to be a series which has the
property that when truncated at some finite number r the remainder has the
same order of magnitude as the first neglected term. ln the present case
the remainder must be of order
,4ktxl

=l(x)

+ 1

(?1-)r

j (;

x (x) ojn- ((r

l.tr

-e

-f1,

1 ). 2 1)

(1:

g4)

This is ensured in the case of an expansion


r

./'(x) k=0

akkzktx) + Rr(x),

(/k(x),k > 0)

when the sequence

/k+

1(.X)

( 10.75)
is an asymptotic

sequence

as x

-+

O(/k)

xo

(10.76)

because then Rr(x) O(/k).


Under certain restrictions
=

>

k (
l

--

expansion

of the form

.4ktxl

/(.'x) 1 +

.,(x)=

(seeFeller ( 1970))the
?') k

(10.77)

+ Rm

1'/21)
asymptotic expansion and R (x) otn-nr'h
This
expansion
widely
applied
Edgewortb
and
has
is known as the
been
in the
econometric literature (see Sargan ( 1976), Phillips (1977),Rothenberg
( 1984), inter alia).
In order to illustrate some of the concepts introduced above let us
consider the standardised Borel function

a proper
constitutes

'',:

z,, x(7'
=

''

l f

l Xi

( 10.78)

-p

c2, Exi -y)k


Xn with E(Xi)
Vartx'f)
of the 11D sequence -Y1,
3,
The
problem
4,
is
first
have
face
derive
the necessary
i
to
to
we
able
of
in
order
evaluate
coefficients
Z,,
be
the
moments
ck, k 0, 1, 2,
to
to
E(Z,l,)
EZn)
Apart
from
the first two moments
0,
1, the higher
r.
l..
.
moments involve some messy manipulations (see Cramer (1946/.Using
these moments given by Cramer we can evaluate the first few coefficients'.
.

=y,

zzzpi,

.r

c()= 1, cl

=0,

C5

1
-

&5,

10.6

Error bounds and asymptotic

expansions

207

3, 4, 5, 6 refer to the cumulants of Xi


c (seeChapter 4);
c*)
1946:.
The
Gram-charlier
3
Cramer
(
(see
y5
(/t4
approximation of
for r 6 takes the form
where
Ks

Ki,

-p)

c3,

&4.

,(x)

,:3

lc
l &
1 (3J(x) +. 1 ....
45(x)
y*txl.f:(x) 4(x) --3!k-.4
-j -4
ua
yf u
1 .s, + l0,l 4(,j (x).
+ -f n
n
=

(j(j..yq)

Collecting the terms with the same order of magnitude we can construct the
Edgeworth expansion

1x

--l

4(x) 3! nz $ aj(x)+ V V &44

./;,(x)=

(4j

10

(x)+ 6! x14 (6j (xj

.j.

g c,,,

(10.80)
here Rc,, O(n-f). ln terms of the Hermite polynomials
moments this takes the form

./;,(x)-

4(x)

+)

and the central

(q) (?-:-

3) 1',2:x'

''33tx'+-'
!

o'

10

p5
a
c

S6(x)

6!

+ Ra,,,.

(10.8 1)

lt is important to note that the presence of cumulants in (79)and (80)is no


coincidence. lt is due to the fact that for the sum of standardised IID r.v.'s
&: =

O(n1-4r/2))

From these expansions we can see that in the case where the distribution of
the CLT approximation is of order 1 n
the Arfs is symmetric g(/t3c3)
and when the kurtosis a4= (p4c4) 3 the approximation is even better, of
order 1 nl and not of order 1 x/'n as the CLT suggests. ln a sense we can
interpret the CLT as providing us with the first term in the above
Edgeworth expansion and if we want to improve it we should include
higher-order terms.
and Edgeworth expansions for arbitrary Borel
The Gram-charlier
Ar,,can be derived similarly when the moments needed
functions of aY1
to determine the required coefficients are available (see Bhattacharya
(1977)). When these moments are not easily available the approximate
moments considered in Section 10.5 can be used instead. lt must be
emphasised that Edgeworth expansions are not restricted to Borel
functions (X) with asymptotic normal distributions. Certain Borel
=0)

208

Introduction

to asymptotic

theory

functions of considerable interest in econometrics (see Chapter 16) have


asymptotic chi-square distributions for which Edgeworth approximations
can be developed (see Rothenberg ( 1984)). The only difference from the
derivation of the normal Edgeworth expansion is that the appropriate
space is Lz(0, ) and the corresponding orthogonal polynomials are the socalled Laguerre polynomials:
'yt

x dk
.L,,(x) e
i. (.x e -=

'7),

t.ydx

1 2,

( 10.82)

Error bounds and asymptotic expansions of the type discussed above are
of considerable interest in econometrics where we often have to resort to
asymptotic theory (see Part 1V).
Important
Convergence

concepts

almost

surely, convergence

in rth mean, convergence

in

probability, convergence in distribution. convergence


on S, convergence
uniformly on S, big Op, small op, order of magnitude, Mann-Wald
theorem, Taylor series expansions, sample raw moments, J-method, limits
approximate
moments,
of moments, asymptotic
moments, uniform
theorem,
integrability, Berry-Esseen
integrable
functions,
square
polynomial
approximation,
Hermite polyorthogonal polynomials,
series A, asymptotic
nomials. Gram-charlier
sequence, asymptotic
expansion, Edgeworth expansion.

Questions
Why do we need to bother with asymptotic theory and run the risk to
use inaccurate approximations?
Compare and contrast the two modes of convergence, convergence in
probability and almost sure convergence. Explain intuitively why
P

.s.

a
-+

-+
,

Compare the concepts of convergence in probability and convergence


D
p
=
in distribution and give an intuitive explanation to the result
-.+

-..

Explain the Cramer-Wald


device of proving convergence
in
distribution for a vector sequence tX,,, ?? y 1)
Explain the Op and op notation and compare it with the 0, o
notation of mathematical analysis.
Discuss the Mann-Wald theorem on how to derive stochastic order
of magnitude results from non-stochastic ones.
Explain the Taylor series expansion for a Borel function /l(X) of a
j.
sequence of r.v.'s X H (' j
.

'2,

-,,

10.6
8.

Error bounds and asymptotic

expansions

209

Explain how the results of the limit theorems for )'.j Xi can be
extended to the sample raw moments
)'-k A'r, rb 1. How can the
latter be extended to arbitrary continuous functions of them'?
Discuss the
for deriving the asymptotic distribution of f7(/?)
((/( ) continuous) from that of rnr.
Compare and contrast the following concepts:
?'th-order
(i)
moments',
(ii)
limits of rth-order moments;
asymptotic rth-order moments; and
(iii)
approximate moments.
(iv)
-method

'

Explain intuitively the concept of uniform integrability.


Discuss the Berry-Esseen theorem and its role in deriving upper
bounds for the approximation error in the CLT.
Explain the role of Edgeworth expansions in asymptotic theory.
Discuss the derivation of an orthogonal polynomial approximation
to an appropriately standardised Borel function of a sequence of r.v.'s
t-Y,,sn y 11) in the context of the space 1w24
%, ).
Explain the difference between Gram-charlier
and Edgeworth
-yt

expansions.
What order of approximation does the CLT provide in the case where
the skewness and kurtosis coefficients of )'=l Xi are zero and three
respectively'? Explain.
Discuss the question of how Edgeworth approximations can help us
discriminate between asymptotically equivalent Borel functions of a
sequence of r.v.'s X I1 n > l 1
.(f

Exercises
Determine the order of magnitude

sequences:
n3 + 6n2 + 2
-

6n3+

n rr::rc .1 2

1 .-

1.,1 1 2
=:::

Determine

(bigO and

small

0)

of the following

,.
?

the order of magnitude

of the following series:

to asymptotic

Introduction

thtxory

For the 11D sequence ).Y,,, n k: 1) where A-,, N(0, 1) we know p,,
()'- 1 Xl) z2(n).From the CLT we know that h,,(X) (p,,
'v

-n)/

'w

Ev/'(2n)) N(0,
'.w

(Z

order 1/n. Hint:


Additional

1). Derive the Edgeworth


a,,

(2x/2)/x/n,a4,,

approximation

of

(X) of

12/n.)

references

Bhattacharya and Rao (1976); Bishop


Wallace (1958).

et al.

(1975);

Billingsley

(1968);Rao (1984);

PART

lII

Statistical inference

C H A P T E R 11

The nature of statistical

11.1

inference

Introduction

ln the discussion of descriptive statistics in Part l it was argued that in order


to be able to go beyond the mere summarisation and description of the
it was important
observed data under consideration
to develop a
mathematical model purporting to provide a generalised description of the
data generating process (DGP). Motivated by the various results on
frequency curves, a probability model in the form of the parametric family
of density functions *=
0), 0 ( (4) and its various ramifications was
formulated in Part ll.providing such a mathematical model. Along with the
formulation of the probability model * various concepts and results were
discussed in order to enable us to extend and analyse the model, preparing
the way for statistical inference to be considered in the sequel. Before we go
on to consider that, however, it is important to understand the difference
between the descriptive study of data and statistical inference. As suggested
above, the concept of a density function in terms of which the probability
model is defined was motivated by the concept of a frequency curve. lt is
obvious that any density function
0) can be used as a frequency curve
by reinterpreting it as a non-stoehastic
function of the observed data. This
precludes any suggestions that the main difference between the descriptive
study of data and statistical inference proper lies with the use of density
functions in describing the observed data. 'What is the main difference
thent?'
In descriptive statistics the aim is to summarise and describe the data
under consideration and frequency curves provide us with a convenient
way to do that. The choice of a l-requency curve is entirely based on the data
in hand. On the other hand, in statistical inference a probability model (l) is

rtx,'

./'(x;

The nature

of statistical

inference

postulated a priori as a generalised description of the underlying DGP


giving rise to the observed data (not the observed data themselves). Indeed,
there is nothing stochastic about a set of members making up the data. The
stochastic element is introduced into the framework in the form of
uncertainty relating to the underlying DGP and the observed data are
viewed as one of the many possible realisations. ln descriptive statistics we
start with the observed data and seek a frequency curve which describes
these data as closely as possible. In statistical inference we postulate a
probability model * a priori, which purports to describe either the DGP
giving rise to the data or the population which the observed data came
from. These constitute fundamental departures from descriptive statistics
allowing us to make generalisations beyond the numbers in hand. This
being the case the analysis of observed data in statistical inference proper
will take a very different form as compared with descriptive statistics briefly
considered in Part 1. In order to see this 1et us return to the income data
and
discussed in Chapter 2. There we considered the summarisation
description of personal income data on 23 000 households using descriptors
like the mean, median, mode, variance, the histogram and the frequency
curve. These enabled us to get some idea about the distribution of incomes
among these households. The discussion ended with us speculating about
the possibility of finding an appropriate frequency curve which depends on
few parameters enabling us to describe the data and analyse them in a much
more convenient way. ln Section 4.3 we suggested that the parametric
family of density functions of the Pareto distribution

could provide a reasonable probability model for incomes over f4500. As


can be seen, there is only one unknown parameter 0 which once specified
flx,' 0) is completely determined. In the context of statistical inference we
postulate * a priori as a stochastic model not of the data in hand but of the
distribution of income of the population from which the observed data
constitute one realisation, i.e. the UK households. Clearly, there is nothing
cr/rrc in the context of descriptive
wrong with using .J(x;0) as a frequency
statistics by returning to the histogram of these data and after plotting
.J(x;p) for various values of 0, say 0= 1, 1.5, 2, choose the one which comes
closer to the frequency polygon. For the sake of the argument let us assume
that the curve chosen is 0= 1.5, i.e.

11.2

The sampling model

This provides us with a very convenient descriptor of these data as can be


easily seen when compared with the cumbersome histogram function

(see Chapter 2). But it is no more than a convenient descriptor of the data in
hand. For example, we cannot make any statements about the distribution
of personal income in the UK on the basis of the frequency curve .J*(x).ln
order to do that we need to consider the problem in the context of statistical
inference proper. By postulating * above as a probability model for the
distribution of income in the UK and interpreting the observed data as a
sample from the population under study we could go on to consider
questions about the unknown parameter 0 as well as further observations
from the probability model, see Section 11.4 below.
ln Section 11.2 the important concept of a sampling model is introduced
(p=
).(x; @,0 e:O),
as a way to link the probability model postulated. say
model
E!E
sampling
available.
The
x,,)'
obstrved
data
the
x (x1,
to
statistical
needed
ingredient
define
second
important
to
a
provides the
statistical inference.
model; the starting point of any
ln Section 11.3, armed with the concept of a statistical model, we go on to
discuss a particular approach to statistical inference, known as the
frequency approach. The frequency approach is briefly contrasted with
another important approach to statistical inference, the Bayesian.
A brief overview of statistical inference is considered in Section 11.4 as a
prelude to the discussion of the next three chapters. The most important
concept in statistical inference is the concept of a statistic which is discussed
in Section 11.5. This concept and its distribution provide the cornerstone
for estimation, testing and prediction.
.

iparametric'

11.2

The sampling model

As argued above, the probability model * ).J(x;0), 0 G 0) constitutes a


very important component of statistical inference. Another important
element in the same context is what we call a samplinq model, which
provides the link between the probability model and the observed data. lt is
designed to model the relationship between them and refers to the way the
observed data can be viewed in relation to *. In order to be able to
formulate sampling models we need to detine formally the concept of a
=

sample in statistical inference.

nature of statistical

Te

inference

D ehni tion 1

-4 sample is J#n?J
.

A',,)

.
jnction
,

to

??

(?./a st.ar random

variables

(-Y1 -Ya
(?-.p.-s)
(lensit
,

densit
(nctions tr'o!'nc!'Jpwith the
0o4 t?y postulated b)' the probabilitl' model.

'krl/tpst;r

./'(x,'

*rp-rft.a'

'

J'

Note that the term sample has a very precise meaning in this context and it
is not the meaning attributed in everyday language. In particular the term
does not refer to any observed data as the everyday use of the term might
suggest.
The significance

of the concept becomes apparent

when we learn that the

observed data in this context are considered to be one of the many


possible realisations of the sample. ln this interpretation lies the inductive
argument of statistical inference which enables us to extend the results
based on the observed data in hand to the underlying mechanism giving rise
to them. Hence the observed data in this context are no longerjust a set of
numbers we want to make some sense of, they represent a particular
outcome of an experiment; the experiment as defined by the sampling
model postulated to complement the probability model * f(./'(x; p),
0 g (4)
=

Given that a sample

distribution

which

Djlnition

is a set of r.v.'s related to * it must


call
the distribution of the sample.
we

have a

The distribution of the sample X iE!E(jtp/'!rdistl-ibution t?/' tbe ?-.r.'.$ .zY,


,

/-(x

-v,,

; 04 N

A',,)' is /t?.#??t?J
to lat?the
zY,,delloted b

1,

.'

/'(x ) p).

The distribution of the sample incorporates both forms of relevant


information. the probability as well as sample information. It must come as
p) plays a very important role in statistical
no surprise to learn that
04 depends crucially on the nature of the
inference. The form of
sampling model as well as *. The simplest but most widely used form of a
sampling model is the one based on the idea of a random experiment (f (see
Chapter 3) and is called a random sample.
./'(x;

./'(x;

De-hn rt'pn 3
zY,,)is called a random
.,4 set t?. random variables (.Yj A'a.
0) (/ the
A' j zY2,
X,, are independent (7/7J
sample jom
identicall), distributed (11D4. //7 this ctk-s'r tbe distribution of te
-

.f(x;

-.t'.'s

11.2

The sampling model

the
sample rt//t-e'-s

/rn'l

vqualit). Jlf? t() independence and f/? (.7s()('()n(l

tbe
//?f?l-.v.'s

.#?-sr

identically distributed.

are

rf?th (.2

l'ha t

.?c/

One of the important ingredients of a random experiment t.') is that the


experiment can be repeated under identical conditions. This enables us to
construct a random sample by repeating the experiment n times. Such a
procedure of constructing a random sample might suggest that this is
feasible only when experimentation is possible. Although there is some
truth in this presupposition, the concept of a random sample is also used in
cases where the experiment can be repetted tlnder identical conditions, if
ln order to see this let tIs consider the personal income
only conceptually.
example where (I) represents a Pareto family of density functions, KWhat is a
random sample in this case?' lf we can ensure that every household in the
UK has the same chance of being selected in one performance of a
conceptual experiment then we can interpret the ?? households selected as
X,,) and their incomes (the
representing a random sample (. 1 X1,
sample.
being
realisation
of
the
ln general we denote the
data)
observed
as
a
x,,l', where x
Xt,)' and its realisation by x EEE(x1
salnple by X EH ( l
k./',' usually
values
obsert'ation
yptkt't?
assumed
take
i.e.
in
the
is
to
xG
R''
, (?'
A less restrictive form of a sampling model is what we call an independent
sample- where the identically distributed condition in the random sample is
,

'

.4,'

relaxed.

#ni l???

..4set' q/' r.t'.-s (X 1


/'( i ; p) 1 1 2
.
1, lt1
l'nr/t?pta/?tt-i'n
f(l ?'l

-:'

2/,11..$

A',,) is said
n

l-espet-

Jo

l i ?.rt?/
.

'k'.

bv (7?7 independent sarnple


?'/' 11
t t? l.. ?...*s
1
.'.'.t''

case r/lt?distribution

(?j'

l'tl'n

.7

.,

lt

(1p-t?

sanlple rtk/t?.'.k
the
r/'7t.?

( 11.4)
belong to the same
the density functions I-l.i ; %). i 1
family but their numerical characteristics (moments, etc.) may differ.
If we relax the independence assumption as well we have what we can call
non-random
sample.
a
U sually

,2

,??

The nature

of statistical

.1) tlili J'1,-.l

-''

inference

said tp be a non-random sample from


s .Y,,)is
( I
r.1?.'s
('/'
t'be
.-Yt,are non-llD. ln this case the
X
1
x,,,' 0)
J(x1
t?/'
the
distribution t#' the sample possible is
onlv lcl'/rn/?t/./lf?r/
tt/' ?'.?..-y

4 se:

( 1 1.5)
given xo wllprtl .J'(xj
conditional distribution of Xi

'x:

,x

,2,

i 1
f/illt??'l Xl, X2,
-

; 0i4'

,n,

represent

zTi
-

the

1.

A non-random sample is clearly the most general of the sampling models


considered above and includes the independent and random samples as
special cases given that

Xn are independent r.v.'s. Its generality, however, renders the


whcn A-l,
unless certain restrictions
non-operational
are imposed on the
concept
restrictions
have been
heterogeneity and dependence among the A-js. Such
often
restrictions
extensively discussed in Sections 8.2-3. ln Part IV the
used are stationarity and asymptotic independence.
ln the context of statistical inference we need to postulate both a
probability as well as a sampling model and thus we define a statistical
model as comprising both.
.

Dhnition

,4 statistical model is dejlned as comprising


(p
(i)
a probability model
)./'(x;p), 0 e: (6)),.and
>
model
Ar,,)'.
samplinq
X (-1 Xz,
(f)
a
The concept of a statistical model provides the starting point of all forms of
statistical inference to be considered n the sequel. To be more precise, the
concept of a statistical model forms the basis of what is known as
parametric inference. There is also a branch of statistical inference known
where no * is assumed a priori (see Gibbons
as non-parametric inference
( 1971)). Non-parametric statistical inference is beyond the scope of this
book.
lt must be emphasised at the outset that the two important components
of a statistical model, the probability and sampling models, are clearly
interrelated. For example, we cannot postulate the probability model *=
=

11.3

The frequency approach

This is because if the r.v.'s


the probability model must be defined in
Jt-ftxl,
x,,; p), 0e O). Moreover,
terms of theirjoint distribution, i.e. (I)
sample we need
independent
distributed
the
of
identically
not
but
in
case
an
sample, i.e.*=
functions
each
individual
specify
in
the
for
density
the
nv.
to
n).
of this
The
important
implication
1,
0
6
2,
.fttxk; 0),
most
0, k
when
postulated
sampling
model
is
found
the
relationship is that
to be
respecified
model
probability
be
it
has
the
inappropriate
as
to
means that
well. Several examples of this are encountered in Chapters 2 1 to 23.

f J(x; 0), 0 G (4) if the sample X is non-random.

A'I,

-Y,,are not independent

The frmuency approach

ln developing the concept of a probability model in Part 11it was argued


that no interpretation of probability was needed. The whole structure was
built upon the axiomatic approach which defined probability as a set
function #( ): .F
E0,1q satisfying various axioms and devoid of any
interpretations (see Section 3.2). In statistical inference, however, the
interpretation of the notion of probability is indispensable. The discerning
reader would have noted that in the above introductory discussion we have
already adopted a particular attitude towards the meaning of probability.
ln interpreting the observed data as one of many possible realisations of the
DGP as represented by the probability model we have committed ourselves
towards the frequency interpretation of probability. This is because we
implicitly assumed that if we were to repeat the experiment under identical
conditions indefinitely (i.e. with the number of observations going to
infinity) we would be able to reconstruct the probability model *. In the
case of the income example discussed above, this amounts to assuming that
if we were to observe everybody's income and plot the relative frequency
curve for incomes over f4500 we would get a Pareto density function. This
suggests that the frequency approach to statistical inference can be viewed
as a natural extension of the descriptive study of data with the introduction
of the concept of a probability model. ln practice we never have an infinity
of observations in order to recover the probability model completely and
hence caution should be exercised in interpreting the results of the
frequency-approach-based statistical methods which we consider in the
sequel. These results depend crucially on the probability model which we
interpret as referring to a situation where we keep on repeating the
experiment to infinity. This suggests that the results should be interpreted
the long run' or
as holding under the same circumstances, i.e.
average'. Adopting such an interpretation implies that we should propose
results' according to
statistical procedures which give rise to
'

-+

'in

koptimum

ton

The nature

of statistical

Probability

4)

inference

model
G (i)j

1 (x ; $ ) 0

Distribution

/(x

1,

xz,

of the sample
xn ; 0 )
.

Sampling model
(.X1,Xa
Xn )

EE

Observed data

EE

(x,, xz,

xp)

The frequentist approach

Fig.

to statistical

inference.

criteria related to this


intemretation.
Hence, it is important to
keep this in mind when reading the following chapters on criteria for
optimal estimators, tests and perdictors.
The various approaches to statistical inference based on alternative
intemretations of the notion of probability differ mainly in relation to what
constittites rele,ant information for statistical inference anj lltpw,it should be
tp///prtptzc/k
processed. ln the case of the h-equency
(sometimes called the
classical approach) the relevant information comes in the form of a
probability model (1) fytx,' @,0e 0. ) and a sampling model X N (-Y1 X1,
Xtt)', providing the link between * and the observed data x E/ (x1,x2,
.
x,,)'. The observed data are in effect interpreted
as a realisation of the
.
sampling model, i.e. X x. This relevant information is then processed via
the distribution of the sample txj, xa,
x,,; 0j (see Fig. 11.1).
The subjective' interpretation of probability, on the other hand, leads to
a different approach to statistical inference. This is commonly known as the
Bayesian Jpproflc
because the discussion is based on revising prior beliefs
about the unknown parameters 0 in the light of the observed data using
Bayes' formula. The prior information about 0 comes in the form of a
probability distribution .J(p);that is, 0 is assumed to be a random variable.
The revision to the prior .J(#)comes in the form of the posterior distribution
Slong-run'

.J'(px)

via Bayes' formula:


.,flx

./(p/x)
.
=

0f (t?)

':x

/(x)

..f(x/p)./

(p),

/'(x 0) being the distribution of the sample and

./'tx)

being constant

for

11.4

An overview of statistical

inference

X x. For more details and an excellent discussion of the frequency and


Bayesian approaches to statistical inference see Barnett ( 1973). ln what
follows we concentrate exclusively on the frequency approach.
=

11.4

An o&'erview of statistical

inference

As defined above the simplest form of a statistical model comprises:


(i)
a probability model *= tf/'(x;04, 0 (E 0).., and
Ar,,)' - a random sample.
(ii)
a sampling model X (A-1
Using this simple statistical model, let us attempt a brief overview of
statistical inference before we consider the various topics individually in
order to keep the discussion which follows in perspective. The statistical
model in conjunction with the observed data enable us to consider the
following questions:
Are the observed data consistent with the postulatcd statistical
( 1)
=

model?

-2,

misspecljlcation)

Assuming that the statistical model postulated is consistent with


the observed data, what can we infer about the unknown
parameters 0 e:0. ?
Can we decrease the uncertainty about 0 by reducing the
(a)
parameter space from (.) to O() where (6):is a subset of (9:/
(conjldence estimation)
about 0 by choosing a
Can we decrease the uncertainty
(b)
particular value in (6), say 1, as providing the most
representative value of 0 point estimationj
Can we consider the question that 0 belongs to some
subset (i)() of 6)? qhypotbesistestin
Assuming that a particular representative value l of 0 has been
chosen what can we infer about further observations from the
DGP as described by the postulated statistical model? prediction)
The above questions describe the main areas of statistical inference.
Comparing these questions with the ones we could ask in the context of
descriptive statistics we can easily appreciate the role of probability theory
in statistical inference.
The second question posed above (thelirst question is considered in the
appendix below) assumes that the statistical model postulated is
and
considers various forms of inference relating to the unknown parameters 0.
ivalid'

Point psrfmt/rt)n
(t?- just estimatio: refers to our attempt to give a
numelical value to 0. This entails constlucting a mapping h(') : 1.-.+* (see
Fig. 11.2). We call function h(X) an estimator of 0 and its value h(x) an

The nature

of statistical

inference

Fig. 11.2. Point estimation.

g( #
-

Oo

Fig. 11.3. lnterval

estimation.

estimate of 0. Chapters 12 and 13 on point estimation deal with the issues of


estimators, respectively.
defining and constructing
'optimal'

Consdence estimation: refers to the construction of a numerical region for 0,


in the form of a subset (1)0of 0. (s Fig. 11.3). Again, contidence estimation
gt ):
0.
comes in the form of a multivalued function (one-to-many)
.%.

'

-.+

Hypothesis testinq, on the other hand, relates to some a priori statement


about 0 of the form Ho : o). e, (.la cu (6), against some opposite statement
Hko@4v or, equivalently, #GO1 0. 1 r-ht.l4l Z and Oj kp(.)()
0. In a
situation like this we need to devise a rule which tells us when to accept Ho
tinvalid'
HL
in view of the observed data. Using the
or reject
as
as
postulated partition of (.) into (6)0 and 01 we need, in some sense, to
construct a mapping t?(') :.4---+(1) whose inverse image induces the partition
=

'valid'

q - 14O0) Cll - acceptance


140.
1 ) C1
q- rejection

region,

where Ca t..p Cj

-??''

region,

(seeFig. 11.4).

11.5 Statistics

and their distributions

223

'

Co

(%

f'l

Fig. 11.4. Hypothesis

testing.

The decision to accept Stj as a valid hypothesis about 0 or reject


Ho as an invalid hypothesis about #, in view of the observed data, will be
based on whether the observed data X belongs to the acceptance or
rejection regions respectively, i.e. X G Crll or X (E C71(seeChapter 14).
Hypothesis testing can also be used to consider the question of the
appropriateness of the probability model postulated. Apart from the direct
test based on the empirical cumulative distribution function (seeAppendix
theorems. For
11.1) we can use indirect tests based on characterisation
example, if a particular parametric family is characterised by the form of its
first three moments, then we can construct a test based on these. For several
characterisation results related to the normal distribution see Mathai and
Pederzoli (1975. Similarly, hypothesis testing can be used to assess the
appropriateness of the sampling model as well (see Chapter 22).
As far as question 3 is concerned we need to construct a mapping 1( ):
which will provide us with further values of X not belonging to the
O
sample X, for a given value of 0.
'

...,t'

-+

11.5

Statistics and their distributions

As can be seen from the bird's-eye view of statistical inference considered in


the previous section. the problem is essentially one of constructing some
mappinq of the form:

q( ):
'

.?'
-+

or its inverse, which satislies certain criteria (restrictions)depending on the


nature of the problem. Because of their importance in what follows such
mappings will be given a very special name, we call them (sample)
statistics'.

The nature

of statistical

Dellni

T't??

A statistic is

to /?t?any #0/-t?/ jnction (.s?t?Chapter 6)

y('?jt/

q( ) : 4

--+

Estimators,

'

'

Note that t/(

inference

) does not depend on any unknown parameters.

'

confidence intervals, rejection

regions and predictors

are al1

statistics which are directly related to the distribution of the sample.


-statistics' are themselves random variables (r,v.'s)being Borel functions of
r.v.'s or random vectors and they have their own distributions. The
discussion of criteria for optimum
is largely in terms of their
distribtltions.
bstatistcs'

Two important examples


of statistics
what
occasions
in
follows
are:
numerous

Y,,

called the samplv

-j,'

sz

s.

. ..

ueun,

j
=

on

11

)12

which we will encounter

11

-.-

n- 1i

y (x

-.

vkj,

i -

2 ,.

called the

sflnlplf;z

ptlrftlnc't?.

(1l

l0)

On the other hand, the functions


/ 1(

-7!k'

pi

&

.)k'

...--..j:f

cccn ....
!.

(!17-)
(
.

and
1

pl

y
n .- 1 i 1

/a(.,y)

--

(xj

-/t)2

-,.

unless c2 and ;i are known. respectively.


The concept of a statistic can be generalised to a vector-valued
of the form

are not statistics

q(') : .i'

--+

(.)

R'l,

function

( 11.13)

As with any random variable, any discussion relating to the nature of


(?(X)must be in terms of its distribution. Hence- it must come as no surprise
to Iearn that statistical inference to a considerable extent depends critically
on our ability to determine the distribution of a statistic t?(X) from that of
X T.E.Ee
(-Yl
A-,,)',and determining such distributions is one of the most
difficult problems in probability theory as Chapter 6 clearly exemplified. ln
that chapter we discussed various ways to derive the distribution function
of F ?(X)
-2,

F( )')

.P#t?(X) : )')

( 11.14)

and their distributions

11.5 Statistics

when the distribution of X is known and several results have been derived.
The reader can now appreciate the reason he she had to put up with some
rather involved examples. All the results derived in that chapter will form
the backbone of the discussion that follows. The discerning reader must
have noted that most of these results are related to simple functions t?(X)
Ar,,. lt turns out that most of
of normally distributed r.v.'s, A'l, Xz,
the results in this area are related to this simple case. Because of the
importance of the normal distribution, however, these results can take us a
inference avenue'. Let us restate some of these
long way down
results in terms of the statistics Y,, and s'l, for reference purposes.
.

-statistical

Example 1

Consider the following statistical model:

.Jtx;0)

(I)

X > (A' j Xz,


,

'

(2zr)

c
,

1 x- u
exp 2

0 EEE(p, c2)

A-,,)' is a random sample from

,(x;

R x R+ ;

#).

For the statistics

x=

1
N

11

y xj

11

=1

and

(.; J.t)
.x'J.-..'7.-,
x (() jj.
?

( 11.16)

( 11.18)

(iv)
Note that
n
jl
i

Arl
.

''$
-

%''

u
r'z

X -y
l'

.$2

+ (n - j ) Jj

(n )

( 11.19)

The nature of statistical

226

inference

and

Covf
x.

y,2)

.;f'

?1

0.

n( Xpj p )
-

.j

.r(u

(s2/c2)
,'
yn
('rpl/z2)
x

(11.20)

Sn

1, m

( 11.2 1)

1),

wherez,l is the corresponding

sample variance of a random sample (Z1, Zc,


z2)
s,l,
zm2are independent.
and
from Npz,
. z.)
results
follow from Lemmas 6.1-6.4 of Section 6.3 where the
Al1these
chi-square,
normal,
Student's t and F distributions are related.
Using the distribution of g(X), when known, as in the above cases, we can
consider questions relating to the nature of this statistic such as whether it
tbad'
estimator or test
provides a
(to be defined in the sequel) or a
statistic. Once this is decided we can go on to make probabilistic statements
about 0, the
parameter of *, which is what statistical inference is
largely about. The question which naturally arises at this point is: SWhat
happens if we cannot determine the distribution of the statistic t?(X)?'
Obviously, without a distribution for q(X) no statistical inference is possible
and thus it is imperative to solve' the problem of the distribution somehow. In such cases asymptotic theory developed in Chapters 9 and 10
best' solutions in the form of
comes to our rescue by offering us
of
t?(X).
In Chapter 6 we discussed
distribution
the
approximations to
various results related to the asymptotic behaviour of the statistic T,, such
.

Ggood'

Strue'

isecond

as:
a S.
.

--

X 11

T,

(iii)

p,

->

-+

and

p;

(Y,,-p)

w'

z x(0,1);

-,

( 11.22)

,x.

irrespective of the original distribution of the Xis. Given only that


c2 <
Vartfj)
ln Chapter 10 these results were
x; note that '(Y,,)
general
extended to more
functions (X). ln particular to continuous
functions of the sample rJw moments
'(-f)

=p,

=p.

mr

ni

19

Z aYr
,

(11.23)

11.5 Statistics

and their distributions

227

ln relation to mr it was shown that in the case of a random sample

a.$.

lllv

-+

r?lr

--.

/.

pv,
P

('ii)

;i';r

ftrl (1

( 11.24)

(iii)

with

'tr?1rl

p;,

avl y.
=

(p;)2
,

(11.25)
assuming that pr < :x,
It turns out that in practice the statistics :(X) of interest are often
functions of these sample moments. Examples of such continuous functions
of the sample raw moments are the sample central moments being dened by
.

(11.26)

r > 1.

These provide us with a direct extension of the sample variance and they
represent the sample equivalents to the central moments
X

pr

(.X

(11.27)

d.X.
-FI'.'JIXI

the help of asymptotic theory we could generalise the above


asymptotic results related to n1,, rb 1, to those of F;, <(X) where q ) is a
Borel function. For example, we could show that under the same
conditions
With

(i)

'

a S.
.

br --. Jtr;
P

#r

#r

-#

V'nt,

,'

-pr)

Gr+

where

e'*l
r

/42,+ 2 '-ph

assuming that p2r <

'L<L:
;

.x((),

-2(r

(11.28)

1),.

+ 1)/trpr+ a +

see exercise 1.

(r+

Lllyvlyz,

( 1 1.29)

228

inference

The nature of statistical

Asymptotic results related to L q(X) can be used when the distribution of


Y;,is not available (or very difticult to use). Although there are many ways to
obtain asymptotic results in particularcases it is often natural to proceed by
following the pattern suggested by the limit theorems in Chapter 9:
=

Step 1
Under certain conditions F;, :(X) can be shown to converge in probability
to some function of (p) of 0, i.e.
=

(f?), or

Y;,

-+

Construct two sequences


)z; j (p)

Let F.(y*)

11

,,

?1

,. /1

(?,

h(04.

Y;:

( l 1.30)

c,,(@, n k: 1) such
(11,,4p),
o

y- +

.s.

-+

...+

denote the asymptotic

that
( 11 3 1)

N(0 1).
,

distribution of F,*, then


,

for

rfyc n

F,,(.F) :> Fw,(.1'*),

and F.(y*) can be used as the basis of any inference relatipg to F,, t?(X). A
question which naturally comes to mind is how large n should be to justify
the use of these results. Commonly no answer is available because the
answer would involve the derivation of F,,(y) whose unavailability was the
very reason we had to resort to asymptotic theory. ln certain cases higherorder apprbximations based on asymptotic expansions can throw some
light on this question (see Chapter 10). In general, caution should be
exercised when asymptotic results are used for relatively small values of n,
say n < 100?
=

Appendix 11.1 - The empirical distribution function


The first question posed in Section 11.4 relates to the validity of the
probability and sampling models postulated. One way to consider the
validity of the probability model postulated is via the empirical distribution
/nclt?n F)(x) defined by
F,*,(x)

1
-

t1

(number of

xjs G x),

Alternatively, if we define the random variable (r.v.) Zi to be

Appendix 11.1

229

1 if xi (E ( vt0 otherwise,
-

Zj

xq

)'-

distribution postulated in * is
thell F,*,(x) ( 1/n)
1 Zi. If the original
F(x), a reasonable thing to do is to compare it with F,*,(x). For example,
=

consider the distance

o,,

=max

F(x)l,
l#,*,(x)
-

is the
D,, as defined is a mapping of the form D,,( ): -+ g0,1(1where
observation space. Given that Zi has a Bernoulli distribution F,*,(x) being
Z,, is binomially distributed, i.e.
the sum of Zj Zz,
.9-

,?'

'

Pr F,t(x)

gF(x) k(1
E1

F(x)

,,

)I, k

0, 1,

n,

F(x) and Var(F,t(x))


where f'(Fl(x))
( 1/n)F(x)(1 F(x)). Using the
central limit theorem (see Section 9.3) we can show that
=

-F(x))..
x,'',?(/7(x)
tFtxlg 1

z x((),1).
,w.

F(x)q )

Using this asymptotic

result

it can be shown that

v'n

.D,,--+

.p

where

J?

F(y)

1- 2

k-1

1
( 1/ -- expt - 22
-

p2)

This asymptotic distribution of x//nl)n can be used to test the validity


section21.2.

of *;

see

Important

concepts

Sample, the distribution of the sample, sampling model, random sample,


independent sample, non-random sample, observation space, statistical
model, empirical distribution function, point estimation, confidence
estimation, hypothesis testing, a statistic, sample mean, sample variance,
sample raw moments, sample central moments, the distribution of a
statistic, the asymptotic distribution of a statistic.

Questions
Discuss the difference between descriptive statistics and statistical

inferenee.

Te nature of statistical inference

230

4.

5.
6.

Contrast .Jtx;0) as a descriptor of observed data with ftx; p) as a


member of a parametric family of density functions.
Explain the concept of a sampling model and discuss its relationship
to the probability model and the observed data.
Compare the sampling models:
random sample;
(i)
(ii)
independent sample;
non-random sample;
(iii)
and explain the form of the distribution of the sample in each case.
Explain the concept of the empirical distribution function.
tEstimation and hypothesis testing is largely a matter of constructing
mappings of the form g ): I
(.).9Discuss.
Explain why a statistic is a random variable.
1) (seeAppendix
Ensure that you understand the results (11.15V(11.2
6.1).
Being able to derive the distribution of statistics of interest is largely
what statistical inference is alI about.' Discuss.
Discuss the concept of a statistical model.
'

9.
10.

--+

Exercises

1.* Using the results (22)-(29)show that for a random sample X from a
distribution whose first four moments exist,
nxn

-/z)

ntsaz -

XN

0
0

c2

ps

y5

y4

-cr*

Additional references
Barnett

(1973);

Bickel and Doksum

(1977);Cramer ( 1946);

Dudewicz

(1976).

CHAPTER

12

Estimation

l - properties of estimators

Estimation in what follows refers to point estimation unless indicated


otherwise. Let (S, P( ))be the probability space of reference with X a r.v.
defined on this space. The following statistical model is postulated:
'%

(.J(x;P), 0 6 O), O L

(i)

(ii)

X H(Ar1, Xz,

R;

X,,)' is a random

sample from

.J(x;0).

in the context of this statistical model takes the fol'm of


constructing a mapping 4.)..I
0- where 3- is the observation space and
Borel
composite
h ) is a
function. The
function (a statistic) $- h(X): S O
called
and
value
is
its
an estimator
(x),x c I an estimate. It is important to
distinguish between the two because the former is a random variable (r.v.)
and the latter is a real number.
Estimation

-+

--+

'

Example J

p)2), ps n, and x be a random sample


Let ylx;p)- El/x/tznnexpt
R'' and the following functions define estimators
from .J(x;p). Then
of p:
.--ifx

.133-,,z:

1(

(i)
(ii)

/1

Ni

p-2
P-3

-Yf,'

--

ki

)''1 X i'

..)'

''''''-

231

1 2,

fr

12
,

'

n>

>

1;

Properties

p-z

p-5

(X 1 + X );
11

1g

?1

N i

jj

2
i ',

zr

,1

p-6
$

1
-

of estimators

K f

iA-i
=

1g

'

'

)
n+ 1i j

Xf

lt is obvious that we can construct intinitely many such estimators.


estimators is not so obvious. From the above
However, constructing
examples it is clear that we need some criteria to choose between these
estimators. ln other words, we need to fonnalise what we mean by a
estimator. Moreover, it will be of considerable help if we could devise
such good estimators; a question
general methods of constructing
considered in the next chapter.
kgood'

igood'

Finite sample properties

12.1

ln order to be able to set up criteria for choosing between estimators we


need to understand the role of an estimator first. An estimator is
representative
constructed with the sole aim of providing us with the
value of 0 in the parameter space 0- based on the available information in
the form of the statistical model. Given that the estimator V= (X) is a r.v.
(being a Borel function of a random vector X) any formalisation of what we
representative value' must be in terms of the distribution of
mean by a
(? say ftb. This is because any statement about
near f/is to the true p'
probabilistic
only
be
one.
a
can
of 0 to satisfy is that
estimator
The obvious property to require a
Jt#l is centred around p.
tmost

>

'most

'how

'good'

Dejlnition 1

# 0-f 0 is said

.,1,7estimator
t.'.;jtyyd

&p be an tmbiased

estimator

of 0

#-

E'lJ)

() dp.

........

.
.

That is, the distribution


parameter to be estimated.

t#'

( has mean equal to the unknown

Finite sample properties

12.1

but equivalent, way to define fit( is

Note that an alternative,

( 12.2)
(iistribution
0) =.J(x1 xa,
where
x,,; ?) is the
(?Jtbe sample, X.
without having to derive either of the
Sometimes we can derive
abovedistributions by just using the properties of E ) (see Chapter 4).
For example, in the case of the estimators suggested in Section 12.1, using
and the properties of the normal distribution we can deduce
independence
N(p,
l
(1/n), this is because ('Ikis a linear function of normally
that
distributedr.v.'s (see Chapter 6.3), and
.(x;

'(#)

'

'v

J;(;1)

E N

11

)( xi
=

11

-/

)(

1;(A',.)

N i

N (?

11

p=--

=
''-=

(see Fig. 12. 1). The second equality due to independence and the property
.E'(c) c if c is a constant and the third equality because of the identically
distributed assumption. Similarly for the variance of #1
=

jj

flt)1 P)2
-

11

-j.
K

(A-g- ?)2

11

.-j
f

Eqxt. - 0jl

Ka i

11

1
=

(12.4)

::=:

1 2,

05 N(p, 1), 04. N


'v

'v

1-1.-

27 2
,

nz

-? nz 2 (n
5x

np

)
,

Properties

of estimators

n + 1 0, (rl+ 1)(2n+ 1)
6n
2

-v.

',y

'v

n 0,

n+ 1

(n+ 1)c

Hence, the estimators #1,#2,#aare indeed unbiased but #4, 4) and #, are
h,
biased. We define bias to be B(0) f(#) p and thus ,46)
E(2 n)/n?0,
Rt/5 ) n2( 1 + p2)- p, #())) g(n 1j/2qp, #($j=
0(qn+ 1). As can be seen
from the above discussion, it is often possible to derive the mean of an
estimator # without having to derive its distribution. It must be
remembered, however, that unbiasedness is a property based on the
distribution of #.This distribution is often called sampling distribution of 1
in order to distinguish it from any other distribution of functions of r.v,'s.
Although unbiasedness seems at first sight to be a highly desirable
property it turns out to be a rather severe restriction in some cases and in
most situations there are too many unbiascd estimators for this property to
be used as the sole criterion for judging estimators. The question which
naturally arises is,
can we choose among unbiased estimators?'.
Returning to the above example, we can see that the unbiased estimators
#2,05 have the same mean but they do not have the same variances.
j
Given that the variance is a measure of dispersion, intuition suggests that
the estimator with the smallest variance is in a sense better because its
around p. This argument leads to the
distribution is more
second property, that of relative efficiency.
=

ihow

iconcentrated'

Dehnition 2

zl.rlunbiased estimator #1of 0 is said to be relatiYely


than some other unbiased estimator
'a

more efficient

vartj) < vartpc) or

and

1
-

n
1

vart'/al,z=<

is relatively more efficient than either

< -

k
1

I/) vartpa)< 1.

'1

ln the above definition


since
Vart'j

vartl )

=-----%-

efftlj

Var(2),

1, 2,

h or h

1,

varta),

i.e. h is relatively more efficient than t')a(see Fig. 12.2).


In the case of biased estimators relative efficiency
can be defined in terms
of the mean square error (MSE) which takes the form
E1-

0)1

=var()+

E#()q2,

(12.6)

12.1

Finite sample properties

/'(j)

f (z)
)
/ (J:.,
'!

that is, an estimator


Or

'(* -

$2 <

MSE(*)

<

#* ij
E(

'i

relatively more efficient than

if

p)2

MSE(t.

As can be seen, this definition includes the definition in the case of unbiased
estimators as a special case. Moreover, the definition in terms of the MSE
enables us to compare an unbiased with a biased estimator in terms of
efficiency. For example,
MSE($)

<

MSE(l),

estimator than (35despite the fact


and intuition suggests that (Lis a
Fig.
12.3 illustrates the case. ln
unbiased
and
0y
is
not;
that #a is
circumstanceslike this it seems a bit unreasonable to insist on
unbiasedness.Caution should be exercised, however, when different
distributionsare involved as in the case of ).
'better'

Let us consider the concept of MSE in some detail. As defined above, the
MSE of depends not only on but the value of ()in O chosen as well. That
is, for some pa e:O

MSEII,t?o)=E(4- %)2
=

figt-Ft')l

= Vart)

+ E)

-p())j2

g#(, p())12,

(12.7)
pe) zittil 0v is the bias of #

thecross-product term being zero. B$,


to the value po.Using the MSE as a criterion foroptimal estimators
relative
would like to have
an estimator $ with the smallest MSE for a1l ?in O
we
=

Properties

of estimators

f'(J,)

(Ja)
'C

r'

0
Fig. 12.3. Comparing

the sampling distribution of

h and h.

(uniformly in O). That is,


(?) MSEI#,p)
MSEIV.
:

for all p e:(6),

where Fdenotesany other estimator of p. For any two estimators #and o


0 if MSEI, ?) MSEIC
p), () (E 0- with strict inequality lplding for some
0 G 0, is said to be inadmissible. For example, l above is inadmissible
because
:$

MSEIVI
,

04<

MSEIVa,?), for

a1l 0 e O if n > 1.

ln view of this we can see that 4 and s are inadmissible because in MSE
terms are dominated by ,. The question which naturally arises is: can we
find an estimator which dominates every other in MSE terms'?' A moment's
reflection suggests that this is impossible because the MSE criterion
depends on the value of ? chosen. ln order to see this 1et us choose a
particular value of p in 0, say 0v, ani define the estimator
0* 00
=

for al1 x e:

,?,'

Then MSE(?*, 0v)


and any uniformly best estimator would have to
MSE(t?*,
?)
0 for a1l 0 g (9, since 0o was arbitrarily chosen. That is,
satisfy
whatever
value! Who needs criteria in such a
perfectly
its
estimate 0
case? Hence, the fact that there are no uniformly best estimators in MSE
terms is due to the nature of the problem itself.
Using the concept of relative efficiency we can compare different
=0

'true'

12.1

Finite sample properties

estimators we happen to consider. This, however, is not very satisfactory


since there might be much better estimators in terms of MSE for which we
know nothing about. ln order to be able to avoid choosing the better of two
Such a
inefficient estimators we need some absolute lntzlsu/'t! t?/' tllciency.
which
provided
takes
form
Cramer-Rao
the
is
by
!(?u':7b
ound
the
measure

(7R(?)

---

d#(p)

1+

Iog ytx; ?)
t?p

.J(x;p) is the distribution

of

sample

the

and #(t?) the bias. lt can be shown

0* of p

that for any estimator

MSE(p*, 04y CR(@

under the following regularity conditions on m:


The
?)> 0 ) does ntpr depend t?n 0.
)x :
For each 0 g (i) the derivatives I)t?log (x, ?)(I ), i
fo r a 11x g
(CR3) 0 < f'Rglt)0)lof /'(x; ?)()2 < r.y. jr aII p (Eq(.),

(CR1)
(CRJ)

./'(x;

.,4

-t?r

,'q(()

./

1. 2, 3, exist

.k.'

ln the case of unbiased

the inequality takes the form

estimators

lO g (x ; P)
Pp
,/'

Var(p*) )v E

- 1

the inverse of this lower bound is called Fisher's


by 1,,(P).
Dejlnition

l/rrrlfkrfon and is denoted

.3

zln unbiased

t?/'

estimator

Vart#l

1Og
'

0 is said to
2

J'(X,' P)

<

(fully)efficient #'

?t7

1,,(p)-

wpn
/--/itrfpnr

That is, an unbiased estimator is


Cramer-Rao drpwprbound.

its

1
.

tariance

equals tbe

ln the example considered above the distribution of the sample is


./'(xi;

p1

11

.f(x;p) il-1

p)

-,'/.2

= (2a)

exp t
17 -.--),.'-ja)
(

exp

--

1k

--l-(x

v
11

a fj
=

(xj- ?)2

'

p)2)

of estimators

Properties

and

d loy. .J(x;0)
d0

An alternative
equality

ojtd

F (x2

=n

way to derive the Cramer-Rao


log

..sjd...2

.J(x; )42(j
dp

by independence.

-p)

'--'j

lower bound is to use the

log tx;0)
dp2

(j
,

which holds true under CR 1-CR3 and


density function.
p) is the
and hence the equality
ln the above example Ed21og .Jtx;(?)j/(dp=)
holds.
So, for this example, CR(@ 1/n and, as can bc seen from above, the only
estimator which achieves the bound is #! that is, Var(#1) CR(p); hence #1
is a fully efficient estimator. The properties of unbiasedness, relative
efficiency and full efficiency enabled us to reduce the number of the
originally suggested estimators considerably. Moreover, by narrowing the
class of estimators considered, to the class of unbiased estimators, we
uniformly best estimator',
succeeded in
the problem of
discussed above in relation to the MSE criterion, This, however, is not very
sumrising given that by assuming unbiasedness
we exclude the bias term
which is largely responsible for the problem.
Sometimes the class of estimators considered is narrowed even furtherby
requiring unbiased as well as linear estimators. That is, estimators which are
linear functions of the r.&'.'s of the sample. For example, in the case of
Within the
example 1 above,
h, l, 6, h and ;c are linear estimators.
411
of
linear
unbiased
estimators
and
show
that
class
has minimum
we can
variance. ln order to show that let us take a general linear estimator
:true'

./'(.xr'

-n

'solving'

'no

':

11

p-=c + jl
i

aixi,

( 12.10)

which includes the above linear estimators as special cases, and determine
7, which ensure that
the values for c, ai, i 1, 2,
is best Iinear unbiased
(BLUE)
of
Firstly,
for
be
unbiased
J'to
estimator
y.
we must have
1. Secondly, since Vartp)
which implies that c 0 and
j ai
)'c2
a? we must choose the ais so as to minimise
j)'- j J/c2
a,? as
j
satisfy
unbiasedness).
1 (for
well as
Setting up the Lagrangian for
1 ai
=

'(p)=0

N)'.
:)'-

Z)'=

12.1

Finite sample properties

239

this problem we have


?1

min : ltaa2)

tzi

11

ail

F, ai
=

Summing over f,
l1

.
i

'1

1=

jyj

,;g

?1

lf =

1e

jj
ai

12

.,

for c 0 and ai 1/n, i 1, 2,


n,
(1 n) j)'- 1 Xi, which is identical to
#1. Hence 4'1is BLUE (minimumvariance among the class of linear and
unbiased estimators of p). This result will be of considerable interest in
theorem.
Chapter 2 1, in relation to the so-called Gauss-Markov
The properties of unbiasedness and efficiency can be generalised directly
where 0 H(p1,
0m). is said to be an unbiased
to the multiparametercase
of
estimator
0 if
=

.E'(#) 0,
=

i.e. E(#f)

0i

1, 2,

?'l.

ln the case of full efficiency wecan show that the Cramer-Rao


of 0 takes the form
an unbiased estimator
-j1(0

Yj(0

10g XX;

covt)

lOg

pp

Or

> Inoli-i
varlpf)

',

t?0
1, 2,

.f(x;

1,,(:=E

Yjt

./x;

I0B /tX;-..?6

0
-

pp

l log ftx; 0)
(0 Jp

(12.12)

(rrl being the number of parameters), where


diagonal element of the inverse of the matrix

z;j(0 I0B

j'jy:

04

inequality for

1
I,,(p)J represents the th

)'j
( 12.13)

called the sample information matrix; the second equality holding under the
restrictions CRI-CR3. ln order to illustrate these, consider the following
example:

Properties

In example

1 discussed above we deduced that


j

/ =
'good'

is a
sample

of estimators

11

'l j

y xi

(12.14)

estimator of y, and intuition suggests that since i is in effect the


corresponding to /t the sample variance

moment

(f2

11

y (xy-p2

tl

( 12.15)

estimator of c2. In order to check our intuition let us


dl satisfies any of the above discussed properties.

should be a
examine whether

'good'

( 12.16)
Since

E(Xi -p)2

Ep -/t)2

c2,

and

from independence, we can deduce that


,1

(-Yj
i

1,
-/@)2

1*=

(7.2
1

+--

1)

-2

=(n

-K

1)c2.

( 12.17)

This, however, implies that E(dl)


g(n- 1)/'??jcr2+ c2, that is, dl is a biased
c2.
Moreover, it is clear that the estimator
estimator of
=

$.2
'
n

11

t.yj
1 f )-q
1

j2

( 12.18)

is unbiased. From Chapter 6.3 we also know that


(rl 1)

S2
'v

z (rl
-

1)

( 12. 19)

12.1

Finite sample properties

and thus
Var(s2)

c*

1)a

(n
-

2(n

1)

2c*
=

( 12.20)

since the variance of a chi-square r.v. equals twice its degrees of freedom. Let
Y,, and
are efficient estimators:
us consider the question whether
.s2

( 12.2 1)
o

1og ftx; 0)

t?log .J(x;0)
Jp

27:

-jlog

-j

(72

log

-#..j i )
=

-p)2,

( 12.22)

(-Yf
1

p log tx;0)
t?/z

f?

log

tx;04

pc2

:2 1og tx,'p)
(0

t'?p

' J2 1og f (x;0)

t2

log

Jc2

tx,'0)

(?y

log

tx;0)

i)y (?(r2
2

log

tx;p)

pc4.

( 12.24)

0
n
2c 4

and

gI,,(p)j -

N
=

Properties

of estimators

This clearly shows that although T,, achieves the Cramer-Rao lower bound
sl does not. lt turns out, however, that no other unbiased estimator exists
which is relatively more efficient than s2; although there are more efficient
biased cstimators such as

d21

1r

n+ 1 i

?1

)
-

(Xi

Y,,)2.

( 12.26)

Efficiency can be seen as a property indicating that the estimator


all the information contained in the statistical model. An important concept
related to the information of a statistical model is the concept of a sujhcient
statistic. This concept was introduced by Fisher ( 1922) as a way to reduce
the sampling infonnation by discarding only the information of no
relevance to any inference about 0. In other words, a statistic T(X)is said to
be sufficient for 0 if it makes no difference whether we use X or T(X) in
inference concerning 0. Obviously in such a case we would prefer to work
with r(X) instead of X, the former being of lower dimensionality.
'utilises'

Dejlnition 4
R'n, n > rn, is called sufficient for 0
,4 statistic T( ):
t/' the
conditional distribution
z(x) z) is independent 0j% 0, i.e. 0 does
not appear in /(x,,''r(x) T) and the domain of f ) does not involve 0.
In example 1 above intuition suggests that z(X)
j)1-1 Xi must be a
sufficient statistic for 0 since in constructing a
good' estimator of p, 1,
we only needed to know the sum of the sample and not the sample itself.
That is, as far as inference about 0 is concerned knowing all the numbers
(A'j Xz,
A-,,)orjust
1 Xi makes no difference. Verifying this directly
by deriving f(x T(x) T) and showing that it is independent of 0 can be a
very difficult exercise. One indirect way of verifying sufficiency is provided
by the following lemma.
.4-

-+

'

.f(x

'

tvery

Z)'-

Fisher-Neyman
The statistic
jctorisation

letnma
factorisation

T(X) is
of the

sujlcient
abrp?

/'(x; 0) =.J(T(x);

p)

for 0 #' and


(x),

only

#' there

exists

(12.27)

wllcrc .J(T(x); p) fs (he Jtrnsfty ynctfono/' z(X) anl depends t)n 0 and
Jl(X),some function(?JX independent (?J0.
Even this result, however, is of no great help because we have to have the
statistic T(X) as well as its distribution to begin with. The following method
suggested by Lehmannand Scheffe( 1950) provides us with a very convenient
way to derive minimal sufhcient statistics. A sufficient statistic T(X)is said to

12.1

243

Finite sample properties

be minimal if the sample X cannot be reduced beyond T(X) without losing


i?sufficiency. They suggested choosing an arbitrary value xa in and form
the ratio

J(x; P)
'/'(x();0$

x G ..J, 0 (E 0.

( 12.28)

g (x x () ; p)

and the values of xtl which make gtx,


minimal sufficient statistics.
ln example 2 above

xtl;

0)independent of 0 are the required

( 12.29)
Xl?) is a minimai sufficient
This clearly shows that T(X)
1
1 A'j, ZJ'=
statistic since for these values of xa g(x, x(); 04 1. Hence, we can conclude
y2)
being simple functions of z(X) are sufficient statistics. lt is
that (X,,,
Xt? separately as
1
important to note that we cannot take
l Xi or
minimal sufficient statistics; they are jointly sufficient for 0 EB (/t,c2).
In contrast to unbiasedness and efficiency, sufficiency is a property of
statistics in general, notjust estimators, and it is inextricably bound up with
the nature of *. For some parametric family of density functions such as the
exponential family of distributions sufficient statistics exist, for other
families they might not. lntuition suggests that, since efficiency is related to
full utilisation of the information in the statistical model, and sufficiencycan
be seen as a maximal reduction of such information without losing any
relevant information as far as inference about 0 is concerned, there must be
relationship along the
a direct relationship between the two properties. A
should
needed
look no further
efficient
estimator
when
is
we
lines that
an
following
provided
lemma.
statistics,
by
sufficient
the
is
than the
=

(Z)'-

Z)'-

)-

Rao and Blackwell Iemma


Let T(X) be (1 sufjlcient

statistic

)r

0 and t(X) be an estimator of 0,

then
&h(X)

04l %F(t(X) - p)2, 0 g (1),

T),

conditional

of t(X)

expectation
i.e. the
wt'?rc h(X) F(t(X),/T(X)
given z(X) z.
From the above discussion of the properties of unbiasedness, relative and
full efficiency and sufficiency we can see that these properties are directly

Properties

of estimators

related to the distribution of the estimator (1 of t?.As argued repeatedly.


deriving the distribution of Borel functions of r.v.'s such as )'=/?(X)is a very
difficult exercise and very few results are available in the literature. These
results are mainly related to simple functions of normally distributed r.v.'s
(see Section 6.3). For the cases where no such results are available (which
is the rule rather than the exception) we have to resort to asymptotic results.
This implies that we need to extend the above list of criteria for
estimators to include asymptotic properties of estimators. These asymptotic
properties will refer to the behaviour of # as n
r ln order to emphasise
the distinction between these asymptotic properties and the properties
sample (or small sample) properties.
considered so far we call the latter
The finite sample properties are related directly to the distribution of #,,,say
/'((,). On the other hand, the asymptotic properties are related to the
asymptotic distribution of t)-,.
tgood'

-+

.jlnite

12.2

Asymptotic properties

A natural property to require estimators to have is that as n --+


(i.e.as the
sample size increases) the probability of being close to the true value 0
should increase as well. We formalise this idea using the concept of
convergence in probability associated with the weak law of large numbers
(WLLN) (see Section 9.2).
'z.

'

Dehnition 5

4n

0- /?(X) is said to be consistent jr 0

estimaor

lt

lim

11-*

LX

p!< c)
P r( p-.
11

/'
( 12.3 1)

w't? u,,?-/lt?t', -+ p.
t./,7(./

This is in effect an extension of the WLLN for the sample mean T,, to some
Borel function (X). lt is important to note that consistency does not refer to
t approaching
The
0 in the sense of mathematical
converence.
with
pl
refers
probability
associated
<
the event
to the
c as
convergence
derived from the distribution of
as n --+ x.. Moreover, consistency is a
very minimal property (although a very important one) since if In is a
7 405 926, n if #r(t,, - ?t>
consistent estimator of () then so is
n +
which
implies that for a small n the difference
t?l
7405926/:) 1,n, n > 1,
might be enormous, but the probability of this occurring decreasing to zero

Ip,,
-

%*
=

Iu

aS

-+

7f

Fig. 12.4 illustrates the concept in the case where

,,

has a well-behaved

Asymptotic

12.2

properties

symmetric distribution for n! < na < ns < rl4.< ks. This diagram seems to
suggest that if the sampling distribution .J(),,)becomes less and less
dispersed as n %) and eventually collapses at the point () (i.e. becomes
degenerate), then
is a consistent estimator of (). The following lemma
formalises this argument.
--+

,,

t hen

(i)

11

--+

lt is important, however, to note that these are only stfjl-lcient conditions for
consistency (nor necessaryl; that is, consistency is not equivalent to the
above conditions, since for consistency Varttdi,ylneed not even exist. The
above lemma, however, enables us to prove consistency in many cases of
interest in practice. lf we return to example 1 above we can see that
P

-+

246

Properties

of estimators

by Chebyshev's inequality and lim,,-, .E1 -( 1/n;2)j 1. Alternatively, using


Lemma 12.1 we can see that both conditions are satisfied. Similarly, we can
=

showthat
P

#c p, ;a +.. p ('+.+'reads

'does

not converge

-+

t) +.+ 0, s ++ p
4
;
showthat dl
--+

in probability

0,

+.+

p. Moreover, for 42 and sl of example 2 we can

+.+
p

c2 and sl

to'),

(7.2

--+

A stronger form of consistency associated with almost sure convergence


very desirable asymptotic property for estimators.

is a

Dejlnition 6

t is said

zzlnestimator

to be a strongly consistent estimator

1im #,, 0

Pr

?1 --#

of 0 (J'

2:)

.%.

and is

tlenoted

by

-+

0.

The strong consistency of ),,in example 1 is verified directly by the SLLN


and that of sl from the fact that it is a continuous function of the sample
X? (see Chapter 10). Consistency and
moments Y,, and r/?c ( 1 n)
1
consistency
extensions
of the weak 1aw and strong law
can be seen as
strong
statistic
respectively.
of large numbers for
1 Xi to the general
Extending the central limit theorem to n leads to the property of
asymptotic normality.

)'-

(,,

Z)'-

Dehnition 7
'J,is ,s(:fJto be asymptotically normal
zln estimator
'tj
exist sttch that
1.
tf U;,(p) n y 1 )p,, > 1)
,

(1-lw()

sequences

-p

()p))-(
()1;-7,

,,

,,

-+

( 12.32)

.N'(0-1),

normality presents several logical


This way of defining asymptotic
problems deriving from the non-uniqueness of the sequences P;,(p)) and
)p,,), n y 1. A more useful definition can be used in the case where the order
of magnitude (seeSection 10.4) of P;,(p)is known. ln most cases of interest in
practice such as the case of a random sample, P;,($ is of order 1/n, denoted
by P;,(@ 04 1/rl). In such a case asymptotic normality can be written in the
following form:

',-1 j
( - p ) x(()prjpjj
-$,?'
,,

where

'

i
'v

,,

reads asylnptotically

(12.33)
distributed as' and U(p) > 0 represents the

12.3

and their properties

Predictors

247

asymptotic variance. ln relation to this form of asymptotic


consider two further asymptotic properties.
D#nron

normality

we

zln estimator (, wflll Vart'(l


unbiased ('/'
x/ n(P,, P) 0

0( 1 rl) is .$JJ to be asymptotically

(12.34)

-->

This is automatically
satisfied in the case of an asymptotically normal
estimator
for Vartt?,yl J,)(p) and f)#,,) p,,. Thus, asymptotic normality
can be written in the form
=

,,

v'''rl($

,,

p)

p-(p)).

(12.35)

lt must be emphasised that asymptotic unbiasedness is a stronger condition


than 1im,,-, F((,) p; the former specifying the rate of convergence.
ln relation to the variance of the asymptotic normal distribution we can
define thc concept of asymptotic efficiency.
=

Dlinition

tt is said rtpbe asymptotically

An asynptoticall y' normal lsrrrlfkrt?r


I,''(p) 1,..(?) - 1 w/-llrt?
efficient
,/'

1
.l.,(p) lim -1,11,,(p)
cx7.
l
=

(12.36)

--.

i.e. the asymptotic pr.lrff.lncl achieves the lrrlr


ltpwgr-bound lyct?Rothenberq (1973:.

0j%

the Cramer-Rao

At this stage it is important to distinguish between three different forms of


the information matrix. The sample information matrix 1,,(p) (see(13)),the
single observation one 1(p) with txf;P) in (13), i.e.
E

J 1og f (xf,'p)
d0

and the asymptotic


12.3

information matrix 1.(?) in (36).

Predictors and their properties

Consider the simple statistical model:


(i)
Probability model: *
p)
i-e.
.N((),1).
Samplinq model: X N (xYj A''a,
=

'v

ttx;
,

1 x/(27:) expt

-.(x
-

@2,0 c R)

.Y,,)' is a randorn salnple.

Properties

248

of estimators

Hence the distribution of the sample is


!1

flx,

1-l
..ftx-'

p)

,x,,;
.

(12.37)

p).

Prediction of the value of X beyond the sample observations, say A',,+ 1,


refers to the construction of a Borel function 1( ) from the parameter space
O to the observation space i?t.

/(

): 0- --+

'

.kL

(12 38)
.

If 0 is known we can use the assumption that Xn + 1 x, N(0, 1) to make


probabilistic statements about Xn + j Otherwise we need to estimate 0 first
and then use it to construct 1( ln the present example we know from
Sections 12.1 and 12.2 above that
.

').

p11

=-

&f

( 12.39)

=kr

estimator of 0. lntuition suggests that a


is a
might be to use /(#,,) #,,,that is,

tgood'

igood'

predictor of X,, +. 1

?1 +

(,.

(12.40)

The random

variable
.1
/(,,) is called the predictor of X,, + 1 and its
prediction
value.
Note that the main difference between
value the
testimating'
and
prediction
estimation
is that in the latter case what we are
(A-,,+j) is a random variable itself not a constant parameter p.
/(t) we
In order to consider the optimal properties of a predictor
+ j
define the prediction error to be
.-,,

,,

1, +

.Y,,+

-',,

(12.41)

1.

Given that both -Y,,+ 1 and X,, 1 are random variables en + I is also a random
variable and has its own distribution. Using the expectation operator with
respect to the distribution of c,,+ j we can define the following properties'.
Unbiasedness. The predictor .)Q,,1 of X,, + l is said to be unbiased if
(1)
+

f (G,+

1)

0.

(12.42)

Minimum MSE. The predictor

mean square error if


E(el

kn

+ 1

-k

H
zl + 1

E(X

?1

of X,+

)2KEIX

11 +

is said to be minimum
1

-X

n +

)2

for any other predictor Y,:+1 of X,, 1.


Another property of predictors commonly used in practice is linearity.
Linear. The predictor kn + 1 of X', + 1 is said to be linear if 1( ) is a
(3)
linear function of the sample.
+

Predictors

12.3

and their properties

ln the case of the example considered above


e

11 .1

'v

N 0, 1 +-

249
we can

deduce that
(12.44)

function of normally distributed r.v.'s, :,,+ 1


+ 1 is a linear
1
Xi.
Hence,
n)
X,,+ 1 is both linear and unbiased. Moreover,
( j)?-1
using the same procedure as in Section 13. 1 for linear least-squares
estimators, we can show that X,,+1 is also minimum MSE among the class
of Iinear unbiased predictors.
The above properties of predictors are directly related to the same
properties for estimators discussed in Section 12.1. This is not surprising,
however, given that a predictor can be viewed as an estimator' of a random
variable which does not belong to the sample.
given that c,,
Xt, .1

Important

concepts

estimate, unbiased estimator, bias, relative efficiency, mean


full efficiency, Cramer-Rao lower bound, information matrix,
error,
square
properties,
sufficient statistic, finite sample properties, asymptotic
asymptotic
normality,
asymptotic
consistency, strong consistency,
unbiasedness, asymptotic efticiency, BLUE.
Estimator,

Questions
Define the concept of an estimator as a mapping and contrast it with
the concept of an estimate.
Define the finite sample properties of unbiasedness, relative and full
efficiency, sufliciency and explain their meaning.
kunderlying every expectation operator J) ) there is an implicit
distribution.' Explain.
Explain the Cramer-Rao
lower bound and the concept of the
information matrix.
method of constructing
minimal
Explain the Lehmann-scheffe
sufficient statistics.
Contrast unbiasedness and efliciency with sufficiency.
Explain the difference between small sample and asymptotic
'

4.

properties.
8.
9.

Define and compare consistency and strong consistency.


Discuss the concept of asymptotic normality and its relationship
the order of magnitude of Var((,).

to

of estimators

Properties

10.

Explain

the

of asymptotic

concept

efficiency

in

relation

to

asymptotically normal estimators. %What happens when the


asymptotic distribution is not normal?'
Explain intuitively why x' n(t?,, 0j 0 as n
%. is a stronger
condition than lim,,-.
p.
12. Explain the relationships between 1,,(p), l(0j and 1w(f?),
--+

.'(p,,)

--.

Exercises
Let X EEE(A'1 A-c,
X,,)' be a random
consider the following estimators of (?:
.

.'

;a

-s

j
=

jXj

12

2,aA-,, #.k

J-a.- +
1

1,

sample from N(p, 1) and

---j-

n- 1

'

jl ix i h ;, + l-a
i I
,

rl

Derive the distribution of these estimators.


Using these distributions consider the question whether these
estimators satisfy the properties of unbiasedness, full efficiency
and consistency.
(iii)
Choose the relatively most efficient estimator.
Consider the following estimator defined by

1
jrx1- t?--,,

)
nl

cuc -.

...

(1'

and

Prt,,

n)

??

cuc -

n+

l
11-f- J

and show that:


(i)
(, as defined above has a proper sampling distribution;
is a biased estimator of zero;
(L?,
(ii)
lim,,-, Vart#,,l does not exist; and
(iii)
(iv)
#,,is a consistent estimator of zero.
Let X H (.Y1 X.z,
zY,,)' be a random sample from N(0, c2) and
consider
..

022

?1

Xll

tl i

as an estimator of c2.
Derive the sampling distribution of *2 and show that it is an
(i)
unbiased, consistent and fully efficient estimator of c2.

12.3

Predictors

and their properties

Compare it with l of example 2 above and explain intuitively


why the differences occur.
Derive the asymptotic distribution of l.
(iii)
.Y,,)' be a random sample from the exponential
4. Let X EEE(Ar1,
distribution with density
(ii)

Construct a minimal sufficient statistic for 0 using the LehmannScheffe method.

Additional references
Bickel and Doksum ( 1977)) Cox and Hinkley ( 1974); Kendall and Stuart
Lloyd (1984)) Rao ( 1973)., Rohatgi (1976); Silvey (1975); Zacks (197 1).

(1973)',

Estimation 11 - methods

The purpose of this chapter is to consider various methods for constructing


igood' estimators for the unknown parameters 0. The methods to be
discussed are the least-squares method, the method # moments and the
maximum likelibood method. These three methods played an important role
in the development of statistical inference from the early nineteenth century
to the present day. The historical background is central to the discussion of
these methods because they were developed in response to the particular
demands of the day and in the context of different statistical frameworks. lf
we consider these methods in the context of the present-day framework of a
statistical model as developed above we lose most of the early pioneers'
insight and the resulting anachronism can lead to misunderstanding. The
statistical model
method developed in relation to the contemporary
framework i'sthe maximum likelihood method attributed to Fisher (1922).
The other two methods will be considered briefly in relation to their
historical context in an attempt to delineate their role in contemporary
statistical inference and in particular their relation to the method of
maximum likelihood.
The method of maximum likelihood will play a very important role in the
discussion and analysis of the statistical models considered in Part IV; a
of this method will be of paramount importance.
sound understanding
After the discussion of the concepts of the likelihood function, maximum
likelihood estimator (MLE) and score function we go on to discuss the
properties of MLE's. The properties of MLE'S
are divided into finitesample and asymptotic properties and discussed in the case of a random as
well as a non-random sample. The latter case will be used extensively in
Part IV. The actual derivation of MLE'S and their asymptotic distributions

The method of least-squares

13.1

throughout as a prelude to the discussion of estimation

is emphasised
Part lV.

in

The method of Ieast-squares

13.1

The method of least-squares was tirst introduced by Legendre in 1805 and


Gauss in 1809 in the context of astronomical measurements. The problem as
posed at the time was one of approximating a set of noisy observations vf,
n,
n. with some known functions (pft?l 0z,
i 1, 2,
k), i 1,
which depended on the unknown parameters p= ()j
tk)', m < r7.
h?, minimising
?j i 1, 2,
Legendre argued that in the case of qiol
=

1)

) ( J,j - )1)2, with respect to 71


i

( 1,/n)jgyf #,,, the sample mean, which was generally


givesrise to
.J',,). On the
consideredto be the most representative value of (y1 ya,
squared
result
he went on to suggest minimising the
errors
basisof this
=

11

l04

1*crr

(-pf- (/f(p))2,' the least-squares,

in the general case. Assuming differentiability of #j.(#), i 1, 2,


gt?!(p)(l(0 0 gives rise to the so-called normal equations of the form
=

n,

?1

2)

())

g),-

gf(#)1 '

'.

liob

ppk

0,

1 2,
,

n'l.

ln this form the least-squares method has nothing to do with the statistical
model framework developed above, it is merely an interpolation method in
approximation theory.
Gauss, on the other hand, proposed a probabilistic set-up by reversing
the Legendre argument about the mean. Crudely, his argument was that if
X EE (A-l Xz,
X,,)' is a random sample from some density function .J(x)
value for all such A-s, then the
is
the most representative
and the mean
normal,
density function must be
i.e.
,

.k,,

f'tvxl
''
=

'

----

'

exo
'

'*(2z:)

cx

Using this argument he

'$.'ggio)
:=

-h cf

cf lbrl, c2),
'v

1
x2
- 2c c

to pose the problem in the form

went on

i zzu l 2,
,

1, 2,

n.

u,

Methods

normal independent, justifying the normality


grounds
of being made up of a large number of
the
assumption on
cancelling
otherout.) In this form the problem can
each
independent factors
viewed
of
estimation
the
in
context of the statistical model:
be
as one
Not(e: Nl(

'

'reads'

.f

()'f;#)

cx

y,(aal'exp

(-vf qil04jl
-

assumption

the probabilistic

by transferring
r.v., and

1
,o.2

from

0GO

to yi, the observable

;j

consider

3' (
==

.r

.3..'..2

1,

.J.?,,)/

rl. Gauss went on to


to be an independent sample from tm;$, i 1, 2,
derive what we have called the distribution of the sample
=

lg

,1

2
/'(y,. 0) (27:c2)- n.'' exp - --j
'
2c
=

)( (yf
-

gi(0))l

interpreting it as a function of 0, and suggested that maximisation of fty,' 0)


with respect to 0 gives rise to the same estimator of 0 as minimising the
squared errors
11

Yl1 El'-f?f(p)12.
=

As we will see below, the above maximisation can be seen as a forerunner of


the maximum likelihood method.
Since these early contributions the method of least-squares has been
extended in various directions, both as an interpolation method as well as a
statistical model with the probabilistic structure attached to the error term.
The model mostly discussed in this literature is the linear model where gjoj
is linear, i.e.
m

yi

(lkxki+ t:i

and the normality


Etc

1, 2,

k=1

replaced by the assumptions

assumption

) 0, Vartl)

2
,

i,j

1, 2,

7.

( 13. 10)
ln some of the present-day literature this model is considered as an
extension of the Gauss formulation by weakening the normality
assumption. For further discussion of the Gauss linear model see Chapter
18.

of least-squares

The method

13.1

For simplicity of exposition let us consider the case where m


model becomes

,f 0 1 x 1 i .1.t?i
::=

12

:=

1 and the

1.1
.

method suggests minimising

The least-squares

(13.12)

( 13. 14)

Note that in the case where xl

sample

t71.

of

estimator

is the least-squares

1, i

1, 2,

n,

0-1

(1/''n)j)'=1

i.e. the

.pf,

mean,

Given that the xjfs are in genera l assume d to be known constants 0-1 is a
.J',, of the form
linear function of the r.v.'s )'I
,

19

'-'

01

t. l ). f
.

..

where

Hence,
11

??

F(;1 )

)'- c'fuE'tyfl
-

)'-1 cfxl

i03

( 13.16)

t?l
,

i.e. t?:
- is an unbiase d es timator of p1 Moreover, since (p-1 - ()1 )
.

lt can be shown that, under the above


(j)'- j x2jy)+ 0, the least-squares estimator

'l

Y)' I x.c f,
r

assumptions
relating to ;f if
of ?l has the smallest variance

Methods

within the class of linear and unbiased estimators (Gauss-Markov


see Section 2 1.2).

theorem,

The method of moments

13.2

From the discussion in the previous section it is clear that the least-squares
method is not a general method of estimation because it presupposes the
existence of approximating functionsg/p), 1 1,2,
n, which play the role
of the mean in the context of a probability model. In the context of a
probability model *, however,unknown parameters of interest are not only
associated with the mean but also with the higher moments. This prompted
Pearson in 1894 to suggest the method t?/' moments as a general estimation
method. The idea underlying the method can be summarised as follows:
A-,,)'is a random sample from .J(x;0),
Let us assume that X (11,X1,
Rk.
The
of
moments
0c
r > 1, are by definition
mw
.J(x,'0), p'v>
of
unknown
since
functions
the
parameters,
=

'(.xr),

:'

.X'.'/-(.X;

/t'r(#)
=

P)dx,

(13.18)

ln order to apply the method we need to express the unknown parameters 0


in the form

0i

gipk p-,,

=-

1. 2,

Jtk'). 1

''''''-

(13.19)

k,

where the gis are continuous functions. The method of moments based on
the substitution idea, proposes estimating oiusing
li

gjtn'l 1 n'l2,

n'lkl,

1. 2,

(13.20)

k,

represent
;)'-1, 2,.Yr, r > 1,The

where r#l,= ( 1 n)
estimators of Vi,i
the fact that if pj
Jt h;.
.

-+

the sample rlw' moments, as the


justification of the method is based on
p'k are one-to-one functions of 0 then since

k.

(13

p?r
,

.2

1)

it follows that
a S.
.

3332'-'i -+ t, i
9

(see Chapter

'v

'

...

'i..

',

( 13.22)

10).

Exanvle
Let Xi

(I.:2

2,'

Np, c2), i

1, 2,

?,

then pj

p, y

c2 +/t2

and

l';j

d2

Iikelihood method

The maximum

13.3

1 12

(D1

ma -

1
-

tz

)X ;

(Y/j

11 j

?1

)-')(A'
--

i -

Xtlt).

Formally, the above result can be stated as follows: Let the functions
k, have continuous partial derivatives up to order I on 0/z;(p),i 1, 2,
Jacobian
of the transformation
the
and
=

J(
--.-P

dett?(p
If the equations

.j

p;(@

/t q)

pk)

( l 3.23)

+ () fo r 0 e,o.

p1g,

1, 2,

/, have a unique solution (, EEE(t


p
xL. then J
0 (i.e.i is a
one, as n
,

h)' with probability approaching


consistent estimator of 0).
Although the method of moments usually yields (strongly) consistent
estimators they are in general inefficient. This was taken up by Fisher in
several papers in the 1920: and 30s arguing in favour of the maximum
efficient estimators
likelihood method
for producing
(at least
and
Fisher
The
about
asymptotically).
the
controversy between Pearson
relative merits of their respective methods of estimation ended in the mid1930s with Fisher the winner and the absolute dominance since then of the
maximum likelihood method.
The basic reason for the inefficiency of the estimators based on the
method of moments is not hard to find. lt is due to the fact that the method
does not use any information relating to the probability model * apart
from the assumption that raw moments of order I exist. lt is important,
however, to remember that this method was proposed by Pearson in the
late nineteenth century when no such probability model was postulated a
priori. The problem of statistical inference at the time was seen as one
A-,,)' and estimating .J(x; 04 without
starting from a sample X (-Yj
assuming a priori some form for /'( ).This point is commonly missed when
comparisons between the various methods are made; it was unfortunately
missed even by Pearson himself in his exchanges with Fisher. lt is no
surprise then to discover that a method developed in the context of an
alternative framework when applied to present-day set-up is found
wanting.
-+

--+

'

13.3

The maximum

likelihood method

The maximum likelihood method of estimation

was formulated by Fisher

Methods

in a series of papers in the 1920s and 30s and extended by various authors
such as Cramer, Rao and Wald. In the current statistical literature the
method of maximum likelihood is by far the most widely used method of
estimation and plays a very important role in hypothesis testing.
The likelihoodfunction

Consider the statistical model:


(i)

(ii)

(fx;

#), 0 6 O );

(xY'1X1,
,

Xtt)' a sample from

tx,'04,

where X takes values in .I= R'', the observation space. The distribution of
the sample D(x1, xc,
x,,; 04 describes how the density changes as X takes
different values in I for a given 0 6 0. In deriving the likelihood function we
reason as follows:
.

since D(x; 0) incorporates al1 the information in the statistical model it


makes a 1ot of intuitive sense to reverse the argument in deriving D(x; 0)
and consider the question which value of 0 e:(4 is mostly supported by a
given sample realisation X x?
=

Example 1

Consider the case where X is a random sample from a Bernoulli distlibution


with density function

.Jx; 0)

pX(
=

?)1-'Y,

and for simplicity assume that the


0.8. Suppose the sample realisation

(13.24)

0, 1,

0 can only be either 0= 0.2 or 0=

'true'

was

what can we say about the two possible values of 03 Intuition suggests that
since the average of the observed realisation is 0.7 it is more reasonable to
assume that x is a sample realisation from flx; 0.8) rather than tx,'0.2).
That is, the value of 0 under which x would have had the highest
ilikelihood' of arising must be intuitively our best choice of 0. Using this
intuitive argument the likelihood function is defined by
L0; x)

k(x)D(x; p),

0 (E (1),

where k(x) > 0 is a function of x only (not p). In particular


1a( ; x): O
'

-+

g0, ).
Evz

(13.26)

Iikelihood method

The maximum

13.3

259

Even though the probability remains attached to X and not 0 in detining


fp; x) it is interpreted as if it is reflected inferentially on 0; reflecting the
of a given X x arising for different values of 0 in 0. In order to
consider
this,
the following example.
see

Slikelihood'

Example 2
Let X EEE(aYj

the randomness
Dlxl

Xn)' be a random sample from N(0, @;0= c2. Because of


of the sample its distribution takes the form
,

x,,;

p)

,,

1-l

exp

x?'
-

(2ap)

f=1

11

-''l

= (2ap)

2p
XI

1-I

exp

J!

- 2p

and the likelihood function is


ZIP' X)
,

S/

V(X)(27:P)
-

x/

p1

2
j

J'JCXP
.j

j-j

If we were to reduce the dimensionality of x to one (to enable us to draw


pictures) the derivation of fz4:; x) is shown in Fig. 13.1 for two sample
realisations X xl and X x2. Fig. 13.14/) shows a family of D(x; 0)for 0=
0.5, 1,2, 3,4 and for two given values of X,xj and x2 the likelihood functions
1atp;x1) and Ltp; x2) are reflected in Fig. 13.1()). As can be seen from these
diagrams, different sample realisations provides different likelihoods' for
0 c 0. In the derivation of these likelihoods the constant k(x) was chosen
arbitrarily to be equal to one. The presence of k(x)in the definition of .L(p',x)
=

0.5
0 1
() 2

(x; 0 )

l (0 ; x)

L (0 ; x : )

0
4

L (p; xz)

o (1
X1

(2 3

5 6 7 x

Xz

(b)
Fig. 13.1. Deriving the likelihood function from the distribution of the

sample.

Methods

implies that the likelihood function is non-unique;


any monotonic
of
particular:
ln
transformation
it represents the same information.
and
the

sl'()re

( 13.27)

/ifl'lt'rff.?n,

incorporate the same information as 1-(p,'x) itself; if we have any one of the
functions L(0; x), 1og 1-tp; x), s(p; x) we can derive the others.
In example 2 above

and

Hence.

( Iog fatp,' x)
-- .-..(0

1
n
+
j
20 2 p

d log L0'-'- x)
(.

(1l

?'

-.

dp

--

y x,?
.

1 ''
().
log
- x-;j
j
uz:
0) i
n

X) +

(y*

and

As will be seen in the sequel, although the proportionality


factor is
function,
it plays no
indispensable in the actual detinition of the likelihood
what
with
1og L(0; x)
follows working
role in the derivation of estimators. ln
and g Iog L(0; x)(I(?0 will prove more convenient than using 1-(p; x) itself;
see Figs. 13.2-13.4.

(2)

The maximum

Iikelihood estimatov

ML f)

Given that the likelihood function represents the support given to the
various 0 (E (.) given X x, it is natural to define the maximum likelihood
O such that
estimator of 0 to be a Borel function :
=

.%

--+

L4#; x)

max 1-tp; x),


0 e (.)

and there may be one, none or many such MLE's.

(13.29)

13.3

The maximum

Iikelihood method

L (9 ; x)

iI
(

Fig.

,
13.2. A likelihood

;
function.

Fig. 13.3. The log-likelihood function.

Methods

262

Note that

log L,

for a1l 0* e:(9.

x) > log fz(p*; x),

(13.30)

In the case where fz(p; x) is dlfferentiable the MLE can be derived as a


solution of the equations

P log f-(p; x)
EEEs(#; x)
00

=0,

(13.3 1)

referred to as the likelihood equations.


ln example 2,
0 log .L(p; x)
=
f?0

''

Ar/ 0,
-j+ 20 a i )g
1
=

lg

p= -n

11

Zx

2
f

(for a maximum).
Example 3
Let X N

(11,

.(x; 0) 0x - t?=

Ar,,)'be a random sample from a Pareto distribution with


1 %x % r.'.c p 6 IJ?+
,

11

L(0; x)

/(x)

7) k(x)/'(x1

./xf;

and
=

d log L0;
dp

x)

''

n
=.-p

=>

F. log xf

r1

?t

= n N log xf
i

=0

--. 1

x,,) - P-

13.3 The

Iikelihood method

maximum

Before the readerjumps to the erroneous conclusion that deriving the MLE
is a matter of a simple differentiation let us consider some examples where
the derivation is not as straightforward.

Example 4
Let Z

(Z1, Za,
N

i-e.

Zf

Z,,) where

sample from

(A-d,F;.) be a random

1 p

0
0

exp
1-pc)-2,
-k (
l'(x,,;p)-f

log 1-(p; x, y) c =

1
j

n
log 27: -jlog
(1
1

-p

?1

- 2(1 -

-cpxy+l,zl

(x2

-pc)

p2)

-$'/),

)- (x,?-

2/?xfyi +

.1

11

d log L
dp '

,?(
-

-2)

+ )'/)
)(()txk?
(1+p c)
i
.y

2(1 - p 2 )

/' - /'

2)
(1 - p c

(1 p c)z
-

/1

n/(1

-/2)

/2)

(1+

,,

Z.

xfyf

11

x/ +
i

()

11
-

i -Pi

.''

))
=

.vl

0.

This is a cubic equation in p and hence there are three possible values for the
MLE and additional search is needed to locate the maximum value using
MLE'S
was a
numerical methods. The use of numerical methods in deriving
major drawback for the method when it was first suggested by Fisher.
Nowadays, however, this presents no difficulties. For a discussion of several
numerical methods as used in econometrics see Harvey (1981).
Example

.5

Xn)' be a random sample from


Let X H(Ar1, Xz,
The
function is
likelihood
0.
x%
.

fatp,'x)

0-

''

if 0

:$

xf

% 0, i

1 2,
,

1 0 where 0 %

tx',t?)
=

n.

x)1 dp= 0 to derive the MLE is out of the question since


Using Edfwtp,'
Ltp; x) is not continuous at the maximum (see Fig. 13.5). A moment's
xY,,).
maxtArl, .Ya,
reflection suggests that the MLE of ? is
'=

Methods
1 (0 ; x)

--

-.

Fig. 13.5. The likelihood

function of example 5.

Example 6
Let X (-Y1 Xa,
A',,)' be a random
The
().
likelihood
7:
function is
x
=

sample from

/'(x;0)

e-tx-t') where

Again gdLtt?,' x)q d? 0 is inappropriate for dcriving the MLE. Common


sense suggests 1w4?;x) is maximised by choosing 0 as large as possible such
that 1-lp; x) > 0. Since 0 is bounded below by the xfs, V=mint-j Xz,
',,) represents the MLE of 0.
Looking at examples
and 6 we can see that the problem of the
derivation of the MLE arose because tbe range of the Xis depended on the
unknown parameter 0. lt turns out that in such cases there are not only
problems with deriving the MLE but also the estimators derived do not in
general satisfy al1 the properties MLE'S enjoy (seebelow). For example, V=
maxlzYj
is not asymptotically normal. Such cases are excluded by
the assumption CR1 of Chapter 12.
So far the examples considered refer to the case where 0 is a scalar. In
econometrics, however, p is commonly a k x 1 vector, a case which presents
certain additional difficulties. For differentiable likelihood functions the
MLE of 0- ((?j,pa,
pkl' is derived by solving the system of equations
=

.5

',,)

Iog L

=0

likelihood method

The maximum

13.3

265

Or

?1og L
. lp,

(), i

a,zz

c=,

12
,

ensuring that the Hessian matrix

is negative

?loe L
t?pt?#'

? log L

H(

lpf t?oi.-j

o-t.

i,j

;tt

1, 2,

i,j

definite, i.e. for any zeRk, z'H()z

z: 0.

<0,

Example
Let X EEE(Ar1 X2,
,

f*; p,
'

J2)

1 x
- -2 .

exp

(2zr)

tr

tr2) G

0-(p,

x c R,

sample from N(y, c2), i.e.

Ar,,)' be a random

2
,

tr

J! x (!Rs
.

The likelihood function takes the form

and
log L
P log L
t?p -

n log 27: n 1og c


c
2
2

1
=

y (xj-p)

/1

2t7.a i-l

--

wthe Mt,E is y-

?'

--

plogL
t?c2 = -

'1

o'

--

+
2c a

2c

,y

11i

) xi
=

-/:)2=0

Z (xf

f-1

T,,,

(xj-p)

2
,

( 13.33)

266

Methods

Since

and

-/

H(#)=

z'H(l)z<0

and

for any zg R2.

- 24

This example will be of considerable interest in Part IV where most


estimation problems will be a direct extension of this case.
Despite the intuitive appeal of the method of maximum likelihood its
ultimatejustification as a general estimation method must be based on the
optimum properties of the resulting MLE's.

(3)

Finite sample properties

Let us discuss the finite sample properties of MLE'S


in the context of the
statistical
model:
simple
probability model, *= ).(x; @,0 G (9),.
(i)
sampling model, X N (.Y1
A',,)', is a random sample from
(ii)
.

./'(x;03.
One of the most attractive properties of

MLE'S

is invariance.

Invariance

#be a MLE

of p. lf g ): O --+ R is a Borel function of 0 then a MLE of


exists
and
is given by g(#). This means that if the MLE of 0 is available
go)
thenforfunctionsk, spch as 0k,et?5 log p, its MLE is derived by substituting in
log are the MLE'S of these functions. In the case of
P laceof p, i.e.
example 2 above the MLE of 0 was

Let

'

?1

p-= n
i

--.

)(
=

log Xi

11

13.3

The maximum

of

The invariance property

4=j

is

likelihood method

=-

enables us to deduce that the MLE of

MLE'S

Z 1ogXi.
=

ln relation to invariance it is important to note that in general

(13.34)

'(:())##()).

p2 it is well known that


For example, if qoj
# (E(t)2 in general. This
MLE'S
contributes to the fact that the
are not in general unbiased
estimators. For instance, in example 7 above the MLE of c2, d2
(1 n) =1 (Ari T)2 is a biased estimator since (nc-2)c2 g2(u 1) (see
Section 11.5) and hence f(d'2) g(n 1) njc2 # c2. Thus, in general,
unbiased and MLE'S do not coincide. ln one particular case, when
unbiasedness is accompanied by full efficiency, however, the two coincide.
'(2)

))'

'w

full-eflciency

Unbiasedness,

ln the case where satisfies the regularity conditions CRI-CR3 and is an


unbiased estimator of 0 whose variance achieves the Cramer-Rao lower
bound, then the likelihood equation has a unique solution equal to 1.This
suggests that any unbiased fully efficient estimator 1 can be derived as a
solution of the likelihood equation (a comforting thoughtl). ln example 7
Nt, o'ljn) since
above the MLE of p was /,, Y,,which implies that
is an
is a linear function of independent r.v.'s. Hence, S(,,)
p and
unbiased estimator. Moreover, given that
(1)

'w

,,

,,

I,,(p)H E

-p2 1ogz-(p;x)
0 ?g

'JY

,,

and gI,,(p)(1-

N
=

0
we can see that Var(,,) achieves the Cramer-Rao lower bound. On the
other hand, the MLE of c2, :,2, (1 n) )'=: (A'f X,,)2as discussed above, is
not an unbiased estimator.
The property mostly emphasised by Fisher in support of the method of
maximum likelihood was the property of sufficiency.
=

268

Methods
Sujhciencj'

lf z(X) is a sufficient statistic for p and a unique MLE


0 exists then #is a
function of z(X). ln the case of a non-unique MLE, a MLE
be found
which is a function of T(X).lt is important to note that this does not say that
MLE'S
any MLE is a function of z(X); in the case of non-uniqueness some
1 Xi,
are not functions of z(X). lt was shown in Chapter 12 that T(X)
j)'- j Xi?) are jointly minimal sufficient statistics for 0-(p, c2) in the case
Xn)' is a random sample from Np, c2). In example 7
where X HE (11,
MLE'S
of p and c2 were
above the
'of

'can

=()-'

lj

y-

11

,1

X'i

d,2,
=

jj

11

..

)'')(A'

y-

Xt,)2

which are clearly functions of z(X).


An important implication for ML estimation when a sufficient statistic
exists is that the asymptotic covariance of ,, (seebelow) can be consistently
estimated by the Hessian evaluated at 0= #,,.That is,
(?1

1
-n .

log L
(0

t?p

!!,

P
-.+

I .,

(p)

( 13.35)

(see Dhrymes ( 1970)).


(4)
Although

Asymptotic properties
MLE'S

(IID casej

enjoy several optimum

finite sample properties,

as seen

above, their asymptotic properties provide the main justification for the
almost universal appeal of the method of maximum likelihood. As argued
below, under certain regularity conditions, MLE'S can be shown to be
consistent, asymptotically normal and asymptotically efficient.
Let us begin the discussion of asymptotic properties enjoyed by MLE'S
by considering the simplest possible case where the statistical model is as
follows:
probability model, *
(i)
1 p), p (E O), 0 being a scalar;
EEE
sampling
model,
X tA-j
Xn)' is a random sample from flx; 0).
(ii)
of
Although this case is little interest in Part 1V, a brief discussion of it will
help us understand the non-random sample case considered in the sequel.
conditions
The regularity
needed to prove the above-mentioned
MLE'S
asymptotic properties for
can take various forms (see Cramer
( 1946), Wald ( 1949), Norden ( 1972-73), Weiss and Wolfowitz (1974),
Serfling ( 1980), inter alia). For our purposes it suffices to supplement the
regularity conditions of Chaptcr 12, CR 1-CR3, with the following
condition:
./'(x.'

13.3

The maximum

likelibood method

(CR4)

wllcre the

/1a(x)are inteqrable

/11(x)an
functions

12

tptpt?r (

vt zt
,

),i.e.

tlntl
X

P)dx <
lls(x).f(.'.t7;

w/lprr K does ntpr depend on p.


The conditions CRI-CR4 are only sufficient conditions forconsistency and
asymptotic normality. In order to get some idea about how restrictive these
conditions are it is instructive to consider various examples which do not

satisfy them.

The examples 4 and 5 considered above are excluded by the condition


CR1 because the range of the random variables
X,, depends on the
Moreover,
of
example
6,
if p
in the case
then
parameter 0.
-l,

=0

J3 1og f
Jc 6

3x2

+ --jc
c

--.

as c2

--+

'

i.e. the third derivative is not bounded in the open interval 0..:: (72 < :;c. ;
condition CR4 is not satised (seeNorden ( 1973/.
Under the regularity conditions CR 1-CR4 we can prove (seeSerfling
x)1 (0= 0 admits a sequence
1980))
(
that the likelihood equation y log 1a4@,
such
solutions
1)
that:
n>
of

t,,,

Consistency:
a S.
.

0tt 0o (strong consistencyl


-+

which implies that

tt

P
-+

pll

(weakconsistency)

(see Chapter 10).


Asymptotic normality:

p() N(0,
x,/ntqtl'-,
-

.!'(p)-

').

270

Methods

enlcienc

-,45b'mptobic

p.'

1
1(p) lil'n - t,(p)
Fl
co
n
=

-.

i.e. the asymptotic


Rao lower bound.

variance

of

@,equals

the limit of the Cramer-

Although aformal proof of these results isbeyond the scopeof thisbook it is


important to consider an informal heuristic argument on how such results
come about.
If we were to take a Taylor expansion of gt?log ft; x)?/(0 at #= 0v we
would get

p 1ogL(h) p log L(0o) + :2 log L(%)


(% P0)+ 0p(#,
pp2
pp
pp =
-

(13.36)

where Op(n) refers to a11 tenns of order n (see Chapter 10). The above
expansion is based on CR2 and CR4. ln view of the fact the log L0)
j)'- 1 log
0) and the .Jtxi;0), i 1, 2,
n, can be interpreted as
functions of 1ID (independentand identically distributed r.v.'s) we can
express the above Taylor expansion in the form
=

.(xi;

Using the strong law of large numbers (SLLN) for llD


,4,,

1
-

ni

''

t? log

ftxf,'p())a.s.
-+

P0

=-

p Iog

.(x;

r.v.'s

p())

t?0

(seeSection 9.2)
(13.38)

These in turn imply that

(J
.

11

a. S

a. S.
-

0o) -+ 0

To show asymptotic
expansion by ljw/'n

or

-+

,,

0v.

(13.40)

normality we divide all the terms of the Taylor


(not 1/n) to get

plog ytxf;p,)
+ Op(vi).
%j 1n )
#,,Un(#,,
i
1
V
oo
n

=--,u

(13.41)

13.3 The

maximum

likelihood method

Using the central limit theorem (CLT) for llD nv.'s (seeSection 9.3) wc can
show that
1
n

txf;)())

p log

Given that Bn

a S.
.
-.+

X(0, T(P)).

'--

PP

(13.42)

1(p) we can deduce that

x/nlh 0v)

').

Nl, 10)-

w,

(13.43)

As far as asymptotic efficiency is concerned we can see that the asymptotic


variance of
a consistent and asymptotically normal (CAN) estimator, is
1
equal to 1(@ where
,,

1604 E
=

lx,'otj

(?log

(?2

(?0

1og

tx,'0vj

t?pc

(13.44)

the information for a single observation. We know, however, that in the lID
(3462
equals n
case the sample information matrix 1?,(p) 1;g0 log L(0vj
times 1(04,given that each observation contributes equally to the sample,i.e.
nl(0) l',f0j. This implies that the limit of the Cramer-Rao lower bound is
1(p) bccause
=

1
lirn - 1,,(p)

1
=-

1,:(p) l(0j.
=

It must be stressed that this is only true for the llD case. ln more general
cases care should be exercised in assessing the order of magnitude of 1,,4?)

(see Chapter 19).


Returning to the example 2 above we can see that
n

1,:(J1)=

and

agj

1(c)

1
=

2c

which implies that


/

,2

x/ ntd,,
I1-1example

c()2

4
X(0, 2c().

'.w

3, 1

?,

n po2and

(po)
=

v/'nq

,,

0())

x.

N(0, p()2).In example

although /,rcannot be derived explicitly, this does not stop us from deriving
its asymptotic distribution. This takes the form
-p2)2

&nb',
-pq

.t%

(1
0.
1+ # a.

Methods
The above asymptotic

where 0 EEEE(()1 t?c


,

J
(the

a .s.

properties can be extended directly to the case


k 1: 1 to read :

?k)',

and

-,

--+

,,

dropped for notational

zero subscript
''nk

(ii)

p)

-.

',

N(0, I(p) -

'v

conveniencel;

1)
,

(13.46)

where

( log

l(po) E
-

polh

.(x;

.(x;

oo

:2 1og

'j'j'

.?.lo..y

0v)

))

pp

).

0o)

.(x;

vo pg

( 13.47)

That is,

xz'noin

0i)

'

aN,

El(p)!-f),

1, 2,

1
I(p)J being the

k,

( 13.48)

fth diagonal element of I(p) - '


Asymptotic efficiency in the multiparameter
case is in terms of the
I(p)- 1. For any
asymptotic covariance of which takes the form Covti)
.

other CAN estimator

cov(#,,)
-

#,,,the

matrix difference

I(p) - 1 > 0

(13.49)

is ntpn-ntx/rpt? dflnite. In the case of example 7 above the asymptotic


takes the form
distribution of l >(,,,
',2)/

lJ

#l(J,,

-/t)

Y,

/n(2

'dddj,...

1.lr

-c2)

.s.

c2

2c4

'

This shows that asymptotically *,2, ( 1,,7n) )'=j Xi Y,,)achieves the lower
bound even though for a fixed ?? it does not (see Chapter 12).
=

(5)*

Asymptotic properties

non-llD

twNe)

The cases of particular interest in Part IV are the ones of an independent


Xtf)' is a set of
and a non-random sample, that is, (i) X,,H(zY:
independent (ID) (but not identically distributed) r.v.'s, and (li)X,, is a set of
non-llD r.v.'s. The asymptotic properties of MLE'S
for the 11D case
considered above can be extended to the cases of interest without much
difficulty. The way this is achieved is to reduce the more complicated cases
to something for which the limit theorems considered in Chapter 9 can be
,

applied.

The extension of the above asymptotic

results to the independent (ID)

The maximum

13.3

likelihood method

case presents no particular difficulties apart from the fact that the result
11(#) 1,:(P) no longer holds. This in turn implies that the asymptotic
variance of
can no longer be 14$. This, however, is not such a difficult
problem because it can always be replaced by the assumption that 1,,(?) is of
order rl. That is,
=

,,

lim -N 1,,(p)
,1
--.

<

(x)

( 13.50)

.:x)

denoted by I,,q0) O(n) (seeChapter 10). Given that


=

1',04
=

z log

?'

j'txf; p)

'

Ppc

= 1

wecan see that the assumption t,(@=O@)ensures that as n (yz, 1,,(?)


at the same speed, i.e. information continues to accrue as n increases.
The non-random sample case is more difficult to tackle but again all we
need to do is to reduce it to a case where the various limit theorems can be
applied. ln our discussion of these limit theorems it was argued that
independence is not needed for the various results and that martingale
orthogonality suffices for most purposes. This suggests that if we were to
reduce ztp, and S,, above to martingale orthogonal processes then the results
will follow.
The natural way to orthogonalise the non-random sample X,, is to start
from ArI, define Xj Ar1 '(Ar1),and then construct
-+

kz

x-?

X2

A-1

-+

F(Xa/c(xY1)),

Ex

c(A' 1

l1

'

A-?, --' 1.)).

c(aY1, Xz,
Xi- 1), i 2, 3,
n, denote the c-field generated
1
z%i
r.v.'s
A-1,
Xi
By
construction
includes only the new
the
by the
1.
and
satisfies
properties;
Xi
the following
infonuation in
Let

.%

(i)
(ii)

.E'(Xi f.?.
iEzk

Rj

f- j

.@,

2 3

<

ij
,

(13 52)

n;

2 3
,

(13 53)
.

That is, the Xis define a martingale difference (seeSection 8.4). Hence by
conditioning on past infonuation at each i we can reduce the non-random

Methods
,Xn)' to a martingale orthogonal sample X,,EB (X1,
asymptotically can be treated in a similar way as a random
sample. For this we need to impose certain time-homogeneity and memor)
restrictions on t-Yu, n > 1) so as to ensure for example that

sample Xu EB (A-1
,

T,,) which

n - ESn)

(7.2

-+

as n

(13.54)

:c

-+

where Sn
n r 1) behaves asymptotically as the
7=1 Xi.ln such a case
of
martingale
with
each
variance c2. This in turn enables
differences
sum n
various
the
limit
theorems (see
Chapter 9) to derive a number of
us to use
L
asymptotic results needed. Heuristlcally, this enables us to treat the
parameters oiin the decomposition

t5',,,

as being equal, i.e. deduce that


%

-/

( 13.56)

This in turn allows us to define the likelihood function to be


glven

.X().

( 13.57)

For simplicity 1et us consider the case where 0 is scalar.

d
loc L..(04
d0

' '

'

''

Y - log fxif/xk
i
1 d0
,

=-

xf

1,-

0)

( 13.58)

is the random quantity we are particularly interested in. Observe that the
terms in the summation can be written as
d
dp

log wJtxf/.Yl,

A'f

1;

P)

j-j Elog

Li(0)

Li 1(P)(1HE zf(@,
-

'-log

( 13.59)
which implies that
d
d
Lf(?) 1ogLi
log fo,(p) F
d0
d
0 glog
1
ir
and assuming the expected values exist,
d
E --- 1og flxi x 1
xi 1 04/V' j
dp
''

''

1(p)j

EEE

j 1 zftp)
-

,'

z-.tply,z-lj-st(jdpIog
Iog

-,'(.yjd

.r-,--,(?)yz-:)-e.

(13.61)

13.3

Iikelihood method

The maximum

1)
and hence kf(d dp) 1og Lnoj,
These imply that Eziolf/gi
0, i 1, 2,
cftpls
martingale
and
1
is
the
are martinqale dlfferences
,A,,n > ) a zero mean
variable
Defining
the
random
8.4).
(see Section
(r.v.)
=

( 13.62)
we observe that it corresponds to the Fisher information matrix defined
above. Moreover, under conditions similar to CR 1-CR3 above,
( 13.63)
and 1,,(?) can be defined alternatively
1

,1

P)

,1(

dgf ( pj
Vf o
'

'

as

,,.'

,,

'

( 13.64)

.'''

j.j

,''

Under certain regularity conditions relating the first three derivatives of


it can be shown that
1og1.,,(p), n 1, 2,
=

11

I7,,(p)(1
- F,1
'

0,

.:'f(p)

--+

(13.65)

provided 1,,(p)
as
property for MLE's.
':x)

-+

This enables us to deduce the following

,LJo.

-->

consistency
The likelihood equation

0, i.e.

lim #rtl,,

!1 --

Moreover,

)'-j

- pp< c)

'zl

zf()=0

has a root

,,

which is consistent for

( 13.66)

0.

under slightly stronger

restrictions

( 13.67)
a.S.

provided /,,(p)
i.e.
consistent,

'vz

-+

a. S.
.
,,

--.

as n

-+

:t.

and hence a MLE

,1

is also strongly

p. The most interesting of the sufficient conditions

for

Methods

the consistency of

is the condition

MLE'S

that
( 13.68 )

(or a.s.). This can be interpreted intuitively as saying that fn/brmafon


0 increases with the sample size.

tilhlu!

As argued above consistency is only a minimal property. A more useful

property is that of asymptotic normality. ln order to get asymptotic


normality we need to find a normalisation sequence
n y 1) such that
(7)X0
in
be
probability
transformed into
can
convergence
%
convergence in distribution of the form

tc',,,

t!,,(tt p)

-+

N(0, 1).

( 13.69)

The natural choice for c,, in this case is c',,

E.l(p)jl(
11

In

l/?tz

p)

11

which
E/,,(p)(I'l

ensures

x(0, 1).

11D case
,1

Fl
i=1

(jj

11

ctp)

J=1

log

dp

Jjxi; 0)

and

,,
f,,'(p)i Fl

,,
'(z,?(p)

?i

lf we denote the fq/rmllforl


d log

1(p) E
=

fxi

d().

) )i
1
-

(j

-'.u

jog .--,-7.--yfx

dp

.,

o)

; ?)

.-

random

sample case

11

n1$04-

and hence the asymptotic

normality

p) N(0,
(,?.f(p)/(t',
-

mtt rl-fx ./i?l-ontt ollstrrtwrforl by 1(P),i.e.

and assume CR l-CR3 then, in the

1',(0)i )-1 .f(p)

1)

takes the form

that

13.3

The maximum

likelihood method

( 13.76)
of n. lt is obvious that in this

since 0 < l0) < :y- from CR3 and independent


case the condition
1,,(?) n1(@
=

uo

-+

s automatically satisfed.
ln the general non-random
asymptotic normality result

like an analogous

N(0, Iz'(p)),

c,,(t - t?)
-

sample case we would

(13.78)

here c refers to the order of magnitude of


,,

For
g1,,(?)1l.

example, in the llD

Case,

l (0)p
=

11

n1(@

0(n)

and

nll < k < vz since


i.e.limn- E1,,(p)
N
example
7 above.
2 in
(;u, 4,,)
.

'

c',,

/''n
-bh--

0 < 1(p) < .z. To illustrate


.

this consider

!1

P)

1?11

lim
-. z

11

ln cases where the sampling model is either an independent or a nonran dom sample the order of magnitude of 1,,(?) can be any power of U.For
most cases of interest, however, it suftices to concentrate on cases where

1,,(:) O P (n)
=

information matrix I.(p) to be

and define the asymptotic


l

''t8)
-

-+

11

I .(#),

( 13.80)

1.

I.(p) - The notation above is used to emphasise the fact that


I,,(p) might be stochastic given that the conditional expectation is relative to
a c-field (see Section 7.2).

with V(p)

X. b'tlptotic
Under

certain

llt?rFrl/lf

!).'

regularity conditions
any consistent
),, is asymptotically normal, i.e.

solution

of

likelihood equation

(.f((?)/(t- l1 .

p)

x(0,1).

(13.81)

Methods

The case of particular interest in econometrics is when 0 is a k x 1 vector of


unknown parameters. ln such a case we need to normalise each individual
parameter separately in general. With this in mind let us define the
following normalising matrix:
D,,(p)

logL

diag

.
'

?01

p log L
(''?
pk

P log L
where .
(oi Under certain regularity conditions
D ? (p)( 11

where
D 11- 1(p)A11(1)D 11-

( 13.83)

1(p)

c(p)

-+

p2 jog L
'oi t'?pj

we can show that

x(0,c(p)-

'v

Cr

A,,(p)

( 13.82)

cxs

l),

-p)

and

--+

i,)'

Asymptobic efhcienc

1,

k.

'

As shown above, the asymptotic distribution of the CAN estimator


is
F(p))
where
N(0
P'(pl
achieves
asymptotic
n(
p
'
the
Cramer-Rao
)
x,'lower bound fw.(p)- 1, that is,
,,

--

,,

'v

P-(p) 1.0) =

(13.84)

This last equality defines the asymptotic

(6)

efficiency of the MLE

,,.

Summary of asymptotic properties

For reference purposes let us summarise the asymptotic properties of


MLE of 0o, in the case where I,,(p) Op(n):
=

Consistency'. For some root

a .S
-+

0o

lim

Pr

?1 --*

0-?1 0()
-+

2t2.

J,, of the likelihood

0v

1im #r(1 0-tt 0()


-

!1

--*

::J;

<

1 ;

4 1
=

equation

279

Iikelihood method

The maximum

Asvmptotic normality. For a consistent

(2)

n(

0v) x N(0,

Asymptotic c//ccncy.
V(p)
Note that asymptotic

1.(p) -

of 0o

v(p)).
l,, of

For a CAN estimator

0v

normality

x''nt

MLE

unbiasedness,

also implies asymptotic

i.e.

,,

Example 8
Let X,, EEE(.Y1, X1,

sample generated by the AR(1)

A',,)' be a non-random

time-series model:
Xi

tzztf

+
- 1

Uf,

where b'i V(0, c2); a normal white-noise process (see Chapter 8). The
distribution of the sample D(X,:; p) is multivariate normal of the form:
'v

X,,

'v

N(0, c2V,,(a)),

where

This is because

k 0, 1 2
=

Ial

< 1 which ensures stationarity


(seeChapter 8).
in view of the restriction
As argued above the only decomposition possible in the case of a nonrandomsample is
11

D(x,:; 0j

=/(x();

0$
i

1-lTtxi/xl,
.

xi -

-,

04

where .J(xo;0) refers to the marginal distribution of the initial conditions.


ln practice there are several ways the initial conditions can be treated'.
(i)
assume Xo is a known constant, say Xo 0 (i.e.a degenerate r.v.);
=

280

Metllods

(ii)

assume

(iii)

?,

this ensures stationarity

y 1)

of

,.

'

-a)2)

that Xv ,vX(0,c2/'(1

.t#,,,

A-,, which defines


as a circular
Anderson
1971/.
stochastic process (see
(
For expositional
purposes let us adopt (i) to define the likelihood function
assume

that Xv

as

t?log L
pc2 =
Hence, the

1
2c4

2c2

?'

jq1

of a and c2 are

MLE'S

y,2
=

t72log L
2

t?a

lo L
472
g
(.7a 7c2 =

:2 log L
. c4

i-1
i

2c4

:2 log L

oyc
p2loe
<'

t'?a:t7.2

11

)'(1 fzkri-

)2,

Xi .j

'

1
''

.1
n i

.ya.a

Taking expectations
E

Xi -aA'f-1)2=0.

jj xixi
=

c6

) xi

-a
i

1
-

-1

...,..

''

)g(xi - xxi
=

j)2.

..

with respect to D(X,,; 04 it follows that


-

1
c,

''

yyj

Exi-

1 (n 1J
c2 (1 a2)

'

'

note the role of the initial conditions.

1
-

(n

a,
-

1)a
'

'

(7.2

''

jya

-a2)

(1

0;

ac

1 aa
-

1,

likelihood method

The maximum

13.3

02loq L

E-

n o ncl
+
2c*
c

<'

.-

Jc*

In order to establish consistency

of

281
0

n
.-u;
2c

'IG

note that the score function:

ti,, we

martingale
defines a zero-mean
(,%,,n > 1). Using
martingales and Markov processes it follows that
42* 11,

-J--1

)(

zr

,?-

(r;.

--+

1 xc

for

;)

l2'

li

--..

and

-+

the WLLN

respectively. Hence

p
&,, x.
-+

Using a similar argument we can show that :,2,--+ c2. For asymptotic
normality we need to express &,,in the form
&

jj

'

n(

.-

,,

deducing that

a)

=;:

x.''rl(4u

/1

--

a)

'w

2
f- l

j
N

,,

X(0, (1 a2)) and x'''n(62


-

c2)

'.w

N(0, 2c4)

> 1
(see Anderson ( 197 1)). It is interesting to note that in the case where la!
the order of magnitude of 1u(a) is no longer n, in which case we need
a different normalising sequence for asymptotic 1)lnormality. In particular for
the sequence
it follows that
n y 1) where c,, ()'- 1 Xh

tc,,,

(.,,(.4,,a)
-

xN(0,c2)

(see Anderson (1959)).It is also important to stress that the asymptotic


distribution of 4,, depends crucially on the distribution of the white-noise
> 1 if t-p's
is not normally
error process )U,,, n y 1). ln the case where
asymptotic
normality
above
result
does not follow.
distributed the

1a1

Methods
Important

concepts

normal equations, method of moments, the likelihood


Least-squares,
function, the log-likelihood function, the score function, maximum
likelihood estimator, likelihood equations, invariance, sample information
matrix, single observation information matrix, asymptotic information
matrix.

Questions
Explain the least-squares estimation method.
Explain the logic underlying
the method of moments.
Why is it that the method of moments usually leads to inefficient

4.
5.

7.
8.

estimators?
Define the likelihood function and explain its relationship
with the
distribution of the sample.
Discuss the relationship between the likelihood, log likelihood and
score functions.
Define the concept of a MLE and explain the common-sense logic
underlying the definition.
State and explain the small sample properties of MLE's.
sample X can be transformed into a
Explain how a non-random
martingale orthogonal sample X.
Explain why
log Ln(()), 9,,, n > l ) defines a zero mean
#l, define a martingale
difference.
martingale and the zf(p), i 1, 2,
State and explain the asymptotic properties of MLE's.
Discuss the relationship
and
between the order of magnitude
asymptotic normality of MLE's.
If we were to interpret the likelihood function as a density function
what does the MLE correspond to?

ttd/dp)
=

10.
11.

Exerdses

Consider the linear model


y'j pj x 1 j +
=

;i ,

li

'v

N-1(, c2)

1,

rl.

0= (p1,c2). Define the MLE-S of 0 and compare its properties


those of the least-squares estimators.
Show that

with

13.3

likelihood method

The maximum

283

where #1is the least-squares estimator of exercise 1, and compare it


with the Cramer-Rao lower bound.
X,,)' be a random sample from the Poisson
Let X EEE(Arl Xz,
distribution with density function
,

PX

JlX

P)

'

:h
%-'

x!

Derive the likelihood, the log-likelihood and the score


functions.
Derive the MLE of 0 and its asymptotic distribution.
(ii)
Let X H(xY'1,
sample from the Bernoulli
X,,)' be a random
distribution with a density function

(i)

4.

.Jlx;p) pxt1 - p)1- x

0, 1.

Derive the MLE and state its properties.


Let X EEE(-Y1,
Xn4' be a random sample from N(p, 1). Show that
.

d log

E .
Let X

(A-1
,

d2 1og J-,t./:;x)

(jjta

tr2)

xc

exo
-

(2z:)

j.

sample from the log normal


1

x
lo2 2cu:
m
-

--'

2
,

Derive the MLE'S of m and c2. (Hint: use invariance.)


Derive the method of moments estimators for m and c2.
(A-1
Ar,,)' be a random sample from the exponential
,

distribution

(X',

8.

A-,,)' be a random
density function

with

tx,' ?'n,

Let X =

cjy

distribution

(i)
(ii)

f-tJt; x)

with

density function

P) P e - 0x
=

Derive the MLE of 0 and show that it is both consistent and


asymptotically normal. What is the MLE of 1/??
Let X EEE(A'1, Xz,
X,,)' be an independent sample from Nlpi, c/).
SILE'S
c2,
cj?
of (c2,gj,
i 1, 2,
For
n, derive the
(i)
/t,,) and their asymptotic distribution. (Hint: check 1.(.$.)
MLE'S
of @,c2j,
For pi
i 1, 2,
n, derive the
cJ)
and their asymptotic distribution.
.Y,,)' be a random sample from the Weibull
Let X -(.:-1,
distribution with a density function
.

=p,

expt

.Jtx;p) ocf =

pxc), x > 0,

284

Methods

where c fs a known constant. Derve the MLE of 0 and fts asymptotfc


distributions.
10*. Let X EEE(X1, X2,
X,,)' where
.

Xf

H(X')
X,i

be a random sample from the bivariate normal distribution

# 1i

A-2i

c2

/?c l c 2

pc 1 c2

c22

..v../V' 0

i zzz 1, 2,

n.

Derive the MLE


of 0m (cf, cl, p) and its asymptotic
distribution.
Of
Assuming that cf c2a 1 derive the MLE
p and its
asymptotic distribution.
Compare the asymptotic variance of /$ and / and explain
intuitively why they differ.
For p
and p 1 derive the MLE'S
of trf and cq and
variances.
their
asymptotic
compare
=

=0

Additional references

Bickel and Doksum (1977/ Cox and Hinkley ( 1974); Kendall and Stuart ( 1973);
(1984); Rao (1973);Rohatgi ( 197$,. Silvey ( 197,5); Zacks (197 1).

Lloyd

CHAPTER

14

Hypothesis testing and confidence regions

The current framework of hypothesis testing is largely due to the work of


Neyman and Pearson in the late 1920s, early 30s, complementing Fisher's
work on estimation. As in estimation, we begin by postulating a statistical
model but instead of seeking an estimatorof pin Owe consider the question
whether 0 65 (.)0c O or 0 (E (41 (.) - Oo is mostly supported by the observed
data. The discussion which follows will proceed in a similar way, though
less systematically and formally, to the discussion of estimation. This is due
to the complexity of the topic which arises mainly because one is asked to
assimilate too many concepts too quickly just to be able to deline the
problem properly. This difficulty, however, is inherent in testing, if any
of the topic ls to be attempted, and thus unavoidable.
proper understanding
effort
made
is
Every
to ensure that the formal definitions are supplemented
explanations
and examples. ln Sections 14.1 and 14.2 the
with intuitive
needed
tests are
concepts
to define a test and some criteria for
discussed using a simple example. ln Section 14.3 the question of
constructing
tests is considered. Section 14.4 relates hypothesis
testing to confidence estimation, bringing out the duality between the two
areas. ln Section 14.5 the related topic of prediction is considered.
=

'good'

'good'

14.1

Testing, definitions and concepts

Let A- be a random variable (r.v.) defined on the probability


(S,
P( )) and consider the statistical model associated with Xi

space

.%

'

tl'
(i)
),J(.X', ?). ()6 0)
X (Ar1 zY2,
-Y,,)/is a random sample, from .J(x',@.
(ii)
The problem of hypothesis testing is one of deciding whether or not some
=

286

Hypotllesis testing and confidence regions

conjecture about 0 of the form p belongs to some subset (i)o of (.) is


x,,)'. We call such a conjecture the null
supported by the data x (.xI x2,
hypothesis and denote it by Ho 0 c Olljwhere ifthe sample realisation x e: l-a
we accept Ho, if x (E C1 wc reject it. The mapping which enables us to define
IJI(see Fig. 11.4).
Ca and C1 we call a test statistic z(X): ?'
ln order to illustrate the concepts introduced so far let us consider the
following example. Let X be the random variable representing the marks
achieved by students in an econometric theory paper and let the statistical
model be:
=

-.+

(ii)

n 40 is a
=

.(Ar; 0)

(.11,Xz,

8
.

exp

(2zr)

Ho1 0= 60

(i.e.A'

f/1 : p# 60

(i.e.A'

against

1 A- - p
j
,

peg

>

E0,100j ;

.Y,,)',

sample from

random

'v

.(x;

$.

N(60, 64:,

The hypothesis to be tested is

(,.)0
=

Nlp, 64/, y # 60),

t60)
(91

g0,100)

(60)
.

estimator of 0, say T,,


Common sense suggests that if some
sample
realisation
Xi,
for the
(1/n) j7- 1
60 then we
x takes a value
will be inclined to accept Hv. Let us formalise this argument:
'good'

earound'

The acceptance
Co

C'1

and

reqion

takes the form 60 -c GY,,< 60 + c, s >0, or

1X,, 601 G

l)

tx:IX, 60I>

l)

ft

x:

is tbe

rejection

region.

The next question is,showdo we choose ;?' If ; is too small we run the risk of
rejecting Hv wpn l is true; we call this tbpe I error. On the other hand, if c is
too large we run the risk of acceping Ho wJlt?nl is false;we call this type 11
error. Formally, if x 6 C1 (rejectHo) and p e:(6)0(Ho is true) - type I error; if
x 6: Ctl (acceptHv) and 0 6 (91 Ho is false) - type 11errortsee Table 14.1). The

Table 14.1

Ho true
Hj false

Ho accepted

Hv rejected

correct

type I error
correct

type 11 error

14.1

Testing, delinitions and concepts

hypothesis to be tested is formally stated as follows:


Sa: 0 e:0c),

Oa

0.

Against the null hypothesis Ho


the form:

(14.1)
we postulate

Sj which takes

the alternati,e

H j : 0#0..o

or, equivalently,
H j : 0 e:0- 1

BE

0- - (-)()

(14 3)

lt is important to note at the outset that Ho and H3 are in effect hypotheses


about the distribution of the sample .Jtx; ?), i.e.
H

(X;

P),

H 1:

/(X;

P), P E:0- 1

A hypothesis Ho or Sj is called simple if knowing 0 e:(.)0 or 0 6 01 specifies


/(x; ?) completely, otherwise it is called a composite hypothesis. That is, if
J(x; ?), 0 (E (.): or /'(x;p), 0 (E(.): contain only one density function we say
that Ho or Sl are simple hypotheses, respectively; otherwise they are said to
be composite.
ln testing a null hypothesis Hv against an alternative Hj the issue is to
decide whether the sample realisation x tsupports'
Ho or Sl. In the former
case we say that Hv is accepted, in the latter Ho is rejected. ln order to be
able to make such a decision we need to formulate a mapping which relates
(')0 to some subset of the observation space
say C(), we call an acceptance
('o rn Cc
region, and its complement C71 (Co t..p C1
we call the
rejection reqion (see Fig. 11.4). Obviously, in any particular situation we
cannot say for certain in which of the four boxes in Table 14.1 we are in; at
best we can only make a probabilistic statement relating to this. Moreover,
small' we run a higher risk of committing a type l
if we were to choose s
of
committing
than
a type 11 error and vice versa. That is, there is a
error
trade ojf' between the probability of type l error, i.e.
.../,'

.%

.g)

btoo

#K(x

e:C1; p e O0)

and the probability

a,

j of type 11 error, i.e.

Prtx c Ca; 0 (E O 1)

j.

( 14.6)

Ideally we would like x j= 0 for al1 0 G O which is not possible for a fixed n.
Moreover, we cannot control both simultaneously because of the trade-off
between them. tl-low
do we proceed, then'?' ln order to help us decide let us
consider the close analogy between this problem and the dilemma facing
the jury in a trial of a criminal offence.
=

288
The

Hypothesis

testing and confidence regions

jury in a criminal offence trial are instructed to choose between:


Hv the accused is not guilty; and
the accused is guilty;

HL :

with their decision based on the evidence presented in the court. This
evidence in hypothesis testing comes in the form of * and X. The jury are
instructed to accept Ho unless they have been convinced beyond any
reasonable doubt otherwise. This requirement is designed to protect an
innocent person from being convicted and it corresponds to choosing a
small value for a, the probability of convicting the accused when innocent.
By adopting such a strategy, however, they are running the risk of letting a
off the hook'. This corresponds to being prepared to
number of
relatively
high value of j, the probability of not convicting the
accept a
accused when guilty, in order to protect an innocent person from
conviction. This is based on the moral argument that it is preferable to let
off a number of guilty people rather than to sentence an innocent person.
However, we can never be sure that an innocent person has not been sent to
prison and the strategy is designed to keep the probability of this happening
very low. A similar strategy is also adopted in hypothesis testing where a
small value of a is chosen and for a given a. p is minimised. Formally, this
amounts to choosing a* such that
'crooks

and

Prtx e:C1; 0 G Oe)


Prlx

a(@ %a*

C().,0 c O1) jt)l

for 0 e:O()

is minimised for pcl')j

by choosing C1 or Ctl appropriately.


In the case of the above example if we were to choose

#r(IT,,

-601

> ;,'

(14.7)

p= 60)

0.05.

a, say a*

(14.8)
=0.05,

then

(14.9)

This represents a probabilistic statement with c being the only unknown.


l-low do we determine c, then?' Being a probabilistic statement it must be
based on some distribution. The only random variable involved in the
statement is X,, and hence it has to be its sampling distribution. For the
above probabilistic statement to have any operational meaning to enable
us to determine :, the distribution of Y,,must be known. In the present case

we know that

T,

'v

c2

(9.2

N p,
-u

where

which implies that for 0 60


=

ztx)-

.S,,- 60
jocyj

)
-

64
=

4:

(i.e.when

.x(()
,

!),

1.6,

( 14.10)

Hv is true)

( 14.11)

14.1

Testing, definitions and concepts

and thus the distribution of z( ) is known completely (no unknown


parameters). When this is the case this distribution can be used in
conjunction with the above probabilistic statement to determine E:. In order
to do this we need to relate jXp,-601 to z(X) (a statistic) for which the
distribution is known. The obvious way to do this is to standardise the
which is equal to
former, i.e. consider 1Y,,
This suggests
pz(X)I.
changing the above probabilistic statement to the equivalent statement
.

-601/1.265

IY,,
-

Pr

60I> c,;

1.265

p 60
=

0.05 where

cx

c
.

1.265

Given that the distribution of the statistic z(X) is symmetric and we want to
determine cx such that ##1z(X)1 > cz) 0.05 we should choose the value of c'a
from the tables of N(0, 1) which leaves a*/2 0.025 probability on either side
of the distribution as shown in Fig. 14.1. The value of ca given from the
N(0, 1) tables is cx= 1.96. This in turn implies that the rejection region for
the test is
=

( 14.13)
C1

tx:1X,,- 60l > 2.48 ).

(14.14)

That is, for sample realisations


x which give rise to T,, falling outside the
interval (57.52,62.48) we reject Hv.
Let us summarise the argument so far in order to keep the discussion in
perspective. We set out to construct a test for Hv : p= 60 against S1 : 0+ 60
and intuition suggested the rejection region (1X,,
k' 4. In order to
determine : we had to
-

60I

(i)
(ii)

choose an a; and then


define the rejection region in terms of some statistic z(X).
The Iatter is neeessary
to cnable us to determine t; via some known
distribution. This is the distribution of the test statistic z(X) under Ho (i.e.
when Hv is true).
(z)

::

-ca

r'

Fig. 14.1. The rejection region

ca

(14.13).

Hypothesis testing and confidence regions

Iz(X)I

Given that C1 (x:


the question
> 1.96) defines a test with a
What
probbility
of
naturally
arises
need
is:
do we
which
the
type 11error j
for'?'The answeris that we need # to decide whether the test defined in terms
of f71 is a
test. As we mentioned at the outset, the way we
or a
problem
of the trade-off between a and j was to
decided to
the
small
and
value
for a
define C1 so as to minimise j. At this stage we
choose a
whether
do not know
the test defined above is a good' test or not. Let us
setting
consider
up the apparatus to enable us to consider the question of
optimality.
=

=0.05,

tbad'

'good'

tsolve'

14.2

Optimal tests

Since the acceptance and rejection regions constitute a partition of the


observation space
i.e. Co k.p C1 I and Co ra C1 Z, it implies that
Prx E:Co) 1
c C1) for all 0 e O1. Hence, minimisation of Prtx 6 Co)
for all 0 E:01 is equivalent to maximising Prtx e:C1) for a1l 0 c O1.
.%

-##x

Dejlnition 1
Theprobability of rejecting So w/lcn falseat some point t?1601,
Prx s C1; 0 pl) is called the power of the test at 0 0j
=

i.e.

Note that
G C1;

Prx

0 p1) 1 - #r(x
=

C0; 0=

t?1)
=

j(p1).

In the case of the example above we can define the power of the test at some
pj eO1, say )l 54, to be PrE(lT)-601) 1.265 > 1.96; p= 54q. 'How do we
calculate this probability'?' The temptation is to suggest using the same
distribution as above, i.e. z(X) N (X,, 60)/ 1.265 N(0, 1).This is, however,
wrong because 0 is no longer equal to 60,.we assumed that p= 54 and thus
(-Y,, 54)/1.265 N(0, 1). This implies that
=

'v

'v

z(X)

N(
,v

(54-60) jj
j.a65

for

.54.

Using this we can define the power of the test at 0= 54 to be


Ar,, 60
-

Pr

j.a6j

y 1.96,. 0=
+ Pr

541=

Pr

(T,,- 54)
j.r6j

(X,, 54)
(54-60)
> 1.96
-

1.265

1.265

1.96
=

(54-60)
j.z6j

0.9973.

Hence, the power of the test defined by Cj above is indeed very high for
p= 54. ln order to be able to decide on how good such a test is, however, we

14.2 Optimal tests


need to calculate the power for a1l 0 c (9: Following the same procedure the
power of the test defined by C1 for 0= 56, 58, 60, 62, 64, 66 is as follows:
#r(Iz(X)1> 1.96; 0= 56)
.

=0.8849,

Pr(lz(X)1> 1.96; 0= 58)

=0.3520,

Pr(Iz(X)1> 1.96,. p= 60)

=0.05,

> 1.96; p= 62)


#r(1z(X)I

=0.3520,

> 1.96; p= 64)


#r(Iz(X)I
> 1.96; p= 66)
#r(Iz(X)I

=0.8849,

=0.9973.

As we can see, the power of the test increases as we go further away from
p= 60 Hjl and the power at 0 60 equals the probability of type l error.
This prompts us to define the power function as follows:
=

Dejlnition 2

,#(p) #r(x G C1), 0 E: 0- is called the power function of the test


dejlned by the rejection region C1.
=

Dehnition 3
is dejlned to be tbe size

.#(@

a
the test.
=

maxpoo,

(t)rte

signcance

leve of

ln the case where Hv is simple, say 0= 0o,then a #ool. These definitions


enable us to define a criterion for best' test of a given size x to be the one (if
it exists) whose power function c?(p), 0 6 Oj is maximum at every 0.
=

ta

Dejlnition 4

..4lcsl of So : 0 6 0-o against S1 : 0 e:01 as desned by some rejection


reqion C1 is said to be uniformly most powedul (UM#) test of size a
.:#(@

(f)

max

(ff)

@0) y

wcrp

0 (2 (90

kt.#*(?)

a,.

J?*(p)

is the

for all 0 6 01,.


of any
power function

other test of size a.

As we saw above, in order to be able to determine the power functlon we


need to know the distribution of the test statistic z(X) (interms of which C1
is defined) under Sl (.c. wen Ho is false). The concept of a UMP test
provides us with the criterion needed to choose between tests for the
same Ho.
Let us consider the question of optimality for the size 0.05 test derived

Hypothesis testing and confidence regions


f (z)
C(

;)
.

.,

(x:z(X) > 1

(14.16)

.645)

1.645

f (z)
C(

4-

=(x:z(X) G 1

'Z

-1

(14.17)

.645

.645)

/ (z)
C(

-0.03

0 0.03

Fig. 14.2. The rejection

regions

above with rejection


(71

tx

=(x:1

z(X)I< 0.038) (14.1 8)

(14. 16), ( 14. 17) and ( 14. 18).

region

Iz(X)1> 1.96).

( 14.19)

To that end we shall compare the power of this test with the power of the
size 0.05 tests (Fig. 14.2), defined by the rejection regions. All the rejection
regions define size 0.05 tests for HvL t? 60 against H3 : p# 60. ln order to
tgood'
and
discriminate between
tests we have to calculate
their power functions and compare them. The diagram of the power
* +
functions
((?),.#
(p) is illustrated in Fig. 14.3.
Looking at the diagram we can see that only one thing is clear
+ +
C)
defines a very bad test, its power function being Jomnlrp by the
other tests. Comparing the other three tests we can see that C( is more
+
< a for 0.< 60. Cib is more
powerful than the other two for 0>. 60 but
+ >
powerful than the other two for 0 < 60 but t
(p)< a for p > 60, but none of
the tests is more powerful over the whole range. That there is no UMP test
of size 0.05 for HvL 0= 60 against Hj : ()+ 60. As will be seen in the sequel, no
UMP tests exist in most situations of interest in practice. The procedure
adopted in such cases is to reduce the class of all tests to some subclass by
imposing some more criteria and consider the question of UMP tests within
=

kbad',

,#(p),

.#-'((?),

.#

'better'

'cut';

.#*(@

293

Optimal tests
@+

'++ (p)

1.00

>.

x.

p'

N
N

/
h

.t.#++''
0.05

ttpy

X ?

(#)

,(# (0 )

60

0
Fig. 14..3. The pow'er functions

..#(f?),

.#*(0j,

.#-t

''-(?),

.#-F

> ''(#).

the subclass. One of the most important restrictions used in this context is
the criterion of unbiasedness.
Dejlnition J

-4 test of Ho1 0 e:(')() alainst


..@

max
t?e O f.,

(0)t

0 g 0- 1 is

said t() be unbiased

#)J

(14.20)

,(#(p).

max

() G (''1j

ln other words, a test is unbiased if it rejects Hv more often when it is false


than when it is true; a minimal but sensible requirement. Another fonn
these added restrictions can take which reduces the problem to one where
UMP do exist is related to the probability model *. These include
exponential
jmilv.
restrictions such as that (l) belongs to the one-parameter
+
ln the case of the above example we can see that the test defined by C)
is biased and C1 is now UMP within the class of unbiased tests. This is
because C1 and C3* are biased for 0 < 60 and 0>. 60 respectively. lt is
obvious, however, that for
*

() 60

HvL

against
H3 : 0 > 60
H

()< 60,

+
the tests defined by C1 and C* are UMP, respectively. That is, for the one+.
lt is
sided alternatives there exist UMP tests given by C1 and C)
of
H3
above
H3
and
in
the
parameter
that
the
space
important to note
case
implicitly assumed is different. ln the case of Hk the parameter space
implicitly assumed is O E60,10% and in the case of H3, O g0,6%. This is
needed in order to ensurc that (6): and 0- I constitute a partition of 0=

Hypothesis testing and confidence regions

Collecting al1 the above concepts together we say that a test has been
dejlned when the following components have been specified:
(FJ)
a test statistic z(X).
(T2)
the sjcc of the test a.
(FJ)
the distribution of z(X) under Hv and Sj
(F4)
the rejection region C1 (t?r,equivalently, C0).
Let us illustrate this using the marks example above. The test statistic is
.

n(.Y,, p) (.Y,, 60)


;
-

r(X)

( 14.2 1)

1.27

we call it a statistic because c is known and 0 is known under Ho and S1. If


wechoose the size a 0.05 the fact that z(X) N(0, 1)underse enables us to
define the rejection region C1 (x:Iz(X)I
> ca) where ca is determined from
Pr(Iz(X)l > cz; 0= 60)
to be 1.96,from the standard normal tables, i.e.if
4tz) denotes the density function of N(0, 1) then
=

'v

=0.05

Ca

/tz) dz= 1
- C

(14.22)

-a.

In order to derive the power function we need the distribution of z(X) under
Sl. Under Hb we know that

w/nuk'np1) ,vN(0,
z*(x)=
-

1),

(j.4 2,3)
.

for any pl G (91 and hence we can relate z(X) with z*(X) by

(p1
-po)

z(x)=z*(x) +

( 14.24)

to deduce that

xtx/'n (p1
-

z(x)

jj

p())

( 14.25)

under H3. This enables us to define the power function as


yca)
p(pj)=#r(x: Iz(x)I
-p())

=Pr

+Pr

z*(x) %

-c,

z*(x) > c,

. (p1
V'n

-vG

(pj

0o)
,

p: s e1.

( 14.26)

Using the power function this test was shown to be UMP unbiased.
The most important component in detining a test is the test statistic for

14.2 Optimal tests


which we need to know its distribution under both Hv and HL. Hence,
constructing an optimal test is largely a matter of being able to tind a
statistic z(X) which should have the following properties:
z(X) depends on X via a
estimator of p; and
(i)
(ii)
the distribution of z(X) under both Ho and Sl does not depend on
any unknown parameters. We call such a statistic a pivot.
lt is no exaggeration to say that hypothesis testing is based on our ability to
construct such pivots. When X is a random sample from N(.p, c2) pivots are
readily avail/ble in the form of
Sgood'

.&4#,,-/tj.x((),

j),

1)

(n
-

S2
'x,z

Gc

(n-

1),

(14.27)
but in general these pivots are very hard to come by.
The tirst pivot was used above to construct tests for p when c2 is known
(both one-sided and two-sided tests). The second pivot can be used to set up
similar tests for y when c2 is unknown. For example, testing Ho2 p
against H3 : p #/t() the rejection region can be defined by
=pv

C1

)x:lzj(X)1 y cz)

where z1(X)

fxcy

x/n

(Y,,-Jt)
,

(14.28)

and cacan be determined by:


flt) d! 1 tz,' .J(r)being the density of the
with n 1 degrees of freedom. For S0: p po
Student's l-distribution
against Sl : p < po the rejection region takes the form
=

W:

C1

zj(X) y c.)

tx:

with

(x =

dr,
.J'(1)

detennining ca.
The pivot
z2(X)

(n- 1)s2
z2(n
Gz
'v

1)

c2. For example, in the


case of a
can be used to test hypotheses about
c()2against Sj : c2 < cj the
c2)
c2
sample
N(g,
Hv
from
testing
y
random
rejectionfor an optimal test takes the form

C7:

wherec.

)x:z2(X) %c.),

is detenuined via
Cz

dzztn 1)
-

a.

(14.31)

296

Hypothesis testing and confidence

regions

Constructing optimal tests


ln constructing the tests considered so far we used ad hoc intuitive
arguments which led us to a pivot. As with estimation, it would be helpful if
there were general methods for constructing optimal tests. It turns out that
the availability of a method forconstructing optimal tests depends crucially
on the nature of the hypotheses (Hv and H3) or, and the probability model
postulated. As far as the nature of Ho and HL is concerned existence and
optimality depend crucially on whether these hypotheses are simple or
composite. As mentioned in Section 14.2 a hypothesis Hv or lk is called
simple if 0-() or (6)1contain just one point respectively. In the case of the
'marks' example above, (6)0 ft60) and 0- I (E0,60) k..p (60, 10% i.e. Hv is
)
simple and H, is composite since it contains more than one point. Care
should be exercised when 0 is a vector of unknown parameters because in
such a case 0-() or 0- 1 must contain single vectors as well in order to be
simple. For example, in the case of sampling from N(p, tr2) and c2 js not
c2j, c2 g R
krlfpwn,Hv3 p pv is nOt a simple h b'potbesis Since Oa
=

'f(tpa,

.)

(1)

Slmple

null and slmple alternative

The theory concerning two simple hypotheses


1920: by Neyman and Pearson. Let
(lJ

$, t?e O )

./'(x',

was fully developed in the

be the probability model and X (.Xj, Xz,


Xn)' be the sampling model
and consider the simple null and simple alternative Hv: 0= ?ll and HL :
f?(), 71) i.e. there are only two possible distributions for *, that
?= f?j O
iss
?1). Given the available data x we want to choose
0o) and
between the two distributions. The following theorem provides us with
sufficient conditions for the existence of a UMP test for this, the simplest of
the cases in testing.
=

.f

.(x;

./'(x'.

( 14.32)

( 14.33)

optimal tests

14.3 Constructing

297

ln this simple case


(x
1

/#(t?)
=

for 0 %
for 0 ()1
=

1)

(14.34)

The Neyman-pearson
theorem suggests that it is intuitively sensible to
base the acceptance or rejection of Hv on the relative values of the
distributions of the sample evaluated at 0= 0o and 0= ?l i.e. reject Hv if the
ratio .Jlx; t?o)/'(x; f?I) is relatively small. This amounts to rejecting Hv when
it a higher
lt is very
the evidence in the form of x favour H3 rgiving
that
solve
Neyman-pearson
theorem
not
note
does
the
the
important to
relating
problem
the
ratio
completely
because
of
the
p())
problem
tx,'
J(x; ?1) to a pivotal quantity (test statistic) remains. Consider the case
where X N(t?, c2), c2 known, and we want to test HvL 0= 0v against f11:
theorem we know that the
p= ?j (p()< ?:). From the Neyman-pearson
ratio
region
of
in
delined
the
rejection
terms
-

tsupport'.

'x-

())
a exists for some a.
can provide us with a UMP test if #rtx c Cj; 0=
The ratio as it stands is not a proper pivot as we defined it. We know,
however, that any monotonic transformation of the ratio generates the
same family of rejection regions. Thus we can define
=

z(X)

w''n (X,1- 04

r;-

gn(o1 .p())g

log /(x, p(),p1)


t:'

+.N-

a
in terms of
Cl

which
=

we can define the rejection region

)x: T(X) y
6: C1 ;

'

?o) a
=

( 14.37)
x if

( 14.38)

exists.

Remark: in the case of a discrete random variable

#rtx

6: C 1 ;

(14.36)

as

c)).

C1 defines a UMP test of size


P#x

0 + 0o
n --!- p

(74)) a,
=

x might not exist since the random variable takes discrete

values.

298

Hypothesis testing and confidence regions

For example if a

,t?qpl)
=

zl

=0.05,

cl

-N

(x)> c)

1.645 and the power of the test is

/ n(p1 p2 )
-

(14.39)

p,

where
under Hj

(14.40)

ln this case we can control j if we can increase the

1 j=
-

#r(z1(x) ,Ac)*)

c'l + c)*

sample

size since

Vn(() - o ).
G

(14.41)

()

For the hypothesis So: p= 0o against Hf : 0= ?: when 0


statistic takes the form

<

0v the test

yz'

(# s p)
z(X) v----s-..n
=

which gives
C1

(2)

rise
=

kf :
x

to the

rejection

region

z(X) % ca*)

Composite

( 14.42)

null and composite alternative

one parameter

cas

For the hypothesis


Ho : 0 > 0o

against
H j : 0 < 0v

being the other extreme of two simple hypotheses, no such results as the
Neyman-pearson theorem exist and it comes as no surprise that no UMP
tests exist in general. The only result of some interest in this case is that if we
restrict the probability model to require the density functions to have
monotone Iikelihood ratio in the test statistic z(X) then UMP tests do exist.
This result is of limited value, however, since it does not provide us with a
method to derive z(X).

14.4

The Iikelihood ratio test procedure

299

Simple Hv against composite H 1


ln the case where we want to test HoL 0= ?a against Sl : 0>. ?o (or 0< ?())
uniformly most powedul (UMP) tests do not exist in general. ln some
particular cases, however, such UMP tests do exist and the NeymanPearson theorem can help us derive them. lf the UMP test for the simple
Ho 0= 0v against the simple Hj : 0= ?1does n()r depend on pj then the same
ln the example
> 0v (or 0 < t?()).
test is UMP for the one-sided alternative
discussed above the tests defined by
t71

)x:z(X) > c))

(14.43)

C1

)x:z(X) %c))

(14.44)

HoL
0= 0v against .S1 : 0 > 0v and Ho1
are also UMP for the hypotheses
0= 0v against Sj : 0 < po, respectively. This is indeed confirmed by the
example above.
diagram of the power function derived for the
Another result in the simple class of hypotheses is available in the case
exponential
where sampling is from a one-parameter
famil),of densities
(normal, binomial, Poisson, etc.). ln such cases UMP tests do exist for onesided alternatives.
'marks'

Two-sided alternatives
For testing S(): 0= 0o against H3 : 0+ ?tlno UMP tests exist in general. This
is rather unfortunate since most tests in practice are of this type. One
interesting result in this case is that if wc restrict the probability model to the
one-parameter exponential familvand narrow down the class of tests by
imposing unbiasedness, then we know that UMP tests do exist. The test
defined by the rejection region

C'1

)x:1z(X)!p: ca)

(14.45)

example) is indeed UMP unbiased; the one-sided alternative


(see
biased
tests being
over the whole of 0.
'marks'

14.4

The Iikelihood ratio test procedure

The discussion so far suggests that no UMP tests exist for a wide variety of
cases which are important in practice. However, the likelihood ratio test
procedure yields very satisfactory tests for a great number of cases where
none of the above methods is applicable. lt is particularly valuable in the
case where both hypotheses are composite and 0 is a vector of parameters.
This procedure not only has a lot of intuitive appeal but also frequently
leadg to UMP tests or UMP unbiased tests (whensuch exist).

300

Hypothesis

testing and confidence regions

Consider
Ho 0 c(.)()

against
H 1 : 0 e:O j

Let the likelihood function be Ltp; x), then the likelihood ratio is defined by

max Atp', x)
2(x) dceo

L0; x)

x)

.1.z(#;

max
,s e)

( 14.46)

L0; x)

measures the highest


x renders to 0 G (6)0and the
value
maximum
of
denominator measures the
the Iikelihood function (see
Fig. 14.4). By definition 2(x) can never exceed unity and the smaller it is the
less Hv is
by the data. This suggests that the rejection region
based on 2(x) must be of the form

'support'

The numerator

'supported'

Cj

)x:2(x) G k)

( 14.47)

and the size being defined by


max @0)

oBo

a.

(x and k as well as the power function can only be dened


L (0 ; x )

L ()

1(#a)

--

---

I
1
I
I

I
l

I
I
I

1
I
I

I
I
I
1
l

I
I
I

I
l
I

J
Fig. 14.4. The likelihood ratio test.

when the

14.4

The likelihood ratio test procedure

301

distribution of 2(x) under both Ho and Hj is known. This is usually the


exception rather than the rule. The exceptions arise when * is a normal
family of densities and X is a random sample in which case 2(x) is often a
monotone function of some of the pivots we encountered above. Let us
illustrate the procedure and the difficulties arising by considering several
examples.
Example 1

Xn)' be a random sample


be the probability model and X tA-j, Xz,
from /'(x;p, c2), Ho : p yv against f11 : p # pv.
=

1k

(2zrc2)

-,:./2

L(p; x)

11

s-c j

exp

?1

.... l , t

)( (xf--pn)2
i 1

(xj-p)2

I=1

pr

2(x)
=

,,

j (xi- Y,,)2
f=1

At first sight it might seem an impossible task to detennine the distribution


of 2(x). Note, however, that

which implies that

n(-'.t'

2(x)

,,

1+

,1

)7 (l -i
=

1,'

14J1

sq

,v,

1+

j;j;i (!
.jj

--

l1/

;!

r(n 1) under Ho,

) under H 1

Since 2(x) is a monotone


the form

C 1 jfx :

-'p,,)

-/t())
here1&r=x7nI((#,,
I'I'',v l(n

-/1...2

-/-to)

V:1(Jt1
,

-/.10)

Jzl G 01

decreasing function of I''Uthe rejection

> cz )

region

takes

Hypothesis testing and confidence regions

and a, ca and @ol can be derived from the distribution of W(


Example 2
In the context of the statistical
H

against
and

(72

of example 1 consider

model

c2
o

H 1 : c 2 + c 0,2

0- K J@x R

under Hv

z (n -

17
'v

1; J ) un d e r H 1

no'j2
n
tr

tr

2:

(E;())

with kj and Lz defined by


z

dzztn

1)

kl

1 -a,

18.5, kz 29.3.
n - 1 30, /j
e.g. if a
Hence, the rejection region is (71
n kj or p y kc). Using the analogy
between this and the various tests of y we encountered so far we can
postulate that in the case of the one-sided hypotheses:

=0.1,

.ftx:

c2 > cj,

H 0 : c2

t;

Hj

c2 H :
0,
1

c2 < c(j
(7.2

>

(7.2

0,

the rejection region is C1


i>1

fx:

py

jf

x : 17%kj );

/(a).

The question arising at this stage is: tWhat use is the likelihood ratio
test procedure if the distribution of 2(X)is only known when a well-known
pivot exists already'?' The answer is that it is reassuring to know that the
procedure in these cases leads to certain well-known pivots because the
likelihood ratio test procedure is of considerable importance when no such
pivots exist. Under certain conditions we can derive the asymptotic
distribution of 2(X). We can show that under certain conditions
Ho

2 Iog

(x) z2(,.)
-

(14.48)

14.5 Confidence estimation


H ()

('

'w

'

distributed under Hv'),

iasymptotically

reads

being the number of

parameters tested. This will be pursued further in Section 16.2.


Confidence estimation
In point estimation whcn an estimator of p is constructed we usually think
ofit notjust as a point but as a point surrounded by some region of possible
standard error of
This can be
error,i.e. 1+ e, where e is related to the
condence
of
interval for 0
viewedas a crude form a
.

;<

0 :A1+ n);

(14.49)

crude because there is no guarantee that such an interval will include 0.


Indeed, we can show that the probability the 0 does not belong to this
interval is actually non-negative. In order to formalise this argument we
need to attach probabilities to such intervals. ln general, interval estimation
refers to constructing random intervals of the form

(14.50)

(z(X)G 0 %f(X)),

together with an associated probability for such a statement being valid.


r(X) and f(X) are two statistics referred to as the lower and upper
respectively; they are in effect stochastic bounds on 0. The associated
probability will take the form
tbound'

#r(#X) G p %f(X))

(14.51)

-a,

where the probabilistic statement is based on the distribution of J.(X) and


f(X). The main problem is to construct such statistics for which the
distribution does not depend on the unknown parameterts) 0. This,
however, is the same problem as in hypothesis testing. ln that context we
'solved' the problem by seeking what we called pivots and intuition suggests
that the same quantities might be of use in the present context. lt turns out
that not only this is indeed the case but the similarity between interval
estimation and hypothesis testing does not end here. Any size a test about 0
with 1
can be transformed directly to an interval estimator of 0
consdence level.
-a

Dehnition 6
The interlml (#X), f(X)) is called a (1
jr alI 0 e:O
#r(z(X) G 0 %f(X))> 1

-.

$ confidence interval for 0 #(14.52)

( 1 -$ is called the probability of coverage of the interval and the statement

Hypothesis

testing and confidence regions

suggests that in the long-run (in repeated experiments) the random interval
(z(X), f(X)) will include the ttrue'
but unknown 0. For any particular
sure' whether (z(X), f(X))
realisation x, however, we do not know
includes or not the true' 0,' we are only ( 1 - a) confident that it oes. The
duality between hypothesis testing and confidence intervals can be seen in
example discussed above. For the null hypothesis
the
ifor

Kmarks'

H () : 7 0()
=

0() 6: 0-

against
H 1 : p # 0v,
a size x test based on the acceptance

we constructed
C()(?o)

x : ()v- cz

c G Xn G 0v +

--c

(,'a

v'n

region

--rx n

( 14.53)

with cx delined by
fr:;4

$(z) dz= 1
-

T'

-a,

( 14.54)

N(0, 1).

C(,, 0= p(,) 1 x and hence by a simple


manipulation of Co we can define the (1-a) confidence interval

This

implies

that Prx

C(X)

is

()o: X,, c!a c % 0v %

--s
,n

x-,
x<

,,

+ cx

tr

-sx,. /

Proo e:C7) 1 - a.

( 14.55)
( 14.56)

ln general, any acceptance region for a size x test can be transformed into a
( 1-a) confidence interval for 0 by changing ()-a, a function of x 6,7/,. to C, a
function of 0o s0.
One-sided tests correspond to one-sided confidence intervals of the form
#r(z(X) G 0j > 1 -a

(14.57)

Pro Gf(A-)) y 1

(14.58)

-a,

In general when O R''', m 1. 1, the family of subsets C(X) of O where C(X)


For example,
depends on X but not 0 is called a random lzgt??.
=

C(X)

)#:z(X)

:$

0 %f(X))

()-(X)

)#:z(X) % p).

The problem of confidence estimation is one of constructing


region C(X) such that, for a given a e:(0, 1),

#rtx : 0 G C(X)/p)> 1 a,
-

for a1l 0 G (4.

( 14.59)
a random

(14.60)

14.5 Confidence estimation

305

It is interesting to note that C(X) could be interpreted as

C)X)

tp: r(X)

0i fk(.X), i

t:i

:$

1, 2,

in which case if (r(X), i(X)) represent


intervals

??l

( 14.6 1)

independent

(1

(zf)

confidence

(1

-a)=

J-l(1

-j).

The duality between hypothesis testing and confidence estimation does


not end at the construction stage. The various properties of tests have
corresponding counterparts in eonfidence estimation.
Djlnition

.4 hlmil)' ( 1 a) Ie,el conjdence rtxitpn.s C(X) is said to be


uniformly most accurate (UNIA) among ( 1 a) Ievel
regions C*(X) /'
0j'

x'tpny'zf/f?n'ta

#r(x: 0 6: C(X) #) G#r(x: 0 E: C*(X)

'#)

for al1 0 6 0.
( 14.63)

This clearly shows that when power is reinterpreted as accuracy it provides


us with the basic optimality criterion in confidence estimation. lt turns out
(not surprisingly) that bkbp tests lead to UMA confidence regions. This is

because
0 t5 C7(X) (9 if and only if x 6

rt)tpl

( 14.64)

,?'',

where C()(p) represents the acceptance region of Ho 0


confidence region C(X) can be formed by

0o. In effect the

f'tx) )p():x e t)-o(p()))s

( 14.65)

and the acceptance region C()($ by


t)-(#())
=

hence

tx:0v e:C(x)),

#?'(x: x c 7()(p())0= p()) Prtx: 0v c f7(x) 0= 0vj > l


=

a.

( 14.67)

This duality between C04*0) and C(X) is illustrated below for the above
example assuming that n l to enable us to draw the graph given in Fig.
14.5.
Continuing with this duality it comes as no surprise to learn that
unbiased tests give rise to unbiased confidence regions and vice versa.
=

306

Fig. 14.5. The duality between hypothesis testing and interval estimation.

e:C(x)/#J 1 - a for #1 0z i 0.
(14.68)
ln general, a
test will give rise to a good conlidence region and vice
Lehmann
(1959)).
versa (see

#r(x:

0L

:$

igood'

14.6

Prediction

In the context of a statistical model as defined by the probability and


sampling model components, prediction refers to the construction of an
toptimal' Borel function 1( ) (seeChapter 6) of the form:
'

'

): 0-

-+

k??l

(14

.69)

guess' for the value of a random variable


which purports to provide a
zY,, 1 which does not belong to the postulated sample. If we denote the
A-,,)' and its distribution with D(X,,; 04then we
sample with X,, E (#1,
need to construct l0) which for a good estimator l,, of 0, X,, 1 /(i) is a
Sgood

Prediction

14.6

307

'good' predictor of .Y,,+ 1. Given that 1, ll(X,,) we can define /( ) as a


1
lnction of the sample directly, i.e. /(,,) !(X,,). Properties of optimal
predictors were discussed in Section 12.3, but no methods of constructing
such predictors were mentioned. The purpose of this section is to consider
this problem briefly.
The problem of constructing optimal predictors refers to the question of
-how do we choose the function 1( ) so as the resulting predictor to satisfy
certain desirable properties?' To be able to answer this question we need to
specify what the desirable properties are. The single most widely used
criterion for a good predictor is minimum mean square error (MSE). This
criterion suggests choosing /( ) in such a way so as to minimise
'

'

'

X,, +

l(X,,))2,

(14.70)

is defined in tenus of the joint distribution of Xn j and Xn, say,


X,,;
1,
#). lt turns out that the solution of this minimisation problem
is theoretically extremely simple. This is because (70)can be expressed in the
where
DX,,

'(')

form

'(X,: +

l(X,,))2

'().Y,,+

Exn

7(-Y,,
+ 1 c(X,,)) + .tF(-Y,,+1/c(X,,))

1-

+ 2'('t.Y,,

c(X,,)))2 + ECECXn+

flX?,

+ 1.

1 -

Ezrn

/(X,,)))2

c(X,,))

l(X,,))2

l/c(X'')) .t'(A-,, + 1/c(X,,) l(X,,))).


-

(14.7 1)

Using the properties CE5 and SCE5 of Section 7.2 we can show that the last
term is equal to zero. Hence, (71) is minimised when
l(X,,)

f;(..,,

( 14.72)

1/c(X,,)).

When this is the case the second term is also equal to zero. That is, the form
1(.Y,,)which minimises (70) is
of the predictor
j
.-,,.

.i'

11 +

Exn

1/t4X,,)),

(14.73)

where c(X,,) is the c-field generated by X,, (see Chapter 4).


As argued in Sections 5.4 and 7.2, the functional form of EXt, .I/c(X,,))
depends entirely on the form of the joint distribution Dx,, + 1, X,,; #).For
example, in the case where D4Ar,,+ 1, X,,; #) is multisariate normal then the
conditional expectation is linear, i.e.
E (X,,

WlX,,)) #'X,,

(14.74)

(see Chapter 15). ln practice, when D(',, + 1 X,,; #) is not known linear
predictors are commonly used as approximations
to the particular
,

308

Hypotllesis testing and confidence regions

functional form of
'(A-,,+ 1/c(X,,)) =g(X,,).

ln such cases the joint distribution is implicitly assumed to be closel)


approximated by a normal distribution.
The prediction value of Xn + j will take the form

.4
l

E(Xn + 1/X,,

x,,),

where x,, refers to the observed realisation of the sample X,,. The intuition
for the value xY,, 1 must be the
underlying (76)is that the best
values, in view of the past realisations
of
a11
possible
of
its
average
iguess'

#(XC=

It

then

Is

xn)

(seeFig.

14.6).

important to note that in the case where .Y,,

'(A-,,+ 1/W(X,,)) E(Xn +


=

and X,, are independen:

1).

(14.77)

expectation
with the marginal
That is, the conditional
coincides
expectation of aY,, : (seeChapters 6 and 7). This is the reason why in the
case of the random sample X,, where Xi N((). 1), i 1, 2,
n, if A-,, 1 is
also assumed to have the same distribution, its best predictor (in MSE
sense) is its mean, i.e. kn 1 (1//rl) )'-1 Xi (seeSection 12.3). lt goes without
saying that for prediction purposes we prefer to have non-random sampling
models because the past history of the stochastic proess t.Y,,, n )y:1) will be
of considerable value in such a case.
Prediction regions for A-,, take the same form as confidence regions for
..

'v

.1

1
I
I
I

I //
I /

51

.''

ly

Il sk-x x
ll NX
ll N
x

'x

Il

ll
I

1
I
I
I

n n

14.6

Prediction

309

0 and the same analysis as in Section 14.5 goes through with minor
interpretation changes.
Important

concepts

Null and alternative hypotheses, acceptance region, rejection region, test


statistic, type l error, type 11 error, power of a test, the power function, size
of a test, uniformly most powerful test, unbiased test, simple hypothesis,
lemma, likelihood ratio test,
composite hypothesis, Neyman-pearson
level, uniformly most accurate
pivots, confidence region, condence
confidence regions, unbiased confidence region, optimal predictor,

minimum mean square error.

Questions
Explain the relationship between Hv and H3 and the distribution of
the sample.
and rejection
Describe the relationship
between the acceptance
regions and (-)0 and O1.
Define the concepts of a test statistic, type l and type 11 errors and
probabilities of type l and 11 errors.
Explain intuitively why we cannot control both probabilities of type l
and type 11 errors. How do we
this problem in hypothesis
testing'?
Define and explain the concepts of the power of a test and the power
function of a test.
Explain the concept of the size of a test.
Define and explain the concept of a UMP test.
State the components needed to define a test.
Explain why we need to know the distribution of the test statistic
under both the null and the alternative hypotheses.
Define the concept of a pivot and explain its role in hypothesis testing.
Explain the concepts of one-sided and two-sided tests.
Explain the circumstances under which UMP tests exist.
Explain the Neyman-pearson
theorem and the likelihood ratio test
procedure as ways of constructing optimal tests.
Explain intuitively the meaning of the statement
isolve'

10.
11.
12.
13.
14.

#r(z(X) G 0 Gf(X))

1 - a.

Define the concept of a ( 1 -a) confidence region for 0.


Explain the relationship between C()(?()),the acceptance region for
HoL
0= 0o against Sj : o 0v and the confidence interval C(X) for 0o.

Hypothesis

310

testing and confidence regions

Define and explain the concept of a (1

confidence

uniformly

-a)

most accurate

region.

Exevcises
For the
example of Section 14.2 construct a size 0.05 test for Hv
HL :
against
60
0 < 60. Is it unbiased? Using this, construct a 0.95
0
confidence
interval for 0.
level
significance
c2)
the following hypotheses:
and
consider
Let X N(p,

'marks'

'v

(i)

ff0: /t G/zo,

(ii)

S(): c2 >

(iij)

H () : p

(iv)

H,.. /.t

(7.24),

y ()

c2 >

f11 : p Apo,

0, po - known;

c2 < c2(),

HL :

c2()

H 1 : p # p ()

c2

-pv-

> czt), H,

c2

- known;

:#:

2:,.

c2 < czo.

y +pv,

State whether the above null and alternative hypotheses are simple or
composite and explain your answer.
X,,)' be a random sample from N(0, 1) where 0 e:OLet X H(-Y1,
.tft?1,p2). Construct a size tz test for
.

against S1 : 0 0z.
=

Using this, construct


for p.

(1

tz)

significance

level conlidence

interval

Xn)' be a random sample from a Bernoulli distribution


with a density function
Let X EH (71,

fx; 0)

Construct a size

p'I(

041 -

'Y

0, 1.

test for HvL 0 %0v against S1 : 0>. 0o. (Snl.'


is binomially distributed.)
Let X EEE(A-j,
A-,,)' be a random sample from N(p, c2)
(i)
Show that the test defined by the rejection region
.

cl

(x

x:

(j)'=1 Xi)

(.S,,-yJ

Un

>k

sl

11

1j

y xi

-.:,,)2

.j

defines a UMP unbiased test for Ho'. p G/t() against Sl : p wpo.


Derive a UMP unbiased test for Hv.. p y/to against Sl : p <po.
(ii)
BE
F.)? be two random samples
Let X tA-j
Xn)' and Y (F1
from Ni, c2j) and Np, c2a) respectively. Show that for the hypotheses'.
,

(i)

H o c 12 c c2 H j : c lj > c

(ii)

.:2

:$

,.

-.

()

(7

2)

y: c

22
,

:c

2j.

<

2,

c ;

Additional references

(iii)

H () : c 21 c 2,
=

H j : c 2j # c 2,

the rejection regions are:


Cl

c1

x y : z (x y) y; k 1 )

.f(

xf

tx

y : z(x y) %k

.z

cl (x,y : ks %ztx,
where
respectively,
=

y)

,::

/,j.)

11

Z (A-i

-f,,)2/n

ztx,y)

,,,

Z (1'i=

1V?nl2

define UMP unbiased tests (seeLehmann ( 1959), pp. 169-70).


What is the distribution of ztx, y)?
Construct a size a test for HvL cf cq
the likelihood ratio test procedure.
=

c2, H3 : cf + cl using

Additional references
Bickel and Doksum ( 1977); Cox and Hinkley ( 1974); Kendall
Lehmann (1959); Rao ( 1973); Rohatgi ( 1976)) Silvey ( 1975).

and Stuart

(1973);

CHAPTER

15*

The multivariate

15.1

normal distribution

distributions

Multivariate

The multivariate
normal distribution
is by far the most important
distribution in statistical inference for a variety of reasons including the fact
that some of the statistics based on sampling from such a distribution have
tractable distributions themselves. lt forms the backbone of Part IV on
and thus a closer study of this
statistical models in econometrics
distribution will greatly simplify the discussion that follows. Before we
consider the multivariate normal distribution, however, let us introduce
some notation and various simple results related to random vectors and
their distributions in gcneral.
Let X EB (A-1 Xz,
Xn)' be an n x 1 random vector defined on the
probability space (S, P )).The mean vector '(X) is defined by
,

.%'

'

F(A-1)
f(X)=

F(A'2)

r
1

Kp,

an

x 1 vector

s(-v,,)
and the covariance

matrix Cov(X) by

covt.vlx,,l I
,

covtxcAVar(A',,)

/1

) 1l
j MY,
J

.j

15.1

distribution

Multivariate

definite matrix, i.e. E'


where E is an n x n symmetric non-negative
Iljth
element of E is
x'Ea> 0 for any x G R''. The
ojj. f;(-tf
=

jlxj.

/t

pj),

i,j

1, 2,

E and

( 15.3)

u.

In relation to a'Ya y 0 we can show that if there exists an x e: R'', a#0 such
that Var(a'X)= a'Ea 0 then Pr@'X c) 1 where c' s a constant (only
holding
constants have zero variance), i.e. there is a linear relationship
Xn with probability one.
among the nv.'s .Y1
=

Lemma 15.1

Ltzhnrtltk15.2

AS(X) + b

(f)

F(Z)

(/)

Cov(Z)

.F)rZ

and ct/r/rfkrlc'tz E

//' X bas mean y

AX + b

Ap + b,'

SEIAX + b -(Ap + b))(AX + b (Ap + b))1


AFIX
p)(X
p)'A'
AYA'.
=
-

Let X and Y be n x 1 and m x 1 random


'(Y) py, rllt?rl

wr

pt?cltpry

F(X)

/lx,

Cov(X, Y)

I)tcovtA-fFy.lfyj FEIX - px)(Y

/ly)'(l.

Correlation

So far correlation has been defined for random variables (r.v.'s)only and the
question arises whether it can be generalised to random vectors. Let
F(X) 0
X into
=

(without any loss of generality), Cov(X)

X EEE
Define Z

A' 1
X2

X2:

(n- 1) x 1 Y=

E. X n x 1, and partition

11

5'1 ,

J2 1

Y2 2

x'X2 and 1et us consider the correlation

Corrtzl A-j )

between A-: and Z

5'12

---t-1

.---u-.

c21(x E22*2
,

This is maximised for the value of x which minimises


S(xY1

x'Xc)2

(see Chapter 7), which is

12-21Jz1 and

we

define the multiple correlation

The multivariate

normal distribution

coefhcient to be
1
2N22
- J21

(J1

c'l 1

c o rrtx

z/xa),

ln the case where


X

X1

X1: k x 1, X: (n k) x 1, E
-

X.a

we could define the r.v.'s


coefficient is
Corrtzl

ZL

'1Xj and Z:

E 1c

E21

E2 2

x'1 E 1 2 x 2

za)

(1El 1l)'2X222)
---u.
,

(y 1 x 12 x 22
-

E 11

'aXa whose correlation

From the above inequality it follows that for x2

C1 c

1x

21 y 1

)j

(1, x 1 1 a 1 ).#

>

Ec1Y211

corrtzj za).

(15.10)

E1-/El ztz-zs Ecj has at most k non-zero eigenvalues which measure


association between XI and Xa and are called canonical correlations.
Let
X1
X

where Xa : (n 2) x 1

X1

the

X3

c11
E

Another

between
form the

J13

ca1

c22

Jc3

J31

J32

Y33

form of correlation of interest in this context is the correlation


and X1 given that the effect of Xa is taken away. For this we

-:-1

r.v.'s

F1

X3

b'lxa

and

X2

b'2Xa

Corrt'l

F2)

is maximised by bl Ea-alcaj and b2 Ya-alo'a:as seen above. Hence We


define the partial correlation coejhcient between X3 and Xz given Xa to be
=

pl 2.3

Ecl1

cl

- a5z
3 E 33
c12
1
l
--y-3E33 /31J
- /32)1
1c22 (r23E3:5
-tr1

b Corrth

Fc).

15.2

normal distribution

The multivariate

15.2

normal distribution

The multiuriate

normal density function discussed above was of the form

The univariate

( 15.12)
X,,)' when the xY,.sare l1D
The density function of X EEE(-Y1, X2,
normally distributed r.v.'s was shown to be of the form
.

- n/ 2jg2)

gaj

11/ 2

1k

exp

11

p) l

(xy
-

20-a f

Similarly, the density function of X when the .Ys are only independent,
Xi Nqpi, cJ),izuu1, 2,
n, takes the form

i.e.

'v

,,/2

= (27:)

(c:, c2,

2
c,,)

j.
-

exp

11

2i

)-(
1

.Xf

1!

pi

.....

(1j

o'i

j4)

Comparing the above three density functions we can discern a pattern


developing which is very suggestive for the density function of an arbitrary
normal vector X with f(X) p and Cov(X) E, which takes the form
=

E)-l expt

-?'/2(det

tx; p, E) (2z:)
=

-.(x

1(x

/z)'E -

Jl)),

(15.15)

and we write X N(p, E). lf the Arfsare 1lD r.v.'s E c2I,, and (detX) (c2)';.
On the other hand, if the A'zsare independent but not identically distributed
=

'.w

11

diagtcz1.

ln the case of n
p=
o

p1
pz

(det E)

c2)
l1

(det N)

and

(cJ.
) (c
=

2:
,

ct;

)
.

2
E

t7'l2

21 2a(

c c

c 12

tF12

1 p 2)
-

c2

G2

>

Pc 1c 2
2
G2

p/1tF2

for

1< p

<

where p

12
,

(71J2

3 16

Te multivariate

normal distribution

and

(see Chapter 6). The standard bivariate density function can be obtained by
defining the new r.v.'s

(1)

Properties

N(p, E) then Y (AX + b) NlAp +b, AEA') for A: m x n


and b: m x 1 constant nktkrl'l't't/s. p.(7. i)' Y C'X,c # 0, Y x Nlcp, c2E).
This property shows that if X is normally distributed then any linear
jllnction of X is also normally distributed.
(N2)
Let Xtx Ntpf, Xr), r 1, 2,
F be independently' distributed
?n)' arbitrarq'
constant matrices Aj, t 1, 2,
random rpcltary, tben

(N1)

Let X

'v

'v

./r

j A,X,
l

x.

jj
=

A,s,
f

jjl (A,tA,')

The converse also holds. If the Xrs are llD then pt

and

(-'' $x,)
'

x(,z, y' x)

p, Yf

E, t

1, 2,

The multivariate

15.2

normal distribution

Let X N(p, E) then the -Yfs are independent if and only f.Jaij 0,
c,,,,). ln general, zero
n, i.e. E diagtc 1 1
i #.js i,j 1, 2,
implq'
independence
but in tbe case ofnormality'
covariance does not
=

'v

are equivalent.

r//? lwo
Ij' X

Np, Y) then the marginal distribution

'w

0j'

an), k x 1 subset Xj

whvre
X EEE

X1
,

X2

p,

pz

E11

E 1z

.-

.a

=2

L 22

q/r

N(p1 E1 1). This jllows jom property' N1


Similarly, Xa N(p1, E2a).
k x n), b
These can be verified directly using
Xj

'v

(Ik: 0),

=0.

'v

J(X

1;

(X;

P1)

and

P)dx 2

/'(x2;p2)

/'(x;0

dx1

although the manipulations involved are rather too cumbersome.


k 1, this property implies that each component of X Np,t)
normally distributed; the converse, however, is not true.
'v

(N5)

F()r

1t?

partition of X considered
Xl
(?J given X2 takes the

same

distribution

This follows from property


Ik

AX

0
=

in N4 the conditional

(X2 - p2),
+ E1 2E2-21

NpL

E1 1

E1 2E2-21
E21).

N 1 for

- E 1 2E zh
I,, -k

is also

./?nl

(Xl/X2),

Taking

and

X 1 - E 1 aE c-21X z

Xa

(15.19)

b 0
=

Cov(AX)

Ej 1

E 1 21

2-21

:22 1

0
.

Ec c

(15.20)
From this we can deduce that if E1 2 0 then X1 and Xa are independent
since(X1 Xc) x. N(/z1 E1 1). Moreover, for any E1 c, (Xj -E1cE2-21Xc) and X2
given that their covariance is zero. Similarly, (X2,/Xj)
are independent
X(#2 + E21E1-/ (X1 /11), E22 E21E1-11E1 2). ln the case n 2
=

'w

X3./Xz)
,

(7'

'v

N p.tl+ p - !. (-2 -p2), cfl 1 - p 2 )


Gz

( 15.2 1)

These results can be verified using the formula

/'(x1 x2; /)

'.

Jtx; p)

.I'(xc-,0z)

(15.22)

The multivariate

normal distribution

function

N5 suggests that the regression


'IX 1 Xc

x2)

pL

E1 cE2-altxa pz)
-

is linear in

xc

(15.23)

and the skedasticity jnction


Cov(X1/Xa

x2)

E1 yz'tft l cE2-21E2j

is

freeof

These are very important properties of the multivariate


and will play a crucial role in Part lV.

(2)
Without

partition

Multle

X1

X2

let X

c11

,.w

N(0, E), X: n x 1 and define the

c12
,

..

=22

J21

where X2: (n 1) x 1 and .,Y1: 1 x 1. The squared


coefficient takes the form
R2

(3)

normal distribution

correlation

any loss of generality

Xv

x2.

vartx

,'x

Vart.-

multiple

a 1 cta-clcc,

( 15.25)

c:

correlation

Partial corvelation

further into

Let Xa be partitioned
Xz

X z EB

Xa

X s : (n - 2 ) x 1

A' 2 : 1 x 1

with
J22

Ec2

J23

E33

G5z

The partial correlation

tz1 1

J1 3

c21

az

J31

E33

between Xj and X1 given X3 takes the form

covtxl xcy/'xal
,

/?1 2

'

--.y

EVar(A-1/Xa)(1IEVar(.Yc,Xalj
=

Ec11
-

c1 (!Ea-31a aa
o'13E:'-3'/..51 lEcc2 - c'cata-a'

c 12 -

,a2(1-:

'

(15.26)

319

15.3 Quadraticforms
with

*'

X ,''X3
XN
1
A' ,'X 3

3E 3-31X a

J1

3-31

X :9

J 2 3E

c 21

3E a-alTr :$1

J1

o.z 3E

J 1 3 E 3- 3 J 3 2
- 1

(T 1 2 -

-I J 3 2 o'z,

/23E33 c32

( 15.27)

15.3

Quadraticforms

(Q1)

Let X

to the normal distribution

related

Np, E), w/llrl E> p, X : n x 1, then

,v

p)'t

(j)

(X

(f)

XE - IX

(X

g2(n;

'v

z2(n)-

-p)

)-

chi-square;

chi-square;

non-central

These rtlsl/r,s depend crucially on E beinq a


p'Ewt?rt?
E > 0 tbere exists a ntpn-sfngur
ptlsflt'l dejlnite matrix because
matrix H, E HH' Z H- '(X pj N(0, 1,,), i.e. the Zis are
independent and (X p)'E - 1(X pj j)'- 1 Z/. Similarly
().
For r/yl MLE q y,
1p.

./lr

z=.

'v

./),-

T'

=>

Let X

(Q2)

1
N p, - E
F

'w

1
'(/-p)

N(p, 1,,), then


(X

(t)

X,

-p)'E--

T
z..-g

p)'A(X

z2(n).

for A a
p)

h'om N2

yyn-lrnt?r/-ft.'

(A'

A) matrix

zzttrA),'

'.w

'wzzttr

A; ), j= p'Ap.
X/AX
(ff)
A). Note tr A re-krs to the
if and only if A is idempotent (.c.
7jj).
of
A
A
(tr j)'=1
rmc.l
matrix, l/l?n
Let X Np, E), Y > 0 and A is a
-*.2

(Q3)

.s.rnlm'r?-c'

x,

(X - #)'A(X

(f)

X'AX

if and

(Q#)
X

()

zzttrA);

-p)

'v

zzttrA', J), J

x.

only if AE is ftcrnptprenr

Le

'v

Nlp, E),

E > 0,

and

p'Ap,.

(f.t?.AYA

(xz
X1

X EEE

A).
p1
,

yg

E 1l

xaj xaa

for X1 : k x 1 the dlerence


1(x
E(x- p)'E -

p)

(X1

p1)'E1-11(Xl
-

JIjlll

X1 2

z2(n k).
-

The multivariate

normal distribution

N(p, E), then

'v

matrices.
AEB 0.

(./1

X'AX

pr

and

(./2

and B sjnlmetric and jleplpt?rtv7r


X'BX ar? independent if and onlq' 4'/

(Q6)

Let X Np, 1,,), then


A a syrn#npr/-t.' and idempotent matrix antl
B a k x n l'nr/rrjx,
then X'AX and BX are independent #' BA 0.
Let X N(p, 1,,) and Z xN(0,1.) tben
A and B symmetr.
./ip,-

(Q7)

./,-

'v

matrices
flt?rn/ptpr???l

X'AX
-

tr A

Z BZ
g

F( t !- A t r B

.j

(j)

y' A p

tr 11
15.4

Estimation

Let X H (Xj X2,


Xw)'be a random sample from N(p, E), i.e. Xf N(p, Y),
1,
'lL
2,
X
being
l
a T x n matrix. The likelihood function takes the
,

'w

form

= k(X)(27:)

ztdet

'-nT

E)-T'

1'

exp

)
=

(Xt X'E- 1(.Y - p)


t
-

( 15.28)
1ogL(0; X)

c-

r
n T lo2 27: T logtdet E) 1
(Xf-p)'EF
--s
2
2
2t j
--,-

=-'-

1(X,

-p).

( 15.29)

Since
(Xf -/z)'E- 1(Xr
-p)-

/'

)(Xr -&r)E-

1(X,

-&w)

T(X, g)'X - 1(Xw p) tr E - IA + T'(X.wp)/E - 1(X.w#


=

Xw=

log L(0; X)=

c*

T
-

--

1
T

X,,

logtdet E) -

j( (X, -&w)(x,

-&w)',

( j 5.3())

tr E- IA
( 15.3 1)

15.4 Estimation
j
log1-(p,'X)
wx- (: v . JI )
.
(p
t?1og f.te, X) F E 1 A 0 = i
=Y
2
t?E- 1
.()

--

l 1og fatp,' X)
Lp t?/l'

'rx

.;.:y

(Xl -Xw)(Xl

-Xw), (15.33)

f?log L4:; X)
. PE

(jj.ya)

..

F
=

Y(I - : )E.
( 15.34)

Hence,

Xwand i are the

of p and E respectively.

MLE'S

Properties
Looking at Xwand i we can see that they correspond directly to the MLE'S
in the univariate case. It turns out that the analogy between the univariate
and multivariate cases extends to the properties of Xwand i.
ln order to discuss the small sample properties of Xwand i we need their
distributions. Since Xwis a linear function of normally distributed random
vectors it is itself normally distributed

Xw'vN /z, E
T

The distribution of i is a direct generalisation


of the chi-square
distribution, the so-called Wishart distribution with F- 1 degrees of
freedom (seeAppendix 24.1), i.e.
Tk

'v

I#,(E, F-

1)

( 15.36)

From these we can deduce that .E'(Xw)p unbiased estimator of p and


'(f) g(T'- 1)/FjE
- biased estimator of E. S (1/(T- 1)j jt (Xt Xw)
(Xf-Xw)' is an unbiased estimator of E.
X / and i are independent and jointly sufjlcient for (!z,E).
=

(2)

Useful distributions

Using the distribution (T- 1)S I4'(E, T- 1) the following results relating
to the sample correlations can be derived (see Muirhead ( 1982)):
'v

Simple correlation

ri.i
=

sij S Lsij?i-j,i,)
siisjj
,

1, 2,

.,

n.

The multivariate

normal distribution

If

For M

Lrij?ijwhen

diagtcl

1,

c.),

2 logtdettsl/

,.v.

.;

zztntn-

1:.

(15.38)

Multiple correlation
Rmw

ls
sl zS 2-z a j 'i

( 15.39)

S11

Under
R

k;

w-nj

=0,

h;-

.stu

.j,

w.u).

1-.!'i/

( 15.40)

In particular,
1

n'(.l'il.)=
r

2(T-n)(n

Vart y #)=

(F c

1)

(15.41)

1)4F- 1)

The distribution of 11.2when R # 0 is rather complicated


commonly use its asymptotic distribution:

v'..'-

1)(w -.R2)

x(0,4R2(1 -R2)2)

,v,

0..::Rl

and instead we

<

1.

(15.42)

On the other hand, under


R

=0,

1).)

('r-

'w

z2(n
-

1).

to R2 is the quantity

A closely related sample equivalent

#2F

'

X 1 X 2 (XIX 2

)--1X'2 x 1 l

(x1x1)
t

(15.43)

(15.44)

x 1 X c: T x k
The sampling distribution of 112was derived by Fisher (1928)
but it is far
variance
complicated
of
and
1ts
be
direct
interest.
too
to
mean, however,
are
of some interest

x1 : T

testing and consdence

Hypothesis

15.5

4R2(1

Var(#2)

R2)2

+ O(T-

323

regions

2)

(15.46)

(see Muirhead (1982)).On the 0( ) notation (seeChapter 10). Hence, the


mean of 112 increases as k increases and for R.2
'

=0

'(#2)

T- 1

+ O(T

).

(15.47)

Partial correlation
1

s 1 2 s 1 3 S 3- 3 s 3 2
!
s31)-(s22
:'S33
(s1l -s:
-

P12

Under

pt 2.3

15.5

.-scass:s

( 15.48)

'

s32)-j

0.

(15.49)

Hypothesis testing and confidence regions

Hypothesis testing in the context of the multivariate nonnal distribution


will form the backbone of testing in Part IV where the normal distribution
plays a very important role.
For expositional purposes let us consider an example of testing and
confidence estimation in the context of the statistical model of Section 15.4.
Consider the null hypothesis HoL p 0 against 1-11: p# 0 when E is
unknown. Using the likelihood ratio test procedure with
=

L0. x)

maX

Ti)-'lT'-n-

:)-'F'2(det

c*tdet

c*tdettt +

t) expt

-1.Fn),

f-(p; x)
max
0 e (6,0

Fi/F-n-

zldet
Xw&'w)-r

(15.50)

p e (.)

1)

expt -JTn),

( 15.5 1)
we get

2(x)

det(TE)

dettt

+&w&r,)

vfz =

1+&J

.r.,z

--

j:

j ..).ggzjg-

(15.52)
where

z12=

'r&ys-l&

s= T-1 t

(15.53)

324

The multiuriate

normal distribution

is the so-called Hotelling's statistic which can form the basis of the test,
being a monotone function of 2(X). Indeed,
T- n
H1
n(F- 1)

(see Muirhead
based on the
t-j

(1982:.Using

rejection

x:

1)

n(T-

the Hotelling

H 2 > cu

rl). For

Tp E

( 15.54)

statistic we can define a test

(15.55)

p= po against H3 : ## pv the

HvL

test statistic

T'(Xw JIp'S - 1(Xz


-

(15.56)

-,:).

region for the latter

Using the acceptance


Ca

fcze

H2

region

dF(n, Twhere a
takesthe form
=

F(n, T - n; J),

'v

x:

Important

-.

rl(T- 1)

we can define a ( 1
C(X)

H= %;ca

( 15.57)

conjldence region for p of the form

-a)

Jl: W(Xw p)'S - 1(Xw p)


-

:$

(T-

1)n

(F-n)

ca

concepts

Mean and covariance of a random vector, multiple correlation, canonical


normal density function,
correlation', partial correlation, the multivariate
marginal and conditional normal density functions, Wishart distribution.

Questions
Explain the various correlation measures in the context of random
vectors and compare the general formulae with the ones associated
normal
distribution.
Comment
with the multivariate
on the

similarities.
Discuss the relationship

between normality and linearity.


Discussthe marginal and conditional distributions of a subvectorxj
of
normally
random
vector
X
distributed
(X'1 X?2)'.
a
chiUnder what circumstances is the quadratic form (X #'A(X
=

-p)

square distributed?

15.5

Hypothesis

testing and confidence regions

325

State the conditions under which the quadratic forms X'AX and X'BX
will be independent.
6. State the conditions under which the ratio of the quadratic forms X'AX
and Z'BZ will be F-distributed.
Under which circumstances will the quadratic form X'AX and BX be
independent?
8. Discuss the properties of the MLE'S X,,and i of p and E respectively in
the context of the statistical model defined by the random sample X ESE
X,,)' from Np, E).
(X1 X2,
,

Additional references

Anderson

Mardia
(1984),.

et al. (1979); Morrison

(1976),. Seber (1984).

Asymptotic test procedures

As discussed in Chapter 14, the main problem in hypothesis testing is to


construct a test statistic z(X) whose distribution we know under both the
null hypothesis Hv and the alternative HL and it does not depend on the
unknown parameterts) 0. The first part of the problem, that of constructing
z(X), can be handled relatively easily using the various methods discussed
above (Neyman-pearson lemma, likelihood ratio) when certain conditions
are satisfied. The second part of the problem, that of determining the
distribution of z(X) under both Ho and H3 is much more difficult to solve'
and often we have to resort to asymptotic theory. This amounts to deriving
the asymptotic distribution of z(X) and using that to determine the rejection
region C1 (or C() and the associated probabilities. For agiven sample size n,
however, these will be as accurate as the asymptotic distlibution of z,,(X)is
an accurate approximation of its finite sample distribution. Moreover,since
the distribution of zp(X) fora given n is not known (otherwise
we use that) we
do not know how good' the approximation is. This suggests that when
asymptotic results are used we should be aware of their limitations and the
inaccuracies they can lead to (see Chapter 10).
,

16.1

Asymptotic properties

Consider the test defined by the rejection region


c:

> c,, )
tx lz,,(X)I
:

and whose power function is


,?4,(p) Prx s
=

G),

0 iE (9.

Since the distribution of Tn(X)is not known we cannot determine cn or


326

16.1 Asymptotic

properties

J#(p). lf the asymptotic distribution of z,,(X)is available,however,


we can use
that instead to define cu from some fixed a and the asymptotic power
jnction

zr(p) #r(x

G C)''),

0 6 0.

(16.3)

ln this sense we can think of (z,,(X), n > 1) as a sequence of test statistics


defining a sequence of rejection regions (C,1,n > 1) with power functions
).#(p), n y: 1, 0 G(6)) and we can choose c,, accordingly to ensure that the
sequence of tests have the same size if
'1

max

,#,(p)

0 c 6o

for all n > 1.

( 16.4)

Note that limn-.. X(#)=


In this context the various criteria for tests
discussed above must be reformulated in terms of the asymptotic power
function z:(:); see Bickel and Doksum (1977).
.;:(:).

Dehnition 1
The sequence of tests for Ho.' 0 6 (i)o against 1'f1: 0 01 dejlned by
c
(C'1 n > 1) is said to be consistent of size f/
,

c(#)
max
e
e

and

(z

etl

7:(p) 1, 0 6 01
=

(16.6)

As in the case of estimation, consistency is a reasonable property but only a


minimal property. In order to be able to make comparisons between
various tests we need better approximations to the power than 1. With this
in mind we define asymptotic unbiasedness.
Dehnition 2

.4 sequence of tests as dehned above is said to be asymptotically


unbiased of size a #'
max

z:4:)

e (.:0
'l0)

tx<

<

1,

0 6 O1.

( 16.8)

Desnition 3

.,4sequence of tests as desned above is said to be uniformly most

Asymptotic

test procedures

power (UM#) of

size a

7(0)

max
('a

(/
( 16.9)

ee

and

z0) >

for any

zr*(p),

0 (E (.l,

( 16.10)

size tz test with asymptotic

ptpwt!r jnction zr*(p).


often
interested in Iocal alternatives of the form
tests we are

ln asymptotic

b#0

( 16.11)

in order to assess the power of the test around the null. When
I

(p) opn) then


-

11

v'',,(4-po)

x(b,I.(p)-

1)

for l the MLE of 0. ln this case we consider only local power and a test with
greatest local power is called locally' uniformly most powerful.
The Iikelihood ratio and related test procedures

16.2

ln this section three general test procedures which give Iise to


asymptotically optimal tests will be considered', the Iikelihood ratio, Wrtzlf!
and Lagrange multiplier test procedures. All three test procedures can be
interpreted as utilising the information incorporated in the 1oglikelihood
function in different but asymptotically equivalent ways.
For expositional purposes the test procedures will be considered in the
context of the simplest statistical model where
*= (.(x; p), 0 (E 0) is the probability model; and
(i)
X EB (11, Xz,
Ar,,)' is a random sample.
(ii)
The results can be easily generalised to the non-random
sample case where
I,,(p) O/n) as explained in Chapter 13 above in the context of maximum
likelihood estimation. For most results the generalisation amounts to
substituting I(p) for I.(p) and reinterpreting
the results.
.

(1)

Simple

null

hypothesis

Let the null hypothesis

be Ho: 0= 0o, 0 c (.) EEER''' against Sl

0+ 0v.

The likelihood ratio test


The likelihood ratio test statistic discussed in Section 14.4 takes the form

2(x)
=

x)
1-(p();
L(0., x)
max
e e til

f/pe', x)
L(1,. x)

( 16.13)

16.2

Likelihood

ratio

329

and related test procedures

where I is the MLE of 0. ln cases where 2(x) or some monotonic function of


it have a tractable distribution there is no need for asymptotic theory.
Commonly, however, this is not the case and asymptotic theory is called for.
The IOIJ test
Under certain regularity conditions which include CRI-CR3
13) log L(0; x) can be expanded in a Taylor series at p= l
Lo;
1og

p Iog z-(p;x)

l-t; x) +(4-p)

xl-log

+.:1- 0)'
whereJp* p) < J- p! and
- 10). Since
(seeChapter
p
f-(p; x)
'ik log
#--J

00

p2log 1-(p*; x)
pp 0
,

p-j

(seeChapter

:-j

p) +

0,(1),

( 16.14)

og 1) refers to asymptotically

negligible terms

( 16.15)

=0,

conditions

beingthe tirst order


1 p2
n 0 Pp

log.L(;

for the MLE, and

x) XI(p),

the above expansion can be simplified


log L0,. x) 1og .L(1;x) +.l.n(;
=

(16.16)

(seeSerfling (1980))to:
p)'l(p)(I p) + o/ 1).
-

(16.17)

This implies that, since


2glog L(1; x) - log l..t?(); x)(l,
(16.18)
- 2 log 2(x)
-2 1og2(x) n(4 - pf))'I(#)(l - 0o) +o1).
(16.19)
it is known that under certain
For the asymptotic properties of MLE'S
regularity conditions
=

x/nt

0o)

tz

x(0,I(p)-

Using this we can deduce


Z-R

-2

log

1).

(16.20)

(seeproperty Q1, Chapter

(x):>,n(I-pp'I(p)(4

being a quadratic form in asymptotically

Ho
-p(,)

15) that

z2(,,l),

normal random variables

(16.21)
(r.v.'s).

330

test procedures

Asymptotic

Wald (1943),using the above approximation of


log 2(x), proposed an
alternative test statistic by replacing I(p) with 141):
-2

Ho

Wz'=n(l-po)I()(J-po)
given that

(iii)

I(J)

xz2(m),

(16.22)

14$. This is the so-called W/-J/Jstatistic.

-+

The Lagrange multiplier test

Rao (1947)using the asymptotic distribution of the score function


of that ot l), i.e.

log
f7tr
V n -u

q(#)

=---s

L(0; x)

proposed the emcient score


LM

=-

N(0, I(p))

1q(po)

q(#()'l(#() -

Hv

( 16.23)

multiplier)

(or Lagrange

(instead

test statistic

zztrj,

which is again a quadratic

form in asymptotically

normally

distributed

I-.V. S.

For all three test statistics (LR, l4zi.LA.flthe rejection

)x : !(x)> cz)

(71

region

takes the form

(16.25)

where /(x) stands forall three test statistics and the critical value ca is defined
by dz2(r# a, x being the size of the test. Under local alternatives with a
Pittman type drift of the folnn:

Jc

(t

/.fl : 0,, 0o +
=

(16.26)

all three test statistics are asymptotically


H3

l(x4

z2(m;J), J

'v

distributed as:

b'Itp()b,

( 16.27)

since

x'',;(J
-

0v)

vt

#,,) + b

xtb,1(:0) - 1)

(16.28)

and
qtpo)
v7n

-x/ntl-ppltp()l

+0p(1)

xtbI(po), I(po)).

(16.29)

16.2 Likelihood

ratio and related test procedures

Fig. 16.1. The LR, W' and LM tests compared.

Hence, the power function for all three test statistics takes the form
X

itol

dzztm.. J),

(16.30)

C?

and thus, LR, W' and LM are asymptotically equivalent in the sense that
they have the same asymptotic properties.
Fig. 16.1, due to Pagan (1982),shows
the relationship between LR, ##rand
LM in the case of a scalar 0.
z,M=2

area

Lo

q(po),

p
(- p())2

g,,=2

=q(po)2

area Atl

q),

LR= 2 area

A C

qo) dp.

=2

( 16.3 1)

(16,32)

( 16.33)

po

Note that al1 three test statistics can be interpreted as functions of the score
function.

332

Asymptotic

test procedures

(2)

Composite

null

hypothesis

Consider the case where the Ho is composite, i.e.


aga in st

H o : 0 i! 0. o

(p:R(#)

0. ()

Rr

0. i Rm

as well as practical to parametrise

It is both convenient

(1)0

H 1 : 0 (F0. j

(.)0 in the form

0, 0 c 0)

(16.34)

represents r non-linear equations, i.e. R(#) (R1(#), Rz0),


Rr(#))'. In most situations in practice the parametrised form arises
.
naturally in the fonn of restrictions such as R1(p) 0305+ 0t, Rz(0)
log ?l p2, Rz(0) p2j+ 0z 1, Rzj0) 0L
etc. lf we define 0 to be the
maximum likelihood estimator (MLE) of 0, i.e. J is the solution of
(t?log L0; xlj/Jp= 0, then from
=0

where R(#)
.

-2pc,

x'V(l

0)

.N(0, 14:) -

'w

1),

(16.35)

and
1

t?

log L(0; x)

n'&

N(0, I(p)),

'v

we can deduce that


Ho

xV(R(

R),I(p) - 1R:),

-R(p)) xN(,

since R(#) can be approximated


R(p)

R(4) +Rp(p

where

by

) +0/1),

R(p)

Rp=

(i)

at 0=

( 16.37)

(0

The W'/Z/Jtest procedure

lf the null hypothesis Ho is true we expect the MLE 1,without imposing the
restrictions, to be close to satisfying the restrictions, i.e.if Hv is true, R(l) 0.
This implies that a natural measure for any departure from Hv should be
p1
R(

-011

(16.39)

If this is
different from zero it will be a good indication that Hv
different from
is false. The problem is to formalise the concept
zero'. The obvious way to proceed is to construct a pivot based on 1jR(l)1jin
order to enable us to turn this statement into a precise probabilistic
statement.
Ssignificantly'

tsignificantly

16.2

Likelihood ratio and related test procedures

333

ln constructing such a vot there are two basic problems to overdepends on the units of measurement and
come. The first is that jjR(#)Jj
the second is that absolute values are not easy to manipulate. A quantity
both problems is the quadratic form
which
'solves'

- 'R(;),

R('Ev(R()1

(16.40)

where V(R()) represents the covariance of R(l). Determining V(R()) can be


a very difficult task since we often know very little about the distribution of
Asymptotically, however, we know the distribution of R() and
.
v(R(l))

hencewe

'

(t6.41)

R;I(p) - R:,

can deduce that

IRPI iR()

nR('gRI(#)-

Ho

Wald's suggestion
estimator, i.e.

z2(r).

(16.42)

'X

to replacing

amounts

1'V'=nR()'gRI(l)-

'w

Ho

'RI;II-1R()

'v

V(R()

with

a consistent

z2(r).

Note that the Wald procedure can be used in conjunction


asymptotically normal estimator 0* (not just MLE's) since if

w''n0

#)

'.-

N (0 E : )
,

W'= nR(#*)'gRipRj1

with any

(16 44)

Ho

- 'R(#*)

'w

:'

(ii)

( 16.43)

z2(r).

(16.45)

Laqranqe multiplier test procedure

In contrast to the Wald test procedure the Lagrange multiplier procedure is


based solely on the restricted MLE of 0, say #. Although the Lagrange
multiplier test statistic can take various equivalent forms we consider only
two such forms in what follows. Estimation of 0 subject to the restrictions
R(p)= 0 is based on the optimisation of the Lagrangian function

,.l? 1og L(#., x)


=

R(p)p,

(16.46)

where p: r x 1 vector of multipliers. The restricted MLE of 0 is defined to be


the solution of the system of equations:

log1z(:

x)- Rp=0,

(16.47)

Asymptotic

test procedures

R(#) 0.

(16.48 )

In the case of the Wald procedure we began our search for an asymptotic
pivot using R(l) which should be close to zero when Hv is true. ln the
bydelinition and thus it cannot be used. But.
present case, however, R(#)
=0

the score function evaluated at 0=

although in the Wald procedure


zero, i.e.
91og L(f. x)

is

(16.49)

= 0,

this is not the case for (t7log L(; xlj/'Pp and we can use it to construct an
asymptotic pivot. Equivalently, the Lagrange multipliers pfe can be used
instead. The intuition underlying the use of g(#) is that these multipliers can
be interpreted as shadow prices for the constraints and should register a11
d epar tures from Ho; if #is closed to'd #'#)is small and vice versa. Hence, a
reasonable thing to do is to consider the quantity
Using the same
argument as in the Wald procedure for jR(J)
we set up the quadratic
form

1-)

-0(

-01.

1#(#).

/z(l1'EV(/#l))I
-

(16.50)

Using the fact that

1t? log L(,.


t?p
-n

x)

N(0,

'v

l(p)),

(16.5 1)

we can deduce that


1

(p()

p(04)

N0, gR;l(p) - j R?q

'v

''''''''''''''tiklih',------,---'
-

-)--

( 16.52)

).

Hence,
1

g(#)'(R;l(p)- 'RJIZI-I

-11

Ho
-

z2(r).

( 16.53)

The Lagrange multiplier test statistic takes the form


Z-M

1
=

Hv

#(#)'gR;-l(#)- 1RJg(#)
-

z2(r),

(16.54)

or, equivalently,
LM

'

log L(. x)

which is the efjlcient score form.

lo -

p
'0

log L(. x)

(16.55)

16.2 Likelihood

ratio test statistic takes the form

The likelihood

Ho

LR

2(1ogLk,' x)

1og Ll; x))

IVO:

LM

:>:

n(l

'v

(:

#-)'1(#)(1
-

z2(r).

( 16.56)

we can show that

Using the Taylor series expansions

LR :>:

335

ratio and related test procedures

#).

Thus, although a1l three test statistics are based on three different
asymptotic pivots, as n a:i the test statistics become equivalent. Al1three
tests share the same asymptotic properties; they are a11consistent as well
as asymptotically locally' UMP against local alternatives of the form
considered above. In the absence of any information relating to higherorder approximations of the distribution of these test statistics under both
Ho and SI the choice between them is based on computational
convenience. The Wald test statistic is constructed in terms of J the
unrestricted MLE of 0, the Lagrange multiplier in tcrms of # the restricted
MLE of 0 and the likelihood ratio in terms of both.
ln order to be able to discriminate between the above three tests we need
to derive higher-order approximations such as Edgeworth approximations
(see Chapter 10). Rothenberg (1984)gives an excellent discussion of various
ways to derive such higher-order approximations.
Of particular interest in practice is the case where 0- (p1 0z) and f'fo:
0L
0 against ffj : 03# o, pl : r x 1 with pc: (rn- r) x 1left unrestricted. In
this case the three test statistics take the form
-+

.f.-(4,-

LR

(16.58)

x)),

x) -log

L,

-2(log

W'= rl(11 -001)/2-11 -I1a(l)I.,2%1(I2)(;)(()(lj ...-.#0:),


1(1)

LM

1g(#) EI11(-)
,

--

-I1c(#)I22

ljyj-),

(yj21 tj.)g
-

where

-(11, 2), #= o, #2),


This is because for R(p)

0L

1(*)

I(p)

11c(*)
11
121(#) 1224#)

R;

pb

(16.59)
(16.60)

x)
t7log z-(p,'
(o, .-4.

(1,:0)

and hence
R;I(#)(#) - 1R:

1(:)
E11

I 1 a(p)I2-21(p)Ia1(#)q-

For further discussion of the above asymptotic


survey by Engle (1984).

test procedures

(16.6 1)
see the

Asymptotic

test procedures

Important

concepts

power function, consistent test, asymptotically unbiased test.


asymptotically uniformly most powerful test, locally uniformly most
powerful test, Wald test statistic, Lagrange multiplier (efficientscore) test
Asymptotic

statistic.

Questions
Why do we need asymptotic theory in hypothesis testing'?
Explain the concept of an asymptotic power function and use it to
define consistency, asymptotic unbiasedness and UMP in testing.
What do we mean by a size x test in this context?
Compare the LR, 1#' and LM tests in the case of a simple null
hypothesis (draw diagrams if it helps).
Explain the common-sense logic underlying the LR, 14' and LM test
procedures in the case of a composite null hypothesis.
6. Discuss the similarities and differences between the LR, Hzrand LM
test procedures.
Explain the circumstances
under which you would use these
asymptotic test procedures in preference to the test procedures
discussed in Chapter 14.
8. Explain the derivation of the 145and LM test statistics in the case of
HvL p1
03 against HL : 0L 0. p! being a subset of parameter vector
0= (p1,0z), considered above.
9. Verify the form of the Wald and Lagrange multiplier test statistics for
0z) using the partitioned matrix
Ho #1 0 against Sl : pl # 0, pHtpj
inversion rule
:#

- 1

A 11

M12

A2l

A2c

(A1 1 A1 CA2-3IAC 1) Az-l A21(Al l - A1 cA2-2'A21) -

A1-/Aj 24A22 A2 j A1-/ A1 c) 1


(Ac2 A21 A1-/ A1c) -

1
.

Additional references
Aitchison

and Silvey ( 1958); Buse

Silvey (1959).

( 1982),. Engle (1984);

Moran

(1970);Rao (1973);

PART

IV

The Iinear regression and related statistical

models

17

CHAPTER

Statistical models in econometrics

17.1

Simple statistical models

The main purpose of Parts 11and IIl has been to formulate and discuss the
concept of a statistical model which will form the backbone of the
discussion in Part 1V. A statistical model has been defined as made up of
two related components:
(i)
a probability model, *= .tD(y; 04, 0 6: (.))t specifying a parametric
family of densities indexed by 0', and
)'w)' defining a sample from
a sampling model, y (y':,
D()?; p()),for some true' 0v in (4.
The probability model provides the framework in the context of which the
stochastic environment of the real phenomenon being studied can be
defined and the sampling model describes the relationship between the
probability model and the observable data. By postulating a statistical
model we transform the uncertainty relating to the mechanism giving rise to
the observed data to uncertainty relating to some unknown parameterts) 0
whose estimation determines the stochastic mechanism D(y', 0).
An example of such a statistical model in econometrics is provided by the
modelling of the distribution of personal income. ln studying the
distribution of personal income higher than a lower limit y() the following
statistical model is often postulated:
.y,2,

D()/ yo ; 0)

y2,
y EEF(-9?1,

P '* 1

--

Vo
,

J.7

yw)?is a random

sample from Dly/yo; 0).

!' The notation in Part IV will be somewhat different from the one used in Parts 11and
111.This change in notation has been made to conform with the established
econometric notation.
339

Statistical models in econometrics

Note:

if 0 >2.
For y a random

sample the likelihood

L0; y)

>'0

function is

p+ l

V0

wp
0w),()
(y 1

).'f

.J,,,2
,

)?w)-(p

+ 1)
,

log L4t?;y)

T log 0 + T? log yo -((? + 1)

d log L T
dp =-0

=>

iis the maximum


(d2 log L) dp2
Chapter 13):
=

+ T

log )'()-

log

T
t

log yr

l=1

log yf

0,

,
-

.J.'()

likelihood estimator (MLE) of the parameterp. Since


F/p2, the asymptotic distribution Of P takes the form
(see

./T((4- 0)

x(0-p2).

Although in general the finite sample distribution is not frequently


available, in this particular case we can derive D() analytically. It takes the
form
D)

pw-lww-l
=

.j.

F( F

1).J,,

exp

y.p
-

--

i.e.

awp
--)--

'v

z2(2F)

(see Appendix 6.1). This distribution of can be used to consider the finite
sample properties of as well as test hypotheses or set up condence
intervals for the unknown parameter 0. For instance, in view of the fact that
'l

E)

T
=

we can deduce that is a biased estimator of p.


It is of interest in this particular case to assess the accuracy' of the
asymptotic distribution of for a small 'T:(W= 8), by noting that

vart

w2p2
(T- 2)a (F- 3)

17.1 Simple statistical

34 1

models

(see Johnson and Kotz ( 1970:. Using the data on income distribution
5000 (reproducedbelow) to estimate 0,
Chapter 2), for

(see

.vy

6000

5000

we get

#1

log

T'
l

..

8000

7000

12000

15 000

20 ()(X)

P()

as the ML estimate.
Using the invariance property of MLE's (seeSecton 13.3) we can deduce
that
'tarl)

'()

2. 13,

0.9 1.

As we can see, for a small sample (T= 8) the estimate of the mean and the
variance are considerably larger than the ones given by the asymptotic

distribution:

'll=
a

1.6,

42

vartl=F

=0.32.

On the other hand, for a much larger sample, say T= 100,

'()

1.63,

Varl) 0.028,
=

with

as compared

4
-

1.6,

kart

=0.026.

These results exemplify the danger of using asymptotic results for small
samples and should be viewed as a warning against uncritical use of
asymptotie theory. For a more general discussion of asymptotic theory and
how to improve upon the asymptotic results see Chapter 10.
The statistical inference results derived above in relation to the income
of the
distribution example depend crucially on the appropriateness
should
represent a
statistical model postulated. That is, the statistical model
good approximation of the real phenomenon to be explained in a way
which takes account the nature of the available data. For example, if the
data were collected using stratified sampling then the random sample

assumption is inappropriate

(seeSection 17.2 belowj. When any of the

Statistical models in

onometrics

assumptions underlying the statistical model are invalid the above


estimation results are unwarranted.
In the next three sections it is argued that for the purposes of econometric
modelling we need to extend the simple statistical model based on a random
sample, illustrated above, in certain specific directions as required by the
particular features of econometric modelling. In Section 17.2 we consider
the nature of economic data commonly available and discuss its
implications for the form of the sampling model. lt is argued that for most
forms of economic data the random sample assumption is inappropriate,
Section 17.3 considers the question of constructing probability models if the
identically distributed assumption does not hold. The concept of a
statistical generating mechanism (GM) is introduced in Section 17.4 in
order to supplement the probability and sampling models. This additional
certain specific features of
component enables us to accommodate
econometric modelling. ln Section 17.5 the main statistical models of
interest in econometrics are summarised as a prelude to the discussion
which follows.
17.2

Fxonomic data and the sampling model

Economic data are usually non-experimental in nature and come in one of


three forms:
(i)
time series, measuring a particular variable at successive points in
time (annual, quarterly, monthly or weekly);
cross-section, measuring a particular variable at a given point in
time over different units (persons, households, firms, industries,
countries, etc.);
panel data, which refer to cross-section data over time.
(iii)
Economic data such as M1 money stock (M), real consumers' expenditure
(F) and its implicit deflator (P), interest rate on 7 days' deposit account (f),
over time, are examples of time-series data (seeAppendix, Table 17.2). The
income data used in Chapter 2 are cross-section data on 23 000 households
in the UK for 1979-80. Using the same 23 000 households of the crosssection observed over time we could generate panel data on income. In
practice, panel data are rather rare in econometrics because of the
difficulties involved in gathering such data. For a thorough discussion of
econometric modelling using panel data see Chamberlain ( 1984).
The econometric modeller is rarely involved directly witll the data
collection and refinement and often has to use published data knowing very
little about their origins. This lack of knowledge can have serious
repercussions on the modelling process and lead to misleading conclusions.
Ignorance related to how the data were collected can lead to an erroneous

17.2 Economic data

and the sampling model

343

choice of an appropriate sampling model. Moreover, if the choice of the


data is based only on the name they carry and not on intimate knowledge
about what exactly they are measuring, it can lead to an inappropriate
choice of the statistical GM (see Section 17.4, below) and some misleading
conclusions about the relationship between the estimated econometric
model and thc theoretical model as suggested by economic theory (see
Chapter 1). Let us consider the relationship between the nature of the data
and the sampling model in some more detail.
In Chapter 11 we discussed three basic forms of a sampling model:
random sample
(i)
- a set of independent and identically distributed
(ilD) random variables (r.v.'s),'
independent sample
a set of independent but not identically
and
r.v.'s;
distributed
non-random
sample - a set of non-llD r.v.'s.
(iii)
For cross-section data selected by the simple random sampling method
(where every unit in the target population has the same probability of being
selected), the sampling model of a random sample seems the most
appropriate choice. On the other hand, for cross-section data selected by
the stratified sampling method (the target population divided into a
number of groups (strada)with every unit in each group having the same
probability of being selected), the identically distributed assumption seems
rather inappropriate. The fact that the groups are chosen a priori in some
systematic way renders
the identically
distributed assumption
inappropriate. For such cross-section data the sampling model of an
independent sample seems more appropriate.
The independence
assumption can be justified if sampling within and between groups is

(ii)

random.
For time-series data the sampling models of a random or an independent
sample seem rather unrealistic on a priori grounds, leaving the non-random
sample as the most likely sampling model to postulate at the outset. For the
time-series data plotted against time in Fig. 17. 1@)-(J)the assumption that
they represent realisations of stochastic processes (seeChapter 8) seems
more realistic than their being realisations of llD r.v.'s. The plotted series
exhibit considerable time dependence. This is confirmed in Chapter 23
where these series are used to estimate a money adjustment equation. ln
Chapters 19-22 the sampling model of an independent sample is
intentionally maintained for the example which involves these data series
and several misleading conclusions are noted throughout.
In order to be able to take cxplicitly into consideration the nature of the
observed data chosen in the context of onometric modelling, the statistical
models of particular interest in econometrics will be specified in terms of the
observable r.v.'s giving rise to the data rather than the error term, the usual

Statistical models in econometrics

35000

25000

f
X
15000

5000

1963

1966

1969

1972
Time

1975

1978

1982

1966

1969

1972
Time

1975

1978

1982

consumerf

expenditure.

(a)

18o
'!' 160
J.
14000

12000

1963

Fig. 17,1(5). Money stock ftmillion).

(/?)Real

approach in econometrics textbooks (see Theil ( 1971), Maddala ( 1977),


Judge et aI. ( 1982) inter alia). The approach adopted in the present book is
to extend the statistical models considered so far in Part 1II in order to
modelling. In
accommodate certain specific features of econometric
particular a third component, called a statistical generating mechanism
(GM) will be added to the probability and sampling models in order to
enable us to summarise the information involved in a way which provides

17.2

Economic data and the sampling model

240

200

160
@
Q.

120

80

40
1963

1966

1969

1972
Time

1975

1966

1969

1972

1975

1978

1982

1978

1982

15

12

9
w*

()
1963

'

T i me

Fig. 17. 1(c). lmplicit

price detlator. (J) lnterest rate on 7 days' deposit

account.

'an adequate' approximation to the actual DGP giving rise to the observed
will be considered
data (see Chapter 1). This additional eomponent
section
17.4
ln
next
the nature of the
the
below.
extensively in Section
modelling
will
econometric
required
models
be discussed in
in
probability
model.
sampling
of
the
view of the above discussion

346

Statistical models in econometrics


Economic data and the probability model

In Chapter 1 it was argued that the specification of statistical models should


take account not only of the theoretical a priori information available but
the nature of the observed data chosen as well. This is because the
specification of statistical models proposed in the present book is based on
the observable random variable giving rise to the observed data and not by
attaching a white-noise error term to the theoretical model. This strategy
such as
implies that the modeller should consider assumptions
independence, stationarity, mixing (see Chapter 8) in relation to the
observed data at the outset.
As argued in Section 17.2, the sampling model of a random sample seems
rather unrealistic for most situations in econometric modelling in view of
the economic data usually available. Because of the interrelationship
between the sampling and the probability model we need to extend the
simple probability model (I) tD( 1,,. 0), 0 c (4) associated with a random
sample to ones related to independent and non-random samples.
An independent (but non-idcntically distributed) sample y /5 (y'1, ywl'
raises questions of time-heterogeneity in the context of the corresponding
probability model. This is because in general every element y, of y has its
own distribution with different parameters Dt),f; pf). The parameters ot
which depend on t are called incidental parameters. A probability model
related to y takes the general form
=

*= (D(y'2; 0t),

ot6 0,

t CETl.,

(17.1)

.)

where T ) 1, 2,
is an index set.
A non-random sample y raises questions not only of time-heterogeneity
but of time-dependenceas well. In this case we need thejoint distribution of y
in order to define an appropriate probability model of the general form
=

/t.D().'1

.p2,

A'w;#r),

ove' 0,

T1

(1, 2,

F)

T)

ln both of the above cases the observed data can be viewed as realisations
of the stochastic process (yf,t e: T) and for modelling purposes we need to
restrict its generality using assumptions such as normality, stationarity and
asymptotic independence or/and supplement the sample and theoretical
information available. In order to illustrate these let us consider the
simplest case of an independent sample and one incidental parameter:
(i)

D(y2,. 0 t )

exp

(2zr)

1
-- 2

-pt

2
,

)'vl' is an independent sample from D(yI;


'T;respectively.

y EEE()71,ya,
.

347

model

Data and the probability

17.3

0t), t

1, 2,

The probability model postulates a nonnal density with mean pt (an


incidental paramete and variance c2. The sampling model allows each y; to
have a different mean but the same variance and to be independent of the
The distribution of the sample for the above statistical model
other
ns.
and 0 n 1, yz,
D(y: p) where y H (.v1,yz,
pv, c2) is
.ywl'

Dlyt', g,, c2)


oty,04 f17
1
-

(c2)-F/2(2a)-F/2

exp

1
2c a f j

-jtf)2

(.yf

gw),
As we can see, there are T+ 1 unknown parameters, #= (c2,jtj, yz,
sufficient
provide
with
which
observations
only
estimated
and
T
us
to be
warning that there will be problems. This is indeed confirmed by the
maximum likelihood (ML) method. The log likelihood is
.

log L(p', y)

log L
t'q/z, =

,log L
f,c2 =

const

--j-

1
(
2c a

T
-

2c z

log c2
2)(.J4

2c

2c a

pt)

1
-

-/tl)2,

0,

(17.4)

(y:

,-

1, 2,

(17.5)

-6
.

-/zf)2=0.

Z(.pt

( 17.6)

These first-order conditions imply that lt yt,


Before we rush into pronouncing these as MLE'S
the second-order conditions for a maximum.
=

:2 1og L
aA

1= 1, 2,

'C and 82

=0.

it is important to look at
1

T
;

= 2c 4.

Z(A'-/4)

'

-;

c=

which are unbounded and hence lt and :2 are not MLE's; see Section 13.3.
This suggests that there is not enough information in the statistical model
(i)-(ii) above to estimate the statistical parameters 0= (Jz1,pz,
pw', c2).
An obvious way to supplement this information is in the form of panel
T: In the case where N
N, t= 1, 2,
data for yt, say y, i 1, 2,
realisations of y, are available at each r, 0 could be estimated by
.

1.

pt N
=

--

xv

Z A',
-

r= 1,2,

Statistical models in econometrics

and

:2

1
T

j jy (yy ptll.
-

1 i

of 0.
lt can be verified that these are indeed the MLE'S
An alternative way to supplement the information of the statistical model
(ijvii) is to reduce the dimensionality of the statistical parameter space 0.
This can be achieved by imposing restrictions on 0 or modelling p by
relating it to other observable random variables (r.v.'s)via conditioning (see
Chapter 7). Note that non-stochastic variables are viewed as degenerate
r.v.'s. The latter procedure enables us to accommodate
theoretical
information within a probability model by relating such information to the
statistical parameters 0. ln particular, such information is related to the
meap (marginalor conditional) of r.v.'s involved and sometimes to the
variance. Theoretical information is rarely related to higher-order
moments (seeChapter 4).
The modelling of statistical parameters via conditioning leads naturally
to an additional component to supplement the probability and sampling
models. This additional component we call a statistical generating
mechanism (GM) for reasons which will become apparent in the discussion
which follows. At this stage it suffices to say that the statistical GM is
postulated as a crude approximation to the actual DGP which gave rise to
the observed data in question, taking account of the nature 4 such data as
well as theoretical a priori information.
In the case of the statistical model (i)-(ii) above we could solve' the
inadequate information problem by relating pt to a vector of observable
1, 2,
variables xlj, xcf,
'C say, linearly, to postulate
xkf l
.

pt b'xf,

( 17.9)

bkl', k < Ti is a vector of unknown parameters. By


where b > (4j, b2,
postulating this relationship
we reduce the parameter
space from
(.) > RFX R + and increasing with T to (i)cp> R x (Ji and independent of
The statistical GM in this case takes the general form
.

'.r

yt

=b'x

+ ut,

where pt b'xf and ut


are called systematic and non-systematic
of
respectively.
construction
By
components
),'t,
-b'x,

Eytut)

.y,

0 and

Eutj

0,

Elutl)

c2,

Eutus)

0,
f #s.

where E( ) is defined relative to Dtyf,' 0), the marginal distribution of rf


Equation ( 10) represents a situation where the choice of the values xlf, xal,
'

17.4

The statistical

generating

mechanism

349

xkt determines the systematic part of yr with the unmodelled part ut


.
being a white-noise process (seeChapter 8). This is the statistical GM of the
.

Gauss linear model (see Chapter 18). The above statistical GM will be
extended in the next section in order to deline some of the most widely used
statistical models in econometrics.

17.4

The statistical

generating mhanism

The concept of a statistical GM is postulated to supplement the probability


sampling models and represents a crude approximation
to the actual
DGP which gave rise to the available data. It represents a summarisation of
the sample information in a way which enables us to accommodate any a
priori information related to the actual DGP as suggested by economic
theory (see Chapter 1).
Let (y, l (E T) be a stochastic process defined on (., i P(
(seeChapter
8). The statistical GM is defined by

and

.))

3'f pt +
=

where

ut

(17. 11)

#f f(#t/V),

(17. 12)

L being some c-field. This defines the statistical process generating y, with
pf being the postulated systematic mechanism giving rise to the observed
and ut the non-systematic
data on
part of yr defined by ut y'f
Defining ut this way ensures that it is orthogonal
to the systematic
component /t,; denoted by Jtf-l-l/? (see Chapter 7). The orthogonality
condition is needed for the logical consistency of the statistical GM in view
of the fact that ut represents the part of yf left unexplained by the choice of pt.
The terms systematic, non-systematic and orthogonality are formalised in
terms of the underlying probability and sampling models defining the
statistical model.
lt must be emphasised at the outset that the terms systematic and nonsystematic are relative to the information set as defined by the underlying
probability and sampling models as well as to any a priori information
related to the statistical parameters of interest 0. This information is
incorporated in the definition of the systematic component and the
remaining part of yt we call non-systematic or error. Hence, the nature of ut
depends crucially on how pf is defined and incorporates the unmodelled part
of #,1.This definition of the error term differs significantly from the usual use
of the term in econometrics as either errors-in-equation
or errors of
The
of
the
in
the
book
concept
present
measurement.
use
comes much
tnoise'
used in engineering and control literatures (see
closer to the term
.vf

-pt.

350

Statistical models in econometrics

Klman

(1982:.Our

aim in postulating a statistical GM is to minimise thc


non-systematic component ut by making the most of the systematic
information in defining the systematic component pt. For more discussion
on the error term see Hendry (1983).
Let
t c T) be a k x 1 vector stochastic process defined on (S, .@I
P ))
which represents the observable random variables involved. Let y, be the
random variable whose behaviour is of interest, where

tzt,
Zt

'

.l
Xt

For a conditioning

information set

the systematic component of yr can be defined by


pt

E (#l/V't )

(17 13)

where % is some sub-c-tield of


The non-systematic
unmodelled
of
given
the
part
represents
y,
pt, i.e.
.@

l,

.p,
-

component

.E(.h/f4),

These two components

ut

(17.14)

give rise to the general statistical GM

(17.16)
( 17. 17)
using the properties of conditional expectation (see Chapter 7). It is
important to note at this stage that the expectation operator E( ) in (16)and
(17) is defined relative to the probability distribution of the underlying
probability model. By changing Lh (andthe related probability model) we
can define some of the most important statistical models of interest in
econometrics. Let us consider some of these special cases.
Assuming that .tZf, l q T) is a normal 11D stochastic process and
(a)
choosing % (Xt xf ),a degenerate c-field, (15)takes the special
form
'

yf

p'xt+ ut,

t 6 T,

(17.18)

where the underlying probability model is based on D(yf,/X,; 04.


This dcfines the linear regression model (seeChapter 19).
l (E T) is a normal IlD stochastic process and
Assuming that
.tzt,

choosing

.t

.f-

t
=

c(Xf), (15)becomes

#'X +

l&,

(17.19)

generating mechanism

The statistical

17.4

with D(Z,; #) being the distribution dening the probability


model. ( 19) represents the statistical GM of the stocbastic Iinear
reqression model (seeChapter 20).
Assuming that (Zt, t g T) is a normal stationary lth-order Markov
c-field to be % c(yt0- j,
process and choosing the appropriate
XO,
Xy (X, f, i 0, 1,2,
x,0), y)'- : (y, i 1,2,
(15)
takes the form
=

.),

.),

-f,

+
A', #k)x,
=

(d)

Z (.af1', + /ixf
-i

-f)

(17.20)

+ uf,

where the underlying probability model is based on Dytjyh 1 ,Xy;


004.This defines the statistical GM of the JpntzrnfcIinear regression
model (seeChapter 23).
Assuming that (Zr, t G T) is a normal llD stochastic process and yt
is an m x 1 subvector of Z, the c-field t c(Xj xt) reduces (15)to
.9).

(17.21)

yl B'xf + u2,
=

with Dtyf X,; p*) the distribution defining the underlying


probability model. This is the statistical GM of the rrltllritwritzlt?
linear regression model (seeChapter 24).
An important feature of any statistical GM is the set of parameters
defining it. These parameters are called the statistical parameters of interest.
For instance, in the case of (18)and (19)the statistical parameters of interest
cl
c11
cj aEallaj These are functions of the
are 0- (#,c2), #= Xz-zbaz,
assumed to be
D(Zf;#)
of
parameters
=

z,

(xX,)

)
(t'..;
((0:)

s&'a2a))

( 17.22)

(see Chapter 15).


In practice the statistical parameters of interest 0 might not coincide with
the theoretical parameters ofinterest, say (. ln such a case we need to relate
the two sets of parameters in such a way that the latter are uniquely
determined by the former. That is, there exists a mapping
(

(17.23)

H($,

which detine ( uniquely This situation for example arises in the case of the
yrrlulrtlnptlusequations model where the statistical parameters of interest are
the parameters defining (21) but the theoretical parameters are different (see
Chapter 25). In such a case the statistical GM is reparametrised/restricted
in an attempt to define it in tenns of the theoretical parameters of interest.
restricted statistical GM is said to be an econometric
The reparametrised
model.
.

Statistical models in econometrics


It must be stressed that the statistical GM postulated depends crucially
on the information set chosen at the outset and it is well defined within such
a context. When the information set is changed the statistical GM should be
respecified to take account of the change. This implies that in econometric
modelling we have to decide on the information set within which the
specification of the statistical model will take place. This is one of the
reasons why the statistical model is defined directly in terms of the random
variables giving rise to the available observed data chosen and not in terms
of the error term. The relevant information underlying the specification of
the statistical GM comes in three forms:
(i)
theoretical information;
sample information; and
(ii)
(iii)
measurement information.
ln terms of Fig. 1.2 the theoretical information relates to the choice of the
observed data series (andhence of Zf) and the form of the estimable model.
The sample information relates to the probabilistic structure of
l 6: T)
of
and
and the measurement information to the measurement system Zt
any
exact relationships
among the observed data chosen (see Chapter 26 for
further discussion). Any theoretical information which can be tested as
restrictions on p is not imposed a priori in order to be able to test it. An
important implication of this is that the statistical GM is not restricted a
priori to coincide with either the theoretical or estimable model apart from
a white-noise term at the outset. Moreover, before any theoretical meaning
is attached to the statistical GM we need to ensure that the latter is first
assumptions
defining the
the underlying
well-dqjlned statistically;
statistical model are indeed valid for the data chosen. Testing the
underlying assumptions is the task of misspectjlcation testing (seeChapters
20-22). When these assumptions are tested and their validity established we
in order to derive a
can proceed with the reparametrisationjrestl-iction
theoretically meaningful GM, the empirical econometric model (seeFig.
.ftzf,

1.2).

17.5

Looking ahead

As a prelude to the extensive discussion of the linear regression model and


related statistical models of interest in econometrics let us summarise these
in Table 17.1.
ln the chapters which follow the statistical analysis (specification,
misspecification, estimation and testing) of the above statistical models will
be considered in some detail. ln Chapter 18 the linear model is briefly
considered in its simplest form k 2) in an attempt to motivate the linear
regression model considered extensively in Chapters 18-22. The main
=

(:)x

(:)x

rt

1I! Y

'

w-

*=

II

''7

'

cp>>-

.-.

-M

'

.-.

*'

Z'-u
A

Il Q

x:

41

ga

ea

=6

>'

...

w.

d
:

;=

-* G
o ew
.=. (7
' + :
m

.0

(:)x

+ u.
.-k o
+ <

V
A

AN

a-

#=

>/

lD

tp

r
..

hw

t
'-d

Ux1!1

7:-

1*
+ rm..-x
8..
cw W
Il Il

-.

,.>

rq

k)

<

o
,.
R

C1

<

<

<

11 I1

11
>'

;wx

:>.

>h=

.
.

'-

1I1
=

'U
o

-6

r
o
.a
m
CJ
V-

x
d
=
=
:: o
1)
.a
=
-

*'
tv
1)
=

.
.

.q;!

ce

.=

2
l
=

x
o =c:

w.

x
1)

Q
=

ct

r w.
>.

lz

D
C)

;;

*
.

D
=

.-.

$7

*v:

'i;

Y
O

; o
.

'-

=u
'

a
m

v.

xo

w.

iz

'z

.z

z
: O

354

Statistical models in econometrics

reason for the extensive discussion of the linear regression model is that this
statistical model forms the backbone of Part lV. ln Chapter 19 the
estimation, specication testing and prediction in the context of the linear
regression model are discussed. Departures from thc assumptions
(misspecification) underlying the linear regression model are discussed in
Chapters 20-22. Chapter 23 considers the dynamic linear regression model
which is by far the most widely used statistical model in econometric
modelling. This statistical model is viewed as a natural extension of the
linear regression model in the case where the non-random sample is the
appropriate sampling model. In Chapter 24 the multivariate linear
regression model is discussed as a direct cxtension of the linear regression
model. The simultaneous equation model viewed as a reparametrisation of
the multivariate linear regression model is discussed in Chapter 25. In
Chapter 26 the methodological discussion sketched in Chapter 1 is
considered more extensively.
Important concepts
Time-selies,

cross-section and panel data, simple random sampling,


sampling,
stratied
incidental parameters, statistical
generating
mechanism, systematic and non-systematic
statistical
components,
of
of
interest,
theoretical parameters
interest, reparametrisaparameters
tion/restriction.

Questions
Explain why for most forms of economic data the notion of a random

4.

sample is inappropriate.
Explain the concept of a statistical GM and its role in the statistical
model specication.
Explain the concepts
the systematic and non-systematic
components.
Discuss the type of information relevant for the specification of a
statistical GM.

Appendix 17.1

Appendix 17.1
adjusted data on rnt?rll),' stock M1 (M),
real consumers' expenditure (F), its implicit price dejlator P) and interest
rate on 7 days' deposit account (1)for the period 19631-19821v. source:
Economic Trends, ptrlnurl/ Supplemenb, J9#J, CSO)

Table 17.2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

Quarterlyseasonally

6740.0
6870.0
6990.0
7210.3

12 086.0
12 446.0
12 575.0
12 618.0
12 691.0

0.402 53
0.403 26
0.405 8 1
0.408 15
0.412 58
0.416 36
0.42 1 27
0.426 98
0.432 98
0.437 8 1
0.442 38
0.446 37
0,449 94
0.454 97
0.459 72
0.465 36
0.465 33
0.467 36
0.470 42
0.474 35
0.477 82
0.489 52
0.497 14
0.50 1 10
0.51 103
0.516 94
0.520 83
0.527 46
0.534 99
0.544 82
0.552 99
0.565 33
0.576 53
0.592 99
0.603 48
0.610 49
0.6 15 75
0,624 45
0.641 8 1
0.657 58

0.202 OOE-OI
0.200 E--0I
0.200 (XIE--OI
0.200 E-0I
0.237 OOE-OI
0.300 OOE-OI
0.300 E-0I
0,390 E-0I
0.500 E-0 1
0.470 E-0 1
0,400 OOE-OI
0.400 E-OI
0.400 E-0 1
0.400 E-0 1
0.486 E-0 1
0.500 E-0 l
0.45500E-01
0.368 E-0I
0.350 E-0I
0.489 OOE-OI
1
0.594 OOE-.O
0.550 E-.0 1
0.544 OOE-OI
0.500 E-.01
0.535 E--0I
0.600 E-0 1
0.600 E--0I
0.600 OOE-OI
0.585 E-0I
0.508 E-OI
0.500 YE--PI
1
0.500 OOE-.O
0.500 E-0 1
0.400 E-0I
0,367 XE--OI
0.325 E-0I
0.250 E-.01
0.250 E-0I
0.470 E-0 1
0.544 XE-OI
0.7 18 OOE-OI
0.703 OOE-O1
0.827 E-0I

7280.0
7330.0
7440.0
7450.0
7490.0

7570.0
7620.0
7610.0
7910.0
7830.0
7740.0

7600.0
7780.0

7880.0
8160.0

8250.0
82 10.0
8340.0
8530.0
8640.0
8490.0
83 10.0
8380.0
8660.0
8640.0
8920.0
9020.0

9420.0
9820.0
9900.0
10210.0
10 310.0
1 13.0
11 740.0
12 050.0
12 370.0
12 440.0
13 200.0
12 960.0

12787.0
12 847.0
12 949.0
12 959.0

12960,0
13 095.0
13 117.0
13 304.0
13 458.0
13 258.0
13 164.0
13 311.0
13 527.0
13 726.0
13 82 1.0
14 290.0
13 69 1.0
13 962.0
14 083.0
13 960.0

13988.0
14 089.0
14 276.0
14 2 17.0
14 359.0
14 597.0
14 64 1.0
14 603.0
14 867.0

15071.0
15 183.0

15 503,0
15 766.0
15 930.0
16 07 1.0
16 724.0
16 525.0
16 566.0

0.665 15

0.677 76
0.695 34

continued

356

Statistical models in econometrics

Table 17.2. continued

44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

13 020.0
12 850.0
13 230.0
13 550.0
14 460.0
14 850.0
15 250.0
16 770.0
17 150.0
17 880.0
18 430.0
19 050.0
19 0.0
19 440.0
20 430.0
2 1 970.0
23 170.0
24 280.0
24 950.0
25 920.0
26 920.0

27 520.0
28 030.0

28 840.0
29 360.0

29 260.0
29 880.0
29 660.0
30 550.0
3 1 8 10.0
32 870.0

33 210.0
33 760.0
36 720.0
37 590.0
38 140.0
40 220.0

16 5 17.0
16 2 11.0
16 169.0
16 288.0
16 38 1.O
16 342.0
16 358.0
16 0 15.0
15 937.0
16 105.0
16 163.0
16 199.0
16 240.0
15 980.0
16 020.0
16 153.0
16 364.0
16 840.0
16 884.0
17 249.0
17 254.0
17 396.0
18 3 15.0
17 8 16.0
18 072.0
18 120.0
17 729.0
17 83 1.0
l 7 870.0
18 040.0
17 926.0
17 934.0
17 97 1.0
17 927.0
17 998.0
18 242.0
18 543.0
-

0.72 1 44
0.752 76
0.790 53

0.950

0.824 84
0.867 22
0.9 19 47

0.982 88
1.0297
1.0703
1 1027
1.1322
1.1679
1.2237
1.2770
1.32 16
1.3523
1.3766
1
1.4363
1.46 19
1.49 10
1.5357
1.5796
1
1.7419
l.8 109
1.8823
1.9246
1 17
2.0 154

0.642 OOE-OI
E--0I
E-.0 1
0.592 OOE-.O1
0.693 OOE-OI

0.700
0.588

0.107 30
0.565
E-0 1
0.428 OOE-.O1
0.377 E-.0 1
0.332 E-.O1
0.306 OOE-0 1
0.5 l 8 OOE-.O1
0.675
E-0I
0.857 OOE.-01
0. 103 70
0.993
E-0 1
0. 1 15 00
0.13 1 40
0. l50
0. 150 ()0
0. 140 50
0. 13 1 (X)
0.109 40
0.9E-0
1
0.943 O0E-O1
0.133
0.114 20
0. 100 10
0.829 E-0I
0.624
E-.0 1

,4059

.6808

.97

2.0867
2. 1343
2. 1838
2.2 177

2.2673
2.29 19

2.3076
-U

Additional references

Granger (1982);Griliches (1985); Richard (1980).

E-0 1

0.950 OOE--OI
0.950 E-0I
0.950 E--0I
0.950
E-0 1
0.846 E-.0 1
0.625 E-.0 1

'

CHAPTER

18

The Gauss linear model

18.1

Specifkation

In the context of the Gauss linear model the only random variable involved
is the variable whose behaviour is of interest. Denoting this random
variable by yt we assume that the stochastic process ftyt, t G T) is a normal,
independent process with F(y,) pf and a time-homogeneous variance c2
for t (E T (-T being some index set, not necessarily time),
=

37f Nlpt,
'v

tr2),

defined on the probability space (S,


#( )).
In terms of the general statistical GM (17.15)
the relevant conditioning
which implies that
set is the trivial c-field o (.,
,.1

'

.3)

.@

pt

Eyt/fqz'zu E(#,).

That is, the statistical GM is


#t E(J',) +

(tB.2)

ut,

with gt assumed to be related to a set of k non-stochastic (or controlled)


variables xlt, xcf,
xkt, in the form of the linear function
.

J'tt=

) ix i =

b'x t

is an obvious notation. Defining the non-systematic


Llt =

357

J't

.E(#t).

component

by

(18.4)

358

The Gauss Iinear model

the statistical GM
.J'!
,

b'x t +

(2)takes the

particular

form
(18.5)

u t,

The underlying probability is naturally delined in terms of the marginal


distribution of y,, say, D(y,; 0), where 0 > (b,c21) are the statistical
parameters of interest, being the parameters in terms of which the statistical
GM (5) is defined. The probability model is defined by
D(yf; 0)

*=

j--r-jV( )

exp

1
-

(y? b'x,)2
-

IG

0 e' Ild x

R+ tGT
,

(18.6)

tyf,

In view of the assumption of independence of


t s T) the sampling
model, providing the link between the observed data and the statistical GM,
is defined as follows:

(-p1,
ymb
.p2,

yvl'

is an independent sample from Dtyr; p), tzzu 1, 2,


T; respectively. It could
not be a random sample in view of the fact that each yt has a different mean.
By construction the systematic and non-systematic components satisfy
the following properties:
.

(i)
(ii)

.E'(l,)
Eptutj
Elutus)

uE'ty,- .E'(yf1) 0,'


=

=/t,'(l,)

=0-,

G' 2,
=

0,

l + s,

t, s c T.

Properties (i)and (iii)show that )l/t,t e: T) is a normal white-noise process


and (ii)establishes the orthogonality of the two components. It is important
to note that the distribution in terms of which the above expectation
is defined is none other than D(yt; %), the distribution
operator E
underlying the probability model with 0o the true' value of 0.
The Gauss linear model is specified by the statistical GM (5), the
probability model (6)and the sampling model defined above. Looking at
this statistical model we can see that it purports to model an experimentallike' situation where the xfts are either fixed or controlled by the
experimenter and the chosen values determine the systematic component of
y, via (3).This renders this statistical model of limited applicability in
econometrics where controlled experimentation is rather rare. At the outset
the modeller adopting the Gauss linear model discriminates between yf and
.)

18.2 Estimation

359

the xfrs on probabilistic grounds by assuming yt is a random variable and


the xfrs non-stochastic or controlled variables. ln econometric modelling,
however, apart from a time trend variable. say x, t, t G T, and dummy
variables taking the value zero or one by design, it is very difficult to think of
non-stochastic or controlled variables.
The Gauss linear model is of interest in econometrics mainly because it
enhances our understanding of the linear regression model (seeChapter 19)
when the two are compared. The two models seem to be almost identical
notation-wise, thus causing some confusion; but a closer comparison
reveals important differences rendering the two models applicable to very
different situations. This will be pursued further in the next chapter.
=

18.2

Estimation

For expositional purposes 1et us consider the simplest case where there are
only two non-stochastic variables (k
and the statistical GM of the
Gauss linear model takes the simple form
=2)

A't d?1+
=

d72.X1t

(18.7)

14,

The reason for choosing this simple case is to utilise the similarity of the
mathematical manipulations
between the Gauss linear and linear
regression models in order to enhance the reader's understanding of the
matrix notation used in the context of the latter (seeChapter 19). The first
variable in (7)takes the value one for all t, commonly called the constant (or
intercept).
ln view of the probability model (6)and the sampling model assumption
of independence we can deduce that the distribution of the sample (see
Chapter 11) takes the form
F

D(1'1, #2,

J'w; 04

17D(A',;04.

(18.8)

Hence, the likelihood function, ignoring the constant of proportionality,


can be defined by

360

The Gauss Iinear model

(see Section 13.3). The log likelihood takes the form:


T

log L0; y)=

log 2r--j-

--j-

The first-order conditions


estimators (MLE's) are:

t?log L

1og c a

for the derivation of the maximum likelihood

(
p/?l = 2c

Z (A',

-2)

-'1

(18.11)

72-Y1) =0,

(18.12)

Solving (1 1H13)simultaneously
JI

/;2

MLE'S

J'c.f,

)- -

we get the

Z (.J.,,-

( 18.14)
.9

.:(.x',

( 18. 15)

x32

Z(x1(

where
42

and
Jj J2x, represents the
where gfEE:
of
the
error term uf. Taking
testimator'

,,

( 18.16)

estimated residuals',
second derivatives

.p,

=-

the natural

( 18 17)
.

( log L
p/?lpha = p21ogL

t?c2

J)2

1
1

p2log L
jgx,, kb
1 pcz
,

1
#.

$ xtut.
t

1
-

''-'tri

&''

(18.18)

18.2 Estimation
For p-(/?:, hc, c2), the sample information matrix Iw(p) and its inverse are

Zxt
c2

xf

Z xl

lwtpl-

c2 )

x2t

''

Z(x t

r
-' 1

Il w($1

(18.19)

0.2

-F

-t-Y3

- c

-y

-c2 jj-0 x
T

j-2.x

)((x,-'.)

) (xf

---------L?iq---L--)3
--')i)7----t)

.92

---------

if , (x,-.f)2#0,i.e. theremust
is positivedefinitetlwtpl>t)
be at least two distinct values for xt. This condition also ensures the
existence of J'cas defined by ( 15).

xorcthat/rtp)

of 0

Properties

1,

2,

Asymptotic properties

is a MLE enables us to conclude


limw- a
information (matrix) defined by 1.(:)
:
12-13)
then
definite (seeChapters

The fact that

J t-0, i.e. Iis a consistent


-+

/'T4

E,(4)

p)

that if the asymptotic


()(l F)lw(#)) is positive

estimator of 0 (if
1

N(0, gI.(p)1 -

'

0, i.e. asymptotically

i.e.

tr:

.xt

-.

I is asymptotically

unbiased

()I,x,(p)q-

1,

i.e.

as T-+ w);

normal;

(the asymptotic mean of ;

is 0);

varal

,z-

is asymptotically

efficient.

362

The Gauss linear model

1.(.:) is positive definite if det(1.(p)) > 0,' this is the case if

lim
v- x

--

1
T

)((x/
-

x3 c

( 18.2 1)

qxx > 0.

Finite sample properties

(2)

being a MLE we can deduce that:


is a function of the set of minimal sufficient statistics
F

#y) Z yl, Z.)4, Z xtn


l=1

f=l

t=1

(18.22)

is invariant with respect to Borel functions,i.e. ifh(p):O (6)then


the MLE of h4#)is h(J);see Section 13.3.
ln order to consider any other small (finite)sample properties of J we need
to derive its distribution. Because the mathematical manipulations are
rather involved in the present case no such manipulations are attempted. It
turns out that these manipulations are much easier in matrix notation and
will be done in the next chapter for the linear regression model which when
reinterpreted applies to the present case unaltered.

(vi)

-->

Jj
Jc

'v N

bj
b,

var(J,) Covt/;l J2)


covtrl Jc) vartrc)
,

( 18.23)

where

vartra)

Z(x:-

.92

This result follows from the fact that J'l )5where


42
jyt
and linear functions of normally distributed
;.t= gtx, xx/gzf (xt
random variables and thus themselves normally distributed (seeSection
6.3). The distribution of 42 takes the form
-f,

-x')2j,

T.l
2

.vz2(F-2),

Zf

.f),

(18.24)

18.3

Hypothesis

363

testing and confidence intervals

where z2(T-2) stands for the chi-square distribution with T- 2 degrees of


freedom. (2)follows from the fact that F2 c2 /7ol(f/c)2 involves F- 2
independent squared standard normally distributed random variables.
(viii) From (vii)it follows that E(J1) /?1,f)/J2) /?c, i.e. j and z are
unbiased estimators of ?1 and bz respectively. On the other hand,
since the mean of a chi-square random variable equals its degrees
of freedom (seeAppendix 6.1)
=

Tl

F-2

/.2

(T-2)
S442)=-T'

::::>

i.e. 82 is a biased estimator of


(1 (T- 2)(1 f l is unbiased and
(T- 2).:2
2

(; 1,

(x)

'

.$2

but

the estimator

,vz2(T'-2),

) are independent

This can be verified

c2.

(72c/:c2

of

by considering

(18.25)
.s2

(or 42).
the covariance

between them.

(20) we can see that (/1, ;a) achieve the


Cramer-Rao lower bound and hence we can deduce that they are
fully efficient. Given that :2 isbiased the Cramer-Rao given by (20)
is not applicable, but for sl we know that

Comparing

Var
o

(23) with

(T- 2).s2

az

o.

Var(y2)=

2c*
T- 2

2(T-2)
2c*

the Cramer-Rao
T -

bound.

Thus, although sl does not achieve the Cramer-llao lower bound,


no other unbiased estimator of c2 achieves this bound.

18.3

Hypothesis testing and confidence intervals

In setting up tests and conlidence intervals the distribution of and any


pivotal quantities thereof are of paramount importance. Consider the null
hypothesis
Hv1
Intuition

deviation

)1

l-fj against

H3 : hl #

-1,

being a constant.

suggests that the distance $J1 511, scaled by its standard


(toavoid any units of measurement problems), might provide the

364

The Gauss Iinear model

basis for a

test statistic. Given that

igood'

( 18.26)

this is not a pivotal quantity unless c2 is known. Otherwise we must find an


alternative pivotal quantity. Taking (25)and (26)together and using the
independence between J'1and sl we can set up the pivotal quantity

( 18.27)

IJ
1 1(r-2).
--,-$---1-.-N,'g'%ar(/71)(I
-b-

( 18

.28)

The rejection region for a size x test is

cj

$J -l; j

(' (k'

.----t---s--.y
=

y..

ca

Jgvart/?llj

where 1

dr(F-2).

-a

=
-

(18.29)

L'a

Using the duality between hypothesis testing and confidence intervals (see
level confidence interval for bk
Section 14.5) we can construct an (1
based on the acceptance region,
-a)

C()('1)

y:

1;1

'1

(?1
-

% cu

#r

Evartrll.'.l

cz %

'1)

v'g j- ar( j 1)j

%ca

( 18.30)

T'Zt.Y'-O2
l

.v/

jj (xj-p2
t

(18 1)
.3

x,

18.3 Hypothesis
Similarly, for H(): Iz
Clts-cl

y:

y: a

1$-2
the rejection

z#

1r2- -l > cz
'
x' gkartrclq

region of a size a test

:%

(T'-2)s2

:$

#r(Co)

is

( 18.32)

lh against

against Sj : c2 # l. The pivotal quantity


up the acceptance region

Consider Hv c2
useddirectly to set

tr'tl

365

testing and confidence intervals

(18.34)

1 a,

(25)can be

such that
( 18.35)
A ( 1 -J)

level confidence interval is

c(y)

c2:

(F-2)s2

ca

ts

:6:

(T-2)s2

( 18.36)

Remark: One-sided tests can be easily constructed by modifying the above


two-sided results; see Chapter 14.
Consider the question of constructing a (1 -a) level confidence interval
for

pt= :1

bzxt.

A natural estimator
Vart/)

of pt is

Vart;j +

41+ lzxt,

with Et)

pt, and

;2xt)

( 18.37)

366

The Gauss Iinear model

These results imply that the distribution of

(Jf
-

(i)

pt)

gvartrn

'v

Jf is

normal and

x(() 1).,

(js.a8)

(18.39)
construct

jsl

(x,
-

.y

Z(x,-.k)2
l

( 18.40)
This confidence interval can be extended to t> T in order to provide us with
a prediction conlidence interval for yvyt, ly 1

( 18.4 1)

(see Chapters 12 and 14 on prediction).


In concluding this section it is important to note that the hypothesis
testing and confidence interval results derived above, as well as the
estimation results of Section 18.2, are crucially dependent on the validity of
the assumptions underlying the Gauss linear model. lf any of these
assumptions are in fact invalid the above results are unwarranted to a
greater or lesser degree (seeChapters 20-22 for misspecscation analysis in
the context of the linear regression model).

18.4

Experimental

design

ln Section 18.2 above we have seen that the

MLE'S

J1and

h of

bL

and bz

Looking ahead

18.5

respectively are distributed as bivariate normal as shown in (23).


The fact that the xfs are often controlled variables enables us to consider
the question of
the statistical GM (5) so as to ensure that it
satisfies certain desirable properties such as robustness and parsimony.
These can be achieved by choosing the xfs and their values appropriately.
Lookin4at their variances and covariances we can see that we could make
41 and b2 more
by choosing the values of xf in a certain way.
Firstly, if
0 then
idesigning'

'accurate'

.k

;2) 0

Covlrl,

(18.42)

and and ;2 are now independent. This implies that if we were to make a
changeof origin in x, we could ensure that J1 and z are independent.
Secondly, the variances of J1and /2 are minimised when f xtl (given
isas large as possible. This can be easily achieved by choosing the value of xl
and as large as possible. For
to be on either side of zero (to achieve
example,we could choose the xfs so that
';1

=0)

.'.1=0)

X2

X T+ l

X F

XF

K>

( 18.43)

11

(T even) and n is as large as possible; see Kendall and Stuart (1968).


Another important feature of the Gauss linear model is that repeated
observations on y can be generated for some specified values of the xts by
repeating the experiment represented by the statistical GM (7).

18.5

Looking ahead

From the econometric viewpoint the linear control knob model can be seen
to have two questionable features. Firstly, the fact that the xfls are assumed
to be non-stochastic reduces the applicability of the model. Secondly, the
independent sample assumption can be called into question for most
economic data series. In other disciplines where experimentation is possible the Gauss linear model is a very important statistical model. The
purpose of the next chapter is to develop a similar statistical model where
th first questionable feature is substituted by a more realistic formulation
of the systematic component. The variables involved are all assumed to be
random variables at the outset.
Important concepts
Non-stochastic or controlled
repeated observations.

variables, residuals, experimental

design,

368

The Gauss Iinear model

Questions
Explain the statistical GM of the Gauss linear model.
Derive the MLE'S
of b and c2 in the case of the general Gauss linear
model where
b'xr + ut, t= 1, 2,
7; x, being a k x 1 vector of nonstochastic variables, and state their asymptotic properties.
Explain under what circumstances the MLE'S f 1 and 42of bj and bz
respectively are independent. Can we design the values of the non.p,

4.

stochastic variables so as to yet independence'?


Explain why the statistic l/l,/Vartrll is distributed as r(T'-2) and use it
to set up a test for
Ho: bj

against

HL :

bf #U,

as well as a confidence interval for ht.


Verify that and 42 are independent.
Additional references
Chow ( 1983); Dhrymes ( 1978); Johnston ( 1984); Judge e (11. (1982);Kmenta
Koutsoyiannis
(1975) Maddala ( 1977);Pindyck and Rubinfeld (1981).

(197 1);

CHAPTER

19

The linear regression model 1 - specification,


estimation and testing

19.1

Introduction

The linear regression model forms the backbone of most other statistical
of the
models of particular interest in econometrics. A sound understanding
regression
prediction
and
estimation,
the
linear
in
testing
specification,
model holds the key to a better understanding of the other statistical models

discussed in the present book.

In relation to the Gauss linear model discussed in Chapter 18, apart from
and the mathematical
similarity in the notation
some apparent
statistical
involved
analysis,
the linear regression
in the
manipulations
situation
model
from the one envisaged
model pumorts to
a very different
particular
model
could be considered to
the Gauss linear
by the former. In
models of the
estimable
analysing
statistical
model
for
be the appropriate

form

Mt

Jli,

=aa+

(19.2)

where Mt refers to money and Qit,i 1, 2, 3 to quarterly dummy variables,


in view of the non-stochastic nature of the xfts involved. On the other hand,
estimable models such as
=

Akrxbpxzlx

(19.3)

referring to a demand for money function (M - money, F- income, P -price


level, 1 - interest rate), could not be analysed in the context of the Gauss
369

Specilkation, estimation and testing

linear model. This is because it is rather arbitrary to discriminate on


probabilistic grounds between the variable giving rise to the observed data
chosen for M and those for F, # and 1. For estimabl'e models such as (3)the
linear regression model as sketched in Chapter 17 seems more appropriate,
especially if the observed data chosen do not exhibit time dependence. This
willbecome clearerin thepresent chapter after the specification of thelinear
regression model in Section 19.2. The money demand function (3)is used to
illustrate the various concepts and results introduced throughout this
chapter.

Spaification

19.2

Let tZ?, t c T) be a vector stochastic process on the probability space (S,


#4.) where Zt (yf,X;)' represents the vector of random valiables giving rise
to the observed data chosen, with yr being the variable whose behaviour we
are aiming to explain. The stochastic process (Zf,lGT) is assumed to be
nornml, independent and identically' distributed (N11D) with '(Zt) m and
Covtzf E. i.e.
.28

j (( j( ) jj

(x-'',

.''' yx

t;,1

sl1,2,

in an obvious notation (seeChapter 15). lt is interesting to note at this stage


that these assumptions seem rather restrictive for most economic data in
general and time-series in particular.
On the basis of the assumption that .f(Z?,r s T) is a Nl1D vector stochastic
Zw; #)in
process we can proceed to reduce thejoint distlibution D(Z1,
order to define the statistical GM of the linear regression model using the
general form
.

yt

Eyt/xt

xf)

y, E(y,,/Xf
14,,.,u,
=

is the
x,)

systematic

the

pt
p0

and

E (A',/'X,

my

/:

X,)=

component,

non-systematic

(seeChapter 17). In view of the normality


where

(19.5)

where
and

of (Zl, t G T) we deduce that

Jo + fxt (linearin

2E2-21m
x
,

component

x,),

(19.6)

az 1
# Ec-21
=

Var(u,/X,=x,) =Var(y'/X,=x,)=c2

(homoskedastic), (19.7)

19.2 Specification
where c2 cj 1 - o.j atc-al ,yj (seeChapter 15). The time invariance of the
parameters jo, ( and tz = stems from the identically distributed (lD)
assumption related to .tZr, t G T). It is important, however, to note that the
ID assumption
provides only a sufficient condition for the time invariance
of the statistical parameters.
ln order to simplify the notation let us assume the m 0 without any loss
of generality given that we can easily transform the original variables in
and (Xt
This implies that ?v, the
mean derivation form (y,
coefficient of the constant, is zero and the systematic component becomes
=

-n1y)

-mx).

#'Xl.
(19.8)
l-tt E (-Fr?''XtN)
ln practice, however, unless the observed data are in mean deviation form
the constant should never be dropped because the estimates derived
otherwise are not estimates of the regression coeflicients #= Ec-21J2, but of
'(XlX;)- ''(Xf'.Ff); SCe Appendix 19.1 on the role of the constant.
#*
The statistical GM of the linear regression model takes the particular
form
y', p'xt+ Ikt,
(19.9)
=

with 0- (#,c2) being the statistical parameters t#' interest; the parameters in
terms of which the statistical GM is defined. By construction the systematic
and non-systematic components of (9)satisfy the following properties:
E(ut%t

Xf)

(ii)

f;tutls/x,

xf)

(iii)

Eptut/xt

Xt)

E(1'r
-

'(-Pt,?/Xf Xf)),'''Xf Xt1


=

ptkutjxt

X?)

=0,

tuf,

The first two properties define


t 6 T) to be a white-noise process and (iii)
establishes the orthogonality of the two components. It is important to note
,,'XZ
that the above expectation operator
xr) is defined in terms of
D()//Xf', 0), which is the distribution underlying the probability model for
(9).However, the above properties hold for F( ) defined in terms of Dzt; #)
as well, given that:
'(

'

'

(i)'

'(u,)

'tutlks)
=

F(f'(?.k,/X,= xl))
'hftftagus/''xt

=0.,

xt))

G 2,
=

0,

Specification, estimation and testing

and
(iiil'

JJ(/t,uf)

Elkytut

'Xf

xj))

=0,

r, s iE T

(see Section 7.2 on conditional

expectation).
The conditional distribution D()',/X,', 04is related to thejoint distribution
Dn, X; #) via the decomposition
D(#,, Xf',

) D(1',//X?; l ) D(X,;

( 19. 10)

a)

'

(see Chapter 5). Given that in defining the probability model of the linear
regression model as based on 1)(),,,/'Xf;p) we choose to ignore D(X,; #2)for
the estimation of the statistical parameters of interest p. For this to be
possible we need to ensure that X, is wt?kk/), exogenous with respect to 0 for
the sample period r= 1, 2,
r (see Section 19.3. below).
For the statistical parameters of interest 0 EEE(#, c2) to be well defined we
need to ensure that Ec2 is non-singular, in view of the formulae #= E2-21o.c1,
c2 cj j - o.j aE2-a1,a1 at least for the sample period r 1, 2,
T This
requires that the sample equivalent of 122,( 1/F)(X'X) where X EEE(xj
i.e.
xwl' is indeed non-singular,
.

,x2,

rank(X'X)

ranktx)

k,

Xf being a k x 1 vector.
As argued in Chapter 17, the statistical parameters of interest do not
necessarily coincide with the theoretical parameters of interest t. We need,
however, to ensure that ( is uniquely defined in terms of 0 for ( to be
identnable. ln constructing empirical econometric models we proceed from
a well-defined estimated statistical GM (seeChapter 22) to reparametrise it
in terms of the theoretical parameters of interest. Any restrictions induced
by the repa'rametrisation, however, should be tested for their validity. For
this reason no a priori restrictions are imposed on p at the outset to make
such restrictions testable at a later stage.
As argued above, tbe probabilitv tnodel underlying (9)is defined in terms
of D(y'r?'Xr;p) and takes the form

(1)
=

D(yr/Xt; 0)

exp

(21)

1
-

2c

j.

(yf p'xtll
-

0 s IJd x IJ@
+ t e: T
,

(19.12)

in view of the independence of (Z,, r 6 T) the sampling model


sequentially
takes the form of an independent sample, y > (y1
drawn from D(y,,/X,', p), l 1, 2,
T respectively.
Having defined all three components of the linear regression model 1et us
Moreover,

,y'w)',

19.2 Specification
together and specify the statistical

collect al1 the assumptions


properly.

Statistical GM, yf

specilication

model:

The linear regrenion


=

#'xf+

model

u,, l G T

g3j
g41
g5q

'(yr,/Xt=x,)
S(y,,/'Xt xf) - the systematic component; ut yr
non-systematic component.
the
- N(#, J2)e
0
p Na-clcaj c2 o.j j - cj ara-al ,aj are the statistical
Cov(X,, y,),
parameters of interest. Note: Eac Cov(X,), ,2I
Var(yf).)
cl j
T
Xf iS weakly exogenous with respect to 0. r 1, 2,
No a priori information on 0.
xwl/,' T x I data matrix, (T > /).
Ranklx)
k, X (x1, xa,

(11)

Probability

E1(1
17(1

p,

model

Dl.,'r,''Xf;0$

exp

'-QVx'
( )
.

(-J.'f-

2c 1

#'x,)2

D(#f Xl ; #) is normal;
ti)
Ft-r X, xf) p'xt linear in xr;
(ii)
- homoskedastic
xr) (7.2
Vart))r/x,
(free of xt),'
(iii)
p is time invariant.
.

(111)

Sampling model

y' represents an independent sample sequentially


'J:
drawn from 1)(y,,,/X,.,p), t 1, 2,
speeification is that the
above
An important point to note about the
making
of
D(y,g'Xf;0)
no assumptions
model is specified directly in terms
model there is
FOr
regression
the specification of the linear
about D(Zf; #).
T).
The problem,
no need to make any assumptions related to 1Zf, r 6
however, is that the additional generality gained by going directly to
D( #'l/''Xf;0) is more apparent than real. Despite the fact that the assumption
that tZr, t g -IJ-)is a NllD process is only sufficient (not necessary) for g6)
to g81above, it considerably enhances our understanding of econometric
modelling in the context of the linear regression model. This is, firstly,
because it is commonly easier in practice to judge the appropriateness of

g8j

y EB (-:1
,

Specification, estimation and testing

probabilistic assumptions related to Zf rather than (yr,/X, x,); and,


secondly, in the context of misspecification analysis possible sources for the
departures from the underlying assumptions are of paramount importance.
Such sources can commonly be traced to departures from the assumptions
postulated for (Zr, r (E T) (see Chapters 2 1-22).
Before we discuss the above assumptions underlying the linear regression
it is of some interest to compare the above specification with the standard
textbook approach where the probabilistic assumptions are made in terms
of the error term.
=

Standard

' X#
=

textbook

spcct/krprft?n of tbe linear regression

model

+u.

X(0, c21w);
no a priori information on (#,c2);
rank (X) k.
Assumption (1) implies the orthogonality F(Xf'?-/f,,At x,) 0, t,z,z1, 2,
'T;and assumptions (6j to g8j the probability and the sampling models
respectively. This is because (y,?'X)
is a linear functionof u and thus normally
distributed (see Chapter 15), i.e.

(1)
(2)
(3)

(u,/X)

'v

(y?'X) N(X#, c2Iw).


'v

(19.13)

As we can see, the sampling model assumption of independence is


behind the form of the conditional covariance c2J. Because of this the
independence assumption and its implications are not clearly recognised in
certain cases when the linear regression model is used in econometric
modelling. As argued in Chapter 17, the sampling model of an independent
sample is usually inappropriate when the observed data come in the form of
aggregate economic time series. Assumptions (2)and (3)are identical to (4j
and g5) above. The assumptions related to the parameters of interest
onp, c2) and the weak exogeneity of X with respect to 0 (g2qand g3(J
above) are not made in the context of the standard textbook specification.
These assumptions relpted to the parametrisation of the statistical GM play
a very important role in the context of the methodology proposed in
Chapter 1 (seealso Chapter 26). Several concepts such as weak exogeneity
(see Section 19.3, below) and collinearity (seeSections 20.5-6) are only
detinable with respect to a given parametrisation. Moreover, the statistical
going
GM is turned into an econometric model by reparametrisation,
from the statistical to the theoretical parameters of interest.
The most important difference between the specification g1()-g8q
and
( 1)-(3), however, is the role attributed to the error term. In the context of the
thidden'

19.3

Discussion of the assumptions

latter the probabilistic and sampling model assumptions are made in terms
of the error term not in terms of the observable random variables involved
as in g11-r81.This difference has important implications in the context of
misspecification testing (testingthe underlying assumptions) and action
thereof. The error term in the context of a statistical model as specified in
the present book is by construction white-noise relative to a given
information set % 7 +.

19.3

Discussion of the assumptions

r1q

The systematic

and non-systematic

components

in Chapter 17 (see also Chapter 26) the specification of a


T; i.e.
statistical model is based on the joint distribution of Zf, l 1, 2,

As argued

D(Z1 Z2,
,

(19. 14)

Zw; #)H D(Z; #)

which includes the relevant sample and measurement information.


The specification of the linear regression model can be viewed as directly
using the assumptions
of
related to ( 14) and derived by
normality and llD. The independence assumption enables us to reduce
1)(Z; #) into the product of the marginal distributions D(Zf; kt),r= 1, 2,
Ti i.e.
treduction'

)-

D(z;

17D(Z,; #,)
=

( 19. 15)

The identical distribution enables us to deduce that #f # fOr t 1.2, ,Z


The next step in the reduction is the following decomposition of D(Z(; #):
=

D(Zf;

#)

(19. 16)

#1) D(X,; /2).

D(#t,/X,;

'

The normality assumption with E> 0 and unrestlicted enable us to deduce


the weak exogeneity of Xf relative to 0.
tXf xt) depends
The choice of the relevant information set
crucially on the NIID assumptions', if these assumptions are invalid the
will in general be inappropriate. Given this choice of , the
choice of
non-systematic components are defined by:
and
systematic
.)t

.1-

.1,

pt

E ()4,/Xt

Xt),

lh

Under the NIID assumptions


l.t? j'xg, u:
=

y?

l?f-

EtAt

yt and ut take the particulgr

q'xt.

(19.17)

Xr).

forms:

(19.18)

Spification,

estimation

and testing

Again, if the NIID assumptions

pt*p1
(seeChapters

(21

and

are invalid then

EplulAt

( 19. 19)

xt) #z

2 1-22).

Te parameters

of

interest

As discussed in Chapter 17, the parameters in terms of which the statistical


GM is defined constitute by defnition the statistical parameters of interest
parametrisation
of the unknown
and they represent
a particular
of
underlying
probability
model.
ln
the case of the linear
the
parameters
model
interest
of
the
regression
parameters
come in the form of p>(j, c2)
c2
Ec-21o'c1,
-,1cE2-c1la1.
above
the
As argued
cj1
where pn
parametrisation 0 depends not only on D(Z; #) but also on the assumptions
of NIID. Any changes in Zf or and the NIID assumptions will in general
change the parametrisation.
=

(31

Exogeneity

ln the linear regression model we begin with Dtyf, Xf;


where
concentrate exclusively on Dtyt/xf
,./1)

D(.Fl, Xr/#)

=D(-pt,/X,;#1) D(Xr; /2),


'

#) and then

we

(19.20)

which implies that we choose to ignore the marginal distribution D(Xf;/2). ln


order to be able to do that, this distribution must contain no information
relevant for the estimation of the parameters of interest, 0n (p,o.l),i.e. the
stochastic structure of X/ must be irrelevant for any inference on 0.
Formalising this intuitive idea we say that: Xf is weakl exogenous over the
with # H(#1,
sample period for 0 if there-exists a reparametrisation
#2)
such that:
(i)
p is a function of #j (0 h(/1));
(ii)
/1 and 2 are variation free ((#: #a)e:T1 x Ta).
Variation free means that for any specific value #a in T2, /1 call take any
other value in Tj and vice versa. For more details on exogeneity see Engle,
Hendry and Richard (1983).When the above conditions are not satisfied
the marginal distribution of Xf cannot be ignored because it contains
relevant information for any inference on 0.
=

Discussion of the assumptions

19.3

g4(l

N0 a priori information

on 0 >

(#, c2)

is made at the outset in order to avoid imposing invalid


testable
on p. At this stage the only relevant interpretation of 0
statistical
parameters, directly related to #j in D( I'f/''Xr;/1). As such no
is as
Such information is
a priori information seems likely to be available for 0.
of
interest
related
theoretical
parameters
commonly
(. Before 0 is
to the
statistical
underlying
need
the
that
to ensure
used to define (, however, we
observed
of
in
data
misspecification)
the
well
is
terms
defined (no
model
chosen.

This assumption

restrictions

The observed data matrix X is ofill

IS

For the data matrix X H!

ranktx)

(x1
,

x2,

vank

xz)', T x /(, we need to assume that

k, k < 'J:

The need for this assumption is not at al1 obvious at this stage except
perhaps as a sample equivalent to the assumption

rankttzz)

k,

needed to enable us to define the parameters of interest 0. This is because


ranktx) rank(X'X), and
=

1
k . jg xrxt'

1
=

y.

(X'X)

can be seen as the sample moment equivalent to Ecc.

L61

linearity

Normality

homoskedasticity

of normality of D( J',, X,; #) plays an important role in the


specification as well as statistical analysis of the linear regression model. As
far as specification is concerned, normality of D(),,, Xr; #) implies
D(.r, X/; p) is normal (see Chapter 15),.
(i)
x,) #/x,,a linear function of the observed value xr of Xr',
f)y,/'Xl
(ii)
is free of xt, i.e.
Varly, Xj x,) c2, the conditional variance
(iii)
homoskedastic.
Moreover, (i)-(iii)come very close to implying that Dt.rt, Xf; #) is normal as
well (see Chapter 24.2).
The assumption

Spification,
Parameter

estimation

and testing

time-invariance

As far as the parameter invariance assumption is concerned we can see


that it stems from the time invariance of the parameters of the distribution
Dlyt, Xf; ); that is, from the identically distributed (ID) component of the
normal 11D assumption related to Z,.

g8)

Independent sample

The assumption that y is an independent sample from D(y,/Xf; 0), t= 1, 2,


T; is one of the most crucial assumptions underlying the linear
.
regression model. ln econometrics this assumption should be looked at
very closely because most economic time series have a distinct time
which cannot be modelled exclusively in terms of
dimension (dependency)
exogenous random variables X,. ln such cases the non-random sample
assumption (seeChapter 23) might be more appropriate.
.

19.4

Estimation

(1)

Maximum likelihood estimators

Let us consider the estimation of the linear regression model as specified by


the assumptions I(1)-g81discussed above. Using the assumptions (61 to
(8j we can deduce that the likelihood function for the model takes the form
L(p, c2; y, A')

1'

=k(y)

lgJ cv
-

exp

(2zr)

-#'x,)2

2c .z(yt

( 19.23)

(3)

( 19.25)

19.4 Estimation
F

1
ucz- 'r t ) 1 (y,- j'x' )2ec2

T'i j(1

tl,

in an obvious notation,

( 19.26)

c2, respectively.
lf
are the maximum likelihood estimators (MLE's) of # and
staiisticl
2,
T; in the
GM, p, #'x,+ u, r 1,
we were to write the
matrix notation form
=

(19.27)

y= Xj+u,
where y >

F x 1, the

(y1
,

xsl', F x k, and u H
T x 1, X EEE(x!
suggestive
form
take the more

(uI

.vv4',

MLE'S

/ (x'x) - t x'y

14z.)',

and for

BE

The information

y -XJf,

42

( 19.28)

'

matrix Iw(#) is defined by

Iw(p)-s((-Pl0BL)(P1Og)')-s(-t?21OBL),

(-)0

Pp (0'

t?p

where the last equality holds under the assumption that D()'t X,; 0)
probability model. ln the above case
represents the
'true'

:2 log L
. P/ 0/

1
=

c2

p2log L T
t?c4 = 2c4

y xfx;

ca

,-1

JX

:2 log L

(x'x), t7'j

cl

= -

tr

4.
f

)
=

xf ut

1
-

2Es

ut1

( 19.29)

Hence
X'X

c2(x'X)-

Ir($

and

gIw(p)q-

0
2c4

( 19.30)
lt is very important to remember that the expectation operator above is
defined relative to the probability model Dty'f,/Xr;0).
ln order to get some idea as to what the above matrix notation formulae

380

Specilication, estimation and testing

look like let us consider these formulae for the simple model:
''f-/?cxf +',/t,
l,f ib

t- 1, 2,

( 19.3 1)

X1
X2

X'X

x, Z

x,
xl

),
tpyyl
2

.vf

X'

'

EB

xf)',
t

- p-2

.p

).;

Compare these formulae with those of Chapter 18.


One very important feature of the MLE p-above is that it preserves the
original orthogonality
between the systematic and non-systematic
components

(19.32)

y p + u,

p .u
between the estimated systematic and non-systematic
=

components

in the

form
y=

/+,

-L

( 19.33)

19.4 Estimation
b NXj-

-X/

respectively. This is because

!h

and

= Pxy

(1-

( 19.34)

Pxly,

where Px X(X'X) - 1X' is a symmetric (P' P ), idempotent


matrix (i.e.it represents an orthogonal projection) and
=

E(;')

(Px2 Px)
=

'(Pxyy'(l - Px))

= f7tpxyu'll

Px)),

Px)c2,
= Px(I -

since

(1 Pxly
-

since S(yu')

(1 Pxlu

c2Iw

since Px(I - Px) 0.


= 0,
ln other words, the systematic and non-systematic components were
estimated in such a way so as to preserve the original orthogonality.
Geometrically Px and (1 Px) represent orthogonal projectors onto the
subspace spanned by the columns of X, say .,//(X), and into its orthogonal
complement //(X)-, respectively. The systematic component was estimated
by projecting y onto ,.,//(X) and the non-systematic
by
component
projecting y into //(X)i, i.e.
=

y Pxy
=

(1 Pxly.

(19.35)

Moreover, this orthogonality, which is equivalent to independence in this


(92
since is independent of
context, is p assed over to the MLE'S j and
y'(I Pxly, the residual sums of squares, because Px(I - Px)
(seeQ6,
Chapter 15). Given that #= X/ and 42 ( 1/T)' we can deduce that / and
42 are independent', see (E2) of Section 7.1.
Another feature of the MLE'S
/ and :2 worth noting is the suggestive
similarity between these estimators and the parameters j, c2:
'

=0

/
J

2>

N 2-ai /2
tr 11

'

l-

X'X
=

X' y

.--.

( 19.36)

12

E 22
- a 21,

)-(y,.'X)(Xw'X)-'(X'y).

oz-l;'

(19.37)

Looking at these formulae we can see that the MLE'S


of j and c2 can be
derived by substituting the sample moment equivalents to the population

moments:
Na a :

(X'X),

c'zl

Using the orthogonality

X'y,

cj

1:

1
-.

y'y.

of the estimated components

( 19.38)

and

we could

Specification,

estimation

and testing

decompose the variation in y as measured by y'y into


'''h.'d

-?

?
yr=##+=#X
m

-?

X#+.

( 19.39)

Using this decomposition we could define the sample equivalent


multiple correlation coefhcient (seeChapter 15) to be

to the

''-'

#2

'

X(X X)
/

Xy
,

.'

1-

'

p;

(19.40)

y
yy
This represents the ratio of the variation
by / over the total
variation and can be used as a measure t#' goodness C!JJlt for the linear
regression model. A similar measure of fit can be constructed using the
decomposition of y around its mean jF, that is

'

'explained'

(y'y

T)*)

(/'/

Tgl) +

/,

(19.41)

denoted as
TSS
ltotal)

ESS

RSS

(19.42)

( residual)

(explained)

where SS stands for sums of squares. The multiple correlation coefficient in


this case takes the form
a

R- 2

pa,#

'

(y y

yy
F

=-

'zk

Rss
-

T SS

( 19.43)

Note that R2 was used in Chapter 15 to denote the population multiple


correlation coefficient but in the econometrics literature R2 is also used to
2
and #2.
denote
of t', #2 and kl, have variously
Both of the above measures of
sample
multiple
correlation coefficient in the
defined
been
to be the
when reading different
Iiterature.
should
exercised
econometric
Caution
be
and
properties.
textbooks because #2
For example,
have different
0 < #2 < 1, but no such restriction exists for 2 unless one of the regressors
in X, is the constant term. On the role of the constant term see Appendix
19.1.
One serious objection to the use of *2 as a goodness-of-fit measure is the
fact that as the number k of regressors increases, kl increases as well
irrespective of whether the regressors are relevant or not. For this reason a
goodness-of-fit measure is defined by
bgoodness

.k2

Ecorrected'

( 19.44)

19.4 Estimation

383

is the division of the statistics involved


corresponding degrees of freedom; see Theil ( 197 1).

The

(2)

correction

An empirical

by their

example

In order to illustrate some of the concepts and results introduced so far let
us consider estimating a transactions demand for money. Using the
simplest form of a demand function we can postulate the theoretical model:
MD=

(19.45)

(X P, 1),

MD

is the transactions demand for money, F is income, P is the price


where
level and 1 is the short-run interest rate referring to the opportunity cost of
holding transactions money. Assuming a multiplicative form for ( ) the
demand function takes the form
'

A,jnw Ayapxzix,

or

ln

MD

( 19.46)
( 19.47)

atj + aj ln F + aa ln P + as ln 1,

where ln stands for loge and

(xo

ln

.4.

For expositional purposes let us adopt the commonly accepted approach


modelling (seeChapter 1) in an attempt to highlight some of
econometric
to
the problems associated with it. If we were to ignore the discussion on
econometric modelling in Chapter 1 and proceed by using the usual
'textbook' approach the next step is to transform the theoretical model to
an econometric model by adding an error term, i.e. the econometric model

is

( 19.48)
where mt ln Mt, yr In F), pf ln Pt, it ln It and t1( N/(0, c2). Choosing
some observed data series corresponding to the theoretical variables, M, F,
=

.v

P and 1, say:

Vf -

M 1 money stock;
real consumers' expenditure;
J Pt - implicit price deflator of #;,'
f, interest rate on 7 days' deposit account (seeChapter 17 and its
appendix for these data series),
respectively, the above equation can be transformed into the linear
regression statistical GM:
D-t, =

/.0+ /1h + /2/, + Ist

Estimation of this equation

(19.49)

&f.

for the period

1963-1982f:

(T= 80) using

Spilication,

estimation

and testing

quarterly seasonally adjusted (for convenience) data yields


2.896
0.690

/-

0.865
-

.$2

=0.00

TSS

0.055
155

24.954,

That is, the estimated

kl

'

ESS

42

0,9953,
24.836,

equation

l?-i 2.896 +0.690.ff


f

=0.995

RSS

0.1 18.

takes the form

+0.865/:$

-0.0554 +

t.

( 19.50)

The danger at this point is to get carried away and start discussing the
plausibility of the sign and size of the estimated
(?).For example,
telasticities'
have both a
we might be tempted to argue that the estimated
tcorrect' sign and the size assumed on a priori grounds. Moreover, the
tgoodness of fit' measures show that we explain 99.5t% of the variation.
Taken together these results
that (50)is a good empirical model
for the transactions demand for money. This, however. will be rather
premature in view of the fact that before any discussion of a priori economic
estimated statistical
theory information we need to have a well-dhned
model which at least summarises the sample information adequately. Well
defined in the present context refers to ensuring that the assumptions
underlying the statistical model adopted are valid. This is because any
formal testing of a priori restrictions could only be based on the underlying
assumptions which when invalid render the testing procedures incorrect.
Looking at the above estimated equation in view of the discussion of
econometric modelling in Chapter l several objections might be raised:
The observed data chosen do not correspond one-to-one to the
(i)
theoretical variables and thus the estimable model might be
different from the theoretical model (see Chapter 23).
The sampling model of an independent sample seems questionable
in view of the time paths of the observed data (see Fig. 17.1).
The
high /2 (and #2) is due to the fact that the data series for Mt and
(iii)
Pt have a very similar time trend (seeFig. 17.1(t# and (L')). If we Iook
at the time path of the actual ( #'f) and fitted (.f,,)values we notice
(explains) largely the trend and very little else (see
that fr
Fig. 19.1). An obvious way to get some idea of the trend's
contribution in .*2 is to subtract pt from both sides of the money
equation in an attempt to
the dependent variable.
'elasticities'

kindicate'

ttracks'

bdetrend-

19.4

Estimation
actual
z fitted

10,6
'**

0.4

A
-'

10.2

10.0

z'e

'sz 9.8

z.

gs'

'-

--

9. 4

,-

9.2

.e

>'

.A

9.0
8.8

1963

1972
Time

1969

1966

1982

1978

1975

f', from ( 19.50).

Fig. 19. 1. Actual .J4 ln Mt and fitted


=

9.9

9. 8
/

k/ N
N.

actual

.../

9.7

x N.
xN
/
w

d
'-

<.

z v z'h-

z.s
h

sv'''.'X

/ !

.x

/
l /
l l
'hu

9.6

1963

.h.

1966

1972
Time

1969

Fig. 19.2. Actual

.pz

1975

lj
1

N/ u
/

-..v

/
.e-'

1978

f itted

1982

ln (M P)t and fitted ft from ( 19.5 1).

detrended dependent
ln Fig. 19.2 the actual and fitted values of the
the
The new regression
emphasise
point.
variable (r?1, are shown to
equation yielded
elargely'

-pt)

tmf-p1)

2.896 +0.690.:,

.k2 0.468
=

-0.135pt

42 0.447
=

sl

-0.055t

+
=0.00155.

Iif,

(19.51)

386

Spification,

estimation

and testing

Looking at this estimated equation we can see that the coefficients of the
constant, yf and it, are identical in value to the previous estimated equation.
The estimated coefficient of pt is, as expected, one' minus the original
estimate and the is identical for both estimated equations. These suggest
that the two estimated equations are identical as far as the estimated
coefficients are concerned. This is a special case of a more general result
related to arbitrary linear combinations of the xzs subtracted from both
sides of the statistical GM. In order to see this let us subtract y'xf from both
sides of the statistical GM:
.s2

or

-T'Xf

(#'-T')Xf

),t

(19.52)

+ ut

j*zx + ut,

in an obvious notation. It is easy to see that the non-systematic component


as well as c2 remain unchanged. Moreover, in view of the equality
*

(19.53)

where
*

*
=y -X#

-+

'-'*

j =(X X)
'

X y#
'

=#-y,

-'

we can deduce that


sl

*'*

F- k

On the other hand,


*2

'

=
-

kl is not

invariant

'

t' +,y+-

(19.54)

'

2.

,z'y-+2

to this transformation because

(19.55)

.*2

of the
As we can see, the
dependent variable equation is less
than half of the original. This confirms the suggestion that the trend in pt
contributes significantly to the high value of the original kl. It is important
to note at this stage that trending data series can be a problem when the
asymptotic properties of the MLE'S are used uncritically (seesub-section (4)
below).
'detrended'

(/ J2j jinite sample


In order to decide whether the MLE l is a good' estimator of 0 we need to
consider its properties. The tinite sample properties (seeChapters 12 and
13) will be considered tirst and then the asymptotic properties.
(3)

Pvoperties of the MLE

l being a MLE satisfies certain properties by definition:


For a Borel function hf ) the MLE of h(0)is (4).For example, the
(1)
MLE of logtj'j) is loglj/j)
'

19.4 Estimation
(2)

387

If a minimal sufficient statistic

T(y)

of it.
Using the Lehmann-scheffe
theorem
the values of yo for which the ratio

exists, then

(seeChapter

oty,'x; (2,,:,.c)-w,c

12) we can deduce that

(y-x#)'(y-x#))

expt-aota

p)
Dyo X; 0' =

l must be a function

.y

.rg1

w,,a

taagal-

exp

.xj),(y()

ty()

(19.56)

xpjq

is independent of 0, are ygyll y'y and X'y: X'y. Hence, the minimal
lz
T2(y))
sufficient statistic is z(y) H (z1(y),
(y'y,X'y) and J (X'X) - 2 (y),
1za(y))
T(y).
42 (1/F)(z1(y) -zk(y)(X'X)are indeed functions of
ln order to discuss any other properties of the MLE of 0 we need to
derivethe sampling distribution of 4.Given that / and :2 are independent
we can consider them separately.
=

The distribution of

/=(x'x)-

lx'y

/
(19.57)

Ly,

''*

!X) - 1

where L H(X
X is a k x T matrix of known constants. That j s,
linear functionof the nonnally distributed random vector y. Hence
,

p- NILX/, c2LL')

from N1, Chapter 15,

'v

or

p-'wNp c2(X'X)-

j is a

')

(19.58)

From the sampling distribution (58)wecan deduce the following properties


for p
(3(i)) / is an unbiased estimator of # since Elp-) #, i.e. the sampling
distributionof /1has mean equal to p.
- is a fully efflcient estimator of # since Cov(/) c2(X'X)- 1 i e
Covtj) achieves the Cramer-Rao lower bound; see (30)above.
=

The distribution of 42

42
where Mx

=-

F
I

Tdl

c2

(y-Xj)'(y

-Xj)

Px. From

(Q2)of Chapter

'wzzttr Mx),

=-

'

=-

u'Mxu,
.

(19.59)

15 we can deduce that

(19.60)

Spaitkation,

estimation

and testing

where tr Mx refers to the trace of Mx (tr A


tr Mx

)J'- 1 aii, A: n x

tr I - tr X(X'X) - 1X' (since trtxi + B)

= F- tr(X'X) - 1(X'X)

(since tr(AB)

n),

tr A + tr B)

tr(BA))

'-k.

=
Hence, we can deduce that
T.l

g2(w- k).

( 19.6 1)

lntuitively we can explain this result as saying that (u'Mxu)/c2 represents


the summation of the squares of T- L independent standard normal
components.
Using (61) we can deduce that
Fd2

T- k

cz

and

Var

F#2
c

2(T- k)

(see Appendix 6.1). These results imply that


T- k

f)ti2)=

-T

c2 # c2

'

That is:

82 is a biased estimator of c2; and


82 is not a jll), efhcient estimator of c2.
However, 3(ii) implies that for

(3(ii))
(4(ii))

sl

(T- k)

'

( 19.62)

S2

(Ty z
'v

/()

(19.63)

and E(Sl) cl, Vartyzj (2c4)/(F- k) > (2c4)/F - Cramer-Rao bound.


That is, is an unbiased estimator of c2, although it does not quite achieve
the Cramer-Rao lower bound given by the information matrix (30)above.
lt turns out, however, that no other unbiased estimator of c2 achieves that
bound and among such estimators
has minimum variance. ln statistical
inference relating to the linear regression model
is preferred to :2 as an
estimator of c2.
The sampling distributions of the estimators p- and
involve the
=

.s2

.:2

.$2

.s2

389

19.4 Estimation

c2. ln practice the covariance of is needed to


/
unknownparameters # and
of the estimates. From the above analysis it is known
assessthe
that
1
'accuracy'

Cov(/)=c2(x'x)-

(19.64)

which involves the unknown parameter c2. The obvious


such a case is to use the estimated eovariance
ovt#-)=.$2(X'X)-

way

to proceed in

( 19.65)

1
.

The diagonal elements of ov()) refer to the estimated variances of the


eoefficient estimates and they are usually reported in standard deviation
form underneath the coefficient estimates. ln the case of the above example
results are usually reported in the form
mt 2.896 +0.690./,, +0.865p,
( 1.034) (0.105) (0.020)

(19.66)

-0.055,

5.2

72

=0.9953,

f,

(0.013) (0.039)

F= 80.

1og L= 147.412,

:=0.0393,

=0.9951,

Note that having made the distinction between theoretical variables and
observed data the upper tildas denoting observed data have been dropped
for notational convenience and R2 is used instead of 112in order to comply
with the traditional econometric notation.

(4)

Propevties of the MLE

(/

EEE

'2)

asymptotic

is the fact that under certain regularity


An obvious advantage of MLE'S
conditions they satisfy a number of desirable asymptotic properties (see
Chapter 13).
P

0v

Consistency

-.+

0)

Looking at the infonuation matrix

(i)

(30)we can deduce that:

42 is a consistent estimator of

lim

#K(l'2

c2j <

:)

c2, i.e.

1,

T-+ x

since MSE(#2)

-+

0 as T-+ cti.' and


-

lim(X'X) F-+

EEE

limw- #r(l/
y

lim

F-,

-j1

< )) =

1, i.e.

xfx;

1
=

0,

J is a consistent

estimator

of

b.

390

Specification, estimation and testing

Note that the above restriction is equivalent to assuming that c'(X'X')c


) for any non-zero vector c (seeAnderson and Taylor (1979/.
The above condition is needed because it ensures that

Covt#)

(2)

ln order

-+

(x''z'(I,.

Asymptotic normality

I to be

of

(seeChapter

as T-+ ct.l

asymptotically

12).
-1))

-x(O,I.(p)

-p)

-+

normal we need to ensure that

1im 1 Iw(p)
1.(p) F-+
-gx.
=

Given that J.(c2)

exists and is non-singular.

(i)

x'/wtdz

-c2)

x(0,2c4).

Moreover, if limw- .(X'X/T)

.v''r(/

and non-singular

then

1).

,v

(19.68)

Qx is bounded

x(0,c2Q

-#)

1/2c4 we can deduce that

(19.69)

and 42 we can deduce


F ro m the asymptotic normal distribution of
asymptotic unbiasedness as well as asymptotic efficiency (seeChapter 13).

(3)*

(&

Strong consistency

a .S

0)

-,+

a S.
.

-2 iS a Strongly

consistent

lim

Pr

T-+ a.

+2

estimator

(J2

of c2

-+

2)

t)-

ie
.

( 19.70)

1.

Let us prove this result for sl and then extend it to d2,


(y2

c2)

u'

P
u
F k

c2

where w H(w1,
that

wc,

ww-k), w

T- k

F (w/

k tr 1

c2),

H'u, H being an orthogonal

(19.71)
matrix, such

F-k

1-1'41 Px)l1
-

diagt 1, 1,

1, 0,

0).

Note that w .vN(0, c2Iw-k) because H is orthogonal. Since F(w/


2c4< (x) we can apply Kolmogorov's SLLN
and S(w/
-c2)

=0

-c2)2

(see

19.4 Estimation

391

Chapter 9) to deduce that


j.

a.s.

-F E(w/
,
Using the fact that
-c2)

('2

c2)

-+

jg1 (w/

c2) +-

s2.

....+

w-k

=-

a,s.

.s2

0,

( 19.72)

(7.2

T
a. S

and the last term goes to zero as T-+ cc, 42 c2.


(ii)
j is a strongly consistent estimator of j (#--+

X Y'
'

m.

p) if

-+

< C, i
Ixjgl
x'x

T
m

(#

f)

1, 2,

Z
t

z'

k, t

(X'X)- 1X'u

jg

x f x/t T

1 2,

for all T

=0

- 1

is non-singular

and F(xftlr)2
Since f2txttltl
SLLN to deduce that
XIX/ r

-T' Z sut

xJc2<

C-constant; and

(19.73)
1

) xfjkf.

(19.74)

we can apply Kolmogorov's

w,

a.s.
->

0.

( 19.75)

Ixftxpl

Note that (1) implies that


< C* for i
C* being a constant.
It is important to note that the assumption

1, 2,

k, r, s

1, 2,

1
xrx'r =Qx < :z) and non-singular,
lim
F
t
r-, x
needed for the asymptotic normality of / is a rather restrictive assumption
because it excludes regressors such as xf, t, t 1, 2,
T) since
-

x,?,
=*6

T(T + 1)(2T + 1) O(F3)

(19.76)

Zl

(see Chapter 10)sand limw- x E(1/F)


x/J (x).
The problem arises because the order of magnitude of
x,?jis higher
than O(T) and hence it goes to infinity much quicker than T The obvious
way out is to change the factor UT in x/'Tl1)so as to achieve the same
rate of convergence. For example, in the above case of the regressor xft t
UF3 in order
to ensure that limw- .E(1 T3) xjq < ). The
we need to use
question which naturally arises is how can we generalise this particular
result? One of the most important results in relation to orders of magnitude
=

:1

Zf

Spification,

estimation

and testing

big as its
is that every random variable with bounded variance is
c/ < tz, then Zi Op(cf)
standard deviation-, i.e. if Varlzf)
(seeChapter
10).Using this result we can weaken the above asymptotic normality result
(69) to the following:
tas

Lemma
Ft??-the linear regression
Aw

model as specied

Z xjx;

(F)

t,li i

Qw

l=1

D ?- AwD /.

as y'..+ w

...+

in Section 19.2 above let

information increases
2
--+0,
i
(li(r)

Xfw..h 1

12
,

wl

T);

tnt? individual observation dominates the summationl;


lim

F-+ z

Q, Q <
=

vz

and non-singular,

tben
D T (/-#)

Anderson
txstde

19.5

Spification

x(0,c2Q-

')

(1971:.
testing

testinq refers to tests based on the assumption of correct


That
specification.
is, tests within the framework specified by the statistical
model in question. On the other hand, misspecscation testing refers to
testing outside this specified framework (seeMizon (1977:.
Logically, misspecification
tests precede specification tests because,
unless we ensure that the assumptions underlying the statistical model in
question are valid (misspecificationtests), specification tests (basedon the
validity of these assumptions) can be very misleading. For expositional
purposes the estimated money equation of Section 19.4 will be used to
illustrate some of the specification tests discussed below. It must be

Specihcation

19.5 Specification testing

393

emphasised, however, that the results of these tests should not be taken
seriously in view of the fact that various misspecifications are suspected
(indeed, confirmed in Chapters 20-.22). ln practice, misspecification tests
are used first to ensure that the estimated equation represents a well-defined
estimated statistical GM and then we go on to apply specification tests.
This is because specification tests are based on the assumption of correct
specification'.
hypothesis-testing
Within the Neyman-pearson
framework a test is
defined when the following components (see Chapter 14) are specified:
(i)
the test statistic z(y);
(ii)
the size (x of the test;
(iii)
the distribution of z(y) under Ho;
the rejection (or acceptance) region;
(iv)
(v)
the distribution of z(y) under HL.
(1)

Tests velating to c2

As argued in Chapter 14,the problem of setting up


tests for unknown
largely
exercise
appropriate
pivot related to
finding
in
an
an
parameters is
c2
unknown
parameterts)
question.
of
ln
the
in
the case
the likeliest
quantity
candidate must be the

'good'

(T-/)

S2

'v

(19.77)

/().

(Ty z

Let us consider the null hypothesis Hv: c2 /2() (cg-known) against the
alternative hypothesis Sj : c2 > c()2. Common-sense suggests that if the
estimated c2 is much bigger than c we will be inclined to reject Hv in
favour of H3, i.e. for sl > c where c is some constant considered to be
enough', we reject Hv. ln order to determine c we have to relate this to a
probabilistic statement which involves the above pivot and decide on the
size of the test a. That is, define the rejection region to be
=

kbig

C1

s
y: (T- k) z > cz

( 19.78)

where cx is determined by the distribution of


2

z(y)=(Fi.t).

k)

under Ho,

( 19.79)

394

Spification,

estimation

and testing
X

dz2(F-k)

a.

C2

This implies that


ln the case of the money example 1et us assume c/
155, ca 85.94 for a 0.05 and the rejection region takes the
since sl
form
(F- klsl
> 85.94
tl'j
y:
=0.001.

=0.00

-:0

and hence Ho is rejected.


In order to decide whether this is an
test or not we need to
consider its power function for which we need the distribution of z(y) under
Sl. In this case we know that
toptimum'

( 19.80)
(

'v

reads

:distributed

zty)
=

under S1'), and thus

c/

zoty)

'''z

H,
'w

o.

..a' z

(T-k),

(19.81)

because an affine function of a chi-square distributed random valiable is


also chi-square distributed (seeAppendix 6.1). Hence, the power function
takes the form

.#4c2) Jar

z(y) > cz

2
ti'o
2
; c > coc
Gz

dz 2 (F- k).

=
(?a(o.j,/'o.2)

(19.82)

The above test can be shown to be uniformly most powerful (UMP); see
Chapter 14. Using the same procedure we could construct tests for:
Hv.. c2

c2()against
=

or

Ho'. c2

cj against

c/*
x
0

: c2 <

ty.'z(y) < ca*),

C*j
(ii)

ffj

Sl

c2()(one-sided)
with
a

c2 # cj

(y:z(y) ga

cl

dz2(F-

>

)),

GO

dz2(F-k)
b

( 19.83)

(two-sided)With

or z(y)

dz2(F-k)=

k)

-X

(19.84)

395

19.5 Specilication testing

The test defined by C.'tis also UMP but the two-sided test defined by C1* is
UMP unbiased. Al1 these tests can be derived via the likelihood ratio test

procedure.

Let us consider another test related to c2 which will be used extensively in


Section 21.6. The sample period is divided into two sub-periods, say,

and

t e: T 1

(1 2

.t

t s T2

T) + 1,

T1)

F),

whereT- Tk Tc, and the parameters of interest 0- p, c2) are


be different. That is, the statistical GM is postulated to be
=

yt pkxt+

1&,

and

#r2Xl

#t

+ ut,

Var(J',/X,

x,)

Vartyr/x,

x,)

allowed to

cf

(19.85)

c2z for t (E T2..

( 19.86)

An important hypothesis in this context is


21
H () t:r c
'.

J2

against

H 1:

2
(7'1

a
(7'2

>

co

where cll is a known constant (usuallyc 1).


lntuition suggests that an obvious way to proceed in order to construct a
test for this hypothesis is to estimate the statistical GM for the two subperiods separately and use
=

Tk

:2
s2=
1 T1 k l : 1 f
-

and

to define the statistic z(y)

(-6
c?
-

st?

'vZ

c (T;

sllslz. Given that

-/f),

from (77),and sf is independent of


assumption),we can deduce that
((Tc
(Tk

sl/o-l/lj
k)s#/(cl/(Tc

-/4)

( 19.87)

1, 2,
yl

(due to the

sampling

model

k))1

(19.88)

396

Spification,

estimation

and testing

Hence,
T4$

s 21
c
c 0 sz

Ho

#(TI k' V /f1.

'v

'

This can be used to define a test based on the rejection region C1


ty: r(y) > cy ) where the critical value ca is determined via
dF(A I,
T2
a, a being the size of the test chosen a priori. It turns out that this
defines a UMP unbiased test (seeLehmann ( 1959/. Of particular interest is
thecase where ctl 1, i.e. HvL cf c2.z.Note that the alternative Hj : c2:/ca2 < 1
ca2/c2j > 1, i.e. have the
can be easily accommodated by defining it as H3 :
greater of the two variances on the numerator.
=

Jt*',

-/)

(2)

Tests relating

to

The first question usually asked in relation to the coefficient parameters


whether they are statistically significant'. Statistical significance
formalised in the form of the null hypothesis:
Hv :

pi

pi# 0 for

against
j

# is
is

0
some i

1, 2,

k.

Common sense suggests that a natural way to proceed in order to construct


a test for these hypotheses is to consider how farfrom zero / is. The problem
with this, however, is that the estimate of # depends crucially on the units of
measurements used for yt and X,. The obvious way to avoid this problem is
to divide the pisby their standard deviation. Since pxNlp, c2(X'X)- 1), the
standard deviation of Jf is

)q x/'gcztx'xlJ
x'tvartp-

1q,

1
where (X/XIJ refers to the fth diagonal element of (X'X) - 1 Hence we can
deduce that a likely pivot for the above hypotheses might be
.

?'-/'i

(c2(X'X)J1q

xo,

1).

(1q.89)

The problem with this suggestion, however, is that this is not a pivot given
that c2 is unknown. The natural way to
this problem is to substitute
its estimator sl in such a way so as to end up with a quantity for which we
know the distribution. This is achieved by dividing the above quantity with
the square root of
ksolve'

((w-,)s2
,,''(wc-2/()),

'
.

'

19.5

Specification

testing

( 19.90)

which is a

very convenient

T(y)

pivot. For Ho above

jf

(F- /f).

'v

s E()()()f; 1
?

Using this we can define the rejection

('1

region

> ca
),
(z(y)j

.ity:

where ca is determined from the


dr(F- k)

tables for a given size

a. That

is,

-a.

The decision on
optimal' the above test is can only be considered using
its power function. For this we need the distribution of zly) under H3, say
/f /$9,X # 0. Given that
%how

- jy
Ib
l(T-k),
1
Esa (X X)J l
I?
+
1y)
'r(y)c
i
E,s(x,xl.y

':()(y)
='

( 19.9 1)

'ro(r)

a non-central

with non-centrality

j=

'tT'- k;

( 19.92)

),

parameter

1q

tz g(X'X)J

(see Appendix 6. 1). This test can be shown to be UMP unbiased.


In the case of the money example above for J1t: jf 0, f 1, 2, 3, 4,
=

/1

g(X'X)1

I
:.
s ()( X):'3 (1
=

--

= 2.8,

42 9
.s
.

g(X'X)2-1

/'?.

6.5,

.-4.1.

E(X X).t1
,

.-

0.05 two-sided test thecritical value is cx 1.993,given that we


76 degrees of freedom. Assuming that the underlying
assumptions of the linear regression model are valid (asit happens they are
notl) we can proceed to argue that al1 the above hypotheses of significance
For a size
have T- I

398

Specification, estimation and testing

(the coefficients are zero) are rejected. That is, the coefficients are indeed
significantly different from zero. lt must be emphasised that the above ttests on each coefficient are separate tests and should not be confused with
the joint test: HvL Fj /72 p5 p.k 0, which will be developed next.
The null hypothesis considered above provides an example of linear
hypotheses, i.e. hypotheses specified in the form of linear functions of #.
lnstead of considering the various forms such linear hypotheses can take we
will consider constructing a test for a general formulation.
Restrictions among the parameters j, such as:
=

(i)
+ /5
135

/72+

/73+ /74+ ks

can be accommodated
R#= r,

1',

1,'

within the linear formulation

ranktR)

(19.93)

where R and r are m x k (k> m) and m x 1 known matrices. For example, in


the case of (iii),
R-(0

0 0
1 0

r
-

()
'j

( 19.94)

This suggests that linear hypotheses related to


special cases of the null hypothesis
Ho1 Rp= r

against the alternative

Hj

# can be
:

considered

as

R## r.

ln Chapterzo a test forthis hypothesis will bederived via thelikelihood ratio


test procedure. In what follows, however, the same test will be derived using
the common-sense approach which served us so well in deriving optimal
tests for c2 and #f.
The problem we face is to construct a test in order to decide whether #
satisfies the restrictions R#= r or not. Since p is unknown, the next best
thing to do is use p-(knowingthat it is a
estimator of #) and check
whether the discrepancies
'good'

Rb- r
$$
-

$r

(19.95)

to zero' or not. Since when deriving p-no such restrictions were


are
taken into consideration, if Ho is true, p should come very close to satisfying
these restrictions. The question now is
close is close?'. The answer to
this question can only be given in relation to the distribution of some test
'close

thow

19.5 Specification testing

399

statistic related to (95).We could not use this as a test statistic for two
reasons:
it depends crucially on the units of measurement used for yt and Xf;
(i)
and
(ii)
the absolute value feature of (95) makes it very awkward to
manipulate.
The units of measurement problem in such a context is usually solved by
dividing the quantity by its standard deviation as we did in (89)above.
The absolute value difficulty is commonly avoided by squaring the
quantity in question. If we apply these in the case of (95)the end result will
be the quadratic form

(RJ
-

r)' EVar(R/

r)q - 'IR#-

r)

(19.96)

which is a direct matrix generalisation of (89).Now, the problem is to


determine the form of VartR/
Since R/ - r is a linear function of a
normally distributed random vector, j,
-r).

(Rj-r)

,v

N(Rj

c R(X X)

-r,

(from N1 in Chapter 15). Hence,


-

(R #

r)

'

(1

(96)becomes

IR'q - iIR#-

Ec2R(X'X)-

(19.98)

r).

This being a quadratic form in normally distributed random variables, it


must be distributed as a chi-square. Using Q1of Chapter 15 we can deduce
that

-.

(R/

'-'

-r)

2
1
Ec R(X ) lR, 1- (R # -r)
,x

,v

z (1,n,),

(19.99)

chi-square
with m
i.e. (16) is distrlbuted as a non-central
(R(X'X') - 1R')) degrees of freedom and non-centrality parameter
=

(Rj- r)'ER(X'X)--IR'J- t(Rj- r)


(F a

(rank

( 19.100)

(see Appendix 6.1). Looking at (99)we can see that it is not a test statistic as
yet because it involves the unknown parameter a1. Intuition suggests that if
we were to substitute sl in the place at c2 we might get a test statistic. The
problem with this is that we end up with
(R/

r)'(R(X'X) S

r)
IR/IIIRJ
-

( 19. 101)

for which we do not know the distribution. An equivalent way to proceed


which ensures thatwe end up with a pivot (atest statistic whose distribution

400

Specilication,

estimation

and testing

is known) is the following. Since,


S2

(T- k)
c

'w

(T'- k),

(19.102)

if we could show that this is independent of (99)we could take their ratio
(divided by the respective degrees of freedom) to end up with an Fdistributed test statistic', see 95 and ()7 of Chapter 15. ln order to prove
independence we need to express both quantities in quadratic forms which
involve the same normally distributed random vector. From ( 102) we know
that

sl

(F- k)

u'(I

P 'Y lu

After some manipulation

u'Q u +

P .Y

'

X(X'X) - 1X'.

we can express

( 19.103)

(99) in the form

r)'gR(X'X) - R'j - 1(R# r)


1

(Rj-

( 19. 104)

where

X(X'X) - 1R/(R(X'X) - 1R/j - 1R(X'X)- 1X'

In view of Qx(I

Px)

0. (99)and (102)are independcnt. This implies that

u'Qu+ (R# r)'(R(X'X) - 1R'j


-

(R#

-r)

T(y)

u (1-Px)u
,

'v

'---

F(m, T- k;

(T'-k)c2

A more convenient form for

1 (RJ r)'gR(X'X) -

(19.105)

is

z(y)

z(y)

).

IR'II- IIRJ r)
'

( 19.106)

which apart from the factor ( l m) is identical to ( 101), the quantity derived
by our intuitive argument. lt is important to note that under Hv, R#= r and
J
i.e.
=0,

Ho

z(y)

F(m, T-/().

( 19.107)

Using this we can define the rejection

t)7j
=

ty:z(y) > cz)

where a

region to be
dF(m, F- k).

t' St

The power of this test depends crucially on the non-centrality

(19.108)
parameter

19.5 Specilication

40 1

testing

E(T- k)(m+ )
as given in (100).ln view of the fact that
n1(T- k 2)q (seeAppendix 6. 1) we can see that the larger is the greater the
power of the test (ensure that you understand why). The non-centrality
Rj
(a very desirable
parameter is larger the greater the distance 2/
i
variance
smaller
conditional
The power also
and
the
feature)
the
c
1'n,
p2
of
order
Fln
k.
depends on the degrees freedom vl
to show this
well-known
relationship
the
between the F and beta
explicitly 1et us use
enables
which
distributions
us to deduce that the statistic
'(z(y))=

.-.r

z*(y) E''1'r(y)1

is distributed as non-central
terms of z*(y) is

beta

E)J.'1'r(y)

k#(#) #r(z*(y)
=

- (j

'2)
I

,i

1.t'

''-'/

)-( /!
=

J?al

(19.109)

(seeSection 21.5). The

power function in

cz*)

>

..2)

.v.

E(v1

c:

1+

U( 1*

-.

vb

lt-tyvl + I

l pz

2 )- 1

jjg +

,4.zva.l

( 19.110)
(see Johnson and Kotz (1970/. From (l 10) we can see that the power of the
test, ceteris paribus, increases as F-k increases and m decreases. lt can be
shown that the F test of size a is UMP unbiased and invariant to
transformations of the form:
*
(ii)

y*

=cy

where

y + pv,

pv e: (40.

(for further details see Seber (1980/.


of the F-test is that it provides ajoint test
One important
for the m hypotheses Rj= r. This implies that when the F-test leads us to
reject Hv any one or any combination of these m separate hypotheses might
be responsible for the rejection and the F-test throws no light on the matter.
In order to be able to decide on this matter we need to consider
simultaneous hypothesis testing; see Savin (1984).
As argued in Chapter 14, there exists a duality relationship between
hypothesis testing and confidence regions which enables us to transform an
optimal test to an optimal confidence region and vice versa. For example,
the acceptance region of the F-test with R h; k x 1, r= h'#0, #0known,
'disadvantage'

1h'/ b',01
-

C0(/0)-

y:

<cu
k---WAF7XYL-X/E
( ) 1
.

(19.112)

402

Specification, estimation and testing

can be transformed into a (1


h,/ -s.v''p'(x'x)
c(y)
-

) confidence interval

t#:

1h(1),
<b'p< h'/+casv7gI1'(x'x) -

lhjca

(19.113)
Note that the above result is based on the fact that if

then

V''v F( 1, T'- k)

Qvxr(F-

k).

(19.114)

A special case of the linear restrictions R#= r which is of interest in


econometric modelling is when the null hypothesis is
Ho3 #(1) 0 against Sj
=

#(1) + 0,

where /91)represents a1l the coefficients apart from the coefficient of the
constant. ln this case R
1 and r=0. Applying this test to the money
equation estimated in Section 19.4 we get
=1k-

z(y)
=

tjjo.(ml
248.362

546 a

5353.904,

0.05.

This suggests that the null hypothesis is strongly rejected. Caution,


however, should be exercised in interpreting this result in view of the
discussion of possible misspecification in Section 19.4. Moreover, being a
joint test its value can be easily inflated' by any one of the coefficients. In the
present case the coefficient of pt is largely responsible for the high value of
the test statistic. If real money stock is used, thus detrending Mt by dividing
it with Pt which has a very similar trend (seeFig. 17. 1(a) and 17.1(c)), the test
statistic for the signilicance of the coefficients takes the value 22.27% a great
deal smaller than the one above. This is clearly exemplified in Fig. 19.2
where the goodness of fit looks rather poor.

19.6

Prediction

The objective so far has been to estimate or construct


related to the parameters of the statistical GM

tests for hypotheses

p'xt+ I/f,

(19.115)

using the observed data for the observation period r= 1, 2,


T The
question which naturally arises is to what extent we can use (115) together
with the estimated parameters of interest in order to predict values of y
beyond the observation period, say
.

.'.'.'.=

12
,

Prediction

19.6

403

From Section 12.3 we know that the best predictor for yw+i,1= 1,2,
is its
conditional expectation given the relevant information set. ln the present
case this infonnation set comes in the fonu of w./ )Xw+/= xw.;). This
suggests that in order to be able to predict beyond the sample period we
need to ensure that 9. w+/, I >0, is available. Assuming that Xws, xr..; is
available for some lb 1 and knowing that Eluvytlxv-qt= xw-r/) 0 and
.

.9.

El-vvst Xw-,

pv-vl
a natural

x,.+,)

(19.116)

#'x,.+,,

predictor for pv +/ must be

lkl t #-xw.,.

(19. 117)

In order to assess how good this predictor is we need to compare it with the
value of y, y,vsk. The prediction error is defined as

actual

-y,.-,-w-,-uw+?+(,-#-w)'x,

and

(19.118)

-.,

r +l

evs:

N(0, c241 +x'w+/(X'X)- 1xw+/),

(19.119)

since ev..l is a linear function of normally distributed random variables, and


the two quantities involved are independent. The optimal properties of #
predictor and evwk has the smallest variance among
make pv..lan
linear predictors (see Section 12.3 and Harvey (198 1:.
ln order to construct a confidence interval for ),' a pivot is needed. The
obvious quantity
'optimal'

+/

c gl +

ev-vl

x')...,(X'X) -

lxw-j-/ll

( 19. 120)

x x((),

1)
tr2.

is not a pivot because it involves the unknown parameter


reduce it to a pivot, however, by dividing it by x/ E(T-k).$
since sl is independent of ev-vi, to obtain

We could
((Fklc 1,
1

( 19. 121)
see Section 6.3. (Using
....

Prt#xwvg-casv

(121)we can set


E1+xw+/(X X)
,;'

....

up the prediction interval


gj

xw.h/qGyw+/<

+x'w,.,(x'x)-1x,.../q)+c ,.u,''E1

,'h

# xw..,
,

-a,

(19.122)

404
where

Specification,
cx

estimation

and testing

is determined from the r tables for a given

via

As in the case of specification testing, prediction is based on the


assumption that the estimated equation represents a well-defined estimated
statistical GM; the underlying assumptions are valid. lf this is not the case
then using the estimated equation to predict $'vp: can be very misleading.
Prediction, however, can be used for misspecification testing pumoses if
additional observations beyond the sample period are available. It seems
obvious that if it is assumed to represent a well-defined statistical GM and
the sample observations t 1,2,
Xare used for the estimation of p, then
the predictions based on pv.l= #-'xwo/,
1= 1, 2,
n1, when compared with
should
give
idea
about
the validity of the
1,
2.
I
m,
us some
y'.:.pt,
specification'
assumption.
correct
Let us re-estimate the money equation estimated in Section 19.4 for the
sub-period 1963-1980/ and use the rest of the observed data to get some
idea about the predictive ability of the estimated equation. Estimation for
the period 1963f-1980ff yielded
=

mt 3.029 +0.678)?,
( 1.116) (0.113)

+0.863p,

R2
RSS

72

=0.993,

-0.049f,

(0.024)

=0.993,

0. 10667,

li,,

(0,014)

(19.123)

(0.040)

=0.0402,

1og L

127.703.

Using this estimated equation to predict for the period l980ffollowing prediction errors resulted:
-0.0317,

t?j

-0.0279,

ez

14,

-0.03

t?s

0.0276,

t?a =

-0.0193,

e
p1 ()

ey

-0.02 17,

=0.0457,

1982/1,,the

ezb

-0.0243,

f?g

0.0408,

=0.0497.

As can be seen, the estimated equation untlerpredicts for the first six periods
and overpredicts for the rest. This clearly indicates that the estimated
equation leaves a lot to be desired on prediction grounds and re-enforces
the initial claim that some misspecification is indeed present.
Several measures
of predictive ability have been suggested in the
econometric literature. The most commonly used statistics are:
j.

MSE=-

m t

T' + m

.;

el
l

(meansquare

errorl

(19. 124)

19.7

The residuals

MAE

N1

y. +. l

405

If4I(mean absolute
1 ;.
-m t jj
+.

j
-D1

sm

jg
r

)t

et1

w.

U=

,,2

( 19. 125)

errorl;

.)

--

-if

7.+..

&

2
t.*t

m f= w-j-I

(Theil's inequality coefficient)

(19.126)

(see Pindyck and Rubinfeld (198 1) for a more extensive discussion). For the
above example MSE= 0.001 12, MAE 0.03201, U 0.835. The relatively
high value of U indicates a weakness in the predictive ability of the
estimated equation.
The above form of prediction is sometimes called ex-post prediction
because the actual observations for ),f and X are available for the prediction
period. Ex-ante prediction, on the other hand, refers to prediction where
this is not the case and the values of X, for the post sample period are
kguessed' (in some way). ln ex-ante prediction ensuring that the underlying
assumptions of the statistical GM in question are valid is of paramount
importance. As with specification testing, ex-ante prediction should be
preceded by misspecification testing which is discussed in Chapters 20-22.
Having accepted the assumptions underlying the linear regression model as
valid we can proceed. to
xw t by iw-, and use
=

kguessestimate'

JF

p-T k T + /

12
'

'

ln such a case the prediction error defined by


as the predictor of
decomposed
be
into three sources of errors:
can
)'v-l
-J/.
w-?
+ (x7.., #w+.,)',w,
uv +(#
#w)'xw+g
( 19. 128)
,y.+/.

,/

?.-s,

one additional to
19.7

.,

(?vyt

(see (118)).

The residuals

The residuals for the linear regression model are defined by

l,i, / y!
EB

#'xt

y -X#

1, 2,

in matrix

notation.

(19 129)
( 19.130)

From the definition of the residuals we can deduce that they should play a
role in the misspecification
analysis (testing the
very important
assumptions of the linear regression model) because any test related to

406

Specification, estimation and testing

ty,,/Xr= xt,l6T) can only be tested via f. For example, if we were to test any
one of the assumptions relating to the probability model
(.VI,/X,

Xf)

X(#'xf,

'v

t7'2)

(19.131)

we cannot use y'j because ( 131) refers to its conditional distribution


marginal) and we cannot use

#'x,)

(y, -

x.

N(0, c2)

(notthe
(19. 132)

because # is unknown. The natural thing to do is to use (y?-/x,), i.e. the


T: lt must be stressed, however, that in the same
residuals Iif, t 1, 2,
of its own' (it stands for y, #'xf), Stands for
way as ut does not have a
and
random variable
interpreted
should
not be
as an
1,' - /'xf
observable
of
form
but as the
ty,'Xf xr) in mean deviation form.
The distribution of > y -X#= (1 Pxly (1 Pxlu takes the form
=

tlife

bautonomous'

c2(1

(19. 133)
- Px)),
'v .N(0,
where Px is the idempotent matrix discussed in Section 19.4 above. Given,
however, that rank (1 Px) trtl Px) T-k we can deduce that the
distribution of is a singular multivariate nonnal. Hence, the distributions
of and u can coincide only asymptotically if
=

v(X(X'X) - 1X')

-+

(19.134)

where v(A) maxf,yst7jjt,A Laiji,j.This condition plays an important role


in relation to the asymptotic results related to y2. Without the condition
(134) the asymptotic distribution of s2 will depend on the fourth central
moment of its finite sample distribution (seeSection 21.2). What is more,
1
the condition limw-+w(X'X)
0 or equivalently
=

v(X'X) -

'

-+0

(19.135)

as

verified for xf,


2'.
v
Looking at ( 133) we can see that the finite sampling distribution of is
inextricably bound up with the observed values of X, and thus any finite
sample test based on will be bound up with the particular X matrix in
hand. This, together with the singularity of (l33l,promptus to ask whether we
could transform in such a way so as to sidestep both problems. One way
we could do that is to find a Fx (T-/() matrix H such that

does not imply ( 134) as can be

H'(I

Px)H

A,

where A takes the form


('F-k

01,

(19.136)

The residuals

19.7

407

being the matrix of the eigenvalues of I Px. This enables us to detine the
transformed residuals, known as BLUS residuals, to be
-

' H'
=

N(0, c2, I.r-k),

'v

(19.137)

i being a (T- k) x 1 vector (seeTheil (1971/. These residuals can be used in


misspecification tests nstead of but their interpretation becomes rather
difficult. This is because fl is a linear combination of all lils and cannot be
related to the observation date r. Another form of transformed residuals
which emphasises the time relationship with the observation date is the
recursive residuals.
The recursive residuals are defined be
>'

y,l

..

bt- 1
recursive

t-

1, 2,

( 19. 138)

,.

where
is the

for t

j,t -

r k+ 1

1x t ,

(X, lX,
-

1)

X,

Ieast-squares
E&

(x j

xz

x,

1 yt

'

!'

( 19.139)

1,

estimator
'

j0-

of

# with

1 EB

(y j y z
,

yt

'

-.

j.

This estimator of j uses information up to t 1 only and it can be of


considerable value in the context of theoretical models which involve
expectations in various forms.
-

#, N(0,c2( 1 + x;(X?0-'jXr0-j)
'v

1xj)),

and for
P=

'

g1 + x;(X,0-

tl*
;* EEE(t?*
k+
k+ 1
$

2,

'

#+ x((),c2I.,-k).
x

'

1x,0-

tl*)/
F

)-

(19. 140)

'

x f (j

>

(19.141)

residuals have certain distinct advantages


in
over
related
of
misspecification tests
the yfs (seeSection
to the time dependency
2 1.5 and Harvey (1981:.
The residuals play a very important role in misspecitication testing, as
shown in Chapters 20-22, because they provide us with a way to test the
relating to the process t-vf,/Xr, t e: T). lt is
underlying assumptions
interesting at this stage to have a look at the time path of the residuals for
the money equation estimated in Section 19.4, shown in Fig. 19.3. The
residuals exemplify a certain tendency to increase once they start increasing
and to decrease once they start decreasing. This indicates the presence of'
strong positive correlation in successive residuls (orserial correlation) and
The

recursive

Speciation,

408

estimation

and testing

0.10

0.06

-0.05

-0. 10

1964

1967

1970

1973
T ime

1976

1979

1982

Fig. 19.3. The residuals from ( 19.66).

therefore the sampling model assumption of independence seems rather


suspect (invalid).The time path of the residuals indicates the presence of
systematic temporal information which was
by the postulated
systematic component (seeChapter 22).
tignored'

19.8

Summary and conclusion

The linear regression model is undoubtedly


the most widely used statistical
model in econometric modelling. Moreover, the model provides the
foundation for various extensions which give rise to several statistical
models of interest in econometric modelling. The main purpose of this
chapter has been to discuss the underlying assumptions in order to enable
the econometric modeller to decide upon its appropriateness
for the
particular case in question as well as derive statistical inference results
related to the model. These included estimation, specification testing and
prediction. The statistical inference results are derived based on the
presupposition that the assumptions underlying the linear regression
model are valid. When this is not the case these results are not only

inappropriate, they can be very

misleading.

The money equation estimated in Section 19.4 was chosen to highlight


the importance of choosing the most appropriate statistical model by
taking into consideration not just the theoretical infonuation but the
information related to the observed data chosen. The latter form of
information should be taken into consideration
in postulating the

Appendix 19.1

409

probability and sampling models as well as the statistical GM. The


estimation and testing results related to the money equation might
of a well-specified transactions
encourage premature pronouncements
demand for money. Such conclusions, however- are not warranted in view
discussed brefly above. Before any
of several indications of misspecification
estimation, specification testing or prediction results can be considered
appropriate we need to test the underlying assumptions on the validity of
which they are based. This is the task of misspecification testing eonsidered
in the next three chapters.
Once the underlying assumptions g1q-g8(1
are tested and their validity
established with the data chosen, the estimated statstical GM is said to
estimated statistical model. This. however, does not
constitute (1 wt?//-#npf
necessarily coincide with the empirical econometric model because the
statistical and theoretical parameters of interest are commonly different.
estimated statistical model s transformed
The well-defined
into an
reparametrising
econometric
model
in
empirical
by
it
terms of the
theoretical parameters of interest.

Appendix 19.1 A note on measurement


-

systems

A measurement system refers to the range of the variables involved and the
associated mathematical structure. This system is normally selected so that
the mathematical structure of the range retlects the structural properties of
systems such as
the phenomena being measured. Different measurement
nominal, ordinal, interval and ratio are used in praetice. Let us consider
them one by one very briefly.
possible are
Nominal: ln a nominal system the only relationships
(i)
whether the quantities involved belong to a group or not without
any ordering among the groups.
Onlinal: In this system we add to the nominal system the ordering
of the various groups (e.g. social class, ordinal utility).
lnerval: ln this system we add to the ordinal system the
(iii)
interpoint distances (e.g. measures of
possibility of comparing
of the values
that
Note
temperature).
any linear transformation
Fahrenheit
and
scales).
is
legitimate
(Celsius
taken
Ratio: ln the ratio system we add to the interval system a natural
origin for the values. ln such a systen the ratio of two values is a

meaningful

value.

In the case of the linear regression model caution should be exercised when
using different measurement systems for the variables involved. In general,
most economic variables are of the ratio type and in such a case the

410

Specification, estimation and testing

constant term among the regressors is of paramount importance since it


estimates a linear function of the means of the variables involved. For this
reason tbe ctpnsrfznr term sbould alwaq's be included in a regression among
ratio scaled variables because it represents the origin of the estimated
empirical relationship.
Important

concepts

White-noise error term, exogeneity, parametrisation,


residuals,
R2, specification tests, F-test, recursive residuals, BLUS residuals,
least-squares, measurement system.

2,

.#2 #2

recursive

Questions
Compare the linear Gauss linear and the linear regression models
(statistical GM, probability and sampling models).
Explain the concept of exogeneity in the context of the linear
regression model.
Discuss the differences and similarities between
'(.?f

pt BE

X!

x?), p)

EEE

'(.rr,'''c(Xt)).

Explain the role of conditioning


of the
in the parametrisation
statistical model.
Explain
the relationship
between normality,
linearity and
homoskedasticity.
6. Discuss the orthogonality of the estimated systematic and nonsystematic components and their relationship to the goodness-of-fit
measure #2.
2
and #2
7 Compare the goodness-of-fit measures /12
MLE'S
sample
properties
of
the
l EEE(j
8 State the finite
IH(/
MLE'S
c2)'.
asymptotic
properties
of
9. State the
the
and
model
consistency
and
+
discuss the
10. Consider the
ut
y jr
of
(Note.'
normality
1.
1,
asymptotic
1 t
a =J, a
J for
l.cT( F+ 1),limw- w (j7- 1 1/r2) 7:2/6.)
Explain why we need to assume that limv- .(X'X) --1 0 for the
consistency of / and no such assumption is needed for the consistency
of 42

4.

.l)'.

'

,)

7-

Exercises
Derive the

MLE'S

of 0 EE (j, c2)'.

Additional references

41 1

Derive the information matrix Iw(p) and discuss the question of full
>
efficiency of j and c
Show that
az

(i)

Z'(#)

(ii)

Px2 Px

(iii)

Px'

(iv)

(1 Px)X 0.

0,'

(Px

X(X'X) - 1X');

Px;

Derive the distribution of p-and c-2


Check whether the variables .X'lt 2', 0 < k < 1 and Xzt
=

t satisfy the

conditions:
lim
v-.x

-.

1
T

(X'X)

lim(X'X)-

Qx<

LJo

non-singular;

=0.

F-y x

Compare these two conditions.


HvL c2
cj against /11 : c2 # c2().
a test for
Explain how the following linear restrictions can be accommodated
within the general form Rj= r.
(i)
J$1 /72*,

Construct

(iii)

/71+

/72+ /33

jl

fz /3

1,'

=0.

Show that
(R/

r)'(R(X'X)- 1R')-

IIRJ- r)
(R#

=u'Qu +

r)'gR(X'X)- IR') - 1(R#

-r),

where Q X(X'X)- 1R'gR(X'X)- IR'j - 1R(X'X)- 1X' Show that


() 2 Q
Derive the distribution of pv+: fxv. 1 and use it to construct a
prediction interval for yv.v,.
Compare the prediction errors with the recursive residuals.
Derive the distribution of and explein how can be transformed to
N(0, c2Iw-k).
define the BLUj residual vector
=

'

'v

Additional references

Goldberger (1968)) Judge et al. (1982);Madansky (1976); Maddala


Dhrymes (1978);
1977);Malinvaud
( 1970);Schmidt (197$; Theil (1983).
(

C H AP T E R 20

The linear regression model 11 departures


from the assumptions underlying the
statistical GM

ln the previous chapter we discussed the specification


of the linear
regression model as well as its statistical analysis based on the underlying
eight standard assumptions. In the next three chapters several departures
and their implications will be discussed. The discussion differs
from E1q-g8j
somewhat from the usual textbook discussion (see Judge et aI. ( 1982))
because of the differences in emphasis in the specification of the model.
$
In Section 20. 1 the implications of having '(yl,/c(Xr)) instead of
'(rf, Xf xl) as the systematic component are discussed. Such a change
gives rise to the stochastie regressors model which as a statistical model
shares some features with the linear regression model, but the statistical
inference related to the statistical parameters of interest p is somewhat
different. The statistical parameters of interest and their role in the context
of the statistical GM is the subject of Section 20.2. ln this section the socalled omitted variables bias problem is reinterpreted
as a parameters of
interest issue. In Section 20.3 the assumption of exogeneity is briefly
considered. The cases where a priori exact linear and non-linear restrictions
on 0 exist are discussed in Section 20.4. Estimation as well as testing when
such information is available are considered. Section 20.5 considers the
concept of the rank deficiency of X known as collinearity and its
implications. The potentially more serious problem of
collinearity' is
the subject of Section 20.6. Both problems of collinearity and near
collinearity are interpreted as insufficient data information for the analysis
of the parameters of interest. It is crucially important to emphasise at the
outset that the discussion of the various departures from the assumptions
underlying the statistical GM which follows assumes that the probability
=

Snear

4 12

The stochastic Iinear regression model

20.1

and sampling models remain valid and unchanged. This assumption is


needed because when the probability and/or the sampling model change
the whole statistical model requires respecifying.
20.1

The stochastic linear regression

model

the statistical GM is that the systematic

The first assumption underlying


component is defined as

X).
(20.1)
pt G f(-Pr/X/
An alternative but related form of conditioning is the one with respect to the
c-field generated by Xf, which defines the systematic component to be
=

pl

'(A 6'(Xf)).

The similarities and differences between ( 1)and (2)were discussed in Section


7.2. ln this section we will consider the meaning and intuition underlying (2)
variable in Xr, which is
as compared with (1).Let xYjt be the first random
assumed to be defined on the probability space (S, #( )). c(A-1t)
represents the c-field generated by .Y1,, i.e. the minimal c-field with respect
to which -Yjt is a random variable. By construction c(A-jr) z / The c-field
-Yk,) is defined to be
generated by Xt HE (zY1f,Xzt,
.%

'

c(X,)

U c(A-.f) zzn
=

Let y, be also defined on (S,,W,P4.)). The


J)y,/c(X,)): (,, c(X,)) --+ (R,
is defined via

conditional

expectation

,)#)

with respect to c(Xf).


This shows that F(yf/c(Xf)) is a random t'ariable
Intuitively, conditioning )?fOn c(Xt) amounts to considering the part of the
random variable yf associated with all the events generated by Xf.
Conditioning on X, x, can be seen as a special case of this where only the
event Xl x! is considered. Because of this relationship it should come as no
surprise to learn that in the case where D( Xr,' p) is jointly normal the
conditional expectations take the form
=

',

pt f(.r, 'X, x,)


lz)EEEuE'l-r,'c(Xr))
EEEE

(seeChapter
J', =

15). Using

,'X,

+ u,,

J1

212-21

(20.5)

xr,

cl 2E2-2'X?

(6) we can define the

(20.6)
statistical

GM:

(20.7)

414

from assumptions

Departures

statistical

GM

where the parameters


of interest are 0- p, c2), p- Yc-allaj (7.2
2E2-a1J21.
The random vectors (X1, X2,
Xw)' > i'k- are assumed
cj 1 J1
rank
satisfy
condition,
the
rankt)
k
to
for any observed value of 1. The
defined
by
term
error
=

f()?,/oX,)),
ut
satisfiesthe following properties:
M

(20.8)

.)4

'lF(&,/c(Xf)))

f'tF(/t?,/,/c(Xt)))

.E'(ul)
Eutplq
Elutus)

=0,

(20.9)
=0,

.E')S(u,l</c(Xt)))

(20. 10)
c2

0,

s,

(20. 11)

tzp s,

The statistical GM (7) represents


a situation where the systematic
component of y, is defined as the part of yf associated with the events c(X!)
and the observed value of Xt does not contain a1l the relevant information.
That is, the stochastic structure of Xt is of interest as far as it is related to
yt.
This should be contrasted with the statistical GM of the Gauss linear and
linear regression models.
Given that Xfin the statistical GM is a random vector, intuition suggests
that the probability model underlying (7)should comc in the form of thejoint
distribution 1)()'r, X,; #). We need, however. a form of this distribution
which involves the parameters of interest directly. Such a form is readily
available using the equality
D()'f, Xf;

#)

Dq' 'Xt;

#:)

'

D(X,;

a),

(20.12)

with 0 (#,c2) being a parametrisation


of #j. This suggests that the
probability model underlying (7) should take the form
EEE

(20.13)

where

(20.14)
D(Xr;

(det E 2 2 )kyc
(2zr)

#c)

(see Chapter 15) and


The random

#c are

vectors

X1,

expt --l.x,Eca
,

xt )

(20. 15)

said to be nuisance parameters (notof interest).


Xw become part of the sampling model
.

which is defined as follows:


(Zl Zc,
,

Zw) is a random sample

fromD(Zt,.#), t

1, 2,

T;

The stochastic linear regression model

20.1

respectiyely,

where as usual

(xA'r,,)

z,

EEE

lf we collect a1l the above components together


stochastic linear regression model as follows:
The statistical

GM: yf

j'Xr + ut,

l (E T

'c(X,)) and lk, )'f - Et))r 'c(Xf)).


pt
0 HE (#, c2), # E z1 az j and (72 cj j - Jj ztz-t o.z j
parameters of interest.
X is assumed to be weakly exogenous wrt 0, for t 1, 2,
No a priori information on 0.
Xw)?, ranktt')
For :t(' EEE(X1, Xa,
k for al1 observable
>
T
of
k.

E1q
r21

the

we can specify

'tl

g31
E4q
E51

the

'TJ

values

.?)

The probability

model

1
2c

cv ( )
(detxccl-feXp --l(X;E2-2'Xr))
k/z
t
x (2zr)
E6(1 (i)

0 c(Rk x R +)

D(y/X,; p) is normal;

#'x, - linear in x,',


vartyf' x )) c2 - homoskedastic-,
0- (j, c ) are time-invariant.

(ii)
(iii)

;t),,, c(x,))
c(
-

(Z1, Zc,

j!

model

Ihe sampling

g8(I

respectively.

Zw) is a random sample from D(Zt;

k), t

1, 2,

T)

The assumption related to the weak exogeneity of xr with respect to 0 for


T shows clearly that the concept is related only to inference on 0
t= 1,2,
and not to the statistical inference based on the estimator of 0. As shown
below, the distribution of the MLE of 0 depends crucially on the marginal
distribution of X2. Hence, for prediction purposes this marginal
distribution has a role to play although for efficient estimation and testing
on 0 it is not needed. This shows clearly that the weak exogenty concept is
about statistical parameters of interest and not so much about
.

.,

distributions.

Departures

The probability

-.v4'
'k'

1 .J'a.
,

from assumptions

statistical

GM

and sampling models taken together imply that for y K


Eiii (X
and
X w)' t he l ikelihood function is
1 X c,
.?'

f-(0 ; y / )

D( #'j Xf; 0)
tg-I

'

'

D (X f ;

:=

# a)
.

The log likelihood takes the form

in ( 17) can be treated as a constant as far as the


respect to the parameters of interest 0 is concerned.
Because of the apparent similarity between ( 17) and the log likelihood
function of the linear regression model (seeSection 19.4), it should come as
with respect to 0 yields
no surprise to learn that maximisation
The last component

differentiation

with

-.

)(x'),

j x x't

E,

(.?'',,z.)-

.,t''y

''

(20.18)

and
1

(i*2

Ta

V(

p*xtj a EEI -T

.:
-

(20.19)

'''b

in an obvious notation. The same results can be derived by defining the


likelihood function in terms of D( T'f, X,,' #) and, after estimating cl j, c1c,

thes: estimators and the invariance property of MLE'S


to
c2
- Yc-21
EEE
estimators
and
the
corresponding
for
j
cl
ccj
eonstruct
p
MLE'S
of jand 0-2
tzl2E2-21caj. Looking at ( 18)and ( lglwecan see that these
differ from the corresponding estimators for the linear regression model #=
1
in so far as the latter include the observed value xr
(X'X) - X'y. 42 ( 1'Fl'
of
random
the
X,, as above. This difference, however. implies
vector
instead
4*2
and
longer
are no
a linear and a quadratic function of y and thus
that p-.
results
relation
distributional
in
to and 42 cannot be extended to j*
the
*
and
J.*2
That
is,
are no longer normally and chi-square
and
respectively.
ln
distributed,
fact, the distributions of these estimators are
analytically
tractable at present. As can be seen from ( 18) and (19) they
not
complicated
functions of normally distributed random variables.
are very
which
naturally
arises is whether we can derive any
The question
without
knowing their distributions. Using the
properties of p-*and
SCE3)
properties SCE 1-SCE5 (especially
on conditional expectations with
c-field
Section
7.2)
we can deduce the following'.
(see
respect to some
Ecc- using

/*

'*2

/*

1
1
( 1.'.1''4- k''lb:t'p + uj j-h (,/''./') - #'''u
=

(20.20)

The stochastic Iinear regression model

20.1

# + 'I).4'';'')- ',f.''E'(u
= #,

'(#*)
if

'g(;??''t?)

E(u/c(J'))

1,'71

< x?,

That is,

=0.

COvll

**

c(t?'))(l

'E.E'(#*/c(.(F))1

)O

Ep

(20.21)
GM

since by the construction of the statistical


#* is an unbiased estimator of (.
R*

A*

A*

>*

-#)

-#)(#

ELEC'

/
=

(7),

-#)(/

/) /c(?)1
'

1,'?''A'(uu' c(,'.F))t'(;?''t?') -

= FI)-''.??'')
o'lE.J-'.'jt -

11

(20.22)

if E-/''.I) - exists, since ftuu//ct.'F' )) czlw. Using the similarity between


the log likelihood function (17)and that of the linear regression model we
can deduce that the information matrix in the present case should be of the
form
1

E(I'-%
I1(p)

(20.23)

2c

estimator of (.
j* is an
Using the same properties for the conditional expectation
of c2
can show that for the MLE

mcient

This shows that

operator

we

'*2

.E'(#*2) Sg.E'(#*2 c(.?'))j


=

1
= T Fgfu/Mxu
1
= -,jr Sgftr
1
=

''

Mx)
c2,

This implies that although

defined by

is unbiased.

1
=

'.

czsttr

*/c(kY-))(j

where Mx
1

c(.?)(1

Mxuu'/c(.?))1

cuftr

T-k

1 Fgf't*
w

=-

'rtr

Iw - ''tl'.Il
.

.-

.,

Ma.E'(uu'/c(.'?/))1
1

Iw tr(tf'.'') - ,')F/,jF))
-

for al1 observable

values

of ,?J

d'*2 is a biased estimator of

c2

(20.24)

the estimator

418

from assumptions - statistical GM

Departures

Using the Lehmann-scheffe theorem (seeChapter 12) we can show that


>
(y'y,1'il, I'y) is a minimal sujhcient statistic and, as can be seen
from (18)and (19),both estimators are functions of this statistic.
Although we were able to derive certain finite properties of the MLE'S
J*
without having their distribution, no testing or confidence regions
and
are possible without it. For this reason we usually resort to asymptotic
theory. Under the assumption
Tty,

,+)

'*2

lim
F-+

E(.I'Ij

'

Qxx<

cfz

and non-singular,

(20.26)

we deduce that

x'''z'(/*

.N0,c2Qx4),

(20.27)

,v

N(0, 2c*).

(20.28)

-#)

UT(+2

c2)

These asymptotic distributions can be used to test hypotheses and set up


condence regions when T is large.
The above discussion of the stochastic linear regression model
as a
separate statistical model will be of considerable value in the discussion of
the dynamic linear regression model in Chapter 23. In that chapter it is
argued that the dynamic linear regression model can be profitably viewed
as a hybrid of the linear and stochastic linear regression models.

20.2

The statistical

parameters

of interest

The statistical parameters which define the statistical GM are said to be the
statistical parameters of interest. In the case of the linear regression model
these are #= E2-)lc1 and c2 cl 1 cj aE2-zllaj. Estimation of these
statistical parameters provides us with an estimated data generating
mechanism assumed to have given rise to the observed data in question.
The notion of the statistical parameters of interest is of paramount
importance because the whole statistical analysis
around these
assumptions
defining the linear
parameters.
cursory look at
E1j-E8q
regression model reveals that al1 the assumptions are directly or indirectly
related to the statistical parameters of interest 0. Assumption E1jdefines the
systematic and non-systematic component in terms of 0. The assumption of
weak exogeneity (3q is dened relative to 0. Any a priori information is
introduced into the statistical model via 0. Assumption g5qreferring to the
rank of X is indirectly related to 0 because the condition
=

trevolves'

rank(X'X)

(20.29)

20.2 The

statistical

parameters

of interest

419

is the sample equivalent to the condition


rankttzz)

k,

(20.30)

required to ensure that E2c is invertible and thus the statistical parameters
of interest 0 can be defined. Note that for T > k, ranklx)
rank(X'X).
Assumptions E6qto (8(Iare directly related io 0 in vi
ofthe
fact that they
w
defined
in
of
a1l
D(y,,/Xt;
0).
terms
are
The statistical parameters of interest 0 do not necessarily coincide with
the theoretical parameters of interest, say (. The two sets of parameters,
however, should be related in such a way as to ensure that ( is uniquely
defined in terms of 0. Only then the theoretical parameters of interest can be
given statistical meaning. In such a case
t is said to be identnable (see
Chapter 25). Empirical econometlic models represent reparametrised
statistical GM's in terms of (. Their statistical meaning is derived from 0
and their theoretical meaning through (. As it stands, the statistical GM,
=

y'f
=

#'xl+

lzf,

t e: T,

(20.31)

might or might not have any theoretical meaning


mapping
' G(t 0)

depending on the

(20.32)

=0,

relating the two sets of parameters. lt does, however, have statistical


meaning irrespective of the mapping (32). Moreover, the statistical
parameters of interest 0 are not restricted unduly at the outset in order to
enable the modeller to test any such testable restrictions. That is, the
statistical GM is not restricted to coincide with any theoretical model at the
outset. Before any such restrictions are imposed we need to ensure that the
estimated statistical GM is well defined statistically; the underlying
assumptions g1!-E8qare valid for the data in hand.
The statistical parametrisation
0 depends crucially on the choice of Zt
and its underlying probabilistic structure as summarised in D(Z; #). Any
changes in Zt or/and D(Z; #) changes 0 as well as the statistical model in
question. Hence, caution should be exercised in postulating arguments
which depend on different parametrisations, especially when the
parametrisations involved are not directly comparable. In order to
illustrate this let us consider the so-called omitted variables bias problem.
The textbook discussion of the omitted variables bias argument can be
summarised as follows:
The true specification is
y X#+ W? +
=

:,

E,V

N(0, c,Z1z),

'

(20.33)

420

from assumptions - statistical GM

Departures

but instead
y X#
=

+ u,

N(0, c2Jw)

'w

(20.34)

was estimated by ordinary least-squares (OLS) (seeChapter 2 1), the OLS


estimators being
j= (X X)
42

Xy

(20.35)

'
'

'r - k

y -X#.

(20.36)

ln view of the fact that a comparison


u

between

(33) and (34) reveals that

WT +E,

(20.37)

we can deduce that


E(u)

Wy 10

(20.38)

and thus
A'(/)

'x'wy#o;

-#=(x'x)-

(20.39)

and
(20.40)
1

'-'

l .-XIX ?X) - X That is, # and ti 2 suffer from omitted variables bias
and y= 0, respectively; see Maddala (1977), Johnston ( 1984),
unless W'X
Schmidt ( 1976), inter alia.
Mx

=0

viewpoint,
From the textbook specification approach
where the
statistical model is derived by attaching an error tenn to thc theoretical
model, it is impossible to question the validity of the above argument. On
the other hand, looking at it from the specification viewpoint proposed in

Chapter 19 we can see a number of serious weaknesses in the argument.


The most obvious weakness of the argument is that it depends on two
statistical models with different parametrisations. ln particular # in (33)and
(34) is very different. lf we denote the coefficient of X in (33)by j= Ea-21J21,
the same coefficient in X takes the form:
lJ2
lo N2-=

Ec- lEc

3EL' ,3

1,

where Ec.a (Ycc - EaaYL1Es2), Eza Cov(Wr), Eaa Cov(Xf, Wf), JaI
CovtWt, yf) (seeChapter 15). Moreover, the probability models underlying
(33) and (34)are Dt.pf Xf; 0) and 1)(y,,,'Xr,Wr,' 0o) respectively. Once this is
realised we can see that the omitted variables bias (39)should be written as
=

20.3 Weak
E
)'/-1f

exogeneity

=(X X)

-bo

1J'

t#)

(20.41)

X Wy#0,

since
E

(u) WT #0,

(20.42)

j',..''A'U/
.

where E)..

) refers to the expectation operator defined in terms of


D( pt/Xr, Wf;0v). Looking at (41)we can see that the omitted variables bias
arises when we try to estimate pv in
f,.(

'

y =X#() + Wy+ c

(20.43)

by estimating # in (34)wheres by construction,


the context of the same statistical model,
E (/)

## po.On the other hand, in

-#-

(20.44)

since
E

(u) 0

(20.45)

vt'.kr

and no omitted variables problem arises. A similar argument can be made


for ti2. From this viewpoint the question of estimating the statistical
parameters of interest 0v H (#a, )!, c,2) by estimating 0 H (j, c2) never arises
since the two parameter sets 0o and 0 depend on different sample
ct.yt, X,, Wf t 1, 2,
T) and ..F c(y,, Xt, t 1, 2:
information,
F) respectively. This, however, does not imply that the omitted variables
argument is useless, quite the opposite. ln cases where the sample
the argument can be very useful in
information is the same (.#$
deriving misspecification tests (see Chapters 2 1 and 22). For further
dscussion of this issue see Spanos ( 1985b).
The above argument illustrates the dangers of not specifying explicitly
the underlying probability model and the statistical parameters of interest.
By changing
probability
the underlying
distribution and the
parametrisation the results on bas disappear. The two parametrisations
are only comparable when they are both derivable from the joint
distribution, D(Zj
,Zw;#) using alternative
arguments.
.#-()

.t./-)

treduction'

20.3

Weak exogeneity

When the random vector Xr is assumed to be weakly exogenous in the


context of the linear regression model it amounts to postulating that the
stochastic structure of Xf, as specified by its marginal distribution D(Xf; #c),
is not relevant as far as the statistical inference on the parameters of interest

422

Departures from assumptions

statistical

GM

o-p, c2) is concerned. That is, although at the outset we postulate Dt-pt, Xt;
#) as far as the parameters of interest are concerned, D(yj/Xt; #1) suffices;
note that
D(#t, Xf; #) D(.F/Xl;
=

#1) D(Xf; #2)

(20.46)

'

is true for any joint distribution (seeChapter 5). If we want to test the
exogeneity assumption we need to specify 1)(X,; #2) and consider it in
relation to D(yt/X2; #1) (see Wu (1973),Engle (1984),.
inter alia). These
exogeneity tests usually test certain implications of the exogeneity
assumption and this can present various problems. The implications of
exogeneity tested depend crucially on the other assumptions of the model as
well as the appropriate specification of the statistical GM giving rise to xr,
T; see Engle et al. (1983).
t= 1, 2,
Exogeneity in this context will be treated as a non-directly testable
assumption and no exogeneity tests will be considered. It will be argued in
Chapter 2 1 that exogeneity assumptions can be tested indirectly by testing
the assumptions g6q-E8j.The argument in a nutshell is that when
inappropriate marginalisation and conditioning are used in defining the
parameters of interest the assumptions g6J-E8(I
are unlikely to be valid (see
Engle et al. (1983),Richard (1982:.For example
the weak
a way to
exogeneity assumption indirectly is to test for departures from the
nonuality of Dtyf, Xf; #) using the implied normality of D(yl/Xf; 0) and
homoskedasticity of Vartyt/xr
xr). For instance, in the case where D(y,, Xf;
#) is multivariate Student's 1, the parameters /1 and #c above are no longer
variation free (seeSection 21.4). Testing for departures from normality in
the directions implied by D(y,, X,; #) being multivariate t can be viewed as
an indirect test for the variation free assumption underlying weak
exogeneity.
.

ttest'

20.4

Restrictions

on the statistial

parameters

of interest 0

The statistical inference results on the linear regression model derived in


Chapter 19 are based on the assumption that no a priori information on 0 BE
(#, c2) is available. Such a priori infonuation, when available, can take
various forms such as linear, non-linear, exact, inexact or stochastic. In this
section only exact a priori infonuation on # and its implications will be
considered; a priori information on (7.2 is rather scarce.

(1)

Lineav a 'l'ftll'f restrictions

on

Let us assume that a priori information in the form of m linear rcstrictions


R/= r

(20.47)

20.4 Restrictions

of interest

on parameters

is also available at the outset, where R and r are m x k and m x 1 known


Such restrictions imply that the parameter space
matrices, ranktlt)
values
in no longer Rk but some subset of it as determined by
where # takes
relevant for the statisrestrictions
represent information
(47). These
and
regression
model
of
the linear
tical analysis
can be taken into
consideration.
In the estimation of 0 these restrictions can be taken into consideration
by extending the concept of the log likelihood function to include such
restrictions. This is achieved by defining the Lagrangian function to be
=?'n.

l0, p; y, X)

const

T
--j.

1og c a

2c z ty -

.,.

ns'g

.A#l

v
n.
y - .,h.P? # (R/
,

r),

(20.48)
where p represents an m x 1 vector of Laqranqe multipliers. Optimisation of
(48) with respect to #, c2 and p gives rise to the first-order conditions'.
01 1
(X'y -X?X#) - R'p
p = cl

pl
t?cz

2c a

(31
= -(R#

1 .(y
2c
=

(49)with

(20.49)

-X#)'(y -X/)=0,

(20.50)
(20.5 1)

0.

-r)

ep

Premultiplying

=0,

R(X'X)-

solving for p we get


/

(tomake the second term invertible) and

gc2R(X'X)- 1R'j - 1(Rj- r),

(20.52)

where J= (X'X)- 1X'y is the unconstrained


that the constrained MLE of # is
..-.

...- j-(X
#=

and

From

R(

- r

(50)we

X)

R ERIX X)

1
=- F

#. This in turn implies

(Rj - r)

(20.54)

0.

can deduce that the constrained

:2 Y
-

MLE of

-x#

-x#

(y
'

MLE of c2 is

(20.55)

(y
y -Xj.

424

Departures

from assumptions - statistical GM

Properties

of

1-0(J #
5

2)

Using these formulae we can derive the distributions of the constrained


of # c2 and #. Jand# being linear functions of we can deduce that
J

MLE'S

(X'X) - 1 R' gR(X'X) - 1 R/) - 1(RjEc2R(X'X)-1R!j - 1 (R#-r)

(#

r)) Cj 1

Cj z

C.z, Caa

'

(20.56)
c 11

c2E(x'X) - 1 -(x'X)

1
1
- R'gR(X'X) - 1R'j - R(X'X) -

C12 (X'X)- 1 R'gc2R(X'X) - 1 R'() - 1 > Covt/,


=

C22 gc2R(X'X)- IR'II- '


=

Using

(i)

lj

Covtj),

#) C'a1,
=

Cov(#).

EEE

(56)we

can deduce that


When R#= r, Elp-) p and
0, i.e. #- and # are unbiased
estimators of # and 0, respectively.
# are fully efficient estimators of p and p since their variances
achieve the Cramer-ltao lower bounds as can be verified directly
using the extended information matrix:
'(#)

Jand

x'x
Iw(#, p, c2)

R'

()

(20.57)

2c

;y

(see exercises 1 and 2).


- Covt#-lj G0 i e. the covariance of the constrained
EC0v(#)
#- is always less than or equal to the covariance
unconstrained MLE j-2 irrespective of whether R#= r holds
- MSE(#)j 10 where MSE stands for mean
but EMSEI#)
12).
Chapter
(see
error
The constrained MLE of c2 can be written in the form
-

:2

42 +

1
F

Given that
T'42

--s

(R# -r)?gR(X'X) -

li
= -X(X'X)-

z2(w-/()

MLE
of the
or not;
square

RgR(X'X) -

IR'II- 1(R# -r),

(20.58)

IR'II- 1(R/

(20.59)

r).

20.4 Restrictions

of interest

on parameters

425

and
(Rj-

r)'

IR'I--1 (Rj-

ERIX'X) .7

r)

J),
z2(m;

'w

(20.60)

where
1
1
(R# -..j'gR(X'X) - R') - (R#

=
we ean deduce that
7-:2
- a
c

k,.

(20.61)

z2(T +

'v

r)

(20.62)

),

using the reproductive property of the chi-square (seeAppendix 6.1) and the
independence of the two components in (59).This implies
F(d2) + c2.

But for

(J().63)

.92

gl (F+

f'(#2)

-/()j?g,

c2 wjl.w

R#= r, since J

0.

The F-test revisited

ln Section 19.5 above we derived the F-test based on the test statistic
'r(y)

(R/

i
1
(R/
- r)' (R(X'X) - R'j --

/l1 l

z--'

r)

(20.64)

--

for the null hypothesis


Ho: R#= r

against

Hj

R## r,

must be close to
using the intuitive argument that when Hv is valid jlR#zero. We can derive the same test using various other intuitive arguments
similar to this in relation to quantities like j1J-/)tand
being close to
of the Fvalid
when
question
A
formal
derivation
is
5).
Ho
(see
more
zero
test can be based on the likelihood ratio test procedure (see Chapter 14).
The abovc null and alternative hypotheses in the language of Chapter 14
can be written as
.-rij

tj/j!

Ho1 pgeo,

H,

where
0G

(j,

(F2),

(.)

pc(6)1

c2):

.t(#,

jG

-60,

(F/, c2

(i)tl (#, c2j: # G iRk Rj=


=

.tf

R o ).s
r,

(7.2

(j

R+

jl
.

The likelihood ratio takes the form

#y)

max L(0; y)
ps eo -----.
max L(0, y)
pct.k

L (#.
,

.-v--

w(,;.ya)- r?y,.-w.a
# (a j --jwn--.
sj.-- o
J.

ttp,. y) (2a)
.

(tz )

t?

(?c

-.

r,,a

ti a

(20.65)

from assumptions - statistical GM

Departures

The problem we have to face at this stage is to determine the distribution of


2(y) or some monotonic function of it. Using (58)we can write 2(y) in the
form

2(y)
/'

Looking

(R/

1 ?-

==

'R'II-1(R/-r)

-r)'gR(X'x)-

-r/2

(20.66)

(T- k)sa
--

(66) we can see that it is directly related to (64) whose

at

distribution we know. Hence, 2(y) can be transformed into the F-test using
the monotonic transformation
z(y)

-27z
(2(y) . -

T- k

1)

(20.67)

This transformation provides us with an alternative way to calculate the


value of the test statistic zty) using the estimates of the restricted and
unrestricted MLE'S of c2. An even simpler operational form of z(y) can be
specified using the equality (58).From this equality we can deduce that
'''h

(R#

r) r (RIX X)

?,

(1
-

''''

(R# - r)

??

/'
-

(20.68)

(see exercise 4). This implies that z(y) can be written in the form

'r(y)
z(y)

uu-uu

T-k

(20.69)

'

UU

RRSS -URSS
URSS

T-k
.

(20.70)

where RRSS and URSS stand for restricted and unrestricted residual sums
of squares, respectively. This is a more convenient form because most
computer packages report the RSS and instead of going through the
calculation needed for (64)we estimate the regression equation with and
without the restrictions and use the RSS in the two cases to calculate z(y) as
in (70).

Example
Let us return to the money equation

mt 2.896 + 0.690.:,
( 1.034) (0.105)
=

R2

log L

=0.995,

42

estimated

0.865p,

-0.055,

in Chapter 19:
l'ir,

(0.020) (0.013) (0.04)


=0.995,

147.4, RSS

=0.117

.:=0.0393,

52, T= 80.

(20.7 1)

20.4 Restrictions

of interest

on parameters

427

that this is a well-defined estimated statistical GM (a very


questionable assumption) we can proceed to consider specification tests
related to a priori restrictions on the parameters of interest. One set of such
theory
a priori restrictions which is interesting from the economic
viewpoint is
Assuming

ja # 1.

against

and

lnterpreting j2 and ja as income and price elasticities, respectively, we can


view Hv as a unit elasticlty hypothesis.
ln order to use the form of the F-test (linear restrictions) as specified in
(70) we need to re-estimate (71) imposing the restrictions. This estimation
yielded
lmt

-p1

R2

yt)

-0.529

0.629,

(20.72)

-0.2 19f1+
(0.055) (0.019) (0.087)
,

#2

=0.624,

=0.0866,

log 1-= 83.18, RSS 0.58552,

T= 80,

Given that RRSS 0.585 52 and URSS= 0. l 17 52


=

we can

deduce that
(20.73)

(20.74)
Hence we can conclude that Hv is strongly rejected. lt must be stressed,
however, that this is a specification test and is based on the presupposition
that all the assumptions underlying the linear regression model are valid.
From the limited analysis of this estimated equation in Chapter 19 there are
clear signs such as its predictive ability and the residual's time pattern that
some of the underlying assumptions might be invalid. In such a case the
above conclusion based on the F-test might be very misleading.
The above form of the F-test will play a very important role in the context
of misspecification testing to be considered in Chapters 2 1-23.

(2)

Exact non-linear restrictions

on

ji

Having considered the estimation and testing of the linear regression model
when a priori information in the form of exact linear restrictions on j we
turn to exact non-linear restrictions.

428

from assumptions - statistical GM

Departures

Consider the case where a priori information comes in the form of m nonlinear restrictions (e.g.pj #c/ja, jl
-#c2):
=

f(#) 0,

1, 2,

m,

or, in matrix form:


H(#)

=0.

(20.75)

ln order to ensure independence between the m restrictions we assume that


?H(#)

rank

(p

(20.76)

rl.

As in the case of the linear restrictions, 1etus consider first the question of
constructing a test for the null hypothesis
Ho1 H(#)

0 against

Hj

H(#) + 0.

(20.77)

Using the same intuitive argument as the one which served us so well in
constructing the F-test (seeSection 19.5) we expect that when Ho is valid,
H(j) :%0 The problem then becomes one of constructing
a test statistic

based on the distance


f!H(#1

0 !)
.

Following the same arguments in relation


the absolute value we might transform
H(#) ECov(H(#))j

to the units of measurement and


(78) into

H(#).

(20.79)

Unlike the case of the linear restrictions,


however, we do not know the
distribution of (79)and thus it is of no use as a test statistic. This is because
x'Np c2(X'X) - 1) the distribution of H(j) is no
a lthough we know that
longer normal, being a non-linear function of j. Although, for some nonlinear functions
we might be able to derive their distribution, it is of
little value because we need general results which can be used for any nonlinear restrictions. The construction of the F-test suggests that if we could
linearise H(/) such a general test could be derived along similar lines as the
F-test. Linearisation of H(/) can be achieved by taking its first-order
Taylor's expansion at #, i.e.

f(J)

H(/)

H(#) +

pH(#)

p#

(/-#)

tasymptotically

+op(1),

(20.80)

negligible terms' (seeChapter 10).,


where op(1) stands for
(80) provides us with a linearised version of H(j) at the expense of the
approximation. The fact that we chose to ignore a1l the higher terms as

Restrictions

20.4

of interest

on parameters

429

asymptotically negligible implies that any result based on (80)can only be


justified asymptotically. Hence, any tests based on (80) can only be
asymptotic. What we could not get in finite sample theory (linearity),
we get
by going asymptotic. Given that

v''r(/ - ,)

c2Qx-')

N(0,

(20.81)

we can deduce that


x''T(H(

-H(#))

N 0, c

H(,)

z op -

Qx-

pH(#)

'

(20.82)

op

(see N 1 in Chapter 15). This result implies that if we substitute the


asymptotic covariance (Cova(H(j))) in (79)we could get an asymptotic test
because
'z'I4(/)'Ecov(H(#)()

- 1H(/)

where

j),
z2(m,.

'v

(20.83)

(:

c2j(''JjC>)ox-

covt/lQ

1(t''Jj>)'j

- H(#)'Ecov(/)q

-1H(,).

(20.84)

(83) cannot be used as a test statistic because # and


P

As it stands

unknown and Qx is not available. Given, however, that s2 --+


limw-+.,(X'X) T= Qx we can deduce that
t?H(/)

.'b'

where

P#

(?H(j)

t?j

X'X - 1 ?H(/)
F
Jj

'

P
--+

covtj)

(20.85)

,,

t?j

#-j

the test statistic

H(/)'js2(0Hgj(j(x,x)-

u,(y)-

ltt?l-lgjjj-

'u(j)
,...,

zztm;),
(20.86)

which can be used to define an a size asymptotic test based on the


region
=

are

c2 and

?H(#)
=

This enables us to construct

C1

c2

1y:1'Pr) >

Gl,

rejection

(20.87)

430

Departures

from assumptions

statistical

GM

where

dzztn1l
=

,x.

This is because
H
BStyl

z2(r#l).

(20.88)

This test is known as the WW/J test whose general form was discussed in
Chapter 16. ln the same chapter two other asymptotic test procedures
which give rise to asymptotically equivalent tests were also discussed. These
are the Lagrange multiplier and the likelihood ratio asymptotic test
procedures.
The Lagrange multiplier test procedure is based on the constrained MLE
of
J # instead of the unconstrained MLE J. That is, #-is derived from the
optimisation of the Lagrangian function

1(0,#;

=log

y)
?/

t? log

pt?/

t?H(#)

#-(#- d2) is the

--

/.t(#)

lp

lH(#)

P#

= -H4#)

(p
=>

L0; y) -#H(#)

()

(20.90)

(#; y)

t7#

plp

0,

log L

(20.89)

H(#

0,

(20.91)

MLE of onp, c2). In order to understand


what is involved in (89)-491),it is advisable to compare these with (48),(49)
and (52)above for the linear restrictions case. ln direct analogy to the linear
.

'

constrained

restrictions case the distance

r)/z(#
-0)(

can be used as a
intuitiveargument
1
T

(20.92)
measure of whether Ho is true or not. Using the same
as for !hH(/)
j above we can construct the quantity
-0

#(#)'ECOv(#(#))q

p(#)

(20.93)

as the basis of a possible test statistic. lt can be shown that

-v,'v
pt)

-,,(#))

--

x((,, c2g(f'Hp,t#')Qz'(-'----/?.'4p#

)'j-')

(ae.94)

Restrictions

20.4

(see Chapter

used

C1

431

16). This suggests that the statistic

f-Mtyl

can be

of interest

on parameters

#/1/

pH(J)(X X)

(72

p#

)y : Lsiy)

> ca

p(

(j)
c
z (n1,
.

'v

(20.95)

x size test with rejection

to construct an
=

H(J)
'

-.

region

(20.96)

and power function


izthp) PrLM(y)

dzztnl,'

> ral =

Cz

).

(20.97)

ln Chapter 16 it was shown that the Lagrange multiplier test can take an
alternative formulation based on the score jnction gr log L)?((0. ln view
of the relationship between the score function and the Lagrange multipliers
given in (91) we can deduce that we can construct a test for Hv against H3
based on the distance

logL)

- 0

Following the same argument as in the case of the construction


LMyj we can suggest that the quantity

( log Lj'
(0

Cov

1o2 L(#)

t? log

Ltpl

t?p

7loz Lj
'-'

(20.99)

'

t?p

should form the basis of another


1

of lVtyl and

reasonable

test statistic. Given that

x N(0, Iw.(p))

(20.100)

(see Chapters 13 and 16) we can deduce the following test statistic:
Esy)

--

1
'r

p log L)'
t0

Iw)

log L(#)

Hu

(0

z (m).

(20. 10 1)

This test statistic can be simplified further by noting that Ho involves only a
subset of 0 t'ustj) and 1.(p) is block diagonal. Using the results of Chapter

16 we can deduce that


Esyl-ji?

log 1-(1, 0)j'

'y

lrjt

g:2(x,x)-

10g .L(1, 0)j.

(20. 102)

Departures

from assumptions - statistical GM

The test statistic Esy) constitutes what is sometimes called the ejjlcient
score form of the Lagrange multiplier test.
The Iikelihood ratio test is based on the test statistic (seeChapter 16):
J-aR(y)

2(log L(, y) - log .I-(', y)) x.


1

z2(m; ),

(20.103)

with a rejection region


C1

.fy

f-R(y)

>

ca ).

(20.104)

Using the first-order Taylor's expansion


f-R(y)

T'(#-

cx

This approximation

4)'I,c0)(Jis very

we can approximate

LRy) as

(20.105)

suggestive

because

it shows

clearly

an

important feature shared by a1l four test statistics WJtyl, LMy), ESy)
and .LR(y). A11four are based on the intuitive argument that some distance
l H(/)1) g(/311 E7log L($-,0)l/?,)r and 14
#-- respectively, is
to
zero' when Ho is valid.
Another important feature shared by a11 four test statistics is that
their asymptotic distribution (z2(m)under Hv) depends crucially on the
asymptotic nqrmality of a ertain quantity inyolved in defining these
/
dijtancesg x?' F(# - #), ( 1,/x.' Tqlp) - #(#)), ( 1 x7 F) EPlog .L(#, 0)j/J$ and
v''F(#- 0) respectively. All thre tests, the Wald ( 14,'),Lagrange multiplier
LLM) and likelihood ratio (faR) are asymptotically equivalent, in the sense
that they have the same asymptotic power characteristics. On practical
grounds the only difference between the three test statistics is purely
computational, l4'tyl is based only on the unconstrained MLE p-LM(y) is
based on the constrained MLE J and LR(y) on both. For a given example
size T; however, the three test statistics can lead to difference decisions as far
as rejection of Hv is concerned. For example, if we were to apply the above
procedures to the case where H(#) R# r we could show that l4z'(y)

11

/)1

41

F'

'

tclose

20.5

(see exercises 5 and 6).

f.R(y) y LMy)

Collinearity

As argued in Section 19.3 above, assumption


ranktx)
is directly

E2-21o'al c2
,

k,

related
=

cj

ranktx)

j
=

g5qstating that

to the statistical parameters of interest 0


vja Ya:. This is because
gj aEa-.z1gaj)
rank(X'X)

(20.106)
H(#,

c2)

(#

(20.107)

20.5 Collinearity
and (106) represents the sample equivalent to
ranktrzc)

k.

(20.108)
(72

When (108)is invalid and Ea2 is singular # and


cannot even be defined.
verified
directly and thus we need to
Condition ( 108), however, cannot be
relyon ( 106) which ensures that the estimators

/=(x'x)-

'x'y

42

and

1 y'(I -X(X'X)

=-

ly

(20.109)

of # and c2 can be defined. ln the case where X'X is singular j and 42 cannot
be derived.
The problem we face is that the singularity of (X'X) does not necessarily
imply the singularity of Ec2. This is because the singularity of (X'X) might be
a problem with the observed data in hand and not a population problem.
For example, in the case where T< k rank(X?X) < k, irrespective of E22
because of the inadequacy of the observed data information. The only clear
conclusion to be drawn from the failure of the condition ( 106) is that the
of the statistical
the estimation
sample injrmation in X is inadequate
t#' interest
c2.
The source of the problem is rather more
j and
parameters
difficult to establish (sometimesimpossible).
In econometric modelling the problem of collinearity is rather rare and
when it occurs the reason is commonly because the modeller has ignored
relevant measurement information (see Chapter 26) related to the data
chosen. For example, in the case where an accounting identitq' holds among
some of the xrs.
It is important to note that the problem of collinearity is defined relative
to a given parametrisation. The presence of collinearity among the columns
of X, however, does not preclude the possibility of estimating another
of the
statistical
GM.
parametrisation restriction
One such
parametrisation is provided by a particular linear combination of the
columns of X based on the eigenvalues and eigenvectors of (X'X).
.jr

Let
P(x?X)P'

diagti-l, 2c,

2,,,,0, 0,

(20.110)

0)

and P'P= PP' 1, where P represents a k x k orthogonal matrix whose


columns are the eigenvectors of (X'X) and ;.1, ;.a,
m its non-zero
eigenvalues (seeHouseholder (1974/. lf we define the new observed data
matrix to be X* XP and #* P'# the associated coefficient parameters we
could reparametrise
the statistical GM into
=

(20.111)
The new data matrix X* can be viewed as referring to the values of the

from assumptions - statistical GM

Departures

artificial random variable X1= X,'pf, i 1, 2,


k. The columns of X*
Xpf are known as principal components of X and in view of
defined by X
=

(X'X)pf

fOr

(see 110), where pj, i

1, 2,

X* HE (X*1:X1),

#*

Decomposing

(iii) for
t

1, 2,

m + 1,

X1

0.

as y

(20.112)

k, are the columns of P, we can deduce that

P'# conformably

in the form

#*

=(a',

/)'

(20.114)

X*a + u,
,:2

being the new parameters


where ranktxl)
rl, with and
data specsc. These parameters can be estimated via
=

- =(X/'Xt)-1X1'y

Moreover, in

view of

#= P#*
any linear

1-2

=-

which are now

y'(I -X1(Xt'X1)-1X1)y.

(20.115)

relationship

P1 + P2)'

combination

c'b=c'PIa

and

the

we can rewrite

c'# of

+c'Pcy

# is

estimable if c'Pa

0 since

=c'P1a,

(20.117)

Using the principal components as the new columns of X, however, does


not constitute a solution to the original collinearity problem because the
estimators ( 115) refer to a new parametrisation. This shows clearly how the
collinearity problem is relative to a given parametrisation and not just a
problem of data matrices. The same is also true for a potentially more
serious problem, that of near collinearity', to be considered in the next

section.

20.6

Near' collinearity

lf we define collinearity as the situation where the rank of (X'X) is less than
collinearity refers to the situation where (X'X) is
k,
singular or
ill-conditioned as is known in numerical analysis. The effect of this near
singularity is that the solution of the system
dnear'

tnearly'

(X'X)j

X'y

(20.118)

to derive j is highly sensitive to small changes in (X'X) and X'y. That is,
small changes in (X'X) or X y can lead to big changes in q. As with

20.6

tNear' collinearity

collinearity, this is a problem of insujhcient data information relative to a


given parametrisation. ln the present context the information in X is not
quite adequate for the estimation of the statistical parameters of interest /
and c2. This might be due to insufficient sample information or to the
choice of the variables involved (Ecc is nearly singular). For example, in
cases where there is not enough variability in some of the observed data
series the sample information is inadequate for the task of determining j
collinearity is a problem to be tackled. On the
and c2. In such cases
other hand, when the problem is inherent in the choice of Xt (i.e. a
population problem) then near collinearity is not really a problem to be
tackled. In practice, however, there is no way to distinguish between these
collinearity because we do not know Ec2, unless we are
two sources of
in a Monte Carlo experimental situation (seeHendry (1984:.This suggests
collinearity relative to a given
that any assessment of whether
parametrisation is a problem to be tackled will depend on assumptions
values of the statistical parameters of interest.
about the
collinearity
Some of the commonly used criteria for detecting
suggested by the textbook econometric literature are motivated solely by its
of J as measured by
effect on the
:near'

'near'

knear'

ttrue'

'near'

taccuracy'

Cov(#)

c (X X)

(20.119)

Such criteria include:


Simple correlations
These refer to the transformation of the (X'X) matrix into a correlation
matrix by standardising the regressors using

(xfrv
(

kit

(-YI'!

.ff)

-ff

1)2

1, 2,

k.

(20.120)

The standardisation is used in order to eliminate the units of measurement


problem. High simple correlations are sometimes interpreted as indicators
of near collinearity.

(b)

yluxlrv

regressions

436

from assumptions - statistical GM

Departures

and a high value of the multiple correlation coefficient from this regression,
Rl, is used as a criterion for tnear'
collinearity. This is motivated by the
following form of the covariance of h :

covt/ h )

c2(x'x)j-;,1

c2g( 1

Rjzlxkx

(20.122)

else assumed fixed) leads


(see Theil (1971)). A high value for RL(everything
to a high value for Cov (/ )9 viewed as the uncertainty related to j h. By the
same token, however,a small value for xkxh, interpreted as the variability of
the th regressor, will have the same effect. N()te that R; refers to

RJ

xkX -,(X'-p,X

-,)

1X'-y:x,

(c)

(20.123)

xhxh

Condition numbers

Using the spectral decomposition of (X'X) in ( 110) we can express ( 119) in


the form
Cov(#)
-

c (X'X)-

c2(PA - 1P/)

k
0.2

;::1

.x..z

y
1

9-6j
2,.

(20.124)

and thus
Var(#f)
-

.g

pij

'E ;j
f=1

1, 2,

(20. 125)

(see Silvey (1969:. This suggests that the variance of each estimated
coefficient /f depends on a1l the eigenvalues of (X'X). The presence of a
relatively small eigenvalue ki will
these variances. For this
reason we look at the condition numbers
tdominate'

&j(X'X)

2max
i
ki
,

1, 2,

k,

(20. 126)

refers to the largest


for large values indicating near' collinearity, where zmax
Belsley
number is large
large
condition
eigenvalue (see
et aI. (1980:.How
a
enough to indicate the presence of near collinearity is an open question in
view of the fact that the eigenvalues 21
kk are not invariant to scale
changes. For further discussion see Belsley (1984).
Several other criteria for detecting
collinearity have been suggested
in the econometric literature (seeJudge et aI. ( 1985) for a survey) but all
these criteria, together with (a)-(c) above, suffer from two major
weaknesses'.
(i)
they do not seem to take account of the fact that near' collinearity
should be assessed relative to a given parametrisation; and
,

%near'

20.6

tNear' collinearity

none of these criteria is invariant to linear transformations of the


data (changesof origin and scale).
Ideally, we would like the matrix X'X to be diagonal, reflecting the
orthogonality of the regressors, because in such a case the statistical GM
takes a form where the effect of each regressor can be assessed separately
ckk) and
given that E22 diagtccc, caa,
=

(20.127)
tJ'l i

/'11 l'ny - Fl t;cii

Covtx'flrf),

rrlx f (Xft),
=

1/

plx,

2, 3,

(20.128)

k.

ln such a case
1x'
- (x'x)
i bi
=

>

Vartj)

-1

(xJx)

(20.129)

with the estimator as well as its variance being effected only by the th
regressor. This represents a very robust estimated statistical model because
changes in the behaviour of one regressor over time will affect no other
coefficient estimator but its own. Moreover, by increasing the number of
regressors the model remains unchanged. This situation can be reached by
design in the case of the Gauss linear model (seeChapter 18). Although such
an option is not directly available in the context of the linear regression
model because we do not control the values of Xl, we can achieve a similar
effect via reparametrisation. Given that the statistical parameters of interest
0 rarely coincide with the theoretical parameters of interest ( we can tackle
the problem at the stage of reparametrising the estimated statistical GM in
order to derive an empirical econometric model. It is important to note that
statistical GM is postulated only as a crude approximation of the actual
DGP purporting to summarise the sample information in a particular way,
as suggested by the estimable model. Issues of efficient estimation utilising
a1l a priori information do not arise at this stage. Hence, the presence of
tnear' collinearity in the context of the statistical model should be viewed as
providing us with important information relating to the adequacy of the
sample information for the estimation of 0. This information will help us to
model by
construct a much more robust empirical econometric
estimated
statistical
GM
which
provide
in
reparametrising the
us with
ways
without,
which
close
orthogonal
transformed regressors
to being
are
however, sacrificing its theoretical meaning. This is possible because there is
no unique way to reparametrise a statistical GM into an empirical
econometric model and one of the objectives in constructing the latter is to

438

from assumptions - statistical GM

Departures

ensure that the sample information is sufficient for the accurate


determination of its parameters. It is important to emphasise at this stage
empirical econometric models are not, as some econometric and
that
time-series textbooks would have us believe, given to us from outside by
'sophisticated' theories or by observed data regularities, but constructed by
econometric modellers using their ingenuity and craftsmanship.
ln view of the above discussion it seems preferable to put more emphasis
robust
empirical econometric
models with
in constructing
orthogonal regressors without, however, sacrificing either their statistical
or economic meaning. Hence, instead of worrying how to detect
collinearity (whichcannot be defined precisely anyway) by some battery of
suspect criteria, it is preferable to turn the question on its head and consider
the problem as one of constructing empirical econometric models with
'nearly' orthogonal regressors among its other desirable properties. To that
end we need some criterion which assesses the contribution of each
regressor separately and is invariant to linear transformations of the
observed data. That is, a criterion whose value remains unchanged when
igood'

'nearly'

tnear'

and

h is mapped into

.b't

X, is mapped into X)

tk1 ),t + c1

(20.130)

zlcxf

(20.131)

+ c2,

where f?j # 0 and


is a (k 1) x k 1) non-singular
We know that under the normality assumption
,42

Z,

'v

Zt

EB

matrix.

(20.132)

N(m, E),

where
l'f
Xt

ml

m2

(r1 1

J1 2

J1 1

X22

(20.133)

the statistic
(2,S),

=-

)(
-

Zf

and

)( (z,

l=1

-2)(z,

-2)'

(20.134)

is a sufficient statistic (seeChapter 15). Under the data transformation


and (131) the sufficient statistic is transformed as
2

-.+

A2

+c

(20.135)

ASA',

(20.136)

and
S

-->

where
A

(130)

j
(/4)1
) (cr
x0
2

al

20.6 ENear' collinearity


are k x k and k x 1 matrices. The corresponding
parameters are:
m
E

transformations

Ammc,

-+

(20.137)

AEA'.

--+

on the

What we are seeking is a criterion which remains unchanged under this


group of transformations. The obvious candidates as likely measures for
assessing t h e separate contr ibutions of the regressors involved are the
multiple and partial correlation comcients
(see Chapter 15). The sample
partial correlation coefcient of y, with one of the regressors, say Arlr, given
This represents a measure of the
the rest Xaf, is given in equation (15.48).
when the effect of all the other variables have been
correlation of y'fand
tpartialled out'. On the other hand, if we want the incremental contribution
of Xjt to the regression of y, on Xt we need to take the difference between the
sample multiple correlation coefficient, between yyand Xf and that between
yt and X 1r (X, with xYj, excluded), denoted by kl and #2-1, i.e. use
uhrbt

(2

kl 1 )

(20.138)

2).

It is not very surprising

(seeequation (15.39)for
measuresare directly related

(#2

2- j)

yz :$( 1

that both of these

via
-

2-

(20.139)

p-12.a denotes the sample


coefficient of y, and Xqt given the rest Xsf
Let
kl
S) and /(c,a=g1(Z, S).

(seeTheil (1971)),where

partial

correlation

=g(2,

We can verify directly that

(/Z, S) g(A2
=

g1(2, S) =gj(A2

+c, ASA')
+c, ASA').

(20.140)
(20.141)

That is, the sample multiple and partial correlation coefficients are
invariant to the group of linear transformations on the data. lndeed it can
be shown that 112 is a maximal invariant to this group of transformations
That is, any invariant statistic is a function of 2.
(see Muirhead (1982)).
The multiple and partial correlation coefficients together with the simple
correlation coefficients can be used as guides' to the construction of
empirical models with nearly orthogonal
regressors. The multiple
used
overall
picture of the relative
particular
correlation in
to get an
can be
statistical
of
various
GM and the
contribution
the
regressors (in both the

440

Departures

from assumptions

empirical econometric

statistical

GM

model) using the incremental contributions

1, 2,

k - 1,

(20.142)
(20. 143)

which Theil ( 1971) called the multicollinearity effect. ln the present context
such an interpretation should be viewed as coincidental to the main aim of
constructing robust empirical econometric models. It is important to
remember that the computation of the multiple correlation coefficient
differs from one computer package to another and it is rarely the one used
above.
To conclude, note that the various so-called
to the problem of
such
adding
collinearity
dropping
as
or
near
regressors or supplementing
the model with a priori information are simply ways to introduce
alternative reparametrisations,
not solutions to thc original problem.
isolutions'

Important

concepts

Stochastic linear regression model, statistical versus theoretical parameters


of interest, omitted variables bias, reparametrisation,
constrained and
MLE's,
priori
and
non-linear
unconstrained
restrictions, restricted
linear
a
and unrestricted residual sum of sqtlares, collinearity, near' collinearity,
orthogonal regressors,
incremental contributions,
partial correlation
coefficient, invariant to linear transformations.

Questions
Compare and contrast

(i)

F().'f X,

(ii)

'''''

1
j= (X X) X y, p+

(iii)

42

xE7lyfc(X,));

xf),

''-'

',

t*2

1
=--

(.)??')
.?

.1

'

.4

.t
.

y,

*?*,

Compare and contrast the statistical GM's, the probability and


sampling models for the' linear regression and stochastic linear
regression statistical models.
model be
tl-et the
ebtrue''

yt

j'x t + y,w ! + f;f

20.6

Near' collinearity

d the one used be v #'xt + u t lt can be shown that for j'


(X X
)- X y
E(p) p, i.e. # suffers from omitted variables bias; and
(i)
741/,) y'wt.'
(ii)
an

'

'

u#

Discuss.
Explain informally how you would go about constructing
an
exogeneity test. Discuss the difficulties associated with such a test.
MLE'S
of p and c2:
Compare the constrained and unconstrained

)= j-(x'X)- 1 R'
j=(x'x)-1x'y,

42

1 (y

g.2

-x#

r
-.

1
T

(y

(y-x#,
-

x jj (y- X#)
,

Explain how you would


Hv1 Rj= r

(R(X/X)- IR') - 1(Rj-r).

a test for

go about constructing

against

S1 : Rp+ r

based on the intuitive argument that when Hv is true is close to zero.


kWhy don't we use the distance 11R#- r) Compare the resulting test
with the F-test.
Explain intuitively the derivation of the Wald test for
'?'

.f

Hv H(j)

'Why don't we use

0 against

rH(/)

Hj : H(j) # 0.

1 instead of

IH(/) i as the basis of the

! I

argument for the derivation?'


What do we mean by the statement that the Wald, Lagrange
multiplier and likelihood ratio test procedures give rise to three
asymptoticallv equitalent lasrs?
When do we prefer an asymptotic to a finite sample test?
Explain the role of the assumption that rank (X) k in the context of
the linear regression model.
and their
and
Discuss the concepts of
implications as far as the MLE'S # and 42 are concerned.
in the
How do we interpret the fact that (X'X) is
context of the linear regression model? How do we tackle the
problem?
=

tcollinearity-

Snear-collinearity'

'ill-conditioned-

442

Departures from assumptions - statistical GM


Exercises
Using the first-order conditions (49/-(51)of Section 20.4 derive the
information matrix (57).
Using the partitioned inverse formula,
A
B'

- 1

E=D

A-

A-FE- 1F'

1F'
- EIB

-B'A-

EF

c2)j 1 and
derive g1w(#,
compare its
p,
and Cc2 in (56).
Verify the distribution of
-

--FE-

A- IB

various

elements with C1 j C1 2
,

($-

in(56).
vCr ify the equality

cuu'
-'
(R/ -r)'gR(X'X) - IR') - 1(Rj
For the null hypothesis Hv: Rp= r against ff1 : R## r use the Wald,
Lagrange multiplier and likelihood ratio test procedures to derive the
following test statistics:

?U(y)

==

-r)

( 1
i-

y- k

FNT(Y)' L A4tY)

==

.-

LRy)

T log

(7-

--CIVKGCXI
) j. suzty)
,

mzty)

1 + T-i

respectively, where z(y) is the test statistic of the F-test.


Using IzFtyl, LMy) and .LR(y) from exercise 5 show that
PU(y);: 2L1(y) )h pA4ly).
Note that logtl + z)> z/(1 + z), cylogtl
Savin (1982).)

+ z), zb 0)

(see Evans

Additional references

Aitchison and Silvey (1958),. Judge et

al.

( 1982),. Leamer (1983).

and

The linear regression model IlI - departures


from the assumptions underlying the
probability model
The purpose of this chapter is to consider various forms of departures from
the assumptions of the probability model;
D(J'f/Xt; 04 is normal,
E61 (i)
E(1', X, xt) j'xf: linear in xf,
(ii)
Vartyf/x,
x,) c z homoskedastic,
(iii)
=

g7l
0- (j1c2)are time-invariant.
ln each of the Sections 2-5 the above assumptions will be relaxed one at a
time, retaining the others, and the following interrelated questions will be
discussed :
what are the implications of the departures considered?
(a)
(b)
how do we detect such departures'?, and
(c)
how do we proceed if departures are detected?
It is important to note at the outset that the following discussion which
considers individual assumptions being relaxed separately limits the scope
of misspecification analysis because it is rather rate to encounter such
conditions in practice. More often than not various assumptions are invalid
simultaneously. This is considered in more detail in Section 1. Section 6
discusses the problem of structural change which constitutes a particularly
important form of departure from E7).

21.1

Misspification

testing and auxiliary regressions

Misspecication
testing refers to the testing of the assumptions underlying
statistical
model.
In its context the null hypothesis is uniquely defined as
a
the assumptionts) in question being valid. The alternative takes a particular
fonn of departure from the null which is invariably non-unique. This is

443

444

Departures

from assumptions

probability

model

because departures from a given assumption can take numerous forms with
the specified alternative being only one such form. Moreover, most
misspecification tests are based on the questionable presupposition that the
of the model are valid. This is because joint
other assumptions
misspecification testing is considerably more involved. For these reasons
the choice in a misspecification test is between rejecting and not rejecting
the null; accepting the alternative should be excluded at this stage.
An important implication for the question on how to proceed if the null is
rejected is that before any action is taken the restllts of the other
misspecification tests should also be considered. lt is often the case that a
particular form of departure from one assumption might also affect other
assumptions. For example when the assumption of sample independence
(8))is invalid the other misspecification
tests are influenced (see Chapterzz).
In general the way to proceed when any of the assumptions 1(6(1-1(8)
are
invalid is first to narrow down the source of the departures by relating them
back to the NIID assumption of LZ,, f (E Ir and then respecify the model
of the
taking into account the departure from NllD. The respecification
of the reduction from D(Z1 Za,
Zw,' #)
model involves a reconsideration
to D(y,/Xf; p) so as to account for the departures from the assumptions
involved. As argued in Chapters 19-20 this reduction coming in the form of:
,

D(Z j

Z 7. ; #)

Dt Zrs' #)
J'''.j
.....

tg'ID( yf 'X?

) D(Xr
'

,'

#a)

involves the independence and the identically distributed assumptions in


( 1). The noimality assumption plays an important role in defining the
parametrisation of interest 0- (#, c2) as well as the weak exogeneity
condition. Once the source of the detected departure is related to one or
takes the form of an
more of the NIID assumptions the respecification
form
of
reduction.
This
is
alternative
illustrated most vividly in Chapter 22
assumption
discussed.
lt
is invalid not
where
turns out that when (8(1
(8(1is
results
19
invalid
but
the
other
misspecification
only the
in Chapter
tests
are
inappropriate as well. For this reason it is advisable in practice
are
first and then proceed with the other assumptions if
to test assumption (8(1
(8(lis not rejected.-f'he sequence of misspecification
tests considered in what
follows is chosen only for expositional purposes.
Slargely'

With the above discussion in mind 1et us consider the question of general
procedures for the derivation of misspecification tests. In cases where the
alternative in a misspecification test is given a specific parametric form the
various procedures encountered in specification testing (F-type tests, Wald,

21.1

Misspecification

testing

445

and likelihood ratio) can be easily adapted to apply in

Lagrange mtlltiplier

the present context. ln addition to these procedures several specific


misspecification test procedures have been proposed in the literature (see
White ( 1982), Bierens ( 1982), intel. alia. Of particular interest in the present
variables'
argument which
book are the procedures based on the
lead to auxiliary regressions (see Ramsey ( 1969), ( 1974), Pagan and Hall
( 1983). Pagan ( 1984), inter /!i/). This particular procedure is given a
prominent rolkl in what follows because it is easy to implement in practice
of most
other
interpretation
and it provides a common-sense
misspecification tests.
variables' argument was criticised in Section 20.2 because it
The
-omitted

-omitted

statistical GM's.
was based on the comparison of two
This was because the information sets underlying the latter were different. lt
was argued, however, that the argument could be reformulated by
postulating the same sample information sets. ln particular if both
Zw; #) by using
parametrisations can be derived from D(Z1 Zc,
statistical
redtlction
GM's
then
the
arguments
alternative
two
can be made
comparable.
Sttchastic
Let ftzf l G 1 ) bo a kector
process defined on the probability
which
includes the stochastic variables of interest. In
#( ))
space (S,
'non-comparable'

..t

'

that for a given fz.r( .F

17 it was argued

Chapter

.y t

E(. .

'

,/

t/'.) +
,

uj

defines a general sta' tistical GM


.'k)),
#f = .E'( f /

I/t

'

).'!- E .J',

satisfying some desirable


orthogonality condition:
Espttlts

with

(2 1.4)

t)

properties

by construction

including the

=0,

lt is important to note, however, that (3)-(4)as defined above arejust empty


boxes'. These are filled when .tZ,, l e: T) is given a specific probabilistic
structure such as NllD. In the latter case (3)-44) take the specific forms:
$,r

#'xr +

/tt* j'x
=

(2 1.6)

tt r ,
LI*
f

rk'r lX,

x,

#'x

f,

(2 1.7)

information set being

with the conditioning


=

'

)
.

When any of the assumptions in NllD are invalid, however, the various
properties of pt and lIt no longer hold for pl and ul. ln particular the

446

Departures

from assumptions - probability model

orthogonality condition
Eplutt)

(5)is invalid.

The non-orthogonality

#0,

(21.9)

can be used to derive various misspecification tests. If we specify the


alternative in a parametric fonu which includes the null as a spial case (9)
could be used to derive misspecification tests based on certain auxiliary
regressions.
ln order to illustrate this procedure let us consider two important
parametric forms which can provide the basis of several misspecification
tests:
'iplli

g*(x,)-f=1
(b)

(21.0)
k

#(x,)=t1 +
j'

'-V

J,bixit+ )( )cijxitxjt
1

jp 1

+
l )wil lkdi-lxirx./fxl

i-

'

'

'

'

/yj

The polynomial g*(xt) is related to RESET type tests (see Ramsey


(1969)
and g(x,) is known as the Kolmogorov-Gabor polynomial (scelvakhnenko
(1984/. Both of these polynomials can be used to specify a general
parametric form for the alternative systematie component:

ltf

#'0Xf

''o%*

(21.12)

where z) represents known functions of the variables


gives rise to the alternative statistical GM
rr

bLxt+ i''vzl + cf

Zf

1,

Z1, Xf. This

(21. 13)

which includes (6) as a special case under

ffe: yll 0,
=

with

A direct comparison
-

li (
1

between

regression
#)'X,
+ SZ)
ut (#o
whoseoperational form
=

HL

),e # 0.

(13)

(21.14)
and

(6) gives rise to the auxiliary

(21. 15)

+ cf,

/)?x+ ylzlp + c,

(21.16)

can be used to test ( 14) directly. The most obvious test is the F-type test
discussed in Sections 19.5 and 20.3. The F-test will take the general form
FF(y)

( 1

RRSS - URSS F-k*

vgss

sy

(21.17)

447

21.2 Normality

where RRSS and URSS refer to the residuals sum of squares from (6)and
( 16) (or ( 13/, respectively; k* being the number of parameters in (13) and m
the number of restrictions.
This procedure could be easily extended to the highercentral moments of
) ?t /'Xt s

Elut/xt

(21.18)

r y 2.

x,),

For further discussion see Spanos

21.2

(1985b).

Normality

As argued above, the assumptions underlying the probability model are all
interrelated and they stem from the fact that Dt-pf,Xf; /) is assumed to be
multivariate normal. When D(),,, Xf,' ) is assumed to be some Other
multivariate distribution the regression function takes a more general fonn
(not necessarily linear),

ElhAt

X!)

11(*,X!),

and the skedasticity function is not necessarily free of xf,


Vartyf/'x,

x,)

=g(k,

x,).

(21.20)

Several examples of regression and skedasticity functions in the bivariate


case were considered in Chapter 7. In this section, however, we are going to
consider relaxing the assumption of normality only, keeping linearity and
homoskedasticity. In particular we will consider the consequences of

assuming
(J'f,/Xf

X)

'v

D(FX,,

(T2),

where D( ) is an unknown distribution, and discuss the problem of testing


whether D ) is in fact normal or not.
.

'

(1)

Consequences of

non-normality

Let us consider the effect of the non-normality assumption in (21)on the


specification, estimation and testing in the context of the linear regression
model discussed in Chapter 19.
As far as specification (see Section 19.2) is concerned only marginal
changes are needed. After removing assumption
g6(l(i)the other
assumptions can be reinterpreted in terms of Dtj/xf, c2). This suggests that
relaxing normality but retaining linearity and homoskedasticity might not
constitute a major break from the linear regression framework.

448

from assumptions

Departures

--

probability

model

The first casualty of (2 1) as far as estimation


(see Section 19.4) is
concerned is the method of maximum likelihood itself which cannot be used
unless the form of D ) is known. We could, however, use the least-squares
method of estimation brietly discussed in Section 13. 1, where the form of the
not needed.
underlying distribution is
is
alternative
method
of estimation which is historically
Least-squares
an
older
the
maximum
likelihood
much
than
or the method of moments. The
least-squares method estimates the unknown parameters 0 by minimising
the squares of the distance between the observable random variables yf,
r e: T, and ht(0) (a function of p purporting to approximate the mechanism
giving rise to the observed values p,), weighted by a precision factor 1 rcf
which is assumed known, i.e.
'

'apparently'

hto)
minV ),,,lk't
-

(2 1.22)

psf.l r

It is interesting to note that this method was tirst suggested by Gauss in 1794
as an alternative to maximising what we, nowadays, call the log-likelihood
function under the normality assumption (seeSection 13. 1for more details).
In an attempt to motivate the Ieast-squares method he argued that:

the most probable value of the desired parameters will be that in which the
sum of the squares of differences between the actually observed and
computed values multiplied by numbers that measure the degree of
precision.is a minimum
.

This clearly shows a direct relationship


between the normality assumption
method
estimation.
the
least-squares
of
lt can be argued, however, that
and
applied
least-squares
method
the
to estimation problems without
can be
normality.
In relation to such an argument Pearson (1920)
assuming

warned that:

methods are theoretically


we can only assert that the lcast-squares
obey the normal law,
accurate on the assumption that our observations
Hence in disregarding normal distributions
and claiming great
.
the
generality
by merely using the principle of least-squares
generalisation
gained
merely
been
the
has
at
apparent
expense of
theoretical validity
.

Despite this forceful argument let us consider the estimation of the linear
regression model without assuming normality, but retaining linearity and
homoskedasticity as in (21).
The least-squares method suggests minimising

6$)

w (), #,x,)2
---j.
'

=1

(21.23)

21.2 Normality

V9

or, equivalently:
F

f(#)
=

el

0# = -

Z (A't=

#'x,)2 (y-X#)'(y

2X

(y-X#)

(2 1.24)

-X#),

(2 1.25)

0.

that rankl)
Solving the system of normal equations (25)(assuming
get the ordinary least-squares (OLS) estimator of #
b (X'X) - 1X'y.

of c2 is

1 /(b)

k) we

(21.26)

The OLS estimator

T- k

T- k

(y-

Xb) , (y Xb).

(2 1.27)

Let us consider the properties of the OLS estimators b and


fact that the form of DFxt,c2) is not known.
Hnite sample properties

of b and

in view of the

.92

Although b is identical to p-(the MLE of #)the similarity does not extend to


the properties unless Dtyf/xf; 0) is normal.
Since b Ly, the OLS estimator is linear in y.
(a)
Using the properties of the expectation operator E ) we can deduce:
.E'(b) .E'tb+ Lu) j+ L.E'(u) #, i.e. b is an unbiased estimator of #.
(b)
czLL'
c2(X'X) - 1.
.E'(b
(c)
- #)(b- j)' F(Luu'L')
Given that we have the mean and variance of b but not its distribution,what
other properties can we deduce?
Clearly, we cannot say anything about sujh'ciency or full efflciency
without knowing D(.p,,/Xf;0) but hopefully we could discuss relative
efhciency within the class of estimators satisfying (a) and (b).The GaussMarkov theorem provides us with such a result.
=

'

Gausr-Markov

theorem

Under the assumption (21), b, the OLS estimator of #, has minimum variance
among the class of linear and unbiased estimators (fora proof see Judge et
aI. ( 1982/.
.1

is concerned, we can show that


As far as
.E'(.2) c2, i.e. l is an
estimator Of c2,
(d)
using only the properties of the expectation operator relative to DFxt,c2).
ln order to test any hypotheses or set up confidence intervals for
'unbiased

450

from assumptions - probability model

Departures

0=p, c2) we need the distribution of the OLS estimators b and J2. Thus,
unless we specify the form of Dxt, c2), no test or/and confidence interval
statistics can be derived. The question which naturally arises is to what
theory' can at least provide us with large sample results.
extent
'asymptotic

Asymptotic distribution of b and


Lemma 21

.1

(21),

Unft?r assumption

x/zb p) x(0,czo.y-

lim

.2

X'X

u-

-+

1)

(2 1.28)

(2 1.29)

Qx

is flniteand non-singular.
Lemma 21

.2

(21)we

Under

x''w(J2

ean deduce that


-

c2)

x(0,

-.

tfl-y
-

where puyrefers to the fourthcentral


to be
(seeSchmidt 1976)).
./znffr

1)c4),

lzlomcnr

of 1)(.J,/Xf;

(2 1.30)
04assumed

Note that in the case where Dlyf, Xj',0j is normal

P-43
4

N/

G'

7w(.2-c2)

Lemma 21
Under

x(0,2c4).

(2 1.3 1)

.3

(21)
P

b p
-+

and

(!/limw-.(X'X)=0)

(2 1.32)
(2 1.33)

a2
S

-+

2
.

From the above lemmas we can see that although the asymptotic
distribution of b coincides with the asymptotic distribution of the MLE this
is not the case with ?l. The asymptotic distribution of b does not depend on

Normality

21.2

4f1
l

1)(yt,/X,; 04but that of does via p4. The question which naturally arises is
to what extent the various results related to tests about 0- (#,c2) (see
Section 19.5) are at least asymptotically justifiable.Let us consider the Ftest for Ho R#= r against HL : R## r. From lemma 2 1.1 we can deduce that
1R') - 1) which implies that
under Hv : V'F(Rb - r) N(0,c2(RQx'v

(Rb

r)

ERIX/X)- 1 R'(l G

Ho

(Rb

r)

'w

Using this result in conjunction

zwty) (Rb

(21.35)

with lemma 21.3 we can deduce that

gR(X'X) - IR'II- 1
(Rb
r)' .
c

z2(n'l).

m.

r)

1
x.

z2(r#

under Hv, and thus the F-test is robust with respect to the non-normality
assumption (21) above. Although the asymptotic distribution of zwty) is chisquare, in practice the F-distribution provides a better approximation for a
small T (seeSection 19.5)).This is particularly true when D(#'xt, c2) has
heavy tails. The significance r-test being a special case of the F-test,
c=-,,--,.--2,

zty)

i;
sv E(xk-')j
,

N(0, 1) under Ho:

/,.-0

(2 1.37)

is also asymptotically justifiable and robust relative to the non-normality


assumption (21) above.
Because of lemma 2 1.2 intuition suggests that the testing results in
relation to c2 will not be robust relative to the non-normality assumption.
Given that the asymptotic distribution of depends on pzbor a4= p4/c4 the
kurtosis coefficient, any departures from normality (wherea4 3) will
seriously affect the results based on the normality assumption. In particular
the size a and power of these tests can be very different from the ones based
on the postulated value of a. This can seriously affect al1 tests which depend
l
such as some heteroskedasticity and structural
on the distribution of
change tests (seeSections 2 1.4-2 1.6 below). ln order to get non-normality
robust tests in such cases we need to modify them to take account of /.24.
.92

(2)

Testing

fovdepartures from normality

Tests for normality

can be divided into parametric and non-parametric


whether
depending
the alternative is given a parametric form or
tests
on
not.

452

Departures

(a)

Non-parametric

from assumptions - probability model


tests

The Kolmoqorotwsmirnov

test

tlf/xf,

Based on the assumption that


r e: T) is an IID process we can use the
results of Appendix 11.1 to construct test with rejection region

C1

(y:Q'T 1> ca)

(21.38)

where y refers to the Kolmogorov-smirnov


residuals. Typical values of cx are:
.01

.05

test statistic in terms of the

.01

a
(21.39)
ca 1.23 1.36 1.67 .
For a most illuminating discussion of this and similar tests see Durbin
(1973).
The Shapiro-Wilk

test

This test is based on the ratio of two different estimators of the variance c2.
2

1V=
r

where

li41)

Z
=

tilrrtl'r-r.f.

li(t))

1
'

'

/Z l
I
=

(21.40)

are the ordered residuals,

G lo

if T is even

=-

% lcj G
F

1)

T- 1

if T is odd,

and atv is a weight coefficient tabulated by Shapiro and Wilk


samplg sizes 2 < T 50. The rejection region takes the form:
C1
where
(b)

ty:1z:'< c.)

cu are

(1965)for
(21.41)

tabulated in the above paper.

Parametric

tests

The skewness-kurtosis

test

The most widely used parametric test for normality is the skewnesskurtosis. The parametric alternative in this test comes in the form of the
Pearson family of densities.
The Pearson family of distributions is based on the differential equation

d ln
dz

-/)

.(z)

(z

c +cjz-hczzz

(21.42)

21.2 Normality

453

where solution for different values of @,c(), cj


generates a large number
of interesting distributions such as the gamma, beta and Student's t. lt can
be shown that knowledge of c2, aa and a4 can be used to determine the
distribution of Z within the Pearson family. In particular:
,

cl

(a4+

(..2)

3)(a3/c-

(21.43)

(414 33)c2/J,
12as ca (2a4 3a3 6) J, J= (1014.
(see Kendall and Stuart (1969:.These parameters
co

(21.44)
18)

(21.45)

can be easily estimated


using (13 and :14 and then used to give us some idea about the nature of the
departure from non-normality. Such information will be of considerable
interest in tackling non-normality (see subsection (3/. ln the case of
normality cj c2 0 aa
3. Departures from normality within the
Pearson family of particular interest are the following cases:
(a)
cc
cl # 0. This gives rise to qamma-type distributions with the
chi-square an important member of this class of distributions. For
',

=0,a4

u't>

=0,

(b)

'v

g2(r1),

aa

23

(21.46)

cl 0, trtl > 0, (r2 > 0. An important member of this class of


(x4.
3+
distributions is the Student's !. For Z 1(rl), aa
%m 4), m > 4).
which are
< (r2. This gives rise to beta-type distributions
cl
directly related to the chi-square and F-distributions. In particular
i 1, 2, and Z, ,Zz are independent, then
if Zi z2(n1f),
=

=0,

'v

..::0

'.w

)
( ) (--11
jZ,)

z,

m--s

(2 1.47)

where Bmk/l, rnc/2) denotes the beta distribution with parameters


/11 2 and mzjl.
As argued above normality within the Pearson family is characterised by
aa

(/t3/c3)

and

=0

=(p4/c*)

a4

3.

(21.48)
'

lt is interesting to note that (48)also characterises normality


short' (lirstfour moments) Gram-charlier expansion:
-..aa(z3

gz) E1
=

3z) +

Wta,j.3)(z*-6z2
-

+ 3)q(I)(z)

within the

(21.49)

(see Section 10.6).


Bera and Jarque (1982)using the Pearson family as the parametric
alternative derived the following skewness-kurtosis
test as a Lagrange

454

from assumptions

Departures

model

probability

multiplier test:
T1(y)

F
=

6
where

ti=3+

,- g()
-

((i4 3)2
24

'w

s.--

--)((-i

tljl-i

z2(2)

(2 1.50)

,i/)'j

,z.--

,.-gt-,)
,z.-j

H'

,.'',,?)2j.

(2 1.5 1)

The rejection region is defined by


X

C1

zlty)
1)y:

dz2(2) a.

>cl,

Ca

A less formal derivation of the test can be based on the asymptotic


distributions of tia and 44:

v'''r 43

Ho

(21.54)

x(0,6)

x,7r(4.-3)
-

With

N,

24).

(2 1.55)

(13

and 44.being asymptotically independent (seeKendall and Stuart


(1969))we can add the squares of their standal-dised forms to derive (50).,
see
Section 6.3.
Let us consider the skewness-kurtosis
test for the money equation

mt 2.896 +0.690y, +0.865/5 -0.0552 + f,


( 1.034) (0.105) (0.020) (0.013) (0.039)
=

R2

=0.995
,

F= 80,

(ij=

42

=0.995,

((i4.- 3)2

0.005,

1og L

=0.0393,

=0.

(21.56)

147.4,

145.

Thus, zj(y)
and since ca 5.99 for a
we can deduce that under
the assumption that the other assumptions underlying the linear regression
and a4 3 is not rejected for
model are valid the null hypothesis HvL aa
=0.55

=0.5

=0

0.05.

There are several things to note about the above skewness-kurtosis


test.
Firstly, it is an asymptotic test and caution should be exercised when the
sample size T is small.' For higher-order approximations of the finite sample
distribution of &s and 4 see Pearson, D'Agostino and Bowman ( 1977),
Bowman and Shenton (1975), inter alia. Secondly, the test is sensitive to
outliers' (tunusually
large' deviations). This can be both a blessing and a

21.2

Normality

455

hindrance. The first reaction of a practitioner whose residuals fail this


normality test is to look for such outliers. When the apparent nonnormality can be explained by the presence of these outliers the problem
can be solved when the presence of the outliers can itself be explained.
Otherwise, alternative fonns of tackling non-normality
need to be
considered as discussed below. Thirdly, in the case where the standard error
of the regression is relatively large (becausevery little of the variation in yt
is actually explained), it can dominate the test statistic z1(y). lt will be
suggested in Chapter 23 that the acceptance of normality in the case of the
money equation above is largely due to this. Fourthly, rejection of
normality using the skewness-kurtosis test gives us no information as to the
nature of the departures from normality unless it is due to the presence of
'

Outliers.

A natural way to extend the skewness-kurtosis


test is to include
cumulants of order higher than four which are zero under normality (see
Appendix 6.1).

(3)

Tackling non-normality

When the normality assumption is invalid there are two possible


ways to
proceed. One is to postulate a more appropriate distribution for D(yl/Xf; 0)
and respecify the linear regression model accordingly. This option is rarely
considered, however, because most of the results in this context are
developed under the normality assumption. For this reason the second way
to proceed, based on nonnalising transformations, is by far the most
commonly used way to tackle non-normality. This approach amounts to
applying a transformation to y, or and Xl so as to induce normality.
Because
of the relationship
between normality,
linearity and
homoskedasticity these transformations commonly induce linearity and
homoskedasticity as well.
One of the most interesting family of transformations in this context is
the Box-cox (1964)transformation. For an arbitrary random variable Z

the Box-cox transformation takes the form


0 %% 1.

(2 1.57)

Of particular interest are the three cases:


(i)

j=

(ii)

=0.5,

1, Z*
Z*

Z(Z/

reciprocal',

(2 1.58)

- square root;

(21.59)

Departures

= 0,

from assumptions - probability model

Z*

logg Z - logarithmic

ote'. 1imZ*

log.

(2 1.60)

z).

J-0

The first two cases are not commonly used in econometric modelling
because of the diliculties involved in interpreting Z* in the context of an
model.
empirical econometric
Often, however. the square-root
transformation might be convenient as a homoskedasticity inducing
transformation. This is because certain economic time-series exhibit
variances which change with its trending mean (%),i.e. Var(Z,) pl,c2, tczz1,
T: In such cases the square-root transformation can be used as a
2,
variance-stabilising one (seeAppendix 2 1.1) since Var(Z)) r!, c2.
The logarithmic
tranAformation
is of considerable
interest in
econometric modelling for a variety of reasons. Firstly, for a random
variable Zf Whose distribution is closer to the 1og normal, gamma or chisquare (i.e.positively skewed), the distribution of loge Zt is approximately
nonual (seeJohnson and Kotz ( 1970:. The loge transformation induces
'near symmetry' to the original skewed distribution and allows Zt to take
negative values even though Z could not. For economic data which take
only positive values this can be a useful transformation to achieve near
normality. Secondly, the loge transformation can be used as a variancestabilising transformation in the case where the heteroskedasticity takes the
form
=

Varty /X t
t

xf)

c,2

(/t,/c2

!,

1, 2,

'fJ

(21.61)

xt) c2, t= 1, 2,
T Thirdly, the log
yt, Vartyl/xf
transformation can be used to define useful economic concepts such as
elasticities and growth rates. For example, in the case of the money
equation considered above the variables are a11in logarithmic form and the
estimated coefficients can be interpreted as lasticities (assumingthat the
estimated equation constitutes a well-defined statistical model; a doubtful
assumption). Moreover, the growth rate of Zt defined by Z* (zf -Zt- 1)/
Zt 1 can be approximated
by A loge Zt H loge Zt
Z,- j because
Alogezt logt 1+Zt*) ::x Zt*
In practice the Box-cox transformation can be used with unspecified
and let the data determine its value (seeZarembka (1974/.For the money
equation the original variables Mf, 'F;,Pt and It were used in the Box-cox
transformed equation:
For y)

xloge

-log

(Ms-

1)-/,1

+fzLY,-

1)+/,z(#2j-

1)+,.42-,

1)+u,

(21.6a)

457

21.3 tnearity
and allowed the data to determine the
chosen was J=0.530 and
JI

J2

J3

(0.223)

(0.119)

of

The estimated j

J4=

-0.000

=0.005,

=0.865,

0.252,

value

(0.0001)

value

07.
(0.00002)

transformation
is
this mean that the original logarithmic
inappropriate'?' The answer is, not necessarily. This is because the estimated
value of depends on the estimated equation being a well-delined statistical
GM (nomisspecification). ln the money equation example there is enough
evidence to suggest that various forms of misspecification are indeed
present (seealso Sections 21.3-7 and Chapter 22).
The alternative way to tackle non-linearity by postulating a more
appropriate form for the distriution of Zf remains largely unexplored.
Most of the results in this direction are limited to multivariate distributions
closely related to the normal such as the elliptical family of distributions (see
Section 21.3 below). On the question of robust estimation see Amemiya
(1985).

tDoes

21.3

Linearity

As argued above, the assumption


f(J't/Xf

xt)

(2 1.63)

#'Xt,

can be viewed as a consequence of the assumption that


Z, N(0, E), l 6E T (Zf is a normal lID sequence of r.v.'s). The form of (63)is
pr//x)
x)) can be nonnot as restrictive as it seems at first sight because .E'(
linear in x) but linear in xf llx1) wherg /(') is a well-behaved
transformation such as xt log xl and xf (x))2.Moreover, tenns such as

where

#=Xa-21J2j

'v

2
ca + c 1 r + ca l ..j.

(,tl (a,
.s.9:

costt')

.j. (ym (m

sintt')

)r+ )r),
-,,.

(2 1.64)

pumorting to model a time trend and seasonal effects respectively, can be


easily accommodated as part of the constant. This can be justified in the
context of the above analysis by extending Lt N(0, E), r c T, to Zt
Nlmf, X), t 6 T, being an independent sequence of random vectors where the
65
mean is a function of time and the covariance matrix is the same for all t T.
The sequence of random vectors tZ,, t (E T) in this case constitutes a nonstationary sequence (seeSection 21.5 below). The non-linearities of interest
in this section are the ones which cannot be accommodated into a linear
conditional mean after transformation.
'w

,v

458

from assumptions - probability model

Departures

lt is important to note that postulating (63),without assuming normality


of Dty'f, Xr; #), we limit the class of symmetric distributions in which
Dyt, X:,' #) could belong to that of elliptical distributions, denoted by
Eljp, E) (seeKelker ( 1970/. These distributions provide an extension of the
multivariate normal distribution which preserve its bell-like shape and
symmetry. Assuming that

s,4(0(,)
s/lacal)
(x-p.,)
)
(gc)

(2 1.65)

implies that

.E'(y,,'X! xf)
-

,'1

ztz-lxt

(2 1.66)

and
Vart)'f/xf

X2)

(/(Xt.)(t7'1

5'!

1 -

2X2-21J21).

(21.67)

This shows that the assumption of linearity is not as sensitive to some


departures from normality as the homoskedasticity assumption. Indeed,
homoskedasticity of the conditional variance characterises the normal
distribution within the class of elliptical distributions (seeChmielewski
( 198 1:.

(1)

Implications

of non-linearity

Let us consider the implications of non-linearity for the results of Chapter


19 related to the estimation, testing and prediction in the context of the
linear regression model. ln particular,
are the implications of
assuming that D(A;#) is not normal and
'what

'()'f/'Xt

Xf)

/l(N),

(21.68)

where (xf) p'xt'?'


ln Chapter 19 the statistical GM for the linear regression model was
defined to be
:#;

)'f

#'xf+

thinking that /t)


Epituttf/Ak x,)
however, is
=

'tyr/xf

=x,)

With

-p1

=j'xf

and ul
x,) c2. The
=yt

Elutlt

0 and

yr /)(xt)+
=

(21.69)

ut,

ttrue'

E (!t,/Xt=xf)
statistical GM,
=0,

(21.70)

cf,

and ct
where pt
'(yl,/Xl= xf)
f()4/X? xf). Comparing (69)
and (70) we can see that the error term in the former is no longer
/I(x2)
white noise but ut
+8, Hg(xf)
+ st. Moreover,
=/?(xf)

=.pr

.--p'xt

=y',

.-p'xt

21.3 t-inearity
Eutxt

x.,)

JJ(utlt/x t

ln

view of

459

# 0 and

Eptutq

=:(xf),

2
xf ) gtxf ) + c/.
=

these properties of

deduce that for

ut we can

g(x,..))',
e - (f?(x1),g(x2),
#) #+ (X X) X e' ,,
.

(2 1.72)

and

J)s2)

(7.2

+ e'

Mx

w- k

e+ c

2
,

Mx

l .-XIX X)
,

(2 1.73)

,
,

because y Xpv + e + E not y Xj+ u. Moreover, / and


are also
inconsistent estimators of # and c2 unless the approximation
error e
satisfies (1/T)X'e
0 and (1/T)e'Mxe
0 as T-+ CX; respectively. That is
2
non-linear and the non-linearity decreases with 'C#
unless /1(x,) is not
and s2 are inconsistent estimators of j and c2.
As we can see, the consequences of non-linearity are quite serious as far as
t h e proper ties of j and sl are concerned being biased and inconsistent
estimators of j and c2, in general. What is more, the testing and prediction
results derived in Chapter 19 are generally invalid in the case of nonlinearity. In view of this the question arises as to what is it we are estimating
by s and # in (70)7
Given that plt (Jl(xt)- p'xt)+ st we can think of p-as an estimator of #*
where #* is the parameter which minimises the mean square error of ut, i.e.
.s2

--+

-+

'too'

'h

#*

min c2(#)

where c2(j)

Elutlj.

(21.74)

Ecztjlj/)j= ( -2)A'g(/?(xf) - j'xylxg'q (assumingthat we


operator).
Hence, #*
differentite
inside the expectation
can
1
This is because

::::c0

Y2-21/2,, say. Moreover, sl can be viewed as the


xtx/t)natural estimator of c2(#*). That is j and
are the natural estimators
*'xt
unknown
function /l(xr)
the
of a least-squares approximation
to
respectively.
What is more,
and the least-squares approximation-error
'(/1(xt)x;)

,$2

we can show

(2)

that

Testing

a.S.

#-+#* and

.s2

a,S.
-+

c2(#*)

(see White ( 1980)).

for non-linearity

ln view of the serious implications of non-linearity for the results of Chapter


19 it is important to be able to test for departures from the linearity
assumption. ln particular we need to construct tests for
f10: '(yr/Xl x2) j'xf
=

460

from assumptions - probability model

Departures

against

J.f1 : El)'tjxt

xf)

=/l(x2).

(21.76)

This, however, raises the question of postulating a particular functional


form for txf)which is not available unless we are prepared to assume a
Alternatively,
particular form for D(Zf;#).
use the
we could
and systematic
parametrisation related to the Kolmogorov-Gabor
component polynomials introduced in Section 21.2.
Using, say, a third-order Kolmogorov-Gabor polynomial (.&'G(3)) we
can postulate the alternative statistical GM:

)'f
where

#Xt

Szf

kzt includes the

+ 713/32+

second-order

injzzz2, 3,
and

(21.77)

:t

terms

k,

(21.78)

#at the third-order terms


1, inj, 1--2, 3,

xitx-itxtt,>./>

(2 1.79)

k.

Note that x1, is assumed to be the constant.


Assuming that F is large enough to enable us to estimate
linearity in the form of:
HoL

y2

0 and

HL :

yz + 0

or

(77)we can test

ya # 0

using the usual F-type test (seeSection 2 1.1). An asymptotically


test can be based on the R2 of the auxiliary regression:
t

= (#0-

#)'X

+ T'2z, +

'kst

equivalent

(21.80)

Ef

using the Lagrange multiplier test statistic


RRSS URSS
RRSS

ffe

LM(y)

FR2

q being the number of


Cj

ty

restrictions

LM(y)

>

ca )

;l(q)

(21.8 1)

(see Engle ( 1984)). Its rejection region is


dz 24g)

a.

Cu

For small T the F-type test is preferable in practice because of the degrees of
freedom adjustment; see Section 19.5.
Using the polynomial in pt we can postulate the alternative GM of the
form:

yt
where pt

p'sxt+

czptl + cspt3 +

xt. A direct comparison

+ cmp? +

p?

between

(21.82)

(75)and (82)gives rise to

21.3

Linearity

461

RESET type test (seeRamsey (1974)) for linearity based on Ho : cw= c3


l'n. Again this can be tested using the F-type
cm 0, Sl : ci #0, i 2,
'
LM
test or the
test both based on th'e auxiliary regression :
=

'

e'

b. #) xt +

Z cipt +
zx

's

14,

pt

x,.

(2 1.83)

Let us apply these tests to the money equation estimated in Section 19.4.
The F-test based on (77)with terms up to third order (but excluding )
yielded :
because of collinearity with
.)

477
0.1 17 520
().:4j4,7-/
-0.045

FF(F)

(-j-167

11.72.

Given that ca 2.02 the null hypothesis of linearity is strongly rejected.


Similarly, the RESET type test based on (82)with m=4 (excluding;J
because of collinearity with Jf) yielded:
=

FT(y)

(:-1-

0.1 17520 -0.060 28 74


().:6: as

35. l3.

Again, with cz 3. 12 linearity is strongly rjected.


lt is important to note that although the RESET type test is based on a
more restrictive form of the alternative (compare(77)with (82))it might be
the only test available in the case where the degrees of freedom are at a
premium (see Chapter 23).
=

(3)

Tackling non-linearity

As argued in Section 21.1 the results of the various misspecification tests


should be considered simultaneously because the assumptions are closely
interrelated. For example in the case of the estimated money equation it is
highly likely that the linearity assumption was rejected because the
independent sample assumption g8qis invalid. ln cases, however, where the
source of the departure is indeed the normality assumption (leadingto nonlinearity) we need to consider the question of how to proceed by relaxing the
normality of )Z! 1 l 6 T). One way to proceed from this is to postulate a
general distlibution Dtyf, Xf; #) and derive the specific form of the
conditional expectation
,

ClA,'''Xf

Xr)

(xf).

(21.84)

Choosing the form of Dtl?f, Xt: #) will determine both the form of the
conditional expectation as well as the conditional variance (seeChapter 7).
An alternative way to proceed is to use some normalising transformation

462

from assumptions - probability model

Departures

on the original variables y, and X? so as to ensure that the transformed


variables y) and X) are indeed jointly normal and hence
E)'1%1

=x))=#*'x

(21.85)

and
Vart-vtl/rx)

x))

c2.

(21.86)

The transformations considered in Section 2 1.2 in relation to normality are


also directly related to the problem of non-linearity. The Box-cox
transformation can be used with different values of for each random
variable involved to linearise highly non-linear functional forms. ln such a
case the transformed nv.'s take the general form
X tt>

zkri
it

12

(2 1.87)

f
(see Box and Tidwell

(1962:.

In practice non-linear regression models are used in conjunction with the


inter alia).
normality of the conditional distribution (seeJudge et aI. (1985),
The question which naturally arises is,
reconcile
the noncan we
linearity of the conditional expectation and the normality of D(y?/Xf; p)?' As
x,) is a direct
mentioned in Section 19.2, the linearity of pt Es Eytj'xt
consequence of the normality of thejoint distribution Dyt, X,; #). One way
the non-linearity of '()!f/Xr xf) and the normality of D(y,/Xt; 0j can be
reconciled is to argue that the conditional distribution is normal in the
transformed variables X) /)(X), i.e. Dlttf/Xl x)) linear in xtl but non'how

linear in xf, i.e.

E-vbt/xt

xr)

gls, T).

(2 1.88)

Moreover, the parameters of interest are not the linear regression


H(y,
c2+). It must be emphasised that
parameters 0-(p, c2) but 4
nonlinearity in the present context refers to both non-linearity in parameters (y)
and variables (X,).
Non-linear regression models based on the statistical GM:
.Vf

#(N, $ +

(21.89)

l/f

can be estimated by least-squares

based on the minimisation

of

(2 1.90)

463

21.4 Homoskedasticity

This minimisation will give rise to certain non-linear normal equations


which can be solved numerically (seeHarvey ( 198 1), Judge et al. (1985),
Malinvaud ( 1970), inter alia) to provide least-squares estimators for
c2
),: m x 1. * can then be estimated by

sl

1 Z (-p,-/x,, f))2.
w- k

(21.91)

Statistical analysis of these parameters of interest is based on asymptotic


theory (seeAmemiya ( 1983) for an excellent discussion of some of these
results).

21.4

Homoskedasticity

that Vartyf,/Xl xt) c2 is free of x, is a consequence of the


assumption that Dtyt, Xr; #) is multivariate normal. As argued above, the
assumption of homoskedasticity is inextricably related to the assumption of
normality and we cannot retain one and reject the other uncritically.
x2)
above, homoskedasticity of Vartyt/xf
lndeed, as mentioned
class.
elliptical
For
within
normal
distribution
the
characterises the
argument's sake, let us assume that the probability model is in fact based on
D(#'xf, c,2) where D( ) is some unknown distribution and oj1 Jl(xt).
The assumption

(1)

of keteroskedasicity

Implications

A far as the estimators


s

/ and

.:2

are concerned

p i.e. p-is an unbiased


Covt
(x'x)- '(X'ox)(X'X)-

(i)

Ep

estimator

f)

diagtcf,

c2a,

of

(21.92)

c;/)

c2A,

lf limw-+.((1,'JIX'DX) is bounded and non-singular


P

/-

-+

p i e. b-is a
,

estimator of

consistent

#.

where

we can show that

then

#.

(X'X)-- 1X'y retains some desirable


results suggest that
and
unbiasedness
consistency, although it might be
properties such as
with
the so-called generalised Ieastinefficient. p- is usually compared
of
#, #,derived by minimising
sylt?rt?s (GLS) estimator
These

/(l)=(y-X#)'n-

'(y

-X,),

(21.93)

464

Departures from assumptiolks - probability model

P(#)
P# =0

#=(x f) x)
,

lx/o

- ly

-((),)(--'-,)')-'
)):(---'-,)('t).

Given that

Covt#l =(X'f1 - 1X) -

and

(2 1.94)

(21.95)

Cov(/) y Cov(#)

(21.96)

(see Dhrymes (1978:, p- is said to be relatively inefficient. It must be


emphasised, however, that this efficiency comparison is based on the
presupposition that A is known a priori and thus the above efficiency
comparison is largely irrelevant. lt should surprise nobody to
that supplementing the statistical model with additional information we
can get a more efficient estimator. Moreover, when A is known there is no
need for GLS because we can transform the original variables in order to
return to a homoskedastic conditional variance of the form
kdiscover'

Vartyle/''xle

x))

c2,

t= 1,

(21.97)

This can be achieved by transforming y and X into

y* Hy
=

and

X*

HX

In tenns of the transformed

where

H'H =A -

1.

(21.98)

the statistical GM takes the form

variables

y* X*#+ u*

(21.99)

and the linear regression assumptions are valid for )'1 and X). Indeed, it can
be verified that

#- (x+'x*)=

1x*'

Hence, the GLS estimator


known a priori.

is

(x'A - 1X) - 1x,A - 1 y


rather

unnecessary

(J 1.104))

in the case where A is

The question which naturally arises at this stage is, twhat


happens when
fl is unknown'?' The conventional wisdom has been that since fl involves T
unknown incidental parameters and increases with the sample size it is
clearly out of the question to estimate T+k parameters from T
observations. Moreover, although p- (X'X)- 1X'y is both unbiased and
of Covt#-)
estimator
consistent s2(X'X)- 1 is an inconsistent
(X'X) - IX'DXIX'X) - 1 and the difference
=

c2(X'X) - 1 -(X'X)- 1(X'f1X)(X'X)- 1

(21.101)

21.4 Homoskedasticity

465

can be positive or negative. Hence, no inference on #, based on p-is possible


since for a consistent estimator of Cov(/) we need to know f (or estimate it
the
consistently). So, the only way to proceed is to model V so as to
incidental parameters problem.
Although there is an element of truth in the above viewpoint White (1980)
ointed
out that for consistent inference based on j we do not need to
P
estimate fl by itself but (X'fX), and the two problems are not equivalent.
The natural estimator c/ is l (yt-J'xf)2,
which is clearly unsatisfactory
because it is based on only one observation and no further information
accrues by increasing the sample size. On the other hand, there is a perfectly
acceptable estimator for (X'f1X) coming in the form of
5

tsolve'

Wr

F,1

li/xrxt',

(21.102)

for which information accrues as T-+ x. White (1980) showed that under
certain regularity restrictions

j
and

--+

(21 103)

(X'f1X).

(21.104)

it .S.

Ww
-+

The most important implication of this is that consistent inference, such as


the F-test, is asymptotically justifiable, although the loss in efficiency
should be kept in mind. In particular a test for heteroskedasticity could be
based on the difference
(X'X)- 1X'DX'(X'X) -

c2(X'X) - 1

(21.105)

Before we consider this test it is important to summarise the argument so


far.
Under the assumption that the probability model is based on the
c(.) is
distribution Dxt, c/), although no estimator of fl diagtczj,
possible, #= (X X) X y is both unbiased and consistent (undercertain
conditions) and a consistent estimator of Cov(/) is available in the form of
WT. This enables us to use j for hypothesis testing related to #. The
c/ will be taken
argument of
up after weconsider the question of
from
testing for departures
homoskedasticity.
=

'modelling'

(2)

Testing departures

from homoskedasticity

White ( 1980), after proposing a consistent estimator for X'DX, went on to

466

Departures

from assumptions - probability model

use the difference (equivalentto ( 105::


(X'f1X) -c2(X'X)

(2 1.106)

to construct a test for departures from homoskedasticity.


expressed in the folnn

(106)can be

('(l?r2) czlxyxj',

(21.107)

;=1

and a test for heteroskedasticity could be based on the statistic

1 jr (2
t
wt 1

dzjxtxt',

(2 1.108)

the natural estimator of (107).Given that (108) is symmetric we can express


the jklk 1) different elements in the form
-

-'.,

where

-45/

(21.109)

2,,
/p,rl', klt xitxjt,
, (/1,,
k l 12
i /' 1 2
.

Note the similarity

between

gklk

n'l

1)
.

#f above and the second-order term of the


(11). Using (109),White (1980)went on to

Kolmogorov-Gabor polynomial
suggest the test statistic

zy)

where
F

jl1

tl

zl'f

bw-1

1
--

)
-

tl

,llt

(2 1.110)

I,.--

ll
l

-42)2(,

-#w) kt -#w)',

1
#v-y f Z1 #,.

(2 1. 111)

Under the assumptions of homoskedasticity


can be based on the rejeciion region

cl

Jt

y: zy)

> ca),

zy)

,vz2(m)

and a size a test

(2 1.112)

Because of the difficulty in deriving the test statistic (109)White went on to


suggest an asymptotically equivalent test based on the R2 of the auxiliary

Homoskedasticity

21.4

regression equation
l

acj + a j

/ 1t +

Under the assumption


TR2

.v

ac/

zt +

(2 1.113)

+ am/mt.

of homoskedasticity,

z2(n1),

(21.114)

and TR2 could replace zy) in (112) to define an asymptotically equivalent


test. lt is important to note that the constant in the original regression
should not be involved in defining the /ftS but the auxiliary regression
should have a constant added.
Example
For the money equation estimated above the estimated auxiliary equation
of the form

co + y'#! +

lll

yielded R2 0.190, FF(y) 2.8 and FR2 15.2. ln view of the fact that
the null hypothesis of
./6.73) 2.73 and z2(6) 12.6 for a
homoskedasticity is rejected by both tests.
The most important feature of the above White heteroskedasticity test
that
is
no particular form of heteroskedasticpy is postulated.
subsection
ln
(3)below, however, it is demonstrated that the White test is an
exact test in the case where D(Z,; #) is assumed to be multivariate t. ln this
case the conditional mean is pt #'xt but the variance takes the form:
=

=0.05

kapparently'

c/

c2

(21.115)

f
+x'Qxf.

variables'
xf) + lll We Can
Using the
argument for ul Eluijxt
This suggests
derive the above auxiliary regression (seeSpanos (1985b)).
that although the test is likely to have positive power for various forms of
heteroskedasticity it will have highest power for alternatives in the
multivariate t direction. That is, multivariate distributions for D(Zr; #)
which are symmetric but have heavier tails than the normal.
ln practice it is advisable to use the White test in conjunction with other
tests based on particular fonns of heteroskedasticity. ln particular, tests
which allow first and higher-order terms to enter the auxiliary regression,
such as the Breusch-pagan test (see(128)below).
lmportant examples of heteroskedasticity considered in the econometric
literature (seeJudge et al. (1985),Harvey (1981)) are:
tomitted

Cr2

c2'x);

(21.116)

>'

468

Departures

lii)

Gl

(iii)

c2l

from assumptions

probability

model

c2('x*)2.
t

exp('x*)!
.

!'

x2) and x) is an m x 1 vector which incltldes known


where V Vartyf/xf
of
and
transformations
its first element is the constant term. It must be
xr
econometric
noted that in the
literature these forms of heteroskedasticity
are expressed in tenus of w, which might include observations from
weakly exogenous variables not included in the statistical GM. This form of
heteroskedasticity is excluded in the present context because, as argued in
Chapter 17, the specification of a statistical model is based on all the
observable random variables comprising the sample information. lt seems
very arbitrary to exclude a subset of such variables from the definition of the
Elytlxt
xr) and include them only in the
systematic component
conditional variance. ln such a case it seems logical to respecify the
systematic component as well in order to take this information into
conditioning
consideration. Inappropriate
in delining the systematic
component can lead to heteroskedastic errors if the ignored information
affects the conditional variance. A very important example of this case is
when the sampling model assumption of independence is inappropriate, a
non-random sample is the appropriate assumption. ln this case the
systematic component should be defined in such a way so as to take the
variables
temporal dependence among the random
involved into
consideration (seeChapter 22 for an extensive discussion). lf, however, the
systematic component is defined as /tt NJJ(y!,7X,=xf) then this will lead to
autocorrelated and heteroskedastic residuals because important temporal
information was left out from yt. A similar problem arises in the case where
stochastic processes (see Chapter 8) with
y, and Xt are non-stationary
distinct time trends. These problems raise the same issues as in the case of
non-linearity being detected by heteroskedasticity misspecification tests
discussed in the previous section.
Let us consider constructing misspecification tests for the particular
forms of heteroskedasticity (il---tiii).
It can be easily verified that (i)-(iii)are
special cases of the general form
=

tother'

oj2

/;(a'x)),

(2 1.119)

for which we will consider a Lagrange multiplier misspecification test.


Breusch and Pagan ( 1979) argued that the homoskedasticity assumption is
equivalent to the hypothesis
Hv1 a2

aa

'

ap,=0,

given that the first element of xl is the constant and

(aj)

c2.

The 1og

21.4 Homoskedasticity
likelihood function

logL,

see discussion above) is

(retainingnormality,
x)

,'

469

const

1.

jl log c,2
t 1

-j

-.j.

'-'

j c;t 1

2(.p

#'xf)2,(21.120)

where V hlx'xt). Under Ho, c/ c2 and the Lagrange multiplier


statistic based on the score takes the general form
=

uM-L.t-j

oeIogz-tyl,

log,,(),I(p)-1

test

(2 1.121)

where #refersto the constrained MLE of 0- (j, ). Given that only a subset
of the parameters 0 is constrained the above form reduces to

p log .L(0, #) (122-lc1l1l21)/k

'

LM

log fa(0, &)

nx

(21.122)

(see Chapter 16). In the above case the score and the information matrix
evaluated under Hv take the forms
(2 1. 123)
(2 1. 124)

andlc1

42
LM test statistic is
where

=(),

LM

=(1

j7- 1

F)

tl

)(

)(

xlw,

=j

Hence, the

H (;

)(

xlw,

klm

'v

''.:t

42) 1j. Given that


wherew, g(li,2
=

xlxl'

R1

is the MLE of c2 unders.

- 1),

(21.125)

R2 in the linear regression model is

y'X(X'X)- 1X'y - T#2

(21.126)

',y- wy.z

(see Chapter 19), the LM test statistic expressed in the form


/

1--bv4*

'

xlxl'

xl'w,

'

xlwf
'

(21.127)

w'?2

is asymptotically

equivalent

to TR2 from the auxiliary regression


(2 1.128)

Departures

from assumptions

probability

model

Ho

that is, T'R2

'v

z2(m
-

1) (see Breusch and Pagan

(1979),Harvey (1981)1.

lf we apply this test to the estimated money equation with x,l > (x,, c?.
#af) (See (78) and (79)) xlr, x excluded because of collinearity) the
auxiliary regression
:2
--2
k'

yielded R2

/ 1X

+ 7'2 zt + T'3

3t

L't

FF(y)
1) 19.675 and
Given that TR2
F(11, 68) 1.94, the null hypothesis of homoskedasticity is relected by both
test statistics.
=0.250,

=20,z2(1

=2.055.

(3)

Tackling heteroskedasticity

When the assumption of homoskedasticity is rejected using some


misspecification test the question which arises is,
do we proceed'?' The
first thing we should do when residual heteroskedasticity is detected is to
diagnose the likeliest source giving rise to it and respecify the statistical
model in view of the diagnosis.
In the case where heteroskedasticity is accompanied by non-normality
or and non-linearity the obvious way to proceed is to seek an appropriate
nonnalising, variance-stabilising
transformation. The inverse and loge
above
transformations discussed
can be used in such a case after the form of
heteroscedasticity has been diagnosed. This is similarto the GLS procedure
where A is known and the initial variables transformed to
khow

y* Hy,
=

X*

for

HX

H'H

A - 1.

ln the case of the estimated money equation considered in Section 2 1.3


above the normality assumption was not rejected but the linearity and
homoskedasticity assumptions were both rejected. ln view of the time paths
of the observed data involved (seeFig. 17.1) and the residuals (seeFig. 19.3)
it seems that the Iikeliest source of non-linearity and heteroskedasticity
which led to dynamic
might be the inappropriate conditioning
non-linearity,
misspecification (see Chapter 22). This
heteroskedasticity can be tackled by respecifying the statistical model.
An alternative to the normalising, variance-stabilising transformation is
postulate
to
a non-normal distribution forDtyr, Xt; 0j and proceed to derive
f'tyt/'Xf x?) and Vart)'t/xf xf) which hopefully provide a more
appropriate statistical for the actual DGP being modelled. The results in
this direction, however, are very limitedjpossibly because of the complexity
of the approach. ln order to illustrate these difficulties 1et us consider the
kapparent'

21.4

Homoskedasticity

case where D(yf,Xf

471

t with n degrees of freedom, denoted by

is multivariate

*,0)

y,

Xf zw&

cj

o.1 c

az3

Y22

(21.13 1)

lt turns out that the conditional mean is identical to the case of normality
(largelybecause of the similarity of the shape with the normal) but the
conditionalvariance is heteroskedastic, i.e.
ft.Ft/''Xf Xf) J1 2Y2-11X/
(21.132)
and
=

(see Zellner ( 197 1)). As we can see, the conditional mean is identical to the
one under normality but the conditional variance is heteroskedastic. ln
particular the conditional variance is a quadratic function of the observed
values of Xl. ln cases where linearity is a valid assumption and some form of
heteroskedasticity is present the multivariate l-assumption
seems an
obvious choice. Moreover, testing for heteroskedasticity based on
H0:

0-2
f

(7.2

tzzzz1, 2,

T; Q being a k x k matrix, will lead


against H3 : c,2 (x;Qxf)+ c2, t 1, 2,
White
identical
the
directly to a test
test.
to
The main problem associated with a multivariate
r-based linear
regression model is that in view of (133)the weak exogeneity assumption of
X, with respect to 0n (p,tr2) no longer holds. This is because the parameters
/1 and /2 in the decomposition
=

D(.)Xr;

#)

D(),/X,;

#1) D(Xt;
'

(21.134)

a)

are no longer variation free (seeChapter 19 and Engle et al. ( 1983) for more
details) because /1 = (Jj cE2-21 cl 1 J1 2Ec-21 o.a1, Ea-a1)and /2 EEE(Eac) and
the constant in the conditional variance depends on the dimensionality of
Xl. This shows that /1 and /2 are no longer variation free.
The linear regression model based on a multivariate sdistribution but
with homoskedastic conditional variance of the form
-

Vart?&/xl

xt)

1.,()c
(J,'0

2)

was discussed by Zellner (197$.He showed that in this case


indeed the MLE'S of J and c2 as in the case of normality.

/ and

:2 are

from assumptions

Departures

- probability

21.5

Parameter

time invariance

(1)

Parameter

time dependence

An important

)'f

assumption

#'xt+

underlying

the linear

model

regression

statistical GM

(21.136)

ut,

is that the parameters of interest 0 HE (#,c2) are time nlllnknr, where #>
1
tj.latc-alcaj
The time invariance of these parameters
1222
- J 21 and c2 cj j
is a consequence of the identically distributed component of the assumption
=

At

'v

N(0, Y),

i.e.

tzr,r e:T)

is Nl1D.

(21.137)

This assumption,

however, seems rather unrealistic for most economic


time-series data. An obvious generalisation is to retain the independence
assumption but relax the identicallq' distributed restriction. That is, assume
that f(Zj,t (E T) is an independent stochastic process (seeChapter 8). This
introduces some time-heterogeneity in the process by allowing its
parameters to be different at each point in time, i.e.
Z,

N(m(!), E(r)),

(21.138)

where IZr, t e: T) represents a vector stochastic process.


A cursory

look at Fig. 17.1 representing

the time path

of several

confinus that the


economic time series for the period 1963-1982n
assumption (137) is rather unrealistic. The time paths exhibit very distinct
time trends which could conceivably be modelled by linear or exponential
type trends such as:

(i)

mt

mt eXP )
=

(iii)

n2t

(21.139)

ao + al 1,'
ll

tl + a1

+ aj 11',.

(1 -

e-'

(21.140)

(21.14 1)

1),

The extension to a general stochastic process where time dependence is also


allowed will be considered in Chapters 22 and 23. For the purposes of this
chapter independence will be assumed throughout.
ln the specification of the linear regression model we argued that (137)is
equivalent to Zf N(m, E) because we could always define Zf in mean
deviation or add a constant term to the statistical GM; the ccnstant is
defined by jj n1l 0.: 2Yc1m2 (seeChapter 19). ln the case where (138)is
valid, however, using mean deviation is not possible because the mean
varies with 1. Assuming that
'w

(xA'f,)
-

;)2at(r/)))
(.)'j'(t,')t
((m'''
1at('/))

(21.142)

time inuriance

21.5 Parameter

& vt
-

Xt

.'

Vart

where

T'l

x t)

,/x

#'x*
t
l

xt )

c,2

1', l-kt,,/(1.,,),/1,
',.zltJ'),
#(l)f Ycct/)=

r&zcl
=

1(r)

take the form

mean and variance

we can deduce that the conditional

,.1

/.',11(r)

x)

2(l)Y22(l) -

,1

2(r)Ec.a(r)-

',...,1t.?'4-

t1, x;)'
'cc1(r).

convenience the star


Several comments are in order. Firstly, for notational
written
will
conditional
the
and
x)
be dropped
as #'fxf.Secondly,
mean
in
stochastic
non-stationary
independent
142)
under
defines a
the sequence Zf
(
restrictions
on the time
process (see Chapter 8). Without further
(E
'T-jt
of
interest
p, (#;, cfzl cannot be
the parameters
heterogeneity of kZf, t
sample
size
ILThis gives us a fair
with
the
estimated because they increase
warning that testing for departures from parameter time invariance will not
be easy. Thirdly, ( 142) is only a sufficient eondition for ( 143) and ( 144), it is
not necessary. We could conceive of parametrisations of ( 142) whieh could
lead to time invariant j and c2, Fourthlys it is important to distinguish
between time invariance and homoskedasticity of Vart r,,,'Xf xr), at least at
the theoretical level. Homoskedasticity
as a property of the conditional
variance refers to the state where it is free of the conditioning variables (see
Chapter 7). ln the context of the linear regression model homoskedasticity
of Vart-l'r//'xf xf) Stems from the normality of Zr. On the other hand, time
invariance refers to the time-homogeneity of Vaq '!,'X, xr) and follows
from the assumption that f(zt, l 65 7-) is an identically distributed process. ln
principle, heteroskedasticity and time dependence need to be distinguished
because they arise by relaxing different assumptions relating to the
stochastic process ftzt, t e: T). In practice, however, it will not be easy to
test for
discriminate between the two on the basis of a misspecification
both
and
heteroskedasticity
dependence
be
time
Moreover,
can
either.
Section
multivariate
r-distribution
where
is
in
(see
(142) a
present as the case
2 1.4 above). Finally, the form of Jlf above suggests that, in the case of
economic time series exemplifying a very distinct trend, even if the variance
is constant over time, the coefficient of the constant term will be time
dependent, in general. ln cases where the non-stationarity is homogeneous
by
(restricted to a local trend, see Section 8.4) and can be
time-invaliant.
differeneing,its main effect will be on jj, leaving j(:),
This might exptain why in regressions with time series data the coefficient of
the constant seems highly volatile althougt the other coefficients appear to
be relatively constant.
=

Geliminated'

Klargely'

474

Departures

(2)

Testing

from assumptiolts - probability model

for parameter

time dependence

Assuming that (Zr, t e: Ty is a non-stationary, independent normal process,


and detining the systematic and non-systematic
components by
pt

and

E()'t ,.'Xj x)

the implied

#;Xf

yt - S()'f/'Xf

(21.145)

XJ),

GM takes the form

statistical

yt

tIt

(21.146)

&t,

with pf
V)being the statistical parameters of interest. lf we compare
(146) with (136)we can see that the null hypothesis for parameter time
invariance for a sample of size T is
Klpt,

Hv :

#j #2
=

against
Sl

'

'

#w #

ct2 # c2

pt# #

c2j c22

and

for any r

1, 2,

cw2
=

(7.2

Given that the number of parameters to be estimated is Tk + 1) + F and we


only have T observations it is obvious that 03
pware not estimable. It is
and
ahead
ignore
instructive, however, to
this
to attempt estimation of
go
these parameters by maximum likelihood.
Differentiation of the 1og likelihood function:
,

logfa=const

jg log c,2

1
-

-1

2(

cf-

'

-#r'xf)2

pr

(2 1.147)

,-,

yields the following first-order conditions:


t?log L
tn(bt =

(y, #,x,)xf
,

()7lo g L

0,

2 ..-Y
(, G't

ac:r

2c4t

u?

()

(2 1.148)

ctl

These equations cannot be solved for pt and


because ranktxg)
ranktxr, xt') 1, which suggests that xtx; cannot be inverted; no MLE of 0t
exists. Knowing the source' of the problem, however, might give us ideas on
how we might solve' it. In this case intuition suggests that a possible
(Xk0'xk0),where Xk0EEEE(x1,x2,
x:).
invertible form of xfx; is
1 x,x;)
That is, use the observations r= 1,2,
k, in order to get ranktxko) k and
invert it to estimate pkvia
=

(Z,k-

(x0'X0)
k
#

txk'yk

#,

lyk,

In turn the residuals

k + 1,

T; the corresponding

pts

(21.150)

.T:

(y't-#;xf), t k + 1,
=

(2 1.149)

1, k + 2,

(X, X, ) Xt yt

(xk0)
-

H
yk0
(y1, yk). Moreover, for 1,:::ck +
couldconceivably be estimated by
.

'Ccan be derived which,

21.5 Parameter time inuriance


however, cannot be used to estimate cf2 because the estimator implied by
the above first-order conditions is
(2 1.151)
This is clearly unsatisfactory
given that we only have one observation for
each ct1 and the fs are not even independent. An alternative form of
residuals which are at least independent are the recusri'e residuals (see
Section 19.7). These constitute the one-step-ahead prediction errors

fh
=

(A'f-

#; 1x,)
-

+ x;(#,

l/,

#, 1),
-

k + 1,

and they can be used to update the recursive estimators of


observation becomes available using the relationship

/? k- 1 +(X,=

1X!-

1)

x! d,
t

r= k + 1,

(21.152)

'f)
.

each new

#fsas

(21.153)

7-:

(see exercises 5 and 6) where


+ x'(X0'
(21154)
t/
t -- 1 )! (1 t t -- 1 x0
As we can see from (153), the new information at time r comes in the form of
f'f and #f is estimated by updating p-f-1
p-t - 1 in (152)yields

lx

==

'

'

substituting
f'!

lg,

+ x'

pt

(X,0-'j x,0-j)

- 1

f- 1
i

jg1

xfx;

pi

x;(x,0-' 1 x,0-j)

t- 1

'

xylj
1

(2 1.155)
c22,
(see exercise 7). Hence, under So, Et)
0, El)
This implies that the standardised recursive residuals
=

wl

6
.-

Jf

k+ 1

tzzzk + 1,

(21 156)

suggest themselves as the natural candidates on which a test for Ho might


wrl',
be devised. lndeed, for w BE (wk 1, wk c,
+

Hv

'v

N(0, c2Iw-k)

(21.157)

N(J, C),

(2 1.158)

Hl

where

,w.

J EB (k +
'j
=
t df

!, s
c Ectsq,
1
x/f #t (Xt0-'I Xf0- ) ' F x ix'iqi
i 1
1

k+

a,

w),

f-

k + 1,

T)

476

from assumptions .- probability model

Departures
(7.2

--j
t!?, -(It
...............=

0'

+ x'(X
t
t

Xt -

- j

)-'

i
)-)x x'c2
i

1.
.

-.-

0
X
(X0'
r- l t - 1j-

t 1' +
=

X l'(X,0-'1 X,0- j.

t/ (/

)-

<?

t s

xt

T2. (2 1 160 )
.

' -'

)-')x

'ic-,?
,.x

-:.

(2 1.161)
7- (see exercise 8).
for t < y, l k + 1lf we separate Hv into
=

Hlvq)
'

for a11 t

pt #,

..

H$1t.
0

(7.2 =
t

(7.2

for al1 r

12

1 2.

'J)
'/-i

we can see that


H)$

w
but

N((), c),

(2 1.162)

HL2$

.N(J, c2Iy.-j),

'v

(2 1.163)

This shows that coefficient time dependcnce only affects the mean of w and
variance time dependence affects ils coq'ariance. The implication from these
results is that we could construct a test for //t)1) given that f:lttlz'holds against
St/ '.' pt+ p for any t 1, 2,
T based on the chi-square distribution. ln
view of ( 163) we can deduce that
=

k.l:)

(j

w,

I'

'

r=k

f=k+1

'

cl

(,

(21 164)

:1

This result implies that testing for Ho '. given Hlvlbis valid, is equivalent to
testing for F(wr) 0 against F(wt) # 0. Before we can use ( 164) as the basis of
a test statistic we need to estimate c2. A natural estimator for c2 is
=

2
.sw

where

1
. -..----.. 1F k
f

j''n (u,
k

,.

1/(F- /$:))J
E)
jJ--'k

-#.

fi

(y) ('F- .) -.Sw,


=

(2 1 165)
.

This enables us to construct


Tl

).gj2

=u

u.',. kNrt?t,that

the test statistic

? + 0 when

HL') is

not

valid.

Ho
x

t (w
-

.
-

1)
.

(2 1.166)

21

time invariance

Parameter

.5

#t'a

(71

7' t?a
z 1 (y)1

.:y :

).
-

1- a

d rl T

(.

(21 167)

&- 1)

and ( 167) is
(see Harvey ( l98 1)). Under Hv the above test based on ( 166) HLI'
does not
UMP unbiased (see Lehmann ( 1959)). On the other hand, when
(7.2
E'(sw2,)
and this can reduce the power of the test significantly (see
>
1101(1
Dufour ( 1982)).
1
conditional on HLlb
being valid was suggested
Another test related to HL2
by Brown, Durbin and Evans ( 1975). The CL.rsL'AzJ-test is based on the test
statistic
l

1$.f
;

J4(

..

k + 1,

(21.168)

71,

(1/(T-/)q
j7- j ttl. They showed that under Hv the distribution
by N(0, l - /) ( W'; being an approximate
of W( can be approximated
Brownian motion). This 1ed to the rejection region
.s2

where

(I-T

.L,
y : !Hz;(>

(.z

'F- k)'l+ 2t/(l

/(

()v =

I%'
)(T' - /..)-' 5,

with a depending on the size x of the test. For tx 0.0 1, 0.05, 0. 10, a 1.143,
0.948, 0.850, respectively. The underlying intuition of this test is that if Hv is
invalid there will be some systematic changes in the j,s which will give rise
to a disproportionate number of wfs having the same sign. Hopefully, these
T
will be detected via their cumulative effects 1V;,t Ik'+ 1,
Brown vt aI. ( 1975) suggested a second test based on the test statistic
=

(2 1.170)

has a beta distribution


Under Hv : l !'. k now'n as the Cl-zrit-zrA/sT-statistic,
ly
v-sT
k
i
t
i
h
t
)
t
r
t
e
(
(
)
s
w pa ra n-le e
,

/fo

lz,--

(21 17 1)

B6--#T' r), -ty(1- /)).


-

between the beta and F-distributions


ln view of the relationship
Johnson and Kotz ( 1970)) we can deduce that

j,zr

1-5
f-- k )
.
1
'.J( -- r) -- 11
;
(t

'

H0
x

F ((F - ;) (t - li))
,

(see

(2 1.172)

478

from assumptions - probability model

Departures

This enables us to use the F-test rejection region whose tabulated values are
more commonly available.
Looking at the three test statistics (166),(168)and (170)we can see that
one way to construct a test for Hvlt is to compare the prediction error
with the average over the previous periods, i.e. use the test
squared (wr2)

statistics

wr2

z2(y0)
t

t- 1

(1- k) i

)-

k + 1,

-k

(173)is that the denominator

The intuition underlying


natural estimator of hl-

can be viewed as the


with the new information at

which is compared

time t. Note that


Wl2

Hv

G2

'v

z (1)

(2 1.173)

TJ

wf2

t- 1

Hv

w/
y ij 1

and

(21.174)

z (1-k),

and the two random variables are independent. These imply that under Hv,
z2(y/) F(1,

zty)')

t(t - k), l k + 1,
TJ (21.175)
lt must be noted that z(y/) provides a test statistic for #t pt j assuming
- For an
that c,2 cJ.j; see Section 2 1.6 for some additional comments.
overall test of Hv we should use a multiple comparison procedure based on
the Bonferroni and related inequalities (see Savin ( 1984:.
One important point related to a11 the above tests based on the
standardised recursive residuals is that the implicit null hypothesis is not
quite Hv but a closely related hypothesis. lf we return to equation (156)we
can see that
x

-k)

or

'v

'(u?r) 0
=

which is

if x'tlqt

not the same as

bt )
-

(2 1.176)

0,

pt.-pt- 1) 0.
=

In practice, the above tests should be used in conjunction with the time
of the recursive estimators / ft i 1 2
aths
k and the standardised
/7
TJ lf we ignore the first few values of
recursive residuals wf, t k + 1,
these series their time paths can give us a lot of information relating to the
time invariance of the parameters of interest.
ln the case of the estimated money equation discussed above, the time
paths of p,t, #2r,jat, #4 are shown in Fig. 21.1(J)--(J)for the period t 20,
80. As we can see, the graphs suggest the presence of some time
.
dependence in the estimated coefficients re-enforcing the variance time
dependence detected in Section 21.4 using the heteroskedasticity tests.
,

21.5

(3)

479

time inuriance

Parameter

Tackling parameter

time dependence

When parameter time invariance is rejected by some misspecification test


do we proceed'?' The answer to
the question which naturally arises is,
likely
the
this question depends crucially on
source of time dependence. If
behaviour
of
the
the agents behind the actual
time dependence is due to
model
DGP we should try to
this behaviour in such a way so as to take this
additional information. Random coejjl'cient models (seePagan (1980/ or
state space models (see Anderson and Moore ( 1979)) can be used in such
cases. lf, however, time dependence is due to inappropriate conditioning or
stochastic process then the way to proceed is to
Z, is a non-stationary
respecify the systematic component or transform the original time series in
order to induce stationarity.
ln the case of the estimated money equation considered above it is highly
likely that the coefficient time dependency exemplified is due to the nonstationarity of the data involved as their time paths (see Fig. 17.1)
indicate. One way to tackle the problem in this case is to transform the
stochastic processes involved so as to induce some stationarity. For
example, if Mf, l > 1) Shows an exponential-like time path, transforming it
to .ftA1n(M #)f, r y 1) can reduce it to a near stationary process. The time
path of A lntM #lf ln(M/P)f ln(M #)r- I as shown in Fig. 2 1.2, suggests
that this transformation has induced near stationarity to the original time
series. lt is interesting to note that if A ln(M,/#)r is stationary then
ihow

a-,-(Ms),-,n(wM),-cIn('wM),.,

+In(,ws),.-

(2 1.177)

is also stationary (seeFig. 21.3),*any linear combination of a stationary


process is stationary. Caution, however, should be exercised in using
for
stationarity inducing transformation, because ollerd4ferencing,
example, can increase the variance of the process unnecessarily (seeTintner
( 1940)). ln the present example this is indeed the case given that the variance
of A2 1n(A.1P4t is more than twice the variance of A ln(M/#)f,
Var A ln

-P

0.000 574,

M
Var A z ln P

=0.001

354.

(2 1.178)

ln econometric modelling differencing to achieve near stationarity


should not be used at the expense of theoretical interpretation. lt is often
using appropriate explanatory
the non-stationarity
possible to
variables.
Note that it is the stationarity of (yj/xf,
t G T) which is important, not
that of (Z,, l (5 T).
'model'

480

Departurcs

from assumptiow

- probability

model

(b)
Fig. 2 1.1(fz). The time path of the recursive estimate of
p3t- the coefcient
time path of the recursive estimate of pzt the

of the constant. () The


coefcient of
',,

In view of our initial discussion related to the possible inappropriateness


of the sampling model assumption of independence (see Chapter 19) we can
argue that the above detected parameter time dependence might also be due
to invalid conditioning (see Chapters 22-23 for a detailed discussion). ln

21.6

Parameter

structural

481

change

0.100
0.075
0.050
0.025
it
0
-0.025
-0.050
-0.075
-0.1

1968

1970

1972

1974

1976

1978

1980

1982

1976

1978

1980

1982

Time

(c)
4
2
0
-2

Jl!
-4
-6
-8
1
-. O1968

1970

1972

1974

Time

(d)

Fig. 21. l(c). The time path of the recursive estimate of pt - the coeftkient
of the recursive estimate of #4, the coefficient of (.

of p,. (J) The time path

this case the way to tackle time dependence is to respecify pt in order to take
account of the additional information present.
21.6

Parameter

structural

change

structural change is interpreted as a special case of time


dependence, considered in Section 2 1.5, where some a priori information

Parameter

482

Departures

from assumptions - probability model

0.075
0.050
0.025

kc
S
<

-0.025
-0.050
-0.075

1964

1967

1970

1973

1976

1979

1982

Time

Fig. 2 1.2. The time path of

ln (M/#)l.

0.10

0.05

f
*
g

-0.05

-0.10

1964

1967

1970

1973

1976

1979

1982

T ime
Fig. 2 1.3. The time path of A2 ln (M/P)f.

related to the point of change is available. For example, in the case of the
money equation estimated in Chapter 19 and discussed in the previous
sections we know that some change in monetary policy has occurred in
1971 which might have induced a structural change.
The statistical GM for the case where only one structural change has
occurred at t T'1(T1 > k), takes the general form
=

21.6
y
..

?l

structural

Parameter

p'jxr +

uj t

pjxt+

lkct

change

483
(2 1.179)

(2 1.180)

12
Tk) T 2
T1 + 1
T) .)w it h p1 EEEE(#1 tr 2j) an d
pcEEE(#2,c2a) being the underlying parameters of interest respectively. For
T) ,Tk + 1,
'Jl the
the case where the sample period is 1::::c1,2,
of
sample
form
the
takes
the
distribution

whe re T 1

.ft

y1 X 1 N
'2/X 2 x
whereTz =

X 1 #1

21
c I wj

X 2 #2

0
a
c-21

(21.181)

wa

T- T'l The hypothesis of interest in this case is


.

S0:

and

/1 /2
=

which can be separated into


.J'1C1''
0

c2
Sf2)'
0
1

#1 pzt
=

'

'

c2.2,

IS

S(1)
0

fa

.J1(21)
0
'

The alternative f-1l is specified as in (181).These hypotheses are very similar


to the ones we considered in the previous section and it should come as no
surprise to learn that constructing optimal tests for the present case raises
similar problems. Moreover, the test statistics enjoy more than just a
passing resemblance with some of the statistics derived in Section 2 1.5. The
main advantage in the present case is that the point of structural change T1
is assumed known a priori. This, however, raises the issuetof whether F2 > k
(p2is estimable) or Tz < k (p2is not estimable). Let us consider the two cases

separately.

Case 1 (L

>

k)

Chow ( 1960) proposed the F-type


RSSV

T2(')

.-RSS'I

(seeChapter

,-RSSZ

F-2k

RSSL

19) test statistic

RSSZ

'

(21.182)

where RSSV, RSSL and RSSZ refer to the residual sums of squares for the
Tk) and subperiod 2( 1:::=Fl + 1,
whole sample, subperiod 1 (1 1, 2,
F) respectively.
The test statistic (182)can be used to construct a UMP
That islxitcan
invariant test (seeLehmann (1959))
for HL'S against .J/1 fo HLl$.
used
against
for
be
Sj with ct c2a.
to construct an $op timal' test
pb j2
The test statistic is distributed as follows:
=

r2(y) F(), F-2k)

under

'v

'rcty) F(/(, T-2/;


'v

under

HL3',
Sl
HLl$,
f'''h

(2 1.183)
(21.184)

484

Departures

from assumptions - probability model

where

E(X'kX1)-1 +(XrcXc)- 1(I-1

-(#1

-#c)'

.z

HL''can be used

The distribution of zaty) under


rejection region is
C1

ly:T2(y)

(#1-#2).

to define a size a test whose

Gl,

>

(21.185)

(21 186)
.

This test raises the same issues as the time invariance tests considered
in Section 21.6 where c21 o'2a had to be explicitly tor implicitly)
assumed in constructing a coefficient time invariant test. There is no
HLI'
is valid when Hvl might not be. A test
reason to believe, however, that
for Hvlb against S1 can be based on the test statistic comparing the two
estimated variances (seeChapter 19):
=

.s1
1'3(b p' EEE
1

RSSZ

T,

T3(b

F(T2

'v

T2 k

RSSL

- k, T1 k)

under

zaty) F(Tc k, T1 k; )
'v

(2 1.87)

HLlI,

under Hj

(2 1.188)
(2 1.189)

(c2a/c2j)
is the non-centrality parameter which ortunately' does
where
not depend on #1or #2.This turns out to be critical because the test for Hlvt)
against Sl defined by the rejection region,
=

Cl

ty:z:'(y)>

ca),

(2 1.190)

is independent of the test defined by zaty) (seePhillips and Mccabe (1983/.


This implies that a test for Hv against Sj can be implemented by testing for
HLltt'irstusing zz(y) and, if accepted, testing for HL''using zaly) in that
Seqklence.

Let us apply the above tests to the money equation considered in the
previous sections. As mentioned above, a structural change is suspected to
have occurred in 1971 because of important changes in monetary policy.
Estimation of the money equation for the period 1963f-1971f: yielded
-0.793

mt

+ 1.050yf +0.305pt

(2.055) (0.208)
R2

1og L

=0.968,

90.08,

RSSL

t,

(0.103) (0.014) (0.015)

42 0.964,
=

-0.007t
,$1

=0.0155,

0.006 72,

(21.191)

21.6

structural

Parameter

of the same equation

Estimation

mt

6.397 + 1.641y,
(1.875) (0.191)

R2

#2

=0.994,

log1.=95.37,
Testing for

zaty)
-

yielded

-0.0761

+0.784/7,

(21.192)

f,

(0.022) (0.014) (0.035)


sz
=0.052

=0.0346,

84,

F=48.

H3 the test statistic zaty) is

(u)

(0.05284) 28
,7a,)

(:.(06

485

for the period 1972-1982/:

=0.993,

RSSZ

HLltagainst

change

(2 1.193)

5.004.

HLI'
For a size x
is rejected. At this stage there is not
c,= 1.81 and
given that HLI'
against Sl r'h HL1t
much point in proceeding with testing HL't
has been rejected. For illustration purposes, however, 1et us consider the
test regardless.
The residual sum of squares for the whole period is RSS 0.1 17 52 (see
against S1 f''h HLI/ the test statistic is
Chapter 19). Hence, testing for HL3'
=0.05,

'2tS-

52)
72)-(0.052
-m.(06
(0.117
72)+(0.052 84)

(0.006

:4)

72
1-Y117.516. (21.194)
HLZ,

is also
rejected.
strongly
It is important to note that in the case of the test for Hv the size is not a
i.e. for the above example the overall size is 0.0975. This is
but 1
becauseHo was tested as two independent hypotheses in a multiple
hypothesistesting context (seeSavin (1984/.
Given that ca=2.5 for a size

a=0.05

test we can deduce that

-a)2,

-(1

Case 2 (T2 < k)


ln this case 0z is not estimable since ranktxqxa) % Ta < k. This is very similar
to the time invariance test where F2 1. If we were to express the test
statistic
=

W22

z(y,0)
-

t k
-

t
i

(2 1.195)

jg

w/

.k

(see (173/ in terms of the residual sum of squares with


test statistic emerges:

l G T2

the following

486

Departures from assumptions - probability model

w/
This is based on the equalities RSSt RSSt- 1 + w/ and
RSSt- 1
(see exercise 9). The test statistic CH, known as Cllt?w test (seeChow
( 1960:, as in the case of z(y,0) in Section 2 1.5, can be used to construct a
UMP invariant test not of Ho against J.f1but of Ht against Hj fo HLI' where
S): Xa(#1 #2)0 0. This is not surprising given that pz is not estimable
and thus we need to rely on (y2-Xa/ 1 ) uc - X2(/ 1 - #2)(adirect analogue
of Section 2 1.5) in order to construct a test for
to the recursive residuals
statistic
T
he
Cs-test
is
distributed as follows:
HL).

)r--2

czf-Ftw
CH

2:

'F1

IILIb
under 11%
0 ro

-k)

F( F2, T1 - k; J)

'w

where

under H 1

r''h

HLlt,

(21.198)
(2 1.199)

Given that parameter time dependence can be viewed as an extension of


structural change at every observation t ?L+ 1,
'Cit is not surprising
that the Chow test statistic is directly related to the CUSUMSQstatistic
discussed in Section 2 1.5 above (seeMcAleer and Fisher (1982:.
This test can also be used for Hv against Hj ra HLII but caution should be
exercised because although the size is correct the test can be inconsistent
when S1 is valid but HL3'
is not (seeRea (1978), Mizon and Richard (198$).
Moreover, for S1 against H3 the above test does not have the correct size
against alternatives when ct # cJ, given that the distribution under f.f1
when HLIS is invalid is not F(Ta, F1 - k). Similarly for testing Hv against Sj
the Chow test has the correct size but 1ow power for alternatives when Ht
and HLI, are valid but HL'is not, given that the implicit alternative is in fact
H1 rn HLl3.
These comments can also be made for the test based on z(yf0) in
Section 2 1.5.
As in case 1 (T2 < k), we need to test HLI3 before we can safely apply the
Chow test. Given that yl cannot be estimated, intuition suggests using the
prediction sum of squares
=

F
f

Z
rj

'-'/

'-'

(.:2, l1x2!)

=(y2

-Xa,1)

''h

(y2-X2,1)

(21.200)

Appendix 21.1

487

instead. This gives rise to the test statistic


T2b

(y2 - X 2 / 1 )'(y 2 - X 2 / 1 )
s'f

W heru

st

RSS 1
T1 - k

(a1.a()j)

lt is not very difficult to see, however, that the numerator and denominator
of this statistic are not independent and hence the ratio is not F-distributed.
Asymptotically,
T4(y)

however,
z2(T2)

'v

s(

c2 under

--+

HLI)

and thus

HL2t.

under

Using this we can construct an


based on the rejection region

(21.202)
size a test for

aspnptotic

HLlagainst

1-1l

'

ly:z.tyl

t7l
=

> c'al,

dz2(T'2).

(21.203)

Ca

Let us apply the above tests to the money equation discussed above.
Estimation of this equation for the period 1963-1980
yielded:
mt 2.685 + 0.7 13yr + 0.852,/ 0.0521,+ f,
(1.055) (0.107) (0.022) (0.014) (0.039)
=

(21.204)

R2

Iog L

#2

=0.994,

138.52,

=0.994,

RSSL

.s1

=0.109

0.0392,

23,

Testing for HLlt


against
6.484, ca= 11.07, a 0.05. Testing for Ht
: z4(y)
: CH
1.078, ca 2.34, a 0.05. These results imply that
against HLOHLI'
both hypotheses cannot be rejected. This is not very surprising given that
the post-sample period used for the tests was rather small (Tc 5). In cases
where Tc islarger we expect the two hypotheses to be rejected on the basis of
the results using CH. In the present case this was not attempted because
when Ta > k, CH is no longer the best' test to use; T/y) is the optimal test.
SI

Appendix 21.1

uriance

stabilising transformations

Consider the case where Vartyf/xf


xt) o'lgpt) and q ) is a known
function. Our aim is to find a transformation h ) such that
c2. Let us assume that we can take the first-order Taylor
x,)
Varty)/x,
of
expansion
(y,) at p,:
=

'

'

(.pf) hpt)
':>:

+ (A',

-/t,)/l'(/t,),

/?'(pr)being the first derivative of /l(yf)evaluated at yt

=pt.

Then, we can use

488

from assumptio

Departures

- probability

model

1972
Time

1975

15

12

9
@
6

1963

1966

1969

1978

1982

-2.0

-2.5

-3.0

-3.5

-4.0

1963

1969

1966

1978

1975

1972

1982

Time

(b)
Fig. 2 1,4(J).

Time graph of lt.

(#)Time

in order to approximate

this approximation
Varltytl/xf

xf) cxvargll/lt/tfl

graph of ln 1t.

the

(yt-/tt)'(p,))/X,

= vargx/x, xJ('(/t,))2.
=

This implies that when we choose hl

'(p,)

==

then Vartyl/x,

El(/t,)1
=

x,) :>: c2.

'

) such that

of

variance
=

x;

(.pf)by

Appendix 21.1

489

The variance stabilising transformation can be used in a general case


where the variance of a random variable depends on some unwanted
parameters #r. ln the case where Vartyl Xt x,) yto.l the transformation
(y,) loge yl is called for because
=

'

11(/.tt )

pt.

ln Fig. 2 1.4(:/) the time path of 7 days' interest rate is shown which exhibits a
variance increasing with the Ievel Jtt. Its log. transformation is shown in Fig.
2 1.4()).
Important

ctlactzpt.k

Auxiliary regression misspecification test procedure, Kolmogorov-Gabor


elliptical
theorem, OLS estimators,
P olynomial, Gauss-Markov
distributions, non-linear conditional expectations, normalising transformations, GLS estimators, variance stabilising transformations, time
dependent parameters, recursive estimator, structural change.

Questions
Explain
linearity
between normality,
the relationship
homoskedasticity.
variables'
the
Explain the intuition underlying
misspecification test.
State the finite sample properties of the OLS estimator
Komitted

and

type

b (X'X) - 1X'y,
=

based on the assumptions that (yf Xt xf) D(#'x,, c2) and y is an


T;where the form of
independent sample from D(y,/Xf,' #),l 1,2,
1)( ) is not known.
Explain the statement: b'T'heOLS estimator b (X'X)- 'X'y of # has
minimum variance among the class of linear and unbiased
estimators.'
theorem shows that b is a fully efficient
t'l-he Gauss-Markov
estimator.' Discuss.
for the linear
t'T'he normality assumption is largely unnecessary
regression model because without it we can use the OLS estimators b
and of # and c2, respectively. Moreover, all the hypothesis testing
(72
derived using the MLE*S j and c2 are
results about j and
asymptotically vald- anyway.' Discuss.
=

'

.1

490

10.

from assumptions - probability model

Departures

Explain the difference in the asymptotic distributions of sl and J2, the


MLE and OLS estimators of c2.
Explain the intuition underlying the skewness-kurtosis
test for
departures from normality.
How do we proceed when the normality assumption is rejected in the
linear regression model?
Discuss the implications of non-linearity in the context of the linear

regression model.
How can we test for non-linearity'?

'When linearity is rejected by some misspecification test we should


adopt a non-linear specification h(0, xf) instead of p'xt and retaining
the assumption of normality for D( pf/''X,;#)) proceed to derive MLE'S
for p and c2.' Discuss.
Discuss the implications of heteroskedasticity in the context of the
linear regression model.
A'T'hecomparison between F and / is largely irrelevant since the
derivation of #-is based on the assumption that fl is known up to a
scalar multiple.' Discuss.
Explain the intuition underlying the derivation of the GLS estimator

p-of #.

Discuss the following matrix inequalities:


(X'f1 - 1X)-- ' -(X'X) - 1X/f1X(X'X) - 1 Gll
c2(X'x) - 1 -(X'x)
#'

lx'flxtx/xj

- l

y().

'*h

(.Although

#= (X X)

X y is both unbiased and consistent estimator


of #, no consistent inference about # is possible since no consistent
estimator of fl is possible unless c/ is modelled.' Discuss.
18. Explain the intuition underlying the White heteroskedasticity test.
19. How is the White heteroskedasticity related to non-linearity as well?
20. t'T'he way to proceed in the case where homoskedasticity is rejected by
some misspecification test is to model cf2 by relating it to xf or some
other exogenous variable z, and retaining the linearity and normality
(7.2
of #,
and the unknown
assumptions proceed to derive MLE'S
c,2.'
postulated
model
of
in
the
Discuss.
parameters
Discuss the linear regression model based on the assumption that
D(yf,Xr; ) is multivariate t. Explain why Xr cannot be weakly
exogenous for 0 > (#,c2) in this model.
Explain how we can reconcile the heteroskedasticity of Vartyt/x, =xJ
with the normality of D k,f,/X?;
$.
What do we mean by time invariance of 0u p, c2)'?
of
Explain how non-stationarity
?.

Appendix 21.1
.J?r

Zf HE

Xv

can lead to time dependence parameters of interest.


Explain the following form of the recursive estimator of
.,.

26.

.w

#r bt 1 +
-

(),

() .j

(X, X, )

xrt.'4

n.f

xt),
t- 1

k + 1,

#,:
.

IL

How can we test for parameter time-invariance in the context of the


linear regression model?
Explain the intuition underlying
the CUSUM test.
b'T'heimplicit null in testing for coefficient time invariance using the
recursive residuals is not Hlb : pt jf 1 but S! : xtpt
jf 1) 0.'
Discuss.
How do we tackle the problem of time dependence of p?
How do we test for coefficient structural stability (constancy)in the
case where T2 > k? How is this different from the case Tz < k'?
:In the case where Fc < k the implicit null for coefficient constancy is
not Hv': (#I - ja) 0 but Ht X2(#j ja) 0.' Discuss.
=

Exercises

and
when r, N(0, c2) and compare them with the same
quantitites when nt D(0,c,2) and the form of D4.) is
unknown.
Show that b=(X'X)- 1X/y has minimum variance among the class of
unbiased estimators of the form
'v

b* (L + Cly,
=

(X'X) - 1X'

is a k x T arbitrary matrix.
Show that under the assumption

Ep +

and

(y,/Xt

xl) 'vD(/7(x,),c2),

JJ(y2)# c2,

Show that under the assumption


Cov(#) (X X)
=

Derive the GLS estimator

(),'t'X, xr) D(#'xf,


=

X DXIX X)
of

cf2)

assumption of exercise 3.
c2A is essentially the same as

p under the

Show that knowing f) or A where f)


far as estimation of j is concerned.

492

Departures

from assumptions

probability

Using the formula B- 1 A 1


c 1/4.1 + #'A - #, show that
=

'

model

C?A- 1a#'A -' 1 for B

--

A h-ap'

where

'x

0) -

(x t

1 :=

'

(x t -

?-

)-

(X0' X0

x'('X0' X0

--G-1-.C---.-t)- x..1.-.4-.L- 1 t - 1 j v, 1 (x, - 1) -k x,- -1-h x, ((X,

Show that
-

#?

#, l
-

0
(X0' j X
...q- j )
, -

+ -

1+

-;-

x/xf

'*1

#y.syx,)
j'
'
L -''
1)
1Xr- x,
x-q.(y..j,

1
8. Show that (X,0-') x,0..
j )jj ) xyx;
9. Verify the expressions for Vartw'r) and Covtwrws) of Section 2 1.5.
10. Show that wf2 RSS: RSSt 1 t 7: /f + 1 Where RSS:
.-

Zl
-

(.f

#-'Xf) 2
t

Additional references
Bickel and Doksum ( 198 l ); Gourieroux
Ramsey ( 1974),. White and MacDonald

t?r

aI. ( 1984),. Hinkley ( 1975),. Lau


( 1974).

( 1980),. Zarembka

( 1985),.

22

CHAPTER

The linear regression model IV - departures


from the sampling model assumption

One of the most crucial asstlmptions underlying the linear regression model
)'w)' constitutes an
is the sampling model assumption that y EEE( )'a.
seqtlentially
T)
X,;
1, 2,
from
draw'n
salnple
D(
t
t?),
independent
the
assumption
enables
function
This
likelihood
to
respectively.
us to define
be
'l

.,

-IP; y)

c(y)

l-lD(),/X,;

p).

(22. 1)

f=1

lntuitively. this assumption amounts to postulating that the ordering of the


observations in yl and Xt plays no role in the statistical analysis of the
model. That is, in the case where the data on )'2 and Xt are punched
observation by observation a reshuffling of the cards will change noneof the
results in Chapter 19. This is a very restrictive assumption for most
economic time-series data where some temporal dependence between
successive values seems apparent. As argued in Chapter 17, for most
model
sampling
economic time series the non-random
seems more
appropriate.
ln Section 22.1 we consider the implications of a non-independent
sample for the statistical results derived in Chapter 19. lt is argued that
these implications depend crucially on how non-independence is modelled
and two alternative modelling strategies are discussed. These strategies give
and autocovrelation
rise to two alternative approaches, the respecillcation
testing and ways to tackle the
tppprotzc/lt?sto the misspecification
dependence in the sample. ln the context of the autocorrelation approach
the dependence is interpreted as due to error temporal correlation. On the
other hand, in the context of the misspecification approach the error term's
493

Departures

from the sampling model assumption

rolt?as the non-systematic


component of the statistical GM is retained and
samples
dependence
is modelled from first principles in terms of
the
in the
random
variables
observable
involved. Sections 22.2 and 22.3 consider
the
proceed
with
sample and the
various ways to
a non-random
sample
independent
assumptions
misspecification testing for the
22.4
of
misspecification
analysis in
In
respectively.
Section
the discussion
perspective.
Chapters 20-22 is put into

22.1

Implications

of a non-random

(1)

Defining the concept

t#*

sample

a non-random

sample

lt is no exaggeration to say that the sampling model assumption of


independence is by far the most crucial assumption underlying the linear
regression model. As shown below, when this assumption is invalid no
estimation, testing or prediction result derived in Chapter 19 is valid in
general. ln order to understand the circumstances under which the
independence assumption might be inappropriate
in practice, it is
instructive to return to the linear regression model and consider the
Zr,. #) to Dlt't 'X,,' 04, t 1, 2,
reduction from D(Z1, Zc.
F in order to
understand the role of the assumption in the reduction process and its
relation to the NIID assumption for Z?. r g Tj Zr N (yr,X;)'.
As argued in Chapter 19, the linear regression model could be based
directly on the conditional distribution D vry'X,,'#j) and no need to define
D(Zf; #) arises. This was not the approach adopted for a very good reason.
ln practice it is much easier to judge the appropriateness of assumptions
related to Dzt., #), on the basis of the observed data, rather than
assumptions related to D( pf/Xl; 1). What is more, the nature of the latter
distribution is largely determined by that of the former.
In Chapter 19 we assumed that (Z!, l 6 T is a normal, independent and
identically distributed (NllD) stochastic process (seeChapter 8). On the
basis of this assumption we were able to build the linear regression model
ln particular the probability and sampling
defined by assumptions E1j-I)8(I.
viewed
assumptions
model
can be
as consequences of )Zt, t c T) being
NIID. The normalitl' of D(.yr/Xf; /1), the Iinearit of f'ly'r/xr xr) and the
homoskedasticity of Vartyr//''xf x,) stem directly from the normality of Zt.
.

The time invariance of 0- (p,c2) and the independent sample assumptions


stem from the identically distributed and independent component Of N11D.
Note that in the present context we distinguish between homoskedasticity
and time invariance of Vart)'r/xt x,) (seeSection 21.5). This suggests that
=

22.1

Implications

of a non-random

sample

495

sample assumption
the obvious way to make the independent
t G -1i-)is a (dependent) stochastic
inappropriate is to assume that
process. ln particular (becausewe do not want to lose the convenience of
normality) we assume that

tzt,

Zf

'v

8), Cov(Z?Z,)

N(m(r), E(l,

E(l,

t, s e: T.

#,

(22.2)

to the money equation estimated in Chapter 19, a cursory look


the
realisation
of Z,, t 1, 2,
F (see Fig. 17.1(4g)-(J)),would convince
at
above
assumption
that
the
seems much more appropriate than the 1ID
us
realisations
of the process exhibit a very
The
for
such
data.
assumption
systematically
trend
changes
time
distinct
(the mean
over time) and for at
well.
variance
the
change
of
least two
them
seems to
as
The question which naturally arises at this stage is to what extent the
assumptions underlying the linear regression model will be affected by
relaxing the I1D assumption for Z,. r G7T1, One obvious change will come
sample y H ( )'1 .'2.
)'w)'.What is not so
in the form of a non-independent
replace
which
B
i11
f)( )'; X:.. #:). Table 22.1
obvious is the distribution
reduction
of the linear regression
summarises the ilnportant steps in the
is a N1lD stochastic
model from the initial assumption that :Z,, i!
where
:Z,, l e: 5) is a non-llD
process and contrasts these with the case
minor
of the linear
difference between the
process. The only
regression model as summarised in this table and that of Chapter 19 is that
the mean of Zf is intentionally given a non-zero value because the mean
plays an important role in the case of a non-llD stochastic process.
As we can see from Table 22.1, the first important difference between the
Zw is complicated
1lD and non-llD case is that distribution of Z1
considerably in the latter case with the presence of the temporal covariances
and the fact that all the parameters are changing with r. The presence of the
temporal covariance implies that the decomposition of the joint
Zr in terms of the marginal distributions is no longer
distribution of Z1,
valid. The dependence among the Z,s implies that the only decomposition
possible is the sequential conditioning decomposition (seeChapter 6) where
the conditioning is relative to the past history of the process denoted by
This in turn
Z0l-1 Ntzj
Zr-- 1)., with t representing the
zc,
implies that the role played by D(Zr; #) in the lID case will now be taken
over by D(Zf Zr0- 1 ; #*(r)).In particular the corresponding decomposition
into the conditional (D4y, 'Xf; /1)) and marginal (D(Xr; /2))distribution w111

lf we return

tconstruction'

tpresent'.

now be

where the past histor-b'of the process comes into both distributions because
it contains relevant information for yf as well as X,. Although the algebra in

))
z, ss

22

XN

Zw

mt l )

E( I l ), E( 1 2)

m(2)

E(2, l )-E42, 2)

'

,x.

t'

@21

)!'

q m
=

tz!

c2

#'xr )
aE2-a1ca I mxl
+

#s
ez (('()p

rz2)

x,)

lp Tt F)

A.-(('(,(t) +. #'tltrjx,
+

definition of the parameters


0

Xl
t.r('--J'
. 1 ),

'k,,,

c(Y,O

(iii). Vartl',.

time independcnt

(iv)e1

sample
)?wl' is an independent
y LE t),l
sequentially drawn from .I)(.', X,,' kj ). t l 2,
'f; respectively
,

I.'r/.fz0,

(ii) E

model

'p

'

Sampling

E42,

E( F. l )

mt T)

'r)

22

cj j - .1 araa- trcj
J.'t//X,
E(
(ii)
xt ) - linear in xt
Vartl',
Xt
x,) - homoskedastic
(iii)

(iv)

1J(1

(x''''
) 1(''1))
N((.'''''

(.1,, 'X / ) Nt'v

model

'

t ...

Probab ility

'

).

xl

XO
t

xI)

(t'()(/),polt),#f(l)-

#'f(/)x,J
.

see Appendix

r,rg4/)) (for the

1)

homoskedastic

involved

txJ(r),

.- 1

Iinear in Zlt'-

)'y.)/ is a non-random
y Es (yj
D( y'p,'/Zt0j Xl,. 9(l)),
t 1. 2.
.

- 1

0)

)-(Iaf(/).J.,

(free of

Zl' -

t - 1, c4(l)) - time dependent


samf'le sequentially drawn from
'T)respcctively
1-2,

22.1

Implications

of a non-random

sample

497

the non-llD case is rather involved, the underlying argument is the same as
in the llD case. The role of D()'f/''Xr; 1) is taken over by D()'f Zf0- j Xf ; k$(t))
and the probability and sampling models in the non-llD case need to be
defined in terms of the latter conditional distribution. A closer look at this
distribution, however, reveals the following:
,

f-1

't.pf,/'J(Y/- j ), Xr0

Vart

)f

X/
't7'(Yr0-1)
,

x))l

c()(1)+

#(l)xI

F (af(l)yti + #(r)xt J;
-

r=1

xt0)

(22.4)

c(jttl ;

where
X f0
the conditional

Firstly, although

(x 1

EEE

mean

x 2,

x ).
t

is linear in the conditioning

variables, the number of variables increases with r. Secondly, even though


the conditional variance is homoskedastic (free of the conditioning
variables) it is time dependent along with other parameters of the
distribution. This renders the conditional distribution Dt-pt Zt0- 1, Xt; #y(l)),
as defined above, non-operational as a basis of an alternative spification to
be contrasted with the linear regression model. It is in a sense much too
general to be operational in practice. Theoretically, however, we can define
the new sampling model as:
),w)' represents a non-random
y
drawn, from D()'r Z,0- l X,', 41:(1)),
t 1, 2,
-(.v1,

sample sequentially
'Tkrespectively.
.

ln view of this it is clear that what we need to do is to restrict the generality


assumptions related to )Z,, l iE T) in order to render
of the underlying
ln particular, we need to impose certain
D( 'Z,0- j Xr,.
t(r))operational.
the
of
time-heterogeneity
and dependence of the
form
the
restrictions on
ftzf,
T).
(E
This
is the purpose of the next two substochastic process
r
sub-section
the dependence of ftzf, t g Tjt is restricted to
sections. In the next
independence
and
the complete time-heterogeneity (nonasj'mptotic
identically distributed) is restricted to stationarity (seeChapter 8) in order
to get an operational model. ln sub-section (3) the time-dependence
assumption is restricted even further in an attempt to retain the systematic
component of the linear regression model and relegate the dependence to
the error term.
,r

Departures from the sampling model assumption

(2)

The respec6cation

approach

As seen above, when


l G T) is assumed to be normal, and non-llD, the
conditional distribution we are interested in, D()'tf'Zh j, Xj,' #y(J)), has a
mean whose number of terms increases with r, and its parameters #1'(r)are
time dependent (see(4)-46/. The question which naturally arises at this
stage is to what extent we need to restrict the non-llD assumption in order
the incidental Jptzl'tkrrlplt?p-s
problem. ln order to understand
to
the
role of the dependence and non-identically distributed assumptions let us
eonsider restricting the latter first.
Let us restrict the non-identically distributed assumption
to that of
stationarity, i.e. assume that )Zr, l (E T) is a normal, stationary stochastic
process (without any restrictions on the form of dependence). Stationarity
(see Section 8.2) implies that

tzr,

tjolve'

f(Zr)

m,

Cov(ZtZ,)

Z1

Z2

E(Ir
Yotl
Y1Y:

xN

.$1).

(22.7)

Yz. 1
Er-?

Y1

Zw

Ew-l

tiuvttlt -s(),

Y1Eo

1 2,
,

T - 1.

(22.8)

That is, the Zfs have identical means and variances and their temporal
covariances depend only on the absolute value of the distance between
them. This reduces the above sample
g'F x ( + 1)(1x E.'Fx (k+ 1)(1
covariance matrix to a block Toeplitz matrix (see Akaike ( 1974)). This
restricts the original covariance matrix considerably by inducing symmetry
and reducing the number of
k + 1) x (k+ 1) matrices making up
these covariances from 7-2 to F- 1. A closerlook at statienarity reveals that
it is a direct extension of the identically distributed assumption to the case of
a dependent sequence of random variables. ln terms of observed data the
sample realisation of a stationary process cxemplifies no systematic
changes either in mean or variance and any z-period section of the
realisation should look like any other r-period section. That is, if we slide a
z-period
along the time axis the picture'
over the realisation
should not differ systematically. Examples of such realisations are given in
Figs. 21.2 and 21.3.
Assuming that ftzf, t (y T) is a normal stationary process implies that as far
tdifferent'

-wnf/tpw'

22.1

Implications

of a non-random

(ii)

Var(),,''c(Y?0- j ),Xf0 x,0)

(iii)

0* EEEE(c(), v

pi a i

499

c()2;

12

sample

(22.10)
t - 1 c()2).

(22.11)

As we can see, stationarity enables us to


the parameter timeincidental
problem
but
dependence
the
parameters problem remains
unresolved
the
number
of
largely
because
terms (and unknown parameters)
1. Hence, the time-homogeneity
conditional
with
in the
mean (9)increases
introduced by stationarity is not sufficient for an operational model. We
also need to restrict the form of dependence of (Z,, r g T) ln particular we
of the process by imposing restrictions such
need to restrict the
ergodicity,
mixing or asymptotic independence. ln the present case the
as
independence.
most convenient memory restriction is that of (ls
This restricts the conditional memory of the normal process so as to enable
us to approximate the conditional mean of the process using an mth-ovder
Markov process (see Chapter 8). A stochastic vector process .CtZf,l 6 T). is
said to be mth-order Markov if
'solve'

bmemory'

'lnprtalfc-

Ftzr Z,0- 1)

;(zr,,'c(zr-

z,

- a,

z,

Assuming that .f(Zf t e: T) is:


normal;
(i)
stationary; and
(ii)
asymptotically independent
(iii)
enables us to deduce that for large enough
conditional mean takes the form

(22.12)

-,,,)),

(hopefully m < F) the

This form provides us with an operational


form for the systematic
?,n, c02)is both time
component for t > ?1. Now 0* (c'0,je, jf, xi, i 1, 2,
invariant and its dimensionality remains fixed as F increases. lndeed,
Dty', Zf0- j Xr,.
rrl, a
t) yields a mean linear in xt, y, -i, xf -f, i 1, 2,
homoskedastic variance and /1 is time invariant. Hence, defining the nonsystematic component by
=

u;b

.pt
-

E(yt/''c(Yy0-I )X0

x$0),

(22.14)

from the sampling model assumption

Departures

w'e can define a new statistical GM based on D()'t,

ZJ0-

Xf

t) to

be

in an obvious notation. Notv that c'a has been dropped for notational
convenience (implicitlyincluded in pvq.
of stationarity and
lt must be noted at this stage that the assumption
asymptotic independence for IZ, t e: T) are not the least l'estrictive
assumptions for the results which follow. For example asymptotic
independence can be weakened to ergodicity (see Section 8.3) without
affecting any of the asymptotic results which follow. Moreover, by
strengthening' the memory restriction to that of tp-mixing
some timeheterogeneity might be allowed without affecting the asymptotic results (see

White (1984)).
It is also important to note that the maximum lag m in ( 15) does not
represent the maximum memory lag of the process tyf/'Z)'- 1 Xf, t e: T) a.S in
the case of an m-dependence (see Section 8.3). Although there is a duality
with an mth-order Markov process, in the
result relating an rn-dependent
of
is
considerably
longer than m (seeChapter 8).
latter
the
the
memory
case
This is one of the reasons why the AR representation
is preferred in the
is
determined
The
maximum
lag
by the solution of
present context.
memory
polynomial:
1ag
the
,

a(.f-)
=

af.f-f

(22.17)

0.

ln view of the statistical GM ( l5) let us consider the implications of the


non-random sample (i)-(iii)for the estimation, testing and prediction
results in Chapter 19. Starting with estimation we can deduce that for
#=(X X)

and

sl

(22.18)

Xy

1 T'-/l'

=(

(22.19)

the following results hold:


i
()
(ii)

A'(/)

#() + (X'X) - 1X'.&Z* 7)# #' j is a biased


# #; # is an inconsisten: estimator of #;
AP
++

estimator

of

#'
,

''h

MSEI#)HE c 2 (X X)
t

A(s2)=c2 +y';(z*'M

(X X)

z*ly#c2.

.X

IXIX X) 1 # c 2 (X X) 1
js a biased estimator Of c2'

....y

X '(Z SZ
.:2

+?

Implications

22.1

(v)
(vi)

s2

+%

c2;

-$2

of a non-random

sample

is an inconsistent estimator

j01
Of

c2;

p MSE(j)'

$2(X'X) -

1 +.+

where Mx Iw-X(X'X) - 1X'. These results show clearly that the


sample assumption are very serious for the
implications of the non-random
y2)
as estimators of 0 E (#, c2). Morcover, (i)(vi)
appropr iateness of l N (/
taken together imply that none of the testing or prediction results derived in
Chapter 19, under the assumption of an independent sample, are valid. In
particular the l-statistics
for the significance of the coefficients in the
estimated money equation are invalid together with the tests for the
coefficients of v, and pt being equal to one as well as the prediction intervals.
At first sight the argument underlying the derivation of (i)-(vi)seems to be
variables problem' criticised in Section 20.2 as
identical to the
being rather uninteresting in that context. A direct comparison, however,
between ( 15) and
=

bomitted

y, j'x,
=

(22.20)

+ u,.

reveals that both statistical GM*s are special cases of the general statistical

GM

p,

1), Xf0 xr0


.E'()'j,/W(Yr0)+

under alternative
constitute

on .tZt, t c T) ln this sense


from the same joint distribution

assumptions

'reductions'

D(Zl

Zw; #)

them directly comparable;


conditioning set
f'j

(3)

.t

c(Y,0-j ). x,0 xy

The autocorrelation

(20) and ( 15)


(22.22)

which makes

(22.21)

gt

'
)

they are based on the same

(22.23)

approach

The main difference between the respecification approach considered


approach
above and the autocorrelation
lies with the systematic
the
respecification
In
approach
we need to respecify the
component.
systematic component in order to take account of (model) the temporal
systematic information from the sample. ln the autocorrelation approach
remains the same and hence the temporal
the systematic component
systematic information will be relegated to the error term which will no
longer be non-systematic. The term autocorrelation
stems from the fact that
dependence in this context is intereted as due to the correlation among

from the sampling model assumption

Departures

the error terms. This is contrary to the logic of the approach of statistical
model specification propounded in the present book (seeChapter 17). The
approach, however, is important for various
reasons. Firstly, the
approach is very illuminating for both
comparison with the respecification
approach
approaches. Secondly, the autocorrelation
dominates the
textbook econometric literature and as a consequence it provides the basis
for most misspecification tests of the independent sample assumption.
The systematic component for the autocorrelation
approach is exactly
the same as the one under the independent sample assumption. That is,
assuming a certain fol'm of temporal dependence (see (41)).
j)
yt > .E'(y, V',

#'x,.

(22.24)

This implies that the temporal dependence in the sample will be left in the

error term:

(22.25)
;f yt - E (.)/.67)).
.%.as defined in (23).In view of this the error term will satisfy the following
properties:
=

(i)

Elst/.@-,)

E (cJs/.f? r)

(22.26)
c2(p,
t,(r,

,s),

These assumptions in terms of the observable


the rather misleading notation:
(y,''X) N(Xp, c2Vw),
'v

r.v.'s are often expressed

Vw>0.

in

(22.28)

The question which naturally arises at this stage is.


are the
implications of this formulation for the results related to the linear
19 under the independence
regression model derived in Chapter
assumption'?' As far as the estimation results related to j (X'X) - 1X'y and
.$2 g1/(F- klj/,
y - Xj are concerned we can show that:
bwhat

J is an

unbiased estimator

(i)'

Ep

(ii)'

(iiil'

Covt#-) c2(X'X) - IX'V F X(X'X) -' 1.

P
-+

# if limr..o.(X'VwX'''F<
=

)s2)

(iv)'

2X

,5,2

and non-singular;

tr( 1 Px)Vw + c2;


-

is an inconsistent estimator

C+c2(X'X) - IX'V z'x(x'Xjsl


and
bare not independent.

S 2(X/X) -

#;

'

gc2/(T'-/()j

(7.2

y.

of

Of

c2.
,

22.2 Tackling temporal dependence

503

ln view of (iiil', (v)',(vi)' and (viil' we can conclude that the testing results
derived in Chapter 19 are also invalid.
The important difference between the results (i)-(vi) and (i)'-(vi)'is that /
estimator in the latter case. This is not surprising,
is not such a
however- given that we retained the systematic component of the linear
regression model. On the other hand, the results based on Covty/-'= X)
property of / in the present
G 21 are inappropriate. The only undesirable
v
the
relative
said
its
is
inefliciency
be
proper MLE of j when Vw
to
to
context
is assumed kntpwn. That is, p-is said to be an inefficient estimator relative to
the GLS estinlator
kbad'

'''

j
j
j
#= (X Vw X) . X Vw y
y

(22.29)

(see Judge et aI. (1985:. A very similar situation was encountered in the case
of heterosk-edasticity and the same comment applies here as well. This
efficiency comparison is largely irrelevant. ln order to be able to make
(2)
with
justifiable efficiency comparisons NN': shotlld be able to compare (/.
well
is
information
l
known,
however,
the
set.
t
based
same
estimators
on
no uonsistent estimator of the
that in the case where 5'w i s tlnknown
exi
information
and
the
of
interest
st
mat rix could not be used to
parameters
efficiency
bound.
full
lower
define a
22.2

Tackling temporal dependence

sample
question: 'How do we proceed when the independent
of
considered
the
this
will
testing
invalid?'
before
is
be
assumption
approach
autocorrelation
the
the
assumption because in
two are
inextricably bound up and the testing becomes easier to understand when
the above question is considered first. This is because most misspecification
approach consider
tests of sample independence in the autocorrelation
which we
assumption
independence
particular forms of departure from the
approach,
This
however, will be
will discuss in the present section.
above,
mentioned
respecification
approach because, as
considered after the
respecification
the
the former is a special case of the latter. Moreover,
approach provides a most illuminating framework in the context of which
the autocorrelation approach ean be thoroughly discussed.
The

(1)

The

rtzArptlclr/ztw/t?zl

approach

ln Section 22. l we considered the question of respecifying


of the linear regression model in view of the dependence
model. lt was argued that in the case where lZ:. (E 'T' is
independent process
stationary and asymptotically

the components
in the sampling
assumed to be a
the systematic

Departures from the sampling model assumption

component takes the form

(22.30)
The non-systcmatic
uf

'(F(Y/- j ),Xf0 x/),

).,t - S(.J.'?

This suggests that


to the sequence 9 t

f'lf Vf -

1)

is defined by

component

,f(

t > m.

u,, t > rn)

defines a martinqale derence process relative


J(Z?() Xr+ I), l > m, Of c-fields (see Chapter 8). That is,

0,

(22.32)

Moreover,
2

Gf),

JJ(url,/x)

(22.33)

0,

These properties will play a very important


i.e. it is an innovation p?-?c'(?.$.$.
role in the statistical analysis of the implied statistical GM:

'f p'vxt+

afl'r

1*=

+
i

(22.34)

&t,
- f+

ixt

If we compare (34)with the statistical GM of the linear regression model


we can See that the error term, say f;r in
.

)'f

Fxf+

(22.35)

lr,

noise relative to the information set (c(Yt0- Xf0 xf0)


is
given that t)t is largely predictable from this information set (see Granger
(1980:. Moreover, in view of the behaviour of ;r the recursive estimator
no longer white

#, #! 1 +
=

j),

(Xf Xl )

xtll'f -

#, 1 xr)
-

(22.36)

(seeChapter 2 1) might exemplify parameter time dependence given that


(yj // 1x,), t > k, is no longer a mean innovation process but varies
systematically with t. Hence, detecting parameter dependence using the
recursive estimator j l > k, should be interpreted with caution when
invalid conditioning might be the cause of the erratic behaviour of
-

15

f(#f

#f 1),
-

(22.37)

The probability
distribution underlying (34) comes in the form of
D()'/'Zy- :, X.,',#), which is related to the original sequential decomposition
T

D(Z*; #)

L1D(Z,/Zf -

f=1

1,

Z1;

#),

(22.38)

22.2

Tackling temporal dependence

D(Zf /Zf0- j ;

#)

Dt.y'f(L,t-k

xf 41) otxf//'Zf0- ; #a).


,.

505

(22.39)

The parameters of interest 0 EEE(jf, xi + 1, i 0, ls


n'l, c2) are functions of
41 and Xt remains weakly exogenous with respect to 0 because #: and /2
are variation free (seeChapter 19). Hence, the estimation of 0 can be based
on D(yf/'Zf0-j, xf',#j) only. For prediction purposes. however, the fact that
D(Xl Z,0- j ; #a) involves Yf0-j cannot be ignored because of the feedback
between these and Xf. Hence, for prediction purposes in order to be able to
ZtOconcentrate exclusively on D4y1
1, Xf,. #1) we need to assume that
=

D(Xf/'Z!0- 1 ;

#c)

D(Xf Xf0.j

,'

(22.40)

#a),

i.e. Yt0- j does not Granqer cause Xf (See Engle et aI. ( 1983:. When weak
exogeneity is supplemented with Granger non-causality we say that Xt is
stronql), exogenous with respect to 0.
The above changes to the linear regression model due to the non-random
sample taken together amount to specifying a new statistical model which
(1
?-t?g?-e'ssf(??7 nl()(leI. Becatlse of its inportance
in
we call the J'nfilhnftr lineal.
specification.
estimation,
testing and prediction
econometric modelling the
in the context of the dynamic linear regression model will not be considered
here but in a separate chapter (see Chapter 23).
(2)

The autocorrelation

approach

As argued above in the case of the autocorrelation approach the stochastic


t 6 T) is restricted even further than just stationary and
process
asymptotic independent. ln order to ensure (24) we need to restrict the
identical,
temporal dependence among the components of Zr to be
.tzf.

'largely'

i-e.
rovtzffzsj)

.....=

c-/l't

--

yj

), i.j

=::

1 2,
,

(22.41)

1%'
.+ 1,

(see Spanos ( 1985/7) for further details).


The question of restricting the time-heterogeneity and memory of the
process arises in the context of the autocorrelation approach as restrictions
on c(l) and n(l, s) (see (27)).lndeed, looking at (28)we can see that c2(r) has
already been restricted (implicitly),c2(t) c2, t (E T.
Assuming that
ty 1) is a stationary process Vw becomes a Toeplitz
matrix fsee Durbin ( 1960)) of the form Ltts trjr-l, r, s 1, 2,
T and
although the number of unknown parameters is reduced the incidental
parameters problem is not entirely solved unless some restrictions on the
'memory' of the process, such as ergodicity or mixing, are imposed. Hence,
the same sort of restrictions as in the respecification approach are needed
here as well. In practice, however, this problem is solved by postulating a
=

tcf,

506

from the sampling model assumption

Departures

generating mechanism forcf which ensures both stationarity as well as some


form of asymptotic independence (see Chapter 8). This mechanism (or
model) is postulated to complement the statistical GM of the Iinear
regression model under the independence assumption in order to take
account of the non-independence. The most commonly used models for F;,
are the AR(m), MA(m) and ARMAIJ?. qj models discussed in Chapter 8. By
far the most widely used autocorrelated error mechanism is the AR( 1)
model where st pBt - 1 + ut and ut is (1 w/lfftontpyt? process. Taking this as a
typical example, the statistical GM under the non-random
sample
assumption in the context of the autocorrelation approaeh becomes
=

c,

.1

#'x +
l

pBt

- 1

(22.42)

;r,

N,,

0< p

<

(22.43)

The effect of postulating (43)in order to supplement


(42)is to reduce the
number of unknown parameters of Yv tojust one, /?, which is time invariant
as well. The temporal covariance matrix :'7. takes the form
/?
r-2

1)

(22.44)

(see Chapter 8). On the assumption that (43) represents an appropriate


model for the dependency in the sample we can proceed to estimate the
4,7.2
J(;/).
parameters of interest 0 (#, p, c2) where
The likelihood function based on thejoint distribtltion under (42)--(43)
is
''

L(0,. ).)
,

(2z:c )

1%2

(t.je t y,rv j; ;))

1ogA(0,' y) con st - - Ioc


2
=

0-2

-.

tj

+ . 1og ( 1 - /?
2

2)

(22.46)

22.2

Tackling temporal dependence

1
t?log L
X'Vw(p) (7p =-rj
c

(' 1og L
t?c2 -

T
=

x+

2c=

1
2c

L
p
p lo-#-.-+
1
p'/? ( - p2)

4.

c,v z.(/?) ..

pc1

(22.47)

0,

1;=

(7.2

(22.48)

(j

1,

+-.j
c

:=

ya (;, - vst-

)f;, j
-

0,

(22.49)

where c HE y -X#. Unfortunately, the first-order conditions ( log Lj, ?p= 0


cannot be solved explicitly for the MLE of 0 because they are not linear in
p. To derive # we need to use some numerical optimisation procedure (see
Beach and Melinnon
( 1978), Harvey (198 1), inter alia). For a more
extensive discussion of the estimation and testing in the context of the
autocorrelation approach, see Judge et aI. ( 1985).

(3)

compared

The two approaches

- the common

factor restrictions

As mentioned in Section 22.1, the autocorrelation approach is a special case


of the respecification approach. ln order to show this 1et us consider the
example of the statistical GM in the autocorrelation approach when the
error mechanism is postulated to be an AR(rn) process

lpl

v, #'xf+
=

pftl't

f=1

-.

j'xr

i -

) +.

ld,

(22.51)

lf we compare this with the statistical GM (34) in the respecification


approaeh we can see that the two are closely related. lndeedst34) is identical
to (51) under
Ho

ppi

pi,

?' =

l 2,
,

m.

(22.52)

These are called ctprtlmt?njctor restrictions (see Sargan (1964),Hendry


and Mizon ( 1978). Sargan ( 1980)). This implies that in this case the model
suggested by the autocorrelation approach is a special case of (34)with the
which arises at this
common factors imposed a priori. The natural question
result
or is only true for this particular
stage is whether this is a general
example. In Spanos (1985/)it was shown that in the case of a temporal
covarianee matrix Vw based on a stationary ergodic process the hybrid
statistical GM in the context of the autocorrelation approach takes the

508

from the sampling model assumption

Departures

general form
(22 53)
.

Hence, the autocorrelation


approach can be viewed as a special case of the
respecification approach with the common factors restrictions imposed a

priori.

Having established this relationship


between the two approaches it is
factors
interesting to consider the question: tWhy do the common
The answer
restrictions arise naturally in the autocorrelation approach?'
lies with the way the two approaches take account of the temporal
dependence in the formulation of the non-random sample assumption.
In the respecification
approach
based on the statistical model
specification procedure proposed in Chapter 17 any systematic information
in the statistical model should be incol-porated directly into the definition of
the systematic component. The error term has to be non-systematic relative
the
to the information set of the statistical model and represents
tunmodelled' part of y'r. Hence, the systematic component needs to be
redefined when temporal systematic information is present in the sampling
approach attributes the
model. On the other hand, the autocorrelation
dependence in the sample to the covariance of the error terms retaining the
same systematic component. The common factors arise because by
modelling the error term the implicit conditioning is of the form
#n

z);r,,'c(E0
( .

where Ef0-j

(:,

EB

'tcr

1,

))
;f

r
=

a,

(22.54)

ait:t -f-

c(Et0-j ))+

;()),

The implied statistical GYI is

(22.55)

ut

(22.56)
which is identical to (53), Hence, the common factors are the result of
tmodelling' the dependence in the sample in terms of the error term and not
in terms of the observable random variables directly. The result of both
approaches is a more general statistical GM which
the dependence
of the
in the sample in two different but related ways. The restrictiveness
autocorrelation approach can be seen by relating the common factor
restrictions (52)to the parameters of the original AR(lz7)representation of
Gmodels'

dependence

Tackling temporal

22.2

tZ,, r G T) based on the following sequential conditional

(Zr

At

m
) kN ))

f.'k1

1()

2(1')

a1
A 2 a(,)

'.w

a 2 1 (j)

I.? i
-

xf
-

distribution:
-1

f7.)1 1
,

(22.57)

'jcc

f-tu j

As shown in the appendix- the parameters of the statistical GM


relatedto the above parameters via the following equations:

/30

Dz-zif5zl

ui

)(1 1 (i) + e?

and

ta1

2()

b'i
=

a c 1 (ij ) fo r i

12

Hence the common factor restrictions hold


al a()

a'21()

and

(22.58)

+ t?3l 2612-21 A22(f)

2-a1

.,2f1

(4) are

A2a(f)

ln.

when

J1 1 (f)I

for a11 i

1.

n1.

(22.59)
hold when Granger non-causality
That is. the common factor restrictions
holds among all Zfts and an identical form of temporal self-dependence
1985tJ)).
These are very
rn. t > m (see Spanos (
exists for all Zits. f= 1,
principle
factor
priori.
ln
the
impose
restrictions
common
unrealistic
a
to
of
the
the
by
context
in
tested
testing
indirectly
be
(59)
restrictions can
representation.
general AR(m)
A direct test for these restrictions can be formulated as a specification test
approach. In order to illustrate how the
in the context of the respecification
restrictions
can be tested 1et us return to the money
eommon factor
estimated
in Chapter 19 and consider the case where m 1. The
equation
statistical GM of the money equation for the respecification and
autocorrelation approaches is
.

illl <

Under
S

X
() :

1::

-.

a
'

,::

Iz

Xg
-

--

/.3

(:4,
'

(Z4

..

.-

/.4

(22.61)

from the

Departures

model assumption

sampling

wtt can see that the two sides of (62)have the common jctor ( 1 al L) which
can be eliminated by dividing both sides by the common factor. This will
give rise to (61).
The null hypothesis Hv is tested against
-

H 1: a 1 +

,-3
a

tz....j

s/74

or

/72

# -

a:

.:1

/./3

Although the Wald test procedure (see Chapter 16) is theoretically much
more attractive, given that estimation under 1j is considerably easier (see
Mizon ( 1977), Sargan ( 1980) on the Wald tests), in our example the
likelihood ratio test is more convenient because most computer packages
provide the log likelihood. The rejection region based on the asymptotic
likelihood ratio test statistic (seeChapter 16).

-2 logc-;-ty)

2(loge

-(#,

y)

# refer to the MLE'S

where and
the form

y))

log. L,

z2(k

x.

(22.63)

1),

Sl and Ho, respectively,

of 0 under

takes

X.
.f(

C1

y:

of

Estimation

mt

2 logg 2(y) y

0.766 + 0.793,n,

+ 0.160p! (0.220)

log L
Estimation

of

)
,

dz

+ 0,038)!r + 0.240.:!

(0. 169)

(0.182)

0.04 1f, + 0.006ft l


12)

1 -

(0.0
/12

=0.999,

=0,999,

1).

(22.64)

=0.018

+ 0.023p/

(0.208)
(22.65)

li,,

(0.013)

(0.018)
1,

=209.25,

(61) for the

same period yielded

mt 4. 196 + 0.56 1rf + 0.884/4 - 0.040,


( 1.53) (0.158) (0.037) (0.0l3)
=

R2

24k

C2

(60) for the period 1963//.-1982f,, yielded


(0,582) (0.060)

R2

(.',

42

=0.998,

=0.998,

0.8 19t

(0.064)

t,

(0.022)

log L= 187.73,

.$=0.0223,

(22.66)
and three
43.04. Given that c, 7.8 15 for (z
of
is
strongly
rejected.
Ho
freedom,
degrees
As mentioned above, the validity of the result of a common factor test
Hence,

-2

1og.2(y)

=0.05

Testing the independent sample assumption

22.3

511

of the statistical GM postulated for the


depends on the appropriateness
well-defined
model
of
estimated statistical model. The
general
a
as part
of
extensively
discussed
in the next chapter.
this
is
ensuring
question

22.3

Testing the independent sample assumption

(1)

The respecijication

approach

In Section 22. 1 it was argued that the statistical results related to the linear
sample
regression model (see Chapter 19) under the independent
assumptions are invalidated by the non-independence of the sample. For
importance
to be able to test for
this reason it is of paramount

independence.

As argued in Section 22.2, the statistical GM takes different forms when


the sampling model is independent or asymptotically independent, that is,

'f

#'x?+

uf

1 2,
,

(22.67)

'F

(22.68)
respectively. This implies that a test of independence
based on the significance of the parameters x EEE(aj,
#' )' i.e.
.

p.j

can be constructed
a.)', j* H(#'j,
#2,
.

and

pz 0
=

/l* 0.
'

In view of the linearity of the restrictions


an F-test type proeedure (see
Chapters 19 and 20j should provide us with a reasonable test. The
discussion of testing linear hypotheses in Chapter 20 suggests the statistic:
r(y)

RRSS - URSS
URSS
--

--

T'- ktnl + 1)
mk

---

--

.j-

(22.69)

nl/zt-rl has an asymptotic chi-square distribution (z2(mk))under Hv. ln


small samples, however, it might be preferable to use the F distribution
approximation (F(mk, F- Llm+ 1/) even though it is only asymptotically

Departures

from the sampling model assumption

justifiable. This is because the statistic z*( #') nykztl'l increases with the
number of regressors; a feature which is particularly problematical in the
present case because of the choice of n1. On the other hand, the test statistic
(69) does not necessarily increase when m increases. ln practice we use the
z(y) > t'alj where ca is defined by
test based on the rejection regict) C1
=

.ry:

L'L?

(22.70)

dFtl?1k, T- /(v? + 1))

f'J

as if' it were a finite sample tcst.


Let us consider this test for the money equation estimated in Chapter 19
with m 4 using the estimation period 1964- 1982/:,,
=

mt 2.763 + 0.705.4 + 0.8621,, - 0.0533.:+ ,,


( 1.10) (0.112) (0.022) (0.014) (0.040)
=

.R2

log L

=0.995,

138.425,

0.995,

RSS

mt 0.706 +0.589-,
(0.815) (0.132)
=

0.040 22.

0. 1165,

-0.018rn
-0.046mf 3 +0.2 14-1
-z
(0.152)
(0.166)
(0.129)

+ 0. 19lyt + 0.5 18 v, :
(0.199) (0.261)
-0.060pr
(0.348)

+0.606p-

(0.670)

-0.047f, + 0.0 17ff


(0.0 14) (0.020)

0.253.9,1 c
-

(0.255)

0. 116.:,

(0.260)

-0.38 1pf 2 +0.558p,-

(0.642)
-

(0.022)

+ 0.006/2

(0.021)

0.022.:/ 4.
- a-

(0.223)

-0.479pf

(0.630)

-0.025?'r

-4

(0.348)

-0.0 18)

(0.014)

log L

#2

=0.999,

,j.

(22.72)

+ l,
(0.0 18)
R2

-4

2 10.033,

=0.999,

Rk5'=0.017

=0.0

17 78,

697.

19.25. Given that cx 1.8 12 for


The test statistic (69)takes the
of
56)
l6,
and
0.05
deduce that Hv is strongly
freedom
degrees
(
can
we
a
rejected. That is, the independence assumption is invalid. This confirms our
initial reaction to the time path of the residuals
,
y - x t (see Fig. 19.3)
that some time dependency was apparently present.
An asymptotically equivalent test to the F-test considered above which
corresponds to the Lagrange multiplier (seeChapters 16 and 20) test can be
based on the statistic
value z(y)

Hv

LMy)=

TRI

z2(mk),

(22.73)

22.3

Testing the independent sample assumption

where the R2 comes from the auxiliary

regression

m + 1,

'.E (22.74)

In the case of the money equation the R2 for this auxiliary regression is
64.45. Given that cx= 26.926 for x
0.848 which implies that LMy)
and 16 degrees of freedom, hv is again strongly rejected.
lt is interesting to note that faMtyl can be expressed in the form :
=0.05

LMqy)

TR2

RRSS

URSS

(22.75)

RRSS

(see exercise 4). This form of the test statistic suggests that the test suffers
from the same problem as the one based on the statistic z*( T'), given that 5.2
increases with the number of regressors (see Section 19.4).
(2)

approach

The autocorrelation

of the
As argued in Section 22.3 above, tackling the non-independence
approach
before testing the
sample in the context of the autocorrelation
appropriateness Of the implied common factors is not thecorrect strategy to
adopt. In testing the common factors, implied by adopting an error
autocorrelation formulation, however, we need to refer back to the
useful is a test of
respecification approach. Hence, the question arises,
the independence assumption in the context of the autocorrelation
approach given that the test is based on an assumption which is likely to be
erroneous'?' In order to answer this question it is instructive to compare the
statistical GM's of the two approaches:
'how

J.',-

#'oxf+ j
i

afy,
1

,.

+
i

)
=

'ixt

+ u,,

r>

(22.76)

n1,

and
p/

p'xt+

t'tl-lrt

.;t,

?(f-)u,,

where J(L) and /74L)are pth- and /th-order polynomials in L. That is, the
postulated model for the error term is an ARMAIP, q) time series
formulation (see Section 8.4).
The error term )f interpreted in the context of (76) takes the form
G

:-(#0
-

bs'xt+

Z Eaiy'f-i+ #ixf-(l + u,-

(22.78)

=1

That is, cf is a linear function of the normal, stationary and asymptotically


t > rn) as
independent process Zf -.f, i 1, 2,
m ).This suggests that
.j

t;f
,

Departures

from the sampling model assumption

defined in (78)is itself a normal, stationary and asymptotically independent


process. Such a proccss, however, can always be approximated to any
degree of approximation by an ARMAIP, q) stationary process with
enough' p and q (seeHannan ( 1970), Rosanov ( 1967), inter alia). ln view of
this we can see that testing for departures from the independent sample
assumption in the context of the autocorrelation
approach
is not
unreasonable. What could be very misleading is to interpret the rejection of
assumption
the independence
of the error
as an
autocorrelation model postulated by the misspecification test. Such an
interpretation is totally unwarranted.
An interesting feature of the comparison
between (76)and (77)is that of
the role of the coefficients of xf, pvand #, are equal only when the common
factor restrictions are valid. In such a case estimation of # in yt #'xf+
should yield the same estimate as the estimate of # in
klarge

'endorsement'

)', j'x,
=

)r,

faly1-laf

h(1.)u,.

.5!

(22.79)

Hence, a crude

of the implicitly imposed


test' of the appropriateness
restrictions
might
the two estimates of j in
factor
be
to
compare
common
the context of the autocorrelation approach. Such a comparison might be
quite useful in cases where one is reading somebody else's published work
and there is no possibility of testing the common factor restrictions directly.
In view of the above discussion we can conclude that tests of the sample
independence assumption in the context of the autocorrelation approach do
have a role to play in misspecification testing related to the linear regression
model, in so far as they indicate that the residuals Jl, t 1,
F do not
constitute a realisation of a white-noise process. For this reason we are
going to consider some of these tests in the light of the above discussion. ln
particular the emphasis will be placed on the non-parametric aspects of
these tests. That is, the particular form of the error autocorrelation (AR(1),
MA( 1),ARMAIP, t?)) will be less crucial in the discussion. ln a certain sense
based tests will be viewed in the context of the
these autocorrelation
auxiliary regression
=

(22.80)
which constitutes a special case of (74).
The particular aspect of the process
t G T). we are interested in is its
temporal structure. The null hypothesis of interest in the present context is
that
t g T) is a white-noise process (uncorrelated over time) and the
alternative is that it is a dependent stochastic process. The natural way to

tif,

.f(:,,

Testing the independent sample assumption

22.3

proceed in order to construct tests for such a hypothesis is to choose a


measure' of temporal association among the cts and devise a procedure to
detennine whether the estimated temporal associations are significant or
not.
The most obvious measure of temporal dependence is the correlation
defined by
between cf and Ct 1, What We Call lth-order autocorrelation
+

jej czz -

Covtkk

+I)

;
gvartk)vartq..1---6)
j

In the case where (ct,t e: T) is also assumed to be stationary


Vartrt +!)) this becomes
e
1

Covtk;,

(Var(cf)

+/)

var(:,)

--

(22.82)

'

(22.83)
Intuition suggests that a test fof thesapple independence assumption in the
present context should eonsider whether the values of for some I 1, 2,
rn, say, are significantly different from zero. ln the next subsection we
.
1 and then generalise the results to != m > 1.
consider tests based on 1:::z:
,

Tbe Durbin-ktson

tes

(/ 1)
=

used (and misused) misspecification


test for the
approach is
autocorrelation
the
of
the
assumption
in
independence
context
GM
The
postulated
statistical
so-called
Durbin-Watson
for the
test.
the
purposes of this test is

The

most

.J.'r=

widely

#'x,+

'ir,

t:t =

pst - I +

(22.84)

l,,

The null hypothesis of interest is Ho1 p


(i.e.t?t u!, white noise) against
Durbin
the alternative J.fl : p #0. Building on the work of Anderson (1948),
and Watson (1950, 1951) proposed the test statistic,
=0

(22.85)

from the sampling model assumption

Departures

where
0

-1

-1

0
2

-1

(22.86)

:
2

- 1
1

between A1 and %'v is given by

The relationship

where C

=diagl

1. 0,
'

V,.(p) -

:>:

0, 1). Durbin and Watson

( 1 -p)2I

used

the approximation

(22,88)

+ /?A1

which enabled them to use Anderson's result based on a known temporal


1
Vp)Iw. The
covariance matrix. As we can see from (88), when /?
region
for
hypothesis
the
null
rejection
=0,

HoL

0,

against

H3 : p + 0

takes the form


C1

.j

: z 1 ()') %c'a

.p

).
,

(22 9)
.8

where L'z refers to the critical value for a size a test, determined by the
distribution of the test statistic (85)under Hv. This distribution, however, is
inextricably bound up with the observed data matrix X given that b= MxcMx= Iw X(X'X) - 1X' (see Chapter 19) and
-

r 1 (y)

:'M .h(.A1Mx:

E'Mx:

(22.90)

This implies that the Durbin-Watson


test is data spt?c'tc and the
distribution of z1(y) needs to be evaluated for each X. ln order to make the
evaluation of this distribution easier to handle Durbin and Watson, using
the fact that Mx and MxA1Mx commute (i.e. My(MxA1 Mx)
(MxA1 Mx)Mx) and Mx is an idempotent matrix, suggested using an
orthogonal matrix H which diagonalises both quadratic forms in (90)
simultaneously, i.e.
=

F-k

''C1

(y)

r-k
i

.'ftl?

.=

) 1 t,?
=

(22.91)

Testing the independent sample assumption

22.3

t: H'c
=

N(0, c2Jw).

'v

(22.92)

Hence the value cx can be evaluated by


statement based on (91) for ca:

##z 1 (y)f
for a given size

c.)

Pr
i

.=

',.

the following probabilistic

'solving'

t7'.)(/

G0

x.

( 1970), however, suggested that in practice there was no need to


evaluate cx. Instead we could evaluate
Hannan

p(y)-Pr

F .-

1*uzur

1(

(vj-r1(y))tk? /9

(22.94)

and if this probability exceeds x we accept Ho' otherwise we reject Hv in


favour of HL* : p > 0. For H( : p
as the alternative 4 -z1(y) should be used
in place of r1(y). The value 4 stems from the fact that
-::0

'r1 (y) 2( 1
'v

lI

(22.95)

is the first-order residual correlation

coefficient and takes values between

and l hence 0 Gz1(y) < 4.


- 1
ln the case of the estimated money equation discussed
Watson test statistic takes the value

above

the Durbin-

z 1 (y) 0.376.

(22.96)

For this value p(y ) ().()()()and hence H () : /) 0


for any
size a test. The question which arises is,
do we interpret the rejection of
Ho in this case/' 'Can we use it as an indication that the appropriate
statistical GM is not k., j'x, + ut but (84)?5The answer is, certainly not. ln
no circumstances should we interpret the rejection of Hv against some
specific form of departure from independence as a confirmation of the
validity of the alternative in the context of the autocorrelation approach.
This is because in a misspecification testing framework rejection of Ho
should never be interpreted as equivalent to acceptance of H3', see Davidson
and McKinnon ( 1985).
ln the case where the evaluation of (93) is not possible, Durbin and
Watson ( 1950. 1951) using the eigenvalues of A1 proposed a bounds test
which does not depend on the particular observed data matrix X, That is,
=

1*.$stl.onJly

thow

l'ttjt?rlp/

Departures

from the sampling model assumption

they derived Ju and Ju such that for any X,


Ju %f 1(y) %Ju,

(22.97)

where dv and Ju are independent of X. This 1ed them to propose


bounds test for
against

the

(22.98)
and

L-'o

ry:z1(y) > Ju ).

(22.99)

ln the case where dv Gz1(y) %dv the test is inconclusive (seeMaddala (1977)
for a detailed discussion of the inconclusive region). For the case S(): p 0
should be used in (99).
against Sj : p <0 the test statistic 4
view
In
of the discussion at the beginning of this section the DurbinWatson (DW) test as a test of the independent sample assumption is useful
coefficient rl
in so far as it is based on the first-order autocorrelation
Because of the relationship (95)it is reasonable to assume that the test will
have adequate power against other forms of first-order dependence such as
MA(1) (seeKing (1983)
for an excellent survey of the DW and related tests).
Hence, in practice the test should be used not as a test related to an AR(1)
only but as a general first-order dependence test.
error autocorrelation
Moreover, the DW-test is likely to have power against higher-order
dependence in so far as the lirst order autocorrelation coefficient captures'
part of this temporal dependence.
=

-z1(y)

Higher-order

tests

of the linear regression model is easier to handle


under the independent sample assumption rather than when supplemented
with an autocorrelation
error model, it should come as no surprise to
discover that the Lagrange multiplier test procedure (see Chapter 16)
asymptotic
provides the most convenient method for constructing
misspecification tests for sample independence.
In order to illustrate the derivation of misspecification tests for higherorder dependencies among the y,s let us consider the LM test procedure for
the simple AR(1) case which generalises directly to an AR(rn) as well as a
moving-average (MA(rn)). Postulating an AR(1) form of dependency among
the yfs is equivalent to changing the statistical GM for the linear regression
model to
Given that estimation

#'Xf

+ Pt-

-P#'Xf

- 1

&r,

(22.100)

Testing the independent sample assumption

22.3

against
is Hv : p
as can be easily verified from (84).The null hypothesis
easier than its
under
much
is
of
Hv
estimation
100)
:
the
Because
(
0.
+
l'fl p
estimation under H3 the f-M-test procedure is eomputationally preferable
to both the Wald and the likelihood ratio test procedures.
The efficient score form of the faAf-test statistic is
=0

( log

z-A,1(y)
=

Lt,' y)
-

pp

I y.( #)-

log

-.-.

--

z-(#.,
)))
-.--

(')0

ln the case where Hv involves only a subset of 0, say Hv t?l 03


(f?1,0zj then the statistic takes the form
=

p log Lol #a) (l1l


(7,p,
'

tA.ftyl

lj nlc2 j j.z1) .. j
-

log Lo

where 0 E:E

z)

oo1
(22.102)

(see Chapter 1$.


ln the present case the 1og likelihood function was given in equation
and for ?j p and 0z BE (#, c2) it can be shown that:

(46)

2)

( 1 - /?
0

lv(04

(X'Vw-1 X)

(22.103)

(22.104)

(22.105)
and
LMy)

(22.106)

TI z2(1)
--

(see exercise 5). Hence, the Lagrange

multiplier

test is defined by the

from the sampling model assumptim

Departures

rejection

region

C'1

.ty:

LMy)

ca )

>

wherex

dz2( 1).

(22.107)

C'(y

The form of the test statistic in ( 106) makes a lot of intuitive sense given that
the first-order residual correlation coefficient is the best measure of firstorder dependence among the residuals. Thus it should come as no surprise
to learn that for zth-order temporal dependence, say
;f

/?

=:

-1-tl r

t - c

the f-M-test statistic takes the form


LMy)

Tl

(22. 108)

72rlc'r

v'

va

lflf
.+.

:jr''

'''
,..''

.y,

,.'.

,,.'

z
.''

2
;j

12
,

(22. 109)

m< F

(see Godfrey ( 1978), Breusch and Pagan ( 1980)).


This result generalises directly to higher-order autoregressive
and
moving-average error models. lndeed both forms of error models give rise
to the same AM-test statistic. That is, for the statistical GM, ),'f p'xt+ lf,
supplemented with either an AR(m) error process:
t 1, 2.
=

'f)

(22. 110)
or a MA(-) process:
(ii)

(22.111)

the J-M-test statistic for


for all i

1, 2,

for any i

1. 2,

takes the same form


LMy)

z' j

with

rejection

Cj

?J
1

z2(m),

(22. 112)

region

.ty

Lsly) y ca )

(22.113)

(seeGodfrey ( 1978), Godfrey and Wickens ( 1982/. The intuition underlying


this result is that because of the approximations involved in the derivation

22.4

Looking back

of the faM-test statistic the two underlying


distinguished asymptotically.

error processes

equivalent test, sometimes called the modified


test, can be based on the .R2 of the auxiliary regression (see (80)):
An asymptotically

be

cannot

LM-

(22. 114)
IIv

'rR2

with the

z2(/n)

rejection

(22. 1lf )

region
X

f
t.5.
'.

TR2 p: c

::t

'
.f ,

dz2(m).

(22. 116)

This is a test for thejoint significance of 7:


The auxiliary regression for the estimated

yielded'.
FR2

similar to ( 112) above.


money equation with ?n 6
-,.,u.

','a-

(22.117)

72(0.7 149)= 51.47.

Given that ('a 12.592 for a 0.05 the null hypothesis 11(): 7 0 is strongly
rejected in favour of Sl : + 0. Hence, the estimated money equation has
failed every single misspecification
test for independence showing clearly
that the independent sample assumption was grossly inappropriate. The
above discussion also suggests that the departures from linearity,
homoskedasticity and parameter time invariance detected in Chapter 2 1
might be related to the dependence in the sample as well.
=

'y

22.4

Looking back

In Chapters 20-22 we disctlssed the implications of certain departures from


the assumptions underlying the linear regression model, how we can test for
these assumptions
as w'ell as how we should proceed when these
assumptions are inN alid. ln relation to the implications of these departures
we have seen that the statistical results (estimation testing and prediction)
related to the linear regression model (seeChapter 19) are seriously affected,
with some departures, such as a non-random sample, invalidating these
results altogether. This makes the testing for these departures particularly
crucial for econometric modelling. This is because the first stage in the
statistical analysis of an econometric model is the specification and
estimation of a well-defined (no misspecifications) statistical model. Unless
we start with a well-defined estimated statistical GM any deductions such
as specification testing of a priori restrictions- economic interpretation of

Departures from the sampling model assumption

estimated parameters, prediction and policy analysis, will be misleading, if


not outright unwarranted, given that these conclusions will be based on
erroneous statistical foundations.
When some misspecification is detected the general way to proceed is to
respeclf'ythe statistical model in view of the departures from the underlying
of al1 the
assumptionts). This sometimes involves the reconsideration
underlying
statistical
model
such
assumptions
the
as the case of the
sample
assumption.
independent
One important disadvantage of the misspecification testing discussed in
Chapters 20-22 is the fact that most of the departures were considered in
isolation. A more appropriate procedure will be to derive joint misspecitication tests. For such tests the auxiliary regressions test procedure
discussed in Chapter 2 1 seems to provide the most practical way forward.
In particular it is intuitively obvious that the best way to generate a
misspecification test is to turn it into a specification test. That is, extend the
linear regression model in the directions of possible departures from the
underlying assumptions in a way which defines a new statistical model
which contains the linear regression model as a special case; under the null
that thc underlying assumptions of the latter are valid (seeSpanos (1985c)).
Once we have ensured that th underlying assumptions g1j-g8)of the
linear regression model are valid for the data series chosen we call the
estimated statistical GM with the underlying assumptions a well-dejned
estimated statistical model. This provides the starting point for
reparametrisation restriction and model selection in an attempt to
construct an empirical econometric model (seeFig. 1.2). Using statistical
procedures based on a well-defined estimated statistical model we can
proceed to reparametrise the statistical GM in terms of the theoretical
parameters of interest t: which are directly related to the statistical
parameters 0 via a system of equations of the form
'good'

G(p, () 0.
=

(22.118)

The reparametrisation
in terms of t: is possible only when the system of
equations provides a unique solution for ( in terms of o

H(p)

(22.119)

(explicitly or implicitly). ln such a case ( is said to be identlsed. As we can


see, the theoretical parameters of interest derive their statistical meaning
from their relationship to 0. In the case where there are fewer tfsthan 0is the
extra restrictions implied by ( 118) can (andshould) be tested before being
imposed.
lt is important to note that there is a multitude of possible
reparametrisations in terms of theoretical parameters of interest giving rise

Apandix

22.1

to a number of possible empirical econometric models. This raises the


question of model selection where various statistical as well as theoryoriented criteria can be used such as theorb' consistency,
parsimony,
encompassings robustness, and nearly orthoqonal explanatory variables (see
Hendry and Richard (1983:, in order to select the best' empirical
restriction of the estimated
econometric model. This reparametrisation
statistical GM, however, should not be achieved at the expense of its
statistical properties. The empirical econometric model needs to be a welldefined estimated statistical model itself in order to be used for prediction
and policy analysis.
Appendix 22.1
Assuming that

deriYing

zl

mt 1)

Z2

1%

E( 1, 1).

E( 1, F)

m(2)

Y(2. 1).

E(2, F)

mt T)

E( T,' 1)

E(T) T)

'v

Zw

the conditional exptation

we can deduce that

For

z, Af(r)

1.
(x''''
)
(m/vt''
,
mtr)

x(r)

' a 1 : (i. r)-a j 2(f,

a2 ! (. )-A22(,

1)

1)

f#rl

o.l1 1 (l)

)21
l -

(#, Z,0-1

Xf)

cotr) +

jk,(llx,+

c)1

2(r)

(l) f12a(/)

i + ,;(r)xr
)- gaftll),',
i
=

j1,

c()2(r)

Departures

from the sampling model assumption

c()(l) p1y,(!)
=

#7(r)

fza21

(l)
#;(f)

71 1

Important

al

f.51

2(r)f12-2'(r)mx(r)
'

(r)f122(r)
-

? 1 ct rlD

(i, r) -

2(f,

r)

2--21(

rlaz 1(, r),

f.,tyl z(r)f12-21(r)A22(,

r).

concepts

Error autocorrelation,

stationary

process, ergodicity,

mixing, an inno-

vation process, a martingale difference process, strong exogeneity,


Durbin-Watson
test, well-defined estimated
common factor restrictions,
statistical GM, reparametrisation? restriction, model selection.

Questions
the interpretation
autocorrelation approach

of non-independence in the context of the


that of the respecification approach.
Compare and contrast the implications of the respecification and
autocorrelation approaches for the properties of the estimators
HvL
(X'X) - 1X'y, sl g1/(F-klj',
y - Xj and the F-test for
Rp r against Hj : R## r.
Explain the role of stationarity, in the context of the respecification
approach, for the form of the statistical GM (34).
'T'he concept of a stationary
stochastic process constitutes
a
of
of
generalisation
the concept
identically distributed random
variables to stochastic processes.' Discuss.
of the
lnappropriate conditioning due to the non-independence
sample leads to time dependency of the parameters of interest #EEE
(#, c2).' Explain.
Compare the ways to tackle the problem of a non-random sample in
and autocorrelation
approaches.
the context of the respecification
Explain why we do not need strong exogeneity for the estimation of
the parameters of interest in (34).
Explain how the common factor restrictions arise in the context of the
autocorrelation approach.
Discuss the merits of testing the independent sample assumption in
the context of the respecification approach as compared with that of
the autocorrelation approach.
'Rejecting the null hypothesis in a misspecification test for error
independence in the context of the autocorrelation approach should
not be interpreted as acceptance of the alternative.' Discuss.

Compare

with

Appendix 22.1

Explain the exact Durbin-Watson


test.
i'T'he Durbin-Watson
test as a test of the independence assumption in
the eontext of the autocorrelation approach is useful in so far as it is
directly related to ?-1 Discuss.
error
Discuss the Lagrange multiplier test for higher-order
autocorrelation (AR(m) and MA(n1))and explain intuitively why the
test is identical for both models.
underlying the linear
14. State briefly the changes in the assumptions
non-independence
of the
regression model brought about by the
.-

sample.

Explain the role of some asymptotic independence restriction on the


non-random
memory of )Zr, t (E T) in defining the statistical GM for a

sample.

Exercises
sample assumption for p-and
Verify the implications of a non-random
Of
.$2as given by (i)-(vi)and (i)'--(vi)' Section 22.1 in the context of the
respecification and autocorrelation approaches, respectively.
Show that I) 1ogL(@,y)q, (p 0 where log L(0; y) given in equation (49)
=

is

non-linear

in p.

Derive the Wald test for the common factor restriction in the case
where I 2 and m 1.
Verify the formula FR2 LIRRSS URSS) 'RRSS? T given in equation
(75).,see Engle ( 1984).
Derive the f-ktf-test statistic for Ho /? 0 against H 1 : /? # 0 in the case
where t?t /?;, 1 + lI, as given in equation ( 106/.
=

Additional references

The dynamic linear regression

model

of the linear regression


ln Chapter 22 it was argued that the respecification
model induced by the inappropriateness of the independent sample
assumption led to' a new statistical model which we called the dynamic
linear regression (DLR) model. The purpose of the present chapter is to
consider the statistical analysis (specication, estimation, testing and
prediction) of the DLR model.
The dependence in the sample raises the issue of introducing the concept
of dependent random variables or stocastl'c processes. For this reason the
reader is advised to refer back to Chapter 8 where the idea of a stochastic
process and related concepts are discussed in some detail before proceeding
further with the discussion which follows.
The linear regression model can be viewed as a statistical model derived
by reduction from the joint distribution D(Z1
Zw; #), where
Z, < (yt,Xf')', and )Zt,l6T) is assumed to be a normal, independent and
identically distributed (N11D) vector stochastic process. For the purposes
of the present chapter we need to extend this to a more general stochastic
process in order to take the dependence, which constitutes additional
systematic information, into consideration.
ln Chapter 2 1 the identically distributed component was relaxed leading
to time varying parameters. The main aim of the present chapter is to relax
the independence component but retain the identically distributed
assumption in the form of stationarity.
In Section 23. 1 the DLR is specified assuming that tZ,, t c T) is a
stationary, asymptotically independent normal process. Some of the
approach
arguments discussed in Chapter 22 in terms of the respecification
will be considered more formally. Section 23.2 considers the estimation of
the DLR using approximate MLE's. Sections 23.3 and 23.4 discuss
.

526

23.1 Specification
misspecification and specification testing in the context of the DLR model,
respectively. Section 23.5 considers the problem of prediction. In Section
23.6 the empirical econometric model constructed in Section 23.4 is used to
the misspecification results derived in Chapters 19-22.
Sexplain'

23.1

Spification

In defining the linear regression model )Zt, t e: T) was assumed to be an


independent and identically distributed (llD) multivariate normal process.
ln dening the dynamic linear regression (DLR) model (Zf, t e T) is
assumed to be a stationary, asymptotically independent normal process.
That is, the assumption of identically distributed has been extended to
that of stationarity and the independence assumption to that of asymptotic
independence (seeChapter 8). ln particular, (Z/, l e T) is assumed to have
the following autoregressive representation:

zf

l A()z,-f

(23.1)

+ El,

where

c(z1L1))

El,t

YlA(f)Zl

i
Z/-

(Zf

1()

A(i)

Ef

and

(23.2)

-,

Zt

t7l
a21()
Z?

2,

2()
al
A22(f)

S(Z,,'c(Z,O-

Z1),

)),

(23.3)

with tEl,c(Zr0- j),r > rtll defining a vector martingale difference process,
which is also an innovation process (see Chapter 8), such that
(Er/''Z,0-1)

N(0,

n), n >

0,

It is important to note that stationarity is not a necessary condition for


the existence of the autoregressive representation (1).The existence of (1)
hn, and f being time invariant are the essential
with A(f), i 1, 2,
conditions for the argument which follows and both are possible without
the stationarity of )Zf, l 6 T). For homogeneous non-stationary stochastic
process (1)also exists but differencing is needed to achieve stationarity (see
Box and Jenkins (1976/. Homogeneous non-stationary processes are
=

The dynamic linear regression

model

characterised by realisations which, apart from the apparent existence of a


local trend, the time path seems similar in the various parts of the
realisation. ln such cases the differencing transformation
Ai

-z-)i,

where Ljzt

=(

zrt-i,

1, 2,

can be used to transform the original series to a stationary one (see


Chapter 8).
The distribution underlying ( 1) is D(Z, Zr0-j ; ) arising from the
sequential decomposition of the joint distribution of (Z1, Z2,
Zw):
.

Z.', #) refers to the joint distribution of the first m observation


D(Z1
interpreted as the initial conditions. Asymptotic independence enables us to
argue that asymptotically the effect of the initial conditions is negligible.
One important implication of this is that we can ignore the first m
observations and treat r m + 1,
T as being the sample for statistical
will be adopted for
inference purposes. ln what follows this
expositional purposes because proper' treatment of the initial conditions
can complicate the argument without adding to our understanding,
The statistical generating mechanism (GM) of the dynamic linear
regression model is based on the conditional distribution D( pf//xl, Zt0- 1,. #j)
related to D(Zl ?'Z,0-j ; ) via
,

-solution'

) D ).,,,'X,; z,0- (#')04x,

D(Z, ,''Z,0-, ;

,.

,/z,1.

,.

,j.

The systematic component is defined by

',''-

0
r- 1

( .k,
.

f -

((I

#;

(a12()

11

( )+

t:

!,

..

a,
1

2f1

)-j. ), X,0

a-cl

:.,.)1 zfkz-l

!'))

a a 1(

((x, .'x,...j

pv (1.c-c1 a j

.'.x: ),

.o

Ac.c(i'))- i

l 2.
,

(see Spanos ( 1985z)). The non-systvmatic component is defined by


1 ),
af )'f - E (.3./7-?6'
c(Zp0-1 xf xt), and satisfies the following properties-.
1
=

where

-..:);)

(23.6)

23.1 Specication
'(ut ) E $4E

ut

(F T24

t6

.- 1

)) 0
=

fturl.ksl E )f utus/qth'-

1.)

Go

c? 1 1 - o 12 fl 22

(23.9)

2
tl'o
0

21

(23. 10)

These two properties show that ut is a martingale difference process relative


with bounded variance, i.e. an innovation process (see Section 8).
to
is also orthogonal to the
Moreover, the non-systematic
component
systematic component, i.e.
'%

'ft

ET34

Eptut

'tprlf)

1))

-t/6-

0.

The properties ETI-ET3 can be verified directly using the properties of the
conditional expectation discussed in Section 7.2. ln view of the equality

o'lut ut
,

that
deduce
can
we
ET4)

u 1)

(23.12)

-@(',

(23.13)

E'(ur,.'c(Uf0-1 )) 0,
=

/)- lJf0- j (lf - 1 t1t - j2


i.e. u, is not predictablefrom
:=

N1

its

past.

tpwn

This property extends the notion of a white-noise process encountered so


far (see Granger ( 1980:.
As argued in Chapter 17, the parameters of interest are the parameters in
terms of which the statistical GM is defined unless stated otherwise. These
parameters should be defined more precisely as statistical parameters of
interest with the theoretical parameters of interest being functions of the
former. ln the present context the statistical parameters of interest are 0*
H(#l) where 0* H(j(), #1
am, c()2).
pm,al,
D(Zf/'Z,0,.
that
The normality of
#j and #a are variation free
) implies
weakly
exogenous with respect to 0. This suggests that 0* can
and thus Xr is
be estimated efficiently without any reference to the marginal distribution
D(X,,,/Zf0j
#z). The presence of Y,0-I in this marginal distribution,
however, raises questions in the context of prediction because of the
feedback from the lagged yrs. ln order to be able to treat the xfs as given
when predicting .I.f we need to ensure that no such feedback exists. For this
purpose we need to assume that
=

',

D(Xf,,''Zr0-j ;

a)

'L

(23.14)

(seeEngle et aI. ( 1983) for a

more detailed

.l)(X,,?'Xr0j ;

i.e. J't does not Gl'fgl?fyt?'


cause X

2),

m + 1,

The dynamic Iinear regression

model

discussion). Weak exogeneity of X, with respect to 0* when supplemented


with Granger non-causality, as defined in (14), is called stronl exogeneitq'.
Note that in the present context Granger non-causality is equivalent to
aclt)

0, i

1, 2,

(23.15)

(1),which suggests that the assumption is testable.


In the case of the linear regression model it was argued that, although the
joint distribution D(Zr; #) was used to motivate the statistical model, its
specification can be based exclusively on the conditional distribution
D(yr/Xf; #j). The same applies to the specification of the dynamic linear
regression model which can be based exclusively on Dyt/zt. I Xf; /1). ln
need to be imposed on the
such a case, however, certain restrictions
parameters of the statistical generating mechanism (GM):
in

IN

).,, #'oxf+
=

a-vf - +

Z #;x!-

(23.l6)

l/f,

ln particular we need to assume that the parameters


the restriction that a1l the roots of the polynomial
m -

am

a2,

a.) satisfy

) afz
m

z -

(aI

m -

1ieinside the unit circle, i.e. l-ft< 1, i l 2,


m (seeDhrymes (1978)). This
restriction is necessary to ensure that .fty;, t e: T) as generated by (16) is
indeed an asymptotically independent stationary stochastic process. ln the
case where m 1 the restriction is < 1 which ensures that
=

1
1

a2f

Covtyfllf

+,)

-+

:z

ta4

(7'211

as z --+

Jz

(see Chapter 8). lt is important to note that in the case where )Zt, t G T) is
assumed to be both stationary and asymptotically independent the above
restriction on the roots of the polynomial in ( 17) is satisfied automatically.
convenience let us rewrite the statistical GM ( 16) in the
For notational
concise
form
more
),f p*'X1
=

where

+ u,,

For the sample period


matrix fonn:

y =X*#*

+ u,

r > m,

m + 1,

'fj
.

( 18) can be written in the following

(23.19)

23.1 Specification
where y: (T- rn) x 1, X* : (T- m) x rrltk + 1). Note that xr is k x 1 because it
1) 1 vectors',
rrl, are (/
includes the constant as well but xf - f, i 1, 2,
- x
this convention is adopted to simplify the notation. Looking at ( 18) and (19)
the discerning reader will have noticed a purposeful attempt to use notation
which relates the dynamic linear regression model to the linear and
stochastic linear regression models. lndeed, the statistical GM in (18)
and (19)is a hybrid of the statistical GM's of these models. The part
is directly related to the stochastic linear regression model in
Z)''-1 fyf -f conditioning
view of the
on the c-field c(Y,0- ) ) and the rest of the systematic
extension of that of the linear regression model.
being
direct
component
a
=

will prove very important in the statistical analysis of the


This relationship
of
dynamic linear regression model discussed in what
the
parameters

follows.
ln direct analogy to the linear and stochastic linear regression models we
need to assume that X* as defined above is of full rank, i.e. ranktx*)
1)'.
ml + 1) for all the observable values of Y?- j H ()..,, + 1
)'vThe probability model underlying ( 16)comes in the form of the product of
the sequentially conditional normal distribution Dtyf Zy- 1 Xf; 1), t > F/1.
T the distributon of the sample is
For the sample period t m + 1,
=

.p.,

(23.20)
sample from
The sampling model is speeified to be a non-random
)'vl' can be viewed as a nonD*(y; /1). Equivalently, y EEE(yu+ 1, )u + c,
T)
random sample sequentially drawn from D( bj/zt.- l X,; 41),r m + 1,
respectively. As argued above, asymptotlcally. the effect of the initial
conditions summarised by
.

DIZ 1 ; #)

17D .'? Z)'-

Xr

.=

',

#)

can be ignored: a strategq adopted in Section 23.2 for expositional


pumoses. The interested reader should consult Priestley (198 1) for a
readable discussion of hovs the initial conditions can be treated. Further
discussion of these conditions is given in Section 23.3 below.

The dynamic Iinear regression

0)

vt

'oxt
=

+
i

F
=

spec+cation

GM

The statistical

model

x i )'r

+
f

Y1 p'ixf - i + ut
=

(23.22)

The dynamic Iinear regression model

(23.23)

Ll (1
The

(statistical)parameters of interest are


0* EEEE(a 1

(x.2 ,

a,,,,

#I

v,

#mc()2);
,

see Appendix 22. 1 for the form of the mapping 0* H(#1).


Xp is strongly exogenous with respect to #*.
>(a1,
The parameters
aml' satisfy the restriction
that all
polynomial
of
the roots
the
=

g3(I
g4(I

.12,

E5q
(11)

jj

m - i

ai

are less than one in absolute value.


ranktx*)

m(k + 1)for a1l observable

The probability

E61

m-

)wm

T > m (k+ 1).

model

'z)L 1, X,; p*)

D()'t

values of Y7-.,

1
exp
J'''''l''''''--o
x ( )

1c / (
'

pr -

#*'X))2

D()',?'Z,0- j X,; p*) is normal;


7( ',/'c(Y,0- j),
xr0 x,0) .'x;b .- Iinear in x),'
(ii)
Vartyf/ctYto- j), X,0 xfo) c()2 homoskedastic
(iii)
X) );
0* is time invariant.

(i)

(111)

The sampling

E81

y EEE()',,,+

(free of

model

asymptotically
is a stationary,
)',n+ 2,
independentsample sequentially drawn from D()4 Zr0- ,
'F respectively.
t m + 1, m + 2,
-pw)'

,xl,.p*),

N()te that the above specification is based on D(y, Zl0- j Xr; p*j, t m + 1,
T) directly and not on D(Z,,''Zf0- j ; #). This is the reason why we need
.
assumption (4(1in order to ensure the asymptotic independence of
,

J(
t

J't /Z0t -

1 !,

Xl ) t >
',

n1l.f

23.2
23.2

Estimation

Estimation

ln view of the probability and sampling


likelihood function is defined by

model

assumptions

(6qto (8()the
(23.25)

and
log 1-(p*,'y) const
=

(T- rn)
-

1og c/

1 -X*#*)'(y
- 2/.0 (y

-X*#*),

(23.27)

:20

1
-

/
-

(F - m)

(23.28)

li
y - X*J*The estimators /* and 42o are said to be appvoximate
where
estimators (MLE's) of #* and c()2, respectively, bause
Iikeliood
nmximum
conditions
initial
have been ignored. The formulae for these estimators
the
similarity
between the dynamic, linear and stochastic linear
the
bring out
the similarity does not end with the formulae.
models.
Moreover,
regression
for the dynamic linear regression model can
statistical
GM
the
Given that
viewed
the
other
of
two models we can deduce that in direct
be
as a hybrid
linear
regression model the finite sample
the
stochastic
analogy to
(vl
of
and
distributions
are likely to be largely intractable. One important
J*
dynamic
and stochastic linear regression models,
difference between the
however, is that in the former case, although the orthogonality between pt
and l1t (pf Ik?) holds in the decomposition
(23.29)
t', /t, + l?r. for each t > m,
it does not extend to the sample period formulation
=

(23.30)

y p+u
=

for z < 0

(23.31)

The dynamic Iinear regression model

One important
Ep-'b

implication

#*) Sg(X*/X*)- 1X*'uj

:#:

/* is a biased estimator

of this is that

0,

of

#*,i.e.
(23.32)

This, however, is only a problem for a small T because asymptotically #*>


()2)
enjoys certain attractive properties under fairly general conditions,
including asymptotic unbiasedness.

(J*,

Asymptotic properties

of

#y

Using the analogy between the dynamic and stochastic linear regression
models we can argue that the asymptotic properties of 11depend crucially
on the order of magnitude (seeChapter 10) of the information matrix Iv0*)
dcfined by
'(X*'X*)
Ir(#*)

cp

(23.33)

This can be verified directly from (27)and (28).lf E(X*'X*) is of order Op(T)
then Iw(p*) Op(F) and thus the asymptotic information matrix I.,(p *)
limw.-.((1/F)Iw(p*)j <
Moreover, if GwEB S(X*?X*/T) is also nontsufciently
large', then the asymptotic properties of
singular for all T
as
of
MLE
be
Let us consider the argument in more detail.
0*
deduced.
can
a
If we define the information set
(c(Y)L1), X,0 x/)) we can deduce
1
that the non-systematic component as defined above takes the form
=

,'.'.f,

,6'-

nr
=

)'? Ef.t,ih-

1),

.f(/.r,

(23.34)
hrj)

and the sequence


represents a martingale difference, i.e.
r>
j)
0, t > m (seeSection 8.4). Using the limit theorems related to
'(l,
martingales (seeSection 9.3) we can then proceed to derive the asymptotic
properties of #y.
,#(,

.%-

In view of the following relationship:

(,
*

tf we

ian

j*)

(X*'X*

'j-

ttat

ensue

yj,x-* x
'

hy

.0

a4a .-ss

then

we can use

( ),
X*'tl

the strong

$7%A%

taw of targenumbes fo

mattqtates

to stow

Estimation

23.2

(-X*

'u

-j
a'S.
--,

(23.37)

().

Hence,
a %.
.

'*+

...#

p.

The convergence in (37) stems from the fact that .tfX)l?f,


martingale difference given that
,h'

E(Xt3ut/.%

1)

Xtt.qElut/eh
-

:)

0, i

1, 2,

m(k + 1).

defines a

(23.38)

This suggests that the main assumption underlying the strong consisteney
or p-*is the order of magnitude of F(X*'X*) and the non-singularity of
F(X*'X*/F) for T > m (k+ 1). Given that X) EEE(.J.7, 1 yf 2,
.J7f-?n, N, xf 1
conditions
F(X*'X*)
satisfies
if the crossthese
we can ensure that
.
products involved satisfy the restrictions. As far as the cross-products which
ipvolve the y,-fS are concerned the restrictions are ensured by assumption
afz'n - i) 0. For the xts we need to assume
(4J on the roots of (2n j7--11
,

,xt-.)

that:

<
c, i
lxffl

1, 2,
k, t (EFT, C being a constant;
.E1,/(T-z)q
limw7--Jx,x;-,= Qzexists for z > 1 and in particular
non-singular.
also
Qo is

(i)
(ii)

a S
. .

These assumptions ensure that '(X*'X*) ()( F), i.e. Gw 0( 1)and Gw G


where G s non-singular. These in turn imply that Ir(#*) 04 F) and
1.4#*) 0( 1) which ensures that not only (37) holds but
--+

a S.
.

ri'g
-+

(23.39)

c.

by multiplying
(#1 0w) with x? T (the order of its standard
normality
deviation) asymptotic
can be deduced:
Moreover,

/''
' T( #+ #+)
-

'

1
x(() j (p*)- )

(23.40)

That is,
...'

'r(:

o
(

' Tdo

-, v )
-

j!

c())

x
J

x(0,cpG - ),

(23.41)

4.
-N(0,2co).

(23.42)

The dynamic linear regression model

The

lwc(7/tlconsistency

of

#w*
as

an estimator

of 0*, i.e.

#*
F

--+

(23.43)

p*

follows from the strong consistency (seeChapter 10). Moreover, asymptotic


d/ictnktry follows from (40).
If we compare the above asymptotic properties of #) with those of .s
(/ 42) in the linear regression model we can see that their asymptotic
properties are almost identical. This suggests that the statistical testing and
prediction results derived in the context of the linear regression model
which are based on the asymptotic properties of kvare likely to apply to the
present context of the DLR model. Moreover, any finite sample based
result, which is also justifiable on asymptotic grounds, is likely to apply to
the present case with minor modifications. The purpose of the next two
sections is to consider this in more detail.
=

Example
The tests for independence applied to the money equation in Chapter 22
showed that the assumption is invalid. Moreover, the erratic behaviour of
estimators
and the rejection
of the linearity and
the recursive
assumptions
Chapter
confirmed
1
homoskedasticity
in
2
the invalidity of
values
conditioning
observed
of
only.
Xf
ln such a case
the
on the current
natural
proceed
appropriate
statistical
model
the
is to respecify the
way to
for the modelling of the money equation so as to take into consideration the
time dependence in the sample.
ln view of the discussion of the assumption of stationarity as well as
economic theoretical reasons the dependent variable chosen for the
postulated statistical GM is m;b ln (,hzJ,/'#,)(seeFig. 19.2). The value of the
maximum lag postulated is m 4, mainly because previous studies
restriction adequate to
demonstrated the optimum 1ag for
characterise similar economic time series (see Hendry (1980/. The
postulated statistical GM is of the form
=

'memory'

ln:

()v+

'ximJ-

(//1

i+

-)';

+ c'l Q1,+trzQ2,+ caoaf

pzipf - i + (lsiit -

f)

(23.44)

+ ut.

where Qit,i 1, 2, 3, are three dummy variables which refer to important


changes:
monetary policy changes which 1ed to short-run
=

tunusual-

(Q1/-197?fr)The

introduction

of competition and crpf/l'r control.

23.2

Estimation

Table 23, 1. E.%tinltltetl srtkrf-srcltJl

(Qar- 19

',75

i t..)

GM

l-le sl./spt?n.tft??'lq.Jthe

th t? banks

to

('/'?(?nl'lt?/

t-tp/-stpr (1;1(1

r/?e

?7t,u'

r/7t.,Btlnk

llltliial.l

t?/'

Enlland t/s/t?t/
J/t?/-st-?k//
vjol

t/)4lp)'

Ioans.
I-he in rroc/nclfo'l f?/- A.11 as t/ tnonetar )' t(1?-gt.:1
The estimattd coeftcients for the period 1964f 1982f1, are shown in Table
23. 1. Estislatlon of (44)with htnl as the dependent variable changed only
the goodltlss-of-fit measure as given by R.2 and #2 to R2 0,796 and /12
0.7 11.The change measures the loss of goodness of fit due to the presence of
()Lt
a trend (compare Fig. 19.2 with 2 1.2). The parameters p EEEF(jt). jl (lzi
c2()j
in terms of whicl the statistical G M is
ui ,. I i 0- 1, 2, 3. 4- c I t'a- t. a.
defined are the statistical and not the (econonic ) theo retical pa rameters of
interest. ln order to be able to determ intt the latter (using specification
statistical G M is well
testing), we need to ensure first that the klstilrlatel
assumptiens
tlnderlying
is.
11
That
that
the
defined.
the statistical
g) 1))8-J
is the task of
model are indeed : al id. Testing for these assun-iptions
nisspecificttion tcsti ng in the context of the dynamic linear regression
naodel consid t?red i1-1 t ht2 next sectio n
Looking at the time graph of the actual (y,)and fitted ( f'r)(see Fig. 23. 1)
pr quite closely
values of the dependent variable we can see that t
for the estimation period. This is partly confirmed by the value of the
variance of the regression which is nearly one-tenth of that of the money
equation estimated in Chapter 19,*see Fig. 23.2 showing the two sets of
residuals. This reduction is even more impressive given that it has been
achieved using the same sample information set. In terms of the 5.2 we can
see that this also confirms the improvement in the goodness of fit being

((? t-l 9*J1.)

ttracks'

The dynamic Iinear regression model


k'
$'
!'

0.075
0.050

actual'
l

0.025

l
l
l
l

l
j

$
v

I
t/

..1

p
/l
l
l

t
l

lw

f'-

$,

ZZ

0 025

f ktlsd

1/

l
k

-0.050
-0.075

1964

1967

1970

1973
Ti m e

Fig. 23. 1. The time graph of actual


estimated rcgression in Table 23.1

Fig. 23.2. Comparing the residuals


and (23.44)
(see Table 23.1).

1976

T.,

1979

1982

A ln (uYJ,,/#),and fitted fr from the

from the estimated

regressions

(19.66)

more than double the original one with ml as the dependent variable and y,,
Only
the regressors (see Chapter 19).
pf and it
Before we proceed with the misspecification
testing of the above
estimated statistical GM it is important to comment on the presence of the

23.3

Misspecification

testing

539

three dummy variables. Their presence brings out the importance of the
background in econometric modelling.
sample period economic
Unless the modeller is aware of this background the modelling can very
easily go astray. ln the above case, leaving the three dummies out, the
normality assumption will certainly be affected. lt is also important to
remember that such dummy variables are only employed when the
modeller believes that certain events had only a temporary effect on the
relationship without any lasting changes in the relationship. ln the case of
longer-term effects we need to model them, not just pick them up using
dummies. Moreover, the number of such dummies should be restricted to
be relatively small. A liberal use of dummy variables can certainly achieve
wonders in terms of goodness of fit but very little else. lndeed, a dummy
variable for each observation which yields a perfect fit but no explanation' of
any kind. In a certain sense the coefficients of dummy variables represent a
measure of our ignorance'.
thistory'

23.3

Mi%pecification

testing

As argued above, specification testing is based on the assumption of correct


specncation, that is, the assumptions underlying the statistical model in
question are valid. This is because departures from these assumptions can
invalidate the testing procedures. For this reason it is important to test for
the validity of these assumptions before we can proceed to determine an
empirical econometric model on the sound basis of a well-defined estimated
statistical GM.

(1)

Assumption underlying

(2lf

the statistical

Assumption I(1Jrefers to the definition of the systematic and non-systematic


components of the statistical GM. The most important restriction in
defining the systematic component as

pt
=

f(.A'/ CIY/0-1), Xf0 xr0l


=

p'vx+

) taf.y,,+ #;x,

-f)

-f

(23.45)

is the choice of the maximum /tktym. As argued in what follows an


inappropriate choice of m can have some serious consequences for the
estimation results derived in Section 23.2 as well as the specification testing
results to be considered in Section 23.4 below.
(i)
large'
m chosen
lf m is chosen larger than its optimum but unknown
'too

value

m* then

Snear

The dynamic Iinear regression model

collinearity' (insufficientdata information) or even exact collinearity will


'creep' into the statistical GM (see Section 20.6). This is because as m is
increased the same observed data are
to provide further and further
of
about
number
unknown parameters. The
information
an increasing
insufficient
of
data
information
implications
discussed in the context of the
regression
problem
be
applied
linear
to the DLR model with some
can
reinterpretation due to the presence of lagged yts in X).
'asked'

m cbosen

small'

'too

small' then the omitted lagged Zrs will form part of the
lf m is chosen
unmodelled part of )'l and the error term
'too

(23.46)
is

no

longer non-systematic
.%

kfutvytnt

relative to the information set

(c(Yt0- j ) X;0 x;0)


=

(23.47)

> rnj. will no longer be a martingale diflrence, having vel'y


serious consequences for the properties of #* discussed in Section 23.2. In
particular the consistency and asymptotic normality of #* are no longer
valid, as can be verified using the results of Section 22.1,. see also Section 23.2
above. Because of these implications it is important to be able to test for
statistical GM is given by
m < m*. Given that the

That is.

'true'

*1 +

#'oxf+ j

)),

the error term

((zf-'J

j'fxr j)
-

ul can

(23.48)

+ u,,

be written in the form


(23.49)

This implies that m < m* can be tested using the null hypothesis
H o : ap,* 0

a nd

where

x.*K

+. 1
(.1,,,

pm*0
=

a.)'

aga in s t

#.*EEEE(jm.#j

H 1:
,

<

#0

#.*+ 0

#,,,.)
.

The obvious test for such a hypothesis will be analogous to the F-type test
for independence suggested in Chapter 22. Alternatively, we could use an
asymptotically equivalent test based on FR2 from the auxiliary regression

;b
-(,()

-#'x +
r

(af).'r-f+ #;x,-f)+cf,

(23.50)

Misspecification

23.3
where

testing

refers to the residuals from the estimation

of

f) + ut..
+
v p'vxt+ Yltaiy,- i #'fxf-

The rejection region is defined by

(71

ty

FR2

>

c.)

dzzunl*

n1)/t-1

(23.52)

L'y

(see Chapter 22 for more details).


The F-type test for the money equation
n1* 6 yielded

estimated

in Section 23.3

with

-0.009

Fr(y)

0.010 49
0.009 65

43

65

0.467.

(23.53)

2. 14 for a 0.05. Ho is strongly aecepted and the value of


seems appropriate.
m
As argued in Chapter 22. the various tests for residual autocorrelation
for
can be used in the context c)f the respecification approach as tests
of
independence assumption especially when the degrees freedom are at a
premium. ln the present context the same misspecification tests can be used
with certain modifications as indirect tests o m < m*. As far as the
asymptotic Lagrange multiplier (.LM) tests for AR(p) or MA(p), where
pb 1, based on the auxiliary regressions are concerned, can be applied in
the present context without any modifications. The reason is that the
presence of the lagged yts in the systematic component of the statistical GM
(51) makes no difference asymptotically. On the other hand, the Durbin-

Given that

t'a

=4

W'r.st?n (DW) test is not applicable in the present context because the test
depends crucially on the non-stochastic nature of the matrix X. Durbin
of the DLR
1 + ur) in the context
( 1970) proposed a test fpr AR(1) (.u)
statistic
so-called
by
defined
ll-test
model based on the
1
F- m
N(0, 1)
(23.54)
b ( 1 - JD I4z-(y))
1-(T-rn) Vaitafl=pu1-

x.

<

under Ho: p

0. Its rejection region takes the form

'X

cl

.fty

1n1
>

czy'.
,

1a

4() d-

(23.55)

Ca

where 44 ) is the density function of a standard normal distribution. lt is


interesting to note that Durbin's ll-test can be rationalised as a Lagrange
multiplier test. The Lagrange multiplier test statistic for a first-order
'

The dynamic linear regression model

dependency (AR(1) or MA( 1)) takes the form


T(y)

1-*

(23.56)

l1I1 w*vartti'l)
Ho

and z(y)

(198 1:,

(see Harvey

z2(1). Noting that

'v

(23.57)

-#Dw),

=k1

1%1

we can see that the Durbin ll-test can be interpreted as a Lagrange


multiplier (faM) test based on the first-order temporal correlation
coefficient
and
As in the case of the linear regression model the above Durbin's
the f,M() (/1 1) test can be viewed as tests of sijnificance in the context of
the auxiliary regression
-test

?%=J'x)

l pflir-f + nr,

+
f

(23.58)

The obvious way to test


Ho : pl

pc

'

'

'

=0,

=pI

for any i

S1 : pi #0,

1,2,

,1
.

(23.59)
is to use the F-test approximation which includes the degrees of freedom
chi-square
form. The
correction term instead of its asymptotic
autocorrelation error tests will be particularly useful in cases where the Ftest based on (48) cannot be applied because the degrees of freedom are at a
premium.
ln the case of the estimated money equation in Table 23.1 the above test
yielded:
statistics for a
=0.05

1.19,

LM

(2): FF(y)

LM

(3):

LM

(4): FT(y)

FF(y)=(

cx

1.96,

0.0 10487 -0.0 10 339


-

()s()j();gj)

0.010500 -0.010 291


().()j() ajlj

1)
1(-j-1-0.318,
..:-1-0.23
1

0.010 466

-0.010

().()j() aj5

49

255

47

45

-0.350,

3.18,

(23.60)

c.=2.80,

(23.61)

cx

1, c.=2.5

/.

(23.62)

As we can see from the above results in all cases the null hypothesis is
(53)above.

accepted confirming

23.3

Misspification

543

testing

The above tests can be viewed as indirect ways to test the assumption
postulating the adequacy of the maximum lag rn. The question which
naturally arises is whether m can be determined directly by the data. ln the
statistical time-series literature this question has been considered
extensively and various formal procedures have been suggested such as
Akaike's AlC and BIC or Parzen's CAT criteria (see Priestley (198 1) for a
readable summary of these procedures). In econometric practice, however,
it might be preferable to postulate m on a priori grounds and then use the
above indirect tests for its adequacy.
specifies the statistical parameters of interest as being 0* HE
Assumption g2(I
(J1
xm, #(),#1
#,,,,tr()2). These parameters provide us with an
opportunity to consider two issues we only discussed in passing. The first
issue is related to the distinction made in Chapter 17 between the statistical
and (economic)theoretical parameters of interest. ln the present context 0*
as detined above has very little, if any, economic intemretation. Hence, 0*
represents the statistical parameters of interest. These parameters enable us
to specify a well-defined statistical model which can provide the basis of the
tdesign' for an empirical econometric model. As argued in Chapter 17, the
estimated statistical GM could be viewed as a sufficient statistic for the
theoretieal parameters of interest. The statistical parameters of interest
with the
parametrisation
provide only a statistically
(sufficient)
theoretieal parameters of interest being defined as functions of the former.
This is because a theoretical parameter is well defined (statistically) only
statistical
parameter. The
when it is directly related to a well-defined
determination of the theoretical parameters of interest will be considered in
Section 23.4 on specification testing. The second related issue is concerned
collinearity. ln Section 20.6 it was argued that
with the presence of
and
relatiy e to a giN en parametrisation
defined
collinearity
is
enear'
collinearity
likely
it
that
is
In
the
context
(or
present
information set.
insufficient data information) might be a problem relative to the
parametrisation based on 0*. The problem, however, can be easily overcome
a
in determining the theoretical parameters of interest so as to
Both issues
theoretical parametrisation.
parsimonious as well as
will be considered in Section 23.6 below in relation to the statistical GM
estimated in Section 23.2.
postulates the strong exogeneity of Xt with respect to the
Assumption (.3(J
7 As far as the weak exogeneity component
parameters 0* for t m + 1
concerned
it will be treated as a non-testable
of this assumption is
of
the
linear regression model (seeSection
presupposition as in the context
20.3). The Granger non-causality component, however, is testable in the
context of the general autoregressive representation
,

tadequate'

'near'

-near'

Sdesign'

'robust'

The dynamic linear regression

model

#l1

At

A(j)Z,

-f

+ E,

(23.63)

'l'.r

does not 'Granger cause' Xr if x21(f) 0


(see Appendix 22. 1).In particulars
2.
1,
This
i
for
suggests that a test of Granger non-causality can be
m.
based on Ho a21(f) 0 for all i 1,2,
m against H3 : aa1(f) # 0 for any i
l?y.
where
2,
For
the
1,
1,
Granger ( 1969) suggested the Wald test
k
case
statistic
.y'l

(23.64)
where URSS and RRSS refer to the residuals sums of squares of the
/'n, respectively. The
regressions with and without the .-js, i 1. 2,
rejection region takes the form
.,

C1

f
tX

'

'

14S

> (' z>

(23.65)

The Wald test statistic can be viewed as an F-type test statistic and thus a
natural way to generalise it to the case where k > 1is to use an F-type test of
significance in the context of multivariate linear regression (seeChapter 24).
For a comprehensive survey of Granger non-causality
tests see Geweke
( 1984).
Assumption g4qrefers to the restrictions
on x needed to ensure that
l )'r, t s T ) as generated by the statistical GM:

(23.66)
is an asymptotically
form
alfal.pf
=

where

independent stochastic process. If we rewrite

w, + n1,

(66)in' the
(23.67)

and treat it as a diflrence equation, then its solution (assumingthat (fz)


has ??7 distinct roots) takes the general form

=0

(23.68)

Misspification

23.3

testing

g(r),called the complementary function, is the solution of the homogeneous


cu)' are constants
difference equation at1.alyr 0 and c EEE(t'1 cc,
via
determined by the initial conditions yj
)u
=

.v2,

171
-

-1-c 2 +

'

'

'

+ t?2/.

( 1 /.l
.

m
''

+ c m?vm
.

i>.

''

''

'

(23.69)

independence
In order to ensure the asymptotic
(stationarity) of
T)
g
need
this
l
decay
: )'t,
component to
to zero as t
we
w in order for
Priestley
198
conditions
initial
T)
(E
l
the
.ty,,
( 1/. For this to
to
(see
polynomial
the
of
which
;.z,
the
2,,,),
roots
be the case (21
are
-+

iforget

.,

(23.70)
should satisfy the restrictions

12f < l

(23 1)
.7

l 2,

1.=

rn.

,'';.,.

Note that the roots of a(1-) are ( 1


When the restrictions
hold

and

), i

1, 2,

lim gt)
l

-+

m.

.72)

(23

0.

As argued in Section 23. 1 in the case where f#Z,, r G Tlj, is assumed to be a


stationary, asymptotically independent process at the outset, the time
invariance of the statistical parameters of interest and the restrictions
?77. are automatically
satisfied.
1, 2,
< 1, i
ln order to get some idea as to what happens in the case where the
< 1 1
1 2.
restrictions
m- are not satisfied let us consider the
where
ll
and
1 (,:z
1), i.e.
l
;.k
simplest case
1

Il

I2f

(23.73)
Cov(y,),,

z)
-

c2()r,

(23.74)
.f

These suggest that both the mean and covariance of $.yf, l c -1J-)increase with
.
t and thus as l --+ C/S they become unbounded. Thus yf, r (E T), as generated
by (73),is not only non-stationary (withits mean and covariance varying
remains constant.
with r) but also its

'memory'

Tlle dynamic linear regression model


> 1 again
Ia1I

In the case where

y,, t c T) is non-stationary

-at

Z'(.Ff)

1
1-

Wt

(: j

and

CovtA',yf +,)

co2

since

lt

x1

a'k

(; (.)yj)

a1

and thus .E'(yt) tx) and Covtyfyrsr) ct;. as t -+ ). Moreover, the


tmemory' of the process increases as the gap z -.+ c/a !
< 1 is not
In the simplest case where m 1 when the restriction
satisfied we run into problems which invalidate some of the results of
Section 23.2. In particular the asymptotic properties of the approximate
MLE'S
of 0* need to be modified (seeFuller (1976)for a more detailed
discussion). For the general case where m > 1 the situation is even more
complicated and most of the questions related to the asymptotic properties
of #* are as yet unresolved (seeRao (1984:.
x't, x; 1,
Assumption ES, relating to the rank of X* E (yf 1,
yf
ln the case
x;-m), has already been discussed in relation to assumption E1q.
where for the observed sample y the rank condition falls, i.e.
-+

-+

1211

-.,

ranktx*) n < mk + 1),


(23.76)
a unique #*does not exist. If this is due to the fact that the postulated m is
=

much larger than the optimum maximum 1ag m* then the way to proceed is
to reduce rrl. 1f, however, the problem is due to the rank of the submatrix
xwl' then we need to reparametrise (seeChapter 20). In
X Htxj,
x2,
either case the problem is relatively easy to detect. What is more difficult to
collinearity which might be particularly relevant in the
detect is
As argued above, however, the problem is relative to a
context.
present
and thus can be tackled alongside
the
given parametrisation
reparametrisation of (48) in our attempt to design' an empirical
econometric model based on the estimated form of (48)(seeSection 20.6).
.

%near'

(2)

Assumptions underlying

the probability

model

The assumptions underlying the probability model constitute a hybrid of


those of the linear and stochastic linear regression models considered in
Chapters 19 and 20 respectively. The only new feature in the present context
is the presence of the initial conditions coming in the form of the following

distribution:

D(Z1 ; /)

1-1Dyt//=

1,

X,;

4),

(23.77)

where 4 =H(#). For expositional purposes we chose to ignore these initial


conditions in Sections 23.1 and 23.2. This enabled us to ignore the problem

23.3 Misspecification testing


of having to estimate the statistical GM
t-1

#f

#lxr+

l-1

Z
=

ai-vr-f +

Z #;xt-f+ ut
=

(23.78)

for the period l 1, 2,


rn. The easiest way we can take the initial
conditions into consideration is to assume that 4 coincides with 0*. If this is
adopted the approximate MLE'S #*will be modied in so far as the various
summations involved will no longer be of the form jr- P! + 1 uniformly but
v +fxl-fyt,
i 1, 2,
,- z
m. That 1s, start summing
L-r 1 +iXIX)-f and
where
observations become available for each individual
from the point
component.
As far as the assumptions of normality, linearity and homoskedasticity
are concerned the same comments made in Chapter 21 in the context of the
linear regression model apply here with minor modifications. In particular
the results based on asymptotic theory arguments carry over to the present
case. The implications of non-normality, as defined in the context of the
linear regression model (seeChapter 21), can be extended directly to the
DLR model unchanged. The OLS estimators of j* and co2have more or less
the same asymptotic properties and any testing based on their asymptotic
distributions remains valid. In relation to misspecication
testing for
departures from nonuality the asymptotic test based on the skewness and
kurtosis coefficients remains valid without any changes. Let us apply this
test to the money equation estimated in Section 23.3. The skewnesskurtosis test statistic is z(y) 2.347 which implies that for a 0.05 the null
hypothesis of normality is not rejected since ca 5.99.
Testing linearity in the present context presents us with additional
problems in so far as the test based on the Kolmogorov-Gabor polynomial
(see (21.11) and (21.77))will not be operational. The RESET type test,
however, based on (21.10) and (2 1.83) is likely to be operational in the
tl
t3
present context. Applying this test with j$l and
were excluded because
of collinearity) in the auxiliary regression
=

pr J'x
=

+ y.4 + t',

(23.79)

which does not reject Ho.


yielded: FT(y) 2.867, ca 4.03 for a
Testing homoskedasticity in the present context using the KolmogorovGabor parametrisation or the White test (seeNicholls and Pagan (1983))
presents us with the same lack of degrees of freedom' problem. On the other
hand, it might be interesting to use only the cross-products of the xfts Only
as in the linear regression case. Such a test can be used to refute the
conjecture made in Chapter 21 about the likely source of the detected
heteroskedasticity. It was conjectured that invalid conditioning (ignoring
=

=0.05,

548

The dynamic linear regression model

the dependence in the sample) was the main reason for rejecting the
homoskedasticity assumption based on the results of the Whitc test. An
obvious way to refute such a conjecture is to use the same regressors /1,.
/6f in an auxiliary regression where tIt refers to the residuals of the
.
dynamic money equation estimated in Section 23.3 above. This auxiliao
regression yielded:
.

TR2

5. 11,

FF(y)

0.830.

(23.80)

The values of both test statistics for the significance of the coefficients of the
kits reject the alternative (heteroskedasticity)most strongly; their critical
values being ca 12.6 and t'a 2.2 respectively for
0.05. The time path of
the residuals shown in Fig. 23.345) exemplifies no obvious systematic
,x

variation.
ln the present context heteroskedasticity takes the general form
Vart r,/'c(Y0,,-j),

x,0
=

x,t') /7(v,0.) x,0).


=

This suggests that an obvious way to


the problem of applying the
White test' is to use the lagged residuals as proxies for Z,0- j That is, use
tl..v
r2-1 tl. a,
as proxy regressors in the auxiliary regression:
'solve

li2
t

t'tj

(.':

tl.

+ cztl.- a +

+ cgth

gt

(J,3.8.'2)

and test the significance of c'1,


cv. This is the so-called ARCH
(autoregressive conditional heteroskedasticity) test and was suggested by
Engle (1982).ln the case of the above money equation the ARCH test
statistic for p
takes the value F'Ijyj 0.074, with critical value cz 2.5 1
for x 0.05. Hence, no heteroskedasticity is detected, confirming the
White test given above.
ln Chapter 22 it was argued that the main reason for the detected
parameter time dependence related to the money equation was the invalid
conditioning. That is, the fact that the temporal dependence in the sample
Having conditioned
was not taken into consideration.
on a11 past
information to definc the dynamie linear regression model the question
arises as to what extent the newly defined parameters are time invariant. ln
Fig. 23.3())-(J) we can see that the time paths of the estimated recursive
coefficients of J'l, pr and it exemplify remarkable time invariance for the last
40 periods. lt might be instructive to compare them with Fig. 2 1.1(h)-(J) of
the static money equation.
.

=4

23.4

tshort'

Spification

testing

Specification testing refers to testing of hypotheses related to the statistical

23.4 Specification

testing

parameters of interest assuming that the assumptions underlying the


statistieal model in question are valid. In the context of the dynamic linear
testing is particularly important
regression (DLR) model specification
stands
it
statistical
GM
has very little, if any, economic
because the
as
GM when tested for any
estimated
statistical
The
interpretation.
assumptions is rejected can
underlying
of
and
the
misspecifications
none
of the sample
convenient
providing
summarisation
only be interpreted as
a
0. 10

0.05

0.00

-0.05

-0.10

1964

1967

1970

1973

1976

1979

1982

T i me

(b)

Time

from (23.44).(h)--(J) The time paths of the


Fig. 23.34(/).Thc residuals
of
estimates
recursive
Jya,,/s, and jzkg,the coefficients of ln yr. ln J'f and
ln lt respectively from (23.44).

The dynamic Iinear regression

model

1.25
,00

0.75
lat
0.50
0.25

-0.25

1973

1976

1979

1982

1979

1982

Time

(c)
0.100
0.075
0.050
0.025

l4t

0
-0.025
-0.050
-0.075
-0.1

1973

1976
Time

(d)

Fig. 23.3 continued

information. The (economic)


theoretical parameters of interest are assumed
of
simple
functions
the statistical parameters 0*. These theoretical
to be
will
be determined using specification testing.
parameters
well-defined
The
estimated statistical GM provides the firm foundation
on which the empirical econometric model will be constructed. There are
two important considerations which must be taken into account in going
from the estimated statistical GM to the empirical econometric model.

23.4 Specification testing

55l

Firstly, any theoretical restrictions needed to determine the theoretical


from the statistical parameters of interest must be tested before being
imposed. Secondly, when these restrictions are imposed we should ensure
that none of the statistical properties defining the original statistical model
has been invalidated. That is, we should ensure that the empirical
econometric model constructed is as well specified (statistically)
as the
original statistical GM on which it was based.
An important class of restrictions motivated by economic theory
considerations are the exact linear restrictions related to j*:
Ho: R#*

against

Sj

(23.83)

R#* # r,

and q x 1 known matrices with ranktRl=

where R and r are q x k*


q, k*
mtk + 1). Using the analogy with the linear regression model the F-type test
statistic suggests itself:
(R/% -r)'ER(X'X)-

FT*(y)=

1R/1 -

qdl
RRSS - URSS
URSS

1(RJ*r)
-

T-k*
q

(23.84)

(see Chapter 20). The problem, however, is that since the distribution of #-*
is no longer normal we cannot deduce the distribution of FT*(y) as
Fq, T- k*) under Ho. On the other hand, we could use the asymptotic
distribution of #*, i.e.
v'T(/*

N(0,c2G-1)

-,*)

(23.85)

in order to deduce that


So

qyl-sy) z2(t?).
-

Using this result we couldjustify the use of the F-type test statistic (84)as an
approximation of the chi-square test based on (85).lndeed, Kiviet (1981)
has shown that in practice the F-type approximate test might be preferable
in the present context because of the presence of a large number of
regressors.
In order to illustrate the wide applicability of the above F-type
specification test 1et us consider the simplest form of the DLR model's
statistical GM where k 1 and m 1:
=

A'f
=

JOXI

Jlxf 1 +J1X-

+'

lfr.

Despite its simplicity (86) incorporates a large proportion

(23.86)
of empirical

The dynamic Iinear regression model

econometric models in the applied econometric literature. In particular,


following Hendry and Richard ( 1983), we can consider at least nine special
cases of (86) where certain restrictions among the coefficients po,jl and aj
are imposed:
Case 1. Static
't

k,f

fvxt+

ttt.

u j yl

+ ut

.,

()1

regression

..j

.I1

0)..

(23.87)

(23.88)

(23.89)
Case 4.
(23.90)
Case 5.

#oxf+ ()1 xr - 1 +

.r

Partial adjustment

(vx;+

#'f
=

x 1 )'f

Case 8. Dead-start'
'r

).

Jl1 -Y

==

+ ut

f -

-i- t: j )'f

0)..
(23.92)

+ x1
/.71

)(x, 1 - )'r - 1 ) +
(/.() 0).'

al

model
j.

/A7oJt.?/
(/.()+

poAx, + ( 1 -

n-ltpt/c/ (jj

Error--correction
A)',

t1t

ur.

1)..

(23 9 3)
.

-h 1/f

(23.94)

For the above eight cases the restrictions imposed are a1l linear restrictions
which can be tested using the test statistic (84) in conjunction with the
rejection region
C1

.t

y : F r(y)>

(',

where ca is determined by
x

dkq, F- k*).

=
Q'y

family of restrictions considered extensively in Chapter 22


the
factor restrictions. ln relation to (86)the common factor
common
are
restriction is :1 jo + jl 0, which gives rise to the special case of an
autoregressive error model.
An important

23.4 Specilkation testing


Case 9. Autoreqressive

yf jax, +
=

t;,,

t/-rt?r

c,

553

model

al;,

(a1/0+

/JI

0)..

(23.95)

l,.

For further discussion on a1l nine cases see Hendry and Richard ( 1983), and
Hendry, Pagan and Sargan (1984).
ln practice, the construction ('design')of an empirical econometric model
taxes the ingenuity and craftsmanship of the applied econometrician more
than any other part of econometric modelling. There are no rules or
reduce any well-defined
established procedures which automatically
empirical
estimated
statistical
GM
specified)
to a
('correctly'
economic
mainly
model.
both
theory
This
is
because
econometric
as well as
role
choice
sample
data
play
in
the
properties
of
the
the
('design')of the
a
statistical
GM for a
this,
order
ln
let us return to the
latter.
to illustrate
estimated
23.3
this
equation
estimated
ln
in Section 23.2.
Section
money
and
misspecifications
possible
equation was tested for any
none of the
natural
question
rejected.
The
to ask at
underlying assumptions tested was
constitutes
this
estimated
statistical
GM
that
this stage is
a wellspecify
statistical
model,
how do we proceed to
(choose) an
defined
empirical econometric model?'
As it stands, the money equation estimated in Section 23.2 does not have
any direct economic interpretation. The estimated parameters can only be
viewed as well-defined statistical parameters. ln order to be able to proceed
of an empirical econometric model we need to consider the
with the
question of the estimable form of the theoretical model, in view of the
observed data used to estimate the statistical GM (see Chapter 1). In the
case of the money equation estimated in Section 23.2 we need to decide
whether th theoretical model of a transactions demand for money
considered in Chapter 19 could coincide with the estimable model. Demand
in the context of a theoretical model is a theoretical concept which refers to
to a range of hypothetical
the intentions of economic agents corresponding
values for the variables affecting their intentions. On the other hand, the
observed data chosen refer to actual money stock M 1and there is no reason
why the two should coincide for all time periods. Moreover, the other
variables used in the context of the theoretical model are again theoretical
constructs and should not be uncritically assumed to coincide with the
observed data chosen. In view of these comments one should be careful in
isearching' for a demand for money function. ln particular the assumption
that the theory accotlnts for al1 the information in the data apart from a
white-noise term is highly questionable in the present case.
In the case of the estimated money equation the special case of a static
demand for money equation can be easily tested by testing for the
significance of the coefficients of all the lagged variables using the F-type
'proper'

'assuming

'design'

554

The dynamic Iinear regression model

0 and #f 0,
test considered above. That is, test the null hypothesis S0:
: #0 or
i 1, 2, 3, 4, against
#f# 0, i 1, 2, 3, 4. The test statistic for this
hypothesis is
=

.#.fj

F'Ijy)

0.11650

53
53 1(-fI1-

-0.010

(23.96)

28.072.

().()j() 53

c.= 1.76 for a 0.05 we can deduce that Hv is strongly rejected.


A moment's reflection suggests that this is hardly surprising given that what

Given that

the observed data referto are not intentions orhypothetical range of values,
but realisations. That is, what we observe is in effect the actual adjustment
process for money stock M 1 and not the original intentions. Hence, without
any further information the estimable form of the model could only be a
money adjustment equation which can be dominated by the demand,
supply or even institutional factors. The latter question can only be decided
by the data in conjunction with further a priori information.
Having decided that the estimable model is likely to be an adjustment
process rather than a demand function we could proceed with the design' of
thc empirical econometric model. Using previous studies rclated to
adjustment equations, without actually calling them as such (seeDavidson
Hendry and Ungern-sternberg (1981), Hendl'y
et al. (1978),Hendry (1980),
and
Richard
(1983), Hendry
(1983/,the following empirical econometric
chosen:
model was

In(->-M),

y.''j

-0.134

In(->-M),.f)

(j1

-0.474

(0.130)
(0.02)
(Intyl

-e.196

l-1

(0.022)

-0.80 1 ln Pt -0.059 ln ff

j) (

-0.025

f=1

(0.145)

(0.007)

=0.758,

log 1=223.318,

42 0.725,
=

1)f ln It

(0.008)

+ lif,
-0.045Qjt+0.059Qct+0.053Q31
(0.014)
(0.014) (0.015)

R2

(23.97)

=0.0137,

R&S=0.012 47,

F= 76.

A1l the restrictions imposed on the estimated statistical GM in order to


reduce it to the above empirical econometric model are linear and the
relevant F-type test statistic is

23.4 Specification testing


FT(y)

53

0.0 12 476 -0.010 530


0.0 10 530

(23.98)

=0.753.

test we can deduce that these


Given that ca 1.88 for a size a
restrictions are not rejected. This test, however, does not suffice by itself to
establish the validity of (97)as a well-defined statistical model as well. For
this we need to ensure that the misspecification test results of the original
estimated statistical GM are maintained by (97).
against m 6. Using the F-type test
Cboice of m - testing for m
(i)
1 342 the value of the test
379 and URSS
with RRSS
statistic is
=0.05

=4

=0.01

=0.012

1 342 56
0.012 379
:
().()1 1 y4a
-0.01

FT(y)

with
c.= 2.1 1, x

1-

0.640

0.05,

the null hypothesis m 4 is not rejected. The Lagrange multiplier error


autocorrelation test for I 2 in its F-type form yielded:
=

177 62

-0.012

FT(y)

0.012 379
0.012 177

(23.99)

0.514

with ca 3. l5, q 0.05.


Misspectjlcation test for normality - skewness-kurtosis
(ii)
the test statistic is
43 0.3433 and :4. 3
=

test. With

-0.2788

z(y)

F
=-

4a +-

and with c.= 5.99 for


rejected.

0.0f the assumption

(b)

0.685,

.t3,

tl

RESET test with


FT(y)

rpsrs Jr Iinearity

Misspecscaion

1.18,

cx

2.76.

1.75.

ARCH test:
FF(y)=

0.53 1, cz=2.51.

These tests indicate no misspecification

of normality

homoskedasticity

ca

and

.*

White test:
FF(y)

(23.100)

(44. 3)2 1.801


-

24

at a = 0.05.

is not

The dynamic linear regression

model

Structural t.-/:fgnf?tr tests (seeSection 2 1.6). It might be interesting to


test the hypothesis that the changes picked up' by the dummy
variables are indeed temporary changes without any lasting effects
as assumed and not important structural changes. The first
dummy variable was defined by
=0

D1

for t # 36, f

Using T1 37 the F-type test for


=

F T) (y)

0.0 16 00
0.0 14 549

1l
.

HLlt:clj

5, 6,

80.

c2a yielded

(23.103)

implies that HLl)


is not rejected. Given
which for x 0.05, ca 1
this result we can proceed to test /'t)': #j pz. The F-type test
statistic is
.85,

FF2(y)

(11-0.0517,

0.013 749 -0.0


-

()s

j?

13 678 3 60
?- g'yjj

(23.104)

)
is strongly accepted. Using
given that c?, 2.254 for x 0.05, HL3
the same procedure for Tz 51.
=

FFI

0.016 278

(y) 0.01 356


=

c'a= 1.83,

and

FF2(y)

1.22,

=0.05

(23.105)

0.0 13 749 -0.0 12 867 60


-j0.0 12 867
-.

.---

cu 2.254,

#1 #2 and

cf

Hence, Hv

0.685,

0.05.

o'q is accepted

(23.106)
at

1-

-a)2

(1

0.0975.
ln Chapter 2 1 we used the distinction between structural change and
parameter invariance with the former referring to the case where the point
of change is known a priori. Let us eonsider time invariance in relation to
(97) as well.
A very important property for empirical econometric models when
needed for prediction or policy analysis is the time invariance of the
estimated coefficients. For this reason it is necessary to consider the time
invariance of the model in order to check whether the original time
invariance exemplified by the estimated coefficients of the statistical GM
the empirical econometric model we try
has been lost or not. In
to capture the invariant features of the observable phenomena in order for
'designing'

23.4 Specification

testing

the model to have a certain value for prediction and policy analysis
purposes. Hence, if the model has been designed at the expense of the time
invariance the estimated statistical GM will be of very little value.

The recursive estimates of Lj7- 1 Atn:f ..j -pt -j)1, (Fl, - 1 - pt 1 - yt 1),
1)J'jf
-y), AR, it and
in Fig. 23.4(4J)-(./')
1 (
..j are shown
() AJ'l
respectively for the period 1969/-1982/:. Apart from some initial volatility
due to insufficient sample information these estimates show remarkable
time constancy. The estimated theoretical parameters of interest have
indeed preserved the time invariance exemplified by statistical parameters
of interest in Section 23.3.
lt is important to note that the above misspecification tests are not
iproper' tests in the same sense as in the context of the statistical GM. They
checks' in order to ensure that the
should be interpreted as
determination of the empirical econometric model was not achieved at the
expense of the correct specification assumption. This is because in
the empirical econometric model from the estimated statistical
GM we need to maintain the statistical properties which ensure that the end
meaningful
estimated equation but a
product is not just an economically
well-defined statistical model as well. Having satisfied ourselves that (97)is
indeed well defined statistically we can proceed with its economic meaning
as a money adjustment equation.
The main terms of the estimated adjustment equation are:
(i)
(1y
j7, j A ln(M P4t -f) - the average annual rate of growth of real
money stock;
(ln(M/#)t- 1 -ln F) 1) -- the error-correction
(ii)
term (see Hendry
( 1980)).,
j)
Vz-ot)
(1.3
the average annual rate of growth of real
1 A ln Fp
consumers' expenditure'.
(iv)
A ln Pt - inflation rate:
ln
lt - interest rate (7 days- deposit accountl;
(v)
(vi)
jt) j ( - 1)iln 1:- annual polynomial lag for interest rate.
Interpreting (97)as a money adjustment equation we can see that both the
rate of interest and consumers' expenditure play a important role in the
determination of tht? changes in real money stock. As far as inflation is
concerned we can see that the restriction for its coefficient to be equal to
minus one against being less than minus one (one-sidedtest) is not rejected
1.67 and the test statistic is
at a 0.05 given that (',

l7-

(ZJ?-

Sdiagnostic

Sdesigning'

z(y)

-.-

- 0.80 1 + 1.00

tj.j4j

-0.

137.

(23. 107)

This implies that in effect the inflation rate term cancels out from both sides

558

The dynamic linear

regression

model

a 1t

(a)

T ime

Fig. 23.4(44-(/1, The time paths of the recursive


coelicients of (j7- j A ln (M #lf -j), (ln (M #F)f 1),
and (7. j ( 1)J ln lt--j) respectively (from(97:.
-

estimates of the
1 A ln 1-j), ln lt

(1F

with short-run behaviour being largely determined in nominal terms; not


an unreasonable proposition. The long-run represented by the static
solution (assumingthat yt y'f-f,N N-f,i 1,2,
) Of the adjustment
is
equation
=

M
PF

Aj -

0.301(

j .j. gj -

4..087
'

(23. 108)

23.4 Specification testing

559

Fig. 23.4. collrulktuat

This suggests that the long-run


where is a constant (see Hendry (1980)).
behaviour differs from the short-run in so far as the former is related to real
money stock.
The question which arises at this stage is,
is the estimated
adjustment equation related to the theoretical model of a transactions
demand for money?' From (97) we can see that the adjustment equation is
dominated by the demand side in view of the signs of
I A ln Ft- f, ln It
.,4

ihow

Z)-,

The dynamic Iinear regression model

560
0

ast

Time

Time

Fig. 23.4. continued

and jt) 1 ( 1)i ln It f. This suggests that if we were to assume that the
supply side is perfectly elastic then the equilibrium state, where
inherent
tendency to change' exists, can be related directly to ( 108). Hence, in view of
the perfect elasticity of supply ( 108) can be interpreted as a transactions
demand for money equation. For a more extensive discussion of adjustment
processes, equilibrium and demand, supply functions, see Hendry and
Spanos (1980).
-

'no

23.4 Specification
Re-estimation

of

testing

561

(97) with Apt excluded from both sides yielded the

following more parsimonious

empirical model:

(23.109)
.R2ccz(j.'y:9,

s 0.01384=

1og L= 222.25,
The above estimated coefficients can be interpreted as estimates of the
theoretical parameters of interest defining the money adjustment equation.
These parameters are simple functions of the statistical parameters of
interest defining the statistical GM. An interesting issue in the context of
this distinction between theoretical and statistical parameters of interest is
collinearity or/nd short data (collectively
related to the problem of
called insufficient data information) raised in Chapter 20.
ln view of the large number of estimated parameters involved in the
money statistical GM one might be forgiven for suspecting that insufficient
problems
might
data information
be affecting the statistical
parametrisation estimated in Section 23.3, One of the aims of econometric
modelling, however, is to design' l'tlbusl estimated coefficients which are
directly related to the theoretical parameters of interest. For this reason it
will be interesting to consider the correlation matrix of the estimated
coefficients as a rough guide to such robustness, see Table 23.2.
The correlations among the coefficients of the regressors are relatively
small; none is greater than 0.8 1 with only one greater than 0.68. These
correlations suggest that most of the esttmated (theoretical)parameters of
design'.
interest are nearly orthogonal; an important criterion of a
The first column of Table 23.2 shows the partial correlation coefticients (see
Section 20.6) between Am and the regressors with the numbers in
parentheses underneath referring to the simple correlation coefficients.
coefficients show that every
The values of the partial correlation
contributes
substantially
the
explanation
of Arnfwith the errorto
regressor
correction term and the interest rate playing a particularly important role.
above (see
Another important feature of the empirical model
bnear'

tgood

tdesigned'

(??1t1 - Pt
/'

YkxAnA7K

'he

562

1 -

.1L-

llel:

0.4 13
( - 0.053)
- 0.754
( - 0.4 18)

-3 j=)

mWt

leresm

0.422
(0.056)

A y,j..j
.

-0.70 l
( -0.080)

j'

-0.077

Table 23.3. Restricted coefhcient

p;)@-j
pt..j
pt-j
it ..j

0.4 13

27

4.=2

0.354

0
O
0
-0.025

0.025

on

(97)
.

--.

j=j
0. 196
0.80 1

-0.80 1
-0.059

based

estimates
-

j=Q

0.093

j=?
0
13
- 0.4
0
0.025

j=4
0.158

0
0

-0.025

( 109/ is that the restricted coefficient estimates do not differ significantly


from the unrestricted estimates as can be seen from Table 23.3.
Further evidence of the constancy of the estimated coefficients is given in
Fig. 23.5(fl)-4.) where the 4o-observation
estimates are plotted.
These estimates. based on a fixed sample size of 40 observations, run
through the whole sample, that is, jl based on observations 1...-40,#2 on
2..41, ja on 3-42, etc.
iwjnftpw'

23.5

Prediction

An important question to consider in the context of the dynamic linear


regression (DLR) model is to predict the value of yw+1 given the sample
T As argued in Chapter 13, the best
information for the period t 1, 2,
predictor (in mean square error (MSE) sense) is given by the conditional
expectation of yw+1 given the relevant information set. For simplicity let
=

23.5

Prediction

563

(a)

(b)
Fig. 23.5(a)-(:'). The time paths of 40 observation
the coefficients of ( 109) apart from the constant.

window estimates of

564

The dynamic linear regression model

Time

(c)
0

Fig. 23.5. coatinued

565

Prediction

23.5

ffgt/

Fig. 2.3.5. cottil;

us assume that
0* EEE(al aa,
predictor of
.

xowsI EB (x1 xa,


.

',f..1

tv

HF

is given and the parameters


ln such a case the best
known.
are

x:.,.I)'

pm,
am. j(), j1,
is given by its systematic component,
.

)
=

0y
I

i.e.

B1

N1

xw,
c2a)

0w
0w)
E'(4,w+ l ,/c(Y X

Moreover,

'

Xi l y

I-

Va

- t)

the predietion

error

F;X

(23.110)

is

(23. 111)
(23. 112)
and
/f( l

-.

c(Y01 ) X?.
1

x0T 4.

0.

Similarly the best predictor of ),wv! given a similar information set is


m

?N

)((
pr-hz-alpwsl

afy,.+a-..+

z=
=

#;xw+?-f.

In general we can construct predictors for I periods ahead by substituting

566

The dynamic Iinear regression

model

and
p1

p1

pv-l- l
i

cf/zwx./-i

#;xw-,-,
j(2
0

/-rn+

1, ,n+2,

(23.115)

The above predictors were derived on the assumption that 0* was known
a priori 't-a grossly unrealistic assumption. In the case where 0* is not known
we use 0*, the approximate MLE and the predictor of )'v+l takes the form
/-1

p- v + l

j
i

m
.:i'

i
1

pr

l- i

+
i

(:i.

p v+

j ..

l-

j p-'i x v

1*=

g..

l 2 3,
=

m
(23. 116)
.

and
m

F-

'r

jl
=

(i'

yv+

t- i +

)
=

p-k'J
.

m+ 1

(23.117)

ln the case where xw.hp,/ 1,2,


is not known it has to be
and substituted in the above formulae.
Returning to ( 1 10),we can see that the prediction error variance
by (111), i.e.

tguessestimated'

Ellyw-h 1 - p..+ 1)2 c(Y?),

x7-

xow..ll

czt).

is given

(23.118)

Similarly,
(23. l 19)

and

I 2 3,
(23.120)
=:

a,?c()2+
j
= ij 1
i
=

r?T +

cj
1

l m+ 1
=

(23 12 1)
.

formulae for the case where #* is used instead of 0*,


when the latter is unknown, are rather complicated. ln practice, however,
since #*is a consistent estimatorof 0* the formulae ( 118)-(12 1) are also used
(but only asymptotically valid) in the case where the estimated coefficients
are used.
For hypothesis testing and interval estimation related to prediction
The corresponding

Looking back

567

is given by

distribution

errors the asymptotic

N 0, )( a/coz
(.pw+,-w-,)
-

('

1- 1, 2,

(23.122)

r,,,

where xl 1. Note that the predictors


pv..lare correlated
w+ 1, w+2,
and anyjoint testing or confidence estimation needs to take the covariance
into consideration as well.
=

23.6

Looking back

In Section 23.4 we constructed an empirical econometric model of a money


adjustment equation based on a well-defined statistical GM estimated in
Section 23.2. lt will be useful at this stage to return to the money equation
estimated in the context of the linear regression model (seeChapter 19) and
derived in
the various statistical inference
attempt to
19-22. For convenience 1et us reproduce both equations
Chapters
estimated for the same sample period 1964/- 1982/:r:
'results'

'explain'

hmt

124 -0.485

-0.

(0.018) (0.130)

Aln1r-j

pt-j)

J=1

(0.022)
(23. 123)
(0.007)

(0.0 14)

kl

.R2

=0.709,

1ogL= 222.25,

=0.675,

s=0.013

RSS=0.012

832,

84,
F= 76.

Note that small letters are used to denote the natural logs of the variables
represented by the capital letters.

nl, 2.763 +0.705yf


( 1,100) (0.112)
=

Rz

0.995,

log L= 138.425,

+0.862pf

-0.053t

(0.022) (0.014) (0.040)

Rl 0.995,
=

RSS

=0.

s=0.040

(23.124)

22,

1165,

Looking at these estimated equations we can see how grossly misspecified


(124)is as a linear regression statistical GM. The inappropriate

equation

568

The dynamic linear regression model

assumption of an independent sample invalidates a1l the statistical


inference results based on ( 124) derived in Chapter 19. Moreover,
interpretipg the estimated coefficients as elasticities and using these in any
way is again unwarranted given that these are not well-defined statistical
parameters and any arguments based on the estimated coefficients can be
very misleading.
ln terms of goodness of fit the estimated variance of ( 123) is almost a tenth
of that of (124).
From the statistical viewpoint the main difference between (123) and
( 124) comes in the form of the unmodelled part of yf. The residuals for the
two estimated equations behave very differently. In Fig. 23.6, t and J, are
plotted over time and as we can see they differ greatly. Firstly, the standard
deviation of t is three times smaller than that of J,. Secondly. the whitenoise assumption seems much more appropriate for ut rather than :f. From
Chapters 19-22 we know that f exhibit not only time dependency but
heteroskedasticity as well. Hence, in no circumstances should ( 124) be
interpreted as an empirical econometric model.
The main problem associated with ( 124) is that of invalid conditioning.
That is, in defining the systematic component we should have conditioned
(X, xr) but the past as well. This invalid
not only on the
conditioning induced time dependency in Jt, V as well as j. Looking at
( 123) we can see how this problem can arise. Using the terminology
bpresent'

Time

Fig. 23.6. Comparing the residuals from

(123)and (124).

569
23.6 Looking back
model
empirical
econometric
and
Richard
1982)
the
by
Hendry
introduced
(
l 124). Encompassing refers to the ability of a particular
( 123)
estimated statistical model to explain the results of rival models (seeMizon
( 1984) for an extensive discussion). The comparison of ( 123) and (124) above
ln this case, howevers a
was in effect a simple exercise in encompassing.
formal discussion of encompassing seems rather unnecessary given that
( 124) could not be seriously entertained as an empirical econometric
model given that it has failed almost all the misspecifieation tests applied.
Although the statistical fotlndations of ( 123) seem reasonably
sound, its
theoretical underpinnings are less obvious. This, however. it not surprising
given that economic theory is rather embarrassingly slent on estimable
adjustment equations. At this early stage it might be advisable to rely more
heavily on data-based specifications which might provide the basis for more
realistic theoretical formulations of estimable models. Muellbauer ( 1986)
provides an excellent example of how' economic theoretical arguments can
and interpret successful data-based empirical
be used to
econometric models. This seems a most promising way forward with
each other;
economic theory and data-based specifications complementing
1985).
Nickell
and
Engle
1985).
Pagan
1985).
Granger
(
(
(
see
The above discussion exemplifies the dangers of using an inappropriate
statistical model in economuttric modelling. At the outset it must have been
obvious that the sampling model assumption of independence associated
with the linear regression model was highly suspect for the kind of data
chosen. Despite that, we adopted an inappropriate statistical model in
order to illustrate the importance of the decision to adopt one in preference
of other statistical models. Throughout the discussion in Chapters 17-23
every attempt has been made to persuade the reader that the nature and
statistical properties of the observed data chosen have an important role to
play in econometric modelling. These should be taken into consideration
when the decision to adopt a particular statistical model in preference to the
others is made. In a certain sense the choice of the statistical model to be
used for the particular case of cconometric modelling under consideration
is one of the most important dccisions to be taken by the modeller. Once an
inappropriate choice is made quite a few misleading conclusions can be
drawn unless the modellel' is knowledgeable enough to put the estimated
statistical GM through a battery of misspecification tests before embarking
on specification tcstilg and prediction. lf, however, the modeller follows the
mtxim that
theory accounts for all the
naive methodological
information in the data f irrespecti: e of the choice of the data) apart from a
theoretical information is real information-,
white-noise error term' or
then the misspecifieation testing seems only of secondary importance and
misleading conclusions arc more than likely.
wencompasses-

trationalise'

'the

'only

The dynamic Iinear regression model


Important

concepts

Autoregressive representation, homogeneous non-stationary


processes,
initial conditionsrasymptotic
independence, Granger non-causality, strong
exogeneity, unit circle restrictions, approximate MLE, Durbin's
test,
statistical versus theoretical parameters of interest, error correction term,
long-run solution, partial correlation eoefficients, encompassing.

Questions
Explain the role of the stationarity and asymptotic independence of the
stochastic proccss .tZ!, l (E T) in the context of the specification of the
DLR model.
Is the stationarity
of )Z,, t c T) necessary for the autoregressive
representation of the process?
Define the concept of strong exogeneity and explain its role in the
context of the DLR model.
statistical GM of the DLR model is a hybrid of those for the linear
4.
and stochastic linear regression models.' Explain.
Discuss the difference between the exact and approximate MLE'S of 0*.
:Do they have the same asymptotic properties?' Explain your answer.
State the asymptotic properties of the approximate MLE #* of p*.
l-low do we test whether the maximum lag postulated in the
specification of the statistical GM is too large'?'
tl-low do we interpret residual autocorrelation in the context of an
estimated statistical GM in the DLR model'?'
iWhy is the Durbin-Watson test inappropriate in testing for
an AR(1)
error term in the context of the DLR model'?'
-f'he

Additional references

Anderson (1959))Crowder (1980);


Durbin ( 1960); Harvey (198 1); Granger and Weiss
and Wald (1943:; Priestley ( 1981).

(1983);Mann

linear regression model

The multivariate

Introduction

24.1

The multivariate linear regression model is a direct extension of the linear


regression model to the case where the dependent variable is an m x 1
random vector yr. That is, the statistical GM takes the form

B'x ! + u t

where yf : m x 1, B : k x rrl, xt: k x 1, uf : m x 1. The system


system of m linear regression equations'.
yff

fixt+

uit,

1, 2,

(1) is effectively

n1, l 6 T,

(24.2)

with B (#1,#c,
pm).
ln direct analogy with the m 1 case (seeChapter 19) the multivariate
linear regression model will be derived from first principles based on the
joint distribution of the observable random variables involved, D(Zf; #)
+ k) x 1. Assuming that Z, is an I1D normally
where Zt BE (y;,X;)', (?'?
distributed vector, i.e.
=

(xY',)
-

)
((0())()1a
2a))
Yxal

wecan proceed to define the


by:
pt E(y,/X,
=

and

tl,

xl)

systematic and non-systematic

B'x,,

yt - C(y!/Xt x,),

for all t s --,

components

Ec-21E2:

(24.5)

'

The multivariate

(i)

'(u,)

Stulu'sl

Ftpruf'l

where D
19.2).

E1 j

model

regression

ul and pt satisfy the following properties'.

by construction,

Moreover,

(iii)

linear

'EZ'(u,/X! x,)(l
=

ErEtuu's Xf

'ES(pu;,/Xf
E21
El 2Ea-21

Xf)1

x,)(1

0,'

fl
0
E

'X, x,)()
rprluf',
=

(compare these

with

0,

the results in Section

The similarity between the m 1 case and the general case allows us to
consider several loose ends left in Chapter 19. The first is the use of thejoint
distribution D(Zr; #) in defining the model instead of concentrating
exclusively on Dtyf X!; /1). The loss of generality in postulating the form of
the joint distribution is more than compensated for by the additional
insight provided. In practice it is often easier to judge' the plausibility of
assumptions relating to the nature of D(Z,; #) rather than D(yf/Xr; j).
analysis the relationship
Moreover, in misspecification
between the
assumptions underlying the model and those underlying the random vector
of the nature of the
process (Z,, t e: T) enhances our understanding
possible departures. An interesting example of this is the relationship of the
assumption that pZ,, t e: T) is a
normal (N);
( 1)
independent (J); and
(2)
identically distributed (1D) process; and
(3)
D(y!,/X,; 1) is normal;
!61
(i)
Eyt/'xt
x,) is linear in x,;
(ii)
(iii)
Covtyf/'x, xr) is homoskedastic (free of xr),'
p>(B, f1) are time-invariant;
g7q
g8(I (.J,'l,,'Xl,t (E T) is an independent process.
=

The relationship

between these components

is shown diagrammatically

below:

(.f) I8I
--+

The question which naturally arises is whether (i)-(iii)imply (N) or not. The
following lemma shows that if (i)-(iii)are supplemented by the assumption

24.1
that X,

Introduction

implication

N(0.Eac), detltac) + 0, the reverse

holds.

Lemma 24.1
Zf

kV(0,E) jr

'v

only

(/'

N(0, Ya2), dettEaa) + 0,'

(,*)

(1'1')

E (yly'xf

'v

Xt

Covtyf/Xf

4' and

t e: T

Xt;
) X 1 2Y2-J
=

Xl

X1 1

X j 2X2--/ X 2 1

(24.6)

XB + U,

where Y: F x n?. X: T x k. B: k x n?. U : T x 131. The system in f 1) can be


lth
row of (6). The fth row taking the fo rm
as the
yi Xjf
=

+ uf

1, 2,

viewed

tn

I'th regression in
(2).ln order to define
represents al1 T observations on the
#I)
the
need
special notation of
conditional
D(Y
X;
distribution
the
we
Appendix
2).
Using
notation
this
the matrix
Kronecker products (see
written
in
form
the
distribution can be

(Y

where f

X)

.3'=

NIXB, f ()) Iw),

,v

the covariance of

(:& Ir represents
y

vec (Y )

(24.8)

'2

'F?'nx 1
.

y,,l
The vectoring operator vect ) transforms a matrix into a column vector by
stacking the columns of the matrix one beneath the other. Using the
vectoring operator we can express (6) in the form
'

vectYj

(Im () X) vectll) + vec(U)

y* Xyjv
=

in an obvious notation.

+ uv

(24.10)

The multivariate

linear regression model

The multivariate linear regression


(MLR) model is of considerable
interest in econometrics because of its direct relationship
with the
simultaneous equations formulation to be considered in Chapter 25. ln
particular, the latter formulation can be viewed as a reparametrisation
of
the MLR model where the statisticalparameters of interest 0- (B, f) do not
coincide with the theoretical parameters of interest (. lnstead, the two sets of
parameters are related by some system of implicit equations of the form:

f 0,I)
These

0,

equations

1, 2,

be

can

(24.11)

p.

interpreted

providing

as

an

alternative

parametrisation for the statistical GM in terms of the theoretical


parameters of interest. ln view of this relationship
between the two
statistical models a sound understanding of the MLR model will pave the
way for the simultaneous equations formulation in Chapter 25.

24.2

Spification

and estimation

ln direct analogy to the linear regression model (n


linear regression model is specified as follows:
(1)

Statistical GM: y,
y, : m x 1

x, :

B'xr + ut,

B: k x m

kx 1

x,)

B'x,,

1) the multivariate

lsT

The systematic and non-systematic

;tt JJty, 'Xr

components
u,

are:

y, - .E'(y,/Xt x,),
=

and by construction
ut) E EE(u,,''Xf xrll
=

0,

'(pt'uf) E Ef(#Ju,/'X, x?)(1 0,


=

(4J

(5q

The statistical parameters of interest are 0- (B, D) where B


1
1
E 22
E 11 E 12 E 22
- Y21, f
- E 21'
X, is assumed to be weakly exogenous with respect to 0.
No a priori information on 0.
Ranktx)
xw)': T x k, for F > k.
k, X E (x:, xz,
=

g31

24.2 Specification
(11)

Probability

alxl estimation

model
--i

tll

D(yt/'Xl; #)=

f1)

(det

CXP)

mjz

(2zr)

'--llyf

B Xf) D .y
,

(yt- s,xt jj,

R'nkx frt

tGT

D(y,//Xt,' #) - normal;
(i)
JJty; X, x,) B'xr - linear in x,;
(ii)
(iii)
Covlyf/xr= N) f - homoskedastic (free of xp);
0 is time invariant.
=

(111)

Mmpling

model

ywl' is an independent sample sequentially drawn


from D(y,,/Xr;0j, t= 1, 2,
'C and T> m + k.
The above specilkation is almost identical with that of m 1 considered
in Chapter 19. The discussion of the assumptions in the same chapter
above with only minor modilications due to m> 1. The
applies to E1(1-E81
only real change brought about by m > 1 is the increase in the number of
statistical parameters of interest being mk +-)m(n1 + 1).It should come as no
surprise to learn that the similarities between the two statistical models
extend to estimation, testing and prediction.
From assumptions
(6) to E8qwe can deduce that the likelihood
function takes the form

g81

Y=

(y1,y2,

v)=c(Y)
z-(p;

1-1o(y,,/x,;04
=

and the log likelihood is


T

logtdet f1) - s
2
2

F (y, B'x,)/f1- 1(y,

B'x,)

(24.12)

1(Y XB)'(Y XB)q


= const -gT logtdet D) + tr fl (see exercise 1). The first-order conditions are

(24.13)

log L= const

t?log L
PB

= (X Y -X XBID
,

p log L T fl
l
pfl- .= Y

0.

-J(Y -XB) (Y -XB)


,

'-'r

(24.14)
=0.

(24.15)

't C?''denotes the space of all real positive definite symmetric matrices of rank m.

The multivariate

Iinear

regression

model

These first-order conditions lead to the following MLE's:


= (X'X) IX'Y
-

(24. 16)

!!

where U Y - Xb. For ( to be positive dcfinite we need to assume that T y


n?+ /:. (see Dykstra (1970)). lt is interesting to note that ( 16) amounts to
estimating each regression equation separately by
=

p-i (x'x)
=

lx'y

1, 2,

'

(24.18)

/3.7.

Moreover, the residuals from these separate regressions j yf X/ i can be


used to derive f' via &ij ( 1y'Flf/iy, ivj 1,2,
As in the case of j- in the linear regression modd, the MLE preserves the
original orthogonality
between the systematie and non-systematie
and t y:
components. That is, for t
=

y,

+r,

1, 2.

z'x,

z'xf

,rn.

(24. 19)

and t ..1- r. This orthojonality


can be used to define a goodness-of-fit
'
2
1 ( ) (y y) to
measure by extending R
,

I -((;''l'))(Y'v)

.,

(Y'Y - l')l'))(Y'Y) -

'.

(24.20)

The matrix G varies between the identity matrix when l-J 0 and zero when
Y U (no explanation). In order to reduce this matrix goodness-of-fit
measure to a scalar we can usc the trace or the determinant
=

(24.2 1)
(see Hooper ( 1959)).
ln terms of the eigenvalues (21 ;.1,
goodness of fit take the form
.

t/1

N7

)-q;.i
l',1 i .
--

2.) of G the above measures of

tn

and

(lz

- 1

(24.22)

.j.

The orthoyonality extends directly to 1 X: and U and can be used to


show that B and fl are independen random matriees. ln the present context
this amotlnts to
=

covt () f1)=0,

(24.23)

where E( ) is relative to D(Y X; 0j.


'

24.2 Specification

and estimation

Finite sample properties

t#'

fl

and

and f' are MLE'S


we can deduce that they enjoy the
of
such
estimators
(see Chapter 13) and they are
invarianct?
stfjh'
trtdnl
if they exist. Using the Lehmanstabistics,
militnal
of
the
functions
result
12)
Chapter
we can see that the ratio
Scheffe
(see
Irrom the fact that

prtppgl-ll'

o(Y X; p)
D(f l .X; p)

=exp)
-

1
.--YTYII
tr f1- EY'Y

-(Y -Yo)'XB -

(,

B'x'(Y -Y())(l)
of 0 if Y'Y =Y'Y() and Y'X

is independent
z(Y)

(24.24)

YLX. This implies that

(z1(Y)-Tc(Y)), where z1(Y) Y'Y, zc(Y)


=

Y'X

defines the set of minimal sufficient statistics and


:

(X'X) -

lz

2 (Y)',

(24.26)
ln order to discuss the other properties
distributions.
Since
2 B + (X'X) - IX'U

and f let us derive their

of

L=(X'X)- 1X'

=B + LU,

we can deduce that

(24.27)

').

,..-N(B, f (x)(X'X) This is because


is a linear function of 5'

where

(24.28)

(Y 'X) NIXB'D L I).


x

Y(I Mx)Y', its distribution is the matrix equivalent to the


Given that T
chi-square, known as the Wishart distribtltion with F- k degrees of
freedom and scale matrix fl and written as
=

x,

(24.29)

Wz,,,(f1,r- /()

(see Appendix 1). ln the case where


T'fz'v c2z2(T-k),

1 T

E(Tf1)= c2(F-

ti'li and

/&j.

properties of the
The Wishart distribution enjoys most of the attractive
normal
distribution (seeAppendix 1#.In direct analogy to (30),
multivariate
=(T-k);n,
A'I)('.r'f)

(24.31)

The multivariate

linear regression model

and thus ( gl (F-k)qU'Uis an unbiased estimator of f1. In view of


(25/-(31) we can summarise the finite sample properties of the MLE'S (: and
f of B and fl respectively:
* and f are insariant (withrespect to Borel functions of the form
(1)
+ 1:).
y( ): O Rr (1%r Gtrnk + J?'ntpl
B and fl are jnctions of the minimal sufficient statistics T1(Y) Y'Y
and z2(Y) Y'X.
2 is an unbiased estimator of B (i.e..E'1)
B) but f'i is a biased
(1/(F- k)(lI'.)'U
estimator of f; l
being unbiased.
: is a fullyeihcient estimator of B in view of the fact that Cov(2)
fl ()!)(X'X)-- 1 and the information matrix of 0 EEE(B, f1) takes the
form
=

'

-+

lr(#)

(5)

f1-1 ()) X'X

(24.32)
j

- j

- (Eldll )
(d1!

(see Rothenberg (1973:.


and f are independent; in view of the orthogonality
Asymptotic properties of

in

(19).

and f-1

Arguing again by analogy to the m= 1 case we can derive the asymptotic


B and fl of B and f1, respectively.
properties of the MLE'S

(i)

( Q B, f X f1)

Consistency:

B) N(0, !'1 (&(X'X)w-1) we can deduce that if


In view of the result (
1
limw- .(X'X)w0 then Covt)
0 and thus 2 is a consistent estimator of
and
B (see Chapters 12 and 19). Similarly, given that limw- x
ew

-+

'4(1)

limw-w Covtfll

0, Lh % n.

Note that the following statements


1

lim (X'X)w-

2min(X'X)w-.+
1

,.'.'c

2 (x'x)wmaX

tr(X?X)w-1

-+

are equivalent:

0,'

F-* x

(d)

=f1

-+

0',

!' Note that the expectation


model Dlyt X,; 0j.

operator

'(
.

) is relative to the

underlying

probability

579

information

24.3 A priori

1
where 2min(X'X)r and 2max(X'X)w-refer to the smallest and largest
eigenvalue of (X'X)w and its inverse respectively', see Amemiya (1985).

Strong consistencyz

lim(XlX).w-x

(*

an d

=0

'

a..S.
-.+

B)

maX

(X'X)
F
.

<

2min(XX)w
,

a.. S.

some arbitrary constant

for
(iii)

c, then

-->

B; see Anderson and Taylor(1979).

Asymptotic normality

From the theory of maximum likelihood estimation we know that under


V'F(;- 0)
relatively mild conditions (seeChapter 13) the MLE ; of 0
'.w

1).
N(0,I.(p)
For this result to apply, however, we need the boundedness of
zxclimr...
js(1/T)Ir(#) as well as its non-singularity. ln the present case
I.(p)
the asymptotlc information matrix is bounded and non-singular (fullrank)
Under this condition we can
if limw- w(X'X) F= Qx< %. and non-singular.
-

deduce that

T't -B)

1)

(1

v7'r(fz-n) x(0,2(n
-

(see Rothenberg

(24.33)

x(0,n () Qx-

(24.34)

() n))

(1973/.

Note that if t(X'X)w,T > k) is a sequence of k x k positive definite matrices


r.'.'c
as
such that (X'X)r- ) - (X'X)r is positive semi-definite and c'(X'X)zc
1
then
w(X'X)wlimw0.
T-+ x for every c # 0
-+

ln view of (iii) we can deduce that


(iv)
unbiased and jf'ciell.

24.3

2 and (1 are both

asymptotically

A priori information

One particularly important departure from the assumptions underlying the


model is the introduetion of a priori
multivariate linear regression
such additional information is available
When
0.
restrictions related to
applies
and the results on estimation derived in
assumption g4)no longer
modified.
The importance of a priori information in
Section 24.2 need to be

580

Iinear regression model

The multivariate

the present context arises partly because it allows us to derive tests which
testing and partly because this
can be usefully employed in misspecification
will provide the link between the multivariate linear regression model and
the simultaneous equations model to be considered in Chapter 25.
(1)

Lineav restrictions

kelated' to X,

The first form of restrictions


D1B + Cj

to be considered

is

(24.35)

0,

: p x m are matrices of known


where D1 : p x k (p< /(), ranktll)
p and
special
particularly
important
A
case of (35) is when
constants.
=

D1

(0,Ik

),
2

CI

B1

(24.36)

That is, a subset of the coefficients in B is


and (35)takes the form Ba
restrictions
The
about
thing
is that they are not the same
these
note
to
zero.
fonn
the
as
=0.

(24.37)

R#= r,

discussed in the context of the m 1 case (seeChapter 20). This is because


the D1 matrix affects a1l the columns of B and thus the same restrictions,
apart from the constants in C1 are imposed on al1 ?z? linear regression
equations. The form of restrictions comparable to (37) in the present
context is
=

Rj+

(24.38)

r,

: n?1, x 1, R : p x mk, r: p x 1. This form of


where #,y vec(B) (#'j #'2, #'m)'
linear restrictions is mor general than (35) as well as
=

BF1 + A1

(24.39)

0.

All three forms, (35),(38)and (39),will be discussed in this section because


they are interesting for different reasons.
When the restrictions (35)are intel-preted in the context of the statistical
GM

y, B'x,
=

(24.40)

+ ur,

1, 2,
k.
we can see that they are directly related to the regressors Xit,
The easiest way to take (35)into consideration in the estimation of pH(B, f)
the system (35)for B and substitute the
is to
into (40).ln
order to do that we define two arbitrary matrices D1' :4/f -p) x k, ranktDl)
=

'solve'

'solution'

A priori information

24.3
I

-p,

and Cf

x rn, and reformulate

(k -p)

(35) into

DB+C=0

(24.41)

D=(D1,D1),

C=

The fact that ranktl)


1

B= - D -- C

where G
yields

(GI
EEE

Y*

G1')

GICI

D-

(41) for B to yield

(24.42)

GICI',

+
1.

tc?

us to solve

k enables

/'C1

Substituting

this into

(40)for r 1, 2,
=

X*C1'+ U,

where Y* EEEY .-XGICI


underlying probability

(G/'X'XG

Hence, from (42) the constrained

GjCj),

G 1'X'(Y - XG1 C1)

(24.44)

MLE of B is
1G1''X'X(

-G1Cj),

+ L(

1')

- G1 Cj ).

+ G/(G/'X'XGt)-

P(

P= I

(24.46)

lu.

(24.45)

1. =G1'(G/'X'XG1')- IGI'X'X

where
where

-GIC1)

Given that 1u2 L, P2 P and LP=0 (i.e.they are orthogonal


we can deduce that P takes the form
=

1G1'X'X'(

and X* XGt. The fact that the form of the


is unchanged implies that the MLE of C1 is

= (GI'X'XGI)
-

model

=GjC1

(24,43)

1
C1 (X*'X*) - X*'Y*'

2,,,zGICI

(X'X) - 1 D' 1 ED 1 (X'X) -- D' 1(1- D 1

projections)

(24.47)

(see exercise 2). This implies that

E= -(x'x)since D1G j

1l,

D' 1 LD 1 (x'X)-

I- 1(D 1

Ip. Moreo: er. the constrained

f1= f f
'r-.,

=(1

v.

(fj

c1),

(24.48)

MLE of fl is

)'(x'x)(:-).

(24.49)

of B and f we can see that they are


Looking at the constrained MLE'S
direct extensions of the results in the case of m 1 in Chapter 20.
Another important speeial case of (35)is the case where al1 the coefficients
apart from the constant terms, say j.1 are zero. This can be expressed in the
=

582
form

Tbe multivariate

linear regression model

(35)with
D1

(0,lk - 1),

B=

(j.1,B(1)),

and Hj takes the form B(:)

(2)

kelated' to yt

Linear restrictions

0.

The second form of restrictions

BFl

+ A1

to be considered is

=0,

(24.50)

where
k x q are known matrices with ranktr'l)
q.
m x q (qGn1) and
The restrictions in (50) represent
linear between-equations restrictions
F1 :

A1:

because the ith row of B represents the fth coelicient on a11 equations.
lnterpreted in the context of (35)these restrictions are directly related to the
yffs.This implies that if we follow the procedure used for the restrictions in
(38) we have to be much more careful because the form of the underlying
probability model might be affected. Richard (1979) shows how this
procedurecan give rise to the restricted MLE'S of B and f1. For expositional
purposes we will adopt the Lagrange multiplier procedure. The Lagrangian
function is
!(B,f), M)

T
=

logtdet f1)

--j-

--)

tr f

.j

(Y

-tr(A'(BF1 +A1)q,

XB)

(v

.xs)

(24.51)

where A is a matrix of Lagrange multipliers.


01

(X Y -X XBID
,

PB =

el T f) -#Y
1
po - = Y
Pl
-(BFj
PA =

B)

asj

.()
,

-XB)'(Y -xB)=0,

+ A1)=0

(see Appendix 2). From


(X'X)(

(52)we

(24.53)
(24.54)

can deduce that

AF'1f1.

(24.55)

Premultiplying by A1 and solving for A yields


A

(X'X)(F1

which in view of
A

-BF1)(F'jQF1)-

1,

(24.56)

(54)becomes

(X'X)(rj

A1)(F'1f1F1)- 1.

(24.57)

A priori information

24.3

583

This implies that the constrained

1
f=- T f; fT=f
,

+-

1
T

(:-

of B and fl are

1F'1f1,

(24.58)

)(x x)(:-)

(24.59)

+ A:)(l7fFj)-

(F1

MLE'S

(see Richard ( 1979/. lf we compare (58)with (48)we can see that the main
difference is that ( enters the MLE estimator of B in view of the fact that the
restrictions (50)affect the form of the probability model. It is interesting to
note that if we premultiply (58)by Fj it yields (54).The above formulae,ts8),
(59), will be of considerable value in Chapter 25.
(3)

Linear

frela/ed'

restvictions

to both y, and X,

A natural way to proceed is to combine the linear restrictions


in the form of
DIBF

(38)and (50)
(24.60)

C= 0,

where D1 : p x k, F1 : ?,n x q, C: p x q, are known


ranktrl)
q. Using the Lagrangian function

matrices with rank(D1)=

p,

I(B,f1,A)

--j-

logtdet f1)

-trgA'(D1BF1

we can show that the restricted

j
tr f - (Y -XB) (Y -XB)

--i'

(24.61)

C)q,

MLE'S

are

'D' 1 ED1(x'X)- 1D'.1 - 1(D1r'1

-(x'x)-

c)(r'IfF1)-

'rlf,

(24.62)

1
f= (1+ F (B-E) (x x)(E- ).
,

An alternative

derive

w'ay to

(24.63)

(62)and (63)is to

consider

(24.64)

D1B* + C= 0

for the transformed specification


(24.65)

Y* =XB* + E.
B* BF1 and E= UFj.
where Y* =YFI,
The linear restrictions in (60) in vector form can be written as
=

vec(D1BFj
or

C) (F': () D1) vectB)

(F'1 @)D1)/+

r,

+ vectc)

(24.66)
(24.67)

The multivariate

584

Iinear

regression

model

vec(C). This suggests that an obvious way to


where j+ vec B and r
(F'I
this
is
substitute
generalise
L D1) with a p x km matrix R to
to
restrictions
the
in
the
form
formulate
=

R#v

(24.68)

r,

where ranktR) p p < /(rn). The restrictions


in (68) represent the most
general form of linear restrictions in view of the fact that #+ enables us to
and
reach' each coefficient of B directly and impose within-equation
between-equations restrictions separately.
In the case where only within-equation linear restrictions are available R
is block-diagonal, i.e.
=

R1

Rc

'

and

rc

r..

ranktRf)

restrictions

(24.69)

R.j

0
R : pi x k,
Exclusion

r1

pi

?-f:

pf x 1.

are a special case of within-equation

restrictions

where Rf has a unit sub-matrix, of dimension equal to the number of


excluded variables, and zeros everywhere else.
Across-equations
linear restrictions
in the off
can be accommodated
?,l,
of R with Rfyreferring
block-diagonal submatrices Rfy,i,j 1, 2,
to the restrictions between equations i and j.
of B and f) under
Let us consider thederivation of the constrained MLE'S
the linear restrictions (68).The most convenient form of the statistical GM
F is not
for the sample period t 1, 2,
zpj

XB + U,

5'+=X+#+

(24.70)
formulation

vectorised

(24.71)

+u+,

X+

j + (#'1#'2
=

and

as in the case of (35) and (39),but its

where

pm'
)': lnk

x 1

u+

(1p,L X) : Tm x mk,
(u'I
,

u'2
,

u;

)': bn

x 1

Iw): Tm x Fn.
f1+ (f1 Lx?l
=

The Lagrangian

function is defined using the

Ip., fl+, 2)

T
=

--y

--jyv

logtdet f1,y)

2.'(Rj+ - r),

vector

2: p x 1 of multipliers'.

-X+#+)'f1+ -

(y,,-X+#+)
(24.72)

24.4

The Zellner and Malinvaud

t?!
-

b.

X'+f1+-1(y+

t?I
- 2 (R#+
=

r)

X+j+) - R'2

7.0. = - 'i- o.

- 1

+o.

formulations
=

585

0,

(24.73)

1
(y.- x . p. )jy. .xs#s),os- (24.,74)

0.

(24.75)

Looking at the above first-order conditions (73)-(75)we can see that they
equations which cannot be solved
constitute a system of non-linear
explicitly unless fl is assumed to be known. In the latter case (73)and (75)
imply that

J #.*

k
and

F.

(X'+n+- IX * )- IR'ERIX' * ( *- IX * )- 1 R*l - 1(Rj

*,(24.76)

gR(X'+n+- X+) - R'1 - 1(R#+ - r).

(24.77)

(X'.f1+-lx

(24.78)

'

)- 1X'* j1-*

ly
*

lf we compare these formulae with those in the m 1 case (seeChapter 20)


we can sec that thc only diffcrencc (when fl is known) is the presence of f1..
This is because in the m > 1 case the restrictions Rj+
r affect the
restricting
probability
model
the
econometric
literature
ln
by
underlying
y,.
the estimator (78) is known as the generalised Ieast-squares (GLS)
estimator.
In practice fl is unknown and thus in order to solve' the conditions
(73)-475)we need to resort to iterative numerical optimisation (seeHarvey
( 198 1), Quandt ( 1983)- inter alia).
The purpose of the next section is to consider two special cases of (68)
where the restrictions
can be substituted directly into a reformulated
statistical GM. These are the cases of exclusion and across-equations linear
homogeneous restrietions. ln these two cases the constrained MLE of j.
takes a form similar to ( 78).
=

24.4

The Zellner and

formulations
slalinvaud

ln econometric modelling t&5o special cases of the general linear


Rj+

restrictions

(24.79)

useful. These are the exclusion and across-equations linear


restrictions.
In order to illustrate these let us consider the
homogeneous

are particularly

The multivariate

Iinear regression model

two-equation case

(.p1!j-(#11
)?cf #l2

.X1 f

#a1
pz5

(i)

Exclusion restrictions..

(ii)

Across-equation

x.:,

#11

(24.80)

c,

.X3t
=0,

tl/jyl
,j,

+.

=0,.

pz5

#21 #j 2.

linear homogeneous restrictions:

lt turns out that in these two cases the restrictions can be accommodated
directly into a reformulation of the statistical GM and no constrained
optimisation is necessary. The purpose of this section is to discuss the
estimation of #+ under these two forms of restrictions and derive explicit
formulae which will prove useful in Chapter 25.
Let us consider the exclusion restrictions first. The vectorised form of
Y =XB + U,

(24.81)

as defined in the previous sections, takes the explicit form


yl

y2

.'
,

#1
#2

y. =X+#+

u1
u2

(24.82)

#.

um

(24.83)

+ uv

in an obvious notation. Exclusion restrictions can be accommodated


directly into (82) by allowing the regressor matrix X to be different for
different regression equations yf X#f + u, i 1, 2,
m, and redefining
reformulate
accordingly.
That is,
the #s
(82)into
=

X1

yl
y2

0
=

)'+ X1#:

'.

X2

0
,

0
+ u1,

Xpl

#*1
2
#*
#*
m

u1
u2

+
'

(24.84)

um

(24.85)
where Xf refers to the regressor data matrix for the h equation and p? the
corresponding coefficients vector. ln the case of the example in (80)with the
=

24.4
restrictions jl

The Zellner and Malinvaud


1

587

j23 0, (84)takes the form

0,

formulatios

xh)(',))+(--))'

Where

(ry))-()

(24.86)

xs), X2 EEE(x1,x2), jl (j21ja1)?and pz (j12jac)'.


The formulation (84)is known as the seeminqly unrelated reqression
equations (SURE), a term coined by Zellner (1962), because the m linear
regression equations in (84)seem to be unrelated at first sight but this turns
out to be false. When different restrictions are placed on different equations
the original statistical GM is affected and the various equations become
interrelated. In particular the covariance matrix fl enters the estimator of
#l. As shown in the previous section, in the case where fl is known the MLE

of

Xj

H(x2,

#: takes the form

#:= (X:'(D-

()) Iw)X:) - 1X:'(f1-

(:&Iw)y+.

(24.87)

Otherwise, the MLE is derived using some iterative numerical procedure.


For this case Zellner ( 1962) suggested the two-step least-squares estimator

/1=(X1(Z- 1 @Iw)X:)- 'X:'(Z (&Iw)y+,


where f1=(1/F)fJ'I'.), f; =Y -X.
lt is not very difficult to

(24.88)

see that this


estimator can be viewed as an approximation to the MLE defined in the
previous section by the first-order conditions (73/-(75) where only two
iterations were performed. One to derive ( and then substitute into (87).
Zellner went on to show that if

o *- 1

lim X!'
w--w

X:

and non-singular,
taking the form

Q+<

'.yt

distribution of

the asymptotic

V'y j#-+ jv )
-

'

Q*-

y (()
'

'

(24.89)
(87)and (88) coincide,
(24.90)

lt is interesting to note that in the cases:


(a)

X1

Xc

'

X,u X:

and
f)= diagl?:
&* jv
=

ac.

(x+xv ?

.&+

yv

mv,)

(24.91)

(see Schmidt (1976)).


Another important special case of the linear restrictions in (79)is the case

The multivariate

Iinear

model

regression

of across-equation linear homogeneous restrictions


example (80).Such restrictions can be accommodated
(82) directly by redefining the regressor matrix as

X)

x'1!

x' 2?

such as #2j #12in


into the formulation
=

(24.92)

x'mt

(where x, refers to the regressors included in the Rh equation) and the


coefficient vector #+,so as to include only the independent coefficients. The
form
yr X#1

(24.93)

+ uf

is said to be the Malinvaud form (see Malinvaud (1970/.For the above


example the restriction /y1
2
#21 can be accommodated into (80) by
defining X) and #: as
=

p .zj

Y 1t

X 1t

-Y2t

X=+

) x+?n-

1x*

(24.94)

&*in the case where f is known is

MLE of

F
=

p5j

#2c

The constrained
-+
#+

#*
*

and

jg1 x*'nr

(24.95)

l'

Given that fl is usually unknown, the MLE of p; as defined in the previous


section by (73)-(75)can be approximated by the GLS estimator based on
the iterative formula
F

>+

jj x t

j')'G

- 1

jl x t

() i- 1 y!

j g
/
(; 4 jj 6)
r
where l refers to the number of iterations which is either chosen a priori or
determined by some convergence criterion such as

Ft

=:

lj) +1

4: /

-....-

==

-j))

< ;

for some

l >0.

In the case where /= 2 the estimator


F

#-:=

.1

)()x)'f1=

'x)

- 1

wheref1=(1,/r)'l7,

(J =Y -X.

e.g,

=0.00

defined by

1.

(96)coincides

(24.97)
with

1'

jl x)'f1=

ly

r.

(24.98)

24.5 Specification testing


24.5

589

testing

Spification

ln the context of the linear regression model the F-type test proved to be by
far the most useful test in both specification as well as misspecification
analysis', see Chapters 19-22. The question which naturally arises is
whether the F-type test can be extended to the multivariate linear
regression model. The main purpose of this section is to derive an extended
F-type test which serves the same purpose as the F-test in Chapters 19-22.
From Section 24.2 we know that for the MLE'S
fkand D

(i)

1)*,

N(B, fl @ (X'X)-

'w

(24.99)

and
T(1

(ii)

(24.100)

I'F;tD, T'- k).

'v

Using these results we can deduce that, in the case where we consider one
regression from the system, say the fth,

y
the

MLE'S

of

bi

(24. 10 1)

X/f V uf'

jf and

(X X)

are

ii

X yf

o'bii-,--

i-

y -X#i

(24.102)

Moreover, using the properties of the multivariate normal (seeChapter


and Wishart (seeAppendix 1) distributions we can deduce that

Nbi, t.t)if(X'X)- : )

#iN

and

oh

T - .
t?f 1

'v

z (y,.j).

15)

(a4.j()?)

These results ensure that, in the case of linear restrictions related to jf of the
form Hv Rzjf rj against Hk : Rsjf + rf where Rf and rf are pi x k and pi x 1
known matrices and ranklR/) ;)i, the F-test based on the test statistic
=

- ri ) R..j..(X X )- 1 R j.(1- 1 (R i p-i ....j.r ) T - k


(Rd#.,.j
;...
-.ltjf
pi

'

'

'rtyy

--

'

is applicable without anj' changes. In particular,


individual coefficients based on the test statistic

'rlyi) ---j,,-s
-.-L-jx..
i (x x )
T
--

'/-'-..

'

(a4.j()4)

tests of significance for

(24.105)

,) 1

(a special case of ( 104)) is applicable to the present context without any


modifications', see Chapters 19 and 20.

590

linear regression model

The multivariate

Let us now consider the derivation of a test for the null hypothesis:
So: DB

against

=0

Sl

DB

C# 0

(24.106)

where D and C are p x k and p x m known matrices,


particularly important special case of (106)is when
D

(0,Ikc): k2 X k, B

B1

and

B2

i.e. HvL B2 0 against S1 : B2 #0. The constrained


Ho take the form

MLE'S

(x'X)- 1D'gD(x'x)

and
f

=fl

+-

1
T

( -)'(X'X)(2

ID'II- 1(D

0: kz x

ranktD)

=p.

,n,

of B and fl under

(24.107)

c)

(24.108)

-),

=(X'X) - IX/Y, f'i =(1/F)U'(J,


U =Y -X are the unconstrained
of B and fl (see Section 24.3 above). Using the same intuitive
argument as in the m 1 case (seeChapter 20) a test for Hv could be based

where

MLE'S

on the distance
hSD:

-CS(

(24.109)

The closer this distance is to zero the more the support for Ho. lf we
normalise this distance by defining the matrix quadratic form
f1-1(D

-c)'gD(X'X)

lD'j - 1(D -C),

the similarity between (110)and the F-test statistic


Moreover, in view of the equality
(T'fl +(D

fT'f;=

stemming from ( 108),

=417'17
-

-c)'ED(x'x)-

(110) can be

(24.110)
(104)is a1l too apparent.

1D'j - 1(D

-c)

(24.111)

written in the form

f.J'I'))(I')'fJ)-1,

(24.112)

where U =Y -X2. This form constitutes a direct extension of the F-test


statistic to thegeneral m > 1 case. Continuing the analogy, we can show that

U'U

Gmtf1, r- k),

(24.113)
where U'U U'MXU, Mx I -X(X'X) - 1X'. Moreover, in view of (112)the
chidistribution of 0/0 0'U is a direct extension of the non-central
'v

square distribution, the non-central Wishart, denoted by


(0'0
U'U) l4z;(f1, p; A), F y m + k,
-

,v

(24.114)

SpecifKation

24.5
where

A= f1- IIDB

(0/17
where
Mo

ID'I - IIDB -C)

C)'ED(X'X)-

is the non-centrality

testing

(24.115)

This is because

parameter.

0/U)= U'MOU

X(X'X)- 1D?(D(X'X)- ID'J- 1D(X'X)- 1X'

(24.116)

and Mo is a symmetric idempotent matrix (M/ Mo, MoMo Mp) with


ranktMp) ranktD) p. Given that Mp and Mx are orthogonal, i.e.
=

MoMx

(24.117)

0,

we can also deduce that U'MXU and U'MPU are independently distributed
(see Chapter 15).The analogy between the F-test statistic,
FF(y)

'll

S0

T-k
P

'-/

=
,

,.w

F(p, T- k),

(24.118)

in the case k::zr1, and C as defined in (112), is established. The problem,


however, is that G is a random m x m matrix, not a random variable as in
the case of (118).The obvious way to reduce a matrix to a scalar is to use a
matrix real-valued function such as the determinant or the trace of a matrix:
zj(Y)=det((fJ'fJ
zz(Y)=trE(I7'I7

17,1-.))(f.J,fJ)-1q,

I-J'I'.))(I')'(J)-

1q.

(24. 119)
(24.120)

ln order to construct tests for Ho against H3 using the rejection regions


C1

J'tY :

z i (Y)>

(.'x

12

1.=

(24. 12 1)

we need the distribution of the test statistics r1(Y) and za(Y). These
distributions can be derived from thejoint distribution of the eigenvalues of
21, 2c,
kt where l mintm. p). because
, say,
.

(24. 122)
The distribution of 2 > (21 k.
;.!)' was derived by Constantine (1963)
and James (1964)in terms of a zonal polynomial expansion. Constantine
(1966) went on to derive the distribution of z2(y) in terms of generalised
Laguerre polynomials which is rather complicated to be used directly. For
this reason several approximations in terms of the non-central chi-square
distribution have been suggested in the statistical literature; see Muirhead
(1982) for an excellent discussion. Based on such approximations several
.

592

The multivariate

linear regression model

tables relating to the upper percentage points of


(T- kh2(')

z1(y)
=

have been constructed


asymptotic result,

(24.12.3

(seeDavis ( 1970)). Forlarge

T-k

we can also use tht'

Ho

':ltyl zzlmp),
-

in order to derive c. in ( 121). The test statistic is known as the Lawlel',


Hotelling statistic. Similar results can be derived for the determinental ratio
test statistic

z?(y)
-

(F- k)'r1(y).

(24.125)

statistics z1(y) and z2(y) can

The test
be interpreted as arising from the
Wald test procedure discussed in Chapter 16. Using the other two test
procedures, the likelihood ratio and Lagrange multiplier procedures, we
can construct alternative tests for Hv against Sj. The Iikelihood ratio test
procedure gives rise to the test statistic
L( Y)
L(4., Y)

.I-.RIYI
ln terms

dettf'i)

2'F

'

of the eigenvalues of

1,R(Y)=

C this test

1j.

(24.127)

region

is defined by

tY: f,R(Y) Gca).,

(24.126)

statistic takes the form

and thus its rejection


C1

detg(')(U'fJ)

1-11+ ki
=

dettf'i)

(24.128)

cx being determined by the distribution of .LS(Y) under Hv. This


distribution can be expressed as the product of p independent beta
distributed random variables (see Johnson and Kotz ( 1970), Anderson
( 1984), inter alia). For large T we might also use the asymptotic result
Ho

- v.
where F*

log LR(y)

z2(,np),

(24. 129)

gF-k
+ 1)j (;)y?.n)', see Davis (1979)for tables of
upper percentage points ca.
The Lagrange multiplier test statistic based on the function

/(p,2)

-.(?'n

F
=

-p

-j-

dettfl)

trtfl - (Y XBIIY XB) ) trtA


,

(ss

.c))

(,4. jyt)l

24.5 Specification

testing

can be expressed in the form:

z-M(v)=tr(G),

(24.131)

where

(7 (f;'I7 - fJ'U)(fJ'U)- 1
(24.132)
This test statistic is known in the statistical literature as Pillai's trace test
statistic because it was suggested by Pillai (1955), but not as a Lagrange
multiplier test statistic. In terms of the eigenvalues 21, 2a,
2,,, this
statistic takes the form
=

z,Af(v)

'i

-1V;.i

(24.133)

'

The distribution of f>M(Y) was obtained by Pillai and Jayachandran (1970)


but this is also rather complicated to be used directly and several
approximations have been suggested in the literature; see Pillai ( 1976),
( 1977), for references. For large F- k the critical value c-a for a rejection
region identical to ( 12 1) can be based on the asymptotic result
Hv

(T'- k)f-M(Y)

z2(mp).

test statistic known as I4'ro's ratio test statistic

A similar

other matrix scalar function of (7, the determinant


z3(Y)

dettG).

(24.134)
is defined as the

(24.135)

ln terms of the eigenvalues of (-1 this test statistic is


': 3(Y)

'wi

(24.136)

1+ ;.i

It is interesting to note that (7 as defined above is directly related to


multivariate goodness of tit measure G as defined in Section 24.2 above.
Note that
G

fv'v

f-r'(;)('''Y)

In order to see the relationship


HvL

B2

0.

let us consider the special case where

H 1 : Bc # 0.

and
XI

(24.137)

B1 + XcBa + U.

(24.138)
(24.139)
where :1

residuals
the restricted
by U =Y -Xj21
IXIY
view
(X'jX1)C as the multivariate multiple correlation
we can
Defining

The multivariate

594

linear regression model

matrix of the auxiliary multiple regression

U =XIBI

+ XcBc + V.

(24.140)

All the tests based on the test statistics mentioned so far are unbiased but
no uniformly most powerful test exists for Hv against Sl; see Giri ( 1977),
Hart and Money (1976)for power comparisons.
A particularly important special case of Hv : DB C 0 against S1 :
DB C # 0 is the case where the sample peliod is divided into two sub71) and T2 (74+ 1,
F), where F- rl F2
samples, say, T1
2,
and '.TkF2 > k. If we allow the conditional means for the two sub-periods to
be different, i.e. for
-

=41,

+ U1

(24.141)

XaBc + U2,

(24.142)

=XIBI
r e: T1 : Y1

t G T2:

Y2

but the conditional

variances to be the same, i.e.

E (Yf/.?; Xf) f
=

(24.143)

(x)1.6,

then the hypothesis:so: B1 Bc against Sl : B1 # B2 can be accommodated


into the above formulation with
=

and

=(l,

-Ik),

j, c

B1

B=

B2

('t))-()

x0-)(Bs))+(7)).

()

(24. 144)

This is a direct extension of the F-test for structural change considered in


Chapter 21. ln the same chapter it was argued that we need to test the
equality of the conditional variances before we apply the coefficient

constancy test.
A natural way to proceed in order to test the hypothesis Ho f11 f12
against Sl : f11
is to generalise the F-test derived in Chapter 21. That
scalar
function of the
is, use a
=

c#f12

'ratio':

=f1

f- 1

(24.145)

>

where
fl- i

1
Fl

U'U
f,
i

such as

d,
(ii)

dettflafll-

1),-

dz trtflcf-il- 1).
=

1 2.
,

596

The

24.6

Misspilication

multiYariate

linear regression model


testing

Misspecification testing in the context of the multivariate linear regression


model is of considerable interest in econometric modelling because of its
relationship to the simultaneous
equations model to be discussed in
Chapter 25. As argued above, the latter model is a reparametrisation of the
former and the reparametrisation
can at best be as well defined
(statistically) as the statistical parameters 0n (B, D). ln practice, before any
questions related to the theoretical parameters of interest ( can be asked the
misspecification testing for 0 must be successfully completed.
As far as assumptions E1()-(8(l
are concerned the discussion in Chapters
19 to 22 is directly applicable with minor modications.
Let us consider the
probability and sampling model assumptions in the present case where
l'n> l
.

Normalitv. The assumption

that D(yf//'X!;0) is multivariate normal can be


:
tested uslng a multivariate
extension of the skewness-kurtosis
test. The
skewness and kurtosis coefficients for a random vector ur with mean zero
and covariance matrix fl are defined by
J3.,,

lu

E u'n
t
-

)3q

and

x 4,,,,

zJ(u'f1
t

respectively. In the case where ur N(0, f1),


These coefficients can be estimated by
'v

2
3.m

where
0ts

)
=

)2

(24157)

0 and a4,?,, mtn + 2).


=

jj jg gt3s

(24. 158)

'n--

==

a,m

lu

j x

l
s,

=,

t s
'

12

IL

(24. 159)

Asymptotically', we can show that


F

:23.,n

g2(/),

-t(;n1tpn
+ ljjj?y+ J)

(24. 160)

and
F

(4.j.m -n1(?'n
+ 2)
8n1(,.:1

Hv

+2))2

'v

'

Note that in the case where m


F

-6

(i2a

,x.

z2(lj

and

(24.161)

1:

(.i4 3)2
-

24

z2(1),

Hu
'v

g2j1).

(24.162)

Misspification

24.6

testing

Using ( 160) and ( 161) we can define separate tests for the hypothesis
f1g1J:

and

az

x 4., n
1'1t2):
0

0 against

pntr?l

S$1 ):

tya

?yj

+0

J.ft/': azj m + mjm + 2j

+ 2)

respectively. When Dtyr//xf', 0) is normal then Hl ' ro Hvls is valid.


As in the case where rrl 1, the above tests based on the residuals
skewness and kurtosis coefficients are rather sensitive to outliers and
should be used with caution.
normality see Mardia
For further discussion of tests for multivariate
(1980), Small ( 1980) and Seber ( 1984). inter alia.
=

Linearity. A test for the linearity of the conditional


the auxiliary regression

mean can be based on

IL
(24.163)
= (Bo - )'x,+ F'kt + f)f r 1, 2.
where 1/:1 EEE(/1/,
/pr)' are the higher-order terms related to the
Kolmogorov-Gabor or RESET type polynomials (see (2 1.10) and (2 1.11)).
The hypothesis to be tested in the present case takes the form
t

Sa: F

(24.164)

H3 : F# 0.

against

This hypothesis can be tested using any one of the tests zf(Y), i
LR(Y) or 1wM(Y) discussed in Section 24.4.

1, 2, 3,

A direct extension of the White test for departures from


homoskedasticity is based on the following multivariate linear auxiliary
regression:

Homoskedasticitv.

4-r c'0

+ C' # + c,

where
and

4, Eiis t(/?1,.() a:.

,.

(24.165)

a.

()-l

1.

li

i!

liyr

k: j

4:,)

Testing for homoskedasticity


Hv C

=0

1 2,
.

n1,

Aan-ltr?-l

+ 1).

(24.166)

can be based on
HL :

(24.167)

C# 0.

which can be tested using


A linear set of restrictions
Section 24.4 above. The main difference with the m
the cross-products of the residuals in addition to the
regressors.
=

the tests discussed in


1 case is that we have

cross-products

of the

598

The multiuriate

linear regression model

Time invariance and structural change. The discussion of departures from


the time invariance of 0 > (#,c2) in the m 1 case was based on the
behaviour of the recursive estimators of 0, say 0t, t k + 1,
T This
discussion can be generalised directly to the multivariate
case where 0 EE
(B, f1) without any great difficulty. The same applies to the discussion of
structural change whose tests have been considered briefly in Section 24.4.
=

Independence. Using the analogy with the m= 1 case we can argue that
X;)' is assumed to be a normal, stationary and lthwhen )Z,, t e: -T), Zt > (y'r:
order Markov process (seeChapter 8), the statistical GM takes the form
l

=B7xr +

yt

)
=

Aiyf -f +
1

zB;x,-

(24.168)

:,,

.tc,,

l > !) is a vector innovation process. lf


where
statistical GM under independence

yf Fxr

(24.169)

we can see that the independence


auxiliary multivariate regression
(Bo

)'x,+ j
1

ln particular,

(168)with the

+ u,,

we compare

can be tested using the

assumption

gA/fyf f + B'fxf fl +
-

(24.170)

Et.

the hypothesis of interest takes the form

S0: Af

and

for all i

1, 2,

against
S1

Af # 0

Bf # 0

for any i

1,

1.

This hypothesis can be tested using the multivariate

F-type tests discussed


in Section 24.4.
The independence test which corresponds
to the autocorrelation
approach (see Chapter 22) could be based on the auxiliary multivariate
regression
Dox, + Crtf 1 +
+ Clf -1 -1-V?.
(24.171)
t
HvL
That is, test
C1 C2
C! 0 against H3 : Cf # 0 for any i 1, 2,
!. This can also be tested using the tests developed in Section 24.4 for linear
restrictions.
Testing for departures from the independence assumption is particularly
important in econometlic modelling with time-series data. When the
assumption is inappropriate a respecification of the multivariate linear
regression model gives rise to the multivariate dynamic linear regression
model which is very briefly considered in Section 20.8.
'

'

'

'

'

'

599

24.7 Prediction
24.7
ln

Prediction

view of

that

the assumption

yt B'xt
=

+ ur,

(24.172)

c: T,

Twere
the best predictor of yw-hg,given that the observations t 1, 2,
used to estimate B and n, can only be its conditional expectation
=

1= 1, 2,

'x,+,,

v-vl

(24.173)

where xw../ represents the observed value of the random vector Xt at


T + /. The prediction error is
ewmlH

yv-l

(2

w+l=

(24.174)

B)'xr+,+uw+I.

1=:,

Given that ev-vt is a linear function of normally distributed r.v.'s,


'xw-2))
(24.175)
evstx N40, f1(1 + x'w+,(X'X)(see exercise 7), which is a direct generalisation of the prediction error
distribution in the case of the linear regression model. Since f) is unknown
its unbiased estimator

S=

'- k

fi/0

is used to construct the


-v+tj'SFH (yr+l
where

(24.176)

'(yw+I

Ss

=s(1

test statistic

prediction

+ xy..I(X'X)-

(24.177)

w+I),

(24.178)

lxw+/).

Hotelling (1931) showed that


H*

(F-/(

-rn

+ 1)S
'v

(F-k)m

F(m, T-k

-n

+ 1),

(24.179)

and this can be used to test hypotheses about the predictions or construct
prediction regions.
24.8

The multi&'ariate dynamic Iinear regression (MDLR) model

In direct analogy to the

0)

ln

1 case the MDLR model is specified as follows:

Statistical GM
(24.180)

600

The multivariate

linear

model

regression

E1(1 pt Eyt c(Yr0-1), X,0 x,) and uf y, Elyt c(Yt0- j ), x,0 x,O).
B/,f1())are the statistical parameters
g2(l 0. v (A1,. At, B(),B1,
interest.
E3q Xf is strongly exogenous with respect to 0*.
=

(4(1

of

The roots of the matrix polynomial


/-1

211- ) AfJ -f
f=1
I(5j

lie within the unit circle.


Rank(X*)
k* where X* H (Y j
mk + Imlk 1) + /!,n2,

Y -.,, X, X

(11)

Probablity
-

1),

k'b

model

otyr/zl'-1

(p

,.

p*)

(det n())-'ix
1
= (27$=7

expt

B +,x)),o

yy, -

()

ljyf

s+,x)))y,

p*G (i), r (E -I7

(24.181)

D(y,/'Z,0- l ; p*) normal;


'(y,/c(Y,0- j), x,0 xfo) B*'X)
- linear in X);
Covty, c(Y,0- j ), x,0 x,0) ntl homoskedastic;

(i)
(ii)
(iii)

E'7(I

0* is time invariant.

(111)

Sampling model

g8j

Yr)/ is a non-random sample sequentially drawn from


Y > (Y1,
Dlyt JZt-. 1 ; p*), t 1, 2,
.T;respectively.
.

B*'

(A?1 A'2 s

EEE

X* EH (rt

yt

A'I B'0 B'1


,5

''

'h

B')I

'h

-,)-

..2,
xr
tt -I, xf- xr - l
The estimation, misspecification and specification testing in the context
of this statistical model follows closely that of the multivariate linear
regression model considered
in Sections 24.2-24.5 above. The
modifications to these results needed to apply to the MDLR model are
analogous to the ones considered in the context of the m 1 case (see
Chapter 23). In particular the approximate MLE'S of 0*
-

.1.,

(x''x#)- 1x*'Y

(24.182)

24.8

The MDLR

model

and
o- 0

I.T *

U*

t.
'

&,-

(24. 183)

like 2 and f (see Anderson and Taylor ( 1979)).


Moreover, the multivariate F-type tests considered in Section 24.4 are
asymptotically justifiable in the context of the MDLR model.
For policy analysis and prediction purposes it is helpful to reformulate
GM ( 180) in order to express it in the first-order
the statistical

behave

asymptotically

autoregression form
y'h A1'y)=

(24. 184)

+ B1t'Z) + u),

where
.'.$.
1

()

4a
'

*
1

I.

1)

- I tt?

.*.2

(24 l 85)

(24. 18 6)
This is k ne S.Nn i n t htl

51f) B'l
=

*1, B
=

')'

ric 1 teratu re as

1-10l-net

tlco

.fnal

./i?rnk,

with

(24. 187)

..$.

t he

')

'
.

:'

known as thc ilptll' ttlld


solution in ( 186) pro'$ ides

12
.

(24 188 )
.

The
rt??'??7nlultipliers of delay z respectively.
t-ayrfl'/ll-l'l,//'n
with
so-called
also
Ionjl-run
the
t1s
ll

602

linear regression model

The multiuriate

multiplier matrix defined by


X

L=

BIAI:T

1.

=B1(I -A1)-

(24.189)

z=0

The elements of this matrix lij refer to the total expected response of thelth
endogenous variable to the sustained unit change in the fth exogenous
variable, holding the other exogenous variables constant (see Schmidt
(1973),(1979),
for a discussion of the statistical analysis of these multipliers).
Returning to the question of prediction we can see that the natural
predictor for yw+1 given by y1,
yv and xj
xw+ , is
.

+ 2f'Z1+ 1
(24.190)
v. l St'yv
well
ln order to predict yw..z we need to know xw+ 1, xw..c as
as yw+1.
Assuming that xr+ l and xwo.care available we can use the predictor of yv+ j,
Jw.l in order to get yw..c in
=

v-vz

S'yv,v 1 +

2?'Z1+2

*
*
*j
*w
= A 1 (A 1 y v + B Z +
'

'-'

Hence,

'

'

)+

'-'

''h

B *1 Z w*+ 2
'

'-'

+
*1 2
*1
*1
+
= (A ) y w + A B Z w+ ( + B 1 Z w+ a
t

rl

(24.191)

'r
-(')'yw+

v..z

j=1

(?')'-')'zy+,,

1,2,

(24.192)

will provide predictions for future values of yt assuming that the values
taken by the regressors are available. For the asymptotic covariance matrix
of the prediction error (yw..z
see Schmidt (1974).
Prediction in the context of the multivariate (dynamic)
linear regression
model is particularly important in econometric modelling because of its
relationship with the simultaneous equation model discussed in the next
chapter. The above discussion of prediction carries over to the
simultaneous equation model with minor modifications and the concepts of
impact, interim and long-run multipliers are useful in simulations and
policy analysis.
'wsz)

Appendix 24.1 - The Wishart distribution


Let (Z,, t c T) be a sequence of n x 1 independent random vectors such that
Zf X(0, f1), t G T, then for Tb n, S
J=1 Ztz't, S W$(f1,T), where the
density function of S is
ew

D(S; 0)

ctdet

S)EtF-N

ldet

'w

1'/21

fIIEX/X

expt

-J

tr fl - 1S)

Appendix 24.2

603

where
D

Fn

2,

,=

t'j

1 /'4
l 1Ilntn- )1 )

r(

the gamma function

) being

Pvoperties
.

XV 1

(seePress (1972:.

Wishart distribution

of the

Sk are
T;), i 1, 2,
14.7:(Q,

If S1,

tj-.

independent matrices and Sf

n random

)(

'v

k, then

Sf
f

I4$(n, T),

where

If S

W'(f1,T) and M is a /( x

'w

MSM'

--

IXIMQM',

h?

matrix of rank k then

r)

fllll.
(see Muirhead ( 1982), Press ( 1972) inter
W;,tf,
enable
S
T) and S and fl are
These results
us to deduce that if
conformally
partitioned
as
'.w

S11 Slc
Scl Sa2

(1

then

(a)

n11 n12
f21 nc2

where v : + n J, n, S j l : n 1 x n 1 Sc :
S lf W';,f(Dff,T), i 1, 2,
c
n2 x n a
S1: and Sj c are independent if fll 2 0.
2f12-c1f121
T- na) and is inde(S11 S1 2Sc-21S21
) l4$,(f1:1 - f11
of
and
Sa2.
pendent S1a
( : cfl LzS a 2 (f1l l fl afl z-z flc 1 ) (& S z c).
(S1 2 'Sc a)
=

'v

(b)
@)

'v

is'

Appendix 24.2 Kronecker products and matrix differentiation


The Kronecker product between two arbitrary matrices A m x n and B:
p x q is defined b
'

A ()l B

rz11 B

x: CB

: 2 1B

:x

JmIB

j,,B

za B
.

am,,B

604

The multivariate

linear regression model

Let A, B, C and D be arbitrary matrices, then


A () (aB) l(A @ B), x being a scalar;
(i)
(A + B) ()) C A (&C + B () C, A and B being of the same order;
(ii)
A (&(B+ C) A ?t B + A () C, B and C being of the same order;
(iii)
A @)(B @ C) (A ()) B) (@ C,.
(iv)
(V)
(A (# B)' A' @)B';
(Vi)
(A () B)(C (&D) AC (#)BD
1
1
(A @)B) A - () B- 1 A and B being square non-singular
(vii)
=

matrices

(viii)
(ix)
(x)

vec(ACB) (B' ()j)A) vec(C);


trtA () B) (tr Alttr B), A and B being square matrices;
dettA @)B) (det AlMtdet
B)'', A and B being n x n and m x m
=

matrices;

vectA + B) vec(A) + vec(B), A and B being of the same order;


tr(AB) (vec(B')')(vec(A)).
=

Usejl

derlvatives

' logtdet A)
1
.4A - )
-- dA

,.

t?trtAB
h

'-.

4y-7
1E3;

( tr(A'B)
PB

'

'

= A;

t7tr(X'AXB)
= AXB
px
tr(.?')

(v)

(.7.4

n.h -

t?vec(AXB)
.
( vec X

Important

.s,

+ A'XB''

'

'

'

s x.

concepts

Wishart distribution, trace correlation- coefficient of alienation, iterative


numerical optimisation, SURE and Malinvaud formulations, estimated
GLS estimator, exclusion restrictions, linear homogeneous restrictions,
final form, impact, interim and long-run multipliers.

Appendix 24.2

605

Questions
Compare

the linear regression

statistical

models.

linear regression

and multivariate

how linearity and homoskedasticity are related to the


normality of Z,.
Compare the formulations Y XB + U and y,j X+j+ + u+.
el-low do you explain the fact that, although yj ,,
ymf are
correlated, the MLE estimator of B, given by
Explain

-pa,,

=(X'X)- IX'Y

does not involve f'?' Derive the distribution of


Explain why the assumption F 7: m + k is needed for the existence of
the MLE f of f1. Discuss the distribution of f1.
Discuss the relationship between the goodness-of-fit measures JI
( 1#'n) tr G and t/c dettf;) where
.

5.

I - (5'''5')-

f,l/f.'

rdettElq (detlEl 1 ) detltaalj- known as Hotelling's alienation


i
t?
co c n t
and (1 and discuss the
of the MLE'S
state the distributions
which can be deduced from their distributions.
properties
Discuss intuitively how the conditions
with

t##'

(1 =

Jt

S.

B.
imply that :
Give two examples for each of the following forms of restrictions:
D 1 B + C 1 0,'
(i)
ii)
BF
0
1 + A:
(
Discuss the differences between them.
10. Explain how the linear restrictions formulation R#v r generalises
and (ii) in question 9.
-.+

(i)

Verify the follovs ing equality:

2 ti (
=

12.

What is the reason

formulations'?

1
I- 1 + A 1 )fF'1f1F1 )- F'1(1

(rl

+ Al )(F'1fF1) -

'

rlf.

for the interest in the Zellner and Malinvaud

606

14.

The multivariate

Explain how the Gl-s-type estimators of #+ for the Zellner and


Malinvaud
formulations
can be derived using some numerical
optimisation iterative formula.
Discuss the question of constructing a test for
Hv DB

15.

linear regression model

against

=0

ff1 : DB

C # 0.

Compare the following test statistics defined in Section 24.5:


'r1(Y), z2(Y), .LR(Y), 1-M(Y), za(Y).

16. Discuss the question of testing for departures from normality and
compare it with the same test for the m 1 case.
Explain
how you would go about
testing for linearity,
homoskedasticity and independence in the context of the multivariate
linear regression model.
18. 'Misspecification testing in the context of the multivariate dynamic
linear regression model is related to that of the non-dynamic model in
the same way as the dynamic linear regression is related to the linear
regression model.' Discuss.
19. Explain the concepts of impact, interim and long-run multipliers and
discuss their usefulness.
=

Exercises

1. Verify the following:


vec(B) # vec(B');
(i)
Covt5,/ectY'l vec(Y?)') Iw L f1;
(ii)
ltyr
-B'xt)
(iii)
tr D - 1(Y-XB)'(Y -XB).
(yr B'xf)'D lrelationships
Using the
L2 L P
P and LP=0 show that for L=
G1(G/'X'XG1')- IGt/X'X P takes the form
=

where

P= (X'X) - 1D'1 (D 1 (X'X) - tD' 1! - ID


D

D1
D1'

(see Section 24.3).


Verify the formulae

(Gj, Gt)=

1:

- !

(58),(59)and (62),(63.).

Consider the following system of equations

)'z=Ibzxtt +/$'22x2,

fszxst+/'/icx4t

2)

l?2t.

Appendix 24.2

607

Discuss the estimation of this system in the following three cases:


no a priori restrictions;
0, /42 0,'
s,

(i)
lii)
(iii)

/.z1 ja2.
=

Derive the F-type test for Hv1 DB - C 0 against HL : DB -C + 0.


tln testing for departures from the assumption of independence we can
use either of the following auxiliary equations;
=

,= xiyt--i

B'fxt-f +.,,

?r-(B()

-())'xr

because X'

B)x,-f +v,,

Afyl-f +

+
=

0 and thus both cases should give the same answer.'

Discuss.

Construct a 1

x prediction

region for yy.pz.

Additional references
Anderson (1984),. Kendall and Stuart ( 1968); Mardia et al. ( 1979)) Morrison (197$,.
Srivastava and Khatri (1979).

The simultaneous

25.1

equations model

Introduction

The simultaneous equations model was first proposed by Haavelmo ( 1943)


as a way to provide a statistical framework in the context of which a
of simultaneous
theoretical model which comprises
a system
interdependent equations can be analysed. His suggestion was to provide
the basis of a research programme undertaken
by the Cowles Foundation
during the latc 1940s and carly 50s. Their results, published in two
monographs (Koopmans ( 1950), Hood and Koopmans ( 1953)). dominated
the econometric research agenda for the next three decades.
ln order to motivate the simultaneous equations formulation 1et us
consider the following theoretical model:

where l??p, f,, ;)t. )',, )t refer to (the logg of) the theoretical variables, money,
interest rate, price level, income and government
budget deficit,
respectively. For expositional purposes 1et us assume that there exist
observed data series which correspond one-to-one to these theoretical
variables. That is, ( 1)-(2) is also an estimable model (see Chapter 1). The
question which naturally arises is to what extent the estimable model ( 1)-(2)
can be statistically analysed in the context of the multivariate linear
regression model discussed in Chapter 24. A moment's reflection suggests
that the presence of the so-called endoqenous variables it and mt on the RHS
of ( 1) and (2), respectively, raises new problems. The alternative
608

25.1

Introduction

609

formulation'.

mt /71+ /21J?, + llt.,'t + /'.lrkr,


it /31z + #22p,+ pszyt+ /4.26/,,
=

(25.3)

(25.4)

can be analysed in the context of the multivariate linear regression model


because the bijs,i 1,2, i 1,2,3,4, j 1,2, are directly related to the
statistical parameters B of the multivariate linear regression model (see
Chapter 24).
This suggests that if we could find a way to relate the two
and (3h4) we could interpret the theoretical
parametrisations in (1V42)
of the jfys. ln what follows it is
parameters xijs as a reparametrisation
argued that this is possible as long as the afys can be uniquely defined in
terms of the ijs. The way this is achieved is by
(3)V4)first
into the formulation:
=

'reparametrising'

(25.6)
and then derive the uijs by imposing restrictions on the aijs such as Jyl
0,
tzsc 0. ln view of this it is important to emphasise at the outset that the
simultaneous equations formulation should be interpreted as a theoretical
parametrisation of particular interest in econometric modelling because it
lmodels' the co-determination of behaviour, and not as a statistical model.
ln Section 25.2 the relationship between the multivariate linear
regression and the simultaneous equation formulation is explicitly derived
in an attempt to introduce the problem of reparametrisation
and
overparametrisation. The latter problem raises the issue of identification
which is considered in Section 25.3. The specification of the simultaneous
equation model as an extension of the multivariate linear regression model
where the statistical parameters of interest do not coincide with the
theoretical (structural) parameters of interest is discussed in Section 25.4.
The estimation of the theoretical parameters of interest by the method of
maximum likelihood is considered in Section 25.5. Section 25.6 considers
two least-squares estimators in an attempt to enhance our understanding of
the problem of simultaneity and its implications. These estimators are
related to the instrumental variables method in Section 25.7. In Section 25.8
we consider misspecification testing at three different but interrelated levels.
Section 25.9 discusses the issues of spccification testing and model selection.
In Section 25.10 the problem of prediction is briefly discussed.
lt is important to note at the outset that even though the dynamic
=

610

The simultaneous equations model

simultaneous equations model is not explicitly considered the results which


follow can be extended to the more general case in the same way as in the
context of the multivariatc linear regression model (seeChapter 24). In
particular, if we interpret Xf as including al1 the predetermined variables,
i.().
jx

::.t

yr

1,

yt

l , X/', Xt - 1 ,

Xt

t)

,
,

the following results on estimation, misspecification and specification


testing as well as prediction go through asymptotically (seeHendry and
Richard ( 1983/.

25.2

The multiuriate linear regression and simultaneous equations


models

z'rhe multivariate linear regression model discussed in Chapter 24 was based


'on the statistical GM:
yt B'xt +u:,

(25.7)

with 0 (B,f1), B E2-2lEcl fl Ej j E1 ct2-cltcl


the statistical
parameters of interest. As argued in Section 25.1, for certain estimable
models in econometric modelling the theoretical parameters of interest do
not coincide with 0 and we need to reparametrise (7)so as to accommodate
such models.
In order to motivate the reparametrisation
needed to accommodate
estimable models such as ( 1-(2) let us separate y1f from the other
endogenous variables yjt) and decompose B and fl conformably in an
obvious notation:
H

((.Vytt)

y''x,x,)
-

((s#)
(&'.)
,)

X)x'

jl

f1)2a))

(25.8)

A natural way to gd the endogenous variables ykl' into the systematic


component,purporting to explain yjt, is to condition on the c-field
generatedby yjll, say c(yj11), i.e. for
Ajl)
Jt1,

e(c(yj1)),
H

f(.y1t

Xf xf)
=

l-:''y)1' + A'lxr,

..'/')'')

(25.9)
(25.10)

whererf
#j - B(1)n1(z).:1. The systematic component
by
defined (10)can be used to construct the statistical GM
=

nc-cllajand A1

+
1
y1 l ro'yjl)
=

'

1xt

+ c:,,

(25.11)

25.2 The MLR


where

and simultaneous

equatiols

models

'(.p:,/,.*-11)).

;1,

A'1,

-.:/11')1=0,

E'E.E'IEIII

E(;:,)-

(i)

p11
.EEF(7.t1,;1,.'.Fj1))!

'(#1,;1,)

(iii)

11

)1 2

nc-al?cl

=0,

(see Chapter 7 for the properties of the conditional expectations needed to


prove (i)-(iii)).
Looking at (11) we can see that such a statistical GM can easily
accommodate any estimable equation of the form:
r1l al l +
=

zit + al

.a1

3p!

+ al

(25.12)

4).r

separately. The thing to note about


structural parameters (1, lJ11) where
l=
t,I 1

(11) is that its

parameters

we call

(r0,A,
1 )'
1

(25. 13)

2j
l t

(25.14)

E(

are simple functions of B and f1. That is. they constitute an alternative
parametrisation of 0, in Dtyt/Xf; 0), based on the decomposition
D(y,/Xt; 04 D(.J'1r''yl1', Xl; 41) o(y;'',/X,; 4(1)().
-

(25.15)

the normality assumptipn ensures that y)1) is indeed weakly


with
respect to the parameterstxj, rj j).The statistical GM (11)is
exogenous
of
the linear and stochastic linear regression models. These
a hybrid
comments suggest that the so-callezd simultaneity problem has nothing to
do with the presence of stochastic variables among the regressors.
The decomposition ( 15) holds for any one endogenous variable yf,
Moreover,

Dt't Xr; 04

givingrise to

D(-J',.'yji',

a statistical

x,;,l(fj).

(25.16)

GM:

A'x
i t +cfl.
t +
l''ft = Fgdytfl
I

Xt; ,lf) Dlyjil

(25.17)

The simultaneous equations

model

The problem, however, is that no decomposition of D(yf/Xf; #) exists whicl-.


m in (17).That is, the system
can sustain all m equations i 1, 2,
=

F'y + A'x t +
'

'1.

'.)

.1-EEEE((1- *1'.z

EEE
ar (l1,,c2f,

0,

t:t =

.1-

'

...:'.

z:'i /1

,'.'.$ 'z

',',',',',',',',','! ,

t'.

'.)
,

smtj'

(I-f is essentially F;9 with - 1 added as its fth element), is not a well-defined
of (7).Fol'
statistical GM because it constitutes an over-reparametrisation
equation,
reparametrisation
first,
the
the
say
one
EEE
<1 (r01A 1
,

d?1 1

2)
441 ) = (B(1 ) f1.,2.

is well defined but


41 x tl,

x ttm =
f

tlj

is not.
A particular case where the cartesian product X7- 1 tli is a proper
reparametrisation of 0 is when there exists a natural ordering of the yfts such
1, i.e.
that yjt depends only on the pas up to
.j

Eyjt

'.../-.

rU') =

E( v-jfc()'f! i
,

1 2,

,j
-

1), Xt

x,),

1, 2,

rn.

(25.19)
ln this case the distribution D(yr,'''X;;0j can be decomposed

in the form

p1

.l)(yt,''EXt,0)

lL/q-it/''.vkt.),'2,-

'-'

)'i -

1f,

X,,-<).

(25.20)

This decomposition gives rise to a lower triangular matrix F and the system
( 18) is then a well-defined statistical GM known as a recursive system (see
below).
given in (18)is defined in
In the non-recursive case the parametrisation
unknown
?? H (F, A, V)
1)
mk
of
1)
+
+
parameters
terms
n mm
-hJ(n:
'(:fE;), and
EEE
well-defined
statistical
only
mk
1)
V
+
there are
where
+Jn1(r?
?!
constitutes
shortfall
of
parameters in 0,'a
mm - 1) parameters. Given that
of 0 there can only be mk +.ntn1 + 1) well-defined
a reparametrisation
parameters in tl. ln order to see the relationship between 0 and ?? let us
premultiply ( 18) by (F') - 1 (assumedto be non-singular):
=

(r')-

Yf+ (r') - A'x f +

11)
l

'

lf we compare (21) with (7) we deduce that 0 and ?Jare related via
BF + A

0,

F'flr.

(25.22)

25.2

The MLR and simultaneous

equations models

The first thing to note about this system of equations is that they are not a
priori restrictions of the form considered in Chapter 24 where 1-and A are
assumed known. The system of equations in (22)
the parameters 4
in terms of 0. As it stands, however, the system allows an infinity of
solutions for 4 given p.
The parameters 0 and ?; will be referred to as statistical and structural
parameters, respectively. The system of equations (22) enables us to
determine only a subset 41 of <(4H (4j:42)) for any given set of well-defined
statistical parameters 0. That is, (22) can be solved for mk +-tanltm + 1)
structural parameters <, in the form
tdefines'

(25.23)

41 G(P, 42).
=

Hence, we need additional


Note that

tlz

information to determine qz elsewhere.

is a m(m - 1) x 1 vector of strtlctural parameters.

Without any

additional information the structural


parameters ?/ are said to be nt?r
identlhed. The problem of identifying the structural parameters of interest
using additional information will be considered formally in the next section,
ln the meantime it is sufficient to note that. if we supplement the system (22)
with mm - 1) additional independent restrictions, then a unique solution
(implicit or explicit),
(= G*(p),

(25.24)

exists for the structural parameters of interest t L(<).


There are several things worth summarising in the above discussion.
Firstly, given that the structural formulation ( 18) is a reparametrisation
of
the statistical GM for the multivariate linear regression model, the
structural parameters of interest ( (when identified) are no more well
defined than the statistical parameters 0. This suggests that before any
question about ( can be asked we need to ensure that 0 is well defined (no
misspecification test has shown any departures from the assumptions
linear regression model). Secondly, ( 18) does
underlying the multivariate
well-defined
constitute
statistical GM unless F is a triangular matlix;
not
a
recursive.
of
equations
is
This is because although each equation
the system
well
separately is
defined. the system as a whole is not because of
overparametrisation. In the case of a recursive system F is lower triangular
and V is diagonal with
=

detttl)
dettt! 1)
being the /th leading diagonal matrix.
vL, =

EI

(25.25)

This implies that

X7'-1 l;f

614

The simultaneous equations model

constitutes a proper reparametrisation of 0 with mk +1.zmlm + 1) structural


1t)'
parameters in (F, A, V). Using the notation yj9-1r 58 (y1,,ycr,
we
yo
recursive
form
the
system in the
can express
.

j''it= ygzyg 1, +J'xf


,

, -

+ st,

We can estimate the structural


A.tj

Y-fjy 1Yf()
X/YO

l'

m=

ti

i- 1

and

1, 2,

parameters

v0, 1Av
i-

X'X

rn,

Jf, l7ff) by
(y,9,

tti =

v0,
*

i - 1Yi
X'y i

(25.27)

-Y,9ii T (yf
=-

a
1).,9

-XJf)

(yf-

Yf(; 1 yr
-

It can be verified that these are indeed the

25.3

Identification

.j

xgf j

MLE'S

of

tti.

using Iinear homogeneous restrictions

As argued in the previous section, the reparametrisation


GM,
yt B'xf + ul,
in the form of the structural
F'yf + A'xt + tp 0,

of the statistical

(25.29)

system,

(25.30)

is not well defined because there are only mk +.l.n#?'; + 1) statistical


parameters 0 defining (29) and mtrn - 1) + mk + .i'?'?1(?Al+ 1) structural
parameters < defining (30).The two sets of parameters 0 and 1) are related
via
BF+ A 0,
=

V= F'f1F.

(25.31)

No unique solution for ?J exists, corresponding to any set of well-defined


statistical parameters 0, unless the system of equations (31)is supplemented
with some additional a priori restrictions.
In order to simplify the discussion of the identification problem let us
assume that the system V= F'DF determines V for given F and fl and
concentrate on the detennination of F and A. The system of equations
BF+ A
HA

=0

written in the form

=0,

where
H

(B, 1)

and

.-(r,)

(25.32)

25.3 ldentification

615

is linear in A for given B. ln Kronecker


24.2), (32)can be expressed as
(1m() H) vectA)

I'1v

product

notation

(seeAppendix

0,

(25.33)

a'.m)' are the columns of A turned into a


where
@'1 x'.2,
The
1)
1
system of equations (33)cannot be solved
mm + k - x vector.
ranktrl+)
mk < rn4n1+ k 1) for m > 1. For a unique
uniquely for x because
need
supplement
solution we
the system with a priori restrictions such as
to
>

the Iinear homogeneous restrictions


*

0,

(25.34)

ensuring that ranktm) > mm

1). If this is the case then the system

('))--e

has a unique solution

for

(25.35)

Dejlnition 1
The structural

are said

parameters
H*

rank *

nytny +

rankt*A*l

rtpbe identified #' and only

1).

(f

(25.36)

Using the result that


rank
(see Schmidt

I1+

(197$),we

+mk

can state condition

rank(*A*)

m(m

1).

(25.37)

(36) in the form

(25.38)

Note that A* denotes the restricted


structural coefficient parameters.
Hence, the structural formulation (30)is said to be identified if and only if
we can supplement it with at least nllnl 1) additional restlictions. More
general restrictions as wcll as covariance restrictions are beyond the scope
of the present book (see Hsiao (1983)inter alia).
The identification problem in econometrics is usually tackled not in
terms of the system (30) as a whole but equation by equation using a
particular form of linear homogeneous restlictions, the so-called exclusion
(or zero-one) restrictions. ln order to motivate the problem 1et us consider
the two equation estimable model introduced in Section 25.1 above. The
-

616

The simultaneous equations model

unrestricted

form(30)(comparewith (5)V6))of this

structural

model is

:.,21it + J1 1 + Jc1 yr + Ja1 l,t + J-,,:gt + clf,

mt
=

(25.39)

(25.40)
As can be seen, the two equations are indistinguishable given that they differ
arises because
only by a normalisation condition. The overparametrisation
statistical
underlying
GM
effect
the
(39) and (40) is in
+ #21 + #31pt + l.4l(/r+ 1,/1,,
1
(25.41)
/.1
mt
.rf

(25.42)
and the two parametrisations
1
/.12
/71
/21 #22
#3 1 /' 3 2
/4.1 I4.,

A) are related via

B and (r,

J1 a

''

l 1

(621

)'12

21

2c

:3.1

J.l

,3

'0

,c

J4c

(25.43)

which cannot be solved for F and A uniquely, given that there are only eight
pijs and ten ijs and (bijs.
A natural way to
the identification problem is to impose exclusion
These restrictions
restrictions on (39) and (40) such as 4l
J22
enable us to distinguish between the money and interest rate equations.
Equivalently, the restrictions enable us to get a unique solution for (712,
),a1,
(5
al) given (py, i 1.
4- j 1, 2).
l .,
1 2, J1
1
2l
lThe exclusion restrictions on the fth structural equation can be expressed
in the form
esolve'

=0,

.3,

*.j

=0.

0,

(25.44)

matrix of zeros and ones and xi is the th column of A.


where *f is a
above
example
the
the selector matrices are
ln
tselector'

mj

=40,

0, 0,0, 0, 1)

ma

(0,0, 0, 1, 0, 0).

The identification condition (36)for each equation separately,known


rank condition, takes the form

rank

H*

(hi

ranktmfA*)

m+ k
m

1,

'

1, 2,

1, 2,

???.

(25.45)
as the

(25.47)

In the special case of exclusion restlictions the order ondition can be


made even easier to apply. Let us consider the first equation of the system

25.3
BF

ldentification

+ A.j

(25.48)

restrictions',
omit (n1
+ (/4 k1) exelusion
and impose (m endogenous and (/ - k:) exogenous variables from the first equation. ln this
case we do not need to define *1 explicitly and consider (46)because we can
substitute the restrictions directly into (48).Re-arranging the variables so as
to have the excluded ones last, the resbricted structural parameters Ft and
-n11)

1111)

A1' are

where )!1 : (rak - 1) x 1

#ll
/21

B1z
Ba2

where

jl

1:

kl

jc

1:

(k -

J1

)x

0-

0,

B1 c : kl x

x 1,

J1

o 0

(25.52)

=0,

B2zrl

'y1

Bzsz

+ B1 c71 +

- / 11
-/21 +

.- 1

Bla'j

J 1 : k1 x 1

(m1-

B 2 a : (k - l

1), B1 : /(1x (h'?-lhl'lj ),


l

) x (tn 1

1)
,

B z :! : (l - k 1 ) x (m

m 1)
.

Determining J1 from (51)presents no problems if yl is uniquely determined


in (52).For this to be the case we need the condition that
ranklBaa)

(25.53)

rnl - 1.

ln view of the result that the ranktBaz) mintk - pti nlj - 1) we can deduce
that a necessal'q' t't??7t/frl't?n for identification under exclusion restrictions is
that
=

(/k'- l 1 )k: ??) 1 - 1.

(25.54)

That is, the numbcr of excluded exogenous variables is greater or equal to


the number of incltlded cndogenous variables minus one. This condition in
(I)f takes the form
terms of the selector luatrix

ranklm:

,b

.=

??? ..

(2f

.55)

Tlle simultaneous

equations model

This is known as the order condition for identification which is necessary


but not sufficient. This can be easily seen in the example (39),(40),above
14.=0.
The selector matrices
when the exclusion restrictions are J4j
for the two equations are *1 (0, 0, 0, 0, 0, 1), *2 (0,0, 0, 0, 0, 1). Clearly
1 and thus the order condition is satisfied but the
ranktml) ranktmc)
rank condition (47)fails because
=0,

rank(*1A*)

rankto,

0)

(25.56)

=0,

This is because when gt is excluded from both equations the restriction is


'phoney'. Such a situation arises when:
and
a1l equations satisfy the same restriction;
(i)
restrictions
a11
of the fth equation
other
equation
satisfies
(ii)
the
some
(see example below).
related to the identification of
Let us introduce some nomenclature
particular equations in the system (30).

Dehni tion 2

-4 particular
to be:

equation

q/' the structural

c/?-p?

(J(?), saq' the ith,

-$

saitl

under identified if
rank(*jA*)

< r??-

1,'

(25.57)

exactly identified ('/'


rankt*fA*l

iii)

over identified

1,.

(25.58)

(f'

ranktmfA*)
The system

ranklmf)

(.)) is said to

??

m - 1,

identi.fled

ranktmf)

t'Jevery

>

equation

1.
.$

(25.59)

identilied.

Exatnple

the following structural

Consider

form with the exclusion

restrictions

imposed:
(25.60)
712.F1/ -

(25.61)

-1.'2,

.F3f+

l 3.X1,

(123.X2,

:3,,

(25.62)

25.4 Specilkation

619
0 0

*:! (0, 1, 0,
=

*a

0 0
0 0

0
0

0, 0)

1),

2a

*aA*

(0, 0,

1).

3.
2 but ranktmj)
The first equation is overidentified since ranktmjA*)
underidentified
ranktmcA*)
second
equation
1
because
is
The
even
though ranklmz) 2 (the order condition holds). This is because the first
equation satisfied all the restrictions of the second equation rendering these
The third equation is underdentified
condtons
as well, beeause
rankt*aA*l
l < 2.
lt is important to note that when certain equations (af the structural form
(30) are overidentified then not only the structural parameters of interest are
uniquely defined by (31) but the statistical parameters 0 are themselves
restricted. That is, overdentifying restrctions imply that 0 G (6))where (.)1is
a subset of the parameter space (i).An important implication of this is that
the identifying restrictions cannot be tested but the overidentifying ones
can (see Section 25.9).
The above discussion of the identification problem depends crucially on
the assumption that the statistical parameters of interest 0n (B, f) are well
defined the assumptions
g1q-g8()underlying the multivariate linear
regression model are valid. However, in the case where some assumption is
changes we need to reconsider the
invalid and the parametrisation
identification of the system. For example, in the case where the
independenee assumption is invalid and the statistical GM takes the form
=

'phoney'.

B'x
i

j -

+ ut

!,

(25.63)

the dentification problem has to be related to the statistical parameters


At, Bo. B:,
Bf, f1ol, not 0. Hence, the identification of the
0* H (A:,
of
is
system not just a matter
a priori information only.
.

25.4

Spification

The simultaneous

equation formulation as a statistcal

model is viewed as a

620

Tlle simultaneous

equations model

reparametrisation of the multivariate linear regression model where the


theoretical (structural) parameters of interests ( do not coincide with the
statistical parameters of interest 0. In particular the statistical model ih
specified as follows:

0)

Statistical GM
y, Bk,
=

g1j

(25.64)

+ u,,

and up )', - Elt't ,'''X/ xr) are the systematic and


,'X,
non-systematic components, respekrtively.
0 EEE(B, f1) are the statistical parameters of interest where
(i)
B Y 2-21Ya j f' E j j E j zrc-2l E z : 0 g (.) > j2MlV x C ;
(= L(r, A, V) are the theoretical parameters of interest
where r H1(p), A Hc(p)- V H a(p),
X, is assumed to be weakly exogenous with respect to 0 (and t).
The theoretical parameters of interest I=H(0), IVE, are identitied.
ranklx)
x?.)': T x k for 1' > k.
k where X >(x1
x2,
'tyr

pt

xJ)

g3j
g4j

g5(I
(11)

Probability
(p

model

Dlyt ,,/XfP)
,'

(det f1)(27:)

eXP )

2--'--.rj-

lg

yt - B

Xr)

.1

)
()'f- B XJy

0 (E R'n

m5

t( T

(25.65)
D( yf/'Xt ; 0) is no rmal
(i)
(ii)
J)y;,/'Xr xf) B'x, - linear in x,;
(iii)
Covty,/'''xf xr) f'l homoskedastic
>
(B.
f1) is time invariant.
0
',

(111)

Sampling model

g8j

(free of x,);

y1)' is an independent sample sequentially drawn


1 2
-J( respecti vely.
The above specification suggests most clearly that before we can proceed to
consider the theoretical parameters of interest t we need to ensure that the
That is, the
statistical parameters of interest 0 t//'t, well Jp/v/.
testing
misspecification
discussed in Chapter 24 precedes any discussion of
either identification or statistical inference related to ). Testing departures
from multivariate normality, linearity, homoskedasticity, time invariance
EBt,'j

y2.

frona D(yr,/X?;#), l

25.5

likelihood estimation

Maximum

and independence became of paramount


modelling in the context of simultaneous
25.5

importance

in econometric

equations,

likelihood estimation

Maximum

ln view of the simultaneous equations model specification considered in the


previous section and the discussion of the identification problem in Section
25.3 we can deduce that in the case where the theoretical (structural)
parameters of interest ( zjust-identilled the mapping H( ): R'''k x C,, E,
and onto. This implies that the
where ( H(p), is one-to-one
'(()
and an obvious estimator of ( is
invertible,
H
is
0=
reparametrisation
estimator
indirect
maximum
(lM LE).
likelihood
the
-+

D/in i rffpl3
I n r/lt.zcase J4,/? eve ( is
r-itlvlkrftl,ll i r.s (indirect)
Iikelihood estirnator (Iz%'
/.$
/?)'
ILE ) (Ie./i?7tp(/

n'Iaxirnurn

.jlk.$

(-

jjj

jj

0- (

fl.

(r

(x'x)- lx'Y

v -x.
(25.67)

tbe invariance pvopert t?/' i%ILE's.


A more explicit formula for the IMLE is given in the case of one equation,
say the first, when just-identificationis achieved by linear homogeneous
restrictions. It is defined as the solution of
Tbis

lLsbased

t?s/rf'nTf/rtal-

fl

ln the simpler

0.

f1= (

J1

('?n

lk).

case of exclusion

jj la): ;
- ll1 +

'k'

restrictions

the system

(66) is

()

(2,j.69)

= 0,

(25.70)

(25.71)
J1

j:

: -

B : a B c-jj B c :

matrix.
given that B2a is a sqtlare (nl - 1) x (mj 1) non-singular
The IMLE can be N iewed as an unconstrained MLE of the theoretical
for testing any
parameters of interest which might provide the
-

'benchmark'

622

The simtlltaneous

equations model

As argued above, the identifying restrictionh


overidentifying restrictions.
testable
from 0 to ( bu:
because they provide the reparametrisation
are not
restrictions
restrictionh
overidentifying
because
imply
the
they
are testable
statistical
of
interest.
parameters
for 0, the
of ( 1et us utilise thc
In order to simplify the derivation of general MLE'S
MLE'S
of
constrained
derived
of 0 under a priof:
formulae
in the context
restrictions
in
linear
Chapter 24.
The constrained MLE'S of B and f2, in the context of the multivariate
linear regression model, subject to the linear a priori restrictions
HA >BF + A

=0,

where (F, A) are known, take the following form:


:--(r'+A)(r'f1r)-1r'f,

f=

f+-

1 (:

-*)

(x x)(

-)

(25.75)

(see Chapter 24). In the context of the simultaneous equations formulation


these formulae can be reinterpreted as determining the theoretica!
parameters of interest (= G(F,A,V) given and f. That is, determine ( via

-(r(j)

+ A(j))(F(j)'f(r'()

lrtf

and

1 A(

v(t)-F(t)'fF(-.,jk

Z ZA(
,

(25.77)

(see exercise 2) where


A-(F)

a
and

and

z-tv,x)

Richard (1983)).
(see Hendry
Substituting (76)and (77)into the log
of
multivariate
likelihood function
the
linear regression model log Lttl; Y)
likelihood function log f-((., Y) defined by
we get the
Cconcentrated'

1og f,tt; Y)

const T

logldet fla

log(det((F'Y'MxYF')

- 1(A'Z'ZA)j).

(25.78)

This function can be viewed as the objective function for estimating ( using
12/ (, f) as opposed to the direct estimation of (by solving (76)and (77)for
1.The log likelihood function (78)has the advantage that it provides us with

Iikelihood estimation

Maximum

25.5

(76)and (77).ln the case of general


a natural objective function to
rather
prohibitive and thus we consider
is
restrictions ( H(#) the
l-estrictions.
exclusion
the case where the restrictions are
the
exclusion restrictions
by
achieved
When the identifieation of ( is
This
A(()
matrices
and
in
structural
parameter
are Iinear (.
constrained
r4()
implies that the first-order conditions can be derived explicitly. The
conditions
tsolve'

'solution'

12
,

(2 5 80)
.

Using the relation

X'Z'ZX

t'h'.

T --....

(80) takes the form


t?1-

tr tf-'ff-l-itf--,.,--(f''P)t?(i

?A
v
)-t

Z'Z

1,

(25.82)

=0.

of ( and
The system of equations (82)could be used to derive the MLE'S
t-i,
derived
asymptotically
equivalent
be
An
form
f.
and
can
hencef', 2

using

fl

for f in view of the fact that

1
F

r/n

PF

ti

(7

- A'z/z

JA
--.
C'1

(f

0 and

--+

t7A
A'Z/XH ..0

f)

(i

Using these, (82) ean be expressed in the asymptotically


tr

tL'ff'

1
)- .i'Z'Xf1

JA

Ptf

=0,

1, 2,

p.

(B : I),

EE

equivalent

(25.83)
form

(25.84)

For the details of the derivation see Hendry ( 1976), Hendry and Richard
system (84)is particularly intuitive because it can
( 1983). The reformulated
iding
interpreted
an estimator of ( in terms of the sufficient
as pro:
be

statistics and
H(,
1-,,,=
and f

f of the fonn
fl,

(25.85)

are sufficient statistics for the MLR statistical model because, as

The simultaneous

624

equations model

argued in Chapter 24, they are functions

of the minimal

z(Y) N (Y/Y, Y'X).

sufficient statistics

(25.86)

the orthogonality between the systematic and non-systematic


components, preserved with their estimated counterparts by the MLE'S
(discussed in Chapters 19 and 23) holds asymptotically for (84).ln order to
see this consider the systematic and non-systematic components related to
the system of equations (73)defined by
Moreover,

EZ

and

'X t

tt, where cf

It can be shown
1

-T

xr)
-

I'l'xr

I)'

x?

lk

A'Z,, Zf EEE(y'f X;)'.


,

(25.87)
(25.88)

(seeHendry ( 1976)) that


P

A'z'xl

--.

0,

(25.89)

a condition which suggests that the MLE can be interpreted as an IV


estimator (see Section 25.7).
Unfortunately, explicit solutions for ( of the form given in (85) are not
possible because (84) is non-linear in ( and it has to be solved by some
numerical optimisation procedure (see Hendry ( 1976:. As emphasised
by Hendry the numerical optimisation rule for deriving the MLE'S (-of (
from (84)must be distinguished from the alternative approximate solutions
of the system
(25.90)
where
V=
B=

A'Z/ZA,
-(r+

H=(B:
A)'(r'fr)

I),
-

'

rv!1.

(25.91)
(25.92)

When V and H are given, (90)is linear in A and an explicit solution can be
derived. On the other hand, when A is given, V and 11can be derived easily.
This suggests that (90)-492)can be used to generate estimators of A for
different estimators of V and n. lndeed, Hendry ( 1976) showed that most of
the conventional estimators in the econometric literature such as threestage Ieast-squares (3SLS), two-stage least-squares (2SLS), can be easily
Hendry referred to (90)as the estimator generating
generated by (90)V92).

j'

25.5

Maximum

likelihood estimation

625

equation (.EGE). The usefulness of the EGE is that it unifies and summarises
a huge literature in econometrics by showing that:
(a)

Every solution is an approximation


to the maximum likelihood estimator
values' selected for V and
obtained by variations in the
rl and in the
through
iterating.
of
taken
the
cycle
including
not
steps
number
(90)--492),
Al1 known econometric estimators for linear dynamic systems can be
obtained in this way, which provides a systematic approach to estimation
theory in an area where a very large number of methods have been
'initial

proposed.

methods immediately into distinct groups of


Equation (90) classies
varying asymptotic efficiency as follows:
if (V, H)
(i)
as efcient asymptotically as the maximum likelihood
estimated consistently',
are
consistent for A if any convergent estimator is used for (5', H);
asymptotically as eticient as can be obtained for a single
equation out of a system if V I but H is consistently estimated
wx

(see Hendry and Richard ( 1983)).


Dellnition 4

ln tbe case wlztlr? F is m x m 'lnf von-sinqular tbe MLE of (, called


te full information maximum likelihood (FIML) estimator, is the
solution of the system (#4) for (-.
lt is interesting to note that in this case
determining j' takes the simple form

A(r(j)

the system of equations

'

(76)

(25.93)

whenthe theoretical parameters of interest ( are just identified then the


: and f1= f and thus
statisticalparameters are not constrained, i.e.
indirect
maximum
estimator
reduces
likelihood
to the
(93)
=

A(t)F(()

'

(25.94)

The score function for the general case is delined as

qit)

(7lo L

t(

with log L as defined in (78).The information

1/()

E!q(()q(t)'1

matrix can be defined by

(25.96)

(see Chapter 13), where E'( ) is with respect to the underlying probability
model distribution D4y, Xf,' 0). Using the asymptotic information matrix
.

626

The simultaneous

equations model

defined by

.,''j j'-

tlslxj- x((),I
x

(() - ).

(25.98)

A more explicit form of 1.(() will be given in the next section in the case
exist, for the 3SLS (three-stage leastwhere only exclusion restrictions
squares) estimator.
One important feature of the above derivation of (84)and (90)-(92)is that
it does not depend on F being ln x rn and non-singular. These results hold
true with r being m x q ((/%?n). In this case the structural formulation is said
to be incomplete (seeRichard ( 1984:. For f- m x r?11 (:n3< ?zl),
(84)
gives rise to a direct sub-system generalisation of the llmited information
1) (see Richard ( 1979:.
maximum likelihood (LIML) estimator (with -1
'solving'

25.6

Least-squares

estimation

As argued in the previous section most of the conventional estimators of the


structural parameters
( can be profitably viewed as particular
approximations of the FIML estimator. lnstead of deriving such estimators
(EGE) discussed above (see
using the estimator generating equation
Hendry (1976)for a comprehensive
discussion) we will consider the
derivation of two of the most widely used (and discussed) least-squares
estimators, the two-stage (ZSLS) and three-stage least-squares (3SLS)
estimators, in order to illustrate some of the issues raised in the previous
sections. ln particular, the role of weak exogeneity and a priori restrictions
in the reparametrisation.

(1)

Two-stage least-squares

(2N15)

ZSLS is by far the most widely used method of estimation for the structural
parameters of interest in a particular equation. For expositional purposes
1et us consider the first structural equation of the system

25.6

Least-squares estimation

GM based on the following decomposition


1 ',

''yrt

o(y,,/Xl; 04 D ).,.r
=

X,,' 1/1 )

of the probability

otyjl',,-'x

'

model:

(25.10 1)

141))

with systematic component


.::/-:1

p''j,
=

))

.E().,

and non-systematic

;1,
=

(25.102)

component

J)).,//,F)1

.J'! -

,#7'J'' (c(y)1'),Xr

'),

(25.103)

xr).

This suggests that, because of the normality of Dt)'t Xf; 0), ?!j and 4(1) are
variationfree and thus yjl' is wctpkly exogenous with respect to ?;: (see
Chapter 20). If no restrictions are placed on
-(F)
1

A1

its )1 1-1! is

*1 (Z'(1)Z(1))- 'Z'(1)y1.

(25.104)

1'

X). This estimator has the same properties in the


estimator of j* in the stochastic linear regression model given that ( 100) is a
hybrid of the linear and stochastic linear regression models. Moreover, the
variance L?11 can be estimated by

where Z(1) EEE(Yt

il

'1

Z(1):1

(25.105)

The optimality of the estimators (104)and ( 105) arises because of the


orthogonality between the systematic and non-systematic components
.l.t/.t'l,E'('lll c(y;''))l
.E7(p'l,c1?)- E.t7E/,.t'lfc1,,7c(y)'')1l
-

-0

(25.106)

0. Note that the expectation operator in .E'(/t'ly1f) iS


since .!;Rjt/'c(y)1')(l
'XC,'
relati:
defined
p), the distribution underlying the probability
e to Dty,
model.
The above scenario. however, changes drastically by the imposition of a
priori restrictions on 41 ln order to see this 1etus consider the simple case of
Let us rearrange the vectors yt and xf so as to get the
exclusion
and
included
endogenous
kl
(y1t)and exogenous (x1f)variables tirst, i,e.
hrl1
EE (x'1,, x'(1)f)'.ln
EEn
y'1,,
y('1),)'.
xr
terrns of the decolnposition in (101)
3' (.3.71,,
this rearrangement suggests that
=

't?yrl'l'c-rt?p7s.

1)(y!/(X:; 04 1)( J'1 r y 1 l y(1)f (Xf; 41.)


,

.D(y1 t

''y(

X, ; <*1
)

.l)lyl

1 )1,

1 )1

xf; 44*1)).

(25. 107)

628

The simultaneous

equations model

ln terms of this decomposition the statistical GM ( 100) becomes


.J'1f 7-1yl t + 7/(1,y(

1 )f

),1 : (n j

l'l j. j : (r?'s.- r?7j

where

1) x 1

.-

+J'l x l t +J'( 1 )x(


.-

1 )1 +

(25. 108)

clf,

1) x 1
.

and

(25.109)
(25. 110)

Let us consider

system

the implications of these restrictions

for the rest of the

via

Bl-/ + A1t

=0.

Given that the restricted

coefficient vectors
At'

(#1
#c1 B22
B21

r 1 + J 1 #1 1
=

Bc2r1

!
-

g'j
h.t,
'j

.j.

.g,0.-j

(25. 113)

) q07

(25. 114)

(25. 115)

=#21
.

These equations

#1

y,

Bca7

(25.112)

(J'1 0)',

.-

B1 a-h

B1 2

take the form

relate 41 and <41) via

/1 l
#21

and

B(j)

B21

B31

B22

B3c

In the case where the equation ( 110) isjust identljled k k : rl1 - 1) Bca is
(,'n1 1) x (,,n1 1) square matrix with rank rn1 1. Hence, (114) and ( 115)
can be solved uniquely for )!1 and J1 :
=

r1

B 2-cl #,

J1

p1 ,

B c 1 B z-, pa 1

Looking at (116) we can see that in this case 4j and tlj ) are still variation
free. Moreover, no structural parameter estimator is needed because these
can be estimated via the indirect MLE's..

71

: 22- #21 j 1 #11


j

: 22- j #21
-

21

(25.1 17)

25.6

629

estimation

Least-squares

1 the first equation is overiden tihed, the


restrictions on B( j ) and thus the variation
impose
in
114).
115)(
equations
(
h-ee condition between ?)1 and q(: j is invalidabetl. ln an attempt to enhance
of this condition let us return to the decomposition in
our understanding
structural
and
the
107)
parameters involved.
define
(

(L'- /..'1 ) >

when

However,

n1

f.51 3

12
t.?a
f122
f23 2

l'l

f' 2 2 c)2 1 + j'j

32

c?g 1

) (1 j
y

=u

j'j

D22

as

well as

ln the

t/ya 1 .j. j'j

f2 ! 3
3

ft) 3 1

(25. 120)

f132

rr,)

23

f12 3

(25. 121)

t),

(25. 122)

( 114), ( 115).

just identified case

nz-alc)z j

B2-21
pz1

(25.123)

and yl as defined by ( 121) and ( 115) coincide. On the other hand, in the
overidentified case ( 12 1) and ( 115) define ),l differently with ( 115)
free condition,
invalidating the N'ariation
is in terms of the
An alternativ'e
u'ay to view the above argument
non-systematic
lf
components.
we define plj in ( 110) by
systematie and
:1::

then

lt

J( ).: : c( y 1 ? ). X 1 t

x j f ),

(25.124)
(25. 125)

(25.126)

630

The simultaneous equations model

where

cl',

1'1t

ftl'lf/'cty:

), X 1

xl

,)

(see Spanos (1985$). This suggests that the natural way to proceed in order
to estimate ()'1,Jj t'11) is to find a way to impose the restrietions (114)-( 115)
and then construct an estimator which preserves the orthogonality between
pl and t?st. The LIML estimator briefly considered in the previous section is
indeed such an estimator tseeTheil ( l97 1/. Another estimator based on the
same argument is the so-called two-stage least-squares (ZSLS) estimator.
Let us consider its derivation in some detail.
The orthogonality in ( 126) suggests that
,

.r1t
=

7'1y1r+J'1x1f + zt

(25.127)

is equivalent to

vlf #'1l x1, + #'21xrl),


=

subject to ( 114) and

y1,-B'1zxl,

(1 15). In

(25.128)

lg,,

order to see this let us substitute

+Brz.zx(1)fA-ulr

(25.129)

into ( 127):

I!

3.'1,r'1(B', cxlf
-

+ B'22x(1),+ ul

f)

+J'1 xj,

+ t?st

+
= (/1B?12+J'l)x1, +J'1B)cx(1), +
l'-l'u, l/jr - y/lulf, (130)becomes

ltf.

'y'jul,

Given

lt,

(25.130)
(25.131)

3.'1:(7'1B'12 +J?1)x1t+
=

/1B'2cx(1 ), +

(25.132)

l,flt.

The LIML estimator can be viewed as a constrained MLE of 1 and #21in


#1
(128) subject to the restrictions (114)-41 15) as shown in (132)(see Theil
(1971:. Similarly, the 2SLS estimator of )/> (/, J')' can be interpreted as
achieving the same effect by a two-stage procedure. The method is based on
F:
a re-arranged formulation of (130)for t 1, 2,
=

yl =(X1B12 +X(1)Bc2)71

-FXIJI

ll

(25.133)

in an attempt to impose (114) and (115) in stage one by estimating (XIBI 2 +


X(1)B2c) separately using ( 129). Once the restrictions are imposed the next
step is to construct an estimator of xlt which preserves the orthogonality
between the systematic and non-systematic component of (133).More

explicitly:

Stage one: using the formulation provided by

XB.2 (X1B12 + X(1)B22)


=

(133)estimate
(25.134)

25.6

estimation

Least-squares

631

in the context of ( 129) or


Y1

XB

+ U1

using

(X'X) - X'Y

(25.135)

(25.136)

(130) can be transformed


XIJI
yl i' l r1+
or
*1
*1

into

y1

Z 1x

+ u

(25.137)

+ u?

*1 =

*1

+ U

'

Z1

1,

(25. 138)

(Y l ; X ! )

Estimate a*1 by
,a.

.a.

1
x*

m.

,'<

)
(Z:ZI)

f1
J1 zsu.s

(:.!5 139)

zlyl

i''

'

X'

''

x,1$, x,'

..

-'

y'

'

x,'
..

1
'

3.1

ln the context of the model ( 1/-(2) estimating the parameters of ( 1) by


2SLS amounts to estimating the statistical form in equation (4):

it /21+

+ ?2l,
Izsh + /724.::

zzpt +

by OLS and substitute the fitted values


the structural form of (1) to get:
lf

( 1:

J 21 t

(x 3 j

pt +

J41

into
j21 + jazp; + qzyt + fz.bgt

),: + ul

Applying OLS to this equation yields the ZSLS estimator of aj 1, a21 a31
and ayl.
Given that ( 133)preserves the orthogonality between the systematic and
non-systematic component by redefining the latter, the 2SLS method uses
the equivalence between ( 127) and (130) and via an operational form of the
,

Consistency (*1: x1:) stems from the

latter estimates tz:l' consistently.


asymptotic orthogonality
;(2'1u/ )

--.+

as T-+

-.

(25.14 1)

-..c
.

For a given Tl however. f (2,':ul) #0 and thus &1'is a biased estimator, i.e.
Ell'l

z,.tz

(25.142)

tz1:.

The ZSLS estimator can be compared


,j.

$-1uluu

vv

:= X1A 1
1

directly

p r 1 () 1 v x j
,

,k

XIXI

- l

the LIML estimator

with

(v j

(.T
J)y
j ),y j
,

Xlyl

(2 5 14 3)
.

632

The simultaneous

which, in

view of

V'IX j

equations model

the fact that

i''1i' I

Y'jX j

Y'1Y 1

it differs from the 2SLS estimator in so far as


0/

Ory

where
0
T1

1 mx

.-. 1

f*is not

(25.144)

given the value one but

71

(25.145)

'

-0

,v

fJ'l 0 1

0 S)

$-711 yj 1 y 1.r1
J* --o
0/ u
v 0
f1

171

Y 01

(yj v j )

M1

I -XI(X')X1) --1X'1

(see Schmidt ( 1976:. lf we substitute k for J* in ( 143) where k is a scalar


(stochastic or non-stochastic) ( 143) defines the so-called k-class estimator of
all which includes both 2SLS and LIML estimators. For the 2SLS
estimator, k 1.
Looking at ( 140) we can see that the ZSLS estimator exists if the rank of
=

Y'1 Y 1

Y'1 X 1

Y1 1

x 1x 1

'

'

is ???j + kj - 1. Given that rank(X'jX1) /(1 by assumption (5j, Section 25.3,


this rank condition is satisfied if rankti/l'j)
nl 1. The latter condition
holds when ranktBzc) > r??l 1, i.e.the order condition for the first equation
is satisfied. It must be noted that in the case where this equation is exactly
identified, i.e.
=

ranktml)

ranktmjA*)

(25. 146)

1,

the 2SLS estimator is equivalent to solving

AI'+ rt

where

=0,

(25.147)
- 1

= (X'X) - IX/Y,

r?

(25.148)

zl
0

The solution of (147)for a/ is the indirect MLE.


ln order to discuss the properties of the ZSLS estimator we need to derive
its distribution. From (140)we can deduce that

fzst.s=(Y'1(Px
ZSLS

Px.)Y1) --1Y'1(Px

(X'1X1)- 1X'1 (yj -Y,

Px, )y1

fcsus),

(25.149)
(25. 150)

where Px X(X'X) - 1X' P


Xj(X/1X1)- 1X'1. These results show that the
distribution oftssus can be derived directly from that of fzsus.
Concentrating
=

!'

.Yj

25.6

we can express it in the

on the latter estimator

form

7zsus g.?2- z 17 2 1
yr::z

where

estimation

Least-squares

B''

W01 v

(25 15 1)
.

W'12

11

(),

=Y1 (Px - p y,)

w 21 w 22

vo vo
j

'

3l
yr 1

(25. 152)

(see Mariano ( 1982)). ln view of the fact that


(y0jt 'Xr

x)

N(B0j
'xf

x,

Doj)

(25.153)

YO/QYO
1

1 where Q
where B01 EEr (#'j B',c)', we can see that the distribution of
Px - Pxj is an idempotent matrix must be a matrix extension to the noncentral chi-square (see Appendix 24.1). That is.
.

wo1

11J

??1j

(n01 w.M0)1
5

>

where I4Jmjt ) stands for the non-central Wishart distribution with F


degrees of freedom, scale matrix f1ol and non-centrality (or means-sigma)
matrix
'

M0

no -

)Y(')'Q)Y0).
i

The 2SLS estimator of J1 as given in ( 150) can be viewed as a regressor


functionof Wf. The l'-class and LIML estimators of yj can be expressed

similarlyas

f(k)=(W22

+V22)- '(1V21 -i-'S21).

(25.156)

-l*Sc1).

(25.157)

fuIs1u=(5V22 -f*S22)-'(M'21

where
Sy

1 (Y
-

?-

XBO:

)'(Y0j xB:))
-

SE

S1 1
S21

S1 2
S2 1

(25, 158)

(see Mariano ( 1982)). These results suggest that the same methods for the
derivation of the distribution of fcsps
can be used to derive those of ftkjand
fujul-. ln Chapter 6. several ways to derive the distribution of Borel
functions of random vectors were discussed and the general impression was
that such deri: ations can be very difcult exercises in multivariate calculus.
In the case of fl abo: k?. the derivations are further complicated by the fact
that the distribtltion ftlnction of the non-central Wishart is a highly
complicated function based on infinite series (see Muirhead (1982)). The
complexity of the distribution increases rapidly with the ranktMoj). This is
basically the reason why the econometric literature on the distribution of
these estimators concentrated mainly on the case where 1Al1 1 (see
Basmann (1974) for an excellent survey). ln the general case of n1 > 1 (see
Phillips ( 1980)) the distribution of # is so complicated so as to be almost
=

The simultaneous

model

equations

non-operational. This encouraged the development of asymptotic results


and related expansions such as the Edgeworth expansions (seeChapter 10).
Before we consider these results it is important to summarise some of tht!
most interesting aspects of the finite sample research related to the above
estimators; in particular to the existence of moments of the various
estimators discussed above.
(i)
fzsl.shas moments up to order (/2 ?,:11 + 1), i.e. the 2SLS
estimator has moments up to the degree of overidentification.
Thus, for a just-identifiedequation where lz ?'?I - 1 no moments
exist; see Kinal ( 1980).
f(k)has moments up to order 7--4/1 + r?1 1) if 0 k < 1 and k is
non-stochastic; see Kinal ( 1980).
(iii)
fl-luuhas no moments (see Sargan ( 1970)).
See Mariano ( 1982) for an excellent survey of these results. It must be
emphasised that the non-existence of moments should not be interpreted as
making the relevant estimators inferior to ones whose moments exist. lt
implies, however, that comparisons of estimators based on criteria such as
mean square error cannot be made since they presuppose the existence of
some moments. Moreover, in Monte Carlo studies (seeHendry (1984))the
non-existence of moments provides useful information.
The asymptotic distribution of the ZSLS estimator takes the form
-

F( :*1 x *1)zs
I-s

X(0

x.

*1

?-? 1 D

(25.159)

where
D 11

B/ 2 Qx B
.

B' 2 Q1

The asymptotic

x'x 1

lim
-,

Q1 j

--

Q11

X'X

.X

X'1 X 1
--.T

lim
w..- y

E(ss1 f 1;'y1 t ).

of *1 can be estimated

covariance

*
*
Cov(41)=
f11

I7*
11 a

(25. 160)

1im
Qx F-+

Q1

Q, B.c

ir'1 i' 1

i''1 X 1

X1Y1

X1X:

by

(25. 161)

where
,1

v 1f - x 1
1

(see Schmidt ( 1976)). lt is important to note that D1 1 is


ranktBac) n'?1 1, l.e. the first equation ls. identi.fled.
=

non-singular

if

25.6

Least-squares

Under the condition

estimation

that

v'' F(k - 1)

--+

0.

(25.162)

the k-class estimators are asymptotically equiv alent to the ZSLS. ln


particular the LIML and ZSLS are asymptotically eqtli: alent.
A brief ntroduction
in
to Edgeworth expansions was considered
Chapter 10. Sargan ( 1976) considered the question of applicability of such
expansions to econometric estimators. He proposed an Edgeworth
expansion for the difference
(

t?/P,

W),

(25.163)

where p is a normally distributed random vector. and w a random vector,


independent of p. The conditions placed on Lawlp. B ) are general enough to be
applicable to many of the conventional estimators mentioned above; see
also Phillips ( 1977). These conditions are related to the smoothness and
invertibility of trwt ) and the boundedness of the moments of x F w of all
orders. ln the case of the ZSLS it can be shown that
.

fl

lllp, /.1, B.2).

(25.164)

where
P 55E

X'u 1 X'U 1
T

and the Sargan conditionj


expansion of order Op(T-I)
(f l -T1)= F () + F -

apply

(25.165)

(seePhillips ( 1980a)).

The asymptotic

takes the general form


+ F-

+ Op(F-),

where the components of Fo, F-.L, F


O P (T-) O P (T - 1) respectively.

(25.166)

include the terms of order Op(1),

(2)

Three-stage least squares

(3S+5)

The 3SLS estimator is a system GLS estimator based on the SURE


formulation considered in Chapter 24. As argued there, this formulation
exclusion restrictions directly into the
enables us to accommodate
formulation itself.
The results related to the estimation of the MLR model when a priori
information is available (see Section 24.4) suggest that in estimating (110)
separated from the system there must be some loss of efciency. ln the case
of exclusion restrictions the reformulation of the MLR model into the
SURE form enabled us to incorporate such a priori information. lntuition

636

The simultaneous equations model

suggests that the same formulation should lead us to a more efficient system
estimator than the ZSLS estimator.
Expressing all m equations of (99) in the same form as equation one n
( 110) the SURE formulation in the present context gives rise to the systenl
y1

Z1

ya

()

za

x1

c*1

12

P ':2

'

ym

where

Zm

7f

('' f : xf)

Ai

Jf

12

x FX

(25.167)

'

:*

.P1

'

n''l
,

or, more compactly,


+):

'+ Z++
=

is an obvious
the system

notation.

y+ =

(25.168)
The ZSLS estimator amounts

a,y + c:.

(25.169)

where 2+ is Zv with if instead of Yj, i.e.


estimator for the system as a whole is
ti* zsus

to applying OLS to

(i'f'.X). That is, the ZSLS

EEE

(2'+2+)- 1 k'* y,y

(25.170)

In view of the fact that


(c:,/X)

'v

N(0, V () 1w),

intuition suggests that the generalised-least-squares


(GLS) estimator used
in the context of the MLR model should be more appropriate. This
estimator takes the form (see Section 24,4):
ti* ysus

(2.'+($''- l () Iw)2'v) - 12' * ($' '' 1 (>()Iw)y+,

(25.172)

where Vis estimated from the ZSLS residuals applied to all the equations of
the system via
l''t'ij

''?A
2)
I
--' 2'-'A

'

l,

'
..1

12
,

(25. 173)

For obvious

reasons this estimator of a,j is called the three-stage leastestimator, first suygested by Zellner and Theil ( 1962).
(3SLS)
squares
l
It is important to note that for(Z'+(V
- (@)Iw)2 * ) in ( 189) to be invertible
k,v must be of full column rank which requires that each 2f (ij. Xf), i 1,
=

Instrumental

2,

variables

comprising
rn, i.e. a11 equations
when the system is identified

the system must be identihed.

Moreover.

1 (;'
(# l
-T * -

(g)

Iy );

-.a

v3st.s

P
*

....+

(;j.

1
) N ((j D - )
x

j.74)
,

(25.175)

(see Schmidt ( 1976)). Sargan (1964/) showed that when only exclusion
restrictions are present and the system is identified, 3SLS and FIML are
asymptotically equivalent, i.e. D 1.((),. see Section 25.3.
the result related to the
ln relation to the finite sample distribution of Vysus
moments of the ZSLS applies to this ease as well. That is, moments exist up
1he
nonto the order of overidentification (see Sargan ( 1978)). Moreover.
the L1M L estimator extends to the FIML
existence of moments
estima-tor (see Sargan ( 1970)). As argued above. howeNrer. existence or nonexistence of moments should not be considered as a criterion for choosing
between estimators in general.
=

.for

25.7

Instrumental

variables

variables (lV) was initially proposed by


The method of instrumental
Reiersol ( 1941) in the context of the errors-in-variables model (seeJudge et
the
aI. ( 1982)) as an alternative estimation method which
inconsistency problem associated with the OLS estimator. This was then
systematised and extended by Durbin ( 1954) to a more general linear
statistical model where the explanatory variables are correlated with the
error term. He also conjectured the direct relationship between the IV
estimator and the LIML estimator in the context of the simultaneous
equations model. The current approach to the IV method was formalised
by Sargan ( 19583. who considered the asymptotic efficiency as well as the
consistency of IV estimators. Let us summarise the current
approach in order to motivate the formalisation proposed in the sequel.
Consider the statistical formulation
Ksolved'

Ktextbook'

pf I'X
=

where X,: p
that

(25.176)

+ Tr.

x 1 vector

'(X;c,) # 0.

of

fpossibly stochastic) explanatory

variables

such

(25.177)

The simultaneous

equations model

If we estimate x using the usual orthogonal

projection

(OLS) estimator

:=(X'X)- 1X'y
we can see that this estimator is in general biased and inconsistent since
i x + (X'X) - 1X/c and S)(X'X)- 1X'E) # 0
=

and the bias does


The

IV method

introducing

not go

a new vector

'(Z; G)

zero as T'-+ :f)

to

'solves'

the bias (and inconsistency)


Z : m x 1, t 1, 2,

of instruments

((Z/ c,/F)

Eaa

((Z'Z,'F)

,$

problem b)
F such that:

0)

-->

Cov(Z,)

Eaa)

-+

(iii)

Covtzf ,Xf) =X32

((Z'X,/F)

Eaa)

-+

where Eaa > 0. ln the case where the number


number of unknown parameters in a, i.e. m
form:
(9IV =(Z'X)- 1Z'y.
=p,

of instruments equals the


the IV estimator takes the

Assuming that Ea2 is bounded and non-singular

we can deduce that

P
'(Iv)

and

91:.
-+

if in addition to

Moreover,

y'

(iv)

(Z v#x/ T)

X(0,

'w

'

(i)-(iii)we

also assume that

azXaa)

then
/'F(4Iv
N/

a)

'.w

- 1 E3aEaa
- 1
N(0, c 2 N.zs
)

ln the case where m wp (we have more instruments


(1958) proposed the so-called generalised instrumental

than
variable

fs)

Sargan

estimator

(GIVE)

&)j=(X'PaX) - IX'P y

(25.179)

where Pz Z(Z'Z) - 1Z'. The GIVE estimator was derived by Sa-gan as an


instruments' are chosen as linear
extension of mv where the
functions of Z, i.e. Z* ZH, H being a m x J?matrix and H is chosen so as to
minimise the asymptotic covariance matrix
=

'optimal

Covt:lv)
2

c2(HYca) - 1(H'ZaaH)(N'aaH')- 1

Instrumental

25.7

variables

639

(Z*'X) - Z*'y
1
1
Given that for /(H) (HEaa) - (H'EasH)(E'asH') ' /(H) /(AH) for any
matrix
minimising
we can choose H by
m x m non-singular
Of

:lt

(H'Ea aH)

subject to HE2a

1.

The optimal H takes the form:


H

E 5-1 E a a(N aa E -l E aa )- l

Using its sample


Z*

the

equivalent
1

P :: XIX'P X):5

toptimal'

set of instruments

is

and thus

,'

jjrgjjf
jjj.f
....'

.
..

lj,

''

(# x.
N(0, c (EazEa........a Ea a) ... )
The GIVE estimator as derived above is not just consistent but also
asymptotically efficient. Because of this it should come as no surprise to
learn that a number of well-known estimators in econometrics including
OLS, GLS, ZSLS, 3SLS. LIML and FIML can be viewed as GIVE
estimators (see Bowden and Turkington ( 1984), furt/l' alia).
Norp that in the case where m /?, g)i). I& as defined in (178).
The main problem in implementing the IV method in practice is the
choice of the instruments Z,. The above argument begins with the
presupposition that such a set of instruments exists. As for the conditions
(i)-(iv), they are of rather limited value in practice because the basic
ln order to
orthogonality condition (i), as it stands, is non-operational.
make it operational we need to specify explicitly the distribution in terms of
which E ) in ( 176) and (i)-(iv) are defined. lf we look at the argument
leading to the GIVE estimator ( 179) we can see that the whole argument
around an (implicit)distribution which esomehow'
involves y't, Xf
and Z,. ln order to see this let us assume that the underlying distribution is
Dt-pr,Xp; #) (assumed to be normal for simplicity). lf we interpret the
systematic and non-systematic components of ( 176) as in Chapters 17 and
20, i.e.
.

F()

'

Krevolves'

and

c-al
x E o'c :
=

'(Xf';r) 0.
=

(E a c

by

t?t =

Covtxf )

yf - E ()'r,6'Xf))

o'c 1

Co v(X,,

(25.180)

-pf)),

Construction

and
9,:u:(X'X) ' 'X'
fully efficient.

is th0 MLE of x and it is unbiased, consistent

and

640

The simultaneous

equations model

This suggests immediately


that D ).,, X?; ) is not the distribution
underlying the above IV argument and thus ( 180) is not the (implicit)
interpretation of a'X, and st.
Another question which raises the issue of the underlying distribution is
related to the possible conflict between condition (i) and

(v)

Cov(Z,,.pr)

/3 1

# 0.

A glance at the IV estimators ( 178) and ( 179) confirm that the latter
condition is needed in order to ensure the existence of these estimators, in
addition to (i)-(iii).This, however, requires Zf to be correlated with pt but
do we
uncorrelated with a random variable cf. directly related to yr.
resolve this apparent paradox?'
ln addition to the above raised questions the discerning reader might be
estimator of x in
wondering how the GIVE estimator ( 179) can be a
176)
when
apparently
the
has
latter
(
parameter
to do' with Zf. As
emphasised throughout Chapters 19-24 there is nothing coincidental
estimator. For
about the form of the parameter to be estimated and its
1X'y
(X'X)estimator
of
is a
example the fact that &
x Ea-alokj is no
that
the
natural
given
sample
analogues
of
and
J21 are X'X
E22
accident,
respectively.
analogy
in
of
and X'y
Using the same
i),. we can see
the case
estimator of must be:
that the parameter this is a
Sl-low

'good'

'nothing

'best'

'good'

'good'

* (E2 aE 5-1 Ea c )- 1 Ea aEa'-a1


as j
=

(25. 18 1)

How is this parameter implicitly assumed to be the parameter of interest in


( 176)?
The above raised questions indicate that the issues of the (implicit)
condition ( 177) and
distribution underlying ( 176), the non-orthogonality
the (implicit)parametrisation are inextricably bound up. The fact that (181)
involves the moments of a11random variables (.k,f,
that the
Xf, Zf) Suggests
(implicit) distribution is related to their joint distribution. Moreover, the
apparent conllict between conditions (i) and (v) can only be resolved if Z,
together they suggest
refer to conditioning variables. Putting these
that the underlying distribution should be:
'clues'

D( J..,, Xr/'Zr;0j.

(25. 182)

Let us consider how this choice of the implicit distribution could help us
resolve the above raised issues.
Firstly, assuming that the parametrisation
of interest in ( 176) is I*, as
defined in ( 18l), we can see how the non-orthogonality
(177) arises. lf we
ignore ( 182) and instead treat f'( ) in ( 177) as defined in terms of D()'t, Xt; #)
.

then

'(X;k)

'(X;(yf- x'Xt))

(c21-

E22t:* )# 0

25.7

Ilstrumental

variables

unless * Ea'-21o'c1 This leaves us with having to explain how the


parametrisation of x* in ( 181) arises. Using ( 182) as the underlying
distribution we can deduce that
=

E(X;y,./c(Z,))

E( f'gX,'cl c(Zt)1 c(X,))

E'IXCfR,c(X,)1/'c(Z,))

*'Xf)/'c(Zf)1/'c(X/))
= F)X;JJg(yf
Z/Ya-.tEa aa*l c(Xt) )
f Ygi as 1
t f
= E$X'LZ
-

= E 23 E-3 3

f(Xf'G 'c(Zl))

65 1

-E

2.3

t- 3 3IE 32 x*

in ( 177)
for
as defined in (181). ln other words, the non-orthogonality
expectation
of
the
x*
in
1
82)
but
in
defined
terms
terms
(.
was
arose because
Xl.,
D(
p,,
of
#).
The above discussion suggests that the instruments Zf are random
variables jointly distributed with the pt and X, but treated (implicitly)as
specification ( 176),
ln defining the statistical
conditioning variables.
explicitly.
include
That
natural statistical
is,
the
Z,
however, we did not
*

GM

yl a'()Xt+ SZ,

(25.184)

+ ut

Ec- l(Jl 2 J1 3ELI Ea2)' and 7c, EL1o.5 1 - Egs-alEtpza()where


E2 :! (Eac - E2aE:!:5Esa) is not the parametrisation f?/' interest. lnstead Zf is
marginalised out from the systematic component F(y,/c(Xf, Zr)). In the
present case this amounts to imposing the restriction yo 0 or equivalently

where

atl

(25.185)

X2370 0
=

since m y p. Given the relationship


(185) implies that
'
aEa-sI
al
atl (Ea aE s- Es a )- Ec
=

between ya and atl in ( 184), however,


o's I

x*

(25.186)

and ( 184) reduces to ( l 76# with x as defined in ( 18 1). This relationship


between ( 184) and ( 176) enhances our understanding of the IV argument
conditioning variables
considerably. lt shows that IV are indeed
whose separate effect is of no interest in a particular formulation. lf we were
would be the usual orthogonal
to estimate alg and y() in ( 184) the MLE'S
projection estimator:
tomitted'

ll

(X'M,X) - X'5'1Z y

(25.187)

642

The simultaneous

equations model

and

ftl (Z'MxZ) - IX'M y


=

(Z'Z)
- 1Z'y -(Z'Z)
.

- 1(Z'X)()

Pz= Z(Z'Z)- lZ', Mz=I - Pz. The estimator ('10 of (), however, is not an
estimator of the parameters of interest a*. On the other hand, when Zf in
(184) is dropped without taking into account its indirect effect (throughthe
parametrisation) the estimator #= (X'X) - 'X'y is inappropriate. This is
because (i introduces a non-existent orthogonality between the estimated
'systematic' and
components. That is, for #= X:= Pxy and
-X4=
Mxy
sample equivalent to ( 176) induced by
that
the
wecan
see
=y
knon-systematic'

is

pxy + Mxy

(25.188)

given that PxMx 0, Pv X(X'X) - IX', Mx I Px. ln view of


where /
.
177),
however,
(
no such orthogonality between yt and st exists within the
framework defined by D(o, Xr; #) and thus x is inappropriate. From ( 183)
we can see that in order to achieve an orthogonality between pt and st we
need to take account of the conditioning variables Z,, The way this was
achieved in ( 184) was inappropriate not because we introduced a nonexistent orthogonality
(the orthogonality was valid) but because we
orthogonality
achieved the
at the expense of the parametrisation of interest.
The question which naturally arises at this stage is how we can introduce an
of
orthogonality between p, and ;, without losing the parametrisation
interest.
=

Intuition

suggests that when /t, a'X, and ), yf - J'X: are nOt


orthogonal it must be the case that t includes information which trightfully'
belongs to pt. In the present context what is systematic and non-systematic
information is determined by ( 182), the underlying distribution. This being
the case the obvious way to proceed is to redefine both components so as to
ensure that they are indeed systematic and non-systematic relative to
D(y,, X, Zt,' 0j. This can be achieved by the new components :
=

ptt .E'(/t,?'c(Zr)) and


=

t:1

zt

(p) -p,)

the systematic information y1 -/tf) is Subtracted


the new error term c). The redesigned form of ( 176) is
Notejow

5't E (Jtf'rqZ,))
=

it

f(/lf,/fT(Zr))()+ c,

(25.189)
from cf to define

(25.190)

0.
with Eptttl)
(190) is different from ( 184) in so far as the original parametrisation has
been retained
It is interesting to note that the above
can be motivated
=

're-design'

25.7

Instrumental

variables

643

directly in terms of condition (i). Given that


F(Z;;,)

'('gZ'f)f/J(Zf)1)

(see Chapter 7)

which in turn is

if

(25.19 1)

for

valid

Ept c(Zf>

(i)holds

deduce that condition

we can

E (lt J(Zl))

E (ZIEI;, c(Z,)))

(25.192)

c(Zr))

'(.)?,

Vchoice'

The conditions ( 19 1) and (192) are equivalent to ( 189) and thus the
of Z, so as to ensure condition (i) is equivalent to the above re-design' of
( 176).
ln order to understand how the IV estimator (179) is indeed the most
appropriate estimator of x* in the context of ( 176) 1et us derive the sample
equivalent of (190).Using the well-known analogy between conditional
expectations and orthogonal projections we can deduce the sample
F (T > m + p):
equivalents to (189) and (190) for t 1, 2,
=

p * P z xa*
=

and

c*

PzXa* + Mzxa

1) +

M z X()

(25.193)

+ c.

Pz and Mz I Pz are the orthogonal projections onto the space spanned


by the columns of Z and its orthogonal complement, respectively. ln view of
the orthogonality between Pz and Mz (PzMz 0) the usual orthogonal
(OLS) estimators of * and tl coincide with (179) and (187),respectively.
This derivation of the IV estimator provides us with additional insight as
tsolves'
the original estimation problem. ln a sense the
to how the method
method
method
IV
is a two-staqe
(seePagan (1986))where the jlrst staqe
the original formulation so as to achieve orthogonality between
the systematic and non-systematic components, without changing the
parametrisation, and the second stage the parameters of interest are
estimated using the usual orthogonal projection (OLS) estimator.
The re-designed formulation is particularly illuminating in so far as the
relationship between ( 176) and (184)is concemed. lf we rearrange (193)in
=

Sre-designs'

the form:

y Xa*
=

+ M,X(a() - a*) + c

(a()- x*) the hypothesis


against
f11 : J#0

we can see that for J


J.f():J=0

(25.194)

can be easily tested in the context of (194) when T > m + p. Ho is often


viewed as referring to the admissibility of the instruments in Zf. From (184)-

644

The simultaneous equatios

model

(186) we can see that Hv is an indirect test of the hypothesis that the
coefficient of the conditioning variables Zt is zero, i.e.
covt)?t,ZXt)

This brings out the connection between


conditioning variables
variables
25.8
and instrumental
below).
(seeSection
In the context of the simultaneous equations model a1l the estimators
discussed (ZSLS, LIML, 3SLS and FIML) can be viewed as IV estimators;
see the recent monograph by Bowden and Turkington (1984).In order to
see this let us consider the case of the ZSLS (see Section 25.6 above).
The first restricted structural equation for the sample period tzzu1, 2,
T takes the form:
'omitted'

yl

Zll

+ zl3

(25.195)

where Z1 >(Y1,X1) and aj =(y'1.J'!)'. Given that the underlying


distribution is Dlyttixt; 0) where y, HE (yl,, y'1,, y'(j)r)' and Xf EEEE(X?1
t, X'(1),)' we
can see that X, is the set of instruments and ( 195) includes some of those
instruments as genuine conditioning variables. Expressing (19f) in the form
(193) yields:
yl
since PxX1
i

PXYITI

X1. From this we can deduce that the IV estilnator

J'1

XIJI

Y' P x Y 1

Y'1 P x X 1

XIPXY 1

X'IX 1

,1

Iv

+ MxY17 +

:1'

(25.196)
Y'1 P xyl

X'1yl

of al is:

'

.,)

(aj. j oy)

which coincides with the 2SLS estimator ( 140).

25.8

Misspecification testing

For theoretical as well as practical reasons misspecification testing in the


context of the simultaneous equations model will be considered at three
different but interrelated levels:

the statistical system level, in terms of the statistical GM,


yf B'xt
=

(ii)

M-ut,

(25.198)

the unrestricted structural equation, in terms of the statistical GM,

25.8

Misspecification

testing

the restricted structural


parameters of interest,
=

(1)

equation,

7'1y1, -F-ti'1X1? +

1'1,

The statistical

system

645

in terms of the theoretical

'l/f

(25.200)

level

lf we look at the specification of the simultaneous equations model given in


Section 25.4 we can see that a sufficient condition for the theoretical
parameters of interest ( to be well defined is that the statistical parameters 0
are well defined. This suggests that the natural way to proceed with
misspecification testing is to test assumptions (6(1to (8qin the context of
the multivariate linear regression model (seeChapter 24). In the case where
the system defined by the structural form with the restrictions imposed is
just identified, the mapping H(.) : O-/E, (= H(p), is one-to-one and onto,
implying that the inverse image of H( ) is (.) itself. Hence, ( is a simple
reparametrisation and as well defined as 0. ln the case where the system is
whose inverse image
overidentified then the mapping H( ) is many-to-one
0. () is a subset of (4. That is, in the case of overidentification H( ) imposes
restrictions on 0. which are, however, testable in the context of (198)(see
Section 25.91.
The above argument suggests that before any questions related to the
theoretical parameters of interest t are considered we need to ensure that
the statistical model in terms of the statistical parameters p is well defined.
This is achieved by testing for departures from the underlying assumptions
using l/?L2 procedures discussed in C'lr/rt/l- J4. lf this misspecification testing
reveals that the statistical assumptions underlying ( 198) are valid we can
proceed to consider the identification, estimation as well as specification
testing in the context of the structural form reparametrisation.
Otherwise
we need to respecify the underlying statistical model. For example, the
assumption of independence is likely to prove inappropriate in econometric
modelling with time-series data. A respecification of the underlying
statistical model U i1lgive rise to the multivariate dynamic linear regression
model whose statistical GM is
.

(25.201)
(see Chapter 24). This suggests that if the identification problem was
in terms of 0 in f 198) we need to reconsider it in terms of 0* in (199).Given
that (198) and ( 199) coincide under
isolved'

Hv : Bf

0,

-'.f

0-

1.=

1, 2,

/,

(25.202)

646

The simultaneous equations model

restrictions.
In the
we need to account for these implicitly imposed
restrictions
context of (198)the restrictions in (202)are viewed as
which fail the rank condition (seeSection 25.3) because a11 equations satisfy
these restrictions. When Ho is tested and rejected, however, we need to
reconsider our estimable model (seeChapters 1 and 26) in view of the fact
that the original model allowed for no lags in the variables involved. When
a situation like this arises in practice the modelling is commonly done
equation by equation and not in terms of the system as a whole because of
the relative uncertainty about which variables enter which equation. For
this reason it is important to consider misspecification testing in terms of
individual equations as well.
itestable'

tphoney'

(2)

The unrestricted

single structural

equation level

(194)is of no interest because no structural


identified
in
is
its
context. From the statistical viewpoint,
parameter
however, (194)is of some interest because it can be viewed as a bridge'
testing
between (193)and (195)which can be used for misspecification
stochastic
and
of
Treating
the
it
linear
linear
hybrid
as a
purposes.
regression models (see Chapters 19-20) (194)presents us with no new
testing is concerned.
The testing
problems as far as misspecification
procedures proposed in Chapters 2 1 and 22 can be easily adapted for the
present context.
From the theoretical viewpoint

The restricted

single structural

Ievel

equation

Having tested for misspecification in the context of (198)and accepted the


underlying assumptions as valid we have a well-defined estimated statistical
GM. From the theoretical viewpoint, however, (198)makes very little sense
(if any). For that we need to reparametrise it (by imposing a priori
restrictions) in terms of the theoretical parameters of interest. The question
will be discussed in Section
of how we achieve such a reparametrisation
25.9. At this stage we need to assume that this has been achieved by
J(1) 0 (enoughin number to
imposing the exclusion restrictions >1)
identify the first equation) in order to get (200).
expressed in
From the misspecification viewpoint the reparametrisation
(200)is of interest in so far as this has not been achieved at the expense of the
statistical properties of the estimated GM (198) we started with. That is, we
need to check that (200)is not just a theoretically meaningful but also
statistically well-detined relationship. The discerning reader would have
noticed that the same problem arose in the context of the linear and
dynamic linear regression models (see Chapter 23).
=0,

25.8

Misspecilkation

647

testing

ln Sections 25.6 and 25.7 it was argued that (200)can be viewed in various
alternative ways which are useful for different purposes. The two
intemretations we are interested in here are:
(25.203)
(25.204)
Given that
E (. . k'1 f

'o-l

y1

.'

X1t

:),

)?'jy j t + J'1x 1 r

x j t)

aE-tylf,''X, x!)
-

Bz1cxl, + B'lcxtl

*j

),'ju j t +

?.fj t

)?,

we can see that we can go from (204)to (203)by subtracting y'jyjt from both
sides of (204).For estimation ptlrposes (204)is preferable because of the
orthogonality between the systematic and non-systematic components (see
Section 25.6). For misspecification testing- howey'er,it is more convenient to
use (203).
Tbe

oI' D(

??t?,-n'la/!'r.

Xl,,' (1) can be tested using a direct extension


test discussed in Chapter 2 1 based on the residuals

'1,..y1,,

of the skewness-kurtosis
J: t

.yl

, -

f;

j..y

, -

t,

l 2
,

(25.205)

whereflv,tsvrefer to

some IV estimator of ),j and J! respectively such as the


estimators.
LIML
ZSLS or
The Iinearitl' of the conditional expectation
c(.ylf), X1t

f( 1':,

xl!)

?'Z1,,

(25.206)

x'Ifl'
and 1'H (/1,J'1) can be tested using the F-type test
where Z1t EEE(y'1p,
discussed in the context of the linear regression model (see Chapter 2 1).
That is, generate the higher-order tenns #r,using Z1f or powers of ft),and test
the null hypothesis

He: d

regression

in the auxiliary
)',1
/

H 1: d# 0

a01 +

Z1(at'

tlzd

:0
1

$) ) + Td

(25.207)
+

&'

(25.208)

(see Chapter 2 1). The augmented equation (207) or (208) should be


estimated using an IV estimator, say ZSLS. Using the asymptotic normality
of &jj.the F-type test procedure discussed in Chapter 21 can be justified
asymptotically in the present context.

The simultaneous

The homoskedasticitv
Var()'1 t/o-ly

equations model

of the conditional

), X j r

xl

r)

variance

(25.209)

r: 1

can be tested by extending the White and related tests discussed in Chapter
2 1 to the present context. If we assume that no heteroskedasticity is present
in the simultaneous equation model as a whole we can test (209)against
t/c(y1,),

Vartyl

using Hv cl

X 1,

xlr)

/'l(z 1 ,),

(25.210)

0 against H3 : cl # 0 in the auxiliary

regression

Lqtl cll + c'1 kt + q

(25.211)

(see Kelejian ( 1982)). This can be tested using the F-type test or its
asymptotic variants (e.g. T'R2) discussed in Chapter 21. ln the case where
the assumption related to the presence of heteroskedasticiLy
in the rest of the
appropriate,
need
modify
F-type
is
because the
the
tests
not
system
to
we
asymptotic distribution of > ((),1) is not
Ft

-c)

N(0, c

x
J

Q/ )

(25.212)

but
j.'

,. T( -c)

x.
J

N(0, 2c #.(Q-

Q/ and Qz: refer to the probability


1#) ( #,)
' and (), r 1 k lf k'lt ) k'1, (''1r, x'lr)
of
matrix coefficients of the auxiliary regression:
where

'

f(Z1?l1',)

-..

L'#) +

%1

(25.213)

L )),
2LQcj

limits of
respectively

(,Fj

#)#)'q,

and L is the

(25.214)

and Hall ( 1983), White ( 1982a)).


An important assumption which should be tested in the present context
is whether by imposing the exclusive restrictions on ( 198) the parameter time
invariance no longer holds. The diagnostic graphs related to the recursive
windows discussed
estimates of the coefficients as well as the z-observation
in Chapter 2 1 apply to the present case in terms of some IV estimator of a1.
Moreover, the F-type tests related to time-dependence and structural
change can be applied to equation (203)using some IV estimator for 1'.
Time-dependence of the parameters is particularly important in the present
context because this formulation is commonly used for prediction and
policy evaluation purposes. ln cases where no misspecification tests have
been applied to the statistical GM ( 198), testing for parameter time
dependence is of paramount importance because heteroskedasticity in its
context might appear as time dependence in the context of (204).This is
because the parameters in (204)are defined in terms of B and f2 and if

(seePagan

25.9 Specification testing


Covtyf Xf

xr)

649
(25.215)

ftxt),

will be a function of xt via flxfl.


The sampling model assumption of independence can be tested by
applying the F-type test to test the significance of the coefficients of the
lagged Zj, in

the vector of coefficients

(25.2 16)
0r
l

*1 t

(01 4*1 r )'Z 1 t + )-' a'jzjf

(25.217)

+ vl,

The only modilication of the F-type test discussed in Chapter 22 needed is


to estimate the parameters of (216)or (217) by some IV estimator because of
the presence of simultaneity. The F-type autocorrelation test is based on the
auxiliary equation
l

*1t (01

-&*

I l''

)'z 1 t + Y c'0
f

1t - i

+v

(25.218)

t5

(see Godfrey (1976),Breusch and Godfrey (198 1)).


25.9

Spaification testing

ln practice, specification testing in the simultaneous equations model is


almost exclusively considered in the context of the single identilied
structural equation
.J.'lf T'1ylt
=

whereylt: (n11(1)

+J'I

X1t

where .J,'1, ylf


of both xIf and
=

against Sl

yl #

( 1949:. Under Hv the GM (219) takes the form

(25.220)

lj f,

yj Under
i,'1
,.

x(1)r given

y1f B'12X 1 t +
=

> n'lj - 1 (foridentification).

1) x 1, x1f : kl x 1, (k

(Anderson and Rubin


'lf

(25.219)
-kj)

Testing Ho : l'j

'l

l11,

X1,

H3 however, y1,
,

that

B'22X(1)t

+ ulf

ylr - y'ylf and a function

(25.221)

Hence, Hv can be tested indirectly by testing that the coefficient of X(:) is


zero in the regression equation

yf

X1l1' + X(l)/(1)

+V1

(25.222)

650

The simultaneous

equations model

This can be tested using the F-type test statistic


.)'?'(Mx, Mxly?
Mxyl

T-k

FT(y1)=

'f

kz

where Mx l -X(X'X) - 1X' and M.y


region being
=

H
'--

F- k),

F(2,

(25.223)

I -X1 (X)X1) - 1Xk the rejection


,

Lk

t)-l

.ftyl:

FT(yt)

>

ca),

df-lkc, F

-k).

(25.224)

t7t

A special case of Ho above of particular interest is yl


i.e.the endogenous
variables are insignificant in (219).
This test can be readily extended to include some of the coefficients in Jj
as long as al1 the coefficients in yj are specified under Ho. The difficulty with
including only a subset of yl in Hv is that we cannot apply the usual linear
regression estimator under Ho, we need to apply an IV estimator such as
ZSLS or LIML. The most convenient way to extend Hv to a more general
set of linear restrictions on a/ EEE(y'j,J?1)' is to use the asymptotic distribution
of a consistent and asymptotically normal estimator.
=0,

(2)

Testing S(): R1'

r against S1 : Ra/ # r

where R and r are p x (rn1+ kj - 1) and p x 1 known matrices


ranktR) p. Using the asymptotic normality of S*sss
(0r &ljuu),

with

x'/T*lsus

*)

.N'(0,t?1 1 D 1-11)

t'

(25.225)

(see Section 25.6), we can use the same intuitive argument as in the case of
the F-test (seeChapter 20) to suggest that a natural choice for a test statistic
in the present context is
(R&lst.s r)'gR(2'121) - 1 R'(J- ltR&lsus

-r)

FT'(y1)

where

f1:1

21 B (#1: ij). Asymptotically


pFrtyll

In practice

(25.226)

we can argue that

&o

z2(.

it is preferable

approximation

(25.227)
to use the F-type form

(226)based

on the

(25.228)

25.9 Specification

651

testing

The above results suggest that asymptotically the specification testing


results discussed in Chapter 19 in the context of the linear regression model
individual (r-tests)and
can be extended to the present case. ln particular,
Two special cases of
asymptotically
significance
of
justified.
joint tests
are
which
using
be
restrictions
tested
the above F-type
can
linear homogeneous

test are:
(i)
(ii)
which
(3)

the overidentifying restrictions;


the exogeneity restrictions;
we consider next.
Testing the overidentifying

and

restrlctions

As argued in Section 25.3 the identification restrictions in the case of


exclusion restrictions come in the form of the system (51)-(52). For
identification we need (k kj) > (?'n1 1). ln the case of overidentification
the overidentifying restrictions q= k
-zn1 + 1 can be tested via (51(52). Given that (51) is not directly involved in the overidentification the
latter conditions can be tested using (52),i.e.
-

-k:

H3 : B22r1 #21#0.
1.o: 82271 #21
enables us to test Ho is the statistical
which
A convenient formulation
sub-model:
=0,

A'lt

/'11 xl t + b',j.x(1), +

1./1t

+ Br2cxll), + u1,

y1, B'1ax,,
-

Multiplying the second equation by y'2and subtracting from the first we can
define the unrestricted structural formof y1, as :

(#11 -B12T1)'x1f

7'1y1l+

-P1, =

(/21
-

B2271)/x(jjt +

:1,.

(25.229)

This form presents itself as the natural formulation in the context of which
Hv can be tested. If we compare (229)with restricted structural form(219)
we can derive the auxiliary regression:

tt
or its

t#11

operational

Jt,

J?'x1

The obvious
test statistic

B12r1 -J1)'x1, +

form:
:

J!l'lxllj, +

way to test Hv
II

LM

rR2

(#c1-B2c),1)'x(l), +c1,

x
7

g2((y).

!?f.

is to use the R2 from

(25.230)
(230)to define the LM
(25.231)

652

The simultaneous

An asymptotically
for small F is:

equivalent form suggested by Bassman

RRSS

FF

model

equatio>

T-k

URSS

fo
'v

URSS

where
RRSS=(y1
URSS

ylflv

appx

FM1,T-k)

-Y1fIv

-X1JIv)'(y1

-Y1fIv)'(Mx

=(y1

(1960)as better
(25.232)

Xl#lv)

Mx,)(yl -Y1iIv).

The main problems with (231)and (232)are:


when Hv is rejected the tests provide no information as to the likely
source of the problem, and
(ii)
for a large k their power is likely to be 1ow (seeSection 19.5).
In an attempt to alleviate the problem of low power and get some idea
about the likely source of any rejection, the modeller could choose q
instruments at a time. That is, for X(1) BE (X2; X3) where X2: F x q we could
,T:
consider the following augmented form of (219) for 1,,=,1,2,

(i)

=YI)4

XIJI+X2J1

r1

:1

J1

against

=0

H3 : J1

(25.233)

where X1 : T x kj and X2: T x q. Using


exclusion restrictions takes the form
HvL

(233)the test for the overidentifying

'0.

The F-type test statistic for this hypothesis is


RRSSV

FXp'(r1)

URSS;V

T- k

where
and
URSS;..

(y1

(25.234)

URSS;V

Y1f hp,

X1J*1,pXc#*c,p.)'
-

*1
p )
(y 1 - Y 1 f , p X 1IL'p. - X 2J*w

and IV refers to some instrumental variables estimator such as 2SLS or


LIML of the parameter in question. Asymptotically, we can argue by
analogy to the linear regression case that
Ho

FT',wlyl)

z2(>.

ln practice, however, it might


approximation:

be preferable

use the F-type

FT;Jy1)

aPPX

Flq, T-k),

(25.235)

653

25.9 Specifkation testing

with the usual rejection region. For a readable survey of the finite sample
results related to the test statistics (226)and (234)see Phillips (1983).
The above tests can be generalised directly to the system as a whole using
either the 3SLS formulation in conjunction with the F-type tests or the
FIML formulation using the likelihood ratio test procedure.

for exogeneity
specification of (219) above,
Testing

(4)

yjt was treated as a random vector of


by defining the systematic component as the
endogenous
and the
expectation
of
conditional
yjt on the c-field generated by y1f (c(y1f))
XIf
yI,
and
of
asymmetric
xlt).
This
treatment
observed value of X1, (X1;
distinction.
was used as the main implication of the endogenous-exogenous
of
refers
possibility
the
Testing for exogeneity in
present context
to the
This
amounts to
treating yjt or some subset of it in the same way as Xjy.
testing whether the stochastic component of yj, contributes significantly to
the equation (219).
of this
A convenient formulation for testing the appropriateness
of
suggested
equation
context
the
in
asymmetric treatment of yjf is the
(194)
form:
the
194)
context
takes
IV estimation. In the present
(
In the

variables

y1 =Y171

XIJI

+ MxY1(y1

(25.236)

r1)+ :1.

ln this formulation we can see that y1t and x1, in (219)are treated similarly
when ),t yl 0. Hence, an obvious way to parametrise the endogeneity' of
yj, is in the form of:
0, Sl : (7? 71) #0.
Hl :(T1-T1)
(25.237)
=

This can be tested using the F-type test statistic:


FT=

RRss -

usss

'Rss

T-2(rn1
,,n1

1)-I

1
Hu
-

Ft'''l

1, T'-2@1

1'+k1)

(25.238)

where RRSS and URSS refer to the RSS from (219) and (236)respectively,
both estimated by OLS (seeWu (1973),Hausman (1978), inter ali.
The above test can be easily extended to the case where only the
exogeneity of a subset of the variables in y1f is tested by re-arranging (236)

yl

(see Hausman

PxYt7? +Y171

''FXIJI

MxY1(y -y!)

+ MxYtyf

+Et

(25.239)
and Taylor

(1981) for a similar test).

The simultaneous

equations model

As a conclusion to this section it is important to emphasise that a1l the


above specilication tests will be very sensitive to any departures from
assumptions (1j-g8q(see Section 25.4). If any of these assumptions are
invalid the conclusions of the above tests will be at best misleading. ln
particular the above
test is likely to be inappropriate in cases
where the independent sample assumption is invalid', see Section 25.7.
texogeneity'

25.10

Prediction

lf we assume that the observed value xw+, of the exogenous random vector
Xw./ is available, the obvious way to predict yw../ is to use the statistical
GM,
yt B'xf + u,,

t G T,

(25.240)

estimator of the unknown


in conjunction with a
parameters 0(B, f1). The problem, however, is that (240) can also be expressed
(parametrised) in terms of the theoretical parameters of interest (, i.e.
tgood'

yt B(()'xt
=

(25.241)

+ u?,

and the MLE of 0 and ( coincide only in the case where the system of
structural equations (withthe identifying restrictions imposed),
F*'yt + A*'xr + st*

0,

(25.242)

identsed. Otherwise, if the system is overidentified the FIML (01'


is
3SLS) estimator of ( has a smaller variance than
This discussion suggests that in the case where the system isjust identified
we apply the prediction methods discussed in the context of the
multivariate linear regression model (see Chapter 24). ln the case of
overidentification, however, we derive the restricted estimator of the
statistical parameters B via
.jlsr

(25.243)
.
- trt(-)where A(j) and F(j) refer to some asymptotically efficient estimator f of the
structural parameters (, such as FIML or 3SLS. lt can be shown that in the
'

B(-

overidentified case:

x,7''z-(B(j)-B(t))

x//'rt
with (Fo

-B)

,v

N(e, Fa),

x(e,F0),

Fa) being positive semi-definite.

(25.244)
(25.245)

25.10 Prediction

655

In the case where ( is estimated by a single equation estimation


such as LIML or ZSLS then
'''r(B(j

.v'

B(())

r )-

N(0, F2),

method

(25.246)

and both (F() - Fz) and (F2 F:5) are positive semi-definite (see Dhrymes
( 1978)). lt is not very difficult to see that this asymptotic efficiency in
estimation carries over to prediction as well given that the prediction error
defined by
-

(w+/ - yr+/)

(2

B)'Xw+!

is a function of the difference


lmportant

(25.247)

+uw+/
-

B) and

n.

concepts

of interest,
Reparametrisation,
theoretical (structural) parameters
equations,
overparametrisation,
recursive
of
simultaneity,
system
restrictions,
restrictions,
exclusion
identification, linear homogeneous
order and rank conditions for identification. underidentification, just and
overidentification, indirect MLE, estimator generating equation, full
informatio' n maximum likelihood (FIML), limited information maximum
likelihood (LIML), two-stage least squares (2SLS), k-class estimator, noncentral Wishart distribution, instrumental variables, three-stage least-

squares (3SLS).

Questions

3.

Compare the multivariate linear regression and the simultaneous


equations models.
Discuss the relationship
between statistical and structural parameters
and
well
structural
theoretical parameters of interest.
as
as
Show how the identification problem arises because of the
overparametrisation induced by the structural formulation.
'In the context of the statistical GM
T. 1 ,

r:

'

t' )
yt + Ajx, +
/

ij,

of Section 25.2 no endogeneity problem arises and the usual


orthogonal estimator (OLS) is the appropriate estimator to use.'
5.

Explain.
of the variation
'Endogeneity arises because of the violation
free
exogeneity
induced by the identifying restrictions.'
condition of
Discuss.

656

The simultaneous equatiom model

Explain what we mean by a recursive system of simultaneous


equations and discuss the estimation of the parameters of interest.
refers to the uniqueness of the reparametrisation
from
the statistical to the theoretical parameters of interest.' Discuss.
Explain the order and rank conditions of identification in the case of
linear homogeneous restrictions.
Discuss the two cases where the ordercondition
might be satisfied but
the rank condition fails.
10. Before we could even discuss the identification problem we need to
have a well-defined statistical GM.' Discuss.
Explain the concept of an indirect MLE of the theoretical parameters
of interest. Under what circumstances is the IMLE fully efficient?
Explain the intuition underlying the concept of an estimator
generating equations (EGE).
How do restrictions of the form BF + A 0 differ in the context of the
simultaneous equations and multivariate linear regression model?
structural parameters estimators such as ZSLS, 1V, LIML, k14.
class, 3SLS are numerical approximations of the FIML estimator.'
Discuss (see Hendry (1976:.
Explain the derivation of the 2SLS estimator of 1 in
Sldentification

t'l-he

yl =Z

+:* 1,

both as constrained least-squares as well as an instrumental variables


estimator.
'Fhe 2SLS estimator
can be viewed
as being a numerical
LIML
estimaton'
Discuss.
approximation to the
'lf the rank condition for identification is not satisfied then 2SLS does
not exist.' Discuss.
S'T'he3SLS estimator is derived by using the Zellnerformulation
in the
context of the simultaneous equation model.' Explain.
'Misspecification testing in the context of the simultaneous equation
model coincides with that in the context of the multivariate linear
regression model.' Discuss.
Consider the question of misspecification testing in the context of a
single structural equation before and after the identifying restrictions
have been imposed.
Explain how you could construct a test for the overidentifying
restrictions in the context of a single structural equation.
Discuss the question of testing for the exogeneity of a subset of the
endogenous variables included in the first equation.

25.10 Prediction

657

'Using the derived statistical parameter


.-

..<

..x

B(t)

will provide

A(()F(t)

efficient predictors

more

estimators

defined via

for yw-hj.'Discuss.

Exerclses

Consider the two equation estimable model:


mt

a 1 1 + a 1 zit + a 1 5pt +

it

(1)
(2)

J 1 4y1,.

a21 + xzzmt + acapf + uzqgt.

Express equations

(1) and (2)in the

B'x f +u

forms

'

f'

r*' t't + A*'x t +

t1

=0.

F* and A* in terms of (B, f1) and show


that the theoretical parameters of interest are uniquely
Derive the parameters

defined.
Verify that in the case of a recursive simultaneous
1 IFft= y0?y9

1t

+J'xi

ilf

1, 2,

the orthogonal estimators (where a,9 HE (),,9',J;)',Zf EH (yf 1


49 (z;zf)- l zkyi
l

and

model

n'l,

equations

X)):

1
ii T
=

--.

if

ii

yi

Zf&,9, i

1 2,
,

,9

MLE'S
of
and nf.
are indeed the
equations
structural
Consider the following
imposed'.
restrictions

+ 7 4 1 .V41+

- 1.'1 l + 7 2 1 )' 2f

+
l

4.1' 1

-h

-/'

24.-.21

.F4t

-1-

1 1 X 1t

lcxl,

l 3x1,

1 4.,X1 t

the exclusion
3 1 .X 3/

l1

;3,,-

2cx2t

+ aaxa,
+

t;

(524..X2/

fl4t.

Discuss the identifiability of the above equations under the


following conditions:
No additional restrictions;
(a)
0 J z2 0 ;
(b)
24
(c)
7 32 0, T41 0.
Explain how you would estimate the first equation by ZSLS.
=

(ii)

-/4.3.:4.1

with

The simultaneous

658

muationsmodel

Show that in the case of ajust identified equation the ZSLS and IMLE
estimators coincide.
Show that l-((-)'fF(t-)
( 1/F).d(()'Z'ZA((-) in Section 25.5.
Derive the 2SLS estimator of al' in yl Z1al + 1)1' and explain why
where * (y-.irj sus -X1J2sus)is an inconsistent
!7tl (1/F)t'/
estimator of tll'j
Compare your answer in 6 with the derivation of the 2SLS estimator as
an instrumental variables estimator.
t'T'he GM yl Pxzlf
+ ct* (see Section 25.6) can be viewed as a
equivalent
sample
for the sample
to y3t y'1'(y1,/c(X,)) + J':xlf + k:T*
T' Explain.
period t 1, 2,
Derive a test for the exogeneity of a subset y2, of y1f in the context of the
single equation
=

)'1, r'1yl t +J'1 x1,


=

+ tt

Additional references
Anderson

( 1982);

Hausman

( 1983);

Judge et al. ( 1982).

Epilogue: towards a methodology


econometric modelling

26.1

A methodologist's

of

critical eye

The purpose of this chapter is to formalise the methodology sketched in


Chapter 1 using the concepts and procedures developed and discussed in
the intervening chapters. The task of writing this chapter has become
considerably easier since the publication of Caldwell ( 1982) which provides
a lucid introduction to the philosophy of science for economists and
establishes the required terminology. Indeed, the chapter can be seen as a
response to Caldwell's challenge in his discussion of possible alternative
approaches to economic methodology:
One approach which to my knowledge has been completely ignored is

.
and philosophy
with
the integration of economic methodology
econometrics. Methodologists have generally skirted the issue of
methodological foundations of econometric theory, and the few
econometricianswho have addressed philosophical issues have seldom
gone beyond gratuitous references to such figures as Feigl or Carnap
.

(See ibid, p. 216.)

ln order to avoid long digressions into the philosophy of science the


discussion which follows assumes that the reader has some basic knowledge
of philosophy of science at the level covered in the first five chapters of
Caldwell ( 1982) or Chalmers (1982).
Let us begin the discussion by considering the textbook econometric
methodology criticised in Chapter 1 from a philosophy of science
perspective. Any attempt to justify the procedure given in Fig. 1.1 reveals a
deeply rooted influence from the logical positivist tradition of the late 1920s
early 30s. Firstly, the preoccupation of logical positivism with criteria of
cognitive significance, and in particular their verifiability criterion, is clearly
659

Epilogue

660

discernible in defining the intended scope of econometrics as


measurement of theoretical relationships'. A theory or a theoretical concept
was considered meaningful in so far as it can be verified by observational
evidence. Secondly, the trcatment of the observed data as not directly
related to the specification of the statistical model was based on the logical
positivists' view that observed data represent
facts' and any
with the facts was rendered meaningless.
theory which does not
Unless the theoretical model, now re-interpreted
in terms of the
observational language,differs from the statistical model only by some nonsystematic effects (white-noise
error) the theory had no meaning. Moreover.
there was no need to bring the actual DGP into the picture because the
theory could only have cognitive significance if it constitutes a description
of the former. Thirdly, emphasis in the textbook methodology tradition is
cliterion of cognitive significance) and
placed on testing theories (testability
choosing between theories on empirical grounds. Despite such an
emphasis, however, to my knowledge no economic theory was ever
abandoned because it was rejected by some empirical econometric test, nor
was a clear-cut decision between competing theories made in lieu of the
evidence of such a test.
From the philosophy of science viewpoint the textbook econometric
methodology largely ignored the later reformulations of logical positivism
in the shape of logical empiricism in the late 1950s early 60s (seeCaldwell
(1982/. For example, the distinction between theoretical and observational
concepts was never an issue in econometric modelling. adhering to the
logical positivist view that the two should coincide for the former to have
any meaning. Moreover, the later developments in philosophy of science
challenging every aspect of logical positivism and in particular the
verication principle and objedivity, as well as the incorrigibility of
observed data (see Chalmers (1982)), are yet to reach the textbook
econometric methodology. The structure of theories in economics has been
influenced by the axiomatic hypothetico-deductive formulation of logical
empiricism but this served to complicate the implementation of the
textbook econometric methodology even further. Research workers found
themselves having to use not only
statistical procedures but
also
theory construction procedures. That is, they would begin
but
with a very
theoretical model and after estimating a
variety of similar equations and a series of
statistical
techniques end up with the
and
revise
sense)
return
the
some
to
(in
theoretical model in order to rationalise the
empirical equation.
Commonly, research workers find themselves having to include lagged
variables in their estimated equations (becauseof the nature of the observed
data used), even though the original theoretical model could not account
'the

iobjective

'comply'

'illegitimate'

'illegitimate'

'vague'

'estimable'

iillegitimate'

ibest'

'best'

26.2

Formalising

661

a methodology

for them. Often, research workers felt compelled, and referees encouraged
them, not to report their modelling strategy but to devise a new theoretical
hypothetico-deductive model using some form of dynamic optimisation
method and pretend that that was their theoretical model all along (see
Ward ( 1972/. As argued in Chapter 1, research workers are driven to use
procedures because of the
the textbook
methodology forces them to wear. Moreover, most of these
procedures become the natural way to proceed in the context of the
alternative methodology sketched in Chapter 1. In order to see this let us
consider a brief formalisation of this methodology in the context of the
philosophy of science.
'straightjacket'

Sillegitimate'

'illegitimate'

26.2

Econometric

modelling, formalising a methodology

Having criticised the textbook econometric methodology for being deeply


rooted in an outdated philosophy of science the task of fonnalising an
alternative methodology would have been easier if the new methodology
could be founded on the most recent accepted view in philosophy of science.
Unfortunately
(or fortunately) no such generally accepted view has
since
emerged
the dethronement of the positivist (logicalpositivism as well
empiricism)
philosophy (seeCaldwell (1982),Chalmers (1982),
as logical
alia).
The discussion which follows, purporting to bring
inter
Suppe (1977),
various
threads of methodological arguments in the book,
together the
related
cannot be
to any one school of thought in the current philosophy of
science discussions aspiring to become the new orthodoxy. At best it could
be seen as the product of a cross-fertilisation between some of the current
views on the stnlcture, status and function of theories and the particular
features of economic theory and the associated observed data. For
expositional purposes the discussion will be related to Fig. 1.2 briefly
discussed in Chapter 1.
The concept of the actual data generation process (DGP) is used to
designate the phenomenon of interest which a theory purports to explain.
The concept is used in order to emphasise the intended scope of the theory
as well as the source of the observable data. Defined this way, the concept of
the actual DGP might be a real observable phenomenon
or an
experimental-like situation depending on the intended scope of the theory.
For example, in the case of the demand schedule discussed in Chapter 1, if
between a
the intended scope of the theory is to determine a relationship
hypothetical range of prices and the corresponding intentions to buy by a
group of economic agents, at some particular point in time, the actual DGP
might be used to designate the experimental-like situation where such data
could be generated. On the other hand, if the intended scope of the theory is

662

Epilogue

to explain observed quantity or and price changes over time, then the
actual DGP should refer to the actual market process giving rise to the
observed data. lt should be noted that the intended scope of the theory is
used to determine the choice of the observable data to be used. This will be
taken up in the discussion of the theory-dependence of observation.
,4 theory is defined as a conceptual construct which purports to provide
an idealised description of the phenomena within its intended scope (the
actual DGP). A theory is not intended as a carbon copy of
providing an exact description of the observable phenomena in its intended
is much too complicated for such an exact copy
scope. Economic
to be comprehensible and thus useful in explaining the phenomena in
question. A theory provides only an idealised projected image of reality in
tenns of certain abstracted features of the phenomena in its intended scope.
These abstracted features, referred to as concepts, provide the means by
descriptions are possible. To some extent
which such generalised (idealised)
the theory assumes that the phenomena within its intended scope can be
iadequately explained' in terms of the proposed idealised replicas viewed as
isolated systems built in terms of the devised concepts.
ln the context of the centuries-old dichotomy of instrumentalism versus
realism the above view of theory borrows elements from both. lt is
instrumentalist in so far as it assumes that a theory does not purport to
provide an exact picture of reality. Moreover, concepts are not viewed as
describing entities of the real world which exist independently of any
theory. It is also realist in two senses. Firstly, it is realist in so far as underthe
circumstances assumed by the theory (as an isolated system) its
can be ascertained. There is something realistic about a demand schedule in
so faras it can be established ornot under the circumstances assumed by the
theory of demand. Secondly, it is realist because its main aim is to provide
explanation of the phenomena in its intended scope. As such,
an
of
adequacy
the
a theory can be evaluated by how successful it is in coming
the
reality it purports to explain. Theories are viewed as
with
grips
to
providing
to reality and they are judged by the extent to
approximations
enhance our understanding of the phenomena
such
which
We
question.
cannot, however, appraise theories by the extent to which
in
provide
exact pictures of reality
they
treality'

treality'

:validity'

tadequate'

kapproximations'

simply because we have no access to the world independently of our


would enable us to assess the adequacy of those

.
theories in a way that
descriptions
.

(See Chalmers

(1982),p.

163.)

This stems from the view that observation itself presupposes some theory
providing the tenus of reference. Hence, in the context of the adopted

26.2 Formalising

663

a methodology

methodology of econometric modelling, theories are treated as providing


approximations to the observable phenomena without being exact copies
of reality; a realist position without the logical empiricists' correspondence
theory of truth (see Suppe ( 1977:.
In view of the realist elements of a theory propounded above the logical
positivists' contention that the only function of theories is description is
rejected. Theories are constructed so as to enable us to consider a variety of
questions related to the phenomena within their intended scope, that
includes description, explanation and prediction.
The distinction between theoretical and observational
concepts
associated with logical empiricism as well as naive instrumentalism (see
Caldwell (1982)) is not adhered to in the context of the proposed
methodology. This is because observation itself is theory-laden. We do not
just observe, we observe within a reference framework, however
rudimentary. ln constructing a theory we devise the concepts and choose
the assumptions in terms of which an idealised picture of the phenomena
within its intended scope is projected. This amounts to devising a language
and accepting a system of picturing and conceiving the structure of the
features of
phenomena in question. Such a system specifies the
the phenomena to be observed. The main problem in econometric
modelling, however, is that what a theory suggests as important features to
be observed and the available observed data can differ substantially.
Commonly, what is observed (data collected) was determined by some
outdated theory and what was deemed possible at the time given the
external constraints on data collection. Although there is an on-going
revision process on what aspects of the observable phenomena data are
collected, there is always a sizeable time-lag between current theories and
the available observed data. For this reason in the context of the proposed
methodology we distinguish between the concepts of te theory and the
obser,ed data available. lt should be stressed that this is not the theoreticalobservational concepts distinction of logical positivism in some disguise. lt
is forced upon the econometric modelling because of the gap between what
is needed to observe (as suggested by the theory in question) and what is
available in terms of observed data series. This distinction becomes even
more important when the theory is contrasted with the actual DGP giving
rise to the asailable data.
Observed tt/rtk in econometric modelling are rarely the result of the
experiments on some isolated system as projected by a theory. They
constitute a sample taken from an on-going real DGP with al1 its variability
and irrelevant' features (as far as the theory in question is concerned).
These, together with the sampling impurities and observational
errors,
objective
which
against
facts
that
published
data
far
from
being
suggest
are
'important'

664

Epilogue

theories are to be appraised, striking at the very foundation of logical


positivism. Clearly the econometrician can do very little to improve the
quality of the published data in the short-run apart from suggesting better
ways of collecting and processing data. On the other hand, bridging the gap
between the isolated system projected by a theory and the actual DGP
giving rise to the observed data chosen is the econometrician's
responsibility. Hence, in view of this and the multitude of observed data
series which can be chosen to correspond to the concepts of theory, a
distinction is suggested between a theoretical and an estimable model.
theoretical model is simply a mathematical formulation of a theory. This
differs from the underlying theory in two important respects. Firstly, there
might be more than one possible mathematical formulation of a theory.
Secondly, in the context of a theory the initial conditions and the
simplifying assumptions are explicitly recognised; in the context of the
of the
model these simplifying assumptions govern its characterisation
phenomena of interest. This is to be contrasted with an estimable model
whose form depends crucially on the nature of the observed data series
chosen (seeChapter 1). To some extent the estimable model refers to a
mathematical formulation of the observational implications of a theory, in
view of the observed data and the underlying actual DGP. ln order to
determine the form of the estimable model the econometrician might be
required to use auxiliary hypotheses in an attempt to bridge the gap
between the theory and the actual DGP. lt is, however, important to
emphasise that an estimable model is defined in terms of the concepts of the
theory and not the observed data chosen. ln practice, the form of estimable
models might not be readily available. This, however, is no reason to
abandon the distinction and return to the textbook methodology
assumption that the theory coincides with the actual DGP apart from a
white-noise error. The estimable form of a theory depends crucially on the
nature of the available observed data and the gap between the theory and
the actual DGp,given that its aim is to bridge thisgap. The estimable model
plays an important role in the propesed methodology because it provides
the link between a theory and a statistical model.
Having rejected the presupposition that the actual DGP, the theory and
the statistical model coincide apart from a white-noise error, the task of
specifying a statistical model, for the problem in hand, is no longer a trivial
matter. The statistical model. can no longer be specified by attaching a
white-noise process, assumed to follow a certain distribution, to the
theoretical model. The nature of the observed data has a crucial role to play
because of their relationship with the estimable form of the theory. Because
of this it is important to specify the statistical model, in the context of which
the estimable model will be analysed, taking account of the nature of the
-4

26.2 Formalising

a methodology

665

observed data in question.


The procedure from the observed data to the statistical model has been
discussed in various places throughout the book but it has not been
formalised in any systematic way. Because of its central role in the proposed
methodology- however- it is imperative to provide such a formalisation. The
tlle o/ast?rnv/ data rtp3he statistical model is made by assuming
first step
that:
Zw)'0j' a
the observed data constitute a realisation Z EEE(Z1 Z2,
processesj (Zt, t e:T1.
sequence (?Jrandom vectors (stochastic
This assumption provides the necessary link between the actual DGP and
probability theory. lt enables us to postulate a probabilistic structure for
)Zr, t 6 T) in the form of its joint distribution function D(Z; #). This will
provide the basis for the statistical model specitication.
The specification of a statistical model is based on three sources of
information:
theory information;
(i)
information; and
measurement
(ii)
information.
sample
(iii)
ikfrmation comes in the form of the estimable model. 1ts role
Tbe r/l?t?l-)'
of the model is related to the choice of the
in the initial specification
variables
underlying
the model as well as the general form of the
observable
GM.
statistical
The measurement illjrmation is related to the quantification and the
measurement system properties of the variables involved. These include the
units of measurement, the measurement system (nominal,ordinal, interval,
ratio; see Appendix 19. 1), as well as exact relationships among the observed
data series such as accounting identities. Such information is useful for the
specification of the statistical model because it enables us to postulate a
sensible statistical GM for the problem in question. For example, if the
statistical GM v, #'x,+ uf allows y, to take values outside its range we need
to reconsider it. Also, if an accounting identity holds either among some of
the xfrs or among )', and some xas the statistical GM is useless. For further
discussion on accounting identities and their role in econometric modelling
see Spanos ( 1982tk).
Sample ?!/?l'??7(/f(??7 colnes in the form of the observable random variables
involved and their structure. lt is helpful to divide this information into
sets related to )Z,, r e: T ):
three mutually exclusiN'e
2
1
Z
st
1
(a)
1
pa
.//0n1

ft

(b)

( -.
'
l

present

'
!b*

z.

1 12
future Z
(c)
)
Such information is useful in relation to important concepts underlying the
specification of a statistical model such as exogeneity, Granger-causality,
i
(

--

,. -

666

Epilogue

structural invariance

(seeHendry and Richard (1983)).


The specthcation of a statistical model is indirectly related to the
distribution of the stochastic process .tZf, t (E T) because in practice it is
easier to evaluate the appropriateness of probabilistic assumptions about
the marginal distributions of the Zits instead of any conditional
distributions directly related to the postulated statistical GM. According to
Fisher the problem of statistical model specification
.

is not arbitrary, but requires an understanding


are supposed to, or did in fact originate

,
data

of the way in which the


.

(See Fisher (19j8), p. 8.)


For this reason the statistical model is directly related to the actual DGP
being defined in terms of the observed data and not some arbitrary whitenoise error term. From the modelling viewpoint it is convenient to specify a
statistical model in terms of three interrelated components:
statistical GM;
(i)
probability model; and
(ii)
sampling model.
(iii)
The statistical GM defines a probabilistic
mechanism purporting to
provide a generalised approximation to the actual DGP. lt is a generalised
approximation in the sense that it incorporates no theoretical information
apart from the choice of the observed data and the general form of the
estimable model. That is, it is
to provide the framework in the
which
of
possibly
testable theoretical information of interest
context
any
could be tested. ln particular, any theoretical information not needed in
determining the distribution of )Zr, r iE7') for the sample period r= 1, 2,
F might be testable in the context of the postulated statistical model. On the
other hand, the other two sets of information, the measurement and sample
information, play a vital role in determining D(Z; #), Z EB (Zj, Z2,
Zw),
and should be taken into consideration at the outset. The theoretical
information which is taken into account in designing a statistical GM is the
form of the estimable model. This is because the estimable model
constitutes a reformulation of the theoretical model so as to provide an
idealised approximation to the actual DGP in view of the available data
and the statistical GM postulates a probabilistic mechanism which could
conceivably have given rise to the observed data. The latter, however, does
not necessarily coincide with the former apart from some white-noise error.
The statistical GM is a stochastic mechanism whose particular form
depends on the nature of D(Z; #).ln particular the statistical GM is defined
in terms of a distribution derivable from D(Z1, Z2,
Zw; #) by
marginalisation and conditioning (seeChapters 5 and 6) in such a way so as
to enable us to analyse (statistically)
the estimable model in its eontext. That
'designed'

26.2 Formalising

a methodology

667

in terms of the observable variables


is, the estimable model
special
case) within the postulated
involved should be
(a
statistical GM.
The probability distribution underlying a statistical GM defines the
probabilit p model eomponent of the statistical model in question. For the
completeness of the argument and the additional insights available at the
misspecification stage the distribution defining the probability model is
related to D(Z', #). The apparent generality we lose by ignoring D(Z', #) and
going direetly to the distribution of the probability model is illusory. The
probability model in the context of statistical inference plays the role of the
uanchor distribution' in terms of which e'very other distrbution or
distribution related quantity is definable. ln Part IV we dscussed many
instances where if the modeller does not adhere to the same probability
model the argument can be led astray. Such instances include the omitted
variables argument, the stochastic linear regression model as well as the
problem of simultaneity. ln the last case the argument about simultaneity
bias is based totally on defining the expectation operator fJ( ) in terms of
Dtyt,/Xr; p). Ths bias ean be made to disappear if the underlyng distribution
can be changed (seeChapter 25).
As argued in Chapter 17, the sampling model, which constitutes the third
component of a statistical model, provides the link between the probability
model and the observed data in hand. lts role might not seem so vital at the
specification stage because of its close relationship with the other two
components. At the misspecification stage, however, it can play a crucial
role and its close relationship with the other two components can be utilised
to determine the way to proceed. Given that the statistical model is defined
directly in terms of the observable variables giving rise to the observed data
in hand and not some unobservable error term, the sampling model has a
very crucial role to play in econometric modelling. For examples in the case
of the linear regression model very few eyebrows will be raised if a modeller
assumes that the error is independent. On the other hand, if time-series data
are used and the modeller postulates that )'y (conditional on Xf xf) is
T; a sizeable number of econometricians will
independent for ? 1, 2,
advise caution; even though the two assumptions are equivalent in this
context. Making probabilistic assumptions about observable random
variables which gas'e rise to the data in hand seems to provide a better
perspective for judging the appropriateness of such assumptions.
Once a statistical model is postulated we can proceed to estimate the
parameters in terms of which the statistical GM is defined, what we called
the statistical ptkramt?li?l's(?Jinterest. These parameters do not necessarily
coincide with the tbeoretical pavameters of intevest in terms of which the
estimable model is defined. Before we are in a position to relate the two sets
etranslated'

tnestable'

668

Epilogue

of parameters we need to ensure that the estimated statistical model is well


J/inv/. Given that statistical arguments will be used to
as well as
test any hypotheses related to the theoretical parameters of interest, we
need to ensure that their foundation (the estimated statistical model) is well
assumptions
defined in the sense that the underlying
are valid. The
estimated statistical model might be viewed as a convenient summarisation
of the sample information in the Fisher paradigm analogous to the
histogram in the Pearson paradigm. As such, the estimated statistical model
is well defined if the underlying assumptions (defining it) are valid.
Otherwise, any statistical arguments based on invalid assumptions will be
at best misleading. Misspeci.ticalion Ip.sr??f7 refers to the testing of the
(testable) assumptions underlying the statistical model in question. The
estimated statistical parameters of interest acquire a meaning only after the
misspecification testing has been completed without any assumptions being
rejected. This is to ensure that the estimated parameters are indeed
estimates of the intended statistical parameters of interest. It will be very
difficult to overestimate the importance of misspecification testing in the
context of econometric modelling. A misspecified estimated statistical
model is essentially useless in this context.
The concept of an esbimated wp//-t//nta/
satistical model plays a vital role
in the context of the proposed methodology of econometric modelling. No
valid statistical inference argument can be made unless it is based on such
an estimated statistical model. For this reason the statistical GM is not
-constrained' to coineide with the estimable model. F'or example, there is no
point in adhering to one 1ag in the variables involved in view of the
estimable model if the temporal structure of these variables requires more
than one in order to yield a well-defined statistical model. The modeller
should not feel constrained to allow for features necessitated by the nature
of the observed data chosen simply because the theory (estimablemodel)
does not account for. More often than not the theoretical information
relating to the estimable model is vague because of the isolated system
characteristics of economic theories. An important implication of this is the
apparent non-rejection of theories on the basis of statistical tests. Such
theories can be accepted or rejected if the
system' conditions
assumed by the theory are made to hold under an experimental-like
situation where the theoretical model and the actual DGP can be ensured
to coincide apart from a white-noise error term. When this is not the case
rejecting a theory whose estimable model can take a number of different
forms will be very difficult.
ln the case where misspecification testing leads to the rejection of one or
more of the assumptions underlying the statistical model we proceed by
respecifying the model so as to take account of the apparent invalid
'define'

tgood'

'isolated

26.2

Formalising

a methodology

669
Sdrafting'

in local surgery' by
assumption. What we do not do is to
the alternative hypothesis of a misspecification test into an otherwise
unchanged statistical model such as postulating an AR( 1) error because the
kengage

Durbin--Watson

test rejects the independence

assumption,

in the context of

the linear regression model; see Hendry (1983) for a similar viewpoint. Once
a well-defined estimated statistical model is reached we can proceed to
determine (construct) the empirical econometric model.
Starting from a well-defined statistical model we can proceed to test any
theoretical restrictions which can be related to the statistical formulation.

The specification

of the empirical econometric

model can be viewed as a

reparametrisation/restriction of the estimated statistical GM in view of the


estimable model, so as to be expressed in terms of the theoretical pat-ameters
which imposes restrictions on the
of interest. Any reparametrisation
statistical parameters ean be tested for on a formal hypothesis-testing
framework and accepted or rejected. The aim is to construct an
approximation to the actual DGP in terms of the theoretical parameters of
interest as suggested by the estimable form of the theory, an ?mp-trf:/
econometric model. This should be done not at the expense of the statistical
properties enjoyed by the estimated statistical GM because the empirical
econometric model needs to be itself a well-defined statistical model for
prediction and policy evaluation purposes. Hence- when the empirical
econometric model is constructed by reparametrising,.'restricting
a wellunderlying
that
satisfies
the
statistical
GM
need
it
defined
to
we
statistical assumptions we started with. This does not constitute proper
lt is important,
statistical testing but informal diagnostic
however, in order to be able to use formal statistical arguments in relation
to prediction or and policy evaluation in the context of the empirical
econometric model.
Although we need to ensure that the theoretical parameters of interest (
are uniquely defined in terms of the statistical parameters of interest 0, there
is nothing unique about (. Numerous theoretical parametrisations
are
possible for an' well-defined set of statistical parameters. ln practice, we
of the
need to choose one of such possible reparametrisations,/restrictions
estimated statistical GM. Several criteria for model selection (see Fig. 1.2)
have been proposed in the econometric literature depending on the
potential uses of the empirical econometric model in question. The most
important of such criteria are:
(i)
theo ry con si st ttl'clh
goodness o f fit:
(ii)
predictiN'e
ability:
(iii
robustness
(iv)
tincluding nearly orthogonal explanatory variablesl;
encompassing:
(v)
tcheck'

bchecking'.

..

670

Epilogue

parsimony
(vi)
(see Hendry and Richard (1982),
( 1983), Hendry (1983), Hendry and Wallis
( 1984)).
ln cases where the observed data can be relied upon as accurate
measurements of the underlying variables, there is something realistic
about an empirical econometric model in so far as it can be a
or a
bad' approximation to the actual DGP. ln such cases the instrumentalists'
interpretations of such an empirical econometric model is not very useful
because it will stop the modeller seeking to improve the approximation.
The realistic interpretation, however, should not be viewed as implying the
econometric modelling is aiming at; we have
existence of the ultimate
criteria
outside
our theoretical perspective to establish ultimate
no
This should not stop us from seeking better empirical econometric models
which provide us with additional insight in our attempt to explain
observable economic phenomena of interest. In particular, an empirical
econometric model which explains why and how other empirical studies
have reached the conclusions they did is invaluable. ln such a case we say
that the empirical econometric model encompasses others purporting to
explain the same observable phenomenon; on the subject of encompassing
see Hendry and Richard (1982), (1983),and Mizon ( 1984). Hendry and
Richard ( 1982), in their attempt to formalise the concept of a well-defined
empirical model, include the encompassing of all rival models as one of the
important conditions for what they call tentatively adequate conditional
data characterisation'
(TACDI;for further discussion on the similarities
and differences between the two approaches see Spanos ( 1985).
The empirical econometric model is to some extent as close as we can get
to an actual DGP within the framework of an underlying theory and the
available observed data chosen. As argued above, its form takes account of
and sample.
all three main sources of information - theory, measurement
As such, the above-discussed
methodology differs from both extreme
approaches to econometric modelling where only theory or data are used
for the specification of empirical models. The first extreme approach
requires the statistical model to coincide with the theoretical model apart
from a white-noise error tenn. The second approach ignores the theoretical
model altogether and uses only the structure of the observed data chosen as
the only source of information fbr the determination of the empirical model;
see Bos and Jenkins (197$., see Spanos ( 1985).
This concludes the formalisation of the alternative
methodology
sketched in Chapter 1. A main feature of the new methodology is the
broadening of the intended scope of econometrics. Econometric modelling
is viewed not as the estimation of theoretical relationships nor as a
of economic theories, but as an
procedure in establishing the
tgood'

'truth'

ttruth'.

'a

'trueness'

26.3 Conclusion
endeavour to understand observable economic phenomena of interest
using observed data in conjunction with some underlying theory in the
context of a statistical framework.
26.3

Conclusion

The methodology formulated in Section 26.2 can be viewed as an attempt


towards a coherent approach to econometric modelling where economic
theory as well as the structure of the observed data have a role to play. ln a
certain sense the adopted methodology can be seen as an attempt to
formalise certain procedures which are currently practised by an increasing
number of applied econometricians. The most important features of the
adopted methodology are:
(i)
the rose of the actual DGP;
(ii)
the distinction between a theoretical and an estimable model;
(iii)
the role of the observed data in the statistical model specification;
the notion of a well-defined estimated statistcal model; and
(iv)
the distinction between an estimated statistical GM and an
(v)
empirical econometric model.
As mentioned in the preface the methodology formalised in the present
book can be viewed as having evolved out of the LSE tradition in
econometric modelling (see Gilbert (1985))that owes a lot to the work of
Denis Sargan and David Hendry. The main features of the LSE tradition in
time series econometric modelling, including its emphasis on specification
and misspecification
testing, maximum likelihood and instrumental
variables estimation, asymptotic approximations, dynamic specifications,
parametrisations,
are
common factor restrictions and error-correction
coherent famework in order to formalise an
integrated within a (hopefully)
approach to econometric modelling.
Another related methodology at odds with the textbook methodology
was proposed by Sims ( 1980). The Sims' methodology, in terms of Fig. 1.2,
essentially ignores the left-hand side of the djagram and concentrates
almost exclusi: ely on modelling the observed data. For a discussion of the
relationships between the metlodology adopted in the present book and
these alternative methodologies see Spanos ( 1985).
Finally, it is important to warn the reader about the nattlre of any
methodological discussion. On this I can do no better than quote from
Caldwell ( 1982):
methodologl'

is a fruslrating and rcwarding area in which to work. Just


or to write a

.
as there is no bcst wal' to listen to a Tchaikovsky symphony,
book, or to raise a child, there is no best way to investigate
.

social reality.

Epilogue
Yet methodology
has a role to play in all of this. By showing that science is
not objectives rigorous, intellectual endeavour it was once thought to be.
and by demonstrating that this need not lead to anarchy, that critical
discourse still has a place, the hope is held out that a true picture of the
strengths and limitations of scientific practice will emerge, And with luck.
this insight may lead to a better and certainly more honest, science.
(See ibid, p. 252.)

Additional references
Blaug

( 1980);

Boland

(1982);Caldwell ( 1984);

Hendry and Wallis

( 1984).

Index

acceptance region, 286-7


and confidencc region. 304 5
actual DGP. 2O. 66 1
2
adjusted R 382
admissible estimator. 236
almost sure consrcrgence. 188
alternative hypothesis. 287
approximate M L E's. 533-6
a priori restrictions. 377
exclusion (zero-one). 6 16. 19
Iinear. 396..-401 422-7
linear homogeneouse 6 15- 19
non-linear. 427 -32
ARIMA ( /?. d. (?) process, 156
stability' eonditions, 16 1
ARMA (/). q j process. 159-6 1
ARCH test. 550
AR( 1) process. 150-5
error. 506- 7
estimation. 279-8 1
stability condition. 152...-4

l 53
asymptotic stationarity
asymptotic test procedu res 328- 35
14 1
asympto t ic unco rrelatedness
autoco rrelation 134
at1toco rrttlat ion erro rs 50 l - 3 505- 11
tests for. 5 l3...2 1
autocovariance. 134
autoproduct moment, 134
auxiliary regressions. 446- 7, 460- 1 467
470
.

Bassman

AR(m) process. 155-8. 506- 7


asymptotic expansions. 203-8
asymptotic independence. 140- 1 501
asymptotic momcnts. 192
asymptotic pouer ftlnction. 327
asymptotic. properties of estimators, 244-7
consistencl'. ?44
,

efficiency. 247
nonnalit) 246
7
unbiasedness
asymptotic. properties
consistcncy. 327
.

-n4

Iocally UM P. 328

UMP, 327 8
unbiasedness. 327

test. 652

Bayes' formula. 121


Bayesian approach, 220
Benoulli distribution,
62-3, 166
Bernoulli's theorem. 165
best Iinear unbiased estimator (BLUE),
239 255-6, 450- 1
best linear unbiased scalar (BLUS)
residuals, 407
beta distribution. 40 1 479
bias, 235
binomial distribution, 63-4. 166
bivariate distributions, 79-93
binomial, 84
exponential, 92, 124
logistics 9 1. 125
normal. 83. 88, 93. 120, 122
Pareto. 84, 124
Borel field, 4 1 52
Borel function. 95
455-7
Box-cox transformation,
Box-lenkins approach (sgtrARMA.
ARl M A)
Breusch-pagan test, 469-70
Brownian motion process, 149--50

of tests. 326-8

690

Index

CAN estimators, 27 1
canonical correlations, 3 14
Carleman's conditionp 74
Cauchy distribution. 70-1 105
causality (setpGranger non-causality)
central limit theorem

cross-correlation, 135
cross-covariance, 135
cross-section data. 342-3
cumulants, 74
cumulative distribution function (.$&&
distribution function)
cumulative freqtlency, 25
CUSUM test, 477
CUSLIMSQtest, 477

De Moivre-l-aplace,

64s 165

Liapounov, 174
Lindeberg-Feller,
Lindeberg-l-evy.

174
173
characteristic ftlnction, 73--4

data economic. 342-.6


and tbe probability models 346-9
degrees of freedom, 108 111--13
-method. 20 I
demand function, 10- 11
dc Moivre-l-aplace CLT. 64, 165
dcnsity function definition. 57
conditional, %)
joint, 82
lna rginal 86
)3ro pert ies 59
diagnostic checking, 557
d ifference cq uation l 55-6 543- 5
differencing. 16 1 479-8 l 528
differentiation of vectors and matrices.
603...4
dist ribu tion function 55-.60
conditional 89-92
joint 78-85
ma rginal 8 5-7
dist ribution of thc sample 2 16
dist u rbance (see erro r term non-systema t it
component)
dummy variables. 369. 536-7
,

Chebyshev's inequality, 73
chi-square distribution, 98-.9, 108, 1 l l
non-central 108, l 1l
Chow test 487-8
collinearity. exact. 432....4
&near',434.-40
common factor restrictions, 507- 1l
condition ntlmbers, 436
conditional distriblltions, 89--94
exponentials 92
logistic. 9 l
normal. 93
Pareto, 92
conditional cxpectation. 12 1..7
wl'f a tr-field l25-7
w'?'ran observed vtlue, 12 l 5
propertics, 122, 125, l26 -7
conditional moments. 122- 5
mean. 122.-3
variances 122--3
conditional probability, 43-4
confidence region. 303-6
conllucnce analysis, 12
consistcncy, wcak, 244..-6
strong, 246
constant. in linear regression, 370- l 4 10
constrained MLE*s, 423-4
continuous rvfs. 56
185.-8
convergence, mathematical,
of a function, l 85-6
of a sequence, l 85
pointwise, 187
uniform, 187
convergence, modes of, 188-92
almost sure, J 88, 167
in distribution. 189, 167
in probability. 189. l66
in l'th mean 188
convergence of moments, 192.-4
corrclation coefficient, 119
covariance, 119
matrix, 3 12-3
Cramer-Rao,
lower botlnd, 237
regularity conditions, 237
Cramer-Wold lemma, 19 l

Durbin's /1 test, 54 1-2


Durbin-Watson
test. 515- 18
dynamic linear regression model, 526-70
dynamic multipliers, 60 1-2

econometrics, definition of. 3, 676


Edgeworth

expansion,

206-7

cfficiency. relative, 234-5


full. 237-8
efficient score test t,ct'.,Lagrange multiplier
test)
eigenvalue. 433. 436
eigenvector, 433. 436
elliptical family of distributions, 458
empirical distribution function. 228-9
empirical cconometric model, 2 1 23, 670
encompassing, 568, 670
endogeneity (non-exogeneity), 629
endogenous variables. 608
Engel's Iaw. 6
equilibrium, long-runa 558-.9
crgodicity, 143. 500
,

Index

691

505-7
error, autocorrelation,
error bounds, Berry-Esseen, 202-3
error-correction model, 554
errors-in-variables, 12
also nonerror term. 349-50. 374-5 (.st?t?
systematic component)
estimable model. 23, 668
estimate- 23 l
estimation. methods, 252-84
estimation. properties of estimators,
23 1-.49
estimators,
CAN. 27 l
FIML, 625
GLS, 463. 503, 587-8
IV 637-.-44
,

k-class, 632
LIML, 629, 63 1, 633
OLS, 449-52
3SLS, 639.-40
ZSLS, 635-7
estimator generation cquation. 624-6
events, 38
elementary. 38
impossible, 39
mutually exclusive. 44
sure, 39
exclusion restrictions (.t?t>a priori
restrictions)
exogeneity, weak, 273, 376.7, 42 1-2
strong, 505, 629-30
tests for, 653
expectation, 68-9
conditional, 12 1-7
properties of, 70- 1, 116-20
experimental design, 366-7
exponential distribution, 76, 92, 124
exponentiai family of distributons. 68, 299
F distribution. centrals 104, 108, 113, 3 19
non-central i 13 3 19. 320, 324
F test. 398--402 425-,
553
power of. 40 1
.

F-type misspecification

test. 446

homoskedasticitl'. 467. 547 555


linearity. 460- 1 547-8 555
parameter time-ins ariance. 477
sample independence. j 11 54 1
structural inN ariance. 482-6. 556
,

FIML,

625

final form, 60 1
finite sampie properties. 232.44
efficiency. 234--8
linearity. 238
sufficiency, 242...-4
unbiasedness, 232--4

Fisher-Neyman factorisationa 242


F distribution)
Fishef's F distribution (st?t?
Fisher paradigm, 7-9
forecasting (st?eprediction)
frequency curves, 27
frequency polygon, 24
functions of r.v.'s, distribution of, 96-107
addition, 100-2
min, 105-6
quotient. 102-5
Gamma function, 99
Gaussian distribution (sec nonnal
distribution)
Gauss linear model, 6-8, 348, 353. 357-68
Gauss-Markov theorem, 239, 449
generalised F-test. 590-3
GIVE PC (computerpackage). xvii
GIV E (estimator), 638
GLS, 463-6. 503, 587
goodness of fit (s(,(' :.2 )
Gram-charlier series A, 205
Granger non-causality. 505, 509, 529
tcst for. 544
hennite polynomials. 204
heteroskedasticitys 463-7 1
versus parameter time-dependence, 473
histogram, 23
homogeneous non-stationary
process, 16l
479-8 1, 527-8
homoskedasticity, 126, 378, 463-7 1
misspecification tests for, 464-7 1 547,
648-9
hypothesis testing. 285-303
alternative, 287
composite, 287
null, 286-7
simple, 287
.

idempotent matrix. 3 19. 38 1 4 11


idcntically distributed r.v.'s, 94. 2 16- 17
identification, 6 14-9
exact, 6 (8
order condition, 6 15
overidentification. 6 18
rank condition, 6 17-8
underidentification, 6 18
impact multipliers. 60 1.-2
incidental parametcrs. 136. 346-7, 499
independence. 44. 87. 93
independent r.v.'s, 87
linear. 118
neccssary and suflicient condition,
l 17- 18
118
versus orthogonality.
,

692

Index

independent r.v.'s continued)


118
versus uncorrelatedness,
independent sample, 2 17, 378
misspecilication tests for, 51 1-2 l
indirect MLE, 62 1-2 627
infonnation matrix, 239
asymptotic, 247
sample, 239
single observation, 247
information matrix test proccdure, 467
initial conditions, 151-3. 156, 528, 531
innovation process, 147
instrumental variables, 637.-44
estimator, 638
instrumentalism, 665
69
integral, Riemann-stieltjes.
integrability, 193
square, 203
unifonn, 193
interim multipliers, 60 l
intersept (see constant)
interval estimation fsee confidence region)
invariance, linear transformations, 438-9
invariance of MLE's, 266-7
inverse matrix, partitioned, 442
invertibitity conditionss 16l
iterative estimation procedure, 588
,

Jacobian transformation, 106. 257


joint central moments. 119
joint density function, 8 1-2
continuous, 82
discrete, 82
k-class estimator, 632
Khinchin's WLLN l69
King's law, 5
Kolmogorov-Gabor polynomial, 446
Kolmogorov's axiomatic approach, 37-43
Kolmogorov's inequality, 171
Kolmogorov's stochastic process
conditions, 133
Kolmogorov's
SLLN, 170-1
Kolmogorov's
WLLN, 169
Kolmogorov-smirnov test, 229, 453
Kronecker product. 573, 603-4
kurtosis, 73, 452
,

lag operator,

155, 16 1, 509
multiplier test procedure, 330,
333....4,430-2
in misspecilication testing, 446, 453, 460.
466, 468-9, 5 19-2 1
Laguerre polynomials, 208
law of large numbers (see WLLN, SLLN)
leading indicator model, 554
Lagrange

Ieast squares method, 6-7. 253-.6. 448-50


Ieast squares estimators, 448
GLS 463. 503, 587-8
OLS, 448-9, 638
Lehmann-scheffe
theoremp 242-3, 387. 577
Ievel of significance of a test lksgta
size of a
,

test)

Liapunov's

CLT,

174

likelihood function 258-..60


likelihood ratio test procedure. 299-303,
328-.9, 335. 425-6, 432
limit of a function. 186
limit of a sequence. 185
limit of moments, 192
limit, probability (stpgconvergence in
probability)
LIML estimator. 629. 63 1 633
.

Lindeberg condition, 174-5


Lindeberg-Feller
CLT. 174, 177
Lindeberg-l-evy
CLT, 173-..4
linear regression model, 369-4 10
Iinear restrictions l,st?t?
a priori restrictions)
linearity. 370, 378, 457-.63
and normality, 3 16
462-3
inducing transformations.
misspecification tests for. 459-6 1 597,
.

648
locally UMP test, 335
logical empiricism, 662. 3
logical positivism. 3, 662-3
logistic distribution, 9 1 125
log-likelihood. 258-60
log-normal distribution. 283. 457
long-run equilibrium solution, 558-9
long-run multipliers, 602
lower bound (scc Cramer-Rao lower
bound)
,

MA(p) stochastic process, 158


Malinvaud formulation. 588
Mann-Wald theorem, 197
marginal distributions. 85-9
normal distribution, 88, 317
Markov inequality, 7 1
Markov process, 148-9
Martingale diflrence processs 147, 273-.5
118, 171
Martingale orthogonality.
Martingale process. 145
CLT for, 178
SLLN for. l72
WLLN for, 172
maximum likelihood, method, 257-8 1
mean, 25. 70-- 1
mean square error (MSE), 234-.6, 249
measurement equations, 12

measurement information. 352. 665

Index

693

measurement systcms, 409-1 1


m-dependent process, 14 1
median, 25, 7 1
methodology, 15-2 l 659 72
misspecification testing, 2 1 22 1 392
mixing. strong. 142. 179
uniform. 143. 179
MLE. 260
constrained. 423
properties. 266 82
mode, 25, 7 1
model selection. 523. 669. 70
l94
moments. approximate,
asymptotic, 192
central 73. 119
limit of, 192
raw 73. l 18
moments, method of, 256-7
Monte Carlo. 435
mth-order Markov process, l49
mtllticollinearity tse'gcollinearity)
multiplication rule of probability, 44
multiple correlation coefficient. 3 13- 14.
3 l 8. 322. 3. 382. 439
multivariate linear regrcssion models
57 1.-607
in relation to the SEM. 610-1 4
multivariate normal distribution, 3 12-24
multivariate l distribution. 47 1
.

theorem. 296
Neyman-pearson
non-centrality parameter. 108, 111-13
in the F-test, 399, 40 1
nonlinear model. 46 1- 3
non-linear restrictions tstyt.'
a priori
restrictions)
non-parametric inference. 2 18
non-parametric processes, 146-52
non-parametric tests, 453
non-random sample. 2 18. 343. 494-7
357
non-stochastic N'ariables.
non-systematic component. 350, 370, 374,
376
64 6
normal dislribution
bivariate. 83 4. 88
mean of. 7()
multivariate. 3 l j 24
standard. 68
variance. 7 l
normality. 447 57
misspecification tests lbr. 45 1-j
4j5 7
normalising transformations.
normal (Gaussian ) stochastic process.
135--6
time homogeneitl rcstrictions on. 139
.

nuisance parameters.
null hypothesis. 286

4 14

observation space. 2 17
ogive, 27
omitted variables bias argument, 4 19-2 1
and auxiliary regressions. 446, 458-6 1
468, 47 1, 502, 515, 523
reformulated, 445-7
OLS, 448-51
0, o notation, 195--6
Op, op notation, 196-9
order condition fsee identification)
order of magnitude, 171, 174, 179, 194-8
orthogonal projection. 38 1 41 1, 642
orthogonality, 118
between systematic and non-systematic
components, 350, 358. 371, 38 1
overdiflkrencing, 479
overidentifying restrictions. 6 18
test for, 65 1-3
overparametrisation, 612-13
,

panel data, 42
parameter space, 60
parameter structural change, 48 1-7
parameter time-invariance, 378, 472-8 1
parametric family of densities (see
probability model)
parametric processes, 146
Pareto distribution, 6 1 339....41
partial adjustment model, 552
partial correlation coefficient, 314, 318,
323, 439-40
Pearson family of densities, 28, 452-3
Pearson paradigm, 7-8
Pillai's trace test, 593
pivots, 295
,

Political

Arithmetik, 4-5

power function, 29 1
power of a test, 290
power set, 39
predetennined variables, 610
prediction, 22 1, 247-9, 30*9
in the linear regression model, 402-5
in the multivariate linear regression
model. 599, 601-2, 654-5
principal components, 434
probability, definition,
axiomatic, 43
classical, 34
frequency, 34
subjective, 35
probability limit (,T?E, convergence)
probability model, 60-1, 214

694

Index

probability set function, 42-3


probability space, 42
quadratic forms related to the normal
distributions 3 19-20
quotient of two r.v.'s, 102-5
R2 see multiple

correlation

coefficient)

random experiment, 37
random matrix, 135
random sample, 2 16-17
random variable, 48-76
continuous, 56
defnition, 50
discrete, 56
functions of, 97, 99-1 10
minimal o'-lield generated by, 50
random vector, 78-93
rank condition (.st?t?
identication)
Rao-Blackwell
lemma, 243
realism, 663
recursive estimator, 407, 474-78
recursive system, 6 12- 14
regression curve, 122-4
regressors, order of magnitude, 39 1-2
rejection region, 286
reparametrisation/restriction, 2 1, 352
RESET type tests, 446, 460-1, 555, 597
residuals, 405-8
BLUS, 407
recursive, 407, 474-8
residual sum of squares (RSS), 428
respecification approach, 498-502, 505-9
restrictions see a priori restrictions)
Russian school, 36, 64
sample information, 352, 667
sample moments. 227
sample space, 38
sampling model, 2 15-19
score function, 260
second order stochastic process, 138-9
selection matrices, 6 19
sequential conditioning, 273-4, 495
serial correlation (see autocorrelation)
Shapiro-Wilk
test, 452
c-field. definition, 40
generated by a r.v., 50-1
generated by a set, 40
increasing sequence of, 51
simple random sampling, 343
simultaneous equation model, 608-58
singular normal distribution, 406
size of a test, 29 1
skedasticity, 123-4
skewness, 26, 73

skewness-kurtosis test, 452-5


SLLN,

170-2

small sample properties (yt?cfinite sample


properties)
specification testing, 392
spectral decomposition (-seceigenvalues)
standard deviation, 72
stationarity. strict, 137-8
!th order, 138-9
statistic, 224
statistical GM, 344, 349-52
statistical model, 2 18, 339
well-defined estimated, 352, 409, 522,
668
statistical parameters of interest, 351, 37 1,
376 4 19-22, 575, 666-7
versus theoretical parameters of interest,
351, 376, 667-9
stochastic linear regression model, 413- 18
stochastic process, 13 1-7
realisation of, 131
stratified sampling, 343
structural form, unrestricted, 6 14
restricted, 616
structural parameters see theoretical
parameters of interest)
Student's t distribution, 104, 108
multivariate, 47 1
sufficiency, 242
minimal, 243
supermartingale, 145
SURE formulation, 585-8, 636
systematic component, 349-51, 370-1,
375-6
,

Student's t distribution)
f distribution t.st?t?
t test, 364, 396-7, 392
test, definition, 294
test statistic, 289
testing, 22 1, 285-303
Theil's inequality coefcient, 405
theoretical parameters of interest. 349, 351,
553, 569, 61 3-14, 620, 669
theoretical model, 2 1, 667
theory, 20, 662-6
3SLS estimator, 635-7
time-homogeneity restrictions, 137-9
time series data, 130, 342
Toeplitz matrix, 495, 505
total sum of squares, 382
2SLS estimator, 626-35
type I error, 286
type 11 error, 286
UMA region, 305
UMP test, 291

Index

695

unbiased confidence region. 306


unbiased estimator, 232
unbiased test. 293
uncorrelated r.v.-s, 118
underidentification, 62 l
Llliform

Convcrgence

(.$&E?

Convergence)

uniform distribution. 57-8


unimodal distribution 7 1
univariate distributions, 62-8
,

variance. 72. 119


properties. 73
varance---covariance matrix, 3 12
variance ratio tests, 395-6
variance stabilising transformations, 487-8
variation free condition, 377. 422
violation of, 629
vectoring, 573. 604
vector stochastic process, 135

Wald test procedure,

329, 332-3. 429-30

in
probability)
weakly stationary process (sczcsecond
order stationarity)
Weibull distributiona 105
white-noise process, 150, 151
White test for homoskedastcity, 465-7
Wilk's ratio tcst. 593
Weak

(.?t? Convergence

Convergence

Window. z-period,
40, 562
Wishart distribtltion, 32 1 577, 602-3
WLLN
168--9
Wold decomposition, 159
.

Yule-Walker

equationse

157

Zellner formulation (st?gSU RE)

Zero-one

restrictions

restrictions)

tst??exclusion

Symbols and abbreviations

(1) Symbols
Nq;1,c2)

normal
random
c-field

distribution
experiment

p and variance

with mean

c2

'...-r
.# - Borel field
k..p- union
to - intersection
- - complementation
subset of
(F - belongs to
( - does not belong to
r3' - empty set
c(z4) - minimal c-field generated by
# - for all
t.zrt:z,
j) - uniform distribution between a and j
5(,7,J?)- binomial distribution with parameters n and p
* : ) probability model
z2(??) chi-square distribution with n degrees of freedom
Ftnj na) - F distribution with nl and na degrees of freedom
-4

--+
- convergence
D

in probability

-+ - convergence in distribution
a S
. .

-+

'

almost

Sure

Convorgence

--+- convergence in
R - the real line (

?'th
:y-

mean
.

:y-

List of abbreviations

Euclidean k-dimensional space


positive real line g0, )
(2
Iv0) - sample information matrix
/(p) - single observation information matrix
1w(p) - asymptotic information matrix
'v distributed as
Ep
.

.-

'v - asymptotically

E(
E(
Vart

distributed as

,v - distributed
.

) - expected

under
value

) - asymptotic expected value

) - variance

(2) Abbreviations
DGP - data generating process
CLT - central limit theorem
DF - distribution function
pdf - probability density function
WLLN - weak law' of large numbers
SLLN - strong law of large numbers
residual sum of squares
URSS - unrestricted
RRSS - restricted residual sum of squares
DLR - dynamic linear regression
MLR - mtlltivariate linear regression
MDLR - multivariate dynamic linear regression
MSE - mean square error
UMP - uniformly most powerftll
llD - independent and identically distributed
MLE maximum likelihood estimator

xxiii

List of abbreviations

OLS
GLS
AR
MA
ARMA
ARIMA
ARCH
LR
LM
GM

- ordinary least-squares
- generalised least-squares
.- autoregressive
moving average
.autoregressive,
moving average
-autoregressive,
integrated, moving average
autoregressive,
conditional
heteroskedasticity
likelihood
ratio
test
multiplier
- Lagrange
generating mechanism
.-

rv -

random

variable

IV instrumental variable
GIVE - generalised instrumental variables estimator
ZSLS - two stage least-squares
LIML - limited information maximum likelihood
FIML - full information maximum likelihood
3SLS - three stage least-squares
wrt with respect to
IMLE -- indirect maximum likelihood estimator
CAN - consistent, asymptotically normal
NllD - normal 11D
UMA - uniformly most accurate
MAE - mean absolute error
regression
specification error test
RESET
BLUE
best linear unbiased estimator
BLUS best linear unbiased scalar (residuals)

Das könnte Ihnen auch gefallen