Probability Distribution and Statistics: Key Points of Learning

February 1, 2011 13:41 9in x 6in b1016-ch01 Financial Valuation and Econometrics
Chapter 1
PROBABILITY DISTRIBUTION
AND STATISTICS
Key Points of Learning

Random variable, Joint probability distribution, Marginal probability distribu-
tion, Conditional probability distribution, Expected value, Variance, Covariance,
Correlation, Independence, Normal distribution function, Chi-square distribu-
tion, Student-t distribution, F-distribution, Data types and categories, Sampling
distribution, Hypothesis, Statistical test
1.1. PROBABILITY
Joint probability, marginal probability, and conditional probability are important
basic tools in financial valuation and regression analyses. These concepts and
their usefulness in financial data analyses will become clearer at the end of the
chapter. To motivate the idea of a joint probability distribution, let us begin by
looking at a time series plot or graph of two financial economic variables over
time: Xt and Yt , for example, S&P 500 Index aggregate price-to-earnings ratio
Xt , and S&P 500 Index return rate Yt . The values or numbers that variables Xt
and Yt will take are uncertain before they happen, i.e. before time t. At time t,
both economic variables take realised values or numbers xt and yt . xt and yt are
said to be realised jointly or simultaneously at the same time t. Thus, we can
describe their values as a joint pair (xt , yt ). If their order is preserved, it is called
an ordered pair. Note that the subscript t represents the time index.
The P/E or price-to-earnings ratio of a stock or a portfolio is a financial ratio
showing the price paid for the stock relative to the annual net income or profit per
share earned by the firm for the year. The reciprocal of the P/E ratio is called the
earnings yield. The earnings yield or E/P reflects the risky annual accounting rate
of return, R, on the stock. This is easily shown by the relationship $E = $P × R.
In other words, P/E = 1/R.
1
FINANCIAL VALUATION AND ECONOMETRICS
© World Scientific Publishing Co. Pte. Ltd.
http://www.worldscibooks.com/economics/7782.html
2 Financial Valuation and Econometrics
60%
40%
20%
0%
-20%
-40%
S&P 500 INDEX RETURN RATE
S&P 500 INDEX AGGREGATE P/E RATIO
-60%
1870 1890 1910 1930 1950 1970 1990 2010
YEAR
Figure 1.1. S&P 500 Index Portfolio Return Rate and Price-Earning Ratio 1872–2009 (Data
from Prof Shiller, Yale University).
In Fig. 1.1, it seems that low return corresponded to, or lagged high P/E
especially at the beginnings of the years 1929–1930, 1999–2002, and 2008–2009.
Conversely, high returns followed relatively low P/E ratios at the beginnings of
the years 1949–1954, 1975–1982, and 2006–2007. We shall explore the issue of
the predictability of stock return in more details in Chap. 8.
The idea that random variables correspond with each other over time or
that display some form of association is called a statistical correlation which is
defined, or which has interpretative meaning, only when there is the existence
of a joint probability distribution describing the random variables.
In Fig. 1.2, we plot the U.S. national aggregate consumption versus national
disposable income in US$ billion. Disposable income is defined as Personal
Income less personal taxes. Personal Income is National Income less corporate
taxes and corporate-retained earnings. In turn, National Income is Gross Domes-
tic Product (GDP) less depreciation and indirect business taxes such as sales tax.
GDP is essentially the total dollar output or gross income of the country. If
we include repatriations from citizens working abroad, then it becomes Gross
National Product (GNP).
In Fig. 1.2, it appears that consumption increases in disposable income. The
relationship is approximately linear. This is intuitive as on a per capita basis, we

Probability Distribution and Statistics 3
$9,600
$9,200
$8,800
CONSUMPTION
$8,400
$8,000
$7,600
$7,200
$7,000 $8,000 $9,000 $10,000 $11,000
DISPOSABLE INCOME
Figure 1.2. U.S.Annual NationalAggregate Consumption versus Disposable Income 1999–2009
(Data from Federal Reserve Board of U.S. in $billion).
would expect that for each person, when his or her disposable income rises, he
or she would consume more. In life-cycle models of financial economics theory,
some types of individual preferences could lead to consumption as an increasing
function of individual wealth that consists of inheritance as well as fresh income.
Sometimes, the analysis on income also breaks it down into a permanent part and
a transitory part. More of these could be read in economics articles on life-cycle
models and hypotheses.
In Fig. 1.3, we evaluate the annual year-to-year change in consumption and
disposable income and plot them on an X–Y graph. The point P1 refers to the
bivariate values (x1 , y1 ), where x1 is change in disposable income and y1 is
change in consumption in 2000. P2 refers to the bivariate values (x2 , y2 ), where
x2 is change in disposable income and y2 is change in consumption in 2001, and
so on. Subscripts to x and y indicate time. It may be construed as the end of a
time period and the beginning of the next time period. In this case, subscript 1
refers to time t1 , end of year 2000.

$400
P1
$300 P6 P5
CHANGE IN CONSUMPTION
P7
P4
P8
$200
P2 P3
$100
$0
P9 P10
$-100
$40 $80 $120 $160 $200 $240 $280 $320 $360
CHANGE IN DISPOSABLE INCOME

Figure 1.3. U.S. Annual Year-to-Year Change in National Aggregate Consumption versus
Change in Disposable Income 2000–2009 (Data from Federal Reserve Board of U.S. in
$billion).
The pattern in Fig. 1.3 reveals that disposable income change dropped from
t = 1 to t = 2, then rose back at t = 3. After that, there was a sharp drop at t = 4
before a wild swing back up at t = 5, and so on. The changes seem to be cyclical.
A cyclical but decreasing trend can be seen in consumption. However, what is
more interesting is that consumption and disposable income visibly increased and
decreased together. Thus, if we construe consumption as the purchases of goods
and services, then the plot displays the positive income effect on such effective
demand. Theoretically, each Xt and each Yt for every time t is a random variable.
A random variable is a variable that takes on different values each with a
given probability. It is a variable with an associated probability distribution. For
the above scatter plot, since Xt and Yt occur jointly together in (Xt , Yt ), the pair is
a bivariate random variable, and thus has a joint bivariate probability distribution.
There are two generic classes of probability distributions: discrete probability
distribution, where the random variable takes on only a countable set of possible
values, and continuous probability distribution, where the random variable takes

Table 1.1. Discrete Bivariate Joint Probability of Two Stock Return Rates.
Xt+1
P(xt+1 , yt+1 ) a1 a2 a3 a4 a5 a6 P(yt+1 )
b1 0.005 0.03 0.03 0.015 0.005 0.01 0.095

b2 0.015 0.02 0.04 0.015 0.005 0.02 0.115
Yt+1 b3 0.015 0.025 0.05 0.02 0.015 0.05 0.175
b4 0.03 0.03 0.07 0.08 0.025 0.035 0.27
b5 0.02 0.06 0.04 0.05 0.045 0.02 0.235
b6 0.015 0.035 0.02 0.02 0.005 0.015 0.11
P(xt+1 ) 0.1 0.2 0.25 0.2 0.1 0.15 1
on an uncountable number of possible values. In what follows, we construct a

bivariate discrete probability distribution of the return rates on two stocks.
Let t denote the day number. Thus, time t = 1 is the end of day 1, t = 2 is
end of day 2, and so on. Let Pt be the price in $ of stock ABC at time t. Let Xt+1
be stock ABC’s holding or discrete return rate at time t + 1. Xt+1 = Pt+1 /Pt − 1.
The corresponding continuously compounded return rate at t + 1 is ln(Pt+1 /Pt ),
which is approximately Xt+1 when Xt+1 is close to 0. Another stock XYZ has
discrete return rate Yt+1 at time t + 1.
In Table 1.1, we must take care to distinguish between random variable Xt+1
and the realised value it takes in an outcome, e.g. xt+1 ≡ a3 . For example, a3
could be 0.03 or 3%. In the bivariate discrete probability distribution shown in
the table, Xt+1 takes one of six possible values viz. a1 , a2 , a3 , a4 , a5 , and a6 .
The probability of any one of these six events or outcomes is given by P(Xt+1 =
xt+1 ≡ ak ), or in short P(xt+1 ), and is shown in the last row of the table. The
probability function P(.) for discrete probability distribution is also called a
probability mass function (pmf). We should think of a probability or chance as a
one-to-one function that maps or assigns a number in [0, 1] ⊂ R to each realised
value of the random variable. R denotes the real line or (−∞, +∞). Likewise,
the probability of any one of the six outcomes of the random variable Yt+1 is
given by P(yt+1 ) and is shown in the last column of the table. Note that the
probabilities of events that make up all the possibilities must sum up to 1.
The joint probability of event or outcome with realised values (xt+1 , yt+1 )
is given by P(Xt+1 = xt+1 , Yt+1 = yt+1 ). These probabilities are shown in
the table. For example, P(a3 , b5 ) = 0.04. This means that the probability or
chance of Xt+1 = a3 and Yt+1 = b5 simultaneously occurring is 0.04 or 4%.
Clearly, the sum of all the joint probabilities within the table must equal 1.

The marginal probability of Yt+1 = b3 in the context of the (bivariate) joint

probability distribution is the probability that Yt+1 takes the realised value yt+1 ≡
b3 regardless of the simultaneous value of xt+1 . We write this marginal probability
as PY (Yt+1 = b3 ). The subscript Y to probability function P(.) is to highlight
that it is marginal probability of Y . Sometimes, this is omitted. Note that this
marginal probability is also a univariate probability. In this case, PY (b3 ) =
P(a1 , b3 ) + P(a2 , b3 ) + P(a3 , b3 ) + P(a4 , b3 ) + P(a5 , b3 ) + P(a6 , b3 ). Notice
that we simplify the notations indicating the aj ’s and bk ’s are values xt+1 and
yt+1 , respectively, where the context is understood. In a full summation notation,

6
PY (Yt+1 = b3 ) = P(Xt+1 = aj , Yt+1 = b3 ).
j=1
This is obviously the sum of numbers in the row involving b3 and is equal to
0.175. The marginal probability of Xt+1 = a2 is given by:

6
PX (Xt+1 = a2 ) = P(Xt+1 = a2 , Yt+1 = bk ) = 0.2.
k=1
Thus, given the joint probability distribution, the marginal probability distribu-
tion of any one of the joint random variables can be found.

What is 6j=1 6k=1 P(Xt+1 = aj , Yt+1 = bk )? Employing the concept of
marginal probability that we just learned,

6
6
6
P(Xt+1 = aj , Yt+1 = bk ) = PX (Xt+1 = aj ) = 1.
j=1 k=1 j=1
In the bivariate probability case, we know that future risk or uncertainty is char-
acterised by one and only one of the 36 pairs of values (aj , bk ) that will occur.
Suppose the event has occurred, and we know only that it is event {Xt+1 = a2 }
that occurred, but without knowing which of the events b1 , b2 , b3 , b4 , b5 , or b6 had
occurred in simultaneity. An interesting question is to ask what is the probability
that {Yt+1 = b3 } had occurred, given that we know {Xt+1 = a2 } occurred. This
is called a conditional probability and is denoted by P(Yt+1 = b3 |Xt+1 = a2 ).
The symbol “|” represents “given” or “conditional on”.
From Table 1.1, we focus on the column where it is given that {xt+1 ≡ a2 }
occurred. This is shown in Table 1.2. The highlighted 0.025 is the joint probability
of (a2 , b3 ).

Table 1.2. Joint Probability of Two

Stock Return Rates when Xt+1 = a2 .
0.03
0.02
0.025
0.03
0.06
0.035
Intuitively, the higher (lower) this number, the higher (lower) is the condi-
tional probability that b3 in fact had occurred simultaneously. Given that a2 had
occurred, we are finding the conditional probability given {xt+1 ≡ a2 }, which is
in itself a proper probability distribution and thus must have probabilities that
add to 1. Then, the conditional probability must be the relative size of 0.025 to
the other joint probabilities in the above column.
We recall Bayes’ rule on event sets, that:
P(A ∩ B)
P(A | B) = ,
P(B)
where A and B are events or event sets in a universe. We can think of the outcome
{Xt+1 = a2 } as event B and the outcome {Yt+1 = b3 } as event A. Events can be
more general, as occurrences {Xt+1 = aj }, {Yt+1 = bk }, {Xt+1 = aj , Yt+1 = bk }
are all events or event sets. More exactly,
P(a2 , b3 ) 0.025
P(b3 | a2 ) = = = 0.125.
PX (a2 ) 0.2
In general,
P(Xt+1 = aj , Yt+1 = bk )
P(Yt+1 = bk | Xt+1 = aj ) =
PX (Xt+1 = aj )
P(Xt+1 = aj , Yt+1 = bk )
= 6 .
k=1 P(Xt+1 = aj , Yt+1 = bk )
When we move from discrete probability distribution, where event sets consist
of discrete elements, to continuous probability distribution, where event sets are
continuous, such as intervals on a real line, we have to deal with continuous
functions.
The continuous joint probability density function (pdf) of bivariate
(Xt+1 , Yt+1 ) is represented by a continuous function f(x, y) where Xt+1 = x and

Yt+1 = y, and x, y are usually numbers on the real line R. Note that we simplify
the notations of the realised values by dropping their time subscripts here. For a
continuous probability distribution, the events are described not as point values
e.g. x = 3, y = 4, but rather as intervals, e.g. event A = {(x, y): − 2 < y < 3}
and event B = {(x, y): 0 < x < 9.5}. Then,
9.5 3
P(A, B) = P(0 < x < 9.5, −2 < y < 3) = f(x, y) dy dx.
0 −2
The “support” for a random variable such as Xt+1 is the range of x. For joint
normal densities, the ranges are usually (−∞, ∞). Thus, Yt+1 also has the same
support. It is usually harmless to use (−∞, ∞) as supports even if the range is
finite [a, b], since the probabilities of null events (−∞, a) and (b, ∞) are zeros.
However, when more advanced mathematics is involved, it is typically better
to be precise. In addition, notice that probability is essentially an integral of a
function, whether continuous or discrete, and is area under the pdf curve.
The marginal probability density function of Xt+1 and Yt+1 are given by:
∞
fX (x) = f(x, y) dy
−∞
∞
and fY (y) = −∞ f(x, y) dx.
Notice that while f(x, y) is a function containing both x and y, fY (y) is a
function containing only y since x is integrated out. Likewise, fX (x) is a function
that contains only x.
The conditional probability density functions are:
f(x | y) = f(x, y)/fY (y)
and f(y | x) = f(x, y)/fX (x).

These conditional pdf’s contain both x and y in their arguments.
1.2. EXPECTATIONS
The expected value of random variable Xt+1 is given by

6
E(Xt+1 ) = aj PX (aj ) = µX for the discrete distribution in Table 1.1,
j=1

and for continuous pdf,

∞
E(Xt+1 ) = xfX (x) dx = µX .
−∞
The conditional expected value or the conditional expectation of Xt+1 | b4 is

given by:

6
E(Xt+1 | b4 ) = aj P(aj | b4 ) for the discrete distribution in Table 1.1,
j=1
and for continuous pdf,

∞
E(Xt+1 | y) = xf(x | y) dx.
−∞
Notice that for the continuous pdf, the conditional expected value given y is a
function containing only y. This means that one can further evaluate more specific
conditional expectations based on given sets of y values e.g. {y: − 2 < y < 3}.
Then, E(Xt+1 | −2 < y < 3) is found via:
∞ ∞
f(x, −2 < y < 3)
xf(x | −2 < y < 3) dx = x 3 ∞ dx
−∞ −∞ −2 −∞ f(x, y) dx dy
∞ 3
−∞ x −2 f(x, y) dy dx
= 3 ∞
−2 −∞ f(x, y) dx dy
3 ∞
−2 −∞ xf(x, y) dx dy
= 3 .
−2 fY (y) dy
The interchange of integrals in the last step in the above equation uses the Fubini
Theorem assuming some mild regularity conditions satisfied by the functions.
The variance of a continuous random variable Xt+1 is given by
∞
var(Xt+1 ) = σX =2
(x − µx )2 fX (x) dx.
−∞
Variance measures the degree of movement or the variability of the random vari-
able itself. The standard deviation (s.d.) of a random variable Xt+1 is the square
root of the variance. Standard deviation is sometimes referred to as volatility and
sometimes as “risk” in the finance literature.

The covariance between two continuous random variables Xt+1 and Yt+1 is
given by:
∞ ∞
cov(Xt+1 , Yt+1 ) = σXY = (x − µx )(y − µy )f(x, y) dx dy.
−∞ −∞
Covariance measures the degree of co-movements between two random vari-

ables. If the two random variables tend to move together, i.e. when one increases
(decreases), the probability of the other increasing (decreasing) is high, then the
covariance will be a positive number. If they vary inversely, then the covariance
will be a negative number. If there is no co-moving relationship and each ran-
dom variable moves independently, then their covariance is zero. Notice that the
covariance is also an expectation or integral.
The co-movement of two random variables is typically better characterised
by their correlation coefficient that is the covariance normalised or divided by
their s.d.’s.
σXY
corr(Xt+1 , Yt+1 ) = ρXY = .
σ X σY
One other advantage of using the correlation coefficient than the covariance is
that the correlation coefficient is not denominated in the value units of X or Y
but is a ratio.
It is important to understand that the correlation measures association but
not causality. In Fig. 1.3, clear changes in consumption and income are strongly
positively correlated. Suppose one concludes that increasing consumption will
increase income, the resulting action will be disastrous. Or, even if one sim-
ply concludes (based on some understanding of macroeconomics theory or by
intuition) that increased income causes increased consumption, it may still be
premature, as there are so many other possibilities and qualifications. For exam-
ple, some other variables such as general education level could lead to increases
in both income and consumption.
Or, suppose we think of Yt+1 as GDP and Xt+1 as population. Both increase
with time due to various economic and geo-political reasons. But, it will be
disastrous for policy implication to think that increasing population leads to or
causes increase in GDP. This has to assume fairly constant employment and
output per capita.

For general random variables X and Y (dropping time subscripts), we can

write their means, variances, and covariance as follows.
E(X) = µX
E(Y ) = µY
var(X) = E(X − µX )2 = E(X2 ) − µ2X
var(Y ) = E(Y − µY )2 = E(Y 2 ) − µ2Y
cov(X, Y ) = E(X − µX )(Y − µY ) = E(XY ) − µX µY .
Covariances are linear operators. A function is f : A → B or {f : f(a) = b;

a ∈ A, b ∈ B} in which A is the domain set and B the range set and each a
is mapped onto one and only one element b in B. We can think of an operator
as a special case of a function where the domain and range consist of normed
space such as a vector space. These technicalities are not important except in
more advanced courses.
Now consider N number of random variables Xi , where i = 1, 2, . . ., N. A
very useful property of a covariance is shown below.
    
n
N  N N 
cov  Xi , Xj  = E  [Xi − E(Xi )] [Xj − E(Xj )]
 
i=1 j=1 i=1 j=1

N
N
=E [Xi − E(Xi )][Xj − E(Xj )]
i=1 j=1

N
N
= E{[Xi − E(Xi )][Xj − E(Xj )]}
i=1 j=1

N
N
= cov(Xi , Xj ).
i=1 j=1
A special case of the above is
var(X + Y ) = cov(X + Y, X + Y )
= cov(X, X) + cov(X, Y ) + cov(Y, X) + cov(Y, Y )
= var(X) + var(Y ) + 2 cov(X, Y ).

A convenient property of a correlation coefficient ρ is that it lies between −1

and +1. This is shown as follows. For any real θ,
var(X − θY ) = σX2 + θ 2 σY2 − 2θρσX σY ≥ 0.
Put θ = ρ σσXY . Then, σX2 + ρ2 σX2 − 2ρ2 σX2 ≥ 0.

Thus, for any random variable X and Y , σX2 (1 − ρ2 ) ≥ 0, and hence
(1 − ρ2 ) ≥ 0, or ρ2 ≤ 1. Therefore, −1 ≤ ρ ≤ 1.
1.3. DISTRIBUTIONS
Continuous probability distributions are commonly employed in regression anal-
yses. The commonest probability distribution is the normal (Gaussian) distribu-
tion. The pdf of a normally distributed random variable X is given by
1
1 x−µ 2
f(x) = √ e− 2 σ for −∞ < x < ∞,
2πσ 2
where the mean of x is µ and the s.d. of x is σ. µ and σ are given constants.
+∞
E(X) = xf(x) dx = µ
−∞
Var(X) = E(X − µ)2

+∞
= (x − µ)2 f(x) dx
−∞
= σ2.
The cumulative distribution function (cdf) of X is

x
F(X) = f(x)dx.
−∞
d
We can write the distribution of X as X ∼ N(µ, σ 2 ) in which the arguments
indicate the mean and variance of the normal random variable. Suppose we
define a corresponding random variable
X−µ
Z= or X = µ + σZ,
σ

where the symbol “=” means “to define”. The second “equality” is interpreted
as not just equivalence in distribution, but that whenever Z takes value z, then X

takes value x = µ + σz. Then,

E(Z) = 0 and Var(Z) = 1.
Since a constant multiple of a normal random variable is normally distributed
and a sum of normal random variables is also a normal random variable, then
d
Z ∼ N(0, 1). Z has pdf f x−µ σ and is called the standard normal variable.
For normal distribution N(µ, σ 2 ),
x−µ
σ x−µ
F(X) = f dz,
−∞ σ

where f x−µσ is the standard normal pdf and z = x−µ σ . The standard normal cdf
is often written as (z). For the standard normal Z,
P(a ≤ z ≤ b) = (b) − (a).
The normal distribution is a familiar workhorse in statistical estimation and
testing. The normal distribution pdf curve is “bell-shaped”. Areas under the
curve are associated with probabilities. Figure 1.4 shows a standard normal pdf
N(0,1) and the associated probability as area under the curve.
The corresponding z values of random variable (r.v.) Z can be seen in the
following standard normal distribution Table 1.3.
For example, the probability P(−∞ < Z < 1.5) = 0.933. This same
probability can be written as P(−∞ ≤ Z < 1.5) = 0.933, P(−∞ < Z ≤
1.5) = 0.933, or P(−∞ ≤ Z ≤ 1.5) = 0.933. This is because for continuous
pdf, P(Z = 1.5) = 0.
From the symmetry of the normal pdf, P(−a < Z < ∞) = P(−∞ < Z <
a), we can also compute the following.
Total area from −∞ to ∞
5%
Z
a = –1.645 0
Figure 1.4. Standard Normal Probability Density Function of Z.

Table 1.3.
Z Area under curve from –∞ to z Z Area under curve from –∞ to z
0.000 0.500 1.600 0.945

0.100 0.539 1.645 0.950
0.200 0.579 1.700 0.955
0.300 0.618 1.800 0.964
0.400 0.655 1.960 0.975
0.500 0.691 2.000 0.977
0.600 0.726 2.100 0.982
0.700 0.758 2.200 0.986
0.800 0.788 2.300 0.989
0.900 0.816 2.330 0.990
1.000 0.841 2.400 0.992
1.100 0.864 2.500 0.994
1.282 0.900 2.576 0.995
1.300 0.903 2.600 0.996
1.400 0.919 2.700 0.997
1.500 0.933 2.800 0.998
P (Z > 1.5) = 1 − P (−∞ < Z ≤ 1.5) = 1 − 0.933 = 0.067.

P (−∞ < Z ≤ −1.0) = P (Z > 1.0) = 1 − 0.841 = 0.159.
P (−1.0 < Z < 1.5) = P (−∞ < Z < 1.5) − (−∞ < Z ≤ −1.0) =
0.933 − 0.159 = 0.774.
P (Z ≤ −1.0 or Z ≥ 1.5) = 1 − P (−1.0 < Z < 1.5) = 1 − 0.774 = 0.226.
Several values of Z under N(0,1) are commonly encountered, viz. 1.282, 1.645,
1.960, 2.330, and 2.576.
P (Z > 1.282) = 0.10 or 10%.

P (Z < −1.645 or Z > 1.645) = 0.10 or 10%.
P (Z > 1.960) = 0.025 or 2.5%.
P (Z < −1.960 or Z > 1.960) = 0.05 or 5%.
P (Z > 2.330) = 0.01 or 1%.
P (Z < −2.576 or Z > 2.576) = 0.01 or 1%.
The case for P (Z < −1.645) = 5% is shown in Fig. 1.4.

The bivariate normal distribution of random variables X, Y is given by
1
e− 2 q ,
1
f(x, y) = (1.1)
2πσX σY 1 − ρ2

where
2 2
1 x − µX x − µX y − µY y − µY
q= − 2ρ +
1 − ρ2 σX σX σY σY
and cov( x,y)

σX σY = ρ.
The multivariate normal distribution pdf (p-variate normal pdf) is given by
−1
1 1
f x1 , x2 , . . . , xp = exp − (x − µ) T
(x − µ) ,
(2π)p/2 ||1/2 2
where x is the vector of random variables X1 to Xp , µ is the p × 1 vector of

means of x, and is the p × p covariance matrix of x. If p = 2 is substituted
into the above, the bivariate pdf shown in Eq. (1.1)
can be obtained.
k
The kth moment of random variable X is x f(x) dx where f(x) is the
of X.k If µ = E(X) is the mean of X, the kth central moment of X is
pdf
(x − µ) f(x) dx. Notice that the variance is the second central moment of X.
The third central moment ÷ variance3/2 is known as skewness. The fourth central
moment ÷ variance2 is known as kurtosis.
The normal distribution r.v. X ∼ N(µ, σ 2 ) has mean µ, variance σ 2 , skew-
ness 0, and kurtosis that is equal to 3. Hence, the standard normal variate
Z ∼ N(0, 1) has a mean 0, variance 1, skewness 0, and kurtosis 3.
Many financial variables, e.g. daily stock returns, currency rate of change,
etc. display skewness as well as large kurtosis compared with the benchmark
normal distribution with symmetrical pdf, skewness = 0, and kurtosis = 3.
Departure from normality is illustrated by a pdf in Fig. 1.5. The shaded
area in Fig. 1.5 shows a normal pdf. The unshaded curve shows pdf of a random
variable with negative skewness, a kurtosis larger than that of the normal random
variable, and mean µ < 0.
The concept of stochastic independence between random variables is impor-
tant. Two random variables X and Y are said to be stochastically independent if
and only if their joint pdf can be expressed as follows:
f(X, Y ) = fx (X)fy (Y ).
One implication of the above is that for any function h(.) of X and any function
g(.) of Y , their expectation can be found as:
E(h(X)g(Y )) = E(h(X))E(g(Y )).

f(x)
Negative or left Fat tails with

skewness (longer kurtosis > 3
left tail)
x
µ 0
Figure 1.5. Example of a Pdf with Negative Skewness and Large Kurtosis.
A special case is the covariance operator. If X and Y are (stochastically) inde-

pendent, then it implies that their covariance is zero:
cov(X, Y ) = E(X − µX )(Y − µY ) = E(X − µX )E(Y − µY ) = 0.
The converse is not always true. It is true only for special cases such as when
X and Y are jointly normally distributed. When X and Y are jointly normally
distributed, then if they have zero covariance, they are stochastically independent.
For bivariate normal pdf, conditional pdf
f(x, y)
g(x | y) = .
fY (y)
Or,
q
1√
e− 2
2πσX σY 1−ρ2
g(x | y) = 2
y−µY
1 − 1
√ e 2 σY
σY 2π
2
x−µ y−µ
1 − 1 X −ρ Y
= √ e 2(1−ρ )
2 σX σY
σX 2π 1 − ρ2
σX
1 − 1
2 2 [(x−µX )−ρ σY (y−µY )]
2
= e 2(1−ρ )σX
2πσX2 (1 − ρ2 )
1 − 1
2 (x−µX|Y )2
= e 2σX|Y
2
2πσX|Y

2
where σX|Y = (1 − ρ2 )σx2 is the variance of X conditional on Y = y, and
µX|Y = µX + ρ σσXY (y − µY ) is the mean of X conditional on Y = y.
There are some common continuous probability distributions that are related
d
to the normal distribution. If random variable X ∼ N(µ, σ 2 ), then random vari-
2
able V = X−µ σ ∼ χ12 is a chi-square distribution with 1 degree of freedom.
If X1 , X2 , X3 , . . . , Xn are n random variables each independently drawn from
the same population distribution N(µ, σ 2 ), or think of {Xi }i=1 to n as a random
2
sample of size n, then ni=1 Xiσ−µ ∼ χn2 is a chi-square distribution with n
degrees of freedom.
d d
If X ∼ N(0, 1), and V ∼ χr2 , and both X and V are stochastically indepen-
d
dent, then √X is a Student-t distribution with r degrees of freedom. If U ∼ χr21 ,
Vr −1
d Ur −1 d
V ∼ χr22 , and both U and V are stochastically independent, then Vr−1
1
∼ Fr1 ,r2
2
is an F -distribution with degrees of freedom r1 and r2 . If random variable
d
X ∼ N µ, σ 2 and Y = exp(X) or X = ln(Y ), then Y is a random variable
with a lognormal distribution.
1.4. STATISTICAL ESTIMATION

Suppose a random variable X with a fixed normal distribution N(µ, σ 2 ) is given.
Suppose there is a random draw of a number or an outcome from this distribution.
This is the same as stating that random variable X takes a realised value x. Let
this value be x1 ; it may be say 3.89703. Suppose we repeatedly make random
draws and thus form a sample of n observations: x1 , x2 , x3 , . . . , xn−1 , xn . This
is called a random sample with a sample size of n. Each xi comes from the same
distribution N(µ, σ 2 ), but each of xi and xj are realisations from independent
sampling.
We next compute a statistic, which is a function of the realised values {xk },
k = 1, 2, . . . , n. Consider a statistic, the sample mean.

x̄ = n1 nk=1 xk . Another common sample statistic is the unbiased sample
variance
1
n
s =
2
(xk − x̄)2 .
n−1
k=1
Each time we select a random sample of size n, we obtain a realisation

x̄. Thus, x̄ is itself a realisation of a random variable, and this r.v. can be

denoted by
1
N
X̄n = Xk ,
n
k=1
where Xk above is clearly the random variable from N(µ, σ 2 ) itself. X̄n is a
random variable and its probability distribution is called the sampling distribution
of the mean or perhaps more clearly, the distribution of the sample mean.
What is the exact probability distribution of X̄n ?
1 1 1
n n n
E(X̄n ) = E Xk = E(Xk ) = µ = µ.
n n n
k=1 k=1 k=1
1
n
1
n
nσ 2 σ2
var(X̄n ) = var Xk = var(Xk ) = = .
n2 n2 n2 n
k=1 k=1
Since X̄n is a normal random variable, therefore,

σ2
X̄n ∼ N µ, .
n
The standardised normal random variable then becomes
√
X̄n − µ n(X̄n − µ)
= ∼ N(0, 1).
σ 2 σ
n
On the other hand, E(s2 ) = σ 2 . But s2 itself is a sampling distribution.

2 d
(n − 1) σs 2 ∼ χn−1
2 2
. It can be seen that E(χn−1 ) = n − 1, the number of
degrees of freedom of the chi-square random variable. Therefore,
√ √
n(X̄n −µ)
n(X̄n − µ)
σ
=
s2 s
σ2
is distributed as Student-t with (n−1) degrees of freedom and zero mean. Denote
the random variable with t-distribution, n − 1 degrees of freedom, as tn−1 . Then,
√
n(X̄n − µ) d
∼ tn−1 .
s
Suppose we find (−a, +a), a > 0, such that Prob(−a ≤ tn−1 ≤ +a) = 95%.
Since tn−1 is symmetrically distributed, then Prob(−a ≤ tn−1 ) = 97.5% and

Prob(tn−1 ≤ +a) = 97.5%. Thus,

√
n(X̄n − µ)
Prob −a ≤ ≤ a = 0.95.
s
Also, Prob(X̄n − a √sn ≤ µ ≤ X̄n + a √sn ) = 0.95.
d
Suppose x1 , x2 , x3 , . . . , xn−1 , xn are randomly sampled from X ∼ N(µ, σ 2 ).
Sample size n = 30. The t-statistic value such that Prob(t29 ≤ a) = 97.5% is
a = 2.045. Then,

s s
Prob X̄n − 2.045 √ ≤ µ ≤ X̄n + 2.045 √ = 0.95.
30 30
Hence, the 95% confidence interval estimate of µ is given by

s s
X̄n − 2.045 √ , X̄n + 2.045 √
30 30
when estimated s is entered.
1.5. STATISTICAL TESTING

In many situations, there is a priori (or ex-ante) information about the value of
the mean µ, and it may be desirable to use observed data to test if the infor-
mation is correct. µ is called a parameter of the population or fixed distribution
N(µ, σ 2 ). A statistical hypothesis is an assertion about the true value of the pop-
ulation parameter, in this case µ. A simple hypothesis specifies a single value for
the parameter, while a composite hypothesis will specify more than one value.
We will work with the simple null hypothesis H0 (sometimes this is called the
maintained hypothesis), which is what is postulated to be true. The alternative
hypothesis HA is what will be the case if the null hypothesis is rejected. Together
the values specified under H0 and HA should form the total universe of possibil-
ities of the parameter. For example,
H0 : µ = 1
HA : µ = 1.
A statistical test of the hypothesis is a decision rule that, given the inputs
from the sample values and hence sampling distribution, chooses to either reject
or else not reject (intuitively similar in meaning to “accept”) the null H0 . Given
this rule, the set of sample outcomes or sample values that lead to the rejection
of the H0 is called the critical region. If H0 is true but is rejected, a Type I error
is committed. If H0 is false but is accepted, a Type II error is committed.

tn–1
–a 0 +a
Figure 1.6. Critical Region Under the Null Hypothesis H0 : µ = 1.
The statistical rule on H0 : µ = 1, HA : µ = 1, is that if the test statistic

tn−1 = (X̄√ns−1) which is t-distributed with (n−1) degrees of freedom, falls within
n
the critical region (shaded), defined as {tn−1 < −a or tn−1 > +a}, a > 0, as
shown in Fig. 1.6, then H0 is rejected in favour of HA . Otherwise, H0 is not
rejected and is “accepted”.
If H0 is true, then the t-distribution would be correct, and therefore the prob-
ability of rejecting H0 would be the area of the critical region, say 5% in this
case. Notice that for n = 61, P(−2.00 < t60 < 2.00) = 0.95. Moreover, the
t-distribution is symmetrical, so each of the right and left shaded tails makes up
2.5%. This is called a two-tailed test with a significance level of 5%. The signif-
icance level is the probability of committing a Type I error when H0 is true. In
the above example, if the sample t-statistic is 1.045, then it is <2, and we cannot
reject H0 at the two-tailed 5% significance level. Given a sample t-value, we can
also find its p-value which is the probability under H0 of |t60 | exceeding 1.045
in a two-tailed test. In the above two-tailed test, the p-value of a sample statistic
of 1.045 would be 2 × Prob(t60 > 1.045) = 2 × 0.15 = 0.30 or 30%. Another
way to verify the test is that if the p-value < test significance level, reject H0 ;
otherwise H0 cannot be rejected.
In theory, if we reduce the probability of Type I error, the probability of
Type II error increases, and vice versa. This is illustrated in Fig. 1.7.
Suppose H0 is false, and µ > 1, so the true tn−1 distribution is represented by
the dotted curve in Fig. 1.7. The critical region {tn−1 < −2.00 or tn−1 > 2.00}
remains the same, so the probability of committing Type II error is 1− sum of
shaded areas. Clearly, this probability increases as we reduce the critical region in
order to reduce Type I error. Although it is ideal to reduce both types of errors, the
tradeoff forces us to choose between the two. In practice, we fix the probability
of Type I error when H0 is true, i.e. determine a fixed significance level e.g. 10%,
5%, or 1%. The power of a test is the probability of rejecting H0 when it is false.

pdf f(X)
tn−1
−2 0 2
Figure 1.7.
Thus, power = 1− P(Type II error). Or, power equals the shaded area in Fig. 1.7.
Clearly, this power is a function of the alternative parameter value µ = 1. We
may determine such a power function of µ = 1.
Thus, reducing significance level also reduces power and vice versa. In statis-
tics, it is customary to want to design a test so that its power function of µ = 1
equals or exceeds that of any other test with equal significance level for all plau-
sible parameter values µ = 1 in HA . If this test is found, it is called a uniformly
most powerful test.
We have seen the performance of a two-tailed test. Sometimes, we embark
instead on a one-tailed test such as H0 : µ = 1, HA : µ > 1, in which we
theoretically rule out the possibility of µ < 1, i.e. P(µ < 1) = 0. In this case, it
makes sense to limit the critical region to only the right side, for when µ > 1,
then tn−1 will become larger. Thus, at the one-tail 5% significance level, the
critical region under H0 is {tn−1,95% > 1.671} for n = 61 where tn−1,95% is the
95th percentile of the t distribution with n − 1 d.f.
1.6. DATA TYPES

Consider the types of data series that are commonly encountered in regression
analyses. There are four generic types, viz.
(a) Time series,

(b) Cross-sectional,
(c) Pooled time series cross-sectional, and
(d) Panel/longitudinal/micropanel.
Time series are the most prevalent in empirical studies in finance. They are
data indexed by time. Each data point is a realisation of a random variable at a

particular point in time. The data occur as a series over time. A sample of such
data is typically a collection of the realised data over time such as the history of
ABC stock’s prices on a daily basis from 1970 January 2 till 2002 December 31.
Cross-sectional data are also common in finance. An example is the reported
annual net profit of all companies listed on an exchange for a specific year. If
we collect the cross sections for each year over a 20-year period, then we have
a pooled time series cross section of companies over 20 years. Panel data are
less used in finance. They are data collected by tracking specific individuals or
subjects over time and across subjects.
The nature of data also differs according to the following categories.
(a) Quantitative,
(b) Ordinal e.g. very good, good, average, and poor, and
(c) Nominal/categorical e.g. married/not married, college graduate/non-
graduate.
Quantitative data such as return rates, prices, volume of trades, etc. have the
least limitations and therefore the greatest use in finance. These data provide not
only ordinal rankings or comparisons of magnitudes, but also exact degrees of
comparisons. There are some limitations and therefore special considerations to
the use of the other categories of data. In the treatment of ordinal and nominal
data, we may have to use specific tools such as dummy variables in regression.
1.7. PROBLEM SET

(1.1) X, Y, Z are r.v.’s with a joint pdf f (X, Y, Z) that is integrable. Show using
the concept of marginal pdf’s that E(X + Y + Z) = E(X) + E(Y) + E(Z)
by integrating over (X + Y + Z).
N
(1.2) Show how one could express cov( N i=1 Xi , j=1 Xj ) in terms of the N

by N covariance matrix N×N ?
(1.3) The following is the probability distribution table of a trivariate U1 , U2 ,
and U3 .
U1 −1 −1 −1 −1 1 1 1 1
U2 −2 −2 2 2 −2 −2 2 2
U3 −3 3 −3 3 −3 3 −3 3
P(U1 ,U2 ,U3 ) 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125

Find the bivariate probability distribution P(U1 , U2 ). Find the marginal

P(U3 ).
(1.4) In the probability distribution table of a trivariate U1 , U2 , and U3 ,
U1 −1 −1 −1 −1 1 1 1 1
U2 −2 −2 2 2 −2 −2 2 2
U3 −3 3 −3 3 −3 3 −3 3
P(U1 ,U2 ,U3 ) 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
after finding P(U1 , U2 ), suppose Yi = bXi + Ui , i = 1, 2, and

X1 = 1, X2 = 2,
(i) Find E(Ui )’s and cov(U1 , U2 ).
2
(ii) Find the probability distribution of estimator b̂ = i=1 Xi Yi
2
( i=1 Xi2 )−1 . This probability distribution of the estimator is called
the sampling distribution of b̂.
(iii) Find the mean and variance of b̂ from its probability distribution.
(1.5) X and Y have joint pdf f(X, Y) = exp(−X − Y) for 0 < X, Y < ∞,
and pdf is 0 elsewhere. Find the marginal pdf’s of X and Y . Are X and Y
stochastically dependent?
(1.6) X and Y have a joint pdf f(X, Y) = 1 in the set {0 ≤ X ≤ 2, 0 ≤ Y ≤
X/2}.
(i) Find the marginal distributions of X and Y .
(ii) Find the variances of X and Y , and the covariance of X and Y .
(iii) Find the conditional means E(X | Y), E(Y | X), and conditional vari-
ances var(X | Y), var(Y | X).
(1.7) Xit is distributed as independent univariate normal, N(0, 1) for i = 1, 2, 3,
and t = 1, 2, . . . , 60. Yt = 0.5X1t + 0.3X2t + 0.2X3t . What are the mean
and the standard deviation of Yt ? If a computer program runs and churns
out 3K number of random values Zj belonging to univariate normal
N(0, 1) distribution, and Wi = 0.5Z3i−2 + 0.3Z3i−1 + 0.2Z3i for i =

1, 2, . . . , K, what is the variance of the sampling mean K−1 K i=1 Wi ?
(1.8) Suppose r.v. Xi ∼ N(0, 60 ) for i = 1, 2, . . . , K, and Xi and Xj are
1
independent when i = j. If AXi ∼ N(0, 1) where A is a constant, what

is A? If random vector Y = (X1 , X2 , . . . , XK ), what is the distribution
of YYT ?
(1.9) If cov(a, b) = 0.1, cov(c, a) = 0.2, cov(d, a) = 0.3, and x = b+2c+3d,
what is cov(a, x)?

(1.10) Suppose X, Y , and Z are jointly distributed as follows.
Probability X Y Z
0.5 +1 −1 0
0.5 −1 0 +1
Find cov(X, Y), cov(X, Z), and cov(Y, Z).
FURTHER RECOMMENDED READINGS

1. Mood, A.M., E.A. Graybill and D.C. Boes, Third or later editions, Introduction to the Theory
of Statistics, McGraw-Hill publisher.
2. Hogg, R.V. and A.T. Craig, Introduction to Mathematical Statistics, Fourth or later editions,
Collier MacMillan publisher.


Probability Distribution and Statistics: Key Points of Learning

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Probability Distribution and Statistics: Key Points of Learning

Hochgeladen von

Copyright:

Verfügbare Formate

February 1, 2011 13:41 9in x 6in b1016-ch01 Financial Valuation and Econometrics

Key Points of Learning

2 Financial Valuation and Econometrics

FINANCIAL VALUATION AND ECONOMETRICS

Probability Distribution and Statistics 3

FINANCIAL VALUATION AND ECONOMETRICS

4 Financial Valuation and Econometrics

CHANGE IN DISPOSABLE INCOME

FINANCIAL VALUATION AND ECONOMETRICS

Probability Distribution and Statistics 5

P(xt+1 , yt+1 ) a1 a2 a3 a4 a5 a6 P(yt+1 )

b1 0.005 0.03 0.03 0.015 0.005 0.01 0.095

on an uncountable number of possible values. In what follows, we construct a

FINANCIAL VALUATION AND ECONOMETRICS

6 Financial Valuation and Econometrics

The marginal probability of Yt+1 = b3 in the context of the (bivariate) joint

FINANCIAL VALUATION AND ECONOMETRICS

Probability Distribution and Statistics 7

Table 1.2. Joint Probability of Two

FINANCIAL VALUATION AND ECONOMETRICS

8 Financial Valuation and Econometrics

f(x | y) = f(x, y)/fY (y)

and f(y | x) = f(x, y)/fX (x).

FINANCIAL VALUATION AND ECONOMETRICS

Probability Distribution and Statistics 9

and for continuous pdf,

The conditional expected value or the conditional expectation of Xt+1 | b4 is

and for continuous pdf,

FINANCIAL VALUATION AND ECONOMETRICS

10 Financial Valuation and Econometrics

Covariance measures the degree of co-movements between two random vari-

FINANCIAL VALUATION AND ECONOMETRICS

Probability Distribution and Statistics 11

For general random variables X and Y (dropping time subscripts), we can

Covariances are linear operators. A function is f : A → B or {f : f(a) = b;

A special case of the above is

FINANCIAL VALUATION AND ECONOMETRICS

12 Financial Valuation and Econometrics

A convenient property of a correlation coefficient ρ is that it lies between −1

var(X − θY ) = σX2 + θ 2 σY2 − 2θρσX σY ≥ 0.

Put θ = ρ σσXY . Then, σX2 + ρ2 σX2 − 2ρ2 σX2 ≥ 0.

Var(X) = E(X − µ)2

The cumulative distribution function (cdf) of X is

FINANCIAL VALUATION AND ECONOMETRICS

Probability Distribution and Statistics 13

takes value x = µ + σz. Then,

Total area from −∞ to ∞

Figure 1.4. Standard Normal Probability Density Function of Z.

FINANCIAL VALUATION AND ECONOMETRICS

14 Financial Valuation and Econometrics

Z Area under curve from –∞ to z Z Area under curve from –∞ to z

0.000 0.500 1.600 0.945

P (Z > 1.5) = 1 − P (−∞ < Z ≤ 1.5) = 1 − 0.933 = 0.067.

P (Z > 1.282) = 0.10 or 10%.

The case for P (Z < −1.645) = 5% is shown in Fig. 1.4.

FINANCIAL VALUATION AND ECONOMETRICS

Probability Distribution and Statistics 15

and cov( x,y)

where x is the vector of random variables X1 to Xp , µ is the p × 1 vector of

E(h(X)g(Y )) = E(h(X))E(g(Y )).

FINANCIAL VALUATION AND ECONOMETRICS

16 Financial Valuation and Econometrics

Negative or left Fat tails with

Covariances are linear operators. A function is f : A → B or {f : f(a) = b;