Beruflich Dokumente
Kultur Dokumente
Chapter 1
PROBABILITY DISTRIBUTION
AND STATISTICS
1.1. PROBABILITY
Joint probability, marginal probability, and conditional probability are important
basic tools in financial valuation and regression analyses. These concepts and
their usefulness in financial data analyses will become clearer at the end of the
chapter. To motivate the idea of a joint probability distribution, let us begin by
looking at a time series plot or graph of two financial economic variables over
time: Xt and Yt , for example, S&P 500 Index aggregate price-to-earnings ratio
Xt , and S&P 500 Index return rate Yt . The values or numbers that variables Xt
and Yt will take are uncertain before they happen, i.e. before time t. At time t,
both economic variables take realised values or numbers xt and yt . xt and yt are
said to be realised jointly or simultaneously at the same time t. Thus, we can
describe their values as a joint pair (xt , yt ). If their order is preserved, it is called
an ordered pair. Note that the subscript t represents the time index.
The P/E or price-to-earnings ratio of a stock or a portfolio is a financial ratio
showing the price paid for the stock relative to the annual net income or profit per
share earned by the firm for the year. The reciprocal of the P/E ratio is called the
earnings yield. The earnings yield or E/P reflects the risky annual accounting rate
of return, R, on the stock. This is easily shown by the relationship $E = $P × R.
In other words, P/E = 1/R.
1
FINANCIAL VALUATION AND ECONOMETRICS
© World Scientific Publishing Co. Pte. Ltd.
http://www.worldscibooks.com/economics/7782.html
February 1, 2011 13:41 9in x 6in b1016-ch01 Financial Valuation and Econometrics
60%
40%
20%
0%
-20%
-40%
S&P 500 INDEX RETURN RATE
S&P 500 INDEX AGGREGATE P/E RATIO
-60%
1870 1890 1910 1930 1950 1970 1990 2010
YEAR
Figure 1.1. S&P 500 Index Portfolio Return Rate and Price-Earning Ratio 1872–2009 (Data
from Prof Shiller, Yale University).
In Fig. 1.1, it seems that low return corresponded to, or lagged high P/E
especially at the beginnings of the years 1929–1930, 1999–2002, and 2008–2009.
Conversely, high returns followed relatively low P/E ratios at the beginnings of
the years 1949–1954, 1975–1982, and 2006–2007. We shall explore the issue of
the predictability of stock return in more details in Chap. 8.
The idea that random variables correspond with each other over time or
that display some form of association is called a statistical correlation which is
defined, or which has interpretative meaning, only when there is the existence
of a joint probability distribution describing the random variables.
In Fig. 1.2, we plot the U.S. national aggregate consumption versus national
disposable income in US$ billion. Disposable income is defined as Personal
Income less personal taxes. Personal Income is National Income less corporate
taxes and corporate-retained earnings. In turn, National Income is Gross Domes-
tic Product (GDP) less depreciation and indirect business taxes such as sales tax.
GDP is essentially the total dollar output or gross income of the country. If
we include repatriations from citizens working abroad, then it becomes Gross
National Product (GNP).
In Fig. 1.2, it appears that consumption increases in disposable income. The
relationship is approximately linear. This is intuitive as on a per capita basis, we
$9,600
$9,200
$8,800
CONSUMPTION
$8,400
$8,000
$7,600
$7,200
$7,000 $8,000 $9,000 $10,000 $11,000
DISPOSABLE INCOME
Figure 1.2. U.S.Annual NationalAggregate Consumption versus Disposable Income 1999–2009
(Data from Federal Reserve Board of U.S. in $billion).
would expect that for each person, when his or her disposable income rises, he
or she would consume more. In life-cycle models of financial economics theory,
some types of individual preferences could lead to consumption as an increasing
function of individual wealth that consists of inheritance as well as fresh income.
Sometimes, the analysis on income also breaks it down into a permanent part and
a transitory part. More of these could be read in economics articles on life-cycle
models and hypotheses.
In Fig. 1.3, we evaluate the annual year-to-year change in consumption and
disposable income and plot them on an X–Y graph. The point P1 refers to the
bivariate values (x1 , y1 ), where x1 is change in disposable income and y1 is
change in consumption in 2000. P2 refers to the bivariate values (x2 , y2 ), where
x2 is change in disposable income and y2 is change in consumption in 2001, and
so on. Subscripts to x and y indicate time. It may be construed as the end of a
time period and the beginning of the next time period. In this case, subscript 1
refers to time t1 , end of year 2000.
$400
P1
$300 P6 P5
CHANGE IN CONSUMPTION
P7
P4
P8
$200
P2 P3
$100
$0
P9 P10
$-100
$40 $80 $120 $160 $200 $240 $280 $320 $360
The pattern in Fig. 1.3 reveals that disposable income change dropped from
t = 1 to t = 2, then rose back at t = 3. After that, there was a sharp drop at t = 4
before a wild swing back up at t = 5, and so on. The changes seem to be cyclical.
A cyclical but decreasing trend can be seen in consumption. However, what is
more interesting is that consumption and disposable income visibly increased and
decreased together. Thus, if we construe consumption as the purchases of goods
and services, then the plot displays the positive income effect on such effective
demand. Theoretically, each Xt and each Yt for every time t is a random variable.
A random variable is a variable that takes on different values each with a
given probability. It is a variable with an associated probability distribution. For
the above scatter plot, since Xt and Yt occur jointly together in (Xt , Yt ), the pair is
a bivariate random variable, and thus has a joint bivariate probability distribution.
There are two generic classes of probability distributions: discrete probability
distribution, where the random variable takes on only a countable set of possible
values, and continuous probability distribution, where the random variable takes
Table 1.1. Discrete Bivariate Joint Probability of Two Stock Return Rates.
Xt+1
6
PY (Yt+1 = b3 ) = P(Xt+1 = aj , Yt+1 = b3 ).
j=1
This is obviously the sum of numbers in the row involving b3 and is equal to
0.175. The marginal probability of Xt+1 = a2 is given by:
6
PX (Xt+1 = a2 ) = P(Xt+1 = a2 , Yt+1 = bk ) = 0.2.
k=1
Thus, given the joint probability distribution, the marginal probability distribu-
tion of any one of the joint random variables can be found.
What is 6j=1 6k=1 P(Xt+1 = aj , Yt+1 = bk )? Employing the concept of
marginal probability that we just learned,
6
6
6
P(Xt+1 = aj , Yt+1 = bk ) = PX (Xt+1 = aj ) = 1.
j=1 k=1 j=1
In the bivariate probability case, we know that future risk or uncertainty is char-
acterised by one and only one of the 36 pairs of values (aj , bk ) that will occur.
Suppose the event has occurred, and we know only that it is event {Xt+1 = a2 }
that occurred, but without knowing which of the events b1 , b2 , b3 , b4 , b5 , or b6 had
occurred in simultaneity. An interesting question is to ask what is the probability
that {Yt+1 = b3 } had occurred, given that we know {Xt+1 = a2 } occurred. This
is called a conditional probability and is denoted by P(Yt+1 = b3 |Xt+1 = a2 ).
The symbol “|” represents “given” or “conditional on”.
From Table 1.1, we focus on the column where it is given that {xt+1 ≡ a2 }
occurred. This is shown in Table 1.2. The highlighted 0.025 is the joint probability
of (a2 , b3 ).
0.03
0.02
0.025
0.03
0.06
0.035
Intuitively, the higher (lower) this number, the higher (lower) is the condi-
tional probability that b3 in fact had occurred simultaneously. Given that a2 had
occurred, we are finding the conditional probability given {xt+1 ≡ a2 }, which is
in itself a proper probability distribution and thus must have probabilities that
add to 1. Then, the conditional probability must be the relative size of 0.025 to
the other joint probabilities in the above column.
We recall Bayes’ rule on event sets, that:
P(A ∩ B)
P(A | B) = ,
P(B)
where A and B are events or event sets in a universe. We can think of the outcome
{Xt+1 = a2 } as event B and the outcome {Yt+1 = b3 } as event A. Events can be
more general, as occurrences {Xt+1 = aj }, {Yt+1 = bk }, {Xt+1 = aj , Yt+1 = bk }
are all events or event sets. More exactly,
P(a2 , b3 ) 0.025
P(b3 | a2 ) = = = 0.125.
PX (a2 ) 0.2
In general,
P(Xt+1 = aj , Yt+1 = bk )
P(Yt+1 = bk | Xt+1 = aj ) =
PX (Xt+1 = aj )
P(Xt+1 = aj , Yt+1 = bk )
= 6 .
k=1 P(Xt+1 = aj , Yt+1 = bk )
When we move from discrete probability distribution, where event sets consist
of discrete elements, to continuous probability distribution, where event sets are
continuous, such as intervals on a real line, we have to deal with continuous
functions.
The continuous joint probability density function (pdf) of bivariate
(Xt+1 , Yt+1 ) is represented by a continuous function f(x, y) where Xt+1 = x and
Yt+1 = y, and x, y are usually numbers on the real line R. Note that we simplify
the notations of the realised values by dropping their time subscripts here. For a
continuous probability distribution, the events are described not as point values
e.g. x = 3, y = 4, but rather as intervals, e.g. event A = {(x, y): − 2 < y < 3}
and event B = {(x, y): 0 < x < 9.5}. Then,
9.5 3
P(A, B) = P(0 < x < 9.5, −2 < y < 3) = f(x, y) dy dx.
0 −2
The “support” for a random variable such as Xt+1 is the range of x. For joint
normal densities, the ranges are usually (−∞, ∞). Thus, Yt+1 also has the same
support. It is usually harmless to use (−∞, ∞) as supports even if the range is
finite [a, b], since the probabilities of null events (−∞, a) and (b, ∞) are zeros.
However, when more advanced mathematics is involved, it is typically better
to be precise. In addition, notice that probability is essentially an integral of a
function, whether continuous or discrete, and is area under the pdf curve.
The marginal probability density function of Xt+1 and Yt+1 are given by:
∞
fX (x) = f(x, y) dy
−∞
∞
and fY (y) = −∞ f(x, y) dx.
Notice that while f(x, y) is a function containing both x and y, fY (y) is a
function containing only y since x is integrated out. Likewise, fX (x) is a function
that contains only x.
The conditional probability density functions are:
1.2. EXPECTATIONS
The expected value of random variable Xt+1 is given by
6
E(Xt+1 ) = aj PX (aj ) = µX for the discrete distribution in Table 1.1,
j=1
6
E(Xt+1 | b4 ) = aj P(aj | b4 ) for the discrete distribution in Table 1.1,
j=1
Notice that for the continuous pdf, the conditional expected value given y is a
function containing only y. This means that one can further evaluate more specific
conditional expectations based on given sets of y values e.g. {y: − 2 < y < 3}.
Then, E(Xt+1 | −2 < y < 3) is found via:
∞ ∞
f(x, −2 < y < 3)
xf(x | −2 < y < 3) dx = x 3 ∞ dx
−∞ −∞ −2 −∞ f(x, y) dx dy
∞ 3
−∞ x −2 f(x, y) dy dx
= 3 ∞
−2 −∞ f(x, y) dx dy
3 ∞
−2 −∞ xf(x, y) dx dy
= 3 .
−2 fY (y) dy
The interchange of integrals in the last step in the above equation uses the Fubini
Theorem assuming some mild regularity conditions satisfied by the functions.
The variance of a continuous random variable Xt+1 is given by
∞
var(Xt+1 ) = σX =2
(x − µx )2 fX (x) dx.
−∞
Variance measures the degree of movement or the variability of the random vari-
able itself. The standard deviation (s.d.) of a random variable Xt+1 is the square
root of the variance. Standard deviation is sometimes referred to as volatility and
sometimes as “risk” in the finance literature.
The covariance between two continuous random variables Xt+1 and Yt+1 is
given by:
∞ ∞
cov(Xt+1 , Yt+1 ) = σXY = (x − µx )(y − µy )f(x, y) dx dy.
−∞ −∞
σXY
corr(Xt+1 , Yt+1 ) = ρXY = .
σ X σY
One other advantage of using the correlation coefficient than the covariance is
that the correlation coefficient is not denominated in the value units of X or Y
but is a ratio.
It is important to understand that the correlation measures association but
not causality. In Fig. 1.3, clear changes in consumption and income are strongly
positively correlated. Suppose one concludes that increasing consumption will
increase income, the resulting action will be disastrous. Or, even if one sim-
ply concludes (based on some understanding of macroeconomics theory or by
intuition) that increased income causes increased consumption, it may still be
premature, as there are so many other possibilities and qualifications. For exam-
ple, some other variables such as general education level could lead to increases
in both income and consumption.
Or, suppose we think of Yt+1 as GDP and Xt+1 as population. Both increase
with time due to various economic and geo-political reasons. But, it will be
disastrous for policy implication to think that increasing population leads to or
causes increase in GDP. This has to assume fairly constant employment and
output per capita.
E(X) = µX
E(Y ) = µY
var(X) = E(X − µX )2 = E(X2 ) − µ2X
var(Y ) = E(Y − µY )2 = E(Y 2 ) − µ2Y
cov(X, Y ) = E(X − µX )(Y − µY ) = E(XY ) − µX µY .
N
N
=E [Xi − E(Xi )][Xj − E(Xj )]
i=1 j=1
N
N
= E{[Xi − E(Xi )][Xj − E(Xj )]}
i=1 j=1
N
N
= cov(Xi , Xj ).
i=1 j=1
var(X + Y ) = cov(X + Y, X + Y )
= cov(X, X) + cov(X, Y ) + cov(Y, X) + cov(Y, Y )
= var(X) + var(Y ) + 2 cov(X, Y ).
1.3. DISTRIBUTIONS
Continuous probability distributions are commonly employed in regression anal-
yses. The commonest probability distribution is the normal (Gaussian) distribu-
tion. The pdf of a normally distributed random variable X is given by
1
1 x−µ 2
f(x) = √ e− 2 σ for −∞ < x < ∞,
2πσ 2
where the mean of x is µ and the s.d. of x is σ. µ and σ are given constants.
+∞
E(X) = xf(x) dx = µ
−∞
= σ2.
d
We can write the distribution of X as X ∼ N(µ, σ 2 ) in which the arguments
indicate the mean and variance of the normal random variable. Suppose we
define a corresponding random variable
X−µ
Z= or X = µ + σZ,
σ
where the symbol “=” means “to define”. The second “equality” is interpreted
as not just equivalence in distribution, but that whenever Z takes value z, then X
5%
Z
a = –1.645 0
Table 1.3.
Several values of Z under N(0,1) are commonly encountered, viz. 1.282, 1.645,
1.960, 2.330, and 2.576.
1
e− 2 q ,
1
f(x, y) = (1.1)
2πσX σY 1 − ρ2
where
2 2
1 x − µX x − µX y − µY y − µY
q= − 2ρ +
1 − ρ2 σX σX σY σY
f(X, Y ) = fx (X)fy (Y ).
One implication of the above is that for any function h(.) of X and any function
g(.) of Y , their expectation can be found as:
f(x)
x
µ 0
Figure 1.5. Example of a Pdf with Negative Skewness and Large Kurtosis.
σX 2π 1 − ρ2
σX
1 − 1
2 2 [(x−µX )−ρ σY (y−µY )]
2
= e 2(1−ρ )σX
2πσX2 (1 − ρ2 )
1 − 1
2 (x−µX|Y )2
= e 2σX|Y
2
2πσX|Y
2
where σX|Y = (1 − ρ2 )σx2 is the variance of X conditional on Y = y, and
µX|Y = µX + ρ σσXY (y − µY ) is the mean of X conditional on Y = y.
There are some common continuous probability distributions that are related
d
to the normal distribution. If random variable X ∼ N(µ, σ 2 ), then random vari-
2
able V = X−µ σ ∼ χ12 is a chi-square distribution with 1 degree of freedom.
If X1 , X2 , X3 , . . . , Xn are n random variables each independently drawn from
the same population distribution N(µ, σ 2 ), or think of {Xi }i=1 to n as a random
2
sample of size n, then ni=1 Xiσ−µ ∼ χn2 is a chi-square distribution with n
degrees of freedom.
d d
If X ∼ N(0, 1), and V ∼ χr2 , and both X and V are stochastically indepen-
d
dent, then √X is a Student-t distribution with r degrees of freedom. If U ∼ χr21 ,
Vr −1
d Ur −1 d
V ∼ χr22 , and both U and V are stochastically independent, then Vr−1
1
∼ Fr1 ,r2
2
is an F -distribution with degrees of freedom r1 and r2 . If random variable
d
X ∼ N µ, σ 2 and Y = exp(X) or X = ln(Y ), then Y is a random variable
with a lognormal distribution.
denoted by
1
N
X̄n = Xk ,
n
k=1
where Xk above is clearly the random variable from N(µ, σ 2 ) itself. X̄n is a
random variable and its probability distribution is called the sampling distribution
of the mean or perhaps more clearly, the distribution of the sample mean.
What is the exact probability distribution of X̄n ?
1 1 1
n n n
E(X̄n ) = E Xk = E(Xk ) = µ = µ.
n n n
k=1 k=1 k=1
1
n
1
n
nσ 2 σ2
var(X̄n ) = var Xk = var(Xk ) = = .
n2 n2 n2 n
k=1 k=1
is distributed as Student-t with (n−1) degrees of freedom and zero mean. Denote
the random variable with t-distribution, n − 1 degrees of freedom, as tn−1 . Then,
√
n(X̄n − µ) d
∼ tn−1 .
s
Suppose we find (−a, +a), a > 0, such that Prob(−a ≤ tn−1 ≤ +a) = 95%.
Since tn−1 is symmetrically distributed, then Prob(−a ≤ tn−1 ) = 97.5% and
tn–1
–a 0 +a
pdf f(X)
tn−1
−2 0 2
Figure 1.7.
Thus, power = 1− P(Type II error). Or, power equals the shaded area in Fig. 1.7.
Clearly, this power is a function of the alternative parameter value µ = 1. We
may determine such a power function of µ = 1.
Thus, reducing significance level also reduces power and vice versa. In statis-
tics, it is customary to want to design a test so that its power function of µ = 1
equals or exceeds that of any other test with equal significance level for all plau-
sible parameter values µ = 1 in HA . If this test is found, it is called a uniformly
most powerful test.
We have seen the performance of a two-tailed test. Sometimes, we embark
instead on a one-tailed test such as H0 : µ = 1, HA : µ > 1, in which we
theoretically rule out the possibility of µ < 1, i.e. P(µ < 1) = 0. In this case, it
makes sense to limit the critical region to only the right side, for when µ > 1,
then tn−1 will become larger. Thus, at the one-tail 5% significance level, the
critical region under H0 is {tn−1,95% > 1.671} for n = 61 where tn−1,95% is the
95th percentile of the t distribution with n − 1 d.f.
Time series are the most prevalent in empirical studies in finance. They are
data indexed by time. Each data point is a realisation of a random variable at a
particular point in time. The data occur as a series over time. A sample of such
data is typically a collection of the realised data over time such as the history of
ABC stock’s prices on a daily basis from 1970 January 2 till 2002 December 31.
Cross-sectional data are also common in finance. An example is the reported
annual net profit of all companies listed on an exchange for a specific year. If
we collect the cross sections for each year over a 20-year period, then we have
a pooled time series cross section of companies over 20 years. Panel data are
less used in finance. They are data collected by tracking specific individuals or
subjects over time and across subjects.
The nature of data also differs according to the following categories.
(a) Quantitative,
(b) Ordinal e.g. very good, good, average, and poor, and
(c) Nominal/categorical e.g. married/not married, college graduate/non-
graduate.
Quantitative data such as return rates, prices, volume of trades, etc. have the
least limitations and therefore the greatest use in finance. These data provide not
only ordinal rankings or comparisons of magnitudes, but also exact degrees of
comparisons. There are some limitations and therefore special considerations to
the use of the other categories of data. In the treatment of ordinal and nominal
data, we may have to use specific tools such as dummy variables in regression.
U1 −1 −1 −1 −1 1 1 1 1
U2 −2 −2 2 2 −2 −2 2 2
U3 −3 3 −3 3 −3 3 −3 3
P(U1 ,U2 ,U3 ) 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
U1 −1 −1 −1 −1 1 1 1 1
U2 −2 −2 2 2 −2 −2 2 2
U3 −3 3 −3 3 −3 3 −3 3
P(U1 ,U2 ,U3 ) 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
Probability X Y Z
0.5 +1 −1 0
0.5 −1 0 +1