Sie sind auf Seite 1von 641

Contents

Articles
Bernoulli distribution 1
Beta distribution 3
Beta function 31
Beta-binomial distribution 35
Binomial coefficient 41
Binomial distribution 59
Cauchy distribution 66
Cauchy–Schwarz inequality 74
Characteristic function (probability theory) 81
Chernoff bound 89
Chi-squared distribution 95
Computational complexity of mathematical operations 102
Conjugate prior 106
Continuous mapping theorem 112
Convergence of random variables 114
Convergent series 123
Copula (probability theory) 127
Coupon collector's problem 134
Degrees of freedom (statistics) 137
Determinant 143
Dirichlet distribution 161
Effect size 169
Erlang distribution 179
Expectation–maximization algorithm 182
Exponential distribution 191
F-distribution 201
F-test 204
Fisher information 209
Fisher's exact test 217
Gamma distribution 221
Gamma function 232
Geometric distribution 246
Hypergeometric distribution 250
Hölder's inequality 257
Inverse Gaussian distribution 264
Inverse-gamma distribution 269
Iteratively reweighted least squares 271
Kendall tau rank correlation coefficient 273
Kolmogorov–Smirnov test 277
Kronecker's lemma 280
Kullback–Leibler divergence 281
Laplace distribution 291
Laplace's equation 295
Laplace's method 301
Likelihood-ratio test 307
List of integrals of exponential functions 311
List of integrals of Gaussian functions 313
List of integrals of hyperbolic functions 315
List of integrals of logarithmic functions 317
Lists of integrals 319
Local regression 327
Log-normal distribution 331
Logrank test 339
Lévy distribution 341
Mann–Whitney U 344
Matrix calculus 350
Maximum likelihood 368
McNemar's test 379
Multicollinearity 382
Multivariate normal distribution 387
n-sphere 397
Negative binomial distribution 405
Noncentral chi-squared distribution 414
Noncentral F-distribution 419
Noncentral t-distribution 421
Norm (mathematics) 425
Normal distribution 432
Order statistic 460
Ordinary differential equation 465
Partial differential equation 475
Pearson's chi-squared test 488
Perron–Frobenius theorem 494
Poisson distribution 506
Poisson process 515
Proportional hazards models 519
Random permutation statistics 523
Rank (linear algebra) 535
Resampling (statistics) 541
Schur complement 548
Sign test 550
Singular value decomposition 551
Stein's method 566
Stirling's approximation 572
Student's t-distribution 577
Summation by parts 590
Taylor series 592
Uniform distribution (continuous) 603
Uniform distribution (discrete) 609
Weibull distribution 612
Wilcoxon signed-rank test 618
Wishart distribution 621

References
Article Sources and Contributors 626
Image Sources, Licenses and Contributors 634

Article Licenses
License 638
Bernoulli distribution 1

Bernoulli distribution
Bernoulli

Parameters
Support
PMF

CDF

Mean
Median

Mode

Variance
Skewness

Ex. kurtosis

Entropy
MGF
CF
PGF

In probability theory and statistics, the Bernoulli distribution (or binomial distribution), named after Swiss
scientist Jacob Bernoulli, is a discrete probability distribution, which takes value 1 with success probability and
value 0 with failure probability . So if X is a random variable with this distribution, we have:

A classical example of a Bernoulli experiment is a single toss of a coin. The coin might come up heads with
probability p and tails with probability 1-p. The experiment is called fair if p=0.5, indicating the origin of the
terminology in betting (the bet is fair if both possible outcomes have the same probability).
The probability mass function f of this distribution is

This can also be expressed as

The expected value of a Bernoulli random variable X is , and its variance is


Bernoulli distribution 2

The above can be derived from the Bernoulli distribution as a special case of the Binomial distribution.[1]
The kurtosis goes to infinity for high and low values of p, but for the Bernoulli distribution has a lower
kurtosis than any other probability distribution, namely -2.
The Bernoulli distribution is a member of the exponential family.
The maximum likelihood estimator of p based on a random sample is the sample mean.

Related distributions
• If are independent, identically distributed (i.i.d.) random variables, all Bernoulli distributed with

success probability p, then (binomial distribution). The Bernoulli

distribution is simply .
• The categorical distribution is the generalization of the Bernoulli distribution for variables with any constant
number of discrete values.
• The Beta distribution is the conjugate prior of the Bernoulli distribution.
• The geometric distribution is the number of Bernoulli trials needed to get one success.

Notes
[1] McCullagh and Nelder (1989), Section 4.2.2.

References
• McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and
Hall/CRC. ISBN 0-412-31760-5.
• Johnson, N.L., Kotz, S., Kemp A. (1993) Univariate Discrete Distributions (2nd Edition). Wiley. ISBN
0-471-54897-9

External links
• Hazewinkel, Michiel, ed. (2001), "Binomial distribution" (http://www.encyclopediaofmath.org/index.
php?title=p/b016420), Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Weisstein, Eric W., " Bernoulli Distribution (http://mathworld.wolfram.com/BernoulliDistribution.html)"
from MathWorld.
Beta distribution 3

Beta distribution
Beta

Probability density function

Cumulative distribution function

Parameters shape (real)


shape (real)
Support
PDF

CDF
Mean

(see digamma function)


Median no general closed form, see text
Mode
for

Variance

(see polygamma function)


Skewness

Ex. kurtosis

Entropy see text


Beta distribution 4

MGF

CF (see Confluent hypergeometric function)

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined
on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β. The beta distribution has
been applied to model the behavior of random variables limited to intervals of finite length. It has been used in
population genetics for a statistical description of the allele frequencies in the components of a sub-divided
population. It has also been used extensively in PERT, critical path method (CPM) and other project management /
control systems to describe the statistical distributions of the time to completion and the cost of a task. It has also
been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported
as a good indicator of the condition of gears[1]. It has also been used to model sunshine data for application to solar
renewable energy utilization[2]. It has also been used for parametrizing variability of soil properties at the regional
level for crop yield estimation, modeling crop response over the area of the association[3]. It has also been used to
determine well-log shale parameters, to describe the proportions of the mineralogical components existing in a
certain stratigraphic interval[4]. It is used extensively in Bayesian inference, since beta distributions provide a family
of conjugate prior distributions for binomial and geometric distributions. For example, the beta distribution can be
used in Bayesian analysis to describe initial knowledge concerning probability of success such as the probability that
a space vehicle will successfully complete a specified mission. The beta distribution is a suitable model for the
random behavior of percentages. It can be suited to the statistical modelling of proportions in applications where
values of proportions equal to 0 or 1 do not occur. One theoretical case where the beta distribution arises is as the
distribution of the ratio formed by one random variable having a Gamma distribution divided by the sum of it and
another independent random variable also having a Gamma distribution with the same scale parameter (but possibly
different shape parameter).
The usual formulation of the beta distribution is also known as the beta distribution of the first kind, whereas beta
distribution of the second kind is an alternative name for the beta prime distribution.

Characterization

Probability density function


The probability density function of the beta distribution, for 0 ≤ x ≤ 1, and shape parameters α > 0 and β > 0, is:

where is the gamma function. The beta function, B, appears as a normalization constant to ensure that the total
probability integrates to unity.
This definition includes both ends x= 0 and x=1, which is consistent with definitions for other continuous
distributions supported on a bounded interval which are special cases of the beta distribution, for example the arcsine
distribution, and consistent with several authors, such as N.L.Johnson and S.Kotz[5][6][7][8] . Note, however, that
several other authors, including W. Feller [9] [10][11], choose to exclude the ends x= 0 and x=1, (such that the two
ends are not actually part of the density function) and consider instead 0<x<1.
Beta distribution 5

Several authors, including N.L.Johnson and S.Kotz[5], use the nomenclature p instead of α and q instead of β for the
shape parameters of the beta distribution, reminiscent of the nomenclature traditionally used for the parameters of the
Bernoulli distribution, because the beta distribution approaches the Bernoulli distribution in the limit as both shape
parameters α and β approach the value of zero.
In the following, that a random variable X is Beta-distributed with parameters α and β will be denoted by:

Cumulative distribution function


The cumulative distribution function is

where is the incomplete beta function and is the regularized incomplete beta function.

Properties

Mode
The mode of a Beta distributed random variable X with both parameters α and β greater than one is:[5]

When both parameters are less than one (α < 1 and β < 1), this is the anti-mode: the lowest point of the probability
density curve[7]. Letting in the above expression one obtains , showing that for

the mode (in the case α > 1 and β > 1), or the anti-mode (in the case α < 1 and β < 1), are at the center of the
distribution: it is symmetric in those cases. See "Shapes" section in this article for a full list of mode cases, for
arbitrary values of α and β. For several of these cases, the maximum value of the density function occurs at one or
both ends. In some cases the (maximum) value of the density function occurring at the end is finite, for example in
the case of α=2, β=1 (or α=1, β=2), the right-triangle distribution, while in several other cases there is a singularity at
the end, and hence the value of the density function approaches infinity at the end, for example in the case α=β=1/2,
the arcsine distribution. The choice whether to include, or not to include, the ends x=0, and x=1, as part of the density
function, whether a singularity can be considered to be a mode, and whether cases with two maxima are to be
considered bimodal, is responsible for some authors considering these maximum values at the end of the density
distribution to be considered[12] modes or not[10].
Beta distribution 6

Median
The median of the beta distribution is
the unique real number x for which the
regularized incomplete beta function
. There is no general
closed-form expression for the median
of the beta distribution for arbitrary
values of α and β. Closed-form
expressions for particular values of the
parameters α and β follow:
• For symmetric cases
Mode for Beta distribution for 1≤α≤5 and 1≤β≤5
.
• For
(this case is the mirror-image of the power function [0,1] distribution)
• For (this case is the power function [0,1] distribution[10])
• For the real [0,1] solution to the quartic
equation
• For
A reasonable approximation of the
value of the median of the beta
distribution, for both α and β greater or
equal to one, is given by the
formula[13]

For α ≥ 1 and β ≥ 1, the relative error


(the absolute error divided by the
median) in this approximation is less
than 4% and for both α ≥ 2 and β ≥ 2 it
is less than 1%. The absolute error
divided by the difference between the Median for Beta distribution for 0≤α≤5 and 0≤β≤5

mean and the mode is similarly small:


Beta distribution 7

Mean
The expected value (mean) ( ) of a Beta distribution random variable X with parameters α and β is:[5]

Letting in the above expression one obtains , showing that for the mean is at the center of

the distribution: it is symmetric. Also, the following limits can be obtained from the above expression:

Therefore, for , or for , the mean is located at the right end, x = 1. For these limit ratios, the beta

distribution becomes a 1 point Degenerate distribution with a Dirac delta function spike at the right end, x = 1, with
probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at
the right end, x = 1.

Similarly, for , or for , the mean is located at the left end, x = 0. The beta distribution becomes a

1 point Degenerate distribution with a Dirac delta function spike at the left end, x = 0, with probability 1, and zero
probability everywhere else. There is 100% probability (absolute certainty) concentrated at the left end, x = 0.

Variance
The variance (the second moment
centered around the mean) of a Beta
distribution random variable X with
parameters α and β is:[5]

Letting in the above


expression one obtains
, showing

that for the variance decreases Mean for Beta distribution for 0 ≤ α ≤ 5 and 0 ≤ β ≤ 5
monotonically as increases.
Setting in this
expression, one finds the maximum
[5]
variance which only occurs approaching the limit, at .
The beta distribution may also be parametrized in terms of its mean μ (0 < μ < 1) and sample size ν = α + β (ν > 0)
(see section below titled "Mean and sample size"):

Using this parametrization, one can express the variance in terms of the mean μ and the sample size ν as follows:
Beta distribution 8

Since , it must follow that


For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore:

Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above
expressions:

Skewness
The skewness (the third moment
centered around the mean, normalized
by the 3/2 power of the variance) of
the beta distribution is[5]

Skewness for Beta Distribution as a function of variance and mean

Letting in the above expression one obtains , showing once again that for the distribution
is symmetric and hence the skewness is zero. Positive skew (right-tailed) for α < β, negative skew (left-tailed) for α
> β.
Using the parametrization in terms of mean μ and sample size ν = α + β:

one can express the skewness in terms of the mean μ and the sample size ν as follows:
Beta distribution 9

The skewness can also be expressed just in terms of the variance var and the mean μ as follows:

The accompanying plot of skewness as a function of variance and mean shows that maximum variance (1/4) is
coupled with zero skewness and the symmetry condition (μ = 1/2), and that maximum skewness (positive or negative
infinity) occurs when the mean is located at one end or the other, so that that the "mass" of the probability
distribution is concentrated at the ends (minimum variance).
For the symmetric case (α = β), skewness = 0 over the whole range, and the following limits apply:

For the unsymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be
obtained from the above expressions:

Kurtosis
The beta distribution has been applied
in acoustic analysis to assess damage
to gears, as the kurtosis of the beta
distribution has been reported to be a
good indicator of the condition of a
gear[1]. Kurtosis has also been used to
distinguish the seismic signal
generated by a person's footsteps from
other signals. As persons or other
targets moving on the ground generate
Excess Kurtosis for Beta Distribution as a function of variance and mean
Beta distribution 10

continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves they
generate. Kurtosis is sensitive to impulsive signals, so it’s much more sensitive to the signal generated by human
footsteps than other signals generated by vehicles, winds, noise, etc.[14] Unfortunately, the notation for kurtosis has
not been standardized. Kenney and Keeping[15] use the symbol for the excess kurtosis, but Abramowitz and
Stegun[16] use different terminology. To prevent confusion[17] between kurtosis (the fourth moment centered around
the mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelled
out as follows[11][10]:

Letting in the above expression one obtains

Therefore for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of - 2
at the limit as {α = β} → 0, and approaching a maximum value of zero as {α = β} → ∞. The value of - 2 is the
minimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of any
possible kind) can ever achieve. This minimum value is reached when all the probability density is entirely
concentrated at each end x = 0 and x = 1, with nothing in between: a 2-point Bernoulli distribution with equal
probability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" for
further discussion). The description of kurtosis as a measure of the "peakedness" (or "heavy tails") of the probability
distribution, is strictly applicable to unimodal distributions (for example the normal distribution). However, for more
general distributions, like the beta distribution, a more general description of kurtosis is that it is a measure of the
proportion of the mass density near the mean. The higher the proportion of mass density near the mean, the higher
the kurtosis, while the higher the mass density away from the mean, the lower the kurtosis. For α ≠ β, skewed beta
distributions, the excess kurtosis can reach unlimited positive values (particularly for α → 0 for finite β, or for β → 0
for finite α) because all the mass density is concentrated at the mean when the mean coincides with one of the ends.
Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean is
at the center), and there is no probability mass density in between the ends.
Using the parametrization in terms of mean μ and sample size ν = α + β:

one can express the excess kurtosis in terms of the mean μ and the sample size ν as follows:

The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and the
sample size ν as follows:

and, in terms of the variance var and the mean μ as follows:


Beta distribution 11

The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excess
kurtosis (- 2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled with
the maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint (μ = 1/2). This
occurs for the symmetric case of α = β = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distribution
with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (A
coin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because the
distribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: the
probability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosis
reaches the minimum possible value (for any distribution) when the probability density function has two spikes at
each end: it is bi-"peaky" with nothing in between them.
On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end
(μ = 0 or μ = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean of
the distribution approaches either end.
Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square of
the skewness, and the sample size ν as follows:

From this last expression, one can obtain the same limits published practically a century ago by Karl Pearson in his
paper[18], for the beta distribution (see section below titled "Kurtosis bounded by the square of the skewness").
Setting α + β= ν = 0 in the above expression, one obtains Pearson's lower boundary (values for the skewness and
excess kurtosis below the boundary (excess kurtosis + 2 - skewness2 = 0) cannot occur for any distribution, and
hence Karl Pearson appropriately called the region below this boundary the "impossible region"). The limit of α + β=
ν → ∞ determines Pearson's upper boundary.

therefore:

Values of ν= α + β such that ν ranges from zero to infinity, 0 < ν < ∞, span the whole region of the beta distribution
in the plane of excess kurtosis versus squared skewness.
For the symmetric case (α = β), the following limits apply:

For the unsymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be
obtained from the above expressions:
Beta distribution 12

Characteristic function
The characteristic function is the
Fourier transform of the probability
density function. The characteristic
function of the beta distribution is
Kummer's confluent hypergeometric
function (of the first kind)[5][16][19]:

Re(characteristic function) symmetric case α = β ranging from 25 to 0


Beta distribution 13

Re(characteristic function) symmetric case α = β ranging from 0 to 25

Re(characteristic function) β = α + 1/2; α ranging from 25 to 0

Re(characteristic function) α = β + 1/2; β ranging from 25 to 0


Beta distribution 14

Re(characteristic function) α = β + 1/2; β ranging from 0 to 25

where

is the rising factorial, also called the "Pochhammer symbol". The value of the characteristic function for t = 0, is one:
.
Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to the
origin of variable t:

The symmetric case α = β simplifies the characteristic function of the beta distribution to a Bessel function, since in
the special case the confluent hypergeometric function (of the first kind) reduces to a Bessel function
(the modified Bessel function of the first kind ) using Kummer's second transformation as follows:

In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed for
symmetric (α = β) and skewed (α ≠ β) cases.
Beta distribution 15

Moment generating function


It also follows[5][10] that the moment generating function is

Higher moments

Using the moment generating function, the th raw moment is given by[5] the factor

multiplying the (exponential series) term in the series of the moment generating function

where is a Pochhammer symbol representing rising factorial. It can also be written in a recursive form as

Moments of transformed random variables


One can also show the following expectations for a transformed random variable[5]

The following transformation by inversion of the random variable (X/(1 - X) gives the expected value of the inverted
beta distribution or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI)
[5]
:

Expected values for logarithmic transformations (that may be useful for maximum likelihood estimates, for
example):
Beta distribution 16

Higher order logarithmic moments can be expressed in terms of higher order poly-gamma functions:

therefore the variance of a logarithmic variable is:

also

and therefore the covariance of LnX and Ln(1-X) is:

These identities can be derived by using the representation of a Beta distribution as a proportion of two Gamma
distributions and differentiating through the integral.

Quantities of information (entropy)


Given a beta distributed random variable, X ~ Beta(α, β), the differential entropy of X is[20] (measured in nats), the
expected value of the negative of the logarithm of the probability density function:

where is the probability density function of the beta distribution:

The digamma function appears in the formula for the differential entropy as a consequence of Euler's integral
formula for the harmonic numbers which follows from the integral:

where is the Euler-Mascheroni constant.


The differential entropy of the beta distribution is negative for all values of α and β greater than zero, except at
α = β = 1 (for which values the beta distribution is the same as the uniform distribution), where the differential
entropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place when
the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events
are equiprobable.
For α or β approaching zero, the differential entropy approaches its minimum value of negative infinity. For (either
or both) α or β approaching zero, there is a maximum amount of order: all the probability density is concentrated at
the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) α or β
approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum
amount of order. If either α or β approaches infinity (and the other is finite) all the probability density is concentrated
Beta distribution 17

at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetric
case), α = β, and they approach infinity simultaneously, the probability density becomes a spike (Dirac delta
function) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zero
probability everywhere else.

The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the
"entropy of a continuous distribution"), as the concluding part[21] of the same paper where he defined the discrete
entropy. It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete
entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution).
What really matters is the relative value of entropy.
Given two beta distributed random variables, X ~ Beta(α, β) and Y ~ Beta(α', β'), the cross entropy is (measured in
nats)

It follows that the relative entropy, or Kullback–Leibler divergence, between these two beta distributions is
(measured in nats)

The relative entropy, or Kullback–Leibler divergence, is always non-negative

Relationships between statistical measures

Mean, mode and median relationship


If 1 < α < β then mode ≤ median ≤ mean.[13] Expressing the mode (only for α > 1 and β > 1), and the mean in terms
of α and β:

If 1 < β < α then the order of the inequalities are reversed. For α > 1 and β > 1 the absolute distance between the
mean and the median is less than 5% of the distance between the maximum and minimum values of x. On the other
hand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximum
and minimum values of x, for the (pathological) case of α ≈ 1 and β ≈ 1 (for which values the beta distribution
approaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum
"disorder").
For example, for α = 1.0001 and β = 1.00000001:
• mode = 0.9999; PDF(mode) = 1.00010
• mean = 0.500025; PDF(mean) = 1.00003
Beta distribution 18

• median = 0.500035; PDF(median) = 1.00003


• mean − mode = −0.499875
• mean − median = −9.65538×10−6
(where PDF stands for the value of the probability density function)

Kurtosis bounded by the square of the skewness


As remarked by Feller[9], in the Pearson system the beta probability density appears as type I (any difference
between the beta distribution and Pearson's type I distribution is only superficial and it makes no difference for the
following discussion regarding the relationship between kurtosis and skewness). Karl Pearson showed, in Plate 1 of
his paper [18] published in 1916, a graph with the kurtosis as the vertical axis (ordinate) and the square of the
skewness as the horizontal axis (abscissa), in which a number of distributions were displayed[22]. The region
occupied by the beta distribution is bounded by the following two lines in the (skewness2,kurtosis) plane, or the
(skewness2,excess kurtosis) plane:

or, equivalently,

(At a time when there were no powerful digital computers), Karl Pearson accurately computed further
boundaries[18][8], for example, separating the "U-shaped" from the "J-shaped" distributions. The lower boundary line
(excess kurtosis + 2 - skewness2 = 0) is produced by "U-shaped" beta distributions with values of shape parameters α
and β close to zero. The upper boundary line (excess kurtosis - (3/2) skewness2 = 0) is produced by extremely
skewed distributions with very large values of one of the parameters and very small values of the other parameter.
An example of a beta distribution near the upper boundary (excess kurtosis - (3/2) skewness2 = 0) is given by α= 0.1,
β=1000, for which the ratio (excess kurtosis)/(skewness2) = 1.49835 approaches the upper limit of 1.5 from below.
An example of a beta distribution near the lower boundary (excess kurtosis + 2 - skewness2 = 0) is given by α=
0.0001, β=0.1, for which values the expression (excess kurtosis+2)/(skewness2) =1.01621 approaches the lower limit
of 1 from above. In the infinitesimal limit for both α and β approaching zero symmetrically, the excess kurtosis
reaches its minimum value at -2. This minimum value occurs at the point at which the lower boundary line intersects
the vertical axis (ordinate). (Note, however, that in Pearson's original chart, the ordinate is kurtosis, instead of excess
kurtosis, and that it increases downwards rather than upwards).
Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 - skewness2 = 0) cannot
occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the
"impossible region." The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal
"U"-shaped distributions for which parameters α and β approach zero and hence all the probability density is
concentrated at each end: x = 0 and x = 1 with practically nothing in between them. Since for α ≈ β ≈ 0 the
Beta distribution 19

probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a
2-point distribution: the probability can only take 2 values (Bernoulli distribution), one value with probability p and
the other with probability q = 1 - p. For cases approaching this limit boundary with symmetry α = β, skewness ≈ 0,
excess kurtosis ≈ -2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are p ≈ q ≈
1/2. For cases approaching this limit boundary with skewness, excess kurtosis ≈ - 2 + skewness2, and the probability
density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at
the left end and at the right end .

Symmetry
All statements are conditional on α > 0 and β > 0
• Probability density function

• Cumulative distribution function reflection symmetry plus unitary translation

• Mode reflection symmetry plus unitary translation

• Median reflection symmetry plus unitary translation

• Mean reflection symmetry plus unitary translation

• Variance symmetry

• Skewness skew-symmetry

• Excess kurtosis symmetry

• Characteristic function symmetry of Real part (with respect to the origin of variable "t")

• Characteristic function skew-symmetry of Imaginary part (with respect to the origin of variable "t")

• Differential Entropy symmetry


Beta distribution 20

Shapes
The beta density function can take a wide variety of different shapes depending on the values of the two parameters
α and β:
Symmetric
• the density function is symmetric about 1/2 (blue & teal plots).


• is U-shaped (blue plot).

[5]


• is a 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta
function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coin
being x = 0 and the other face being x = 1.

• a lower value than this is impossible for any distribution

to reach.
• The differential entropy approaches a minimum value of
• is the arcsine distribution


• is the uniform [0,1] distribution



• The (negative anywhere else) differential entropy reaches its maximum value of zero
• is symmetric unimodal

[5]


• is a semi-elliptic [0,1] distribution, see: Wigner semicircle distribution


• is the parabolic [0,1] distribution


• is bell-shaped, with inflection points located to either side of the mode


• is a 1 point Degenerate distribution with a Dirac delta function spike at the
midpoint x = 1/2 with probability 1, and zero probability everywhere else. There is 100%
probability (absolute certainty) concentrated at the single point x = 1/2.
Beta distribution 21


• The differential entropy approaches a minimum value of
Skewed
• the density function is skewed. An interchange of parameter values yields the mirror image (the reverse) of
the initial curve.
• is skewed U-shaped. Positive skew for α < β, negative skew for α > β.



• is skewed unimodal (magenta & cyan plots). Positive skew for α < β, negative skew for α >
β.



• is reverse J-shaped with a right tail, positively skewed, strictly decreasing, strictly convex


• (maximum variance occurs for , or α=Φ the golden
ratio conjugate)
• is positively skewed, strictly decreasing (red plot), a reversed (mirror-image) power function
[0,1] distribution

• is strictly concave



• is a straight line with slope -2, the right-triangular distribution with right angle at the
left end, at x = 0


• is reverse J-shaped with a right tail, strictly convex



• is J-shaped with a left tail, negatively skewed, strictly increasing, strictly convex


• (maximum variance occurs for , or β=Φ the golden
ratio conjugate)
• is negatively skewed, strictly increasing (green plot), the power function [0,1] distribution[10]

• is strictly concave


Beta distribution 22


• is a straight line with slope +2, the right-triangular distribution with right angle at the
right end, at x = 1


• is J-shaped with a left tail, strictly convex


Parameter estimation

Method of moments
Using the method of moments, with the first two moments (sample mean and sample variance), let:

be the sample mean and

be the sample variance. The method-of-moments estimates of the parameters are

conditional on

conditional on

When the distribution is required over known interval other than , say , then replace with

and with in the above equations (see "Alternative parametrizations, four parameters" section

below).[23]

Maximum likelihood
As it is also the case for maximum likelihood estimates for the gamma distribution, the maximum likelihood
estimates for the beta distribution do not have a closed form solution for arbitrary values of the shape parameters. If
are independent random variables each having a beta distribution, the following system of coupled
maximum likelihood estimate equations (for the average log-likelihoods) needs to be inverted to obtain the
(unknown) shape parameter estimates in terms of the (known) average of logarithms of the samples
[5]
:

These coupled equations containing digamma functions of the shape parameter estimates must be solved by
[24]
numerical methods as done, for example, by Beckman et.al. .
Beta distribution 23

Gnanadesikan et. al. give numerical solutions for a few cases[25] .N.L.Johnson and S.Kotz[5] suggest that for "not too
small" shape parameter estimates , the logarithmic approximation to the digamma function
may be used to obtain initial values for an iterative solution, since the equations resulting
from this approximation can be solved exactly:

More readily, and perhaps more accurately, the estimates provided by the method of moments can instead be used as
initial values for an iterative solution of the maximum likelihood coupled equations in terms of the digamma
functions.
When the distribution is required over a known interval other than , say , then replace in the

first equation with and replace in the second equation with (see "Alternative

parametrizations, four parameters" section below).


If one of the shape parameters is known, the problem is considerably simplified. The following transformation can
be used to solve for the unknown shape parameter (for skewed cases such that , otherwise, if symmetric,
both -equal- parameters are known when one is known):

If, for example, is known, the unknown parameter is provided, exactly, by the inverse[26] of the digamma
function:

Note that this is the logarithm of the transformation by inversion of the random variable (X/(1 - X) that transforms a
beta distribution to the inverted beta distribution or beta prime distribution (also known as beta distribution of the
second kind or Pearson's Type VI).

In particular, if one of the shape parameters has a value of unity, for example for (the power function

distribution with bounded support [0,1]), using the recurrence relation , the maximum

likelihood estimator for the unknown parameter is[5], exactly:

Generating beta-distributed random variates


If and are independent, with and then , so one
algorithm for generating beta variates is to generate X/(X + Y), where X is a gamma variate with parameters ( )
[27]
and Y is an independent gamma variate with parameters ( ).
Also, the kth order statistic of uniformly distributed variates is , so an alternative if
and are small integers is to generate uniform variates and choose the -th largest.[28]
Beta distribution 24

Related distributions

Transformations
• If then mirror-image symmetry
• If then . The beta prime distribution, also called "beta
distribution of the second kind".
• If then (assuming n > 0 and m > 0). The Fisher-Snedecor F
distribution
• If then
where PERT denotes a distribution used in PERT
[29] [30]
analysis, and m=most likely value . Traditionally in PERT analysis.
• If then Kumaraswamy distribution with parameters
• If then Kumaraswamy distribution with parameters
• If then

Special and limiting cases


• the standard uniform distribution.
• If and then Wigner semicircle distribution.
• is the Jeffreys prior for a proportion and is equivalent to arcsine distribution.
• the exponential distribution
• the gamma distribution

Derived from other distributions


• The kth order statistic of a sample of size n from the uniform distribution is a beta random variable,
[28]

• If and then
• If and then
• If and then . The power function distribution.

Combination with other distributions


• and then for all x > 0.

Compounding with other distributions


• If and then beta-binomial distribution
• If and then beta negative binomial distribution

Generalisations
• The Dirichlet distribution is a multivariate generalization of the beta distribution. Univariate marginals of the
Dirichlet distribution have a beta distribution.
• The beta distribution is equivalent to the values that make the Pearson type I distribution a proper probability
distribution.
• the noncentral beta distribution
Beta distribution 25

Applications

Order statistics
The beta distribution has an important application in the theory of order statistics. A basic result is that the
distribution of the k'th largest of a sample of size n from a continuous uniform distribution has a beta distribution.[28]
This result is summarized as:

From this, and application of the theory related to the probability integral transform, the distribution of any
individual order statistic from any continuous distribution can be derived.[28]

Rule of succession
A classic application of the beta distribution is the rule of succession, introduced in the 18th century by Pierre-Simon
Laplace in the course of treating the sunrise problem. It states that, given s successes in n conditionally independent
Bernoulli trials with probability p, that p should be estimated as . This estimate may be regarded as the

expected value of the posterior distribution over p, namely Beta(s + 1, n − s + 1), which is given by Bayes' rule if one
assumes a uniform prior over p (i.e., Beta(1, 1)) and then observes that p generated s successes in n trials.

Bayesian inference
Beta distributions are used extensively in Bayesian inference, since beta distributions provide a family of conjugate
prior distributions for binomial (including Bernoulli) and geometric distributions. The Beta(0,0) distribution is an
improper prior and sometimes used to represent ignorance of parameter values.
The domain of the beta distribution can be viewed as a probability, and in fact the beta distribution is often used to
describe the distribution of an unknown probability value — typically, as the prior distribution over a probability
parameter, such as the probability of success in a binomial distribution or Bernoulli distribution. In fact, the beta
distribution is the conjugate prior of the binomial distribution and Bernoulli distribution.
The beta distribution is the special case of the Dirichlet distribution with only two parameters, and the beta is
conjugate to the binomial and Bernoulli distributions in exactly the same way as the Dirichlet distribution is
conjugate to the multinomial distribution and categorical distribution.
In Bayesian inference, the beta distribution can be derived as the posterior probability of the parameter p of a
binomial distribution after observing α − 1 successes (with probability p of success) and β − 1 failures (with
probability 1 − p of failure). Another way to express this is that placing a prior distribution of Beta(α,β) on the
parameter p of a binomial distribution is equivalent to adding α pseudo-observations of "success" and β
pseudo-observations of "failure" to the actual number of successes and failures observed, then estimating the
parameter p by the proportion of successes over both real- and pseudo-observations. If α and β are greater than 0,
this has the effect of smoothing out the distribution of the parameters by ensuring that some positive probability
mass is assigned to all parameters even when no actual observations corresponding to those parameters is observed.
Values of α and β less than 1 favor sparsity, i.e. distributions where the parameter p is close to either 0 or 1. In effect,
α and β, when operating together, function as a concentration parameter; see that article for more details.
Beta distribution 26

Subjective logic
In standard logic, propositions are considered to be either true or false. In contradistinction, subjective logic assumes
that humans cannot determine with absolute certainty whether a proposition about the real world is absolutely true or
false. In subjective logic the posteriori probability estimates of binary events can be represented by beta
distributions.[31]

Wavelet analysis
A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to
zero. It can typically be visualized as a "brief oscillation" that promptly decays. Wavelets can be used to extract
information from many different kinds of data, including – but certainly not limited to – audio signals and images.
Thus, wavelets are purposefully crafted to have specific properties that make them useful for signal processing.
Wavelets are localized in both time and frequency whereas the standard Fourier transform is only localized in
frequency. Therefore, standard Fourier Transforms are only applicable to stationary processes, while wavelets are
applicable to non-stationary processes. Continuous wavelets can be constructed based on the beta distribution. Beta
wavelets[32] can be viewed as a soft variety of Haar wavelets whose shape is fine-tuned by two shape parameters α
and β .

Project Management: task cost and schedule modeling


The beta distribution can be used to model events which are constrained to take place within an interval defined by a
minimum and maximum value. For this reason, the beta distribution — along with the triangular distribution — is
used extensively in PERT, critical path method (CPM), Joint Cost Schedule Modeling (JCSM) and other project
management / control systems to describe the time to completion and the cost of a task. In project management,
shorthand computations are widely used to estimate the mean and standard deviation of the beta distribution[30]:

where a is the minimum, c is the maximum, and b is the most likely value (the mode for α > 1 and β > 1).

The above estimate for the mean is known as the PERT three-point estimation and it is

exact for either of the following values of β (for arbitrary α within these ranges):

for α > 1 (symmetric case) with standard deviation , skewness = 0, and

excess kurtosis =

or

for 5 > α > 1 (skewed case) with standard deviation ,

skewness = , and excess kurtosis =

The above estimate for the standard deviation is exact for either of the following values of α and

β:

(symmetric) with skewness = 0, and excess kurtosis =


Beta distribution 27

or

and (right-tailed, positive skew) with skewness = , and excess kurtosis = 0

or

and (left-tailed, negative skew) with skewness = , and excess kurtosis = 0

Otherwise, these can be poor approximations for beta distributions with other values of of α and β. For example, the
particular values and resulting in and

have been shown to exhibit average errors of 40% in the

mean and 549% in the variance[33][34][35]

Alternative parametrizations

Two parameters

Mean and sample size


The beta distribution may also be reparameterized in terms of its mean μ (0 < μ < 1) and sample size ν = α + β (ν >
0). This is useful in Bayesian parameter estimation if one wants to place an unbiased (uniform) prior over the mean.
For example, one may administer a test to a number of individuals. If it is assumed that each person's score (0 ≤ θ ≤
1) is drawn from a population-level Beta distribution, then an important statistic is the mean of this population-level
distribution. The mean and sample size parameters are related to the shape parameters α and β via[36]

Under this parametrization, one can place a uniform prior over the mean, and a vague prior (such as an exponential
or gamma distribution) over the positive reals for the sample size.

Mean (allele frequency) and (Wright's) genetic distance between two populations
The Balding–Nichols model is a two-parameter parametrization of the beta distribution used in population genetics.
It is a statistical description of the allele frequencies in the components of a sub-divided population:
See the articles Balding–Nichols model, F-statistics, fixation index and coefficient of relationship, for further
information.

Mean and variance


Solving the system of (coupled) equations given in the above sections as the equations for the mean and the variance
of the beta distribution in terms of the original parameters α and β, one can express the α and β parameters in terms
of the mean (μ) and the variance (var):

This parametrization of the beta distribution may lead to a more intuitive understanding (than the one based on the
original parameters α and β), for example, by expressing the mode, skewness, excess kurtosis and differential
Beta distribution 28

entropy in terms of the mean and the variance:

Four parameters
A beta distribution with the two shape parameters α and β is supported on the range [0,1]. It is possible to alter the
location and scale of the distribution by introducing two further parameters representing the minimum, a, and
maximum c, values of the distribution[5], by a linear transformation substituting the non-dimensional variable x in
terms of the new variable y (with support [a,c]) and the parameters a and c:

The probability density function of the four parameter beta distribution is then given by

The mean, mode and variance of the four parameters Beta distribution are:

Since the skewness and excess kurtosis are non-dimensional quantities (as moments normalized by the standard
deviation), they are independent of the parameters a and c, and therefore equal to the expressions given above in
terms of X (with support [0,1]):
Beta distribution 29

History
The first systematic, modern discussion of the
beta distribution is probably due to Karl
Pearson FRS[37] (27 March 1857 – 27 April
1936[38]), an influential English mathematician
who has been credited with establishing the
discipline of mathematical statistics.[39]. In
Pearson's papers[18] the beta distribution is
couched as a solution of a differential
equation: Pearson's Type I distribution. The
beta distribution is essentially identical to
Pearson's Type I distribution for Pearson's
parameter values for which Pearson's
differential equation solution becomes a proper
statistical distribution (with area under the
probability distribution equal to 1). In fact, in
several English books and journal articles in
the few decades prior to World War II, it was
common to refer to the beta distribution as
Pearson's Type I distribution. According to
David and Edwards's comprehensive treatise
on the history of statistics[40] the first modern
treatment of the beta distribution[41] using the
Karl Pearson analyzed the beta distribution as the solution "Type I" of Pearson
beta designation that has become standard is distributions
due to Corrado Gini,(May 23, 1884 – March
13, 1965), an Italian statistician, demographer, sociologist, who developed the Gini coefficient.

References
[1] Oguamanam, D.C.D.; Martin, H.R. , Huissoon, J.P. (1995). "On the application of the beta distribution to gear damage analysis". Applied
Acoustics 45 (3): 247–261. doi:10.1016/0003-682X(95)00001-P.
[2] Sulaiman, M.Yusof; W.M Hlaing Oo, Mahdi Abd Wahab, Azmi Zakaria (December 1999). "Application of beta distribution model to
Malaysian sunshine data". Renewable Energy 18 (4): 573–579. doi:10.1016/S0960-1481(99)00002-6.
[3] Haskett, Jonathan D.; Yakov A. Pachepsky, Basil Acock (1995). "Use of the beta distribution for parameterizing variability of soil properties
at the regional level for crop yield estimation". Agricultural Systems 48 (1): 73–86. doi:10.1016/0308-521X(95)93646-U.
[4] Gullco, Robert S.; Malcolm Anderson (December 2009). "Use of the Beta Distribution To Determine Well-Log Shale Parameters". SPE
Reservoir Evaluation & Engineering 12 (6): 929-942. doi:10.2118/106746-PA.
[5] Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1995). "Chapter 21:Beta Distributions". Continuous Univariate Distributions Vol. 2
(2nd ed.). Wiley. ISBN 978-0-471-58494-0.
[6] Keeping, E. S. (2010). Introduction to Statistical Inference. Dover Publications. pp. 462 pages. ISBN 978-0486685021.
[7] Wadsworth, George P. and Joseph Bryan (1960). Introduction to Probability and Random Variables. McGraw-Hill. pp. 101.
[8] Hahn, Gerald J. and S. Shapiro (1994). Statistical Models in Engineering (Wiley Classics Library). Wiley-Interscience. pp. 376.
ISBN 978-0471040651.
[9] Feller, William (1971). An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley. pp. 669. ISBN 978-0471257097.
[10] Gupta (Editor), Arjun K. (2004). Handbook of Beta Distribution and Its Applications. CRC Press. pp. 42. ISBN 978-0824753962.
[11] Panik, Michael J (2005). Advanced Statistics from an Elementary Point of View. Academic Press. pp. 273. ISBN 978-0120884940.
[12] Rose, Colin, and Murray D. Smith (2002). Mathematical Statistics with MATHEMATICA. Springer. pp. 496 pages. ISBN 978-0387952345.
[13] Kerman J (2011) "A closed-form approximation for the median of the beta distribution". arXiv:1111.0433v1
[14] Liang, Zhiqiang; Jianming Wei, Junyu Zhao, Haitao Liu, Baoqing Li, Jie Shen and Chunlei Zheng (27 August 2008). "The Statistical
Meaning of Kurtosis and Its New Application to Identification of Persons Based on Seismic Signals". Sensors 8: 5106–5119.
doi:10.3390/s8085106.
Beta distribution 30

[15] Kenney, J. F., and E. S. Keeping (1951). Mathematics of Statistics Part Two, 2nd edition. D. Van Nostrand Company Inc. pp. 429.
[16] Abramowitz, Milton and Irene A. Stegun (1965). Handbook Of Mathematical Functions With Formulas, Graphs, And Mathematical Tables.
Dover. pp. 1046 pages. ISBN 78-0486612720.
[17] Weisstein., Eric W.. "Kurtosis" (http:/ / mathworld. wolfram. com/ Kurtosis. html). MathWorld--A Wolfram Web Resource. . Retrieved 13
August 2012.
[18] Pearson, Karl (1916). "Mathematical contributions to the theory of evolution, XIX: Second supplement to a memoir on skew variation".
Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 216
(538–548): 429–457. Bibcode 1916RSPTA.216..429P. doi:10.1098/rsta.1916.0009. JSTOR 91092.
[19] Gradshteyn, I. S. , and I. M. Ryzhik (2000). Table of Integrals, Series, and Products, 6th edition. Academic Press. pp. 1163.
ISBN 978-0122947575.
[20] A. C. G. Verdugo Lazo and P. N. Rathie. "On the entropy of continuous probability distributions," IEEE Trans. Inf. Theory, IT-24:120–122,
1978.
[21] Shannon, Claude E., "A Mathematical Theory of Communication," Bell System Technical Journal, 27 (4):623–656,1948. PDF (http:/ /
www. alcatel-lucent. com/ bstj/ vol27-1948/ articles/ bstj27-4-623. pdf)
[22] Pearson, Egon S. (July 1969). "Some historical reflections traced through the development of the use of frequency curves" (http:/ / www.
smu. edu/ Dedman/ Academics/ Departments/ Statistics/ Research/ TechnicalReports). THEMIS Statistical Analysis Research Program,
Technical Report 38 Office of Naval Research, Contract N000014-68-A-0515 (Project NR 042-260): 23. .
[23] Engineering Statistics Handbook (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda366h. htm)
[24] Beckman, R. J.; G. L. Tietjen (1978). "Maximum likelihood estimation for the beta distribution". Journal of Statistical Computation and
Simulation 7 (3-4): 253-258. doi:10.1080/00949657808810232.
[25] Gnanadesikan, R.,Pinkham and Hughes (1967). "Maximum likelihood estimation of the parameters of the beta distribution from smallest
order statistics". Technometrics 9: 607-620.
[26] Fackler, Paul. "Inverse Digamma Function (Matlab)" (http:/ / hips. seas. harvard. edu/ content/ inverse-digamma-function-matlab). Harvard
University School of Engineering and Applied Sciences. . Retrieved 08/18/2012.
[27] van der Waerden, B. L., "Mathematical Statistics", Springer, ISBN 978-3-540-04507-6.
[28] David, H. A., Nagaraja, H. N. (2003) Order Statistics (3rd Edition). Wiley, New Jersey pp 458. ISBN 0-471-38926-9
[29] Herrerías-Velasco, José Manuel and Herrerías-Pleguezuelo, Rafael and René van Dorp, Johan. (2011). Revisiting the PERT mean and
Variance. European Journal of Operational Research (210), p. 448–451.
[30] Malcolm, D. G.; Roseboom, C. E., Clark, C. E. and Fazar, W., (1959). "Application of a technique for research and development program
evaluation". Operations Research 7: 646–649.
[31] A. Jøsang. A Logic for Uncertain Probabilities. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 9(3),
pp.279-311, June 2001. PDF (http:/ / www. unik. no/ people/ josang/ papers/ Jos2001-IJUFKS. pdf)
[32] H.M. de Oliveira and G.A.A. Araújo,. Compactly Supported One-cyclic Wavelets Derived from Beta Distributions. Journal of
Communication and Information Systems. vol.20, n.3, pp.27-33, 2005.
[33] Keefer, Donald L. and Verdini, William A. (1993). Better Estimation of PERT Activity Time Parameters. Management Science 39(9), p.
1086–1091.
[34] Keefer, Donald L. and Bodily, Samuel E. (1983). Three-point Approximations for Continuous Random variables. Management Science
29(5), p. 595–609.
[35] DRMI Newsletter, Issue 12, April 8, 2005 (http:/ / www. nps. edu/ drmi/ docs/ 1apr05-newsletter. pdf)
[36] Kruschke, J. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. Academic Press / Elsevier ISBN 978-0123814852 (p. 83)
[37] Yule, G. U.; Filon, L. N. G. (1936). "Karl Pearson. 1857-1936". Obituary Notices of Fellows of the Royal Society 2 (5): 72.
doi:10.1098/rsbm.1936.0007. JSTOR 769130.
[38] "Library and Archive catalogue" (http:/ / www2. royalsociety. org/ DServe/ dserve. exe?dsqIni=Dserve. ini& dsqApp=Archive&
dsqCmd=Show. tcl& dsqDb=Persons& dsqPos=0& dsqSearch=((text)=' Pearson: Karl (1857 - 1936) '))). Sackler Digital Archive. Royal
Society. . Retrieved 2011-07-01.
[39] "Karl Pearson sesquicentenary conference" (http:/ / www. economics. soton. ac. uk/ staff/ aldrich/ KP150. htm). Royal Statistical Society.
2007-03-03. . Retrieved 2008-07-25.
[40] David, H. A. and A.W.F. Edwards (2001). Annotated Readings in the History of Statistics. Springer; 1 edition. pp. 252.
ISBN 978-0387988443.
[41] Gini, Corrado (1911). Studi Economico-Giuridici della Università de Cagliari Anno III (reproduced in Metron 15, 133,171, 1949): 5-41.
Beta distribution 31

External links
• Weisstein, Eric W., " Beta Distribution (http://mathworld.wolfram.com/BetaDistribution.html)" from
MathWorld.
• "Beta Distribution" (http://demonstrations.wolfram.com/BetaDistribution/) by Fiona Maclachlan, the Wolfram
Demonstrations Project, 2007.
• Beta Distribution – Overview and Example (http://www.xycoon.com/beta.htm), xycoon.com
• Beta Distribution (http://www.brighton-webs.co.uk/distributions/beta.htm), brighton-webs.co.uk

Beta function
In mathematics, the beta function, also called the Euler integral of the first kind, is a special function defined by

for
The beta function was studied by Euler and Legendre and was given its name by Jacques Binet; its symbol Β is a
Greek capital β rather than the similar Latin capital B.

Properties
The beta function is symmetric, meaning that
[1]

When x and y are positive integers, it follows trivially from the definition of the gamma function that:

It has many other forms, including:

[1]

[2]

[2]

where is a truncated power function and the star denotes convolution. The second identity shows in
particular . Some of these identities, e.g. the trigonometric formula, can be applied to deriving the
volume of an n-ball in Cartesian coordinates.
Euler's integral for the beta function may be converted into an integral over the Pochhammer contour C as
Beta function 32

This Pochhammer contour integral converges for all values of α and β and so gives the analytic continuation of the
beta function.
Just as the gamma function for integers describes factorials, the beta function can define a binomial coefficient after
adjusting indices:

Moreover, for integer n, can be integrated to give a closed form, an interpolation function for continuous values of
k:

The beta function was the first known scattering amplitude in string theory, first conjectured by Gabriele Veneziano.
It also occurs in the theory of the preferential attachment process, a type of stochastic urn process.

Relationship between gamma function and beta function


To derive the integral representation of the beta function, write the product of two factorials as

Changing variables by putting u=zt, v=z(1-t) shows that this is

Hence

The stated identity may be seen as a particular case of the identity for the integral of a convolution. Taking

and , one has:

Derivatives
We have

where is the digamma function.

Integrals
The Nörlund–Rice integral is a contour integral involving the beta function.
Beta function 33

Approximation
Stirling's approximation gives the asymptotic formula

for large x and large y. If on the other hand x is large and y is fixed, then

Incomplete beta function


The incomplete beta function, a generalization of the beta function, is defined as

For x = 1, the incomplete beta function coincides with the complete beta function. The relationship between the two
functions is like that between the gamma function and its generalization the incomplete gamma function.
The regularized incomplete beta function (or regularized beta function for short) is defined in terms of the
incomplete beta function and the complete beta function:

Working out the integral (one can use integration by parts) for integer values of a and b, one finds:

The regularized incomplete beta function is the cumulative distribution function of the Beta distribution, and is
related to the cumulative distribution function of a random variable X from a binomial distribution, where the
"probability of success" is p and the sample size is n:

Properties

Calculation
Even if unavailable directly, the complete and incomplete Beta function values can be calculated using functions
commonly included in spreadsheet or Computer algebra systems. With Excel as an example, using the GammaLn
and (cumulative) Beta distribution functions, we have:
Complete Beta Value = Exp(GammaLn(a) + GammaLn(b) - GammaLn(a + b))
and,
Incomplete Beta Value = BetaDist(x, a, b) * Exp(GammaLn(a) + GammaLn(b) - GammaLn(a + b)).
These result from rearranging the formulae for the Beta distribution, and the incomplete beta and complete beta
functions, which can also be defined as the ratio of the logs as above.
Beta function 34

Similarly, in MATLAB and GNU Octave, betainc (Incomplete beta function) computes the regularized incomplete
beta function - which is, in fact, the Cumulative Beta distribution - and so, to get the actual incomplete beta function,
one must multiply the result of betainc by the result returned by the corresponding beta function..//

References
[1] Davis (1972) 6.2.2 p.258
[2] Davis (1972) 6.2.1 p.258

• Askey, R. A.; Roy, R. (2010), "Beta function" (http://dlmf.nist.gov/5.12), in Olver, Frank W. J.; Lozier,
Daniel M.; Boisvert, Ronald F. et al., NIST Handbook of Mathematical Functions, Cambridge University Press,
ISBN 978-0521192255, MR2723248
• Zelen, M.; Severo, N. C. (1972), "26. Probability functions", in Abramowitz, Milton; Stegun, Irene A., Handbook
of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover Publications,
pp. 925-995, ISBN 978-0-486-61272-0
• Davis, Philip J. (1972), "6. Gamma function and related functions" (http://www.math.sfu.ca/~cbm/aands/
page_258.htm), in Abramowitz, Milton; Stegun, Irene A., Handbook of Mathematical Functions with Formulas,
Graphs, and Mathematical Tables, New York: Dover Publications, ISBN 978-0-486-61272-0
• Paris, R. B. (2010), "Incomplete beta functions" (http://dlmf.nist.gov/8.17), in Olver, Frank W. J.; Lozier,
Daniel M.; Boisvert, Ronald F. et al., NIST Handbook of Mathematical Functions, Cambridge University Press,
ISBN 978-0521192255, MR2723248
• Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), "Section 6.1 Gamma Function, Beta Function,
Factorials" (http://apps.nrbook.com/empanel/index.html?pg=256), Numerical Recipes: The Art of Scientific
Computing (3rd ed.), New York: Cambridge University Press, ISBN 978-0-521-88068-8

External links
• Evaluation of beta function using Laplace transform (http://planetmath.org/?op=getobj&amp;from=objects&
amp;id=6206), PlanetMath.org.
• Arbitrarily accurate values can be obtained from:
• The Wolfram Functions Site (http://functions.wolfram.com): Evaluate Beta Regularized Incomplete beta
(http://functions.wolfram.com/webMathematica/FunctionEvaluation.jsp?name=BetaRegularized)
• danielsoper.com: Incomplete Beta Function Calculator (http://www.danielsoper.com/statcalc/calc36.aspx),
Regularized Incomplete Beta Function Calculator (http://www.danielsoper.com/statcalc/calc37.aspx)
Beta-binomial distribution 35

Beta-binomial distribution
Probability mass function

Cumulative distribution function

Parameters n ∈ N0 — number of trials


(real)
(real)
Support k ∈ { 0, …, n }
PMF

CDF
where 3F2(a,b,k) is the generalized hypergeometric
function
=3F2(1, α + k + 1, −n + k + 1; k + 2, −β − n + k + 2; 1)
Mean

Variance

Skewness

Ex. kurtosis See text


MGF

CF

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on
a finite support of non-negative integers arising when the probability of success in each of a fixed or known number
of Bernoulli trials is either unknown or random. It is frequently used in Bayesian statistics, empirical Bayes methods
and classical statistics as an overdispersed binomial distribution.
Beta-binomial distribution 36

It reduces to the Bernoulli distribution as a special case when n = 1. For α = β = 1, it is the discrete uniform
distribution from 0 to n. It also approximates the binomial distribution arbitrarily well for large α and β. The
beta-binomial is a one-dimensional version of the Dirichlet-multinomial distribution, as the binomial and beta
distributions are special cases of the multinomial and Dirichlet distributions, respectively.

Motivation and derivation

Beta-binomial distribution as a compound distribution


The Beta distribution is a conjugate distribution of the binomial distribution. This fact leads to an analytically
tractable compound distribution where one can think of the parameter in the binomial distribution as being
randomly drawn from a beta distribution. Namely, if

is the binomial distribution where p is a random variable with a beta distribution

then the compound distribution is given by

Using the properties of the beta function, this can alternatively be written

It is within this context that the beta-binomial distribution appears often in Bayesian statistics: the beta-binomial is
the predictive distribution of a binomial random variable with a beta distribution prior on the success probability.

Beta-binomial as an urn model


The beta-binomial distribution can also be motivated via an urn model for positive integer values of α and β.
Specifically, imagine an urn containing α red balls and β black balls, where random draws are made. If a red ball is
observed, then two red balls are returned to the urn. Likewise, if a black ball is drawn, it is replaced and another
black ball is added to the urn. If this is repeated n times, then the probability of observing k red balls follows a
beta-binomial distribution with parameters n,α and β.
Note that if the random draws are with simple replacement (no balls over and above the observed ball are added to
the urn), then the distribution follows a binomial distribution and if the random draws are made without replacement,
the distribution follows a hypergeometric distribution.
Beta-binomial distribution 37

Moments and properties


The first three raw moments are

and the kurtosis is

Letting we note, suggestively, that the mean can be written as

and the variance as

where is the pairwise correlation between the n Bernoulli draws and is called the over-dispersion
parameter.

Point estimates

Method of moments
The method of moments estimates can be gained by noting the first and second moments of the beta-binomial
namely

and setting these raw moments equal to the sample moments

and solving for α and β we get

Note that these estimates can be non-sensically negative which is evidence that the data is either undispersed or
underdispersed relative to the binomial distribution. In this case, the binomial distribution and the hypergeometric
distribution are alternative candidates respectively.
Beta-binomial distribution 38

Maximum likelihood estimation


While closed-form maximum likelihood estimates are impractical, given that the pdf consists of common functions
(gamma function and/or Beta functions), they can be easily found via direct numerical optimization. Maximum
likelihood estimates from empirical data can be computed using general methods for fitting multinomial Pólya
distributions, methods for which are described in (Minka 2003). The R package VGAM through the function vglm,
via maximum likelihood, facilitates the fitting of glm type models with responses distributed according to the
beta-binomial distribution. Note also that there is no requirement that n is fixed throughout the observations.

Example
The following data gives the number of male children among the first 12 children of family size 13 in 6115 families
taken from hospital records in 19th century Saxony (Sokal and Rohlf, p. 59 from Lindsey). The 13th child is ignored
to assuage the effect of families non-randomly stopping when a desired gender is reached.

Males 0 1 2 3 4 5 6 7 8 9 10 11 12

Families 3 24 104 286 670 1033 1343 1112 829 478 181 45 7

We note the first two sample moments are

and therefore the method of moments estimates are

The maximum likelihood estimates can be found numerically

and the maximized log-likelihood is

from which we find the AIC

The AIC for the competing binomial model is AIC = 25070.34 and thus we see that the beta-binomial model
provides a superior fit to the data i.e. there is evidence for overdispersion. Trivers and Willard posit a theoretical
justification for heterogeneity in gender-proneness among families (i.e. overdispersion).
The superior fit is evident especially among the tails
Beta-binomial distribution 39

Males 0 1 2 3 4 5 6 7 8 9 10 11 12

Observed Families 3 24 104 286 670 1033 1343 1112 829 478 181 45 7

Predicted (Beta-Binomial) 2.3 22.6 104.8 310.9 655.7 1036.2 1257.9 1182.1 853.6 461.9 177.9 43.8 5.2

Predicted (Binomial p = 0.519215) 0.9 12.1 71.8 258.5 628.1 1085.2 1367.3 1265.6 854.2 410.0 132.8 26.1 2.3

Further Bayesian considerations


It is convenient to reparameterize the distributions so that the expected mean of the prior is a single parameter: Let

where

so that

The posterior distribution ρ(θ|k) is also a beta distribution:

And

while the marginal distribution m(k|μ, M) is given by

Because the marginal is a complex, non-linear function of Gamma and Digamma functions, it is quite difficult to
obtain a marginal maximum likelihood estimate (MMLE) for the mean and variance. Instead, we use the method of
iterated expectations to find the expected value of the marginal moments.
Let us write our model as a two-stage compound sampling model. Let ki be the number of success out of ni trials for
event i:
Beta-binomial distribution 40

We can find iterated moment estimates for the mean and variance using the moments for the distributions in the
two-stage model:

(Here we have used the law of total expectation and the law of total variance.)
We want point estimates for and . The estimated mean is calculated from the sample

The estimate of the hyperparameter M is obtained using the moment estimates for the variance of the two-stage
model:

Solving:

where

Since we now have parameter point estimates, and , for the underlying distribution, we would like to find a
point estimate for the probability of success for event i. This is the weighted average of the event
estimate and . Given our point estimates for the prior, we may now plug in these values to find a point
estimate for the posterior

Shrinkage factors
We may write the posterior estimate as a weighted average:

where is called the shrinkage factor.


Beta-binomial distribution 41

Related distributions
• where is the discrete uniform distribution.

References
* Minka, Thomas P. (2003). Estimating a Dirichlet distribution [1]. Microsoft Technical Report.

External links
• Using the Beta-binomial distribution to assess performance of a biometric identification device [2]
• Fastfit [3] contains Matlab code for fitting Beta-Binomial distributions (in the form of two-dimensional Pólya
distributions) to data.

References
[1] http:/ / research. microsoft. com/ ~minka/ papers/ dirichlet/
[2] http:/ / it. stlawu. edu/ ~msch/ biometrics/ papers. htm
[3] http:/ / research. microsoft. com/ ~minka/ software/ fastfit/

Binomial coefficient
In mathematics, binomial coefficients are a family of positive
integers that occur as coefficients in the binomial theorem. They
are indexed by two nonnegative integers; the binomial coefficient
indexed by n and k is usually written . It is the coefficient of
the x k term in the polynomial expansion of the binomial power
(1 + x) n. Under suitable circumstances the value of the coefficient
is given by the expression . Arranging binomial
coefficients into rows for successive values of n, and in which k
ranges from 0 to n, gives a triangular array called Pascal's triangle.
The binomial coefficients can be arranged to form
This family of numbers also arises in many other areas than Pascal's triangle.
algebra, notably in combinatorics. For any set containing n
elements, the number of distinct k-element subsets of it that can be formed (the k-combinations of its elements) is
given by the binomial coefficient . Therefore is often read as "n choose k". The properties of binomial
coefficients have led to extending the meaning of the symbol beyond the basic case where n and k are
nonnegative integers with k ≤ n; such expressions are then still called binomial coefficients.
The notation was introduced by Andreas von Ettingshausen in 1826,[1] although the numbers were already
known centuries before that (see Pascal's triangle). The earliest known detailed discussion of binomial coefficients is
in a tenth-century commentary, due to Halayudha, on an ancient Hindu classic, Pingala's chandaḥśāstra. In about
1150, the Hindu mathematician Bhaskaracharya gave a very clear exposition of binomial coefficients in his book
Lilavati.[2]
Alternative notations include C(n, k), nCk, nCk, Ckn, Cnk,[3] in all of which the C stands for combinations or choices.
Binomial coefficient 42

Definition and interpretations


For natural numbers (taken to include 0) n and k, the binomial coefficient can be defined as the coefficient of
the monomial Xk in the expansion of (1 + X)n. The same coefficient also occurs (if k ≤ n) in the binomial formula

(valid for any elements x,y of a commutative ring), which explains the name "binomial coefficient".
Another occurrence of this number is in combinatorics, where it gives the number of ways, disregarding order, that k
objects can be chosen from among n objects; more formally, the number of k-element subsets (or k-combinations) of
an n-element set. This number can be seen as equal to the one of the first definition, independently of any of the
formulas below to compute it: if in each of the n factors of the power (1 + X)n one temporarily labels the term X with
an index i (running from 1 to n), then each subset of k indices gives after expansion a contribution Xk, and the
coefficient of that monomial in the result will be the number of such subsets. This shows in particular that is a
natural number for any natural numbers n and k. There are many other combinatorial interpretations of binomial
coefficients (counting problems for which the answer is given by a binomial coefficient expression), for instance the
number of words formed of n bits (digits 0 or 1) whose sum is k is given by , while the number of ways to write
where every ai is a nonnegative integer is given by . Most of these
interpretations are easily seen to be equivalent to counting k-combinations.

Computing the value of binomial coefficients


Several methods exist to compute the value of without actually expanding a binomial power or counting
k-combinations.

Recursive formula
One has a recursive formula for binomial coefficients

with initial values

The formula follows either from tracing the contributions to Xk in (1 + X)n−1(1 + X), or by counting k-combinations
of {1, 2, ..., n} that contain n and that do not contain n separately. It follows easily that  = 0 when k > n, and
 = 1 for all n, so the recursion can stop when reaching such cases. This recursive formula then allows the
construction of Pascal's triangle.
Binomial coefficient 43

Multiplicative formula
A more efficient method to compute individual binomial coefficients is given by the formula

where the numerator of the first fraction is expressed as a falling factorial power. This formula is easiest to
understand for the combinatorial interpretation of binomial coefficients. The numerator gives the number of ways to
select a sequence of k distinct objects, retaining the order of selection, from a set of n objects. The denominator
counts the number of distinct sequences that define the same k-combination when order is disregarded.

Factorial formula
Finally there is a formula using factorials that is easy to remember:

where n! denotes the factorial of n. This formula follows from the multiplicative formula above by multiplying
numerator and denominator by (n − k)!; as a consequence it involves many factors common to numerator and
denominator. It is less practical for explicit computation unless common factors are first canceled (in particular since
factorial values grow very rapidly). The formula does exhibit a symmetry that is less evident from the multiplicative
formula (though it is from the definitions)

(1)

Generalization and connection to the binomial series


The multiplicative formula allows the definition of binomial coefficients to be extended[4] by replacing n by an
arbitrary number α (negative, real, complex) or even an element of any commutative ring in which all positive
integers are invertible:

With this definition one has a generalization of the binomial formula (with one of the variables set to 1), which
justifies still calling the binomial coefficients:

(2)

This formula is valid for all complex numbers α and X with |X| < 1. It can also be interpreted as an identity of formal
power series in X, where it actually can serve as definition of arbitrary powers of series with constant coefficient
equal to 1; the point is that with this definition all identities hold that one expects for exponentiation, notably

If α is a nonnegative integer n, then all terms with k > n are zero, and the infinite series becomes a finite sum, thereby
recovering the binomial formula. However for other values of α, including negative integers and rational numbers,
the series is really infinite.
Binomial coefficient 44

Pascal's triangle
Pascal's rule is the important recurrence relation

1000th row of Pascal's


triangle, arranged vertically,
with grey-scale
representations of decimal
digits of the coefficients,
right-aligned. The left
boundary of the image
corresponds roughly to the
graph of the logarithm of the
binomial coefficients, and
illustrates that they form a
log-concave sequence.

(3)

which can be used to prove by mathematical induction that is a natural number for all n and k, (equivalent to the
statement that k! divides the product of k consecutive integers), a fact that is not immediately obvious from formula
(1).
Pascal's rule also gives rise to Pascal's triangle:
Binomial coefficient 45

0: 1

1: 1 1

2: 1 2 1

3: 1 3 3 1

4: 1 4 6 4 1

5: 1 5 10 10 5 1

6: 1 6 15 20 15 6 1

7: 1 7 21 35 35 21 7 1

8: 1 8 28 56 70 56 28 8 1

Row number n contains the numbers for k = 0,…,n. It is constructed by starting with ones at the outside and
then always adding two adjacent numbers and writing the sum directly underneath. This method allows the quick
calculation of binomial coefficients without the need for fractions or multiplications. For instance, by looking at row
number 5 of the triangle, one can quickly read off that
(x + y)5 = 1 x5 + 5 x4y + 10 x3y2 + 10 x2y3 + 5 x y4 + 1 y5.
The differences between elements on other diagonals are the elements in the previous diagonal, as a consequence of
the recurrence relation (3) above.

Combinatorics and statistics


Binomial coefficients are of importance in combinatorics, because they provide ready formulas for certain frequent
counting problems:

• There are ways to choose k elements from a set of n elements. See Combination.
• There are ways to choose k elements from a set of n if repetitions are allowed. See Multiset.
• There are strings containing k ones and n zeros.
• There are strings consisting of k ones and n zeros such that no two ones are adjacent.[5]

• The Catalan numbers are

• The binomial distribution in statistics is


• The formula for a Bézier curve.

Binomial coefficients as polynomials


For any nonnegative integer k, the expression can be simplified and defined as a polynomial divided by k!:

This presents a polynomial in t with rational coefficients.


As such, it can be evaluated at any real or complex number t to define binomial coefficients with such first
arguments. These "generalized binomial coefficients" appear in Newton's generalized binomial theorem.

For each k, the polynomial can be characterized as the unique degree k polynomial p(t) satisfying p(0) = p(1) =
... = p(k − 1) = 0 and p(k) = 1.
Its coefficients are expressible in terms of Stirling numbers of the first kind, by definition of the latter:
Binomial coefficient 46

The derivative of can be calculated by logarithmic differentiation:

Binomial coefficients as a basis for the space of polynomials


Over any field containing Q, each polynomial p(t) of degree at most d is uniquely expressible as a linear

combination . The coefficient ak is the kth difference of the sequence p(0), p(1), …, p(k). Explicitly,[6]

(3.5)

Integer-valued polynomials
Each polynomial is integer-valued: it takes integer values at integer inputs. (One way to prove this is by
induction on k, using Pascal's identity.) Therefore any integer linear combination of binomial coefficient polynomials
is integer-valued too. Conversely, (3.5) shows that any integer-valued polynomial is an integer linear combination of
these binomial coefficient polynomials. More generally, for any subring R of a characteristic 0 field K, a polynomial
in K[t] takes values in R at all integers if and only if it is an R-linear combination of binomial coefficient
polynomials.

Example
The integer-valued polynomial 3t(3t + 1)/2 can be rewritten as

Identities involving binomial coefficients


The factorial formula facilitates relating nearby binomial coefficients. For instance, if k is a positive integer and n is
arbitrary, then

(4)

and, with a little more work,

Moreover, the following may be useful:


Binomial coefficient 47

Series involving binomial coefficients


The formula

(5)

is obtained from (2) using x = 1. This is equivalent to saying that the elements in one row of Pascal's triangle always
add up to two raised to an integer power. A combinatorial interpretation of this fact involving double counting is
given by counting subsets of size 0, size 1, size 2, and so on up to size n of a set S of n elements. Since we count the
number of subsets of size i for 0 ≤ i ≤ n, this sum must be equal to the number of subsets of S, which is known to be
2n. That is, Equation 5 is a statement that the power set for a finite set with n elements has size 2n. More explicitly,
consider a bit string with n digits. This bit string can be used to represent 2n numbers. Now consider all of the bit
strings with no ones in them. There is just one, or rather n choose 0. Next consider the number of bit strings with just
a single one in them. There are n, or rather n choose 1. Continuing this way we can see that the equation above holds.
The formulas

(6a)

and

(6b)

follow from (2) after differentiating with respect to x (twice in the latter) and then substituting x = 1.
The Chu-Vandermonde identity, which holds for any complex-values m and n and any non-negative integer k, is

(7a)

and can be found by examination of the coefficient of in the expansion of (1 + x)m (1 + x)n − m = (1 + x)n using
equation (2). When m = 1, equation (7a) reduces to equation (3).
A similar looking formula, which applies for any integers j, k, and n satisfying 0 ≤ j ≤ k ≤ n, is

(7b)

and can be found by examination of the coefficient of in the expansion of

using When j = k, equation (7b) gives

From expansion (7a) using n = 2m, k = m, and (1), one finds

(8)

Let F(n) denote the n-th Fibonacci number. We obtain a formula about the diagonals of Pascal's triangle
Binomial coefficient 48

(9)

This can be proved by induction using (3) or by Zeckendorf's representation (Just note that the lhs gives the number
of subsets of {F(2),...,F(n)} without consecutive members, which also form all the numbers below F(n+1)). A
combinatorial proof is given below.
Also using (3) and induction, one can show that

(10)

Although there is no closed formula for

(unless one resorts to Hypergeometric functions), one can again use (3) and induction, to show that for k = 0, ..., n−1

(11)

as well as

(12)

[except in the trivial case where n = 0, where the result is 1 instead] which is itself a special case of the result from
the theory of finite differences that for any polynomial P(x) of degree less than n,[7]

(13a)

Differentiating (2) k times and setting x = −1 yields this for , when 0 ≤ k < n,
and the general case follows by taking linear combinations of these.
When P(x) is of degree less than or equal to n,

(13b)

where is the coefficient of degree n in P(x).


More generally for (13b),

(13c)

where m and d are complex numbers. This follows immediately applying (13b) to the polynomial Q(x):=P(m + dx)
instead of P(x), and observing that Q(x) has still degree less than or equal to n, and that its coefficient of degree n is
dnan.
The infinite series
Binomial coefficient 49

(14)

is convergent for k ≥ 2. This formula is used in the analysis of the German tank problem. It is equivalent to the
formula for the finite sum

which is proved for M>m by induction on M.


Using (8) one can derive

(15)

and

(16)

Series multisection gives the following identity for the sum of binomial coefficients taken with a step s and offset t
as a closed-form sum of s terms:

Identities with combinatorial proofs


Many identities involving binomial coefficients can be proved by combinatorial means. For example, the following
identity for nonnegative integers (which reduces to (6a) when q = 1):

(16b)

can be given a double counting proof as follows. The left side counts the number of ways of selecting a subset of [n]
= {1, 2, …, n} with at least q elements, and marking q elements among those selected. The right side counts the
same parameter, because there are ways of choosing a set of q marks and they occur in all subsets that
additionally contain some subset of the remaining elements, of which there are
In the Pascal's rule

both sides count the number of k-element subsets of [n] with the right hand side first grouping them into those that
contain element n and those that do not.
The identity (8) also has a combinatorial proof. The identity reads

Suppose you have empty squares arranged in a row and you want to mark (select) n of them. There are
ways to do this. On the other hand, you may select your n squares by selecting k squares from among the first n and
squares from the remaining n squares; any k from 1 to n will work. This gives
Binomial coefficient 50

Now apply (4) to get the result.


The identity (9),

has the following combinatorial proof. The number denotes the number of paths in a two-dimensional lattice
from to using steps and . This is easy to see: there are steps in total and
one may choose the steps. Now, replace each step by a step; note that there are exactly .
Then one arrives at point using steps and . Doing this for all between and gives

all paths from to using steps and . Clearly, there are exactly such paths.
Sum of coefficients row

The number of k-combinations for all k, , is the sum of the nth row (counting from 0) of the

binomial coefficients. These combinations are enumerated by the 1 digits of the set of base 2 numbers counting from
0 to , where each digit position is an item from the set of n.

Dixon's identity
Dixon's identity is

or, more generally,

where a, b, and c are non-negative integers.

Continuous identities
Certain trigonometric integrals have values expressible in terms of binomial coefficients:
For and

(17)

(18)

(19)

These can be proved by using Euler's formula to convert trigonometric functions to complex exponentials, expanding
using the binomial theorem, and integrating term by term.
Binomial coefficient 51

Generating functions

Ordinary generating functions

For a fixed n, the ordinary generating function of the sequence is:

For a fixed k, the ordinary generating function of the sequence is:

The bivariate generating function of the binomial coefficients is:

Another bivariate generating function of the binomial coefficients, which is symmetric, is:

Exponential generating function


The exponential bivariate generating function of the binomial coefficients is:

Divisibility properties
In 1852, Kummer proved that if m and n are nonnegative integers and p is a prime number, then the largest power of
p dividing equals pc, where c is the number of carries when m and n are added in base p. Equivalently, the
exponent of a prime p in equals the number of nonnegative integers j such that the fractional part of k/pj is
greater than the fractional part of n/pj. It can be deduced from this that is divisible by n/gcd(n,k).
A somewhat surprising result by David Singmaster (1974) is that any integer divides almost all binomial
coefficients. More precisely, fix an integer d and let f(N) denote the number of binomial coefficients with n < N
such that d divides . Then

Since the number of binomial coefficients with n < N is N(N+1) / 2, this implies that the density of binomial
coefficients divisible by d goes to 1.
Another fact: An integer n ≥ 2 is prime if and only if all the intermediate binomial coefficients

are divisible by n.
Proof: When p is prime, p divides
Binomial coefficient 52

for all 0 < k < p

because it is a natural number and the numerator has a prime factor p but the denominator does not have a prime
factor p.
When n is composite, let p be the smallest prime factor of n and let k = n/p. Then 0 < p < n and

otherwise the numerator k(n−1)(n−2)×...×(n−p+1) has to be divisible by n = k×p, this can only be the case when
(n−1)(n−2)×...×(n−p+1) is divisible by p. But n is divisible by p, so p does not divide n−1, n−2, ..., n−p+1 and
because p is prime, we know that p does not divide (n−1)(n−2)×...×(n−p+1) and so the numerator cannot be divisible
by n.

Bounds and asymptotic formulas


The following bounds for hold:

Stirling's approximation yields the bounds:

and, in general, for m ≥ 2 and n ≥ 1,

and the approximation

as

The infinite product formula (cf. Gamma function, alternative definition)

yields the asymptotic formulas

as .
This asymptotic behaviour is contained in the approximation

as well. (Here is the k-th harmonic number and is the Euler–Mascheroni constant).
The sum of binomial coefficients can be bounded by a term exponential in and the binary entropy of the largest
that occurs. More precisely, for and , it holds

where is the binary entropy of .[8]


A simple and rough upper bound for the sum of binomial coefficients is given by the formula below (not difficult to
prove)
Binomial coefficient 53

Generalizations

Generalization to multinomials
Binomial coefficients can be generalized to multinomial coefficients. They are defined to be the number:

where

While the binomial coefficients represent the coefficients of (x+y)n, the multinomial coefficients represent the
coefficients of the polynomial

See multinomial theorem. The case r = 2 gives binomial coefficients:

The combinatorial interpretation of multinomial coefficients is distribution of n distinguishable elements over r


(distinguishable) containers, each containing exactly ki elements, where i is the index of the container.
Multinomial coefficients have many properties similar to these of binomial coefficients, for example the recurrence
relation:

and symmetry:

where is a permutation of (1,2,...,r).

Generalization to negative integers

If , then extends to all .

In the special case , this reduces to

Taylor series
Using Stirling numbers of the first kind the series expansion around any arbitrarily chosen point is
Binomial coefficient 54

Binomial coefficient with n=1/2


The definition of the binomial coefficients can be extended to the case where is real and is integer.
In particular, the following identity holds for any non-negative integer :

This shows up when expanding into a power series using the Newton binomial series :

Identity for the product of binomial coefficients


One can express the product of binomial coefficients as a linear combination of binomial coefficients:

where the connection coefficients are multinomial coefficients. In terms of labelled combinatorial objects, the
connection coefficients represent the number of ways to assign m+n-k labels to a pair of labelled combinatorial
objects—of weight m and n respectively—that have had their first k labels identified, or glued together to get a new
labelled combinatorial object of weight m+n-k. (That is, to separate the labels into three portions to apply to the
glued part, the unglued part of the first object, and the unglued part of the second object.) In this regard, binomial
coefficients are to exponential generating series what falling factorials are to ordinary generating series.

Partial Fraction Decomposition


The partial fraction decomposition of the inverse is given by

and

Newton's binomial series


Newton's binomial series, named after Sir Isaac Newton, is one of the simplest Newton series:

The identity can be obtained by showing that both sides satisfy the differential equation (1+z) f'(z) = α f(z).

The radius of convergence of this series is 1. An alternative expression is

where the identity

is applied.
Binomial coefficient 55

Two real or complex valued arguments


The binomial coefficient is generalized to two real or complex valued arguments using the gamma function or beta
function via

This definition inherits these following additional properties from :

moreover,

The resulting function has been little-studied, apparently first being graphed in (Fowler 1996). Notably, many
binomial identities fail: but for n positive (so negative). The behavior is

quite complex, and markedly different in various octants (that is, with respect to the x and y axes and the line
), with the behavior for negative x having singularities at negative integer values and a checkerboard of
positive and negative regions:
• in the octant it is a smoothly interpolated form of the usual binomial, with a ridge ("Pascal's ridge").
• in the octant and in the quadrant the function is close to zero.
• in the quadrant the function is alternatingly very large positive and negative on the
parallelograms with vertices
• in the octant the behavior is again alternatingly very large positive and negative, but on a square
grid.
• in the octant it is close to zero, except for near the singularities.

Generalization to q-series
The binomial coefficient has a q-analog generalization known as the Gaussian binomial coefficient.

Generalization to infinite cardinals


The definition of the binomial coefficient can be generalized to infinite cardinals by defining:

where A is some set with cardinality . One can show that the generalized binomial coefficient is well-defined, in

the sense that no matter what set we choose to represent the cardinal number , will remain the same. For

finite cardinals, this definition coincides with the standard definition of the binomial coefficient.

Assuming the Axiom of Choice, one can show that for any infinite cardinal .
Binomial coefficient 56

Binomial coefficient in programming languages

The notation is convenient in handwriting but inconvenient for typewriters and computer terminals. Many

programming languages do not offer a standard subroutine for computing the binomial coefficient, but for example
the J programming language uses the exclamation mark: k ! n .
Naive implementations of the factorial formula, such as the following snippet in Python:

def binomialCoefficient(n, k):


from math import factorial
return factorial(n) // (factorial(k) * factorial(n - k))

are very slow and are useless for calculating factorials of very high numbers (in languages as C or Java they suffer
from overflow errors because of this reason). A direct implementation of the multiplicative formula works well:

def binomialCoefficient(n, k):


if k < 0 or k > n:
return 0
if k > n - k: # take advantage of symmetry
k = n - k
c = 1
for i in range(k):
c = c * (n - (k - (i+1)))
c = c // (i+1)
return c

(Notice that range(k) is a list from 0 to k-1 and, as a consequence, we need to use i+1 in the above function). The
example mentioned above can be also written in functional style. The following Scheme example uses recursive
definition

Rational arithmetic can be easily avoided using integer division

The following implementation uses all these ideas

(define (binomial n k)
;; Helper function to compute C(n,k) via forward recursion
(define (binomial-iter n k i prev)
(if (>= i k)
prev
(binomial-iter n k (+ i 1) (/ (* (- n i) prev) (+ i 1)))))
;; Use symmetry property C(n,k)=C(n, n-k)
(if (< k (- n k))
(binomial-iter n k 0 1)
(binomial-iter n (- n k) 0 1)))

Another way to compute the binomial coefficient when using large numbers is to recognize that
Binomial coefficient 57

where denotes the natural logarithm of the gamma function at . It is a special function that is easily
computed and is standard in some programming languages such as using log_gamma in Maxima, LogGamma in
Mathematica, or gammaln in MATLAB. Roundoff error may cause the returned value to not be an integer.

Notes
[1] Higham (1998)
[2] Lilavati Section 6, Chapter 4 (see Knuth (1997)).
[3] Shilov (1977)
[4] See (Graham, Knuth & Patashnik 1994), which also defines for . Alternative generalizations, such as to two real or
complex valued arguments using the Gamma function assign nonzero values to for , but this causes most binomial coefficient
identities to fail, and thus is not widely used majority of definitions. One such choice of nonzero values leads to the aesthetically pleasing
"Pascal windmill" in Hilton, Holton and Pedersen, Mathematical reflections: in a room with many mirrors, Springer, 1997, but causes even
Pascal's identity to fail (at the origin).
[5] Muir, Thomas (1902). "Note on Selected Combinations" (http:/ / books. google. com/ books/ reader?id=EN8vAAAAIAAJ& output=reader&
pg=GBS. PA102). Proceedings of the Royal Society of Edinburgh. .
[6] This can be seen as a discrete analog of Taylor's theorem. It is closely related to Newton's polynomial. Alternating sums of this form may be
expressed as the Nörlund–Rice integral.
[7] Ruiz, Sebastian (1996). "An Algebraic Identity Leading to Wilson's Theorem" (http:/ / www. jstor. org/ stable/ 3618534). The Mathematical
Gazette 80 (489): 579-582. .
[8] see e.g. Flum & Grohe (2006, p. 427)

References
• Benjamin, Arthur T.; Quinn, Jennifer (2003). Proofs that Really Count: The Art of Combinatorial Proof (https://
www.maa.org/EbusPPRO/Bookstore/ProductDetail/tabid/170/Default.aspx?ProductId=675), Mathematical
Association of America.
• Bryant, Victor (1993). Aspects of combinatorics. Cambridge University Press. ISBN 0-521-41974-3.
• Flum, Jörg; Grohe, Martin (2006). Parameterized Complexity Theory (http://www.springer.com/east/home/
generic/search/results?SGWID=5-40109-22-141358322-0). Springer. ISBN 978-3-540-29952-3.
• Fowler, David (January 1996). "The Binomial Coefficient Function". The American Mathematical Monthly
(Mathematical Association of America) 103 (1): 1–17. doi:10.2307/2975209. JSTOR 2975209
• Graham, Ronald L.; Knuth, Donald E.; Patashnik, Oren (1994). Concrete Mathematics (Second ed.).
Addison-Wesley. pp. 153–256. ISBN 0-201-55802-5.
• Higham, Nicholas J. (1998). Handbook of writing for the mathematical sciences. SIAM. p. 25.
ISBN 0-89871-420-6.
• Knuth, Donald E. (1997). The Art of Computer Programming, Volume 1: Fundamental Algorithms (Third ed.).
Addison-Wesley. pp. 52–74. ISBN 0-201-89683-4.
• Singmaster, David (1974). "Notes on binomial coefficients. III. Any integer divides almost all binomial
coefficients". J. London Math. Soc. (2) 8 (3): 555–560. doi:10.1112/jlms/s2-8.3.555.
• Shilov, G. E. (1977). Linear algebra. Dover Publications. ISBN 978-0-486-63518-7.
Binomial coefficient 58

External links
• Calculation of Binomial Coefficient (http://www.stud.feec.vutbr.cz/~xvapen02/vypocty/komb.
php?language=english)
This article incorporates material from the following PlanetMath articles, which are licensed under the Creative
Commons Attribution/Share-Alike License: Binomial Coefficient, Bounds for binomial coefficients, Proof that C(n,k)
is an integer, Generalized binomial coefficients.
Binomial distribution 59

Binomial distribution
Probability mass function

Cumulative distribution function

Notation B(n, p)
Parameters n ∈ N0 — number of trials
p ∈ [0,1] — success probability in each trial
Support k ∈ { 0, …, n } — number of successes
PMF

CDF
Mean np
Median ⌊np⌋ or ⌈np⌉
Mode ⌊(n + 1)p⌋ or ⌊(n + 1)p⌋ − 1
Variance np(1 − p)
Skewness

Ex. kurtosis

Entropy

MGF
CF
PGF
Binomial distribution 60

In probability theory and statistics, the


binomial distribution is the discrete
probability distribution of the number
of successes in a sequence of n
independent yes/no experiments, each
of which yields success with
probability p. Such a success/failure
experiment is also called a Bernoulli
experiment or Bernoulli trial; when n =
1, the binomial distribution is a
Bernoulli distribution. The binomial
distribution is the basis for the popular
binomial test of statistical significance.
Binomial distribution for
The binomial distribution is frequently (blue), (green) and (red)

used to model the number of successes


in a sample of size n drawn with
replacement from a population of size
N. If the sampling is carried out
without replacement, the draws are not
independent and so the resulting
distribution is a hypergeometric
distribution, not a binomial one.
However, for N much larger than n, the
binomial distribution is a good
approximation, and widely used.

Specification

Probability mass function


In general, if the random variable K
follows the binomial distribution with
parameters n and p, we write
Binomial distribution for
K ~ B(n, p). The probability of getting
with and as in Pascal's triangle
exactly k successes in n trials is given The probability that a ball in a Galton box with 8 layers ( ) ends up in the
by the probability mass function: central bin ( ) is .

for k = 0, 1, 2, ..., n, where

is the binomial coefficient (hence the name of the distribution) "n choose k", also denoted C(n, k),  nCk, or nCk. The
formula can be understood as follows: we want k successes (pk) and n − k failures (1 − p)n − k. However, the k
successes can occur anywhere among the n trials, and there are C(n, k) different ways of distributing k successes in a
sequence of n trials.
Binomial distribution 61

In creating reference tables for binomial distribution probability, usually the table is filled in up to n/2 values. This is
because for k > n/2, the probability can be calculated by its complement as

Looking at the expression ƒ(k, n, p) as a function of k, there is a k value that maximizes it. This k value can be found
by calculating

and comparing it to 1. There is always an integer M that satisfies

ƒ(k, n, p) is monotone increasing for k < M and monotone decreasing for k > M, with the exception of the case where
(n + 1)p is an integer. In this case, there are two values for which ƒ is maximal: (n + 1)p and (n + 1)p − 1. M is the
most probable (most likely) outcome of the Bernoulli trials and is called the mode. Note that the probability of it
occurring can be fairly small.

Cumulative distribution function


The cumulative distribution function can be expressed as:

where is the "floor" under x, i.e. the greatest integer less than or equal to x.
It can also be represented in terms of the regularized incomplete beta function, as follows:

For k ≤ np, upper bounds for the lower tail of the distribution function can be derived. In particular, Hoeffding's
inequality yields the bound

and Chernoff's inequality can be used to derive the bound

Moreover, these bounds are reasonably tight when p = 1/2, since the following expression holds for all k ≥ 3n/8[1]
Binomial distribution 62

Mean and variance


If X ~ B(n, p) (that is, X is a binomially distributed random variable), then the expected value of X is

and the variance is

Mode and median


Usually the mode of a binomial B(n, p) distribution is equal to , where is the floor function.
However when (n + 1)p is an integer and p is neither 0 nor 1, then the distribution has two modes: (n + 1)p and
(n + 1)p − 1. When p is equal to 0 or 1, the mode will be 0 and n correspondingly. These cases can be summarized as
follows:

In general, there is no single formula to find the median for a binomial distribution, and it may even be non-unique.
However several special results have been established:
• If np is an integer, then the mean, median, and mode coincide and equal np.[2][3]
• Any median m must lie within the interval ⌊np⌋ ≤ m ≤ ⌈np⌉.[4]
• A median m cannot lie too far away from the mean: |m − np| ≤ min{ ln 2, max{p, 1 − p} }.[5]
• The median is unique and equal to m = round(np) in cases when either p ≤ 1 − ln 2 or p ≥ ln 2 or
|m − np| ≤ min{p, 1 − p} (except for the case when p = ½ and n is odd).[4][5]
• When p = 1/2 and n is odd, any number m in the interval ½(n − 1) ≤ m ≤ ½(n + 1) is a median of the binomial
distribution. If p = 1/2 and n is even, then m = n/2 is the unique median.

Covariance between two binomials


If two binomially distributed random variables X and Y are observed together, estimating their covariance can be
useful. Using the definition of covariance, in the case n = 1 (thus being Bernoulli trials) we have

The first term is non-zero only when both X and Y are one, and μX and μY are equal to the two probabilities. Defining
pB as the probability of both happening at the same time, this gives

and for n such trials again due to independence

If X and Y are the same variable, this reduces to the variance formula given above.
Binomial distribution 63

Relationship to other distributions

Sums of binomials
If X ~ B(n, p) and Y ~ B(m, p) are independent binomial variables with the same probability p, then X + Y is again a
binomial variable; its distribution is

Conditional binomials
If X ~ B(n, p) and, conditional on X, Y ~ B(X, q), then Y is a simple binomial variable with distribution

Bernoulli distribution
The Bernoulli distribution is a special case of the binomial distribution, where n = 1. Symbolically, X ~ B(1, p) has
the same meaning as X ~ Bern(p). Conversely, any binomial distribution, B(n, p), is the sum of n independent
Bernoulli trials, Bern(p), each with the same probability p.

Poisson binomial distribution


The binomial distribution is a special case of the Poisson binomial distribution, which is a sum of n independent
non-identical Bernoulli trials Bern(pi). If X has the Poisson binomial distribution with p1 = … = pn =p then
X ~ B(n, p).

Normal approximation
If n is large enough, then the skew of the distribution is
not too great. In this case a reasonable approximation to
B(n, p) is given by the normal distribution

and this basic approximation can be improved in a


simple way by using a suitable continuity correction.
The basic approximation generally improves as n
increases (at least 20) and is better when p is not near to
0 or 1.[6] Various rules of thumb may be used to decide
whether n is large enough, and p is far enough from the
extremes of zero or one:

• One rule is that both x=np and n(1 − p) must be


greater than 5. However, the specific number varies
from source to source, and depends on how good an
approximation one wants; some sources give 10 Binomial PDF and normal approximation for n = 6 and p = 0.5

which gives virtually the same results as the


following rule for large n until n is very large (ex: x=11, n=7752).

• A second rule[6] is that for n > 5 the normal approximation is adequate if

• Another commonly used rule holds that the normal approximation is appropriate only if everything within 3
standard deviations of its mean is within the range of possible values, that is if
Binomial distribution 64

The following is an example of applying a continuity correction. Suppose one wishes to calculate Pr(X ≤ 8) for a
binomial random variable X. If Y has a distribution given by the normal approximation, then Pr(X ≤ 8) is
approximated by Pr(Y ≤ 8.5). The addition of 0.5 is the continuity correction; the uncorrected normal approximation
gives considerably less accurate results.
This approximation, known as de Moivre–Laplace theorem, is a huge time-saver when undertaking calculations by
hand (exact calculations with large n are very onerous); historically, it was the first use of the normal distribution,
introduced in Abraham de Moivre's book The Doctrine of Chances in 1738. Nowadays, it can be seen as a
consequence of the central limit theorem since B(n, p) is a sum of n independent, identically distributed Bernoulli
variables with parameter p. This fact is the basis of a hypothesis test, a "proportion z-test," for the value of p using
x/n, the sample proportion and estimator of p, in a common test statistic.[7]
For example, suppose one randomly samples n people out of a large population and ask them whether they agree
with a certain statement. The proportion of people who agree will of course depend on the sample. If groups of n
people were sampled repeatedly and truly randomly, the proportions would follow an approximate normal
distribution with mean equal to the true proportion p of agreement in the population and with standard deviation
σ = (p(1 − p)/n)1/2. Large sample sizes n are good because the standard deviation, as a proportion of the expected
value, gets smaller, which allows a more precise estimate of the unknown parameter p.

Poisson approximation
The binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while
the product np remains fixed. Therefore the Poisson distribution with parameter λ = np can be used as an
approximation to B(n, p) of the binomial distribution if n is sufficiently large and p is sufficiently small. According
to two rules of thumb, this approximation is good if n ≥ 20 and p ≤ 0.05, or if n ≥ 100 and np ≤ 10.[8]

Limiting distributions
• Poisson limit theorem: As n approaches ∞ and p approaches 0 while np remains fixed at λ > 0 or at least np
approaches λ > 0, then the Binomial(n, p) distribution approaches the Poisson distribution with expected value λ.
• de Moivre–Laplace theorem: As n approaches ∞ while p remains fixed, the distribution of

approaches the normal distribution with expected value 0 and variance 1. This result is sometimes loosely
stated by saying that the distribution of X is asymptotically normal with expected value np and
variance np(1 − p). This result is a specific case of the central limit theorem.

Confidence intervals
Even for quite large values of n, the actual distribution of the mean is significantly nonnormal.[9] Because of this
problem several methods to estimate confidence intervals have been proposed.
Let n1 be the number of successes out of n, the total number of trials, and let

be the proportion of successes. Let zα/2 be the 100 ( 1 − α / 2 )th percentile of the standard normal distribution.
• Wald method
Binomial distribution 65

A continuity correction of 0.5/n may be added.


• Agresti-Coull method[10]

Here the estimate of p is modified to

• ArcSine method[11]

• Wilson (score) method[12]

The exact (Clopper-Pearson) method is the most conservative.[9] The Wald method although commonly
recommended in the text books is the most biased.

Generating binomial random variates


Methods for random number generation where the marginal distribution is a binomial distribution are
well-established. [13][14]

References
[1] Matoušek, J, Vondrak, J: The Probabilistic Method (lecture notes) (http:/ / kam. mff. cuni. cz/ ~matousek/ prob-ln. ps. gz).
[2] Neumann, P. (1966). "Über den Median der Binomial- and Poissonverteilung" (in German). Wissenschaftliche Zeitschrift der Technischen
Universität Dresden 19: 29–33.
[3] Lord, Nick. (July 2010). "Binomial averages when the mean is an integer", The Mathematical Gazette 94, 331-332.
[4] Kaas, R.; Buhrman, J.M. (1980). "Mean, Median and Mode in Binomial Distributions". Statistica Neerlandica 34 (1): 13–18.
doi:10.1111/j.1467-9574.1980.tb00681.x.
[5] Hamza, K. (1995). "The smallest uniform upper bound on the distance between the mean and the median of the binomial and Poisson
distributions". Statistics & Probability Letters 23: 21–25. doi:10.1016/0167-7152(94)00090-U.
[6] Box, Hunter and Hunter (1978). Statistics for experimenters. Wiley. p. 130.
[7] NIST/SEMATECH, "7.2.4. Does the proportion of defectives meet requirements?" (http:/ / www. itl. nist. gov/ div898/ handbook/ prc/
section2/ prc24. htm) e-Handbook of Statistical Methods.
[8] NIST/SEMATECH, "6.3.3.1. Counts Control Charts" (http:/ / www. itl. nist. gov/ div898/ handbook/ pmc/ section3/ pmc331. htm),
e-Handbook of Statistical Methods.
[9] Brown LD, Cai T. and DasGupta A (2001). Interval estimation for a binomial proportion (with discussion). Statist Sci 16: 101–133
[10] Agresti A, Coull BA (1998) "Approximate is better than 'exact' for interval estimation of binomial proportions". The American Statistician
52:119–126
[11] Pires MA () Confidence intervals for a binomial proportion: comparison of methods and software evaluation.
[12] Wilson EB (1927) "Probable inference, the law of succession, and statistical inference". Journal of the American Statistical Association 22:
209–212
[13] Devroye, Luc (1986) Non-Uniform Random Variate Generation, New York: Springer-Verlag. (See especially Chapter X, Discrete
Univariate Distributions (http:/ / cg. scs. carleton. ca/ ~luc/ chapter_ten. pdf))
[14] Kachitvichyanukul, V.; Schmeiser, B. W. (1988). "Binomial random variate generation". Communications of the ACM 31 (2): 216–222.
doi:10.1145/42372.42381.
Cauchy distribution 66

Cauchy distribution
Cauchy

Probability density function

The purple curve is the standard Cauchy distribution


Cumulative distribution function

Parameters location (real)


scale (real)
Support
PDF

CDF

Mean undefined
Median
Mode
Variance undefined
Skewness undefined
Ex. kurtosis undefined
Entropy
MGF does not exist
CF

The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known,
especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution,
Lorentz(ian) function, or Breit–Wigner distribution.
Cauchy distribution 67

The Cauchy distribution is often used in statistics as the canonical example of a "pathological" distribution. Its mean
does not exist and its variance is infinite. The Cauchy distribution does not have finite moments of order greater than
or equal to one; only fractional absolute moments exist.[1] The Cauchy distribution has no moment generating
function.
Its importance in physics is the result of its being the solution to the differential equation describing forced
resonance.[2] In mathematics, it is closely related to the Poisson kernel, which is the fundamental solution for the
Laplace equation in the upper half-plane. In spectroscopy, it is the description of the shape of spectral lines which are
subject to homogeneous broadening in which all atoms interact in the same way with the frequency range contained
in the line shape. Many mechanisms cause homogeneous broadening, most notably collision broadening, and
Chantler–Alda radiation.[3] In its standard form, it is the maximum entropy probability distribution for a random
variate X for which .[4]

Characterisation

Probability density function


The Cauchy distribution has the probability density function

where x0 is the location parameter, specifying the location of the peak of the distribution, and γ is the scale parameter
which specifies the half-width at half-maximum (HWHM). γ is also equal to half the interquartile range and is
sometimes called the probable error. Augustin-Louis Cauchy exploited such a density function in 1827 with an
infinitesimal scale parameter, defining what would now be called a Dirac delta function.
The amplitude of the above Lorentzian function is given by

The special case when x0 = 0 and γ = 1 is called the standard Cauchy distribution with the probability density
function

In physics, a three-parameter Lorentzian function is often used:

where is the height of the peak.


Cauchy distribution 68

Cumulative distribution function


The cumulative distribution function is:

and the quantile function (inverse cdf) of the Cauchy distribution is

It follows that the first and third quartiles are , and hence the interquartile range is 2γ.
The derivative of the quantile function, the quantile density function, for the Cauchy distribution is:

The differential entropy of a distribution can be defined in terms of its quantile density,[5] specifically

Properties
The Cauchy distribution is an example of a distribution which has no mean, variance or higher moments defined. Its
mode and median are well defined and are both equal to x0.
When U and V are two independent normally distributed random variables with expected value 0 and variance 1,
then the ratio U/V has the standard Cauchy distribution.
If are independent and identically distributed random variables, each with a standard Cauchy distribution, then the
sample mean has the same standard Cauchy distribution (the sample median, which is not
affected by extreme values, can be used as a measure of central tendency). To see that this is true, compute the
characteristic function of the sample mean:

where is the sample mean. This example serves to show that the hypothesis of finite variance in the central limit
theorem cannot be dropped. It is also an example of a more generalized version of the central limit theorem that is
characteristic of all stable distributions, of which the Cauchy distribution is a special case.
The Cauchy distribution is an infinitely divisible probability distribution. It is also a strictly stable distribution.[6]
The standard Cauchy distribution coincides with the Student's t-distribution with one degree of freedom.
Like all stable distributions, the location-scale family to which the Cauchy distribution belongs is closed under linear
transformations with real coefficients. In addition, the Cauchy distribution is the only univariate distribution which is
closed under linear fractional transformations with real coefficients.[7] In this connection, see also McCullagh's
parametrization of the Cauchy distributions.
Cauchy distribution 69

Characteristic function
Let X denote a Cauchy distributed random variable. The characteristic function of the Cauchy distribution is given
by

which is just the Fourier transform of the probability density. The original probability density may be expressed in
terms of the characteristic function, essentially by using the inverse Fourier transform:

Observe that the characteristic function is not differentiable at the origin: this corresponds to the fact that the Cauchy
distribution does not have an expected value.

Explanation of undefined moments

Mean
If a probability distribution has a density function f(x), then the mean is

The question is now whether this is the same thing as

If at most one of the two terms in (2) is infinite, then (1) is the same as (2). But in the case of the Cauchy
distribution, both the positive and negative terms of (2) are infinite. This means (2) is undefined. Moreover, if (1) is
construed as a Lebesgue integral, then (1) is also undefined, because (1) is then defined simply as the difference (2)
between positive and negative parts.
However, if (1) is construed as an improper integral rather than a Lebesgue integral, then (2) is undefined, and (1) is
not necessarily well-defined. We may take (1) to mean

and this is its Cauchy principal value, which is zero, but we could also take (1) to mean, for example,

which is not zero, as can be seen easily by computing the integral.


Because the integrand is bounded and is not Lebesgue integrable, it is not even Henstock–Kurzweil integrable.
Various results in probability theory about expected values, such as the strong law of large numbers, will not work in
such cases.
Cauchy distribution 70

Higher moments
The Cauchy distribution does not have finite moments of any order. Some of the higher raw moments do exist and
have a value of infinity, for example the raw second moment:

Higher even-powered raw moments will also evaluate to infinity. Odd-powered raw moments, however, do not exist
at all (i.e. are undefined), which is distinctly different from existing with the value of infinity. (Consider 1/0, which
is defined with the value of infinity, vs. 0/0, which is undefined.) The odd-powered raw moments are undefined
because their values are essentially equivalent to since the two halves of the integral both diverge and
have opposite signs. The first raw moment is the mean, which, being odd, does not exist. (See also the discussion
above about this.) This in turn means that all of the central moments and standardized moments do not exist (are
undefined), since they are all based on the mean. The variance — which is the second central moment — is likewise
non-existent (despite the fact that the raw second moment exists with the value infinity).
The results for higher moments follow from Hölder's inequality, which implies that higher moments (or halves of
moments) diverge if lower ones do.

Estimation of parameters
Because the mean and variance of the Cauchy distribution are not defined, attempts to estimate these parameters will
not be successful. For example, if n samples are taken from a Cauchy distribution, one may calculate the sample
mean as:

Although the sample values xi will be concentrated about the central value x0, the sample mean will become
increasingly variable as more samples are taken, because of the increased likelihood of encountering sample points
with a large absolute value. In fact, the distribution of the sample mean will be equal to the distribution of the
samples themselves; i.e., the sample mean of a large sample is no better (or worse) an estimator of x0 than any single
observation from the sample. Similarly, calculating the sample variance will result in values that grow larger as more
samples are taken.
Therefore, more robust means of estimating the central value x0 and the scaling parameter γ are needed. One simple
method is to take the median value of the sample as an estimator of x0 and half the sample interquartile range as an
estimator of γ. Other, more precise and robust methods have been developed [8][9] For example, the truncated mean
of the middle 24% of the sample order statistics produces an estimate for x0 that is more efficient than using either
the sample median or the full sample mean.[10][11] However, because of the fat tails of the Cauchy distribution, the
efficiency of the estimator decreases if more than 24% of the sample is used.[10][11]
Maximum likelihood can also be used to estimate the parameters x0 and γ. However, this tends to be complicated by
the fact that this requires finding the roots of a high degree polynomial, and there can be multiple roots that represent
local maxima.[12] Also, while the maximum likelihood estimator is asymptotically efficient, it is relatively inefficient
for small samples.[13] The log-likelihood function for the Cauchy distribution for sample size n is:
Maximizing the log likelihood function with respect to x0 and γ produces the following system of equations:
Cauchy distribution 71

Note that is a monotone function in γ and that the solution γ must satisfy

. Solving just for x0 requires solving a polynomial of degree 2n − 1,[12]


and solving just for γ requires solving a polynomial of degree n (first for , then x0). Therefore, whether solving
for one parameter or for both paramters simultaneously, a numerical solution on a computer is typically required.
The benefit of maximum likelihood estimation is asymptotic efficiency; estimating x0 using the sample median is
only about 81% as asymptotically efficient as estimating x0 by maximum likelihood.[11][14] The truncated sample
mean using the middle 24% order statistics is about 88% as asymptotically efficient an estimator of x0 as the
maximum likelihood estimate.[11] When Newton's method is used to find the solution for the maximum likelihood
estimate, the middle 24% order statistics can be used as an initial solution for x0.

Circular Cauchy distribution


If X is Cauchy distributed with median μ and scale parameter γ, then the complex variable

has unit modulus and is distributed on the unit circle with density:

with respect to the angular variable , where

and expresses the two parameters of the associated linear Cauchy distribution for x as a complex number:

[15][16]
The distribution is called the circular Cauchy distribution (also the complex Cauchy distribution)
with parameter . The circular Cauchy distribution is related to the wrapped Cauchy distribution. If is
a wrapped Cauchy distribution with the parameter representing the parameters of the corresponding
"unwrapped" Cauchy distribution in the variable y where , then

See also McCullagh's parametrization of the Cauchy distributions and Poisson kernel for related concepts.
The circular Cauchy distribution expressed in complex form has finite moments of all orders

for integer . For , the transformation

is holomorphic on the unit disk, and the transformed variable is distributed as complex Cauchy with
parameter .
Given a sample of size n > 2, the maximum-likelihood equation

can be solved by a simple fixed-point iteration:

starting with The sequence of likelihood values is non-decreasing, and the solution is unique for samples
containing at least three distinct values. [17]
The maximum-likelihood estimate for the median ( ) and scale parameter ( ) of a real Cauchy sample is
obtained by the inverse transformation:
Cauchy distribution 72

For n ≤ 4, closed-form expressions are known for .[12] The density of the maximum-likelihood estimator at t in the
unit disk is necessarily of the form:

where

Formulae for and are available.[18]

Multivariate Cauchy distribution


A random vector X = (X1, ..., Xk)′ is said to have the multivariate Cauchy distribution if every linear combination of
its components Y = a1X1 + ... + akXk has a Cauchy distribution. That is, for any constant vector a ∈ Rk, the random
variable Y = a′X should have a univariate Cauchy distribution.[19] The characteristic function of a multivariate
Cauchy distribution is given by:

where x0(t) and γ(t) are real functions with x0(t) a homogeneous function of degree one and γ(t) a positive
homogeneous function of degree one.[19] More formally:[19]
and for all t.
An example of a bivariate Cauchy distribution can be given by:[20]

Note that in this example, even though there is no analogue to a covariance matrix, x and y are not statistically
independent.[20]
Analogously to the univariate density, the multidimensional Cauchy density also relates to the multivariate Student
distribution. They are equivalent when the degrees of freedom parameter is equal to one. The density of a k
dimension Student distribution with one degree of freedom becomes:

Properties and details for this density can be obtained by taking it as a particular case of the multivariate Student
density.

Transformation properties
• If then
• If then
• If and are independent, then

• If then
• McCullagh's parametrization of the Cauchy distributions: Expressing a Cauchy distribution in terms of one
complex parameter , define X ~ Cauchy to mean X ~ Cauchy . If X ~ Cauchy
then:
Cauchy distribution 73

~ Cauchy where a,b,c and d are real numbers.

• Using the same convention as above, if X ~ Cauchy then:

~ CCauchy

where "CCauchy" is the circular Cauchy distribution.

Related distributions
• Student's t distribution
• Non-standardized Student's t distribution
• If and then
• If then
• If then
• The Cauchy distribution is a limiting case of a Pearson distribution of type 4
• The Cauchy distribution is a special case of a Pearson distribution of type 7.[1]
• The Cauchy distribution is a stable distribution: if X ~ Stable , then X ~Cauchy(μ, γ).
• The Cauchy distribution is a singular limit of a Hyperbolic distribution
• The wrapped Cauchy distribution, taking values on a circle, is derived from the Cauchy distribution by wrapping
it around the circle.

Relativistic Breit–Wigner distribution


In nuclear and particle physics, the energy profile of a resonance is described by the relativistic Breit–Wigner
distribution, while the Cauchy distribution is the (non-relativistic) Breit–Wigner distribution.

References
[1] N. L. Johnson, S. Kotz, and N. Balakrishnan (1994). Continuous Univariate Distributions, Volume 1. New York: Wiley., Chapter 16.
[2] http:/ / webphysics. davidson. edu/ Projects/ AnAntonelli/ node5. html Note that the intensity, which follows the Cauchy distribution, is the
square of the amplitude.
[3] E. Hecht (1987). Optics (2nd ed.). Addison-Wesley. p. 603.
[4] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. econ. yorku. ca/
cesg/ papers/ berapark. pdf). Journal of Econometrics (Elsevier): 219–230. . Retrieved 2011-06-02.
[5] Vasicek, Oldrich (1976). "A Test for Normality Based on Sample Entropy". Journal of the Royal Statistical Society, Series B
(Methodological) 38 (1): 54–59.
[6] S.Kotz et al (2006). Encyclopedia of Statistical Sciences (2nd ed.). John Wiley & Sons. p. 778. ISBN 978-0-471-15044-2.
[7] F. B. Knight (1976). "A characterization of the Cauchy type". Proceedings of the American Mathematical Society 55: 130–135.
[8] Cane, Gwenda J. (1974). "Linear Estimation of Parameters of the Cauchy Distribution Based on Sample Quantiles". Journal of the American
Statistical Association 69 (345): 243–245. JSTOR 2285535.
[9] Zhang, Jin (2010). "A Highly Efficient L-estimator for the Location Parameter of the Cauchy Distribution" (http:/ / www. springerlink. com/
content/ 3p1430175v4806jq). Computational Statistics 25 (1): 97–105. .
[10] Rothenberg, Thomas J.; Fisher, Franklin, M.; Tilanus, C.B. (1966). "A note on estimation from a Cauchy sample". Journal of the American
Statistical Association 59 (306): 460–463.
[11] Bloch, Daniel (1966). "A note on the estimation of the location parameters of the Cauchy distribution". Journal of the American Statistical
Association 61 (316): 852–855. JSTOR 2282794.
[12] Ferguson, Thomas S. (1978). "Maximum Likelihood Estimates of the Parameters of the Cauchy Distribution for Samples of Size 3 and 4".
Journal of the American Statistical Association 73 (361): 211. JSTOR 2286549.
[13] Cohen Freue, Gabriella V. (2007). "The Pitman estimator of the Cauchy location parameter" (http:/ / faculty. ksu. edu. sa/ 69424/ USEPAP/
Coushy dist. pdf). Journal of Statistical Planning and Inference 137: 1901. .
[14] Barnett, V. D. (1966). "Order Statistics Estimators of the Location of the Cauchy Distribution". Journal of the American Statistical
Association 61 (316): 1205. JSTOR 2283210.
Cauchy distribution 74

[15] McCullagh, P., "Conditional inference and Cauchy models" (http:/ / biomet. oxfordjournals. org/ cgi/ content/ abstract/ 79/ 2/ 247),
Biometrika, volume 79 (1992), pages 247–259. PDF (http:/ / www. stat. uchicago. edu/ ~pmcc/ pubs/ paper18. pdf) from McCullagh's
homepage.
[16] K.V. Mardia (1972). Statistics of Directional Data. Academic Press.
[17] J. Copas (1975). "On the unimodality of the likelihood function for the Cauchy distribution". Biometrika 62: 701–704.
[18] P. McCullagh (1996). "Mobius transformation and Cauchy parameter estimation.". Annals of Statistics 24: 786–808. JSTOR 2242674.
[19] Ferguson, Thomas S. (1962). "A Representation of the Symmetric Bivariate Cauchy Distribution". Journal of the American Statistical
Association: 1256. JSTOR 2237984.
[20] Molenberghs, Geert; Lesaffre, Emmanuel (1997). "Non-linear Integral Equations to Approximate Bivariate Densities with Given Marginals
and Dependence Function" (http:/ / www3. stat. sinica. edu. tw/ statistica/ oldpdf/ A7n310. pdf). Statistica Sinica 7: 713–738. .

External links
• Earliest Uses: The entry on Cauchy distribution has some historical information. (http://jeff560.tripod.com/c.
html)
• Weisstein, Eric W., " Cauchy Distribution (http://mathworld.wolfram.com/CauchyDistribution.html)" from
MathWorld.
• GNU Scientific Library – Reference Manual (http://www.gnu.org/software/gsl/manual/gsl-ref.
html#SEC294)

Cauchy–Schwarz inequality
In mathematics, the Cauchy–Schwarz inequality (also known as the Bunyakovsky inequality, the Schwarz
inequality, or the Cauchy–Bunyakovsky–Schwarz inequality, or Cauchy–Bunyakovsky inequality), is a useful
inequality encountered in many different settings, such as linear algebra, analysis, probability theory, and other
areas. It is considered to be one of the most important inequalities in all of mathematics.[1] It has a number of
generalizations, among them Hölder's inequality.
The inequality for sums was published by Augustin-Louis Cauchy (1821), while the corresponding inequality for
integrals was first stated by Viktor Bunyakovsky (1859) and rediscovered by Hermann Amandus Schwarz (1888).

Statement of the inequality


The Cauchy–Schwarz inequality states that for all vectors x and y of an inner product space it is true that

where is the inner product. Equivalently, by taking the square root of both sides, and referring to the norms of
the vectors, the inequality is written as

Moreover, the two sides are equal if and only if x and y are linearly dependent (or, in a geometrical sense, they are
parallel or one of the vectors is equal to zero).
If and are any complex numbers, the inner product is the standard inner
product and the bar notation is used for complex conjugation then the inequality may be restated in more explicitly
as

When viewed in this way the numbers x1, ..., xn, and y1, ..., yn are the components of x and y with respect to an
orthonormal basis of V.
Even more compactly written:
CauchySchwarz inequality 75

Equality holds if and only if x and y are linearly dependent, that is, one is a scalar multiple of the other (which
includes the case when one or both are zero).
The finite-dimensional case of this inequality for real vectors was proved by Cauchy in 1821, and in 1859 Cauchy's
student Bunyakovsky noted that by taking limits one can obtain an integral form of Cauchy's inequality. The general
result for an inner product space was obtained by Schwarz in the year 1885.

Proof
Let u, v be arbitrary vectors in a vector space V over F with an inner product, where F is the field of real or complex
numbers. We prove the inequality

and the fact that equality holds only when u and v are linearly dependent (the fact that conversely one has equality if
u and v are linearly dependent is immediate from the properties of the inner product).
This inequality is trivial, and in fact an equality, in the case v = 0, and in this case u and v are also linearly dependent,
regardless of u. The theorem being thus proved for this case, we henceforth assume that v is nonzero. Let

Then, by linearity of the inner product in its first argument, one has

i.e., z is a vector orthogonal to the vector v (Indeed, z is the projection of u onto the plane orthogonal to v.) We can
thus apply the Pythagorean theorem to

which gives

and after multiplication by ||v||2 the Cauchy–Schwarz inequality. Moreover, if the relation '≥' in the above expression
is actually an equality, then ||z||2 = 0 and hence z = 0; the definition of z then establishes a relation of linear
dependence between u and v. This establishes the theorem.
CauchySchwarz inequality 76

Notable special cases

Rn
In Euclidean space Rn with the standard inner product, the Cauchy–Schwarz inequality is

To prove this form of the inequality, consider the following quadratic polynomial in z.

Since it is nonnegative it has at most one real root in z, whence its discriminant is less than or equal to zero, that is,

which yields the Cauchy–Schwarz inequality.


An equivalent proof for Rn starts with the summation below.
Expanding the brackets we have:

collecting together identical terms (albeit with different summation indices) we find:

Because the left-hand side of the equation is a sum of the squares of real numbers it is greater than or equal to zero,
thus:

This form is used usually when solving school math problems.


Yet another approach when n ≥ 2 (n = 1 is trivial) is to consider the plane containing x and y. More precisely,
recoordinatize Rn with any orthonormal basis whose first two vectors span a subspace containing x and y. In this
basis only and are nonzero, and the inequality reduces to the algebra of dot product in the plane,
which is related to the angle between two vectors, from which we obtain the inequality:

When n = 3 the Cauchy–Schwarz inequality can also be deduced from Lagrange's identity, which takes the form

from which readily follows the Cauchy–Schwarz inequality.


Another proof of the general case for n can be done by technique used to prove Inequality of arithmetic and
geometric means.
CauchySchwarz inequality 77

L2
For the inner product space of square-integrable complex-valued functions, one has

A generalization of this is the Hölder inequality.

Use
The triangle inequality for the inner product is often shown as a consequence of the Cauchy–Schwarz inequality, as
follows: given vectors x and y:

Taking square roots gives the triangle inequality.


The Cauchy–Schwarz inequality allows one to extend the notion of "angle between two vectors" to any real inner
product space, by defining:

The Cauchy–Schwarz inequality proves that this definition is sensible, by showing that the right hand side lies in the
interval [−1, 1], and justifies the notion that (real) Hilbert spaces are simply generalizations of the Euclidean space.
It can also be used to define an angle in complex inner product spaces, by taking the absolute value of the right hand
side, as is done when extracting a metric from quantum fidelity.
The Cauchy–Schwarz is used to prove that the inner product is a continuous function with respect to the topology
induced by the inner product itself.
The Cauchy–Schwarz inequality is usually used to show Bessel's inequality.

Probability theory
For the multivariate case,

For the univariate case, Indeed, for random variables X and Y, the

expectation of their product is an inner product. That is,

and so, by the Cauchy–Schwarz inequality,

Moreover, if μ = E(X) and ν = E(Y), then


CauchySchwarz inequality 78

where Var denotes variance and Cov denotes covariance.

Generalizations
Various generalizations of the Cauchy–Schwarz inequality exist in the context of operator theory, e.g. for
operator-convex functions, and operator algebras, where the domain and/or range of φ are replaced by a C*-algebra
or W*-algebra.
This section lists a few of such inequalities from the operator algebra setting, to give a flavor of results of this type.

Positive functionals on C*- and W*-algebras


One can discuss inner products as positive functionals. Given a Hilbert space L2(m), m being a finite measure, the
inner product < · , · > gives rise to a positive functional φ by

Since < ƒ, ƒ > ≥ 0, φ(f*f) ≥ 0 for all ƒ in L2(m), where ƒ* is pointwise conjugate of ƒ. So φ is positive. Conversely
every positive functional φ gives a corresponding inner product < ƒ, g >φ = φ(g*ƒ). In this language, the
Cauchy–Schwarz inequality becomes

which extends verbatim to positive functionals on C*-algebras.


We now give an operator theoretic proof for the Cauchy–Schwarz inequality which passes to the C*-algebra setting.
One can see from the proof that the Cauchy–Schwarz inequality is a consequence of the positivity and anti-symmetry
inner-product axioms.
Consider the positive matrix

Since φ is a positive linear map whose range, the complex numbers C, is a commutative C*-algebra, φ is completely
positive. Therefore

is a positive 2 × 2 scalar matrix, which implies it has positive determinant:

This is precisely the Cauchy–Schwarz inequality. If ƒ and g are elements of a C*-algebra, f* and g* denote their
respective adjoints.
We can also deduce from above that every positive linear functional is bounded, corresponding to the fact that the
inner product is jointly continuous.
CauchySchwarz inequality 79

Positive maps
Positive functionals are special cases of positive maps. A linear map Φ between C*-algebras is said to be a positive
map if a ≥ 0 implies Φ(a) ≥ 0. It is natural to ask whether inequalities of Schwarz-type exist for positive maps. In
this more general setting, usually additional assumptions are needed to obtain such results.

Kadison–Schwarz inequality
The following theorem is named after Richard Kadison.
Theorem. If Φ is a unital positive map, then for every normal element a in its domain, we have Φ(a*a) ≥ Φ(a*)Φ(a)
and Φ(a*a) ≥ Φ(a)Φ(a*).
This extends the fact φ(a*a) · 1 ≥ φ(a)*φ(a) = |φ(a)|2, when φ is a linear functional.
The case when a is self-adjoint, i.e. a = a*, is sometimes known as Kadison's inequality.

2-positive maps
When Φ is 2-positive, a stronger assumption than merely positive, one has something that looks very similar to the
original Cauchy–Schwarz inequality:
Theorem (Modified Schwarz inequality for 2-positive maps)[2] For a 2-positive map Φ between C*-algebras, for all
a, b in its domain,
1.
2.
A simple argument for (2) is as follows. Consider the positive matrix

By 2-positivity of Φ,

is positive. The desired inequality then follows from the properties of positive 2 × 2 (operator) matrices.

Part (1) is analogous. One can replace the matrix by

Physics
The general formulation of the Heisenberg uncertainty principle is derived using the Cauchy–Schwarz inequality in
the Hilbert space of quantum observables.

Notes
[1] The Cauchy–Schwarz Master Class: an Introduction to the Art of Mathematical Inequalities, Ch. 1 (http:/ / www-stat. wharton. upenn. edu/
~steele/ Publications/ Books/ CSMC/ CSMC_index. html) by J. Michael Steele.
[2] Paulsen (2002), Completely Bounded Maps and Operator Algebras (http:/ / books. google. com/ books?id=VtSFHDABxMIC& pg=PA40),
ISBN 9780521816694, page 40.
CauchySchwarz inequality 80

References
• Bityutskov, V.I. (2001), "Bunyakovskii inequality" (http://www.encyclopediaofmath.org/index.php?title=b/
b017770), in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Bouniakowsky, V. (1859), "Sur quelques inegalités concernant les intégrales aux différences finies" (http://
www-stat.wharton.upenn.edu/~steele/Publications/Books/CSMC/bunyakovsky.pdf) (PDF), Mem. Acad. Sci.
St. Petersbourg 7 (1): 9
• Cauchy, A. (1821), Oeuvres 2, III, p. 373
• Dragomir, S. S. (2003), "A survey on Cauchy–Bunyakovsky–Schwarz type discrete inequalities" (http://jipam.
vu.edu.au/article.php?sid=301), JIPAM. J. Inequal. Pure Appl. Math. 4 (3): 142 pp
• Kadison, R.V. (1952), "A generalized Schwarz inequality and algebraic invariants for operator algebras", Ann.
Math. 56 (3): 494–503, doi:10.2307/1969657, JSTOR 1969657.
• Lohwater, Arthur (1982), Introduction to Inequalities (http://www.mediafire.com/?1mw1tkgozzu), Online
e-book in PDF fomat
• Paulsen, V. (2003), Completely Bounded Maps and Operator Algebras, Cambridge University Press.
• Schwarz, H. A. (1888), "Über ein Flächen kleinsten Flächeninhalts betreffendes Problem der Variationsrechnung"
(http://www-stat.wharton.upenn.edu/~steele/Publications/Books/CSMC/Schwarz.pdf) (PDF), Acta
Societatis scientiarum Fennicae XV: 318
• Solomentsev, E.D. (2001), "Cauchy inequality" (http://www.encyclopediaofmath.org/index.php?title=C/
c020880), in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Steele, J.M. (2004), The Cauchy–Schwarz Master Class (http://www-stat.wharton.upenn.edu/~steele/
Publications/Books/CSMC/CSMC_index.html), Cambridge University Press, ISBN 0-521-54677-X

External links
• Earliest Uses: The entry on the Cauchy–Schwarz inequality has some historical information. (http://jeff560.
tripod.com/c.html)
• Example of application of Cauchy–Schwarz inequality to determine Linearly Independent Vectors (http://
people.revoledu.com/kardi/tutorial/LinearAlgebra/LinearlyIndependent.html#LinearlyIndependentVectors)
Tutorial and Interactive program.
Characteristic function (probability theory) 81

Characteristic function (probability theory)


In probability theory and statistics, the characteristic function of
any real-valued random variable completely defines its probability
distribution. If a random variable admits a probability density
function, then the characteristic function is the Fourier transform
of the probability density function. Thus it provides the basis of an
alternative route to analytical results compared with working
directly with probability density functions or cumulative
distribution functions. There are particularly simple results for the
characteristic functions of distributions defined by the weighted The characteristic function of a uniform U(–1,1)
sums of random variables. random variable. This function is real-valued because it
corresponds to a random variable that is symmetric
In addition to univariate distributions, characteristic functions can around the origin; however in general case
be defined for vector- or matrix-valued random variables, and can characteristic functions may be complex-valued.

even be extended to more generic cases.


The characteristic function always exists when treated as a function of a real-valued argument, unlike the
moment-generating function. There are relations between the behavior of the characteristic function of a distribution
and properties of the distribution, such as the existence of moments and the existence of a density function.

Introduction
The characteristic function provides an alternative way for describing a random variable. Similarly to the cumulative
distribution function
(where 1{X ≤ x} is the indicator function — it is equal to 1 when X ≤ x, and zero
otherwise)
which completely determines behavior and properties of the probability distribution of the random variable X, the
characteristic function

also completely determines behavior and properties of the probability distribution of the random variable X. The two
approaches are equivalent in the sense that knowing one of the functions it is always possible to find the other, yet
they both provide different insight for understanding the features of the random variable. However, in particular
cases, there can be differences in whether these functions can be represented as expressions involving simple
standard functions.
If a random variable admits a density function, then the characteristic function is its dual, in the sense that each of
them is a Fourier transform of the other. If a random variable has a moment-generating function, then the domain of
the characteristic function can be extended to the complex plane, and
[1]

Note however that the characteristic function of a distribution always exists, even when the probability density
function or moment-generating function do not.
The characteristic function approach is particularly useful in analysis of linear combinations of independent random
variables: a classical proof of the Central Limit Theorem uses characteristic functions and Lévy's continuity theorem.
Another important application is to the theory of the decomposability of random variables.
Characteristic function (probability theory) 82

Definition
For a scalar random variable X the characteristic function is defined as the expected value of eitX, where i is the
imaginary unit, and t ∈ R is the argument of the characteristic function:

Here FX is the cumulative distribution function of X, and the integral is of the Riemann–Stieltjes kind. If random
variable X has a probability density function ƒX, then the characteristic function is its Fourier transform,[2] and the last
formula in parentheses is valid.
It should be noted though, that this convention for the constants appearing in the definition of the characteristic
function differs from the usual convention for the Fourier transform.[3] For example some authors[4] define
φX(t) = Ee−2πitX, which is essentially a change of parameter. Other notation may be encountered in the literature:
as the characteristic function for a probability measure p, or as the characteristic function corresponding to a
density ƒ.

Generalizations
The notion of characteristic functions generalizes to multivariate random variables and more complicated random
elements. The argument of the characteristic function will always belong to the continuous dual of the space where
random variable X takes values. For common cases such definitions are listed below:
• If X is a k-dimensional random vector, then for t ∈ Rk

• If X is a k×p-dimensional random matrix, then for t ∈ Rk×p

• If X is a complex random variable, then for t ∈ C [5]

• If X is a k-dimensional complex random vector, then for t ∈ Ck [6]

• If X(s) is a stochastic process, then for all functions t(s) such that the integral ∫Rt(s)X(s)ds converges for almost
all realizations of X [7]

Here denotes matrix transpose, tr(·) — the matrix trace operator, Re(·) is the real part of a complex number, z
denotes complex conjugate, and * is conjugate transpose (that is z* = zT ).

Examples
Characteristic function (probability theory) 83

Distribution Characteristic function φ(t)

Degenerate δa

Bernoulli Bern(p)

Binomial B(n, p)

Negative binomial NB(r, p)

Poisson Pois(λ)

Uniform U(a, b)

Laplace L(μ, b)

Normal N(μ, σ2)

Chi-squared χ2k

Cauchy C(μ, θ)

Gamma Γ(k, θ)

Exponential Exp(λ)

Multivariate normal N(μ, Σ)

[8]
Multivariate Cauchy MultiCauchy(μ, Σ)

Oberhettinger (1973) provides extensive tables of characteristic functions.

Properties
• The characteristic function of a real-valued random variable always exists, since it is an integral of a bounded
continuous function over a space whose measure is finite.
• A characteristic function is uniformly continuous on the entire space
• It is non-vanishing in a region around zero: φ(0) = 1.
• It is bounded: | φ(t) | ≤ 1.
• It is Hermitian: φ(−t) = φ(t). In particular, the characteristic function of a symmetric (around the origin) random
variable is real-valued and even.
• There is a bijection between distribution functions and characteristic functions. That is, for any two random
variables X1, X2

• If a random variable X has moments up to k-th order, then the characteristic function φX is k times continuously
differentiable on the entire real line. In this case

• If a characteristic function φX has a k-th derivative at zero, then the random variable X has all moments up to k if k
is even, but only up to k – 1 if k is odd.[9]

• If X1, …, Xn are independent random variables, and a1, …, an are some constants, then the characteristic function
of the linear combination of Xi's is
Characteristic function (probability theory) 84

One specific case would be the sum of two independent random variables and in which case one would
have .
• The tail behavior of the characteristic function determines the smoothness of the corresponding density function.

Continuity
The bijection stated above between probability distributions and characteristic functions is continuous. That is,
whenever a sequence of distribution functions { Fj(x) } converges (weakly) to some distribution F(x), the
corresponding sequence of characteristic functions { φj(t) } will also converge, and the limit φ(t) will correspond to
the characteristic function of law F. More formally, this is stated as
Lévy’s continuity theorem: A sequence { Xj } of n-variate random variables converges in distribution to
random variable X if and only if the sequence { φXj } converges pointwise to a function φ which is continuous
at the origin. Then φ is the characteristic function of X.[10]
This theorem is frequently used to prove the law of large numbers, and the central limit theorem.

Inversion formulas
Since there is a one-to-one correspondence between cumulative distribution functions and characteristic functions, it
is always possible to find one of these functions if we know the other one. The formula in definition of characteristic
function allows us to compute φ when we know the distribution function F (or density ƒ). If, on the other hand, we
know the characteristic function φ and want to find the corresponding distribution function, then one of the
following inversion theorems can be used.
Theorem. If characteristic function φX is integrable, then FX is absolutely continuous, and therefore X has the
probability density function given by

  when X is scalar;

in multivariate case the pdf is understood as the Radon–Nikodym derivative of the distribution μX with respect to the
Lebesgue measure λ:

Theorem (Lévy).[11] If φX is characteristic function of distribution function FX, two points a<b are such that {x|a < x
< b} is a continuity set of μX (in the univariate case this condition is equivalent to continuity of FX at points a and b),
then

  if X is scalar

  if X is a vector random variable.


Theorem. If a is (possibly) an atom of X (in the univariate case this means a point of discontinuity of FX) then

,   when X is a scalar random variable

,   when X is a vector random

variable.
Theorem (Gil-Pelaez).[12] For a univariate random variable X, if x is a continuity point of FX then
Characteristic function (probability theory) 85

The integral may be not Lebesgue-integrable; for example, when X is the discrete random variable that is always 0, it
becomes the Dirichlet integral.
Inversion formulas for multivariate distributions are available.[13]

Criteria for characteristic functions


It is well known that any non-decreasing càdlàg function F with limits F(−∞) = 0, F(+∞) = 1 corresponds to a
cumulative distribution function of some random variable.
There is also interest in finding similar simple criteria for when a given function φ could be the characteristic
function of some random variable. The central result here is Bochner’s theorem, although its usefulness is limited
because the main condition of the theorem, non-negative definiteness, is very hard to verify. Other theorems also
exist, such as Khinchine’s, Mathias’s, or Cramér’s, although their application is just as difficult. Pólya’s theorem, on
the other hand, provides a very simple convexity condition which is sufficient but not necessary. Characteristic
functions which satisfy this condition are called Pólya-type.[14]
• Bochner’s theorem. An arbitrary function is the characteristic function of some random variable if and
only if φ is positive definite, continuous at the origin, and if φ(0) = 1.
• Khinchine’s criterion. An absolutely continuous complex-valued function φ equal to 1 at the origin is a
characteristic function if and only if it admits the representation

• Mathias’ theorem. A real, even, continuous, absolutely integrable function φ equal to 1 at the origin is a
characteristic function if and only if

for n = 0,1,2,…, and all p > 0. Here H2n denotes the Hermite polynomial of degree 2n.
Pólya’s theorem. If φ is a real-valued continuous
function which satisfies the conditions
1. φ(0) = 1,
2. φ is even,
3. φ is convex for t>0,
4. φ(∞) = 0,
then φ(t) is the characteristic function of an
absolutely continuous symmetric distribution.
• A convex linear combination (with
Pólya’s theorem can be used to construct an example of two random
) of a finite or a countable number variables whose characteristic functions coincide over a finite
of characteristic functions is also a characteristic interval but are different elsewhere.
function.
• The product of a finite number of characteristic functions is also a characteristic function. The same
holds for an infinite product provided that it converges to a function continuous at the origin.
• If φ is a characteristic function and α is a real number, then φ, Re[φ], |φ|2, and φ(αt) are also characteristic
functions.
Characteristic function (probability theory) 86

Uses
Because of the continuity theorem, characteristic functions are used in the most frequently seen proof of the central
limit theorem. The main trick involved in making calculations with a characteristic function is recognizing the
function as the characteristic function of a particular distribution.

Basic manipulations of distributions


Characteristic functions are particularly useful for dealing with linear functions of independent random variables. For
example, if X1, X2, ..., Xn is a sequence of independent (and not necessarily identically distributed) random variables,
and

where the ai are constants, then the characteristic function for Sn is given by

In particular, φX+Y(t) = φX(t)φY(t). To see this, write out the definition of characteristic function:

Observe that the independence of X and Y is required to establish the equality of the third and fourth expressions.
Another special case of interest is when ai = 1/n and then Sn is the sample mean. In this case, writing X for the mean,

Moments
Characteristic functions can also be used to find moments of a random variable. Provided that the nth moment exists,
characteristic function can be differentiated n times and

For example, suppose X has a standard Cauchy distribution. Then φX(t) = e−|t|. See how this is not differentiable at
t = 0, showing that the Cauchy distribution has no expectation. Also see that the characteristic function of the sample
mean X of n independent observations has characteristic function φX(t) = (e−|t|/n)n = e−|t|, using the result from the
previous section. This is the characteristic function of the standard Cauchy distribution: thus, the sample mean has
the same distribution as the population itself.
The logarithm of a characteristic function is a cumulant generating function, which is useful for finding cumulants;
note that some instead define the cumulant generating function as the logarithm of the moment-generating function,
and call the logarithm of the characteristic function the second cumulant generating function.

Data analysis
Characteristic functions can be used as part of procedures for fitting probability distributions to samples of data.
Cases where this provides a practicable option compared to other possibilities include fitting the stable distribution
since closed form expressions for the density are not available which makes implementation of maximum likelihood
estimation difficult. Estimation procedures are available which match the theoretical characteristic function to the
empirical characteristic function, calculated from the data. Paulson et al. (1975) and Heathcote (1977) provide some
theoretical background for such an estimation procedure. In addition, Yu (2004) describes applications of empirical
characteristic functions to fit time series models where likelihood procedures are impractical.
Characteristic function (probability theory) 87

Example
The Gamma distribution with scale parameter θ and a shape parameter k has the characteristic function

Now suppose that we have

with X and Y independent from each other, and we wish to know what the distribution of X + Y is. The characteristic
functions are

which by independence and the basic properties of characteristic function leads to

This is the characteristic function of the gamma distribution scale parameter θ and shape parameter k1 + k2, and we
therefore conclude

The result can be expanded to n independent gamma distributed random variables with the same scale parameter and
we get

Entire characteristic functions


As defined above, the argument of the characteristic function is treated as a real number: however, certain aspects of
the theory of characteristic functions are advanced by extending the definition into the complex plane by analytical
continuation, in cases where this is possible.[15]

Related concepts
Related concepts include the moment-generating function and the probability-generating function. The characteristic
function exists for all probability distributions. However this is not the case for moment generating function.
The characteristic function is closely related to the Fourier transform: the characteristic function of a probability
density function p(x) is the complex conjugate of the continuous Fourier transform of p(x) (according to the usual
convention; see continuous Fourier transform – other conventions).

where P(t) denotes the continuous Fourier transform of the probability density function p(x). Likewise, p(x) may be
recovered from φX(t) through the inverse Fourier transform:

Indeed, even when the random variable does not have a density, the characteristic function may be seen as the
Fourier transform of the measure corresponding to the random variable.
Characteristic function (probability theory) 88

Notes
[1] Lukacs (1970) p. 196
[2] Billingsley (1995)
[3] Pinsky (2002)
[4] Bochner (1955)
[5] Andersen et al. (1995, Definition 1.10)
[6] Andersen et al. (1995, Definition 1.20)
[7] Sobczyk (2001, p. 20)
[8] Kotz et al. p. 37 using 1 as the number of degree of freedom to recover the Cauchy distribution
[9] Lukacs (1970), Corollary 1 to Theorem 2.3.1
[10] Cuppens (1975, Theorem 2.6.9)
[11] Named after the French mathematician Paul Pierre Lévy
[12] Wendel, J.G. (1961)
[13] Shephard (1991a,b)
[14] Lukacs (1970), p.84
[15] Lukacs (1970, Chapter 7)

References
• Andersen, H.H., M. Højbjerre, D. Sørensen, P.S. Eriksen (1995). Linear and graphical models for the
multivariate complex normal distribution. Lecture notes in statistics 101. New York: Springer-Verlag.
ISBN 0-387-94521-0.
• Billingsley, Patrick (1995). Probability and measure (3rd ed.). John Wiley & Sons. ISBN 0-471-00710-2.
• Bisgaard, T. M.; Z. Sasvári (2000). Characteristic functions and moment sequences. Nova Science.
• Bochner, Salomon (1955). Harmonic analysis and the theory of probability. University of California Press.
• Cuppens, R. (1975). Decomposition of multivariate probabilities. Academic Press.
• Heathcote, C.R. (1977). "The integrated squared error estimation of parameters". Biometrika 64 (2): 255–264.
doi:10.1093/biomet/64.2.255.
• Lukacs, E. (1970). Characteristic functions. London: Griffin.
• Kotz, Samuel; Nadarajah, Saralees (2004). Multivariate T Distributions and Their Applications. Cambridge
University Press.
• Oberhettinger, Fritz (1973). Fourier Transforms of Distributions and their Inverses: A Collection of Tables.
Aciademic Press
• Paulson, A.S.; E.W. Holcomb, R.A. Leitch (1975). "The estimation of the parameters of the stable laws".
Biometrika 62 (1): 163–170. doi:10.1093/biomet/62.1.163.
• Pinsky, Mark (2002). Introduction to Fourier analysis and wavelets. Brooks/Cole. ISBN 0-534-37660-6.
• Sobczyk, Kazimierz (2001). Stochastic differential equations. Kluwer Academic Publishers.
ISBN 978-1-4020-0345-5.
• Wendel, J.G. (1961). "The non-absolute convergence of Gil-Pelaez' inversion integral". The Annals of
Mathematical Statistics 32 (1): 338–339. doi:10.1214/aoms/1177705164.
• Yu, J. (2004). "Empirical characteristic function estimation and its applications". Econometrics Reviews 23 (2):
93–1223. doi:10.1081/ETC-120039605.
• Shephard, N. G. (1991a) From characteristic function to distribution function: A simple framework for the theory.
Econometric Theory, 7, 519–529.
• Shephard, N. G. (1991b) Numerical integration rules for multivariate inversions. J. Statist. Comput. Simul., 39,
37–46.
Chernoff bound 89

Chernoff bound
In probability theory, the Chernoff bound, named after Herman Chernoff, gives exponentially decreasing bounds on
tail distributions of sums of independent random variables. It is a sharper bound than the known first or second
moment based tail bounds such as Markov's inequality or Chebyshev inequality, which only yield power-law bounds
on tail decay but it requires that the variates be independent - a condition that neither the Markov nor the Chebyshev
inequalities require.
It is related to the (historically earliest) Bernstein inequalities, and to Hoeffding's inequality.

Definition
Let X1, ..., Xn be independent Bernoulli random variables, each having probability p > 1/2. Then the probability of
simultaneous occurrence of more than n/2 of the events has an exact value S, where

The Chernoff bound shows that S has the following lower bound:

This result admits various generalizations as outlined below. One can encounter many flavours of Chernoff bounds:
the original additive form (which gives a bound on the absolute error) or the more practical multiplicative form
(which bounds the error relative to the mean).

A motivating example
The simplest case of Chernoff bounds is used to bound the success
probability of majority agreement for n independent, equally likely
events.
A simple motivating example is to consider a biased coin. One side
(say, Heads), is more likely to come up than the other, but you don't
know which and would like to find out. The obvious solution is to flip
it many times and then choose the side that comes up the most. But
how many times do you have to flip it to be confident that you've
chosen correctly?
In our example, let denote the event that the ith coin flip comes up
Heads; suppose that we want to ensure we choose the wrong side with
at most a small probability ε. Then, rearranging the above, we must
have:

If the coin is noticeably biased, say coming up on one side 60% of the time (p = .6), then we can guess that side with
95% ( ) accuracy after 150 flips . If it is 90% biased, then a mere 10 flips suffices. If the coin
is only biased a tiny amount, like most real coins are, the number of necessary flips becomes much larger.
More practically, the Chernoff bound is used in randomized algorithms (or in computational devices such as
quantum computers) to determine a bound on the number of runs necessary to determine a value by majority
agreement, up to a specified probability. For example, suppose an algorithm (or machine) A computes the correct
Chernoff bound 90

value of a function f with probability p > 1/2. If we choose n satisfying the inequality above, the probability that a
majority exists and is equal to the correct value is at least 1 − ε, which for small enough ε is quite reliable. If p is a
constant, ε diminishes exponentially with growing n, which is what makes algorithms in the complexity class BPP
efficient.
Notice that if p is very close to 1/2, the necessary n can become very large. For example, if p = 1/2 + 1/2m, as it
might be in some PP algorithms, the result is that n is bounded below by an exponential function in m:

The first step in the proof of Chernoff bounds


The Chernoff bound for a random variable X, which is the sum of n independent random variables
, is obtained by applying etX for some well-chosen value of t. This method was first applied by Sergei Bernstein to
prove the related Bernstein inequalities.
From Markov's inequality and using independence we can derive the following useful inequality:
For any t > 0,

In particular optimizing over t and using independence we obtain,

Similarly,

and so,

Precise statements and proofs

Theorem for additive form (absolute error)


The following Theorem is due to Wassily Hoeffding and hence is called Chernoff-Hoeffding theorem.
Assume random variables are i.i.d. Let , , and . Then

and

where
Chernoff bound 91

is the Kullback-Leibler divergence between Bernoulli distributed random variables with parameters and
respectively. If , then

Proof
The proof starts from the general inequality (+) above. . Taking a = mq in (+), we obtain:

Now, knowing that , , we have

Therefore we can easily compute the infimum, using calculus and some logarithms. Thus,

Setting the last equation to zero and solving, we have

so that .

Thus, .

As , we see that , so our bound is satisfied on . Having solved for , we can plug back
into the equations above to find that

We now have our desired result, that


Chernoff bound 92

To complete the proof for the symmetric case, we simply define the random variable , apply the
same proof, and plug it into our bound.

Simpler bounds
A simpler bound follows by relaxing the theorem using , which follows from the convexity

of and the fact that . This results in a special case of

Hoeffding's inequality. Sometimes, the bound for , which is


stronger for , is also used.

Theorem for multiplicative form of Chernoff bound (relative error)


Let random variables be independent random variables taking on values 0 or 1. Further, assume

that . Then, if we let and be the expectation of , for any

Proof
According to (+),

The third line above follows because takes the value with probability and the value with probability
. This is identical to the calculation above in the proof of the Theorem for additive form (absolute error).
Rewriting as and recalling that (with strict inequality if ),
we set . The same result can be obtained by directly replacing a in the equation for the Chernoff
bound with .[1]
Thus,

If we simply set so that for , we can substitute and find

This proves the result desired. A similar proof strategy can be used to show that
Chernoff bound 93

Better Chernoff bounds for some special cases


We can obtain stronger bounds using simpler proof techniques for some special cases of symmetric random
variables.
Let be independent random variables,

(a) .

Then,

,
and therefore also

(b)

Then,

Applications of Chernoff bound


Chernoff bounds have very useful applications in set balancing and packet routing in sparse networks.
The set balancing problem arises while designing statistical experiments. Typically while designing a statistical
experiment, given the features of each participant in the experiment, we need to know how to divide the participants
into 2 disjoint groups such that each feature is roughly as balanced as possible between the two groups. Refer to this
book section [2] for more info on the problem.
Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce network
congestion while routing packets in sparse networks. Refer to this book section [3] for a thorough treatment of the
problem.

Matrix Chernoff bound


Rudolf Ahlswede and Andreas Winter introduced (Ahlswede & Winter 2003) a Chernoff bound for matrix-valued
random variables.
If is distributed according to some distribution over matrices with zero mean, and if
are independent copies of then for any ,

where holds almost surely and is an absolute constant.


Notice that the number of samples in the inequality depends logarithmically on . In general, unfortunately, such a
dependency is inevitable: take for example a diagonal random sign matrix of dimension . The operator norm of
Chernoff bound 94

the sum of independent samples is precisely the maximum deviation among independent random walks of length . In
order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that should grow
logarithmically with in this scenario.[4]
The following theorem can be obtained by assuming has low rank, in order to avoid the dependency on the
dimensions.

Theorem without the dependency on the dimensions


Let and be a random symmetric real matrix with and almost surely.
Assume that each element on the support of has at most rank . Set

If holds almost surely, then

where are i.i.d. copies of .

References
[1] Refer to the proof above
[2] http:/ / books. google. com/ books?id=0bAYl6d7hvkC& printsec=frontcover& source=gbs_summary_r& cad=0#PPA71,M1
[3] http:/ / books. google. com/ books?id=0bAYl6d7hvkC& printsec=frontcover& source=gbs_summary_r& cad=0#PPA72,M1
[4] * Magen, A.; Zouzias, A. (2011). "Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication".
arXiv:1005.2724 [cs.DM].

• Chernoff, H. (1952). "A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of
Observations". Annals of Mathematical Statistics 23 (4): 493–507. doi:10.1214/aoms/1177729330.
JSTOR 2236576. MR57518. Zbl 0048.11804.
• Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables". Journal of the
American Statistical Association 58 (301): 13–30. doi:10.2307/2282952. JSTOR 2282952.
• Chernoff, H. (1981). "A Note on an Inequality Involving the Normal Distribution". Annals of Probability 9 (3):
533. doi:10.1214/aop/1176994428. JSTOR 2243541. MR614640. Zbl 0457.60014.
• Hagerup, T. (1990). "A guided tour of Chernoff bounds". Information Processing Letters 33 (6): 305.
doi:10.1016/0020-0190(90)90214-I.
• Ahlswede, R.; Winter, A. (2003). "Strong Converse for Identification via Quantum Channels". IEEE Transactions
on Information Theory 48 (3): 569–579. arXiv:quant-ph/0012127.
• Mitzenmacher, M.; Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic
Analysis (http://books.google.com/books?id=0bAYl6d7hvkC). ISBN 978-0-521-83540-4.
• Nielsen, F. (2011). "Chernoff information of exponential families". arXiv:1102.2684 [cs.IT].
Chi-squared distribution 95

Chi-squared distribution
Probability density function

Cumulative distribution function

Notation or
Parameters (known as "degrees of freedom")
Support x ∈ [0, +∞)
PDF

CDF

Mean k
Median

Mode max{ k − 2, 0 }
Variance 2k
Skewness
Ex. kurtosis 12 / k
Entropy

MGF (1 − 2 t)−k/2   for t < ½


CF (1 − 2 i t)−k/2      
[1]

In probability theory and statistics, the chi-squared distribution (also chi-square or χ²-distribution) with k degrees
of freedom is the distribution of a sum of the squares of k independent standard normal random variables. It is one of
the most widely used probability distributions in inferential statistics, e.g., in hypothesis testing or in construction of
confidence intervals.[2][3][4][5] When there is a need to contrast it with the noncentral chi-squared distribution, this
distribution is sometimes called the central chi-squared distribution.
Chi-squared distribution 96

The chi-squared distribution is used in the common chi-squared tests for goodness of fit of an observed distribution
to a theoretical one, the independence of two criteria of classification of qualitative data, and in confidence interval
estimation for a population standard deviation of a normal distribution from a sample standard deviation. Many other
statistical tests also use this distribution, like Friedman's analysis of variance by ranks.
The chi-squared distribution is a special case of the gamma distribution.

Definition
If Z1, ..., Zk are independent, standard normal random variables, then the sum of their squares,

is distributed according to the chi-squared distribution with k degrees of freedom. This is usually denoted as

The chi-squared distribution has one parameter: k — a positive integer that specifies the number of degrees of
freedom (i.e. the number of Zi’s)

Characteristics
Further properties of the chi-squared distribution can be found in the box at the upper right corner of this article.

Probability density function


The probability density function (pdf) of the chi-squared distribution is

where Γ(k/2) denotes the Gamma function, which has closed-form values for odd k.
For derivations of the pdf in the cases of one and two degrees of freedom, see Proofs related to chi-squared
distribution.

Cumulative distribution function


Its cumulative distribution function is:

where γ(k,z) is the lower incomplete Gamma function and P(k,z) is the regularized Gamma function.
In a special case of k = 2 this function has a simple form:

For the cases when 0 < z < 1 (which include all of the cases when this CDF is less than half), the following Chernoff
upper bound may be obtained:[6]

The tail bound for the cases when z > 1 follows similarly

Tables of this cumulative distribution function are widely available and the function is included in many
spreadsheets and all statistical packages. For another approximation for the CDF modeled after the cube of a
Gaussian, see under Noncentral chi-squared distribution.
Chi-squared distribution 97

Additivity
It follows from the definition of the chi-squared distribution that the sum of independent chi-squared variables is also
chi-squared distributed. Specifically, if {Xi}i=1n are independent chi-squared variables with {ki}i=1n degrees of
freedom, respectively, then Y = X1 + ⋯ + Xn is chi-squared distributed with k1 + ⋯ + kn degrees of freedom.

Information entropy
The information entropy is given by

where ψ(x) is the Digamma function.


The Chi-squared distribution is the maximum entropy probability distribution for a random variate X for which
and are fixed. [7]

Noncentral moments
The moments about zero of a chi-squared distribution with k degrees of freedom are given by[8][9]

Cumulants
The cumulants are readily obtained by a (formal) power series expansion of the logarithm of the characteristic
function:

Asymptotic properties
By the central limit theorem, because the chi-squared distribution is the sum of k independent random variables with
finite mean and variance, it converges to a normal distribution for large k. For many practical purposes, for k > 50
the distribution is sufficiently close to a normal distribution for the difference to be ignored.[10] Specifically, if
X ~ χ²(k), then as k tends to infinity, the distribution of tends to a standard normal distribution.
However, convergence is slow as the skewness is and the excess kurtosis is 12/k. Other functions of the

chi-squared distribution converge more rapidly to a normal distribution. Some examples are:
• If X ~ χ²(k) then is approximately normally distributed with mean and unit variance (result credited
to R. A. Fisher).
[11]
• If X ~ χ²(k) then is approximately normally distributed with mean and variance This
is known as the Wilson-Hilferty transformation.
Chi-squared distribution 98

Relation to other distributions


• As , (normal distribution)
• (Noncentral chi-squared distribution with non-centrality parameter )
• If then has the chi-squared distribution

• As a special case, if then has the chi-squared distribution


• (The squared norm of n standard normally distributed variables is a chi-squared
distribution with k degrees of freedom)
• If and , then . (gamma distribution)
• If then (chi distribution)
• If (Rayleigh distribution) then
• If (Maxwell distribution) then
• If then (Inverse-chi-squared distribution)
• The chi-squared distribution is a special case of type 3 Pearson distribution

• If and are independent then (beta distribution)


• If (uniform distribution) then
• is a transformation of Laplace distribution

• If then

• chi-squared distribution is a transformation of Pareto distribution


• Student's t-distribution is a transformation of chi-squared distribution
• Student's t-distribution can be obtained from chi-squared distribution and normal distribution
• Noncentral beta distribution can be obtained as a transformation of chi-squared distribution and Noncentral
chi-squared distribution
• Noncentral t-distribution can be obtained from normal distribution and chi-squared distribution
A chi-squared variable with k degrees of freedom is defined as the sum of the squares of k independent standard
normal random variables.
If Y is a k-dimensional Gaussian random vector with mean vector μ and rank k covariance matrix C, then
X = (Y−μ)TC−1(Y−μ) is chi-squared distributed with k degrees of freedom.
The sum of squares of statistically independent unit-variance Gaussian variables which do not have mean zero yields
a generalization of the chi-squared distribution called the noncentral chi-squared distribution.
If Y is a vector of k i.i.d. standard normal random variables and A is a k×k idempotent matrix with rank k−n then the
quadratic form YTAY is chi-squared distributed with k−n degrees of freedom.
The chi-squared distribution is also naturally related to other distributions arising from the Gaussian. In particular,

• Y is F-distributed, Y ~ F(k1,k2) if where X1 ~ χ²(k1) and X2  ~ χ²(k2) are statistically independent.


• If X is chi-squared distributed, then is chi distributed.
• If X1  ~  χ2k1 and X2  ~  χ2k2 are statistically independent, then X1 + X2  ~ χ2k1+k2. If X1 and X2 are not
independent, then X1 + X2 is not chi-squared distributed.
Chi-squared distribution 99

Generalizations
The chi-squared distribution is obtained as the sum of the squares of k independent, zero-mean, unit-variance
Gaussian random variables. Generalizations of this distribution can be obtained by summing the squares of other
types of Gaussian random variables. Several such distributions are described below.

Chi-squared distributions

Noncentral chi-squared distribution


The noncentral chi-squared distribution is obtained from the sum of the squares of independent Gaussian random
variables having unit variance and nonzero means.

Generalized chi-squared distribution


The generalized chi-squared distribution is obtained from the quadratic form z′Az where z is a zero-mean Gaussian
vector having an arbitrary covariance matrix, and A is an arbitrary matrix.

Gamma, exponential, and related distributions


The chi-squared distribution X ~ χ²(k) is a special case of the gamma distribution, in that X ~ Γ(k/2, 1/2) (using the
shape parameterization of the gamma distribution) where k is an integer.
Because the exponential distribution is also a special case of the Gamma distribution, we also have that if X ~ χ²(2),
then X ~ Exp(1/2) is an exponential distribution.
The Erlang distribution is also a special case of the Gamma distribution and thus we also have that if X ~ χ²(k) with
even k, then X is Erlang distributed with shape parameter k/2 and scale parameter 1/2.

Applications
The chi-squared distribution has numerous applications in inferential statistics, for instance in chi-squared tests and
in estimating variances. It enters the problem of estimating the mean of a normally distributed population and the
problem of estimating the slope of a regression line via its role in Student’s t-distribution. It enters all analysis of
variance problems via its role in the F-distribution, which is the distribution of the ratio of two independent
chi-squared random variables, each divided by their respective degrees of freedom.
Following are some of the most common situations in which the chi-squared distribution arises from a
Gaussian-distributed sample.

• if X1, ..., Xn are i.i.d. N(μ, σ2) random variables, then where .

• The box below shows probability distributions with name starting with chi for some statistics based on Xi ∼
Normal(μi, σ2i), i = 1, ⋯, k, independent random variables:
Chi-squared distribution 100

Name Statistic

chi-squared distribution

noncentral chi-squared distribution

chi distribution

noncentral chi distribution

Table of χ2 value vs p-value


The p-value is the probability of observing a test statistic at least as extreme in a chi-squared distribution.
Accordingly, since the cumulative distribution function (CDF) for the appropriate degrees of freedom (df) gives the
probability of having obtained a value less extreme than this point, subtracting the CDF value from 1 gives the
p-value. The table below gives a number of p-values matching to χ2 for the first 10 degrees of freedom.
A p-value of 0.05 or less is usually regarded as statistically significant, i.e. the observed deviation from the null
hypothesis is significant.

Degrees of freedom (df) [12]


χ2 value

1 0.004 0.02 0.06 0.15 0.46 1.07 1.64 2.71 3.84 6.64 10.83

2 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60 5.99 9.21 13.82

3 0.35 0.58 1.01 1.42 2.37 3.66 4.64 6.25 7.82 11.34 16.27

4 0.71 1.06 1.65 2.20 3.36 4.88 5.99 7.78 9.49 13.28 18.47

5 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07 15.09 20.52

6 1.63 2.20 3.07 3.83 5.35 7.23 8.56 10.64 12.59 16.81 22.46

7 2.17 2.83 3.82 4.67 6.35 8.38 9.80 12.02 14.07 18.48 24.32

8 2.73 3.49 4.59 5.53 7.34 9.52 11.03 13.36 15.51 20.09 26.12

9 3.32 4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88

10 3.94 4.86 6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59

P value (Probability) 0.95 0.90 0.80 0.70 0.50 0.30 0.20 0.10 0.05 0.01 0.001

Nonsignificant Significant
Chi-squared distribution 101

History
This distribution was first described by the German statistician Helmert.

References
[1] M.A. Sanders. "Characteristic function of the central chi-squared distribution" (http:/ / www. planetmathematics. com/ CentralChiDistr. pdf).
. Retrieved 2009-03-06.
[2] Abramowitz, Milton; Stegun, Irene A., eds. (1965), "Chapter 26" (http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_940. htm), Handbook of
Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover, pp. 940, ISBN 978-0486612720, MR0167642,
.
[3] NIST (2006). Engineering Statistics Handbook - Chi-Squared Distribution (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/
eda3666. htm)
[4] Jonhson, N.L.; S. Kotz, , N. Balakrishnan (1994). Continuous Univariate Distributions (Second Ed., Vol. 1, Chapter 18). John Willey and
Sons. ISBN 0-471-58495-9.
[5] Mood, Alexander; Franklin A. Graybill, Duane C. Boes (1974). Introduction to the Theory of Statistics (Third Edition, p. 241-246).
McGraw-Hill. ISBN 0-07-042864-6.
[6] Dasgupta, Sanjoy D. A.; Gupta, Anupam K. (2002). "An Elementary Proof of a Theorem of Johnson and Lindenstrauss" (http:/ / cseweb.
ucsd. edu/ ~dasgupta/ papers/ jl. pdf). Random Structures and Algorithms 22: 60-65. . Retrieved 2012-05-01.
[7] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[8] Chi-squared distribution (http:/ / mathworld. wolfram. com/ Chi-SquaredDistribution. html), from MathWorld, retrieved Feb. 11, 2009
[9] M. K. Simon, Probability Distributions Involving Gaussian Random Variables, New York: Springer, 2002, eq. (2.35), ISBN
978-0-387-34657-1
[10] Box, Hunter and Hunter. Statistics for experimenters. Wiley. p. 46.
[11] Wilson, E.B.; Hilferty, M.M. (1931) "The distribution of chi-squared". Proceedings of the National Academy of Sciences, Washington, 17,
684–688.
[12] Chi-Squared Test (http:/ / www2. lv. psu. edu/ jxm57/ irp/ chisquar. html) Table B.2. Dr. Jacqueline S. McLaughlin at The Pennsylvania
State University. In turn citing: R.A. Fisher and F. Yates, Statistical Tables for Biological Agricultural and Medical Research, 6th ed., Table
IV

External links
• Earliest Uses of Some of the Words of Mathematics: entry on Chi squared has a brief history (http://jeff560.
tripod.com/c.html)
• Course notes on Chi-Squared Goodness of Fit Testing (http://www.stat.yale.edu/Courses/1997-98/101/chigf.
htm) from Yale University Stats 101 class.
• Mathematica demonstration showing the chi-squared sampling distribution of various statistics, e.g. Σx², for a
normal population (http://demonstrations.wolfram.com/StatisticsAssociatedWithNormalSamples/)
• Simple algorithm for approximating cdf and inverse cdf for the chi-squared distribution with a pocket calculator
(http://www.jstor.org/stable/2348373)
Computational complexity of mathematical operations 102

Computational complexity of mathematical


operations
The following tables list the running time of various algorithms for common mathematical operations.
Here, complexity refers to the time complexity of performing computations on a multitape Turing machine.[1] See
big O notation for an explanation of the notation used.
Note: Due to the variety of multiplication algorithms, M(n) below stands in for the complexity of the chosen
multiplication algorithm.

Arithmetic functions
Operation Input Output Algorithm Complexity

Addition Two n-digit numbers One n+1-digit Schoolbook addition with carry Θ(n)
number

Subtraction Two n-digit numbers One n+1-digit Schoolbook subtraction with borrow Θ(n)
number

Multiplication Two n-digit numbers One 2n-digit Schoolbook long multiplication O(n2)
number
Karatsuba algorithm O(n1.585)

3-way Toom–Cook multiplication O(n1.465)

k-way Toom–Cook multiplication O(nlog (2k − 1)/log k)

Mixed-level Toom–Cook (Knuth O(n 2√(2 log n) log


[2]
4.3.3-T) n)

Schönhage–Strassen algorithm O(n log n log log


n)

[3]
Fürer's algorithm O(n log n 2log* n)

Division Two n-digit numbers One n-digit number Schoolbook long division O(n2)

Newton's method M(n)

Square root One n-digit number One n-digit number Newton's method M(n)

Modular Two n-digit numbers and a k-bit One n-digit number Repeated multiplication and reduction O(2kM(n))
exponentiation exponent
Exponentiation by squaring O(k M(n))

Exponentiation with Montgomery O(k M(n))


reduction

Schnorr and Stumpf[4] conjectured that no fastest algorithm for multiplication exists.
Computational complexity of mathematical operations 103

Algebraic functions
Operation Input Output Algorithm Complexity

Polynomial evaluation One polynomial of degree n with fixed-size One fixed-size number Direct evaluation Θ(n)
polynomial coefficients
Horner's method Θ(n)

Polynomial gcd (over Two polynomials of degree n with fixed-size One polynomial of degree Euclidean algorithm O(n2)
Z[x] or F[x]) polynomial coefficients at most n
Fast Euclidean O(n (log n)2 log
[5]
algorithm log n)

Special functions
Many of the methods in this section are given in Borwein & Borwein.[6]

Elementary functions
The elementary functions are constructed by composing arithmetic operations, the exponential function (exp), the
natural logarithm (log), trigonometric functions (sin, cos), and their inverses. The complexity of an elementary
function is equivalent to that of its inverse, since all elementary functions are analytic and hence invertible by means
of Newton's method. In particular, if either exp or log can be computed with some complexity, then that complexity
is attainable for all other elementary functions.
Below, the size n refers to the number of digits of precision at which the function is to be evaluated.

Algorithm Applicability Complexity

Taylor series; repeated argument reduction (e.g. exp(2x) = [exp(x)]2) and direct summation exp, log, sin, cos O(n1/2 M(n))

Taylor series; FFT-based acceleration exp, log, sin, cos O(n1/3 (log n)2 M(n))

[7] exp, log, sin, cos O((log n)2 M(n))


Taylor series; binary splitting + bit burst method

Arithmetic-geometric mean iteration log O(log n M(n))

It is not known whether O(log n M(n)) is the optimal complexity for elementary functions. The best known lower
bound is the trivial bound Ω(M(n)).

Non-elementary functions

Function Input Algorithm Complexity

Gamma function n-digit number Series approximation of the incomplete gamma function O(n1/2 (log n)2 M(n))

Fixed rational number Hypergeometric series O((log n)2 M(n))

m/24, m an integer Arithmetic-geometric mean iteration O(log n M(n))

Hypergeometric function pFq n-digit number (As described in Borwein & Borwein) O(n1/2 (log n)2 M(n))

Fixed rational number Hypergeometric series O((log n)2 M(n))


Computational complexity of mathematical operations 104

Mathematical constants
This table gives the complexity of computing approximations to the given constants to n correct digits.

Constant Algorithm Complexity

Golden ratio, φ Newton's method O(M(n))

Square root of 2, √2 Newton's method O(M(n))

Euler's number, e Binary splitting of the Taylor series for the exponential function O(log n M(n))

Newton inversion of the natural logarithm O(log n M(n))

Pi, π Binary splitting of the arctan series in Machin's formula O((log n)2 M(n))

Salamin–Brent algorithm O(log n M(n))

Euler's constant, γ Sweeney's method (approximation in terms of the exponential integral) O((log n)2 M(n))

Number theory
Algorithms for number theoretical calculations are studied in computational number theory.

Operation Input Output Algorithm Complexity

Greatest common Two n-digit One number with at most n Euclidean algorithm O(n2)
divisor numbers digits
Binary GCD algorithm O(n2)

[8]
Left/Right k-ary Binary GCD algorithm O(n2 / log n)

[9] O(log n M(n))


Stehlé-Zimmermann algorithm

O(log n M(n))
Schönhage controlled Euclidean descent
[10]
algorithm

Jacobi symbol Two n-digit 0, -1, or 1


O(log n M(n))
numbers Schönhage controlled Euclidean descent
[11]
algorithm

[12] O(log n M(n))


Stehlé-Zimmermann algorithm

Factorial A fixed-size One O(m log m)-digit Bottom-up multiplication O(m2 log m)
number m number
Binary splitting O(log m M(m log m))

Exponentiation of the prime factors of m O(log log m M(m log


[13]
m)),
[1]
O(M(m log m))

Matrix algebra
The following complexity figures assume that arithmetic with individual elements has complexity O(1), as is the
case with fixed-precision floating-point arithmetic.
Computational complexity of mathematical operations 105

Operation Input Output Algorithm Complexity

Matrix multiplication Two n×n-matrices One n×n-matrix Schoolbook matrix multiplication O(n3)

Strassen algorithm O(n2.807)

Coppersmith–Winograd algorithm O(n2.376)

[14]
Williams algorithm O(n2.373)

Matrix multiplication One n×m-matrix & One n×p-matrix Schoolbook matrix multiplication O(nmp)
One m×p-matrix

* One n×n-matrix One n×n-matrix Gauss–Jordan elimination O(n3)


Matrix inversion

Strassen algorithm O(n2.807)

Coppersmith–Winograd algorithm O(n2.376)

Determinant One n×n-matrix One number Laplace expansion O(n!)

LU decomposition O(n3)

Bareiss algorithm O(n3)

Fast matrix multiplication O(n2.376)

Back Substitution Triangular matrix n solutions Back substitution [15]


O(n2)

In 2005, Henry Cohn, Robert Kleinberg, Balázs Szegedy and Christopher Umans showed that either of two different
conjectures would imply that the exponent of matrix multiplication is 2.[16] It has also been conjectured that no
fastest algorithm for matrix multiplication exists, in light of the nearly 20 successive improvements leading to the
Williams algorithm.
* Because of the possibility to blockwise invert a matrix, where an inversion of an n×n matrix requires inversion of
two half-sized matrices and 6 mulitplications between two half-sized matrices, and since matrix multiplication has a
lower bound of Ω(n2 log n) operations[17], it can be shown that a divide and conquer algorithm that uses blockwise
inversion to invert a matrix runs with the same time complexity as the matrix multiplication algorithm that is used
internally.

References
[1] A. Schönhage, A.F.W. Grotefeld, E. Vetter: Fast Algorithms—A Multitape Turing Machine Implementation, BI Wissenschafts-Verlag,
Mannheim, 1994
[2] D. Knuth. The Art of Computer Programming, Volume 2. Third Edition, Addison-Wesley 1997.
[3] Martin Fürer. Faster Integer Multiplication (http:/ / www. cse. psu. edu/ ~furer/ Papers/ mult. pdf). Proceedings of the 39th Annual ACM
Symposium on Theory of Computing, San Diego, California, USA, June 11–13, 2007, pp. 55–67.
[4] C. P. Schnorr and G. Stumpf. A characterization of complexity sequences. Zeitschrift fur Mathematische Logik und Grundlagen der
Mathematik 21(1):47–56, 1975.
[5] http:/ / planetmath. org/ encyclopedia/ HalfGCDAlgorithm. html
[6] J. Borwein & P. Borwein. Pi and the AGM: A Study in Analytic Number Theory and Computational Complexity. John Wiley 1987.
[7] David and Gregory Chudnovsky. Approximations and complex multiplication according to Ramanujan. Ramanujan revisited, Academic
Press, 1988, pp 375–472.
[8] J. Sorenson. (1994). "Two Fast GCD Algorithms". Journal of Algorithms 16 (1): 110–144. doi:10.1006/jagm.1994.1006.
[9] R. Crandall & C. Pomerance. Prime Numbers - A Computational Perspective. Second Edition, Springer 2005.
[10] Möller N (2008). "On Schönhage's algorithm and subquadratic integer gcd computation" (http:/ / www. lysator. liu. se/ ~nisse/ archive/ sgcd.
pdf). Mathematics of Computation 77 (261): 589–607. doi:10.1090/S0025-5718-07-02017-0. .
[11] Bernstein D J. "Faster Algorithms to Find Non-squares Modulo Worst-case Integers" (http:/ / cr. yp. to/ papers/ nonsquare. ps). .
[12] Richard P. Brent; Paul Zimmermann (2010). "An O(M(n) log n) algorithm for the Jacobi symbol". arXiv:1004.2091.
Computational complexity of mathematical operations 106

[13] P. Borwein. "On the complexity of calculating factorials". Journal of Algorithms 6, 376-380 (1985)
[14] Virginia Vassilevska Williams, Breaking the Coppersmith-Winograd barrier (http:/ / www. cs. berkeley. edu/ ~virgi/ matrixmult. pdf), 2011
preprint.
[15] J. B. Fraleigh and R. A. Beauregard, "Linear Algebra," Addison-Wesley Publishing Company, 1987, p 95.
[16] Henry Cohn, Robert Kleinberg, Balazs Szegedy, and Chris Umans. Group-theoretic Algorithms for Matrix Multiplication.
arXiv:math.GR/0511460. Proceedings of the 46th Annual Symposium on Foundations of Computer Science, 23–25 October 2005, Pittsburgh,
PA, IEEE Computer Society, pp. 379–388.
[17] Ran Raz. On the complexity of matrix product. In Proceedings of the thirty-fourth annual ACM symposium on Theory of computing. ACM
Press, 2002. doi:10.1145/509907.509932.

Conjugate prior
In Bayesian probability theory, if the posterior distributions p(θ|x) are in the same family as the prior probability
distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate
prior for the likelihood. For example, the Gaussian family is conjugate to itself (or self-conjugate) with respect to a
Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will
ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior
for the likelihood which is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by
Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory.[1] A similar concept had been
discovered independently by George Alfred Barnard.[2]
Consider the general problem of inferring a distribution for a parameter θ given some datum or data x. From Bayes'
theorem, the posterior distribution is equal to the product of the likelihood function and prior p(θ),
normalized (divided) by the probability of the data p(x):

Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement
of the data-generating process. It is clear that different choices of the prior distribution p(θ) may make the integral
more or less difficult to calculate, and the product p(x|θ) × p(θ) may take one algebraic form or another. For certain
choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter
values). Such a choice is a conjugate prior.
A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior: otherwise a difficult
numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing
how a likelihood function updates a distribution.
All members of the exponential family have conjugate priors. See Gelman et al.[3] for a catalog.

Example
The form of the conjugate prior can generally be determined by inspection of the probability density or probability
mass function of a distribution. For example, consider a random variable which is a Bernoulli trial with unknown
probability of success q in [0,1]. The probability density function has the form

Expressed as a function of , this has the form

for some constants and . Generally, this functional form will have an additional multiplicative factor (the
normalizing constant) ensuring that the function is a probability distribution, i.e. the integral over the entire range is
1. This factor will often be a function of and , but never of .
In fact, the usual conjugate prior is the beta distribution with
Conjugate prior 107

where and are chosen to reflect any existing belief or information ( = 1 and = 1 would give a uniform
distribution) and Β( ,  ) is the Beta function acting as a normalising constant.
In this context, and are called hyperparameters (parameters of the prior), to distinguish them from parameters
of the underlying model (here q). It is a typical characteristic of conjugate priors that the dimensionality of the
hyperparameters is one greater than that of the parameters of the original distribution. If all parameters are scalar
values, then this means that there will be one more hyperparameter than parameter; but this also applies to
vector-valued and matrix-valued parameters. (See the general article on the exponential family, and consider also the
Wishart distribution, conjugate prior of the covariance matrix of a multivariate normal distribution, for an example
where a large dimensionality is involved.)
If we then sample this random variable and get s successes and f failures, we have

which is another Beta distribution with a simple change to the (hyper)parameters. This posterior distribution could
then be used as the prior for more samples, with the hyperparameters simply adding each extra piece of information
as it comes.

Pseudo-observations
It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having
observed a certain number of pseudo-observations with properties specified by the parameters. For example, the
values and of a beta distribution can be thought of as corresponding to successes and failures
if the posterior mode is used to choose an optimal parameter setting, or successes and failures if the posterior
mean is used to choose an optimal parameter setting. In general, for nearly all conjugate prior distributions, the
hyperparameters can be interpreted in terms of pseudo-observations. This can help both in providing an intuition
behind the often messy update equations, as well as to help choose reasonable hyperparameters for a prior.

Interpretations

Analogy with eigenfunctions


Conjugate priors are analogous to eigenfunctions in operator theory, in that they are distributions on which the
"conditioning operator" acts in a well-understood way, thinking of the process of changing from the prior to the
posterior as an operator.
In both eigenfunctions and conjugate priors, there is a finite dimensional space which is preserved by the operator:
the output is of the same form (in the same space) as the input. This greatly simplifies the analysis, as it otherwise
considers an infinite dimensional space (space of all functions, space of all distributions).
However, the processes are only analogous, not identical: conditioning is not linear, as the space of distributions is
not closed under linear combination, only convex combination, and the posterior is only of the same form as the
prior, not a scalar multiple.
Just as one can easily analyze how a linear combination of eigenfunctions evolves under application of an operator
(because, with respect to these functions, the operator is diagonalized), one can easily analyze how a convex
combination of conjugate priors evolves under conditioning; this is called using a hyperprior, and corresponds to
Conjugate prior 108

using a mixture density of conjugate priors, rather than a single conjugate prior.

Dynamical system
One can think of conditioning on conjugate priors as defining a kind of (discrete time) dynamical system: from a
given set of hyperparameters, incoming data updates these hyperparameters, so one can see the change in
hyperparameters as a kind of "time evolution" of the system, corresponding to "learning". Starting at different points
yields different flows over time. This is again analogous with the dynamical system defined by a linear operator, but
note that since different samples lead to different inference, this is not simply dependent on time, but rather on data
over time. For related approaches, see Recursive Bayesian estimation and Data assimilation.

Table of conjugate distributions


Let n denote the number of observations.
If the likelihood function belongs to the exponential family, then a conjugate prior exists, often also in the
exponential family; see Exponential family: Conjugate distributions.

Discrete likelihood distributions

Likelihood Model Conjugate Prior Posterior hyperparameters [5]


Interpretation of Posterior predictive
parameters prior hyperparameters [4]
hyperparameters
distribution

Bernoulli p Beta
successes,
(probability) [4]
failures

Binomial p Beta
successes,
(probability) [4] (beta-binomial)
failures

Negative p Beta
total
Binomial (probability)
successes,
with known [4]
failures (i.e.
failure number
r
experiments,
assuming stays
fixed)

Poisson λ (rate) Gamma total occurrences


in intervals
(negative binomial)
Poisson λ (rate) Gamma [6] total occurrences
in intervals
(negative binomial)
Categorical p Dirichlet where is the occurrences
(probability number of observations in category i [4]
of category
vector), k
(number of
categories,
i.e. size of
p)
Conjugate prior 109

Multinomial p Dirichlet
occurrences
(probability [4] (Dirichlet-multinomial)
of category
vector), k
(number of
categories,
i.e. size of
p)

Hypergeometric M (number [7]


Beta-binomial successes,
with known of target [4]
failures
total population members)
size N

Geometric p0 Beta
experiments,
(probability)
total
[4]
failures

Continuous likelihood distributions


Note: In all cases below, the data is assumed to consist of n points (which will be random vectors in the
multivariate cases).

d Model Conjugate prior Prior Posterior hyperparameters Interpretation of hyperparameters [8]


Posterior predictive
parameters distribution hyperparameters

μ (mean) Normal mean was estimated from [9]


observations with total precision
n
(sum of all individual precisions)
and with sample mean

μ (mean) Normal mean was estimated from


n observations with total precision [9]
(sum of all individual precisions)
and with sample mean

2 Inverse gamma [10] variance was estimated from [9]


σ (variance)
n observations with sample variance

(i.e. with sum of squared

deviations )

Scaled inverse variance was estimated from [9]


σ2 (variance)
n chi-squared observations with sample variance

τ (precision) Gamma [6] precision was estimated from [9]


n observations with sample variance

(i.e. with sum of squared

deviations )

Normal-inverse gamma mean was estimated from


μ and σ2
observations with sample mean [9]
Assuming
; variance was estimated from
exchangeability
observations with
sample mean and sample
• is the sample mean
variance (i.e. with sum of

squared deviations )
Conjugate prior 110

μ and τ Normal-gamma mean was estimated from


Assuming observations with sample mean [9]
exchangeability , and precision was estimated
from observations with
sample mean and sample
• is the sample mean
variance (i.e. with sum of

squared deviations )

e μ (mean Multivariate normal mean was estimated from [9]


vector) observations with total precision
n (sum of all individual precisions)
• is the sample mean and with sample mean

e μ (mean Multivariate normal mean was estimated from


[9]
vector) observations with total precision
• is the sample mean
n (sum of all individual precisions)
and with sample mean

e Σ (covariance Inverse-Wishart variance was estimated from


matrix) observations with sum of squared
[9]
n deviations

e Λ (precision Wishart precision was estimated from


matrix) observations with sum of squared
[9]
n deviations

e μ (mean normal-inverse-Wishart mean was estimated from


vector) and Σ observations with sample mean
(covariance ; variance was estimated from
matrix) observations with sample
• is the sample mean mean and with sum of
squared deviations

e μ (mean normal-Wishart mean was estimated from


vector) and Λ observations with sample mean
(precision ; variance was estimated from
matrix) observations with sample

• is the sample mean mean and with sum of


squared deviations

Pareto observations with maximum


value

k (shape) Gamma observations with sum of


n the order of magnitude of each
observation (i.e. the logarithm of
the ratio of each observation to the
minimum )

θ (scale) [7] observations with sum of


Inverse gamma
n the β'th power of each observation

β (shape) observations with sum of


[7]
n the log of each observation and
sum of the β'th power of each
observation (not quite right)
Conjugate prior 111

l μ (mean) [7] "mean" was estimated from


Normal
n observations with total precision
(sum of all individual precisions)
and with sample mean

l τ (precision) [7] [6] precision was estimated from


Gamma
n observations with sample variance

(i.e. with sum of squared log

deviations — i.e. deviations


between the logs of the data points
and the "mean")

al λ (rate) Gamma [6] observations that sum to

β (rate) Gamma observations with sum


n

β (inverse Gamma observations with sum


scale)
n

α (shape) or observations ( for


n estimating , for estimating
) with product

α (shape), β was estimated from


(inverse scale) observations with product ;
was estimated from
observations with sum

Notes
[1] Howard Raiffa and Robert Schlaifer. Applied Statistical Decision Theory. Division of Research, Graduate School of Business Administration,
Harvard University, 1961.
[2] Jeff Miller et al. Earliest Known Uses of Some of the Words of Mathematics (http:/ / jeff560. tripod. com/ mathword. html), "conjugate prior
distributions" (http:/ / jeff560. tripod. com/ c. html). Electronic document, revision of November 13, 2005, retrieved December 2, 2005.
[3] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis, 2nd edition. CRC Press, 2003. ISBN
1-58488-388-X.
[4] The exact interpretation of the parameters of a beta distribution in terms of number of successes and failures depends on what function is used

to extract a point estimate from the distribution. The mode of a beta distribution is which corresponds to successes

and failures; but the mean is which corresponds to successes and failures. The use of and has

the advantage that a uniform prior corresponds to 0 successes and 0 failures, but the use of and is somewhat more
convenient mathematically and also corresponds well with the fact that Bayesians generally prefer to use the posterior mean rather than the
posterior mode as a point estimate. The same issues apply to the Dirichlet distribution.
[5] This is the posterior predictive distribution of a new data point given the observed data points, with the parameters marginalized out.
Variables with primes indicate the posterior values of the parameters.
[6] β is rate or inverse scale. In parameterization of gamma distribution,θ = 1/β and k = α.
[7] Fink, D. (1997). "A Compendium of Conjugate Priors" (In progress report: Extension and enhancement of methods for setting data quality
objectives). DOE contract 95‑831. CiteSeerX: 10.1.1.157.5540 (http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 157. 5540).
[8] This is the posterior predictive distribution of a new data point given the observed data points, with the parameters marginalized out.
Variables with primes indicate the posterior values of the parameters. and refer to the normal distribution and Student's t-distribution,
respectively, or to the multivariate normal distribution and multivariate t-distribution in the multivariate cases.
[9] Murphy, Kevin P. (2007). "Conjugate Bayesian analysis of the Gaussian distribution." (http:/ / www. cs. ubc. ca/ ~murphyk/ Papers/
bayesGauss. pdf)
Conjugate prior 112

[10] In terms of the inverse gamma, is a scale parameter


[11] is a compound gamma distribution; here is a generalized beta prime distribution.

References

External links
• Step-by-step calculation of normal distribution posterior hyperparameters (http://www.eisber.net/StatWiki/
index.php/Mathematische_Statistik_-_Übung_Ergänzungsaufgabe_2_Beispiel_2)

Continuous mapping theorem


In probability theory, the continuous mapping theorem states that continuous functions are limit-preserving even if
their arguments are sequences of random variables. A continuous function, in Heine’s definition, is such a function
that maps convergent sequences into convergent sequences: if xn → x then g(xn) → g(x). The continuous mapping
theorem states that this will also be true if we replace the deterministic sequence {xn} with a sequence of random
variables {Xn}, and replace the standard notion of convergence of real numbers “→” with one of the types of
convergence of random variables.
This theorem was first proved by (Mann & Wald 1943), and it is therefore sometimes called the Mann–Wald
theorem.[1]

Statement
Let {Xn}, X be random elements defined on a metric space S. Suppose a function g: S→S′ (where S′ is another metric
space) has the set of discontinuity points Dg such that Pr[X ∈ Dg] = 0. Then[2][3][4]

1.
2.
3.

Proof
This proof has been adopted from (van der Vaart 1998, Theorem 2.3)

Spaces S and S′ are equipped with certain metrics. For simplicity we will denote both of these metrics using the |x−y|
notation, even though the metrics may be arbitrary and not necessarily Euclidian.

Convergence in distribution
We will need a particular statement from the portmanteau theorem: that convergence in distribution is
equivalent to

Fix an arbitrary closed set F⊂S′. Denote by g−1(F) the pre-image of F under the mapping g: the set of all points x∈S
such that g(x)∈F. Consider a sequence {xk} such that g(xk)∈F and xk→x. Then this sequence lies in g−1(F), and its
limit point x belongs to the closure of this set, g−1(F) (by definition of the closure). The point x may be either:
• a continuity point of g, in which case g(xk)→g(x), and hence g(x)∈F because F is a closed set, and therefore in
this case x belongs to the pre-image of F, or
• a discontinuity point of g, so that x∈Dg.
Thus the following relationship holds:
Continuous mapping theorem 113

Consider the event {g(Xn)∈F}. The probability of this event can be estimated as

and by the portmanteau theorem the limsup of the last expression is less than or equal to Pr(X∈g−1(F)). Using the
formula we derived in the previous paragraph, this can be written as

On plugging this back into the original expression, it can be seen that

which, by the portmanteau theorem, implies that g(Xn) converges to g(X) in distribution.

Convergence in probability
Fix an arbitrary ε>0. Then for any δ>0 consider the set Bδ defined as

This is the set of continuity points x of the function g(·) for which it is possible to find, within the δ-neighborhood of
x, a point which maps outside the ε-neighborhood of g(x). By definition of continuity, this set shrinks as δ goes to
zero, so that limδ→0Bδ = ∅.
Now suppose that |g(X) − g(Xn)| > ε. This implies that at least one of the following is true: either |X−Xn|≥δ, or X∈Dg,
or X∈Bδ. In terms of probabilities this can be written as

On the right-hand side, the first term converges to zero as n → ∞ for any fixed δ, by the definition of convergence in
probability of the sequence {Xn}. The second term converges to zero as δ → 0, since the set Bδ shrinks to an empty
set. And the last term is identically equal to zero by assumption of the theorem. Therefore the conclusion is that

which means that g(Xn) converges to g(X) in probability.

Convergence almost surely


By definition of the continuity of the function g(·),

at each point X(ω) where g(·) is continuous. Therefore

By definition, we conclude that g(Xn) converges to g(X) almost surely.


Continuous mapping theorem 114

References

Literature
• Amemiya, Takeshi (1985). Advanced Econometrics. Cambridge, MA: Harvard University Press.
ISBN 0-674-00560-0. LCCN HB139.A54 1985.
• Billingsley, Patrick (1969). Convergence of Probability Measures. John Wiley & Sons. ISBN 0-471-07242-7.
• Billingsley, Patrick (1999). Convergence of Probability Measures (2nd ed.). John Wiley & Sons.
ISBN 0-471-19745-9.
• Mann, H.B.; Wald, A. (1943). "On stochastic limit and order relationships". The Annals of Mathematical
Statistics 14 (3): 217–226. doi:10.1214/aoms/1177731415. JSTOR 2235800.
• Van der Vaart, A. W. (1998). Asymptotic statistics. New York: Cambridge University Press.
ISBN 978-0-521-49603-2. LCCN QA276 .V22 1998.

Notes
[1] Amemiya 1985, p. 88
[2] Van der Vaart 1998, Theorem 2.3, page 7
[3] Billingsley 1969, p. 31, Corollary 1
[4] Billingsley 1999, p. 21, Theorem 2.7

Convergence of random variables


In probability theory, there exist several different notions of convergence of random variables. The convergence of
sequences of random variables to some limit random variable is an important concept in probability theory, and its
applications to statistics and stochastic processes. The same concepts are known in more general mathematics as
stochastic convergence and they formalize the idea that a sequence of essentially random or unpredictable events
can sometimes be expected to settle down into a behaviour that is essentially unchanging when items far enough into
the sequence are studied. The different possible notions of convergence relate to how such a behaviour can be
characterised: two readily understood behaviours are that the sequence eventually takes a constant value, and that
values in the sequence continue to change but can be described by an unchanging probability distribution.

Background
"Stochastic convergence" formalizes the idea that a sequence of essentially random or unpredictable events can
sometimes be expected to settle into a pattern. The pattern may for instance be
• Convergence in the classical sense to a fixed value, perhaps itself coming from a random event
• An increasing similarity of outcomes to what a purely deterministic function would produce
• An increasing preference towards a certain outcome
• An increasing "aversion" against straying far away from a certain outcome
Some less obvious, more theoretical patterns could be
• That the probability distribution describing the next outcome may grow increasingly similar to a certain
distribution
• That the series formed by calculating the expected value of the outcome's distance from a particular value may
converge to 0
• That the variance of the random variable describing the next event grows smaller and smaller.
These other types of patterns that may arise are reflected in the different types of stochastic convergence that have
been studied.
Convergence of random variables 115

While the above discussion has related to the convergence of a single series to a limiting value, the notion of the
convergence of two series towards each other is also important, but this is easily handled by studying the sequence
defined as either the difference or the ratio of the two series.
For example, if the average of n uncorrelated random variables Yi, i = 1, ..., n, all having the same finite mean and
variance, is given by

then as n tends to infinity, Xn converges in probability (see below) to the common mean, μ, of the random variables
Yi. This result is known as the weak law of large numbers. Other forms of convergence are important in other useful
theorems, including the central limit theorem.
Throughout the following, we assume that (Xn) is a sequence of random variables, and X is a random variable, and
all of them are defined on the same probability space .

Convergence in distribution

Examples of convergence in distribution


Dice factory

Suppose a new dice factory has just been built. The first few dice come out quite biased, due to imperfections in the production process. The
outcome from tossing any of them will follow a distribution markedly different from the desired uniform distribution.
As the factory is improved, the dice become less and less loaded, and the outcomes from tossing a newly produced dice will follow the uniform
distribution more and more closely.

Tossing coins

Let Xn be the fraction of heads after tossing up an unbiased coin n times. Then X1 has the Bernoulli distribution with expected value μ = 0.5 and
variance σ2 = 0.25. The subsequent random variables X2, X3, … will all be distributed binomially.
As n grows larger, this distribution will gradually start to take shape more and more similar to the bell curve of the normal distribution. If we shift
and rescale Xn’s appropriately, then will be converging in distribution to the standard normal, the result that follows from the
celebrated central limit theorem.

Graphic example

Suppose { Xi } is an iid sequence of uniform U(−1,1) random variables. Let be their (normalized) sums. Then according to the
central limit theorem, the distribution of Zn approaches the normal N(0, ⅓) distribution. This convergence is shown in the picture: as n grows larger,
the shape of the pdf function gets closer and closer to the Gaussian curve.

With this mode of convergence, we increasingly expect to see the next outcome in a sequence of random
experiments becoming better and better modeled by a given probability distribution.
Convergence of random variables 116

Convergence in distribution is the weakest form of convergence, since it is implied by all other types of convergence
mentioned in this article. However convergence in distribution is very frequently used in practice; most often it
arises from application of the central limit theorem.

Definition
A sequence {X1, X2, …} of random variables is said to converge in distribution, or converge weakly, or converge
in law to a random variable X if

for every number x ∈ R at which F is continuous. Here Fn and F are the cumulative distribution functions of random
variables Xn and X correspondingly.
The requirement that only the continuity points of F should be considered is essential. For example if Xn are
distributed uniformly on intervals [0, 1⁄n], then this sequence converges in distribution to a degenerate random
variable X = 0. Indeed, Fn(x) = 0 for all n when x ≤ 0, and Fn(x) = 1 for all x ≥ 1⁄n when n > 0. However, for this
limiting random variable F(0) = 1, even though Fn(0) = 0 for all n. Thus the convergence of cdfs fails at the point x =
0 where F is discontinuous.
Convergence in distribution may be denoted as

where is the law (probability distribution) of X. For example if X is standard normal we can write .
For random vectors {X1, X2, …} ⊂ Rk the convergence in distribution is defined similarly. We say that this sequence
converges in distribution to a random k-vector X if

for every A ⊂ Rk which is a continuity set of X.


The definition of convergence in distribution may be extended from random vectors to more complex random
elements in arbitrary metric spaces, and even to the “random variables” which are not measurable — a situation
which occurs for example in the study of empirical processes. This is the “weak convergence of laws without laws
being defined” — except asymptotically.[1]
In this case the term weak convergence is preferable (see weak convergence of measures), and we say that a
sequence of random elements {Xn} converges weakly to X (denoted as Xn ⇒ X) if

for all continuous bounded functions h(·).[2] Here E* denotes the outer expectation, that is the expectation of a
“smallest measurable function g that dominates h(Xn)”.

Properties
• Since F(a) = Pr(X ≤ a), the convergence in distribution means that the probability for Xn to be in a given range is
approximately equal to the probability that the value of X is in that range, provided n is sufficiently large.
• In general, convergence in distribution does not imply that the sequence of corresponding probability density
functions will also converge. As an example one may consider random variables with densities
ƒn(x) = (1 − cos(2πnx))1{x∈(0,1)}. These random variables converge in distribution to a uniform U(0, 1), whereas
their densities do not converge at all.[3]
• Portmanteau lemma provides several equivalent definitions of convergence in distribution. Although these
definitions are less intuitive, they are used to prove a number of statistical theorems. The lemma states that {Xn}
converges in distribution to X if and only if any of the following statements are true:
Convergence of random variables 117

• Eƒ(Xn) → Eƒ(X) for all bounded, continuous functions ƒ;


• Eƒ(Xn) → Eƒ(X) for all bounded, Lipschitz functions ƒ;
• limsup{ Eƒ(Xn) } ≤ Eƒ(X) for every upper semi-continuous function ƒ bounded from above;
• liminf{ Eƒ(Xn) } ≥ Eƒ(X) for every lower semi-continuous function ƒ bounded from below;
• limsup{ Pr(Xn ∈ C) } ≤ Pr(X ∈ C) for all closed sets C;
• liminf{ Pr(Xn ∈ U) } ≥ Pr(X ∈ U) for all open sets U;
• lim{ Pr(Xn ∈ A) } = Pr(X ∈ A) for all continuity sets A of random variable X.
• Continuous mapping theorem states that for a continuous function g(·), if the sequence {Xn} converges in
distribution to X, then so does {g(Xn)} converge in distribution to g(X).
• Lévy’s continuity theorem: the sequence {Xn} converges in distribution to X if and only if the sequence of
corresponding characteristic functions {φn} converges pointwise to the characteristic function φ of X.
• Convergence in distribution is metrizable by the Lévy–Prokhorov metric.
• A natural link to convergence in distribution is the Skorokhod's representation theorem.

Convergence in probability

Examples of convergence in probability


Height of a person

Consider the following experiment. First, pick a random person in the street. Let X be his/her height, which is ex ante a random variable. Then you
start asking other people to estimate this height by eye. Let Xn be the average of the first n responses. Then (provided there is no systematic error) by
the law of large numbers, the sequence Xn will converge in probability to the random variable X.

Archer

Suppose a person takes a bow and starts shooting arrows at a target. Let Xn be his score in n-th shot. Initially he will be very likely to score zeros,
but as the time goes and his archery skill increases, he will become more and more likely to hit the bullseye and score 10 points. After the years of
practice the probability that he hit anything but 10 will be getting increasingly smaller and smaller. Thus, the sequence Xn converges in probability
to X = 10.
Note that Xn does not converge almost surely however. No matter how professional the archer becomes, there will always be a small probability of
making an error. Thus the sequence {Xn} will never turn stationary: there will always be non-perfect scores in it, even if they are becoming
increasingly less frequent.

The basic idea behind this type of convergence is that the probability of an “unusual” outcome becomes smaller and
smaller as the sequence progresses.
The concept of convergence in probability is used very often in statistics. For example, an estimator is called
consistent if it converges in probability to the quantity being estimated. Convergence in probability is also the type of
convergence established by the weak law of large numbers.

Definition
A sequence {Xn} of random variables converges in probability towards X if for all ε > 0

Formally, pick any ε > 0 and any δ > 0. Let Pn be the probability that Xn is outside the ball of radius ε centered at X.
Then for Xn to converge in probability to X there should exist a number Nδ such that for all n ≥ Nδ the probability Pn
is less than δ.
Convergence in probability is denoted by adding the letter p over an arrow indicating convergence, or using the
“plim” probability limit operator:
Convergence of random variables 118

For random elements {Xn} on a separable metric space (S, d), convergence in probability is defined similarly by[4]

Properties
• Convergence in probability implies convergence in distribution.[proof]
• Convergence in probability does not imply almost sure convergence.[proof]
• In the opposite direction, convergence in distribution implies convergence in probability only when the limiting
random variable X is a constant.[proof]
• The continuous mapping theorem states that for every continuous function g(·), if , then also .
• Convergence in probability defines a topology on the space of random variables over a fixed probability space.
This topology is metrizable by the Ky Fan metric:[5]

Almost sure convergence

Examples of almost sure convergence


Example 1

Consider an animal of some short-lived species. We record the amount of food that this animal consumes per day. This sequence of numbers will be
unpredictable, but we may be quite certain that one day the number will become zero, and will stay zero forever after.

Example 2

Consider a man who tosses seven coins every morning. Each afternoon, he donates one pound to a charity for each head that appeared. The first
time the result is all tails, however, he will stop permanently.
Let X1, X2, … be the daily amounts the charity receives from him.
We may be almost sure that one day this amount will be zero, and stay zero forever after that.
However, when we consider any finite number of days, there is a nonzero probability the terminating condition will not occur.

This is the type of stochastic convergence that is most similar to pointwise convergence known from elementary real
analysis.

Definition
To say that the sequence Xn converges almost surely or almost everywhere or with probability 1 or strongly
towards X means that

This means that the values of Xn approach the value of X, in the sense (see almost surely) that events for which Xn
does not converge to X have probability 0. Using the probability space and the concept of the random
variable as a function from Ω to R, this is equivalent to the statement

Another, equivalent, way of defining almost sure convergence is as follows:

Almost sure convergence is often denoted by adding the letters a.s. over an arrow indicating convergence:

For generic random elements {Xn} on a metric space (S, d), convergence almost surely is defined similarly:
Convergence of random variables 119

Properties
• Almost sure convergence implies convergence in probability, and hence implies convergence in distribution. It is
the notion of convergence used in the strong law of large numbers.
• The concept of almost sure convergence does not come from a topology on the space of random variables. This
means there is no topology on the space of random variables such that the almost surely convergent sequences are
exactly the converging sequences with respect to that topology. In particular, there is no metric of almost sure
convergence.

Sure convergence
To say that the sequence or random variables (Xn) defined over the same probability space (i.e., a random process)
converges surely or everywhere or pointwise towards X means

where Ω is the sample space of the underlying probability space over which the random variables are defined.
This is the notion of pointwise convergence of sequence functions extended to sequence of random variables. (Note
that random variables themselves are functions).

Sure convergence of a random variable implies all the other kinds of convergence stated above, but there is no
payoff in probability theory by using sure convergence compared to using almost sure convergence. The difference
between the two only exists on sets with probability zero. This is why the concept of sure convergence of random
variables is very rarely used.

Convergence in mean
We say that the sequence Xn converges in the r-th mean (or in the Lr-norm) towards X, for some r ≥ 1, if r-th
absolute moments of Xn and X exist, and

where the operator E denotes the expected value. Convergence in r-th mean tells us that the expectation of the r-th
power of the difference between Xn and X converges to zero.
This type of convergence is often denoted by adding the letter Lr over an arrow indicating convergence:

The most important cases of convergence in r-th mean are:


• When Xn converges in r-th mean to X for r = 1, we say that Xn converges in mean to X.
• When Xn converges in r-th mean to X for r = 2, we say that Xn converges in mean square to X. This is also
sometimes referred to as convergence in mean, and is sometimes denoted[6]

Convergence in the r-th mean, for r > 0, implies convergence in probability (by Markov's inequality), while if r > s ≥
1, convergence in r-th mean implies convergence in s-th mean. Hence, convergence in mean square implies
convergence in mean.
Convergence of random variables 120

Convergence in rth-order mean

Examples of convergence in rth-order mean.


Basic example: A newly built factory produces cans of beer. The owners want each can to contain exactly a certain amount.
Knowing the details of the current production process, engineers may compute the expected error in a newly produced can.
They are continuously improving the production process, so as time goes by, the expected error in a newly produced can tends to
zero.
This example illustrates convergence in first-order mean.

This is a rather "technical" mode of convergence. We essentially compute a sequence of real numbers, one number
for each random variable, and check if this sequence is convergent in the ordinary sense.

Formal definition
If for some real number a, then {Xn} converges in rth-order mean to a.

Commonly used notation:

Properties
The chain of implications between the various notions of convergence are noted in their respective sections. They
are, using the arrow notation:

These properties, together with a number of other special cases, are summarized in the following list:
• Almost sure convergence implies convergence in probability:[7][proof]

• Convergence in probability implies there exists a sub-sequence which almost surely converges:[8]

• Convergence in probability implies convergence in distribution:[7][proof]

• Convergence in r-th order mean implies convergence in probability:

• Convergence in r-th order mean implies convergence in lower order mean, assuming that both orders are greater
than one:

provided r ≥ s ≥ 1.
• If Xn converges in distribution to a constant c, then Xn converges in probability to c:[7][proof]

provided c is a constant.
• If Xn converges in distribution to X and the difference between Xn and Yn converges in probability to zero, then Yn
also converges in distribution to X:[7][proof]

• If Xn converges in distribution to X and Yn converges in distribution to a constant c, then the joint vector (Xn, Yn)
converges in distribution to (X, c):[7][proof]

provided c is a constant.
Note that the condition that Yn converges to a constant is important, if it were to converge to a random variable Y
then we wouldn’t be able to conclude that (Xn, Yn) converges to (X, Y).
Convergence of random variables 121

• If Xn converges in probability to X and Yn converges in probability to Y, then the joint vector (Xn, Yn) converges in
probability to (X, Y):[7][proof]

• If Xn converges in probability to X, and if P(|Xn| ≤ b) = 1 for all n and some b, then Xn converges in rth mean to X
for all r ≥ 1. In other words, if Xn converges in probability to X and all random variables Xn are almost surely
bounded above and below, then Xn converges to X also in any rth mean.
• Almost sure representation. Usually, convergence in distribution does not imply convergence almost surely.
However for a given sequence {Xn} which converges in distribution to X0 it is always possible to find a new
probability space (Ω, F, P) and random variables {Yn, n = 0,1,…} defined on it such that Yn is equal in
distribution to Xn for each n ≥ 0, and Yn converges to Y0 almost surely.[9]
• If for all ε > 0,

then we say that Xn converges almost completely, or almost in probability towards X. When Xn converges
almost completely towards X then it also converges almost surely to X. In other words, if Xn converges in
probability to X sufficiently quickly (i.e. the above sequence of tail probabilities is summable for all ε > 0),
then Xn also converges almost surely to X. This is a direct implication from the Borel-Cantelli lemma.
• If Sn is a sum of n real independent random variables:

then Sn converges almost surely if and only if Sn converges in probability.


• The dominated convergence theorem gives sufficient conditions for almost sure convergence to imply
L1-convergence:

• A necessary and sufficient condition for L1 convergence is and the sequence (Xn) is uniformly
integrable.

Notes
[1] Bickel et al. 1998, A.8, page 475
[2] van der Vaart & Wellner 1996, p. 4
[3] Romano & Siegel 1985, Example 5.26
[4] Dudley 2002, Chapter 9.2, page 287
[5] Dudley 2002, p. 289
[6] Porat, B. (1994). Digital Processing of Random Signals: Theory & Methods. Prentice Hall. pp. 19. ISBN 0-13-063751-3.
[7] van der Vaart 1998, Theorem 2.7
[8] Gut, Allan (2005). Probability: A graduate course. Theorem 3.4: Springer. ISBN 0-387-22833-0.
[9] van der Vaart 1998, Th.2.19
Convergence of random variables 122

References
• Bickel, Peter J.; Klaassen, Chris A.J.; Ritov, Ya’acov; Wellner, Jon A. (1998). Efficient and adaptive estimation
for semiparametric models. New York: Springer-Verlag. ISBN 0-387-98473-9. LCCN QA276.8.E374.
• Billingsley, Patrick (1986). Probability and Measure. Wiley Series in Probability and Mathematical Statistics
(2nd ed.). Wiley.
• Billingsley, Patrick (1999). Convergence of probability measures (2nd ed.). John Wiley & Sons. pp. 1–28.
ISBN 0-471-19745-9.
• Dudley, R.M. (2002). Real analysis and probability. Cambridge, UK: Cambridge University Press.
ISBN 0-521-80972-X.
• Grimmett, G.R.; Stirzaker, D.R. (1992). Probability and random processes (2nd ed.). Clarendon Press, Oxford.
pp. 271–285. ISBN 0-19-853665-8.
• Jacobsen, M. (1992). Videregående Sandsynlighedsregning (Advanced Probability Theory) (3rd ed.). HCØ-tryk,
Copenhagen. pp. 18–20. ISBN 87-91180-71-6.
• Ledoux, Michel; Talagrand, Michel (1991). Probability in Banach spaces. Berlin: Springer-Verlag. pp. xii+480.
ISBN 3-540-52013-9. MR1102015.
• Romano, Joseph P.; Siegel, Andrew F. (1985). Counterexamples in probability and statistics. Great Britain:
Chapman & Hall. ISBN 0-412-98901-8. LCCN QA273.R58 1985.
• van der Vaart, Aad W.; Wellner, Jon A. (1996). Weak convergence and empirical processes. New York:
Springer-Verlag. ISBN 0-387-94640-3. LCCN QA274.V33 1996.
• van der Vaart, Aad W. (1998). Asymptotic statistics. New York: Cambridge University Press.
ISBN 978-0-521-49603-2. LCCN QA276.V22 1998.
• Williams, D. (1991). Probability with Martingales. Cambridge University Press. ISBN 0-521-40605-6.
• Wong, E.; Hájek, B. (1985). Stochastic Processes in Engineering Systems. New York: Springer–Verlag.
This article incorporates material from the Citizendium article "Stochastic convergence", which is licensed under
the Creative Commons Attribution-ShareAlike 3.0 Unported License but not under the GFDL.
Convergent series 123

Convergent series
In mathematics, a series is the sum of the terms of a sequence of numbers.
Given a sequence , the nth partial sum is the sum of the first n terms of the sequence, that
is,

A series is convergent if the sequence of its partial sums converges. In more formal language,
a series converges if there exists a limit such that for any arbitrarily small positive number , there is a
large integer such that for all ,

A series that is not convergent is said to be divergent.

Examples of convergent and divergent series


• The reciprocals of the positive integers produce a divergent series (harmonic series):

• Alternating the signs of the reciprocals of positive integers produces a convergent series:

• Alternating the signs of the reciprocals of positive odd integers produces a convergent series (the Leibniz formula
for pi):

• The reciprocals of prime numbers produce a divergent series (so the set of primes is "large"):

• The reciprocals of triangular numbers produce a convergent series:

• The reciprocals of factorials produce a convergent series (see e):

• The reciprocals of square numbers produce a convergent series (the Basel problem):

• The reciprocals of powers of 2 produce a convergent series (so the set of powers of 2 is "small"):

• Alternating the signs of reciprocals of powers of 2 also produce a convergent series:

• The reciprocals of Fibonacci numbers produce a convergent series (see ψ):


Convergent series 124

Convergence tests
There are a number of methods of determining whether a series converges or diverges.
Comparison test. The terms of the sequence
are compared to those of another sequence . If,

for all n, , and converges,

then so does

However, if,

for all n, , and diverges, then

so does

Ratio test. Assume that for all n, . Suppose


that there exists such that
If the blue series, , can be proven to converge, then the
smaller series, must converge. By contraposition, if the red
series, is proven to diverge, then must also diverge.
If r < 1, then the series converges. If r > 1, then the
series diverges. If r = 1, the ratio test is inconclusive, and the series may converge or diverge.
Root test or nth root test. Suppose that the terms of the sequence in question are non-negative. Define r as follows:

where "lim sup" denotes the limit superior (possibly ∞; if the limit exists it is the same value).
If r < 1, then the series converges. If r > 1, then the series diverges. If r = 1, the root test is inconclusive, and the
series may converge or diverge.
The ratio test and the root test are both based on comparison with a geometric series, and as such they work in
similar situations. In fact, if the ratio test works (meaning that the limit exists and is not equal to 1) then so does the
root test; the converse, however, is not true. The root test is therefore more generally applicable, but as a practical
matter the limit is often difficult to compute for commonly seen types of series.
Integral test. The series can be compared to an integral to establish convergence or divergence. Let be
a positive and monotone decreasing function. If

then the series converges. But if the integral diverges, then the series does so as well.

Limit comparison test. If , and the limit exists and is not zero, then

converges if and only if converges.

Alternating series test. Also known as the Leibniz criterion, the alternating series test states that for an alternating
series of the form , if is monotone decreasing, and has a limit of 0 at infinity, then the series

converges.
Convergent series 125

Cauchy condensation test. If is a monotone decreasing sequence, then converges if and only if

converges.

Dirichlet's test
Abel's test
Raabe's test

Conditional and absolute convergence


For any sequence , for all n.
Therefore,

This means that if converges, then also converges

(but not vice-versa).

If the series converges, then the series is absolutely

convergent. An absolutely convergent sequence is one in which the


length of the line created by joining together all of the increments to Illustration of the absolute convergence of the
the partial sum is finitely long. The power series of the exponential power series of Exp[z] around 0 evaluated at z =
function is absolutely convergent everywhere. Exp[i⁄3]. The length of the line is finite.

If the series converges but the series diverges, then

the series is conditionally convergent. The path formed by

connecting the partial sums of a conditionally convergent series is


infinitely long. The power series of the logarithm is conditionally
convergent.
The Riemann series theorem states that if a series converges
conditionally, it is possible to rearrange the terms of the series in such
a way that the series converges to any value, or even diverges.

Uniform convergence
Illustration of the conditional convergence of the
Main article: uniform convergence.
power series of log(z+1) around 0 evaluated at z =
Let be a sequence of functions. The series exp((π−1⁄3)i). The length of the line is infinite.

is said to converge uniformly to f if the sequence of

partial sums defined by

converges uniformly to f.
There is an analogue of the comparison test for infinite series of functions called the Weierstrass M-test.
Convergent series 126

Cauchy convergence criterion


The Cauchy convergence criterion states that a series

converges if and only if the sequence of partial sums is a Cauchy sequence. This means that for every there
is a positive integer such that for all we have

which is equivalent to

References
• Rudin, Walter (1976). Principles of Mathematical Analysis. McGrawHill.
• Spivak, Michael (1994). Calculus (3rd ed.). Houston, Texas: Publish or Perish, Inc. ISBN 0-914098-89-6.

External links
• Weisstein, Eric (2005). Riemann Series Theorem [1]. Retrieved May 16, 2005.

References
[1] http:/ / mathworld. wolfram. com/ RiemannSeriesTheorem. html
Copula (probability theory) 127

Copula (probability theory)


In probability theory and statistics, a copula is a kind of distribution function. Copulas are used to describe the
dependence between random variables. They are named for their resemblance to grammatical copulas in linguistics.
The cumulative distribution function of a random vector can be written in terms of marginal distribution functions
and a copula. The marginal distribution functions describe the marginal distribution of each component of the
random vector and the copula describes the dependence structure between the components.
Copulas are popular in statistical applications as they allow one to easily model and estimate the distribution of
random vectors by estimating marginals and copula separately. There are many parametric copula families available,
which usually have parameters that control the strength of dependence. Some popular parametric copula models are
outlined below.

The basic idea


Consider a random vector . Suppose its margins are continuous, i.e. the marginal CDFs
are continuous functions. By applying the probability integral transform to each component,
the random vector

has uniform margins.


The copula of is defined as the joint cumulative distribution function of :

The copula C contains all information on the dependence structure between the components of
whereas the marginal cumulative distribution functions contain all information on the marginal distributions.
The importance of the above is that the reverse of these steps can be used to generate pseudo-random samples from
general classes of multivariate probability distributions. That is, given a procedure to generate a sample
from the copula distribution, the required sample can be constructed as

The inverses are unproblematic as the were assumed to be continuous. The above formula for the copula
function can be rewritten to correspond to this as:

Definition
In probabilistic terms, is a d-dimensional copula if C is a joint cumulative distribution
function of a d-dimensional random vector on the unit cube with uniform marginals.[1]
In analytic terms, is a d-dimensional copula if
• , the copula is zero if one of the arguments is zero,
• , the copula is equal to u if one argument is u and all others 1,

• C is d-increasing, i.e., for each hyperrectangle the C-volume of B is

non-negative:

where the .
Copula (probability theory) 128

For instance, in the bivariate case, is a bivariate copula if


, and for all
.

Sklar's theorem
Sklar's theorem[2] provides the theoretical foundation for the
application of copulas. Sklar's theorem states that a multivariate
cumulative distribution function

of a random vector with margins


can be written as Density and contour plot of a Bivariate Gaussian
Distribution

where is a copula.
The theorem also states that, given , the copula is unique on
, which is the cartesian product of the
ranges of the marginal cdf's. This implies that the copula is unique if
the margins are continuous.
The converse is also true: given a copula and
margins then defines a
Density and contour plot of two Normal
d-dimensional cumulative distribution function. marginals joint with a Gumbel copula

Fréchet–Hoeffding copula bounds


The Fréchet–Hoeffding Theorem (after Maurice René Fréchet and
[3]
Wassily Hoeffding ) states that for any Copula
and any the
following bounds hold:
Graphs of the bivariate Fréchet–Hoeffding copula
limits and of the independence copula (in the
The function W is called lower Fréchet–Hoeffding bound and is middle).
defined as

The function M is called upper Fréchet–Hoeffding bound and is defined as

The upper bound is sharp: M is always a copula, it corresponds to comonotone random variables.

The lower bound is point-wise sharp, in the sense that for fixed u, there is a copula such that .
However, W is a copula only in two dimensions, in which case it corresponds to countermonotonic random variables.
In two dimensions, i.e. the bivariate case, the Fréchet–Hoeffding Theorem states
Copula (probability theory) 129

Families of copulas

Gaussian copula
The Gaussian copula is a distribution over the unit cube . It is
constructed from a multivariate normal distribution over by using
the probability integral transform.
For a given correlation matrix , the Gaussian copula with
parameter matrix can be written as

Cumulative and density distribution of Gaussian


copula with ρ = 0.4
where is the inverse cumulative distribution function of a
standard normal and is the joint cumulative distribution function
of a multivariate normal distribution with mean vector zero and covariance matrix equal to the correlation matrix .
[4]
The density can be written as

where is the identity matrix.

Archimedean copulas
Archimedean copulas are an associative class of copulas. Most common Archimedean copulas admit an explicit
formula for the C, something not possible for instance for the Gaussian copula. In practise, Archimedean copulas are
popular because they allow to model dependence in arbitrarily high dimensions with only one parameter, governing
the strength of dependence.
A copula C is called Archimedean if it admits the representation

where is the so called generator.


The above formula yields a copula if and only if is d-monotone on .[5] That is, if the kth derivatives of
satisfy

for all and and is nonincreasing and convex.


The generators in the following table are the most popular ones. All of them are completely monotone, i.e.
d-monotone for all .
Copula (probability theory) 130

Table with the most important generators[6]


name generator generator inverse parameter

Ali-Mikhail-Haq

[7]
Clayton

Frank

Gumbel

Independence

Joe

Empirical copulas
When studying multivariate data, one might want to investigate the underlying copula. Suppose we have
observations

from a random vector with continuous margins. The corresponding "true" copula observations
would be

However, the marginal distribution functions are usually not known. Therefore, one can construct pseudo copula
observations by using the empirical distribution functions

instead. Then, the pseudo copula observations are defined as

The corresponding empirical copula is then defined as

The components of the pseudo copula samples can also be written as , where is the rank of the
observation :

Therefore, the empirical copula can be seen as the empirical distribution of the rank transformed data.
Copula (probability theory) 131

Monte Carlo integration for copula models


In statistical applications, many problems can be formulated in the following way. One is interested in the
expectation of a response function applied to some random vector .[8] If we denote
the cdf of this random vector with , the quantity of interest can thus be written as

If is given by a copula model, i.e.,

this expectation can be rewritten as

In case the copula C is absolutely continuous, i.e. C has a density c, this equation can be written as

If copula and margins are known (or if they have been estimated), this expectation can be approximated through the
following Monte Carlo algorithm:
1. Draw a sample of size n from the copula C
2. By applying the inverse marginal cdf's, produce a sample of by setting

3. Approximate by its empirical value:

Applications

Quantitative finance
The applications of copulas in quantitative finance are numerous, both in the real-world probability of risk/portfolio
management and in the risk-neutral probability of derivatives pricing.
In risk/portfolio management, copulas are used to perform stress-tests and robustness checks: panic copulas are
glued with market estimates of the marginal distributions to analyze the effects of panic regimes on the portfolio
profit and loss distribution. Panic copulas are created by Monte Carlo simulation, mixed with a re-weighting of the
probability of each scenario.[9]
As far as derivatives pricing is concerned, dependence modelling with copula functions is widely used in
applications of financial risk assessment and actuarial analysis – for example in the pricing of collateralized debt
obligations (CDOs).[10] Some believe the methodology of applying the Gaussian copula to credit derivatives to be
one of the reasons behind the global financial crisis of 2008–2009.[11][12] Despite this perception, there are
documented attempts of the financial industry, occurring before the crisis, to address the limitations of the Gaussian
copula and of copula functions more generally, specifically the lack of dependence dynamics and the poor
representation of extreme events.[13] There have been attempts to propose models rectifying some of the copula
limitations.[13][14][15]
While the application of copulas in credit has gone through popularity as well as misfortune during the global
financial crisis of 2008–2009,[16] it is arguably an industry standard model for pricing CDOs. Less arguably, copulas
have also been applied to other asset classes as a flexible tool in analyzing multi-asset derivative products. The first
Copula (probability theory) 132

such application outside credit was to use a copula to construct an implied basket volatility surface,[17] taking into
account the volatility smile of basket components. Copulas have since gained popularity in pricing and risk
management [18] of options on multi-assets in the presence of volatility smile/skew, in equity, foreign exchange and
fixed income derivative business. Some typical example applications of copulas are listed below:
• Analyzing and pricing volatility smile/skew of exotic baskets, e.g. best/worst of;
• Analyzing and pricing volatility smile/skew of less liquid FX cross, which is effectively a basket: C = S1/S2 or C
= S1*S2;
• Analyzing and pricing spread options, in particular in fixed income constant maturity swap spread options.

Civil engineering
Recently, copula functions have been successfully applied to the database formulation for the reliability analysis of
highway bridges, and to various multivariate simulation studies in civil[19], mechanical and offshore engineering.

Medicine
Copula functions have been successfully applied to the analysis of spike counts in neuroscience. [20]

Weather research
Copulas have been extensively used in climate and weather related research.[21]

References
[1] Nelsen, Roger B. (1999), An Introduction to Copulas, New York: Springer, ISBN 0-387-98623-5
[2] Sklar, A. (1959), "Fonctions de répartition à n dimensions et leurs marges", Publ. Inst. Statist. Univ. Paris 8: 229–231
[3] "J J O'Connor and E F Robertson" (March 2011). "Biography of Wassily Hoeffding" (http:/ / www-history. mcs. st-andrews. ac. uk/
Biographies/ Hoeffding. html). School of Mathematics and Statistics, University of St Andrews, Scotland. . Retrieved 8 November 2011.
[4] Arbenz, Philipp (2011). "Bayesian Copulae Distributions, with Application to Operational Risk Management - Some Comments".
Methodology and Computing in Applied Probability Forthcoming. doi:10.1007/s11009-011-9224-0.
[5] McNeil, A. J.; Nešlehová, J. (2009). "Multivariate Archimedean copulas, d-monotone functions and 1-norm symmetric distributions".
Annals of Statistics 37 (5b): 3059–3097. doi:10.1214/07-AOS556.
[6] Hofert, Jan Marius (2010). Sampling Nested Archimedean Copulas with Applications to CDO Pricing (http:/ / vts. uni-ulm. de/ docs/ 2010/
7242/ vts_7242_10223. pdf). Dissertation at the University of Ulm. .
[7] Clayton, David G. (1978). "A model for association in bivariate life tables and its application in epidemiological studies of familial tendency
in chronic disease incidence". Biometrika 65 (1): 141–151. JSTOR 2335289.
[8] Alexander J. McNeil, Rudiger Frey and Paul Embrechts (2005) "Quantitative Risk Management: Concepts, Techniques, and Tools",
Princeton Series in Finance
[9] Meucci, Attilio (2011), "A New Breed of Copulas for Risk and Portfolio Management" (http:/ / symmys. com/ node/ 335), Risk 24 (9):
122–126,
[10] Meneguzzo, David; Vecchiato, Walter (Nov 2003), "Copula sensitivity in collateralized debt obligations and basket default swaps", Journal
of Futures Markets 24 (1): 37–70, doi:10.1002/fut.10110
[11] Recipe for Disaster: The Formula That Killed Wall Street (http:/ / www. wired. com/ techbiz/ it/ magazine/ 17-03/
wp_quant?currentPage=all) Wired, 2/23/2009
[12] MacKenzie, Donald (2008), "End-of-the-World Trade" (http:/ / www. lrb. co. uk/ v30/ n09/ mack01_. html), London Review of Books,
2008-05-08, , retrieved 2009-07-27
[13] Lipton, Alexander; Rennie, Andrew. Credit Correlation: Life After Copulas. World Scientific. ISBN 978-981-270-949-3.
[14] Donnelly, C; Embrechts, P, (2010). The devil is in the tails: actuarial mathematics and the subprime mortgage crisis. ASTIN Bulletin 40(1),
1–33
[15] Brigo, D; Pallavicini, A; Torresetti, R (2010). Credit Models and the Crisis: A Journey into CDOs, Copulas, Correlations and dynamic
Models. Wiley and Sons
[16] Jones, Sam (April 24, 2009), "The formula that felled Wall St" (http:/ / www. ft. com/ cms/ s/ 2/ 912d85e8-2d75-11de-9eba-00144feabdc0.
html), Financial Times,
[17] Qu, Dong, (2001). "Basket Implied Volatility Surface". Derivatives Week, (4 June.)
[18] Qu, Dong, (2005). "Pricing Basket Options With Skew". Wilmott Magazine (July.)
Copula (probability theory) 133

[19] Thompson, David; Kilgore, Roger (2011), "Estimating Joint Flow Probabilities at Stream Confluences using Copulas" (http:/ / trb.
metapress. com/ content/ m3146tg612k80771/ ?p=d6b0d7200af148b8a4e18e592ca1e269& pi=3), Transportation Research Record 2262:
200–206, , retrieved 2012-2-21
[20] Onken, A; Grünewälder, S; Munk, MH; Obermayer, K (2009), Aertsen, Ad, ed., "Analyzing Short-Term Noise Dependencies of
Spike-Counts in Macaque Prefrontal Cortex Using Copulas and the Flashlight Transformation" (http:/ / www. ploscompbiol. org/ article/
info:doi/ 10. 1371/ journal. pcbi. 1000577), PLoS Computational Biology 5 (11): e1000577, doi:10.1371/journal.pcbi.1000577,
PMC 2776173, PMID 19956759,
[21] doi:10.5194/npg-15-761-2008
This citation will be automatically completed in the next few minutes. You can jump the queue or expand by hand (http:/ / en. wikipedia. org/
wiki/ Template:cite_doi/ 10. 5194. 2fnpg-15-761-2008. 0a?preload=Template:Cite_doi/ preload& editintro=Template:Cite_doi/ editintro&
action=edit)

Further reading
• The standard reference for an introduction to copulas. Covers all fundamental aspects, summarizes the most
popular copula classes, and provides proofs for the important theorems related to copulas
Roger B. Nelsen (1999), "An Introduction to Copulas", Springer. ISBN 978-0-387-98623-4
• A book covering current topics in mathematical research on copulas:
Piotr Jaworski, Fabrizio Durante, Wolfgang Karl Härdle, Tomasz Rychlik (Editors): (2010): "Copula
Theory and Its Applications" Lecture Notes in Statistics, Springer. ISBN 978-3-642-12464-8
• A paper covering the historic development of copula theory, by the person associated with the "invention" of
copulas, Abe Sklar.
Abe Sklar (1997): "Random variables, distribution functions, and copulas – a personal look backward
and forward" in Rüschendorf, L., Schweizer, B. und Taylor, M. (eds) Distributions With Fixed
Marginals & Related Topics (Lecture Notes – Monograph Series Number 28). ISBN
978-0-940600-40-9
• The standard reference for multivariate models and copula theory in the context of financial and insurance models
Alexander J. McNeil, Rudiger Frey and Paul Embrechts (2005) "Quantitative Risk Management:
Concepts, Techniques, and Tools", Princeton Series in Finance. ISBN 978-0-691-12255-7

External links
• Copula Wiki: community portal for researchers with interest in copulas (http://sites.google.com/site/
copulawiki/)
• A collection of Copula simulation and estimation codes (http://www.mathfinance.cn/tags/copula)
• Thorsten Schmidt (2006): "Coping with Copulas" (http://www.math.uni-leipzig.de/~tschmidt/
TSchmidt_Copulas.pdf)
• Copulas & Correlation using Excel Simulation Articles (http://www.crystalballservices.com/Resources/
ConsultantsCornerBlog/tagid/21/Correlation.aspx)
Coupon collector's problem 134

Coupon collector's problem


In probability theory, the coupon collector's problem describes the "collect all coupons and win" contests. It asks
the following question: Suppose that there are n different coupons, equally likely, from which coupons are being
collected with replacement. What is the probability that more than t sample trials are needed to collect all n coupons?
An alternative statement is: Given n coupons, how many coupons do you expect you need to draw with replacement
before having drawn each coupon at least once? The mathematical analysis of the problem reveals that the expected
number of trials needed grows as . For example, when n = 50 it takes about 225[1] trials to collect all
50 coupons.

Understanding the problem


The key to solving the problem is understanding that it takes very little time to collect the first few coupons. On the
other hand, it takes a long time to collect the last few coupons. In fact, for 50 coupons, it takes on average 50 trials to
collect the very last coupon after the other 49 coupons have been collected. This is why the expected time to collect
all coupons is much longer than 50. The idea now is to split the total time into 50 intervals where the expected time
can be calculated.

Solution

Calculating the expectation


Let T be the time to collect all n coupons, and let ti be the time to collect the i-th coupon after i − 1 coupons have
been collected. Think of T and ti as random variables. Observe that the probability of collecting a new coupon given
i − 1 coupons is pi = (n − (i − 1))/n. Therefore, ti has geometric distribution with expectation 1/pi. By the linearity of
expectations we have:

Here Hn is the harmonic number. Using the asymptotics of the harmonic numbers, we obtain:

where is the Euler–Mascheroni constant.


Now one can use the Markov inequality to bound the desired probability:
Coupon collector's problem 135

Calculating the variance


Using the independence of random variables ti, we obtain:

where the last equality uses a value of the Riemann zeta function (see Basel problem).
Now one can use the Chebyshev inequality to bound the desired probability:

Tail estimates
A different upper bound can be derived from the following observation. Let denote the event that the -th
coupon was not picked in the first trials. Then:

Thus, for , we have .

Connection to probability generating functions


Another combinatorial technique can also be used to resolve the problem: see Coupon collector's problem
(generating function approach).

Extensions and generalizations


• Paul Erdős and Alfréd Rényi proved the limit theorem for the distribution of T. This result is a further extension
of previous bounds.

• Donald J. Newman and Lawrence Shepp found a generalization of the coupon collector's problem when k copies
of each coupon needs to be collected. Let Tk be the first time k copies of each coupon are collected. They showed
that the expectation in this case satisfies:

Here k is fixed. When k = 1 we get the earlier formula for the expectation.
• Common generalization, also due to Erdős and Rényi:
Coupon collector's problem 136

Notes
[1] Here is the calculation: 50 ln(50) = 195.60, 50γ = 28.86; 50ln(50)+50γ = 224.46, and E(50) = 50*(1 + 1/2 + 1/3 + .. 1/50) = 224.96, the
expected number of trials to collect all 50 coupons.

References
• Blom, Gunnar; Holst, Lars; Sandell, Dennis (1994), "7.5 Coupon collecting I, 7.6 Coupon collecting II, and 15.4
Coupon collecting III" (http://books.google.com/books?id=KCsSWFMq2u0C&pg=PA85), Problems and
Snapshots from the World of Probability, New York: Springer-Verlag, pp. 85–87, 191, ISBN 0-387-94161-4,
MR1265713.
• Dawkins, Brian (1991), "Siobhan's problem: the coupon collector revisited", The American Statistician 45 (1):
76–82, JSTOR 2685247.
• Erdős, Paul; Rényi, Alfréd (1961), "On a classical problem of probability theory" (http://www.renyi.hu/
~p_erdos/1961-09.pdf), Magyar Tudományos Akadémia Matematikai Kutató Intézetének Közleményei 6:
215–220, MR0150807.
• Newman, Donald J.; Shepp, Lawrence (1960), "The double dixie cup problem", American Mathematical Monthly
67: 58–61, doi:10.2307/2308930, MR0120672
• Flajolet, Philippe; Gardy, Danièle; Thimonier, Loÿs (1992), "Birthday paradox, coupon collectors, caching
algorithms and self-organizing search" (http://algo.inria.fr/flajolet/Publications/alloc2.ps.gz), Discrete
Applied Mathematics 39 (3): 207–229, doi:10.1016/0166-218X(92)90177-C, MR1189469.
• Isaac, Richard (1995), "8.4 The coupon collector's problem solved" (http://books.google.com/
books?id=a_2vsIx4FQMC&pg=PA80), The Pleasures of Probability, Undergraduate Texts in Mathematics, New
York: Springer-Verlag, pp. 80–82, ISBN 0-387-94415-X, MR1329545.
• Motwani, Rajeev; Raghavan, Prabhakar (1995), "3.6. The Coupon Collector's Problem" (http://books.google.
com/books?id=QKVY4mDivBEC&pg=PA57), Randomized algorithms, Cambridge: Cambridge University
Press, pp. 57–63, MR1344451.

External links
• " Coupon Collector Problem (http://demonstrations.wolfram.com/CouponCollectorProblem/)" by Ed Pegg, Jr.,
the Wolfram Demonstrations Project. Mathematica package.
• Coupon Collector Problem (http://www-stat.stanford.edu/~susan/surprise/Collector.html), a simple Java
applet.
• How Many Singles, Doubles, Triples, Etc., Should The Coupon Collector Expect? (http://www.math.rutgers.
edu/~zeilberg/mamarim/mamarimhtml/coupon.html), a short note by Doron Zeilberger.
Degrees of freedom (statistics) 137

Degrees of freedom (statistics)


In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are
free to vary.[1]
Estimates of statistical parameters can be based upon different amounts of information or data. The number of
independent pieces of information that go into the estimate of a parameter is called the degrees of freedom (df). In
general, the degrees of freedom of an estimate of a parameter is equal to the number of independent scores that go
into the estimate minus the number of parameters used as intermediate steps in the estimation of the parameter itself
(which, in sample variance, is one, since the sample mean is the only intermediate step).[2]
Mathematically, degrees of freedom is the dimension of the domain of a random vector, or essentially the number of
'free' components: how many components need to be known before the vector is fully determined.
The term is most often used in the context of linear models (linear regression, analysis of variance), where certain
random vectors are constrained to lie in linear subspaces, and the number of degrees of freedom is the dimension of
the subspace. The degrees-of-freedom are also commonly associated with the squared lengths (or "Sum of Squares")
of such vectors, and the parameters of chi-squared and other distributions that arise in associated statistical testing
problems.
While introductory texts may introduce degrees of freedom as distribution parameters or through hypothesis testing,
it is the underlying geometry that defines degrees of freedom, and is critical to a proper understanding of the concept.
Walker (1940)[3] has stated this succinctly:
For the person who is unfamiliar with N-dimensional geometry or who knows the contributions to modern
sampling theory only from secondhand sources such as textbooks, this concept often seems almost mystical,
with no practical meaning.

Notation
In equations, the typical symbol for degrees of freedom is (lowercase Greek letter nu). In text and tables, the
abbreviation "d.f." is commonly used. R.A. Fisher used n to symbolize degrees of freedom (writing n′ for sample
size) but modern usage typically reserves n for sample size.

Residuals
A common way to think of degrees of freedom is as the number of independent pieces of information available to
estimate another piece of information. More concretely, the number of degrees of freedom is the number of
independent observations in a sample of data that are available to estimate a parameter of the population from which
that sample is drawn. For example, if we have two observations, when calculating the mean we have two
independent observations; however, when calculating the variance, we have only one independent observation, since
the two observations are equally distant from the mean.
In fitting statistical models to data, the vectors of residuals are constrained to lie in a space of smaller dimension than
the number of components in the vector. That smaller dimension is the number of degrees of freedom for error.
Degrees of freedom (statistics) 138

Linear regression
Perhaps the simplest example is this. Suppose

are random variables each with expected value μ, and let

be the "sample mean." Then the quantities

are residuals that may be considered estimates of the errors Xi − μ. The sum of the residuals (unlike the sum of the
errors) is necessarily 0. If one knows the values of any n − 1 of the residuals, one can thus find the last one. That
means they are constrained to lie in a space of dimension n − 1. One says that "there are n − 1 degrees of freedom
for residual."
An only slightly less simple example is that of least squares estimation of a and b in the model

where εi and hence Yi are random. Let and be the least-squares estimates of a and b. Then the residuals

are constrained to lie within the space defined by the two equations

One says that there are n − 2 degrees of freedom for error.
The capital Y is used in specifying the model, and lower-case y in the definition of the residuals. That is because the
former are hypothesized random variables and the latter are data.
We can generalise this to multiple regression involving p parameters and covariates (e.g. p − 1 predictors and one
mean), in which case the cost in degrees of freedom of the fit is p.

Degrees of freedom of a random vector


Geometrically, the degrees of freedom can be interpreted as the dimension of certain vector subspaces. As a starting
point, suppose that we have a sample of n independent normally distributed observations,
.
This can be represented as an n-dimensional random vector:

Since this random vector can lie anywhere in n-dimensional space, it has n degrees of freedom.
Now, let be the sample mean. The random vector can be decomposed as the sum of the sample mean plus a
vector of residuals:

The first vector on the right-hand side is constrained to be a multiple of the vector of 1's, and the only free quantity is
. It therefore has 1 degree of freedom.
Degrees of freedom (statistics) 139

The second vector is constrained by the relation . The first n − 1 components of this vector can

be anything. However, once you know the first n − 1 components, the constraint tells you the value of the nth
component. Therefore, this vector has n − 1 degrees of freedom.
Mathematically, the first vector is the orthogonal, or least-squares, projection of the data vector onto the subspace
spanned by the vector of 1's. The 1 degree of freedom is the dimension of this subspace. The second residual vector
is the least-squares projection onto the (n − 1)-dimensional orthogonal complement of this subspace, and has n − 1
degrees of freedom.
In statistical testing applications, often one isn't directly interested in the component vectors, but rather in their
squared lengths. In the example above, the residual sum-of-squares is

If the data points are normally distributed with mean 0 and variance , then the residual sum of squares has a
scaled chi-squared distribution (scaled by the factor ), with n − 1 degrees of freedom. The degrees-of-freedom,
here a parameter of the distribution, can still be interpreted as the dimension of an underlying vector subspace.
Likewise, the one-sample t-test statistic,

follows a Student's t distribution with n − 1 degrees of freedom when the hypothesized mean is correct. Again,
the degrees-of-freedom arises from the residual vector in the denominator.

Degrees of freedom in linear models


The demonstration of the t and chi-squared distributions for one-sample problems above is the simplest example
where degrees-of-freedom arise. However, similar geometry and vector decompositions underlie much of the theory
of linear models, including linear regression and analysis of variance. An explicit example based on comparison of
three means is presented here; the geometry of linear models is discussed in more complete detail by Christensen
(2002).[4]
Suppose independent observations are made for three populations, , and .
The restriction to three groups and equal sample sizes simplifies notation, but the ideas are easily generalized.
The observations can be decomposed as

where are the means of the individual samples, and is the mean of all 3n
observations. In vector notation this decomposition can be written as
Degrees of freedom (statistics) 140

The observation vector, on the left-hand side, has 3n degrees of freedom. On the right-hand side, the first vector has
one degree of freedom (or dimension) for the overall mean. The second vector depends on three random variables,
, and . However, these must sum to 0 and so are constrained; the vector therefore
must lie in a 2-dimensional subspace, and has 2 degrees of freedom. The remaining 3n − 3 degrees of freedom are in
the residual vector (made up of n − 1 degrees of freedom within each of the populations).

Sum of squares and degrees of freedom


In statistical testing problems, one usually isn't interested in the component vectors themselves, but rather in their
squared lengths, or Sum of Squares. The degrees of freedom associated with a sum-of-squares is the
degrees-of-freedom of the corresponding component vectors.
The three-population example above is an example of one-way Analysis of Variance. The model, or treatment,
sum-of-squares is the squared length of the second vector,

with 2 degrees of freedom. The residual, or error, sum-of-squares is

with 3(n-1) degrees of freedom. Of course, introductory books on ANOVA usually state formulae without showing
the vectors, but it is this underlying geometry that gives rise to SS formulae, and shows how to unambiguously
determine the degrees of freedom in any given situation.
Under the null hypothesis of no difference between population means (and assuming that standard ANOVA
regularity assumptions are satisfied) the sums of squares have scaled chi-squared distributions, with the
corresponding degrees of freedom. The F-test statistic is the ratio, after scaling by the degrees of freedom. If there is
no difference between population means this ratio follows an F distribution with 2 and 3n − 3 degrees of freedom.
In some complicated settings, such as unbalanced split-plot designs, the sums-of-squares no longer have scaled
chi-squared distributions. Comparison of sum-of-squares with degrees-of-freedom is no longer meaningful, and
software may report certain fractional 'degrees of freedom' in these cases. Such numbers have no genuine
degrees-of-freedom interpretation, but are simply providing an approximate chi-squared distribution for the
corresponding sum-of-squares. The details of such approximations are beyond the scope of this page.

Degrees of freedom parameters in probability distributions


Several commonly encountered statistical distributions (Student's t, Chi-Squared, F) have parameters that are
commonly referred to as degrees of freedom. This terminology simply reflects that in many applications where these
distributions occur, the parameter corresponds to the degrees of freedom of an underlying random vector, as in the
preceding ANOVA example. Another simple example is: if are independent normal
random variables, the statistic
Degrees of freedom (statistics) 141

follows a chi-squared distribution with n−1 degrees of freedom. Here, the degrees of freedom arises from the
residual sum-of-squares in the numerator, and in turn the n−1 degrees of freedom of the underlying residual vector
.
In the application of these distributions to linear models, the degrees of freedom parameters can take only integer
values. The underlying families of distributions allow fractional values for the degrees-of-freedom parameters, which
can arise in more sophisticated uses. One set of examples is problems where chi-squared approximations based on
effective degrees of freedom are used. In other applications, such as modelling heavy-tailed data, a t or F distribution
may be used as an empirical model. In these cases, there is no particular degrees of freedom interpretation to the
distribution parameters, even though the terminology may continue to be used.

Effective degrees of freedom


Many regression methods, including ridge regression, linear smoothers and smoothing splines are not based on
ordinary least squares projections, but rather on regularized (generalized and/or penalized) least-squares, and so
degrees of freedom defined in terms of dimensionality is generally not useful for these procedures. However, these
procedures are still linear in the observations, and the fitted values of the regression can be expressed in the form

where is the vector of fitted values at each of the original covariate values from the fitted model, y is the original
vector of responses, and H is the hat matrix or, more generally, smoother matrix.
For statistical inference, sums-of-squares can still be formed: the model sum-of-squares is ; the residual
sum-of-squares is . However, because H does not correspond to an ordinary least-squares fit (i.e.
is not an orthogonal projection), these sums-of-squares no longer have (scaled, non-central) chi-squared
distributions, and dimensionally defined degrees-of-freedom are not useful. The distribution is a generalized
chi-squared distribution, and the theory associated with this distribution[5] provides an alternative route to the
answers provideed by an effective degrees of freedom.
The effective degrees of freedom of the fit can be defined in various ways to implement goodness-of-fit tests,
cross-validation and other inferential procedures. Here one can distinguish between regression effective degrees of
freedom and residual effective degrees of freedom.
Regression effective degrees of freedom.
Regarding the former, appropriate definitions can include the trace of the hat matrix,[6] tr(H), the trace of the
quadratic form of the hat matrix, tr(H'H), the form tr(2H - H H'), or the Satterthwaite approximation,
tr(H'H)2/tr(H'HH'H). In the case of linear regression, the hat matrix H is X(X 'X)−1X ', and all these definitions reduce
to the usual degrees of freedom. Notice that

i.e., the regression (not residual) degrees of freedom in linear models are "the sum of the sensitivities of the fitted
values with respect to the observed response values".[7]
Residual effective degrees of freedom.
There are corresponding definitions of residual effective degrees-of-freedom (redf), with H replaced by I − H. For
example, if the goal is to estimate error variance, the redf would be defined as tr((I − H)'(I − H)), and the unbiased
estimate is (with ),
Degrees of freedom (statistics) 142

or:[8][9][10]

The last approximation above[9] reduces the computational cost from O(n2) to only O(n). In general the numerator
would be the objective function being minimized; e.g., if the hat matrix includes an observation covariance matrix,
Σ, then becomes .
General.
Note that unlike in the original case, non-integer degrees of freedom are allowed, though the value must usually still
be constrained between 0 and n.
Consider, as an example, the k-nearest neighbour smoother, which is the average of the k nearest measured values to
the given point. Then, at each of the n measured points, the weight of the original value on the linear combination
that makes up the predicted value is just 1/k. Thus, the trace of the hat matrix is n/k. Thus the smooth costs n/k
effective degrees of freedom.
As another example, consider the existence of nearly duplicated observations. Naive application of classical formula,
n - p, would lead to over-estimation of the residuals degree of freedom, as if each observation were independent.
More realistically, though, the hat matrix H = X(X ' Σ−1 X)−1X ' Σ−1 would involve an observation covariance matrix
Σ indicating the non-zero correlation among observations. The more general formulation of effective degree of
freedom would result in a more realistic estimate for, e.g., the error variance σ2.
Similar concepts are the equivalent degrees of freedom in non-parametric regression,[11] the degree of freedom of
signal in atmospheric studies,[12][13] and the non-integer degree of freedom in geodesy.[14][15]

References
[1] "Degrees of Freedom" (http:/ / www. animatedsoftware. com/ statglos/ sgdegree. htm). "Glossary of Statistical Terms". Animated Software. .
Retrieved 2008-08-21.
[2] Lane, David M.. "Degrees of Freedom" (http:/ / davidmlane. com/ hyperstat/ A42408. html). HyperStat Online. Statistics Solutions. .
Retrieved 2008-08-21.
[3] Walker, H. M. (April 1940). "Degrees of Freedom". Journal of Educational Psychology 31 (4): 253–269. doi:10.1037/h0054588.
[4] Christensen, Ronald (2002). Plane Answers to Complex Questions: The Theory of Linear Models (Third ed.). New York: Springer.
ISBN 0-387-95361-2.
[5] Jones, D.A. (1983) "Statistical analysis of empirical models fitted by optimisation", Biometrika, 70 (1), 67–88
[6] Trevor Hastie, Robert Tibshirani, Jerome H. Friedman (2009), The elements of statistical learning: data mining, inference, and prediction,
2nd ed., 746 p. ISBN 978-0-387-84857-0, doi:10.1007/978-0-387-84858-7, (http:/ / books. google. com/ books?id=tVIjmNS3Ob8C&
lpg=PA153& dq=degrees of freedom of a smoother& pg=PA154#v=onepage& q=degrees of freedom of a smoother& f=false) (eq.(5.16))
[7] Ye, J. (1998), "On Measuring and Correcting the Effects of Data Mining and Model Selection", Journal of the American Statistical
Association, 93 (441), 120-131. JSTOR 2669609 (eq.(7))
[8] Clive Loader (1999), Local regression and likelihood, ISBN 978-0-387-98775-0, doi:10.1007/b98858, (http:/ / books. google. com/
books?id=D7GgBAfL4ngC& lpg=PP1& pg=PA28#v=onepage& q=degree of freedom& f=false) (eq.(2.18), p.30)
[9] Trevor Hastie, Robert Tibshirani (1990), Generalized additive models, CRC Press, (http:/ / books. google. com/ books?id=qa29r1Ze1coC&
lpg=PR3& dq=Hastie, T. J. , and Tibshirani, R. J. (1990), Generalized Additive Models, London: Chapman and Hall. &
pg=PA54#v=onepage& q=degrees of freedom& f=false) (p.54) and (eq.(B.1), p.305))
[10] Simon N. Wood (2006), Generalized additive models: an introduction with R, CRC Press, (http:/ / books. google. com/
books?id=hr17lZC-3jQC& lpg=PA170& dq=Effective degrees of freedom& pg=PA172#v=onepage& q& f=false) (eq.(4,14), p.172)
[11] Peter J. Green, B. W. Silverman (1994), Nonparametric regression and generalized linear models: a roughness penalty approach, CRC
Press (http:/ / books. google. com/ books?id=-AIVXozvpLUC& lpg=PA37& dq=generalized effective degrees of freedom&
pg=PA37#v=onepage& q& f=false) (eq.(3.15), p.37)
[12] Clive D. Rodgers (2000), Inverse methods for atmospheric sounding: theory and practice, World Scientific (eq.(2.56), p.31)
[13] Adrian Doicu, Thomas Trautmann, Franz Schreier (2010), Numerical Regularization for Atmospheric Inverse Problems, Springer (eq.(4.26),
p.114)
Degrees of freedom (statistics) 143

[14] D. Dong, T. A. Herring and R. W. King (1997), Estimating regional deformation from a combination of space and terrestrial geodetic data,
J. Geodesy, 72 (4), 200-214, doi:10.1007/s001900050161 (eq.(27), p.205)
[15] H. Theil (1963), "On the Use of Incomplete Prior Information in Regression Analysis", Journal of the American Statistical Association, 58
(302), 401-414 JSTOR 2283275 (eq.(5.19)-(5.20))

• Eisenhauer, J.G. (2008) "Degrees of Freedom". Teaching Statistics, 30(3), 75–78

External links
• Walker, HW (1940) "Degrees of Freedom" Journal of Educational Psychology 31(4) 253-269. Transcription by C
Olsen with errata (http://courses.ncssm.edu/math/Stat_Inst/PDFS/DFWalker.pdf)
• Good, IJ (1973). "What Are Degrees of Freedom?". The American Statistician (The American Statistician, Vol.
27, No. 5) 27 (5): 227–228. doi:10.2307/3087407. JSTOR 3087407.
• Yu, Chong-ho (1997) Illustrating degrees of freedom in terms of sample size and dimensionality (http://www.
creative-wisdom.com/computer/sas/df.html)
• Dallal, GE. (2003) Degrees of Freedom (http://www.tufts.edu/~gdallal/dof.htm)

Determinant
In linear algebra, the determinant is a value associated with a square matrix. It can be computed from the entries of
the matrix by a specific arithmetic expression, while other ways to determine its value exist as well. The determinant
provides important information when the matrix is that of the coefficients of a system of linear equations, or when it
corresponds to a linear transformation of a vector space: in the first case the system has a unique solution if and only
if the determinant is nonzero, while in the second case that same condition means that the transformation has an
inverse operation. A geometric interpretation can be given to the value of the determinant of a square matrix with
real entries: the absolute value of the determinant gives the scale factor by which area or volume is multiplied under
the associated linear transformation, while its sign indicates whether the transformation preserves orientation. Thus a
2 × 2 matrix with determinant −2, when applied to a region of the plane with finite area, will transform that region
into one with twice the area, while reversing its orientation.
Determinants occur throughout mathematics. The use of determinants in calculus includes the Jacobian determinant
in the substitution rule for integrals of functions of several variables. They are used to define the characteristic
polynomial of a matrix that is an essential tool in eigenvalue problems in linear algebra. In some cases they are used
just as a compact notation for expressions that would otherwise be unwieldy to write down.
The determinant of a matrix A is denoted det(A), det A, or |A|.[1] In the case where the matrix entries are written out
in full, the determinant is denoted by surrounding the matrix entries by vertical bars instead of the brackets or
parentheses of the matrix. For instance, the determinant of the matrix

is written and has the value

Although most often used for matrices whose entries are real or complex numbers, the definition of the determinant
only involves addition, subtraction and multiplication, and so it can be defined for square matrices with entries taken
from any commutative ring. Thus for instance the determinant of a matrix with integer coefficients will be an
integer, and the matrix has an inverse with integer coefficients if and only if this determinant is 1 or −1 (these being
the only invertible elements of the integers). For square matrices with entries in a non-commutative ring, for instance
the quaternions, there is no unique definition for the determinant, and no definition that has all the usual properties of
determinants over commutative rings.
Determinant 144

Definition
There are various ways to define the determinant of a square matrix A, i.e. one with the same number of rows and
columns. Perhaps the most natural way is expressed in terms of the columns of the matrix. If we write an n-by-n
matrix in terms of its column vectors

where the are vectors of size n, then the determinant of A is defined so that

where b and c are scalars, v is any vector of size n and I is the identity matrix of size n. These properties state that the
determinant is an alternating multilinear function of the columns, and they suffice to uniquely calculate the
determinant of any square matrix. Provided the underlying scalars form a field (more generally, a commutative ring
with unity), the definition below shows that such a function exists, and it can be shown to be unique.[2]
Equivalently, the determinant can be expressed as a sum of products of entries of the matrix where each product has
n terms and the coefficient of each product is -1 or 1 or 0 according to a given rule: it is a polynomial expression of
the matrix entries. This expression grows rapidly with the size of the matrix (an n-by-n matrix contributes n! terms),
so it will first be given explicitly for the case of 2-by-2 matrices and 3-by-3 matrices, followed by the rule for
arbitrary size matrices, which subsumes these two cases.
Assume A is a square matrix with n rows and n columns, so that it can be written as

The entries can be numbers or expressions (as happens when the determinant is used to define a characteristic
polynomial); the definition of the determinant depends only on the fact that they can be added and multiplied
together in a commutative manner.
The determinant of A is denoted as det(A), or it can be denoted directly in terms of the matrix entries by writing
enclosing bars instead of brackets:
Determinant 145

2-by-2 matrices
The determinant of a 2×2 matrix is defined by

If the matrix entries are real numbers, the matrix A can be used to
represent two linear mappings: one that maps the standard basis
vectors to the rows of A, and one that maps them to the columns of A.
In either case, the images of the basis vectors form a parallelogram that
represents the image of the unit square under the mapping. The
parallelogram defined by the rows of the above matrix is the one with
vertices at (0,0), (a,b), (a + c, b + d), and (c,d), as shown in the
accompanying diagram. The absolute value of is the area of
the parallelogram, and thus represents the scale factor by which areas
are transformed by A. (The parallelogram formed by the columns of A
The area of the parallelogram is the absolute
is in general a different parallelogram, but since the determinant is value of the determinant of the matrix formed by
symmetric with respect to rows and columns, the area will be the the vectors representing the parallelogram's sides.
same.)

The absolute value of the determinant together with the sign becomes the oriented area of the parallelogram. The
oriented area is the same as the usual area, except that it is negative when the angle from the first to the second
vector defining the parallelogram turns in a clockwise direction (which is opposite to the direction one would get for
the identity matrix).
Thus the determinant gives the scaling factor and the orientation induced by the mapping represented by A. When the
determinant is equal to one, the linear mapping defined by the matrix is equi-areal and orientation-preserving.

3-by-3 matrices
The determinant of a 3×3 matrix is defined
by

The volume of this Parallelepiped is the absolute value of the determinant of the
matrix formed by the rows r1, r2, and r3.
Determinant 146

The rule of Sarrus is a mnemonic for this formula: the sum of


the products of three diagonal north-west to south-east lines
of matrix elements, minus the sum of the products of three
diagonal south-west to north-east lines of elements, when the
copies of the first two columns of the matrix are written
beside it as in the illustration.

For example, the determinant of


The determinant of a 3x3 matrix can be calculated by its
diagonals.

is calculated using this rule:

This scheme for calculating the determinant of a 3×3 matrix does not carry over into higher dimensions. However,
recently an extension to the Sarrus rule to a 4×4 matrix has been developed that requires lining up 14 columns next
to each other. Note that since 24 terms are included in the determinant of a 4×4 matrix, at least 12 columns are
needed to use this method. [3]

n-by-n matrices
The determinant of a matrix of arbitrary size can be defined by the Leibniz formula or the Laplace formula.
The Leibniz formula for the determinant of an n-by-n matrix A is

Here the sum is computed over all permutations σ of the set {1, 2, ..., n}. A permutation is a function that reorders
this set of integers. The value in the i-th position after the reordering σ is denoted σi. For example, for n = 3, the
original sequence 1, 2, 3 might be reordered to σ = [2, 3, 1], with σ1 = 2, σ2 = 3, and σ3 = 1. The set of all such
permutations (also known as the symmetric group on n elements) is denoted Sn. For each permutation σ, sgn(σ)
denotes the signature of σ; it is +1 for even σ and −1 for odd σ. Evenness or oddness can be defined as follows: the
permutation is even (odd) if the new sequence can be obtained by an even number (odd, respectively) of switches of
numbers. For example, starting from [1, 2, 3] (and starting with the convention that the signature sgn([1,2,3]) = +1)
and switching the positions of 2 and 3 yields [1, 3, 2], with sgn([1,3,2]) = –1. Switching once more yields [3, 1, 2],
with sgn([3,1,2]) = +1 again. Finally, after a total of three switches (an odd number), the resulting permutation is [3,
2, 1], with sgn([3,2,1]) = –1. Therefore [3, 2, 1] is an odd permutation. Similarly, the permutation [2, 3, 1] is even:
[1, 2, 3] → [2, 1, 3] → [2, 3, 1], with an even number of switches.
A permutation cannot be simultaneously even and odd, but sometimes it is convenient to accept non-permutations:
sequences with repeated or skipped numbers, like [1, 2, 1]. In that case, the signature of any non-permutation is zero:
sgn([1,2,1]) = 0.
Determinant 147

In any of the summands, the term

is notation for the product of the entries at positions (i, σi), where i ranges from 1 to n:

For example, the determinant of a 3 by 3 matrix A (n = 3) is

This agrees with the rule of Sarrus given in the previous section.
The formal extension to arbitrary dimensions was made by Tullio Levi-Civita, see (Levi-Civita symbol) using a
pseudo-tensor symbol.

Levi-Civita symbol
The determinant for an n-by-n matrix can be expressed in terms of the totally antisymmetric Levi-Civita symbol as
follows:

Properties of the determinant


The determinant has many properties. Some basic properties of determinants are:
1. where is the n×n identity matrix.
2.

3.

4. For conformable square matrices A and B

• for an n×n matrix.


2. If A is a triangular matrix, i.e. ai,j = 0 whenever i > j or, alternatively, whenever i < j, then its determinant equals
the product of the diagonal entries:

This can be deduced from some of the properties below, but it follows most easily directly from the Leibniz formula
(or from the Laplace expansion), in which the identity permutation is the only one that gives a non-zero contribution.
A number of additional properties relate to the effects on the determinant of changing particular rows or columns:

• Viewing an n×n matrix as being composed of n columns, the determinant is an n-linear function. This means
that if one column of a matrix A is written as a sum v + w of two column vectors, and all other columns are
left unchanged, then the determinant of A is the sum determinants of the matrices obtained from A by
replacing the column by v respectively by w (and a similar relation holds when writing a column as a scalar
Determinant 148

multiple of a column vector).


2. This n-linear function is an alternating form. This means that whenever two columns of a matrix are identical, or
more generally some column can be expressed as a linear combination of the other columns (i.e. the columns of
the matrix form a linearly dependent set), its determinant is 0.
Properties 1, 7 and 8 — which all follow from the Leibniz formula — completely characterize the determinant; in
other words the determinant is the unique function from n×n matrices to scalars that is n-linear alternating in the
columns, and takes the value 1 for the identity matrix (this characterization holds even if scalars are taken in any
given commutative ring). To see this it suffices to expand the determinant by multi-linearity in the columns into a
(huge) linear combination of determinants of matrices in which each column is a standard basis vector. These
determinants are either 0 (by property 8) or else ±1 (by properties 1 and 11 below), so the linear combination gives
the expression above in terms of the Levi-Civita symbol. While less technical in appearance, this characterization
cannot entirely replace the Leibniz formula in defining the determinant, since without it the existence of an
appropriate function is not clear. For matrices over non-commutative rings, properties 7 and 8 are incompatible for n
≥ 2,[4] so there is no good definition of the determinant in this setting.
Property 2 above implies that properties for columns have their counterparts in terms of rows:

• Viewing an n×n matrix as being composed of n rows, the determinant is an n-linear function.
2. This n-linear function is an alternating form: whenever two rows of a matrix are identical, its determinant is 0.
3. Interchanging two columns of a matrix multiplies its determinant by −1. This follows from properties 7 and 8 (it
is a general property of multilinear alternating maps). Iterating gives that more generally a permutation of the
columns multiplies the determinant by the sign of the permutation. Similarly a permutation of the rows multiplies
the determinant by the sign of the permutation.
4. Adding a scalar multiple of one column to another column does not change the value of the determinant. This is a
consequence of properties 7 and 8: by property 7 the determinant changes by a multiple of the determinant of a
matrix with two equal columns, which determinant is 0 by property 8. Similarly, adding a scalar multiple of one
row to another row leaves the determinant unchanged.
These properties can be used to facilitate the computation of determinants by simplifying the matrix to the point
where the determinant can be determined immediately. Specifically, for matrices with coefficients in a field,
properties 11 and 12 can be used to transform any matrix into a triangular matrix, whose determinant is given by
property 6; this is essentially the method of Gaussian elimination.

For example, the determinant of can be computed using the following matrices:

Here, B is obtained from A by adding −1/2 × the first row to the second, so that det(A) = det(B). C is obtained from B
by adding the first to the third row, so that det(C) = det(B). Finally, D is obtained from C by exchanging the second
and third row, so that det(D) = −det(C). The determinant of the (upper) triangular matrix D is the product of its
entries on the main diagonal: (−2) · 2 · 4.5 = −18. Therefore det(A) = −det(D) = +18.
Determinant 149

Multiplicativity and matrix groups


The determinant of a matrix product of square matrices equals the product of their determinants:

Thus the determinant is a multiplicative map. This property is a consequence of the characterization given above of
the determinant as the unique n-linear alternating function of the columns with value 1 on the identity matrix, since
the function Mn(K) → K that maps M ↦ det(AM) can easily be seen to be n-linear and alternating in the columns of
M, and takes the value det(A) at the identity. The formula can be generalized to (square) products of rectangular
matrices, giving the Cauchy-Binet formula, which also provides an independent proof of the multiplicative property.
The determinant det(A) of a matrix A is non-zero if and only if A is invertible or, yet another equivalent statement, if
its rank equals the size of the matrix. If so, the determinant of the inverse matrix is given by

In particular, products and inverses of matrices with determinant one still have this property. Thus, the set of such
matrices (of fixed size n) form a group known as the special linear group. More generally, the word "special"
indicates the subgroup of another matrix group of matrices of determinant one. Examples include the special
orthogonal group (which if n is 2 or 3 consists of all rotation matrices), and the special unitary group.

Laplace's formula and the adjugate matrix


Laplace's formula expresses the determinant of a matrix in terms of its minors. The minor Mi,j is defined to be the
determinant of the (n−1)×(n−1)-matrix that results from A by removing the i-th row and the j-th column. The
expression (−1)i+jMi,j is known as cofactor. The determinant of A is given by

Calculating det(A) by means of that formula is referred to as expanding the determinant along a row or column. For

the example 3-by-3 matrix , Laplace expansion along the second column (j = 2, the sum

runs over i) yields:

However, Laplace expansion is efficient for small matrices only.


The adjugate matrix adj(A) is the transpose of the matrix consisting of the cofactors, i.e.,
Determinant 150

Sylvester's determinant theorem


Sylvester's determinant theorem states that for A, an m-by-n matrix, and B, an n-by-m matrix (so that A and B have
dimensions allowing them to be multiplied in either order):
,
where and are the m-by-m and n-by-n identity matrices, respectively.
From this general result several consequences follow.
(a) For the case of column vector c and row vector r, each with m components, the formula allows quick calculation
of the determinant of a matrix that differs from the identity matrix by a matrix of rank 1:
.
[5]
(b) More generally, for any invertible m-by-m matrix X,
,
(c) For a column and row vector as above, .

Properties of the determinant in relation to other notions

Relation to eigenvalues and trace


Determinants can be used to find the eigenvalues of the matrix A: they are the solutions of the characteristic equation

where I is the identity matrix of the same dimension as A. Conversely, det(A) is the product of the eigenvalues of A,
counted with their algebraic multiplicities. The product of all non-zero eigenvalues is referred to as
pseudo-determinant.
An Hermitian matrix is positive definite if all its eigenvalues are positive. Sylvester's criterion asserts that this is
equivalent to the determinants of the submatrices

being positive, for all k between 1 and n.


The trace tr(A) is by definition the sum of the diagonal entries of A and also equals the sum of the eigenvalues. Thus,
for complex matrices A,

or, for real matrices A,

Here exp(A) denotes the matrix exponential of A, because every eigenvalue λ of A corresponds to the eigenvalue
exp(λ) of exp(A). In particular, given any logarithm of A, that is, any matrix L satisfying

the determinant of A is given by

For example, for n = 2 and n = 3, respectively,


Determinant 151

These formulae are closely related to Newton's identities.


A generalization of the above identities can be obtained from the following Taylor series expansion of the
determinant:

where I is the identity matrix.

Cramer's rule
For a matrix equation

the solution is given by Cramer's rule:

where Ai is the matrix formed by replacing the i-th column of A by the column vector b. This follows immediately by
column expansion of the determinant, i.e.

where the vectors are the columns of A. The rule is also implied by the identity

It has recently been shown that Cramer's rule can be implemented in O(n3) time,[6] which is comparable to more
common methods of solving systems of linear equations, such as LU, QR, or singular value decomposition.

Block matrices
Suppose A, B, C, and D are n×n-, n×m-, m×n-, and m×m-matrices, respectively. Then

This can be seen from the Leibniz formula or by induction on n. When A is invertible, employing the following
identity

leads to

When D is invertible, a similar identity with factored out can be derived analogously,[7] that is,

When the blocks are square matrices of the same order further formulas hold. For example, if C and D commute (i.e.,
CD = DC), then the following formula comparable to the determinant of a 2-by-2 matrix holds:[8]
Determinant 152

Derivative
By definition, e.g., using the Leibniz formula, the determinant of real (or analogously for complex) square matrices
is a polynomial function from Rn×n to R. As such it is everywhere differentiable. Its derivative can be expressed
using Jacobi's formula:

where adj(A) denotes the adjugate of A. In particular, if A is invertible, we have

Expressed in terms of the entries of A, these are

Yet another equivalent formulation is


,
using big O notation. The special case where , the identity matrix, yields

This identity is used in describing the tangent space of certain matrix Lie groups.

If the matrix A is written as where a, b, c are vectors, then the gradient over one of the three
vectors may be written as the cross product of the other two:

Abstract algebraic aspects

Determinant of an endomorphism
The above identities concerning the determinant of a products and inverses of matrices imply that similar matrices
have the same determinant: two matrices A and B are similar, if there exists an invertible matrix X such that A =
X−1BX. Indeed, repeatedly applying the above identities yields

The determinant is therefore also called a similarity invariant. The determinant of a linear transformation

for some finite dimensional vector space V is defined to be the determinant of the matrix describing it, with respect
to an arbitrary choice of basis in V. By the similarity invariance, this determinant is independent of the choice of the
basis for V and therefore only depends on the endomorphism T.
Determinant 153

Exterior algebra
The determinant can also be characterized as the unique function

from the set of all n-by-n matrices with entries in a field K to this field satisfying the following three properties: first,
D is an n-linear function: considering all but one column of A fixed, the determinant is linear in the remaining
column, that is

for any column vectors v1, ..., vn, and w and any scalars (elements of K) a and b. Second, D is an alternating function:
for any matrix A with two identical columns D(A) = 0. Finally, D(In) = 1. Here In is the identity matrix.
This fact also implies that every other n-linear alternating function F: Mn(K) → K satisfies

The last part in fact follows from the preceding statement: one easily sees that if F is nonzero it satisfies F(I) ≠ 0,
and function that associates F(M)/F(I) to M satisfies all conditions of the theorem. The importance of stating this part
is mainly that it remains valid[9] if K is any commutative ring rather than a field, in which case the given argument
does not apply.
The determinant of a linear transformation A : V → V of an n-dimensional vector space V can be formulated in a
coordinate-free manner by considering the n-th exterior power ΛnV of V. A induces a linear map

As ΛnV is one-dimensional, the map ΛnA is given by multiplying with some scalar. This scalar coincides with the
determinant of A, that is to say

This definition agrees with the more concrete coordinate-dependent definition. This follows from the
characterization of the determinant given above. For example, switching two columns changes the parity of the
determinant; likewise, permuting the vectors in the exterior product v1 ∧ v2 ∧ ... ∧ vn to v2 ∧ v1 ∧ v3 ∧ ... ∧ vn, say,
also alters the parity.
For this reason, the highest non-zero exterior power Λn(V) is sometimes also called the determinant of V and
similarly for more involved objects such as vector bundles or chain complexes of vector spaces. Minors of a matrix
can also be cast in this setting, by considering lower alternating forms ΛkV with k < n.

Square matrices over commutative rings and abstract properties


The determinant of a matrix can be defined, for example using the Leibniz formula, for matrices with entries in any
commutative ring. Briefly, a ring is a structure where addition, subtraction, and multiplication are defined. The
commutativity requirement means that the product does not depend on the order of the two factors, i.e.,

is supposed to hold for all elements r and s of the ring. For example, the integers form a commutative ring.
Many of the above statements and notions carry over mutatis mutandis to determinants of these more general
matrices: the determinant is multiplicative in this more general situation, and Cramer's rule also holds. A square
matrix over a commutative ring R is invertible if and only if its determinant is a unit in R, that is, an element having a
(multiplicative) inverse. (If R is a field, this latter condition is equivalent to the determinant being nonzero, thus
giving back the above characterization.) For example, a matrix A with entries in Z, the integers, is invertible (in the
sense that the inverse matrix has again integer entries) if the determinant is +1 or −1. Such a matrix is called
unimodular.
Determinant 154

The determinant defines a mapping

between the group of invertible n×n matrices with entries in R and the multiplicative group of units in R. Since it
respects the multiplication in both groups, this map is a group homomorphism. Secondly, given a ring
homomorphism f: R → S, there is a map GLn(R) → GLn(S) given by replacing all entries in R by their images under
f. The determinant respects these maps, i.e., given a matrix A = (ai,j) with entries in R, the identity

holds. For example, the determinant of the complex conjugate of a complex matrix (which is also the determinant of
its conjugate transpose) is the complex conjugate of its determinant, and for integer matrices: the reduction
modulo m of the determinant of such a matrix is equal to the determinant of the matrix reduced modulo m (the latter
determinant being computed using modular arithmetic). In the more high-brow parlance of category theory, the
determinant is a natural transformation between the two functors GLn and (⋅)×.[10] Adding yet another layer of
abstraction, this is captured by saying that the determinant is a morphism of algebraic groups, from the general linear
group to the multiplicative group,

Generalizations and related notions

Infinite matrices
For matrices with an infinite number of rows and columns, the above definitions of the determinant do not carry over
directly. For example, in Leibniz' formula, an infinite sum (all of whose terms are infinite products) would have to be
calculated. Functional analysis provides different extensions of the determinant for such infinite-dimensional
situations, which however only work for particular kinds of operators.
The Fredholm determinant defines the determinant for operators known as trace class operators by an appropriate
generalization of the formula

Another infinite-dimensional notion of determinant is the functional determinant.

Notions of determinant over non-commutative rings


For square matrices with entries in a non-commutative ring, there are various difficulties in defining determinants in
a manner analogous to that for commutative rings. A meaning can be given to the Leibniz formula provided the
order for the product is specified, and similarly for other ways to define the determinant, but non-commutativity then
leads to the loss of many fundamental properties of the determinant, for instance the multiplicative property or the
fact that the determinant is unchanged under transposition of the matrix. Over non-commutative rings, there is no
reasonable notion of a multilinear form (if a bilinear form exists with a regular element of R as value on some pair of
arguments, it can be used to show that all elements of R commute). Nevertheless various notions of
non-commutative determinant have been formulated, which preserve some of the properties of determinants, notably
quasideterminants and the Dieudonné determinant. It should also be noted that if one considers certain specific
classes of matrices with non-commutative elements, then there are examples where one can define the determinant
and prove linear algebra theorems which are very similar to their commutative analogs. Examples include: quantum
groups and q-determinant; Capelli matrix and Capelli determinant; super-matrices and Berezinian; Manin matrices is
the class of matrices which is most close to matrices with commutative elements.
Determinant 155

Further variants
Determinants of matrices in superrings (that is, Z/2-graded rings) are known as Berezinians or superdeterminants.[11]
The permanent of a matrix is defined as the determinant, except that the factors sgn(σ) occurring in Leibniz' rule are
omitted. The immanant generalizes both by introducing a character of the symmetric group Sn in Leibniz' rule.

Calculation
Determinants are mainly used as a theoretical tool. They are rarely calculated explicitly in numerical linear algebra,
where for applications like checking invertibility and finding eigenvalues the determinant has largely been
supplanted by other techniques.[12] Nonetheless, explicitly calculating determinants is required in some situations,
and different methods are available to do so.
Naive methods of implementing an algorithm to compute the determinant include using Leibniz' formula or
Laplace's formula. Both these approaches are extremely inefficient for large matrices, though, since the number of
required operations grows very quickly: it is of order n! (n factorial) for an n×n matrix M. For example, Leibniz'
formula requires to calculate n! products. Therefore, more involved techniques have been developed for calculating
determinants.

Decomposition methods
Given a matrix A, some methods compute its determinant by writing A as a product of matrices whose determinants
can be more easily computed. Such techniques are referred to as decomposition methods. Examples include the LU
decomposition, the QR decomposition or the Cholesky decomposition (for positive definite matrices). These
methods are of order O(n3), which is a significant improvement over O(n!)
The LU decomposition expresses A in terms of a lower triangular matrix L, an upper triangular matrix U and a
permutation matrix P:

The determinants of L and U can be quickly calculated, since they are the products of the respective diagonal entries.
The determinant of P is just the sign of the corresponding permutation (which is +1 for an even number of
permutations and is -1 for an uneven number of permutations). The determinant of A is then

Moreover, the decomposition can be chosen such that L is a unitriangular matrix and therefore has determinant 1, in
which case the formula further simplifies to

Further methods
If the determinant of A and the inverse of A have already been computed, the matrix determinant lemma allows to
quickly calculate the determinant of A + uvT, where u and v are column vectors.
Since the definition of the determinant does not need divisions, a question arises: do fast algorithms exist that do not
need divisions? This is especially interesting for matrices over rings. Indeed algorithms with run-time proportional to
n4 exist. An algorithm of Mahajan and Vinay, and Berkowitz[13] is based on closed ordered walks (short clow). It
computes more products than the determinant definition requires, but some of these products cancel and the sum of
these products can be computed more efficiently. The final algorithm looks very much like an iterated product of
triangular matrices.
If two matrices of order n can be multiplied in time M(n), where M(n)≥na for some a>2, then the determinant can be
computed in time O(M(n)).[14] This means, for example, that an O(n2.376) algorithm exists based on the
Coppersmith–Winograd algorithm.
Determinant 156

Algorithms can also be assessed according to their bit complexity, i.e., how many bits of accuracy are needed to
store intermediate values occurring in the computation. For example, the Gaussian elimination (or LU
decomposition) methods is of order O(n3), but the bit length of intermediate values can become exponentially
long.[15] The Bareiss Algorithm, on the other hand, is an exact-division method based on Sylvester's identity is also
of order n3, but the bit complexity is roughly the bit size of the original entries in the matrix times n.[16]

History
Historically, determinants were considered without reference to matrices: originally, a determinant was defined as a
property of a system of linear equations. The determinant "determines" whether the system has a unique solution
(which occurs precisely if the determinant is non-zero). In this sense, determinants were first used in the Chinese
mathematics textbook The Nine Chapters on the Mathematical Art (九 章 算 術, Chinese scholars, around the 3rd
century BC). In Europe, two-by-two determinants were considered by Cardano at the end of the 16th century and
larger ones by Leibniz.[17][18][19][20]
In Europe, Cramer (1750) added to the theory, treating the subject in relation to sets of equations. The recurrence law
was first announced by Bézout (1764).
It was Vandermonde (1771) who first recognized determinants as independent functions.[17] Laplace (1772) [21][22]
gave the general method of expanding a determinant in terms of its complementary minors: Vandermonde had
already given a special case. Immediately following, Lagrange (1773) treated determinants of the second and third
order. Lagrange was the first to apply determinants to questions of elimination theory; he proved many special cases
of general identities.
Gauss (1801) made the next advance. Like Lagrange, he made much use of determinants in the theory of numbers.
He introduced the word determinants (Laplace had used resultant), though not in the present signification, but rather
as applied to the discriminant of a quantic. Gauss also arrived at the notion of reciprocal (inverse) determinants, and
came very near the multiplication theorem.
The next contributor of importance is Binet (1811, 1812), who formally stated the theorem relating to the product of
two matrices of m columns and n rows, which for the special case of m = n reduces to the multiplication theorem. On
the same day (November 30, 1812) that Binet presented his paper to the Academy, Cauchy also presented one on the
subject. (See Cauchy-Binet formula.) In this he used the word determinant in its present sense,[23][24] summarized
and simplified what was then known on the subject, improved the notation, and gave the multiplication theorem with
a proof more satisfactory than Binet's.[17][25] With him begins the theory in its generality.
The next important figure was Jacobi[18] (from 1827). He early used the functional determinant which Sylvester later
called the Jacobian, and in his memoirs in Crelle for 1841 he specially treats this subject, as well as the class of
alternating functions which Sylvester has called alternants. About the time of Jacobi's last memoirs, Sylvester (1839)
and Cayley began their work.[26][27]
The study of special forms of determinants has been the natural result of the completion of the general theory.
Axisymmetric determinants have been studied by Lebesgue, Hesse, and Sylvester; persymmetric determinants by
Sylvester and Hankel; circulants by Catalan, Spottiswoode, Glaisher, and Scott; skew determinants and Pfaffians, in
connection with the theory of orthogonal transformation, by Cayley; continuants by Sylvester; Wronskians (so called
by Muir) by Christoffel and Frobenius; compound determinants by Sylvester, Reiss, and Picquet; Jacobians and
Hessians by Sylvester; and symmetric gauche determinants by Trudi. Of the text-books on the subject
Spottiswoode's was the first. In America, Hanus (1886), Weld (1893), and Muir/Metzler (1933) published treatises.
Determinant 157

Applications

Linear independence
As mentioned above, the determinant of a matrix (with real or complex entries, say) is zero if and only if the column
vectors of the matrix are linearly dependent. Thus, determinants can be used to characterize linearly dependent
vectors. For example, given two vectors v1, v2 in R3, a third vector v3 lies in the plane spanned by the former two
vectors exactly if the determinant of the 3-by-3 matrix consisting of the three vectors is zero. The same idea is also
used in the theory of differential equations: given n functions f1(x), ..., fn(x) (supposed to be n−1 times
differentiable), the Wronskian is defined to be

It is non-zero (for some x) in a specified interval if and only if the given functions and all their derivatives up to
order n−1 are linearly independent. If it can be shown that the Wronskian is zero everywhere on an interval then, in
the case of analytic functions, this implies the given functions are linearly dependent. See the Wronskian and linear
independence.

Orientation of a basis
The determinant can be thought of as assigning a number to every sequence of n in Rn, by using the square matrix
whose columns are the given vectors. For instance, an orthogonal matrix with entries in Rn represents an
orthonormal basis in Euclidean space. The determinant of such a matrix determines whether the orientation of the
basis is consistent with or opposite to the orientation of the standard basis. Namely, if the determinant is +1, the basis
has the same orientation. If it is −1, the basis has the opposite orientation.
More generally, if the determinant of A is positive, A represents an orientation-preserving linear transformation (if A
is an orthogonal 2×2 or 3×3 matrix, this is a rotation), while if it is negative, A switches the orientation of the basis.

Volume and Jacobian determinant


As pointed out above, the absolute value of the determinant of real vectors is equal to the volume of the
parallelepiped spanned by those vectors. As a consequence, if f: Rn → Rn is the linear map represented by the matrix
A, and S is any measurable subset of Rn, then the volume of f(S) is given by |det(A)| times the volume of S. More
generally, if the linear map f: Rn → Rm is represented by the m-by-n matrix A, then the n-dimensional volume of f(S)
is given by:

By calculating the volume of the tetrahedron bounded by four points, they can be used to identify skew lines. The
volume of any tetrahedron, given its vertices a, b, c, and d, is (1/6)·|det(a − b, b − c, c − d)|, or any other
combination of pairs of vertices that would form a spanning tree over the vertices.
For a general differentiable function, much of the above carries over by considering the Jacobian matrix of f. For

the Jacobian is the n-by-n matrix whose entries are given by

Its determinant, the Jacobian determinant appears in the higher-dimensional version of integration by substitution:
for suitable functions f and an open subset U of R'n (the domain of f), the integral over f(U) of some other function φ:
Determinant 158

Rn → Rm is given by

The Jacobian also occurs in the inverse function theorem.

Vandermonde determinant (alternant)


Third order

In general, the nth-order Vandermonde determinant is [28]

where the right-hand side is the continued product of all the differences that can be formed from the n(n-1)/2 pairs of
numbers taken from x1, x2, ..., xn, with the order of the differences taken in the reversed order of the suffixes that are
involved.

Circulants
Second order

Third order

where ω and ω2 are the complex cube roots of 1. In general, the nth-order circulant determinant is [28]

where ωj is an nth root of 1.


Determinant 159

Notes
[1] Poole, David (2006), Linear Algebra: A Modern Introduction, Thomson Brooks/Cole, p. 262, ISBN 0-534-99845-3
[2] Serge Lang, Linear Algebra, 2nd Edition, Addison-Wesley, 1971, pp 173, 191.
[3] Ramazi, p., Shoeiby, B. and Abbasian, T. (2012) The extension of Sarrus’ Rule for finding the determinant of a 4x4 matrix. The American
Mathematical Monthly. April V.
[4] In a non-commutative setting left-linearity (compatibility with left-multiplication by scalars) should be distinguished from right-linearity.
Assuming linearity in the columns is taken to be left-linearity, one would have, for non-commuting scalars a, b:

a contradiction. There is no useful notion of multi-linear functions over a non-commutative ring.


[5] Proofs can be found in http:/ / www. ee. ic. ac. uk/ hp/ staff/ dmb/ matrix/ proof003. html
[6] Ken Habgood, Itamar Arel, A condensation-based application of Cramerʼs rule for solving large-scale linear systems, Journal of Discrete
Algorithms, 10 (2012), pp. 98-109. Available online 1 July 2011, ISSN 1570-8667, 10.1016/j.jda.2011.06.007.
[7] These identities were taken from http:/ / www. ee. ic. ac. uk/ hp/ staff/ dmb/ matrix/ proof003. html
[8] Proofs are given in J.R. Silvester, Math. Gazette, 10 (2000), pp. 460-467, available at http:/ / www. mth. kcl. ac. uk/ ~jrs/ gazette/ blocks. pdf
[9] Roger Godement, Cours d'Algèbre, seconde édition, Hermann (1966), §23, Théorème 5, p. 303
[10] Mac Lane, Saunders (1998), Categories for the Working Mathematician, Graduate Texts in Mathematics 5 ((2nd ed.) ed.), Springer-Verlag,
ISBN 0-387-98403-8
[11] Varadarajan, V. S (2004), Supersymmetry for mathematicians: An introduction (http:/ / books. google. com/ ?id=sZ1-G4hQgIIC&
pg=PA116& dq=Berezinian#v=onepage& q=Berezinian& f=false), ISBN 978-0-8218-3574-6, .
[12] L. N. Trefethen and D. Bau, Numerical Linear Algebra (SIAM, 1997). e.g. in Lecture 1: "... we mention that the determinant, though a
convenient notion theoretically, rarely finds a useful role in numerical algorithms."
[13] http:/ / page. inf. fu-berlin. de/ ~rote/ Papers/ pdf/ Division-free+ algorithms. pdf
[14] J.R. Bunch and J.E. Hopcroft, Triangular factorization and inversion by fast matrix multiplication, Mathematics of Computation, 28 (1974)
231–236.
[15] Fang, Xin Gui; Havas, George (1997). "On the worst-case complexity of integer Gaussian elimination" (http:/ / perso. ens-lyon. fr/ gilles.
villard/ BIBLIOGRAPHIE/ PDF/ ft_gateway. cfm. pdf). Proceedings of the 1997 international symposium on Symbolic and algebraic
computation. ISSAC '97. Kihei, Maui, Hawaii, United States: ACM. pp. 28–31. doi:10.1145/258726.258740. ISBN 0-89791-875-4. .
[16] Bareiss, Erwin (1968), "Sylvester's Identity and Multistep Integer-Preserving Gaussian Elimination" (http:/ / www. ams. org/ journals/
mcom/ 1968-22-103/ S0025-5718-1968-0226829-0/ S0025-5718-1968-0226829-0. pdf), Mathematics of computation 22 (102): 565–578,
[17] Campbell, H: "Linear Algebra With Applications", pages 111-112. Appleton Century Crofts, 1971
[18] Eves, H: "An Introduction to the History of Mathematics", pages 405, 493–494, Saunders College Publishing, 1990.
[19] A Brief History of Linear Algebra and Matrix Theory : http:/ / darkwing. uoregon. edu/ ~vitulli/ 441. sp04/ LinAlgHistory. html
[20] Cajori, F. A History of Mathematics p. 80 (http:/ / books. google. com/ books?id=bBoPAAAAIAAJ& pg=PA80#v=onepage& f=false)
[21] Expansion of determinants in terms of minors: Laplace, Pierre-Simon (de) "Researches sur le calcul intégral et sur le systéme du monde,"
Histoire de l'Académie Royale des Sciences (Paris), seconde partie, pages 267-376 (1772).
[22] Muir, Sir Thomas, The Theory of Determinants in the historical Order of Development [London, England: Macmillan and Co., Ltd., 1906].
[23] The first use of the word "determinant" in the modern sense appeared in: Cauchy, Augustin-Louis “Memoire sur les fonctions qui ne peuvent
obtenir que deux valeurs égales et des signes contraires par suite des transpositions operées entre les variables qu'elles renferment," which was
first read at the Institute de France in Paris on November 30, 1812, and which was subsequently published in the Journal de l'Ecole
Polytechnique, Cahier 17, Tome 10, pages 29-112 (1815).
[24] Origins of mathematical terms: http:/ / jeff560. tripod. com/ d. html
[25] History of matrices and determinants: http:/ / www-history. mcs. st-and. ac. uk/ history/ HistTopics/ Matrices_and_determinants. html
[26] The first use of vertical lines to denote a determinant appeared in: Cayley, Arthur "On a theorem in the geometry of position," Cambridge
Mathematical Journal, vol. 2, pages 267-271 (1841).
[27] History of matrix notation: http:/ / jeff560. tripod. com/ matrices. html
[28] Gradshteyn, I. S., I. M. Ryzhik: "Table of Integrals, Series, and Products", 14.31, Elsevier, 2007.
Determinant 160

References
• Axler, Sheldon Jay (1997), Linear Algebra Done Right (2nd ed.), Springer-Verlag, ISBN 0-387-98259-0
• de Boor, Carl (1990), "An empty exercise" (http://ftp.cs.wisc.edu/Approx/empty.pdf), ACM SIGNUM
Newsletter 25 (2): 3–7, doi:10.1145/122272.122273.
• Lay, David C. (August 22, 2005), Linear Algebra and Its Applications (3rd ed.), Addison Wesley,
ISBN 978-0-321-28713-7
• Meyer, Carl D. (February 15, 2001), Matrix Analysis and Applied Linear Algebra (http://www.matrixanalysis.
com/DownloadChapters.html), Society for Industrial and Applied Mathematics (SIAM),
ISBN 978-0-89871-454-8
• Poole, David (2006), Linear Algebra: A Modern Introduction (2nd ed.), Brooks/Cole, ISBN 0-534-99845-3
• Anton, Howard (2005), Elementary Linear Algebra (Applications Version) (9th ed.), Wiley International
• Leon, Steven J. (2006), Linear Algebra With Applications (7th ed.), Pearson Prentice Hall

External links
• Hazewinkel, Michiel, ed. (2001), "Determinant" (http://www.encyclopediaofmath.org/index.
php?title=Determinant&oldid=12692), Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Weisstein, Eric W., " Determinant (http://mathworld.wolfram.com/Determinant.html)" from MathWorld.
• O'Connor, John J.; Robertson, Edmund F., "Matrices and determinants" (http://www-history.mcs.st-andrews.
ac.uk/HistTopics/Matrices_and_determinants.html), MacTutor History of Mathematics archive, University of
St Andrews.
• WebApp to calculate determinants and descriptively solve systems of linear equations (http://sole.ooz.ie/en)
• Determinant Interactive Program and Tutorial (http://people.revoledu.com/kardi/tutorial/LinearAlgebra/
MatrixDeterminant.html)
• Online Matrix Calculator (http://matri-tri-ca.narod.ru/en.index.html)
• Linear algebra: determinants. (http://www.umat.feec.vutbr.cz/~novakm/determinanty/en/) Compute
determinants of matrices up to order 6 using Laplace expansion you choose.
• Matrices and Linear Algebra on the Earliest Uses Pages (http://www.economics.soton.ac.uk/staff/aldrich/
matrices.htm)
• Determinants explained in an easy fashion in the 4th chapter as a part of a Linear Algebra course. (http://algebra.
math.ust.hk/course/content.shtml)
• Instructional Video on taking the determinant of an nxn matrix (Khan Academy) (http://khanexercises.appspot.
com/video?v=H9BWRYJNIv4)
• Online matrix calculator (determinant, track, inverse, adjoint, transpose) (http://www.stud.feec.vutbr.cz/
~xvapen02/vypocty/matreg.php?language=english) Compute determinant of matrix up to order 8
Dirichlet distribution 161

Dirichlet distribution
Dirichlet

Probability density function

Parameters number of categories (integer)


concentration parameters, where
Support where and
PDF

where

where
Mean

(see digamma function)


Mode

Variance

where

Entropy see text

In probability and statistics, the Dirichlet distribution (after Johann Peter Gustav Lejeune Dirichlet), often denoted
, is a family of continuous multivariate probability distributions parametrized by a vector of positive
reals. It is the multivariate generalization of the beta distribution.[1] Dirichlet distributions are very often used as
prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical
distribution and multinomial distribution. That is, its probability density function returns the belief that the
probabilities of K rival events are given that each event has been observed times.
The infinite-dimensional generalization of the Dirichlet distribution is the Dirichlet process.
Dirichlet distribution 162

Probability density function


The Dirichlet distribution of order K ≥ 2 with parameters α1, ..., αK > 0 has a probability density function with
respect to Lebesgue measure on the Euclidean space RK-1 given by

for all x1, ..., xK–1 > 0 satisfying x1 + ... + xK–1 < 1, and where xK = 1 – x1 – ... – xK–1. The density is zero outside this
open (K − 1)-dimensional simplex.
The normalizing constant is the multinomial beta function, which can be expressed in terms of the gamma function:

Support
The support of the Dirichlet distribution is the set of -dimensional vectors whose entries are real numbers in
the interval ; furthermore, , i.e. the sum of the coordinates is 1. These can be viewed as the
probabilities of a K-way categorical event. Another way to express this is that the domain of the Dirichlet
distribution is itself a probability distribution, specifically a -dimensional discrete distribution. Note that the
technical term for the set of points in the support of a -dimensional Dirichlet distribution is the open standard
-simplex, which is a generalization of a triangle, embedded in the next-higher dimension. For example, with
, the support looks like an equilateral triangle embedded in a downward-angle fashion in three-dimensional
space, with vertices at and , i.e. touching each of the coordinate axes at a point 1 unit
away from the origin.

Special cases
A very common special case is the symmetric Dirichlet distribution, where all of the elements making up the
parameter vector have the same value. Symmetric Dirichlet distributions are often used when a Dirichlet prior is
called for, since there typically is no prior knowledge favoring one component over another. Since all elements of
the parameter vector have the same value, the distribution alternatively can be parametrized by a single scalar value
, called the concentration parameter. The density function then simplifies to

concentration-parameter-disambiguation
When , the symmetric Dirichlet distribution is equivalent to a uniform
distribution over the open standard -simplex, i.e. it is uniform over all points in its support. Values of the
concentration parameter above 1 prefer variates that are dense, evenly distributed distributions, i.e. all probabilities
returned are similar to each other. Values of the concentration parameter below 1 prefer sparse distributions, i.e.
most of the probabilities returned will be close to 0, and the vast majority of the mass will be concentrated in a few
of the probabilities.
More generally, parameter vector is sometimes written as the product of a (scalar) concentration parameter
and a (vector) base measure where lies within the K-1 simplex (i.e.: its coordinates sum
to one). The concentration parameter in this case is larger by a factor of K than the concentration parameter for a
symmetric Dirichlet distribution described above. This construction ties in with concept of a base measure when
discussing Dirichlet processes and is often used in the topic modelling literature.
If we define the concentration parameter as the sum of the Dirichlet parameters for each dimension, the
Dirichlet distribution is uniform with a concentration parameter is K, the dimension of the distribution.
Dirichlet distribution 163

Properties

Moments
Let , meaning that the first K – 1 components have the above density and

Define . Then[2][3]

Furthermore, if

(Note that the matrix so defined is singular.).

Mode
The mode of the distribution is the vector (x1, ..., xK) with

Marginal distributions
The marginal distributions are beta distributions:[4]

Conjugate to categorical/multinomial
The Dirichlet distribution is the conjugate prior distribution of the categorical distribution (a generic discrete
probability distribution with a given number of possible outcomes) and multinomial distribution (the distribution
over observed counts of each possible category in a set of categorically distributed observations). This means that if
a data point has either a categorical or multinomial distribution, and the prior distribution of the data point's
parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the posterior
distribution of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the
parameter prior to observing the data point, we then can update our knowledge based on the data point and end up
with a new distribution of the same form as the old one. This means that we can successively update our knowledge
of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.
Formally, this can be expressed as follows. Given a model

then the following holds:

This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution
given a collection of N samples. Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as
representing the number of observations in each category that we have already seen. Then we simply add in the
Dirichlet distribution 164

counts for all the new observations (the vector c) in order to derive the posterior distribution.
In Bayesian mixture models and other hierarchical Bayesian models with mixture components, Dirichlet
distributions are commonly used as the prior distributions for the categorical variables appearing in the models. See
the section on applications below for more information.

Relation to Dirichlet-multinomial distribution


In a model where a Dirichlet prior distribution is placed over a set of categorical-valued observations, the marginal
joint distribution of the observations (i.e. the joint distribution of the observations, with the prior parameter
marginalized out) is a Dirichlet-multinomial distribution. This distribution plays an important role in hierarchical
Bayesian models, because when doing inference over such models using methods such as Gibbs sampling or
variational Bayes, Dirichlet prior distributions are often marginalized out. See the article on this distribution for more
details.

Entropy
If X is a Dir(α) random variable, then the exponential family differential identities can be used to get an analytic
expression for the expectation of and its associated covariance matrix:

and

where is the digamma function, is the trigamma function, and is the Kronecker delta. The formula for
yields the following formula for the information entropy of X:

Aggregation
If then, if the random variables with subscripts i and j are dropped
from the vector and replaced by their sum,

This aggregation property may be used to derive the marginal distribution of mentioned above.

Neutrality
If , then the vector  is said to be neutral[5] in the sense that is
[6]
independent of where

and similarly for removing any of . Observe that any permutation of is also neutral (a property
[7]
not possessed by samples drawn from a generalized Dirichlet distribution.)
Dirichlet distribution 165

Related distributions
If, for

then [8]

and

Although the Xis are not independent from one another, they can be seen to be generated from a set of
[9]
independent gamma random variables (see for proof). Unfortunately, since the sum is lost in forming X, it is
not possible to recover the original gamma random variables from these values alone. Nevertheless, because
independent random variables are simpler to work with, this reparametrization can still be useful for proofs about
properties of the Dirichlet distribution.

Applications
Dirichlet distributions are most commonly used as the prior distribution of categorical variables or multinomial
variables in Bayesian mixture models and other hierarchical Bayesian models. (Note that in many fields, such as in
natural language processing, categorical variables are often imprecisely called "multinomial variables". Such a usage
is liable to cause confusion, just as if Bernoulli distributions and binomial distributions were commonly conflated.)
Inference over hierarchical Bayesian models is often done using Gibbs sampling, and in such a case, instances of the
Dirichlet distribution are typically marginalized out of the model by integrating out the Dirichlet random variable.
This causes the various categorical variables drawn from the same Dirichlet random variable to become correlated,
and the joint distribution over them assumes a Dirichlet-multinomial distribution, conditioned on the
hyperparameters of the Dirichlet distribution (the concentration parameters). One of the reasons for doing this is that
Gibbs sampling of the Dirichlet-multinomial distribution is extremely easy; see that article for more information.

Random number generation

Gamma distribution
A fast method to sample a random vector from the K-dimensional Dirichlet distribution with
parameters follows immediately from this connection. First, draw K independent random samples
from gamma distributions each with density

and then set

Below is example python code to draw the sample:

params = [a1, a2, ..., ak]


sample = [random.gammavariate(a,1) for a in params]
sample = [v/sum(sample) for v in sample]
Dirichlet distribution 166

Marginal beta distributions


A less efficient algorithm[10] relies on the univariate marginal and conditional distributions being beta and proceeds

as follows. Simulate from a distribution. Then simulate in order, as follows.

For , simulate from a distribution, and let .

Finally, set .

Below is example python code to draw the sample:

params = [a1, a2, ..., ak]


xs = [random.betavariate(params[0], sum(params[1:]))]
for j in range(1,len(params)-1):
phi = random.betavariate(params[j], sum(params[j+1:]))
xs.append((1-sum(xs)) * phi)
xs.append(1-sum(xs))

Intuitive interpretations of the parameters

The concentration parameter


Dirichlet distributions are very often used as prior distributions in Bayesian inference. The simplest and perhaps
most common type of Dirichlet prior is the symmetric Dirichlet distribution, where all parameters are equal. This
corresponds to the case where you have no prior information to favor one component over any other. As described
above, the single value α to which all parameters are set is called the concentration parameter. If the sample space of
the Dirichlet distribution is interpreted as a discrete probability distribution, then intuitively the concentration
parameter can be thought of as determining how "concentrated" the probability mass of a sample from a Dirichlet
distribution is likely to be. With a value much less than 1, the mass will be highly concentrated in a few components,
and all the rest will have almost no mass. With a value much greater than 1, the mass will be dispersed almost
equally among all the components. See the article on the concentration parameter for further discussion.
Dirichlet distribution 167

String cutting
One example use of the Dirichlet distribution is if one wanted to cut strings (each of initial length 1.0) into K pieces
with different lengths, where each piece had a designated average length, but allowing some variation in the relative
sizes of the pieces. The α/α0 values specify the mean lengths of the cut pieces of string resulting from the
distribution. The variance around this mean varies inversely with α0.

Pólya's urn
Consider an urn containing balls of K different colors. Initially, the urn contains α1 balls of color 1, α2 balls of color
2, and so on. Now perform N draws from the urn, where after each draw, the ball is placed back into the urn with an
additional ball of the same color. In the limit as N approaches infinity, the proportions of different colored balls in
the urn will be distributed as Dir(α1,...,αK).[11]
For a formal proof, note that the proportions of the different colored balls form a bounded [0,1]K-valued martingale,
hence by the martingale convergence theorem, these proportions converge almost surely and in mean to a limiting
random vector. To see that this limiting vector has the above Dirichlet distribution, check that all mixed moments
agree.
Note that each draw from the urn modifies the probability of drawing a ball of any one color from the urn in the
future. This modification diminishes with the number of draws, since the relative effect of adding a new ball to the
urn diminishes as the urn accumulates increasing numbers of balls. This "diminishing returns" effect can also help
explain how small α values yield Dirichlet distributions with most of the probability mass concentrated around a
single point on the simplex.
Dirichlet distribution 168

References
[1] S. Kotz, N. Balakrishnan, and N. L. Johnson (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications. New York:
Wiley. ISBN 0-471-18387-3. (Chapter 49: Dirichlet and Inverted Dirichlet Distributions)
[2] Eq. (49.9) on page 488 of Kotz, Balakrishnan & Johnson (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications.
New York: Wiley. (http:/ / www. wiley. com/ WileyCDA/ WileyTitle/ productCd-0471183873. html)
[3] BalakrishV. B. (2005). ""Chapter 27. Dirichlet Distribution"". A Primer on Statistical Distributions. Hoboken, NJ: John Wiley & Sons, Inc..
p. 274. ISBN 978-0-471-42798-8.
[4] Ferguson, Thomas S. (1973). "A Bayesian analysis of some nonparametric problems". The Annals of Statistics 1 (2): 209–230.
doi:10.1214/aos/1176342360.
[5] Connor, Robert J.; Mosimann, James E (1969). "Concepts of Independence for Proportions with a Generalization of the Dirichlet
Distribution". Journal of the American statistical association (American Statistical Association) 64 (325): 194–206. doi:10.2307/2283728.
JSTOR 2283728.
[6] Bela A. Frigyik, Amol Kapila, and Maya R. Gupta (2010). "Introduction to the Dirichlet Distribution and Related Processes" (http:/ / ee.
washington. edu/ research/ guptalab/ publications/ UWEETR-2010-0006. pdf) (Technical Report UWEETR-2010-006). University of
Washington Department of Electrical Engineering. . Retrieved May 2012.
[7] See Kotz, Balakrishnan & Johnson (2000), Section 8.5, "Connor and Mosimann's Generalization", pp. 519–521.
[8] Devroye, Luc (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ chapter_nine. pdf). pp. 402. .
[9] Devroye, Luc (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). pp. 594. . (Chapter 11.)
[10] A. Gelman and J. B. Carlin and H. S. Stern and D. B. Rubin (2003). Bayesian Data Analysis (2nd ed.). pp. 582. ISBN 1-58488-388-X.
[11] Blackwell, David; MacQueen, James B. (1973). "Ferguson distributions via Polya urn schemes". Ann. Stat. 1 (2): 353–355.
doi:10.1214/aos/1176342372.

External links
• Dirichlet Distribution (http://www.cis.hut.fi/ahonkela/dippa/node95.html)
• How to estimate the parameters of the Dirichlet distribution using expectation-maximization (EM) (http://www.
ee.washington.edu/research/guptalab/publications/EMbookChenGupta2010.pdf)
• Luc Devroye. "Non-Uniform Random Variate Generation" (http://luc.devroye.org/rnbookindex.html).
Retrieved May 2012.
• Dirichlet Random Measures, Method of Construction via Compound Poisson Random Variables, and
Exchangeability Properties of the resulting Gamma Distribution (http://www.cs.princeton.edu/courses/
archive/fall07/cos597C/scribe/20071130.pdf)
Effect size 169

Effect size
In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical
population, or a sample-based estimate of that quantity. An effect size calculated from data is a descriptive statistic
that conveys the estimated magnitude of a relationship without making any statement about whether the apparent
relationship in the data reflects a true relationship in the population. In that way, effect sizes complement inferential
statistics such as p-values. Among other uses, effect size measures play an important role in meta-analysis studies
that summarize findings from a specific area of research, and in statistical power analyses.
The concept of effect size appears already in everyday language. For example, a weight loss program may boast that
it leads to an average weight loss of 30 pounds. In this case, 30 pounds is an indicator of the claimed effect size.
Another example is that a tutoring program may claim that it raises school performance by one letter grade. This
grade increase is the claimed effect size of the program. These are both examples of "absolute effect sizes", meaning
that they convey the average difference between two groups without any discussion of the variability within the
groups. For example, if the weight loss program results in an average loss of 30 pounds, it is possible that every
participant loses exactly 30 pounds, or half the participants lose 60 pounds and half lose no weight at all.
Reporting effect sizes is considered good practice when presenting empirical research findings in many fields.[1][2]
The reporting of effect sizes facilitates the interpretation of the substantive, as opposed to the statistical, significance
of a research result.[3] Effect sizes are particularly prominent in social and medical research. Relative and absolute
measures of effect size convey different information, and can be used complementarily. A prominent task force in
the psychology research community expressed the following recommendation:
Always present effect sizes for primary outcomes...If the units of measurement are meaningful on a practical
level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure
(regression coefficient or mean difference) to a standardized measure (r or d).
— L. Wilkinson and APA Task Force on Statistical Inference (1999, p. 599)

Overview

Population and sample effect sizes


The term effect size can refer to a statistic calculated from a sample of data, or to a parameter of a hypothetical
statistical population. Conventions for distinguishing sample from population effect sizes follow standard statistical
practices — one common approach is to use Greek letters like ρ to denote population parameters and Latin letters
like r to denote the corresponding statistic; alternatively, a "hat" can be placed over the population parameter to
denote the statistic, e.g. with being the estimate of the parameter .
As in any statistical setting, effect sizes are estimated with error, and may be biased unless the effect size estimator
that is used is appropriate for the manner in which the data were sampled and the manner in which the measurements
were made. An example of this is publication bias, which occurs when scientists only report results when the
estimated effect sizes are large or are statistically significant. As a result, if many researchers are carrying out studies
under low statistical power, the reported results are biased to be stronger than true effects, if any.[4] Another
example, where effect sizes may be distorted is in a multiple trial experiment where the effect size calculation is
based on the averaged or aggregated response across the trials.[5]
Effect size 170

Relationship to test statistics


Sample-based effect sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the
strength of an apparent relationship, rather than assigning a significance level reflecting whether the relationship
could be due to chance. The effect size does not determine the significance level, or vice-versa. Given a sufficiently
large sample size, a statistical comparison will always show a significant difference unless the population effect size
is exactly zero. For example, a sample Pearson correlation coefficient of 0.1 is strongly statistically significant if the
sample size is 1000. Reporting only the significant p-value from this analysis could be misleading if a correlation of
0.1 is too small to be of interest in a particular application.

Standardized and unstandardized effect sizes


The term effect size can refer to a standardized measures of effect (such as r, Cohen's d, and odds ratio), or to an
unstandardized measure (e.g., the raw difference between group means and unstandardized regression coefficients).
Standardized effect size measures are typically used when the metrics of variables being studied do not have intrinsic
meaning (e.g., a score on a personality test on an arbitrary scale), when results from multiple studies are being
combined when some or all of the studies use different scales, or when it is desired to convey the size of an effect
relative to the variability in the population. In meta-analysis, standardized effect sizes are used as a common measure
that can be calculated for different studies and then combined into an overall summary.

Types

Effect Sizes Based on "Variance Explained"


These effect sizes estimate the amount of the variance within an experiment that is "explained" or "accounted for" by
the experiment's model.

Pearson r (correlation)
Pearson's correlation, often denoted r and introduced by Karl Pearson, is widely used as an effect size when paired
quantitative data are available; for instance if one were studying the relationship between birth weight and longevity.
The correlation coefficient can also be used when the data are binary. Pearson's r can vary in magnitude from −1 to
1, with −1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation, and 0 indicating
no linear relation between two variables. Cohen gives the following guidelines for the social sciences:[6][7]

Effect size r

Small 0.10

Medium 0.30

Large 0.50

Coefficient of determination
A related effect size is r², the coefficient of determination (also referred to as "r-squared"), calculated as the square of
the Pearson correlation r. In the case of paired data, this is a measure of the proportion of variance shared by the two
variables, and varies from 0 to 1. For example, with an r of 0.21 the coefficient of determination is 0.0441, meaning
that 4.4% of the variance of either variable is shared with the other variable. The r² is always positive, so does not
convey the direction of the correlation between the two variables.
Effect size 171

Cohen's ƒ2
Cohen's ƒ2 is one of several effect size measures to use in the context of an F-test for ANOVA or multiple regression.
Note that it estimates for the sample rather than the population and is biased (overestimates effect size for the
ANOVA).
The ƒ2 effect size measure for multiple regression is defined as:

where R2 is the squared multiple correlation.


The effect size measure for hierarchical multiple regression is defined as:

where R2A is the variance accounted for by a set of one or more independent variables A, and R2AB is the
combined variance accounted for by A and another set of one or more independent variables B.
By convention, ƒ2A effect sizes of 0.02, 0.15, and 0.35 are termed small, medium, and large, respectively.[6]

Cohen's can also be found for factorial analysis of variance (ANOVA, aka the F-test) working backwards using :

In a balanced design (equivalent sample sizes across groups) of ANOVA, the corresponding population parameter of
is

wherein μj denotes the population mean within the jth group of the total K groups, and σ the equivalent population
standard deviations within each groups. SS is the sum of squares manipulation in ANOVA. An unbiased estimator
for ANOVA would be based on Omega squared, which estimates for the population.

ω2
A more unbiased estimator of the variance explained in the population is omega-squared[8][9][10]

This form of the formula is limited to between-subjects analysis with equal sample sizes in all cells,[10]. Since it is
unbiased, ω2 is preferable to Cohen's ƒ2; however, it can be more inconvenient to calculate for complex analyses. A
generalized form of the estimator has been published for between-subjects and within-subjects analysis, repeated
measure, mixed design, and randomized block design experiments.[11] In addition, methods to calculate partial
Omega2 for individual factors and combined factors in designs with up to three independent variables have been
published.[11]
Effect size 172

Effect sizes based on means or distances between/among means


A (population) effect size θ based on means usually considers the standardized mean difference between two
populations[12]:78

where μ1 is the mean for one population, μ2 is the mean for the other population, and σ is a standard deviation based
on either or both populations.
In the practical setting the population values are typically not known and must be estimated from sample statistics.
The several versions of effect sizes based on means differ with respect to which statistics are used.
This form for the effect size resembles the computation for a t-test statistic, with the critical difference that the t-test
statistic includes a factor of . This means that for a given effect size, the significance level increases with the
sample size. Unlike the t-test statistic, the effect size aims to estimate a population parameter, so is not affected by
the sample size.

Cohen's d
Cohen's d is defined as the difference between two means divided by a standard deviation for the data

Cohen's d is frequently used in estimating sample sizes. A lower Cohen's d indicates a necessity of larger sample
sizes, and vice versa, as can subsequently be determined together with the additional parameters of desired
significance level and statistical power.[13]
What precisely the standard deviation s is was not originally made explicit by Jacob Cohen because he defined it
(using the symbol "σ") as "the standard deviation of either population (since they are assumed equal)".[6]:20 Other
authors make the computation of the standard deviation more explicit with the following definition for a pooled
standard deviation[14]:14 with two independent samples.

This definition of "Cohen's d" is termed the maximum likelihood estimator by Hedges and Olkin,[12] and it is related
to Hedges' g (see below) by a scaling[12]:82

Glass's Δ
In 1976 Gene V. Glass proposed an estimator of the effect size that uses only the standard deviation of the second
group[12]:78

The second group may be regarded as a control group, and Glass argued that if several treatments were compared to
the control group it would be better to use just the standard deviation computed from the control group, so that effect
sizes would not differ under equal means and different variances.
Under an assumption of equal population variances a pooled estimate for σ is more precise.
Effect size 173

Hedges' g
Hedges' g, suggested by Larry Hedges in 1981,[15] is like the other measures based on a standardized difference[12]:79

but its pooled standard deviation is computed slightly differently from Cohen's d

As an estimator for the population effect size θ it is biased. However, this bias can be corrected for by multiplication
with a factor

Hedges and Olkin refer to this unbiased estimator as d,[12] but it is not the same as Cohen's d. The exact form for
the correction factor J() involves the gamma function[12]:104

Distribution of effect sizes based on means

Provided that the data is Gaussian distributed a scaled Hedges' g, , follows a noncentral

t-distribution with the noncentrality parameter and n1 + n2 − 2 degrees of freedom.

Likewise, the scaled Glass' Δ is distributed with n2 − 1 degrees of freedom.


From the distribution it is possible to compute the expectation and variance of the effect sizes.
In some cases large sample approximations for the variance are used. One suggestion for the variance of Hedges'
unbiased estimator is[12]:86

φ, Cramér's φ, or Cramér's V

Phi (φ) Cramér's Phi (φc)

The best measure of association for the chi-squared test is phi (or Cramér's phi or V). Phi is related to the
point-biserial correlation coefficient and Cohen's d and estimates the extent of the relationship between two variables
(2 x 2).[16] Cramér's Phi may be used with variables having more than two levels.
Phi can be computed by finding the square root of the chi-squared statistic divided by the sample size.
Similarly, Cramér's phi is computed by taking the square root of the chi-squared statistic divided by the sample size
and the length of the minimum dimension (k is the smaller of the number of rows r or columns c).
φc is the intercorrelation of the two discrete variables[17] and may be computed for any value of r or c. However, as
chi-squared values tend to increase with the number of cells, the greater the difference between r and c, the more
likely φc will tend to 1 without strong evidence of a meaningful correlation.
Cramér's phi may also be applied to 'goodness of fit' chi-squared models (i.e. those where c=1). In this case it
functions as a measure of tendency towards a single outcome (i.e. out of k outcomes).
Effect size 174

Odds ratio
The odds ratio (OR) is another useful effect size. It is appropriate when both variables are binary. For example,
consider a study on spelling. In a control group, two students pass the class for every one who fails, so the odds of
passing are two to one (or more briefly 2/1 = 2). In the treatment group, six students pass for every one who fails, so
the odds of passing are six to one (or 6/1 = 6). The effect size can be computed by noting that the odds of passing in
the treatment group are three times higher than in the control group (because 6 divided by 2 is 3). Therefore, the odds
ratio is 3. However, odds ratio statistics are on a different scale to Cohen's d. So, this '3' is not comparable to a
Cohen's d of 3.

Relative risk
The relative risk (RR), also called risk ratio, is simply the risk (probability) of an event relative to some independent
variable. This measure of effect size differs from the odds ratio in that it compares probabilities instead of odds, but
asymptotically approaches the latter for small probabilities. Using the example above, the probabilities for those in
the control group and treatment group passing is 2/3 (or 0.67) and 6/7 (or 0.86), respectively. The effect size can be
computed the same as above, but using the probabilities instead. Therefore, the relative risk is 1.28. Since rather
large probabilities of passing were used, there is a large difference between relative risk and odds ratio. Had failure
(a smaller probability) been used as the event (rather than passing), the difference between the two measures of
effect size would not be so great.
While both measures are useful, they have different statistical uses. In medical research, the odds ratio is commonly
used for case-control studies, as odds, but not probabilities, are usually estimated.[18] Relative risk is commonly used
in randomized controlled trials and cohort studies.[19] When the incidence of outcomes are rare in the study
population (generally interpreted to mean less than 10%), the odds ratio is considered a good estimate of the risk
ratio. However, as outcomes become more common, the odds ratio and risk ratio diverge, with the odds ratio
overestimating or underestimating the risk ratio when the estimates are greater than or less than 1, respectively.
When estimates of the incidence of outcomes are available, methods exist to convert odds ratios to risk ratios.[20][21]

Confidence intervals by means of noncentrality parameters


Confidence intervals of standardized effect sizes, especially Cohen's and , rely on the calculation of
confidence intervals of noncentrality parameters (ncp). A common approach to construct the confidence interval of
ncp is to find the critical ncp values to fit the observed statistic to tail quantiles α/2 and (1 − α/2). The SAS and
R-package MBESS provides functions to find critical values of ncp.

T test for mean difference of single group or two related groups


In case of single group, M (μ) denotes the sample (population) mean of single group, and SD (σ) denotes the sample
(population) standard deviation. N is the sample size of the group. T test is used for the hypothesis on the difference
between mean and a baseline μbaseline. Usually, μbaseline is zero, while not necessary. In case of two related groups,
the single group is constructed by difference in each pair of samples, while SD (σ) denotes the sample (population)
standard deviation of differences rather than within original two groups.

and Cohen's
Effect size 175

is the point estimate of

So,

T test for mean difference between two independent groups


n1 or n2 is sample size within the respective group.

wherein

and Cohen's

is the point estimate of

So,

One-way ANOVA test for mean difference across multiple independent groups
One-way ANOVA test applies noncentral F distribution. While with a given population standard deviation , the
same test question applies noncentral chi-squared distribution.

For each j-th sample within i-th group Xi,j, denote

While,

So, both ncp(s) of F and equate

In case of for K independent groups of same size, the total sample size is N := n·K.
Effect size 176

T-test of pair of independent groups is a special case of one-way ANOVA. Note that noncentrality parameter
of F is not comparable to the noncentrality parameter of the corresponding t. Actually, , and

in the case.

"Small", "medium", "large"


Some fields using effect sizes apply words such as "small", "medium" and "large" to the size of the effect. Whether
an effect size should be interpreted small, medium, or large depends on its substantial context and its operational
definition. Cohen's conventional criteria small, medium, or big[6] are near ubiquitous across many fields. Power
analysis or sample size planning requires an assumed population parameter of effect sizes. Many researchers adopt
Cohen's standards as default alternative hypotheses. Russell Lenth criticized them as T-shirt effect sizes[22]
This is an elaborate way to arrive at the same sample size that has been used in past social science
studies of large, medium, and small size (respectively). The method uses a standardized effect size as the
goal. Think about it: for a "medium" effect size, you'll choose the same n regardless of the accuracy or
reliability of your instrument, or the narrowness or diversity of your subjects. Clearly, important
considerations are being ignored here. "Medium" is definitely not the message!
For Cohen's d an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity,
a "large" effect.[6]:25 (But note that the d might be larger than one)
Cohen's text[6] anticipates Lenth's concerns:
"The terms 'small,' 'medium,' and 'large' are relative, not only to each other, but to the area of behavioral
science or even more particularly to the specific content and research method being employed in any
given investigation....In the face of this relativity, there is a certain risk inherent in offering conventional
operational definitions for these terms for use in power analysis in as diverse a field of inquiry as
behavioral science. This risk is nevertheless accepted in the belief that more is to be gained than lost by
supplying a common conventional frame of reference which is recommended for use only when no
better basis for estimating the ES index is available." (p. 25)
In an ideal world, researchers would interpret the substantive significance of their results by grounding them in a
meaningful context or by quantifying their contribution to knowledge. Where this is problematic, Cohen's effect size
criteria may serve as a last resort.[3]

References
[1] Wilkinson, Leland; APA Task Force on Statistical Inference (1999). "Statistical methods in psychology journals: Guidelines and
explanations". American Psychologist 54 (8): 594–604. doi:10.1037/0003-066X.54.8.594.
[2] Nakagawa, Shinichi; Cuthill, Innes C (2007). "Effect size, confidence interval and statistical significance: a practical guide for biologists".
Biological Reviews Cambridge Philosophical Society 82 (4): 591–605. doi:10.1111/j.1469-185X.2007.00027.x. PMID 17944619.
[3] Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: An Introduction to Statistical Power, Meta-Analysis and the Interpretation of
Research Results. United Kingdom: Cambridge University Press.
[4] Brand A, Bradley MT, Best LA, Stoica G (2008). "Accuracy of effect size estimates from published psychological research" (http:/ /
mtbradley. com/ brandbradelybeststoicapdf. pdf). Perceptual and Motor Skills 106 (2): 645–649. doi:10.2466/PMS.106.2.645-649.
PMID 18556917. .
[5] Brand A, Bradley MT, Best LA, Stoica G (2011). "Multiple trials may yield exaggerated effect size estimates" (http:/ / www. ipsychexpts.
com/ brand_et_al_(2011). pdf). The Journal of General Psychology 138 (1): 1–11. doi:10.1080/00221309.2010.520360. .
[6] Jacob Cohen (1988). Statistical Power Analysis for the Behavioral Sciences (second ed.). Lawrence Erlbaum Associates.
[7] Cohen, J (1992). "A power primer". Psychological Bulletin 112 (1): 155–159. doi:10.1037/0033-2909.112.1.155. PMID 19565683.
[8] Bortz, 1999, p. 269f.;
[9] Bühner & Ziegler (2009, p. 413f)
[10] Tabachnick & Fidell (2007, p. 55)
[11] Olejnik, S. & Algina, J. 2003. Generalized Eta and Omega Squared Statistics: Measures of Effect Size for Some Common Research Designs
Psychological Methods. 8:(4)434-447. http:/ / cps. nova. edu/ marker/ olejnik2003. pdf
Effect size 177

[12] Larry V. Hedges & Ingram Olkin (1985). Statistical Methods for Meta-Analysis. Orlando: Academic Press. ISBN 0-12-336380-2.
[13] Chapter 13 (http:/ / davidakenny. net/ statbook/ chapter_13. pdf), page 215, in: Kenny, David A. (1987). Statistics for the social and
behavioral sciences. Boston: Little, Brown. ISBN 0-316-48915-8.
[14] Joachim Hartung, Guido Knapp & Bimal K. Sinha (2008). Statistical Meta-Analysis with Application. Hoboken, New Jersey: Wiley.
[15] Larry V. Hedges (1981). "Distribution theory for Glass's estimator of effect size and related estimators". Journal of Educational Statistics 6
(2): 107–128. doi:10.3102/10769986006002107.
[16] Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and d-based effect-size indices: Problems with a commonly
recommended formula. (http:/ / www. eric. ed. gov/ ERICWebPortal/ custom/ portlets/ recordDetails/ detailmini. jsp?_nfpb=true& _&
ERICExtSearch_SearchValue_0=ED433353& ERICExtSearch_SearchType_0=no& accno=ED433353) Paper presented at the annual meeting
of the Florida Educational Research Association, Orlando, FL. (ERIC Document Reproduction Service No. ED433353)
[17] Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.
[18] Deeks J (1998). "When can odds ratios mislead? : Odds ratios should be used only in case-control studies and logistic regression analyses".
BMJ 317 (7166): 1155–6. PMC 1114127. PMID 9784470.
[19] Medical University of South Carolina. Odds ratio versus relative risk (http:/ / www. musc. edu/ dc/ icrebm/ oddsratio. html). Accessed on:
September 8, 2005.
[20] Zhang, J.; Yu, K. (1998). "What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes". JAMA:
the Journal of the American Medical Association 280 (19): 1690–1691. doi:10.1001/jama.280.19.1690. PMID 9832001.
[21] Greenland, S. (2004). "Model-based Estimation of Relative Risks and Other Epidemiologic Measures in Studies of Common Outcomes and
in Case-Control Studies". American Journal of Epidemiology 160 (4): 301–305. doi:10.1093/aje/kwh221. PMID 15286014.
[22] Russell V. Lenth. "Java applets for power and sample size" (http:/ / www. stat. uiowa. edu/ ~rlenth/ Power/ ). Division of Mathematical
Sciences, the College of Liberal Arts or The University of Iowa. . Retrieved 2008-10-08.

Further reading
• Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and d-based effect-size indices:
Problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida
Educational Research Association, Orlando, FL. (ERIC Document Reproduction Service No. ED433353) (http://
www.eric.ed.gov/ERICWebPortal/contentdelivery/servlet/ERICServlet?accno=ED433353)
• Bonett, D.G. (2008). Confidence intervals for standardized linear contrasts of means, Psychological Methods, 13,
99-109.
• Bonett, D.G. (2009). Estimating standardized linear contrasts of means with desired precision, Psychological
Methods"", 14, 1-5.
• Cumming, G. and Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals
that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 530–572.
• Kelley, K. (2007). Confidence intervals for standardized effect sizes: Theory, application, and implementation.
Journal of Statistical Software, 20(8), 1-24. (http://www.jstatsoft.org/v20/i08/paper)
• Lipsey, M.W., & Wilson, D.B. (2001). Practical meta-analysis. Sage: Thousand Oaks, CA.

External links
Software
• compute.es: Compute Effect Sizes (http://cran.r-project.org/web/packages/compute.es/index.html) (R
package)
• MIX 2.0 (http://www.meta-analysis-made-easy.com) Software for professional meta-analysis in Excel. Many
effect sizes available.
• Effect Size Calculators (http://myweb.polyu.edu.hk/~mspaul/calculator/calculator.html) Calculate d and r
from a variety of statistics.
• Free Effect Size Generator (http://www.clintools.com/victims/resources/software/effectsize/
effect_size_generator.html) - PC & Mac Software
• MBESS (http://cran.r-project.org/web/packages/MBESS/index.html) - One of R's packages providing
confidence intervals of effect sizes based non-central parameters
• Free GPower Software (http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/) - PC & Mac Software
Effect size 178

• Free Effect Size Calculator for Multiple Regression (http://www.danielsoper.com/statcalc/calc05.aspx) - Web


Based
• Free Effect Size Calculator for Hierarchical Multiple Regression (http://www.danielsoper.com/statcalc/calc13.
aspx) - Web Based
• Copylefted Online Calculator for Noncentral t, Chisquare, and F Distributions (http://www.traba.org/
wikitraba/index.php/Online_Calculator_for_Noncentrality_Distributions) - Collaborated Wiki Page Powered by
R
• ES-Calc: a free add-on for Effect Size Calculation in ViSta 'The Visual Statistics System' (http://www.mdp.
edu.ar/psicologia/vista/vista.htm). Computes Cohen's d, Glass's Delta, Hedges' g, CLES, Non-Parametric
Cliff's Delta, d-to-r Conversion, etc.
Further Explanations
• Effect Size (ES) (http://www.uccs.edu/~faculty/lbecker/es.htm)
• EffectSizeFAQ.com (http://effectsizefaq.com/)
• Measuring Effect Size (http://davidmlane.com/hyperstat/effect_size.html)
• Effect size for two independent groups (http://web.uccs.edu/lbecker/Psy590/es.htm#II.independent)
• Effect size for two dependent groups (http://web.uccs.edu/lbecker/Psy590/es.htm#III.Effect size measures
for two dependent)
• Computing and Interpreting Effect size Measures with ViSta (http://www.tqmp.org/doc/vol5-1/p25-34.pdf)
Erlang distribution 179

Erlang distribution
Erlang

Probability density function

Cumulative distribution function

Parameters shape
, rate (real)
alt.: scale (real)
Support
PDF

CDF
Mean
Median No simple closed form
Mode for
Variance
Skewness
Ex. kurtosis
Entropy
MGF for
CF

The Erlang distribution is a continuous probability distribution with wide applicability primarily due to its relation
to the exponential and Gamma distributions. The Erlang distribution was developed by A. K. Erlang to examine the
number of telephone calls which might be made at the same time to the operators of the switching stations. This
work on telephone traffic engineering has been expanded to consider waiting times in queueing systems in general.
The distribution is now used in the fields of stochastic processes and of biomathematics.
Erlang distribution 180

Overview
The distribution is a continuous distribution, which has a positive value for all real numbers greater than zero, and is
given by two parameters: the shape , which is a non-negative integer, and the rate , which is a non-negative
real number. The distribution is sometimes defined using the inverse of the rate parameter, the scale . It is the
distribution of the sum of independent exponential variables with mean .
When the shape parameter equals 1, the distribution simplifies to the exponential distribution. The Erlang
distribution is a special case of the Gamma distribution where the shape parameter is an integer. In the Gamma
distribution, this parameter is not restricted to the integers.

Characterization

Probability density function


The probability density function of the Erlang distribution is

The parameter is called the shape parameter and the parameter is called the rate parameter. An alternative, but
equivalent, parametrization uses the scale parameter which is the reciprocal of the rate parameter (i.e.,
):

When the scale parameter equals 2, then distribution simplifies to the chi-squared distribution with 2k degrees of
freedom. It can therefore be regarded as a generalized chi-squared distribution.
Because of the factorial function in the denominator, the Erlang distribution is only defined when the parameter k is
a positive integer. In fact, this distribution is sometimes called the Erlang-k distribution (e.g., an Erlang-2
distribution is an Erlang distribution with k=2). The Gamma distribution generalizes the Erlang by allowing to be
any real number, using the gamma function instead of the factorial function.

Cumulative distribution function (CDF)


The cumulative distribution function of the Erlang distribution is:

where is the lower incomplete gamma function. The CDF may also be expressed as
Erlang distribution 181

Occurrence

Waiting times
Events that occur independently with some average rate are modeled with a Poisson process. The waiting times
between k occurrences of the event are Erlang distributed. (The related question of the number of events in a given
amount of time is described by the Poisson distribution.)
The Erlang distribution, which measures the time between incoming calls, can be used in conjunction with the
expected duration of incoming calls to produce information about the traffic load measured in Erlang units. This can
be used to determine the probability of packet loss or delay, according to various assumptions made about whether
blocked calls are aborted (Erlang B formula) or queued until served (Erlang C formula). The Erlang-B and C
formulae are still in everyday use for traffic modeling for applications such as the design of call centers.
A.K. Erlang worked a lot in traffic modeling. There are thus two other Erlang distributions, both used in modeling
traffic:
Erlang B distribution: this is the easier of the two, and can be used, for example, in a call centre to calculate the
number of trunks one need to carry a certain amount of phone traffic with a certain "target service".
Erlang C distribution: this formula is much more difficult and is often used, for example, to calculate how long
callers will have to wait before being connected to a human in a call centre or similar situation.

Stochastic processes
The Erlang distribution is the distribution of the sum of k independent identically distributed random variables
each having an exponential distribution. The long-run rate at which events occur is the reciprocal of the expectation
of , that is . The (age specific event) rate of the Erlang distribution is, for , monotonic in ,
increasing from zero at , to as tends to infinity.[1]

Related distributions
• If then
• (normal distribution)
• If and then
• If then
• Erlang distribution is a special case of type 3 Pearson distribution
• If (gamma distribution) then
• If and then

Notes
[1] Cox, D.R. (1967) Renewal Theory, p20, Methuen.

References
• Ian Angus "An Introduction to Erlang B and Erlang C" (http://www.tarrani.net/linda/ErlangBandC.pdf),
Telemanagement #187 (PDF Document - Has terms and formulae plus short biography)

External links
• Erlang Distribution (http://www.xycoon.com/erlang.htm)
• Resource Dimensioning Using Erlang-B and Erlang-C (http://www.eventhelix.com/RealtimeMantra/
CongestionControl/resource_dimensioning_erlang_b_c.htm)
Erlang distribution 182

• Erlang-C (http://www.kooltoolz.com/Erlang-C.htm)

Expectation–maximization algorithm
In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood
or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on
unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a
function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a
maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step.
These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

History
The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and
Donald Rubin.[1] They pointed out that the method had been "proposed many times in special circumstances" by
earlier authors. In particular, a very detailed treatment of the EM method for exponential families was published by
Rolf Sundberg in his thesis and several papers[2][3][4] following his collaboration with Per Martin-Löf and Anders
Martin-Löf.[5][6][7][8][9][10][11] The Dempster-Laird-Rubin paper in 1977 generalized the method and sketched a
convergence analysis for a wider class of problems. Regardless of earlier inventions, the innovative
Dempster-Laird-Rubin paper in the Journal of the Royal Statistical Society received an enthusiastic discussion at the
Royal Statistical Society meeting with Sundberg calling the paper "brilliant". The Dempster-Laird-Rubin paper
established the EM method as an important tool of statistical analysis.
The convergence analysis of the Dempster-Laird-Rubin paper was flawed and a correct convergence analysis was
published by C. F. Jeff Wu in 1983. Wu's proof established the EM method's convergence outside of the exponential
family, as claimed by Dempster-Laird-Rubin.[12]

Introduction
The EM algorithm is used to find the maximum likelihood parameters of a statistical model in cases where the
equations cannot be solved directly. Typically these models involve latent variables in addition to unknown
parameters and known data observations. That is, either there are missing values among the data, or the model can be
formulated more simply by assuming the existence of additional unobserved data points. (For example, a mixture
model can be described more simply by assuming that each observed data point has a corresponding unobserved data
point, or latent variable, specifying the mixture component that each data point belongs to.)
Finding a maximum likelihood solution requires taking the derivatives of the likelihood function with respect to all
the unknown values — i.e. both the parameters and the latent variables — and simultaneously solving the resulting
equations. In statistical models with latent variables, this usually is not possible. Instead, the result is typically a set
of interlocking equations in which the solution to the parameters requires the values of the latent variables and
vice-versa, but substituting one set of equations into the other produces an unsolvable equation.
The EM algorithm proceeds from the observation that the following is a way to solve these two sets of equations
numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the
second set, then use these new values to find a better estimate of the first set, and then keep alternating between the
two until the resulting values both converge to fixed points. It's not obvious that this will work at all, but in fact it can
be proven that in this particular context it does, and that the value is a local maximum of the likelihood function. In
general there may be multiple maxima, and no guarantee that the global maximum will be found. Some likelihoods
also have singularities in them, i.e. nonsensical maxima. For example, one of the "solutions" that may be found by
EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the
Expectationmaximization algorithm 183

same component to be equal to one of the data points.

Description
Given a statistical model consisting of a set of observed data, a set of unobserved latent data or missing values
, and a vector of unknown parameters , along with a likelihood function , the
maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the
observed data

However, this quantity is often intractable.


The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:
Expectation step (E step): Calculate the expected value of the log likelihood function, with respect to the
conditional distribution of given under the current estimate of the parameters :

Maximization step (M step): Find the parameter that maximizes this quantity:

Note that in typical models to which EM is applied:


1. The observed data points may be discrete (taking values in a finite or countably infinite set) or continuous
(taking values in an uncountably infinite set). There may in fact be a vector of observations associated with each
data point.
2. The missing values (aka latent variables) are discrete, drawn from a fixed number of values, and there is one
latent variable per observed data point.
3. The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and
parameters associated with a particular value of a latent variable (i.e. associated with all data points whose
corresponding latent variable has a particular value).
However, it is possible to apply EM to other sorts of models.
The motivation is as follows. If we know the value of the parameters , we can usually find the value of the latent
variables by maximizing the log-likelihood over all possible values of , either simply by iterating over or
through an algorithm such as the Viterbi algorithm for hidden Markov models. Conversely, if we know the value of
the latent variables , we can find an estimate of the parameters fairly easily, typically by simply grouping the
observed data points according to the value of the associated latent variable and averaging the values, or some
function of the values, of the points in each group. This suggests an iterative algorithm, in the case where both
and are unknown:
1. First, initialize the parameters to some random values.
2. Compute the best value for given these parameter values.
3. Then, use the just-computed values of to compute a better estimate for the parameters . Parameters
associated with a particular value of will use only those data points whose associated latent variable has that
value.
4. Iterate steps 2 and 3 until convergence.
The algorithm as just described monotonically approaches a local minimum of the cost function, and is commonly
called hard EM. The k-means algorithm is an example of this class of algorithms.
However, we can do somewhat better by, rather than making a hard choice for given the current parameter values
and averaging only over the set of data points associated with a particular value of , instead determining the
Expectationmaximization algorithm 184

probability of each possible value of for each data point, and then using the probabilities associated with a particular
value of to compute a weighted average over the entire set of data points. The resulting algorithm is commonly
called soft EM, and is the type of algorithm normally associated with EM. The counts used to compute these
weighted averages are called soft counts (as opposed to the hard counts used in a hard-EM-type algorithm such as
k-means). The probabilities computed for are posterior probabilities and are what is computed in the E step. The soft
counts used to compute new parameter values are what is computed in the M step.

Properties
Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed,
data-dependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is
maximized in the second (M) step of an EM algorithm.
Although an EM iteration does increase the observed data (i.e. marginal) likelihood function there is no guarantee
that the sequence converges to a maximum likelihood estimator. For multimodal distributions, this means that an EM
algorithm may converge to a local maximum of the observed data likelihood function, depending on starting values.
There are a variety of heuristic or metaheuristic approaches for escaping a local maximum such as random restart
(starting with several different random initial estimates θ(t)), or applying simulated annealing methods.
EM is particularly useful when the likelihood is an exponential family: the E step becomes the sum of expectations
of sufficient statistics, and the M step involves maximizing a linear function. In such a case, it is usually possible to
derive closed form updates for each step, using the Sundberg formula (published by Rolf Sundberg using
unpublished results of Per Martin-Löf and Anders Martin-Löf).[3][4][7][8][9][10][11]
The EM method was modified to compute maximum a posteriori (MAP) estimates for Bayesian inference in the
original paper by Dempster, Laird, and Rubin.
There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or
variations of the Gauss–Newton method. Unlike EM, such methods typically require the evaluation of first and/or
second derivatives of the likelihood function.

Proof of correctness
Expectation-maximization works to improve rather than directly improving . Here we
[13]
show that improvements to the former imply improvements to the latter.
For any with non-zero probability , we can write

We take the expectation over values of by multiplying both sides by and summing (or integrating)
over . The left-hand side is the expectation of a constant, so we get:

where is defined by the negated sum it is replacing. This last equation holds for any value of
including ,

and subtracting this last equation from the previous equation gives

However, Gibbs' inequality tell us that , so we can conclude that


Expectationmaximization algorithm 185

In words, choosing to improve beyond will improve beyond


at least as much.

Alternative description
Under some circumstances, it is convenient to view the EM algorithm as two alternating maximization steps.[14][15]
Consider the function:

where q is an arbitrary probability distribution over the unobserved data z, pZ|X(· |x;θ) is the conditional distribution
of the unobserved data given the observed data x, H is the entropy and DKL is the Kullback–Leibler divergence.
Then the steps in the EM algorithm may be viewed as:
Expectation step: Choose q to maximize F:
Maximization step: Choose θ to maximize F:

Applications
EM is frequently used for data clustering in machine learning and computer vision. In natural language processing,
two prominent instances of the algorithm are the Baum-Welch algorithm (also known as forward-backward) and the
inside-outside algorithm for unsupervised induction of probabilistic context-free grammars.
In psychometrics, EM is almost indispensable for estimating item parameters and latent abilities of item response
theory models.
With the ability to deal with missing data and observe unidentified variables, EM is becoming a useful tool to price
and manage risk of a portfolio.
The EM algorithm (and its faster variant Ordered subset expectation maximization) is also widely used in medical
image reconstruction, especially in positron emission tomography and single photon emission computed
tomography. See below for other faster variants of EM.

Filtering and Smoothing EM Algorithms


A Kalman filter is typically used for on-line state estimation and a minimum-variance smoother may be employed
for off-line or batch state estimation. However, these minimum-variance solutions require estimates of the
state-space model parameters. EM algorithms can be used for solving joint state and parameter estimation problems.
Filtering and smoothing EM algorithms arise by repeating the following two-step procedure.

E-Step: Operate a Kalman filter or a minimum-variance smoother designed


with current parameter estimates to obtain updated state estimates.

M-Step: Use the filtered or smoothed state estimates within maximum-likelihood


calculations to obtain updated parameter estimates.

Suppose that a Kalman filter or minimum-variance smoother operates on noisy measurements of a


single-input-single-output system. An updated measurement noise variance estimate can be calculated from

where are scalar output estimates calculated by a filter or a smoother from N scalar measurements .
Similarly, for a first-order auto-regressive process, an updated process noise variance estimate can be calculated by
Expectationmaximization algorithm 186

where and are scalar state estimates calculated by a filter or a smoother. The updated model coefficient
estimate is obtained via

The convergence of the above parameter estimates are studied in [16] [17].

Variants
A number of methods have been proposed to accelerate the sometimes slow convergence of the EM algorithm, such
as those utilising conjugate gradient and modified Newton–Raphson techniques.[18] Additionally EM can be utilised
with constrained estimation techniques.
Expectation conditional maximization (ECM) replaces each M step with a sequence of conditional maximization
(CM) steps in which each parameter θi is maximized individually, conditionally on the other parameters remaining
fixed.[19]
This idea is further extended in generalized expectation maximization (GEM) algorithm, in which one only seeks
an increase in the objective function F for both the E step and M step under the alternative description.[14]
It is also possible to consider the EM algorithm as a subclass of the MM (Majorize/Minimize or Minorize/Maximize,
depending on context) algorithm,[20] and therefore use any machinery developed in the more general case.

Relation to variational Bayes methods


EM is a partially non-Bayesian, maximum likelihood method. Its final result gives a probability distribution over the
latent variables (in the Bayesian style) together with a point estimate for θ (either a maximum likelihood estimate or
a posterior mode). We may want a fully Bayesian version of this, giving a probability distribution over θ as well as
the latent variables. In fact the Bayesian approach to inference is simply to treat θ as another latent variable. In this
paradigm, the distinction between the E and M steps disappears. If we use the factorized Q approximation as
described above (variational Bayes), we may iterate over each latent variable (now including θ) and optimize them
one at a time. There are now k steps per iteration, where k is the number of latent variables. For graphical models this
is easy to do as each variable's new Q depends only on its Markov blanket, so local message passing can be used for
efficient inference.

α-EM algorithm
The Q-function used in the EM algorithm is based on the log likelihood. Therefore, it is regarded as the log-EM
algorithm. The use of the log likelihood can be generalized to that of the α-log likelihood ratio. Then, the α-log
likelihood ratio of the observed data can be exactly expressed as equality by using the Q-function of the α-log
likelihood ratio and the α-divergence. Obtaining this Q-function is a generalized E step. Its maximization is a
generalized M step. This pair is called the α-EM algorithm [21] which contains the log-EM algorithm as its subclass.
Thus, the α-EM algorithm by Yasuo Matsuyama is an exact generalization of the log-EM algorithm. No computation
of gradient or Hessian matrix is needed. The α-EM shows faster convergence than the log-EM algorithm by
choosing an appropriate α. The α-EM algorithm leads to a faster version of the Hidden Markov model estimation
algorithm α-HMM. [22]
Expectationmaximization algorithm 187

Geometric interpretation
In information geometry, the E step and the M step are interpreted as projections under dual affine connections,
called the e-connection and the m-connection; the Kullback–Leibler divergence can also be understood in these
terms.

Examples

Gaussian mixture
Let x = (x1,x2,…,xn) be a sample of n independent
observations from a mixture of two multivariate normal
distributions of dimension d, and let z=(z1,z2,…,zn) be the
latent variables that determine the component from which
the observation originates.[15]

and

where
and

The aim is to estimate the unknown parameters


representing the "mixing" value between the Gaussians
and the means and covariances of each:

An animation demonstrating the EM algorithm fitting a two


where the likelihood function is: component Gaussian mixture model to the Old Faithful dataset.
The algorithm steps through from a random initialization to
convergence.

where is an indicator function and f is the probability density function of a multivariate normal. This may be
rewritten in exponential family form:

To see the last equality, note that for each i all indicators are equal to zero, except for one which is equal
to one. The inner sum thus reduces to a single term.
Expectationmaximization algorithm 188

E step
Given our current estimate of the parameters θ(t), the conditional distribution of the Zi is determined by Bayes
theorem to be the proportional height of the normal density weighted by τ:

Thus, the E step results in the function:

M step
The quadratic form of Q(θ|θ(t)) means that determining the maximising values of θ is relatively straightforward.
Firstly note that τ, (μ1,Σ1) and (μ2,Σ2) may be all maximised independently of each other since they all appear in
separate linear terms.
Firstly, consider τ, which has the constraint τ1 + τ2=1:

This has the same form as the MLE for the binomial distribution, so:

For the next estimates of (μ1,Σ1):

This has the same form as a weighted MLE for a normal distribution, so

and

and, by symmetry:

and .
Expectationmaximization algorithm 189

Truncated and censored regression


The EM algorithm has been implemented in the case where there is an underlying linear regression model explaining
the variation of some quantity, but where the values actually observed are censored or truncated versions of those
represented in the model.[23] Special cases of this model include censored or truncated observations from a single
normal distribution.[23]

References
• Robert Hogg, Joseph McKean and Allen Craig. Introduction to Mathematical Statistics. pp. 359–364. Upper
Saddle River, NJ: Pearson Prentice Hall, 2005.
• The on-line textbook: Information Theory, Inference, and Learning Algorithms [24], by David J.C. MacKay
includes simple examples of the EM algorithm such as clustering using the soft k-means algorithm, and
emphasizes the variational view of the EM algorithm, as described in Chapter 33.7 of version 7.2 (fourth edition).
• Theory and Use of the EM Method [25] by M. R. Gupta and Y. Chen is a well-written short book on EM,
including detailed derivation of EM for GMMs, HMMs, and Dirichlet.
• Bilmes, Jeff. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian
Mixture and Hidden Markov Models. CiteSeerX: 10.1.1.28.613 [26], includes a simplified derivation of the EM
equations for Gaussian Mixtures and Gaussian Mixture Hidden Markov Models.
• Variational Algorithms for Approximate Bayesian Inference [27], by M. J. Beal includes comparisons of EM to
Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs (chapters [28]).
• Dellaert, Frank. The Expectation Maximization Algorithm. CiteSeerX: 10.1.1.9.9735 [29], gives an easier
explanation of EM algorithm in terms of lowerbound maximization.
• The Expectation Maximization Algorithm: A short tutorial [30], A self contained derivation of the EM Algorithm
by Sean Borman.
• The EM Algorithm [31], by Xiaojin Zhu.
• EM algorithm and variants: an informal tutorial [32] by Alexis Roche. A concise and very clear description of EM
and many interesting variants.
• Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. ISBN 0-387-31073-8.
• Einicke, G.A. (2012). Smoothing, Filtering and Prediction: Estimating the Past, Present and Future [33]. Rijeka,
Croatia: Intech. ISBN 978-953-307-752-9.

References
[1] Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal
Statistical Society. Series B (Methodological) 39 (1): 1–38. JSTOR 2984875. MR0501537.
[2] Sundberg, Rolf (1974). "Maximum likelihood theory for incomplete data from an exponential family". Scandinavian Journal of Statistics 1
(2): 49–58. JSTOR 4615553. MR381110.
[3] Rolf Sundberg. 1971. Maximum likelihood theory and applications for distributions generated when observing a function of an exponential
family variable. Dissertation, Institute for Mathematical Statistics, Stockholm University.
[4] Sundberg, Rolf (1976). "An iterative method for solution of the likelihood equations for incomplete data from exponential families".
Communications in Statistics – Simulation and Computation 5 (1): 55–64. doi:10.1080/03610917608812007. MR443190.
[5] See the acknowledgement by Dempster, Laird and Rubin on pages 3, 5 and 11.
[6] G. Kulldorff. 1961. Contributions to the theory of estimation from grouped and partially grouped samples. Almqvist & Wiksell.
[7] Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg
formula")
[8] Per Martin-Löf. 1966. Statistics from the point of view of statistical mechanics. Lecture notes, Mathematical Institute, Aarhus University.
("Sundberg formula" credited to Anders Martin-Löf).
[9] Per Martin-Löf. 1970. Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in the
academic year 1969-1970), with the assistance of Rolf Sundberg. Stockholm University. ("Sundberg formula")
[10] Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of
observational data. With a discussion by F. Abildgård, A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, G. A. Barnard, O.
Barndorff-Nielsen, J. D. Kalbfleisch and G. Rasch and a reply by the author. Proceedings of Conference on Foundational Questions in
Expectationmaximization algorithm 190

Statistical Inference (Aarhus, 1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.
[11] Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set
of observational data. Scand. J. Statist. 1 (1974), no. 1, 3–18.
[12] Wu, C. F. Jeff (Mar. 1983). "On the Convergence Properties of the EM Algorithm". Annals of Statistics 11 (1): 95–103.
doi:10.1214/aos/1176346060. JSTOR 2240463. MR684867.
[13] Little, Roderick J.A.; Rubin, Donald B. (1987). Statistical Analysis with Missing Data. Wiley Series in Probability and Mathematical
Statistics. New York: John Wiley & Sons. pp. 134--136. ISBN 0-471-80254-9.
[14] Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan. ed. "A view of the EM algorithm that justifies incremental, sparse, and other
variants" (ftp:/ / ftp. cs. toronto. edu/ pub/ radford/ emk. pdf). Learning in Graphical Models (Cambridge, MA: MIT Press): 355–368.
ISBN 0-262-60032-3. . Retrieved 2009-03-22.
[15] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). "8.5 The EM algorithm". The Elements of Statistical Learning. New York:
Springer. pp. 236–243. ISBN 0-387-95284-5.
[16] Einicke, G.A.; Malos, J.T.; Reid, D.C.; Hainsworth, D.W. (January 2009). "Riccati Equation and EM Algorithm Convergence for Inertial
Navigation Alignment". IEEE Trans. Signal Processing 57 (1): 370–375
[17] Einicke, G.A.; Falco, G.; Malos, J.T. (May 2010). "EM Algorithm State Matrix Estimation for Navigation". IEEE Signal Processing Letters
17 (5): 437–440
[18] Jamshidian, Mortaza; Jennrich, Robert I. (1997). "Acceleration of the EM Algorithm by using Quasi-Newton Methods". Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 59 (2): 569–587. doi:10.1111/1467-9868.00083. MR1452026.
[19] Meng, Xiao-Li; Rubin, Donald B. (1993). "Maximum likelihood estimation via the ECM algorithm: A general framework". Biometrika 80
(2): 267–278. doi:10.1093/biomet/80.2.267. MR1243503.
[20] Hunter DR and Lange K (2004), A Tutorial on MM Algorithms (http:/ / www. stat. psu. edu/ ~dhunter/ papers/ mmtutorial. pdf), The
American Statistician, 58: 30-37
[21] Matsuyama, Yasuo (2003). "The α-EM algorithm: Surrogate likelihood maximization using α-logarithmic information measures". IEEE
Transactions on Information Theory 49 (3): 692–706.
[22] Matsuyama, Yasuo (2011). "Hidden Markov model estimation based on alpha-EM algorithm: Discrete and continuous alpha-HMMs".
International Joint Conference on Neural Networks: 808–816.
[23] Wolynetz, M.S. (1979) "Maximum Likelihood estimation in a Linear model from Confined and Censored Normal Data". Journal of the
Royal Statistical Society (Series C), 28(2), 195–206
[24] http:/ / www. inference. phy. cam. ac. uk/ mackay/ itila/
[25] http:/ / ee. washington. edu/ research/ guptalab/ publications/ EMbookChenGupta2010. pdf
[26] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 28. 613
[27] http:/ / www. cse. buffalo. edu/ faculty/ mbeal/ papers/ beal03. pdf
[28] http:/ / www. cse. buffalo. edu/ faculty/ mbeal/ thesis/ index. html
[29] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 9. 9735
[30] http:/ / www. seanborman. com/ publications/ EM_algorithm. pdf
[31] http:/ / pages. cs. wisc. edu/ ~jerryzhu/ cs838/ EM. pdf
[32] http:/ / arxiv. org/ pdf/ 1105. 1476. pdf
[33] http:/ / www. intechopen. com/ books/ smoothing-filtering-and-prediction-estimating-the-past-present-and-future

External links
• Various 1D, 2D and 3D demonstrations of EM together with Mixture Modeling (http://wiki.stat.ucla.edu/socr/
index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture) are provided as part of the
paired SOCR activities and applets. These applets and activities show empirically the properties of the EM
algorithm for parameter estimation in diverse settings.
• Class hierarchy in C++ (GPL) including Gaussian Mixtures (https://github.com/l-/CommonDataAnalysis)
Exponential distribution 191

Exponential distribution
Exponential

Probability density function

Cumulative distribution function

Parameters λ > 0 rate, or inverse scale


Support x ∈ [0, ∞)
PDF λ e−λx
CDF 1 − e−λx
Mean λ−1
Median λ−1 ln 2
Mode 0
Variance λ−2
Skewness 2
Ex. kurtosis 6
Entropy 1 − ln(λ)
MGF

CF

In probability theory and statistics, the exponential distribution (a.k.a. negative exponential distribution) is a
family of continuous probability distributions. It describes the time between events in a Poisson process, i.e. a
process in which events occur continuously and independently at a constant average rate. It is the continuous
analogue of the geometric distribution.
Note that the exponential distribution is not the same as the class of exponential families of distributions, which is a
large class of probability distributions that includes the exponential distribution as one of its members, but also
includes the normal distribution, binomial distribution, gamma distribution, Poisson, and many others.
Exponential distribution 192

Characterization

Probability density function


The probability density function (pdf) of an exponential distribution is

Alternatively, this can be defined using the Heaviside step function, H(x).

Here λ > 0 is the parameter of the distribution, often called the rate parameter. The distribution is supported on the
interval [0, ∞). If a random variable X has this distribution, we write X ~ Exp(λ).
The exponential distribution exhibits infinite divisibility.

Cumulative distribution function


The cumulative distribution function is given by

Alternatively, this can be defined using the Heaviside step function, H(x).

Alternative parameterization
A commonly used alternative parameterization is to define the probability density function (pdf) of an exponential
distribution as

where β > 0 is a scale parameter of the distribution and is the reciprocal of the rate parameter, λ, defined above. In
this specification, β is a survival parameter in the sense that if a random variable X is the duration of time that a
given biological or mechanical system manages to survive and X ~ Exponential(β) then E[X] = β. That is to say, the
expected duration of survival of the system is β units of time. The parameterisation involving the "rate" parameter
arises in the context of events arriving at a rate λ, when the time between events (which might be modelled using an
exponential distribution) has a mean of β = λ−1.
The alternative specification is sometimes more convenient than the one given above, and some authors will use it as
a standard definition. This alternative specification is not used here. Unfortunately this gives rise to a notational
ambiguity. In general, the reader must check which of these two specifications is being used if an author writes
"X ~ Exponential(λ)", since either the notation in the previous (using λ) or the notation in this section (here, using β
to avoid confusion) could be intended.
Exponential distribution 193

Properties

Mean, variance, moments and median


The mean or expected value of an exponentially distributed random
variable X with rate parameter λ is given by

In light of the examples given above, this makes sense: if you receive
phone calls at an average rate of 2 per hour, then you can expect to
wait half an hour for every call.
The variance of X is given by

The moments of X, for n=1,2,..., are given by


The mean is the probability mass centre, that is
the first moment.

The median of X is given by

where ln refers to the natural logarithm. Thus the absolute difference


between the mean and median is

in accordance with the median-mean inequality.

Memorylessness
An important property of the exponential distribution is that it is
memoryless. This means that if a random variable T is exponentially
distributed, its conditional probability obeys The median is the preimage .

This says that the conditional probability that we need to wait, for example, more than another 10 seconds before the
first arrival, given that the first arrival has not yet happened after 30 seconds, is equal to the initial probability that
we need to wait more than 10 seconds for the first arrival. So, if we waited for 30 seconds and the first arrival didn't
happen (T > 30), probability that we'll need to wait another 10 seconds for the first arrival (T > 30 + 10) is the same
as the initial probability that we need to wait more than 10 seconds for the first arrival (T > 10). The fact that
Pr(T > 40 | T > 30) = Pr(T > 10) does not mean that the events T > 40 and T > 30 are independent.

To summarize: "memorylessness" of the probability distribution of the waiting time T until the first arrival means

It does not mean

(That would be independence. These two events are not independent.)


The exponential distributions and the geometric distributions are the only memoryless probability distributions.
Exponential distribution 194

The exponential distribution is consequently also necessarily the only continuous probability distribution that has a
constant Failure rate.

Quantiles
The quantile function (inverse cumulative distribution function) for Exponential(λ) is

The quartiles are therefore:


first quartile
ln(4/3)/λ
median
ln(2)/λ
third quartile
ln(4)/λ

Kullback–Leibler divergence
The directed Kullback–Leibler divergence between Exp(λ0) ('true' distribution) and Exp(λ) ('approximating'
distribution) is given by

Maximum entropy distribution


Among all continuous probability distributions with support [0,∞) and mean μ, the exponential distribution with λ =
1/μ has the largest entropy. Alternatively, it is the maximum entropy probability distribution for a random variate X
for which is fixed and greater than zero.[1]

Distribution of the minimum of exponential random variables


Let X1, ..., Xn be independent exponentially distributed random variables with rate parameters λ1, ..., λn. Then

is also exponentially distributed, with parameter

This can be seen by considering the complementary cumulative distribution function:

The index of the variable which achieves the minimum is distributed according to the law

Note that

is not exponentially distributed.


Exponential distribution 195

Parameter estimation
Suppose a given variable is exponentially distributed and the rate parameter λ is to be estimated.

Maximum likelihood
The likelihood function for λ, given an independent and identically distributed sample x = (x1, ..., xn) drawn from the
variable, is

where

is the sample mean.


The derivative of the likelihood function's logarithm is

Consequently the maximum likelihood estimate for the rate parameter is

While this estimate is the most likely reconstruction of the true parameter λ, it is only an estimate, and as such, one
can imagine that the more data points are available the better the estimate will be. It so happens that one can compute
an exact confidence interval – that is, a confidence interval that is valid for all number of samples, not just large
ones. The 100(1 − α)% exact confidence interval for this estimate is given by[2]

which is also equal to:

where is the MLE estimate, λ is the true value of the parameter, and χ2p,ν is the 100(1 – p) percentile of the chi
squared distribution with ν degrees of freedom.

Bayesian inference
The conjugate prior for the exponential distribution is the gamma distribution (of which the exponential distribution
is a special case). The following parameterization of the gamma pdf is useful:

The posterior distribution p can then be expressed in terms of the likelihood function defined above and a gamma
prior:
Exponential distribution 196

Now the posterior density p has been specified up to a missing normalizing constant. Since it has the form of a
gamma pdf, this can easily be filled in, and one obtains

Here the parameter α can be interpreted as the number of prior observations, and β as the sum of the prior
observations.

Confidence interval
A simple and rapid method to calculate an approximate confidence interval for the estimation of λ is based on the
application of the central limit theorem.[3] This method provides a good approximation of the confidence interval
limits, for samples containing at least 15 – 20 elements. Denoting by N the sample size, the upper and lower limits of
the 95% confidence interval are given by:

Generating exponential variates


A conceptually very simple method for generating exponential variates is based on inverse transform sampling:
Given a random variate U drawn from the uniform distribution on the unit interval (0, 1), the variate

has an exponential distribution, where F −1 is the quantile function, defined by

Moreover, if U is uniform on (0, 1), then so is 1 − U. This means one can generate exponential variates as follows:

Other methods for generating exponential variates are discussed by Knuth[4] and Devroye.[5]
The ziggurat algorithm is a fast method for generating exponential variates.
A fast method for generating a set of ready-ordered exponential variates without using a sorting routine is also
available.[5]

Related distributions
• Exponential distribution is closed under scaling by a positive factor. If then

• If and then
• If then
• The Benktander Weibull distribution reduces to a truncated exponential distribution
• If then (Benktander Weibull distribution)
• The exponential distribution is a limit of a scaled beta distribution:
Exponential distribution 197

• If then (Erlang distribution)

• If then (Generalized extreme value distribution)


• If then (gamma distribution)
• If and then
(Laplace distribution)
• If and then

• If then

• If then (logistic distribution)


• If and then (logistic
distribution)
• If then (Pareto distribution)
• If then
• Exponential distribution is a special case of type 3 Pearson distribution

• If then (power law)

• If then (Rayleigh distribution)

• If then (Weibull distribution)

• If then (Weibull distribution)

• If (Uniform distribution (continuous)) then


• If (Poisson distribution) where then
(geometric distribution)
• If and then (K-distribution)
• The Hoyt distribution can be obtained from Exponential distribution and Arcsine distribution

• If and then
• If and then
• If , then : see skew-logistic distribution.
• Y ∼ Gumbel(μ, β), i.e. Y has a Gumbel distribution, if Y = μ − βlog(Xλ) and X ∼ Exponential(λ).
• X ∼ χ22, i.e. X has a chi-squared distribution with 2 degrees of freedom, if .

• Let X ∼ Exponential(λX) and Y ∼ Exponential(λY) be independent. Then has probability density

function . This can be used to obtain a confidence interval for .

Other related distributions:


• Hyper-exponential distribution – the distribution whose density is a weighted sum of exponential densities.
• Hypoexponential distribution – the distribution of a general sum of exponential random variables.
• exGaussian distribution – the sum of an exponential distribution and a normal distribution.
Exponential distribution 198

Applications

Occurrence of events
The exponential distribution occurs naturally when describing the lengths of the inter-arrival times in a homogeneous
Poisson process.
The exponential distribution may be viewed as a continuous counterpart of the geometric distribution, which
describes the number of Bernoulli trials necessary for a discrete process to change state. In contrast, the exponential
distribution describes the time for a continuous process to change state.
In real-world scenarios, the assumption of a constant rate (or probability per unit time) is rarely satisfied. For
example, the rate of incoming phone calls differs according to the time of day. But if we focus on a time interval
during which the rate is roughly constant, such as from 2 to 4 p.m. during work days, the exponential distribution can
be used as a good approximate model for the time until the next phone call arrives. Similar caveats apply to the
following examples which yield approximately exponentially distributed variables:
• The time until a radioactive particle decays, or the time between clicks of a geiger counter
• The time it takes before your next telephone call
• The time until default (on payment to company debt holders) in reduced form credit risk modeling
Exponential variables can also be used to model situations where certain events occur with a constant probability per
unit length, such as the distance between mutations on a DNA strand, or between roadkills on a given road.
In queuing theory, the service times of agents in a system (e.g. how long it takes for a bank teller etc. to serve a
customer) are often modeled as exponentially distributed variables. (The inter-arrival of customers for instance in a
system is typically modeled by the Poisson distribution in most management science textbooks.) The length of a
process that can be thought of as a sequence of several independent tasks is better modeled by a variable following
the Erlang distribution (which is the distribution of the sum of several independent exponentially distributed
variables).
Reliability theory and reliability engineering also
make extensive use of the exponential distribution.
Because of the memoryless property of this
distribution, it is well-suited to model the constant
hazard rate portion of the bathtub curve used in
reliability theory. It is also very convenient because it
is so easy to add failure rates in a reliability model.
The exponential distribution is however not
appropriate to model the overall lifetime of
organisms or technical devices, because the "failure
rates" here are not constant: more failures occur for
very young and for very old systems.
Fitted cumulative exponential distribution to annually maximum 1-day
In physics, if you observe a gas at a fixed [6]
rainfalls using CumFreq
temperature and pressure in a uniform gravitational
field, the heights of the various molecules also follow
an approximate exponential distribution. This is a consequence of the entropy property mentioned below.
In hydrology, the exponential distribution is used to analyze extreme values of such variables as monthly and annual
maximum values of daily rainfall and river discharge volumes.[7]
The blue picture illustrates an example of fitting the exponential distribution to ranked annually maximum
one-day rainfalls showing also the 90% confidence belt based on the binomial distribution. The rainfall data
are represented by plotting positions as part of the cumulative frequency analysis.
Exponential distribution 199

Prediction
Having observed a sample of n data points from an unknown exponential distribution a common task is to use these
samples to make predictions about future data from the same source. A common predictive distribution over future
samples is the so-called plug-in distribution, formed by plugging a suitable estimate for the rate parameter λ into the
exponential density function. A common choice of estimate is the one provided by the principle of maximum
likelihood, and using this yields the predictive density over a future sample xn+1, conditioned on the observed
samples x = (x1, ..., xn) given by

The Bayesian approach provides a predictive distribution which takes into account the uncertainty of the estimated
parameter, although this may depend crucially on the choice of prior.
A predictive distribution free of the issues of choosing priors that arise under the subjective Bayesian approach is

which can be considered as (1) a frequentist confidence distribution, obtained from the distribution of the pivotal
quantity ;[8] (2) a profile predictive likelihood, obtained by eliminating the parameter from the joint
likelihood of and by maximization;[9] (3) an objective Bayesian predictive posterior distribution, obtained
using the non-informative Jeffreys prior ; and (4) the Conditional Normalized Maximum Likelihood (CNML)
predictive distribution, from information theoretic considerations.[10]
The accuracy of a predictive distribution may be measured using the distance or divergence between the true
exponential distribution with rate parameter, λ0, and the predictive distribution based on the sample x. The
Kullback–Leibler divergence is a commonly used, parameterisation free measure of the difference between two
distributions. Letting Δ(λ0||p) denote the Kullback–Leibler divergence between an exponential with rate parameter λ0
and a predictive distribution p it can be shown that

where the expectation is taken with respect to the exponential distribution with rate parameter λ0 ∈ (0, ∞), and ψ( · )
is the digamma function. It is clear that the CNML predictive distribution is strictly superior to the maximum
likelihood plug-in distribution in terms of average Kullback–Leibler divergence for all sample sizes n > 0.

References
[1] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[2] Ross, Sheldon M. (2009). Introduction to probability and statistics for engineers and scientists (http:/ / books. google. com/
books?id=mXP_UEiUo9wC& pg=PA267) (4th ed.). Associated Press. p. 267. ISBN 978-0-12-370483-2. .
[3] Guerriero V. et al. (2010). "Quantifying uncertainties in multi-scale studies of fractured reservoir analogues: Implemented statistical analysis
of scan line data from carbonate rocks" (PDF). Journal of Structural Geology (Elsevier). doi:10.1016/j.jsg.2009.04.016.
[4] Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd edn. Boston: Addison–Wesley. ISBN
0-201-89684-2. See section 3.4.1, p. 133.
[5] Luc Devroye (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). New York: Springer-Verlag.
ISBN 0-387-96305-7. See chapter IX (http:/ / luc. devroye. org/ chapter_nine. pdf), section 2, pp. 392–401.
[6] "Cumfreq, a free computer program for cumulative frequency analysis" (http:/ / www. waterlog. info/ cumfreq. htm). .
[7] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage
Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The
Netherlands. pp. 175–224. ISBN 90-70754-33-9. .
Exponential distribution 200

[8] Lawless, J.F., Fredette, M.,"Frequentist predictions intervals and predictive distributions", Biometrika (2005), Vol 92, Issue 3, pp 529–542.
[9] Bjornstad, J.F., "Predictive Likelihood: A Review", Statist. Sci. Volume 5, Number 2 (1990), 242–254.
[10] D. F. Schmidt and E. Makalic, " Universal Models for the Exponential Distribution (http:/ / www. emakalic. org/ blog/ wp-content/ uploads/
2010/ 04/ SchmidtMakalic09b. pdf)", IEEE Transactions on Information Theory, Volume 55, Number 7, pp. 3087–3090, 2009
doi:10.1109/TIT.2009.2018331

External links
• Online calculator of Exponential Distribution (http://www.stud.feec.vutbr.cz/~xvapen02/vypocty/ex.
php?language=english)
F-distribution 201

F-distribution
Fisher-Snedecor

Probability density function

Cumulative distribution function

Parameters deg. of freedom


Support
PDF

CDF

Mean
for

Mode
for

Variance
for

Skewness

for
Ex. kurtosis see text
F-distribution 202

MGF does not exist, raw moments defined in text and in


[1][2]

CF see text

In probability theory and statistics, the F-distribution is a continuous probability distribution.[1][2][3][4] It is also
known as Snedecor's F distribution or the Fisher-Snedecor distribution (after R.A. Fisher and George W.
Snedecor). The F-distribution arises frequently as the null distribution of a test statistic, most notably in the analysis
of variance; see F-test.

Definition
If a random variable has an F-distribution with parameters and , we write . Then the
probability density function for is given by

for real . Here is the beta function. In many applications, the parameters and are positive integers,
but the distribution is well-defined for positive real values of these parameters.
The cumulative distribution function is

where I is the regularized incomplete beta function.


The expectation, variance, and other details about the are given in the sidebox; for , the excess
kurtosis is

[5]
The k-th moment of an distribution exists and is finite only when and it is equal to :

The F-distribution is a particular parametrization of the beta prime distribution, which is also called the beta
distribution of the second kind.
The characteristic function is listed incorrectly in many standard references (e.g., [2]). The correct expression [6] is

where is the confluent hypergeometric function of the second kind.


F-distribution 203

Characterization
A random variate of the F-distribution with parameters d1 and d2 arises as the ratio of two appropriately scaled
chi-squared variates:

where
• U1 and U2 have chi-squared distributions with d1 and d2 degrees of freedom respectively, and
• U1 and U2 are independent.
In instances where the F-distribution is used, for instance in the analysis of variance, independence of U1 and U2
might be demonstrated by applying Cochran's theorem.

Generalization
A generalization of the (central) F-distribution is the noncentral F-distribution.

Related distributions and properties

• If and , and are independent, then

• If (Beta distribution) then

• Equivalently, if , then .

• If then has the chi-squared distribution


• is equivalent to the scaled Hotelling's T-squared distribution

• If then .

• If (Student's t-distribution) then .


• If (Student's t-distribution) then .
• F-distribution is a special case of type 6 Pearson distribution
• If X and Y are independent, with and (Laplace distribution) then

• If then (Fisher's z-distribution)


• The noncentral F-distribution simplifies to the F-distribution if
• The doubly noncentral F-distribution simplifies to the F-distribution if
• If is the quantile for and is the quantile for
, then
.
F-distribution 204

References
[1] Johnson, Norman Lloyd; Samuel Kotz, N. Balakrishnan (1995). Continuous Univariate Distributions, Volume 2 (Second Edition, Section 27).
Wiley. ISBN 0-471-58494-0.
[2] Abramowitz, Milton; Stegun, Irene A., eds. (1965), "Chapter 26" (http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_946. htm), Handbook of
Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover, pp. 946, ISBN 978-0486612720, MR0167642,
.
[3] NIST (2006). Engineering Statistics Handbook - F Distribution (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda3665. htm)
[4] Mood, Alexander; Franklin A. Graybill, Duane C. Boes (1974). Introduction to the Theory of Statistics (Third Edition, p. 246-249).
McGraw-Hill. ISBN 0-07-042864-6.
[5] Taboga, Marco. "The F distribution" (http:/ / www. statlect. com/ F_distribution. htm). .
[6] Phillips, P. C. B. (1982) "The true characteristic function of the F distribution," Biometrika, 69: 261-264 JSTOR 2335882

External links
• Table of critical values of the F-distribution (http://www.itl.nist.gov/div898/handbook/eda/section3/
eda3673.htm)
• Earliest Uses of Some of the Words of Mathematics: entry on F-distribution contains a brief history (http://
jeff560.tripod.com/f.html)

F-test
An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most
often used when comparing statistical models that have been fit to a data set, in order to identify the model that best
fits the population from which the data were sampled. Exact F-tests mainly arise when the models have been fit to
the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher
initially developed the statistic as the variance ratio in the 1920s.[1]

Common examples of F-tests


Examples of F-tests include:
• The hypothesis that the means of several normally distributed populations, all having the same standard deviation,
are equal. This is perhaps the best-known F-test, and plays an important role in the analysis of variance
(ANOVA).
• The hypothesis that a proposed regression model fits the data well. See Lack-of-fit sum of squares.
• The hypothesis that a data set in a regression analysis follows the simpler of two proposed linear models that are
nested within each other.
• Scheffé's method for multiple comparisons adjustment in linear models.
F-test 205

F-test of the equality of two variances


This F-test is extremely sensitive to non-normality.[2][3] In the analysis of variance (ANOVA), alternative tests
include Levene's test, Bartlett's test, and the Brown–Forsythe test. However, when any of these tests are conducted to
test the underlying assumption of homoscedasticity (i.e. homogeneity of variance), as a preliminary step to testing
for mean effects, there is an increase in the experiment-wise Type I error rate.[4]

Formula and calculation


Most F-tests arise by considering a decomposition of the variability in a collection of data in terms of sums of
squares. The test statistic in an F-test is the ratio of two scaled sums of squares reflecting different sources of
variability. These sums of squares are constructed so that the statistic tends to be greater when the null hypothesis is
not true. In order for the statistic to follow the F-distribution under the null hypothesis, the sums of squares should be
statistically independent, and each should follow a scaled chi-squared distribution. The latter condition is guaranteed
if the data values are independent and normally distributed with a common variance.

Multiple-comparison ANOVA problems


The F-test in one-way analysis of variance is used to assess whether the expected values of a quantitative variable
within several pre-defined groups differ from each other. For example, suppose that a medical trial compares four
treatments. The ANOVA F-test can be used to assess whether any of the treatments is on average superior, or
inferior, to the others versus the null hypothesis that all four treatments yield the same mean response. This is an
example of an "omnibus" test, meaning that a single test is performed to detect any of several possible differences.
Alternatively, we could carry out pairwise tests among the treatments (for instance, in the medical trial example with
four treatments we could carry out six tests among pairs of treatments). The advantage of the ANOVA F-test is that
we do not need to pre-specify which treatments are to be compared, and we do not need to adjust for making
multiple comparisons. The disadvantage of the ANOVA F-test is that if we reject the null hypothesis, we do not
know which treatments can be said to be significantly different from the others — if the F-test is performed at level
α we cannot state that the treatment pair with the greatest mean difference is significantly different at level α.
The formula for the one-way ANOVA F-test statistic is

or

The "explained variance", or "between-group variability" is

where denotes the sample mean in the ith group, ni is the number of observations in the ith group, and denotes
the overall mean of the data.
The "unexplained variance", or "within-group variability" is

where Yij is the jth observation in the ith out of K groups and N is the overall sample size. This F-statistic follows the
F-distribution with K − 1, N −K degrees of freedom under the null hypothesis. The statistic will be large if the
between-group variability is large relative to the within-group variability, which is unlikely to happen if the
population means of the groups all have the same value.
Note that when there are only two groups for the one-way ANOVA F-test, F = t2 where t is the Student's t statistic.
F-test 206

Regression problems
Consider two models, 1 and 2, where model 1 is 'nested' within model 2. Model 1 is the Restricted model, and Model
2 is the Unrestricted one. That is, model 1 has p1 parameters, and model 2 has p2 parameters, where p2 > p1, and for
any choice of parameters in model 1, the same regression curve can be achieved by some choice of the parameters of
model 2. (We use the convention that any constant parameter in a model is included when counting the parameters.
For instance, the simple linear model y = mx + b has p = 2 under this convention.) The model with more parameters
will always be able to fit the data at least as well as the model with fewer parameters. Thus typically model 2 will
give a better (i.e. lower error) fit to the data than model 1. But one often wants to determine whether model 2 gives a
significantly better fit to the data. One approach to this problem is to use an F test.
If there are n data points to estimate parameters of both models from, then one can calculate the F statistic
(coefficient of determination), given by

where RSSi is the residual sum of squares of model i. If your regression model has been calculated with weights,
then replace RSSi with χ2, the weighted sum of squared residuals. Under the null hypothesis that model 2 does not
provide a significantly better fit than model 1, F will have an F distribution, with (p2 − p1, n − p2) degrees of
freedom. The null hypothesis is rejected if the F calculated from the data is greater than the critical value of the
F-distribution for some desired false-rejection probability (e.g. 0.05). The F-test is a Wald test.

One-way ANOVA example


Consider an experiment to study the effect of three different levels of a factor on a response (e.g. three levels of a
fertilizer on plant growth). If we had 6 observations for each level, we could write the outcome of the experiment in
a table like this, where a1, a2, and a3 are the three levels of the factor being studied.

a1 a2 a3

6 8 13

8 12 9

4 9 11

5 11 8

3 6 7

4 8 12

The null hypothesis, denoted H0, for the overall F-test for this experiment would be that all three levels of the factor
produce the same response, on average. To calculate the F-ratio:
Step 1: Calculate the mean within each group:

Step 2: Calculate the overall mean:


F-test 207

where a is the number of groups.


Step 3: Calculate the "between-group" sum of squares:

where n is the number of data values per group.


The between-group degrees of freedom is one less than the number of groups

so the between-group mean square value is

Step 4: Calculate the "within-group" sum of squares. Begin by centering the data in each group

a1 a2 a3

6 − 5 = 1 8 − 9 = -1 13 − 10 = 3

8 − 5 = 3 12 − 9 = 3 9 − 10 = -1

4 − 5 = -1 9 − 9 = 0 11 − 10 = 1

5 − 5 = 0 11 − 9 = 2 8 − 10 = -2

3 − 5 = -2 6 − 9 = -3 7 − 10 = -3

4 − 5 = -1 8 − 9 = -1 12 − 10 = 2

The within-group sum of squares is the sum of squares of all 18 values in this table

The within-group degrees of freedom is

Thus the within-group mean square


value is

Step 5: The F-ratio is

The critical value is the number that


the test statistic must exceed to reject
the test. In this case, Fcrit(2,15) = 3.68
at α = 0.05. Since F = 9.3 > 3.68, the
results are significant at the 5%
significance level. One would reject
the null hypothesis, concluding that
there is strong evidence that the
expected values in the three groups
differ. The p-value for this test is 0.002.

After performing the F-test, it is common to carry out some "post-hoc" analysis of the group means. In this case, the
first two group means differ by 4 units, the first and third group means differ by 5 units, and the second and third
group means differ by only 1 unit. The standard error of each of these differences is .
F-test 208

Thus the first group is strongly different from the other groups, as the mean difference is more times the standard
error, so we can be highly confident that the population mean of the first group differs from the population means of
the other groups. However there is no evidence that the second and third groups have different population means
from each other, as their mean difference of one unit is comparable to the standard error.
Note F(x, y) denotes an F-distribution with x degrees of freedom in the numerator and y degrees of freedom in the
denominator.

ANOVA's robustness with respect to Type I errors for departures from


population normality
The oneway ANOVA can be generalized to the factorial and multivariate layouts, as well as to the analysis of
covariance. None of these F-tests, however, are robust when there are severe violations of the assumption that each
population follows the normal distribution, particularly for small alpha levels and unbalanced layouts.[5]
Furthermore, if the underlying assumption of homoscedasticity is violated, the Type I error properties degenerate
much more severely.[6] For nonparametric alternatives in the factorial layout, see Sawilowsky.[7] For more
discussion see ANOVA on ranks.

References
[1] Lomax, Richard G. (2007) Statistical Concepts: A Second Course, p. 10, ISBN 0-8058-5850-4
[2] Box, G.E.P. (1953). "Non-Normality and Tests on Variances". Biometrika 40 (3/4): 318–335. JSTOR 2333350.
[3] Markowski, Carol A; Markowski, Edward P. (1990). "Conditions for the Effectiveness of a Preliminary Test of Variance". The American
Statistician 44 (4): 322–326. doi:10.2307/2684360. JSTOR 2684360.
[4] Sawilowsky, S. (2002). "Fermat, Schubert, Einstein, and Behrens-Fisher:The Probable Difference Between Two Means When σ12 ≠ σ22".
Journal of Modern Applied Statistical Methods, 1(2), 461–472.
[5] Blair, R. C. (1981). "A reaction to 'Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and
covariance.'" Review of Educational Research, 51, 499-507.
[6] Randolf, E. A., & Barcikowski, R. S. (1989, November). "Type I error rate when real study values are used as population parameters in a
Monte Carlo study". Paper presented at the 11th annual meeting of the Mid-Western Educational Research Association, Chicago.
[7] Sawilowsky, S. (1990). Nonparametric tests of interaction in experimental design. Review of Educational Research, 25(20-59).

External links
• Testing utility of model – F-test (http://www.public.iastate.edu/~alicia/stat328/Multiple regression - F test.
pdf)
• F-test (http://rkb.home.cern.ch/rkb/AN16pp/node81.html)
• Table of F-test critical values (http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm)
• FTEST in Microsoft Excel which is different (http://office.microsoft.com/en-gb/excel-help/
ftest-HP005209098.aspx)
Fisher information 209

Fisher information
In mathematical statistics and information theory, the Fisher information (sometimes simply called information[1])
can be defined as the variance of the score, or as the expected value of the observed information. In Bayesian
statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior
(according to the Bernstein–von Mises theorem, which was anticipated by Laplace for exponential families).[2] The
role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the
statistician R.A. Fisher (following some initial results by F. Y. Edgeworth). The Fisher information is also used in
the calculation of the Jeffreys prior, which is used in Bayesian statistics.
The Fisher-information matrix is used to calculate the covariance matrices associated with maximum-likelihood
estimates. It can also be used in the formulation of test statistics, such as the Wald test.

History
The Fisher information was discussed by several early statisticians, notably F. Y. Edgeworth.[3] For example,
Savage[4] says: "In it [Fisher information], he [Fisher] was to some extent anticipated (Edgeworth 1908–9 esp. 502,
507–8, 662, 677–8, 82–5 and references he [Edgeworth] cites including Pearson and Filon 1898 [. . .])." There are a
number of early historical sources[5] and a number of reviews of this early work.[6][7][8]

Definition
The Fisher information is a way of measuring the amount of information that an observable random variable X
carries about an unknown parameter θ upon which the probability of X depends. The probability function for X,
which is also the likelihood function for θ, is a function ƒ(X; θ); it is the probability mass (or probability density) of
the random variable X conditional on the value of θ. The partial derivative with respect to θ of the natural logarithm
of the likelihood function is called the score. Under certain regularity conditions, it can be shown that the first
moment of the score is 0. The second moment is called the Fisher information:

where, for any given value of θ, the expression E[…|θ] denotes the conditional expectation over values for X with
respect to the probability function ƒ(x; θ) given θ. Note that . A random variable carrying high
Fisher information implies that the absolute value of the score is often high. The Fisher information is not a function
of a particular observation, as the random variable X has been averaged out.
Since the expectation of the score is zero, the Fisher information is also the variance of the score.
If log ƒ(x; θ) is twice differentiable with respect to θ, and under certain regularity conditions, then the Fisher
information may also be written as[9]

Thus, the Fisher information is the negative of the expectation of the second derivative with respect to θ of the
natural logarithm of f. Information may be seen to be a measure of the "curvature" of the support curve near the
maximum likelihood estimate of θ. A "blunt" support curve (one with a shallow maximum) would have a low
negative expected second derivative, and thus low information; while a sharp one would have a high negative
expected second derivative and thus high information.
Information is additive, in that the information yielded by two independent experiments is the sum of the information
from each experiment separately:
Fisher information 210

This result follows from the elementary fact that if random variables are independent, the variance of their sum is the
sum of their variances. Hence the information in a random sample of size n is n times that in a sample of size 1 (if
observations are independent).
The information provided by a sufficient statistic is the same as that of the sample X. This may be seen by using
Neyman's factorization criterion for a sufficient statistic. If T(X) is sufficient for θ, then

for some functions g and h. See sufficient statistic for a more detailed explanation. The equality of information then
follows from the following fact:

which follows from the definition of Fisher information, and the independence of h(X) from θ. More generally, if T
= t(X) is a statistic, then

with equality if and only if T is a sufficient statistic.

Informal derivation of the Cramér–Rao bound


The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any
unbiased estimator of θ. Van Trees (1968) and Frieden (2004) provide the following method of deriving the
Cramér–Rao bound, a result which describes use of the Fisher information, informally:

Consider an unbiased estimator . Mathematically, we write

The likelihood function ƒ(X; θ) describes the probability that we observe a given sample x given a known value of θ.
If ƒ is sharply peaked with respect to changes in θ, it is easy to intuit the "correct" value of θ given the data, and
hence the data contains a lot of information about the parameter. If the likelihood ƒ is flat and spread-out, then it
would take many, many samples of X to estimate the actual "true" value of θ. Therefore, we would intuit that the
data contain much less information about the parameter.
Now, we differentiate the unbiased-ness condition above to get

We now make use of two facts. The first is that the likelihood ƒ is just the probability of the data given the parameter.
Since it is a probability, it must be normalized, implying that

Second, we know from basic calculus that

Using these two facts in the above let us write

Factoring the integrand gives


Fisher information 211

If we square the equation, the Cauchy–Schwarz inequality lets us write

The right-most factor is defined to be the Fisher Information

The left-most factor is the expected mean-squared error of the estimator θ^, since

Notice that the inequality tells us that, fundamentally,

In other words, the precision to which we can estimate θ is fundamentally limited by the Fisher Information of the
likelihood function.

Single-parameter Bernoulli experiment


A Bernoulli trial is a random variable with two possible outcomes, "success" and "failure", with "success" having a
probability of θ. The outcome can be thought of as determined by a coin toss, with the probability of obtaining a
"head" being θ and the probability of obtaining a "tail" being 1 − θ.
The Fisher information contained in n independent Bernoulli trials may be calculated as follows. In the following, A
represents the number of successes, B the number of failures, and n = A + B is the total number of trials.

(1) defines Fisher information. (2) invokes the fact that the information in a sufficient statistic is the same as that of
the sample itself. (3) expands the natural logarithm term and drops a constant. (4) and (5) differentiate with respect to
θ. (6) replaces A and B with their expectations. (7) is algebra.
The end result, namely,
Fisher information 212

is the reciprocal of the variance of the mean number of successes in n Bernoulli trials, as expected (see last sentence
of the preceding section).

Matrix form
When there are N parameters, so that θ is a Nx1 vector then the Fisher information takes
the form of an NxN matrix, the Fisher Information Matrix (FIM), with typical element:

The FIM is a NxN positive semidefinite symmetric matrix, defining a Riemannian metric on the N-dimensional
parameter space, thus connecting Fisher information to differential geometry. In that context, this metric is known as
the Fisher information metric, and the topic is called information geometry.
Under certain regularity conditions, the Fisher Information Matrix may also be written as:

The metric is interesting in several ways; it can be derived as the Hessian of the relative entropy; it can be
understood as a metric induced from the Euclidean metric, after appropriate change of variable; in its
complex-valued form, it is the Fubini-Study metric.

Orthogonal parameters
We say that two parameters θi and θj are orthogonal if the element of the ith row and jth column of the Fisher
information matrix is zero. Orthogonal parameters are easy to deal with in the sense that their maximum likelihood
estimates are independent and can be calculated separately. When dealing with research problems, it is very common
for the researcher to invest some time searching for an orthogonal parametrization of the densities involved in the
problem.

Multivariate normal distribution


The FIM for a N-variate multivariate normal distribution has a special form. Let
and let Σ(θ) be the covariance matrix. Then the typical element ,
0 ≤ m, n < N, of the FIM for X ∼ N(μ(θ), Σ(θ)) is:

where denotes the transpose of a vector, tr(..) denotes the trace of a square matrix, and:

Note that a special, but very common, case is the one where Σ(θ) = Σ, a constant. Then
Fisher information 213

In this case the Fisher information matrix may be identified with the coefficient matrix of the normal equations of
least squares estimation theory.
Another special case is that the mean and covariance depends on two different parameters, say, β and θ. This is
especially popular in the analysis of spacial data, which uses a linear model with correlated residuals. We have

where

[10]
The prove of this special case is given in literature. Using the same technique in this paper, it's not difficult to
prove the original result.

Properties

Reparametrization
The Fisher information depends on the parametrization of the problem. If θ and η are two scalar parametrizations of
an estimation problem, and θ is a continuously differentiable function of η, then

where and are the Fisher information measures of η and θ, respectively.[11]


In the vector case, suppose and are k-vectors which parametrize an estimation problem, and suppose that is
[12]
a continuously differentiable function of , then,

where the (i, j)th element of the k × k Jacobian matrix is defined by

and where is the matrix transpose of .


In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic
properties of curvature are unchanged under different parametrization. In general, the Fisher information matrix
provides a Riemannian metric (more precisely, the Fisher-Rao metric) for the manifold of thermodynamic states, and
can be used as an information-geometric complexity measure for a classification of phase transitions, e.g., the scalar
curvature of the thermodynamic metric tensor diverges at (and only at) a phase transition point.[13]
In the thermodynamic context, the Fisher information matrix is directly related to the rate of change in the
corresponding order parameters.[14] In particular, such relations identify second-order phase transitions via
divergences of individual elements of the Fisher information matrix.
Fisher information 214

Applications

Optimal design of experiments


Fisher information is widely used in optimal experimental design. Because of the reciprocity of estimator-variance
and Fisher information, minimizing the variance corresponds to maximizing the information.
When the linear (or linearized) statistical model has several parameters, the mean of the parameter-estimator is a
vector and its variance is a matrix. The inverse matrix of the variance-matrix is called the "information matrix".
Because the variance of the estimator of a parameter vector is a matrix, the problem of "minimizing the variance" is
complicated. Using statistical theory, statisticians compress the information-matrix using real-valued summary
statistics; being real-valued functions, these "information criteria" can be maximized.
Traditionally, statisticians have evaluated estimators and designs by considering some summary statistic of the
covariance matrix (of a mean-unbiased estimator), usually with positive real values (like the determinant or matrix
trace). Working with positive real-numbers brings several advantages: If the estimator of a single parameter has a
positive variance, then the variance and the Fisher information are both positive real numbers; hence they are
members of the convex cone of nonnegative real numbers (whose nonzero members have reciprocals in this same
cone). For several parameters, the covariance-matrices and information-matrices are elements of the convex cone of
nonnegative-definite symmetric matrices in a partially ordered vector space, under the Loewner (Löwner) order. This
cone is closed under matrix-matrix addition, under matrix-inversion, and under the multiplication of positive
real-numbers and matrices. An exposition of matrix theory and the Loewner-order appears in Pukelsheim[15].
The traditional optimality-criteria are the information-matrix's invariants; algebraically, the traditional
optimality-criteria are functionals of the eigenvalues of the (Fisher) information matrix: see optimal design.

Jeffreys prior in Bayesian statistics


In Bayesian statistics, the Fisher information is used to calculate the Jeffreys prior, which is a standard,
non-informative prior for continuous distribution parameters.[16]

Relation to KL-divergence
The Fisher information matrix is the Hessian matrix (second derivative) of the Kullback–Leibler divergence of the
distribution from the true distribution .[17] Here, is the true value of the parameter and
derivatives are taken with respect to .

Distinction from the Hessian of the entropy


In certain cases, the Fisher Information matrix is the negative of the Hessian of the Shannon entropy. The cases
where this explicitly holds is given below. A distribution's Shannon entropy

has as the negative of the entry of its Hessian:

In contrast, the entry of the Fisher information matrix is

The difference between the negative Hessian and the Fisher information is
Fisher information 215

This extra term goes away if, instead, one considers the Hessian of the relative entropy instead of the Shannon
entropy; the relative entropy can be thought of as incorporating the Bayesian prior into the calculation.

Equality
In particular, the Fisher Information matrix will be the same as the negative of the Hessian of the entropy in

situations where is zero for all i, j, X, and θ. For instance, a two-dimensional example that makes the

two equal is

where g1(X), g2(X), and g3(X) are probability distributions.

Inequality
A one-dimensional example where the Fisher Information differs from the negative Hessian is

. In this case, the entropy H is independent of the distribution mean θ. Thus, the second

derivative of the entropy with respect to θ is zero. However, for the Fisher information, we have

Notes
[1] Lehmann and Casella, p. 115
[2] Lucien Le Cam (1986) Asymptotic Methods in Statistical Decision Theory: Pages 336 and 618–621 (von Mises and Bernstein).
[3] Savage (1976)
[4] Savage(1976), page 156
[5] Edgeworth (Sept. 1908, Dec. 1908)
[6] Pratt(1976)
[7] Stigler (1978,1986,1999)
[8] Hald (1998,1999)
[9] Lehmann and Casella, eq. (2.5.16).
[10] Maximum likelihood estimation of models for residual covariance in spatial regression, K. V. Mardia and R. J. Marshall, Biometrika (1984),
71, 1, pp. 135-46
[11] Lehmann and Casella, eq. (2.5.11).
[12] Lehmann and Casella, eq. (2.6.16)
[13] W. Janke, D. A. Johnston, and R. Kenna, Physica A 336, 181 (2004).
[14] M. Prokopenko, J. T. Lizier, O. Obst, and X. R. Wang, Relating Fisher information to order parameters, Physical Review E, 84, 041116,
2011.
[15] Friedrick Pukelsheim, Optimal Design of Experiments, 1993
[16] Bayesian theory, Jose M. Bernardo and Adrian FM. Smith, John Wiley & Sons, 1994
[17] Christian Gourieroux. Statistics and Econometric Models. 1995. p88
Fisher information 216

References
• Edgeworth, F. Y. (Sep. 1908). "On the Probable Errors of Frequency-Constants". Journal of the Royal Statistical
Society 71 (3): 499–512. doi:10.2307/2339293. JSTOR 2339293.
• Edgeworth, F. Y. (Dec. 1908). "On the Probable Errors of Frequency-Constants". Journal of the Royal Statistical
Society 71 (4): 651–678. doi:10.2307/2339378. JSTOR 2339378.
• Frieden, B. Roy (2004) Science from Fisher Information: A Unification. Cambridge Univ. Press. ISBN
0-521-00911-1.
• Hald, A. (May 1999). "On the History of Maximum Likelihood in Relation to Inverse Probability and Least
Squares". Statistical Science 14 (2): 214–222. JSTOR 2676741.
• Hald, A. (1998). A History of Mathematical Statistics from 1750 to 1930. New York: Wiley.
ISBN 0-471-17912-4.
• Lehmann, E. L.; Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. ISBN 0-387-98502-6.
• Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag.
ISBN 0-387-96307-3.
• Pratt, John W. (May 1976). "F. Y. Edgeworth and R. A. Fisher on the Efficiency of Maximum Likelihood
Estimation". The Annals of Statistics 4 (3): 501–514. doi:10.1214/aos/1176343457. JSTOR 2958222.
• Leonard J. Savage (May 1976)). "On Rereading R. A. Fisher". The Annals of Statistics 4 (3): 441–500.
doi:10.1214/aos/1176343456. JSTOR 2958221.
• Schervish, Mark J. (1995). "Section 2.3.1". Theory of Statistics. New York: Springer. ISBN 0-387-94546-6.
• Stephen Stigler (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard
University Press. ISBN 0-674-40340-1.
• Stephen M. Stigler (1978). "Francis Ysidro Edgeworth, Statistician". Journal of the Royal Statistical Society.
Series A (General) 141 (3): 287–322. doi:10.2307/2344804. JSTOR 2344804.
• Stephen Stigler (1999). Statistics on the Table: The History of Statistical Concepts and Methods. Harvard
University Press. ISBN 0-674-83601-4.
• Van Trees, H. L. (1968). Detection, Estimation, and Modulation Theory, Part I. New York: Wiley.
ISBN 0-471-09517-6.

External links
• Fisher4Cast: a Matlab, GUI-based Fisher information tool (http://www.mathworks.com/matlabcentral/
fileexchange/loadFile.do?objectId=20008&objectType=File) for research and teaching, primarily aimed at
cosmological forecasting applications.
• FandPLimitTool (http://www4.utsouthwestern.edu/wardlab/fandplimittool.asp) a GUI-based software to
calculate the Fisher information and CRLB with application to single-molecule microscopy.
Fisher's exact test 217

Fisher's exact test


Fisher's exact test[1][2][3] is a statistical significance test used in the analysis of contingency tables. Although in
practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, R.
A. Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null
hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the
sample size grows to infinity, as with many statistical tests. Fisher is said to have devised the test following a
comment from Dr Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her
cup; see lady tasting tea.

Purpose and scope


The test is useful for categorical data that result from classifying objects in two different ways; it is used to examine
the significance of the association (contingency) between the two kinds of classification. So in Fisher's original
example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be
whether Dr Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are
associated – that is, whether Dr Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher
test involve, like this example, a 2 × 2 contingency table. The p-value from the test is computed as if the margins of
the table are fixed, i.e. as if, in the tea-tasting example, Dr Bristol knows the number of cups with each treatment
(milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by
Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the
cells of the table.
With large samples, a chi-squared test can be used in this situation. However, the significance value it provides is
only an approximation, because the sampling distribution of the test statistic that is calculated is only approximately
equal to the theoretical chi-squared distribution. The approximation is inadequate when sample sizes are small, or the
data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null
hypothesis (the "expected values") being low. The usual rule of thumb for deciding whether the chi-squared
approximation is good enough is that the chi-squared test is not suitable when the expected values in any of the cells
of a contingency table are below 5, or below 10 when there is only one degree of freedom (this rule is now known to
be overly conservative[4]). In fact, for small, sparse, or unbalanced data, the exact and asymptotic p-values can be
quite different and may lead to opposite conclusions concerning the hypothesis of interest.[5][6] In contrast the Fisher
test is, as its name states, exact as long as the experimental procedure keeps the row and column totals fixed, and it
can therefore be used regardless of the sample characteristics. It becomes difficult to calculate with large samples or
well-balanced tables, but fortunately these are exactly the conditions where the chi-squared test is appropriate.
For hand calculations, the test is only feasible in the case of a 2 × 2 contingency table. However the principle of the
test can be extended to the general case of an m × n table,[7][8] and some statistical packages provide a calculation
(sometimes using a Monte Carlo method to obtain an approximation) for the more general case.[9]
Fisher's exact test 218

Example
For example, a sample of teenagers might be divided into male and female on the one hand, and those that are and
are not currently dieting on the other. We hypothesize, for example, that the proportion of dieting individuals is
higher among the women than among the men, and we want to test whether any difference of proportions that we
observe is significant. The data might look like this:

Men Women Row total

Dieting 1 9 10

Non-dieting 11 3 14

Col. total 12 12 24

These data would not be suitable for analysis by a chi-squared test, because the expected values in the table are all
below 10, and in a 2 × 2 contingency table, the number of degrees of freedom is always 1.
The question we ask about these data is: knowing that 10 of these 24 teenagers are dieters, and that 12 of the 24 are
female, what is the probability that these 10 dieters would be so unevenly distributed between the women and the
men? If we were to choose 10 of the teenagers at random, what is the probability that 9 of them would be among the
12 women, and only 1 from among the 12 men?
Before we proceed with the Fisher test, we first introduce some notation. We represent the cells by the letters a, b, c
and d, call the totals across rows and columns marginal totals, and represent the grand total by n. So the table now
looks like this:

Men Women Total

Dieting a b a+b

Non-dieting c d c+d

Totals a+c b+d a + b + c + d (=n)

Fisher showed that the probability of obtaining any such set of values was given by the hypergeometric distribution:

where is the binomial coefficient and the symbol ! indicates the factorial operator.
This formula gives the exact probability of observing this particular arrangement of the data, assuming the given
marginal totals, on the null hypothesis that men and women are equally likely to be dieters. To put it another way, if
we assume that the probability that a man is a dieter is p, the probability that a woman is a dieter is p, and we assume
that both men and women enter our sample independently of whether or not they are dieters, then this
hypergeometric formula gives the conditional probability of observing the values a, b, c, d in the four cells,
conditionally on the observed marginals. This remains true even if men enter our sample with different probabilities
than women. The requirement is merely that the two classification characteristics - gender, and dieter (or not) - are
not associated. For example, suppose we knew probabilities P,Q,p,q with P+Q=p+q=1 such that (male dieter, male
non-dieter, female dieter, female non-dieter) had respective probabilities (Pp,Pq,Qp,Qq) for each individual
encountered under our sampling procedure. Then still, were we to calculate the distribution of cell entries conditional
given marginals, we would obtain the above formula in which neither p nor P occurs. Thus, we can calculate the
exact probability of any arrangement of the 24 teenagers into the four cells of the table, but Fisher showed that to
generate a significance level, we need consider only the cases where the marginal totals are the same as in the
observed table, and among those, only the cases where the arrangement is as extreme as the observed arrangement,
Fisher's exact test 219

or more so. (Barnard's test relaxes this constraint on one set of the marginal totals.) In the example, there are 11 such
cases. Of these only one is more extreme in the same direction as our data; it looks like this:

Men Women Total

Dieting 0 10 10

Non-dieting 12 2 14

Totals 12 12 24

In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme or
more extreme if the null hypothesis is true, we have to calculate the values of p for both these tables, and add them
together. This gives a one-tailed test; for a two-tailed test we must also consider tables that are equally extreme but in
the opposite direction. Unfortunately, classification of the tables according to whether or not they are 'as extreme' is
problematic. An approach used in R using the "fisher.test" function computes the p-value by summing the
probabilities for all tables with probabilities less than or equal to that of the observed table. For tables with small
counts, the 2-sided p-value can differ substantially from twice the 1-sided value, unlike the case with test statistics
that have a symmetric sampling distribution.
As noted above, most modern statistical packages will calculate the significance of Fisher tests, in some cases even
where the chi-squared approximation would also be acceptable. The actual computations as performed by statistical
software packages will as a rule differ from those described above, because numerical difficulties may result from
the large values taken by the factorials. A simple, somewhat better computational approach relies on a gamma
function or log-gamma function, but methods for accurate computation of hypergeometric and binomial probabilities
remains an active research area.

Controversies
Despite the fact that Fisher's test gives exact p-values, some authors have argued that it is conservative, i.e. that its
actual rejection rate is below the nominal significance level.[10][11][12] The apparent contradiction stems from the
combination of a discrete statistic with fixed significance levels.[13][14] To be more precise, consider the following
proposal for a significance test at the 5%-level: reject the null hypothesis for each table to which Fisher's test assigns
a p-value equal to or smaller than 5%. Because the set of all tables is discrete, there may not be a table for which
equality is achieved. If is the largest p-value smaller than 5% which can actually occur for some table, then the
proposed test effectively tests at the -level. For small sample sizes, might be significantly lower than
[10][11][12]
5%. While this effect occurs for any discrete statistic (not just in contingency tables, or for Fisher's test), it
has been argued that the problem is compounded by the fact that Fisher's test conditions on the marginals.[15] To
avoid the problem, many authors discourage the use of fixed significance levels when dealing with discrete
problems.[13][14]
Another early discussion revolved around the necessity to condition on the marginals.[16][17] Fisher's test gives exact
p-values both for fixed and for random marginals. Other tests, most prominently Barnard's, require random
marginals. Some authors[13][14][17] (including, later, Barnard himself[13]) have criticized Barnard's test based on this
property. They argue that the marginal totals are an (almost[14]) ancillary statistic, containing (almost) no
information about the tested property.
Fisher's exact test 220

Alternatives
An alternative exact test, Barnard's exact test, has been developed and proponents of it suggest that this method is
more powerful, particularly in 2 × 2 tables. Another alternative is to use maximum likelihood estimates to calculate a
p-value from the exact binomial or multinomial distributions and accept or reject based on the p-value.

References
[1] Fisher, R. A. (1922). "On the interpretation of χ2 from contingency tables, and the calculation of P". Journal of the Royal Statistical Society
85 (1): 87–94. doi:10.2307/2340521. JSTOR 2340521.
[2] Fisher, R.A. (1954). Statistical Methods for Research Workers. Oliver and Boyd. ISBN 0-05-002170-2.
[3] Agresti, Alan (1992). "A Survey of Exact Inference for Contingency Tables". Statistical Science 7 (1): 131–153. doi:10.1214/ss/1177011454.
JSTOR 2246001.
[4] Larntz, Kinley (1978). "Small-sample comparisons of exact levels for chi-squared goodness-of-fit statistics". Journal of the American
Statistical Association 73 (362): 253–263. doi:10.2307/2286650. JSTOR 2286650.
[5] Mehta, Cyrus R; Patel, Nitin R; Tsiatis, Anastasios A (1984). "Exact significance testing to establish treatment equivalence with ordered
categorical data". Biometrics 40 (3): 819–825. doi:10.2307/2530927. JSTOR 2530927. PMID 6518249.
[6] Mehta, C. R. 1995. SPSS 6.1 Exact test for Windows. Englewood Cliffs, NJ: Prentice Hall.
[7] Mehta C.R., Patel N.R. (1983). "A Network Algorithm for Performing Fisher's Exact Test in r Xc Contingency Tables". Journal of the
American Statistical Association 78 (382): 427–434. doi:10.2307/2288652.
[8] mathworld.wolfram.com (http:/ / mathworld. wolfram. com/ FishersExactTest. html) Page giving the formula for the general form of Fisher's
exact test for m × n contingency tables
[9] Cyrus R. Mehta and Nitin R. Patel (1986). "ALGORITHM 643: FEXACT: a FORTRAN subroutine for Fisher's exact test on unordered r×c
contingency tables". ACM Trans. Math. Softw. 12 (2): 154–161. doi:10.1145/6497.214326.
[10] Liddell, Douglas (1976). "Practical tests of 2x2 contingency tables". The Statistican 25 (4): 295–304. doi:10.2307/2988087.
JSTOR 2988087.
[11] Berkson, Joseph (1978). "In dispraise of the exact test". Journal of Statistic Planning and Inference 2: 27–42.
[12] D'Agostino, R. B., Chase, W., and Belanger, A. (1988). "The Appropriateness of Some Common Procedures for Testing Equality of Two
Independent Binomial Proportions". The American Statistician 42 (3): 198–202. doi:10.2307/2685002. JSTOR 2685002.
[13] Yates, F. (1984). "Tests of Significance for 2 x 2 Contingency Tables (with discussion)". Journal of the Royal Statistical Society, Ser. A 147
(3): 426–463. doi:10.2307/2981577. JSTOR 2981577.
[14] Roderick J. A. Little (1989). "Testing the Equality of Two Independent Binomial Proportions". The American Statistician 43 (4): 283–288.
doi:10.2307/2685390. JSTOR 2685390.
[15] Cyrus R. Mehta and Pralay Senchaudhuri, "Conditional versus Unconditional Exact Tests for Comparing Two Binomials" (4 September
2003). Retrieved 20 November 2009 from http:/ / www. cytel. com/ Papers/ twobinomials. pdf
[16] Barnard, G.A (1945). "A New Test for 2×2 Tables". Nature 156 (3954): 177. doi:10.1038/156177a0.
[17] Fisher (1945). "A New Test for 2 × 2 Tables". Nature 156: 388. doi:10.1038/156388a0.; Barnard, G.A (1945). "A New Test for 2×2 Tables".
Nature 156: 783. doi:10.1038/156783b0.

Web Resources
Calculate Fishers Exact Test Online (http://www.langsrud.com/stat/fisher.htm)
Gamma distribution 221

Gamma distribution
Gamma

Probability density function

Cumulative distribution function

Parameters • shape • shape


• scale • rate
Support
Probability density function (pdf)

Cumulative distribution function (cdf)


Mean
Median No simple closed form No simple closed form
Mode
Variance
Skewness
Excess kurtosis
Entropy

Moment-generating function (mgf)


Characteristic function
Gamma distribution 222

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability
distributions. There are two different parameterizations in common use:
1. With a shape parameter k and a scale parameter θ.
2. With a shape parameter α = k and an inverse scale parameter β = 1⁄θ, called a rate parameter.
The parameterization with k and θ appears to be more common in econometrics and certain other applied fields,
where e.g. the gamma distribution is frequently used to model waiting times. For instance, in life testing, the waiting
time until death is a random variable that is frequently modeled with a gamma distribution.[1]
The parameterization with α and β is more common in Bayesian statistics, where the gamma distribution is used as a
conjugate prior distribution for various types of inverse scale (aka rate) parameters, such as the λ of an exponential
distribution or a Poisson distribution — or for that matter, the β of the gamma distribution itself. (The closely related
inverse gamma distribution is used as a conjugate prior for scale parameters, such as the variance of a normal
distribution.)
If k is an integer, then the distribution represents an Erlang distribution; i.e., the sum of k independent exponentially
distributed random variables, each of which has a mean of θ (which is equivalent to a rate parameter of 1/θ).
Equivalently, if α is an integer, then the distribution again represents an Erlang distribution, i.e. the sum of α
independent exponentially distributed random variables, each of which has a mean of 1/β (which is equivalent to a
rate parameter of β).
The gamma distribution is the maximum entropy probability distribution for a random variable X for which
is fixed and greater than zero, and is fixed ( is the
digamma function).[2]

Characterization using shape k and scale θ


A random variable X that is gamma-distributed with shape k and scale θ is denoted

Probability density function


Gamma distribution 223

The probability density function of the


gamma distribution can be expressed in
terms of the gamma function
parameterized in terms of a shape
parameter k and scale parameter θ. Both
k and θ will be positive values.

The equation defining the probability


density function of a gamma-distributed
random variable x is

Illustration of the Gamma PDF for parameter values over k and x with θ set to
1, 2, 3, 4, 5 and 6. One can see each θ layer by itself here [3] as well as by k [4] and x.
[5].

(This parameterization is used in the infobox and the plots.)

Cumulative distribution function


The cumulative distribution function is the regularized gamma function::

where is the lower incomplete gamma function.


It can also be expressed as follows, if k is a positive integer (i.e., the distribution is an Erlang distribution):[6]

Characterization using shape α and rate β


Alternatively, the gamma distribution can be parameterized in terms of a shape parameter α = k and an inverse scale
parameter β = 1⁄θ, called a rate parameter:

If α is a positive integer, then

A random variable X that is gamma-distributed with shape α and scale β is denoted


Gamma distribution 224

Both parametrizations are common because either can be more convenient depending on the situation.

Cumulative distribution function


The cumulative distribution function is the regularized gamma function:

where is the lower incomplete gamma function.


It can also be expressed as follows, if α is a positive integer (i.e., the distribution is an Erlang distribution):[6]

Properties

Skewness
The skewness depends only on the first parameter ( α ). It approaches a normal distribution when α is large
(approximately when α > 10).

Median calculation
Unlike the mode and the mean which have readily calculable formulas based on the parameters, the median does not
have an easy closed form equation. The median for this distribution is defined as the constant x0 such that

The ease of this calculation is dependent on the k parameter. This is best achieved by a computer since the
calculations can quickly grow out of control.
For the Γ( n + 1, 1 ) distribution the median ( ν ) is known[7] to lie between

This estimate has been improved[8]

A method of estimating the median for any Gamma distribution has been derived based on the ratio μ /( μ - ν ) which
to a very good approximation when α ≥ 1 is a linear function of α.[9] The median estimated by this method is
approximately

where μ is the mean.


Gamma distribution 225

Summation
If Xi has a Γ(ki, θ) distribution for i = 1, 2, ..., N (i.e., all distributions have the same scale parameter θ), then

provided all Xi' are independent.


The gamma distribution exhibits infinite divisibility.

Scaling
If

then for any c > 0,

Hence the use of the term "scale parameter" to describe θ.


Equivalently, if

then for any c > 0,

Hence the use of the term "inverse scale parameter" to describe β.

Exponential family
The Gamma distribution is a two-parameter exponential family with natural parameters k − 1 and −1⁄θ (equivalently,
α − 1 and −β), and natural statistics X and ln(X).
If the shape parameter α is held fixed, the resulting one-parameter family of distributions is a natural exponential
family.

Logarithmic expectation
One can show that

or equivalently,

where ψ(α) or ψ(k) is the digamma function.


This can be derived using the exponential family formula for the moment generating function of the sufficient
statistic, because one of the sufficient statistics of the gamma distribution is
Gamma distribution 226

Information entropy
The information entropy can be derived as

In the k,θ parameterization, the information entropy is given by

Kullback–Leibler divergence
The Kullback–Leibler divergence
(KL-divergence), as with the
information entropy and various other
theoretical properties, are more
commonly seen using the α,β
parameterization because of their uses
in Bayesian and other theoretical
statistics frameworks.

The KL-divergence of
("true"
distribution) from
("approximating" distribution) is given
by[10] Illustration of the Kullback–Leibler (KL) divergence for two Gamma PDFs. Here
β = β0 + 1 which are set to 1, 2, 3, 4, 5 and 6. The typical asymmetry for the KL
divergence is clearly visible.

Written using the k,θ parameterization, the KL-divergence of from is given


by
Gamma distribution 227

Laplace transform
The Laplace transform of the gamma PDF is

Parameter estimation

Maximum likelihood estimation


The likelihood function for N iid observations (x1, ..., xN) is

from which we calculate the log-likelihood function

Finding the maximum with respect to θ by taking the derivative and setting it equal to zero yields the maximum
likelihood estimator of the θ parameter:

Substituting this into the log-likelihood function gives

Finding the maximum with respect to k by taking the derivative and setting it equal to zero yields

where

is the digamma function.


There is no closed-form solution for k. The function is numerically very well behaved, so if a numerical solution is
desired, it can be found using, for example, Newton's method. An initial value of k can be found either using the
method of moments, or using the approximation

If we let

then k is approximately

which is within 1.5% of the correct value. An explicit form for the Newton-Raphson update of this initial guess is
given by Choi and Wette (1969) as the following expression:
Gamma distribution 228

where denotes the trigamma function (the derivative of the digamma function).
The digamma and trigamma functions can be difficult to calculate with high precision. However, approximations
known to be good to several significant figures can be computed using the following approximation formulae:

and

For details, see Choi and Wette (1969).

Bayesian minimum mean-squared error


With known k and unknown , the posterior PDF for theta (using the standard scale-invariant prior for ) is

Denoting

Integration over θ can be carried out using a change of variables, revealing that 1⁄θ is gamma-distributed with
parameters .

The moments can be computed by taking the ratio (m by m = 0)

which shows that the mean ± standard deviation estimate of the posterior distribution for theta is

Generating gamma-distributed random variables


Given the scaling property above, it is enough to generate gamma variables with as we can later convert to any
value of with simple division.
Using the fact that a distribution is the same as an distribution, and noting the method of generating
exponential variables, we conclude that if is uniformly distributed on , then − is distributed
Now, using the "α-addition" property of gamma distribution, we expand this result:

where are all uniformly distributed on and independent. All that is left now is to generate a variable
distributed as for and apply the "α-addition" property once more. This is the most difficult part.
Random generation of gamma variates is discussed in detail by Devroye,[11] noting that none are uniformly fast for
all shape parameters. For small values of the shape parameter, the algorithms are often not valid.[12] For arbitrary
Gamma distribution 229

values of the shape parameter, one can apply the Ahrens and Dieter[13] modified acceptance-rejection method
Algorithm GD (shape k ≥ 1), or transformation method[14] when 0 < k < 1. Also see Cheng and Feast Algorithm
GKM 3[15] or Marsaglia's squeeze method.[16]
The following is a version of the Ahrens-Dieter acceptance-rejection method:[13]
1. Let be 1.
2. Generate , and as independent uniformly distributed on variables.
3. If , where , then go to step 4, else go to step 5.
4. Let . Go to step 6.
5. Let .
6. If , then increment and go to step 2.
7. Assume to be the realization of .
A summary of this is

where
• is the integral part of ,
• has been generated using the algorithm above with (the fractional part of ),
• and are distributed as explained above and are all independent.

Related distributions

Special cases
• If , then X has an exponential distribution with rate parameter λ.
• If , then X is identical to χ2(ν), the chi-squared distribution with ν degrees of freedom.
Conversely, if and c is a positive constant, then .
• If is an integer, the gamma distribution is an Erlang distribution and is the probability distribution of the
waiting time until the -th "arrival" in a one-dimensional Poisson process with intensity 1/θ. If
and , then .
• If X has a Maxwell-Boltzmann distribution with parameter a, then .
• , then follows a generalized gamma distribution with parameters , , and .
• , then ; i.e. an exponential distribution: see skew-logistic distribution.

Conjugate prior
In Bayesian inference, the gamma distribution is the conjugate prior to many likelihood distributions: the Poisson,
exponential, normal (with known mean), Pareto, gamma with known shape σ, inverse gamma with known shape
parameter, and Gompertz with known scale parameter.
The Gamma distribution's conjugate prior is:[17]

Where Z is the normalizing constant, which has no closed form solution. The posterior distribution can be found by
updating the parameters as follows.
Gamma distribution 230

Where is the number of observations, and is the observation.

Compound gamma
If the shape parameter of the gamma distribution is known, but the inverse-scale parameter is unknown, then a
gamma distribution for the inverse-scale forms a conjugate prior. The compound distribution, which results from
integrating out the inverse-scale has a closed form solution, known as the compound gamma distribution.[18]

Others
• If X has a Γ(k, θ) distribution, then 1/X has an inverse-gamma distribution with parameters k and θ−1.
• If X and Y are independently distributed Γ(α, θ) and Γ(β, θ) respectively, then X / (X + Y) has a beta distribution
with parameters α and β.
• If Xi are independently distributed Γ(αi, 1) respectively, then the vector (X1 / S, ..., Xn / S), where S = X1 + ... + Xn,
follows a Dirichlet distribution with parameters α1, …, αn.
• For large k the gamma distribution converges to Gaussian distribution with mean and variance .
• The Gamma distribution is the conjugate prior for the precision of the normal distribution with known mean.
• The Wishart distribution is a multivariate generalization of the gamma distribution (samples are positive-definite
matrices rather than positive real numbers).
• The Gamma distribution is a special case of the generalized gamma distribution, the generalized integer gamma
distribution, and the generalized inverse Gaussian distribution
• Among the discrete distributions, the negative binomial distribution is sometimes considered the discrete
analogue of the Gamma distribution

Applications
The gamma distribution has been used to model the size of insurance claims and rainfalls.[19] This means that
aggregate insurance claims and the amount of rainfall accumulated in a reservoir are modelled by a gamma process.
The gamma distribution is also used to model errors in multi-level Poisson regression models, because the
combination of the Poisson distribution and a gamma distribution is a negative binomial distribution.
In neuroscience, the gamma distribution is often used to describe the distribution of inter-spike intervals.[20]
Although in practice the gamma distribution often provides a good fit, there is no underlying biophysical motivation
for using it.
In bacterial gene expression, the copy number of a constitutively expressed protein often follows the gamma
distribution, where the scale and shape parameter are, respectively, the mean number of bursts per cell cycle and the
mean number of protein molecules produced by a single mRNA during its lifetime.[21]
The gamma distribution is widely used as a conjugate prior in Bayesian statistics. It is the conjugate prior for the
precision (i.e. inverse of the variance) of a normal distribution. It is also the conjugate prior for the exponential
distribution.
Gamma distribution 231

Notes
[1] See Hogg and Craig (1978, Remark 3.3.1) for an explicit motivation
[2] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[3] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-k. png
[4] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-Theta. png
[5] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-x. png
[6] Papoulis, Pillai, Probability, Random Variables, and Stochastic Processes, Fourth Edition
[7] Chen J, Rubin H (1986) Bounds for the difference between median and mean of Gamma and Poisson distributions. Statist Probab Lett 4:
281–283
[8] Choi KP (1994) On the medians of Gamma distributions and an equation of Ramanujan. Proc Amer Math Soc 121 (1) 245–251
[9] Banneheka BMSG, Ekanayake GEMUPD (2009) A new point estimator for the median of Gamma distribution. Viyodaya J Science
14:95-103
[10] W.D. Penny, KL-Divergences of Normal, Gamma, Dirichlet, and Wishart densities
[11] Luc Devroye (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). New York: Springer-Verlag.
. See Chapter 9, Section 3, pages 401–428.
[12] Devroye (1986), p. 406.
[13] Ahrens, J. H. and Dieter, U. (1982). Generating gamma variates by a modified rejection technique. Communications of the ACM, 25, 47–54.
Algorithm GD, p. 53.
[14] Ahrens, J. H.; Dieter, U. (1974). "Computer methods for sampling from gamma, beta, Poisson and binomial distributions". Computing 12:
223–246. CiteSeerX: 10.1.1.93.3828 (http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 93. 3828).
[15] Cheng, R.C.H., and Feast, G.M. Some simple gamma variate generators. Appl. Stat. 28 (1979), 290-295.
[16] Marsaglia, G. The squeeze method for generating gamma variates. Comput, Math. Appl. 3 (1977), 321-325.
[17] Fink, D. 1995 A Compendium of Conjugate Priors (http:/ / www. stat. columbia. edu/ ~cook/ movabletype/ mlm/ CONJINTRnew+ TEX.
pdf). In progress report: Extension and enhancement of methods for setting data quality objectives. (DOE contract 95‑831).
[18] Dubey, Satya D. (December 1970). "Compound gamma, beta and F distributions" (http:/ / www. springerlink. com/ content/
u750hg4630387205/ ). Metrika 16: 27–31. doi:10.1007/BF02613934. .
[19] Aksoy, H. (2000) "Use of Gamma Distribution in Hydrological Analysis" (http:/ / journals. tubitak. gov. tr/ engineering/ issues/
muh-00-24-6/ muh-24-6-7-9909-13. pdf), Turk J. Engin Environ Sci, 24, 419 – 428.
[20] J. G. Robson and J. B. Troy, "Nature of the maintained discharge of Q, X, and Y retinal ganglion cells of the cat," J. Opt. Soc. Am. A 4,
2301-2307 (1987)
[21] N. Friedman, L. Cai and X. S. Xie (2006) "Linking stochastic dynamics to population distribution: An analytical framework of gene
expression," Phys. Rev. Lett. 97, 168302.

References
• R. V. Hogg and A. T. Craig. (1978) Introduction to Mathematical Statistics, 4th edition. New York: Macmillan.
(See Section 3.3.)'
• S. C. Choi and R. Wette. (1969) Maximum Likelihood Estimation of the Parameters of the Gamma Distribution
and Their Bias, Technometrics, 11(4) 683–690

External links
• Weisstein, Eric W., " Gamma distribution (http://mathworld.wolfram.com/GammaDistribution.html)" from
MathWorld.
• Engineering Statistics Handbook (http://www.itl.nist.gov/div898/handbook/eda/section3/eda366b.htm)
Gamma function 232

Gamma function
In mathematics, the gamma function
(represented by the capital Greek
letter Γ) is an extension of the factorial
function, with its argument shifted
down by 1, to real and complex
numbers. That is, if n is a positive
integer:

The gamma function is defined for all


complex numbers except the
non-positive integers. For complex
numbers with a positive real part, it is
defined via an improper integral that
converges:

The gamma function along part of the real axis


This integral function is extended by
analytic continuation to all complex numbers except the non-positive integers (where the function has simple poles),
yielding the meromorphic function we call the gamma function.
The gamma function is a component in various probability-distribution functions, and as such it is applicable in the
fields of probability and statistics, as well as combinatorics.

Motivation
The gamma function can be seen as a solution to the
following interpolation problem:
"Find a smooth curve that connects the
points (x, y) given by y = (x − 1)! at the positive
integer values for x."
A plot of the first few factorials makes clear that such a
curve can be drawn, but it would be preferable to have
a formula that precisely describes the curve, in which
the number of operations does not depend on the size
of n. The simple formula for the factorial, n! = 1 × 2 ×
… × n, cannot be used directly for fractional values
of n since it is only valid when n is a natural number It is easy graphically to interpolate the factorial function to
non-integer values, but is there a formula that describes the resulting
(i.e., a positive integer). There are, relatively speaking,
curve?
no such simple solutions for factorials; any
combination of sums, products, powers, exponential
functions, or logarithms with a fixed number of terms will not suffice to express n!. Stirling's approximation is
asymptotically equal to the factorial function for large values of n. It is possible to find a general formula for
factorials using tools such as integrals and limits from calculus. A good solution to this is the gamma function.
Gamma function 233

There are infinitely many continuous extensions of the factorial to non-integers: infinitely many curves can be drawn
through any set of isolated points. The gamma function is the most useful solution in practice, being analytic (except
at the non-positive integers), and it can be characterized in several ways. However, it is not the only analytic function
which extends the factorial, as adding to it any analytic function which is zero on the positive integers will give
another function with that property.
A more restrictive property than satisfying the above interpolation is to satisfy the recurrence relation defining a
slightly translated version of the factorial function,

for x equal to any positive real number. The Bohr–Mollerup theorem proves that these properties, together with the
assumption that f be logarithmically convex (aka: "superconvex"[1]), uniquely determine f for positive, real inputs.
From there, the gamma function can be extended to all real and complex values (except the negative integers and
zero) by using the unique analytic continuation of f.

Definition

Main definition
The notation Γ(z) is due to Legendre. If the real part of the complex
number z is positive (Re(z) > 0), then the integral

converges absolutely, and is known as the Euler integral of the


second kind (the Euler integral of the first kind defines the Beta
function). Using integration by parts, we see that the gamma function
satisfies the functional equation:

Combining this with Γ(1) = 1, we get:

The extended version of the gamma function in


for all positive integers n. the complex plane

The identity Γ(z) = Γ(z+1) / z can be used (or, yielding the same result,


analytic continuation can be used) to extend the integral formulation for Γ(z) to a meromorphic function defined for
all complex numbers z, except z = −n for integers n ≥ 0, where the function has simple poles with residue (−1)n/n!.
It is this extended version that is commonly referred to as the gamma function.

Alternative definitions
The following infinite product definitions for the gamma function, due to Euler and Weierstrass respectively, are
valid for all complex numbers z, except the non-positive integers:

where is the Euler–Mascheroni constant. It is straightforward to show that the Euler definition
satisfies the functional equation (1) above.
Gamma function 234

A somewhat curious parametrization of the gamma function is given in terms of generalized Laguerre polynomials,

which converges for Re(z) < 1/2.

The gamma function in the complex plane


The behavior of for an increasing positive variable is simple: it grows quickly — faster than an exponential
function. Asymptotically as , the magnitude of the gamma function is given by Stirling's formula

where the symbol ~ means that the quotient of both sides converges to 1.
The behavior for nonpositive z is more intricate. Euler's integral does not converge for z ≤ 0, but the function it
defines in the positive complex half-plane has a unique analytic continuation to the negative half-plane. One way to
find that analytic continuation is to use Euler's integral for positive arguments and extend the domain to negative
numbers by repeated application of the recurrence formula,

choosing n such that z + n is positive. The product in the denominator is zero when z equals any of the integers
0, −1, −2,... . Thus, the gamma function must be undefined at those points; it is a meromorphic function with simple
poles at the nonpositive integers. The residues of the function at those points are:

The following image shows the graph of the gamma function along the real line:

The gamma function is nonzero everywhere along the real line, although it comes arbitrarily close as .
There is in fact no complex number z for which , and hence the reciprocal gamma function is an
entire function, with zeros at z = 0, −1, −2,.... We see that the gamma function has a local minimum at
where it attains the value . The gamma function must alternate sign
between the poles because the product in the forward recurrence contains an odd number of negative factors if the
number of poles between and is odd, and an even number if the number of poles is even.
Gamma function 235

Plotting the gamma function in the complex plane yields:

Absolute value Real part Imaginary part

Properties

General
Other important functional equations for the gamma function are Euler's reflection formula

and the duplication formula

The duplication formula is a special case of the multiplication theorem

A simple but useful property, which can be seen from the limit definition, is:

Perhaps the best-known value of the gamma function at a non-integer argument is

which can be found by setting z = 1/2 in the reflection or duplication formulas, by using the relation to the beta
function given below with x = y = 1/2, or simply by making the substitution u = √t in the integral definition of the
gamma function, resulting in a Gaussian integral. In general, for non-negative integer values of n we have:

where n!! denotes the double factorial and, when n = 0, (-1)!! = 1. See Particular values of the gamma function for
calculated values.
It might be tempting to generalize the result that by looking for a formula for other individual
values where is rational. However, these numbers are not known to be expressible by themselves in terms
of elementary functions. It has been proved that is a transcendental number and algebraically
independent of for any integer and each of the fractions = 1/6, 1/4, 1/3, 2/3, 3/4, and 5/6.[2] In general, when
computing values of the gamma function, we must settle for numerical approximations.
Another useful limit for asymptotic approximations is:
Gamma function 236

The derivatives of the gamma function are described in terms of the polygamma function. For example:

For positive integer m the derivative of gamma function can be calculated as follows (here γ is the Euler–Mascheroni
constant):

The n-th derivative of the gamma function is:

[3]

The gamma function has simple poles at z = −n = 0, −1, −2, −3, ... . The residue there is

Moreover, the gamma function has the following Laurent expansion in 0

valid for |z|<1. In particular

The Bohr–Mollerup theorem states that among all functions extending the factorial functions to the positive real
numbers, only the gamma function is log-convex, that is, its natural logarithm is convex.

Pi function
An alternative notation which was originally introduced by Gauss and which was sometimes used is the Pi function,
which in terms of the gamma function is

so that

for every non-negative integer n.


Using the Pi function the reflection formula takes on the form

where sinc is the normalized sinc function, while the multiplication theorem takes on the form

We also sometimes find

which is an entire function, defined for every complex number. That π(z) is entire entails it has no poles, so Γ(z) has
no zeros.
Gamma function 237

Relation to other functions


• In the first integral above, which defines the gamma function, the limits of integration are fixed. The upper and
lower incomplete gamma functions are the functions obtained by allowing the lower or upper (respectively) limit
of integration to vary.
• The gamma function is related to the Beta function by the formula

• The derivative of the logarithm of the gamma function is called the digamma function; higher derivatives are the
polygamma functions.
• The analog of the gamma function over a finite field or a finite ring is the Gaussian sums, a type of exponential
sum.
• The reciprocal gamma function is an entire function and has been studied as a specific topic.
• The gamma function also shows up in an important relation with the Riemann zeta function, ζ(z).

And also in the following elegant formula:

which is valid only for Re(z) > 1.

The logarithm of the gamma function satisfies the following formula due to Lerch:

where is the Hurwitz zeta function, is the Riemann zeta function and the prime (') denotes
differentiation in the first variable.
• The gamma function is intimately related to the stretched exponential function. For instance, the moments of that
function are

Particular values
Some particular values of the gamma function are:
Gamma function 238

Approximations
Complex values of the gamma function can be computed numerically with arbitrary precision using Stirling's
approximation or the Lanczos approximation.
The gamma function can be computed to fixed precision for Re(z) ∈ [1, 2] by applying integration by parts to Euler's
integral. For any positive number x the gamma function can be written

When Re(z) ∈ [1, 2] and x ≥ 1, the absolute value of the last integral is smaller than (x + 1) e−x. By choosing x large
enough, this last expression can be made smaller than 2−N for any desired value N. Thus, the gamma function can be
evaluated to N bits of precision with the above series.
The only fast algorithm for calculation of the Euler gamma function for any algebraic argument (including rational)
was constructed by E.A. Karatsuba,[4][5] see for details "Fast Algorithms and the FEE Method".[6]
For arguments that are integer multiples of 1/24 the gamma function can also be evaluated quickly using
arithmetic-geometric mean iterations (see particular values of the gamma function).
Because the Gamma and factorial functions grow so rapidly for moderately large arguments, many computing
environments include a function that returns the natural logarithm of the gamma function (often given the name
lngamma in programming environments or gammaln in spreadsheets); this grows much more slowly, and for
combinatorial calculations allows adding and subtracting logs instead of multiplying and dividing very large values.
The digamma function, which is the derivative of this function, is also commonly seen. In the context of technical
and physical applications, e.g. with wave propagation, the functional equation

is often used since it allows one to determine function values in one strip of width 1 in z from the neighbouring strip.
In particular, starting with a good approximation for a z with large real part one may go step by step down to the
desired z. Following an indication of Carl Friedrich Gauss, Rocktaeschel (1922) proposed for lngamma an
approximation for large Re(z):

This can be used to accurately approximate for z with a smaller Re(z) via (P.E.Böhmer, 1939)

A more accurate approximation can be obtained by using more terms from the asymptotic expansions of
and , which are based on Stirling's approximation.
An asymptotic approximation of the gamma function is
Gamma function 239

Applications
Opening a random page in an advanced table of formulas, one may be as likely to spot the gamma function as a
trigonometric function. One author describes the gamma function as "Arguably, the most common special function,
or the least 'special' of them. The other transcendental functions listed below are called 'special' because you could
conceivably avoid some of them by staying away from many specialized mathematical topics. On the other hand, the
gamma function is most difficult to avoid."[7]

Integration problems
The gamma function finds application in such diverse areas as quantum physics, astrophysics and fluid dynamics.[8]
The gamma distribution, which is formulated in terms of the gamma function, is used in statistics to model a wide
range of processes; for example, the time between occurrences of earthquakes.[9]
The primary reason for the gamma function's usefulness in such contexts is the prevalence of expressions of the type
which describe processes that decay exponentially in time or space. Integrals of such expressions can
occasionally be solved in terms of the gamma function when no elementary solution exists. For example, if is a
power function and is a linear function, a simple change of variables gives the evaluation

The fact that the integration is performed along the entire positive real line might signify that the gamma function
describes the cumulation of a time-dependent process that continues indefinitely, or the value might be the total of a
distribution in an infinite space.
It is of course frequently useful to take limits of integration other than 0 and to describe the cumulation of a
finite process, in which case the ordinary gamma function is no longer a solution; the solution is then called an
incomplete gamma function. (The ordinary gamma function, obtained by integrating across the entire positive real
line, is sometimes called the complete gamma function for contrast).
An important category of exponentially decaying functions is that of Gaussian functions and integrals
thereof, such as the error function. There are many interrelations between these functions and the gamma function;
notably, the square root of we obtained by evaluating is the "same" as that found in the normalizing
factor of the error function and the normal distribution.
The integrals we have discussed so far involve transcendental functions, but the gamma function also arises from
integrals of purely algebraic functions. In particular, the arc lengths of ellipses and of the lemniscate, which are
curves defined by algebraic equations, are given by elliptic integrals that in special cases can be evaluated in terms of
the gamma function. The gamma function can also be used to calculate "volume" and "area" of -dimensional
hyperspheres.
Another important special case is that of the beta function
Gamma function 240

Calculating products
The gamma function's ability to and generalize factorial products immediately leads to applications in many areas of
mathematics; in combinatorics, and by extension in areas such as probability theory and the calculation of power
series. Many expressions involving products of successive integers can be written as some combination of factorials,
the most important example perhaps being that of the binomial coefficient

The example of binomial coefficients motivates why the properties of the gamma function when extended to
negative numbers are natural. A binomial coefficient gives the number of ways to choose elements from a set of
elements; if , there are of course no ways. If , is the factorial of a negative integer and
hence infinite if we use the gamma function definition of factorials — dividing by infinity gives the expected value
of 0.
We can replace the factorial by a gamma function to extend any such formula to the complex numbers. Generally,
this works for any product wherein each factor is a rational function of the index variable, by factoring the rational
function into linear expressions. If and are monic polynomials of degree and with respective roots
and , we have

If we have a way to calculate the gamma function numerically, it is a breeze to calculate numerical values of such
products. The number of gamma functions in the right-hand side depends only on the degree of the polynomials, so it
does not matter whether equals 5 or . Moreover, due to the poles of the gamma function, the equation
also holds (in the sense of taking limits) when the left-hand product contain zeros or poles.
By taking limits, certain rational products with infinitely many factors can be evaluated in terms of the gamma
function as well. Due to the Weierstrass factorization theorem, analytic functions can be written as infinite products,
and these can sometimes be represented as finite products or quotients of the gamma function. We have already seen
one striking example: the reflection formula essentially represents the sine function as the product of two gamma
functions. Starting from this formula, the exponential function as well as all the trigonometric and hyperbolic
functions can be expressed in terms of the gamma function.
More functions yet, including the hypergeometric function and special cases thereof, can be represented by means of
complex contour integrals of products and quotients of the gamma function, called Mellin-Barnes integrals.

Analytic number theory


An elegant and deep application of the gamma function is in the study of the Riemann zeta function. A fundamental
property of the Riemann zeta function is its functional equation:

Among other things, this provides an explicit form for the analytic continuation of the zeta function to a
meromorphic function in the complex plane and leads to an immediate proof that the zeta function has infinitely
many so-called "trivial" zeros on the real line. Borwein et al. call this formula "one of the most beautiful findings in
mathematics".[10] Another champion for that title might be

Both formulas were derived by Bernhard Riemann in his seminal 1859 paper "Über die Anzahl der Primzahlen unter
einer gegebenen Grösse" ("On the Number of Prime Numbers less than a Given Quantity"), one of the milestones in
Gamma function 241

the development of analytic number theory — the branch of mathematics that studies prime numbers using the tools
of mathematical analysis. Factorial numbers, considered as discrete objects, are an important concept in classical
number theory because they contain many prime factors, but Riemann found a use for their continuous extension that
arguably turned out to be even more important.

History
The gamma function has caught the interest of some of the most prominent mathematicians of all time. Its history,
notably documented by Philip J. Davis in an article that won him the 1963 Chauvenet Prize, reflects many of the
major developments within mathematics since the 18th century. In the words of Davis, "each generation has found
something of interest to say about the gamma function. Perhaps the next generation will also."[11]

18th century: Euler and Stirling


The problem of extending the factorial to non-integer arguments was
apparently first considered by Daniel Bernoulli and Christian Goldbach
in the 1720s, and was solved at the end of the same decade by
Leonhard Euler. Euler gave two different definitions: the first was not
his integral but an infinite product,

Daniel Bernoulli's letter to Goldbach, 1729-10-06


of which he informed Goldbach in a letter dated October 13, 1729. He
wrote to Goldbach again on January 8, 1730, to announce his discovery
of the integral representation

which is valid for n > 0. By the change of variables t = −ln s, this becomes the familiar Euler integral. Euler
published his results in the paper "De progressionibus transcendentibus seu quarum termini generales algebraice dari
nequeunt" ("On transcendental progressions, that is, those whose general terms cannot be given algebraically"),
submitted to the St. Petersburg Academy on November 28, 1729.[12] Euler further discovered some of the gamma
function's important functional properties, including the reflection formula.
James Stirling, a contemporary of Euler, also attempted to find a continuous expression for the factorial and came up
with what is now known as Stirling's formula. Although Stirling's formula gives a good estimate of , also for
non-integers, it does not provide the exact value. Extensions of his formula that correct the error were given by
Stirling himself and by Jacques Philippe Marie Binet.
Gamma function 242

19th century: Gauss, Weierstrass and Legendre


Carl Friedrich Gauss rewrote Euler's product as

and used this formula to discover new properties of the


gamma function. Although Euler was a pioneer in the
theory of complex variables, he does not appear to have
considered the factorial of a complex number, as instead
Gauss first did.[13] Gauss also proved the multiplication
theorem of the gamma function and investigated the
connection between the gamma function and elliptic
integrals.

Karl Weierstrass further established the role of the gamma


function in complex analysis, starting from yet another
product representation,

where γ is the Euler–Mascheroni constant. Weierstrass


originally wrote his product as one for , in which
case it is taken over the function's zeros rather than its
poles. Inspired by this result, he proved what is known as
the Weierstrass factorization theorem—that any entire
function can be written as a product over its zeros in the The first page of Euler's paper

complex plane; a generalization of the fundamental


theorem of algebra.
The name gamma function and the symbol were introduced by Adrien-Marie Legendre around 1811; Legendre
also rewrote Euler's integral definition in its modern form. Although the symbol is an upper-case Greek gamma,
there is no accepted standard for whether the function name should be written "gamma function" or "Gamma
function" (some authors simply write " -function"). The alternative "Pi function" notation due to
Gauss is sometimes encountered in older literature, but Legendre's notation is dominant in modern works.
It is justified to ask why we distinguish between the "ordinary factorial" and the gamma function by using distinct
symbols, and particularly why the gamma function should be normalized to instead of simply
using " ". Consider that the notation for exponents, , has been generalized from integers to complex
numbers without any change. Legendre's motivation for the normalization does not appear to be known, and has
been criticized as cumbersome by some (the 20th-century mathematician Cornelius Lanczos, for example, called it
"void of any rationality" and would instead use ).[14] Legendre's normalization does simplify a few formulas, but
complicates most others. From a modern point of view, the Legendre normalisation of the Gamma function is the
integral of the additive character against the multiplicative character with respect to the Haar measure
on the Lie group . Thus this normalisation makes it clearer that the Gamma function is a continuous analogue of
a Gauss sum.
Gamma function 243

19th-20th centuries: characterizing the gamma function


It is somewhat problematic that a large number of definitions have been given for the gamma function. Although
they describe the same function, it is not entirely straightforward to prove the equivalence. Stirling never proved that
his extended formula corresponds exactly to Euler's gamma function; a proof was first given by Charles Hermite in
1900.[15] Instead of finding a specialized proof for each formula, it would be desirable to have a general method of
identifying the gamma function.
One way to prove would be to find a differential equation that characterizes the gamma function. Most special
functions in applied mathematics arise as solutions to differential equations, whose solutions are unique. However,
the gamma function does not appear to satisfy any simple differential equation. Otto Hölder proved in 1887 that the
gamma function at least does not satisfy any algebraic differential equation by showing that a solution to such an
equation could not satisfy the gamma function's recurrence formula. This result is known as Hölder's theorem.
A definite and generally applicable characterization of the gamma function was not given until 1922. Harald Bohr
and Johannes Mollerup then proved what is known as the Bohr–Mollerup theorem: that the gamma function is the
unique solution to the factorial recurrence relation that is positive and logarithmically convex for positive z and
whose value at 1 is 1 (a function is logarithmically convex if its logarithm is convex).
The Bohr–Mollerup theorem is useful because it is relatively easy to prove logarithmic convexity for any of the
different formulas used to define the gamma function. Taking things further, instead of defining the gamma function
by any particular formula, we can choose the conditions of the Bohr–Mollerup theorem as the definition, and then
pick any formula we like that satisfies the conditions as a starting point for studying the gamma function. This
approach was used by the Bourbaki group.

Reference tables and software


Although the gamma function can be calculated virtually as easily as any mathematically simpler function with a
modern computer—even with a programmable pocket calculator—this was of course not always the case. Until the
mid-20th century, mathematicians relied on hand-made tables; in the case of the gamma function, notably a table
computed by Gauss in 1813 and one computed by Legendre in 1825.
Tables of complex values of the gamma
function, as well as hand-drawn graphs,
were given in Tables of Higher Functions
by Jahnke and Emde, first published in
Germany in 1909. According to Michael
Berry, "the publication in J&E of a
three-dimensional graph showing the poles
of the gamma function in the complex plane
acquired an almost iconic status."[16]

There was in fact little practical need for


anything but real values of the gamma
function until the 1930s, when applications
A hand-drawn graph of the absolute value of the complex gamma function, from
for the complex gamma function were
Tables of Higher Functions by Jahnke and Emde.
discovered in theoretical physics. As
electronic computers became available for
the production of tables in the 1950s, several extensive tables for the complex gamma function were published to
meet the demand, including a table accurate to 12 decimal places from the U.S. National Bureau of Standards.[11]

Abramowitz and Stegun became the standard reference for this and many other special functions after its publication
in 1964.
Gamma function 244

Double-precision floating-point implementations of the gamma function and its logarithm are now available in most
scientific computing software and special functions libraries, for example Matlab, GNU Octave, and the GNU
Scientific Library. The gamma function was also added to the C standard library (math.h). Arbitrary-precision
implementations are available in most computer algebra systems, such as Mathematica and Maple. PARI/GP, MPFR
and MPFUN contain free arbitrary-precision implementations.

Notes
[1] Kingman, J.F.C. 1961. A convexity property of positive matrices. Quart. J. Math. Oxford (2) 12,283-284.
[2] Waldschmidt, M. (2006). " Transcendence of Periods: The State of the Art (http:/ / www. math. jussieu. fr/ ~miw/ articles/ pdf/
TranscendencePeriods. pdf)". Pure and Applied Mathematics Quarterly, Volume 2, Number 2, 435—463 (PDF copy published by the author)
[3] This can be derived by differentiating the integral form of the gamma function with respect to x, and using the technique of differentiation
under the integral sign.
[4] E.A. Karatsuba, Fast evaluation of transcendental functions. Probl. Inf. Transm. Vol.27, No.4, pp.339-360 (1991).
[5] E.A. Karatsuba, On a new method for fast evaluation of transcendental functions. Russ. Math. Surv. Vol.46, No.2, pp.246-247 (1991).
[6] E.A. Karatsuba " Fast Algorithms and the FEE Method (http:/ / www. ccas. ru/ personal/ karatsuba/ algen. htm)".
[7] Michon, G. P. " Trigonometry and Basic Functions (http:/ / home. att. net/ ~numericana/ answer/ functions. htm)". Numericana. Retrieved
May 5, 2007.
[8] Chaudry, M. A. & Zubair, S. M. (2001). On A Class of Incomplete Gamma Functions with Applications. p. 37
[9] Rice, J. A. (1995). Mathematical Statistics and Data Analysis (Second Edition). p. 52–53
[10] Borwein, J., Bailey, D. H. & Girgensohn, R. (2003). Experimentation in Mathematics. A. K. Peters. pp. 133. ISBN 1-56881-136-5.
[11] Davis, P. J. (1959). "Leonhard Euler's Integral: A Historical Profile of the Gamma Function", The American Mathematical Monthly, Vol. 66,
No. 10 (Dec., 1959), pp. 849–869 (http:/ / mathdl. maa. org/ mathDL/ 22/ ?pa=content& sa=viewDocument& nodeId=3104)
[12] Euler's paper was published in Commentarii academiae scientiarum Petropolitanae 5, 1738, 36–57. See E19 -- De progressionibus
transcendentibus seu quarum termini generales algebraice dari nequeunt (http:/ / math. dartmouth. edu/ ~euler/ pages/ E019. html), from The
Euler Archive, which includes a scanned copy of the original article. An English translation (http:/ / home. sandiego. edu/ ~langton/ eg. pdf)
by S. Langton is also available.
[13] Remmert, R., Kay, L. D. (translator) (2006). Classical Topics in Complex Function Theory. Springer. ISBN 0-387-98221-3.
[14] Lanczos, C. (1964). "A precision approximation of the gamma function." J. SIAM Numer. Anal. Ser. B, Vol. 1.
[15] Knuth, D. E. (1997). The Art of Computer Programming, volume 1 (Fundamental Algorithms). Addison-Wesley.
[16] Berry, M. " Why are special functions special? (http:/ / scitation. aip. org/ journals/ doc/ PHTOAD-ft/ vol_54/ iss_4/ 11_1.
shtml?bypassSSO=1)". Physics Today, April 2001

References
• Milton Abramowitz and Irene A. Stegun, eds. Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables. New York: Dover, 1972. (See Chapter 6) (http://www.math.sfu.ca/~cbm/aands/
page_253.htm)
• G. E. Andrews, R. Askey, R. Roy, Special Functions, Cambridge University Press, 2001. ISBN
978-0-521-78988-2. Chapter one, covering the gamma and beta functions, is highly readable and definitive.
• Emil Artin, "The Gamma Function", in Rosen, Michael (ed.) Exposition by Emil Artin: a selection; History of
Mathematics 30. Providence, RI: American Mathematical Society (2006).
• Askey, R. A.; Roy, R. (2010), "Gamma function" (http://dlmf.nist.gov/5), in Olver, Frank W. J.; Lozier, Daniel
M.; Boisvert, Ronald F. et al., NIST Handbook of Mathematical Functions, Cambridge University Press,
ISBN 978-0521192255, MR2723248
• P. E. Böhmer, ´´Differenzengleichungen und bestimmte Integrale´´, Köhler Verlag, Leipzig, 1939.
• Philip J. Davis, "Leonhard Euler's Integral: A Historical Profile of the Gamma Function," American Mathematical
Monthly 66, 849-869 (1959)
• Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), "Section 6.1. Gamma Function" (http://apps.
nrbook.com/empanel/index.html?pg=256), Numerical Recipes: The Art of Scientific Computing (3rd ed.), New
York: Cambridge University Press, ISBN 978-0-521-88068-8
• O. R. Rocktaeschel, ´´Methoden zur Berechnung der Gammafunktion für komplexes Argument``, University of
Dresden, Dresden, 1922.
Gamma function 245

• Nico M. Temme, "Special Functions: An Introduction to the Classical Functions of Mathematical Physics", John
Wiley & Sons, New York, ISBN 0-471-11313-1,1996.
• E. T. Whittaker and G. N. Watson, A Course of Modern Analysis. Cambridge University Press (1927; reprinted
1996) ISBN 978-0521588072

External links
• Pascal Sebah and Xavier Gourdon. Introduction to the Gamma Function. In PostScript (http://numbers.
computation.free.fr/Constants/Miscellaneous/gammaFunction.ps) and HTML (http://numbers.computation.
free.fr/Constants/Miscellaneous/gammaFunction.html) formats.
• C++ reference for std::tgamma (http://en.cppreference.com/w/cpp/numeric/math/tgamma)
• Examples of problems involving the gamma function can be found at Exampleproblems.com (http://www.
exampleproblems.com/wiki/index.php?title=Special_Functions).
• Hazewinkel, Michiel, ed. (2001), "Gamma function" (http://www.encyclopediaofmath.org/index.php?title=p/
g043310), Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Wolfram gamma function evaluator (arbitrary precision) (http://functions.wolfram.com/webMathematica/
FunctionEvaluation.jsp?name=Gamma)
• Gamma (http://functions.wolfram.com/GammaBetaErf/Gamma/) at the Wolfram Functions Site
• Volume of n-Spheres and the Gamma Function (http://www.mathpages.com/home/kmath163/kmath163.htm)
at MathPages
• Weisstein, Eric W., " Gamma Function (http://mathworld.wolfram.com/GammaFunction.html)" from
MathWorld.
• "Elementary Proofs and Derivations" (http://www.docstoc.com/docs/3507375/
500-Integrals-of-Elementary-and-Special-Functions,)
• "Selected Transformations, Identities, and Special Values for the Gamma Function" (http://www.docstoc.com/
docs/5836783/Selected-Transformations-Identities--and-Special-Values--for-the-Gamma-Function,)
• This article incorporates material from the Citizendium article "Gamma function", which is licensed under the
Creative Commons Attribution-ShareAlike 3.0 Unported License but not under the GFDL.
Geometric distribution 246

Geometric distribution
In probability theory and statistics, the geometric distribution is either of two discrete probability distributions:
• The probability distribution of the number of X Bernoulli trials needed to get one success, supported on the
set { 1, 2, 3, ...}
• The probability distribution of the number Y = X − 1 of failures before the first success, supported on the
set { 0, 1, 2, 3, ... }
Which of these one calls "the" geometric distribution is a matter of convention and convenience.

Geometric

Probability mass function

Cumulative distribution function

Parameters success probability (real) success probability (real)


Support
Probability mass function
(pmf)
Cumulative distribution
function (cdf)
Mean

Median
(not unique if (not unique if

is an integer) is an integer)
Geometric distribution 247

Mode
Variance

Skewness

Excess kurtosis

Entropy

Moment-generating function
,
(mgf)
for
Characteristic function

These two different geometric distributions should not be confused with each other. Often, the name shifted
geometric distribution is adopted for the former one (distribution of the number X); however, to avoid ambiguity, it
is considered wise to indicate which is intended, by mentioning the range explicitly.
It’s the probability that the first occurrence of success require k number of independent trials, each with success
probability p. If the probability of success on each trial is p, then the probability that the kth trial (out of k trials) is
the first success is

for k = 1, 2, 3, ....
The above form of geometric distribution is used for modeling the number of trials until the first success. By
contrast, the following form of geometric distribution is used for modeling number of failures until the first success:

for k = 0, 1, 2, 3, ....
In either case, the sequence of probabilities is a geometric sequence.
For example, suppose an ordinary die is thrown repeatedly until the first time a "1" appears. The probability
distribution of the number of times it is thrown is supported on the infinite set { 1, 2, 3, ... } and is a geometric
distribution with p = 1/6.

Moments and cumulants


The expected value of a geometrically distributed random variable X is 1/p and the variance is (1 − p)/p2:

Similarly, the expected value of the geometrically distributed random variable Y is (1 − p)/p, and its variance is
(1 − p)/p2:

Let μ = (1 − p)/p be the expected value of Y. Then the cumulants of the probability distribution of Y satisfy the
recursion

Outline of proof: That the expected value is (1 − p)/p can be shown in the following way. Let Y be as above. Then
Geometric distribution 248

(The interchange of summation and differentiation is justified by the fact that convergent power series converge
uniformly on compact subsets of the set of points where they converge.)

Parameter estimation
For both variants of the geometric distribution, the parameter p can be estimated by equating the expected value with
the sample mean. This is the method of moments, which in this case happens to yield maximum likelihood estimates
of p.
Specifically, for the first variant let k = k1, ..., kn be a sample where ki ≥ 1 for i = 1, ..., n. Then p can be estimated as

In Bayesian inference, the Beta distribution is the conjugate prior distribution for the parameter p. If this parameter is
given a Beta(α, β) prior, then the posterior distribution is

The posterior mean E[p] approaches the maximum likelihood estimate as α and β approach zero.
In the alternative case, let k1, ..., kn be a sample where ki ≥ 0 for i = 1, ..., n. Then p can be estimated as

The posterior distribution of p given a Beta(α, β) prior is

Again the posterior mean E[p] approaches the maximum likelihood estimate as α and β approach zero.
the properties of "Y", 'P is parallel to D5.
• The probability-generating functions of X and Y are, respectively,

• Like its continuous analogue (the exponential distribution), the geometric distribution is memoryless. That means
that if you intend to repeat an experiment until the first success, then, given that the first success has not yet
occurred, the conditional probability distribution of the number of additional trials does not depend on how many
failures have been observed. The die one throws or the coin one tosses does not have a "memory" of these
Geometric distribution 249

failures. The geometric distribution is in fact the only memoryless discrete distribution.
• Among all discrete probability distributions supported on {1, 2, 3, ... } with given expected value μ, the geometric
distribution X with parameter p = 1/μ is the one with the largest entropy.
• The geometric distribution of the number Y of failures before the first success is infinitely divisible, i.e., for any
positive integer n, there exist independent identically distributed random variables Y1, ..., Yn whose sum has the
same distribution that Y has. These will not be geometrically distributed unless n = 1; they follow a negative
binomial distribution.
• The decimal digits of the geometrically distributed random variable Y are a sequence of independent (and not
identically distributed) random variables. For example, the hundreds digit D has this probability distribution:

where q = 1 − p, and similarly for the other digits, and, more generally, similarly for numeral systems with
other bases than 10. When the base is 2, this shows that a geometrically distributed random variable can be
written as a sum of independent random variables whose probability distributions are indecomposable.
• Golomb coding is the optimal prefix code for the geometric discrete distribution.

Related distributions
• The geometric distribution Y is a special case of the negative binomial distribution, with r = 1. More generally, if
Y1, ..., Yr are independent geometrically distributed variables with parameter p, then the sum

follows a negative binomial distribution with parameters r and 1-p.[1]


• If Y1, ..., Yr are independent geometrically distributed variables (with possibly different success parameters pm),
then their minimum

is also geometrically distributed, with parameter

• Suppose 0 < r < 1, and for k = 1, 2, 3, ... the random variable Xk has a Poisson distribution with expected value
r k/k. Then

has a geometric distribution taking values in the set {0, 1, 2, ...}, with expected value r/(1 − r).
• The exponential distribution is the continuous analogue of the geometric distribution. If X is an exponentially
distributed random variable with parameter λ, then

where is the floor (or greatest integer) function, is a geometrically distributed random variable with
parameter p = 1 − e−λ (thus λ = −ln(1 − p)[2]) and taking values in the set {0, 1, 2, ...}. This can be used to
generate geometrically distributed pseudorandom numbers by first generating exponentially distributed
pseudorandom numbers from a uniform pseudorandom number generator: then is
geometrically distributed with parameter , if is uniformly distributed in [0,1].
Geometric distribution 250

References
[1] Pitman, Jim. Probability (1993 edition). Springer Publishers. pp 372.
[2] http:/ / www. wolframalpha. com/ input/ ?i=inverse+ p+ %3D+ 1+ -+ e^-l

External links
• Geometric distribution (http://planetmath.org/?op=getobj&amp;from=objects&amp;id=3456),
PlanetMath.org.
• Geometric distribution (http://mathworld.wolfram.com/GeometricDistribution.html) on MathWorld.
• Online geometric distribution calculator (http://www.solvemymath.com/online_math_calculator/statistics/
discrete_distributions/geometric/index.php)

Hypergeometric distribution
Hypergeometric

Parameters

Support

PMF

CDF

Mean

Mode

Variance

Skewness

Ex.
kurtosis

MGF

CF

In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that
describes the probability of successes in draws from a finite population of size containing successes
without replacement. (cf. the binomial distribution, which describes the probability of successes in draws with
replacement.)
Hypergeometric distribution 251

Definition
A random variable follows the hypergeometric distribution if its probability mass function is given by:[1]

Where,
• is the population size
• is the number of success states in the population
• is the number of draws
• is the number of successes
• is a binomial coefficient

It is positive when .

Combinatorial identities
As one would expect intuitively, the probabilities sum up to 1 :

This is essentially Vandermonde's identity from combinatorics.


Also note that the following identity holds:

This follows clearly from the symmetry of the problem, but it can also be shown easily by expressing the binomial
coefficients in terms of factorials, and rearranging the latter.

Application and example


The classical application of the hypergeometric distribution is sampling without replacement. Think of an urn with
two types of marbles, black ones and white ones. Define drawing a white marble as a success and drawing a black
marble as a failure (analogous to the binomial distribution). If the variable N describes the number of all marbles in
the urn (see contingency table below) and m describes the number of white marbles, then N − m corresponds to the
number of black marbles. In this example X is the random variable whose outcome is k, the number of white
marbles actually drawn in the experiment. This situation is illustrated by the following contingency table:
Hypergeometric distribution 252

drawn not drawn total

white marbles k m−k m

black marbles n−k N+k−n−m N−m

total n N−n N

Now, assume (for example) that there are 5 white and 45 black marbles in the urn. Standing next to the urn, you
close your eyes and draw 10 marbles without replacement. What is the probability that exactly 4 of the 10 are white?
Note that although we are looking at success/failure, the data are not accurately modeled by the binomial
distribution, because the probability of success on each trial is not the same, as the size of the remaining population
changes as we remove each marble.
This problem is summarized by the following contingency table:

drawn not drawn total

white marbles k=4 m−k=1 m=5

black marbles n − k = 6 N + k − n − m = 39 N − m = 45

total n = 10 N − n = 40 N = 50

The probability of drawing exactly k white marbles can be calculated by the formula

Hence, in this example calculate

Intuitively we would expect it to be even more unlikely for all 5 marbles to be white.

As expected, the probability of drawing 5 white marbles is roughly 35 times less likely than that of drawing 4.

Application to Texas Hold'em Poker


In Hold'em Poker players make the best hand they can combining the two cards in their hand with the 5 cards
(community cards) eventually turned up on the table. The deck has 52 and there are 13 of each suit. For this example
assume a player has 2 clubs in the hand and there are 3 cards showing on the table, 2 of which are also clubs. The
player would like to know the probability of one of the next 2 cards to be shown being a club to complete his flush.
There are 4 clubs showing so there are 9 still unseen. There are 5 cards showing (2 in the hand and 3 on the table) so
there are 52-5=47 still unseen.
The probability that one of the next two cards turned is a club can be calculated using hypergeometric with k=1, n=2,
m=9 and N=47.
The probability that both of the next two cards turned are clubs can be calculated using hypergeometric with k=2,
n=2, m=9 and N=47.
The probability that neither of the next two cards turned are clubs can be calculated using hypergeometric with k=0,
n=2, m=9 and N=47.
Hypergeometric distribution 253

Symmetries
Swapping the roles of black and white marbles:

Swapping the roles of drawn and not drawn marbles:

Swapping the roles of white and drawn marbles:

Symmetry application
The metaphor of defective and drawn objects depicts an application of the hypergeometric distribution in which the
interchange symmetry between n and m is not of foremost concern. Here is an alternative metaphor which brings this
symmetry into sharper focus, as there are also applications where it serves no purpose to distinguish n from m.
Suppose you have a set of N children who have been identified with an unusual bone marrow antigen. The doctor
wishes to conduct a heredity study to determine the inheritance pattern of this antigen. For the purposes of this study,
the doctor wishes to draw tissue from the bone marrow from the biological mother and biological father of each
child. This is an uncomfortable procedure, and not all the mothers and fathers will agree to participate. Of the
mothers, m participate and N-m decline. Of the fathers, n participate and N-n decline.
We assume here that the decisions made by the mothers is independent of the decisions made by the fathers. Under
this assumption, the doctor, who is given n and m, wishes to estimate k, the number of children where both parents
have agreed to participate. The hypergeometric distribution can be used to determine this distribution over k. It's not
straightforward why the doctor would know n and m, but not k. Perhaps n and m are dictated by the experimental
design, while the experimenter is left blind to the true value of k.
It is important to recognize that for given N, n and m a single degree of freedom partitions N into four
sub-populations:
1. Children where both parents participate
2. Children where only the mother participates
3. Children where only the father participates and
4. Children where neither parent participates.
Knowing any one of these four values determines the other three by simple arithmetic relations. For this reason, each
of these quadrants is governed by an equivalent hypergeometric distribution. The mean, mode, and values of k
contained within the support differ from one quadrant to another, but the size of the support, the variance, and other
high order statistics do not.
For the purpose of this study, it might make no difference to the doctor whether the mother participates or the father
participates. If this happens to be true, the doctor will view the result as a three-way partition: children where both
parents participate, children where one parent participates, children where neither parent participates. Under this
view, the last remaining distinction between n and m has been eliminated. The distribution where one parent
participates is the sum of the distributions where either parent alone participates.
Hypergeometric distribution 254

Symmetry and sampling


To express how the symmetry of the clinical metaphor degenerates to the asymmetry of the sampling language used
in the drawn/defective metaphor, we will restate the clinical metaphor in the abstract language of decks and cards.
We begin with a dealer who holds two prepared decks of N cards. The decks are labelled left and right. The left deck
was prepared to hold n red cards, and N-n black cards; the right deck was prepared to hold m red cards, and N-m
black cards.
These two decks are dealt out face down to form N hands. Each hand contains one card from the left deck and one
card from the right deck. If we determine the number of hands that contain two red cards, by symmetry relations we
will necessarily also know the hypergeometric distributions governing the other three quadrants: hand counts for
red/black, black/red, and black/black. How many cards must be turned over to learn the total number of red/red
hands? Which cards do we need to turn over to accomplish this? These are questions about possible sampling
methods.
One approach is to begin by turning over the left card of each hand. For each hand showing a red card on the left, we
then also turn over the right card in that hand. For any hand showing a black card on the left, we do not need to
reveal the right card, as we already know this hand does not count toward the total of red/red hands. Our treatment of
the left and right decks no longer appears symmetric: one deck was fully revealed while the other deck was partially
revealed. However, we could just as easily have begun by revealing all cards dealt from the right deck, and partially
revealed cards from the left deck.
In fact, the sampling procedure need not prioritize one deck over the other in the first place. Instead, we could flip a
coin for each hand, turning over the left card on heads, and the right card on tails, leaving each hand with one card
exposed. For every hand with a red card exposed, we reveal the companion card. This will suffice to allow us to
count the red/red hands, even though under this sampling procedure neither the left nor right deck is fully revealed.
By another symmetry, we could also have elected to determine the number of black/black hands rather than the
number of red/red hands, and discovered the same distributions by that method.
The symmetries of the hypergeometric distribution provide many options in how to conduct the sampling procedure
to isolate the degree of freedom governed by the hypergeometric distribution. Even if the sampling procedure
appears to treat the left deck differently from the right deck, or governs choices by red cards rather than black cards,
it is important to recognize that the end result is essentially the same.

Relationship to Fisher's exact test


The test (see above) based on the hypergeometric distribution (hypergeometric test) is identical to the corresponding
one-tailed version of Fisher's exact test [2]). Reciprocally, the p-value of a two-sided Fisher's exact test can be
calculated as the sum of two appropriate hypergeometric tests (for more information see [3]).

Order of draws
The probability of drawing any sequence of white and black marbles (the hypergeometric distribution) depends only
on the number of white and black marbles, not on the order in which they appear; i.e., it is an exchangeable
distribution. As a result, the probability of drawing a white marble in the draw is

This can be shown by induction. First, it is certainly true for the first draw that:

Also, we can show that by writing:


Hypergeometric distribution 255

which makes it true for every draw.


A simpler proof than the one above is the following:
By symmetry each of the marbles has the same chance to be drawn in the draw. In addition, according to the
sum rule, the chance of drawing a white marble in the draw can be calculated by summing the chances of each
individual white marble being drawn in the . These two observations imply that if for example the number of
white marbles at the outset is 3 times the number of black marbles, then also the chance of a white marble being
drawn in the draw is 3 times as big as a black marble being drawn in the draw. In the general case we have
white marbles and black marbles at the outset. So
.

Since in the draw either a white or a black marble needs to be drawn, we also know that
.
Combining these two equations immediately yields

Related distributions
Let X ~ Hypergeometric( , , ) and .
• If then has a Bernoulli distribution with parameter .
• Let have a binomial distribution with parameters and ; this models the number of successes in the
analogous sampling problem with replacement. If and are large compared to and is not close to 0 or
1, then and have similar distributions, i.e., .
• If is large, and are large compared to and is not close to 0 or 1, then
Hypergeometric distribution 256

where is the standard normal distribution function


• If the probabilities to draw a white or black marble are not equal (e.g. because their size is different) then has
a Noncentral hypergeometric distribution

Multivariate hypergeometric distribution

Multivariate Hypergeometric Distribution

Parameters

Support

PMF

Mean

Variance

The model of an urn with black and white marbles can be extended to the case where there are more than two colors
of marbles. If there are mi marbles of color i in the urn and you take n marbles at random without replacement, then
the number of marbles of each color in the sample (k1,k2,...,kc) has the multivariate hypergeometric distribution. This
has the same relationship to the multinomial distribution that the hypergeometric distribution has to the binomial
distribution—the multinomial distribution is the "with-replacement" distribution and the multivariate hypergeometric
is the "without-replacement" distribution.
The properties of this distribution are given in the adjacent table, where c is the number of different colors and
is the total number of marbles.

Example
Suppose there are 5 black, 10 white, and 15 red marbles in an urn. You reach in and randomly select six marbles
without replacement. What is the probability that you pick exactly two of each color?

Note: When picking the six marbles without replacement, the expected number of black marbles is 6*(5/30) = 1, the
expected number of white marbles is 6*(10/30) = 2, and the expected number of red marbles is 6*(15/30) = 3.
Hypergeometric distribution 257

References
[1] Rice, John A. (2007). Mathematical Statistics and Data Analysis (Third Edition ed.). Duxbury Press. p. 42.
[2] Rivals, I.; Personnaz, L.; Taing, L.; Potier, M.-C (2007). "Enrichment or depletion of a GO category within a class of genes: which test?".
Bioformatics 23: 401–407.
[3] K. Preacher and N. Briggs. "Calculation for Fisher's Exact Test: An interactive calculation tool for Fisher's exact probability test for 2 x 2
tables (interactive page)" (http:/ / quantpsy. org/ fisher/ fisher. htm). .

External links
• Gnu implemetationfor C/C++ (http://www.gnu.org/software/gsl/manual/html_node/
The-Hypergeometric-Distribution.html)
• The Hypergeometric Distribution (http://demonstrations.wolfram.com/TheHypergeometricDistribution/) and
Binomial Approximation to a Hypergeometric Random Variable (http://demonstrations.wolfram.com/
BinomialApproximationToAHypergeometricRandomVariable/) by Chris Boucher, Wolfram Demonstrations
Project.
• Weisstein, Eric W., " Hypergeometric Distribution (http://mathworld.wolfram.com/
HypergeometricDistribution.html)" from MathWorld.
• Hypergeometric tail inequalities: ending the insanity (http://ansuz.sooke.bc.ca/professional/hypergeometric.
pdf) by Matthew Skala.
• Survey Analysis Tool (http://www.i-marvin.si) using discrete hypergeometric distribution based on A.
Berkopec, HyperQuick algorithm for discrete hypergeometric distribution, Journal of Discrete Algorithms,
Elsevier, 2006 (http://dx.doi.org/10.1016/j.jda.2006.01.001).

Hölder's inequality
In mathematical analysis Hölder's inequality, named after Otto Hölder, is a fundamental inequality between
integrals and an indispensable tool for the study of Lp spaces.
Let (S, Σ, μ) be a measure space and let 1 ≤ p, q ≤ ∞ with 1/p + 1/q = 1. Then, for all measurable real- or
complex-valued functions f and g on S,

The numbers p and q above are said to be Hölder conjugates of each other. The special case p = q = 2 gives a form
of the Cauchy–Schwarz inequality.
Hölder's inequality holds even if ||fg ||1 is infinite, the right-hand side also being infinite in that case. In particular, if f
is in Lp(μ) and g is in Lq(μ), then fg is in L1(μ).
For 1 < p, q < ∞ and f ∈ Lp(μ) and g ∈ Lq(μ), Hölder's inequality becomes an equality if and only if |f |p and |g |q are
linearly dependent in L1(μ), meaning that there exist real numbers α, β ≥ 0, not both of them zero, such that α |f |p = β
|g |q μ-almost everywhere.
Hölder's inequality is used to prove the Minkowski inequality, which is the triangle inequality in the space Lp(μ), and
also to establish that Lq(μ) is the dual space of Lp(μ) for 1 ≤ p < ∞.
Hölder's inequality was first found by L. J. Rogers (1888), and discovered independently by Hölder (1889).
Hölder's inequality 258

Remarks

Conventions
The brief statement of Hölder's inequality uses some conventions.
• In the definition of Hölder conjugates, 1/ ∞ means zero.
• If 1 ≤ p, q < ∞, then ||f ||p and ||g ||q stand for the (possibly infinite) expressions

   and   

• If p = ∞, then ||f ||∞ stands for the essential supremum of |f |, similarly for ||g ||∞.
• The notation ||f ||p with 1 ≤ p ≤ ∞ is a slight abuse, because in general it is only a norm of f if ||f ||p is finite and f is
considered as equivalence class of μ-almost everywhere equal functions. If f ∈ Lp(μ) and g ∈ Lq(μ), then the
notation is adequate.
• On the right-hand side of Hölder's inequality, 0 times ∞ as well as ∞ times 0 means 0. Multiplying a > 0 with ∞
gives ∞.

Estimates for integrable products


As above, let f and g denote measurable real- or complex-valued functions defined on S. If ||fg ||1 is finite, then the
products of f with g and its complex conjugate function, respectively, are μ-integrable, the estimates

and the similar one for fg hold, and Hölder's inequality can be applied to the right-hand side. In particular, if f and g
are in the Hilbert space L2(μ), then Hölder's inequality for p = q = 2 implies

where the angle brackets refer to the inner product of L2(μ). This is also called Cauchy–Schwarz inequality, but
requires for its statement that ||f||2 and ||g||2 are finite to make sure that the inner product of f and g is well defined.
We may recover the original inequality (for the case p=2) by using the functions |f| and |g| in place of f and g.

Generalization for probability measures


If (S, Σ, μ) is a probability space, then 1 ≤ p, q ≤ ∞ just need to satisfy 1/p + 1/q ≤ 1, rather than being Hölder
conjugates. A combination of Hölder's inequality and Jensen's inequality implies that

for all measurable real- or complex-valued functions f and g on S,

Notable special cases


For the following cases assume that p and q are in the open interval (1, ∞) with 1/p + 1/q = 1.
Hölder's inequality 259

Counting measure
In the case of n-dimensional Euclidean space, when the set S is {1, …, n} with the counting measure, we have

If S = N with the counting measure, then we get Hölder's inequality for sequence spaces:

Lebesgue measure
If S is a measurable subset of Rn with the Lebesgue measure, and f and g are measurable real- or complex-valued
functions on S, then Hölder inequality is

Probability measure
For the probability space , let E denote the expectation operator. For real- or complex-valued random
variables X and Y on Ω, Hölder's inequality reads

Let 0 < r < s and define p = s / r. Then q = p / (p−1) is the Hölder conjugate of p. Applying Hölder's inequality to the
random variables |X |r and 1Ω, we obtain

In particular, if the sth absolute moment is finite, then the r th absolute moment is finite, too. (This also follows from
Jensen's inequality.)

Product measure
For two σ-finite measure spaces (S1, Σ1, μ1) and (S2, Σ2, μ2) define the product measure space by

where S is the Cartesian product of S1 and S2, the σ-algebra Σ arises as product σ-algebra of Σ1 and Σ2, and μ denotes
the product measure of μ1 and μ2. Then Tonelli's theorem allows us to rewrite Hölder's inequality using iterated
integrals: If f and g are Σ-measurable real- or complex-valued functions on the Cartesian product S, then

This can be generalized to more than two σ-finite measure spaces.


Hölder's inequality 260

Vector-valued functions
Let (S, Σ, μ) denote a σ-finite measure space and suppose that f = (f1, …, fn) and g = (g1, …, gn) are Σ-measurable
functions on S, taking values in the n-dimensional real- or complex Euclidean space. By taking the product with the
counting measure on {1, …, n}, we can rewrite the above product measure version of Hölder's inequality in the form

If the two integrals on the right-hand side are finite, then equality holds if and only if there exist real numbers
α, β ≥ 0, not both of them zero, such that
    for μ-almost all x in S.
This finite-dimensional version generalizes to functions f and g taking values in a sequence space.

Proof of Hölder's inequality


There are several proofs of Hölder's inequality; the main idea in the following is Young's inequality.
If ||f ||p = 0, then f is zero μ-almost everywhere, and the product fg is zero μ-almost everywhere, hence the left-hand
side of Hölder's inequality is zero. The same is true if ||g ||q = 0. Therefore, we may assume ||f ||p > 0 and ||g ||q > 0 in
the following.
If ||f ||p = ∞ or ||g ||q = ∞, then the right-hand side of Hölder's inequality is infinite. Therefore, we may assume that ||f
||p and ||g ||q are in (0, ∞).
If p = ∞ and q = 1, then |fg | ≤ ||f ||∞ |g | almost everywhere and Hölder's inequality follows from the monotonicity of
the Lebesgue integral. Similarly for p = 1 and q = ∞. Therefore, we may also assume p, q ∈ (1, ∞).
Dividing f and g by ||f ||p and ||g ||q, respectively, we can assume that

We now use Young's inequality, which states that

for all nonnegative a and b, where equality is achieved if and only if a p = b q. Hence

Integrating both sides gives

which proves the claim.


Under the assumptions p ∈ (1, ∞) and ||f ||p = ||g ||q = 1, equality holds if and only if |f |p = |g |q almost everywhere.
More generally, if ||f ||p and ||g ||q are in (0, ∞), then Hölder's inequality becomes an equality if and only if there exist
real numbers α, β > 0, namely

   and   
such that
   μ-almost everywhere   (*)
The case ||f ||p = 0 corresponds to β = 0 in (*). The case ||g ||q = 0 corresponds to α = 0 in (*).
Hölder's inequality 261

Extremal equality

Statement
Assume that 1 ≤ p < ∞ and let q denote the Hölder conjugate. Then, for every ƒ ∈ Lp(μ),

where max indicates that there actually is a g maximizing the right-hand side. When p = ∞ and if each set A in the
σ-field Σ with μ(A) = ∞ contains a subset B ∈ Σ with 0 < μ(B) < ∞ (which is true in particular when μ is σ-finite),
then

Remarks and examples


• The equality for p = ∞ fails whenever there exists a set A in the σ-field Σ with μ(A) = ∞ that has no subset B ∈ Σ
with 0 < μ(B) < ∞ (the simplest example is the σ-field Σ containing just the empty set and S, and the measure μ
with μ(S) = ∞). Then the indicator function 1A satisfies ||1A||∞ = 1, but every g ∈ L1(μ) has to be μ-almost
everywhere constant on A, because it is Σ-measurable, and this constant has to be zero, because g is μ-integrable.
Therefore, the above supremum for the indicator function 1A is zero and the extremal equality fails.
• For p = ∞, the supremum is in general not attained. As an example, let S denote the natural numbers (without
zero), Σ the power set of S, and μ the counting measure. Define ƒ(n) = (n − 1)/n for every natural number n. Then
||ƒ ||∞ = 1. For g ∈ L1(μ) with 0 < ||g ||1 ≤ 1, let m denote the smallest natural number with g(m) ≠ 0. Then

Applications
• The extremal equality is one of the ways for proving the triangle inequality ||ƒ1 + ƒ2||p ≤ ||ƒ1||p + ||ƒ2||p for all ƒ1 and
ƒ2 in Lp(μ), see Minkowski inequality.
• Hölder's inequality implies that every ƒ ∈ Lp(μ) defines a bounded (or continuous) linear functional κƒ on Lq(μ) by
the formula

The extremal equality (when true) shows that the norm of this functional κƒ as element of the continuous dual
space Lq(μ)∗ coincides with the norm of ƒ in Lp(μ) (see also the Lp-space article).

Generalization of Hölder's inequality


Assume that r ∈ (0, ∞) and p1, …, pn ∈ (0, ∞] such that

Then, for all measurable real- or complex-valued functions f1, …, fn defined on S,

In particular,
Hölder's inequality 262

Note:
• For r ∈ (0, 1), contrary to the notation, ||.||r is in general not a norm, because it doesn't satisfy the triangle
inequality.

Interpolation
Let p1, …, pn ∈ (0, ∞] and let θ1, …, θn ∈ (0, 1) denote weights with θ1+ … + θn = 1. Define p as the weighted
harmonic mean, i.e.,

Given a measurable real- or complex-valued functions f on S, define

Then by the above generalization of Hölder's inequality,

In particular, taking θ1 = θ and θ2 = 1 - θ, in the case n =2, we obtain the interpolation result

for θ ∈ (0, 1) and

Reverse Hölder inequality


Assume that p ∈ (1, ∞) and that the measure space (S, Σ, μ) satisfies μ(S) > 0. Then, for all measurable real- or
complex-valued functions f and g on S such that g(s) ≠ 0 for μ-almost all s ∈ S,

If ||fg ||1 < ∞ and ||g ||−1/(p −1) > 0, then the reverse Hölder inequality is an equality if and only if there exists an α ≥ 0
such that

    μ-almost everywhere.
Note: ||f ||1/p and ||g ||−1/(p −1) are not norms, these expressions are just compact notation for

   and   
Hölder's inequality 263

Conditional Hölder inequality


Let be a probability space, a sub-σ-algebra, and p, q ∈ (1, ∞) Hölder conjugates, meaning that
1/p + 1/q = 1. Then, for all real- or complex-valued random variables X and Y on Ω,

Remarks:
• If a non-negative random variable Z has infinite expected value, then its conditional expectation is defined by

• On the right-hand side of the conditional Hölder inequality, 0 times ∞ as well as ∞ times 0 means 0. Multiplying
a > 0 with ∞ gives ∞.

References
• Hardy, G. H.; Littlewood, J. E.; Pólya, G. (1934), Inequalities, Cambridge University Press, pp. XII+314,
ISBN 0-521-35880-9, JFM 60.0169.01, Zbl 0010.10703.
• Hölder, O. (1889), "Ueber einen Mittelwertsatz" [1] (in German), Nachrichten von der Königl. Gesellschaft der
Wissenschaften und der Georg-Augusts-Universität zu Göttingen, Band 1889 (2): 38–47, JFM 21.0260.07.
Available at Digi Zeitschriften [2].
• Kuptsov, L. P. (2001), "Hölder inequality" [3], in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer,
ISBN 978-1-55608-010-4.
• Rogers, L. J. (February 1888), "An extension of a certain theorem in inequalities" [4], Messenger of Mathematics,
New Series XVII (10): 145–150, JFM 20.0254.02, archived from the original [5] on August 21, 2007.

External links
• Kuttler, Kenneth (2007), An introduction to linear algebra [6], Online e-book in PDF format, Brigham Young
University.
• Lohwater, Arthur (1982) (PDF), Introduction to Inequalities [7].

References
[1] http:/ / resolver. sub. uni-goettingen. de/ purl?GDZPPN00252421X
[2] http:/ / www. digizeitschriften. de/ index. php?id=64& L=2
[3] http:/ / www. encyclopediaofmath. org/ index. php?title=H/ h047514
[4] http:/ / www. archive. org/ stream/ messengermathem01unkngoog#page/ n183/ mode/ 1up
[5] http:/ / www. archive. org/ details/ messengermathem01unkngoog
[6] http:/ / www. math. byu. edu/ ~klkuttle/ Linearalgebra. pdf
[7] http:/ / www. mediafire. com/ ?1mw1tkgozzu
Inverse Gaussian distribution 264

Inverse Gaussian distribution


Inverse Gaussian

Probability density function

Parameters

Support
PDF

CDF
where is the standard normal (standard

Gaussian) distribution c.d.f.


Mean
Mode

Variance

Skewness

Ex.
kurtosis
MGF

CF

In probability theory, the inverse Gaussian distribution (also known as the Wald distribution) is a two-parameter
family of continuous probability distributions with support on (0,∞).
Its probability density function is given by

for x > 0, where is the mean and is the shape parameter.


As λ tends to infinity, the inverse Gaussian distribution becomes more like a normal (Gaussian) distribution. The
inverse Gaussian distribution has several properties analogous to a Gaussian distribution. The name can be
misleading: it is an "inverse" only in that, while the Gaussian describes a Brownian Motion's level at a fixed time,
Inverse Gaussian distribution 265

the inverse Gaussian describes the distribution of the time a Brownian Motion with positive drift takes to reach a
fixed positive level.
Its cumulant generating function (logarithm of the characteristic function) is the inverse of the cumulant generating
function of a Gaussian random variable.
To indicate that a random variable X is inverse Gaussian-distributed with mean μ and shape parameter λ we write

Properties

Summation
If Xi has a IG(μ0wi, λ0wi2) distribution for i = 1, 2, ..., n and all Xi are independent, then

Note that

is constant for all i. This is a necessary condition for the summation. Otherwise S would not be inverse Gaussian.

Scaling
For any t > 0 it holds that

Exponential family
The inverse Gaussian distribution is a two-parameter exponential family with natural parameters -λ/(2μ²) and -λ/2,
and natural statistics X and 1/X.

Relationship with Brownian motion


The stochastic process Xt given by

(where Wt is a standard Brownian motion and ) is a Brownian motion with drift ν.


Then, the first passage time for a fixed level by Xt is distributed according to an inverse-gaussian:
Inverse Gaussian distribution 266

When drift is zero


A common special case of the above arises when the Brownian motion has no drift. In that case, parameter μ tends to
infinity, and the first passage time for fixed level α has probability density function

This is a Lévy distribution with parameters and .

Maximum likelihood
The model where

with all wi known, (μ, λ) unknown and all Xi independent has the following likelihood function

Solving the likelihood equation yields the following maximum likelihood estimates

and are independent and

Generating random variates from an inverse-Gaussian distribution


The following algorithm may be used.[1]
Generate a random variate from a normal distribution with a mean of 0 and 1 standard deviation

Square the value

and use this relation

Generate another random variate, this time sampled from a uniformed distribution between 0 and 1

If

then return

else return

Sample code in Java:


Inverse Gaussian distribution 267

public double inverseGaussian(double mu, double lambda) {


Random rand = new Random();
double v = rand.nextGaussian(); // sample from a normal
distribution with a mean of 0 and 1 standard deviation
double y = v*v;
double x = mu + (mu*mu*y)/(2*lambda) - (mu/(2*lambda)) *
Math.sqrt(4*mu*lambda*y + mu*mu*y*y);
double test = rand.nextDouble(); // sample from a uniform
distribution between 0 and 1
if (test <= (mu)/(mu + x))
return x;
else
return (mu*mu)/x;
}

Related distributions
• If then

• If then

• If for then

• If then

The convolution of a Wald distribution and an exponential (the ex-Wald distribution) is used as a model for response
times in psychology.[2]

History
This distribution appears to have been first derived by Schrödinger in 1915 as the time to first passage of a Brownian
motion.[3] The name inverse Gaussian was proposed by Tweedie in 1945.[4] Wald re derived this distribution in 1947
as the limiting form of a sample in a sequential probability ratio test. Tweedie investigated this distribution in 1957
and established some of its statistical properties.

Software
The R programming language has software for this distribution. [5]
Inverse Gaussian distribution 268

Notes
[1] Generating Random Variates Using Transformations with Multiple Roots by John R. Michael, William R. Schucany and Roy W. Haas,
American Statistician, Vol. 30, No. 2 (May, 1976), pp. 88–90
[2] Schwarz W (2001) The ex-Wald distribution as a descriptive model of response times. Behav Res Methods Instrum Comput 33(4):457-469
[3] Schrodinger E (1915) Zur Theorie der Fall—und Steigversuche an Teilchenn mit Brownscher Bewegung. Physikalische Zeitschrift 16,
289-295
[4] Folks JL & Chhikara RS (1978) The inverse Gaussian and its statistical application - a review. J Roy Stat Soc 40(3) 263-289
[5] http:/ / www. stat. ucl. ac. be/ ISdidactique/ Rhelp/ library/ SuppDists/ html/ invGauss. html

References
• The inverse gaussian distribution: theory, methodology, and applications by Raj Chhikara and Leroy Folks, 1989
ISBN 0-8247-7997-5
• System Reliability Theory by Marvin Rausand and Arnljot Høyland
• The Inverse Gaussian Distribution by Dr. V. Seshadri, Oxford Univ Press, 1993

External links
• Inverse Gaussian Distribution (http://mathworld.wolfram.com/InverseGaussianDistribution.html) in Wolfram
website.
Inverse-gamma distribution 269

Inverse-gamma distribution
Inverse-gamma

Probability density function

Cumulative distribution function

Parameters shape (real)


scale (real)
Support
PDF

CDF

Mean
for

Mode

Variance
for

Skewness
for

Ex. kurtosis
for

Entropy
MGF

CF

In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous
probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed
according to the gamma distribution. Perhaps the chief use of the inverse gamma distribution is in Bayesian
statistics, where it serves as the conjugate prior of the variance of a normal distribution. However, it is common
among Bayesians to consider an alternative parametrization of the normal distribution in terms of the precision,
Inverse-gamma distribution 270

defined as the reciprocal of the variance, which allows the gamma distribution to be used directly as a conjugate
prior.

Characterization

Probability density function


The inverse gamma distribution's probability density function is defined over the support

with shape parameter and scale parameter .

Cumulative distribution function


The cumulative distribution function is the regularized gamma function

where the numerator is the upper incomplete gamma function and the denominator is the gamma function. Many
math packages allow you to compute Q, the regularized gamma function, directly.

Properties
For and

where is the digamma function.

Related distributions
• If then
• If then (inverse-chi-squared distribution)
• If then (scaled-inverse-chi-squared distribution)
• If then (Lévy distribution)
• If (Gamma distribution) then
• Inverse gamma distribution is a special case of type 5 Pearson distribution
• A multivariate generalization of the inverse-gamma distribution is the inverse-Wishart distribution.
Inverse-gamma distribution 271

Derivation from Gamma distribution


The pdf of the gamma distribution is

and define the transformation then the resulting transformation is

Replacing with ; with ; and with results in the inverse-gamma pdf shown above

Iteratively reweighted least squares


The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems. It solves
objective functions of the form:

by an iterative method in which each step involves solving a weighted least squares problem of the form:

IRLS is used to find the maximum likelihood estimates of a generalized linear model, and in robust regression to
find an M-estimator, as a way of mitigating the influence of outliers in an otherwise normally-distributed data set.
For example, by minimizing the least absolute error rather than the least square error.
Although not a linear regression problem, Weiszfeld's algorithm for approximating the geometric median can also be
viewed as a special case of iteratively reweighted least squares, in which the objective function is the sum of
distances of the estimator from the samples.
One of the advantages of IRLS over linear and convex programming is that it can be used with Gauss–Newton and
Levenberg–Marquardt numerical algorithms.
Iteratively reweighted least squares 272

Examples

L1 minimization for sparse recovery


IRLS can be used for minimization and smoothed minimization, p<1, in the compressed sensing problems.
1 p
It has been proved that the algorithm has a linear rate of convergence for norm and superlinear for with t < 1,
1 t
[1][2]
under the restricted isometry property, which is generally a sufficient condition for sparse solutions.

Lp norm linear regression


To find the parameters β = (β1, …,βk)T which minimize the Lp norm for the linear regression problem,

the IRLS algorithm at step t+1 involves solving the weighted linear least squares problem:[3]

where W(t) is the diagonal matrix of weights with elements:

In the case p = 1, this corresponds to least absolute deviation regression (in this case, the problem would be better
approached by use of linear programming methods).

Notes
[1] Chartrand, R.; Yin, W. (March 31 - April 4, 2008). "Iteratively reweighted algorithms for compressive sensing" (http:/ / ieeexplore. ieee. org/
xpl/ freeabs_all. jsp?arnumber=4518498). IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008.
pp. 3869 - 3872. .
[2] I Daubechies et al (2008). "Iteratively reweighted least squares minimization for sparse recovery" (http:/ / www. ricam. oeaw. ac. at/ people/
page/ fornasier/ DDFG14. pdf). . Retrieved 2010-11-02.
[3] Gentle, James (2007). "6.8.1 Solutions that Minimize Other Norms of the Residuals". Matrix algebra. New York: Springer.
doi:10.1007/978-0-387-70873-7. ISBN 978-0-387-70872-0.

References
• Stanford Lecture Notes on the IRLS algorithm by Antoine Guitton (http://sepwww.stanford.edu/public/docs/
sep103/antoine2/paper_html/index.html)
• Numerical Methods for Least Squares Problems by Åke Björck (http://www.mai.liu.se/~akbjo/LSPbook.
html) (Chapter 4: Generalized Least Squares Problems.)
• Practical Least-Squares for Computer Graphics. SIGGRAPH Course 11 (http://graphics.stanford.edu/~jplewis/
lscourse/SLIDES.pdf)
Kendall tau rank correlation coefficient 273

Kendall tau rank correlation coefficient


In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau (τ) coefficient, is a
statistic used to measure the association between two measured quantities. A tau test is a non-parametric hypothesis
test which uses the coefficient to test for statistical dependence.
Specifically, it is a measure of rank correlation: that is, the similarity of the orderings of the data when ranked by
each of the quantities. It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had
proposed a similar measure in the context of time series in 1897.[2]

Definition
Let (x1, y1), (x2, y2), …, (xn, yn) be a set of joint observations from two random variables X and Y respectively, such
that all the values of (xi) and (yi) are unique. Any pair of observations (xi, yi) and (xj, yj) are said to be concordant if
the ranks for both elements agree: that is, if both xi > xj and yi > yj or if both xi < xj and yi < yj. They are said to be
discordant, if xi > xj and yi < yj or if xi < xj and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor
discordant.
The Kendall τ coefficient is defined as:

[3]

Properties
The denominator is the total number of pairs, so the coefficient must be in the range −1 ≤ τ ≤ 1.
• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value
1.
• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the
coefficient has value −1.
• If X and Y are independent, then we would expect the coefficient to be approximately zero.

Hypothesis test
The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two
variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any
assumptions on the distributions of X or Y.
Under a null hypothesis of X and Y being independent, the sampling distribution of τ will have an expected value of
zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated
exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with
mean zero and variance

.[4]
Kendall tau rank correlation coefficient 274

Accounting for ties


A pair {(xi, yi), (xj, yj)} is said to be tied if xi = xj or yi = yj; a tied pair is neither concordant nor discordant. When tied
pairs arise in the data, the coefficient may be modified in a number of ways to keep it in the range [-1, 1]:

Tau-a
Tau-a statistic tests the strength of association of the cross tabulations. Both variables have to be ordinal. Tau-a will
not make any adjustment for ties.

Tau-b
Tau-b statistic, unlike tau-a, makes adjustments for ties and is suitable for square tables. Values of tau-b range from
−1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A
value of zero indicates the absence of association.
The Kendall tau-b coefficient is defined as:

where

Tau-c
Tau-c differs from tau-b as in being more suitable for rectangular tables than for square tables.

Significance tests
When two quantities are statistically independent, the distribution of is not easily characterizable in terms of
known distributions. However, for the following statistic, , is approximately characterized by a standard
normal distribution when the quantities are statistically independent:

Thus, if you want to test whether two quantities are statistically dependent, compute , and find the cumulative
probability for a standard normal distribution at . For a 2-tailed test, multiply that number by two and this
gives you the p-value. If the p-value is below your acceptance level (typically 5%), you can reject the null hypothesis
that the quantities are statistically independent and accept the hypothesis that they are dependent.
Numerous adjustments should be added to when accounting for ties. The following statistic, , provides an
approximation coinciding with the distribution and is again approximately characterized by a standard normal
distribution when the quantities are statistically independent:

where
Kendall tau rank correlation coefficient 275

Algorithms
The direct computation of the numerator , involves two nested iterations, as characterized by the
following pseudo-code:

numer := 0
for i:=2..N do
for j:=1..(i-1) do
numer := numer + sgn(x[i] - x[j]) * sgn(y[i] - y[j])
return numer

Although quick to implement, this algorithm is in complexity and becomes very slow on large samples. A
more sophisticated algorithm[5] built upon the Merge Sort algorithm can be used to compute the numerator in
time.
Begin by ordering your data points sorting by the first quantity, , and secondarily (among ties in ) by the
second quantity, . With this initial ordering, is not sorted, and the core of the algorithm consists of computing
how many steps a Bubble Sort would take to sort this initial . An enhanced Merge Sort algorithm, with
complexity, can be applied to compute the number of swaps, , that would be required by a
Bubble Sort to sort . Then the numerator for is computed as:
,
where is computed like and , but with respect to the joint ties in and .
A Merge Sort partitions the data to be sorted, into two roughly equal halves, and , then sorts each
half recursive, and then merges the two sorted halves into a fully sorted vector. The number of Bubble Sort swaps is
equal to:

where and are the sorted versions of and , and characterizes the Bubble Sort
swap-equivalent for a merge operation. is computed as depicted in the following pseudo-code:
function M(L[1..n], R[1..m])
n := n + m
i := 1
j := 1
nSwaps := 0
while i + j <= n do
if i > m or R[j] < L[i] then
nSwaps := nSwaps + m - (i-1)
j := j + 1
else
i := i + 1
return nSwaps
Kendall tau rank correlation coefficient 276

A side effect of the above steps is that you end up with both a sorted version of and a sorted version of . With
these, the factors and used to compute are easily obtained in a single linear-time pass through the sorted
arrays.
A second algorithm with time complexity, based on AVL trees, was devised by David Christensen.[6]

References
[1] Kendall, M. (1938). "A New Measure of Rank Correlation". Biometrika 30 (1–2): 81–89. doi:10.1093/biomet/30.1-2.81. JSTOR 2332226.
[2] Kruskal, W.H. (1958). "Ordinal Measures of Association". Journal of the American Statistical Association 53 (284): 814–861.
doi:10.2307/2281954. JSTOR 2281954. MR100941.
[3] Nelsen, R.B. (2001), "Kendall tau metric" (http:/ / www. encyclopediaofmath. org/ index. php?title=K/ k130020), in Hazewinkel, Michiel,
Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4,
[4] Prokhorov, A.V. (2001), "Kendall coefficient of rank correlation" (http:/ / www. encyclopediaofmath. org/ index. php?title=K/ k055200), in
Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4,
[5] Knight, W. (1966). "A Computer Method for Calculating Kendall's Tau with Ungrouped Data". Journal of the American Statistical
Association 61 (314): 436–439. doi:10.2307/2282833. JSTOR 2282833.
[6] Christensen, David (2005). "Fast algorithms for the calculation of Kendall's τ". Computational Statistics 20 (1): 51–62.
doi:10.1007/BF02736122.

• Abdi, H. (2007). "Kendall rank correlation" (http://www.utdallas.edu/~herve/


Abdi-KendallCorrelation2007-pretty.pdf). In Salkind, N.J.. Encyclopedia of Measurement and Statistics.
Thousand Oaks (CA): Sage.
• Kendall, M. (1948) Rank Correlation Methods, Charles Griffin & Company Limited

External links
• Tied rank calculation (http://www.statsdirect.com/help/nonparametric_methods/kend.htm)
• Why Kendall tau? (http://www.rsscse-edu.org.uk/tsj/bts/noether/text.html)
• Software for computing Kendall's tau on very large datasets (http://law.dsi.unimi.it/software/)
• Online software: computes Kendall's tau rank correlation (http://www.wessa.net/rwasp_kendall.wasp)
• The CORR Procedure: Statistical Computations (http://www.technion.ac.il/docs/sas/proc/zompmeth.htm)
KolmogorovSmirnov test 277

Kolmogorov–Smirnov test
In statistics, the Kolmogorov–Smirnov test (K–S test) is a nonparametric test for the equality of continuous,
one-dimensional probability distributions that can be used to compare a sample with a reference probability
distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). The Kolmogorov–Smirnov
statistic quantifies a distance between the empirical distribution function of the sample and the cumulative
distribution function of the reference distribution, or between the empirical distribution functions of two samples.
The null distribution of this statistic is calculated under the null hypothesis that the samples are drawn from the same
distribution (in the two-sample case) or that the sample is drawn from the reference distribution (in the one-sample
case). In each case, the distributions considered under the null hypothesis are continuous distributions but are
otherwise unrestricted.
The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as
it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two
samples.
The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for
normality of the distribution, samples are standardized and compared with a standard normal distribution. This is
equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is
known that using these to define the specific reference distribution changes the null distribution of the test statistic:
see below. Various studies have found that, even in this corrected form, the test is less powerful for testing normality
than the Shapiro–Wilk test or Anderson–Darling test.[1]

Kolmogorov–Smirnov statistic
The empirical distribution function Fn for n iid observations Xi is defined as

where is the indicator function, equal to 1 if Xi ≤ x and equal to 0 otherwise.


The Kolmogorov–Smirnov statistic for a given cumulative distribution function F(x) is

where sup x is the supremum of the set of distances. By the Glivenko–Cantelli theorem, if the sample comes from
distribution F(x), then Dn converges to 0 almost surely. Kolmogorov strengthened this result, by effectively
providing the rate of this convergence (see below). The Donsker theorem provides yet a stronger result.
In practice, the statistic requires a relatively large number of data points to properly reject the null hypothesis.

Kolmogorov distribution
The Kolmogorov distribution is the distribution of the random variable

where B(t) is the Brownian bridge. The cumulative distribution function of K is given by[2]

Both the form of the Kolmogorov–Smirnov test statistic and its asymptotic distribution under the null hypothesis
were published by Andrey Kolmogorov,[3] while a table of the distribution was published by Nikolai Vasilyevich
Smirnov.[4] Recurrence relations for the distribution of the test statistic in finite samples are available.[3]
KolmogorovSmirnov test 278

Kolmogorov–Smirnov test
Under null hypothesis that the sample comes from the hypothesized distribution F(x),

in distribution, where B(t) is the Brownian bridge.


If F is continuous then under the null hypothesis converges to the Kolmogorov distribution, which does
not depend on F. This result may also be known as the Kolmogorov theorem; see Kolmogorov's theorem for
disambiguation.
The goodness-of-fit test or the Kolmogorov–Smirnov test is constructed by using the critical values of the
Kolmogorov distribution. The null hypothesis is rejected at level if

where Kα is found from

The asymptotic power of this test is 1.

Test with estimated parameters


If either the form or the parameters of F(x) are determined from the data Xi the critical values determined in this way
are invalid. In such cases, Monte Carlo or other methods may be required, but tables have been prepared for some
cases. Details for the required modifications to the test statistic and for the critical values for the normal distribution
and the exponential distribution have been published,[5] and later publications also include the Gumbel
distribution.[6] The Lilliefors test represents a special case of this for the normal distribution.

Two-sample Kolmogorov–Smirnov test


The Kolmogorov–Smirnov test may also be used to test whether two underlying one-dimensional probability
distributions differ. In this case, the Kolmogorov–Smirnov statistic is

where and are the empirical distribution functions of the first and the second sample respectively.
The null hypothesis is rejected at level if

Note that the two-sample test checks whether the two data samples come from the same distribution. This does not
specify what that common distribution is (e.g. normal or not normal). Again, tables of critical values have been
published.[5]
KolmogorovSmirnov test 279

Setting confidence limits for the shape of a distribution function


While the Kolmogorov–Smirnov test is usually used to test whether a given F(x) is the underlying probability
distribution of Fn(x), the procedure may be inverted to give confidence limits on F(x) itself. If one chooses a critical
value of the test statistic Dα such that P(Dn > Dα) = α, then a band of width ±Dα around Fn(x) will entirely contain
F(x) with probability 1 − α.

The Kolmogorov–Smirnov statistic in more than one dimension


The Kolmogorov–Smirnov test statistic needs to be modified if a similar test is to be applied to multivariate data.
This is not straightforward because the maximum difference between two joint cumulative distribution functions is
not generally the same as the maximum difference of any of the complementary distribution functions. Thus the
maximum difference will differ depending on which of or or any
of the other two possible arrangements is used. One might require that the result of the test used should not depend
on which choice is made.
One approach to generalizing the Kolmogorov–Smirnov statistic to higher dimensions which meets the above
concern is to compare the cdfs of the two samples with all possible orderings, and take the largest of the set of
resulting K-S statistics. In d dimensions, there are 2d−1 such orderings. One such variation is due to Peacock[7] and
another to Fasano & Franceschini[8] (see Lopes et al. for a comparison and computational details).[9] Critical values
for the test statistic can be obtained by simulations, but depend on the dependence structure in the joint distribution.

Footnotes
[1] Stephens, M. A. (1974). "EDF Statistics for Goodness of Fit and Some Comparisons". Journal of the American Statistical Association
(American Statistical Association) 69 (347): 730–737. doi:10.2307/2286009. JSTOR 2286009.
[2] Marsaglia G, Tsang WW, Wang J (2003). "Evaluating Kolmogorov’s Distribution" (http:/ / www. jstatsoft. org/ v08/ i18/ paper). Journal of
Statistical Software 8 (18): 1-4. .
[3] Kolmogorov A (1933). "Sulla determinazione empirica di una legge di distribuzione". G. Inst. Ital. Attuari 4: 83.
[4] Smirnov NV (1948). "Tables for estimating the goodness of fit of empirical distributions". Annals of Mathematical Statistics 19: 279.
[5] Pearson E.S. and Hartley, H.O., ed. (1972). Biometrika Tables for Statisticians. 2. Cambridge University Press. pp. 117–123, Tables 54, 55.
ISBN 0-521-06937-8.
[6] Galen R. Shorack and Jon A. Wellner (1986). Empirical Processes with Applications to Statistics. p. 239. ISBN 047186725X.
[7] Peacock J.A. (1983). "Two-dimensional goodness-of-fit testing in astronomy". Monthly Notices of the Royal Astronomical Society 202:
615–627. Bibcode 1983MNRAS.202..615P.
[8] Fasano, G., Franceschini, A. (1987). "A multidimensional version of the Kolmogorov–Smirnov test" (http:/ / articles. adsabs. harvard. edu/
full/ 1987MNRAS. 225. . 155F). Monthly Notices of the Royal Astronomical Society (ISSN 0035-8711) 225: 155–170. .
[9] Lopes, R.H.C., Reid, I., Hobson, P.R. (April 23–27, 2007). "The two-dimensional Kolmogorov-Smirnov test" (http:/ / dspace. brunel. ac. uk/
bitstream/ 2438/ 1166/ 1/ acat2007. pdf). XI International Workshop on Advanced Computing and Analysis Techniques in Physics Research.
Amsterdam, the Netherlands. .

References
• Eadie, W.T.; D. Drijard, F.E. James, M. Roos and B. Sadoulet (1971). Statistical Methods in Experimental
Physics. Amsterdam: North-Holland. pp. 269–271. ISBN 0-444-10117-9.
• Stuart, Alan; Ord, Keith; Arnold, Steven [F.] (1999). Classical Inference and the Linear Model. Kendall's
Advanced Theory of Statistics. 2A (Sixth ed.). London: Arnold. pp. 25.37–25.43. ISBN 0-340-66230-1.
MR1687411.
• Corder, G.W., Foreman, D.I. (2009).Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach
Wiley, ISBN 978-0-470-45461-9
• Stephens, M.A. (1979) Test of fit for the logistic distribution based on the empirical distribution function,
Biometrika, 66(3), 591-5.
KolmogorovSmirnov test 280

External links
• Short introduction (http://www.physics.csbsju.edu/stats/KS-test.html)
• KS test explanation (http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm)
• JavaScript implementation of one- and two-sided tests (http://www.ciphersbyritter.com/JAVASCRP/
NORMCHIK.HTM)
• Online calculator with the K-S test (http://jumk.de/statistic-calculator/)
• Open-source C++ code to compute the Kolmogorov distribution (http://root.cern.ch/root/html/TMath.
html#TMath:KolmogorovProb) and perform the K-S test (http://root.cern.ch/root/html/TMath.
html#TMath:KolmogorovTest)
• Paper on Evaluating Kolmogorov’s Distribution (http://www.jstatsoft.org/v08/i18/paper); contains C
implementation. This is the method used in Matlab.

Kronecker's lemma
In mathematics, Kronecker's lemma (see, e.g., Shiryaev (1996, Lemma IV.3.2)) is a result about the relationship
between convergence of infinite sums and convergence of sequences. The lemma is often used in the proofs of
theorems concerning sums of independent random variables such as the strong Law of large numbers. The lemma is
named after the German mathematician Leopold Kronecker.

The lemma
If is an infinite sequence of real numbers such that

exists and is finite, then we have for and that

Proof
Let denote the partial sums of the x's. Using summation by parts,

Pick any ε > 0. Now choose N so that is ε-close to s for k > N. This can be done as the sequence converges
to s. Then the right hand side is:

Now, let n go to infinity. The first term goes to s, which cancels with the third term. The second term goes to zero (as
the sum is a fixed value). Since the b sequence is increasing, the last term is bounded by .
Kronecker's lemma 281

References
• Shiryaev, Albert N. (1996). Probability (2nd ed.). Springer. ISBN 0-387-94549-0.

Kullback–Leibler divergence
In probability theory and information theory, the Kullback–Leibler divergence[1][2][3] (also information
divergence, information gain, relative entropy, or KLIC) is a non-symmetric measure of the difference between
two probability distributions P and Q. KL measures the expected number of extra bits required to code samples from
P when using a code based on Q, rather than using a code based on P. Typically P represents the "true" distribution
of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory,
model, description, or approximation of P.
Although it is often intuited as a metric or distance, the KL divergence is not a true metric — for example, it is not
symmetric: the KL from P to Q is generally not the same as the KL from Q to P. However, its infinitesimal form,
specifically its Hessian, is a metric tensor: it is the Fisher information metric.
KL divergence is a special case of a broader class of divergences called f-divergences. It was originally introduced
by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions. It can be
derived from the Bregman divergence.

Definition
For probability distributions P and Q of a discrete random variable their K–L divergence is defined to be

In words, it is the average of the logarithmic difference between the probabilities P and Q, where the average is taken
using the probabilities P. The K-L divergence is only defined if P and Q both sum to 1 and if for any i
such that . If the quantity appears in the formula, it is interpreted as zero.
For distributions P and Q of a continuous random variable, KL-divergence is defined to be the integral:[4]

where p and q denote the densities of P and Q.


More generally, if P and Q are probability measures over a set X, and Q is absolutely continuous with respect to P,
then the Kullback–Leibler divergence from P to Q is defined as

where is the Radon–Nikodym derivative of Q with respect to P, and provided the expression on the right-hand

side exists. Likewise, if P is absolutely continuous with respect to Q, then

which we recognize as the entropy of P relative to Q. Continuing in this case, if is any measure on X for which
and exist, then the Kullback–Leibler divergence from P to Q is given as
KullbackLeibler divergence 282

The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base e if
information is measured in nats. Most formulas involving the KL divergence hold irrespective of log base.
In this article, this will be referred to as the divergence from P to Q, although some authors call it the divergence
"from Q to P" and others call it the divergence "between P and Q" (though note it is not symmetric as this latter
terminology implies). Care must be taken due to the lack of standardization in terminology.

Motivation
In information theory, the
Kraft–McMillan theorem establishes
that any directly decodable coding
scheme for coding a message to identify
one value xi out of a set of possibilities
X can be seen as representing an
implicit probability distribution q(xi)
= 2−li over X, where li is the length of
the code for xi in bits. Therefore, KL
divergence can be interpreted as the
expected extra message-length per
datum that must be communicated if a
code that is optimal for a given (wrong)
distribution Q is used, compared to
using a code based on the true
Illustration of the Kullback–Leibler (KL) divergence for two normal Gaussian
distribution P. distributions. Note the typical asymmetry for the KL divergence is clearly visible.

where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P.
Note also that there is a relation between the Kullback–Leibler divergence and the "rate function" in the theory of
large deviations.[5][6]

Computing the closed form


For many common families of distributions, the KL-divergence between two distributions in the family can be
derived in closed form. This can often be done most easily using the form of the KL-divergence in terms of expected
values or in terms of information entropy:

where is the information entropy of and is the cross entropy of and


KullbackLeibler divergence 283

Properties
The Kullback–Leibler divergence is always non-negative,

a result known as Gibbs' inequality, with DKL(P||Q) zero if and only if P = Q almost everywhere. The entropy H(P)
thus sets a minimum value for the cross-entropy H(P,Q), the expected number of bits required when using a code
based on Q rather than P; and the KL divergence therefore represents the expected number of extra bits that must be
transmitted to identify a value x drawn from X, if a code is used corresponding to the probability distribution Q,
rather than the "true" distribution P.
The Kullback–Leibler divergence remains well-defined for continuous distributions, and furthermore is invariant
under parameter transformations. For example, if a transformation is made from variable x to variable y(x), then,
since P(x)dx=P(y)dy and Q(x)dx=Q(y)dy the Kullback–Leibler divergence may be rewritten:

where and . Although it was assumed that the transformation was continuous, this need
not be the case. This also shows that the Kullback–Leibler divergence produces a dimensionally consistent quantity,
since if x is a dimensioned variable, P(x) and Q(x) are also dimensioned, since e.g. P(x)dx is dimensionless. The
argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a
more fundamental quantity than some other properties in information theory[7] (such as self-information or Shannon
entropy), which can become undefined or negative for non-discrete probabilities.
The Kullback–Leibler divergence is additive for independent distributions in much the same way as Shannon
entropy. If are independent distributions, with the joint distribution , and
likewise, then
.

KL divergence for Normal Distributions


The Kullback–Leibler divergence between two multivariate normal distributions of the dimension with the means
and their corresponding nonsingular covariance matrices is:

[8]

The logarithm must be taken to base e since the two terms following the logarithm are themselves base-e logarithms
of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives
a result measured in nats. Dividing the entire expression above by loge 2 yields the divergence in bits.

Relation to metrics
One might be tempted to call it a "distance metric" on the space of probability distributions, but this would not be
correct as the Kullback–Leibler divergence is not symmetric – that is, , – nor does it
satisfy the triangle inequality. Still, being a premetric, it generates a topology on the space of generalized probability
distributions, of which probability distributions proper are a special case. More concretely, if is a
sequence of distributions such that

then it is said that . Pinsker's inequality entails that , where the latter stands for
the usual convergence in total variation.
KullbackLeibler divergence 284

Following Rényi (1970, 1961)[9][10] the term is sometimes also called the information gain about X achieved if P
can be used instead of Q. It is also called the relative entropy, for using Q instead of P.

Fisher information metric


However, the Kullback–Leibler divergence is rather directly related to a metric, specifically, the Fisher information
metric. This can be made explicit as follows. Presume that the probability distributions P and Q are parameterized by
some (possibly multi-dimensional) parameter . Consider then two close by values of P and Q so that these differ
by only a small amount of the parameter . Specifically, so that

with an infinitesimal change of in the j direction, and the Hessian matrix representing the
corresponding change in the probability distribution. Then, for this expression for P, one has

where is the Fisher information metric.

Relation to other quantities of information theory


Many of the other quantities of information theory can be interpreted as applications of the KL divergence to specific
cases.
The self-information,

is the KL divergence of the probability distribution P(i) from a Kronecker delta representing certainty that i=m —
i.e. the number of extra bits that must be transmitted to identify i if only the probability distribution P(i) is available
to the receiver, not the fact that i=m.
The mutual information,

is the KL divergence of the product P(X)P(Y) of the two marginal probability distributions from the joint probability
distribution P(X,Y) — i.e. the expected number of extra bits that must be transmitted to identify X and Y if they are
coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability
P(X,Y) is known, it is the expected number of extra bits that must on average be sent to identify Y if the value of X is
not already known to the receiver.
The Shannon entropy,

is the number of bits which would have to be transmitted to identify X from N equally likely possibilities, less the KL
divergence of the uniform distribution PU(X) from the true distribution P(X) — i.e. less the expected number of bits
saved, which would have had to be sent if the value of X were coded according to the uniform distribution PU(X)
rather than the true distribution P(X).
The conditional entropy,
KullbackLeibler divergence 285

is the number of bits which would have to be transmitted to identify X from N equally likely possibilities, less the KL
divergence of the product distribution PU(X) P(Y) from the true joint distribution P(X,Y) — i.e. less the expected
number of bits saved which would have had to be sent if the value of X were coded according to the uniform
distribution PU(X) rather than the conditional distribution P(X|Y) of X given Y.
The cross entropy between two probability distributions measures the average number of bits needed to identify an
event from a set of possibilities, if a coding scheme is used based on a given probability distribution , rather than
the "true" distribution . The cross entropy for two distributions and over the same probability space is thus
defined as follows:

KL divergence and Bayesian updating


In Bayesian statistics the KL divergence can be used as a measure of the information gain in moving from a prior
distribution to a posterior distribution. If some new fact Y = y is discovered, it can be used to update the probability
distribution for X from p(x|I) to a new posterior probability distribution p(x|y,I) using Bayes' theorem:

This distribution has a new entropy

which may be less than or greater than the original entropy H(p(·|I)). However, from the standpoint of the new
probability distribution one can estimate that to have used the original code based on p(x|I) instead of a new code
based on p(x|y,I) would have added an expected number of bits

to the message length. This therefore represents the amount of useful information, or information gain, about X, that
we can estimate has been learned by discovering Y = y.
If a further piece of data, Y2 = y2, subsequently comes in, the probability distribution for x can be updated further, to
give a new best guess p(x|y1,y2,I). If one reinvestigates the information gain for using p(x|y1,I) rather than p(x|I), it
turns out that it may be either greater or less than previously estimated:

may be ≤ or > than

and so the combined information gain does not obey the triangle inequality:

may be <, = or > than

All one can say is that on average, averaging using p(y2|y1,x,I), the two sides will average out.
KullbackLeibler divergence 286

Bayesian experimental design


A common goal in Bayesian experimental design is to maximise the expected KL divergence between the prior and
the posterior.[11] When posteriors are approximated to be Gaussian distributions, a design maximising the expected
KL divergence is called Bayes d-optimal.

Discrimination information
The Kullback–Leibler divergence DKL( p(x|H1) || p(x|H0) ) can also be interpreted as the expected discrimination
information for H1 over H0: the mean information per sample for discriminating in favor of a hypothesis H1 against
a hypothesis H0, when hypothesis H1 is true.[12] Another name for this quantity, given to it by I.J. Good, is the
expected weight of evidence for H1 over H0 to be expected from each sample.
The expected weight of evidence for H1 over H0 is not the same as the information gain expected per sample about
the probability distribution p(H) of the hypotheses,
DKL( p(x|H1) || p(x|H0) )     IG = DKL( p(H|x) || p(H|I) ).
Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal
next question to investigate: but they will in general lead to rather different experimental strategies.
On the entropy scale of information gain there is very little difference between near certainty and absolute
certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute
certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is
enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level)
that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a
mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how
well each reflects the particular circumstances of the problem in question.

Principle of minimum discrimination information


The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of
Minimum Discrimination Information (MDI): given new facts, a new distribution f should be chosen which is as
hard to discriminate from the original distribution f0 as possible; so that the new data produces as small an
information gain DKL( f || f0 ) as possible.
For example, if one had a prior distribution p(x,a) over x and a, and subsequently learnt the true distribution of a was
u(a), the Kullback–Leibler divergence between the new joint distribution for x and a, q(x|a) u(a), and the earlier
prior distribution would be:

i.e. the sum of the KL divergence of p(a) the prior distribution for a from the updated distribution u(a), plus the
expected value (using the probability distribution u(a)) of the KL divergence of the prior conditional distribution
p(x|a) from the new conditional distribution q(x|a). (Note that often the later expected value is called the conditional
KL divergence (or conditional relative entropy) and denoted by DKL(q(x|a)||p(x|a))[13]) This is minimised if q(x|a) =
p(x|a) over the whole support of u(a); and we note that this result incorporates Bayes' theorem, if the new distribution
u(a) is in fact a δ function representing certainty that a has one particular value.
MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum
Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to
continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the KL
divergence continues to be just as relevant.
In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or
Minxent for short. Minimising the KL divergence of m from p with respect to m is equivalent to minimizing the
KullbackLeibler divergence 287

cross-entropy of p and m, since

which is appropriate if one is trying to choose an adequate approximation to p. However, this is just as often not the
task one is trying to achieve. Instead, just as often it is m that is some fixed prior reference measure, and p that one is
attempting to optimise by minimising DKL(p||m) subject to some constraint. This has led to some ambiguity in the
literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be DKL(p||m),
rather than H(p,m).

Relationship to available work


Surprisals[14] add where probabilities multiply. The surprisal
for an event of probability p is defined as s ≡ k ln[1/p]. If k is
{1,1/ln 2,1.38×10−23} then surprisal is in {nats, bits, or J/K}
so that, for instance, there are N bits of surprisal for landing
all "heads" on a toss of N coins.
Best-guess states (e.g. for atoms in a gas) are inferred by
maximizing the average-surprisal S (entropy) for a given set
of control parameters (like pressure P or volume V). This
constrained entropy maximization, both classically[15] and
quantum mechanically,[16] minimizes Gibbs availability in
entropy units[17] A ≡ −kln Z where Z is a constrained
multiplicity or partition function.

When temperature T is fixed, free-energy (T × A) is also Pressure versus volume plot of available work from a mole
minimized. Thus if T, V and number of molecules N are of Argon gas relative to ambient, calculated as To times KL
divergence.
constant, the Helmholtz free energy F ≡ U − TS (where U is
energy) is minimized as a system "equilibrates." If T and P are
held constant (say during processes in your body), the Gibbs free energy G ≡ U + PV − TS is minimized instead. The
change in free energy under these conditions is a measure of available work that might be done in the process. Thus
available work for an ideal gas at constant temperature To and pressure Po is W = ΔG = NkToΘ[V/Vo] where Vo
= NkTo/Po and Θ[x] ≡ x − 1 − ln x ≥ 0 (see also Gibbs inequality).

More generally[18] the work available relative to some ambient is obtained by multiplying ambient temperature To by
KL-divergence or net-surprisal ΔI ≥ 0, defined as the average value of k ln[p/po] where po is the probability of a
given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to
ambient values of Vo and To is thus W =ToΔI, where KL-divergence ΔI = Nk(Θ[V/Vo] + 3⁄2Θ[T/To]). The resulting
contours of constant KL-divergence, at right for a mole of Argon at standard temperature and pressure, for example
put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to
convert boiling-water to ice-water discussed here.[19] Thus KL-divergence measures thermodynamic availability in
bits.
KullbackLeibler divergence 288

Quantum information theory


For density matrices P and Q on a Hilbert space the K–L divergence (or relative entropy as it is often called in this
case) from P to Q is defined to be

In quantum information science the minimum of over all separable states Q can also be used as a
measure of entanglement in the state P.

Relationship between models and reality


Just as KL-divergence of "ambient from actual" measures thermodynamic availability, KL-divergence of "model
from reality" is also useful even if the only clues we have about reality are some experimental measurements. In the
former case KL-divergence describes distance to equilibrium or (when multiplied by ambient temperature) the
amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other
words, how much the model has yet to learn.
Although this tool for evaluating models against systems that are accessible experimentally may be applied in any
field, its application to models in ecology via Akaike information criterion are particularly well described in
papers[20] and a book[21] by Burnham and Anderson. In a nutshell the KL-divergence of a model from reality may be
estimated, to within a constant additive term, by a function (like the squares summed) of the deviations observed
between data and the model's predictions. Estimates of such divergence for models that share the same additive term
can in turn be used to choose between models.
When trying to fit parametrized models to data there are various estimators which attempt to minimize
Kullback–Leibler divergence, such as maximum likelihood and maximum spacing estimators.

Symmetrised divergence
Kullback and Leibler themselves actually defined the divergence as:

which is symmetric and nonnegative. This quantity has sometimes been used for feature selection in classification
problems, where P and Q are the conditional pdfs of a feature under two different classes.
An alternative is given via the λ divergence,

which can be interpreted as the expected information gain about X from discovering which probability distribution X
is drawn from, P or Q, if they currently have probabilities λ and (1 − λ) respectively.
The value λ = 0.5 gives the Jensen–Shannon divergence, defined by

where M is the average of the two distributions,

DJS can also be interpreted as the capacity of a noisy information channel with two inputs giving the output
distributions p and q. The Jensen–Shannon divergence proportional to the square of the Fisher information metric
and is equivalent to the Hellinger metric, and the Jensen–Shannon divergence is also equal to one-half the so-called
Jeffreys divergence (Rubner et al., 2000; Jeffreys 1946).
KullbackLeibler divergence 289

Relationship to Hellinger distance


If P and Q are two probability measures, then the squared Hellinger distance is the quantity given by

Noting that , so that in particular, , we see that

Taking expectations with respect to Q, we get

Hence

Other probability-distance measures


Other measures of probability distance are the histogram intersection, Chi-squared statistic, quadratic form distance,
match distance, Kolmogorov–Smirnov distance, and earth mover's distance.[22]

Data differencing
Just as absolute entropy serves as theoretical background for data compression, relative entropy serves as theoretical
background for data differencing – the absolute entropy of a set of data in this sense being the data required to
reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of
data, is the data required to reconstruct the target given the source (minimum size of a patch).

References
[1] Kullback, S.; Leibler, R.A. (1951). "On Information and Sufficiency". Annals of Mathematical Statistics 22 (1): 79–86.
doi:10.1214/aoms/1177729694. MR39968.
[2] S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY).
[3] Kullback, S.; Burnham, K. P.; Laubscher, N. F.; Dallal, G. E.; Wilkinson, L.; Morrison, D. F.; Loyer, M. W.; Eisenberg, B. et al. (1987).
"Letter to the Editor: The Kullback–Leibler distance". The American Statistician 41 (4): 340–341. JSTOR 2684769.
[4] C. Bishop (2006). Pattern Recognition and Machine Learning. p. 55.
[5] Sanov I.N. (1957) "On the probability of large deviations of random magnitudes". Matem. Sbornik, v. 42 (84), 11--44.
[6] Novak S.Y. (2011) ch. 14.5, "Extreme value methods with applications to finance". Chapman & Hall/CRC Press. ISBN 978-1-4398-3574-6.
[7] See the section "differential entropy - 4" in Relative Entropy (http:/ / videolectures. net/ nips09_verdu_re/ ) video lecture by Sergio Verdú
NIPS 2009
[8] Penny & Roberts, PARG-00-12, (2000) (http:/ / www. allisons. org/ ll/ MML/ KL/ Normal). pp. 18
[9] A. Rényi (1970). Probability Theory. New York: Elsevier. Appendix, Sec.4. ISBN 0-486-45867-9.
[10] A. Rényi (1961). "On measures of information and entropy" (http:/ / digitalassets. lib. berkeley. edu/ math/ ucb/ text/ math_s4_v1_article-27.
pdf). Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability 1960. pp. 547–561. .
[11] Chaloner K. and Verdinelli I. (1995) Bayesian Experimental Design: A Review. Statistical Science 10 (3): 273-304. (http:/ / dx. doi. org/ 10.
1214/ aoms/ 1177729694) doi:10.1214/ss/1177009939
[12] Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 14.7.2. Kullback-Leibler Distance" (http:/ / apps. nrbook. com/
empanel/ index. html#pg=756). Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press.
ISBN 978-0-521-88068-8.
[13] Thomas M. Cover, Joy A. Thomas (1991) Elements of Information Theory (John Wiley and Sons, New York, NY), p.22
[14] Myron Tribus (1961) Thermodynamics and thermostatics (D. Van Nostrand, New York)
[15] E. T. Jaynes (1957) Information theory and statistical mechanics (http:/ / bayes. wustl. edu/ etj/ articles/ theory. 1. pdf), Physical Review
106:620
KullbackLeibler divergence 290

[16] E. T. Jaynes (1957) Information theory and statistical mechanics II (http:/ / bayes. wustl. edu/ etj/ articles/ theory. 2. pdf), Physical Review
108:171
[17] J.W. Gibbs (1873) A method of geometrical representation of thermodynamic properties of substances by means of surfaces, reprinted in
The Collected Works of J. W. Gibbs, Volume I Thermodynamics, ed. W. R. Longley and R. G. Van Name (New York: Longmans, Green,
1931) footnote page 52.
[18] M. Tribus and E. C. McIrvine (1971) Energy and information, Scientific American 224:179–186.
[19] P. Fraundorf (2007) Thermal roots of correlation-based complexity (http:/ / www3. interscience. wiley. com/ cgi-bin/ abstract/ 117861985/
ABSTRACT), Complexity 13:3, 18–26
[20] Kenneth P. Burnham and David R. Anderson (2001) Kullback–Leibler information as a basis for strong inference in ecological studies
(http:/ / www. publish. csiro. au/ paper/ WR99107. htm), Wildlife Research 28:111–119.
[21] Burnham, K. P. and Anderson D. R. (2002) Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach,
Second Edition (Springer Science, New York) ISBN 978-0-387-95364-9.
[22] Rubner, Y., Tomasi, C., and Guibas, L. J., 2000. The Earth Mover's distance as a metric for image retrieval. International Journal of
Computer Vision, 40(2): 99–121.

External links
• Ruby gem for calculating KL divergence (https://github.com/evansenter/diverge)
• Jon Shlens' tutorial on Kullback-Leibler divergence and likelihood theory (http://www.snl.salk.edu/~shlens/
kl.pdf)
• Matlab code for calculating KL divergence (http://www.mathworks.com/matlabcentral/fileexchange/loadFile.
do?objectId=13089&objectType=file)
• Sergio Verdú, Relative Entropy (http://videolectures.net/nips09_verdu_re/), NIPS 2009. One-hour video
lecture.
• A modern summary of info-theoretic divergence measures (http://arxiv.org/abs/math/0604246)
Laplace distribution 291

Laplace distribution
Laplace

Probability density function

Cumulative distribution function

Parameters location (real)


scale (real)
Support
PDF

CDF see text


Mean
Median
Mode
Variance
Skewness
Ex. kurtosis
Entropy
MGF
for

CF

In probability theory and statistics, the Laplace distribution is a continuous probability distribution named after
Pierre-Simon Laplace. It is also sometimes called the double exponential distribution, because it can be thought of as
two exponential distributions (with an additional location parameter) spliced together back-to-back, but the term
double exponential distribution is also sometimes used to refer to the Gumbel distribution. The difference between
two independent identically distributed exponential random variables is governed by a Laplace distribution, as is a
Brownian motion evaluated at an exponentially distributed random time. Increments of Laplace motion or a variance
gamma process evaluated over the time scale also have a Laplace distribution.
Laplace distribution 292

Characterization

Probability density function


A random variable has a Laplace(μ, b) distribution if its probability density function is

Here, μ is a location parameter and b > 0, which is sometimes referred to as the diversity, is a scale parameter. If μ =
0 and b = 1, the positive half-line is exactly an exponential distribution scaled by 1/2.
The probability density function of the Laplace distribution is also reminiscent of the normal distribution; however,
whereas the normal distribution is expressed in terms of the squared difference from the mean μ, the Laplace density
is expressed in terms of the absolute difference from the mean. Consequently the Laplace distribution has fatter tails
than the normal distribution.

Cumulative distribution function


The Laplace distribution is easy to integrate (if one distinguishes two symmetric cases) due to the use of the absolute
value function. Its cumulative distribution function is as follows:

The inverse cumulative distribution function is given by

Generating random variables according to the Laplace distribution


Given a random variable U drawn from the uniform distribution in the interval (-1/2, 1/2], the random variable

has a Laplace distribution with parameters μ and b. This follows from the inverse cumulative distribution function
given above.
A Laplace(0, b) variate can also be generated as the difference of two i.i.d. Exponential(1/b) random variables.
Equivalently, a Laplace(0, 1) random variable can be generated as the logarithm of the ratio of two iid uniform
random variables.
Laplace distribution 293

Parameter estimation
Given N independent and identically distributed samples x1, x2, ..., xN, the maximum likelihood estimator of is
[1]
the sample median, and the maximum likelihood estimator of b is

(revealing a link between the Laplace distribution and least absolute deviations).

Moments

Related distributions
• If then
• If then (exponential distribution)
• If and then .
• If then
• If then (Exponential power distribution)
• If (Normal distribution) for then

• If then (Chi-squared distribution)

• If and then (F-distribution)


• If and (Uniform distribution (continuous)) then
• If and independent of , then
.
• If and independent of , then

• If and independent of , then
.
• If is a geometric stable distribution with =2, =0 and =0 then
is a Laplace distribution with and b=
• Laplace distribution is the limiting case of Hyperbolic distribution
• If with (Rayleigh distribution) then

Relation to the exponential distribution


A Laplace random variable can be represented as the difference of two iid exponential random variables.[2] One way
to show this is by using the characteristic function approach. For any set of independent continuous random
variables, for any linear combination of those variables, its characteristic function (which uniquely determines the
distribution) can be acquired by multiplying the correspond characteristic functions.
Consider two i.i.d random variables . The characteristic functions for are

, respectively. On multiplying these characteristic functions (equivalent to the characteristic

function of the sum of therandom variables ), the result is .


Laplace distribution 294

This is the same as the characteristic function for , which is .

Sargan distributions
Sargan distributions are a system of distributions of which the Laplace distribution is a core member. A p'th order
Sargan distribution has density[3][4]

for parameters α > 0, βj   ≥ 0. The Laplace distribution results for p=0.

Applications
The Laplacian distribution has been used in speech recognition to model priors on DFT coefficients.[5]
The addition of noise drawn from a Laplacian distribution, with scaling parameter appropriate to a function's
sensitivity, to the output of a statistical database query is the most common means to provide differential privacy in
statistical databases.

History
This distribution is often referred to as Laplace's first law of errors. He published it in 1774 when he noted that the
frequency of an error could be expressed as an exponential function of its magnitude once its sign was
disregarded.[6][7]
Laplace in 1778 published his second law of errors wherein he noted that the frequency of an error was proportional
to the exponential of the square of its magnitude. This was subsequently rediscovered by Gauss (possibly in 1795)
and it is now best known as the Normal distribution.
Keynes published a paper in 1911 based on his earlier thesis wherein he showed that the Laplace distribution
minimised the absolute deviation from the median.[8]

References
[1] Robert M. Norton (May 1984). "The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator". The
American Statistician (American Statistical Association) 38 (2): 135–136. doi:10.2307/2683252. JSTOR 2683252.
[2] Kotz, Samuel; Kozubowski, Tomasz J.; Podgórski, Krzysztof (2001). The Laplace distribution and generalizations: a revisit with
applications to Communications, Economics, Engineering and Finance (http:/ / books. google. com/ books?id=cb8B07hwULUC&
lpg=PA22& dq=laplace distribution exponential characteristic function& hl=fr& pg=PA23#v=onepage& q=laplace distribution exponential
characteristic function& f=false). Birkhauser. pp. p.23 (Proposition 2.2.2, Equation 2.2.8). ISBN 9780817641665. .
[3] Everitt, B.S. (2002) The Cambridge Dictionary of Statistics, CUP. ISBN 0-521-81099-X
[4] Johnson, N.L., Kotz S., Balakrishnan, N. (1994) Continuous Univariate Distributions, Wiley. ISBN 0-471-58495-9. p. 60
[5] Eltoft, T.; Taesu Kim; Te-Won Lee (2006). "On the multivariate Laplace distribution" (http:/ / eo. uit. no/ publications/ TE-SPL-06. pdf).
IEEE Signal Processing Letters 13 (5): 300-303. doi:10.1109/LSP.2006.870353. .
[6] Laplace, P-S. (1774). Mémoire sur la probabilité des causes par les évènements. Mémoires de l’Academie Royale des Sciences Presentés par
Divers Savan, 6, 621–656
[7] Wilson EB (1923) First and second laws of error. JASA 18, 143
[8] Keynes JM (1911) The principal averages and the laws of error which lead to them. J Roy Stat Soc, 74, 322–331
Laplace's equation 295

Laplace's equation
In mathematics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace
who first studied its properties. This is often written as:

where ∆ = ∇² is the Laplace operator and is a scalar function. In general, ∆ = ∇² is the Laplace–Beltrami or
Laplace–de Rham operator.
Laplace's equation and Poisson's equation are the simplest examples of elliptic partial differential equations.
Solutions of Laplace's equation are called harmonic functions.
The general theory of solutions to Laplace's equation is known as potential theory. The solutions of Laplace's
equation are the harmonic functions, which are important in many fields of science, notably the fields of
electromagnetism, astronomy, and fluid dynamics, because they can be used to accurately describe the behavior of
electric, gravitational, and fluid potentials. In the study of heat conduction, the Laplace equation is the steady-state
heat equation.

Definition
In three dimensions, the problem is to find twice-differentiable real-valued functions , of real variables x, y, and z,
such that
In Cartesian coordinates

In cylindrical coordinates,

In spherical coordinates,

In Curvilinear coordinates,

or

This is often written as

or, especially in more general contexts,

where ∆ = ∇² is the Laplace operator or "Laplacian"

where ∇ ⋅ = div is the divergence, and ∇ = grad is the gradient.


If the right-hand side is specified as a given function, f(x, y, z), i.e., if the whole equation is written as
Laplace's equation 296

then it is called "Poisson's equation".


The Laplace equation is also a special case of the Helmholtz equation.

Boundary conditions
The Dirichlet problem for Laplace's
equation consists of finding a solution
on some domain such that
on the boundary of is equal to
some given function. Since the Laplace
operator appears in the heat equation,
one physical interpretation of this
problem is as follows: fix the
temperature on the boundary of the
domain according to the given
specification of the boundary
condition. Allow heat to flow until a
stationary state is reached in which the
Laplace's Equation on an annulus (r=2 and R=4) with Dirichlet Boundary Conditions:
temperature at each point on the
u(r=2)=0 and u(r=4)=4sin(5*θ)
domain doesn't change anymore. The
temperature distribution in the interior
will then be given by the solution to the corresponding Dirichlet problem.

The Neumann boundary conditions for Laplace's equation specify not the function itself on the boundary of ,
but its normal derivative. Physically, this corresponds to the construction of a potential for a vector field whose
effect is known at the boundary of alone.
Solutions of Laplace's equation are called harmonic functions; they are all analytic within the domain where the
equation is satisfied. If any two functions are solutions to Laplace's equation (or any linear homogeneous differential
equation), their sum (or any linear combination) is also a solution. This property, called the principle of
superposition, is very useful, e.g., solutions to complex problems can be constructed by summing simple solutions.

Laplace equation in two dimensions


The Laplace equation in two independent variables has the form

Analytic functions
The real and imaginary parts of a complex analytic function both satisfy the Laplace equation. That is, if z = x + iy,
and if

then the necessary condition that f(z) be analytic is that the Cauchy-Riemann equations be satisfied:

where ux is the first partial derivative of u with respect to x.


It follows that
Laplace's equation 297

Therefore u satisfies the Laplace equation. A similar calculation shows that v also satisfies the Laplace equation.
Conversely, given a harmonic function, it is the real part of an analytic function, (at least locally). If a trial
form is

then the Cauchy-Riemann equations will be satisfied if we set

This relation does not determine ψ, but only its increments:

The Laplace equation for φ implies that the integrability condition for ψ is satisfied:

and thus ψ may be defined by a line integral. The integrability condition and Stokes' theorem implies that the value
of the line integral connecting two points is independent of the path. The resulting pair of solutions of the Laplace
equation are called conjugate harmonic functions. This construction is only valid locally, or provided that the path
does not loop around a singularity. For example, if r and θ are polar coordinates and

then a corresponding analytic function is

However, the angle θ is single-valued only in a region that does not enclose the origin.
The close connection between the Laplace equation and analytic functions implies that any solution of the Laplace
equation has derivatives of all orders, and can be expanded in a power series, at least inside a circle that does not
enclose a singularity. This is in sharp contrast to solutions of the wave equation, which generally have less regularity.
There is an intimate connection between power series and Fourier series. If we expand a function f in a power series
inside a circle of radius R, this means that

with suitably defined coefficients whose real and imaginary parts are given by

Therefore

which is a Fourier series for f. These trigonometric functions can themselves be expanded, using multiple angle
formulae.
Laplace's equation 298

Fluid flow
Let the quantities u and v be the horizontal and vertical components of the velocity field of a steady incompressible,
irrotational flow in two dimensions. The condition that the flow be incompressible is that

and the condition that the flow be irrotational is that

If we define the differential of a function ψ by

then the incompressibility condition is the integrability condition for this differential: the resulting function is called
the stream function because it is constant along flow lines. The first derivatives of ψ are given by

and the irrotationality condition implies that ψ satisfies the Laplace equation. The harmonic function φ that is
conjugate to ψ is called the velocity potential. The Cauchy-Riemann equations imply that

Thus every analytic function corresponds to a steady incompressible, irrotational fluid flow in the plane. The real
part is the velocity potential, and the imaginary part is the stream function.

Electrostatics
According to Maxwell's equations, an electric field (u,v) in two space dimensions that is independent of time satisfies

and

where ρ is the charge density. The first Maxwell equation is the integrability condition for the differential

so the electric potential φ may be constructed to satisfy

The second of Maxwell's equations then implies that

which is the Poisson equation.


It is important to note that the Laplace equation can be used in three-dimensional problems in electrostatics and fluid
flow just as in two dimensions.

Laplace equation in three dimensions

Fundamental solution
A fundamental solution of Laplace's equation satisfies

where the Dirac delta function denotes a unit source concentrated at the point No function has this
property, but it can be thought of as a limit of functions whose integrals over space are unity, and whose support (the
region where the function is non-zero) shrinks to a point (see weak solution). It is common to take a different sign
convention for this equation than one typically does when defining fundamental solutions. This choice of sign is
often convenient to work with because is a positive operator. The definition of the fundamental solution thus
Laplace's equation 299

implies that, if the Laplacian of u is integrated over any volume that encloses the source point, then

The Laplace equation is unchanged under a rotation of coordinates, and hence we can expect that a fundamental
solution may be obtained among solutions that only depend upon the distance r from the source point. If we choose
the volume to be a ball of radius a around the source point, then Gauss' divergence theorem implies that

It follows that

on a sphere of radius r that is centered around the source point, and hence

Note that, with the opposite sign convention (used in Physics), this is the potential generated by a point particle, for
an inverse-square law force, arising in the solution of Poisson equation. A similar argument shows that in two
dimensions

where denotes the natural logarithm. Note that, with the opposite sign convention, this is the potential
generated by a pointlike sink (see point particle), which is the solution of the Euler equations in two-dimensional
incompressible flow.

Green's function
A Green's function is a fundamental solution that also satisfies a suitable condition on the boundary S of a volume V.
For instance, may satisfy

Now if u is any solution of the Poisson equation in V:

and u assumes the boundary values g on S, then we may apply Green's identity, (a consequence of the divergence
theorem) which states that

The notations un and Gn denote normal derivatives on S. In view of the conditions satisfied by u and G, this result
simplifies to

Thus the Green's function describes the influence at of the data f and g. For the case of the interior of a
sphere of radius a, the Green's function may be obtained by means of a reflection (Sommerfeld, 1949): the source
point P at distance ρ from the center of the sphere is reflected along its radial line to a point P' that is at a distance

Note that if P is inside the sphere, then P' will be outside the sphere. The Green's function is then given by
Laplace's equation 300

where R denotes the distance to the source point P and R' denotes the distance to the reflected point P. A
consequence of this expression for the Green's function is the Poisson integral formula'. Let ρ, θ, and φ be
spherical coordinates for the source point P. Here θ denotes the angle with the vertical axis, which is contrary to the
usual American mathematical notation, but agrees with standard European and physical practice. Then the solution
of the Laplace equation inside the sphere is given by

where

A simple consequence of this formula is that if u is a harmonic function, then the value of u at the center of the
sphere is the mean value of its values on the sphere. This mean value property immediately implies that a
non-constant harmonic function cannot assume its maximum value at an interior point.

Electrostatics
In free space the Laplace equation of any electrostatic potential must equal zero since (charge density) is zero in
free space.
Taking the gradient of the electric potential we get the electrostatic field

Taking the divergence of the electrostatic field, we obtain Poisson's equation, that relates charge density and electric
potential

In the particular case of the empty space ( ) Poisson's equation reduces to Laplace's equation for the electric
potential.
Using a uniqueness theorem and showing that a potential satisfies Laplace's equation (second derivative of V should
be zero i.e. in free space) and the potential has the correct values at the boundaries, the potential is then uniquely
defined.
A potential that doesn't satisfy Laplace's equation together with the boundary condition is an invalid electrostatic
potential.

References
• Evans, L. C. (1998). Partial Differential Equations. Providence: American Mathematical Society.
ISBN 0-8218-0772-2.
• Petrovsky, I. G. (1967). Partial Differential Equations. Philadelphia: W. B. Saunders.
• Polyanin, A. D. (2002). Handbook of Linear Partial Differential Equations for Engineers and Scientists. Boca
Raton: Chapman & Hall/CRC Press. ISBN 1-58488-299-9.
• Sommerfeld, A. (1949). Partial Differential Equations in Physics. New York: Academic Press.
Laplace's equation 301

External links
• Laplace Equation (particular solutions and boundary value problems) [1] at EqWorld: The World of Mathematical
Equations.
• Laplace Differential Equation [2] on PlanetMath
• Example initial-boundary value problems [3] using Laplace's equation from exampleproblems.com.
• Weisstein, Eric W., "Laplace’s Equation [4]" from MathWorld.
• Module for Laplace’s Equation by John H. Mathews [5]
• Find out how boundary value problems governed by Laplace's equation may be solved numerically by boundary
element method [6]

References
[1] http:/ / eqworld. ipmnet. ru/ en/ solutions/ lpde/ lpde301. pdf
[2] http:/ / planetmath. org/ encyclopedia/ LaplaceDifferentialEquation. html
[3] http:/ / www. exampleproblems. com/ wiki/ index. php/ PDE:Laplaces_Equation
[4] http:/ / mathworld. wolfram. com/ LaplacesEquation. html
[5] http:/ / math. fullerton. edu/ mathews/ c2003/ DirichletProblemMod. html
[6] http:/ / www. ntu. edu. sg/ home/ mwtang/ bemsite. htm

Laplace's method
In mathematics, Laplace's method, named after Pierre-Simon Laplace, is a technique used to approximate integrals
of the form

where ƒ(x) is some twice-differentiable function, M is a large number, and the integral endpoints a and b could
possibly be infinite. This technique was originally presented in Laplace (1774, pp. 366–367).
Laplace's method 302

The idea of Laplace's method


Assume that the function ƒ(x) has a unique global maximum at x0. Then, the
value ƒ(x0) will be larger than other values ƒ(x). If we multiply this function by
a large number M, the gap between Mƒ(x0) and Mƒ(x) will stay the same (since
Mƒ(x0)/Mƒ(x) = ƒ(x0)/ƒ(x) ) -but it will grow exponentially in the function (see
figure)

Thus, significant contributions to the integral of this function will come only
from points x in a neighborhood of x0, which can then be estimated.

General theory of Laplace's method


To state and motivate the method, we need several assumptions. We will
assume that x0 is not an endpoint of the interval of integration, that the values
ƒ(x) cannot be very close to ƒ(x0) unless x is close to x0, and that the second
derivative .
We can expand ƒ(x) around x0 by Taylor's theorem,
The function eMƒ(x), in blue, is shown on
top for M = 0.5, and at the bottom for
M = 3. Here, ƒ(x) = sin x/x, with a global
where maximum at x0 = 0. It is seen that as M
grows larger, the approximation of this
Since ƒ has a global maximum at x0, and since x0 is not an endpoint, it is a function by a Gaussian function (shown
stationary point, so the derivative of ƒ vanishes at x0. Therefore, the function in red) is getting better. This observation
ƒ(x) may be approximated to quadratic order underlies Laplace's method.

for x close to x0 (recall that the second derivative is negative at the global maximum ƒ(x0)). The assumptions made
ensure the accuracy of the approximation

(see the picture on the right). This latter integral is a Gaussian integral if the limits of integration go from −∞ to +∞
(which can be assumed because the exponential decays very fast away from x0), and thus it can be calculated. We
find

A generalization of this method and extension to arbitrary precision is provided by Fog (2008).
Formal statement and proof:
Assume that is a twice differentiable function on with the unique point such that
. Assume additionally that and that for any ,

Then,
Laplace's method 303

Proof:
Lower bound:
Let . Then by the continuity of there exists such that if then
. By Taylor's Theorem, for any ,

Then we have the following lower bound:

where the last inequality was obtained by a change of variables . Remember

that so that is why we can take the square root of its negation.

If we divide both sides of the above inequality by and take the limit we get:

since this is true for arbitrary we get the lower bound:

Upper bound:
The proof of the upper bound is similar to the proof of the lower bound but there are a few annoyances. Again we
start by picking an but in order for the the proof to work we need small enough so that .
Then, as above, by continuity of and Taylor's Theorem we can find so that if , then

. Lastly, by our assumptions there exists an such that if

, then .
Then we can calculate the following upper bound:

If we divide both sides of the above inequality by and take the limit we get:

Since is arbitrary we get the upper bound:


Laplace's method 304

And combining this with the lower bound gives the result.

Laplace's method extension: Steepest descent


In extensions of Laplace's method, complex analysis, and in particular Cauchy's integral formula, is used to find a
contour of steepest descent for an (asymptotically with large M) equivalent integral, expressed as a line integral. In
particular, if no point x0 where the derivative of ƒ vanishes exists on the real line, it may be necessary to deform the
integration contour to an optimal one, where the above analysis will be possible. Again the main idea is to reduce, at
least asymptotically, the calculation of the given integral to that of a simpler integral that can be explicitly evaluated.
See the book of Erdelyi (1956) for a simple discussion (where the method is termed steepest descents).
The appropriate formulation for the complex z-plane is

for a path passing through the saddle point at z0. Note the explicit appearance of a minus sign to indicate the
direction of the second derivative: one must not take the modulus. Also note that if the integrand is meromorphic,
one may have to add residues corresponding to poles traversed while deforming the contour (see for example section
3 of Okounkov's paper Symmetric functions and random partitions).

Further generalizations
An extension of the steepest descent method is the so-called nonlinear stationary phase/steepest descent method.
Here, instead of integrals, one needs to evaluate asymptotically solutions of Riemann–Hilbert factorization
problems.
Given a contour C in the complex sphere, a function ƒ defined on that contour and a special point, say infinity, one
seeks a function M holomorphic away from the contour C, with prescribed jump across C, and with a given
normalization at infinity. If ƒ and hence M are matrices rather than scalars this is a problem that in general does not
admit an explicit solution.
An asymptotic evaluation is then possible along the lines of the linear stationary phase/steepest descent method. The
idea is to reduce asymptotically the solution of the given Riemann–Hilbert problem to that of a simpler, explicitly
solvable, Riemann–Hilbert problem. Cauchy's theorem is used to justify deformations of the jump contour.
The nonlinear stationary phase was introduced by Deift and Zhou in 1993, based on earlier work of Its. A (properly
speaking) nonlinear steepest descent method was introduced by Kamvissis, K. McLaughlin and P. Miller in 2003,
based on previous work of Lax, Levermore, Deift, Venakides and Zhou.
The nonlinear stationary phase/steepest descent method has applications to the theory of soliton equations and
integrable models, random matrices and combinatorics.
Laplace's method 305

Complex integrals
For complex integrals in the form:

with t >> 1, we make the substitution t = iu and the change of variable s = c + ix to get the Laplace bilateral
transform:

We then split g(c+ix) in its real and complex part, after which we recover u = t / i. This is useful for inverse Laplace
transforms, the Perron formula and complex integration.

Example 1: Stirling's approximation


Laplace's method can be used to derive Stirling's approximation

for a large integer N.


From the definition of the Gamma function, we have

Now we change variables, letting

so that

Plug these values back in to obtain

This integral has the form necessary for Laplace's method with

which is twice-differentiable:

The maximum of ƒ(z) lies at z0 = 1, and the second derivative of ƒ(z) has the value −1 at this point. Therefore, we
obtain
Laplace's method 306

Example 2: parameter estimation and probabilistic inference


Azevedo-Filho and Shachter (1994) reviews Laplace's method results (univariate and multivariate) and presents a
detailed example showing the method used in parameter estimation and probabilistic inference under a bayesian
perspective. Laplace's method is applied to a meta-analysis problem from the medical domain, involving
experimental data, and compared to other techniques. (Azevedo-Filho & Shachter 1994)

References
• Azevedo-Filho, A.; Shachter, R. (1994), "Laplace's Method Approximations for Probabilistic Inference in Belief
Networks with Continuous Variables", in Mantaras, R.; Poole, D., Uncertainty in Artificial Intelligence, San
Francisco, CA: Morgan Kauffman, CiteSeerX: 10.1.1.91.2064 [1].
• Deift, P.; Zhou, X. (1993), "A steepest descent method for oscillatory Riemann–Hilbert problems. Asymptotics
for the MKdV equation", Ann. of Math. 137 (2): 295–368, doi:10.2307/2946540.
• Erdelyi, A. (1956), Asymptotic Expansions, Dover.
• Fog, A. (2008), "Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution", Communications
in Statistics, Simulation and Computation 37 (2): 258–273, doi:10.1080/03610910701790269.
• Kamvissis, S.; McLaughlin, K. T.-R.; Miller, P. (2003), "Semiclassical Soliton Ensembles for the Focusing
Nonlinear Schrödinger Equation", Annals of Mathematics Studies (Princeton University Press) 154.
• Laplace, P. S. (1774). Memoir on the probability of causes of events. Mémoires de Mathématique et de Physique,
Tome Sixième. (English translation by S. M. Stigler 1986. Statist. Sci., 1(19):364–378).
This article incorporates material from saddle point approximation on PlanetMath, which is licensed under the
Creative Commons Attribution/Share-Alike License.

References
[1] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 91. 2064
Likelihood-ratio test 307

Likelihood-ratio test
In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which (the null
model) is a special case of the other (the alternative model). The test is based on the likelihood ratio, which expresses
how many times more likely the data are under one model than the other. This likelihood ratio, or equivalently its
logarithm, can then be used to compute a p-value, or compared to a critical value to decide whether to reject the null
model in favour of the alternative model. When the logarithm of the likelihood ratio is used, the statistic is known as
a log-likelihood ratio statistic, and the probability distribution of this test statistic, assuming that the null model is
true, can be approximated using Wilks' theorem.
In the case of distinguishing between two models, each of which has no unknown parameters, use of the likelihood
ratio test can be justified by the Neyman–Pearson lemma, which demonstrates that such a test has the highest power
among all competitors.[1]

Use
Each of the two competing models, the null model and the alternative model, is separately fitted to the data and the
log-likelihood recorded. The test statistic (often denoted by D) is twice the difference in these log-likelihoods:

The model with more parameters will always fit at least as well (have a greater log-likelihood). Whether it fits
significantly better and should thus be preferred is determined by deriving the probability or p-value of the
difference D. Where the null hypothesis represents a special case of the alternative hypothesis, the probability
distribution of the test statistic is approximately a chi-squared distribution with degrees of freedom equal to
df2 − df1 .[2] Symbols df1 and df2 represent the number of free parameters of models 1 and 2, the null model and the
alternative model, respectively. The test requires nested models, that is: models in which the more complex one can
be transformed into the simpler model by imposing a set of constraints on the parameters.[3]
For example: if the null model has 1 free parameter and a log-likelihood of −8024 and the alternative model has 3
degrees of freedom and a LL of −8012, then the probability of this difference is that of chi-squared value of
+2·(8024 − 8012) = 24 with 3 − 1 = 2 degrees of freedom. Certain assumptions must be met for the statistic to follow
a chi-squared distribution and often empirical p-values are computed.

Background
The likelihood ratio, often denoted by (the capital Greek letter lambda), is the ratio of the likelihood function
varying the parameters over two different sets in the numerator and denominator. A likelihood-ratio test is a
statistical test for making a decision between two hypotheses based on the value of this ratio.
It is central to the Neyman–Pearson approach to statistical hypothesis testing, and, like statistical hypothesis testing
generally, is both widely used and much criticized.
Likelihood-ratio test 308

Simple-versus-simple hypotheses
A statistical model is often a parametrized family of probability density functions or probability mass functions
. A simple-vs-simple hypotheses test has completely specified models under both the null and alternative
hypotheses, which for convenience are written in terms of fixed values of a notional parameter :

Note that under either hypothesis, the distribution of the data is fully specified; there are no unknown parameters to
estimate. The likelihood ratio test statistic can be written as:[4][5]

or

where is the likelihood function. Note that some references may use the reciprocal as the definition.[6] In
the form stated here, the likelihood ratio is small if the alternative model is better than the null model and the
likelihood ratio test provides the decision rule as:
If , do not reject ;
If , reject ;
Reject with probability if
The values are usually chosen to obtain a specified significance level , through the relation:
. The Neyman-Pearson lemma states that this likelihood ratio test
is the most powerful among all level- tests for this problem.

Definition (likelihood ratio test for composite hypotheses)


A null hypothesis is often stated by saying the parameter is in a specified subset of the parameter space .

The likelihood function is (with being the pdf or pmf) is a function of the parameter
with held fixed at the value that was actually observed, i.e., the data. The likelihood ratio test statistic is [7]

Here, the notation refers to the Supremum function.


A likelihood ratio test is any test with critical region (or rejection region) of the form where is any
number satisfying . Many common test statistics such as the Z-test, the F-test, Pearson's chi-squared test
and the G-test are tests for nested models and can be phrased as log-likelihood ratios or approximations thereof.
Likelihood-ratio test 309

Interpretation
Being a function of the data , the LR is therefore a statistic. The likelihood-ratio test rejects the null hypothesis if
the value of this statistic is too small. How small is too small depends on the significance level of the test, i.e., on
what probability of Type I error is considered tolerable ("Type I" errors consist of the rejection of a null hypothesis
that is true).
The numerator corresponds to the maximum likelihood of an observed outcome under the null hypothesis. The
denominator corresponds to the maximum likelihood of an observed outcome varying parameters over the whole
parameter space. The numerator of this ratio is less than the denominator. The likelihood ratio hence is between 0
and 1. Lower values of the likelihood ratio mean that the observed result was much less likely to occur under the null
hypothesis as compared to the alternative. Higher values of the statistic mean that the observed outcome was more
than or equally likely or nearly as likely to occur under the null hypothesis as compared to the alternative, and the
null hypothesis cannot be rejected.

Distribution: Wilks' theorem


If the distribution of the likelihood ratio corresponding to a particular null and alternative hypothesis can be
explicitly determined then it can directly be used to form decision regions (to accept/reject the null hypothesis). In
most cases, however, the exact distribution of the likelihood ratio corresponding to specific hypotheses is very
difficult to determine. A convenient result, attributed to Samuel S. Wilks, says that as the sample size approaches
, the test statistic for a nested model will be asymptotically distributed with degrees of freedom
equal to the difference in dimensionality of and .[8] This means that for a great variety of hypotheses, a
practitioner can compute the likelihood ratio for the data and compare to the chi squared value
corresponding to a desired statistical significance as an approximate statistical test.

Examples

Coin tossing
An example, in the case of Pearson's test, we might try to compare two coins to determine whether they have the
same probability of coming up heads. Our observation can be put into a contingency table with rows corresponding
to the coin and columns corresponding to heads or tails. The elements of the contingency table will be the number of
times the coin for that row came up heads or tails. The contents of this table are our observation .

Heads Tails

Coin 1

Coin 2

Here consists of the parameters , , , and , which are the probability that coins 1 and 2 come
up heads or tails. The hypothesis space is defined by the usual constraints on a distribution, , and
. The null hypothesis is the sub-space where . In all of these constraints,
and .
Writing for the best values for under the hypothesis , maximum likelihood is achieved with

Writing for the best values for under the null hypothesis , maximum likelihood is achieved with

which does not depend on the coin .


Likelihood-ratio test 310

The hypothesis and null hypothesis can be rewritten slightly so that they satisfy the constraints for the logarithm of
the likelihood ratio to have the desired nice distribution. Since the constraint causes the two-dimensional to be
reduced to the one-dimensional , the asymptotic distribution for the test will be , the distribution
with one degree of freedom.
For the general contingency table, we can write the log-likelihood ratio statistic as

References
[1] Jerzy Neyman, Egon Pearson (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses". Philosophical Transactions of
the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231 (694–706): 289–337.
doi:10.1098/rsta.1933.0009. JSTOR 91247.
[2] Huelsenbeck, J. P.; Crandall, K. A. (1997). "Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood". Annual Review of
Ecology and Systematics 28: 437–466. doi:10.1146/annurev.ecolsys.28.1.437.
[3] An example using phylogenetic analyses is described at Huelsenbeck, J. P.; Hillis, D. M.; Nielsen, R. (1996). "A Likelihood-Ratio Test of
Monophyly". Systematic Biology 45 (4): 546. doi:10.1093/sysbio/45.4.546.
[4] Mood, A.M.; Graybill, F.A. (1963) Introduction to the Theory of Statistics, 2nd edition. McGraw-Hill ISBN 978-0070428638 (page 286)
[5] Kendall, M.G., Stuart, A. (1973) The Advanced Theory of Statistics, Volume 2, Griffin. ISBN 0852642156 (page 234)
[6] Cox, D. R. and Hinkley, D. V Theoretical Statistics, Chapman and Hall, 1974. (page 92)
[7] Casella, George; Berger, Roger L. (2001) Statistical Inference, Second edition. ISBN 978-0534243128 (page 375)
[8] Wilks, S. S. (1938). "The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses". The Annals of
Mathematical Statistics 9: 60–62. doi:10.1214/aoms/1177732360.

External links
• Practical application of Likelihood-ratio test described (http://www.itl.nist.gov/div898/handbook/apr/
section2/apr233.htm)
• Vassar College's Likelihood Ratio Given Sensitivity/Specificity/Prevalence (http://faculty.vassar.edu/lowry/
clin2.html) Online Calculator
List of integrals of exponential functions 311

List of integrals of exponential functions


The following is a list of integrals of exponential functions. For a complete list of Integral functions, please see the
list of integrals.

Indefinite integrals
Indefinite integrals are antiderivative functions. A constant (the constant of integration) may be added to the right
hand side of any of these formulas, but has been suppressed here in the interest of brevity.

for

( is the error function)


List of integrals of exponential functions 312

where

where

and is the gamma function

when , , and

when , , and

Definite integrals

for

, which is the logarithmic mean

(the Gaussian integral)

(see Integral of a Gaussian function)

(!! is the double factorial)


List of integrals of exponential functions 313

( is the modified Bessel function of the first kind)

References
• Wolfram Mathematica Online Integrator (http://integrals.wolfram.com/index.jsp)
• V. H. Moll, The Integrals in Gradshteyn and Ryzhik (http://www.math.tulane.edu/~vhm/Table.html)

List of integrals of Gaussian functions


In these expressions is the standard normal probability density function,

is the corresponding cumulative distribution function (where erf is

the error function) and

Owen [1] has an extensive list of Gaussian-type integrals; only a subset is given below.

Indefinite integrals

[2]

(in these integrals, n!! is the double factorial: for even n’s it is equal to the product of all even
numbers from 2 to n, and for odd n’s it is the product of all odd numbers from 1 to n, additionally
it is assumed that 0!! = (−1)!! = 1)

[3]
List of integrals of Gaussian functions 314

Definite integrals

[4]
List of integrals of Gaussian functions 315

References
[1] Owen (1980)
[2] Patel & Read (1996) list this integral without the minus sign, which is an error. See calculation by WolframAlpha (http:/ / www.
wolframalpha. com/ input/ ?fp=1& i=D(-e^(-x^2/ 2)/ sqrt(2pi)*Sum((2k)!!/ (2j)!!*x^(2j),{j,0,k}),x)& s=40& incTime=true)
[3] Patel & Read (1996) report this integral with error, see WolframAlpha (http:/ / www. wolframalpha. com/ input/ ?i=Integrate(1/
sqrt(2pi)*e^(-x^2/ 2)*1/ sqrt(2pi)*e^(-(a+ b*x)^2/ 2),x))
[4] Patel & Read (1996) report this integral incorrectly by omitting x from the integrand

• Patel, Jagdish K.; Read, Campbell B. (1996). Handbook of the normal distribution (2nd ed.). CRC Press.
ISBN 0-8247-9342-0.
• Owen, D. (1980). "A table of normal integrals". Communications in Statistics: Computation and Simulation B9:
pp. 389 - 419.

List of integrals of hyperbolic functions


The following is a list of integrals (antiderivative functions) of hyperbolic functions. For a complete list of Integral
functions, see list of integrals.
In all formulas the constant a is assumed to be nonzero, and C denotes the constant of integration.

also:

also:

also:

also:
List of integrals of hyperbolic functions 316

also:

also:

also:

also:

also:

also:
List of integrals of hyperbolic functions 317

List of integrals of logarithmic functions


The following is a list of integrals (antiderivative functions) of logarithmic functions. For a complete list of integral
functions, see list of integrals.
Note: x>0 is assumed throughout this article, and the constant of integration is omitted for simplicity.
List of integrals of logarithmic functions 318

For consecutive integrations, the formula

generalizes to

References
• Milton Abramowitz and Irene A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables, 1964. A few integrals are listed on page 69 [1].

References
[1] http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_69. htm
Lists of integrals 319

Lists of integrals
This article is mainly about indefinite integrals in calculus. For a list of definite integrals see List of definite
integrals.
Integration is the basic operation in integral calculus. While differentiation has easy rules by which the derivative of
a complicated function can be found by differentiating its simpler component functions, integration does not, so
tables of known integrals are often useful. This page lists some of the most common antiderivatives.

Historical development of integrals


A compilation of a list of integrals (Integraltafeln) and techniques of integral calculus was published by the German
mathematician Meyer Hirsch in 1810. These tables were republished in the United Kingdom in 1823. More
extensive tables were compiled in 1858 by the Dutch mathematician David de Bierens de Haan. A new edition was
published in 1862. These tables, which contain mainly integrals of elementary functions, remained in use until the
middle of the 20th century. They were then replaced by the much more extensive tables of Gradshteyn and Ryzhik.
In Gradshteyn and Ryzhik, integrals originating from the book by de Bierens are denoted by BI.
Not all closed-form expressions have closed-form antiderivatives; this study forms the subject of differential Galois
theory, which was initially developed by Joseph Liouville in the 1830s and 1840s, leading to Liouville's theorem
which classifies which expressions have closed form antiderivatives. A simple example of a function without a
closed form antiderivative is e−x2, whose antiderivative is (up to constants) the error function.
Since 1968 there is the Risch algorithm for determining indefinite integrals that can be expressed in term of
elementary functions, typically using a computer algebra system. Integrals that cannot be expressed using elementary
functions can be manipulated symbolically using general functions such as the Meijer G-function.

Lists of integrals
More detail may be found on the following pages for the lists of integrals:
• List of integrals of rational functions
• List of integrals of irrational functions
• List of integrals of trigonometric functions
• List of integrals of inverse trigonometric functions
• List of integrals of hyperbolic functions
• List of integrals of inverse hyperbolic functions
• List of integrals of exponential functions
• List of integrals of logarithmic functions
• List of integrals of Gaussian functions
Gradshteyn, Ryzhik, Jeffrey, Zwillinger's Table of Integrals, Series, and Products contains a large collection of
results. An even larger, multivolume table is the Integrals and Series by Prudnikov, Brychkov, and Marichev (with
volumes 1–3 listing integrals and series of elementary and special functions, volume 4–5 are tables of Laplace
transforms). More compact collections can be found in e.g. Brychkov, Marichev, Prudnikov's Tables of Indefinite
Integrals, or as chapters in Zwillinger's CRC Standard Mathematical Tables and Formulae, Bronstein and
Semendyayev's Handbook of Mathematics (Springer) and Oxford Users' Guide to Mathematics (Oxford Univ.
Press), and other mathematical handbooks.
Other useful resources include Abramowitz and Stegun and the Bateman Manuscript Project. Both works contain
many identities concerning specific integrals, which are organized with the most relevant topic instead of being
collected into a separate table. Two volumes of the Bateman Manuscript are specific to integral transforms.
Lists of integrals 320

There are several web sites which have tables of integrals and integrals on demand. Wolfram Alpha can show
results, and for some simpler expressions, also the intermediate steps of the integration. Wolfram Research also
operates another online service, the Wolfram Mathematica Online Integrator [1].

Integrals of simple functions


C is used for an arbitrary constant of integration that can only be determined if something about the value of the
integral at some point is known. Thus each function has an infinite number of antiderivatives.
These formulas only state in another form the assertions in the table of derivatives.

Integrals with a singularity


When there is a singularity in the function being integrated such that the integral becomes undefined, i.e., it is not
Lebesgue integrable, then C does not need to be the same on both sides of the singularity. The forms below normally
assume the Cauchy principal value around a singularity in the value of C but this is not in general necessary. For
instance in

there is a singularity at 0 and the integral becomes infinite there. If the integral above was used to give a definite
integral between -1 and 1 the answer would be 0. This however is only the value assuming the Cauchy principal
value for the integral around the singularity. If the integration was done in the complex plane the result would
depend on the path around the origin, in this case the singularity contributes −iπ when using a path above the origin
and iπ for a path below the origin. A function on the real line could use a completely different value of C on either
side of the origin as in:

Rational functions
more integrals: List of integrals of rational functions
These rational functions have a non-integrable singularity at 0 for a ≤ −1.

(Cavalieri's quadrature formula)

More generally,[2]
Lists of integrals 321

Exponential functions
more integrals: List of integrals of exponential functions

Logarithms
more integrals: List of integrals of logarithmic functions

Trigonometric functions
more integrals: List of integrals of trigonometric functions

(See Integral of the secant function. This result was a well-known conjecture in the 17th century.)

(see integral of secant cubed)


Lists of integrals 322

Inverse trigonometric functions


more integrals: List of integrals of inverse trigonometric functions

Hyperbolic functions
more integrals: List of integrals of hyperbolic functions

Inverse hyperbolic functions


more integrals: List of integrals of inverse hyperbolic functions
Lists of integrals 323

Products of functions proportional to their second derivatives

Absolute value functions

Special functions
Ci, Si: Trigonometric integrals, Ei: Exponential integral, li: Logarithmic integral function, erf: Error function
Lists of integrals 324

Definite integrals lacking closed-form antiderivatives


There are some functions whose antiderivatives cannot be expressed in closed form. However, the values of the
definite integrals of some of these functions over some common intervals can be calculated. A few useful integrals
are given below.

(see also Gamma function)

(the Gaussian integral)

for a > 0

for

a > 0, n is 1, 2, 3, ... and !! is the double factorial.

when a > 0

for a > 0, n = 0, 1, 2, ....

(see also Bernoulli number)

(see sinc function and Sine integral)

(if n is an even integer and )

(if is an odd integer and )

(for integers with and

, see also Binomial coefficient)

(for real and non-negative integer, see also Symmetry)

(for

integers with and , see also Binomial coefficient)

(for

integers with and , see also Binomial coefficient)


Lists of integrals 325

(where is the exponential function , and

(where is the Gamma function)

(the Beta function)

(where is the modified Bessel function of the first kind)

, this is related to the probability density

function of the Student's t-distribution)


The method of exhaustion provides a formula for the general case when no antiderivative exists:

The "sophomore's dream"

attributed to Johann Bernoulli.

References
[1] http:/ / integrals. wolfram. com/ index. jsp
[2] " Reader Survey: log|x| + C (http:/ / golem. ph. utexas. edu/ category/ 2012/ 03/ reader_survey_logx_c. html)", Tom Leinster, The n-category
Café, March 19, 2012

• M. Abramowitz and I.A. Stegun, editors. Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables.
• I.S. Gradshteyn (И.С. Градштейн), I.M. Ryzhik (И.М. Рыжик); Alan Jeffrey, Daniel Zwillinger, editors. Table of
Integrals, Series, and Products, seventh edition. Academic Press, 2007. ISBN 978-0-12-373637-6. Errata. (http://
www.mathtable.com/gr) (Several previous editions as well.)
• A.P. Prudnikov (А.П. Прудников), Yu.A. Brychkov (Ю.А. Брычков), O.I. Marichev (О.И. Маричев). Integrals
and Series. First edition (Russian), volume 1–5, Nauka, 1981−1986. First edition (English, translated from the
Russian by N.M. Queen), volume 1–5, Gordon & Breach Science Publishers/CRC Press, 1988–1992, ISBN
2-88124-097-6. Second revised edition (Russian), volume 1–3, Fiziko-Matematicheskaya Literatura, 2003.
• Yu.A. Brychkov (Ю.А. Брычков), Handbook of Special Functions: Derivatives, Integrals, Series and Other
Formulas. Russian edition, Fiziko-Matematicheskaya Literatura, 2006. English edition, Chapman & Hall/CRC
Press, 2008, ISBN 1-58488-956-X.
Lists of integrals 326

• Daniel Zwillinger. CRC Standard Mathematical Tables and Formulae, 31st edition. Chapman & Hall/CRC Press,
2002. ISBN 1-58488-291-3. (Many earlier editions as well.)

Historical
• Meyer Hirsch, Integraltafeln, oder, Sammlung von Integralformeln (http://books.google.com/
books?id=Cdg2AAAAMAAJ) (Duncker und Humblot, Berlin, 1810)
• Meyer Hirsch, Integral Tables, Or, A Collection of Integral Formulae (http://books.google.com/
books?id=NsI2AAAAMAAJ) (Baynes and son, London, 1823) [English translation of Integraltafeln]
• David Bierens de Haan, Nouvelles Tables d'Intégrales définies (http://www.archive.org/details/
nouvetaintegral00haanrich) (Engels, Leiden, 1862)
• Benjamin O. Pierce A short table of integrals - revised edition (http://books.google.com/
books?id=pYMRAAAAYAAJ) (Ginn & co., Boston, 1899)

External links

Tables of integrals
• Paul's Online Math Notes (http://tutorial.math.lamar.edu/pdf/Common_Derivatives_Integrals.pdf)
• A. Dieckmann, Table of Integrals (Elliptic Functions, Square Roots, Inverse Tangents and More Exotic
Functions): Indefinite Integrals (http://pi.physik.uni-bonn.de/~dieckman/IntegralsIndefinite/IndefInt.html)
Definite Integrals (http://pi.physik.uni-bonn.de/~dieckman/IntegralsDefinite/DefInt.html)
• Math Major: A Table of Integrals (http://mathmajor.org/calculus-and-analysis/table-of-integrals/)
• O'Brien, Francis J. Jr. Integrals (http://www.docstoc.com/docs/23969109/
500-Integrals-of-Elementary-and-Special-Functions''500) Derived integrals of exponential and logarithmic
functions
• Rule-based Mathematics (http://www.apmaths.uwo.ca/RuleBasedMathematics/index.html) Precisely defined
indefinite integration rules covering a wide class of integrands

Derivations
• V. H. Moll, The Integrals in Gradshteyn and Ryzhik (http://www.math.tulane.edu/~vhm/Table.html)

Online service
• Integration examples for Wolfram Alpha (http://www.wolframalpha.com/examples/Integrals.html)

Open source programs


• wxmaxima gui for Symbolic and numeric resolution of many mathematical problems (http://wxmaxima.
sourceforge.net/wiki/index.php/Main_Page)
Local regression 327

Local regression
LOESS,[1] and LOWESS (locally
weighted scatterplot smoothing) are two
strongly related regression modeling
methods that combine multiple regression
models in a k-nearest-neighbor-based
meta-model.

LOESS and LOWESS thus build on


"classical" methods, such as linear and
nonlinear least squares regression. They
address situations in which the classical
procedures do not perform well or cannot be
effectively applied without undue labor.
LOESS combines much of the simplicity of
linear least squares regression with the
flexibility of nonlinear regression. It does
this by fitting simple models to localized
subsets of the data to build up a function
that describes the deterministic part of the
variation in the data, point by point. In fact, LOESS curve fitted to a population sampled from a sine wave with uniform noise
one of the chief attractions of this method is added. The LOESS curve approximates the original sine wave.

that the data analyst is not required to


specify a global function of any form to fit a model to the data, only to fit segments of the data.

The trade-off for these features is increased computation. Because it is so computationally intensive, LOESS would
have been practically impossible to use in the era when least squares regression was being developed. Most other
modern methods for process modeling are similar to LOESS in this respect. These methods have been consciously
designed to use our current computational ability to the fullest possible advantage to achieve goals not easily
achieved by traditional approaches.
Plotting a smooth curve through a set of data points using this statistical technique is called a Loess Curve,
particularly when each smoothed value is given by a weighted quadratic least squares regression over the span of
values of the y-axis scattergram criterion variable. When each smoothed value is given by a weighted linear least
squares regression over the span, this is known as a Lowess curve; however, some authorities treat Lowess and
Loess as synonyms.

Definition of a LOESS model


LOESS, originally proposed by Cleveland (1979) and further developed by Cleveland and Devlin (1988),
specifically denotes a method that is also known as locally weighted polynomial regression. At each point in the data
set a low-degree polynomial is fitted to a subset of the data, with explanatory variable values near the point whose
response is being estimated. The polynomial is fitted using weighted least squares, giving more weight to points near
the point whose response is being estimated and less weight to points further away. The value of the regression
function for the point is then obtained by evaluating the local polynomial using the explanatory variable values for
that data point. The LOESS fit is complete after regression function values have been computed for each of the
data points. Many of the details of this method, such as the degree of the polynomial model and the weights, are
flexible. The range of choices for each part of the method and typical defaults are briefly discussed next.
Local regression 328

Localized subsets of data


The subsets of data used for each weighted least squares fit in LOESS are determined by a nearest neighbors
algorithm. A user-specified input to the procedure called the "bandwidth" or "smoothing parameter" determines how
much of the data is used to fit each local polynomial. The smoothing parameter, , is a number between
and 1, with denoting the degree of the local polynomial. The value of is the proportion of data
used in each fit. The subset of data used in each weighted least squares fit comprises the (rounded to the next
largest integer) points whose explanatory variables values are closest to the point at which the response is being
estimated.
is called the smoothing parameter because it controls the flexibility of the LOESS regression function. Large
values of produce the smoothest functions that wiggle the least in response to fluctuations in the data. The smaller
is, the closer the regression function will conform to the data. Using too small a value of the smoothing parameter
is not desirable, however, since the regression function will eventually start to capture the random error in the data.
Useful values of the smoothing parameter typically lie in the range 0.25 to 0.5 for most LOESS applications.

Degree of local polynomials


The local polynomials fit to each subset of the data are almost always of first or second degree; that is, either locally
linear (in the straight line sense) or locally quadratic. Using a zero degree polynomial turns LOESS into a weighted
moving average. Such a simple local model might work well for some situations, but may not always approximate
the underlying function well enough. Higher-degree polynomials would work in theory, but yield models that are not
really in the spirit of LOESS. LOESS is based on the ideas that any function can be well approximated in a small
neighborhood by a low-order polynomial and that simple models can be fit to data easily. High-degree polynomials
would tend to overfit the data in each subset and are numerically unstable, making accurate computations difficult.

Weight function
As mentioned above, the weight function gives the most weight to the data points nearest the point of estimation and
the least weight to the data points that are furthest away. The use of the weights is based on the idea that points near
each other in the explanatory variable space are more likely to be related to each other in a simple way than points
that are further apart. Following this logic, points that are likely to follow the local model best influence the local
model parameter estimates the most. Points that are less likely to actually conform to the local model have less
influence on the local model parameter estimates.
The traditional weight function used for LOESS is the tri-cube weight function,

However, any other weight function that satisfies the properties listed in Cleveland (1979) could also be used. The
weight for a specific point in any localized subset of data is obtained by evaluating the weight function at the
distance between that point and the point of estimation, after scaling the distance so that the maximum absolute
distance over all of the points in the subset of data is exactly one.

Advantages of LOESS
As discussed above, the biggest advantage LOESS has over many other methods is the fact that it does not require
the specification of a function to fit a model to all of the data in the sample. Instead the analyst only has to provide a
smoothing parameter value and the degree of the local polynomial. In addition, LOESS is very flexible, making it
ideal for modeling complex processes for which no theoretical models exist. These two advantages, combined with
the simplicity of the method, make LOESS one of the most attractive of the modern regression methods for
applications that fit the general framework of least squares regression but which have a complex deterministic
Local regression 329

structure.
Although it is less obvious than for some of the other methods related to linear least squares regression, LOESS also
accrues most of the benefits typically shared by those procedures. The most important of those is the theory for
computing uncertainties for prediction and calibration. Many other tests and procedures used for validation of least
squares models can also be extended to LOESS models .

Disadvantages of LOESS
LOESS makes less efficient use of data than other least squares methods. It requires fairly large, densely sampled
data sets in order to produce good models. This is not really surprising, however, since LOESS needs good empirical
information on the local structure of the process in order to perform the local fitting. In fact, given the results it
provides, LOESS could be more efficient overall than other methods like nonlinear least squares. It may simply
frontload the costs of an experiment in data collection but then reduce analysis costs.
Another disadvantage of LOESS is the fact that it does not produce a regression function that is easily represented by
a mathematical formula. This can make it difficult to transfer the results of an analysis to other people. In order to
transfer the regression function to another person, they would need the data set and software for LOESS calculations.
In nonlinear regression, on the other hand, it is only necessary to write down a functional form in order to provide
estimates of the unknown parameters and the estimated uncertainty. Depending on the application, this could be
either a major or a minor drawback to using LOESS.
Finally, as discussed above, LOESS is a computationally intensive method. This is not usually a problem in our
current computing environment, however, unless the data sets being used are very large. LOESS is also prone to the
effects of outliers in the data set, like other least squares methods. There is an iterative, robust version of LOESS
[Cleveland (1979)] that can be used to reduce LOESS' sensitivity to outliers, but too many extreme outliers can still
overcome even the robust method.

References
[1] LOESS is a later generalization of LOWESS; although it isn't a true initialism, it may be understood as standing for "LOcal regrESSion" (e.g.
John Fox, Nonparametric Regression: Appendix to An R and S-PLUS Companion to Applied Regression (http:/ / cran. r-project. org/ doc/
contrib/ Fox-Companion/ appendix-nonparametric-regression. pdf), January 2002)

• Cleveland, William S. (1979). "Robust Locally Weighted Regression and Smoothing Scatterplots". Journal of the
American Statistical Association 74 (368): 829–836. doi:10.2307/2286407. JSTOR 2286407. MR0556476.
• Cleveland, William S. (1981). "LOWESS: A program for smoothing scatterplots by robust locally weighted
regression". The American Statistician 35 (1): 54. JSTOR 2683591.
• Cleveland, William S.; Devlin, Susan J. (1988). "Locally-Weighted Regression: An Approach to Regression
Analysis by Local Fitting". Journal of the American Statistical Association 83 (403): 596–610.
doi:10.2307/2289282. JSTOR 2289282.
Local regression 330

External links
• Local Regression and Election Modeling (http://voteforamerica.net/editorials/Comments.aspx?ArticleId=28&
ArticleName=Electoral+Projections+Using+LOESS)
• Smoothing by Local Regression: Principles and Methods (PostScript Document) (http://www.stat.purdue.edu/
~wsc/papers/localregression.principles.ps)
• NIST Engineering Statistics Handbook Section on LOESS (http://www.itl.nist.gov/div898/handbook/pmd/
section1/pmd144.htm)
• Local Fitting Software (http://www.stat.purdue.edu/~wsc/localfitsoft.html)
• LOESS Smoothing in Excel (http://peltiertech.com/WordPress/loess-smoothing-in-excel/)
• Scatter Plot Smoothing (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html)
• The Loess function (http://research.stowers-institute.org/efg/R/Statistics/loess.htm) in R
• Quantile LOWESS (http://www.r-statistics.com/2010/04/
quantile-lowess-combining-a-moving-quantile-window-with-lowess-r-function/) – A method to perform Local
regression on a Quantile moving window (with R code)
 This article incorporates public domain material from websites or documents of the National Institute of
Standards and Technology (http://www.nist.gov).
Log-normal distribution 331

Log-normal distribution
Log-normal

Probability density function

Some log-normal density functions with identical


location parameter μ but differing scale
parameters σ

Cumulative distribution function

Cumulative distribution function of the log-normal distribution


(with μ = 0 )

Notation
Parameters σ2 > 0 — shape (real),
μ ∈ R — log-scale
Support x ∈ (0, +∞)
PDF

CDF
Log-normal distribution 332

Mean
Median
Mode
Variance
Skewness

Ex. kurtosis
Entropy

MGF (defined only on the negative half-axis, see text)


CF
representation is asymptotically divergent but sufficient for numerical purposes

Fisher information

In probability theory, a log-normal distribution is a continuous probability distribution of a random variable whose
logarithm is normally distributed. If X is a random variable with a normal distribution, then Y = exp(X) has a
log-normal distribution; likewise, if Y is log-normally distributed, then X = log(Y) has a normal distribution. The
log-normal distribution is the distribution of a random variable that takes only positive real values.
Log-normal is also written log normal or lognormal. The distribution is occasionally referred to as the Galton
distribution or Galton's distribution, after Francis Galton,[1] and other names such as McAlister, Gibrat and
Cobb–Douglas been associated.[1]
A variable might be modeled as log-normal if it can be thought of as the multiplicative product of many independent
random variables each of which is positive. (This is justified by considering the central limit theorem in the
log-domain.) For example, in finance, the variable could represent the compound return from a sequence of many
trades (each expressed as its return + 1); or a long-term discount factor can be derived from the product of short-term
discount factors. In wireless communication, the attenuation caused by shadowing or slow fading from random
objects is often assumed to be log-normally distributed: see log-distance path loss model.
The log-normal distribution is the maximum entropy probability distribution for a random variate X for which the
mean and variance of is fixed.[2]

μ and σ
In a log-normal distribution X, the parameters denoted μ and σ are, respectively, the mean and standard deviation of
the variable’s natural logarithm (by definition, the variable’s logarithm is normally distributed), which means

with Z a standard normal variable.


This relationship is true regardless of the base of the logarithmic or exponential function. If loga(Y) is normally
distributed, then so is logb(Y), for any two positive numbers a, b ≠ 1. Likewise, if is normally distributed, then
so is , where a is a positive number ≠ 1.
On a logarithmic scale, μ and σ can be called the location parameter and the scale parameter, respectively.
In contrast, the mean and standard deviation of the non-logarithmized sample values are denoted m and s.d. in this
article.
Log-normal distribution 333

Characterization

Probability density function


The probability density function of a log-normal distribution is:[1]

This follows by applying the change-of-variables rule on the density function of a normal distribution.

Cumulative distribution function


The cumulative distribution function is

where erfc is the complementary error function, and Φ is the cumulative distribution function of the standard normal
distribution.

Characteristic function and moment generating function


The characteristic function, E[e itX], has a number of representations. The integral itself converges for Im(t) ≤ 0. The
simplest representation is obtained by Taylor expanding e itX and using formula for moments below, giving

This series representation is divergent for Re(σ2) > 0. However, it is sufficient for evaluating the characteristic
function numerically at positive as long as the upper limit in the sum above is kept bounded, n ≤ N, where

and σ2 < 0.1. To bring the numerical values of parameters μ, σ into the domain where strong inequality holds true
one could use the fact that if X is log-normally distributed then Xm is also log-normally distributed with parameters
μm, σm. Since , the inequality could be satisfied for sufficiently small m. The sum of series first
converges to the value of φ(t) with arbitrary high accuracy if m is small enough, and left part of the strong inequality
is satisfied. If considerably larger number of terms are taken into account the sum eventually diverges when the right
part of the strong inequality is no longer valid.
Another useful representation is available[3][4] by means of double Taylor expansion of e(ln x − μ)2/(2σ2).
The moment-generating function for the log-normal distribution does not exist on the domain R, but only exists on
the half-interval (−∞, 0].

Properties

Location and scale


For the log-normal distribution, the location and scale properties of the distribution are more readily treated using the
geometric mean and geometric standard deviation than the arithmetic mean and standard deviation.

Geometric moments
The geometric mean of the log-normal distribution is . Because the log of a log-normal variable is symmetric and
quantiles are preserved under monotonic transformations, the geometric mean of a log-normal distribution is equal to
its median.[5]
Log-normal distribution 334

The geometric mean (mg) can alternatively be derived from the arithmetic mean (ma) in a log-normal distribution by:

The geometric standard deviation is equal to .

Arithmetic moments
If X is a lognormally distributed variable, its expected value (E – the arithmetic mean), variance (Var), and standard
deviation (s.d.) are

Equivalently, parameters μ and σ can be obtained if the expected value and variance are known:

For any real or complex number s, the sth moment of log-normal X is given by[1]

A log-normal distribution is not uniquely determined by its moments E[Xk] for k ≥ 1, that is, there exists some other
distribution with the same moments for all k.[1] In fact, there is a whole family of distributions with the same
moments as the log-normal distribution.

Mode and median


The mode is the point of global maximum of
the probability density function. In
particular, it solves the equation (ln ƒ)′ = 0:

The median is such a point where FX = 1/2:

Coefficient of variation
The coefficient of variation is the ratio s.d.
over m (on the natural scale) and is equal to:

Partial expectation Comparison of mean, median and mode of two log-normal distributions with
different skewness.
The partial expectation of a random variable
X with respect to a threshold k is defined as
g(k) = E[X | X > k]P[X > k]. For a log-normal random variable the partial expectation is given by
Log-normal distribution 335

This formula has applications in insurance and economics, it is used in solving the partial differential equation
leading to the Black–Scholes formula.

Other
A set of data that arises from the log-normal distribution has a symmetric Lorenz curve (see also Lorenz asymmetry
coefficient).[6]
The harmonic (H), geometric (G) and arithmetic (A) means of this distribution are related[7]; such relation is given
by

Log-normal distributions are infinitely divisible.[1]

Occurrence
• In biology, variables whose logarithms tend to have a normal distribution include:
• Measures of size of living tissue (length, skin area, weight);[8]
• The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth;
• Certain physiological measurements, such as blood pressure of adult humans (after separation on male/female
subpopulations)[9]
Consequently, reference ranges for measurements in healthy
individuals are more accurately estimated by assuming a
log-normal distribution than by assuming a symmetric
distribution about the mean.
• In hydrology, the log-normal distribution is used to analyze
extreme values of such variables as monthly and annual
maximum values of daily rainfall and river discharge
volumes.[10]
• The image on the right illustrates an example of fitting the
log-normal distribution to ranked annually maximum Fitted cumulative log-normal distribution to annually
maximum 1-day rainfalls
one-day rainfalls showing also the 90% confidence belt
based on the binomial distribution. The rainfall data are
represented by plotting positions as part of a cumulative frequency analysis.
• In economics, there is evidence that the income of 97%–99% of the population is distributed log-normally[11].
• In finance, in particular the Black–Scholes model, changes in the logarithm of exchange rates, price indices, and
stock market indices are assumed normal[12] (these variables behave like compound interest, not like simple
interest, and so are multiplicative). However, some mathematicians such as Benoît Mandelbrot have argued that
log-Levy distributions which possesses heavy tails would be a more appropriate model, in particular for the
analysis for stock market crashes. Indeed stock price distributions typically exhibit a fat tail[13].
• The distribution of city sizes is lognormal. This follows from Gibrat's law of proportionate (or scale-free) growth.
Irrespective of their size, all cities follow the same stochastic growth process. As a result, the logarithm of city
size is normally distributed. There is also evidence of lognormality in the firm size distribution and of Gibrat's
law.
• In reliability analysis, the lognormal distribution is often used to model times to repair a maintainable system.
• In wireless communication, "the local-mean power expressed in logarithmic values, such as dB or neper, has a
normal (i.e., Gaussian) distribution." [14]
Log-normal distribution 336

• It has been proposed that coefficients of friction and wear may be treated as having a lognormal distribution [15]

Maximum likelihood estimation of parameters


For determining the maximum likelihood estimators of the log-normal distribution parameters μ and σ, we can use
the same procedure as for the normal distribution. To avoid repetition, we observe that

where by ƒL we denote the probability density function of the log-normal distribution and by ƒN that of the normal
distribution. Therefore, using the same indices to denote distributions, we can write the log-likelihood function thus:

Since the first term is constant with regard to μ and σ, both logarithmic likelihood functions, ℓL and ℓN, reach their
maximum with the same μ and σ. Hence, using the formulas for the normal distribution maximum likelihood
parameter estimators and the equality above, we deduce that for the log-normal distribution it holds that

Multivariate log-normal
If is a multivariate normal distribution then has a multivariate log-normal
[16]
distribution with mean

and covariance matrix

Generating log-normally distributed random variates


Given a random variate Z drawn from the normal distribution with 0 mean and 1 standard deviation, then the variate

has a log-normal distribution with parameters and .

Related distributions
• If is a normal distribution, then
• If is distributed log-normally, then is a normal random variable.
• If are n independent log-normally distributed variables, and , then Y is
also distributed log-normally:

• Let be independent log-normally distributed variables with possibly varying σ and μ


parameters, and . The distribution of Y has no closed-form expression, but can be reasonably
approximated by another log-normal distribution Z at the right tail. Its probability density function at the
neighborhood of 0 has been characterized[17] and it does not resemble any log-normal distribution. A commonly
used approximation (due to Fenton and Wilkinson) is obtained by matching the mean and variance:
Log-normal distribution 337

In the case that all have the same variance parameter , these formulas simplify to

• If , then X + c is said to have a shifted log-normal distribution with support x ∈ (c, +∞).
E[X + c] = E[X] + c, Var[X + c] = Var[X].
• If , then
• If , then
• If then for
• Lognormal distribution is a special case of semi-bounded Johnson distribution
• If with , then (Suzuki distribution)

Similar distributions
• A substitute for the log-normal whose integral can be expressed in terms of more elementary functions (Swamee,
2002) can be obtained based on the logistic distribution to get the CDF

This is a log-logistic distribution.

Notes
[1] Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1994), "14: Lognormal Distributions", Continuous univariate distributions. Vol. 1,
Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics (2nd ed.), New York: John Wiley & Sons,
ISBN 978-0-471-58495-7, MR1299979
[2] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[3] Leipnik, Roy B. (1991), "On Lognormal Random Variables: I – The Characteristic Function", Journal of the Australian Mathematical Society
Series B, 32, 327–347.
[4] Daniel Dufresne (2009), SUMS OF LOGNORMALS (http:/ / www. soa. org/ library/ proceedings/ arch/ 2009/ arch-2009-iss1-dufresne. pdf,),
Centre for Actuarial Studies, University of Melbourne.
[5] Leslie E. Daly, Geoffrey Joseph Bourke (2000) Interpretation and uses of medical statistics (http:/ / books. google. se/
books?id=AY7LnYkiLNkC& pg=PA89) Edition: 5. Wiley-Blackwell ISBN 0-632-04763-1, ISBN 978-0-632-04763-5 (page 89)
[6] Damgaard, Christian; Weiner, Jacob (2000). "Describing inequality in plant size or fecundity". Ecology 81 (4): 1139–1142.
doi:10.1890/0012-9658(2000)081[1139:DIIPSO]2.0.CO;2.
[7] Rossman LA (1990) "Design stream flows based on harmonic means". J Hydraulic Engineering ASCE 116 (7) 946–950
[8] Huxley, Julian S. (1932). Problems of relative growth. London. ISBN 0-486-61114-0. OCLC 476909537.
[9] Makuch, Robert W.; D.H. Freeman, M.F. Johnson (1979). "Justification for the lognormal distribution as a model for blood pressure" (http:/ /
www. sciencedirect. com/ science/ article/ pii/ 0021968179900705). Journal of Chronic Diseases 32 (3): 245–250.
doi:10.1016/0021-9681(79)90070-5. (. . Retrieved 27 February 2012.
[10] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage
Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The
Netherlands. pp. 175–224. ISBN 90-70754-33-9. .
Log-normal distribution 338

[11] Clementi, F.; Gallegati, M. (2005) "Pareto's law of income distribution: Evidence for Germany, the United Kingdom, and the United States"
(http:/ / ideas. repec. org/ p/ wpa/ wuwpmi/ 0505006. html), EconWPA
[12] Black, Fischer and Myron Scholes, "The Pricing of Options and Corporate Liabilities", Journal of Political Economy, Vol. 81, No. 3,
(May/June 1973), pp. 637–654.
[13] Bunchen, P., Advanced Option Pricing, University of Sydney coursebook, 2007
[14] http:/ / wireless. per. nl/ reference/ chaptr03/ shadow/ shadow. htm
[15] Steele, C. (2008). "Use of the lognormal distribution for the coefficients of friction and wear". Reliability Engineering & System Safety 93
(10): 1574–2013. doi:10.1016/j.ress.2007.09.005.
[16] Tarmast, Ghasem (2001) "Multivariate Log–Normal Distribution" (http:/ / isi. cbs. nl/ iamamember/ CD2/ pdf/ 329. PDF) ISI Proceedings:
Seoul 53rd Session 2001
[17] Gao, X.; Xu, H; Ye, D. (2009), "Asymptotic Behaviors of Tail Density for Sum of Correlated Lognormal Variables" (http:/ / www. hindawi.
com/ journals/ ijmms/ 2009/ 630857. html). International Journal of Mathematics and Mathematical Sciences, vol. 2009, Article ID 630857.
doi:10.1155/2009/630857

References
• Aitchison, J. and Brown, J.A.C. (1957) The Lognormal Distribution, Cambridge University Press.
• E. Limpert, W. Stahel and M. Abbt (2001) Log-normal Distributions across the Sciences: Keys and Clues (http://
stat.ethz.ch/~stahel/lognormal/bioscience.pdf), BioScience, 51 (5), 341–352.
• Eric W. Weisstein et al. Log Normal Distribution (http://mathworld.wolfram.com/LogNormalDistribution.
html) at MathWorld. Electronic document, retrieved October 26, 2006.
• Swamee, P.K. (2002). [ Near Lognormal Distribution (http://ascelibrary.org/doi/abs/10.1061/
(ASCE)1084-0699(2002)7:6(441))], Journal of Hydrologic Engineering. 7 (6): 441–444 Swamee, P. K. (2002).
"Near Lognormal Distribution". Journal of Hydrologic Engineering 7 (6): 441–000.
doi:10.1061/(ASCE)1084-0699(2002)7:6(441).
• Holgate, P. (1989). "The lognormal characteristic function". Communications in Statistics - Theory and Methods
18 (12): 4539–4548. doi:10.1080/03610928908830173.

Further reading
• Robert Brooks, Jon Corson, and J. Donal Wales. "The Pricing of Index Options When the Underlying Assets All
Follow a Lognormal Diffusion" (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=5735), in Advances in
Futures and Options Research, volume 7, 1994.
Logrank test 339

Logrank test
In statistics, the logrank test is a hypothesis test to compare the survival distributions of two samples. It is a
nonparametric test and appropriate to use when the data are right skewed and censored (technically, the censoring
must be non-informative). It is widely used in clinical trials to establish the efficacy of a new treatment compared to
a control treatment when the measurement is the time to event (such as the time from initial treatment to a heart
attack). The test is sometimes called the Mantel–Cox test, named after Nathan Mantel and David Cox. The logrank
test can also be viewed as a time stratified Cochran–Mantel–Haenszel test.
The test was first proposed by Nathan Mantel and was named the logrank test by Richard and Julian Peto.[1][2][3]

Definition
The logrank test statistic compares estimates of the hazard functions of the two groups at each observed event time.
It is constructed by computing the observed and expected number of events in one of the groups at each observed
event time and then adding these to obtain an overall summary across all time points where there is an event.
Let j = 1, ..., J be the distinct times of observed events in either group. For each time , let and be the
number of subjects "at risk" (have not yet had an event or been censored) at the start of period in the groups
respectively. Let . Let and be the observed number of events in the groups
respectively at time , and define .
Given that events happened across both groups at time , under the null hypothesis (of the two groups having
identical survival and hazard functions) has the hypergeometric distribution with parameters , , and

. This distribution has expected value and variance

The logrank statistic compares each to its expectation under the null hypothesis and is defined as

Asymptotic distribution
If the two groups have the same survival function, the logrank statistic is approximately standard normal. A
one-sided level test will reject the null hypothesis if where is the upper quantile of the standard
normal distribution. If the hazard ratio is , there are total subjects, is the probability a subject in either
group will eventually have an event (so that is the expected number of events at the time of the analysis), and
the proportion of subjects randomized to each group is 50%, then the logrank statistic is approximately normal with

mean and variance 1.[4] For a one-sided level test with power , the sample size required

is where and are the quantiles of the standard normal distribution.


Logrank test 340

Joint distribution
Suppose and are the logrank statistics at two different time points in the same study ( earlier). Again,
assume the hazard functions in the two groups are proportional with hazard ratio and and are the
probabilities that a subject will have an event at the two time points where . and are approximately

bivariate normal with means and and correlation . Calculations involving the

joint distribution are needed to correctly maintain the error rate when the data are examined multiple times within a
study by a Data Monitoring Committee.

Relationship to other statistics


• The logrank statistic can be derived as the score test for the Cox proportional hazards model comparing two
groups. It is therefore asymptotically equivalent to the likelihood ratio test statistic based from that model.
• The logrank statistic is asymptotically equivalent to the likelihood ratio test statistic for any family of distributions
with proportional hazard alternative. For example, if the data from the two samples have exponential
distributions.
• If is the logrank statistic, is the number of events observed, and is the estimate of the hazard ratio, then
. This relationship is useful when two of the quantities are known (e.g. from a published

article), but the third one is needed.


• The logrank statistic can be used when observations are censored. If censored observations are not present in the
data then the Wilcoxon rank sum test is appropriate.
• The logrank statistic gives all calculations the same weight, regardless of the time at which an event occurs. The
Peto logrank statistic gives more weight to earlier events when there are a large number of observations.

References
[1] Mantel, Nathan (1966). "Evaluation of survival data and two new rank order statistics arising in its consideration.". Cancer Chemotherapy
Reports 50 (3): 163–70. PMID 5910392.
[2] Peto, Richard; Peto, Julian (1972). "Asymptotically Efficient Rank Invariant Test Procedures". Journal of the Royal Statistical Society. Series
A (General) (Blackwell Publishing) 135 (2): 185–207. doi:10.2307/2344317. JSTOR 2344317.
[3] Harrington, David (2005). "Linear Rank Tests in Survival Analysis". Encyclopedia of Biostatistics. Wiley Interscience.
doi:10.1002/0470011815.b2a11047.
[4] Schoenfeld, D (1981). "The asymptotic properties of nonparametric tests for comparing survival distributions". Biometrika 68: 316–319.
JSTOR 2335833.

External links
• Bland, J. M.; Altman, D. G. (2004). "The logrank test". BMJ 328 (7447): 1073. doi:10.1136/bmj.328.7447.1073.
PMC 403858. PMID 15117797.
Lévy distribution 341

Lévy distribution
Lévy (unshifted)

Probability density function

Cumulative distribution function

Parameters location; scale


Support
PDF

CDF

Mean
Median , for
Mode , for

Variance
Skewness undefined
Ex. kurtosis undefined
Entropy
where is Euler's constant

MGF undefined
CF

In probability theory and statistics, the Lévy distribution, named after Paul Pierre Lévy, is a continuous probability
distribution for a non-negative random variable. In spectroscopy this distribution, with frequency as the dependent
variable, is known as a van der Waals profile.[1] It is a special case of the inverse-gamma distribution.
Lévy distribution 342

It is one of the few distributions that are stable and that have probability density functions that can be expressed
analytically, the others being the normal distribution and the Cauchy distribution. All three are special cases of the
stable distributions, which do not generally have a probability density function which can be expressed analytically.

Definition
The probability density function of the Lévy distribution over the domain is

where is the location parameter and is the scale parameter. The cumulative distribution function is

where is the complementary error function. The shift parameter has the effect of shifting the curve to
the right by an amount , and changing the support to the interval [ , ). Like all stable distributions, the
Levy distribution has a standard form f(x;0,1) which has the following property:

where y is defined as

The characteristic function of the Lévy distribution is given by

Note that the characteristic function can also be written in the same form used for the stable distribution with
and :

Assuming , the nth moment of the unshifted Lévy distribution is formally defined by:

which diverges for all n > 0 so that the moments of the Lévy distribution do not exist. The moment generating
function is then formally defined by:

which diverges for and is therefore not defined in an interval around zero, so that the moment generating
function is not defined per se. Like all stable distributions except the normal distribution, the wing of the probability
density function exhibits heavy tail behavior falling off according to a power law:

This is illustrated in the diagram below, in which the probability density functions for various values of c and
are plotted on a log-log scale.
Lévy distribution 343

Related distributions
• If then

• If then

(inverse gamma distribution)


• Lévy distribution is a special case
of type 5 Pearson distribution
• If
(Normal distribution) then

• If then

• If then

(Stable distribution)
Probability density function for the Lévy distribution on a log-log scale. • If then

(Scaled-inverse-chi-squared distribution)
• If then (Folded normal distribution)

Applications
• The frequency of geomagnetic reversals appears to follow a Lévy distribution
• The time of hitting a single point (different from the starting point 0) by the Brownian motion has the Lévy
distribution with . (For a Brownian motion with drift, this time may follow an inverse Gaussian
distribution, which has the Lévy distribution as a limit.)
• The length of the path followed by a photon in a turbid medium follows the Lévy distribution. [2]

Footnotes
[1] "van der Waals profile" appears with lowercase "van" in almost all sources, such as: Statistical mechanics of the liquid surface by Clive
Anthony Croxton, 1980, A Wiley-Interscience publication, ISBN 0-471-27663-4, ISBN 978-0-471-27663-0, (http:/ / books. google. it/
books?id=Wve2AAAAIAAJ& q="Van+ der+ Waals+ profile"& dq="Van+ der+ Waals+ profile"& hl=en); and in Journal of technical
physics, Volume 36, by Instytut Podstawowych Problemów Techniki (Polska Akademia Nauk), publisher: Państwowe Wydawn. Naukowe.,
1995, (http:/ / books. google. it/ books?id=2XpVAAAAMAAJ& q="Van+ der+ Waals+ profile"& dq="Van+ der+ Waals+ profile"& hl=en)
[2] Rogers, Geoffrey L, Multiple path analysis of reflectance from turbid media. Journal of the Optical Society of America A, 25:11, p 2879-2883
(2008).

Notes

References
• "Information on stable distributions" (http://academic2.american.edu/~jpnolan/stable/stable.html). Retrieved
July 13 2005. - John P. Nolan's introduction to stable distributions, some papers on stable laws, and a free
program to compute stable densities, cumulative distribution functions, quantiles, estimate parameters, etc. See
especially An introduction to stable distributions, Chapter 1 (http://academic2.american.edu/~jpnolan/stable/
Lévy distribution 344

chap1.pdf)

Mann–Whitney U
In statistics, the Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon (MWW) or Wilcoxon
rank-sum test) is a non-parametric statistical hypothesis test for assessing whether one of two samples of
independent observations tends to have larger values than the other. It is one of the most well-known non-parametric
significance tests. It was proposed initially[1] by the German Gustav Deuchler in 1914 (with a missing term in the
variance) and later independently by Frank Wilcoxon in 1945,[2] for equal sample sizes, and extended to arbitrary
sample sizes and in other ways by Henry Mann and his student Donald Ransom Whitney in 1947.[3]

Assumptions and formal statement of hypotheses


Although Mann and Whitney[3] developed the MWW test under the assumption of continuous responses with the
alternative hypothesis being that one distribution is stochastically greater than the other, there are many other ways
to formulate the null and alternative hypotheses such that the MWW test will give a valid test.[4]
A very general formulation is to assume that:
1. All the observations from both groups are independent of each other,
2. The responses are ordinal (i.e. one can at least say, of any two observations, which is the greater),
3. Under the null hypothesis the distributions of both groups are equal, so that the probability of an observation from
one population (X) exceeding an observation from the second population (Y) equals the probability of an
observation from Y exceeding an observation from X, that is, there is a symmetry between populations with
respect to probability of random drawing of a larger observation.
4. Under the alternative hypothesis the probability of an observation from one population (X) exceeding an
observation from the second population (Y) (after exclusion of ties) is not equal to 0.5. The alternative may also
be stated in terms of a one-sided test, for example: P(X > Y) + 0.5 P(X = Y)  > 0.5.
Under more strict assumptions than those above, e.g., if the responses are assumed to be continuous and the
alternative is restricted to a shift in location (i.e. F1(x) = F2(x + δ)), we can interpret a significant MWW test as
showing a difference in medians. Under this location shift assumption, we can also interpret the MWW as assessing
whether the Hodges–Lehmann estimate of the difference in central tendency between the two populations differs
from zero. The Hodges–Lehmann estimate for this two-sample problem is the median of all possible differences
between an observation in the first sample and an observation in the second sample.

Calculations
The test involves the calculation of a statistic, usually called U, whose distribution under the null hypothesis is
known. In the case of small samples, the distribution is tabulated, but for sample sizes above ~20 approximation
using the normal distribution is fairly good. Some books tabulate statistics equivalent to U, such as the sum of ranks
in one of the samples, rather than U itself.
The U test is included in most modern statistical packages. It is also easily calculated by hand, especially for small
samples. There are two ways of doing this.
First, arrange all the observations into a single ranked series. That is, rank all the observations without regard to
which sample they are in.
Method one:
For small samples a direct method is recommended. It is very quick, and gives an insight into the meaning of the U
statistic.
MannWhitney U 345

1. Choose the sample for which the ranks seem to be smaller (The only reason to do this is to make computation
easier). Call this "sample 1," and call the other sample "sample 2."
2. For each observation in sample 1, count the number of observations in sample 2 that have a smaller rank (count a
half for any that are equal to it). The sum of these counts is U.
Method two:
For larger samples, a formula can be used:
1. Add up the ranks for the observations which came from sample 1. The sum of ranks in sample 2 is now
determinate, since the sum of all the ranks equals N(N + 1)/2 where N is the total number of observations.
2. U is then given by:

where n1 is the sample size for sample 1, and R1 is the sum of the ranks in sample 1.
Note that it doesn't matter which of the two samples is considered sample 1. An equally valid formula
for U is

The smaller value of U1 and U2 is the one used when consulting significance tables. The sum of the two
values is given by

Knowing that R1 + R2 = N(N + 1)/2 and N = n1 + n2 , and doing some algebra, we find that the sum is

Properties
The maximum value of U is the product of the sample sizes for the two samples. In such a case, the "other" U would
be 0.

Examples

Illustration of calculation methods


Suppose that Aesop is dissatisfied with his classic experiment in which one tortoise was found to beat one hare in a
race, and decides to carry out a significance test to discover whether the results could be extended to tortoises and
hares in general. He collects a sample of 6 tortoises and 6 hares, and makes them all run his race at once. The order
in which they reach the finishing post (their rank order, from first to last crossing the finish line) is as follows,
writing T for a tortoise and H for a hare:
THHHHHTTTTTH
What is the value of U?
• Using the direct method, we take each tortoise in turn, and count the number of hares it is beaten by, getting 0, 5,
5, 5, 5, 5, which means U = 25. Alternatively, we could take each hare in turn, and count the number of tortoises it
is beaten by. In this case, we get 1, 1, 1, 1, 1, 6. So U = 6 + 1 + 1 + 1 + 1 + 1 = 11. Note that the sum of these two
values for U is 36, which is 6 × 6.
• Using the indirect method:
the sum of the ranks achieved by the tortoises is 1 + 7 + 8 + 9 + 10 + 11 = 46.
Therefore U = 46 − (6×7)/2 = 46 − 21 = 25.
MannWhitney U 346

the sum of the ranks achieved by the hares is 2 + 3 + 4 + 5 + 6 + 12 = 32, leading to U = 32 − 21 = 11.

Illustration of object of test


A second example illustrates the point that the Mann–Whitney does not test for equality of medians. Consider
another hare and tortoise race, with 19 participants of each species, in which the outcomes are as follows:
HHHHHHHHHTTTTTTTTTTHHHHHHHHHHTTTTTTTTT
The median tortoise here comes in at position 19, and thus actually beats the median hare, which comes in at position
20.
However, the value of U (for hares) is 100
(9 Hares beaten by (x) 0 tortoises) + (10 hares beaten by (x) 10 tortoises) = 0 + 100 = 100

Value of U(for tortoises) is 261

(10 tortoises beaten by 9 hares) + (9 tortoises beaten by 19 hares) = 90 + 171 = 261

Consulting tables, or using the approximation below, shows that this U value gives significant evidence that hares
tend to do better than tortoises (p < 0.05, two-tailed). Obviously this is an extreme distribution that would be spotted
easily, but in a larger sample something similar could happen without it being so apparent. Notice that the problem
here is not that the two distributions of ranks have different variances; they are mirror images of each other, so their
variances are the same, but they have very different skewness.

Normal approximation
For large samples, U is approximately normally distributed. In that case, the standardized value

where mU and σU are the mean and standard deviation of U, is approximately a standard normal deviate whose
significance can be checked in tables of the normal distribution. mU and σU are given by

The formula for the standard deviation is more complicated in the presence of tied ranks; the full formula is given in
the text books referenced below. However, if the number of ties is small (and especially if there are no large tie
bands) ties can be ignored when doing calculations by hand. The computer statistical packages will use the correctly
adjusted formula as a matter of routine.
Note that since U1 + U2 = n1 n2, the mean n1 n2/2 used in the normal approximation is the mean of the two values of
U. Therefore, the absolute value of the z statistic calculated will be same whichever value of U is used.
MannWhitney U 347

Relation to other tests

Comparison to Student's t-test


The U test is useful in the same situations as the independent samples Student's t-test, and the question arises of
which should be preferred.
Ordinal data
U remains the logical choice when the data are ordinal but not interval scaled, so that the spacing between
adjacent values cannot be assumed to be constant.
Robustness
As it compares the sums of ranks,[5] the Mann–Whitney test is less likely than the t-test to spuriously indicate
significance because of the presence of outliers – i.e. Mann–Whitney is more robust.
Efficiency
When normality holds, MWW has an (asymptotic) efficiency of or about 0.95 when compared to the t
[6]
test. For distributions sufficiently far from normal and for sufficiently large sample sizes, the MWW can be
considerably more efficient than the t.[7]
Overall, the robustness makes the MWW more widely applicable than the t test, and for large samples from the
normal distribution, the efficiency loss compared to the t test is only 5%, so one can recommend MWW as the default
test for comparing interval or ordinal measurements with similar distributions.
The relation between efficiency and power in concrete situations isn't trivial though. For small sample sizes one
should investigate the power of the MWW vs t.
MWW will give very similar results to performing an ordinary parametric two-sample t test on the rankings of the
data.[8]

Area-under-curve (AUC) statistic for ROC curves


The U statistic is equivalent to the area under the receiver operating characteristic curve that can be readily
calculated.[9][10]

Because of its probabilistic form, the U statistic can be generalised to a measure of a classifier's separation power for
more than two classes:[11]

Where c is the number of classes, and the term of considers only the ranking of the items belonging
to classes k and l (i.e., items belonging to all other classes are ignored) according to the classifier's estimates of the
probability of those items belonging to class k. will always be zero but, unlike in the two-class case,
generally , which is why the measure sums over all (k, l) pairs, in effect using the average
of and .
MannWhitney U 348

Different distributions
If one is only interested in stochastic ordering of the two populations (i.e., the concordance probability P(Y > X)), the
U test can be used even if the shapes of the distributions are different. The concordance probability is exactly equal
to the area under the receiver operating characteristic curve (ROC) that is often used in the context.

Alternatives
If one desires a simple shift interpretation, the U test should not be used when the distributions of the two samples
are very different, as it can give erroneously significant results. In that situation, the unequal variances version of the
t test is likely to give more reliable results, but only if normality holds.
Alternatively, some authors (e.g. Conover) suggest transforming the data to ranks (if they are not already ranks) and
then performing the t test on the transformed data, the version of the t test used depending on whether or not the
population variances are suspected to be different. Rank transformations do not preserve variances, but variances are
recomputed from samples after rank transformations.
The Brown–Forsythe test has been suggested as an appropriate non-parametric equivalent to the F test for equal
variances.

Related test statistics

Kendall's τ
The U test is related to a number of other non-parametric statistical procedures. For example, it is equivalent to
Kendall's τ correlation coefficient if one of the variables is binary (that is, it can only take two values).

ρ statistic
A statistic called ρ that is linearly related to U and widely used in studies of categorization (discrimination learning
involving concepts), and elsewhere,[12] is calculated by dividing U by its maximum value for the given sample sizes,
which is simply n1 × n2. ρ is thus a non-parametric measure of the overlap between two distributions; it can take
values between 0 and 1, and it is an estimate of P(Y > X) + 0.5 P(Y = X), where X and Y are randomly chosen
observations from the two distributions. Both extreme values represent complete separation of the distributions,
while a ρ of 0.5 represents complete overlap. The usefulness of the ρ statistic can be seen in the case of the odd
example used above, where two distributions that were significantly different on a U-test nonetheless had nearly
identical medians: the ρ value in this case is approximately 0.723 in favour of the hares, correctly reflecting the fact
that even though the median tortoise beat the median hare, the hares collectively did better than the tortoises
collectively.

Example statement of results


In reporting the results of a Mann–Whitney test, it is important to state:
• A measure of the central tendencies of the two groups (means or medians; since the Mann–Whitney is an ordinal
test, medians are usually recommended)
• The value of U
• The sample sizes
• The significance level.
In practice some of this information may already have been supplied and common sense should be used in deciding
whether to repeat it. A typical report might run,
"Median latencies in groups E and C were 153 and 247 ms; the distributions in the two groups differed
significantly (Mann–Whitney U = 10.5, n1 = n2 = 8, P < 0.05 two-tailed)."
MannWhitney U 349

A statement that does full justice to the statistical status of the test might run,
"Outcomes of the two treatments were compared using the Wilcoxon–Mann–Whitney two-sample rank-sum
test. The treatment effect (difference between treatments) was quantified using the Hodges–Lehmann (HL)
estimator, which is consistent with the Wilcoxon test.[13] This estimator (HLΔ) is the median of all possible
differences in outcomes between a subject in group B and a subject in group A. A non-parametric 0.95
confidence interval for HLΔ accompanies these estimates as does ρ, an estimate of the probability that a
randomly chosen subject from population B has a higher weight than a randomly chosen subject from
population A. The median [quartiles] weight for subjects on treatment A and B respectively are 147 [121, 177]
and 151 [130, 180] kg. Treatment A decreased weight by HLΔ = 5 kg (0.95 CL [2, 9] kg, 2P = 0.02, ρ =
0.58)."
However it would be rare to find so extended a report in a document whose major topic was not statistical inference.

Implementations
• Online implementation [14] using javascript
• ALGLIB [15] includes implementation of the Mann–Whitney U test in C++, C#, Delphi, Visual Basic, etc.
• R includes an implementation of the test (there referred to as the Wilcoxon two-sample test) as wilcox.test
[16]
.
• SAS implements the test in the PROC NPAR1WAY procedure
• Stata includes implementation of Wilcoxon-Mann-Whitney rank-sum test with ranksum [17] command.
• SciPy has the mannwhitneyu [18] function in the stats module.
• MATLAB implements the test with function ranksum in the statistics toolbox.
• Mathematica implements the function as MannWhitneyTest [19].

Notes
[1] Kruskal, William H. (September 1957). "Historical Notes on the Wilcoxon Unpaired Two-Sample Test" (http:/ / www. jstor. org/ stable/
2280906). Journal of the American Statistical Association 52 (279): 356–360. .
[2] Wilcoxon, Frank (1945). "Individual comparisons by ranking methods". Biometrics Bulletin 1 (6): 80–83. doi:10.2307/3001968.
JSTOR 3001968.
[3] Mann, Henry B.; Whitney, Donald R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other".
Annals of Mathematical Statistics 18 (1): 50–60. doi:10.1214/aoms/1177730491. MR22058. Zbl 0041.26103.
[4] Fay, Michael P.; Proschan, Michael A. (2010). "Wilcoxon–Mann–Whitney or t-test? On assumptions for hypothesis tests and multiple
interpretations of decision rules". Statistics Surveys 4: 1–39. doi:10.1214/09-SS051. MR2595125. PMC 2857732. PMID 20414472.
[5] Motulsky, Harvey J.; Statistics Guide, San Diego, CA: GraphPad Software, 2007, p. 123
[6] Lehamnn, Erich L.; Elements of Large Sample Theory, Springer, 1999, p. 176
[7] Conover, William J.; Practical Nonparametric Statistics (http:/ / kecubung. webfactional. com/ ebook/
practical-nonparametric-statistics-conover-download-pdf. pdf), John Wiley & Sons, 1980 (2nd Edition), pp. 225–226
[8] Conover, William J.; Iman, Ronald L. (1981). "Rank Transformations as a Bridge Between Parametric and Nonparametric Statistics". The
American Statistician 35 (3): 124–129. doi:10.2307/2683975. JSTOR 2683975.
[9] Hanley, James A.; McNeil, Barbara J. (1982). "The Meaning and Use of the Area under a Receiver Operating (ROC) Curve Characteristic".
Radiology 143 (1): 29–36. PMID 7063747.
[10] Mason, Simon J.; Graham, Nicholas E. (2002). "Areas beneath the relative operating characteristics (ROC) and relative operating levels
(ROL) curves: Statistical significance and interpretation" (http:/ / reia. inmet. gov. br/ documentos/ cursoI_INMET_IRI/
Climate_Information_Course/ References/ Mason+ Graham_2002. pdf). Quarterly Journal of the Royal Meteorological Society (128):
2145–2166. .
[11] Hand, David J.; Till, Robert J. (2001). "A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification
Problems" (http:/ / www. springerlink. com/ index/ nn141j42838n7u21. pdf). Machine Learning 45 (2): 171–186.
doi:10.1023/A:1010920819831. .
[12] Herrnstein, Richard J.; Loveland, Donald H.; Cable, Cynthia (1976). "Natural Concepts in Pigeons". Journal of Experimental Psychology:
Animal Behavior Processes 2: 285–302. doi:10.1037/0097-7403.2.4.285.
[13] Myles Hollander and Douglas A. Wolfe (1999). Nonparametric Statistical Methods (2 ed.). ISBN 978-0471190455.
[14] http:/ / faculty. vassar. edu/ lowry/ utest. html
MannWhitney U 350

[15] http:/ / www. alglib. net/ statistics/ hypothesistesting/ mannwhitneyu. php


[16] http:/ / stat. ethz. ch/ R-manual/ R-patched/ library/ stats/ html/ wilcox. test. html
[17] http:/ / www. stata. com/ help. cgi?ranksum
[18] http:/ / docs. scipy. org/ doc/ scipy-0. 9. 0/ reference/ generated/ scipy. stats. mannwhitneyu. html
[19] http:/ / reference. wolfram. com/ mathematica/ ref/ MannWhitneyTest. html?q=MannWhitneyTest& lang=en

References
• Lehmann, Erich L. (1975); Nonparametrics: Statistical Methods Based on Ranks.

External links
• Table of critical values of U (pdf) (http://math.usask.ca/~laverty/S245/Tables/wmw.pdf)
• Discussion and table of critical values for the original Wilcoxon Rank-Sum Test, which uses a slightly different
test statistic ( pdf (http://www.stat.auckland.ac.nz/~wild/ChanceEnc/Ch10.wilcoxon.pdf))
• Interactive calculator (http://faculty.vassar.edu/lowry/utest.html) for U and its significance

Matrix calculus
In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of
matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a
multivariate function with respect to a single variable, into vectors and matrices that can be treated as single entities.
This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving
systems of differential equations. The notation used here is commonly used in statistics and engineering, while the
tensor index notation is preferred in physics.
Two competing notational conventions split the field of matrix calculus into two separate groups. The two groups
can be distinguished by whether they write the derivative of a scalar with respect to a vector as a column vector or a
row vector. Both of these conventions are possible even when the common assumption is made that vectors should
be treated as column vectors when combined with matrices (rather than row vectors). A single convention can be
somewhat standard throughout a single field that commonly use matrix calculus (e.g. econometrics, statistics,
estimation theory and machine learning). However, even within a given field different authors can be found using
competing conventions. Authors of both groups often write as though their specific convention is standard. Serious
mistakes can result when combining results from different authors without carefully verifying that compatible
notations are used. Therefore great care should be taken to ensure notational consistency. Definitions of these two
conventions and comparisons between them are collected in the layout conventions section.

Scope
Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of
each component of the dependent variable with respect to each component of the independent variable. In general,
the independent variable can be a scalar, a vector, or a matrix while the dependent variable can be any of these as
well. Each different situation will lead to a different set of rules, or a separate calculus, using the broader sense of the
term. Matrix notation serves as a convenient way to collect the many derivatives in an organized way.
As a first example, consider the gradient from vector calculus. For a scalar function of three independent variables,
, the gradient is given by the vector equation

,
Matrix calculus 351

where represents a unit vector in the direction. This type of generalized derivative can be seen as the
derivative of a scalar, f, with respect to a vector, and it's result can be easily collected in vector form.

More complicated examples include the derivative of a scalar function with respect to a matrix, known as the
gradient matrix, which collects the derivative of with respect to each matrix element in the corresponding position in
the resulting matrix. In that case the scalar must be a function of each of the independent variables in the matrix. As
another example, if we have an n-vector of dependent variables, or functions, of m independent variables we might
consider the derivative of the dependent vector with respect to the independent vector. The result could be collected
in an m×n matrix consisting of all of the possible derivative combinations. There are, of course, a total of nine
possibilities using scalars, vectors, and matrices. Notice that as we consider higher numbers of components in each
of the independent and dependent variables we can be left with a very large number of possibilities.
The six kinds of derivatives that can be most neatly organized in matrix form are collected in the following table.[1]

Types of Matrix Derivatives


Types Scalar Vector Matrix

Scalar

Vector

Matrix

Where we have used the term matrix in its most general sense, recognizing that vectors and scalars are simply
matrices with one column and then one row respectively. Moreover, we have used bold letters to indicated vectors
and bold capital letters for matrices. This notation is used throughout.
Notice that we could also talk about the derivative of a vector with respect to a matrix, or any of the other unfilled
cells in our table. However, these derivatives are most naturally organized in a tensor of rank higher than 2, so that
they do not fit neatly into a matrix. In the following three sections we will define each one of these derivatives and
relate them to other branches of mathematics. See the layout conventions section for a more detailed table.

Relation to other derivatives


The matrix derivative is a convenient notation for keeping track of partial derivatives for doing calculations. The
Fréchet derivative is the standard way in the setting of functional analysis to take derivatives with respect to vectors.
In the case that a matrix function of a matrix is Fréchet differentiable, the two derivatives will agree up to translation
of notations. As is the case in general for partial derivatives, some formulae may extend under weaker analytic
conditions than the existence of the derivative as approximating linear mapping.
Matrix calculus 352

Usages
Matrix calculus is used for deriving optimal stochastic estimators, often involving the use of Lagrange multipliers.
This includes the derivation of:
• Kalman filter
• Wiener filter
• Expectation-maximization algorithm for Gaussian mixture

Notation
The vector and matrix derivatives presented in the sections to follow take full advantage of matrix notation, using a
single variable to represent a large number of variables. In what follows we will distinguish scalars, vectors and
matrices by their typeface. We will let M(n,m) denote the space of real n×m matrices with n rows and m columns.
Such matrices will be denoted using bold capital letters: A, X, Y, etc. An element of M(n,1), that is, a column vector,
is denoted with a boldface lowercase letter: a, x, y, etc. An element of M(1,1) is a scalar, denoted with lowercase
italic typeface: a, t, x, etc. XT denotes matrix transpose, tr(X) is the trace, and det(X) is the determinant. All functions
are assumed to be of differentiability class C1 unless otherwise noted. Generally letters from first half of the alphabet
(a, b, c, …) will be used to denote constants, and from the second half (t, x, y, …) to denote variables.
NOTE: As mentioned above, there are competing notations for laying out systems of partial derivatives in vectors
and matrices, and no standard appears to be emerging as of yet. The next two introductory sections use the numerator
layout convention simply for the purposes of convenience, to avoid overly complicating the discussion. The section
after them discusses layout conventions in more detail. It is important to realize the following:
1. Despite the use of the terms "numerator layout" and "denominator layout", there are actually more than two
possible notational choices involved. The reason is that the choice of numerator vs. denominator (or in some
situations, numerator vs. mixed) can be made independently for scalar-by-vector, vector-by-scalar,
vector-by-vector, and scalar-by-matrix derivatives, and a number of authors mix and match their layout choices in
various ways.
2. The choice of numerator layout in the introductory sections below does not imply that this is the "correct" or
"superior" choice. There are advantages and disadvantages to the various layout types. Serious mistakes can result
from carelessly combining formulas written in different layouts, and converting from one layout to another
requires care to avoid errors. As a result, when working with existing formulas the best policy is probably to
identify whichever layout is used and maintain consistency with it, rather than attempting to use the same layout
in all situations.

Alternatives
The tensor index notation with its Einstein summation convention is very similar to the matrix calculus, except one
writes only a single component at a time. It has the advantage that one can easily manipulate arbitrarily high rank
tensors, whereas tensors of rank higher than two are quite unwieldy with matrix notation. All of the work here can be
done in this notation without use of the single-variable matrix notation. However, many problems in estimation
theory and other areas of applied mathematics would result in too many indices to properly keep track of, pointing in
favor of matrix calculus in those areas. Also, Einstein notation can be very useful in proving the identities presented
here, as an alternative to typical element notation, which can become cumbersome when the explicit sums are carried
around. Note that a matrix can be considered a tensor of rank two.
Matrix calculus 353

Derivatives with vectors


Because vectors are matrices with only one column, the simplest matrix derivatives are vector derivatives.
The notations developed here can accommodate the usual operations of vector calculus by identifying the space
M(n,1) of n-vectors with the Euclidean space Rn, and the scalar M(1,1) is identified with R. The corresponding
concept from vector calculus is indicated at the end of each subsection.
NOTE: The discussion in this section assumes the numerator layout convention for pedagogical purposes. Some
authors use different conventions. The section on layout conventions discusses this issue in greater detail. The
identities given further down are presented in forms that can be used in conjunction with all common layout
conventions.

Vector-by-scalar

The derivative of a vector , by a scalar x is written (in numerator layout notation) as

In vector calculus the derivative of a vector y with respect to a scalar x is known as the tangent vector of the vector
. Notice here that y:R Rm.

Example Simple examples of this include the velocity vector in Euclidean space, which is the tangent vector of the
position vector (considered as a function of time). Also, the acceleration is the tangent vector of the velocity.

Scalar-by-vector

The derivative of a scalar y by a vector , is written (in numerator layout notation) as

In vector calculus the gradient of a scalar field y, in the space Rn whose independent coordinates are the components
of x is the derivative of a scalar by a vector. In physics, the electric field is the vector gradient of the electric
potential.
The directional derivative of a scalar function f(x) of the space vector x in the direction of the unit vector u is defined
using the gradient as follows.

Using the notation just defined for the derivative of a scalar with respect to a vector we can re-write the directional
derivative as This type of notation will be nice when proving product rules and chain rules that

come out looking similar to what we are familiar with for the scalar derivative.
Matrix calculus 354

Vector-by-vector
Each of the previous two cases can be considered as an application of the derivative of a vector with respect to a
vector, using a vector of size one appropriately. Similarly we will find that the derivatives involving matrices will
reduce to derivatives involving vectors in a corresponding way.

The derivative of a vector function (a vector whose components are functions) , of an independent

vector , is written (in numerator layout notation) as

In vector calculus the derivative of a vector function y with respect to a vector x that whose components represent a
space is known as the pushforward or differential, or the Jacobian matrix.

The pushforward along a vector function f with respect to vector v in Rm is given by

Derivatives with matrices


There are two types of derivatives with matrices that can be organized into a matrix of the same size. These are the
derivative of a matrix by a scalar and the derivative of a scalar by a matrix respectively. These can be useful in
minimization problems found many areas of applied mathematics and have adopted the names tangent matrix and
gradient matrix respectively after their analogs for vectors.
NOTE: The discussion in this section assumes the numerator layout convention for pedagogical purposes. Some
authors use different conventions. The section on layout conventions discusses this issue in greater detail. The
identities given further down are presented in forms that can be used in conjunction with all common layout
conventions.

Matrix-by-scalar
The derivative of a matrix function Y by a scalar x is known as the tangent matrix and is given (in numerator layout
notation) by
Matrix calculus 355

Scalar-by-matrix
The derivative of a scalar y function of a matrix X of independent variables, with respect to the matrix X, is given (in
numerator layout notation) by

Notice that the indexing of the gradient with respect to X is transposed as compared with the indexing of X.
Important examples of scalar functions of matrices include the trace of a matrix and the determinant.
In analog with vector calculus this derivative is often written as the following.

Also in analog with vector calculus, the directional derivative of a scalar f(X) of a matrix X in the direction of
matrix Y is given by

It is the gradient matrix, in particular, that finds many uses in minimization problems in estimation theory,
particularly in the derivation of the Kalman filter algorithm, which is of great importance in the field.

Other matrix derivatives


The three types of derivatives that have not been considered are those involving vectors-by-matrices,
matrices-by-vectors, and matrices-by-matrices. These are not as widely considered and a notation is not widely
agreed upon. As for vectors, the other two types of higher matrix derivatives can be seen as applications of the
derivative of a matrix by a matrix by using a matrix with one column in the correct place. For this reason, in this
subsection we consider only how one can write the derivative of a matrix by another matrix.
The differential or the matrix derivative of a matrix function F(X) that maps from n×m matrices to p×q matrices, F :
M(n,m) M(p,q), is an element of M(p,q) ? M(m,n), a fourth-rank tensor (the reversal of m and n here indicates the
dual space of M(n,m)). In short it is an m×n matrix each of whose entries is a p×q matrix.

and note that each is a p×q matrix defined as above. Note also that this matrix has its indexing transposed; m

rows and n columns. The pushforward along F of an n×m matrix Y in M(n,m) is then

as formal block matrices.

Note that this definition encompasses all of the preceding definitions as special cases.
According to Jan R. Magnus and Heinz Neudecker, the following notations are both unsuitable, as the determinant of
the second resulting matrix would have "no interpretation" and "a useful chain rule does not exist" if these notations
are being used:[2]
Given , a differentiable function of an matrix ,
Matrix calculus 356

Given , a differentiable function of an matrix ,

The Jacobian matrix, according to Magnus and Neudecker,[2] is

Layout conventions
This section discusses the similarities and differences between notational conventions that are used in the various
fields that take advantage of matrix calculus. Although there are largely two consistent conventions, some authors
find it convenient to mix the two conventions in forms that are discussed below. After this section equations will be
listed in both competing forms separately.

The fundamental issue is that the derivative of a vector with respect to a vector, i.e. , is often written in two

competing ways. If the numerator y is of size m and the denominator x of size n, then the result can be laid out as
either an m×n matrix or n×m matrix, i.e. the elements of y laid out in columns and the elements of x laid out in rows,
or vice-versa. This leads to the following possibilities:
1. Numerator layout, i.e. lay out according to y and xT (i.e. contrarily to x). This is sometimes known as the
Jacobian formulation.
2. Denominator layout, i.e. lay out according to yT and x (i.e. contrarily to y). This is sometimes known as the
Hessian formulation. Some authors term this layout the gradient, in distinction to the Jacobian (numerator
layout), which is its transpose. (However, "gradient" more commonly means the derivative regardless of

layout.)

3. A third possibility sometimes seen is to insist on writing the derivative as (i.e. the derivative is taken with

respect to the transpose of x) and follow the numerator layout. This makes it possible to claim that the matrix is
laid out according to both numerator and denominator. In practice this produces results the same as the numerator
layout.

When handling the gradient and the opposite case we have the same issues. To be consistent, we should do

one of the following:

1. If we choose numerator layout for we should lay out the gradient as a row vector, and as a column

vector.
2. If we choose denominator layout for we should lay out the gradient as a column vector, and as a

row vector.
3. In the third possibility above, we write and and use numerator layout.
Matrix calculus 357

Not all math textbooks and papers are consistent in this respect throughout the entire paper. That is, sometimes
different conventions are used in different contexts within the same paper. For example, some choose denominator
layout for gradients (laying them out as column vectors), but numerator layout for the vector-by-vector derivative

Similarly, when it comes to scalar-by-matrix derivatives and matrix-by-scalar derivatives then consistent

numerator layout lays out according to Y and XT, while consistent denominator layout lays out according to YT and
X. In practice, however, following a denominator layout for and laying the result out according to YT, is rarely

seen because it makes for ugly formulas that do not correspond to the scalar formulas. As a result, the following
layouts can often be found:

1. Consistent numerator layout, which lays out according to Y and according to XT.

2. Mixed layout, which lays out according to Y and according to X.

3. Use the notation with results the same as consistent numerator layout.

In the following formulas, we handle the five possible combinations and separately. We

also handle cases of scalar-by-scalar derivatives that involve an intermediate vector or matrix. (This can arise, for
example, if a multi-dimensional parametric curve is defined in terms of a scalar variable, and then a derivative of a
scalar function of the curve is taken with respect to the scalar that parameterizes the curve.) For each of the various
combinations, we give numerator-layout and denominator-layout results, except in the cases above where
denominator layout rarely occurs. In cases involving matrices where it makes sense, we give numerator-layout and
mixed-layout results. As noted above, cases where vector and matrix denominators are written in transpose notation
are equivalent to numerator layout with the denominators written without the transpose.
Keep in mind that various authors use different combinations of numerator and denominator layouts for different
types of derivatives, and there is no guarantee that an author will consistently use either numerator or denominator
layout for all types. Match up the formulas below with those quoted in the source to determine the layout used for
that particular type of derivative, but be careful not to assume that derivatives of other types necessarily follow the
same kind of layout.
When taking derivatives with an aggregate (vector or matrix) denominator in order to find a maximum or minimum
of the aggregate, it should be kept in mind that using numerator layout will produce results that are transposed with
respect to the aggregate. For example, in attempting to find the maximum likelihood estimate of a multivariate
normal distribution using matrix calculus, if the domain is a kx1 column vector, then the result using the numerator
layout will be in the form of a 1xk row vector. Thus, either the results should be transposed at the end or the
denominator layout (or mixed layout) should be used.
Matrix calculus 358

Result of differentiating various kinds of aggregates with other kinds of aggregates


Scalar y Vector y (size m) Matrix Y (size m×n)

Notation Type Notation Type Notation Type

Scalar x scalar (numerator layout) size-m (numerator layout) m×n


column vector matrix
(denominator layout) size-m
row vector

Vector x (size (numerator layout) size-n row (numerator layout) m×n matrix ?
n) vector (denominator layout) n×m
(denominator layout) size-n matrix
column vector

Matrix X (size (numerator layout) q×p matrix ? ?


p×q) (denominator layout) p×q matrix

The results of operations will be transposed when switching between numerator-layout and denominator-layout
notation.

Numerator-layout notation
Using numerator-layout notation, we have:[1]

The following definitions are only provided in numerator-layout notation:


Matrix calculus 359

Denominator-layout notation
Using denominator-layout notation, we have:[3][4]

Identities
As noted above, in general, the results of operations will be transposed when switching between numerator-layout
and denominator-layout notation.
To help make sense of all the identities below, keep in mind the most important rules: the chain rule, product rule
and sum rule. The sum rule applies universally, and the product rule applies in most of the cases below, provided that
the order of matrix products is maintained, since matrix products are not commutative. The chain rule applies in
some of the cases, but unfortunately does not apply in matrix-by-scalar derivatives or scalar-by-matrix derivatives
(in the latter case, mostly involving the trace operator applied to matrices). In the latter case, the product rule can't
quite be applied directly, either, but the equivalent can be done with a bit more work using the differential identities.

Vector-by-vector identities
This is presented first because all of the operations that apply to vector-by-vector differentiation apply directly to
vector-by-scalar or scalar-by-vector differentiation simply by reducing the appropriate vector in the numerator or
denominator to a scalar.

Identities: vector-by-vector
Condition Expression Numerator layout, i.e. by y and xT Denominator layout, i.e. by yT and x

a is not a function of x

A is not a function of x

A is not a function of x
Matrix calculus 360

a is not a function of
x,
u = u(x)

A is not a function of
x,
u = u(x)

u = u(x), v = v(x)

u = u(x)

u = u(x)

Scalar-by-vector identities
The fundamental identities are placed above the thick black line.

Identities: scalar-by-vector
Condition Expression Denominator layout,
Numerator layout,
i.e. by x; result is column vector
i.e. by xT; result is row vector

a is not a function of [5] [5]


x

a is not a function of
x,
u = u(x)

u = u(x), v = v(x)

u = u(x), v = v(x)

u = u(x)

u = u(x)

u = u(x), v = v(x)

• assumes numerator layout of  • assumes denominator layout of 

u = u(x), v = v(x),
A is not a function of
x • assumes numerator layout of  • assumes denominator layout of 

a is not a function of
x

A is not a function of
x
b is not a function of
x
Matrix calculus 361

A is not a function of
x

A is not a function of
x
A is symmetric

A is not a function of
x

A is not a function of
x
A is symmetric

a is not a function of
x,
u = u(x)
• assumes numerator layout of  • assumes denominator layout of 

a, b are not functions


of x

A, b, C, D, e are not
functions of x

a is not a function of
x

Vector-by-scalar identities

Identities: vector-by-scalar
Condition Expression Numerator layout, i.e. by Denominator layout, i.e. by
y, yT,
result is column vector result is row vector

a is not a function of x

a is not a function of
x,
u = u(x)

A is not a function of
x,
u = u(x)

u = u(x)

u = u(x), v = v(x)

u = u(x), v = v(x)

u = u(x)

Assumes consistent matrix layout; see below.


Matrix calculus 362

u = u(x)

Assumes consistent matrix layout; see below.

NOTE: The formulas involving the vector-by-vector derivatives and (whose outputs are matrices)

assume the matrices are laid out consistent with the vector layout, i.e. numerator-layout matrix when
numerator-layout vector and vice-versa; otherwise, transpose the vector-by-vector derivatives.

Scalar-by-matrix identities
Note that exact equivalents of the scalar product rule and chain rule do not exist when applied to matrix-valued
functions of matrices. However, the product rule of this sort does apply to the differential form (see below), and this
is the way to derive many of the identities below involving the trace function, combined with the fact that the trace
function allows transposing and cyclic permutation, i.e.:

For example, to compute

Therefore,

Identities: scalar-by-matrix
Condition Expression Denominator layout, i.e. by X
Numerator layout, i.e. by XT

a is not a function of [6] [6]


X

a is not a function of
X, u = u(X)

u = u(X), v = v(X)

u = u(X), v = v(X)

u = u(X)
Matrix calculus 363

u = u(X)

U = U(X)
[7]
    

Both forms assume numerator layout for

i.e. mixed layout if denominator layout for X is being used.

U = U(X), V = V(X)

a is not a function of
X,
U = U(X)

g(X) is any
polynomial with
scalar coefficients,
or any matrix
function defined by
an infinite
polynomial series
(e.g. eX, sin(X),
cos(X), ln(X), etc.
using a Taylor
series); g(x) is the
equivalent scalar
function, g′(x) is its
derivative, and
g′(X) is the
corresponding
matrix function

A is not a function [8]


    
of X

A is not a function [7]


    
of X

A is not a function
[7]
of X     

A is not a function
[7]
of X     

A, B are not
functions of X

A, B, C are not
functions of X

n is a positive
[7]
integer     
Matrix calculus 364

A is not a function
[7]
of X,     
n is a positive
integer

[7]
    

[7]
    

[9]
    

a is not a function of
[7] [10]
X

A, B are not
[7]
functions of X     

n is a positive
[7]
integer     

(see pseudo-inverse)
[7]
    

(see pseudo-inverse)
[7]
    

A is not a function
of X,
X is square and
invertible

A is not a function
of X,
X is non-square,
A is symmetric

A is not a function
of X,
X is non-square,
A is non-symmetric

Matrix-by-scalar identities

Identities: matrix-by-scalar
Condition Expression Numerator layout, i.e. by Y

U = U(x)

A, B are not functions of x,


U = U(x)

U = U(x), V = V(x)

U = U(x), V = V(x)

U = U(x), V = V(x)
Matrix calculus 365

U = U(x), V = V(x)

U = U(x)

U = U(x,y)

A is not a function of x, g(X) is any polynomial with scalar


coefficients, or any matrix function defined by an infinite
polynomial series (e.g. eX, sin(X), cos(X), ln(X), etc.); g(x) is the
equivalent scalar function, g′(x) is its derivative, and g′(X) is the
corresponding matrix function

A is not a function of x

Scalar-by-scalar identities

With vectors involved

Identities: scalar-by-scalar, with vectors involved


Condition Expression Any layout (assumes dot product ignores row vs. column layout)

u = u(x)

u = u(x), v = v(x)

With matrices involved

Identities: scalar-by-scalar, with matrices involved[7]


Condition Expression Mixed layout,
Consistent numerator layout,
i.e. by Y and X
i.e. by Y and XT

U = U(x)

U = U(x)

U = U(x)

U = U(x)
Matrix calculus 366

A is not a function of x, g(X)


is any polynomial with scalar
coefficients, or any matrix
function defined by an infinite
polynomial series (e.g. eX,
sin(X), cos(X), ln(X), etc.);
g(x) is the equivalent scalar
function, g′(x) is its derivative,
and g′(X) is the corresponding
matrix function.

A is not a function of x

Identities in differential form


It is often easier to work in differential form and then convert back to normal derivatives. This only works well using
the numerator layout.

Differential identities: scalar involving matrix [1][7]


Condition Expression Result (numerator layout)

Differential identities: matrix [1][7]


Condition Expression Result (numerator layout)

A is not a function of X

a is not a function of X

(Kronecker product)

(Hadamard product)

(conjugate transpose)

To convert to normal derivative form, first convert it to one of the following canonical forms, and then use these
identities:
Matrix calculus 367

Conversion from differential to derivative form [1]


Canonical differential form Equivalent derivative form

Notes
[1] Minka, Thomas P. "Old and New Matrix Algebra Useful for Statistics." December 28, 2000. (http:/ / research. microsoft. com/ en-us/ um/
people/ minka/ papers/ matrix/ )
[2] Magnus, Jan R.; Neudecker, Heinz (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley Series in
Probability and Statistics (2nd ed.). Wiley. pp. 171–173.
[3] (http:/ / www. colorado. edu/ engineering/ CAS/ courses. d/ IFEM. d/ IFEM. AppD. pdf)
[4] Dattorro, Jon (2005). Convex Optimization & Euclidean Distance Geometry, Appendix D. Meßoo Publishing USA. Version v2010.01.05.
[5] Here, refers to a column vector of all 0's, of size n, where n is the length of x.
[6] Here, refers to a matrix of all 0's, of the same shape as X.
[7] Petersen, Kaare Brandt and Michael Syskind Pedersen. The Matrix Cookbook. November 14, 2008. http:/ / matrixcookbook. com. (http:/ /

matrixcookbook. com) This book uses a mixed layout, i.e. by Y in by X in

[8] Duchi, John C. "Properties of the Trace and Matrix Derivatives" (http:/ / www. cs. berkeley. edu/ ~jduchi/ projects/ matrix_prop. pdf).
University of California at Berkeley. . Retrieved 19 July 2011.
[9] See Determinant#Derivative for the derivation.
[10] The constant a disappears in the result. This is intentional. In general,

External links
• Linear Algebra: Determinants, Inverses, Rank (http://www.colorado.edu/engineering/cas/courses.d/IFEM.
d/IFEM.AppD.d/IFEM.AppD.pdf) appendix D from Introduction to Finite Element Methods book on
University of Colorado at Boulder. Uses the Hessian (transpose to Jacobian) definition of vector and matrix
derivatives.
• Matrix Reference Manual (http://www.psi.toronto.edu/matrix/calculus.html), Mike Brookes, Imperial
College London.
• The Matrix Cookbook (http://matrixcookbook.com), with a derivatives chapter. Uses the Hessian definition.
• Linear Algebra and its Applications (http://www.wiley.com/WileyCDA/WileyTitle/
productCd-0471751561,descCd-authorInfo.html) (author information page; see Chapter 9 of book), Peter Lax,
Courant Institute.
• Matrix Differentiation (and some other stuff) (http://www.atmos.washington.edu/~dennis/MatrixCalculus.
pdf), Randal J. Barnes, Department of Civil Engineering, University of Minnesota.
• Notes on Matrix Calculus (http://www4.ncsu.edu/~pfackler/MatCalc.pdf), Paul L. Fackler, North Carolina
State University.
Matrix calculus 368

• Matrix Differential Calculus (https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester1_2006_7/slide.


pdf) (slide presentation), Zhang Le, University of Edinburgh.
• Introduction to Vector and Matrix Differentiation (http://www.econ.ku.dk/metrics/Econometrics2_05_II/
LectureNotes/matrixdiff.pdf) (notes on matrix differentiation, in the context of Econometrics), Heino Bohn
Nielsen.
• A note on differentiating matrices (http://mpra.ub.uni-muenchen.de/1239/1/MPRA_paper_1239.pdf) (notes
on matrix differentiation), Pawel Koval, from Munich Personal RePEc Archive.
• Vector/Matrix Calculus (http://www.personal.rdg.ac.uk/~sis01xh/teaching/CY4C9/ANN3.pdf) More notes
on matrix differentiation.
• Matrix Identities (http://itee.uq.edu.au/~comp4702/material/matrixid.pdf) (notes on matrix differentiation),
Sam Roweis.

Maximum likelihood
In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical
model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates
for the model's parameters.
The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example,
one may be interested in the heights of adult female giraffes, but be unable to measure the height of every single
giraffe in a population due to cost or time constraints. Assuming that the heights are normally (Gaussian) distributed
with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the
heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as
parameters and finding particular parametric values that make the observed results the most probable (given the
model).
In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects values
of the model parameters that produce a distribution that gives the observed data the greatest probability (i.e.,
parameters that maximize the likelihood function). Maximum-likelihood estimation gives a unified approach to
estimation, which is well-defined in the case of the normal distribution and many other problems. However, in some
complicated problems, difficulties do occur: in such problems, maximum-likelihood estimators are unsuitable or do
not exist.

Principles
Suppose there is a sample x1, x2, …, xn of n independent and identically distributed observations, coming from a
distribution with an unknown pdf ƒ0(·). It is however surmised that the function ƒ0 belongs to a certain family of
distributions { ƒ(·|θ), θ ∈ Θ }, called the parametric model, so that ƒ0 = ƒ(·|θ0). The value θ0 is unknown and is
referred to as the "true value" of the parameter. It is desirable to find an estimator which would be as close to the
true value θ0 as possible. Both the observed variables xi and the parameter θ can be vectors.
To use the method of maximum likelihood, one first specifies the joint density function for all observations. For an
i.i.d. sample, this joint density function is

Now we look at this function from a different perspective by considering the observed values x1, x2, ..., xn to be fixed
"parameters" of this function, whereas θ will be the function's variable and allowed to vary freely; this function will
be called the likelihood:
Maximum likelihood 369

In practice it is often more convenient to work with the logarithm of the likelihood function, called the
log-likelihood:

or the average log-likelihood':

The hat over indicates that it is akin to some estimator. Indeed, estimates the expected log-likelihood of a single
observation in the model.
The method of maximum likelihood estimates θ0 by finding a value of θ that maximizes . This method of
estimation defines a maximum-likelihood estimator (MLE) of θ0

if any maximum exists. An MLE estimate is the same regardless of whether we maximize the likelihood or the
log-likelihood function, since log is a monotone transformation.
For many models, a maximum likelihood estimator can be found as an explicit function of the observed data x1, …,
xn. For many other models, however, no closed-form solution to the maximization problem is known or available,
and an MLE has to be found numerically using optimization methods. For some problems, there may be multiple
estimates that maximize the likelihood. For other problems, no maximum likelihood estimate exists (meaning that
the log-likelihood function increases without attaining the supremum value).
In the exposition above, it is assumed that the data are independent and identically distributed. The method can be
applied however to a broader setting, as long as it is possible to write the joint density function ƒ(x1,…,xn | θ), and its
parameter θ has a finite dimension which does not depend on the sample size n. In a simpler extension, an allowance
can be made for data heterogeneity, so that the joint density is equal to ƒ1(x1|θ) · ƒ2(x2|θ) · … · ƒn(xn|θ). In the more
complicated case of time series models, the independence assumption may have to be dropped as well.
A maximum likelihood estimator coincides with the most probable Bayesian estimator given a uniform prior
distribution on the parameters.

Properties
A maximum-likelihood estimator is an extremum estimator obtained by maximizing, as a function of θ, the objective
function

this being the sample analogue of the expected log-likelihood , where this expectation is taken with
respect to the true density f(·|θ0).
Maximum-likelihood estimators have no optimum properties for finite samples, in the sense that (when evaluated on
finite samples) other estimators have greater concentration around the true parameter-value.[1] However, like other
estimation methods, maximum-likelihood estimation possesses a number of attractive limiting properties: As the
sample-size increases to infinity, sequences of maximum-likelihood estimators have these properties:
• Consistency: a subsequence of the sequence of MLEs converges in probability to the value being estimated.
• Asymptotic normality: as the sample size increases, the distribution of the MLE tends to the Gaussian distribution
with mean and covariance matrix equal to the inverse of the Fisher information matrix.
Maximum likelihood 370

• Efficiency, i.e., it achieves the Cramér–Rao lower bound when the sample size tends to infinity. This means that
no asymptotically unbiased estimator has lower asymptotic mean squared error than the MLE (or other estimators
attaining this bound).
• Second-order efficiency after correction for bias.

Consistency
Under the conditions outlined below, the maximum likelihood estimator is consistent. The consistency means that
having a sufficiently large number of observations n, it is possible to find the value of θ0 with arbitrary precision. In
mathematical terms this means that as n goes to infinity the estimator converges in probability to its true value:

Under slightly stronger conditions, the estimator converges almost surely (or strongly) to:

To establish consistency, the following conditions are sufficient:[2]


1. Identification of the model:

In other words, different parameter values θ correspond to different distributions within the model. If this
condition did not hold, there would be some value θ1 such that θ0 and θ1 generate an identical distribution of the
observable data. Then we wouldn't be able to distinguish between these two parameters even with an infinite
amount of data — these parameters would have been observationally equivalent.
The identification condition is absolutely necessary for the ML estimator to be consistent. When this condition
holds, the limiting likelihood function ℓ(θ|·) has unique global maximum at θ0.
2. Compactness: the parameter space Θ of the model is compact.
The identification condition establishes that the
log-likelihood has a unique global maximum.
Compactness implies that the likelihood cannot
approach the maximum value arbitrarily close at some
other point (as demonstrated for example in the picture
on the right). Compactness is only a sufficient
condition and not a necessary condition. Compactness
can be replaced by some other conditions, such as:
• both concavity of the log-likelihood function and compactness of some (nonempty) upper level sets of the
log-likelihood function, or
• existence of a compact neighborhood N of θ0 such that outside of N the log-likelihood function is less than the
maximum by at least some ε > 0.
3. Continuity: the function ln f(x|θ) is continuous in θ for almost all values of x:

The continuity here can be replaced with a slightly weaker condition of upper semi-continuity.
4. Dominance: there exists an integrable function D(x) such that

By the uniform law of large numbers, the dominance condition together with continuity establish the uniform
convergence in probability of the log-likelihood:

The dominance condition can be employed in the case of i.i.d. observations. In the non-i.i.d. case the uniform
convergence in probability can be checked by showing that the sequence is stochastically equicontinuous.
Maximum likelihood 371

If one wants to demonstrate that the ML estimator converges to θ0 almost surely, then a stronger condition of
uniform convergence almost surely has to be imposed:

Asymptotic normality
Maximum-likelihood estimators can lack asymptotic normality and can be inconsistent if there is a failure of one (or
more) of the below regularity conditions:
Estimate on boundary. Sometimes the maximum likelihood estimate lies on the boundary of the set of possible
parameters, or (if the boundary is not, strictly speaking, allowed) the likelihood gets larger and larger as the
parameter approaches the boundary. Standard asymptotic theory needs the assumption that the true parameter value
lies away from the boundary. If we have enough data, the maximum likelihood estimate will keep away from the
boundary too. But with smaller samples, the estimate can lie on the boundary. In such cases, the asymptotic theory
clearly does not give a practically useful approximation. Examples here would be variance-component models,
where each component of variance, σ2, must satisfy the constraint σ2 ≥0.
Data boundary parameter-dependent. For the theory to apply in a simple way, the set of data values which has
positive probability (or positive probability density) should not depend on the unknown parameter. A simple
example where such parameter-dependence does hold is the case of estimating θ from a set of independent
identically distributed when the common distribution is uniform on the range (0,θ). For estimation purposes the
relevant range of θ is such that θ cannot be less than the largest observation. Because the interval (0,θ) is not
compact, there exists no maximum for the likelihood function: For any estimate of theta, there exists a greater
estimate that also has greater likelihood. In contrast, the interval [0,θ] includes the end-point θ and is compact, in
which case the maximum-likelihood estimator exists. However, in this case, the maximum-likelihood estimator is
biased. Asymptotically, this maximum-likelihood estimator is not normally distributed.[3]
Nuisance parameters. For maximum likelihood estimations, a model may have a number of nuisance parameters.
For the asymptotic behaviour outlined to hold, the number of nuisance parameters should not increase with the
number of observations (the sample size). A well-known example of this case is where observations occur as pairs,
where the observations in each pair have a different (unknown) mean but otherwise the observations are independent
and Normally distributed with a common variance. Here for 2N observations, there are N+1 parameters. It is well
known that the maximum likelihood estimate for the variance does not converge to the true value of the variance.
Increasing information. For the asymptotics to hold in cases where the assumption of independent identically
distributed observations does not hold, a basic requirement is that the amount of information in the data increases
indefinitely as the sample size increases. Such a requirement may not be met if either there is too much dependence
in the data (for example, if new observations are essentially identical to existing observations), or if new independent
observations are subject to an increasing observation error.
Some regularity conditions which ensure this behavior are:
1. The first and second derivatives of the log-likelihood function must be defined.
2. The Fisher information matrix must not be zero, and must be continuous as a function of the parameter.
3. The maximum likelihood estimator is consistent.
Suppose that conditions for consistency of maximum likelihood estimator are satisfied, and[4]
1. θ0 ∈ interior(Θ);
2. f(x|θ) > 0 and is twice continuously differentiable in θ in some neighborhood N of θ0;
3. ∫ supθ∈N||∇θf(x|θ)||dx < ∞, and ∫ supθ∈N||∇θθf(x|θ)||dx < ∞;
4. I = E[∇θlnf(x|θ0) ∇θlnf(x|θ0)′] exists and is nonsingular;
5. E[ supθ∈N||∇θθlnf(x|θ)||] < ∞.
Maximum likelihood 372

Then the maximum likelihood estimator has asymptotically normal distribution:

Proof, skipping the technicalities:


Since the log-likelihood function is differentiable, and θ0 lies in the interior of the parameter set, in the maximum the
first-order condition will be satisfied:

When the log-likelihood is twice differentiable, this expression can be expanded into a Taylor series around the point
θ = θ0:

where is some point intermediate between θ0 and . From this expression we can derive that

Here the expression in square brackets converges in probability to H = E[−∇θθln f(x|θ0)] by the law of large
numbers. The continuous mapping theorem ensures that the inverse of this expression also converges in probability,
to H−1. The second sum, by the central limit theorem, converges in distribution to a multivariate normal with mean
zero and variance matrix equal to the Fisher information I. Thus, applying the Slutsky's theorem to the whole
expression, we obtain that

Finally, the information equality guarantees that when the model is correctly specified, matrix H will be equal to the
Fisher information I, so that the variance expression simplifies to just I−1.

Functional invariance
The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible
probability (or probability density, in the continuous case). If the parameter consists of a number of components,
then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the
complete parameter. Consistent with this, if is the MLE for θ, and if g(θ) is any transformation of θ, then the
MLE for α = g(θ) is by definition

It maximizes the so-called profile likelihood:

The MLE is also invariant with respect to certain transformations of the data. If Y = g(X) where g is one to one and
does not depend on the parameters to be estimated, then the density functions satisfy

and hence the likelihood functions for X and Y differ only by a factor that does not depend on the model parameters.
For example, the MLE parameters of the log-normal distribution are the same as those of the normal distribution
fitted to the logarithm of the data.
Maximum likelihood 373

Higher-order properties
The standard asymptotics tells that the maximum-likelihood estimator is √n-consistent and asymptotically efficient,
meaning that it reaches the Cramér–Rao bound:

where I is the Fisher information matrix:

In particular, it means that the bias of the maximum-likelihood estimator is equal to zero up to the order n−1/2.
However when we consider the higher-order terms in the expansion of the distribution of this estimator, it turns out
that θmle has bias of order n−1. This bias is equal to (componentwise)[5]

jk
where Einstein's summation convention over the repeating indices has been adopted; I denotes the j,k-th
component of the inverse Fisher information matrix I−1, and

Using these formulas it is possible to estimate the second-order bias of the maximum likelihood estimator, and
correct for that bias by subtracting it:

This estimator is unbiased up to the terms of order n−1, and is called the bias-corrected maximum likelihood
estimator.
This bias-corrected estimator is second-order efficient (at least within the curved exponential family), meaning that it
has minimal mean squared error among all second-order bias-corrected estimators, up to the terms of the order n−2. It
is possible to continue this process, that is to derive the third-order bias-correction term, and so on. However as was
shown by Kano (1996), the maximum-likelihood estimator is not third-order efficient.

Examples

Discrete uniform distribution


Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform
distribution); thus, the sample size is 1. If n is unknown, then the maximum-likelihood estimator of n is the
number m on the drawn ticket. (The likelihood is 0 for n < m, 1/n for n ≥ m, and this is greatest when n = m. Note
that the maximum likelihood estimate of n occurs at the lower extreme of possible values {m, m + 1, ...}, rather than
somewhere in the "middle" of the range of possible values, which would result in less bias.) The expected value of
the number m on the drawn ticket, and therefore the expected value of , is (n + 1)/2. As a result, the maximum
likelihood estimator for n will systematically underestimate n by (n − 1)/2 with a sample size of 1.
Maximum likelihood 374

Discrete distribution, finite parameter space


Suppose one wishes to determine just how biased an unfair coin is. Call the probability of tossing a HEAD p. The
goal then becomes to determine p.
Suppose the coin is tossed 80 times: i.e., the sample might be something like x1 = H, x2 = T, ..., x80 = T, and the
count of the number of HEADS "H" is observed.
The probability of tossing TAILS is 1 − p (so here p is θ above). Suppose the outcome is 49 HEADS and 31 TAILS,
and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p = 1/3,
one which gives HEADS with probability p = 1/2 and another which gives HEADS with probability p = 2/3. The
coins have lost their labels, so which one it was is unknown. Using maximum likelihood estimation the coin that
has the largest likelihood can be found, given the data that were observed. By using the probability mass function of
the binomial distribution with sample size equal to 80, number successes equal to 49 but different values of p (the
"probability of success"), the likelihood function (defined below) takes one of three values:

The likelihood is maximized when p = 2/3, and so this is the maximum likelihood estimate for p.

Discrete distribution, continuous parameter space


Now suppose that there was only one coin but its p could have been any value 0 ≤ p ≤ 1. The likelihood function to
be maximised is

and the maximisation is over all possible values 0 ≤ p ≤ 1.


One way to maximize this function is by differentiating with
respect to p and setting to zero:

likelihood function for proportion value of a binomial


process (n = 10)
Maximum likelihood 375

which has solutions p = 0, p = 1, and p = 49/80. The solution which maximizes the likelihood is clearly p = 49/80
(since p = 0 and p = 1 result in a likelihood of zero). Thus the maximum likelihood estimator for p is 49/80.
This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number
of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli
trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli
trials resulting in t 'successes'.

Continuous distribution, continuous parameter space


For the normal distribution which has probability density function

the corresponding probability density function for a sample of n independent identically distributed normal random
variables (the likelihood) is

or more conveniently:

where is the sample mean.


This family of distributions has two parameters: θ = (μ, σ), so we maximize the likelihood,
, over both parameters simultaneously, or if possible, individually.
Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which
maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler
algebra, it is the logarithm which is maximized below. (Note: the log-likelihood is closely related to information
entropy and Fisher information.)

which is solved by
Maximum likelihood 376

This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly
less than zero. Its expectation value is equal to the parameter μ of the given distribution,

which means that the maximum-likelihood estimator is unbiased.


Similarly we differentiate the log likelihood with respect to σ and equate to zero:

which is solved by

Inserting we obtain

To calculate its expected value, it is convenient to rewrite the expression in terms of zero-mean random variables
(statistical error) . Expressing the estimate in these variables yields

Simplifying the expression above, utilizing the facts that and , allows us to obtain

This means that the estimator is biased. However, is consistent.


Formally we say that the maximum likelihood estimator for is:

In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have
to be obtained simultaneously.

Non-independent variables
It may be the case that variables are correlated, that is, not independent. Two random variables X and Y are
independent only if their joint probability density function is the product of the individual probability density
functions, i.e.

Suppose one constructs an order-n Gaussian vector out of random variables , where each variable has
means given by . Furthermore, let the covariance matrix be denoted by
The joint probability density function of these n random variables is then given by:

In the two variable case, the joint probability density function is given by:
Maximum likelihood 377

In this and other cases where a joint density function exists, the likelihood function is defined as above, under
Principles, using this density.

Applications
Maximum likelihood estimation is used for a wide range of statistical models, including:
• linear models and generalized linear models;
• exploratory and confirmatory factor analysis;
• structural equation modeling;
• many situations in the context of hypothesis testing and confidence interval formation;
• discrete choice models.
These uses arise across applications in widespread set of fields, including:
• communication systems;
• psychometrics;
• econometrics;
• time-delay of arrival (TDOA) in acoustic or electromagnetic detection;
• data modeling in nuclear and particle physics;
• magnetic resonance imaging;
• computational phylogenetics;
• origin/destination and path-choice modeling in transport networks;
• geographical satellite-image classification.

History
Maximum-likelihood estimation was recommended, analyzed (with flawed attempts at proofs) and vastly
popularized by R. A. Fisher between 1912 and 1922[6] (although it had been used earlier by Gauss, Laplace, T. N.
Thiele, and F. Y. Edgeworth).[7] Reviews of the development of maximum likelihood have been provided by a
number of authors.[8]
Much of the theory of maximum-likelihood estimation was first developed for Bayesian statistics, and then
simplified by later authors.[6]

Notes
[1] Pfanzagl (1994, p. 206)
[2] Newey & McFadden (1994, Theorem 2.5.)
[3] Lehamnn & Casella (1998)
[4] Newey & McFadden (1994, Theorem 3.3.)
[5] Cox & Snell (1968, formula (20))
[6] Pfanzagl (1994)
[7] Edgeworth (September 1908) and Edgeworth (December 1908)
[8] Savage (1976), Pratt (1976), Stigler (1978, 1986, 1999), Hald (1998, 1999), and Aldrich (1997)
Maximum likelihood 378

References
• Aldrich, John (1997). "R. A. Fisher and the making of maximum likelihood 1912–1922". Statistical Science 12
(3): 162–176. doi:10.1214/ss/1030037906. MR1617519.
• Andersen, Erling B. (1970); "Asymptotic Properties of Conditional Maximum Likelihood Estimators", Journal of
the Royal Statistical Society B 32, 283–301
• Andersen, Erling B. (1980); Discrete Statistical Models with Social Science Applications, North Holland, 1980
• Basu, Debabrata (1988); Statistical Information and Likelihood : A Collection of Critical Essays by Dr. D. Basu;
in Ghosh, Jayanta K., editor; Lecture Notes in Statistics, Volume 45, Springer-Verlag, 1988
• Cox, David R.; Snell, E. Joyce (1968). "A general definition of residuals". Journal of the Royal Statistical Society.
Series B (Methodological): 248–275. JSTOR 2984505.
• Edgeworth, Francis Y. (Sep 1908). "On the probable errors of frequency-constants". Journal of the Royal
Statistical Society 71 (3): 499–512. doi:10.2307/2339293. JSTOR 2339293.
• Edgeworth, Francis Y. (Dec 1908). "On the probable errors of frequency-constants". Journal of the Royal
Statistical Society 71 (4): 651–678. doi:10.2307/2339378. JSTOR 2339378.
• Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical
Association 77 (380): 831–834. JSTOR 2287314.
• Ferguson, Thomas S. (1996). A course in large sample theory. Chapman & Hall. ISBN 0-412-04371-8.
• Hald, Anders (1998). A history of mathematical statistics from 1750 to 1930. New York, NY: Wiley.
ISBN 0-471-17912-4.
• Hald, Anders (1999). "On the history of maximum likelihood in relation to inverse probability and least squares".
Statistical Science 14 (2): 214–222. JSTOR 2676741.
• Kano, Yutaka (1996). "Third-order efficiency implies fourth-order efficiency" (http://www.journalarchive.jst.
go.jp/english/jnlabstract_en.php?cdjournal=jjss1995&cdvol=26&noissue=1&startpage=101). Journal of the
Japan Statistical Society 26: 101–117.
• Le Cam, Lucien (1990). "Maximum likelihood — an introduction". ISI Review 58 (2): 153–171.
• Le Cam, Lucien; Lo Yang, Grace (2000). Asymptotics in statistics: some basic concepts (Second ed.). Springer.
ISBN 0-387-95036-2.
• Le Cam, Lucien (1986). Asymptotic methods in statistical decision theory. Springer-Verlag. ISBN 0-387-96307-3.
• Lehmann, Erich L.; Casella, George (1998). Theory of Point Estimation, 2nd ed. Springer. ISBN 0-387-98502-6.
• Newey, Whitney K.; McFadden, Daniel (1994). "Chapter 35: Large sample estimation and hypothesis testing". In
Engle, Robert; McFadden, Dan. Handbook of Econometrics, Vol.4. Elsevier Science. pp. 2111–2245.
ISBN 0-444-88766-0.
• Pfanzagl, Johann (1994). Parametric statistical theory. with the assistance of R. Hamböker. Berlin, DE: Walter de
Gruyter. pp. 207–208. ISBN 3-11-013863-8.
• Pratt, John W. (1976). "F. Y. Edgeworth and R. A. Fisher on the efficiency of maximum likelihood estimation".
The Annals of Statistics 4 (3): 501–514. doi:10.1214/aos/1176343457. JSTOR 2958222.
• Ruppert, David (2010). Statistics and Data Analysis for Financial Engineering (http://books.google.com/
books?id=i2bD50PbIikC&pg=PA98). Springer. p. 98. ISBN 978-1-4419-7786-1.
• Savage, Leonard J. (1976). "On rereading R. A. Fisher". The Annals of Statistics 4 (3): 441–500.
doi:10.1214/aos/1176343456. JSTOR 2958221.
• Stigler, Stephen M. (1978). "Francis Ysidro Edgeworth, statistician". Journal of the Royal Statistical Society.
Series A (General) 141 (3): 287–322. doi:10.2307/2344804. JSTOR 2344804.
• Stigler, Stephen M. (1986). The history of statistics: the measurement of uncertainty before 1900. Harvard
University Press. ISBN 0-674-40340-1.
• Stigler, Stephen M. (1999). Statistics on the table: the history of statistical concepts and methods. Harvard
University Press. ISBN 0-674-83601-4.
• van der Vaart, Aad W. (1998). Asymptotic Statistics. ISBN 0-521-78450-6.
Maximum likelihood 379

External links
• Maximum Likelihood Estimation Primer (an excellent tutorial) (http://statgen.iop.kcl.ac.uk/bgim/mle/
sslike_1.html)
• Implementing MLE for your own likelihood function using R (http://www.mayin.org/ajayshah/KB/R/
documents/mle/mle.html)
• A selection of likelihood functions in R (http://www.netstorm.be/home/mle)
• "Tutorial on maximum likelihood estimation". Journal of Mathematical Psychology. CiteSeerX: 10.1.1.74.671
(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.74.671).

McNemar's test
In statistics, McNemar's test is a non-parametric method used on nominal data. It is applied to 2 × 2 contingency
tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal
frequencies are equal ("marginal homogeneity"). It is named after Quinn McNemar, who introduced it in 1947.[1] An
application of the test in genetics is the transmission disequilibrium test for detecting genetic linkage.[2]

Definition
The test is applied to a 2 × 2 contingency table, which tabulates the outcomes of two tests on a sample of n subjects,
as follows.

Test 2 positive Test 2 negative Row total

Test 1 positive a b a+b

Test 1 negative c d c+d

Column total a+c b+d n

The null hypothesis of marginal homogeneity states that the two marginal probabilities for each outcome are the
same, i.e. pa + pb = pa + pc and pc + pd = pb + pd.
Thus the null and alternative hypotheses are[1]

Here pa, etc., denote the theoretical probability of occurrences in cells with the corresponding label.
The McNemar test statistic is:

The statistic with Yates's correction for continuity[3] is given by:

An alternative correction of 1 instead of 0.5 is attributed to Edwards [4] by Fleiss [5], resulting in a similar equation:

Under the null hypothesis, with a sufficiently large number of discordants (cells b and c), has a chi-squared
distribution with 1 degree of freedom. If either b or c is small (b + c < 25) then is not well-approximated by the
chi-squared distribution. The binomial distribution can be used to obtain the exact distribution for an equivalent to
the uncorrected form of McNemar's test statistic.[6] In this formulation, b is compared to a binomial distribution with
McNemar's test 380

size parameter equal to b + c and "probability of success" = ½, which is essentially the same as the binomial sign
test. For b + c < 25, the binomial calculation should be performed, and indeed, most software packages simply
perform the binomial calculation in all cases, since the result then is an exact test in all cases. When comparing the
resulting statistic to the right tail of the chi-squared distribution, the p-value that is found is two-sided, whereas
to achieve a two-sided p-value in the case of the exact binomial test, the p-value of the extreme tail should be
multiplied by 2.
If the result is significant, this provides sufficient evidence to reject the null hypothesis, in favour of the
alternative hypothesis that pb ≠ pc, which would mean that the marginal proportions are significantly different from
each other.

Example
In the following example, a researcher attempts to determine if a drug has an effect on a particular disease. Counts of
individuals are given in the table, with the diagnosis (disease: present or absent) before treatment given in the rows,
and the diagnosis after treatment in the columns. The test requires the same subjects to be included in the
before-and-after measurements (matched pairs).

After: present After: absent Row total

Before: present 101 121 222

Before: absent 59 33 92

Column total 160 154 314

In this example, the null hypothesis of "marginal homogeneity" would mean there was no effect of the treatment.
From the above data, the McNemar test statistic with Yates's continuity correction is

has the value 21.01, which is extremely unlikely from the distribution implied by the null hypothesis. Thus the test
provides strong evidence to reject the null hypothesis of no treatment effect.

Discussion
An interesting observation when interpreting McNemar's test is that the elements of the main diagonal do not
contribute to the decision about whether (in the above example) pre- or post-treatment condition is more favourable.
An extension of McNemar's test exists in situations where independence does not necessarily hold between the pairs;
instead, there are clusters of paired data where the pairs in a cluster may not be independent, but independence holds
between different clusters. An example is analyzing the effectiveness of a dental procedure; in this case, a pair
corresponds to the treatment of an individual tooth in patients who might have multiple teeth treated; the
effectiveness of treatment of two teeth in the same patient is not likely to be independent, but the treatment of two
teeth in different patients is more likely to be independent.[7]
McNemar's test 381

Information in the pairings


John Rice wrote:[8]
85 Hodgkin's patients [...] had a sibling of the same sex who was free of the disease and whose age was
within 5 years of the patient's. These investigators presented the following table:

They calculated a chi-squared statistic of 1.53, which is not significant.[...] [they] had made an error in
their analysis by ignoring the pairings.[...] [their] samples were not independent, because the siblings
were paired [...] we set up a table that exhibits the pairings:

It is to the second table that McNemar's test can be applied. Notice that the sum of the numbers in the second table is
85—the number of pairs of siblings—whereas the sum of the numbers in the first table is twice as big, 170—the
number of individuals. The second table gives more information than the first. The numbers in the first table can be
found by using the numbers in the second table, but not vice versa. The numbers in the first table give only the
marginal totals of the numbers in the second table.

Related tests
• The binomial sign test gives an exact test for the McNemar's test.
• The Cochran's Q test for two "treatments" is equivalent to the McNemar's test.
• The Liddell's exact test is an exact alternative to McNemar's test.[9][10]
• The Stuart–Maxwell test is different generalization of the McNemar test, used for testing marginal homogeneity
in a square table with more than two rows/columns.[11]
• The Bhapkar's test (1966) is a more powerful alternative to the Stuart–Maxwell test.[12]

References
[1] McNemar, Quinn (June 18, 1947). "Note on the sampling error of the difference between correlated proportions or percentages".
Psychometrika 12 (2): 153–157. doi:10.1007/BF02295996. PMID 20254758.
[2] Spielman RS; McGinnis RE; Ewens WJ (Mar 1993). "Transmission test for linkage disequilibrium: the insulin gene region and
insulin-dependent diabetes mellitus (IDDM)". Am J Hum Genet. 52 (3): 506–16. PMC 1682161. PMID 8447318.
[3] Yates, F (1934). Contingency table involving small numbers and the χ2 test. Supplement to the Journal of the Royal Statistical Society 1(2),
217–235. JSTOR Archive for the journal (http:/ / www. jstor. org/ pss/ 2983604)
[4] Edwards, A (1948). "Note on the "correction for continuity" in testing the significance of the difference between correlated proportions".
Psychometrika 13: 185–187.
[5] Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John Wiley. p. 114. ISBN 0-471-06428-9.
[6] Sheskin (2004)
[7] Durkalski, V.L.; Palesch, Y.Y.; Lipsitz, S.R.; Rust, P.F. (2003). "Analysis of clustered matched-pair data" (http:/ / www3. interscience. wiley.
com/ journal/ 104545274/ abstract). Statistics in medicine 22 (15): 2417–28. doi:10.1002/sim.1438. PMID 12872299. . Retrieved April 1,
2009.
[8] Rice, John (1995). Mathematical Statistics and Data Analysis (Second ed.). Belmont, California: Duxbury Press. pp. 492–494.
ISBN 0-534-20934-3.
[9] Liddell, D. (1976). "Practical Tests of 2 × 2 Contingency Tables". Journal of the Royal Statistical Society 25 (4): 295–304. JSTOR 2988087.
[10] http:/ / rimarcik. com/ en/ navigator/ z-nominal. html
[11] Sun, Xuezheng; Yang, Zhao (2008). "Generalized McNemar's Test for Homogeneity of the Marginal Distributions" (http:/ / www2. sas.
com/ proceedings/ forum2008/ 382-2008. pdf). SAS Global Forum. .
[12] http:/ / www. john-uebersax. com/ stat/ mcnemar. htm#bhapkar
McNemar's test 382

External links
• Vassar College's McNemar 2×2 Grid (http://faculty.vassar.edu/lowry/propcorr.html)
• McNemar Tests of Marginal Homogeneity (http://john-uebersax.com/stat/mcnemar.htm)

Multicollinearity
Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression
model are highly correlated. In this situation the coefficient estimates may change erratically in response to small
changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as
a whole, at least within the sample data themselves; it only affects calculations regarding individual predictors. That
is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors
predicts the outcome variable, but it may not give valid results about any individual predictor, or about which
predictors are redundant with respect to others.
A high degree of multicollinearity can also cause computer software packages to be unable to perform the matrix
inversion that is required for computing the regression coefficients, or it may make the results of that inversion
inaccurate.
Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase
"no multicollinearity" is sometimes used to mean the absence of perfect multicollinearity, which is an exact
(non-stochastic) linear relation among the regressors.

Definition
Collinearity is a linear relationship between two explanatory variables. Two variables are perfectly collinear if there
is an exact linear relationship between the two. For example, and are perfectly collinear if there exist
parameters and such that, for all observations i, we have

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are
highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation
between two independent variables is equal to 1 or -1. In practice, we rarely face perfect multicollinearity in a data
set. More commonly, the issue of multicollinearity arises when there is a strong linear relationship among two or
more independent variables.
Mathematically, a set of variables is perfectly multicollinear if there exist one or more exact linear relationships
among some of the variables. For example, we may have

holding for all observations i, where are constants and is the ith observation on the jth explanatory variable.
We can explore one issue caused by multicollinearity by examining the process of attempting to obtain estimates for
the parameters of the multiple regression equation

The ordinary least squares estimates involve inverting the matrix

where
Multicollinearity 383

If there is an exact linear relationship (perfect multicollinearity) among the independent variables, the rank of X (and
therefore of XTX) is less than k+1, and the matrix XTX will not be invertible.
In most applications, perfect multicollinearity is unlikely. An analyst is more likely to face a high degree of
multicollinearity. For example, suppose that instead of the above equation holding, we have that equation in
modified form with an error term :

In this case, there is no exact linear relationship among the variables, but the variables are nearly perfectly
multicollinear if the variance of is small for some set of values for the 's. In this case, the matrix XTX has an
inverse, but is ill-conditioned so that a given computer algorithm may or may not be able to compute an approximate
inverse, and if it does so the resulting computed inverse may be highly sensitive to slight variations in the data (due
to magnified effects of rounding error) and so may be very inaccurate.

Detection of multicollinearity
Indicators that multicollinearity may be present in a model:
1. Large changes in the estimated regression coefficients when a predictor variable is added or deleted
2. Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the
joint hypothesis that those coefficients are all zero (using an F-test)
3. Some authors have suggested a formal detection-tolerance or the variance inflation factor (VIF) for
multicollinearity:

where is the coefficient of determination of a regression of explanator j on all the other explanators. A
tolerance of less than 0.20 or 0.10 and/or a VIF of 5 or 10 and above indicates a multicollinearity problem (but
see O'Brien 2007).[1]
4. Condition Number Test: The standard measure of ill-conditioning in a matrix is the condition index. It will
indicate that the inversion of the matrix is numerically unstable with finite-precision numbers ( standard computer
floats and doubles ). This indicates the potential sensitivity of the computed inverse to small changes in the
original matrix. The Condition Number is computed by finding the square root of (the maximum eigenvalue
divided by the minimum eigenvalue). If the Condition Number is above 30, the regression is said to have
significant multicollinearity.
5. Farrar-Glauber Test:[2] If the variables are found to be orthogonal, there is no multicollinearity; if the variables
are not orthogonal, then multicollinearity is present.
6. Construction of a pair-wise correlation matrix will yield indications as to the likelihood that any given couplet of
right-hand-side variables are multi-collinear. Correlation values .4 and higher can indicate a multicollinierity
issue, but sometimes variables may be correlated as high as .8 without causing such issues.
Multicollinearity 384

Consequences of multicollinearity
As mentioned above, one consequence of a high degree of multicollinearity is that, even if the matrix XTX is
invertible, a computer algorithm may be unsuccessful in obtaining an approximate inverse, and if it does obtain one
it may be numerically inaccurate. But even in the presence of an accurate XTX matrix, the following consequences
arise:
In the presence of multicollinearity, the estimate of one variable's impact on y while controlling for the others tends
to be less precise than if predictors were uncorrelated with one another. The usual interpretation of a regression
coefficient is that it provides an estimate of the effect of a one unit change in an independent variable, , holding
the other variables constant. If is highly correlated with another independent variable, , in the given data set,
then we only have observations for which and have a particular relationship (either positive or negative).
We don't have observations for which changes independently of , so we have an imprecise estimate of the
effect of independent changes in .
In some sense, the collinear variables contain the same information about the dependent variable. If nominally
"different" measures actually quantify the same phenomenon then they are redundant. Alternatively, if the variables
are accorded different names and perhaps employ different numeric measurement scales but are highly correlated
with each other, then they suffer from redundancy.
One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large. In that
case, the test of the hypothesis that the coefficient is equal to zero leads to a failure to reject the null hypothesis.
However, if a simple linear regression of the explained variable on this explanatory variable is estimated, the
coefficient will be found to be significant; specifically, the analyst will reject the hypothesis that the coefficient is
zero. In the presence of multicollinearity, an analyst might falsely conclude that there is no linear relationship
between an independent and a dependent variable.
A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression
models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but
correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically
robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical
population).
So long as the underlying specification is correct, multicollinearity does not actually bias results, it just produces
large standard errors in the related independent variables. If, however, there are other problems (such as omitted
variables) which introduce bias, multicollinearity can multiply (by orders of magnitude) the effects of that bias. More
importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. If
the new data differs in any way from the data that was fitted you may introduce large errors in your predictions
because the pattern of multicollinearity between the independent variables is different in your new data from the data
you used for your estimates.

Remedies for multicollinearity


1. Make sure you have not fallen into the dummy variable trap; including a dummy variable for every category (e.g.,
summer, autumn, winter, and spring) and including a constant term in the regression together guarantee perfect
multicollinearity.
2. Try seeing what happens if you use independent subsets of your data for estimation and apply those estimates to
the whole data set. Theoretically you should obtain somewhat higher variance from the smaller datasets used for
estimation, but the expectation of the coefficient values should be the same. Naturally, the observed coefficient
values will vary, but look at how much they vary.
3. Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't affect the fitted model
provided that the predictor variables follow the same pattern of multicollinearity as the data on which the
Multicollinearity 385

regression model is based .


4. Drop one of the variables. An explanatory variable may be dropped to produce a model with significant
coefficients. However, you lose information (because you've dropped a variable). Omission of a relevant variable
results in biased coefficient estimates for the remaining explanatory variables.
5. Obtain more data, if possible. This is the preferred solution. More data can produce more precise parameter
estimates (with lower standard errors), as seen from the formula in variance inflation factor for the variance of the
estimate of a regression coefficient in terms of the sample size and the degree of multicollinearity.
6. Mean-center the predictor variables. Generating polynomial terms (i.e., for , , , etc.) can cause some
multicolinearity if the variable in question has a limited range (e.g., [2,4]). Mean-centering will eliminate this
special kind of multicollinearity. However, in general, this has no effect. It can be useful in overcoming problems
arising from rounding and other computational steps if a carefully designed computer program is not used.
7. Standardize your independent variables. This may help reduce a false flagging of a condition index above 30.
8. It has also been suggested that using the Shapley value, a game theory tool, the model could account for the
effects of multicollinearity. The Shapley value assigns a value for each predictor and assesses all possible
combinations of importance.[3]
9. Ridge regression or principal component regression can be used.
10. If the correlated explanators are different lagged values of the same underlying explanator, then a distributed lag
technique can be used, imposing a general structure on the relative values of the coefficients to be estimated.

Examples of contexts in which multicollinearity arises

Survival analysis
Multicollinearity may also represent a serious issue in survival analysis. The problem is that time-varying covariates
may change their value over the time line of the study. A special procedure is recommended to assess the impact of
multicollinearity on the results. See Van den Poel & Larivière (2004)[4] for a detailed discussion.

Interest rates for different terms to maturity


In various situations it might be hypothesized that multiple interest rates of various terms to maturity all influence
some economic decision, such as the amount of money or some other financial asset to hold, or the amount of fixed
investment spending to engage in. In this case, including these various interest rates will in general create a
substantial multicollinearity problem because interest rates tend to move together. If in fact each of the interest rates
has its own separate effect on the dependent variable, it can be extremely difficult to separate out their effects. Italic
textCollinearity is a linear relationship between two explanatory variables. Two variables are perfectly collinear if
there is an exact linear relationship between the two. For example, and are perfectly collinear if there exist
parameters and such that, for all observations i, we have
Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are
highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation
between two independent variables is equal to 1 or -1. In practice, we rarely face perfect multicollinearity in a data
set. More commonly, the issue of multicollinearity arises when there is a strong linear relationship among two or
more independent variables.
Mathematically, a set of variables is perfectly multicollinear if there exist one or more exact linear relationships
among some of the variables. For example, we may have
holding for all observations i, where are constants and is the ith observation on the jth explanatory variable. We can
explore one issue caused by multicollinearity by examining the process of attempting to obtain estimates for the
parameters of the multiple regression equation
The ordinary least squares estimates involve inverting the matrix
Multicollinearity 386

where
If there is an exact linear relationship (perfect multicollinearity) among the independent variables, the rank of X (and
therefore of XTX) is less than k+1, and the matrix XTX will not be invertible.
In most applications, perfect multicollinearity is unlikely. An analyst is more likely to face a high degree of
multicollinearity. For example, suppose that instead of the above equation holding, we have that equation in
modified form with an error term :
In this case, there is no exact linear relationship among the variables, but the variables are nearly perfectly
multicollinear if the variance of is small for some set of values for the 's. In this case, the matrix XTX has an inverse,
but is ill-conditioned so that a given computer algorithm may or may not be able to compute an approximate inverse,
and if it does so the resulting computed inverse may be highly sensitive to slight variations in the data (due to
magnified effects of rounding error) and so may be very inaccurate.

References
[1] O'Brien, Robert M. 2007. "A Caution Regarding Rules of Thumb for Variance Inflation Factors," Quality and Quantity 41(5)673-690.
[2] Farrar Donald E. and Glauber, Robert R. 1967. "Multicollinearity in Regression Analysis: The Problem Revisited," The Review of Economics
and Statistics 49(1):92-107.
[3] Lipovestky and Conklin, 2001,"Analysis of Regression in Game Theory Approach". Applied Stochastic Models and Data Analysis 17 (2001):
319-330."
[4] Van den Poel, Dirk, and Larivière, Bart (2004), "Attrition Analysis for Financial Services Using Proportional Hazard Models," European
Journal of Operational Research, 157 (1), 196-217.

External links
• Earliest Uses: The entry on Multicollinearity has some historical information. (http://jeff560.tripod.com/m.
html)
Multivariate normal distribution 387

Multivariate normal distribution


Probability density function

Many samples from a multivariate (bivariate) Gaussian distribution centered at (1,3)


with a standard deviation of 3 in roughly the (0.878, 0.478) direction (longer vector) and of 1 in the second direction (shorter
vector, orthogonal to the longer vector).
Notation
Parameters μ ∈ Rk — location
Σ ∈ Rk×k — covariance (nonnegative-definite matrix)
Support x ∈ μ+span(Σ) ⊆ Rk
PDF
exists only when Σ is positive-definite
CDF (no analytic expression)
Mean μ
Mode μ
Variance Σ
Entropy

MGF

CF

In probability theory and statistics, the multivariate normal distribution or multivariate Gaussian distribution, is
a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One possible
definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k
components has a univariate normal distribution. However, its importance derives mainly from the multivariate
central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set
of (possibly) correlated real-valued random variables each of which clusters around a mean value.
Multivariate normal distribution 388

Notation and parametrization


The multivariate normal distribution of a k-dimensional random vector x = [X1, X2, …, Xk] can be written in the
following notation:

or to make it explicitly known that X is k-dimensional,

with k-dimensional mean vector

and k x k covariance matrix

Definition
A random vector x = (X1, …, Xk)' is said to have the multivariate normal distribution if it satisfies the following
equivalent conditions.[1]
• Every linear combination of its components Y = a1X1 + … + akXk is normally distributed. That is, for any constant
vector a ∈ Rk, the random variable Y = a′x has a univariate normal distribution.
• There exists a random ℓ-vector z, whose components are independent standard normal random variables, a
k-vector μ, and a k×ℓ matrix A, such that x = Az + μ. Here ℓ is the rank of the covariance matrix Σ = AA′.
Especially in the case of full rank, see the section below on Geometric Interpretation.
• There is a k-vector μ and a symmetric, nonnegative-definite k×k matrix Σ, such that the characteristic function of
x is

The covariance matrix is allowed to be singular (in which case the corresponding distribution has no density). This
case arises frequently in statistics; for example, in the distribution of the vector of residuals in the ordinary least
squares regression. Note also that the Xi are in general not independent; they can be seen as the result of applying the
matrix A to a collection of independent Gaussian variables z.

Properties

Density function

Non-degenerate case
The multivariate normal distribution is said to be "non-degenerate" when the covariance matrix of the
multivariate normal distribution is symmetric and positive definite. In this case the distribution has density

where is the determinant of . Note how the equation above reduces to that of the univariate normal
distribution if is a matrix (i.e. a real number).
Bivariate case
In the 2-dimensional nonsingular case (k = rank(Σ) = 2), the probability density function of a vector [X Y]′ is
Multivariate normal distribution 389

where ρ is the correlation between X and Y and where and . In this case,

In the bivariate case, we also have a theorem that makes the first equivalent condition for multivariate normality less
restrictive: it is sufficient to verify that countably many distinct linear combinations of X and Y are normal in order
to conclude that the vector [X Y]′ is bivariate normal.[2]
When plotted in the x,y-plane the distribution appears to be squeezed to the line:

as the correlation parameter ρ increases. This is because the above expression is the best linear unbiased prediction
of Y given a value of X.[3]

Degenerate case
If the covariance matrix is not full rank, then the multivariate normal distribution is degenerate and does not have
a density. More precisely, it does not have a density with respect to k-dimensional Lebesgue measure (which is the
usual measure assumed in calculus-level probability courses). Only random vectors whose distributions are
absolutely continuous with respect to a measure are said to have densities (with respect to that measure). To talk
about densities but avoid dealing with measure-theoretic complications it can be simpler to restrict attention to a
subset of of the coordinates of such that covariance matrix for this subset is positive definite; then the
other coordinates may be thought of as an affine function of the selected coordinates.
To talk about densities meaningfully in the singular case, then, we must select a different base measure. Using the
disintegration theorem we can define a restriction of Lebesgue measure to the -dimensional affine
subspace of where the Gaussian distribution is supported, i.e. . With respect to this
probability measure the distribution has density:

where is the generalized inverse and det* is the pseudo-determinant.

Higher moments
The kth-order moments of x are defined by

where r1 + r2 + ⋯ + rN = k.
The central k-order central moments are given as follows
(a) If k is odd, μ1, …, N(x − μ) = 0.
(b) If k is even with k = 2λ, then

where the sum is taken over all allocations of the set into λ (unordered) pairs. That is, if you have a
kth ( = 2λ = 6) central moment, you will be summing the products of λ = 3 covariances (the -μ notation has been
dropped in the interests of parsimony):
Multivariate normal distribution 390

This yields terms in the sum (15 in the above case), each being the product of λ (in
this case 3) covariances. For fourth order moments (four variables) there are three terms. For sixth-order moments
there are 3 × 5 = 15 terms, and for eighth-order moments there are 3 × 5 × 7 = 105 terms.
The covariances are then determined by replacing the terms of the list by the corresponding terms of
the list consisting of r1 ones, then r2 twos, etc.. To illustrate this, examine the following 4th-order central moment
case:

where is the covariance of xi and xj. The idea with the above method is you first find the general case for a kth
moment where you have k different x variables - and then you can simplify this accordingly. Say,
you have then you simply let xi = xj and realise that σii = σi2.

Likelihood function
If the mean and variance matrix are unknown, a suitable log likelihood function for a single observation x would be:

where x is a vector of real numbers. The complex case, where z is a vector of complex numbers, would be:
.
A similar notation is used for multiple linear regression.[4]

Entropy
The differential entropy of the multivariate normal distribution is[5]

where is the determinant of the covariance matrix Σ.


Multivariate normal distribution 391

Kullback–Leibler divergence
The Kullback–Leibler divergence from to , for non-singular matrices Σ0 and Σ1, is:[6]

The logarithm must be taken to base e since the two terms following the logarithm are themselves base-e logarithms
of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives
a result measured in nats. Dividing the entire expression above by loge 2 yields the divergence in bits.

Cumulative distribution function


The cumulative distribution function (cdf) F(x0) of a random vector x is defined as the probability that all
components of x are less than or equal to the corresponding values in the vector x0. Though there is no closed form
for F(x), there are a number of algorithms that estimate it numerically.

Tolerance region
The equivalent for the univariate Normal distribution tolerance intervals in the multivariate case would yield a
tolerance region. Such region consists of those vectors x satisfying

Here is a -dimensional vector, is the known -dimensional mean vector, is the known covariance
matrix and is the quantile function for probability of the chi-squared distribution with degrees of
freedom.
When the expression defines the interior of an ellipse and the chi-squared distribution simplifies to an
exponential distribution with mean equal to two.

Joint normality

Normally distributed and independent


If X and Y are normally distributed and independent, this implies they are "jointly normally distributed", i.e., the pair
(X, Y) must have multivariate normal distribution. However, a pair of jointly normally distributed variables need not
be independent.

Two normally distributed random variables need not be jointly bivariate normal
The fact that two random variables X and Y both have a normal distribution does not imply that the pair (X, Y) has a
joint normal distribution. A simple example is one in which X has a normal distribution with expected value 0 and
variance 1, and Y = X if |X| > c and Y = −X if |X| < c, where c > 0. There are similar counterexamples for more than
two random variables.
Multivariate normal distribution 392

Correlations and independence


In general, random variables may be uncorrelated but highly dependent. But if a random vector has a multivariate
normal distribution then any two or more of its components that are uncorrelated are independent. This implies that
any two or more of its components that are pairwise independent are independent.
But it is not true that two random variables that are (separately, marginally) normally distributed and uncorrelated
are independent. Two random variables that are normally distributed may fail to be jointly normally distributed, i.e.,
the vector whose components they are may fail to have a multivariate normal distribution. For an example of two
normally distributed random variables that are uncorrelated but not independent, see normally distributed and
uncorrelated does not imply independent.

Conditional distributions
If μ and Σ are partitioned as follows

with sizes

with sizes

then, the distribution of x1 conditional on x2 = a is multivariate normal (x1|x2 = a) ~ N(μ, Σ) where

and covariance matrix


[7]

This matrix is the Schur complement of Σ22 in Σ. This means that to calculate the conditional covariance matrix, one
inverts the overall covariance matrix, drops the rows and columns corresponding to the variables being conditioned
upon, and then inverts back to get the conditional covariance matrix. Here is the generalized inverse of
Note that knowing that x2 = a alters the variance, though the new variance does not depend on the specific value of
a; perhaps more surprisingly, the mean is shifted by ; compare this with the situation of not
knowing the value of a, in which case x1 would have distribution .
An interesting fact derived in order to prove this result, is that the random vectors and
are independent.
The matrix Σ12Σ22−1 is known as the matrix of regression coefficients.
In the bivariate case where x is partitioned into X1 and X2, the conditional distribution of X1 given X2 is

where is the correlation coefficient between X1 and X2.


Multivariate normal distribution 393

Bivariate conditional expectation


In the case

the following result holds

where the final ratio here is called the inverse Mills ratio.

Marginal distributions
To obtain the marginal distribution over a subset of multivariate normal random variables, one only needs to drop the
irrelevant variables (the variables that one wants to marginalize out) from the mean vector and the covariance matrix.
The proof for this follows from the definitions of multivariate normal distributions and linear algebra.[8]
Example
Let x = [X1, X2, X3] be multivariate normal random variables with mean vector μ = [μ1, μ2, μ3] and covariance matrix
Σ (standard parametrization for multivariate normal distributions). Then the joint distribution of x′ = [X1, X3] is

multivariate normal with mean vector μ′ = [μ1, μ3] and covariance matrix .

Affine transformation
If y = c + Bx is an affine transformation of where c is an vector of constants and B is a
constant matrix, then y has a multivariate normal distribution with expected value c + Bμ and variance
T
BΣB i.e., . In particular, any subset of the xi has a marginal distribution that is also
multivariate normal. To see this, consider the following example: to extract the subset (x1, x2, x4)T, use

which extracts the desired elements directly.


Another corollary is that the distribution of Z = b · x, where b is a constant vector of the same length as x and the dot
indicates a vector product, is univariate Gaussian with . This result follows by using

and considering only the first component of the product (the first row of B is the vector b). Observe how the
positive-definiteness of Σ implies that the variance of the dot product must be positive.
An affine transformation of x such as 2x is not the same as the sum of two independent realisations of x.
Multivariate normal distribution 394

Geometric interpretation
The equidensity contours of a non-singular multivariate normal distribution are ellipsoids (i.e. linear transformations
of hyperspheres) centered at the mean.[9] The directions of the principal axes of the ellipsoids are given by the
eigenvectors of the covariance matrix Σ. The squared relative lengths of the principal axes are given by the
corresponding eigenvalues.
If Σ = UΛUT = UΛ1/2(UΛ1/2)T is an eigendecomposition where the columns of U are unit eigenvectors and Λ is a
diagonal matrix of the eigenvalues, then we have

Moreover, U can be chosen to be a rotation matrix, as inverting an axis does not have any effect on N(0, Λ), but
inverting a column changes the sign of U's determinant. The distribution N(μ, Σ) is in effect N(0, I) scaled by Λ1/2,
rotated by U and translated by μ.
Conversely, any choice of μ, full rank matrix U, and positive diagonal entries Λi yields a non-singular multivariate
normal distribution. If any Λi is zero and U is square, the resulting covariance matrix UΛUT is singular.
Geometrically this means that every contour ellipsoid is infinitely thin and has zero volume in n-dimensional space,
as at least one of the principal axes has length of zero.

Estimation of parameters
The derivation of the maximum-likelihood estimator of the covariance matrix of a multivariate normal distribution is
perhaps surprisingly subtle and elegant. See estimation of covariance matrices.
In short, the probability density function (pdf) of an k-dimensional multivariate normal is

and the ML estimator of the covariance matrix from a sample of n observations is

which is simply the sample covariance matrix. This is a biased estimator whose expectation is

An unbiased sample covariance is

The Fisher information matrix for estimating the parameters of a multivariate normal distribution has a closed form
expression. This can be used, for example, to compute the Cramér–Rao bound for parameter estimation in this
setting. See Fisher information for more details.
Multivariate normal distribution 395

Bayesian inference
In Bayesian statistics, the conjugate prior of the mean vector is another multivariate normal distribution, and the
conjugate prior of the covariance matrix is an inverse-Wishart distribution . Suppose then that n observations
have been made

and that a conjugate prior has been assigned, where

where

and

Then,

where

Multivariate normality tests


Multivariate normality tests check a given set of data for similarity to the multivariate normal distribution. The null
hypothesis is that the data set is similar to the normal distribution, therefore a sufficiently small p-value indicates
non-normal data. Multivariate normality tests include the Cox-Small test[10] and Smith and Jain's adaptation[11] of
the Friedman-Rafsky test.[12]
Mardia's test[13] is based on multivariate extensions of skewness and kurtosis measures. For a sample {x1, ..., xn} of
k-dimensional vectors we compute

Under the null hypothesis of multivariate normality, the statistic A will have approximately a chi-squared distribution
with 16⋅k(k + 1)(k + 2) degrees of freedom, and B will be approximately standard normal N(0,1).
Mardia's kurtosis statistic is skewed and converges very slowly to the limiting normal distribution. For medium size
samples , the parameters of the asymptotic distribution of the kurtosis statistic are modified[14]
For small sample tests ( ) empirical critical values are used. Tables of critical values for both statistics are
given by Rencher[15] for k=2,3,4.
Mardia's tests are affine invariant but not consistent. For example, the multivariate skewness test is not consistent
against symmetric non-normal alternatives.
The BHEP test[16] computes the norm of the difference between the empirical characteristic function and the
theoretical characteristic function of the normal distribution. Calculation of the norm is performed in the L2(μ) space
Multivariate normal distribution 396

of square-integrable functions with respect to the Gaussian weighting function . The test statistic is

The limiting distribution of this test statistic is a weighted sum of chi-squared random variables,[17] however in
practice it is more convenient to compute the sample quantiles using the Monte-Carlo simulations.
A detailed survey of these and other test procedures is available.[18]

Drawing values from the distribution


A widely used method for drawing a random vector x from the N-dimensional multivariate normal distribution with
mean vector μ and covariance matrix Σ works as follows:
1. Find any real matrix A such that A AT = Σ. When Σ is positive-definite, the Cholesky decomposition is typically
used, and the extended form of this decomposition can be used in the more general nonnegative-definite case: in
both cases a suitable matrix A is obtained. An alternative is to use the matrix A = UΛ½ obtained from a spectral
decomposition Σ = UΛUT of Σ. The former approach is more computationally straightforward but the matrices A
change for different orderings of the elements of the random vector, while the latter approach gives matrices that
are related by simple re-orderings. In theory both approaches give equally good ways of determining a suitable
matrix A, but there are differences in compuation time.
2. Let z = (z1, …, zN)T be a vector whose components are N independent standard normal variates (which can be
generated, for example, by using the Box–Muller transform).
3. Let x be μ + Az. This has the desired distribution due to the affine transformation property.

References
[1] Gut, Allan (2009) An Intermediate Course in Probability, Springer. ISBN 9781441901613 (Chapter 5)
[2] Hamedani, G. G.; Tata, M. N. (1975). "On the determination of the bivariate normal distribution from distributions of linear combinations of
the variables". The American Mathematical Monthly 82 (9): 913–915. doi:10.2307/2318494.
[3] Wyatt, John. "Linear least mean-squared error estimation" (http:/ / web. mit. edu/ 6. 041/ www/ LECTURE/ lec22. pdf). Lecture notes course
on applied probability. . Retrieved 23 January 2012.
[4] Tong, T. (2010) Multiple Linear Regression : MLE and Its Distributional Results (http:/ / amath. colorado. edu/ courses/ 7400/ 2010Spr/
lecture9. pdf), Lecture Notes
[5] Gokhale, DV; NA Ahmed, BC Res, NJ Piscataway (May 1989). "Entropy Expressions and Their Estimators for Multivariate Distributions".
Information Theory, IEEE Transactions on 35 (3): 688–692. doi:10.1109/18.30996.
[6] Penny & Roberts, PARG-00-12, (2000) (http:/ / www. allisons. org/ ll/ MML/ KL/ Normal). pp. 18
[7] Eaton, Morris L. (1983). Multivariate Statistics: a Vector Space Approach. John Wiley and Sons. pp. 116–117. ISBN 0-471-02776-6.
[8] The formal proof for marginal distribution is shown here http:/ / fourier. eng. hmc. edu/ e161/ lectures/ gaussianprocess/ node7. html
[9] Nikolaus Hansen. "The CMA Evolution Strategy: A Tutorial" (http:/ / www. lri. fr/ ~hansen/ cmatutorial. pdf) (PDF). .
[10] Cox, D. R.; N. J. H. Small (August 1978). "Testing multivariate normality". Biometrika 65 (2): 263–272. doi:10.1093/biomet/65.2.263.
[11] Smith, Stephen P.; Anil K. Jain (September 1988). "A test to determine the multivariate normality of a dataset". IEEE Transactions on
Pattern Analysis and Machine Intelligence 10 (5): 757–761. doi:10.1109/34.6789.
[12] Friedman, J. H. and Rafsky, L. C. (1979) "Multivariate generalizations of the Wald-Wolfowitz and Smirnov two sample tests". Annals of
Statistics, 7, 697–717.
[13] Mardia, K. V. (1970). "Measures of multivariate skewness and kurtosis with applications". Biometrika 57 (3): 519–530.
doi:10.1093/biomet/57.3.519.
[14] Rencher (1995), pages 112-113.
[15] Rencher (1995), pages 493-495.
[16] Epps, Lawrence B.; Pulley, Lawrence B. (1983). "A test for normality based on the empirical characteristic function". Biometrika 70 (3):
723–726. doi:10.1093/biomet/70.3.723.
[17] Baringhaus, L.; Henze, N. (1988). "A consistent test for multivariate normality based on the empirical characteristic function". Metrika 35
(1): 339–348. doi:10.1007/BF02613322.
Multivariate normal distribution 397

[18] Henze, Norbert (2002). "Invariant tests for multivariate normality: a critical review". Statistical Papers 43 (4): 467–506.
doi:10.1007/s00362-002-0119-6.

Literature
• Rencher, A.C. (1995). Methods of Multivariate Analysis. New York: Wiley.

n-sphere
In mathematics, an n-sphere is a generalization of the surface of an
ordinary sphere to arbitrary dimension. For any natural number n, an
n-sphere of radius r is defined as the set of points in
(n + 1)-dimensional Euclidean space which are at distance r from a
central point, where the radius r may be any positive real number. In
symbols:

It is an n-dimensional manifold in Euclidean (n + 1)-space.


In particular:
a 0-sphere is a pair of points that are the ends of a line segment,
a 1-sphere is a circle in the plane, and 2-sphere wireframe as an orthogonal projection

a 2-sphere is an ordinary sphere in three-dimensional space.


Spheres of dimension n > 2 are sometimes called hyperspheres, with 3-spheres sometimes known as glomes. The
n-sphere of unit radius centered at the origin is called the unit n-sphere, denoted Sn. The unit n-sphere is often
referred to as the n-sphere. An n-sphere is the surface or boundary of an (n + 1)-dimensional ball, and is an
n-dimensional manifold. For n ≥ 2, the n-spheres are the simply connected n-dimensional manifolds of constant,
positive curvature. The n-spheres admit several other topological descriptions: for example, they can be constructed
by gluing two n-dimensional Euclidean spaces together, by identifying the boundary of an n-cube with a point, or
(inductively) by forming the suspension of an (n − 1)-sphere.

Description
For any natural number n, an n-sphere of radius r is defined as the set of points in (n + 1)-dimensional Euclidean
space that are at distance r from some fixed point c, where r may be any positive real number and where c may be
any point in (n + 1)-dimensional space. In particular:
• a 0-sphere is a pair of points {c − r, c + r}, and is the boundary of a line segment (1-ball).
• a 1-sphere is a circle of radius r centered at c, and is the boundary of a disk (2-ball).
• a 2-sphere is an ordinary 2-dimensional sphere in 3-dimensional Euclidean space, and is the boundary of an
ordinary ball (3-ball).
n-sphere 398

• a 3-sphere is a sphere in 4-dimensional Euclidean space.

Euclidean coordinates in (n + 1)-space


The set of points in (n + 1)-space: (x1,x1,x2,…,xn+1) that define an
n-sphere, (Sn) is represented by the equation:

where c is a center point, and r is the radius.


The above n-sphere exists in (n + 1)-dimensional Euclidean space and
is an example of an n-manifold. The volume form ω of an n-sphere of
radius r is given by

Just as a stereographic projection can project a


sphere's surface to a plane, it can also project the
surface of a 3-sphere into 3-space. This image
shows three coordinate directions projected to
3-space: parallels (red), meridians (blue) and
hypermeridians (green). Due to the conformal
property of the stereographic projection, the
curves intersect each other orthogonally (in the
yellow points) as in 4D. All of the curves are
circles: the curves that intersect <0,0,0,1> have an
infinite radius (= straight line).

where * is the Hodge star operator; see Flanders (1989, §6.1) for a discussion and proof of this formula in the case
r = 1. As a result,

n-ball
The space enclosed by an n-sphere is called an (n + 1)-ball. An (n + 1)-ball is closed if it includes the n-sphere, and it
is open if it does not include the n-sphere.
Specifically:
• A 1-ball, a line segment, is the interior of a (0-sphere).
• A 2-ball, a disk, is the interior of a circle (1-sphere).
• A 3-ball, an ordinary ball, is the interior of a sphere (2-sphere).
• A 4-ball, is the interior of a 3-sphere, etc.
n-sphere 399

Topological description
Topologically, an n-sphere can be constructed as a one-point compactification of n-dimensional Euclidean space.
Briefly, the n-sphere can be described as , which is n-dimensional Euclidean space plus a
single point representing infinity in all directions. In particular, if a single point is removed from an n-sphere, it
becomes homeomorphic to . This forms the basis for stereographic projection.[1]

Volume and surface area


The -volume of an -sphere of radius or, equivalently, the surface area of an -ball of radius is:

The -volume of a -ball of radius :

The 1-sphere of radius R is the circle of radius R in the Euclidean plane, and this has circumference (1-dimensional
measure)

The region enclosed by the 1-sphere is the 2-ball, or disk of radius R, and this has area (2-dimensional measure)

Analogously, in 3-dimensional Euclidean space, the surface area (2-dimensional measure) of the 2-sphere of radius R
is given by

and the volume enclosed is the volume (3-dimensional measure) of the 3-ball, and is given by

In general, the volume, in n-dimensional Euclidean space, of the n-ball of radius R is proportional to the nth power of
the R:

where the constant of proportionality, the volume of the unit n-ball, is given by

where is the gamma function. For even n, this reduces to

and since

for odd n,

where denotes the double factorial.


The "surface area", or properly the (n − 1)-dimensional volume, of the (n−1)-sphere at the boundary of the n-ball is
n-sphere 400

The following relationships hold


between the n-spherical surface area
and volume:

Combining them gives a


"reverse-direction" recurrence relation
for surface area, which is depicted in
the diagram:
The curved red arrows show the relationship between formulas for different n. The
formula coefficient at each arrow's tip equals the formula coefficient at that arrow's tail
times the factor in the arrowhead. If the direction of the bottom arrows were reversed,
Index-shifting n to n−2 then yields the
their arrowheads would say to multiply by 2πn − 2
recurrence relations:

where S0=2, V1=2R, S1=2πR and V2=πR2. (The 0-dimensional Hausdorff measure is the number of points in a set.
The 0-sphere consists of two points, at −R and +R; so S0 = 2.)

The recurrence relation for can be proved via integration with 2-dimensional polar coordinates:

Spherical coordinates
We may define a coordinate system in an n-dimensional Euclidean space which is analogous to the spherical
coordinate system defined for 3-dimensional Euclidean space, in which the coordinates consist of a radial coordinate,
and n − 1 angular coordinates where ranges over radians (or over [0, 360)
degrees) and the other angles range over radians (or over [0, 180] degrees). If are the Cartesian
coordinates, then we may compute from with:
n-sphere 401

Except in the special cases described below, the inverse transformation is unique:

where if for some but all of are zero then when , and
radians (180 degrees) when .
There are some special cases where the inverse transform is not unique; for any will be ambiguous whenever
all of are zero; in this case may be chosen to be zero.

Note that a half-angle formula is used for because the more straightforward is too

small by an addend of π when  < 0.

Spherical volume element


Expressing the angular measures in radians, the volume element in n-dimensional Euclidean space will be found
from the Jacobian of the transformation:

and the above equation for the volume of the n-ball can be recovered by integrating:

The volume element of the (n-1)–sphere, which generalizes the area element of the 2-sphere, is given by

The natural choice of an orthogonal basis over the angular coordinates is a product of ultraspherical polynomials,

for j = 1, 2, ..., n − 2, and the e isφj for the angle j = n − 1 in concordance with the spherical harmonics.
n-sphere 402

Stereographic projection
Just as a two dimensional sphere embedded in three dimensions can be mapped onto a two-dimensional plane by a
stereographic projection, an n-sphere can be mapped onto an n-dimensional hyperplane by the n-dimensional version
of the stereographic projection. For example, the point on a two-dimensional sphere of radius 1 maps to

the point on the plane. In other words,

Likewise, the stereographic projection of an n-sphere of radius 1 will map to the dimensional
hyperplane perpendicular to the axis as

Generating random points

Uniformly at random from the (n − 1)-sphere


To generate uniformly distributed random points on the (n − 1)-sphere (i.e., the surface of the n-ball), Marsaglia
(1972) gives the following algorithm.
Generate an n-dimensional vector of normal deviates (it suffices to use N(0, 1), although in fact the choice of the
variance is arbitrary), .

Now calculate the "radius" of this point, .

The vector is uniformly distributed over the surface of the unit n-ball.

Examples
For example, when n = 2 the normal distribution exp(−x12) when expanded over another axis exp(−x22) after
multiplication takes the form exp(−x12−x22) or exp(−r2) and so is only dependent on distance from the origin.

Alternatives
Another way to generate a random distribution on a hypersphere is to make a uniform distribution over a hypercube
that includes the unit hyperball, exclude those points that are outside the hyperball, then project the remaining
interior points outward from the origin onto the surface. This will give a uniform distribution, but it is necessary to
remove the exterior points. As the relative volume of the hyperball to the hypercube decreases very rapidly with
dimension, this procedure will succeed with high probability only for fairly small numbers of dimensions.
Wendel's theorem gives the probability that all of the points generated will lie in the same half of the hypersphere.
n-sphere 403

Uniformly at random from the n-ball


With a point selected from the surface of the n-ball uniformly at random, one needs only a radius to obtain a point
uniformly at random within the n-ball. If u is a number generated uniformly at random from the interval [0, 1] and x
is a point selected uniformly at random from the surface of the n-ball then u1/nx is uniformly distributed over the
entire unit n-ball.

Specific spheres
0-sphere
The pair of points {±R} with the discrete topology for some R > 0. The only sphere that is disconnected. Has a
natural Lie group structure; isomorphic to O(1). Parallelizable.
1-sphere
Also known as the circle. Has a nontrivial fundamental group. Abelian Lie group structure U(1); the circle
group. Topologically equivalent to the real projective line, RP1. Parallelizable. SO(2) = U(1).
2-sphere
Also known as the sphere. Complex structure; see Riemann sphere. Equivalent to the complex projective line,
CP1. SO(3)/SO(2).
3-sphere
Parallelizable, Principal U(1)-bundle over the 2-sphere, Lie group structure Sp(1), where also
.
4-sphere
Equivalent to the quaternionic projective line, HP1. SO(5)/SO(4).
5-sphere
Principal U(1)-bundle over CP2. SO(6)/SO(5) = SU(3)/SU(2).
6-sphere
Almost complex structure coming from the set of pure unit octonions. SO(7)/SO(6) = G2/SU(3).
7-sphere
Topological quasigroup structure as the set of unit octonions. Principal Sp(1)-bundle over S4. Parallelizable.
SO(8)/SO(7) = SU(4)/SU(3) = Sp(2)/Sp(1) = Spin(7)/G2 = Spin(6)/SU(3). The 7-sphere is of particular interest
since it was in this dimension that the first exotic spheres were discovered.
8-sphere
Equivalent to the octonionic projective line OP1.
23-sphere
A highly dense sphere-packing is possible in 24 dimensional space, which is related to the unique qualities of
the Leech lattice.
n-sphere 404

Notes
[1] James W. Vick (1994). Homology theory, p. 60. Springer

References
• Flanders, Harley (1989). Differential forms with applications to the physical sciences. New York: Dover
Publications. ISBN 978-0-486-66169-8..
• Moura, Eduarda; Henderson, David G. (1996). Experiencing geometry: on plane and sphere (http://www.math.
cornell.edu/~henderson/books/eg00). Prentice Hall. ISBN 978-0-13-373770-7 (Chapter 20: 3-spheres and
hyperbolic 3-spaces.)
• Weeks, Jeffrey R. (1985). The Shape of Space: how to visualize surfaces and three-dimensional manifolds.
Marcel Dekker. ISBN 978-0-8247-7437-0 (Chapter 14: The Hypersphere)
• Marsaglia, G. (1972). "Choosing a Point from the Surface of a Sphere". Ann. Math. Stat. 43 (2): 645–646.
doi:10.1214/aoms/1177692644.
• Huber, Greg (1982). "Gamma function derivation of n-sphere volumes". Am. Math. Monthly 89 (5): 301–302.
doi:10.2307/2321716. JSTOR 2321716. MR1539933.

External links
• Exploring Hyperspace with the Geometric Product (http://www.bayarea.net/~kins/thomas_briggs/)
• Weisstein, Eric W., " Hypersphere (http://mathworld.wolfram.com/Hypersphere.html)" from MathWorld.
Negative binomial distribution 405

Negative binomial distribution


Different texts adopt slightly different definitions for the negative binomial distribution. They can be distinguished by
whether the support starts at k = 0 or at k = r, and whether p denotes the probability of a success or of a failure.
Probability mass function

The orange line represents the mean, which is equal to 10 in each of these plots;
the green line shows the standard deviation.
Notation
Parameters r > 0 — number of failures until the experiment is stopped (integer,
but the definition can also be extended to reals)
p ∈ (0,1) — success probability in each experiment (real)
Support k ∈ { 0, 1, 2, 3, … } — number of successes
PMF
involving a binomial coefficient

CDF the regularized incomplete beta function


Mean

Mode

Variance

Skewness

Ex. kurtosis

MGF

CF

PGF

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution of the
number of successes in a sequence of Bernoulli trials before a specified (non-random) number of failures (denoted r)
occur. For example, if one throws a die repeatedly until the third time “1” appears, then the probability distribution of
the number of non-“1”s that had appeared will be negative binomial.
The Pascal distribution (after Blaise Pascal) and Polya distribution (for George Pólya) are special cases of the
negative binomial. There is a convention among engineers, climatologists, and others to reserve “negative binomial”
in a strict sense or “Pascal” for the case of an integer-valued stopping-time parameter r, and use “Polya” for the
real-valued case. The Polya distribution more accurately models occurrences of “contagious” discrete events, like
Negative binomial distribution 406

tornado outbreaks, than the Poisson distribution.

Definition
Suppose there is a sequence of independent Bernoulli trials, each trial having two potential outcomes called
“success” and “failure”. In each trial the probability of success is p and of failure is (1 − p). We are observing this
sequence until a predefined number r of failures has occurred. Then the random number of successes we have seen,
X, will have the negative binomial (or Pascal) distribution:

When applied to real-world problems, outcomes of success and failure may or may not be outcomes we ordinarily
view as good and bad, respectively. Suppose we used the negative binomial distribution to model the number of days
a certain machine works before it breaks down. In this case “success” would be the result on a day when the machine
worked properly, whereas a breakdown would be a “failure”. If we used the negative binomial distribution to model
the number of goal attempts a sportsman makes before scoring a goal, though, then each unsuccessful attempt would
be a “success”, and scoring a goal would be “failure”. If we are tossing a coin, then the negative binomial distribution
can give the number of heads (“success”) we are likely to encounter before we encounter a certain number of tails
(“failure”).
The probability mass function of the negative binomial distribution is

Here the quantity in parentheses is the binomial coefficient, and is equal to

This quantity can alternatively be written in the following manner, explaining the name “negative binomial”:

To understand the above definition of the probability mass function, note that the probability for every specific
sequence of k successes and r failures is (1 − p)rpk, because the outcomes of the k + r trials are supposed to happen
independently. Since the rth failure comes last, it remains to choose the k trials with successes out of the remaining
k + r − 1 trials. The above binomial coefficient, due to its combinatorial interpretation, gives precisely the number of
all these sequences of length k + r − 1.

Extension to real-valued r
It is possible to extend the definition of the negative binomial distribution to the case of a positive real parameter r.
Although it is impossible to visualize a non-integer number of “failures”, we can still formally define the distribution
through its probability mass function.
As before, we say that X has a negative binomial (or Pólya) distribution if it has a probability mass function:

Here r is a real, positive number. The binomial coefficient is then defined by the multiplicative formula and can also
be rewritten using the gamma function:

Note that by the binomial series and (*) above, for every 0 ≤ p < 1,
Negative binomial distribution 407

hence the terms of the probability mass function indeed add up to one.

Alternative formulations
Some textbooks may define the negative binomial distribution slightly differently than it is done here. The most
common variations are:
• The definition where X is the total number of trials needed to get r failures, not simply the number of successes.
Since the total number of trials is equal to the number of successes plus the number of failures, this definition
differs from ours by adding constant r. In order to convert formulas written with this definition into the one used
in the article, replace everywhere “k” with “k - r”, and also subtract r from the mean, the median, and the mode. In
order to convert formulas of this article into this alternative definition, replace “k” with “k + r” and add r to the
mean, the median and the mode. Effectively, this implies using the probability mass function

which perhaps resembles the binomial distribution more closely than the version above. Note that the arguments
of the binomial coefficient are decremented due to order: the last "failure" must occur last, and so the other events
have one fewer positions available when counting possible orderings. Note that this definition of the negative
binomial distribution does not easily generalize to a positive, real parameter r.
• The definition where p denotes the probability of a failure, not of a success. In order to convert formulas between
this definition and the one used in the article, replace “p” with “1 − p” everywhere.
• The definition where the support X is defined as the number of failures, rather than the number of successes. This
definition — where X counts failures but p is the probability of success — has exactly the same formulas as in the
previous case where X counts successes but p is the probability of failure. However, the corresponding text will
have the words “failure” and “success” swapped compared with the previous case.
• The two alterations above may be applied simultaneously, i.e. X counts total trials, and p is the probability of
failure.

Occurrence

Waiting time in a Bernoulli process


For the special case where r is an integer, the negative binomial distribution is known as the Pascal distribution. It
is the probability distribution of a certain number of failures and successes in a series of independent and identically
distributed Bernoulli trials. For k + r Bernoulli trials with success probability p, the negative binomial gives the
probability of k successes and r failures, with a failure on the last trial. In other words, the negative binomial
distribution is the probability distribution of the number of successes before the rth failure in a Bernoulli process,
with probability p of successes on each trial. A Bernoulli process is a discrete time process, and so the number of
trials, failures, and successes are integers.
Consider the following example. Suppose we repeatedly throw a die, and consider a “1” to be a “failure”. The
probability of failure on each trial is 1/6. The number of successes before the third failure belongs to the infinite set {
0, 1, 2, 3, ... }. That number of successes is a negative-binomially distributed random variable.
When r = 1 we get the probability distribution of number of successes before the first failure (i.e. the probability of
the first failure occurring on the (k + 1)st trial), which is a geometric distribution:
Negative binomial distribution 408

Overdispersed Poisson
The negative binomial distribution, especially in its alternative parameterization described above, can be used as an
alternative to the Poisson distribution. It is especially useful for discrete data over an unbounded positive range
whose sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a
Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not an appropriate
model. Since the negative binomial distribution has one more parameter than the Poisson, the second parameter can
be used to adjust the variance independently of the mean. See Cumulants of some discrete probability distributions.
An application of this is to annual counts of tropical cyclones in the North Atlantic or to monthly to 6-monthly
counts of wintertime extratropical cyclones over Europe, for which the variance is greater than the mean.[1][2][3] In
the case of modest overdispersion, this may produce substantially similar results to an overdispersed Poisson
distribution.[4][5]

Related distributions
• The geometric distribution (on { 0, 1, 2, 3, ... }) is a special case of the negative binomial distribution, with

• The negative binomial distribution is a special case of the discrete phase-type distribution.

Poisson distribution
Consider a sequence of negative binomial distributions where the stopping parameter r goes to infinity, whereas the
probability of success in each trial, p, goes to zero in such a way as to keep the mean of the distribution constant.
Denoting this mean λ, the parameter p will have to be

Under this parametrization the probability mass function will be

Now if we consider the limit as r → ∞, the second factor will converge to one, and the third to the exponent
function:

which is the mass function of a Poisson-distributed random variable with expected value λ.
In other words, the alternatively parameterized negative binomial distribution converges to the Poisson distribution
and r controls the deviation from the Poisson. This makes the negative binomial distribution suitable as a robust
alternative to the Poisson, which approaches the Poisson for large r, but which has larger variance than the Poisson
for small r.
Negative binomial distribution 409

Gamma–Poisson mixture
The negative binomial distribution also arises as a continuous mixture of Poisson distributions (i.e. a compound
probability distribution) where the mixing distribution of the Poisson rate is a gamma distribution. That is, we can
view the negative binomial as a Poisson(λ) distribution, where λ is itself a random variable, distributed according to
Gamma(r, p/(1 − p)).
Formally, this means that the mass function of the negative binomial distribution can be written as

Because of this, the negative binomial distribution is also known as the gamma–Poisson (mixture) distribution.

Sum of geometric distributions


If Yr is a random variable following the negative binomial distribution with parameters r and p, and support
{0, 1, 2, ...}, then Yr is a sum of r independent variables following the geometric distribution (on { 0, 1, 2, 3, ... })
with parameter 1 − p. As a result of the central limit theorem, Yr (properly scaled and shifted) is therefore
approximately normal for sufficiently large r.
Furthermore, if Bs+r is a random variable following the binomial distribution with parameters s + r and 1 − p, then

In this sense, the negative binomial distribution is the "inverse" of the binomial distribution.
The sum of independent negative-binomially distributed random variables r1 and r2 with the same value for
parameter p is negative-binomially distributed with the same p but with "r-value" r1 + r2.
The negative binomial distribution is infinitely divisible, i.e., if Y has a negative binomial distribution, then for any
positive integer n, there exist independent identically distributed random variables Y1, ..., Yn whose sum has the same
distribution that Y has.
Negative binomial distribution 410

Representation as compound Poisson distribution


The negative binomial distribution NB(r,p) can be represented as a compound Poisson distribution: Let {Yn, n ∈ ℕ0}
denote a sequence of independent and identically distributed random variables, each one having the logarithmic
distribution Log(p), with probability mass function

Let N be a random variable, independent of the sequence, and suppose that N has a Poisson distribution with
parameter λ = −r ln(1 − p). Then the random sum

is NB(r,p)-distributed. To prove this, we calculate the probability generating function GX of X, which is the
composition of the probability generating functions GN and GY1. Using

and

we obtain

which is the probability generating function of the NB(r,p) distribution.

Properties

Cumulative distribution function


The cumulative distribution function can be expressed in terms of the regularized incomplete beta function:

Sampling and point estimation of p


Suppose p is unknown and an experiment is conducted where it is decided ahead of time that sampling will continue
until r successes are found. A sufficient statistic for the experiment is k, the number of failures.
In estimating p, the minimum variance unbiased estimator is

The maximum likelihood estimate of p is

but this is a biased estimate. Its inverse (r + k)/r, is an unbiased estimate of 1/p, however.[6]
Negative binomial distribution 411

Relation to the binomial theorem


Suppose Y is a random variable with a binomial distribution with parameters n and p. Assume p + q = 1, with p, q
>=0. Then the binomial theorem implies that

Using Newton's binomial theorem, this can equally be written as:

in which the upper bound of summation is infinite. In this case, the binomial coefficient

is defined when n is a real number, instead of just a positive integer. But in our case of the binomial distribution it is
zero when k > n. We can then say, for example

Now suppose r > 0 and we use a negative exponent:

Then all of the terms are positive, and the term

is just the probability that the number of failures before the rth success is equal to k, provided r is an integer. (If r is a
negative non-integer, so that the exponent is a positive non-integer, then some of the terms in the sum above are
negative, so we do not have a probability distribution on the set of all nonnegative integers.)
Now we also allow non-integer values of r. Then we have a proper negative binomial distribution, which is a
generalization of the Pascal distribution, which coincides with the Pascal distribution when r happens to be a positive
integer.
Recall from above that
The sum of independent negative-binomially distributed random variables r1 and r2 with the same value for
parameter p is negative-binomially distributed with the same p but with "r-value" r1 + r2.
This property persists when the definition is thus generalized, and affords a quick way to see that the negative
binomial distribution is infinitely divisible.
Negative binomial distribution 412

Parameter estimation

Maximum likelihood estimation


The likelihood function for N iid observations (k1, ..., kN) is

from which we calculate the log-likelihood function

To find the maximum we take the partial derivatives with respect to r and p and set them equal to zero:

and

where

is the digamma function.

Solving the first equation for p gives:

Substituting this in the second equation gives:

This equation cannot be solved in closed form. If a numerical solution is desired, an iterative technique such as
Newton's method can be used.

Examples

Selling candy
Pat is required to sell candy bars to raise money for the 6th grade field trip. There are thirty houses in the
neighborhood, and Pat is not supposed to return home until five candy bars have been sold. So the child goes door to
door, selling candy bars. At each house, there is a 0.4 probability of selling one candy bar and a 0.6 probability of
selling nothing.
What's the probability of selling the last candy bar at the nth house?
Recall that the NegBin(r, p) distribution describes the probability of k failures and r successes in k + r Bernoulli(p)
trials with success on the last trial. Selling five candy bars means getting five successes. The number of trials (i.e.
houses) this takes is therefore k + 5 = n. The random variable we are interested in is the number of houses, so we
substitute k = n − 5 into a NegBin(5, 0.4) mass function and obtain the following mass function of the distribution of
houses (for n ≥ 5):

What's the probability that Pat finishes on the tenth house?


Negative binomial distribution 413

What's the probability that Pat finishes on or before reaching the eighth house?
To finish on or before the eighth house, Pat must finish at the fifth, sixth, seventh, or eighth house. Sum those
probabilities:

What's the probability that Pat exhausts all 30 houses in the neighborhood?
This can be expressed as the probability that Pat does not finish on the fifth through the thirtieth house:

Polygyny in African societies


Data on polygyny among a wide range of traditional African societies suggest that the distribution of wives follow a
range of binomial profiles. The majority of these are negative binomial indicating the degree of competition for
wives. However some tend towards a Poisson Distribution and even beyond towards a true binomial, indicating a
degree of conformity in the allocation of wives. Further analysis of these profiles indicates shifts along this
continuum between more competitiveness or more conformity according to the age of the husband and also
according to the status of particular sectors within a society. In this way, these binomial distributions provide a tool
for comparison, between societies, between sectors of societies, and over time.[7]

References
[1] Villarini, G.; Vecchi, G.A. and Smith, J.A. (2010). "Modeling of the dependence of tropical storm counts in the North Atlantic Basin on
climate indices". Monthly Weather Review 138 (7): 2681–2705. doi:10.1175/2010MWR3315.1.
[2] Mailier, P.J.; Stephenson, D.B.; Ferro, C.A.T.; Hodges, K.I. (2006). "Serial Clustering of Extratropical Cyclones". Monthly Weather Review
134 (8): 2224–2240. doi:10.1175/MWR3160.1.
[3] Vitolo, R.; Stephenson, D.B.; Cook, Ian M.; Mitchell-Wallace, K. (2009). "Serial clustering of intense European storms". Meteorologische
Zeitschrift 18 (4): 411–424. doi:10.1127/0941-2948/2009/0393.
[4] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and Hall/CRC.
ISBN 0-412-31760-5.
[5] Cameron, Adrian C.; Trivedi, Pravin K. (1998). Regression analysis of count data. Cambridge University Press. ISBN 0-521-63567-5.
[6] J. B. S. Haldane, "On a Method of Estimating Frequencies", Biometrika, Vol. 33, No. 3 (Nov., 1945), pp. 222–225. JSTOR 2332299
[7] Spencer, Paul, 1998, The Pastoral Continuum: the Marginalization of Tradition in East Africa, Clarendon Press, Oxford (pp. 51-92).

Further reading
• Hilbe, Joseph M., Negative Binomial Regression, Cambridge, UK: Cambridge University Press (2007) Negative
Binomial Regression – Cambridge University Press (http://www.cambridge.org/uk/catalogue/catalogue.
asp?isbn=9780521857727)
Noncentral chi-squared distribution 414

Noncentral chi-squared distribution


Noncentral chi-squared

Probability density function

Cumulative distribution function

Parameters degrees of freedom


non-centrality parameter
Support
PDF

CDF with Marcum Q-function

Mean
Variance
Skewness

Ex. kurtosis

MGF
for

CF

In probability theory and statistics, the noncentral chi-squared or noncentral distribution is a generalization of
the chi-squared distribution. This distribution often arises in the power analysis of statistical tests in which the null
distribution is (perhaps asymptotically) a chi-squared distribution; important examples of such tests are the
likelihood ratio tests.
Noncentral chi-squared distribution 415

Background
Let be k independent, normally distributed random variables with means and variances . Then the
random variable

is distributed according to the noncentral chi-squared distribution. It has two parameters: which specifies the
number of degrees of freedom (i.e. the number of ), and which is related to the mean of the random variables
by:

is sometime called the noncentrality parameter. Note that some references define in other ways, such as half
of the above sum, or its square root.
This distribution arises in multivariate statistics as a derivative of the multivariate normal distribution. While the
central chi-squared distribution is the squared norm of a random vector with distribution (i.e., the
squared distance from the origin of a point taken at random from that distribution), the non-central is the squared
norm of a random vector with distribution. Here is a zero vector of length k, and
is the identity matrix of size k.

Definition
The probability density function is given by

where is distributed as chi-squared with degrees of freedom.


From this representation, the noncentral chi-squared distribution is seen to be a Poisson-weighted mixture of central
chi-squared distributions. Suppose that a random variable J has a Poisson distribution with mean , and the
conditional distribution of Z given is chi-squared with k+2i degrees of freedom. Then the unconditional
distribution of Z is non-central chi-squared with k degrees of freedom, and non-centrality parameter .
Alternatively, the pdf can be written as

where is a modified Bessel function of the first kind given by

Using the relation between Bessel functions and hypergeometric functions, the pdf can also be written as:[1]

Siegel (1979) discusses the case k=0 specifically (zero degrees of freedom), in which case the distribution has a
discrete component at zero.
Noncentral chi-squared distribution 416

Properties

Moment generating function


The moment generating function is given by

Moments
The first few raw moments are:

The first few central moments are:

The nth cumulant is

Hence

Cumulative distribution function


Again using the relation between the central and noncentral chi-squared distributions, the cumulative distribution
function (cdf) can be written as

where is the cumulative distribution function of the central chi-squared distribution with k degrees of
freedom which is given by

and where is the lower incomplete Gamma function.


The Marcum Q-function can also be used to represent the cdf.[2]
Noncentral chi-squared distribution 417

Approximation
Sankaran [3] discusses a number of closed form approximations for the cumulative distribution function. In an earlier
paper,[4] he derived and states the following approximation:

where
denotes the cumulative distribution function of the standard normal distribution;

This and other approximations are discussed in a later text book.[5]


To approximate the Chi-squared distribution, the non-centrality parameter, , is set to zero, yielding

essentially approximating the normalized Chi-squared distribution X / k as the cube of a Gaussian.


For a given probability, the formula is easily inverted to provide the corresponding approximation for .

Derivation of the pdf


The derivation of the probability density function is most easily done by performing the following steps:
1. First, assume without loss of generality that . Then the joint distribution of
is spherically symmetric, up to a location shift.
2. The spherical symmetry then implies that the distribution of depends on the means only
through the squared length, . Without loss of generality, we can therefore take
and .
3. Now derive the density of (i.e. k=1 case). Simple transformation of random variables shows that :

where is the standard normal density.

4. Expand the cosh term in a Taylor series. This gives the Poisson-weighted mixture representation of the density,
still for k=1. The indices on the chi-squared random variables in the series above are 1+2i in this case.
5. Finally, for the general case. We've assumed, without loss of generality, that are standard normal,
and so has a central chi-squared distribution with (k-1) degrees of freedom, independent of
. Using the poisson-weighted mixture representation for , and the fact that the sum of chi-squared
random variables is also chi-squared, completes the result. The indices in the series are (1+2i)+(k-1) = k+2i as
required.
Noncentral chi-squared distribution 418

Related distributions
• If is chi-squared distributed then is also non-central chi-squared distributed:
• If and and is independent of then a noncentral F-distributed variable is

developed as

• If , then

• Normal approximation[6]: if , then in distribution as either

or .

Transformations
Sankaran (1963) discusses the transformations of the form . He analyzes the
expansions of the cumulants of up to the term and shows that the following choices of
produce reasonable results:
• makes the second cumulant of approximately independent of
• makes the third cumulant of approximately independent of
• makes the fourth cumulant of approximately independent of
Also, a simpler transformation can be used as a variance stabilizing transformation
that produces a random variable with mean and variance .
Usability of these transformations may be hampered by the need to take the square roots of negative numbers.

Various chi and chi-squared distributions


Name Statistic

chi-squared distribution

noncentral chi-squared distribution

chi distribution

noncentral chi distribution


Noncentral chi-squared distribution 419

Notes
[1] Muirhead (2005) Theorem 1.3.4
[2] Nuttall, Albert H. (1975): Some Integrals Involving the QM Function (http:/ / ieeexplore. ieee. org/ xpl/ freeabs_all. jsp?arnumber=1055327),
IEEE Transactions on Information Theory, 21(1), 95-96, ISSN 0018-9448
[3] Sankaran , M. (1963). Approximations to the non-central chi-squared distribution (http:/ / biomet. oxfordjournals. org/ cgi/ content/ citation/
50/ 1-2/ 199) Biometrika, 50(1-2), 199–204
[4] Sankaran , M. (1959). "On the non-central chi-squared distribution", Biometrika 46, 235–237
[5] Johnson et al. (1995) Section 29.8
[6] Muirhead (2005) pages 22–24 and problem 1.18.

References
• Abramowitz, M. and Stegun, I.A. (1972), Handbook of Mathematical Functions, Dover. Section 26.4.25. (http://
www.math.sfu.ca/~cbm/aands/page_942.htm)
• Johnson, N. L., Kotz, S., Balakrishnan, N. (1970), Continuous Univariate Distributions, Volume 2, Wiley. ISBN
0-471-58494-0
• Muirhead, R. (2005) Aspects of Multivariate Statistical Theory (2nd Edition). Wiley. ISBN 0-471-76985-1
• Siegel, A.F. (1979), "The noncentral chi-squared distribution with zero degrees of freedom and testing for
uniformity", Biometrika, 66, 381–386
• Press, S.J. (1966), "Linear combinations of non-central chi-squared variates", The Annals of Mathematical
Statistics 37 (2): 480–487, JSTOR 2238621

Noncentral F-distribution
In probability theory and statistics, the noncentral F-distribution is a continuous probability distribution that is a
generalization of the (ordinary) F-distribution. It describes the distribution of the quotient (X/n1)/(Y/n2), where the
numerator X has a noncentral chi-squared distribution with n1 degrees of freedom and the denominator Y has a
central chi-squared distribution n2 degrees of freedom. It is also required that X and Y are statistically independent of
each other.
It is the distribution of the test statistic in analysis of variance problems when the null hypothesis is false. The
noncentral F-distribution is used to find the power function of such a test.

Occurrence and specification


If is a noncentral chi-squared random variable with noncentrality parameter and degrees of freedom, and
is a chi-squared random variable with degrees of freedom that is statistically independent of , then

is a noncentral F-distributed random variable. The probability density function for the noncentral F-distribution is [1]

when and zero otherwise. The degrees of freedom and are positive. The noncentrailty parameter
is nonnegative. The term is the beta function, where

The cumulative distribution function for the noncentral F-distribution is


Noncentral F-distribution 420

where is the regularized incomplete beta function.


The mean and variance of the noncentral F-distribution are

and

Special cases
When λ = 0, the noncentral F-distribution becomes the F-distribution.

Related distributions
Z has a noncentral chi-squared distribution if

where F has a noncentral F-distribution.

Implementations
The noncentral F-distribution is implemented in the R language (e.g., pf function), in MATLAB (ncfcdf, ncfinv,
ncfpdf, ncfrnd and ncfstat functions in the statistics toolbox) in Mathematica (NoncentralFRatioDistribution
function), in NumPy (random.noncentral_f), and in Boost C++ Libraries.[2]
A collaborative wiki page implements an interactive online calculator, programmed in R language, for noncentral t,
chisquare, and F, at the Institute of Statistics and Econometrics, School of Business and Economics,
Humboldt-Universität zu Berlin.[3]

Notes
[1] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory, (New Jersey: Prentice Hall, 1998), p. 29.
[2] John Maddock, Paul A. Bristow, Hubert Holin, Xiaogang Zhang, Bruno Lalande, Johan Råde. "Noncentral F Distribution: Boost 1.39.0"
(http:/ / www. boost. org/ doc/ libs/ 1_39_0/ libs/ math/ doc/ sf_and_dist/ html/ math_toolkit/ dist/ dist_ref/ dists/ nc_f_dist. html). Boost.org. .
Retrieved 20 August 2011.
[3] Sigbert Klinke (10 December 2008). "Comparison of noncentral and central distributions" (http:/ / mars. wiwi. hu-berlin. de/ mediawiki/
slides/ index. php/ Comparison_of_noncentral_and_central_distributions). Humboldt-Universität zu Berlin. .

References
• Weisstein, Eric W., et al. "Noncentral F-distribution" (http://mathworld.wolfram.com/
NoncentralF-Distribution.html). MathWorld. Wolfram Research, Inc. Retrieved 20 August 2011.
Noncentral t-distribution 421

Noncentral t-distribution
Noncentral Student's t

Probability density function

Parameters degrees of freedom


noncentrality parameter
Support
PDF see text
CDF see text
Mean see text
Mode see text
Variance see text
Skewness see text
Ex. kurtosis see text

In probability and statistics, the noncentral t-distribution (also known as the singly noncentral t-distribution)
generalizes Student's t-distribution using a noncentrality parameter. Like the central t-distribution, the noncentral
t-distribution is primarily used in statistical inference, although it may also be used in robust modeling for data. In
particular, the noncentral t-distribution arises in power analysis.

Characterization
If is a normally distributed random variable with unit variance and zero mean, and is a Chi-squared
distributed random variable with degrees of freedom that is statistically independent of , then

is a noncentral t-distributed random variable with degrees of freedom and noncentrality parameter . Note that
the noncentrality parameter may be negative.
Noncentral t-distribution 422

Cumulative distribution function


The cumulative distribution function of noncentral t-distribution with degrees of freedom and noncentrality
parameter can be expressed as [1]

where

is the regularized incomplete beta function,

and
is the cumulative distribution function of the standard normal distribution.
Alternatively, the noncentral t-distribution CDF can be expressed as:

where is the gamma function and is the regularized incomplete beta function.
Although there are other forms of the cumulative distribution function, the first form presented above is very easy to
evaluate through recursive computing.[1] In statistical software R, the cumulative distribution function is
implemented as pt.

Probability density function


The probability density function for the noncentral t-distribution with degrees of freedom and noncentrality
parameter can be expressed in several forms.
The confluent hypergeometric function form of the density function is

where is a confluent hypergeometric function.


An alternative integral form is [2]

A third form of the density is obtained using its cumulative distribution functions, as follows.
Noncentral t-distribution 423

This is the approach implemented by the dt function in R.

Properties

Moments of the Noncentral t-distribution


In general, the th raw moment of the non-central t-distribution is [3]

In particular, the mean and variance of the noncentral t-distribution are

and

Asymmetry
The noncentral t-distribution is asymmetric unless μ is zero, i.e., a central t-distribution. The right tail will be
heavier than the left when μ > 0, and vice versa. However, the usual skewness is not generally a good measure of
asymmetry for this distribution, because if the degrees of freedom is not larger than 3, the third moment does not
exist at all. Even if the degrees of freedom is greater than 3, the sample estimate of the skewness is still very unstable
unless the sample size is very large.

Mode
The noncentral t-distribution is always unimodal and bell shaped, but the mode is not analytically available,
although it always lies in the interval[4]

when and

when

Moreover, the mode always has the same sign as the noncentrality parameter and the negative of the mode is
exactly the mode for a noncentral t-distribution with the same number of degrees of freedom but noncentrality
parameter
The mode is strictly increasing with when and strictly decreasing with when In the limit,
when approaches zero, the mode is approximated by

and when approaches infinity, the mode is approximated by


Noncentral t-distribution 424

Occurrences

Use in power analysis


Suppose we have an independent and identically distributed sample , each of which is normally
distributed with mean and variance , and we are interested in testing the null hypothesis vs. the
alternative hypothesis . We can perform a one sample t-test using the test statistic

where is the sample mean and is the unbiased sample variance. Since the right hand side of the second
equality exactly matches the characterization of a noncentral t-distribution as described above, has a noncentral
t-distribution with degrees of freedom and noncentrality parameter .
If the test procedure rejects the null hypothesis whenever , where is the upper quantile
of the (central) Student's t-distribution for a pre-specified , then the power of this test is given by

Similar applications of the noncentral t-distribution can be found in the power analysis of the general normal-theory
linear models, which includes the above one sample t-test as a special case.

Related distributions
• Central t distribution: The central t-distribution can be converted into a location/scale family. This family of
distributions is used in data modeling to capture various tail behaviors. The location/scale generalization of the
central t-distribution is a different distribution from the noncentral t-distribution discussed in this article. In
particular, this approximation does not respect the asymmetry of the noncentral t-distribution. However, the
central t-distribution can be used as an approximation to the non-central t-distribution.[5]
• If is noncentral t-distributed with degrees of freedom and noncentrality parameter and , then
has a noncentral -distribution with 1 numerator degree of freedom, denominator degrees of freedom, and
noncentrality parameter .
• If is noncentral t-distributed with degrees of freedom and noncentrality parameter and ,
then has a normal distribution with mean and unit variance.
• When the denominator noncentrality parameter of a doubly noncentral t-distribution is zero, then it becomes a
noncentral t-distribution.
Noncentral t-distribution 425

Special cases
• When , the noncentral t-distribution becomes the central (Student's) t-distribution with the same degrees
of freedom.

References
[1] Lenth, Russell V (1989). "Algorithm AS 243: Cumulative Distribution Function of the Non-central t Distribution". Journal of the Royal
Statistical Society. Series C (Applied Statistics) 38: 185–189. JSTOR 2347693.
[2] L. Scharf, Statistical Signal Processing, (Massachusetts: Addison-Wesley, 1991), p.177.
[3] Hogben, D; Wilk, MB (1961). "The moments of the non-central t-distribution". Biometrika 48: 465–468. JSTOR 2332772.
[4] van Aubel, A; Gawronski, W (2003). "Analytic properties of noncentral distributions" (http:/ / www. sciencedirect. com/ science/ article/
B6TY8-47G44WX-V/ 2/ 7705d2642b1a384b13e0578898a22d48). Applied Mathematics and Computation 141: 3–12.
doi:10.1016/S0096-3003(02)00316-8. .
[5] Helena Chmura Kraemer; Minja Paik (1979). "A Central t Approximation to the Noncentral t Distribution". Technometrics 21 (3): 357–360.
JSTOR 1267759.

External links
• Eric W. Weisstein. "Noncentral Student's t-Distribution." (http://mathworld.wolfram.com/
NoncentralStudentst-Distribution.html) From MathWorld—A Wolfram Web Resource

Norm (mathematics)
In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictly
positive length or size to all vectors in a vector space, other than the zero vector (which has zero length assigned to
it). A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition to the zero
vector).
A simple example is the 2-dimensional Euclidean space R2 equipped with the Euclidean norm. Elements in this
vector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional cartesian coordinate system starting at the
origin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean norm
is often known as the magnitude.
A vector space with a norm is called a normed vector space. Similarly, a vector space with a seminorm is called a
seminormed vector space.

Notation
The norm of a vector, matrix, or set (its cardinality) is usually noted using the "double vertical line", Unicode
Ux2016 : ( ‖ ). For example, the norm of a vector v is usually denoted ‖v‖. Sometimes the vertical line, Unicode
Ux007c ( | ), is used (e.g. |v|), but this latter notation is generally discouraged, because it is also used to denote the
absolute value of scalars and the determinant of matrices. The double vertical line should not be confused with the
"parallel to" symbol, Unicode Ux2225 ( ∥ ). This is usually not a problem because ‖ is used in parenthesis-like
fashion, whereas ∥ is used as an infix operator.
Norm (mathematics) 426

Definition
Given a vector space V over a subfield F of the complex numbers, a norm on V is a function p: V → R with the
following properties:[1]
For all a ∈ F and all u, v ∈ V,
1. p(av) = |a| p(v), (positive homogeneity or positive scalability).
2. p(u + v) ≤ p(u) + p(v) (triangle inequality or subadditivity).
3. If p(v) = 0 then v is the zero vector (separates points).
A simple consequence of the first two axioms, positive homogeneity and the triangle inequality, is p(0) = 0 and thus
p(v) ≥ 0 (positivity).
A seminorm is a norm with the 3rd property (separating points) removed.
Although every vector space is seminormed (e.g., with the trivial seminorm in the Examples section below), it may
not be normed. Every vector space V with seminorm p(v) induces a normed space V/W, called the quotient space,
where W is the subspace of V consisting of all vectors v in V with p(v) = 0. The induced norm on V/W is clearly
well-defined and is given by:
p(W + v) = p(v).
A topological vector space is called normable (seminormable) if the topology of the space can be induced by a
norm (seminorm).

Examples
• All norms are seminorms.
• The trivial seminorm, with p(x) = 0 for all x in V.
• The absolute value is a norm on the real numbers.
• Every linear form f on a vector space defines a seminorm by x → |f(x)|.

Euclidean norm
On an n-dimensional Euclidean space Rn, the intuitive notion of length of the vector x = (x1, x2, ..., xn) is captured by
the formula

This gives the ordinary distance from the origin to the point x, a consequence of the Pythagorean theorem. The
Euclidean norm is by far the most commonly used norm on Rn, but there are other norms on this vector space as will
be shown below. However all these norms are equivalent in the sense that they all define the same topology.
On an n-dimensional complex space Cn the most common norm is

In both cases we can also express the norm as the square root of the inner product of the vector and itself:

where x is represented as a column vector ([x1; x2; ...; xn]), and x* denotes its conjugate transpose.
This formula is valid for any inner product space, including Euclidean and complex spaces. For Euclidean spaces,
the inner product is equivalent to the dot product. Hence, in this specific case the formula can be also written with
the following notation:

The Euclidean norm is also called the Euclidean length, L2 distance, ℓ2 distance, L2 norm, or ℓ2 norm; see Lp
space.
Norm (mathematics) 427

The set of vectors in Rn+1 whose Euclidean norm is a given positive constant forms an n-sphere.

Euclidean norm of a complex number


The Euclidean norm of a complex number is the absolute value (also called the modulus) of it, if the complex plane
is identified with the Euclidean plane R2. This identification of the complex number x + iy as a vector in the
Euclidean plane, makes the quantity (as first suggested by Euler) the Euclidean norm associated with the

complex number.

Taxicab norm or Manhattan norm

The name relates to the distance a taxi has to drive in a rectangular street grid to get from the origin to the point x.
The set of vectors whose 1-norm is a given constant forms the surface of a cross polytope of dimension equivalent to
that of the norm minus 1. The Taxicab norm is also called the L1 norm. The distance derived from this norm is
called the Manhattan distance or L1 distance.
The 1-norm is simply the sum of the absolute values of the columns.
In contrast,

is not a norm because it may yield negative results.

p-norm
Let p ≥ 1 be a real number.

Note that for p = 1 we get the taxicab norm, for p = 2 we get the Euclidean norm, and as p approaches the
p-norm approaches the infinity norm or maximum norm.
This definition is still of some interest for 0 < p < 1, but the resulting function does not define a norm,[2] because it
violates the triangle inequality. What is true for this case of 0 < p < 1, even in the measurable analog, is that the
corresponding Lp class is a vector space, and it is also true that the function

(without p-th root) defines a distance that makes Lp(X) into a complete metric topological vector space. These spaces
are of great interest in functional analysis, probability theory, and harmonic analysis. However, outside trivial cases,
this topological vector space is not locally convex and has no continuous nonzero linear forms. Thus the topological
dual space contains only the zero functional.
Norm (mathematics) 428

Maximum norm (special case of: infinity norm, uniform norm, or supremum norm)

The set of vectors whose infinity norm is a given constant, c, forms the
surface of a hypercube with edge length 2c.

Zero norm
In probability and functional analysis, the zero norm induces a
complete metric topology for the space of measureable functions and
for the F-space of sequences with F–norm
, which is discussed by Stefan

Rolewicz in Metric Linear Spaces.[3]

Hamming distance of a vector from zero

In metric geometry, the discrete metric takes the value one for distinct points and zero otherwise. When applied
coordinate-wise to the elements of a vector space, the discrete distance defines the Hamming distance, which is
important in coding and information theory. In the field of real or complex numbers, the distance of the discrete
metric from zero is not homogeneous in the non-zero point; indeed, the distance from zero remains one as its
non-zero argument approaches zero. However, the discrete distance of a number from zero does satisfy the other
properties of a norm, namely the triangle inequality and positive definiteness. When applied component-wise to
vectors, the discrete distance from zero behaves like a non-homogeneous "norm", which counts the number of
non-zero components in its vector argument; again, this non-homogeneous "norm" is discontinuous.

In signal processing and statistics, David Donoho referred to the zero "norm" with quotation marks. Following
Donoho's notation, the zero "norm" of x is simply the number of non-zero coordinates of x, or the Hamming distance
of the vector from zero. When this "norm" is localized to a bounded set, it is the limit of p-norms as p approaches 0.
Of course, the zero "norm" is not a B-norm, because it is not positive homogeneous. It is not even an F-norm,
because it is discontinuous, jointly and severally, with respect to the scalar argument in scalar-vector multiplication
and with respect to its vector argument. Abusing terminology, some engineers omit Donoho's quotation marks and
inappropriately call the number-of-nonzeros function the L0 norm (sic.), also misusing the notation for the Lebesgue
space of measurable functions.

Other norms
Other norms on Rn can be constructed by combining the above; for example

is a norm on R4.
For any norm and any injective linear transformation A we can define a new norm of x, equal to

In 2D, with A a rotation by 45° and a suitable scaling, this changes the taxicab norm into the maximum norm. In 2D,
each A applied to the taxicab norm, up to inversion and interchanging of axes, gives a different unit ball: a
parallelogram of a particular shape, size and orientation. In 3D this is similar but different for the 1-norm
(octahedrons) and the maximum norm (prisms with parallelogram base).
All the above formulas also yield norms on Cn without modification.
Norm (mathematics) 429

Infinite-dimensional case
The generalization of the above norms to an infinite number of components leads to the Lp spaces, with norms

(for complex-valued sequences x resp. functions f defined on ), which can be further generalized (see Haar
measure).

Any inner product induces in a natural way the norm


Other examples of infinite dimensional normed vector spaces can be found in the Banach space article.

Properties
The concept of unit circle (the set of all vectors of norm 1) is different in different
norms: for the 1-norm the unit circle in R2 is a square, for the 2-norm (Euclidean
norm) it is the well-known unit circle, while for the infinity norm it is a different
square. For any p-norm it is a superellipse (with congruent axes). See the
accompanying illustration. Note that due to the definition of the norm, the unit circle
is always convex and centrally symmetric (therefore, for example, the unit ball may be
a rectangle but cannot be a triangle).

In terms of the vector space, the seminorm defines a topology on the space, and this is
a Hausdorff topology precisely when the seminorm can distinguish between distinct
vectors, which is again equivalent to the seminorm being a norm. The topology thus
defined (by either a norm or a seminorm) can be understood either in terms of
sequences or open sets. A sequence of vectors is said to converge in norm to
if as . Equivalently, the topology consists of all sets
that can be represented as a union of open balls.
Two norms ||•||α and ||•||β on a vector space V are called equivalent if there exist
positive real numbers C and D such that

for all x in V. For instance, on , if p > r > 0, then

In particular,

Illustrations of unit circles in


different norms.

If the vector space is a finite-dimensional real/complex one, all norms are equivalent. If not, some norms are not.
Equivalent norms define the same notions of continuity and convergence and for many purposes do not need to be
distinguished. To be more precise the uniform structure defined by equivalent norms on the vector space is
uniformly isomorphic.
Every (semi)-norm is a sublinear function, which implies that every norm is a convex function. As a result, finding a
global optimum of a norm-based objective function is often tractable.
Given a finite family of seminorms pi on a vector space the sum
Norm (mathematics) 430

is again a seminorm.
For any norm p on a vector space V, we have that for all u and v ∈ V:
p(u ± v) ≥ | p(u) − p(v) |
For the lp norms, we have Hölder's inequality[4]

A special case of this is the Cauchy–Schwarz inequality:[4]

Classification of seminorms: Absolutely convex absorbing sets


All seminorms on a vector space V can be classified in terms of absolutely convex absorbing sets in V. To each such
set, A, corresponds a seminorm pA called the gauge of A, defined as
pA(x) := inf{α : α > 0, x ∈ α A}
with the property that
{x : pA(x) < 1} ⊆ A ⊆ {x : pA(x) ≤ 1}.
Conversely:
Any locally convex topological vector space has a local basis consisting of absolutely convex sets. A common
method to construct such a basis is to use a separating family (p) of seminorms p: the collection of all finite
intersections of sets {p<1/n} turns the space into a locally convex topological vector space so that every p is
continuous.
A such method is used to design weak and weak* topologies.
norm case:
Suppose now that (p) contains a single p: since (p) is separating, p is a norm, and A={p<1} is its open unit ball.
Then A is an absolutely convex bounded neighbourhood of 0, and p = pA is continuous.
The converse is due to Kolmogorov: any locally convex and locally bounded topological vector space is
normable. Precisely:
If V is an absolutely convex bounded neighbourhood of 0, the gauge gV (so that V={gV <1}) is a norm.

Notes
[1] Prugovec̆ki 1981, page 20 (http:/ / books. google. com/ books?id=GxmQxn2PF3IC& pg=PA20)
[2] Except in R1, where it coincides with the Euclidean norm, and R0, where it is trivial.
[3] Rolewicz, Stefan (1987), Functional analysis and control theory: Linear systems, Mathematics and its Applications (East European Series),
29 (Translated from the Polish by Ewa Bednarczuk ed.), Dordrecht; Warsaw: D. Reidel Publishing Co.; PWN—Polish Scientific Publishers,
pp. xvi+524, ISBN 90-277-2186-6, MR920371, OCLC 13064804
[4] Golub, Gene; Charles F. Van Loan (1996). Matrix Computations - Third Edition. Baltimore: The Johns Hopkins University Press. p. 53.
ISBN 0-8018-5413-X.
Norm (mathematics) 431

References
• Bourbaki, Nicolas (1987). "Chapters 1–5". Topological vector spaces. Springer. ISBN 3-540-13627-4.
• Prugovec̆ki, Eduard (1981). Quantum mechanics in Hilbert space (2nd ed.). Academic Press. p. 20.
ISBN 0-12-566060-X.
Normal distribution 432

Normal distribution
Probability density function

The red curve is the standard normal distribution


Cumulative distribution function

Notation
Parameters μ ∈ R — mean (location)
σ2 > 0 — variance (squared scale)
Support x∈R
PDF

CDF

Mean μ
Median μ
Mode μ
Variance
Skewness 0
Ex. kurtosis 0
Entropy

MGF

CF

In probability theory, the normal (or Gaussian) distribution is a continuous probability distribution that has a
bell-shaped probability density function, known as the Gaussian function or informally the bell curve:[1]
Normal distribution 433

The parameter μ is the mean or expectation (location of the peak) and σ 2 is the variance. σ is known as the standard
deviation. The distribution with μ = 0 and σ 2 = 1 is called the standard normal distribution or the unit normal
distribution. A normal distribution is often used as a first approximation to describe real-valued random variables
that cluster around a single mean value.
The normal distribution is considered the most prominent probability distribution in statistics. There are several
reasons for this:[2] First, the normal distribution arises from the central limit theorem, which states that under mild
conditions the mean of a large number of random variables drawn from the same distribution is distributed
approximately normally, irrespective of the form of the original distribution. This gives it exceptionally wide
application in, for example, sampling. Secondly, the normal distribution is very tractable analytically, that is, a large
number of results involving this distribution can be derived in explicit form.
For these reasons, the normal distribution is commonly encountered in practice, and is used throughout statistics,
natural sciences, and social sciences[3] as a simple model for complex phenomena. For example, the observational
error in an experiment is usually assumed to follow a normal distribution, and the propagation of uncertainty is
computed using this assumption. Note that a normally-distributed variable has a symmetric distribution about its
mean. Quantities that grow exponentially, such as prices, incomes or populations, are often skewed to the right, and
hence may be better described by other distributions, such as the log-normal distribution or Pareto distribution. In
addition, the probability of seeing a normally-distributed value that is far (i.e. more than a few standard deviations)
from the mean drops off extremely rapidly. As a result, statistical inference using a normal distribution is not robust
to the presence of outliers (data that is unexpectedly far from the mean, due to exceptional circumstances,
observational error, etc.). When outliers are expected, data may be better described using a heavy-tailed distribution
such as the Student's t-distribution.
From a technical perspective, alternative characterizations are possible, for example:
• The normal distribution is the only absolutely continuous distribution all of whose cumulants beyond the first two
(i.e. other than the mean and variance) are zero.
• For a given mean and variance, the corresponding normal distribution is the continuous distribution with the
maximum entropy.[4][5]
The normal distributions are a sub-class of the elliptical distributions.

Definition
The simplest case of a normal distribution is known as the standard normal distribution, described by the probability
density function

The factor in this expression ensures that the total area under the curve ϕ(x) is equal to one[proof], and 12 in the
exponent makes the "width" of the curve (measured as half the distance between the inflection points) also equal to
one. It is traditional in statistics to denote this function with the Greek letter ϕ (phi), whereas density functions for all
other distributions are usually denoted with letters f or p.[6] The alternative glyph φ is also used quite often, however
within this article "φ" is reserved to denote characteristic functions.
Every normal distribution is the result of exponentiating a quadratic function (just as an exponential distribution
results from exponentiating a linear function):

This yields the classic "bell curve" shape, provided that a < 0 so that the quadratic function is concave for x close to
0. f(x) > 0 everywhere. One can adjust a to control the "width" of the bell, then adjust b to move the central peak of
the bell along the x-axis, and finally one must choose c such that (which is only possible
when a < 0).
Normal distribution 434

Rather than using a, b, and c, it is far more common to describe a normal distribution by its mean μ = − b2a and
variance σ2 = − 12a. Changing to these new parameters allows one to rewrite the probability density function in a
convenient standard form,

For a standard normal distribution, μ = 0 and σ2 = 1. The last part of the equation above shows that any other normal
distribution can be regarded as a version of the standard normal distribution that has been stretched horizontally by a
factor σ and then translated rightward by a distance μ. Thus, μ specifies the position of the bell curve's central peak,
and σ specifies the "width" of the bell curve.
The parameter μ is at the same time the mean, the median and the mode of the normal distribution. The parameter σ2
is called the variance; as for any random variable, it describes how concentrated the distribution is around its mean.
The square root of σ2 is called the standard deviation and is the width of the density function.
The normal distribution is usually denoted by N(μ, σ2).[7] Thus when a random variable X is distributed normally
with mean μ and variance σ2, we write

Alternative formulations
Some authors advocate using the precision instead of the variance. The precision is normally defined as the
reciprocal of the variance (τ = σ−2), although it is occasionally defined as the reciprocal of the standard deviation (τ
= σ−1).[8] This parametrization has an advantage in numerical applications where σ2 is very close to zero and is more
convenient to work with in analysis as τ is a natural parameter of the normal distribution. This parametrization is
common in Bayesian statistics, as it simplifies the Bayesian analysis of the normal distribution. Another advantage
of using this parametrization is in the study of conditional distributions in the multivariate normal case. The form of
the normal distribution with the more common definition τ = σ−2 is as follows:

The question of which normal distribution should be called the "standard" one is also answered differently by
various authors. Starting from the works of Gauss the standard normal was considered to be the one with variance σ2
= 12 :

Stigler (1982) goes even further and insists the standard normal to be with the variance σ2 = 12π :

According to the author, this formulation is advantageous because of a much simpler and easier-to-remember
formula, the fact that the pdf has unit height at zero, and simple approximate formulas for the quantiles of the
distribution.
Normal distribution 435

Characterization
In the previous section the normal distribution was defined by specifying its probability density function. However
there are other ways to characterize a probability distribution. They include: the cumulative distribution function, the
moments, the cumulants, the characteristic function, the moment-generating function, etc.

Probability density function


The probability density function (pdf) of a random variable describes the relative frequencies of different values for
that random variable. The pdf of the normal distribution is given by the formula explained in detail in the previous
section:

This is a proper function only when the variance σ2 is not equal to zero. In that case this is a continuous smooth
function, defined on the entire real line, and which is called the "Gaussian function".
Properties:
• Function f(x) is unimodal and symmetric around the point x = μ, which is at the same time the mode, the median
and the mean of the distribution.[9]
• The inflection points of the curve occur one standard deviation away from the mean (i.e., at x = μ − σ and x = μ +
σ).[9]
• Function f(x) is log-concave.[9]
• The standard normal density ϕ(x) is an eigenfunction of the Fourier transform in that if ƒ is a normalized Gaussian
function with variance σ2, centered at zero, then its Fourier transform is a Gaussian function with variance 1/σ2.
• The function is supersmooth of order 2, implying that it is infinitely differentiable.[10]
• The first derivative of ϕ(x) is ϕ′(x) = −x·ϕ(x); the second derivative is ϕ′′(x) = (x2 − 1)ϕ(x). More generally, the
n-th derivative is given by ϕ(n)(x) = (−1)nHn(x)ϕ(x), where Hn is the Hermite polynomial of order n.[11]
When σ2 = 0, the density function doesn't exist. However a generalized function that defines a measure on the real
line, and it can be used to calculate, for example, expected value is

where δ(x) is the Dirac delta function which is equal to infinity at x = μ and is zero elsewhere.

Cumulative distribution function


The cumulative distribution function (CDF) describes probability of a random variable falling in the interval (−∞, x].
The CDF of the standard normal distribution is denoted with the capital Greek letter Φ (phi), and can be computed as
an integral of the probability density function:

This integral cannot be expressed in terms of elementary functions, so is simply called a transformation of the error
function, or erf, a special function. Numerical methods for calculation of the standard normal CDF are discussed
below. For a generic normal random variable with mean μ and variance σ2 > 0 the CDF will be equal to

The complement of the standard normal CDF, Q(x) = 1 − Φ(x), is referred to as the Q-function, especially in
engineering texts.[12][13] This represents the upper tail probability of the Gaussian distribution: that is, the probability
that a standard normal random variable X is greater than the number x. Other definitions of the Q-function, all of
which are simple transformations of Φ, are also used occasionally.[14]
Normal distribution 436

Properties:
• The standard normal CDF is 2-fold rotationally symmetric around point (0, ½):  Φ(−x) = 1 − Φ(x).
• The derivative of Φ(x) is equal to the standard normal pdf ϕ(x):  Φ′(x) = ϕ(x).
• The antiderivative of Φ(x) is: ∫ Φ(x) dx = x Φ(x) + ϕ(x).
For a normal distribution with zero variance, the CDF is the Heaviside step function (with H(0) = 1 convention):

Quantile function
The quantile function of a distribution is the inverse of the CDF. The quantile function of the standard normal
distribution is called the probit function, and can be expressed in terms of the inverse error function:

Quantiles of the standard normal distribution are commonly denoted as zp. The quantile zp represents such a value
that a standard normal random variable X has the probability of exactly p to fall inside the (−∞, zp] interval. The
quantiles are used in hypothesis testing, construction of confidence intervals and Q-Q plots. The most "famous"
normal quantile is 1.96 = z0.975. A standard normal random variable is greater than 1.96 in absolute value in 5% of
cases.
For a normal random variable with mean μ and variance σ2, the quantile function is

Characteristic function and moment generating function


The characteristic function φX(t) of a random variable X is defined as the expected value of eitX, where i is the
imaginary unit, and t ∈ R is the argument of the characteristic function. Thus the characteristic function is the Fourier
transform of the density ϕ(x). For a normally distributed X with mean μ and variance σ2, the characteristic function
is[15]

The characteristic function can be analytically extended to the entire complex plane: one defines φ(z) = eiμz − 12σ2z2
for all z ∈ C.[16]
The moment generating function is defined as the expected value of etX. For a normal distribution, the moment
generating function exists and is equal to

The cumulant generating function is the logarithm of the moment generating function:

Since this is a quadratic polynomial in t, only the first two cumulants are nonzero.
Normal distribution 437

Moments
The normal distribution has moments of all orders. That is, for a normally distributed X with mean μ and variance σ
2
, the expectation E[|X|p] exists and is finite for all p such that Re[p] > −1. Usually we are interested only in moments
of integer orders: p = 1, 2, 3, ….
• Central moments are the moments of X around its mean μ. Thus, a central moment of order p is the expected
value of (X − μ) p. Using standardization of normal random variables, this expectation will be equal to σ p · E[Zp],
where Z is standard normal.

Here n!! denotes the double factorial, that is the product of every odd number from n to 1.
• Central absolute moments are the moments of |X − μ|. They coincide with regular moments for all even orders,
but are nonzero for all odd p's.

The last formula is true for any non-integer p > −1.


• Raw moments and raw absolute moments are the moments of X and |X| respectively. The formulas for these
moments are much more complicated, and are given in terms of confluent hypergeometric functions 1F1 and U.

These expressions remain valid even if p is not integer. See also generalized Hermite polynomials.
• First two cumulants are equal to μ and σ 2 respectively, whereas all higher-order cumulants are equal to zero.

Order Raw moment Central moment Cumulant

1 μ 0 μ

2 μ2 + σ2 σ2 σ2

3 0 0
μ3 + 3μσ2

4 0
μ4 + 6μ2σ2 + 3σ4 3σ 4

5 0 0
μ5 + 10μ3σ2 + 15μσ4

6 0
μ6 + 15μ4σ2 + 45μ2σ4 + 15σ6 15σ 6

7 0 0
μ7 + 21μ5σ2 + 105μ3σ4 + 105μσ6

8 0
μ8 + 28μ6σ2 + 210μ4σ4 + 420μ2σ6 + 105σ8 105σ 8
Normal distribution 438

Properties

Standardizing normal random variables


Because the normal distribution is a location-scale family, it is possible to relate all normal random variables to the
standard normal. For example if X is normal with mean μ and variance σ2, then

has mean zero and unit variance, that is Z has the standard normal distribution. Conversely, having a standard normal
random variable Z we can always construct another normal random variable with specific mean μ and variance σ2:

This "standardizing" transformation is convenient as it allows one to compute the PDF and especially the CDF of a
normal distribution having the table of PDF and CDF values for the standard normal. They will be related via

Standard deviation and tolerance intervals


About 68% of values drawn from a
normal distribution are within one
standard deviation σ away from the
mean; about 95% of the values lie
within two standard deviations; and
about 99.7% are within three standard
deviations. This fact is known as the
68-95-99.7 rule, or the empirical rule,
or the 3-sigma rule. To be more
precise, the area under the bell curve
between μ − nσ and μ + nσ is given by Dark blue is less than one standard deviation from the mean. For the normal distribution,
this accounts for about 68% of the set, while two standard deviations from the mean
(medium and dark blue) account for about 95%, and three standard deviations (light,
medium, and dark blue) account for about 99.7%.
where erf is the error function. To 12
decimal places, the values for the 1-,
2-, up to 6-sigma points are:[17]

i.e. 1 minus ... or 1 in ...

1 0.682689492137 0.317310507863 3.15148718753

2 0.954499736104 0.045500263896 21.9778945080

3 0.997300203937 0.002699796063 370.398347345

4 0.999936657516 0.000063342484 15787.1927673

5 0.999999426697 0.000000573303 1744277.89362

6 0.999999998027 0.000000001973 506797345.897

The next table gives the reverse relation of sigma multiples corresponding to a few often used values for the area
under the bell curve. These values are useful to determine (asymptotic) tolerance intervals of the specified levels
based on normally distributed (or asymptotically normal) estimators:[18]
Normal distribution 439

n n

0.80 1.281551565545 0.999 3.290526731492

0.90 1.644853626951 0.9999 3.890591886413

0.95 1.959963984540 0.99999 4.417173413469

0.98 2.326347874041 0.999999 4.891638475699

0.99 2.575829303549 0.9999999 5.326723886384

0.995 2.807033768344 0.99999999 5.730728868236

0.998 3.090232306168 0.999999999 6.109410204869

where the value on the left of the table is the proportion of values that will fall within a given interval and n is a
multiple of the standard deviation that specifies the width of the interval.

Central limit theorem


The theorem states that under certain (fairly common)
conditions, the sum of a large number of random
variables will have an approximately normal
distribution. For example if (x1, …, xn) is a sequence of
iid random variables, each having mean μ and variance
σ2, then the central limit theorem states that

The theorem will hold even if the summands xi are not


As the number of discrete events increases, the function begins to
iid, although some constraints on the degree of
resemble a normal distribution
dependence and the growth rate of moments still have
to be imposed.
The importance of the central limit theorem cannot be overemphasized. A great number of test statistics, scores, and
estimators encountered in practice contain sums of certain random variables in them, even more estimators can be
represented as sums of random variables through the use of influence functions – all of these quantities are governed
by the central limit theorem and will have asymptotically normal distribution as a result.
Another practical consequence of the central limit theorem is that certain other distributions can be approximated by
the normal distribution, for example:
• The binomial distribution B(n, p) is approximately normal N(np, np(1 − p)) for large n and for p not too close to
zero or one.
• The Poisson(λ) distribution is approximately normal N(λ, λ) for large values of λ.
Normal distribution 440

• The chi-squared distribution χ2(k) is approximately


normal N(k, 2k) for large ks.
• The Student's t-distribution t(ν) is approximately
normal N(0, 1) when ν is large.
Whether these approximations are sufficiently accurate
depends on the purpose for which they are needed, and
the rate of convergence to the normal distribution. It is
typically the case that such approximations are less
accurate in the tails of the distribution.
A general upper bound for the approximation error in
the central limit theorem is given by the Berry–Esseen
theorem, improvements of the approximation are given
by the Edgeworth expansions.

Comparison of probability density functions, p(k) for the sum of n


fair 6-sided dice to show their convergence to a normal distribution
with increasing n, in accordance to the central limit theorem. In the
bottom-right graph, smoothed profiles of the previous graphs are
rescaled, superimposed and compared with a normal distribution
(black curve).

Miscellaneous
1. The family of normal distributions is closed under linear transformations. That is, if X is normally distributed
with mean μ and variance σ2, then a linear transform aX + b (for some real numbers a and b) is also normally
distributed:

Also if X1, X2 are two independent normal random variables, with means μ1, μ2 and standard deviations σ1, σ2,
then their linear combination will also be normally distributed: [proof]

2. The converse of (1) is also true: if X1 and X2 are independent and their sum X1 + X2 is distributed normally, then
both X1 and X2 must also be normal.[19] This is known as Cramér's decomposition theorem. The interpretation
of this property is that a normal distribution is only divisible by other normal distributions. Another application of
this property is in connection with the central limit theorem: although the CLT asserts that the distribution of a
sum of arbitrary non-normal iid random variables is approximately normal, the Cramér's theorem shows that it
can never become exactly normal.[20]
3. If the characteristic function φX of some random variable X is of the form φX(t) = eQ(t), where Q(t) is a
polynomial, then the Marcinkiewicz theorem (named after Józef Marcinkiewicz) asserts that Q can be at most a
quadratic polynomial, and therefore X a normal random variable.[20] The consequence of this result is that the
normal distribution is the only distribution with a finite number (two) of non-zero cumulants.
4. If X and Y are jointly normal and uncorrelated, then they are independent. The requirement that X and Y should be
jointly normal is essential, without it the property does not hold.[proof] For non-normal random variables
Normal distribution 441

uncorrelatedness does not imply independence.


5. If X and Y are independent N(μ, σ 2) random variables, then X + Y and X − Y are also independent and identically
distributed (this follows from the polarization identity).[21] This property uniquely characterizes normal
distribution, as can be seen from the Bernstein's theorem: if X and Y are independent and such that X + Y and X
− Y are also independent, then both X and Y must necessarily have normal distributions.
More generally, if X1, ..., Xn are independent random variables, then two linear combinations ∑akXk and ∑bkXk
will be independent if and only if all Xk's are normal and ∑akbkσ 2
k = 0, where σ 2
[22]
k denotes the variance of X .
k
6. The normal distribution is infinitely divisible:[23] for a normally distributed X with mean μ and variance σ2 we
can find n independent random variables {X1, …, Xn} each distributed normally with means μ/n and
variances σ2/n such that

7. The normal distribution is stable (with exponent α = 2): if X1, X2 are two independent N(μ, σ2) random variables
and a, b are arbitrary real numbers, then

where X3 is also N(μ, σ2). This relationship directly follows from property (1).
8. The Kullback–Leibler divergence between two normal distributions X1 ∼ N(μ1, σ21 )and X2 ∼ N(μ2, σ22 )is given
by:[24]

The Hellinger distance between the same distributions is equal to

9. The Fisher information matrix for a normal distribution is diagonal and takes the form

10. Normal distributions belongs to an exponential family with natural parameters and , and natural
statistics x and x . The dual, expectation parameters for normal distribution are η1 = μ and η2 = μ + σ2.
2 2

11. The conjugate prior of the mean of a normal distribution is another normal distribution.[25] Specifically, if x1,
…, xn are iid N(μ, σ2) and the prior is μ ~ N(μ0, σ ), then the posterior distribution for the estimator of μ will be

12. Of all probability distributions over the reals with mean μ and variance σ2, the normal distribution N(μ, σ2) is the
one with the maximum entropy.[26]
13. The family of normal distributions forms a manifold with constant curvature −1. The same family is flat with
respect to the (±1)-connections ∇(e) and ∇(m).[27]
Normal distribution 442

Related distributions

Operations on a single random variable


If X is distributed normally with mean μ and variance σ2, then
• The exponential of X is distributed log-normally: eX ~ lnN (μ, σ2).
• The absolute value of X has folded normal distribution: |X| ~ Nf (μ, σ2). If μ = 0 this is known as the half-normal
distribution.
• The square of X/σ has the noncentral chi-squared distribution with one degree of freedom: X2/σ2 ~ χ21(μ2/σ2). If μ
= 0, the distribution is called simply chi-squared.
• The distribution of the variable X restricted to an interval [a, b] is called the truncated normal distribution.
• (X − μ)−2 has a Lévy distribution with location 0 and scale σ−2.

Combination of two independent random variables


If X1 and X2 are two independent standard normal random variables with mean 0 and variance 1, then
• Their sum and difference is distributed normally with mean zero and variance two: X1 ± X2 ∼ N(0, 2).
• Their product Z = X1·X2 follows the "product-normal" distribution[28] with density function fZ(z) = π−1K0(|z|),
where K0 is the modified Bessel function of the second kind. This distribution is symmetric around zero,
unbounded at z = 0, and has the characteristic function φZ(t) = (1 + t 2)−1/2.
• Their ratio follows the standard Cauchy distribution: X1 ÷ X2 ∼ Cauchy(0, 1).
• Their Euclidean norm has the Rayleigh distribution, also known as the chi distribution with 2 degrees
of freedom.

Combination of two or more independent random variables


• If X1, X2, …, Xn are independent standard normal random variables, then the sum of their squares has the
chi-squared distribution with n degrees of freedom
.
• If X1, X2, …, Xn are independent normally distributed random variables with means μ and variances σ2, then their
sample mean is independent from the sample standard deviation, which can be demonstrated using Basu's
theorem or Cochran's theorem. The ratio of these two quantities will have the Student's t-distribution with n − 1
degrees of freedom:

• If X1, …, Xn, Y1, …, Ym are independent standard normal random variables, then the ratio of their normalized
sums of squares will have the F-distribution with (n, m) degrees of freedom:
Normal distribution 443

Operations on the density function


The split normal distribution is most directly defined in terms of joining scaled sections of the density functions of
different normal distributions and rescaling the density to integrate to one. The truncated normal distribution results
from rescaling a section of a single density function.

Extensions
The notion of normal distribution, being one of the most important distributions in probability theory, has been
extended far beyond the standard framework of the univariate (that is one-dimensional) case (Case 1). All these
extensions are also called normal or Gaussian laws, so a certain ambiguity in names exists.
• Multivariate normal distribution describes the Gaussian law in the k-dimensional Euclidean space. A vector X ∈
Rk is multivariate-normally distributed if any linear combination of its components ∑ aj Xj has a
(univariate) normal distribution. The variance of X is a k×k symmetric positive-definite matrix V.
• Rectified Gaussian distribution a rectified version of normal distribution with all the negative elements reset to 0
• Complex normal distribution deals with the complex normal vectors. A complex vector X ∈ Ck is said to be
normal if both its real and imaginary components jointly possess a 2k-dimensional multivariate normal
distribution. The variance-covariance structure of X is described by two matrices: the variance matrix Γ, and the
relation matrix C.
• Matrix normal distribution describes the case of normally distributed matrices.
• Gaussian processes are the normally distributed stochastic processes. These can be viewed as elements of some
infinite-dimensional Hilbert space H, and thus are the analogues of multivariate normal vectors for the case k = ∞.
A random element h ∈ H is said to be normal if for any constant a ∈ H the scalar product (a, h) has a (univariate)
normal distribution. The variance structure of such Gaussian random element can be described in terms of the
linear covariance operator K: H → H. Several Gaussian processes became popular enough to have their own
names:
• Brownian motion,
• Brownian bridge,
• Ornstein–Uhlenbeck process.
• Gaussian q-distribution is an abstract mathematical construction which represents a "q-analogue" of the normal
distribution.
• the q-Gaussian is an analogue of the Gaussian distribution, in the sense that it maximises the Tsallis entropy, and
is one type of Tsallis distribution. Note that this distribution is different from the Gaussian q-distribution above.
One of the main practical uses of the Gaussian law is to model the empirical distributions of many different random
variables encountered in practice. In such case a possible extension would be a richer family of distributions, having
more than two parameters and therefore being able to fit the empirical distribution more accurately. The examples of
such extensions are:
• Pearson distribution — a four-parametric family of probability distributions that extend the normal law to include
different skewness and kurtosis values.
Normal distribution 444

Normality tests
Normality tests assess the likelihood that the given data set {x1, …, xn} comes from a normal distribution. Typically
the null hypothesis H0 is that the observations are distributed normally with unspecified mean μ and variance σ2,
versus the alternative Ha that the distribution is arbitrary. A great number of tests (over 40) have been devised for
this problem, the more prominent of them are outlined below:
• "Visual" tests are more intuitively appealing but subjective at the same time, as they rely on informal human
judgement to accept or reject the null hypothesis.
• Q-Q plot — is a plot of the sorted values from the data set against the expected values of the corresponding
quantiles from the standard normal distribution. That is, it's a plot of point of the form (Φ−1(pk), x(k)), where
plotting points pk are equal to pk = (k − α)/(n + 1 − 2α) and α is an adjustment constant which can be anything
between 0 and 1. If the null hypothesis is true, the plotted points should approximately lie on a straight line.
• P-P plot — similar to the Q-Q plot, but used much less frequently. This method consists of plotting the points
(Φ(z(k)), pk), where . For normally distributed data this plot should lie on a 45° line between
(0, 0) and (1, 1).
• Wilk–Shapiro test employs the fact that the line in the Q-Q plot has the slope of σ. The test compares the least
squares estimate of that slope with the value of the sample variance, and rejects the null hypothesis if these two
quantities differ significantly.
• Normal probability plot (rankit plot)
• Moment tests:
• D'Agostino's K-squared test
• Jarque–Bera test
• Empirical distribution function tests:
• Lilliefors test (an adaptation of the Kolmogorov–Smirnov test)
• Anderson–Darling test

Estimation of parameters
It is often the case that we don't know the parameters of the normal distribution, but instead want to estimate them.
That is, having a sample (x1, …, xn) from a normal N(μ, σ2) population we would like to learn the approximate
values of parameters μ and σ2. The standard approach to this problem is the maximum likelihood method, which
requires maximization of the log-likelihood function:

Taking derivatives with respect to μ and σ2 and solving the resulting system of first order conditions yields the
maximum likelihood estimates:

Estimator is called the sample mean, since it is the arithmetic mean of all observations. The statistic is complete
and sufficient for μ, and therefore by the Lehmann–Scheffé theorem, is the uniformly minimum variance unbiased
(UMVU) estimator.[29] In finite samples it is distributed normally:

The variance of this estimator is equal to the μμ-element of the inverse Fisher information matrix . This implies
that the estimator is finite-sample efficient. Of practical importance is the fact that the standard error of is
proportional to , that is, if one wishes to decrease the standard error by a factor of 10, one must increase the
number of points in the sample by a factor of 100. This fact is widely used in determining sample sizes for opinion
Normal distribution 445

polls and the number of trials in Monte Carlo simulations.


From the standpoint of the asymptotic theory, is consistent, that is, it converges in probability to μ as n → ∞. The
estimator is also asymptotically normal, which is a simple corollary of the fact that it is normal in finite samples:

The estimator is called the sample variance, since it is the variance of the sample (x1, …, xn). In practice, another
estimator is often used instead of the . This other estimator is denoted s2, and is also called the sample variance,
which represents a certain ambiguity in terminology; its square root s is called the sample standard deviation. The
estimator s2 differs from by having (n − 1) instead of n in the denominator (the so called Bessel's correction):

The difference between s2 and becomes negligibly small for large n's. In finite samples however, the motivation
behind the use of s2 is that it is an unbiased estimator of the underlying parameter σ2, whereas is biased. Also, by
the Lehmann–Scheffé theorem the estimator s2 is uniformly minimum variance unbiased (UMVU),[29] which makes
it the "best" estimator among all unbiased ones. However it can be shown that the biased estimator is "better"
than the s2 in terms of the mean squared error (MSE) criterion. In finite samples both s2 and have scaled
chi-squared distribution with (n − 1) degrees of freedom:

The first of these expressions shows that the variance of s2 is equal to 2σ4/(n−1), which is slightly greater than the
σσ-element of the inverse Fisher information matrix . Thus, s2 is not an efficient estimator for σ2, and moreover,
since s is UMVU, we can conclude that the finite-sample efficient estimator for σ2 does not exist.
2

Applying the asymptotic theory, both estimators s2 and are consistent, that is they converge in probability to σ2 as
the sample size n → ∞. The two estimators are also both asymptotically normal:

In particular, both estimators are asymptotically efficient for σ2.


By Cochran's theorem, for normal distribution the sample mean and the sample variance s2 are independent,
which means there can be no gain in considering their joint distribution. There is also a reverse theorem: if in a
sample the sample mean and sample variance are independent, then the sample must have come from the normal
distribution. The independence between and s can be employed to construct the so-called t-statistic:

This quantity t has the Student's t-distribution with (n − 1) degrees of freedom, and it is an ancillary statistic
(independent of the value of the parameters). Inverting the distribution of this t-statistics will allow us to construct
the confidence interval for μ;[30] similarly, inverting the χ2 distribution of the statistic s2 will give us the confidence
interval for σ2:[31]

where tk,p and χ 2


th 2
k,p are the p quantiles of the t- and χ -distributions respectively. These confidence intervals are of the level 1 − α,
2
meaning that the true values μ and σ fall outside of these intervals with probability α. In practice people usually take
α = 5%, resulting in the 95% confidence intervals. The approximate formulas in the display above were derived from
Normal distribution 446

the asymptotic distributions of and s2. The approximate formulas become valid for large values of n, and are more
convenient for the manual calculation since the standard normal quantiles zα/2 do not depend on n. In particular, the
most popular value of α = 5%, results in |z0.025| = 1.96.

Bayesian analysis of the normal distribution


Bayesian analysis of normally-distributed data is complicated by the many different possibilities that may be
considered:
• Either the mean, or the variance, or neither, may be considered a fixed quantity.
• When the variance is unknown, analysis may be done directly in terms of the variance, or in terms of the
precision, the reciprocal of the variance. The reason for expressing the formulas in terms of precision is that the
analysis of most cases is simplified.
• Both univariate and multivariate cases need to be considered.
• Either conjugate or improper prior distributions may be placed on the unknown variables.
• An additional set of cases occurs in Bayesian linear regression, where in the basic model the data is assumed to be
normally-distributed, and normal priors are placed on the regression coefficients. The resulting analysis is similar
to the basic cases of independent identically distributed data, but more complex.
The formulas for the non-linear-regression cases are summarized in the conjugate prior article.

The sum of two quadratics

Scalar form
The following auxiliary formula is useful for simplifying the posterior update equations, which otherwise become
fairly tedious.

This equation rewrites the sum of two quadratics in x by expanding the squares, grouping the terms in x, and
completing the square. Note the following about the complex constant factors attached to some of the terms:

1. The factor has the form of a weighted average of y and z.

2. This shows that this factor can be thought of as resulting from a

situation where the reciprocals of quantities a and b add directly, so to combine a and b themselves, it's necessary
to reciprocate, add, and reciprocate the result again to get back into the original units. This is exactly the sort of
operation performed by the harmonic mean, so it is not surprising that is one-half the harmonic mean of a

and b.
Normal distribution 447

Vector form
A similar formula can be written for the sum of two vector quadratics: If are vectors of length , and
and are symmetric, invertible matrices of size , then

where

Note that the form is called a quadratic form and is a scalar:

In other words, it sums up all possible combinations of products of pairs of elements from , with a separate
coefficient for each. In addition, since , only the sum matters for any off-diagonal
elements of , and there is no loss of generality in assuming that is symmetric. Furthermore, if is
symmetric, then the form .

The sum of differences from the mean


Another useful formula is as follows:

where

With known variance


For a set of i.i.d. normally-distributed data points X of size n where each individual point x follows
with known variance σ2, the conjugate prior distribution is also normally-distributed.
This can be shown more easily by rewriting the variance as the precision, i.e. using Then if
and we proceed as follows.
First, the likelihood function is (using the formula above for the sum of differences from the mean):

Then, we proceed as follows:


Normal distribution 448

In the above derivation, we used the formula above for the sum of two quadratics and eliminated all constant factors
not involving . The result is the kernel of a normal distribution, with mean and precision

, i.e.

This can be written as a set of Bayesian update equations for the posterior parameters in terms of the prior
parameters:

That is, to combine data points with total precision of (or equivalently, total variance of ) and mean of
values , derive a new total precision simply by adding the total precision of the data to the prior total precision,
and form a new mean through a precision-weighted average, i.e. a weighted average of the data mean and the prior
mean, each weighted by the associated total precision. This makes logical sense if the precision is thought of as
indicating the certainty of the observations: In the distribution of the posterior mean, each of the input components is
weighted by its certainty, and the certainty of this distribution is the sum of the individual certainties. (For the
intuition of this, compare the expression "the whole is (or is not) greater than the sum of its parts". In addition,
consider that the knowledge of the posterior comes from a combination of the knowledge of the prior and likelihood,
so it makes sense that we are more certain of it than of either of its components.)
The above formula reveals why it is more convenient to do Bayesian analysis of conjugate priors for the normal
distribution in terms of the precision. The posterior precision is simply the sum of the prior and likelihood precisions,
and the posterior mean is computed through a precision-weighted average, as described above. The same formulas
can be written in terms of variance by reciprocating all the precisions, yielding the more ugly formulas
Normal distribution 449

With known mean


For a set of i.i.d. normally-distributed data points X of size n where each individual point x follows
with known mean μ, the conjugate prior of the variance has an inverse gamma distribution or a scaled inverse
chi-squared distribution. The two are equivalent except for having different parameterizations. The use of the inverse
gamma is more common, but the scaled inverse chi-squared is more convenient, so we use it in the following
derivation. The prior for σ2 is as follows:

The likelihood function from above, written in terms of the variance, is:

where

Then:

This is also a scaled inverse chi-squared distribution, where

or equivalently

Reparameterizing in terms of an inverse gamma distribution, the result is:


Normal distribution 450

With unknown mean and variance


For a set of i.i.d. normally-distributed data points X of size n where each individual point x follows
with unknown mean μ and variance σ2, the a combined (multivariate) conjugate prior is placed over the mean and
variance, consisting of a normal-inverse-gamma distribution. Logically, this originates as follows:
1. From the analysis of the case with unknown mean but known variance, we see that the update equations involve
sufficient statistics computed from the data consisting of the mean of the data points and the total variance of the
data points, computed in turn from the known variance divided by the number of data points.
2. From the analysis of the case with unknown variance but known mean, we see that the update equations involve
sufficient statistics over the data consisting of the number of data points and sum of squared deviations.
3. Keep in mind that the posterior update values serve as the prior distribution when further data is handled. Thus,
we should logically think of our priors in terms of the sufficient statistics just described, with the same semantics
kept in mind as much as possible.
4. To handle the case where both mean and variance are unknown, we could place independent priors over the mean
and variance, with fixed estimates of the average mean, total variance, number of data points used to compute the
variance prior, and sum of squared deviations. Note however that in reality, the total variance of the mean
depends on the unknown variance, and the sum of squared deviations that goes into the variance prior (appears to)
depend on the unknown mean. In practice, the latter dependence is relatively unimportant: Shifting the actual
mean shifts the generated points by an equal amount, and on average the squared deviations will remain the same.
This is not the case, however, with the total variance of the mean: As the unknown variance increases, the total
variance of the mean will increase proportionately, and we would like to capture this dependence.
5. This suggests that we create a conditional prior of the mean on the unknown variance, with a hyperparameter
specifying the mean of the pseudo-observations associated with the prior, and another parameter specifying the
number of pseudo-observations. This number serves as a scaling parameter on the variance, making it possible to
control the overall variance of the mean relative to the actual variance parameter. The prior for the variance also
has two hyperparameters, one specifying the sum of squared deviations of the pseudo-observations associated
with the prior, and another specifying once again the number of pseudo-observations. Note that each of the priors
has a hyperparameter specifying the number of pseudo-observations, and in each case this controls the relative
variance of that prior. These are given as two separate hyperparameters so that the variance (aka the confidence)
of the two priors can be controlled separately.
6. This leads immediately to the normal-inverse-gamma distribution, which is defined as the product of the two
distributions just defined, with conjugate priors used (an inverse gamma distribution over the variance, and a
normal distribution over the mean, conditional on the variance) and with the same four parameters just defined.
The priors are normally defined as follows:

The update equations can be derived, and look as follows:


Normal distribution 451

The respective numbers of pseudo-observations just add the number of actual observations to them. The new mean
hyperparameter is once again a weighted average, this time weighted by the relative numbers of observations.
Finally, the update for is similar to the case with known mean, but in this case the sum of squared deviations
is taken with respect to the observed data mean rather than the true mean, and as a result a new "interaction term"
needs to be added to take care of the additional error source stemming from the deviation between prior and data
mean.
Proof is as follows.

Occurrence
The occurrence of normal distribution in practical problems can be loosely classified into three categories:
1. Exactly normal distributions;
2. Approximately normal laws, for example when such approximation is justified by the central limit theorem; and
3. Distributions modeled as normal – the normal distribution being the distribution with maximum entropy for a
given mean and variance.

Exact normality
Certain quantities in physics are distributed normally, as was first
demonstrated by James Clerk Maxwell. Examples of such quantities
are:
• Velocities of the molecules in the ideal gas. More generally,
velocities of the particles in any system in thermodynamic
equilibrium will have normal distribution, due to the maximum
entropy principle.
• Probability density function of a ground state in a quantum harmonic
oscillator.
• The position of a particle which experiences diffusion. If initially the The ground state of a quantum harmonic
particle is located at a specific point (that is its probability oscillator has the Gaussian distribution.

distribution is the dirac delta function), then after time t its location
is described by a normal distribution with variance t, which satisfies the diffusion equation ∂∂t f(x,t) = 12 ∂2∂x2
f(x,t). If the initial location is given by a certain density function g(x), then the density at time t is the convolution
of g and the normal PDF.

Approximate normality
Approximately normal distributions occur in many situations, as explained by the central limit theorem. When the
outcome is produced by a large number of small effects acting additively and independently, its distribution will be
close to normal. The normal approximation will not be valid if the effects act multiplicatively (instead of additively),
or if there is a single external influence which has a considerably larger magnitude than the rest of the effects.
• In counting problems, where the central limit theorem includes a discrete-to-continuum approximation and where
infinitely divisible and decomposable distributions are involved, such as
• Binomial random variables, associated with binary response variables;
• Poisson random variables, associated with rare events;
• Thermal light has a Bose–Einstein distribution on very short time scales, and a normal distribution on longer
timescales due to the central limit theorem.
Normal distribution 452

Assumed normality
I can only recognize the occurrence of the normal curve – the
Laplacian curve of errors – as a very abnormal phenomenon. It
is roughly approximated to in certain distributions; for this
reason, and on account for its beautiful simplicity, we may,
perhaps, use it as a first approximation, particularly in theoretical
investigations.

—Pearson (1901)
There are statistical methods to empirically test that assumption, see
the above Normality tests section.
• In biology, the logarithm of various variables tend to have a normal distribution, that is, they tend to have a
log-normal distribution (after separation on male/female subpopulations), with examples including:
• Measures of size of living tissue (length, height, skin area, weight);[32]
• The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth;
presumably the thickness of tree bark also falls under this category;
• Certain physiological measurements, such as blood pressure of adult humans.
• In finance, in particular the Black–Scholes model, changes in the logarithm of exchange rates, price indices, and
stock market indices are assumed normal (these variables behave like compound interest, not like simple interest,
and so are multiplicative). Some mathematicians such as Benoît Mandelbrot have argued that log-Levy
distributions which possesses heavy tails would be a more appropriate model, in particular for the analysis for
stock market crashes.
• Measurement errors in physical experiments are often modeled by a normal distribution. This use of a normal
distribution does not imply that one is assuming the measurement errors are normally distributed, rather using the
normal distribution produces the most conservative predictions possible given only knowledge about the mean
and variance of the errors.[33]
• In standardized testing, results can be made to have a
normal distribution. This is done by either selecting the
number and difficulty of questions (as in the IQ test), or by
transforming the raw test scores into "output" scores by
fitting them to the normal distribution. For example, the
SAT's traditional range of 200–800 is based on a normal
distribution with a mean of 500 and a standard deviation of
100.
• Many scores are derived from the normal distribution,
including percentile ranks ("percentiles" or "quantiles"),
normal curve equivalents, stanines, z-scores, and T-scores. Fitted cumulative normal distribution to October rainfalls
Additionally, a number of behavioral statistical procedures
are based on the assumption that scores are normally distributed; for example, t-tests and ANOVAs. Bell curve
grading assigns relative grades based on a normal distribution of scores.
• In hydrology the distribution of long duration river discharge or rainfall, e.g. monthly and yearly totals, is often
thought to be practically normal according to the central limit theorem.[34] The blue picture illustrates an example
of fitting the normal distribution to ranked October rainfalls showing the 90% confidence belt based on the
binomial distribution. The rainfall data are represented by plotting positions as part of the cumulative frequency
analysis.
Normal distribution 453

Generating values from normal distribution


In computer simulations, especially in applications of
the Monte-Carlo method, it is often desirable to
generate values that are normally distributed. The
algorithms listed below all generate the standard
normal deviates, since a N(μ, σ2) can be generated as X
= μ + σZ, where Z is standard normal. All these
algorithms rely on the availability of a random number
generator U capable of producing uniform random
variates.

• The most straightforward method is based on the


probability integral transform property: if U is
distributed uniformly on (0,1), then Φ−1(U) will
have the standard normal distribution. The drawback
The bean machine, a device invented by Francis Galton, can be
of this method is that it relies on calculation of the called the first generator of normal random variables. This machine
probit function Φ−1, which cannot be done consists of a vertical board with interleaved rows of pins. Small balls
analytically. Some approximate methods are are dropped from the top and then bounce randomly left or right as
they hit the pins. The balls are collected into bins at the bottom and
described in Hart (1968) and in the erf article.
settle down into a pattern resembling the Gaussian curve.
Wichura[35] gives a fast algorithm for computing
this function to 16 decimal places, which is used by
R to compute random variates of the normal distribution.
• An easy to program approximate approach, that relies on the central limit theorem, is as follows: generate 12
uniform U(0,1) deviates, add them all up, and subtract 6 – the resulting random variable will have approximately
standard normal distribution. In truth, the distribution will be Irwin–Hall, which is a 12-section eleventh-order
polynomial approximation to the normal distribution. This random deviate will have a limited range of (−6, 6).[36]
• The Box–Muller method uses two independent random numbers U and V distributed uniformly on (0,1). Then the
two random variables X and Y

will both have the standard normal distribution, and will be independent. This formulation arises because for a
bivariate normal random vector (X Y) the squared norm X2 + Y2 will have the chi-squared distribution with two
degrees of freedom, which is an easily generated exponential random variable corresponding to the quantity
−2ln(U) in these equations; and the angle is distributed uniformly around the circle, chosen by the random
variable V.
• Marsaglia polar method is a modification of the Box–Muller method algorithm, which does not require
computation of functions sin() and cos(). In this method U and V are drawn from the uniform (−1,1)
distribution, and then S = U2 + V2 is computed. If S is greater or equal to one then the method starts over,
otherwise two quantities

are returned. Again, X and Y will be independent and standard normally distributed.
• The Ratio method[37] is a rejection method. The algorithm proceeds as follows:
• Generate two independent uniform deviates U and V;
• Compute X = √8/e (V − 0.5)/U;
• If X2 ≤ 5 − 4e1/4U then accept X and terminate algorithm;
Normal distribution 454

• If X2 ≥ 4e−1.35/U + 1.4 then reject X and start over from step 1;


• If X2 ≤ −4 / lnU then accept X, otherwise start over the algorithm.
• The ziggurat algorithm (Marsaglia & Tsang 2000) is faster than the Box–Muller transform and still exact. In
about 97% of all cases it uses only two random numbers, one random integer and one random uniform, one
multiplication and an if-test. Only in 3% of the cases where the combination of those two falls outside the "core of
the ziggurat" a kind of rejection sampling using logarithms, exponentials and more uniform random numbers has
to be employed.
• There is also some investigation into the connection between the fast Hadamard transform and the normal
distribution, since the transform employs just addition and subtraction and by the central limit theorem random
numbers from almost any distribution will be transformed into the normal distribution. In this regard a series of
Hadamard transforms can be combined with random permutations to turn arbitrary data sets into a normally
distributed data.

Numerical approximations for the normal CDF


The standard normal CDF is widely used in scientific and statistical computing. The values Φ(x) may be
approximated very accurately by a variety of methods, such as numerical integration, Taylor series, asymptotic series
and continued fractions. Different approximations are used depending on the desired level of accuracy.
• Abramowitz & Stegun (1964) give the approximation for Φ(x) for x > 0 with the absolute error |ε(x)| < 7.5·10−8
(algorithm February 26, 2017 [38]):

where ϕ(x) is the standard normal PDF, and b0 = 0.2316419, b1 = 0.319381530, b2 = −0.356563782, b3 =
1.781477937, b4 = −1.821255978, b5 = 1.330274429.
• Hart (1968) lists almost a hundred of rational function approximations for the erfc() function. His algorithms
vary in the degree of complexity and the resulting precision, with maximum absolute precision of 24 digits. An
algorithm by West (2009) combines Hart's algorithm 5666 with a continued fraction approximation in the tail to
provide a fast computation algorithm with a 16-digit precision.
• W. J. Cody (1969) after recalling Hart68 solution is not suited for erf, gives a solution for both erf and erfc, with
maximal relative error bound, via Rational Chebyshev Approximation. (Cody, W. J. (1969). "Rational Chebyshev
Approximations for the Error Function", paper here).
• Marsaglia (2004) suggested a simple algorithm[39] based on the Taylor series expansion

for calculating Φ(x) with arbitrary precision. The drawback of this algorithm is comparatively slow calculation
time (for example it takes over 300 iterations to calculate the function with 16 digits of precision when x = 10).
• The GNU Scientific Library calculates values of the standard normal CDF using Hart's algorithms and
approximations with Chebyshev polynomials.

History

Development
Some authors[40][41] attribute the credit for the discovery of the normal distribution to de Moivre, who in 1738 [42]
published in the second edition of his "The Doctrine of Chances" the study of the coefficients in the binomial
expansion of (a + b)n. De Moivre proved that the middle term in this expansion has the approximate magnitude of
, and that "If m or ½n be a Quantity infinitely great, then the Logarithm of the Ratio, which a Term distant
from the middle by the Interval ℓ, has to the middle Term, is ."[43] Although this theorem can be interpreted as
Normal distribution 455

the first obscure expression for the normal probability law, Stigler points out that de Moivre himself did not interpret
his results as anything more than the approximate rule for the binomial coefficients, and in particular de Moivre
lacked the concept of the probability density function.[44]
In 1809 Gauss published his monograph "Theoria motus corporum
coelestium in sectionibus conicis solem ambientium" where among
other things he introduces several important statistical concepts, such
as the method of least squares, the method of maximum likelihood, and
the normal distribution. Gauss used M, M′, M′′, … to denote the
measurements of some unknown quantity V, and sought the "most
probable" estimator: the one which maximizes the probability φ(M−V)
· φ(M′−V) · φ(M′′−V) · … of obtaining the observed experimental
results. In his notation φΔ is the probability law of the measurement
errors of magnitude Δ. Not knowing what the function φ is, Gauss
requires that his method should reduce to the well-known answer: the
arithmetic mean of the measured values.[45] Starting from these
principles, Gauss demonstrates that the only law which rationalizes the
choice of arithmetic mean as an estimator of the location parameter, is
Carl Friedrich Gauss discovered the normal
the normal law of errors:[46]
distribution in 1809 as a way to rationalize the
method of least squares.

where h is "the measure of the precision of the observations". Using this normal law as a generic model for errors in
the experiments, Gauss formulates what is now known as the non-linear weighted least squares (NWLS) method.[47]
Although Gauss was the first to suggest the normal distribution law,
Laplace made significant contributions.[48] It was Laplace who first
posed the problem of aggregating several observations in 1774,[49]
although his own solution led to the Laplacian distribution. It was
Laplace who first calculated the value of the integral ∫ e−t ²dt = √π in
1782, providing the normalization constant for the normal
distribution.[50] Finally, it was Laplace who in 1810 proved and
presented to the Academy the fundamental central limit theorem,
which emphasized the theoretical importance of the normal
distribution.[51]

It is of interest to note that in 1809 an American mathematician Adrain


published two derivations of the normal probability law,
simultaneously and independently from Gauss.[52] His works remained
largely unnoticed by the scientific community, until in 1871 they were
"rediscovered" by Abbe.[53]
Marquis de Laplace proved the central limit
In the middle of the 19th century Maxwell demonstrated that the theorem in 1810, consolidating the importance of
the normal distribution in statistics.
normal distribution is not just a convenient mathematical tool, but may
also occur in natural phenomena:[54] "The number of particles whose
velocity, resolved in a certain direction, lies between x and x + dx is
Normal distribution 456

Naming
Since its introduction, the normal distribution has been known by many different names: the law of error, the law of
facility of errors, Laplace's second law, Gaussian law, etc. Gauss himself apparently coined the term with reference
to the "normal equations" involved in its applications, with normal having its technical meaning of orthogonal rather
than "usual".[55] However, by the end of the 19th century some authors[56] had started using the name normal
distribution, where the word "normal" was used as an adjective – the term now being seen as a reflection of the fact
that this distribution was seen as typical, common – and thus "normal". Peirce (one of those authors) once defined
"normal" thus: "...the 'normal' is not the average (or any other kind of mean) of what actually occurs, but of what
would, in the long run, occur under certain circumstances."[57] Around the turn of the 20th century Pearson
popularized the term normal as a designation for this distribution.[58]
Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an
international question of priority, has the disadvantage of leading people to believe that all other distributions
of frequency are in one sense or another 'abnormal'.
—Pearson (1920)
Also, it was Pearson who first wrote the distribution in terms of the standard deviation σ as in modern notation. Soon
after this, in year 1915, Fisher added the location parameter to the formula for normal distribution, expressing it in
the way it is written nowadays:

The term "standard normal" which denotes the normal distribution with zero mean and unit variance came into
general use around 1950s, appearing in the popular textbooks by P.G. Hoel (1947) "Introduction to mathematical
statistics" and A.M. Mood (1950) "Introduction to the theory of statistics".[59]
When the name is used, the "Gaussian distribution" was named after Carl Friedrich Gauss, who introduced the
distribution in 1809 as a way of rationalizing the method of least squares as outlined above. The related work of
Laplace, also outlined above has led to the normal distribution being sometimes called Laplacian, especially in
French-speaking countries. Among English speakers, both "normal distribution" and "Gaussian distribution" are in
common use, with different terms preferred by different communities.

Notes
[1] The designation "bell curve" is ambiguous: there are many other distributions which are "bell"-shaped: the Cauchy distribution, Student's
t-distribution, generalized normal, logistic, etc.
[2] Casella & Berger (2001, p. 102)
[3] Gale Encyclopedia of Psychology – Normal Distribution (http:/ / findarticles. com/ p/ articles/ mi_g2699/ is_0002/ ai_2699000241)
[4] Cover, T. M.; Thomas, Joy A (2006). Elements of information theory. John Wiley and Sons. p. 254.
[5] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[6] Halperin & et al. (1965, item 7)
[7] McPherson (1990, p. 110)
[8] Bernardo & Smith (2000, p. 121)
[9] Patel & Read (1996, [2.1.4])
[10] Fan (1991, p. 1258)
[11] Patel & Read (1996, [2.1.8])
[12] Scott, Clayton; Robert Nowak (August 7, 2003). "The Q-function" (http:/ / cnx. org/ content/ m11537/ 1. 2/ ). Connexions. .
[13] Barak, Ohad (April 6, 2006). "Q function and error function" (http:/ / www. eng. tau. ac. il/ ~jo/ academic/ Q. pdf). Tel Aviv University. .
[14] Weisstein, Eric W., " Normal Distribution Function (http:/ / mathworld. wolfram. com/ NormalDistributionFunction. html)" from
MathWorld.
[15] Bryc (1995, p. 23)
[16] Bryc (1995, p. 24)
Normal distribution 457

[17] WolframAlpha.com (http:/ / www. wolframalpha. com/ input/ ?i=Table[{N(Erf(n/ Sqrt(2)),+ 12),+ N(1-Erf(n/ Sqrt(2)),+ 12),+ N(1/
(1-Erf(n/ Sqrt(2))),+ 12)},+ {n,1,6}])
[18] part 1 (http:/ / www. wolframalpha. com/ input/ ?i=Table[Sqrt(2)*InverseErf(x),+ {x,+ N({8/ 10,+ 9/ 10,+ 19/ 20,+ 49/ 50,+ 99/ 100,+ 995/
1000,+ 998/ 1000},+ 13)}]), part 2 (http:/ / www. wolframalpha. com/ input/
?i=Table[{N(1-10^(-x),9),N(Sqrt(2)*InverseErf(1-10^(-x)),13)},{x,3,9}])
[19] Galambos & Simonelli (2004, Theorem 3.5)
[20] Bryc (1995, p. 35)
[21] Bryc (1995, p. 27)
[22] Lukacs & King (1954)
[23] Patel & Read (1996, [2.3.6])
[24] http:/ / www. allisons. org/ ll/ MML/ KL/ Normal/
[25] "Stat260: Bayesian Modeling and Inference Lecture Date: February 8th, 2010, The Conjugate Prior for the Normal Distribution, Lecturer:
Michael I. Jordan|" (http:/ / www. cs. berkeley. edu/ ~jordan/ courses/ 260-spring10/ lectures/ lecture5. pdf). .
[26] Cover & Thomas (2006, p. 254)
[27] Amari & Nagaoka (2000)
[28] Mathworld entry for Normal Product Distribution (http:/ / mathworld. wolfram. com/ NormalProductDistribution. html)
[29] Krishnamoorthy (2006, p. 127)
[30] Krishnamoorthy (2006, p. 130)
[31] Krishnamoorthy (2006, p. 133)
[32] Huxley (1932)
[33] Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. pp. 592–593.
[34] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage
Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The
Netherlands. pp. 175–224. ISBN 90-70754-33-9. .
[35] Wichura, M.J. (1988). "Algorithm AS241: The Percentage Points of the Normal Distribution". Applied Statistics (Blackwell Publishing) 37
(3): 477–484. doi:10.2307/2347330. JSTOR 2347330.
[36] Johnson et al. (1995, Equation (26.48))
[37] Kinderman & Monahan (1976)
[38] http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_932. htm
[39] For example, this algorithm is given in the article Bc programming language.
[40] Johnson et al. (1994, page 85)
[41] Le Cam (2000, p. 74)
[42] De Moivre first published his findings in 1733, in a pamphlet "Approximatio ad Summam Terminorum Binomii (a + b)n in Seriem Expansi"
that was designated for private circulation only. But it was not until the year 1738 that he made his results publicly available. The original
pamphlet was reprinted several times, see for example Walker (1985).
[43] De Moivre (1733), Corollary I – see Walker (1985, p. 77)
[44] Stigler (1986, p. 76)
[45] "It has been customary certainly to regard as an axiom the hypothesis that if any quantity has been determined by several direct
observations, made under the same circumstances and with equal care, the arithmetical mean of the observed values affords the most probable
value, if not rigorously, yet very nearly at least, so that it is always most safe to adhere to it." — Gauss (1809, section 177)
[46] Gauss (1809, section 177)
[47] Gauss (1809, section 179)
[48] "My custom of terming the curve the Gauss–Laplacian or normal curve saves us from proportioning the merit of discovery between the two
great astronomer mathematicians." quote from Pearson (1905, p. 189)
[49] Laplace (1774, Problem III)
[50] Pearson (1905, p. 189)
[51] Stigler (1986, p. 144)
[52] Stigler (1978, p. 243)
[53] Stigler (1978, p. 244)
[54] Maxwell (1860), p. 23
[55] Jaynes, E J, Probability Theory: The Logic of Science Ch 7 (http:/ / www-biba. inrialpes. fr/ Jaynes/ cc07s. pdf)
[56] Besides those specifically referenced here, such use is encountered in the works of Peirce, Galton and Lexis approximately around 1875.
[57] Peirce, C. S. (c. 1909 MS), Collected Papers v. 6, paragraph 327.
[58] Kruskal & Stigler (1997)
[59] "Earliest uses… (entry STANDARD NORMAL CURVE)" (http:/ / jeff560. tripod. com/ s. html). .
Normal distribution 458

Citations

References
• Aldrich, John; Miller, Jeff. "Earliest uses of symbols in probability and statistics" (http://jeff560.tripod.com/
stat.html).
• Aldrich, John; Miller, Jeff. "Earliest known uses of some of the words of mathematics" (http://jeff560.tripod.
com/mathword.html). In particular, the entries for "bell-shaped and bell curve" (http://jeff560.tripod.com/b.
html), "normal (distribution)" (http://jeff560.tripod.com/n.html), "Gaussian" (http://jeff560.tripod.com/g.
html), and "Error, law of error, theory of errors, etc." (http://jeff560.tripod.com/e.html).
• Amari, Shun-ichi; Nagaoka, Hiroshi (2000). Methods of information geometry. Oxford University Press.
ISBN 0-8218-0531-2.
• Bernardo, J. M.; Smith, A.F.M. (2000). Bayesian Theory. Wiley. ISBN 0-471-49464-X.
• Bryc, Wlodzimierz (1995). The normal distribution: characterizations with applications. Springer-Verlag.
ISBN 0-387-97990-5.
• Casella, George; Berger, Roger L. (2001). Statistical inference (2nd ed.). Duxbury. ISBN 0-534-24312-6.
• Cover, T. M.; Thomas, Joy A. (2006). Elements of information theory. John Wiley and Sons.
• de Moivre, Abraham (1738). The Doctrine of Chances. ISBN 0-8218-2103-2.
• Fan, Jianqing (1991). "On the optimal rates of convergence for nonparametric deconvolution problems". The
Annals of Statistics 19 (3): 1257–1272. doi:10.1214/aos/1176348248. JSTOR 2241949.
• Galambos, Janos; Simonelli, Italo (2004). Products of random variables: applications to problems of physics and
to arithmetical functions. Marcel Dekker, Inc.. ISBN 0-8247-5402-6.
• Gauss, Carolo Friderico (1809) (in Latin). Theoria motvs corporvm coelestivm in sectionibvs conicis Solem
ambientivm [Theory of the motion of the heavenly bodies moving about the Sun in conic sections]. English
translation (http://books.google.com/books?id=1TIAAAAAQAAJ).
• Gould, Stephen Jay (1981). The mismeasure of man (first ed.). W.W. Norton. ISBN 0-393-01489-4.
• Halperin, Max; Hartley, H. O.; Hoel, P. G. (1965). "Recommended standards for statistical symbols and notation.
COPSS committee on symbols and notation". The American Statistician 19 (3): 12–14. doi:10.2307/2681417.
JSTOR 2681417.
• Hart, John F.; et al (1968). Computer approximations. New York: John Wiley & Sons, Inc. ISBN 0-88275-642-7.
• Hazewinkel, Michiel, ed. (2001), "Normal distribution" (http://www.encyclopediaofmath.org/index.
php?title=p/n067460), Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Herrnstein, C.; Murray (1994). The bell curve: intelligence and class structure in American life. Free Press.
ISBN 0-02-914673-9.
• Huxley, Julian S. (1932). Problems of relative growth. London. ISBN 0-486-61114-0. OCLC 476909537.
• Johnson, N.L.; Kotz, S.; Balakrishnan, N. (1994). Continuous univariate distributions, Volume 1. Wiley.
ISBN 0-471-58495-9.
• Johnson, N.L.; Kotz, S.; Balakrishnan, N. (1995). Continuous univariate distributions, Volume 2. Wiley.
ISBN 0-471-58494-0.
• Krishnamoorthy, K. (2006). Handbook of statistical distributions with applications. Chapman & Hall/CRC.
ISBN 1-58488-635-8.
• Kruskal, William H.; Stigler, Stephen M. (1997). Normative terminology: 'normal' in statistics and elsewhere.
Statistics and public policy, edited by Bruce D. Spencer. Oxford University Press. ISBN 0-19-852341-6.
• la Place, M. de (1774). "Mémoire sur la probabilité des causes par les évènemens". Mémoires de Mathématique et
de Physique, Presentés à l'Académie Royale des Sciences, par divers Savans & lûs dans ses Assemblées, Tome
Sixième: 621–656. Translated by S.M.Stigler in Statistical Science 1 (3), 1986: JSTOR 2245476.
• Laplace, Pierre-Simon (1812). Analytical theory of probabilities.
Normal distribution 459

• Lukacs, Eugene; King, Edgar P. (1954). "A property of normal distribution". The Annals of Mathematical
Statistics 25 (2): 389–394. doi:10.1214/aoms/1177728796. JSTOR 2236741.
• McPherson, G. (1990). Statistics in scientific investigation: its basis, application and interpretation.
Springer-Verlag. ISBN 0-387-97137-8.
• Marsaglia, George; Tsang, Wai Wan (2000). "The ziggurat method for generating random variables" (http://
www.jstatsoft.org/v05/i08/paper). Journal of Statistical Software 5 (8).
• Marsaglia, George (2004). "Evaluating the normal distribution" (http://www.jstatsoft.org/v11/i05/paper).
Journal of Statistical Software 11 (4).
• Maxwell, James Clerk (1860). "V. Illustrations of the dynamical theory of gases. — Part I: On the motions and
collisions of perfectly elastic spheres". Philosophical Magazine, series 4 19 (124): 19–32.
doi:10.1080/14786446008642818.
• Patel, Jagdish K.; Read, Campbell B. (1996). Handbook of the normal distribution (2nd ed.). CRC Press.
ISBN 0-8247-9342-0.
• Pearson, Karl (1905). "'Das Fehlergesetz und seine Verallgemeinerungen durch Fechner und Pearson'. A
rejoinder". Biometrika 4 (1): 169–212. JSTOR 2331536.
• Pearson, Karl (1920). "Notes on the history of correlation". Biometrika 13 (1): 25–45.
doi:10.1093/biomet/13.1.25. JSTOR 2331722.
• Stigler, Stephen M. (1978). "Mathematical statistics in the early states". The Annals of Statistics 6 (2): 239–265.
doi:10.1214/aos/1176344123. JSTOR 2958876.
• Stigler, Stephen M. (1982). "A modest proposal: a new standard for the normal". The American Statistician 36
(2): 137–138. doi:10.2307/2684031. JSTOR 2684031.
• Stigler, Stephen M. (1986). The history of statistics: the measurement of uncertainty before 1900. Harvard
University Press. ISBN 0-674-40340-1.
• Stigler, Stephen M. (1999). Statistics on the table. Harvard University Press. ISBN 0-674-83601-4.
• Walker, Helen M (1985). "De Moivre on the law of normal probability" (http://www.york.ac.uk/depts/maths/
histstat/demoivre.pdf). In Smith, David Eugene. A source book in mathematics. Dover. ISBN 0-486-64690-4.
• Weisstein, Eric W. "Normal distribution" (http://mathworld.wolfram.com/NormalDistribution.html).
MathWorld.
• West, Graeme (2009). "Better approximations to cumulative normal functions" (http://www.wilmott.com/pdfs/
090721_west.pdf). Wilmott Magazine: 70–76.
• Zelen, Marvin; Severo, Norman C. (1964). Probability functions (chapter 26) (http://www.math.sfu.ca/~cbm/
aands/page_931.htm). Handbook of mathematical functions with formulas, graphs, and mathematical tables, by
Abramowitz and Stegun: National Bureau of Standards. New York: Dover. ISBN 0-486-61272-4.

External links
• Normal Distribution Video Tutorial Part 1-2 (http://www.youtube.com/watch?v=kB_kYUbS_ig)
• An 8-foot-tall (unknown operator: u'strong' m) Probability Machine (named Sir Francis) comparing stock
market returns to the randomness of the beans dropping through the quincunx pattern. (http://www.youtube.
com/watch?v=AUSKTk9ENzg) YouTube link originating from Index Funds Advisors (http://www.ifa.com)
• An interactive Normal (Gaussian) distribution plot (http://peter.freeshell.org/gaussian/)
Order statistic 460

Order statistic
In statistics, the kth order statistic of a statistical sample is equal to its
kth-smallest value. Together with rank statistics, order statistics are
among the most fundamental tools in non-parametric statistics and
inference.
Important special cases of the order statistics are the minimum and
maximum value of a sample, and (with some qualifications discussed
below) the sample median and other sample quantiles.
When using probability theory to analyze order statistics of random Probability distributions for the n = 5 order
samples from a continuous distribution, the cumulative distribution statistics of an exponential distribution with θ = 3
function is used to reduce the analysis to the case of order statistics of
the uniform distribution.

Notation and examples


For example, suppose that four numbers are observed or recorded, resulting in a sample of size 4. if the sample
values are
6, 9, 3, 8,
they will usually be denoted

where the subscript i in indicates simply the order in which the observations were recorded and is usually
assumed not to be significant. A case when the order is significant is when the observations are part of a time series.
The order statistics would be denoted

where the subscript (i) enclosed in parentheses indicates the ith order statistic of the sample.
The first order statistic (or smallest order statistic) is always the minimum of the sample, that is,

where, following a common convention, we use upper-case letters to refer to random variables, and lower-case
letters (as above) to refer to their actual observed values.
Similarly, for a sample of size n, the nth order statistic (or largest order statistic) is the maximum, that is,

The sample range is the difference between the maximum and minimum. It is clearly a function of the order
statistics:

A similar important statistic in exploratory data analysis that is simply related to the order statistics is the sample
interquartile range.
The sample median may or may not be an order statistic, since there is a single middle value only when the number n
of observations is odd. More precisely, if n = 2m+1 for some m, then the sample median is and so is an
order statistic. On the other hand, when n is even, n = 2m and there are two middle values, and , and
the sample median is some function of the two (usually the average) and hence not an order statistic. Similar remarks
apply to all sample quantiles.
Order statistic 461

Probabilistic analysis
Given any random variables X2, ..., Xn, the order statistics X(1), X(2), ..., X(n) are also random variables, defined by
sorting the values (realizations) of X2, ..., Xn in increasing order.
When the random variables X2, ..., Xn form a sample, they are independent and identically distributed (iid). This is
the case treated below. In general, the random variables X2, ..., Xn can arise by sampling from more than one
population. Then they are independent but not necessarily identically distributed, and their joint probability
distribution is given by the Bapat–Beg theorem.
From now on, we will assume that the random variables under consideration are continuous and, where convenient,
we will also assume that they have a density (that is, they are absolutely continuous). The peculiarities of the analysis
of distributions assigning mass to points (in particular, discrete distributions) are discussed at the end.

Probability distributions of order statistics


In this section we show that the order statistics of the uniform distribution on the unit interval have marginal
distributions belonging to the Beta distribution family. We also give a simple method to derive the joint distribution
of any number of order statistics, and finally translate these results to arbitrary continuous distributions using the cdf.
We assume throughout this section that is a random sample drawn from a continuous distribution
with cdf . Denoting we obtain the corresponding random sample from the
standard uniform distribution. Note that the order statistics also satisfy .

The order statistics of the uniform distribution


The probability of the order statistic falling in the interval is equal to

that is, the kth order statistic of the uniform distribution is a Beta random variable.

The proof of these statements is as follows. For to be between u and u + du, it is necessary that exactly k − 1
elements of the sample are smaller than u, and that at least one is between u and u + du. The probability that more
than one is in this latter interval is already , so we have to calculate the probability that exactly k − 1, 1 and
n − k observations fall in the intervals , and respectively. This equals (refer to
multinomial distribution for details)

and the result follows.


The mean of this distribution is k / (n + 1).

The joint distribution of the order statistics of the uniform distribution


Similarly, for i < j, the joint probability density function of the two order statistics Ui < Uj can be shown to be

which is (up to terms of higher order than ) the probability that i − 1, 1, j − 1 − i, 1 and n − j sample
elements fall in the intervals , , , , respectively.
One reasons in an entirely analogous way to derive the higher-order joint distributions. Perhaps surprisingly, the
joint density of the n order statistics turns out to be constant:
Order statistic 462

One way to understand this is that the unordered sample does have constant density equal to 1, and that there are n!
different permutations of the sample corresponding to the same sequence of order statistics. This is related to the fact
that 1/n! is the volume of the region .

The joint distribution of the order statistics of an absolutely continuous distribution


If FX is absolutely continuous, it has a density such that , and we can use the substitutions

and

to derive the following probability density functions (pdfs) for the order statistics of a sample of size n drawn from
the distribution of X:

where

Application: confidence intervals for quantiles


An interesting question is how well the order statistics perform as estimators of the quantiles of the underlying
distribution.

A small-sample-size example
The simplest case to consider is how well the sample median estimates the population median.
As an example, consider a random sample of size 6. In that case, the sample median is usually defined as the
midpoint of the interval delimited by the 3rd and 4th order statistics. However, we know from the preceding
discussion that the probability that this interval actually contains the population median is

Although the sample median is probably among the best distribution-independent point estimates of the population
median, what this example illustrates is that it is not a particularly good one in absolute terms. In this particular case,
a better confidence interval for the median is the one delimited by the 2nd and 5th order statistics, which contains the
population median with probability

With such a small sample size, if one wants at least 95% confidence, one is reduced to saying that the median is
between the minimum and the maximum of the 6 observations with probability 31/32 or approximately 97%. Size 6
is, in fact, the smallest sample size such that the interval determined by the minimum and the maximum is at least a
95% confidence interval for the population median.
Order statistic 463

Large sample sizes


For the uniform distribution, as n tends to infinity, the pth sample quantile is asymptotically normally distributed,
since it is approximated by

For a general distribution F with a continuous non-zero density at F −1(p), a similar asymptotic normality applies:

where f is the density function, and F −1 is the quantile function associated with F.
An interesting observation can be made in the case where the distribution is symmetric, and the population median
equals the population mean. In this case, the sample mean, by the central limit theorem, is also asymptotically
normally distributed, but with variance σ2/n instead. This asymptotic analysis suggests that the mean outperforms the
median in cases of low kurtosis, and vice versa. For example, the median achieves better confidence intervals for the
Laplace distribution, while the mean performs better for X that are normally distributed.

Proof
It can be shown that

where

with Zi being independent identically distributed exponential random variables with rate 1. Since X/n and Y/n are
asymptotically normally distributed by the CLT, our results follow by application of the delta method.

Dealing with discrete variables


Suppose are i.i.d. random variables from a discrete distribution with cumulative distribution
function and probability mass function . To find the probabilities of the order statistics, three
values are first needed, namely

The cumulative distribution function of the order statistic can be computed by noting that

Similarly, is given by

Note that the probability mass function of is just the difference of these values, that is to say
Order statistic 464

Computing order statistics


The problem of computing the kth smallest (or largest) element of a list is called the selection problem and is solved
by a selection algorithm. Although this problem is difficult for very large lists, sophisticated selection algorithms
have been created that can solve this problem in time proportional to the number of elements in the list, even if the
list is totally unordered. If the data is stored in certain specialized data structures, this time can be brought down to
O(log n). In many applications all order statistics are required, in which case a sorting algorithm can be used and the
time taken is O(n log n).

References
• David, H. A., Nagaraja, H. N. (2003) Order Statistics (3rd Edition). Wiley, New Jersey pp 458. ISBN
0-471-38926-9
• Sefling, R. J. (1980) Approximation Theorems of Mathematical Statistics. Wiley, New York. ISBN
0-471-02403-1

External links
• Order statistics [1] at PlanetMath Retrieved Feb 02,2005
• Weisstein, Eric W., "Order Statistic [2]" from MathWorld. Retrieved Feb 02,2005
• Dr. Susan Holmes Order Statistics [3] Retrieved Feb 02,2005

References
[1] http:/ / planetmath. org/ encyclopedia/ OrderStatistics. html
[2] http:/ / mathworld. wolfram. com/ OrderStatistic. html
[3] http:/ / www-stat. stanford. edu/ ~susan/ courses/ s116/ node79. html
Ordinary differential equation 465

Ordinary differential equation


In mathematics, an ordinary differential equation (abbreviated ODE) is an equation containing a function of one
independent variable and its derivatives. There are many general forms an ODE can take, and these are classified in
practice (see below).[1][2] The derivatives are ordinary because partial derivatives only apply to functions of many
independent variables (see Partial differential equation).
The subject of ODEs is a sophisticated one (more so with PDEs), primarily due to the various forms the ODE can
take and how they can be integrated. Linear differential equations are ones with solutions that can be added and
multiplied by coefficients, and the theory of linear differential equations is well-defined and understood, and exact
closed form solutions can be obtained. By contrast, ODEs which do not have additive solutions are non-linear, and
finding the solutions is much more sophisticated because it is rarely possible to represent them by elementary
functions in closed form - rather the exact (or "analytic") solutions are in series or integral form. Frequently
graphical and numerical methods are used to generate solutions, by hand or on computer (only approximately, but
possible to do very accurately depending on the specific method used), because in this way the properties of the
solutions without solving them can still yield very useful information, which may be all that is needed.

Background
Ordinary differential equations arise in many different
contexts throughout mathematics and science (social
and natural) one way or another, because when
describing changes mathematically, the most accurate
way uses differentials and derivatives (related, though
not quite the same). Since various differentials,
derivatives, and functions become inevitably related to
each other via equations, a differential equation is the The trajectory of a projectile launched from a cannon follows a curve
determined by an ordinary differential equation that is derived from
result, governing dynamical phenomena, evolution and
Newton's second law.
variation. Often, quantities are defined as the rate of
change of other quantities (time derivatives), or
gradients of quantities, which is how they enter differential equations.

Specific mathematical fields include geometry and analytical mechanics. Scientific fields include much of physics
and astronomy (celestial mechanics), geology (weather modelling), chemistry (reaction rates)[3], biology (infectious
diseases, genetic variation), ecology and population modelling (population competition), economics (stock trends,
interest rates and the market equilibrium price changes).
Many mathematicians have studied differential equations and contributed to the field, including Newton, Leibniz, the
Bernoulli family, Riccati, Clairaut, d'Alembert and Euler.
A simple example is Newton's second law of motion - the relationship between the displacement x and the time t of
the object under the force F which leads to the differential equation

for the motion of a particle of constant mass m. In general, F depends on the position x(t) of the particle at time t, and
so the unknown function x(t) appears on both sides of the differential equation, as is indicated in the notation
F(x(t)).[4][5][6][7]
Ordinary differential equation 466

Definitions
In what follows, let y be a dependent variable and x an independent variable, so that y = y(x) is an unknown function
in x. The notation for differentiation varies depending upon the author and upon which notation is most useful for the
task at hand. In this context the Leibniz's notation (dy/dx,d2y/dx2,...dny/dxn) is useful for differentials and when
integration is to be done, while Newton's and Lagrange's notation (y′,y′′, ... y(n)) is useful for representing derivatives
of any order compactly.

General definition of an ODE


Let F be a given function of x, y, and derivatives of y. Then an equation of the form

is called an explicit ordinary differential equation of order n.[8][9]


More generally, an implicit ordinary differential equation of order n takes the form:[10]

There are further classifications:


Autonomous
A differential equation not depending on x is called autonomous.
linear
A differential equation is said to be linear if F can be written as a linear combination of the derivatives of y:

where ai(x) and r(x) continuous functions in x[11][12][13]. Non-linear equations cannot be written in this form. The
function r(x) is called the source term, leading to two further important classifications:[14][15]
Homogeneous: If r(x) = 0, and consequently one "automatic" solution is the trivial solution, y = 0. The solution of a
linear homogeneous equation is a complementary function, denoted here by yc.
Nonhomogeneous (or inhomogeneous): If r(x) ≠ 0. The additional solution to the complementary function is the
particular integral, denoted here by yp.
The general solution to a linear equation can be written as y = yc + yp.

System of ODEs
A number of coupled differential equations form a system of equations. If y is a vector whose elements are
functions; y(x) = [y1(x), y2(x),... ym(x)], and F is a vector valued function of y and its derivatives, then

is an explicit system of ordinary differential equations of order or dimension m. In column vector form:

These are not necessarily linear. The implicit analogue is:

where 0 = (0, 0,... 0) is the zero vector. In matrix form


Ordinary differential equation 467

Solutions
Given a differential equation

a function u: I ⊂ R → R is called the solution or integral curve for F, if u is n-times differentiable on I, and

Given two solutions u: J ⊂ R → R and v: I ⊂ R → R, u is called an extension of v if I ⊂ J and

A solution which has no extension is called a maximal solution. A solution defined on all of R is called a global
solution.
A general solution of an n-th order equation is a solution containing n arbitrary independent constants of integration.
A particular solution is derived from the general solution by setting the constants to particular values, often chosen
to fulfill set 'initial conditions or boundary conditions'.[16] A singular solution is a solution which cannot be obtained
by assigning definite values to the arbitrary constants in the general solution.[17]

Theories of ODEs

Singular solutions
The theory of singular solutions of ordinary and partial differential equations was a subject of research from the time
of Leibniz, but only since the middle of the nineteenth century did it receive special attention. A valuable but
little-known work on the subject is that of Houtain (1854). Darboux (starting in 1873) was a leader in the theory, and
in the geometric interpretation of these solutions he opened a field which was worked by various writers, notably
Casorati and Cayley. To the latter is due (1872) the theory of singular solutions of differential equations of the first
order as accepted circa 1900.

Reduction to quadratures
The primitive attempt in dealing with differential equations had in view a reduction to quadratures. As it had been
the hope of eighteenth-century algebraists to find a method for solving the general equation of the th degree, so it
was the hope of analysts to find a general method for integrating any differential equation. Gauss (1799) showed,
however, that the differential equation meets its limitations very soon unless complex numbers are introduced. Hence
analysts began to substitute the study of functions, thus opening a new and fertile field. Cauchy was the first to
appreciate the importance of this view. Thereafter the real question was to be, not whether a solution is possible by
means of known functions or their integrals, but whether a given differential equation suffices for the definition of a
function of the independent variable or variables, and if so, what are the characteristic properties of this function.
Ordinary differential equation 468

Fuchsian theory
Two memoirs by Fuchs (Crelle, 1866, 1868), inspired a novel approach, subsequently elaborated by Thomé and
Frobenius. Collet was a prominent contributor beginning in 1869, although his method for integrating a non-linear
system was communicated to Bertrand in 1868. Clebsch (1873) attacked the theory along lines parallel to those
followed in his theory of Abelian integrals. As the latter can be classified according to the properties of the
fundamental curve which remains unchanged under a rational transformation, so Clebsch proposed to classify the
transcendent functions defined by the differential equations according to the invariant properties of the
corresponding surfaces f = 0 under rational one-to-one transformations.

Lie's theory
From 1870 Sophus Lie's work put the theory of differential equations on a more satisfactory foundation. He showed
that the integration theories of the older mathematicians can, by the introduction of what are now called Lie groups,
be referred to a common source; and that ordinary differential equations which admit the same infinitesimal
transformations present comparable difficulties of integration. He also emphasized the subject of transformations of
contact.
Lie's group theory of differential equations, has been certified, namely: (1) that it unifies the many ad hoc methods
known for solving differential equations, and (2) that it provides powerful new ways to find solutions. The theory
has applications to both ordinary and partial differential equations.[18]
A general approach to solve DE's uses the symmetry property of differential equations, the continuous infinitesimal
transformations of solutions to solutions (Lie theory). Continuous group theory, Lie algebras and differential
geometry are used to understand the structure of linear and nonlinear (partial) differential equations for generating
integrable equations, to find its Lax pairs, recursion operators, Bäcklund transform and finally finding exact analytic
solutions to the DE.
Symmetry methods have been recognized to study differential equations arising in mathematics, physics,
engineering, and many other disciplines.

Sturm–Liouville theory
Sturm–Liouville theory is a theory of eigenvalues and eigenfunctions of linear operators defined in terms of
second-order homogeneous linear equations, and is useful in the analysis of certain partial differential equations.

Existence and uniqueness of solutions


There are several theorems that establish existence and uniqueness of solutions to initial value problems involving
ODEs both locally and globally. The two main theorems are

Theorem Assumption Conclusion

Peano existence theorem F continuous local existence only

Picard–Lindelöf theorem F Lipschitz continuous local existence and uniqueness

which are both local results.


Ordinary differential equation 469

Local existence and uniqueness theorem simplified


The theorem can be stated simply as follows.[19] For the equation and initial value problem:

if F and ∂F/∂y are continuous in a closed rectangle

in the x-y plane, where a and b are real (symbolically: a, b ∈ ℝ) and × denotes the cartesian product, square brackets
denote closed intervals, then there is an interval

for some h ∈ ℝ where the solution to the above equation and initial value problem can be found. That is, there is a
solution and it is unique. Since there is no restriction on F to be linear, this applies to non-linear equations which
take the form F(x, y), and it can also be applied to systems of equations.

Global uniqueness and maximum domain of solution


When the hypotheses of the Picard–Lindelöf theorem are satisfied, then local existence and uniqueness can be
extended to a global result. More precisely:[20]
For each initial condition (x0, y0) there exists a unique maximum (possibly infinite) open interval

such that any solution which satisfies this initial condition is a restriction of the solution which satisfies this initial
condition with domain Imax.
In the case that , there are exactly two possibilities

• explosion in finite time:


• leaves domain of definition:
where Ω is the open set in which F is defined, and is its boundary.
Note that the maximum domain of the solution
• is always an interval (to have uniqueness)
• may be smaller than ℝ
• may depend on the specific choice of (x0, y0).
Example

This means that F(x, y) = y2, which is C1 and therefore Lipschitz continuous for all y, satisfying the Picard–Lindelöf
theorem.
Even in such a simple setting, the maximum domain of solution cannot be all ℝ, since the solution is

which has maximum domain:

This shows clearly that the maximum interval may depend on the initial conditions. The domain of y could be taken
as being , but this would lead to a domain that is not an interval, so that the side opposite to the
initial condition would be disconnected from the initial condition, and therefore not uniquely determined by it.
Ordinary differential equation 470

The maximum domain is not ℝ because

which is one of the two possible cases according to the above theorem.

Reduction of order
Differential equations can usually be solved more easily if the order of the equation can be reduced.

Reduction to a first order system


Any differential equation of order n,

can be written as a system of n first-order differential equations by defining a new family of unknown functions

for i = 1, 2,... n. The n-dimensional system of first-order coupled differential equations is then

more compactly in vector notation:

where

Summary of exact solutions


Some differential equations have solutions which can be written in an exact and closed form. Several important
classes are given here.
In the table below, P(x), Q(x), P(y), Q(y), and M(x,y), N(x,y) are any integrable functions of x, y, and b and c are real
given constants, and C1, C2,... are arbitrary constants (complex in general). The differential equations are in their
equivalent and alternative forms which lead to the solution through integration.
In the integral solutions, λ and ε are dummy variables of integration (the continuum analogues of indices in
summation), and the notation ∫xF(λ)dλ just means to integrate F(λ) with respect to λ, then after the integration
substitute λ = x, without adding constants (explicitly stated).
Ordinary differential equation 471

Differential equation Solution method General solution

Separable equations

Separation of variables
First order, separable in x and y (general
[21] (divide by P2Q1).
case, see below for special cases)

Direct integration.
[22]
First order, separable in x

[23] Separation of variables


First order, autonomous, separable in y
(divide by F).

[24] Integrate throughout.


First order, separable in x and y

General first order equations

[25] Set y = ux, then solve by


First order, homogeneous
separation of variables in
u and x.

[26] Separation of variables


First order, separable
(divide by xy).
If N = M, the solution is xy = C.

[27] Integrate throughout.


Exact differential, first order

where Y(y) and X(x) are functions from the integrals rather than constant
values, which are set to make the final function F(x, y) satisfy the initial
where equation.

[28] Integration factor μ(x, y) If μ(x, y) can be found:


Inexact differential, first order
satisfying

where

General second order equations

Multiply equation by
Second order
2dy/dx, substitute

, then integrate with


respect to x, then y.
Ordinary differential equation 472

[29] Multiply equation by


Second order, autonomous
2dy/dx, substitute

, then integrate twice.

Linear equations (up to nth order)

Integrating factor:
First order, linear, inhomogeneous, function
[30] .
coefficients

Second order, linear, inhomogeneous, Complementary function


[31]
constant coefficients yc: assume yc = eαx, If b2 > 4c, then:
substitute and solve
polynomial in α, to find
the linearly independent If b2 = 4c, then:
functions .
Particular integral yp: in If b2 < 4c, then:
general the method of
variation of parameters,
though for very simple
r(x) inspection may
[32]
work.

nth order, linear, inhomogeneous, constant Complementary function


[33]
coefficients yc: assume yc = eαx, Since αj are the solutions of the polynomial of degree n:
substitute and solve
, then:
polynomial in α, to find
the linearly independent for αj all different,
functions .
Particular integral yp: in
general the method of
variation of parameters, for each root αj repeated kj times,
though for very simple
r(x) inspection may
[34]
work.
for some αj complex, then setting α = χj + iγj, and using Euler's formula,
allows some terms in the previous results to be written in the form

where ϕj is an arbitrary constant (phase shift).

Software for ODE solving


• FuncDesigner (free license: BSD, uses Automatic differentiation, also can be used online via Sage-server [35])
• odeint [36] - A C++ library for solving ordinary differential equations numerically
• VisSim [37] - a visual language for differential equation solving
• Mathematical Assistant on Web [38] online solving first order (linear and with separated variables) and second
order linear differential equations (with constant coefficients), including intermediate steps in the solution.
• DotNumerics: Ordinary Differential Equations for C# and VB.NET [39] Initial-value problem for nonstiff and stiff
ordinary differential equations (explicit Runge-Kutta, implicit Runge-Kutta, Gear’s BDF and Adams-Moulton).
• Online experiments with JSXGraph [40]
• Maxima computer algebra system (GPL)
• COPASI a free (Artistic License 2.0) software package for the integration and analysis of ODEs.
Ordinary differential equation 473

• MATLAB a matrix-programming software (MATrix LABoratory)


• GNU Octave a high-level language, primarily intended for numerical computations.

Notes
[1] Kreyszig (1972, p. 1)
[2] Simmons (1972, p. 2)
[3] Mathematics for Chemists, D.M. Hirst, THE MACMILLAN PRESS, 1976, (No ISBN) SBN: 333-18172-7
[4] Kreyszig (1972, p. 64)
[5] Simmons (1972, pp. 1,2)
[6] Halliday & Resnick (1977, p. 78)
[7] Tipler (1991, pp. 78–83)
[8] Harper (1976, p. 127)
[9] Kreyszig (1972, p. 2)
[10] Simmons (1972, p. 3)
[11] Harper (1976, p. 127)
[12] Kreyszig (1972, p. 24)
[13] Simmons (1972, p. 47)
[14] Harper (1976, p. 128)
[15] Kreyszig (1972, p. 24)
[16] Kreyszig (1972, p. 78)
[17] Kreyszig (1972, p. 4)
[18] Lawrence (1999, p. 9)
[19] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISBN 0-471-83824-1
[20] Boscain; Chitour 2011, p. 21
[21] Mathematical Handbook of Formulas and Tables (3rd edition), S. Lipschutz, M.R. Spiegel, J. Liu, Schuam's Outline Series, 2009, ISC_2N
978-0-07-154855-7
[22] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[23] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[24] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[25] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[26] Mathematical Handbook of Formulas and Tables (3rd edition), S. Lipschutz, M.R. Spiegel, J. Liu, Schuam's Outline Series, 2009, ISC_2N
978-0-07-154855-7
[27] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[28] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[29] Further Elementary Analysis, R. Porter, G.Bell & Sons (London), 1978, ISBN 0-7135-1594-5
[30] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[31] Mathematical methods for physics and engineering, K.F. Riley, M.P. Hobson, S.J. Bence, Cambridge University Press, 2010, ISC_2N
978-0-521-86153-3
[32] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[33] Mathematical methods for physics and engineering, K.F. Riley, M.P. Hobson, S.J. Bence, Cambridge University Press, 2010, ISC_2N
978-0-521-86153-3
[34] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[35] http:/ / sage. openopt. org/ welcome
[36] https:/ / github. com/ headmyshoulder/ odeint-v2
[37] http:/ / www. vissim. us
[38] http:/ / user. mendelu. cz/ marik/ maw/ index. php?lang=en& form=ode
[39] http:/ / www. dotnumerics. com/ NumericalLibraries/ DifferentialEquations/
[40] http:/ / jsxgraph. uni-bayreuth. de/ wiki/ index. php/ Differential_equations
Ordinary differential equation 474

References
• Halliday, David; Resnick, Robert (1977), Physics (3rd ed.), New York: Wiley, ISBN 0-471-71716-9
• Harper, Charlie (1976), Introduction to Mathematical Physics, New Jersey: Prentice-Hall, ISBN 0-13-487538-9
• Kreyszig, Erwin (1972), Advanced Engineering Mathematics (3rd ed.), New York: Wiley, ISBN 0-471-50728-8.
• Polyanin, A. D. and V. F. Zaitsev, Handbook of Exact Solutions for Ordinary Differential Equations (2nd
edition)", Chapman & Hall/CRC Press, Boca Raton, 2003. ISBN 1-58488-297-2
• Simmons, George F. (1972), Differential Equations with Applications and Historical Notes, New York:
McGraw-Hill
• Tipler, Paul A. (1991), Physics for Scientists and Engineers: Extended version (3rd ed.), New York: Worth
Publishers, ISBN 0-87901-432-6
• Boscain, Ugo; Chitour, Yacine (2011) (in french), Introduction à l'automatique (http://www.cmapx.
polytechnique.fr/~boscain/poly2011.pdf)
• Lawrence, Dresner (1999), Applications of Lie's Theory of Ordinary and Partial Differential Equations, Bristol
and Philadelphia: Institute of Physics Publishing

Bibliography
• Coddington, Earl A.; Levinson, Norman (1955). Theory of Ordinary Differential Equations. New York:
McGraw-Hill.
• Hartman, Philip, Ordinary Differential Equations, 2nd Ed., Society for Industrial & Applied Math, 2002. ISBN
0-89871-510-5.
• W. Johnson, A Treatise on Ordinary and Partial Differential Equations (http://www.hti.umich.edu/cgi/b/bib/
bibperm?q1=abv5010.0001.001), John Wiley and Sons, 1913, in University of Michigan Historical Math
Collection (http://hti.umich.edu/u/umhistmath/)
• E.L. Ince, Ordinary Differential Equations, Dover Publications, 1958, ISBN 0-486-60349-0
• Witold Hurewicz, Lectures on Ordinary Differential Equations, Dover Publications, ISBN 0-486-49510-8
• Ibragimov, Nail H (1993). CRC Handbook of Lie Group Analysis of Differential Equations Vol. 1-3. Providence:
CRC-Press. ISBN 0-8493-4488-3.
• Teschl, Gerald (2012). Ordinary Differential Equations and Dynamical Systems (http://www.mat.univie.ac.at/
~gerald/ftp/book-ode/). Providence: American Mathematical Society. ISBN 978-0-8218-8328-0.
• A. D. Polyanin, V. F. Zaitsev, and A. Moussiaux, Handbook of First Order Partial Differential Equations, Taylor
& Francis, London, 2002. ISBN 0-415-27267-X
• D. Zwillinger, Handbook of Differential Equations (3rd edition), Academic Press, Boston, 1997.

External links
• NCLab (http://nclab.com) provides interactive graphical modules in the web browser to solve ordinary and
partial differential equations.
• Differential Equations (http://www.dmoz.org/Science/Math/Differential_Equations//) at the Open Directory
Project (includes a list of software for solving differential equations).
• EqWorld: The World of Mathematical Equations (http://eqworld.ipmnet.ru/index.htm), containing a list of
ordinary differential equations with their solutions.
• Online Notes / Differential Equations (http://tutorial.math.lamar.edu/classes/de/de.aspx) by Paul Dawkins,
Lamar University.
• Differential Equations (http://www.sosmath.com/diffeq/diffeq.html), S.O.S. Mathematics.
• A primer on analytical solution of differential equations (http://numericalmethods.eng.usf.edu/mws/gen/
08ode/mws_gen_ode_bck_primer.pdf) from the Holistic Numerical Methods Institute, University of South
Florida.
Ordinary differential equation 475

• Ordinary Differential Equations and Dynamical Systems (http://www.mat.univie.ac.at/~gerald/ftp/book-ode/


) lecture notes by Gerald Teschl.
• Notes on Diffy Qs: Differential Equations for Engineers (http://www.jirka.org/diffyqs/) An introductory
textbook on differential equations by Jiri Lebl of UIUC.

Partial differential equation


In mathematics, a partial differential equation (PDE) is a differential
equation that contains unknown multivariable functions and their
partial derivatives. PDEs are used to formulate problems involving
functions of several variables, and are either solved by hand, or used to
create a relevant computer model.
PDEs can be used to describe a wide variety of phenomena such as
sound, heat, electrostatics, electrodynamics, fluid flow, or elasticity.
These seemingly distinct physical phenomena can be formalised
A visualisation of a solution to the heat equation
identically in terms of PDEs, which shows that they are governed by
on a two dimensional plane
the same underlying dynamic. Just as ordinary differential equations
often model one-dimensional dynamical systems, partial differential
equations often model multidimensional systems. PDEs find their generalisation in stochastic partial differential
equations.

Introduction
Partial Differential Equations (PDEs) are equations that involve rates of change with respect to continuous variables.
The configuration of a rigid body is specified by six numbers, but the configuration of a fluid is given by the
continuous distribution of the temperature, pressure, and so forth. The dynamics for the rigid body take place in a
finite-dimensional configuration space; the dynamics for the fluid occur in an infinite-dimensional configuration
space. This distinction usually makes PDEs much harder to solve than Ordinary Differential Equations (ODEs), but
here again there will be simple solutions for linear problems. Classic domains where PDEs are used include
acoustics, fluid flow, electrodynamics, and heat transfer.
A partial differential equation (PDE) for the function is an equation of the form

If F is a linear function of u and its derivatives, then the PDE is called linear. Common examples of linear PDEs
include the heat equation, the wave equation, Laplace's equation, Helmholtz equation, Klein–Gordon equation, and
Poisson's equation.
A relatively simple PDE is

This relation implies that the function u(x,y) is independent of x. However, the equation gives no information on the
function's dependence on the variable y. Hence the general solution of this equation is

where f is an arbitrary function of y. The analogous ordinary differential equation is


Partial differential equation 476

which has the solution

where c is any constant value. These two examples illustrate that general solutions of ordinary differential equations
(ODEs) involve arbitrary constants, but solutions of PDEs involve arbitrary functions. A solution of a PDE is
generally not unique; additional conditions must generally be specified on the boundary of the region where the
solution is defined. For instance, in the simple example above, the function f(y) can be determined if u is specified on
the line x = 0.

Existence and uniqueness


Although the issue of existence and uniqueness of solutions of ordinary differential equations has a very satisfactory
answer with the Picard–Lindelöf theorem, that is far from the case for partial differential equations. The
Cauchy–Kowalevski theorem states that the Cauchy problem for any partial differential equation whose coefficients
are analytic in the unknown function and its derivatives, has a locally unique analytic solution. Although this result
might appear to settle the existence and uniqueness of solutions, there are examples of linear partial differential
equations whose coefficients have derivatives of all orders (which are nevertheless not analytic) but which have no
solutions at all: see Lewy (1957). Even if the solution of a partial differential equation exists and is unique, it may
nevertheless have undesirable properties. The mathematical study of these questions is usually in the more powerful
context of weak solutions.
An example of pathological behavior is the sequence of Cauchy problems (depending upon n) for the Laplace
equation

with boundary conditions

where n is an integer. The derivative of u with respect to y approaches 0 uniformly in x as n increases, but the
solution is

This solution approaches infinity if nx is not an integer multiple of π for any non-zero value of y. The Cauchy
problem for the Laplace equation is called ill-posed or not well posed, since the solution does not depend
continuously upon the data of the problem. Such ill-posed problems are not usually satisfactory for physical
applications.

Notation
In PDEs, it is common to denote partial derivatives using subscripts. That is:

Especially in physics, del (∇) is often used for spatial derivatives, and for time derivatives. For example, the
wave equation (described below) can be written as
Partial differential equation 477

or

where Δ is the Laplace operator.

Examples

Heat equation in one space dimension


The equation for conduction of heat in one dimension for a homogeneous body has the form

where u(t,x) is temperature, and α is a positive constant that describes the rate of diffusion. The Cauchy problem for
this equation consists in specifying u(0, x)= f(x), where f(x) is an arbitrary function.
General solutions of the heat equation can be found by the method of separation of variables. Some examples appear
in the heat equation article. They are examples of Fourier series for periodic f and Fourier transforms for
non-periodic f. Using the Fourier transform, a general solution of the heat equation has the form

where F is an arbitrary function. To satisfy the initial condition, F is given by the Fourier transform of f, that is

If f represents a very small but intense source of heat, then the preceding integral can be approximated by the delta
distribution, multiplied by the strength of the source. For a source whose strength is normalized to 1, the result is

and the resulting solution of the heat equation is

This is a Gaussian integral. It may be evaluated to obtain

This result corresponds to the normal probability density for x with mean 0 and variance 2αt. The heat equation and
similar diffusion equations are useful tools to study random phenomena.

Wave equation in one spatial dimension


The wave equation is an equation for an unknown function u(t, x) of the form

Here u might describe the displacement of a stretched string from equilibrium, or the difference in air pressure in a
tube, or the magnitude of an electromagnetic field in a tube, and c is a number that corresponds to the velocity of the
wave. The Cauchy problem for this equation consists in prescribing the initial displacement and velocity of a string
or other medium:

where f and g are arbitrary given functions. The solution of this problem is given by d'Alembert's formula:
Partial differential equation 478

This formula implies that the solution at (t,x) depends only upon the data on the segment of the initial line that is cut
out by the characteristic curves

that are drawn backwards from that point. These curves correspond to signals that propagate with velocity c forward
and backward. Conversely, the influence of the data at any given point on the initial line propagates with the finite
velocity c: there is no effect outside a triangle through that point whose sides are characteristic curves. This behavior
is very different from the solution for the heat equation, where the effect of a point source appears (with small
amplitude) instantaneously at every point in space. The solution given above is also valid if t is negative, and the
explicit formula shows that the solution depends smoothly upon the data: both the forward and backward Cauchy
problems for the wave equation are well-posed.

Generalised heat-like equation in one space dimension


Where heat-like equation means equations of the form:

where is a Sturm–Liouville operator (However it should be noted this operator may in

fact be of the form where

w(x) is the weighting function with respect to which the eigenfunctions of are orthogonal) in the x coordinate.
Subject to the boundary conditions:

Then:
If:

where
Partial differential equation 479

Spherical waves
Spherical waves are waves whose amplitude depends only upon the radial distance r from a central point source. For
such waves, the three-dimensional wave equation takes the form

This is equivalent to

and hence the quantity ru satisfies the one-dimensional wave equation. Therefore a general solution for spherical
waves has the form

where F and G are completely arbitrary functions. Radiation from an antenna corresponds to the case where G is
identically zero. Thus the wave form transmitted from an antenna has no distortion in time: the only distorting factor
is 1/r. This feature of undistorted propagation of waves is not present if there are two spatial dimensions.

Laplace equation in two dimensions


The Laplace equation for an unknown function of two variables φ has the form

Solutions of Laplace's equation are called harmonic functions.

Connection with holomorphic functions


Solutions of the Laplace equation in two dimensions are intimately connected with analytic functions of a complex
variable (a.k.a. holomorphic functions): the real and imaginary parts of any analytic function are conjugate
harmonic functions: they both satisfy the Laplace equation, and their gradients are orthogonal. If f=u+iv, then the
Cauchy–Riemann equations state that

and it follows that

Conversely, given any harmonic function in two dimensions, it is the real part of an analytic function, at least locally.
Details are given in Laplace equation.

A typical boundary value problem


A typical problem for Laplace's equation is to find a solution that satisfies arbitrary values on the boundary of a
domain. For example, we may seek a harmonic function that takes on the values u(θ) on a circle of radius one. The
solution was given by Poisson:

Petrovsky (1967, p. 248) shows how this formula can be obtained by summing a Fourier series for φ. If r<1, the
derivatives of φ may be computed by differentiating under the integral sign, and one can verify that φ is analytic,
even if u is continuous but not necessarily differentiable. This behavior is typical for solutions of elliptic partial
differential equations: the solutions may be much more smooth than the boundary data. This is in contrast to
solutions of the wave equation, and more general hyperbolic partial differential equations, which typically have no
more derivatives than the data.
Partial differential equation 480

Euler–Tricomi equation
The Euler–Tricomi equation is used in the investigation of transonic flow.

Advection equation
The advection equation describes the transport of a conserved scalar ψ in a velocity field . It is:

If the velocity field is solenoidal (that is, ), then the equation may be simplified to

In the one-dimensional case where u is not constant and is equal to ψ, the equation is referred to as Burgers'
equation.

Ginzburg–Landau equation
The Ginzburg–Landau equation is used in modelling superconductivity. It is

where p,q ∈ C and γ ∈ R are constants and i is the imaginary unit.

The Dym equation


The Dym equation is named for Harry Dym and occurs in the study of solitons. It is

Initial-boundary value problems


Many problems of mathematical physics are formulated as initial-boundary value problems.

Vibrating string
If the string is stretched between two points where x=0 and x=L and u denotes the amplitude of the displacement of
the string, then u satisfies the one-dimensional wave equation in the region where 0<x<L and t is unlimited. Since the
string is tied down at the ends, u must also satisfy the boundary conditions

as well as the initial conditions

The method of separation of variables for the wave equation

leads to solutions of the form

where

where the constant k must be determined. The boundary conditions then imply that X is a multiple of sin kx, and k
must have the form
Partial differential equation 481

where n is an integer. Each term in the sum corresponds to a mode of vibration of the string. The mode with n=1 is
called the fundamental mode, and the frequencies of the other modes are all multiples of this frequency. They form
the overtone series of the string, and they are the basis for musical acoustics. The initial conditions may then be
satisfied by representing f and g as infinite sums of these modes. Wind instruments typically correspond to vibrations
of an air column with one end open and one end closed. The corresponding boundary conditions are

The method of separation of variables can also be applied in this case, and it leads to a series of odd overtones.
The general problem of this type is solved in Sturm–Liouville theory.

Vibrating membrane
If a membrane is stretched over a curve C that forms the boundary of a domain D in the plane, its vibrations are
governed by the wave equation

if t>0 and (x,y) is in D. The boundary condition is u(t,x,y) = 0 if (x,y) is on C. The method of separation of variables
leads to the form

which in turn must satisfy

The latter equation is called the Helmholtz Equation. The constant k must be determined to allow a non-trivial v to
satisfy the boundary condition on C. Such values of k2 are called the eigenvalues of the Laplacian in D, and the
associated solutions are the eigenfunctions of the Laplacian in D. The Sturm–Liouville theory may be extended to
this elliptic eigenvalue problem (Jost, 2002).

Other examples
The Schrödinger equation is a PDE at the heart of non-relativistic quantum mechanics. In the WKB approximation it
is the Hamilton–Jacobi equation.
Except for the Dym equation and the Ginzburg–Landau equation, the above equations are linear in the sense that
they can be written in the form Au = f for a given linear operator A and a given function f. Other important non-linear
equations include the Navier–Stokes equations describing the flow of fluids, and Einstein's field equations of general
relativity.
Also see the list of non-linear partial differential equations.
Partial differential equation 482

Classification
Some linear, second-order partial differential equations can be classified as parabolic, hyperbolic or elliptic. Others
such as the Euler–Tricomi equation have different types in different regions. The classification provides a guide to
appropriate initial and boundary conditions, and to smoothness of the solutions.

Equations of second order


Assuming , the general second-order PDE in two independent variables has the form

where the coefficients A, B, C etc. may depend upon x and y. If over a region of the xy plane,
the PDE is second-order in that region. This form is analogous to the equation for a conic section:

More precisely, replacing by X, and likewise for other variables (formally this is done by a Fourier transform),
converts a constant-coefficient PDE into a polynomial of the same degree, with the top degree (a homogeneous
polynomial, here a quadratic form) being most significant for the classification.
Just as one classifies conic sections and quadratic forms into parabolic, hyperbolic, and elliptic based on the
discriminant , the same can be done for a second-order PDE at a given point. However, the
discriminant in a PDE is given by due to the convention of the xy term being 2B rather than B;
formally, the discriminant (of the associated quadratic form) is with the factor
of 4 dropped for simplicity.
1. : solutions of elliptic PDEs are as smooth as the coefficients allow, within the interior of the
region where the equation and solutions are defined. For example, solutions of Laplace's equation are analytic
within the domain where they are defined, but solutions may assume boundary values that are not smooth. The
motion of a fluid at subsonic speeds can be approximated with elliptic PDEs, and the Euler–Tricomi equation is
elliptic where x < 0.
2. : equations that are parabolic at every point can be transformed into a form analogous to the
heat equation by a change of independent variables. Solutions smooth out as the transformed time variable
increases. The Euler–Tricomi equation has parabolic type on the line where x=0.
3. : hyperbolic equations retain any discontinuities of functions or derivatives in the initial data.
An example is the wave equation. The motion of a fluid at supersonic speeds can be approximated with
hyperbolic PDEs, and the Euler–Tricomi equation is hyperbolic where x>0.
If there are n independent variables x1, x2 , ..., xn, a general linear partial differential equation of second order has the
form

The classification depends upon the signature of the eigenvalues of the coefficient matrix.
1. Elliptic: The eigenvalues are all positive or all negative.
2. Parabolic : The eigenvalues are all positive or all negative, save one that is zero.
3. Hyperbolic: There is only one negative eigenvalue and all the rest are positive, or there is only one positive
eigenvalue and all the rest are negative.
4. Ultrahyperbolic: There is more than one positive eigenvalue and more than one negative eigenvalue, and there are
no zero eigenvalues. There is only limited theory for ultrahyperbolic equations (Courant and Hilbert, 1962).
Partial differential equation 483

Systems of first-order equations and characteristic surfaces


The classification of partial differential equations can be extended to systems of first-order equations, where the
unknown u is now a vector with m components, and the coefficient matrices are m by m matrices for
. The partial differential equation takes the form

where the coefficient matrices Aν and the vector B may depend upon x and u. If a hypersurface S is given in the
implicit form

where φ has a non-zero gradient, then S is a characteristic surface for the operator L at a given point if the
characteristic form vanishes:

The geometric interpretation of this condition is as follows: if data for u are prescribed on the surface S, then it may
be possible to determine the normal derivative of u on S from the differential equation. If the data on S and the
differential equation determine the normal derivative of u on S, then S is non-characteristic. If the data on S and the
differential equation do not determine the normal derivative of u on S, then the surface is characteristic, and the
differential equation restricts the data on S: the differential equation is internal to S.
1. A first-order system Lu=0 is elliptic if no surface is characteristic for L: the values of u on S and the differential
equation always determine the normal derivative of u on S.
2. A first-order system is hyperbolic at a point if there is a space-like surface S with normal ξ at that point. This
means that, given any non-trivial vector η orthogonal to ξ, and a scalar multiplier λ, the equation

has m real roots λ1, λ2, ..., λm. The system is strictly hyperbolic if these roots are always distinct. The geometrical
interpretation of this condition is as follows: the characteristic form Q(ζ)=0 defines a cone (the normal cone) with
homogeneous coordinates ζ. In the hyperbolic case, this cone has m sheets, and the axis ζ = λ ξ runs inside these
sheets: it does not intersect any of them. But when displaced from the origin by η, this axis intersects every sheet. In
the elliptic case, the normal cone has no real sheets.

Equations of mixed type


If a PDE has coefficients that are not constant, it is possible that it will not belong to any of these categories but
rather be of mixed type. A simple but important example is the Euler–Tricomi equation

which is called elliptic-hyperbolic because it is elliptic in the region x < 0, hyperbolic in the region x > 0, and
degenerate parabolic on the line x = 0.
Partial differential equation 484

Infinite-order PDEs in quantum mechanics


Weyl quantization in phase space leads to quantum Hamilton's equations for trajectories of quantum particles. Those
equations are infinite-order PDEs. However, in the semiclassical expansion one has a finite system of ODEs at any
fixed order of . The equation of evolution of the Wigner function is infinite-order PDE also. The quantum
trajectories are quantum characteristics with the use of which one can calculate the evolution of the Wigner function.

Analytical methods to solve PDEs

Separation of variables
Linear PDEs can be reduced to systems of ordinary differential equations by the important technique of separation of
variables. The logic of this technique may be confusing upon first acquaintance, but it rests on the uniqueness of
solutions to differential equations: as with ODEs, if you can find any solution that solves the equation and satisfies
the boundary conditions, then it is the solution. We assume as an ansatz that the dependence of the solution on space
and time can be written as a product of terms that each depend on a single coordinate, and then see if and how this
can be made to solve the problem.
In the method of separation of variables, one reduces a PDE to a PDE in fewer variables, which is an ODE if in one
variable – these are in turn easier to solve.
This is possible for simple PDEs, which are called separable partial differential equations, and the domain is
generally a rectangle (a product of intervals). Separable PDEs correspond to diagonal matrices – thinking of "the
value for fixed x" as a coordinate, each coordinate can be understood separately.
This generalizes to the method of characteristics, and is also used in integral transforms.

Method of characteristics
In special cases, one can find characteristic curves on which the equation reduces to an ODE – changing coordinates
in the domain to straighten these curves allows separation of variables, and is called the method of characteristics.
More generally, one may find characteristic surfaces.

Integral transform
An integral transform may transform the PDE to a simpler one, in particular a separable PDE. This corresponds to
diagonalizing an operator.
An important example of this is Fourier analysis, which diagonalizes the heat equation using the eigenbasis of
sinusoidal waves.
If the domain is finite or periodic, an infinite sum of solutions such as a Fourier series is appropriate, but an integral
of solutions such as a Fourier integral is generally required for infinite domains. The solution for a point source for
the heat equation given above is an example for use of a Fourier integral.
Partial differential equation 485

Change of variables
Often a PDE can be reduced to a simpler form with a known solution by a suitable change of variables. For example
the Black–Scholes PDE

is reducible to the heat equation

by the change of variables (for complete details see Solution of the Black Scholes Equation [1])

Fundamental solution
Inhomogeneous equations can often be solved (for constant coefficient PDEs, always be solved) by finding the
fundamental solution (the solution for a point source), then taking the convolution with the boundary conditions to
get the solution.
This is analogous in signal processing to understanding a filter by its impulse response.

Superposition principle
Because any superposition of solutions of a linear, homogeneous PDE is again a solution, the particular solutions
may then be combined to obtain more general solutions.

Methods for non-linear equations


See also the list of nonlinear partial differential equations.
There are no generally applicable methods to solve non-linear PDEs. Still, existence and uniqueness results (such as
the Cauchy–Kowalevski theorem) are often possible, as are proofs of important qualitative and quantitative
properties of solutions (getting these results is a major part of analysis). Computational solution to the nonlinear
PDEs, the split-step method, exist for specific equations like nonlinear Schrödinger equation.
Nevertheless, some techniques can be used for several types of equations. The h-principle is the most powerful
method to solve underdetermined equations. The Riquier–Janet theory is an effective method for obtaining
information about many analytic overdetermined systems.
The method of characteristics (similarity transformation method) can be used in some very special cases to solve
partial differential equations.
In some cases, a PDE can be solved via perturbation analysis in which the solution is considered to be a correction to
an equation with a known solution. Alternatives are numerical analysis techniques from simple finite difference
schemes to the more mature multigrid and finite element methods. Many interesting problems in science and
engineering are solved in this way using computers, sometimes high performance supercomputers.
Partial differential equation 486

Lie group method


From 1870 Sophus Lie's work put the theory of differential equations on a more satisfactory foundation. He showed
that the integration theories of the older mathematicians can, by the introduction of what are now called Lie groups,
be referred to a common source; and that ordinary differential equations which admit the same infinitesimal
transformations present comparable difficulties of integration. He also emphasized the subject of transformations of
contact.
A general approach to solve PDE's uses the symmetry property of differential equations, the continuous infinitesimal
transformations of solutions to solutions (Lie theory). Continuous group theory, Lie algebras and differential
geometry are used to understand the structure of linear and nonlinear partial differential equations for generating
integrable equations, to find its Lax pairs, recursion operators, Bäcklund transform and finally finding exact analytic
solutions to the PDE.
Symmetry methods have been recognized to study differential equations arising in mathematics, physics,
engineering, and many other disciplines.

Semianalytical methods
The adomian decomposition method, the Lyapunov artificial small parameter method, and He's homotopy
perturbation method are all special cases of the more general homotopy analysis method. These are series expansion
methods, and except for the Lyapunov method, are independent of small physical parameters as compared to the well
known perturbation theory, thus giving these methods greater flexibility and solution generality.

Numerical methods to solve PDEs


The three most widely used numerical methods to solve PDEs are the finite element method (FEM), finite volume
methods (FVM) and finite difference methods (FDM). The FEM has a prominent position among these methods and
especially its exceptionally efficient higher-order version hp-FEM. Other versions of FEM include the generalized
finite element method (GFEM), extended finite element method (XFEM), spectral finite element method (SFEM),
meshfree finite element method, discontinuous Galerkin finite element method (DGFEM), etc.

Finite element method


The finite element method (FEM) (its practical application often known as finite element analysis (FEA)) is a
numerical technique for finding approximate solutions of partial differential equations (PDE) as well as of integral
equations. The solution approach is based either on eliminating the differential equation completely (steady state
problems), or rendering the PDE into an approximating system of ordinary differential equations, which are then
numerically integrated using standard techniques such as Euler's method, Runge–Kutta, etc.

Finite difference method


Finite-difference methods are numerical methods for approximating the solutions to differential equations using
finite difference equations to approximate derivatives.

Finite volume method


Similar to the finite difference method or finite element method, values are calculated at discrete places on a meshed
geometry. "Finite volume" refers to the small volume surrounding each node point on a mesh. In the finite volume
method, surface integrals in a partial differential equation that contain a divergence term are converted to volume
integrals, using the divergence theorem. These terms are then evaluated as fluxes at the surfaces of each finite
volume. Because the flux entering a given volume is identical to that leaving the adjacent volume, these methods are
conservative.
Partial differential equation 487

References
• Adomian, G. (1994). Solving Frontier problems of Physics: The decomposition method. Kluwer Academic
Publishers.
• Courant, R. & Hilbert, D. (1962), Methods of Mathematical Physics, II, New York: Wiley-Interscience.
• Evans, L. C. (1998), Partial Differential Equations, Providence: American Mathematical Society,
ISBN 0-8218-0772-2.
• Ibragimov, Nail H (1993), CRC Handbook of Lie Group Analysis of Differential Equations Vol. 1-3, Providence:
CRC-Press, ISBN 0-8493-4488-3.
• John, F. (1982), Partial Differential Equations (4th ed.), New York: Springer-Verlag, ISBN 0-387-90609-6.
• Jost, J. (2002), Partial Differential Equations, New York: Springer-Verlag, ISBN 0-387-95428-7.
• Lewy, Hans (1957), "An example of a smooth linear partial differential equation without solution", Annals of
Mathematics, 2nd Series 66 (1): 155–158.
• Liao, S.J. (2003), Beyond Perturbation: Introduction to the Homotopy Analysis Method, Boca Raton: Chapman &
Hall/ CRC Press, ISBN 1-58488-407-X
• Olver, P.J. (1995), Equivalence, Invariants and Symmetry, Cambridge Press.
• Petrovskii, I. G. (1967), Partial Differential Equations, Philadelphia: W. B. Saunders Co..
• Pinchover, Y. & Rubinstein, J. (2005), An Introduction to Partial Differential Equations, New York: Cambridge
University Press, ISBN 0-521-84886-5.
• Polyanin, A. D. (2002), Handbook of Linear Partial Differential Equations for Engineers and Scientists, Boca
Raton: Chapman & Hall/CRC Press, ISBN 1-58488-299-9.
• Polyanin, A. D. & Zaitsev, V. F. (2004), Handbook of Nonlinear Partial Differential Equations, Boca Raton:
Chapman & Hall/CRC Press, ISBN 1-58488-355-3.
• Polyanin, A. D.; Zaitsev, V. F. & Moussiaux, A. (2002), Handbook of First Order Partial Differential Equations,
London: Taylor & Francis, ISBN 0-415-27267-X.
• Solin, P. (2005), Partial Differential Equations and the Finite Element Method, Hoboken, NJ: J. Wiley & Sons,
ISBN 0-471-72070-4.
• Solin, P.; Segeth, K. & Dolezel, I. (2003), Higher-Order Finite Element Methods, Boca Raton: Chapman &
Hall/CRC Press, ISBN 1-58488-438-X.
• Stephani, H. (1989), Differential Equations: Their Solution Using Symmetries. Edited by M. MacCallum,
Cambridge University Press.
• Wazwaz, Abdul-Majid (2009). Partial Differential Equations and Solitary Waves Theory. Higher Education
Press. ISBN 90-5809-369-7.
• Zwillinger, D. (1997), Handbook of Differential Equations (3rd ed.), Boston: Academic Press,
ISBN 0-12-784395-7.
• Gershenfeld, N. (1999), The Nature of Mathematical Modeling (1st ed.), New York: Cambridge University Press,
New York, NY, USA, ISBN 0-521-57095-6.
Partial differential equation 488

External links
• Partial Differential Equations: Exact Solutions [2] at EqWorld: The World of Mathematical Equations.
• Partial Differential Equations: Index [3] at EqWorld: The World of Mathematical Equations.
• Partial Differential Equations: Methods [4] at EqWorld: The World of Mathematical Equations.
• Example problems with solutions [5] at exampleproblems.com
• Partial Differential Equations [6] at mathworld.wolfram.com
• Dispersive PDE Wiki [7]
• NEQwiki, the nonlinear equations encyclopedia [8]

References
[1] http:/ / web. archive. org/ web/ 20080411030405/ http:/ / www. math. unl. edu/ ~sdunbar1/ Teaching/ MathematicalFinance/ Lessons/
BlackScholes/ Solution/ solution. shtml
[2] http:/ / eqworld. ipmnet. ru/ en/ pde-en. htm
[3] http:/ / eqworld. ipmnet. ru/ en/ solutions/ eqindex/ eqindex-pde. htm
[4] http:/ / eqworld. ipmnet. ru/ en/ methods/ meth-pde. htm
[5] http:/ / www. exampleproblems. com/ wiki/ index. php?title=Partial_Differential_Equations
[6] http:/ / mathworld. wolfram. com/ PartialDifferentialEquation. html
[7] http:/ / tosio. math. toronto. edu/ wiki/ index. php/ Main_Page
[8] http:/ / www. primat. mephi. ru/ wiki/

Pearson's chi-squared test


Pearson's chi-squared test (χ2) is the best-known of several (Yates, likelihood ratio, portmanteau test in time series,
likelihood ratios) chi-squared tests – statistical procedures whose results are evaluated by reference to the
chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900.[1] In contexts where it is
important to make a distinction between the test statistic and its distribution, names similar to Pearson X-squared
test or statistic are used.
It tests a null hypothesis stating that the frequency distribution of certain events observed in a sample is consistent
with a particular theoretical distribution. The events considered must be mutually exclusive and have total
probability 1. A common case for this is where the events each cover an outcome of a categorical variable. A simple
example is the hypothesis that an ordinary six-sided die is “fair”, i. e., all six outcomes are equally likely to occur.

Definition
Pearson's chi-squared test is used to assess two types of comparison: tests of goodness of fit and tests of
independence.
• A test of goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical
distribution.
• A test of independence assesses whether paired observations on two variables, expressed in a contingency table,
are independent of each other (e.g. polling responses from people of different nationalities to see if one's
nationality affects the response).
The first step is to calculate the chi-squared test statistic, X2, which resembles a normalized sum of squared
deviations between observed and theoretical frequencies (see below). The second step is to determine the degrees of
freedom, , of that statistic, which is essentially the number of frequencies reduced by the number of parameters of
the fitted distribution. In the third step, X2 is compared to the critical value of no significance from the
distribution, which in many cases gives a good approximation of the distribution of X2. A test that does not rely on
this approximation is Fisher's exact test; it is substantially more accurate in obtaining a significance level, especially
Pearson's chi-squared test 489

with few observations.

Test for fit of a distribution

Discrete uniform distribution


In this case observations are divided among cells. A simple application is to test the hypothesis that, in the
general population, values would occur in each cell with equal frequency. The "theoretical frequency" for any cell
(under the null hypothesis of a discrete uniform distribution) is thus calculated as

and the reduction in the degrees of freedom is , notionally because the observed frequencies are
constrained to sum to .

Other distributions
When testing whether observations are random variables whose distribution belongs to a given family of
distributions, the "theoretical frequencies" are calculated using a distribution from that family fitted in some standard
way. The reduction in the degrees of freedom is calculated as , where is the number of parameters
used in fitting the distribution. For instance, when checking a three-parameter Weibull distribution, , and
when checking a normal distribution (where the parameters are mean and standard deviation), . In other
words, there will be degrees of freedom, where is the number of categories.
It should be noted that the degrees of freedom are not based on the number of observations as with a Student's t or
F-distribution. For example, if testing for a fair, six-sided die, there would be five degrees of freedom because there
are six categories/parameters (each number). The number of times the die is rolled will have absolutely no effect on
the number of degrees of freedom.

Calculating the test-statistic


The value of the test-statistic is

where
= Pearson's cumulative test statistic, which asymptotically approaches a distribution.
= an observed frequency;
= an expected (theoretical) frequency, asserted by the null hypothesis;
= the number of cells in the table.
Pearson's chi-squared test 490

The chi-squared statistic can then be used to


calculate a p-value by comparing the value
of the statistic to a chi-squared distribution.
The number of degrees of freedom is equal
to the number of cells , minus the
reduction in degrees of freedom, .

The result about the number of degrees of


freedom is valid when the original data was
multinomial and hence the estimated
parameters are efficient for minimizing the
chi-squared statistic. More generally
however, when maximum likelihood
estimation does not coincide with minimum
chi-squared estimation, the distribution will
lie somewhere between a chi-squared Chi-squared distribution, showing X2 on the x-axis and P-value on the y-axis.

distribution with and


degrees of freedom (See for instance Chernoff and Lehmann, 1954).

Bayesian method
In Bayesian statistics, one would instead use a Dirichlet distribution as conjugate prior. If one took a uniform prior,
then the maximum likelihood estimate for the population probability is the observed probability, and one may
compute a credible region around this or another estimate.

Test of independence
In this case, an "observation" consists of the values of two outcomes and the null hypothesis is that the occurrence of
these outcomes is statistically independent. Each observation is allocated to one cell of a two-dimensional array of
cells (called a table) according to the values of the two outcomes. If there are r rows and c columns in the table, the
"theoretical frequency" for a cell, given the hypothesis of independence, is

where N is the total sample size (the sum of all cells in the table). The value of the test-statistic is

Fitting the model of "independence" reduces the number of degrees of freedom by p = r + c − 1. The number of
degrees of freedom is equal to the number of cells rc, minus the reduction in degrees of freedom, p, which reduces
to (r − 1)(c − 1).
For the test of independence, also known as the test of homogeneity, a chi-squared probability of less than or equal
to 0.05 (or the chi-squared statistic being at or larger than the 0.05 critical point) is commonly interpreted by applied
workers as justification for rejecting the null hypothesis that the row variable is independent of the column
variable.[2] The alternative hypothesis corresponds to the variables having an association or relationship where the
structure of this relationship is not specified.
Pearson's chi-squared test 491

Assumptions
The chi-squared test, when used with the standard approximation that a chi-squared distribution is applicable, has the
following assumptions:
• Simple random sample – The sample data is a random sampling from a fixed distribution or population where
each member of the population has an equal probability of selection. Variants of the test have been developed for
complex samples, such as where the data is weighted.
• Sample size (whole table) – A sample with a sufficiently large size is assumed. If a chi squared test is conducted
on a sample with a smaller size, then the chi squared test will yield an inaccurate inference. The researcher, by
using chi squared test on small samples, might end up committing a Type II error.
• Expected cell count – Adequate expected cell counts. Some require 5 or more, and others require 10 or more. A
common rule is 5 or more in all cells of a 2-by-2 table, and 5 or more in 80% of cells in larger tables, but no cells
with zero expected count. When this assumption is not met, Yates's Correction is applied.
• Independence – The observations are always assumed to be independent of each other. This means chi-squared
cannot be used to test correlated data (like matched pairs or panel data). In those cases you might want to turn to
McNemar's test.

Examples

Goodness of fit
For example, to test the hypothesis that a random sample of 100 people has been drawn from a population in which
men and women are equal in frequency, the observed number of men and women would be compared to the
theoretical frequencies of 50 men and 50 women. If there were 44 men in the sample and 56 women, then

If the null hypothesis is true (i.e., men and women are chosen with equal probability in the sample), the test statistic
will be drawn from a chi-squared distribution with one degree of freedom. Though one might expect two degrees of
freedom (one each for the men and women), we must take into account that the total number of men and women is
constrained (100), and thus there is only one degree of freedom (2 − 1). Alternatively, if the male count is known the
female count is determined, and vice-versa.
Consultation of the chi-squared distribution for 1 degree of freedom shows that the probability of observing this
difference (or a more extreme difference than this) if men and women are equally numerous in the population is
approximately 0.23. This probability is higher than conventional criteria for statistical significance (0.001–0.05), so
normally we would not reject the null hypothesis that the number of men in the population is the same as the number
of women (i.e., we would consider our sample within the range of what we'd expect for a 50/50 male/female ratio.)

Problems
The approximation to the chi-squared distribution breaks down if expected frequencies are too low. It will normally
be acceptable so long as no more than 20% of the events have expected frequencies below 5. Where there is only 1
degree of freedom, the approximation is not reliable if expected frequencies are below 10. In this case, a better
approximation can be obtained by reducing the absolute value of each difference between observed and expected
frequencies by 0.5 before squaring; this is called Yates's correction for continuity.
In cases where the expected value, E, is found to be small (indicating either a small underlying population
probability, or a small number of observations), the normal approximation of the multinomial distribution can fail,
and in such cases it is found to be more appropriate to use the G-test, a likelihood ratio-based test statistic. Where the
Pearson's chi-squared test 492

total sample size is small, it is necessary to use an appropriate exact test, typically either the binomial test or (for
contingency tables) Fisher's exact test; but note that this test assumes fixed and known marginal totals.

Distribution
The null distribution of the Pearson statistic with j rows and k columns is approximated by the chi-squared
distribution with (k − 1)(j − 1) degrees of freedom.[3]
This approximation arises as the true distribution, under the null hypothesis, if the expected value is given by a
multinomial distribution. For large sample sizes, the central limit theorem says this distribution tends toward a
certain multivariate normal distribution.

Two cells
In the special case where there are only two cells in the table, the expected values follow a binomial distribution,

where
p = probability, under the null hypothesis,
n = number of observations in the sample.
In the above example the hypothesised probability of a male observation is 0.5, with 100 samples. Thus we expect to
observe 50 males.
If n is sufficiently large, the above binomial distribution may be approximated by a Gaussian (normal) distribution
and thus the Pearson test statistic approximates a chi-squared distribution,

Let O1 be the number of observations from the sample that are in the first cell. The Pearson test statistic can be
expressed as

which can in turn be expressed as

By the normal approximation to a binomial this is the squared of one standard normal variate, and hence is
distributed as chi-squared with 1 degree of freedom. Note that the denominator is one standard deviation of the
Gaussian approximation, so can be written

So as consistent with the meaning of the chi-squared distribution, we are measuring how probable the observed
number of standard deviations away from the mean is under the Gaussian approximation (which is a good
approximation for large n).
The chi-squared distribution is then integrated on the right of the statistic value to obtain the P-value, which is equal
to the probability of getting a statistic equal or bigger than the observed one, assuming the null hypothesis.
Pearson's chi-squared test 493

Two-by-two contingency tables


When the test is applied to a contingency table containing two rows and two columns, the test is equivalent to a
Z-test of proportions.

Many cells
Similar arguments as above lead to the desired result. Each cell (except the final one, whose value is completely
determined by the others) is treated as an independent binomial variable, and their contributions are summed and
each contributes one degree of freedom.

Notes
[1] Karl Pearson (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is
such that it can be reasonably supposed to have arisen from random sampling". Philosophical Magazine, Series 5 50 (302): 157–175.
doi:10.1080/14786440009463897.
[2] "Critical Values of the Chi-Squared Distribution" (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda3674. htm).
NIST/SEMATECH e-Handbook of Statistical Methods. National Institute of Standards and Technology. .
[3] Statistics for Applications. MIT OpenCourseWare. Lecture 23 (http:/ / ocw. mit. edu/ courses/ mathematics/
18-443-statistics-for-applications-fall-2003/ lecture-notes/ lec23. pdf). Pearson's Theorem. Retrieved 21 March 2007.

References
• Chernoff, H.; Lehmann E.L. (1954). "The use of maximum likelihood estimates in tests for goodness-of-fit".
The Annals of Mathematical Statistics 25 (3): 579–586. doi:10.1214/aoms/1177728726.
• Plackett, R.L. (1983). "Karl Pearson and the Chi-Squared Test". International Statistical Review (International
Statistical Institute (ISI)) 51 (1): 59–72. doi:10.2307/1402731. JSTOR 1402731.
• Greenwood, P.E., Nikulin, M.S.(1996). A guide to chi-squared testing, J.Wiley, New York, ISBN
0-471-55779-X.
PerronFrobenius theorem 494

Perron–Frobenius theorem
In linear algebra, the Perron–Frobenius theorem, proved by Oskar Perron (1907) and Georg Frobenius (1912),
asserts that a real square matrix with positive entries has a unique largest real eigenvalue and that the corresponding
eigenvector has strictly positive components, and also asserts a similar statement for certain classes of nonnegative
matrices. This theorem has important applications to probability theory (ergodicity of Markov chains); to the theory
of dynamical systems (subshifts of finite type); to economics (Leontief's input-output model)[1]; to demography
(Leslie population age distribution model)[2] to mathematical background of the Internet search engines[3] and even
to ranking of football teams[4]

Statement of the Perron–Frobenius theorem


A matrix in which all entries are positive real numbers is here called positive and a matrix whose entries are
non-negative real numbers is here called non-negative. The eigenvalues of a real square matrix A are complex
numbers and collectively they make up the spectrum of the matrix. The exponential growth rate of the matrix powers
Ak as k → ∞ is controlled by the eigenvalue of A with the largest absolute value. The Perron–Frobenius theorem
describes the properties of the leading eigenvalue and of the corresponding eigenvectors when A is a non-negative
real square matrix. Early results were due to Oskar Perron (1907) and concerned positive matrices. Later, Georg
Frobenius (1912) found their extension to certain classes of non-negative matrices.

Positive matrices
Let A = (aij) be an n × n positive matrix: aij > 0 for 1 ≤ i, j ≤ n. Then the following statements hold.
1. There is a positive real number r, called the Perron root or the Perron–Frobenius eigenvalue, such that r is an
eigenvalue of A and any other eigenvalue λ (possibly, complex) is strictly smaller than r in absolute value, |λ| < r.
Thus, the spectral radius ρ(A) is equal to r.
2. The Perron–Frobenius eigenvalue is simple: r is a simple root of the characteristic polynomial of A.
Consequently, the eigenspace associated to r is one-dimensional. (The same is true for the left eigenspace, i.e., the
eigenspace for AT.)
3. There exists an eigenvector v = (v1,…,vn) of A with eigenvalue r such that all components of v are positive: A v =
r v, vi > 0 for 1 ≤ i ≤ n. (Respectively, there exists a positive left eigenvector w : wT A = r wT, wi > 0.)
4. There are no other positive (moreover non-negative) eigenvectors except positive multiples of v (respectively, left
eigenvectors except w), i.e., all other eigenvectors must have at least one negative or non-real component.
5. , where the left and right eigenvectors for A are normalized so that wTv = 1. Moreover, the
matrix v wT is the projection onto the eigenspace corresponding to r. This projection is called the Perron
projection.
6. Collatz–Wielandt formula: for all non-negative non-zero vectors x, let f(x) be the minimum value of [Ax]i / xi
taken over all those i such that xi ≠ 0. Then f is a real valued function whose maximum is the Perron–Frobenius
eigenvalue.
7. A "Min-max" Collatz–Wielandt formula takes a form similar to the one above: for all strictly positive vectors x,
let g(x) be the maximum value of [Ax]i / xi taken over i. Then g is a real valued function whose minimum is the
Perron–Frobenius eigenvalue.
8. The Perron–Frobenius eigenvalue satisfies the inequalities

These claims can be found in Meyer[5] chapter 8 [6]


claims 8.2.11-15 page 667 and exercises 8.2.5,7,9 pages
668-669.
PerronFrobenius theorem 495

The left and right eigenvectors v and w are usually normalized so that the sum of their components is equal to 1; in
this case, they are sometimes called stochastic eigenvectors.

Non-negative matrices
An extension of the theorem to matrices with non-negative entries is also available. In order to highlight the
similarities and differences between the two cases the following points are to be noted: every non-negative matrix
can be obviously obtained as a limit of positive matrices, thus one obtains the existence of an eigenvector with
non-negative components; obviously the corresponding eigenvalue will be non-negative and greater or equal in
absolute value than all other eigenvalues.[7] [8] However, the simple examples

show that for non-negative matrices there may exist eigenvalues of the same absolute value as the maximal one ((1)
and (−1) – eigenvalues of the first matrix); moreover the maximal eigenvalue may not be a simple root of the
characteristic polynomial, can be zero and the corresponding eigenvector (1,0) is not strictly positive (second
example). So it may seem that most properties are broken for non-negative matrices, however Frobenius found the
right way.
The key feature of theory in the non-negative case is to find some special subclass of non-negative matrices—
irreducible matrices— for which a non-trivial generalization is possible. Namely, although eigenvalues attaining
maximal absolute value may not be unique, the structure of maximal eigenvalues is under control: they have the
form ei2πl/hr, where h is some integer number—period of matrix, r is a real strictly positive eigenvalue,
l = 0, 1, ..., h − 1. The eigenvector corresponding to r has strictly positive components (in contrast with the general
case of non-negative matrices, where components are only non-negative). Also all such eigenvalues are simple roots
of the characteristic polynomial. Further properties are described below.

Classification of matrices
Let A be a square matrix (not necessarily positive or even real). The matrix A is irreducible if any of the following
equivalent properties holds.
Definition 1 : A does not have non-trivial invariant coordinate subspaces. Here a non-trivial coordinate subspace
means a linear subspace spanned by any proper subset of basis vectors. More explicitly, for any linear subspace
spanned by basis vectors ei1 , ..., eik, n > k > 0 its image under the action of A is not contained in the same subspace.
Definition 2: A cannot be conjugated into block upper triangular form by a permutation matrix P:

where E and G are non-trivial (i.e. of size greater than zero) square matrices.
If A is non-negative other definitions exist:
Definition 3: For every pair of indices i and j, there exists a natural number m such that (Am)ij is not equal to 0.
Definition 4: One can associate with a matrix A a certain directed graph GA. It has exactly n vertices, where n is size
of A, and there is an edge from vertex i to vertex j precisely when Aij > 0. Then the matrix A is irreducible if and only
if its associated graph GA is strongly connected.
A matrix is reducible if it is not irreducible.
Let A be non-negative. Fix an index i and define the period of index i to be the greatest common divisor of all
natural numbers m such that (Am)ii > 0. When A is irreducible, the period of every index is the same and is called the
period of A. In fact, when A is irreducible, the period can be defined as the greatest common divisor of the lengths of
the closed directed paths in GA (see Kitchens[9] page 16). The period is also called the index of imprimitivity
PerronFrobenius theorem 496

(Meyer[5] page 674) or the order of cyclicity.


If the period is 1, A is aperiodic.
A matrix A is primitive if it is non-negative and its mth power is positive for some natural number m (i.e. the same m
works for all pairs of indices). It can be proved that primitive matrices are the same as irreducible aperiodic
non-negative matrices.
A positive square matrix is primitive and a primitive matrix is irreducible. All statements of the Perron–Frobenius
theorem for positive matrices remain true for primitive matrices. However, a general non-negative irreducible matrix
A may possess several eigenvalues whose absolute value is equal to the spectral radius of A, so the statements need
to be correspondingly modified. Actually the number of such eigenvalues is exactly equal to the period. Results for
non-negative matrices were first obtained by Frobenius in 1912.

Perron–Frobenius theorem for irreducible matrices


Let A be an irreducible non-negative n × n matrix with period h and spectral radius ρ(A) = r. Then the following
statements hold.
1. The number r is a positive real number and it is an eigenvalue of the matrix A, called the Perron–Frobenius
eigenvalue.
2. The Perron–Frobenius eigenvalue r is simple. Both right and left eigenspaces associated with r are
one-dimensional.
3. A has a left eigenvector v with eigenvalue r whose components are all positive.
4. Likewise, A has a right eigenvector w with eigenvalue r whose components are all positive.
5. The only eigenvectors whose components are all positive are those associated with the eigenvalue r.
6. Matrix A has exactly h (where h is the period) complex eigenvalues with absolute value r. Each of them is a
simple root of the characteristic polynomial and is the product of r with an hth root of unity.
7. Let ω = 2π/h. Then the matrix A is similar to eiωA, consequently the spectrum of A is invariant under
multiplication by eiω (corresponding to the rotation of the complex plane by the angle ω).
8. If h > 1 then there exists a permutation matrix P such that

where the blocks along the main diagonal are zero square matrices.
9. Collatz–Wielandt formula: for all non-negative non-zero vectors x let f(x) be the minimum value of [Ax]i /
xi taken over all those i such that xi ≠ 0. Then f is a real valued function whose maximum is the
Perron–Frobenius eigenvalue.
10. The Perron–Frobenius eigenvalue satisfies the inequalities

The matrix shows that the blocks on the diagonal may be of different sizes, the matrices Aj need not

be square, and h need not divide n.


PerronFrobenius theorem 497

Further properties
Let A be an irreducible non-negative matrix, then:
1. (I+A)n−1 is a positive matrix. (Meyer[5] claim 8.3.5 p. 672 [6]).
2. Wielandt's theorem. If |B|<A, then ρ(B)≤ρ(A). If equality holds (i.e. if μ=ρ(A)eiφ is eigenvalue for B), then B = eiφ
D AD−1 for some diagonal unitary matrix D (i.e. diagonal elements of D equals to eiΘl, non-diagonal are zero).[10]
3. If some power Aq is reducible, then it is completely reducible, i.e. for some permutation matrix P, it is true that:

, where Ai are irreducible matrices having the same maximal

eigenvalue. The number of these matrices d is the greatest common divisor of q and h, where h is period of A.[11]
4. If c(x)=xn+ck1 xn-k1 +ck2 xn-k2 + ... + cks xn-ks is the characteristic polynomial of A in which the only nonzero
coefficients are listed, then the period of A equals to the greatest common divisor for k1, k2, ... , ks.[12]

5. Cesàro averages: where the left and right eigenvectors for A are normalized

so that wtv = 1. Moreover the matrix v wt is the spectral projection corresponding to r - Perron projection.[13]
6. Let r be the Perron-Frobenius eigenvalue, then the adjoint matrix for (r-A) is positive.[14]
7. If A has at least one non-zero diagonal element, then A is primitive.[15]
Also:
• If 0 ≤ A < B, the rA ≤ rB, moreover, if A is irreducible, then the inequality is strict: rA < rB.
One of the definitions of primitive matrix requires A to be non-negative and there exists m, such that Am is positive.
One may one wonder how big m can be, depending on the size of A. The following answers this question.
• Assume A is non-negative primitive matrix of size n, then An2-2n+2 is positive. Moreover there exists a matrix M
given below, such that Mk remains not positive (just non-negative) for all k< n2-2n+2, in particular
(Mn2-2n+1)11=0.

[16]

Applications
Numerous books have been written on the subject of non-negative matrices, and Perron–Frobenius theory is
invariably a central feature. The following examples given below only scratch the surface of its vast application
domain.

Non-negative matrices
The Perron–Frobenius theorem does not apply directly to non-negative matrices. Nevertheless any reducible square
matrix A may be written in upper-triangular block form (known as the normal form of a reducible matrix)[17]

PAP−1 =
PerronFrobenius theorem 498

where P is a permutation matrix and each Bi is a square matrix that is either irreducible or zero. Now if A is
non-negative then so are all the Bi and the spectrum of A is just the union of their spectra. Therefore many of the
spectral properties of A may be deduced by applying the theorem to the irreducible Bi.
For example the Perron root is the maximum of the ρ(Bi). While there will still be eigenvectors with non-negative
components it is quite possible that none of these will be positive.

Stochastic matrices
A row (column) stochastic matrix is a square matrix each of whose rows (columns) consists of non-negative real
numbers whose sum is unity. The theorem cannot be applied directly to such matrices because they need not be
irreducible.
If A is row-stochastic then the column vector with each entry 1 is an eigenvector corresponding to the eigenvalue 1,
which is also ρ(A) by the remark above. It might not be the only eigenvalue on the unit circle: and the associated
eigenspace can be multi-dimensional. If A is row-stochastic and irreducible then the Perron projection is also
row-stochastic and all its rows are equal.

Algebraic graph theory


The theorem has particular use in algebraic graph theory. The "underlying graph" of a nonnegative n-square matrix is
the graph with vertices numbered 1, ..., n and arc ij if and only if Aij ≠ 0. If the underlying graph of such a matrix is
strongly connected, then the matrix is irreducible, and thus the theorem applies. In particular, the adjacency matrix of
a strongly connected graph is irreducible.[18][19]

Finite Markov chains


The theorem has a natural interpretation in the theory of finite Markov chains (where it is the matrix-theoretic
equivalent of the convergence of an irreducible finite Markov chain to its stationary distribution, formulated in terms
of the transition matrix of the chain; see, for example, the article on the subshift of finite type).

Compact operators
More generally, it can be extended to the case of non-negative compact operators, which, in many ways, resemble
finite-dimensional matrices. These are commonly studied in physics, under the name of transfer operators, or
sometimes Ruelle–Perron–Frobenius operators (after David Ruelle). In this case, the leading eigenvalue
corresponds to the thermodynamic equilibrium of a dynamical system, and the lesser eigenvalues to the decay modes
of a system that is not in equilibrium. Thus, the theory offers a way of discovering the arrow of time in what would
otherwise appear to be reversible, deterministic dynamical processes, when examined from the point of view of
point-set topology.[20]

Proof methods
A common thread in many proofs is the Brouwer fixed point theorem. Another popular method is that of Wielandt
(1950). He used the Collatz–Wielandt formula described above to extend and clarify Frobenius's work.[21] Another
proof is based on the spectral theory[22] from which part of the arguments are borrowed.

Perron root is strictly maximal eigenvalue for positive (and primitive) matrices
Case: If A is a positive (or more generally primitive) matrix, then there exists a real positive eigenvalue r
(Perron-Frobenius eigenvalue or Perron root), which is strictly greater in absolute value than all other eigenvalues,
hence r is the spectral radius of A.
PerronFrobenius theorem 499

That claim is wrong for general non-negative irreducible matrices, which have h eigenvalues with the same absolute
eigenvalue as r, where h is the period of A.

Proof for positive matrices


Let A be a positive matrix, assume that its spectral radius ρ(A) = 1 (otherwise consider A/ρ(A)). Hence, there exists
an eigenvalue λ on the unit circle, and all the other eigenvalues are less or equal 1 in absolute value. Assume that λ ≠
1. Then there exists a positive integer m such that Am is a positive matrix and the real part of λm is negative. Let ε be
half the smallest diagonal entry of Am and set T = Am − ε1 which is yet another positive matrix. Moreover if Ax = λx
then Amx = λmx thus λm − ε is an eigenvalue of T. Because of the choice of m this point lies outside the unit disk
consequently ρ(T) > 1. On the other hand all the entries in T are positive and less than or equal to those in Am so by
Gelfand's formula ρ(T) ≤ ρ(Am) ≤ ρ(A)m = 1. This contradiction means that λ=1 and there can be no other
eigenvalues on the unit circle.
Absolutely the same arguments can be applied to the case of primitive matrices, after one just need to mention the
following simple lemma, which clarifies the properties of primitive matrices

Lemma
Given a non-negative A, assume there exists m, such that Am is positive, then Am+1, Am+2, Am+3,... are all positive.
Am+1= A Am, so it can have zero element only if some row of A is entirely zero, but in this case the same row of Am
will be zero.
Applying the same arguments as above for primitive matrices, prove the main claim.

Power method and the positive eigenpair


Case: For a positive (or more generally irreducible non-negative) matrix A the dominant eigenvector is real and
strictly positive (for non-negative A respectively non-negative.)
It can be argued by the power method, which states that for a sufficiently generic (in the sense below) matrix A the
sequence of vectors bk+1=Abk / | Abk | converges to the eigenvector with the maximum eigenvalue. (The initial vector
b0 can be chosen arbitrarily except for some measure zero set). Starting with a non-negative vector b0 produces the
sequence of non-negative vectors bk. Hence the limiting vector is also non-negative. By the power method this
limiting vector is the dominant eigenvector for A, proving the assertion. The corresponding eigenvalue is
non-negative.
The proof requires two additional arguments. First, the power method converges for matrices which do not have
several eigenvalues of the same absolute value as the maximal one. The previous section's argument guarantees this.
Second, to ensure strict positivity of all of the components of the eigenvector for the case of irreducible matrices.
This follows from the following fact, which is of independent interest:
Lemma: given a positive (or more generally irreducible non-negative) matrix A and v as any non-negative
eigenvector for A, then it is necessarily strictly positive and the corresponding eigenvalue is also strictly
positive.
Proof. One of the definitions of irreducibility for non-negative matrices is that for all indexes i,j there exists m, such
that (Am)ij is strictly positive. Given a non-negative eigenvector v, and that at least one of its components say j-th is
strictly positive, the corresponding eigenvalue is strictly positive, indeed, given n such that (An)ii >0, hence: rnvi =
Anvi >= (An)iivi >0. Hence r is strictly positive. The eigenvector is strict positivity. Then given m, such that (Am)ij >0,
hence: rmvj = (Amv)j >= (Am)ijvi >0, hence vj is strictly positive, i.e., the eigenvector is strictly positive.
PerronFrobenius theorem 500

Multiplicity one
The proof that the Perron-Frobenius eigenvalue is a simple root of the characteristic polynomial is also elementary.
The arguments here are close to those in Meyer.[5]
Case: The eigenspace associated to Perron-Frobenius eigenvalue r is one-dimensional.
Given a strictly positive eigenvector v corresponding to r and another eigenvector w with the same eigenvalue.
(Vector w can be chosen to be real, because A and r are both real, so the null space of A-r has a basis consisting of
real vectors). Assuming at least one of the components of w is positive (otherwise multiply w by -1). Given maximal
possible α such that u=v- α w is non-negative, then one of the components of u is zero, otherwise α is not maximum.
Vector u is an eigenvector. It is non-negative, hence by the lemma described in the previous section non-negativity
implies strict positivity for any eigenvector. On the other hand as above at least one component of u is zero. The
contradiction implies that w does not exist.
Case: There are no Jordan cells corresponding to the Perron-Frobenius eigenvalue r and all other eigenvalues which
have the same absolute value.
If there is a Jordan cell, then the infinity norm ||(A/r)k||∞ tends to infinity for k → ∞ , but that contradicts the existence
of the positive eigenvector.
Given r=1, or A/r. Letting v be a Perron-Frobenius strictly positive eigenvector, so Av=v, then:

So ||Ak||∞ is bounded for all k. This gives another proof that there are no eigenvalues which have greater absolute
value than Perron-Frobenius one. It also contradicts the existence of the Jordan cell for any eigenvalue which has
absolute value equal to 1 (in particular for the Perron-Frobenius one), because existence of the Jordan cell implies
that ||Ak||∞ is unbounded. For a two by two matrix:

hence ||Jk||∞ = |k+λ| (for |λ|=1), so it tends to infinity when k does so. Since ||Jk|| = ||C-1 AkC ||, then || Ak || >= ||Jk||/ (
||C−1|| || C ||), so it also tends to infinity. The resulting contradiction implies that there are no Jordan cells for the
corresponding eigenvalues.
Combining the two claims above reveals that the Perron-Frobenius eigenvalue r is simple root of the characteristic
polynomial. In the case of non primitive matrices, there exist other eigenvalues which have the same absolute value
as r. The same claim is true for them, but requires more work.

No other non-negative eigenvectors


Case: Given positive (or more generally irreducible non-negative matrix) A, the Perron-Frobenius eigenvector is the
only (up to multiplication by constant) non-negative eigenvector for A.
Other eigenvectors should contain negative, or complex components. Since eigenvectors for different eigenvalues
are orthogonal in some sense, but two positive eigenvectors cannot be orthogonal, so they must correspond to the
same eigenvalue, but the eigenspace for the Perron-Frobenius is one dimensional.
Assuming there exists an eigenpair (λ, y) for A, such that vector y is positive, and given (r, x), where x - is the right
Perron-Frobenius eigenvector for A (i.e. eigenvector for At), then r xt y = (xt A) y= xt (A y)=λ xt y, also xt y >0, so one
has: r=λ. Since the eigenspace for the Perron-Frobenius eigenvalue r is one dimensional, non-negative eigenvector y
is a multiple of the Perron-Frobenius one.[23]
PerronFrobenius theorem 501

Collatz–Wielandt formula
Case: Given a positive (or more generally irreducible non-negative matrix) A, for all non-negative non-zero vectors x
and f(x) as the minimum value of [Ax]i / xi taken over all those i such that xi ≠ 0, then f is a real valued function
whose maximum is the Perron–Frobenius eigenvalue r.
Here, r is attained for x taken to be the Perron-Frobenius eigenvector v. The proof requires that values f on the other
vectors are less or equal. Given a vector x. Let ξ=f(x), so 0<= ξx <=Ax and w to be the right eigenvector for A, then
wt ξx <= wt (Ax) = (wt A)x = r wt x . Hence ξ<=r.[24]

Perron projection as a limit: Ak/rk


Case: Given positive (or more generally primitive) matrix A, and r its Perron-Frobenius eigenvalue, then:
1. There exists a limit Ak/rk for k → ∞, denote it by P.
2. P is a projection operator: P2=P, which commutes with A: AP=PA.
3. The image of P is one-dimensional and spanned by the Perron-Frobenius eigenvector v (respectively for Pt—by
the Perron-Frobenius eigenvector w for At).
4. P= v wt, where v,w are normalized such that w^t v = 1.
5. Hence P is a positive operator.
Hence P is a spectral projection for the Perron-Frobenius eigenvalue r, and is called the Perron projection. The above
assertion is not true for general non-negative irreducible matrices.
Actually the claims above (except claim 5) are valid for any matrix M such that there exists an eigenvalue r which is
strictly greater than the other eigenvalues in absolute value and is the simple root of the characteristic polynomial.
(These requirements hold for primitive matrices as above).
Given that M is diagonalizable, M is conjugate to a diagonal matrix with eigenvalues r1, ... , rn on the diagonal
(denote r1=r). The matrix Mk/rk will be conjugate (1, (r2/r)k, ... , (rn/r)k), which tends to (1,0,0,...,0), for k → ∞, so the
limit exists. The same method works for general M (without assuming that M is diagonalizable).
The projection and commutativity properties are elementary corollaries of the definition: M Mk/rk= Mk/rk M ; P2 =
lim M2k/r2k=P. The third fact is also elementary: M (P u)= M lim Mk/rk u = lim r Mk+1/rk+1 u, so taking the limit
yields M (P u)=r (P u), so image of P lies in the r-eigenspace for M, which is one-dimensional by the assumptions.
Denoting by v, r-eigenvector for M (by w for Mt). Columns of P are multiples of v, because the image of P is
spanned by it. Respectively, rows of w. So P takes a form (a v wt), for some a. Hence its trace equals to (a wt v).
Trace of projector equals the dimension of its image. It was proved before that it is not more than one-dimensional.
From the definition one sees that P acts identically on the r-eigenvector for M. So it is one-dimensional. So choosing
(wt v)=1, implies P=v wt.

Inequalities for Perron–Frobenius eigenvalue


For any non-nonegative matrix A its Perron–Frobenius eigenvalue r satisfies the inequality:

This is not specific to non-negative matrices: for any matrix A and any its eigenvalue λ it is true that
. This is an immediate corollary of the Gershgorin circle theorem. However another proof is
more direct:
Any matrix induced norm satisfies the inequality ||A|| ≥ |λ| for any eigenvalue λ, because ||A|| ≥ ||Ax||/||x|| = ||λx||/||x|| =
|λ|. The infinity norm of a matrix is the maximum of row sums: Hence the desired
inequality is exactly ||A||∞ ≥ |λ| applied to non-negative matrix A.
Another inequality is:
PerronFrobenius theorem 502

This fact is specific to non-negative matrices; for general matrices there is nothing similar. Given that A is positive
(not just non-negative), then there exists a positive eigenvector w such that Aw = rw and the smallest component of w
(say wi) is 1. Then r = (Aw)i ≥ the sum of the numbers in row i of A. Thus the minimum row sum gives a lower
bound for r and this observation can be extended to all non-negative matrices by continuity.
Another way to argue it is via the Collatz-Wielandt formula. One takes the vector x = (1, 1, ..., 1) and immediately
obtains the inequality.

Further proofs

Perron projection
The proof now proceeds using spectral decomposition. The trick here is to split the Perron root from the other
eigenvalues. The spectral projection associated with the Perron root is called the Perron projection and it enjoys the
following property:
Case: The Perron projection of an irreducible non-negative square matrix is a positive matrix.
Perron's findings and also (1)–(5) of the theorem are corollaries of this result. The key point is that a positive
projection always has rank one. This means that if A is an irreducible non-negative square matrix then the algebraic
and geometric multiplicities of its Perron root are both one. Also if P is its Perron projection then AP = PA = ρ(A)P
so every column of P is a positive right eigenvector of A and every row is a positive left eigenvector. Moreover if Ax
= λx then PAx = λPx = ρ(A)Px which means Px = 0 if λ ≠ ρ(A). Thus the only positive eigenvectors are those
associated with ρ(A). If A is a primitive matrix with ρ(A) = 1 then it can be decomposed as P ⊕ (1 − P)A so that An =
P + (1 − P)An. As n increases the second of these terms decays to zero leaving P as the limit of An as n → ∞.
The power method is a convenient way to compute the Perron projection of a primitive matrix. If v and w are the
positive row and column vectors that it generates then the Perron projection is just wv/vw. It should be noted that the
spectral projections aren't neatly blocked as in the Jordan form. Here they are overlaid and each generally has
complex entries extending to all four corners of the square matrix. Nevertheless they retain their mutual
orthogonality which is what facilitates the decomposition.

Peripheral projection
The analysis when A is irreducible and non-negative is broadly similar. The Perron projection is still positive but
there may now be other eigenvalues of modulus ρ(A) that negate use of the power method and prevent the powers of
(1 − P)A decaying as in the primitive case whenever ρ(A) = 1. So enter the peripheral projection which is the
spectral projection of A corresponding to all the eigenvalues that have modulus ρ(A) .... Case: The peripheral
projection of an irreducible non-negative square matrix is a non-negative matrix with a positive diagonal.

Cyclicity
Suppose in addition that ρ(A) = 1 and A has h eigenvalues on the unit circle. If P is the peripheral projection then the
matrix R = AP = PA is non-negative and irreducible, Rh = P, and the cyclic group P, R, R2, ...., Rh−1 represents the
harmonics of A. The spectral projection of A at the eigenvalue λ on the unit circle is given by the formula
. All of these projections (including the Perron projection) have the same positive diagonal, moreover
choosing any one of them and then taking the modulus of every entry invariably yields the Perron projection. Some
donkey work is still needed in order to establish the cyclic properties (6)–(8) but it's essentially just a matter of
turning the handle. The spectral decomposition of A is given by A = R ⊕ (1 − P)A so the difference between An and
Rn is An − Rn = (1 − P)An representing the transients of An which eventually decay to zero. P may be computed as the
limit of Anh as n → ∞.
PerronFrobenius theorem 503

Caveats

The matrices L = ,P= ,T= ,M= provide simple examples of what

can go wrong if the necessary conditions are not met. It is easily seen that the Perron and peripheral projections of L
are both equal to P, thus when the original matrix is reducible the projections may lose non-negativity and there is no
chance of expressing them as limits of its powers. The matrix T is an example of a primitive matrix with zero
diagonal. If the diagonal of an irreducible non-negative square matrix is non-zero then the matrix must be primitive
but this example demonstrates that the converse is false. M is an example of a matrix with several missing spectral
teeth. If ω = eiπ/3 then ω6 = 1 and the eigenvalues of M are {1,ω2,ω3,ω4} so ω and ω5 are both absent.

Terminology
A problem that causes confusion is a lack of standardisation in the definitions. For example, some authors use the
terms strictly positive and positive to mean > 0 and ≥ 0 respectively. In this article positive means > 0 and
non-negative means ≥ 0. Another vexed area concerns decomposability and reducibility: irreducible is an overloaded
term. For avoidance of doubt a non-zero non-negative square matrix A such that 1 + A is primitive is sometimes said
to be connected. Then irreducible non-negative square matrices and connected matrices are synonymous.[25]
The nonnegative eigenvector is often normalized so that the sum of its components is equal to unity; in this case, the
eigenvector is a the vector of a probability distribution and is sometimes called a stochastic eigenvector.
Perron–Frobenius eigenvalue and dominant eigenvalue are alternative names for the Perron root. Spectral
projections are also known as spectral projectors and spectral idempotents. The period is sometimes referred to as
the index of imprimitivity or the order of cyclicity.

Notes
[1] Meyer 2000, p.  8.3.6 p. 681 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[2] Meyer 2000, p.  8.3.7 p. 683 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[3] Langville & Meyer 2006, p.  15.2 p. 167 (http:/ / books. google. com/ books?id=hxvB14-I0twC& lpg=PP1& dq=isbn:0691122024&
pg=PA167#v=onepage& q& f=false)
[4] Keener 1993, p.  p. 80 (http:/ / links. jstor. org/ sici?sici=0036-1445(199303)35:1<80:TPTATR>2. 0. CO;2-O)
[5] Meyer 2000, p.  chapter 8 page 665 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[6] http:/ / www. matrixanalysis. com/ Chapter8. pdf
[7] Meyer 2000, p.  chapter 8.3 page 670 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[8] Gantmacher 2000, p.  chapter XIII.3 theorem 3 page 66 (http:/ / books. google. com/ books?id=cyX32q8ZP5cC& lpg=PA178& vq=preceding
section& pg=PA66#v=onepage& q& f=false)
[9] Kitchens, Bruce (1998), Symbolic dynamics: one-sided, two-sided and countable state markov shifts. (http:/ / books. google. ru/
books?id=mCcdC_5crpoC& lpg=PA195& ots=RbFr1TkSiY& dq=kitchens perron frobenius& pg=PA16#v=onepage& q& f=false), Springer,
[10] Meyer 2000, p.  claim 8.3.11 p. 675 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[11] Gantmacher 2000, p. section XIII.5 theorem 9
[12] Meyer 2000, p.  page 679 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[13] Meyer 2000, p.  example 8.3.2 p. 677 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[14] Gantmacher 2000, p.  section XIII.2.2 page 62 (http:/ / books. google. com/ books?id=cyX32q8ZP5cC& lpg=PA178& vq=preceding
section& pg=PA62#v=onepage& q& f=true)
[15] Meyer 2000, p.  example 8.3.3 p. 678 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[16] Meyer 2000, p.  chapter 8 example 8.3.4 page 679 and exercise 8.3.9 p. 685 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[17] Varga 2002, p. 2.43 (page 51)
[18] Brualdi, Richard A.; Ryser, Herbert John (1992). Combinatorial Matrix Theory. Cambridge: Cambridge UP. ISBN 0-521-32265-0.
[19] Brualdi, Richard A.; Cvetkovic, Dragos (2009). A Combinatorial Approach to Matrix Theory and Its Applications. Boca Raton, FL: CRC
Press. ISBN 978-1-4200-8223-4.
[20] Mackey, Michael C. (1992). Time's Arrow: The origins of thermodynamic behaviour. New York: Springer-Verlag. ISBN 0-387-97702-3.
[21] Gantmacher 2000, p.  section XIII.2.2 page 54 (http:/ / books. google. ru/ books?id=cyX32q8ZP5cC& lpg=PR5& dq=Applications of the
theory of matrices& pg=PA54#v=onepage& q& f=false)
PerronFrobenius theorem 504

[22] Smith, Roger (2006), "A Spectral Theoretic Proof of Perron–Frobenius" (ftp:/ / emis. maths. adelaide. edu. au/ pub/ EMIS/ journals/
MPRIA/ 2002/ pa102i1/ pdf/ 102a102. pdf), Mathematical Proceedings of the Royal Irish Academy (The Royal Irish Academy) 102 (1):
29–35, doi:10.3318/PRIA.2002.102.1.29,
[23] Meyer 2000, p.  chapter 8 claim 8.2.10 page 666 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[24] Meyer 2000, p.  chapter 8 page 666 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[25] For surveys of results on irreducibility, see Olga Taussky-Todd and Richard A. Brualdi.

References

Original papers
• Perron, Oskar (1907), "Zur Theorie der Matrices", Mathematische Annalen 64 (2): 248–263,
doi:10.1007/BF01449896
• Frobenius, Georg (1912), "Ueber Matrizen aus nicht negativen Elementen", Sitzungsber. Königl. Preuss. Akad.
Wiss.: 456–477
• Frobenius, Georg (1908), "Über Matrizen aus positiven Elementen, 1", Sitzungsber. Königl. Preuss. Akad. Wiss.:
471–476
• Frobenius, Georg (1909), "Über Matrizen aus positiven Elementen, 2", Sitzungsber. Königl. Preuss. Akad. Wiss.:
514–518
• Gantmacher, Felix (2000) [1959], The Theory of Matrices, Volume 2 (http://books.google.com/
books?id=cyX32q8ZP5cC&lpg=PA178&vq=preceding section&pg=PA53#v=onepage&q&f=true), AMS
Chelsea Publishing, ISBN 0-8218-2664-6 (1959 edition had different title: "Applications of the theory of
matrices". Also the numeration of chapters is different in the two editions.)
• Langville, Amy; Meyer, Carl (2006), Google page rank and beyond (http://pagerankandbeyond.com), Princeton
University Press, ISBN 0-691-12202-4
• Keener, James (1993), "The Perron–Frobenius theorem and the ranking of football teams" (http://links.jstor.
org/sici?sici=0036-1445(199303)35:1<80:TPTATR>2.0.CO;2-O), SIAM Review (SIAM) 35 (1): 80–93
• Meyer, Carl (2000), Matrix analysis and applied linear algebra (http://www.matrixanalysis.com/Chapter8.
pdf), SIAM, ISBN 0-89871-454-0
• Romanovsky, V. (1933), "Sur les zéros des matrices stocastiques" (http://www.numdam.org/
item?id=BSMF_1933__61__213_0), Bulletin de la Société Mathématique de France 61: 213–219
• Collatz, Lothar (1942), "Einschließungssatz für die charakteristischen Zahlen von Matrize", Mathematische
Zeitschrift 48 (1): 221–226, doi:10.1007/BF01180013
• Wielandt, Helmut (1950), "Unzerlegbare, nicht negative Matrizen", Mathematische Zeitschrift 52 (1): 642–648,
doi:10.1007/BF02230720

Further reading
• Abraham Berman, Robert J. Plemmons, Nonnegative Matrices in the Mathematical Sciences, 1994, SIAM. ISBN
0-89871-321-8.
• Chris Godsil and Gordon Royle, Algebraic Graph Theory, Springer, 2001.
• A. Graham, Nonnegative Matrices and Applicable Topics in Linear Algebra, John Wiley&Sons, New York, 1987.
• R. A. Horn and C.R. Johnson, Matrix Analysis, Cambridge University Press, 1990
• S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability (https://netfiles.uiuc.edu/meyn/www/
spm_files/book.html) London: Springer-Verlag, 1993. ISBN 0-387-19832-6 (2nd edition, Cambridge University
Press, 2009)
• Henryk Minc, Nonnegative matrices, John Wiley&Sons, New York, 1988, ISBN 0-471-83966-3
PerronFrobenius theorem 505

• Seneta, E. Non-negative matrices and Markov chains. 2nd rev. ed., 1981, XVI, 288 p., Softcover Springer Series
in Statistics. (Originally published by Allen & Unwin Ltd., London, 1973) ISBN 978-0-387-29765-1
• Suprunenko, D.A. (2001), "P/p072350" (http://www.encyclopediaofmath.org/index.php?title=P/p072350), in
Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4 (The claim that Aj has
order n/h at the end of the statement of the theorem is incorrect.)
• Richard S. Varga, Matrix Iterative Analysis, 2nd ed., Springer-Verlag, 2002
Poisson distribution 506

Poisson distribution
Poisson

Probability mass function

The horizontal axis is the index k, the number of occurrences. The function is only defined at integer values of k. The
connecting lines are only guides for the eye.
Cumulative distribution function

The horizontal axis is the index k, the number of occurrences. The CDF is discontinuous at the integers of k and flat
everywhere else because a variable that is Poisson distributed only takes on integer values.
Notation
Parameters λ > 0 (real)
Support k ∈ { 0, 1, 2, 3, ... }
PMF

CDF
--or-- (for where is the

Incomplete gamma function and is the floor function)


Mean
Median
Mode
Variance
Skewness
Ex. kurtosis
Poisson distribution 507

Entropy
(for large )

MGF
CF
PGF

In probability theory and statistics, the Poisson distribution (pronounced [pwasɔ̃]) is a discrete probability
distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or
space if these events occur with a known average rate and independently of the time since the last event.[1] (The
Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or
volume.)
Suppose someone typically gets on the average 4 pieces of mail per day. There will be however a certain spread:
sometimes a little more, sometimes a little less, once in a while nothing at all.[2] Given only the average rate, for a
certain period of observation (pieces of mail per day, phonecalls per hour, etc.), and assuming that the process, or
mix of processes, that produce the event flow are essentially random, the Poisson distribution specifies how likely it
is that the count will be 3, or 5, or 11, or any other number, during one period of observation. That is, it predicts the
degree of spread around a known average rate of occurrence.[2]
The distribution's practical usefulness has been explained by the Poisson law of small numbers.[3]

History
The distribution was first introduced by Siméon Denis Poisson (1781–1840) and published, together with his
probability theory, in 1837 in his work Recherches sur la probabilité des jugements en matière criminelle et en
matière civile (“Research on the Probability of Judgments in Criminal and Civil Matters”).[4] The work focused on
certain random variables N that count, among other things, the number of discrete occurrences (sometimes called
“arrivals”) that take place during a time-interval of given length.
A practical application of this distribution was made by Ladislaus Bortkiewicz in 1898 when he was given the task
of investigating the number of soldiers in the Prussian army killed accidentally by horse kick; this experiment
introduced the Poisson distribution to the field of reliability engineering.[5]

Definition
A discrete stochastic variable X is said to have a Poisson distribution with parameter λ>0, if for k = 0, 1, 2, ... the
probability mass function of X is given by:

where
• e is the base of the natural logarithm (e = 2.71828...)
• k! is the factorial of k.
The positive real number λ is equal to the expected value of X, but also to the variance:

The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. The
Poisson distribution is sometimes called a Poissonian.
Poisson distribution 508

Properties

Mean
• The expected value of a Poisson-distributed random variable is equal to λ and so is its variance.
• The coefficient of variation is , while the index of dispersion is 1.[6]
• The mean deviation about the mean is[6]

• The mode of a Poisson-distributed random variable with non-integer λ is equal to , which is the largest
integer less than or equal to λ. This is also written as floor(λ). When λ is a positive integer, the modes are λ and
λ − 1.
• All of the cumulants of the Poisson distribution are equal to the expected value λ. The nth factorial moment of the
Poisson distribution is λn.

Median
Bounds for the median ( ν ) of the distribution are known and are sharp:[7]

Higher moments
• The higher moments mk of the Poisson distribution about the origin are Touchard polynomials in λ:

where

are Stirling numbers of the second kind.[8] The coefficients of the polynomials have a combinatorial meaning. In
fact, when the expected value of the Poisson distribution is 1, then Dobinski's formula says that the nth moment
equals the number of partitions of a set of size n.
• Sums of Poisson-distributed random variables:

If are independent, and , then .[9]

A converse is Raikov's theorem, which says that if the sum of two independent random variables is
Poisson-distributed, then so is each of those two independent random variables.[10]
Poisson distribution 509

Other properties
• The Poisson distributions are infinitely divisible probability distributions.[11][12]
• The directed Kullback-Leibler divergence between Pois(λ) and Pois(λ0) is given by

• Bounds for the tail probabilities of a Poisson random variable can be derived using a Chernoff
bound argument.[13]

Related distributions
• If and are independent, then the difference follows a
Skellam distribution.
• If and are independent, then the distribution of conditional on
is a binomial distribution. Specifically, given ,
. More generally, if X1, X2,..., Xn are independent Poisson random variables
with parameters λ1, λ2,..., λn then

given . In fact,

• The Poisson distribution can be derived as a limiting case to the binomial distribution as the number of trials goes
to infinity and the expected number of successes remains fixed — see law of rare events below. Therefore it can
be used as an approximation of the binomial distribution if n is sufficiently large and p is sufficiently small. There
is a rule of thumb stating that the Poisson distribution is a good approximation of the binomial distribution if n is
at least 20 and p is smaller than or equal to 0.05, and an excellent approximation if n ≥ 100 and np ≤ 10.[14]

• For sufficiently large values of λ, (say λ>1000), the normal distribution with mean λ and variance λ (standard
deviation ), is an excellent approximation to the Poisson distribution. If λ is greater than about 10, then the
normal distribution is a good approximation if an appropriate continuity correction is performed, i.e., P(X ≤ x),
where (lower-case) x is a non-negative integer, is replaced by P(X ≤ x + 0.5).

• Variance-stabilizing transformation: When a variable is Poisson distributed, its square root is approximately
normally distributed with expected value of about and variance of about 1/4.[15][16] Under this
transformation, the convergence to normality (as λ increases) is far faster than the untransformed variable. Other,
slightly more complicated, variance stabilizing transformations are available,[16] one of which is Anscombe
transform. See Data transformation (statistics) for more general uses of transformations.
• If for every t > 0 the number of arrivals in the time interval [0,t] follows the Poisson distribution with mean λ t,
then the sequence of inter-arrival times are independent and identically distributed exponential random variables
having mean 1 / λ.[17]
• The cumulative distribution functions of the Poisson and chi-squared distributions are related in the following
ways:[18]
Poisson distribution 510

and[19]

Occurrence
Applications of the Poisson distribution can be found in many fields related to counting:
• Electrical system example: telephone calls arriving in a system.
• Astronomy example: photons arriving at a telescope.
• Biology example: the number of mutations on a strand of DNA per unit time.
• Management example: customers arriving at a counter or call centre.
• Civil Engineering example: cars arriving at a traffic light.
• Finance and Insurance example: Number of Losses/Claims occurring in a given period of Time.
• Earthquake Seismology example: An asymptotic Poisson model of seismic risk for large earthquakes. (Lomnitz,
1994).
The Poisson distribution arises in connection with Poisson processes. It applies to various phenomena of discrete
properties (that is, those that may happen 0, 1, 2, 3, ... times during a given period of time or in a given area)
whenever the probability of the phenomenon happening is constant in time or space. Examples of events that may be
modelled as a Poisson distribution include:
• The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was
made famous by a book of Ladislaus Josephovich Bortkiewicz (1868–1931).
• The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy
Gosset (1876–1937).[20]
• The number of phone calls arriving at a call centre per minute.
• The number of goals in sports involving two competing teams.
• The number of deaths per year in a given age group.
• The number of jumps in a stock price in a given time interval.
• Under an assumption of homogeneity, the number of times a web server is accessed per minute.
• The number of mutations in a given stretch of DNA after a certain amount of radiation.
• The proportion of cells that will be infected at a given multiplicity of infection.
Poisson distribution 511

How does this distribution arise? — The law of rare events


In several of the above
examples—such as, the number of
mutations in a given sequence of
DNA—the events being counted are
actually the outcomes of discrete trials,
and would more precisely be modelled
using the binomial distribution, that is

In such cases n is very large and p is


very small (and so the expectation np
is of intermediate magnitude). Then
the distribution may be approximated
by the less cumbersome Poisson
distribution

This is sometimes known as the law of


rare events, since each of the n
individual Bernoulli events rarely
occurs. The name may be misleading
Comparison of the Poisson distribution (black lines) and the binomial distribution with
because the total count of success
n=10 (red circles), n=20 (blue circles), n=1000 (green circles). All distributions have a
events in a Poisson process need not be mean of 5. The horizontal axis shows the number of events k. Notice that as n gets larger,
rare if the parameter np is not small. the Poisson distribution becomes an increasingly better approximation for the binomial
For example, the number of telephone distribution with the same mean.

calls to a busy switchboard in one hour


follows a Poisson distribution with the events appearing frequent to the operator, but they are rare from the point of
view of the average member of the population who is very unlikely to make a call to that switchboard in that hour.

The word law is sometimes used as a synonym of probability distribution, and convergence in law means
convergence in distribution. Accordingly, the Poisson distribution is sometimes called the law of small numbers
because it is the probability distribution of the number of occurrences of an event that happens rarely but has very
many opportunities to happen. The Law of Small Numbers is a book by Ladislaus Bortkiewicz about the Poisson
distribution, published in 1898. Some have suggested that the Poisson distribution should have been called the
Bortkiewicz distribution.[21]

Multi-dimensional Poisson process


The poisson distribution arises as the distribution of counts of occurrences of events in (multidimensional) intervals
in mutlidimensional Poisson processes in a directly equivalent way to the result for unidimensional processes. This,is
D is any region the multidimensional space for which |D|, the area or volume of the region, is finite, and if N(D) is
count of the number of events in D, then
Poisson distribution 512

Other applications in science


In a Poisson process, the number of observed occurrences fluctuates about its mean λ with a standard deviation
. These fluctuations are denoted as Poisson noise or (particularly in electronics) as shot noise.
The correlation of the mean and standard deviation in counting independent discrete occurrences is useful
scientifically. By monitoring how the fluctuations vary with the mean signal, one can estimate the contribution of a
single occurrence, even if that contribution is too small to be detected directly. For example, the charge e on an
electron can be estimated by correlating the magnitude of an electric current with its shot noise. If N electrons pass a
point in a given time t on the average, the mean current is ; since the current fluctuations should be of
the order (i.e., the standard deviation of the Poisson process), the charge can be estimated from
the ratio .
An everyday example is the graininess that appears as photographs are enlarged; the graininess is due to Poisson
fluctuations in the number of reduced silver grains, not to the individual grains themselves. By correlating the
graininess with the degree of enlargement, one can estimate the contribution of an individual grain (which is
otherwise too small to be seen unaided). Many other molecular applications of Poisson noise have been developed,
e.g., estimating the number density of receptor molecules in a cell membrane.

Generating Poisson-distributed random variables


A simple algorithm to generate random Poisson-distributed numbers (pseudo-random number sampling) has been
given by Knuth (see References below):

algorithm poisson random number (Knuth):


init:
Let L ← e−λ, k ← 0 and p ← 1.
do:
k ← k + 1.
Generate uniform random number u in [0,1] and let p ← p × u.
while p > L.
return k − 1.

While simple, the complexity is linear in λ. There are many other algorithms to overcome this. Some are given in
Ahrens & Dieter, see References below. Also, for large values of λ, there may be numerical stability issues because
of the term e−λ. One solution for large values of λ is Rejection sampling, another is to use a Gaussian approximation
to the Poisson.
Inverse transform sampling is simple and efficient for small values of λ, and requires only one uniform random
number u per sample. Cumulative probabilities are examined in turn until one exceeds u.
Poisson distribution 513

Parameter estimation

Maximum likelihood
Given a sample of n measured values ki we wish to estimate the value of the parameter λ of the Poisson population
from which the sample was drawn. The maximum likelihood estimate is

Since each observation has expectation λ so does this sample mean. Therefore the maximum likelihood estimate is
an unbiased estimator of λ. It is also an efficient estimator, i.e. its estimation variance achieves the Cramér–Rao
lower bound (CRLB). Hence it is MVUE. Also it can be proved that the sample mean is a complete and sufficient
statistic for λ.

Confidence interval
The confidence interval for a Poisson mean is calculated using the relationship between the Poisson and Chi-square
distributions, and can be written as:

where k is the number of event occurrences in a given interval and is the chi-square deviate with lower
[18][22]
tail area p and degrees of freedom n. This interval is 'exact' in the sense that its coverage probability is never
less than the nominal 1 – α.
When quantiles of the chi-square distribution are not available, an accurate approximation to this exact interval was
proposed by DP Byar (based on the Wilson–Hilferty transformation):[23]

where denotes the standard normal deviate with upper tail area α / 2.
For application of these formulae in the same context as above (given a sample of n measured values ki), one would
set

calculate an interval for μ=nλ, and then derive the interval for λ.

Bayesian inference
In Bayesian inference, the conjugate prior for the rate parameter λ of the Poisson distribution is the gamma
distribution. Let

denote that λ is distributed according to the gamma density g parameterized in terms of a shape parameter α and an
inverse scale parameter β:

Then, given the same sample of n measured values ki as before, and a prior of Gamma(α, β), the posterior
distribution is
Poisson distribution 514

The posterior mean E[λ] approaches the maximum likelihood estimate in the limit as .
The posterior predictive distribution for a single additional observation is a negative binomial distribution
distribution, sometimes called a Gamma-Poisson distribution.

Bivariate Poisson distribution


This distribution has been extended to the bivariate case.[24] The generating function for this distribution is

with

The marginal distributions are Poisson( θ1 ) and Poisson( θ2 ) and the correlation coefficient is limited to the range

The Skellam distribution is a particular case of this distribution.

Notes
[1] Frank A. Haight (1967). Handbook of the Poisson Distribution. New York: John Wiley & Sons.
[2] "Statistics | The Poisson Distribution" (http:/ / www. umass. edu/ wsp/ statistics/ lessons/ poisson/ index. html). Umass.edu. 2007-08-24. .
Retrieved 2012-04-05.
[3] Gullberg, Jan (1997). Mathematics from the birth of numbers. New York: W. W. Norton. pp. 963–965. ISBN 0-393-04002-X.
[4] S.D. Poisson, Probabilité des jugements en matière criminelle et en matière civile, précédées des règles générales du calcul des probabilitiés
(Paris, France: Bachelier, 1837), page 206 (http:/ / books. google. com/ books?id=uovoFE3gt2EC& pg=PA206#v=onepage& q& f=false).
[5] Ladislaus von Bortkiewicz, Das Gesetz der kleinen Zahlen [The law of small numbers] (Leipzig, Germany: B.G. Teubner, 1898). On page 1
(http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA1#v=onepage& q& f=false), Bortkiewicz presents the Poisson distribution.
On pages 23-25 (http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA23#v=onepage& q& f=false), Bortkiewicz presents his
famous analysis of "4. Beispiel: Die durch Schlag eines Pferdes im preussischen Heere Getöteten." (4. Example: Those killed in the Prussian
army by a horse's kick.).
[6] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p157
[7] Choi KP (1994) On the medians of Gamma distributions and an equation of Ramanujan. Proc Amer Math Soc 121 (1) 245–251
[8] Riordan, John (1937). "Moment recurrence relations for binomial, Poisson and hypergeometric frequency distributions". Annals of
Mathematical Statistics 8: 103–111. Also see Haight (1967), p. 6.
[9] E. L. Lehmann (1986). Testing Statistical Hypotheses (second ed.). New York: Springer Verlag. ISBN 0-387-94919-4. page 65.
[10] Raikov, D. (1937). On the decomposition of Poisson laws. Comptes Rendus (Doklady) de l' Academie des Sciences de l'URSS, 14, 9–11.
(The proof is also given in von Mises, Richard (1964). Mathematical Theory of Probability and Statistics. New York: Academic Press.)
[11] Laha, R. G. and Rohatgi, V. K.. Probability Theory. New York: John Wiley & Sons. p. 233. ISBN 0-471-03262-X.
[12] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p159
[13] Massimo Franceschetti and Olivier Dousse and David N. C. Tse and Patrick Thiran (2007). "Closing the Gap in the Capacity of Wireless
Networks Via Percolation Theory" (http:/ / circuit. ucsd. edu/ ~massimo/ Journal/ IEEE-TIT-Capacity. pdf). IEEE Transactions on
Information Theory 53 (3): 1009–1018. .
[14] NIST/SEMATECH, ' 6.3.3.1. Counts Control Charts (http:/ / www. itl. nist. gov/ div898/ handbook/ pmc/ section3/ pmc331. htm)',
e-Handbook of Statistical Methods, accessed 25 October 2006
[15] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models. London: Chapman and Hall. ISBN 0-412-31760-5. page 196 gives the
approximation and higher order terms.
[16] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p163
[17] S. M. Ross (2007). Introduction to Probability Models (ninth ed.). Boston: Academic Press. ISBN 978-0-12-598062-3. pp. 307–308.
[18] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p171
[19] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p153
[20] Philip J. Boland. "A Biographical Glimpse of William Sealy Gosset" (http:/ / wfsc. tamu. edu/ faculty/ tdewitt/ biometry/ Boland PJ (1984)
American Statistician 38 179-183 - A biographical glimpse of William Sealy Gosset. pdf). The American Statistician, Vol. 38, No. 3. (Aug.,
1984), pp. 179-183.. . Retrieved 2011-06-22. "At the turn of the 19th century, Arthur Guinness, Son & Co. became interested in hiring
scientists to analyze data concerned with various aspects of its brewing process. Gosset was to be one of the first of these scientists, and so it
was that in 1899 he moved to Dublin to take up a job as a brewer at St. James' Gate... Student published 22 papers, the first of which was
entitled "On the Error of Counting With a Haemacytometer" (Biometrika, 1907). In it, Student illustrated the practical use of the Poisson
Poisson distribution 515

distribution in counting the number of yeast cells on a square of a haemacytometer. Up until just before World War II, Guinness would not
allow its employees to publish under their own names, and hence Gosset chose to write under the pseudonym of "Student.""
[21] Good, I. J. (1986). "Some statistical applications of Poisson's work". Statistical Science 1 (2): 157–180. doi:10.1214/ss/1177013690.
JSTOR 2245435.
[22] Garwood, F. (1936). "Fiducial Limits for the Poisson Distribution". Biometrika 28 (3/4): 437–442. doi:10.1093/biomet/28.3-4.437.
[23] Breslow, NE; Day, NE (1987). Statistical Methods in Cancer Research: Volume 2—The Design and Analysis of Cohort Studies (http:/ /
www. iarc. fr/ en/ publications/ pdfs-online/ stat/ sp82/ index. php). Paris: International Agency for Research on Cancer.
ISBN 978-92-832-0182-3. .
[24] Loukas S, Kemp CD (1986) The index of dispersion test for the bivariate Poisson distribution. Biometrics 42(4) 941-948

References
• Joachim H. Ahrens, Ulrich Dieter (1974). "Computer Methods for Sampling from Gamma, Beta, Poisson and
Binomial Distributions". Computing 12 (3): 223–246. doi:10.1007/BF02293108.
• Joachim H. Ahrens, Ulrich Dieter (1982). "Computer Generation of Poisson Deviates". ACM Transactions on
Mathematical Software 8 (2): 163–179. doi:10.1145/355993.355997.
• Ronald J. Evans, J. Boersma, N. M. Blachman, A. A. Jagers (1988). "The Entropy of a Poisson Distribution:
Problem 87-6". SIAM Review 30 (2): 314–317. doi:10.1137/1030059.
• Donald E. Knuth (1969). Seminumerical Algorithms. The Art of Computer Programming, Volume 2. Addison
Wesley.

Poisson process
In probability theory, a Poisson process is a stochastic process which counts the number of events[1] and the time
that these events occur in a given time interval. The time between each pair of consecutive events has an exponential
distribution with parameter λ and each of these inter-arrival times is assumed to be independent of other inter-arrival
times. The process is named after the French mathematician Siméon-Denis Poisson and is a good model of
radioactive decay,[2] telephone calls[3] and requests for a particular document on a web server,[4] among many other
phenomena.
The Poisson process is a continuous-time process; the sum of a Bernoulli process can be thought of as its
discrete-time counterpart. A Poisson process is a pure-birth process, the simplest example of a birth-death process. It
is also a point process on the real half-line.

Definition
The basic form of Poisson process, often referred to simply as "the Poisson process", is a continuous-time counting
process {N(t), t ≥ 0} that possesses the following properties:
• N(0) = 0
• Independent increments (the numbers of occurrences counted in disjoint intervals are independent from each
other)
• Stationary increments (the probability distribution of the number of occurrences counted in any time interval only
depends on the length of the interval)
• No counted occurrences are simultaneous.
Consequences of this definition include:
• The probability distribution of N(t) is a Poisson distribution.
• The probability distribution of the waiting time until the next occurrence is an exponential distribution.
• The occurrences are distributed uniformly on any interval of time. (Note that N(t), the total number of
occurrences, has a Poisson distribution over (0, t], whereas the location of an individual occurrence on t ∈ (a, b] is
Poisson process 516

uniform.)
Other types of Poisson process are described below.

Types

Homogeneous
The homogeneous Poisson process is one of the most well-known Lévy
processes. This process is characterized by a rate parameter λ, also
known as intensity, such that the number of events in time interval
(t, t + τ] follows a Poisson distribution with associated parameter λτ.
This relation is given as
Sample Path of a Poisson process N(t)

where N(t + τ) − N(t) = k is the number of events in time interval (t, t + τ].


Just as a Poisson random variable is characterized by its scalar parameter λ, a homogeneous Poisson process is
characterized by its rate parameter λ, which is the expected number of "events" or "arrivals" that occur per unit time.
N(t) is a sample homogeneous Poisson process, not to be confused with a density or distribution function.

Non-homogeneous
In general, the rate parameter may change over time; such a process is called a non-homogeneous Poisson process
or inhomogeneous Poisson process. In this case, the generalized rate function is given as λ(t). Now the expected
number of events between time a and time b is

Thus, the number of arrivals in the time interval (a, b], given as N(b) − N(a), follows a Poisson distribution with
associated parameter λa,b

A homogeneous Poisson process may be viewed as a special case when λ(t) = λ, a constant rate.

Spatial
An important variation on the (notionally time-based) Poisson process is the spatial Poisson process. In the case of a
one-dimension space (a line) the theory differs from that of a time-based Poisson process only in the interpretation of
the index variable. For higher dimension spaces, where the index variable (now x) is in some vector space V (e.g. R2
or R3), a spatial Poisson process can be defined by the requirement that the random variables defined as the counts of
the number of "events" inside each of a number of non-overlapping finite sub-regions of V should each have a
Poisson distribution and should be independent of each other.

Space-time
A further variation on the Poisson process, the space-time Poisson process, allows for separately distinguished space
and time variables. Even though this can theoretically be treated as a pure spatial process by treating "time" as just
another component of a vector space, it is convenient in most applications to treat space and time separately, both for
modeling purposes in practical applications and because of the types of properties of such processes that it is
interesting to study.
Poisson process 517

In comparison to a time-based inhomogeneous Poisson process, the extension to a space-time Poisson process can
introduce a spatial dependence into the rate function, such that it is defined as , where for some
vector space V (e.g. R2 or R3). However a space-time Poisson process may have a rate function that is constant with
respect to either or both of x and t. For any set (e.g. a spatial region) with finite measure , the
number of events occurring inside this region can be modeled as a Poisson process with associated rate function λS(t)
such that

Separable space-time processes


In the special case that this generalized rate function is a separable function of time and space, we have:

for some function . Without loss of generality, let

(If this is not the case, λ(t) can be scaled appropriately.) Now, represents the spatial probability density
function of these random events in the following sense. The act of sampling this spatial Poisson process is equivalent
to sampling a Poisson process with rate function λ(t), and associating with each event a random vector sampled
from the probability density function . A similar result can be shown for the general (non-separable) case.

Characterisation
In its most general form, the only two conditions for a counting process to be a Poisson process are:
• Orderliness: which roughly means

which implies that arrivals don't occur simultaneously (but this is actually a mathematically stronger
statement).
• Memorylessness (also called evolution without after-effects): the number of arrivals occurring in any bounded
interval of time after time t is independent of the number of arrivals occurring before time t.
These seemingly unrestrictive conditions actually impose a great deal of structure in the Poisson process. In
particular, they imply that the time between consecutive events (called interarrival times) are independent random
variables. For the homogeneous Poisson process, these inter-arrival times are exponentially distributed with
parameter λ (mean 1/λ).
Proof : Let be the first arrival time of the Poisson process. Its distribution satisfies

Also, the memorylessness property entails that the number of events in any time interval is independent of the
number of events in any other interval that is disjoint from it. This latter property is known as the independent
increments property of the Poisson process.
Poisson process 518

Properties
As defined above, the stochastic process {N(t)} is a Markov process, or more specifically, a continuous-time Markov
process.
To illustrate the exponentially distributed inter-arrival times property, consider a homogeneous Poisson process N(t)
with rate parameter λ, and let Tk be the time of the kth arrival, for k = 1, 2, 3, ... . Clearly the number of arrivals
before some fixed time t is less than k if and only if the waiting time until the kth arrival is more than t. In symbols,
the event [N(t) < k] occurs if and only if the event [Tk > t] occurs. Consequently the probabilities of these events are
the same:

In particular, consider the waiting time until the first arrival. Clearly that time is more than t if and only if the number
of arrivals before time t is 0. Combining this latter property with the above probability distribution for the number of
homogeneous Poisson process events in a fixed interval gives

Consequently, the waiting time until the first arrival T1 has an exponential distribution, and is thus memoryless. One
can similarly show that the other interarrival times Tk − Tk−1 share the same distribution. Hence, they are
independent, identically distributed (i.i.d.) random variables with parameter λ > 0; and expected value 1/λ. For
example, if the average rate of arrivals is 5 per minute, then the average waiting time between arrivals is 1/5 minute.

Applications
The classic example of phenomena well modelled by a Poisson process is deaths due to horse kick in the Prussian
army, as shown by Ladislaus Bortkiewicz in 1898.[5][6] The following examples are also well-modeled by the
Poisson process:
• Requests for telephone calls at a switchboard.
• Goals scored in a soccer match.[7]
• Requests for individual documents on a web server.[8]
• Particle emissions due to radioactive decay by an unstable substance. In this case the Poisson process is
non-homogeneous in a predictable manner - the emission rate declines as particles are emitted.
In queueing theory, the times of customer/job arrivals at queues are often assumed to be a Poisson process.

Occurrence
The Palm–Khintchine theorem provides a result that shows that the superposition of many low intensity non-Poisson
point processes will be close to a Poisson process.

Further reading
• Cox, D. R.; Isham, V. I. (1980). Point Processes. Chapman & Hall. ISBN 0-412-21910-7.
• Ross, S. M. (1995). Stochastic Processes. Wiley. ISBN 978-0-471-12062-9.
• Snyder, D. L.; Miller, M. I. (1991). Random Point Processes in Time and Space. Springer-Verlag.
ISBN 0-387-97577-2.
Poisson process 519

Notes
[1] The word event used here is not an instance of the concept of event as frequently used in probability theory.
[2] Cannizzaro, F.; Greco, G.; Rizzo, S.; Sinagra, E. (1978). "Results of the measurements carried out in order to verify the validity of the
poisson-exponential distribution in radioactive decay events". The International Journal of Applied Radiation and Isotopes 29 (11): 649.
doi:10.1016/0020-708X(78)90101-1.
[3] Willkomm, D.; Machiraju, S.; Bolot, J.; Wolisz, A. (2009). "Primary user behavior in cellular networks and implications for dynamic
spectrum access". IEEE Communications Magazine 47 (3): 88. doi:10.1109/MCOM.2009.4804392.
[4] Arlitt, Martin F.; Williamson, Carey L. (1997). "Internet Web servers: Workload characterization and performance implications". IEEE/ACM
Transactions on Networking 5 (5): 631. doi:10.1109/90.649565.
[5] Ladislaus von Bortkiewicz, Das Gesetz der kleinen Zahlen [The law of small numbers] (Leipzig, Germany: B.G. Teubner, 1898). On page 1
(http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA1#v=onepage& q& f=false), Bortkiewicz presents the Poisson distribution.
On pages 23-25 (http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA23#v=onepage& q& f=false), Bortkiewicz presents his
famous analysis of "4. Beispiel: Die durch Schlag eines Pferdes im preussischen Heere Getöteten." (4. Example: Those killed in the Prussian
army by a horse's kick.).
[6] Gibbons, Robert D.; Bhaumik, Dulal; Aryal, Subhash (2009). Statistical Methods for Groundwater Monitoring. John Wiley and Sons. p. 72.
ISBN 0-470-16496-4.
[7] Heuer, A.; Müller, C.; Rubner, O. (2010). "Soccer: Is scoring goals a predictable Poissonian process?". EPL (Europhysics Letters) 89 (3):
38007. doi:10.1209/0295-5075/89/38007. "To a very good approximation scoring goals during a match can be characterized as independent
Poissonian processes with pre-determined expectation values."
[8] Arlitt, Martin F.; Williamson, Carey L. (1997). "Internet Web servers: Workload characterization and performance implications". IEEE/ACM
Transactions on Networking 5 (5): 631. doi:10.1109/90.649565.

References

Proportional hazards models


Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes
before some event occurs to one or more covariates that may be associated with that quantity. In a proportional
hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For
example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a
manufactured component is constructed may double its hazard rate for failure. Other types of survival models such
as accelerated failure time models do not exhibit proportional hazards. These models could describe a situation such
as a drug that reduces a subject's immediate risk of having a stroke, but where there is no reduction in the hazard rate
after one year for subjects who do not have a stroke in the first year of analysis.

Introduction
Survival models can be viewed as consisting of two parts: the underlying hazard function, often denoted ,
describing how the hazard (risk) changes over time at baseline levels of covariates; and the effect parameters,
describing how the hazard varies in response to explanatory covariates. A typical medical example would include
covariates such as treatment assignment, as well as patient characteristics such as age, gender, and the presence of
other diseases in order to reduce variability and/or control for confounding.
The proportional hazards condition[1] states that covariates are multiplicatively related to the hazard. In the simplest
case of stationary coefficients, for example, a treatment with a drug may, say, halve a subject's hazard at any given
time , while the baseline hazard may vary. Note however, that the covariate is not restricted to binary predictors;
in the case of a continuous covariate , the hazard responds logarithmically; each unit increase in results in
proportional scaling of the hazard. The Cox partial likelihood shown below, is obtained by using Breslow's estimate
of the baseline hazard function, plugging it into the full likelihood and then observing that the result is a product of
two factors. The first factor is the partial likelihood shown below, in which the baseline hazard has "canceled out".
The second factor is free of the regression coefficients and depends on the data only through the censoring pattern.
Proportional hazards models 520

The effect of covariates estimated by any proportional hazards model can thus be reported as hazard ratios.
Sir David Cox observed that if the proportional hazards assumption holds (or, is assumed to hold) then it is possible
to estimate the effect parameter(s) without any consideration of the hazard function. This approach to survival data is
called application of the Cox proportional hazards model,[2] sometimes abbreviated to Cox model or to proportional
hazards model.

The partial likelihood


Let Yi denote the observed time (either censoring time or event time) for subject i, and let Ci be the indicator that the
time corresponds to an event (i.e. if Ci = 1 the event occurred and if Ci = 0 the time is a censoring time). The hazard
function for the Cox proportional hazard model has the form

This expression gives the hazard at time t for an individual with covariate vector (explanatory variables) X. Based on
this hazard function, a partial likelihood can be constructed from the datasets as

where θj = exp(β′Xj) and X1, ..., Xn are the covariate vectors for the n independently sampled individuals in the
dataset (treated here as column vectors).
The corresponding log partial likelihood is

This function can be maximized over β to produce maximum partial likelihood estimates of the model parameters.
The partial score function is

and the Hessian matrix of the partial log likelihood is

Using this score function and Hessian matrix, the partial likelihood can be maximized using the Newton-Raphson
algorithm. The inverse of the Hessian matrix, evaluated at the estimate of β, can be used as an approximate
variance-covariance matrix for the estimate, and used to produce approximate standard errors for the regression
coefficients.

Tied times
Several approaches have been proposed to handle situations in which there are ties in the time data. Breslow's
method describes the approach in which the procedure described above is used unmodified, even when ties are
present. An alternative approach that is considered to give better results is Efron's method.[3] Let tj denote the unique
times, let Hj denote the set of indices i such that Yi = tj and Ci = 1, and let mj = |Hj|. Efron's approach maximizes the
following partial likelihood.

The corresponding log partial likelihood is


Proportional hazards models 521

the score function is

and the Hessian matrix is

where

Note that when Hj is empty (all observations with time tj are censored), the summands in these expressions are
treated as zero.

Time-varying predictors and coefficients


Extensions to time dependent variables, time dependent strata, and multiple events per subject, can be incorporated
by the counting process formulation of Andersen and Gill.[4]
In addition to allowing time-varying covariates (i.e., predictors), the Cox model may be generalized to time-varying
coefficients as well. That is, the proportional effect of a treatment may vary with time; e.g. a drug may be very
effective if administered within one month of morbidity, and become less effective as time goes on. The hypothesis
of no change with time (stationarity) of the coefficient may then be tested. Details and software are available in
Martinussen and Scheike (2006).[5]

Specifying the baseline hazard function


The Cox model may be specialized if a reason exists to assume that the baseline hazard follows a particular form. In
this case, the baseline hazard is replaced by a given function. For example, assuming the hazard function to
be the Weibull hazard function gives the Weibull proportional hazards model.
Incidentally, using the Weibull baseline hazard is the only circumstance under which the model satisfies both the
proportional hazards, and accelerated failure time models.
The generic term parametric proportional hazards models can be used to describe proportional hazards models in
which the hazard function is specified. The Cox proportional hazards model is sometimes called a semiparametric
model by contrast.
Some authors (e.g. Bender, Augustin and Blettner[6]) use the term Cox proportional hazards model even when
specifying the underlying hazard function, to acknowledge the debt of the entire field to David Cox.
The term Cox regression model (omitting proportional hazards) is sometimes used to describe the extension of the
Cox model to include time-dependent factors. However, this usage is potentially ambiguous since the Cox
proportional hazards model can itself be described as a regression model.
Proportional hazards models 522

Relationship to Poisson models


There is a relationship between proportional hazards models and Poisson regression models which is sometimes used
to fit approximate proportional hazards models in software for Poisson regression. The usual reason for doing this is
that calculation is much quicker. This was more important in the days of slower computers but can still be useful for
particularly large data sets or complex problems. Authors giving the mathematical details include Laird and Olivier
(1981),[7] who remark
"Note that we do not assume [the Poisson model] is true, but simply use it as a device for deriving the likelihood."

The book on generalized linear models by McCullagh and Nelder[8] has a chapter on converting proportional hazards
models to generalized linear models.

Notes
[1] Breslow, N. E. (1975). "Analysis of Survival Data under the Proportional Hazards Model". International Statistical Review / Revue
Internationale de Statistique 43 (1): 45–57. doi:10.2307/1402659. JSTOR 1402659.
[2] Cox, David R (1972). "Regression Models and Life-Tables". Journal of the Royal Statistical Society. Series B (Methodological) 34 (2):
187–220. JSTOR 2985181. MR0341758
[3] Efron, Bradley (1974). "The Efficiency of Cox's Likelihood Function for Censored Data". Journal of the American Statistical Association 72
(359): 557–565. JSTOR 2286217.
[4] Andersen, P.; Gill, R. (1982). "Cox's regression model for counting processes, a large sample study.". Annals of Statistics 10 (4): 1100–1120.
doi:10.1214/aos/1176345976. JSTOR 2240714.
[5] Martinussen & Scheike (2006) Dynamic Regression Models for Survival Data (Springer).
[6] Bender, R., Augustin, T. and Blettner, M. (2006). Generating survival times to simulate Cox proportional hazards models, Statistics in
Medicine 2005; 24:1713–1723. doi:10.1002/sim.2369
[7] Nan Laird and Donald Olivier (1981). "Covariance Analysis of Censored Survival Data Using Log-Linear Analysis Techniques". Journal of
the American Statistical Association 76 (374): 231–240. doi:10.2307/2287816. JSTOR 2287816.
[8] P. McCullagh and J. A. Nelder (2000). "Chapter 13: Models for Survival Data". Generalized Linear Models (Second ed.). Boca Raton,
Florida: Chapman & Hall/CRC. ISBN 0-412-31760-5. (Second edition 1989; first CRC reprint 1999.)

References
• D. R. Cox and D. Oakes (1984). Analysis of survival data (Chapman & Hall).
• D. Collett (2003). Modelling survival data in medical research (Chapman & Hall/CRC).
• T. M. Therneau and P. M. Grambsch (2000). Modeling survival data: extending the Cox Model (Springer).
• V.Bagdonavicius, R.Levuliene, M.Nikulin (2010). "Goodness-of-fit criteria for the Cox model from left truncated
and right censored data". Journal of Mathematical Sciences, v.167, #4, 436-443.
Random permutation statistics 523

Random permutation statistics


The statistics of random permutations, such as the cycle structure of a random permutation are of fundamental
importance in the analysis of algorithms, especially of sorting algorithms, which operate on random permutations.
Suppose, for example, that we are using quickselect (a cousin of quicksort) to select a random element of a random
permutation. Quickselect will perform a partial sort on the array, as it partitions the array according to the pivot.
Hence a permutation will be less disordered after quickselect has been performed. The amount of disorder that
remains may be analysed with generating functions. These generating functions depend in a fundamental way on the
generating functions of random permutation statistics. Hence it is of vital importance to compute these generating
functions.
The article on random permutations contains an introduction to random permutations.

The fundamental relation


Permutations are sets of labelled cycles. Using the labelled case of the Flajolet–Sedgewick fundamental theorem and
writing for the set of permutations and for the singleton set, we have

Translating into exponential generating functions (EGFs), we have

where we have used the fact that the EGF of the set of permutations (there are n! permutations of n elements) is

This one equation will allow us to derive a surprising number of permutation statistics. Firstly, by dropping terms
from , i.e. exp, we may constrain the number of cycles that a permutation contains, e.g. by restricting the EGF to
we obtain permutations containing two cycles. Secondly, note that the EGF of labelled cycles, i.e. of , is

because there are k! / k labelled cycles.


This means that by dropping terms from this generating function, we may constrain the size of the cycles that occur
in a permutation and obtain an EGF of the permutations containing only cycles of a given size.
Now instead of dropping, let's put different weights on different size cycles. If is a weight function that
depends only on the size k of the cycle and for brevity we write

the value of b for a permutation to be the sum of its values on the cycles, then we may mark cycles of length k
with ub(k) and obtain a bivariate generating function g(z, u) that describes the parameter, i.e.

This is a mixed generating function which is exponential in the permutation size and ordinary in the secondary
parameter u. Differentiating and evaluating at u = 1, we have
Random permutation statistics 524

i.e. the EGF of the sum of b over all permutations, or alternatively, the OGF, or more precisely, PGF (probability
generating function) of the expectation of b.
This article uses the coefficient extraction operator [zn], documented on the page for formal power series.

Number of permutations that are involutions


An involution is a permutation σ so that σ2 = 1 under permutation composition. It follows that σ may only contain
cycles of length one or two, i.e. the EGF g(z) of these permutations is

This gives the explicit formula for the total number of involutions among the permutations σ ∈ Sn:

Dividing by n! yields the probability that a random permutation is an involution.

Number of permutations that are mth roots of unity


This generalizes the concept of an involution. An mth root of unity is a permutation σ so that σm = 1 under
permutation composition. Now every time we apply σ we move one step in parallel along all of its cycles. A cycle of
length d applied d times produces the identity permutation on d elements (d fixed points) and d is the smallest value
to do so. Hence m must be a multiple of all cycle sizes d, i.e. the only possible cycles are those whose length d is a
divisor of m. It follows that the EGF g(x) of these permutations is

When m = p, where p is prime, this simplifies to

Number of permutations that are derangements


Suppose there are n people at a party, each of whom brought an umbrella. At the end of the party everyone picks an
umbrella out of the stack of umbrellas and leaves. What is the probability that no one left with his/her own umbrella?
This problem is equivalent to counting permutations with no fixed points, and hence the EGF (subtract out fixed
points by removing z) g(x) is

Now multiplication by just sums coefficients, so that we have the following formula for , the
total number of derangements:

Hence there are about derangements and the probability that a random permutation is a derangement is
This result may also be proved by inclusion-exclusion. Using the sets where to denote the set of
permutations that fix p, we have
Random permutation statistics 525

This formula counts the number of permutations that have at least one fixed point. The cardinalities are as follows:

Hence the number of permutations with no fixed point is

or

and we have the claim.


There is a generalization of these numbers, which is known as rencontres numbers, i.e. the number of
permutations of containing m fixed points. The corresponding EGF is obtained by marking cycles of size one
with the variable u, i.e. choosing b(k) equal to one for and zero otherwise, which yields the generating
function of the set of permutations by the number of fixed points:

It follows that

and hence

This immediately implies that

for n large, m fixed.

One hundred prisoners


A prison warden wants to make room in his prison and is considering liberating one hundred prisoners, thereby
freeing one hundred cells. He therefore assembles one hundred prisoners and asks them to play the following game:
he lines up one hundred urns in a row, each containing the name of one prisoner, where every prisoner's name occurs
exactly once. The game is played as follows: every prisoner is allowed to look inside fifty urns. If he or she does not
find his or her name in one of the fifty urns, all prisoners will immediately be executed, otherwise the game
continues. The prisoners have a few moments to decide on a strategy, knowing that once the game has begun, they
will not be able to communicate with each other, mark the urns in any way or move the urns or the names inside
them. Choosing urns at random, their chances of survival are almost zero, but there is a strategy giving them a 30%
chance of survival, assuming that the names are assigned to urns randomly – what is it?
First of all, the survival probability using random choices is
Random permutation statistics 526

so this is definitely not a practical strategy.


The 30% survival strategy is to consider the contents of the urns to be a permutation of the prisoners, and traverse
cycles. To keep the notation simple, assign a number to each prisoner, for example by sorting their names
alphabetically. The urns may thereafter be considered to contain numbers rather than names. Now clearly the
contents of the urns define a permutation. The first prisoner opens the first urn. If he finds his name, he has finished
and survives. Otherwise he opens the urn with the number he found in the first urn. The process repeats: the prisoner
opens an urn and survives if he finds his name, otherwise he opens the urn with the number just retrieved, up to a
limit of fifty urns. The second prisoner starts with urn number two, the third with urn number three, and so on. This
strategy is precisely equivalent to a traversal of the cycles of the permutation represented by the urns. Every prisoner
starts with the urn bearing his number and keeps on traversing his cycle up to a limit of fifty urns. The number of the
urn that contains his number is the pre-image of that number under the permutation. Hence the prisoners survive if
all cycles of the permutation contain at most fifty elements. We have to show that this probability is at least 30%.
Note that this assumes that the warden chooses the permutation randomly; if the warden anticipates this strategy, he
can simply choose a permutation with a cycle of length 51. To overcome this, the prisoners may agree in advance on
a random permutation of their names.
We consider the general case of prisoners and urns being opened. We first calculate the complementary
probability, i.e. that there is a cycle of more than elements. With this in mind, we introduce

or

so that the desired probability is

because the cycle of more than elements will necessarily be unique. Using the fact that , we
find that

which yields

Finally, using an integral estimate such as Euler–MacLaurin summation, or the asymptotic expansion of the nth
harmonic number, we obtain

so that

or at least 30%, as claimed.


A related result is that asymptotically, the expected length of the longest cycle is λn, where λ is the
Golomb–Dickman constant, approximately 0.62.
This example is due to Anna Gál and Peter Bro Miltersen; consult the paper by Peter Winkler for more information,
and see the discussion on Les-Mathematiques.net. Consult the references on 100 prisoners for links to these
references.
Random permutation statistics 527

The above computation may be performed in a more simple and direct way, as follows: first note that a permutation
of elements contains at most one cycle of length strictly greater than . Thus, if we denote

then

For , the number of permutations that contain a cycle of length exactly is

Explanation: is the number of ways of choosing the elements that consist the cycle; is the number of

ways of arranging items in a cycle; and is the number of ways to permute the remaining elements.
Thus,

We conclude that

Number of permutations containing cycles


Applying the Symbolic combinatorics#The Flajolet–Sedgewick fundamental theorem, i.e. the labelled enumeration
theorem with , to the set

we obtain the generating function

The term

yields the Stirling numbers of the first kind, i.e. is the EGF of the unsigned Stirling numbers of the first kind.
We can compute the OGF of these numbers for n fixed, i.e.

Start with

which yields

Summing this, we obtain


Random permutation statistics 528

Using the formula for on the left, the definition of on the right, and the binomial theorem, we obtain

Comparing the coefficients of , and using the definition of the binomial coefficient, we finally have

a falling factorial.

Expected number of cycles of a given size m


In this problem we use a bivariate generating function g(z, u) as described in the introduction. The value of b for a
cycle not of size m is zero, and one for a cycle of size m. We have

or

This means that the expected number of cycles of size m in a permutation of length n less than m is zero (obviously).
A random permutation of length at least m contains on average 1/m cycles of length m. In particular, a random
permutation contains about one fixed point.
The OGF of the expected number of cycles of length less than or equal to m is therefore

where Hm is the mth harmonic number. Hence the expected number of cycles of length at most m in a random
permutation is about ln m.

Moments of fixed points


The mixed GF of the set of permutations by the number of fixed points is

Let the random variable X be the number of fixed points of a random permutation. Using Stirling numbers of the
second kind, we have the following formula for the mth moment of X:

where is a falling factorial. Using , we have

which is zero when , and one otherwise. Hence only terms with contribute to the sum. This yields
Random permutation statistics 529

Expected number of cycles of any length of a random permutation


We construct the bivariate generating function using , where is one for all cycles (every cycle
contributes one to the total number of cycles).
Note that has the closed form

and generates the unsigned Stirling numbers of the first kind.


We have

Hence the expected number of cycles is , or about .

Expected number of transpositions of a random permutation


We can use the disjoint cycle decomposition of a permutation to factorize it as a product of transpositions by
replacing a cycle of length k by transpositions. E.g. the cycle factors as . The
function for cycles is equal to and we obtain

and

Hence the expected number of transpositions is

We could also have obtained this formula by noting that the number of transpositions is obtained by adding the
lengths of all cycles (which gives n) and subtracting one for every cycle (which gives by the previous
section).
Note that again generates the unsigned Stirling numbers of the first kind, but in reverse order. More
precisely, we have

To see this, note that the above is equivalent to

and that

which we saw to be the EGF of the unsigned Stirling numbers of the first kind in the section on permutations
consisting of precisely m cycles.
Random permutation statistics 530

Expected cycle size of a random element


We select a random element q of a random permutation and ask about the expected size of the cycle that contains
q. Here the function is equal to , because a cycle of length k contributes k elements that are on cycles of
length k. Note that unlike the previous computations, we need to average out this parameter after we extract it from
the generating function (divide by n). We have

Hence the expected length of the cycle that contains q is

Probability that a random element lies on a cycle of size m


This average parameter represents the probability that if we again select a random element of of a random
permutation, the element lies on a cycle of size m. The function is equal to for and zero
otherwise, because only cycles of length m contribute, namely m elements that lie on a cycle of length m. We have

It follows that the probability that a random element lies on a cycle of length m is

Probability that a random subset of [n] lies on the same cycle


Select a random subset Q of [n] containing m elements and a random permutation, and ask about the probability that
all elements of Q lie on the same cycle. This is another average parameter. The function b(k) is equal to ,
because a cycle of length k contributes subsets of size m, where for k < m. This yields

Averaging out we obtain that the probability of the elements of Q being on the same cycle is

or

In particular, the probability that two elements p < q are on the same cycle is 1/2.
Random permutation statistics 531

Number of permutations containing an even number of even cycles


We may use the Symbolic combinatorics#The Flajolet–Sedgewick fundamental theorem directly and compute more
advanced permutation statistics. (Check that page for an explanation of how the operators we will use are computed.)
For example, the set of permutations containing an even number of even cycles is given by

Translating to exponential generating functions (EGFs), we obtain

or

This simplifies to

or

This says that there is one permutation of size zero containing an even number of even cycles (the empty
permutation, which contains zero cycles of even length), one such permutation of size one (the fixed point, which
also contains zero cycles of even length), and that for , there are such permutations.

Permutations that are squares


Consider what happens when we square a permutation. Fixed points are mapped to fixed points. Odd cycles are
mapped to odd cycles in a one-to-one correspondence, e.g. turns into . Even cycles
split in two and produce a pair of cycles of half the size of the original cycle, e.g. turns into
. Hence permutations that are squares may contain any number of odd cycles, and an even number of
cycles of size two, an even number of cycles of size four etc., and are given by

which yields the EGF

Odd cycle invariants


The types of permutations presented in the preceding two sections, i.e. permutations containing an even number of
even cycles and permutations that are squares, are examples of so-called odd cycle invariants, studied by Sung and
Zhang (see external links). The term odd cycle invariant simply means that membership in the respective
combinatorial class is independent of the size and number of odd cycles occurring in the permutation. In fact we can
prove that all odd cycle invariants obey a simple recurrence, which we will derive. First, here are some more
examples of odd cycle invariants.
Random permutation statistics 532

Permutations where the sum of the lengths of the even cycles is six
This class has the specification

and the generating function

The first few values are

Permutations where all even cycles have the same length


This class has the specification

and the generating function

There is a sematic nuance here. We could consider permutations containing no even cycles as belonging to this class,
since zero is even. The first few values are

Permutations where the maximum length of an even cycle is four


This class has the specification

and the generating function

The first few values are

The recurrence
Observe carefully how the specifications of the even cycle component are constructed. It is best to think of them in
terms of parse trees. These trees have three levels. The nodes at the lowest level represent sums of products of
even-length cycles of the singleton . The nodes at the middle level represent restrictions of the set operator.
Finally the node at the top level sums products of contributions from the middle level. Note that restrictions of the set
operator, when applied to a generating function that is even, will preserve this feature, i.e. produce another even
generating function. But all the inputs to the set operators are even since they arise from even-length cycles. The
result is that all generating functions involved have the form

where is an even function. This means that


Random permutation statistics 533

is even, too, and hence

Letting and extracting coefficients, we find that

which yields the recurrence

A problem from the 2005 Putnam competition


A link to the Putnam competition website appears in the section External links. The problem asks for a proof that

where the sum is over all permutations of , is the sign of , i.e. if is even and
if is odd, and is the number of fixed points of .
Now the sign of is given by

where the product is over all cycles c of , as explained e.g. on the page on even and odd permutations.
Hence we consider the combinatorial class

where marks one minus the length of a contributing cycle, and marks fixed points. Translating to generating
functions, we obtain

or

Now we have

and hence the desired quantity is given by

Doing the computation, we obtain

or

Extracting coefficients, we find that the coefficient of is zero. The constant is one, which does not agree with
the formula (should be zero). For positive, however, we obtain
Random permutation statistics 534

or

which is the desired result.


As an interesting aside, we observe that may be used to evaluate the following determinant of an
matrix:

where . Recall the formula for the determinant:

Now the value of the product on the right for a permutation is , where f is the number of fixed points of
. Hence

which yields

and finally

External links
• Alois Panholzer, Helmut Prodinger, Marko Riedel, Measuring post-quickselect disorder. [1]
• Putnam Competition Archive, William Lowell Putnam Competition Archive [2]
• Philip Sung, Yan Zhang, Recurring Recurrences in Counting Permutations [3]

100 prisoners
• Anna Gál, Peter Bro Miltersen, The cell probe complexity of succinct data structures [4]
• Peter Winkler, Seven puzzles you think you must not have heard correctly [5]
• Various authors, Les-Mathematiques.net [6]. Cent prisonniers [7] (French)
Random permutation statistics 535

References
[1] http:/ / www. mathematik. uni-stuttgart. de/ ~riedelmo/ papers/ qsdis-jalc. pdf
[2] http:/ / www. unl. edu/ amc/ a-activities/ a7-problems/ putnamindex. shtml
[3] http:/ / citeseerx. ist. psu. edu/ viewdoc/ download?doi=10. 1. 1. 91. 1088& rep=rep1& type=pdf
[4] http:/ / www. daimi. au. dk/ ~bromille/ Papers/ succinct. pdf
[5] http:/ / www. math. dartmouth. edu/ ~pw/ solutions. pdf
[6] http:/ / les-mathematiques. net
[7] http:/ / les-mathematiques. u-strasbg. fr/ phorum5/ read. php?12,341672

Rank (linear algebra)


The column rank of a matrix A is the maximum number of linearly independent column vectors of A. The row rank
of a matrix A is the maximum number of linearly independent row vectors of A. Equivalently, the column rank of A
is the dimension of the column space of A, while the row rank of A is the dimension of the row space of A.
A result of fundamental importance in linear algebra is that the column rank and the row rank are always equal
(see below for proofs). This number (i.e. the number of linearly independent rows or columns) is simply called the
rank of A. It is commonly denoted by either rk(A) or rank A. Since the column vectors of A are the row vectors of
the transpose of A (denoted here by AT), column rank of A equals row rank of A is equivalent to saying that the rank
of a matrix is equal to the rank of its transpose, i.e. rk(A) = rk(AT).
The rank is also the dimension of the image of the linear transformation that is multiplication by A. More generally,
if a linear operator on a vector space (possibly infinite-dimensional) has finite-dimensional range (e.g., a finite-rank
operator), then the rank of the operator is defined as the dimension of the range.
The rank of an m × n matrix cannot be greater than m nor n. A matrix that has a rank as large as possible is said to
have full rank; otherwise, the matrix is rank deficient.

Column rank = row rank or rk(A) = rk(AT)


This result forms a very important part of the fundamental theorem of linear algebra. We present two proofs of this
result. The first is short and uses only basic properties of linear combination of vectors. The second is an elegant
argument using orthogonality and is based upon: Mackiw, G. (1995). A Note on the Equality of the Column and
Row Rank of a Matrix. Mathematics Magazine, Vol. 68, No. 4. Interestingly, the first proof begins with a basis for
the column space, while the second builds from a basis for the row space. The first proof is valid when the matrices
are defined over any field of scalars, while the second proof works only on inner-product spaces. Of course they both
work for real and complex euclidean spaces. Also, the proofs are easily adapted when A is a linear transformation.
First proof: Let be an × matrix whose column rank is . Therefore, the dimension of the column space
of is . Let be any basis for the column space of and place them as column vectors to form
the × matrix . Therefore, each column vector of is a linear combination of the
columns of . From the definition of matrix multiplication, there exists an × matrix , such that
. (The -th element of is the coefficient of when the -th column of is expressed as a
linear combination of the columns of . Also see rank factorization.)
Now, since , every row vector of is a linear combination of the row vectors of . (The -th
element of is the coefficient of the -th row vector of when the -th row of is expressed as a linear
combination of the rows of .) This means that the row space of is contained within the row space of .
Therefore, we have row rank of ≤ row rank of . But note that has rows, so the row rank of ≤ =
column rank of . This proves that row rank of ≤ column rank of . Now apply the result to the transpose of
to get the reverse inequality: column rank of = row rank of ≤ column rank of = row rank of .
T
This proves column rank of equals row rank of . See a very similar but more direct proof for rk(A) = rk(A )
Rank (linear algebra) 536

under rank factorization. QED


Second proof: Let be an × matrix whose row rank is . Therefore, the dimension of the row space of
is and suppose that is a basis of the row space of . We claim that the vectors
are linearly independent. To see why, consider the linear homogeneous relation involving
these vectors with scalar coefficients :

where . We make two observations: (a) is a linear combination of vectors in the


row space of , which implies that belongs to the row space of , and (b) since = 0, is orthogonal to
every row vector of and, hence, is orthogonal to every vector in the row space of . The facts (a) and (b)
together imply that is orthogonal to itself, which proves that = 0 or, by the definition of :

But recall that the 's are linearly independent because they are a basis of the row space of . This implies that
, which proves our claim that are linearly independent.
Now, each is obviously a vector in the column space of . So, is a set of linearly
independent vectors in the column space of and, hence, the dimension of the column space of (i.e. the column
rank of ) must be at least as big as . This proves that row rank of = r ≤ column rank of . Now apply this
result to the transpose of to get the reverse inequality: column rank of = row rank of ≤ column rank of
= row rank of . This proves column rank of equals row rank of or, equivalently, rk(A) = rk(AT).
QED.
Finally, we provide a proof of the related result, rk(A) = rk(A*), where A* is the conjugate transpose or hermitian
transpose of A. When the elements of A are real numbers, this result becomes rk(A) = rk(AT) and can constitute
another proof for row rank = column rank. Otherwise, for complex matrices, rk(A) = rk(A*) is not equivalent to row
rank = column rank, and one of the above two proofs should be used. This proof is short, elegant and makes use of
the null space.
Third proof: Let be an × matrix. Define to mean the column rank of and let denote the
conjugate transpose or hermitian transpose of . First note that if and only if . This is
elementary linear algebra – one direction is trivial; the other follows from:

where is the Euclidean norm. This proves that the null space of is equal to the null space of . From
the rank-nullity theorem, we obtain . (Alternate argument: Since if and only if
, the columns of satisfy the same linear relationships as the columns of . In particular, they must
have the same number of linearly independent columns and, hence, the same column rank.) Each column of is
a linear combination of the columns of . Therefore, the column space of is a subspace of the column
space of . This implies that . We have proved: .
Now apply this result to to obtain the reverse inequality: since ( , we can write
. This proves . When the elements of are real, the
conjugate transpose is the transpose and we obtain . QED.
Rank (linear algebra) 537

Alternative definitions
dimension of image
If one considers the matrix A as a linear mapping
f : Fn → Fm
such that
f(x) = Ax
then the rank of A can also be defined as the dimension of the image of f (see linear map for a discussion of image
and kernel). This definition has the advantage that it can be applied to any linear map without need for a specific
matrix. The rank can also be defined as n minus the dimension of the kernel of f; the rank-nullity theorem states that
this is the same as the dimension of the image of f.
column rank – dimension of column space
The maximal number of linearly independent columns of the m×n matrix A with entries in the field
F is equal to the dimension of the column space of A (the column space being the subspace of Fm generated by the
columns of A, which is in fact just the image of A as a linear map).
row rank – dimension of row space
Since the column rank and the row rank are the same, we can also define the rank of A as the dimension of the row
space of A, or the number of rows in a basis of the row space.
decomposition rank
The rank can also be characterized as the decomposition rank: the minimum k such that A can be factored as
, where C is an m×k matrix and R is a k×n matrix. Like the "dimension of image" characterization this
can be generalized to a definition of the rank of a linear map: the rank of a linear map f from V → W is the minimal
dimension k of an intermediate space X such that f can be written as the composition of a map V → X and a map X →
W. While this definition does not suggest an efficient manner to compute the rank (for which it is better to use one of
the alternative definitions), it does allow to easily understand many of the properties of the rank, for instance that the
rank of the transpose of A is the same as that of A. See rank factorization for details.
determinantal rank – size of largest non-vanishing minor
Another equivalent definition of the rank of a matrix is the greatest order of any non-zero minor in the matrix (the
order of a minor being the size of the square sub-matrix of which it is the determinant). Like the decomposition rank
characterization, this does not give an efficient way of computing the rank, but it is useful theoretically: a single
non-zero minor witnesses a lower bound (namely its order) for the rank of the matrix, which can be useful to prove
that certain operations do not lower the rank of a matrix.
Equivalence of the determinantal definition (rank of largest non-vanishing minor) is generally proved alternatively. It
is a generalization of the statement that if the span of n vectors has dimension p, then p of those vectors span the
space: one can choose a spanning set that is a subset of the vectors. For determinantal rank, the statement is that if
the row rank (column rank) of a matrix is p, then one can choose a p × p submatrix that is invertible: a subset of the
rows and a subset of the columns simultaneously define an invertible submatrix. It can be alternatively stated as: if
the span of n vectors has dimension p, then p of these vectors span the space and there is a set of p coordinates on
which they are linearly independent.
A non-vanishing p-minor (p × p submatrix with non-vanishing determinant) shows that the rows and columns of that
submatrix are linearly independent, and thus those rows and columns of the full matrix are linearly independent (in
the full matrix), so the row and column rank are at least as large as the determinantal rank; however, the converse is
less straightforward.
tensor rank – minimum number of simple tensors
Rank (linear algebra) 538

The rank of a square matrix can also be characterized as the tensor rank: the minimum number of simple tensors
(rank 1 tensors) needed to express A as a linear combination, . Here a rank 1 tensor (matrix product
of a column vector and a row vector) is the same thing as a rank 1 matrix of the given size. This interpretation can be
generalized in the separable models interpretation of the singular value decomposition.

Properties
We assume that A is an m-by-n matrix over either the real numbers or the complex numbers, and we define the linear
map f by f(x) = Ax as above.
• Only a zero matrix has rank zero.

• f is injective if and only if A has rank n (in this case, we say that A has full column rank).
• f is surjective if and only if A has rank m (in this case, we say that A has full row rank).
• If A is a square matrix (i.e., m = n), then A is invertible if and only if A has rank n (that is, A has full rank).
• If B is any n-by-k matrix, then

• If B is an n-by-k matrix with rank n, then

• If C is an l-by-m matrix with rank m, then

• The rank of A is equal to r if and only if there exists an invertible m-by-m matrix X and an invertible n-by-n
matrix Y such that

where Ir denotes the r-by-r identity matrix.


• Sylvester’s rank inequality: If A is a m-by-n matrix and B n-by-k, then
[1]

This is a special case of the next inequality.


• The inequality due to Frobenius: if AB, ABC and BC are defined, then
[2]

• Subadditivity: when A and B are of the same dimension. As a


consequence, a rank-k matrix can be written as the sum of k rank-1 matrices, but not fewer.
• The rank of a matrix plus the nullity of the matrix equals the number of columns of the matrix. (This is the
rank–nullity theorem.)
• The rank of a matrix and the rank of its corresponding Gram matrix are equal. Thus, for real matrices

This can be shown by proving equality of their null spaces. Null space of the Gram matrix is given by vectors
for which . If this condition is fulfilled, also holds . This proof
was adapted from Mirsky.[3]
• If denotes the conjugate transpose of (i.e., the adjoint of ), then
Rank (linear algebra) 539

Rank from row-echelon forms


A common approach to finding the rank of a matrix is to reduce it to a simpler form, generally row-echelon form by
row operations. Row operations do not change the row space (hence do not change the row rank), and, being
invertible, map the column space to an isomorphic space (hence do not change the column rank). Once in
row-echelon form, the rank is clearly the same for both row rank and column rank, and equals the number of pivots
(or basic columns) and also the number of non-zero rows, say ; further, the column space has been mapped to
which has dimension .
A potentially easier way to identify a matrices' rank is to use elementary row operations to put the matrix in reduced
row-echelon form and simply count the number of non-zero rows in the matrix. Below is an example of this process.

Matrix can be put in reduced row-echelon form by using the following elementary row operations:

By looking at the final matrix (reduced row-echelon form) one could see that the first non-zero entry in both
and is a . Therefore the rank of matrix is 2.

Computation
The easiest way to compute the rank of a matrix A is given by the Gauss elimination method. The row-echelon form
of A produced by the Gauss algorithm has the same rank as A, and its rank can be read off as the number of non-zero
rows.
Consider for example the 4-by-4 matrix

We see that the second column is twice the first column, and that the fourth column equals the sum of the first and
the third. The first and the third columns are linearly independent, so the rank of A is two. This can be confirmed
with the Gauss algorithm. It produces the following row echelon form of A:

which has two non-zero rows.


When applied to floating point computations on computers, basic Gaussian elimination (LU decomposition) can be
unreliable, and a rank revealing decomposition should be used instead. An effective alternative is the singular value
decomposition (SVD), but there are other less expensive choices, such as QR decomposition with pivoting (so-called
rank-revealing QR factorization), which are still more numerically robust than Gaussian elimination. Numerical
determination of rank requires a criterion for deciding when a value, such as a singular value from the SVD, should
be treated as zero, a practical choice which depends on both the matrix and the application.
Rank (linear algebra) 540

Applications
One useful application of calculating the rank of a matrix is the computation of the number of solutions of a system
of linear equations. According to the Rouché–Capelli theorem, the system is inconsistent if the rank of the
augmented matrix is greater than the rank of the coefficient matrix. If, on the other hand, ranks of these two matrices
are equal, the system must have at least one solution. The solution is unique if and only if the rank equals the number
of variables. Otherwise the general solution has k free parameters where k is the difference between the number of
variables and the rank.
In control theory, the rank of a matrix can be used to determine whether a linear system is controllable, or
observable.

Generalization
There are different generalisations of the concept of rank to matrices over arbitrary rings. In those generalisations,
column rank, row rank, dimension of column space and dimension of row space of a matrix may be different from
the others or may not exist.
Thinking of matrices as tensors, the tensor rank generalizes to arbitrary tensors; note that for tensors of order greater
than 2 (matrices are order 2 tensors), rank is very hard to compute, unlike for matrices.
There is a notion of rank for smooth maps between smooth manifolds. It is equal to the linear rank of the derivative.

Matrices as tensors
Matrix rank should not be confused with tensor order, which is called tensor rank. Tensor order is the number of
indices required to write a tensor, and thus matrices all have tensor order 2. More precisely, matrices are tensors of
type (1,1), having one row index and one column index, also called covariant order 1 and contravariant order 1; see
Tensor (intrinsic definition) for details.
Note that the tensor rank of a matrix can also mean the minimum number of simple tensors necessary to express the
matrix as a linear combination, and that this definition does agree with matrix rank as here discussed.

References
[1] Proof: Apply the rank-nullity theorem to the inequality:

[2] Proof: The map

is well-defined and injective. We thus obtain the inequality in terms of dimensions of kernel, which can then be converted to the inequality in
terms of ranks by the rank-nullity theorem. Alternatively, if M is a linear subspace then ; apply this inequality
to the subspace defined by the (orthogonal) complement of the image of BC in the image of B, whose dimension is
; its image under A has dimension
[3] Leon Mirsky: An Introduction to Linear Algebra, 1990, ISBN 0-486-66434-1
Rank (linear algebra) 541

Further reading
• Horn, Roger A. and Johnson, Charles R. Matrix Analysis. Cambridge University Press, 1985. ISBN
0-521-38632-2.
• Kaw, Autar K. Two Chapters from the book Introduction to Matrix Algebra: 1. Vectors (http://
numericalmethods.eng.usf.edu/mws/che/04sle/mws_che_sle_bck_vectors.pdf) and System of Equations
(http://numericalmethods.eng.usf.edu/mws/che/04sle/mws_che_sle_bck_system.pdf)
• Mike Brookes: Matrix Reference Manual. (http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/property.
html#rank)

Resampling (statistics)
In statistics, resampling is any of a variety of methods for doing one of the following:
1. Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data
(jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
2. Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests,
randomization tests, or re-randomization tests)
3. Validating models by using random subsets (bootstrapping, cross validation)
Common resampling techniques include bootstrapping, jackknifing and permutation tests.

Bootstrap
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with
replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors
and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation
coefficient or regression coefficient. It may also be used for constructing hypothesis tests. It is often used as a robust
alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric
inference is impossible or requires very complicated formulas for the calculation of standard errors.

Jackknife
Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error
(variance) of a statistic, when a random sample of observations is used to calculate it. The basic idea behind the
jackknife variance estimator lies in systematically recomputing the statistic estimate leaving out one or more
observations at a time from the sample set. From this new set of replicates of the statistic, an estimate for the bias
and an estimate for the variance of the statistic can be calculated.
Both methods, the bootstrap and the jackknife, estimate the variability of a statistic from the variability of that
statistic between subsamples, rather than from parametric assumptions. For the more general jackknife, the delete-m
observations jackknife, the bootstrap can be seen as a random approximation of it. Both yield similar numerical
results, which is why each can be seen as approximation to the other. Although there are huge theoretical differences
in their mathematical insights, the main practical difference for statistics users is that the bootstrap gives different
results when repeated on the same data, whereas the jackknife gives exactly the same result each time. Because of
this, the jackknife is popular when the estimates need to be verified several times before publishing (e.g. official
statistics agencies). On the other hand, when this verification feature is not crucial and it is of interest not to have a
number but just an idea of its distribution the bootstrap is preferred (e.g. studies in physics, economics, biological
sciences).
Resampling (statistics) 542

Whether to use bootstrap or jackknife may depend more on non-statistical concerns but on operational aspects of a
survey. The bootstrap provides a powerful and easy way to estimate not just the variance of a point estimator but its
whole distribution, thus becoming highly computer intensive. On the other hand, the jackknife (originally used for
bias reduction) only provides estimates of the variance of the point estimator. This can be enough for basic statistical
inference (e.g. hypothesis testing, confidence intervals). Hence, the jackknife is a specialized method for estimating
variances whereas the bootstrap first estimates the whole distribution from where the variance is assessed.
"The bootstrap can be applied to both variance and distribution estimation problems. However, the bootstrap
variance estimator is not as good as the jackknife or the balanced repeated replication (BRR) variance estimator in
terms of the empirical results. Furthermore, the bootstrap variance estimator usually requires more computations
than the jackknife or the BRR . Thus, the bootstrap is mainly recommended for distribution estimation." [1]
There is a special consideration with the jackknife, particularly with the delete-1 observation jackknife. It should
only be used with smooth differentiable statistics, that is: totals, means, proportions, ratios, odd ratios, regression
coefficients, etc.; but not with medians or quantiles. This clearly may become a practical disadvantage (or not,
depending on the needs of the user). This disadvantage is usually the argument against the jackknife in benefit to the
bootstrap. More general jackknifes than the delete-1, such as the delete-m jackknife, overcome this problem for the
medians and quantiles by relaxing the smoothness requirements for consistent variance estimation.
Usually the jackknife is easier to apply to complex sampling schemes than the bootstrap. Complex sampling schemes
may involve stratification, multi-stages (clustering), varying sampling weights (non-response adjustments,
calibration, post-stratification) and under unequal-probability sampling designs. Theoretical aspects of both the
bootstrap and the jackknife can be found in,[2] whereas a basic introduction is accounted in.[3]

Cross-validation
Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held out for use as
validating sets; a model is fit to the remaining data (a training set) and used to predict for the validation set.
Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.
One form of cross-validation leaves out a single observation at a time; this is similar to the jackknife. Another,
K-fold cross-validation, splits the data into K subsets; each is held out in turn as the validation set.
This avoids "self-influence". For comparison, in regression analysis methods such as linear regression, each y value
draws the regression line toward itself, making the prediction of that value appear more accurate than it really is.
Cross-validation applied to linear regression predicts the y value for each observation without using that observation.
This is often used for deciding how many predictor variables to use in regression. Without cross-validation, adding
predictors always reduces the residual sum of squares (or possibly leaves it unchanged). In contrast, the
cross-validated mean-square error will tend to decrease if valuable predictors are added, but increase if worthless
predictors are added.

Permutation tests
A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical
significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all
possible values of the test statistic under rearrangements of the labels on the observed data points. In other words, the
method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that
design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance
levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved
from the works of R.A. Fisher and E.J.G. Pitman in the 1930s.
To illustrate the basic idea of a permutation test, suppose we have two groups and whose sample means are
and , and that we want to test, at 5% significance level, whether they come from the same distribution. Let
Resampling (statistics) 543

and be the sample size corresponding to each group. The permutation test is designed to determine whether the
observed difference between the sample means is large enough to reject the null hypothesis H that the two groups
have identical probability distribution.
The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the
observed value of the test statistic, T(obs). Then the observations of groups and are pooled.
Next, the difference in sample means is calculated and recorded for every possible way of dividing these pooled
values into two groups of size and (i.e., for every permutation of the group labels A and B). The set of
these calculated differences is the exact distribution of possible differences under the null hypothesis that group label
does not matter.
The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in
means was greater than or equal to T(obs). The two-sided p-value of the test is calculated as the proportion of
sampled permutations where the absolute difference was greater than or equal to ABS(T(obs)).
If the only purpose of the test is reject or not reject the null hypothesis, we can as an alternative sort the recorded
differences, and then observe if T(obs) is contained within the middle 95% of them. If it is not, we reject the
hypothesis of identical probability curves at the 5% significance level.

Relation to parametric tests


Permutation tests are a subset of non-parametric statistics. The basic premise is to use only the assumption that it is
possible that all of the treatment groups are equivalent, and that every member of them is the same before sampling
began (i.e. the slot that they fill is not differentiable from other slots before the slots are filled). From this, one can
calculate a statistic and then see to what extent this statistic is special by seeing how likely it would be if the
treatment assignments had been jumbled.
In contrast to permutation tests, the reference distributions for many popular "classical" statistical tests, such as the
t-test, F-test, z-test and χ2 test, are obtained from theoretical probability distributions. Fisher's exact test is an
example of a commonly used permutation test for evaluating the association between two dichotomous variables.
When sample sizes are large, the Pearson's chi-square test will give accurate results. For small samples, the
chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the
test statistic, and in this situation the use of Fisher's exact test becomes more appropriate. A rule of thumb is that the
expected count in each cell of the table should be greater than 5 before Pearson's chi-squared test is used.
Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when
losses are proportional to the size of an error rather than its square). All simple and many relatively complex
parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the
parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than
from the theoretical distribution derived from the parametric assumption. For example, it is possible in this manner
to construct a permutation t-test, a permutation chi-squared test of association, a permutation version of Aly's test for
comparing variances and so on.
The major down-side to permutation tests are that they
• Can be computationally intensive and may require "custom" code for difficult-to-calculate statistics. This must be
rewritten for every case.
• Are primarily used to provide a p-value. The inversion of the test to get confidence regions/intervals requires even
more computation.
Resampling (statistics) 544

Advantages
Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always
free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses.
Permutation tests can be used for analyzing unbalanced designs [4] and for combining dependent tests on mixtures of
categorical, ordinal, and metric data (Pesarin, 2001).
Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small
sample sizes.
Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated
path algorithms applicable in special situations, made the application of permutation test methods practical for a
wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages
and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and
computing test-based "exact" confidence intervals.

Limitations
An important assumption behind a permutation test is that the observations are exchangeable under the null
hypothesis. An important consequence of this assumption is that tests of difference in location (like a permutation
t-test) require equal variance. In this respect, the permutation t-test shares the same weakness as the classical
Student's t-test (the Behrens–Fisher problem). A third alternative in this situation is to use a bootstrap-based test.
Good (2000) explains the difference between permutation tests and bootstrap tests the following way: "Permutations
test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap
entails less-stringent assumptions." Of course, bootstrap tests are not exact.

Monte Carlo testing


An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data
to allow complete enumeration in a convenient manner. This is done by generating the reference distribution by
Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the
possible replicates. The realization that this could be applied to any permutation test on any dataset was an important
breakthrough in the area of applied statistics. The earliest known reference to this approach is Dwass (1957).[5] This
type of permutation test is known under various names: approximate permutation test, Monte Carlo permutation
tests or random permutation tests.[6]
After random permutations, it is possible to obtain a confidence interval for the p-value based on the Binomial
distribution. For example, if after random permutations the p-value is estimated to be , then a 99% confidence
interval for the true (the one that would result from trying all possible permutations) is .
On the other hand, the purpose of estimating the p-value is most often to decide whether , where is the threshold at
which the null hypothesis will be rejected (typically ). In the example above, the confidence interval only tells us that
there is roughly a 50% chance that the p-value is smaller than 0.05, i.e. it is completely unclear whether the null
hypothesis should be rejected at a level .
If it is only important to know whether for a given , it is logical to continue simulating until the statement can be
established to be true or false with a very low probability of error. Given a bound on the admissible probability of
error (the probability of finding that when in fact or vice versa), the question of how many permutations to generate
can be seen as the question of when to stop generating permutations, based on the outcomes of the simulations so far,
in order to guarantee that the conclusion (which is either or ) is correct with probability at least as large as . ( will
typically be chosen to be extremely small, e.g. 1/1000.) Stopping rules to achieve this have been developed[7] which
can be incorporated with minimal additional computational cost. In fact, depending on the true underlying p-value it
will often be found that the number of simulations required is remarkably small (e.g. as low as 5 and often not larger
Resampling (statistics) 545

than 100) before a decision can be reached with virtual certainty.

Bibliography

Introductory statistics
• Good, P. (2005) Introduction to Statistics Through Resampling Methods and R/S-PLUS. Wiley. ISBN
0-471-71575-1
• Good, P. (2005) Introduction to Statistics Through Resampling Methods and Microsoft Office Excel. Wiley. ISBN
0-471-73191-9
• Hesterberg, T. C., D. S. Moore, S. Monaghan, A. Clipson, and R. Epstein (2005). Bootstrap Methods and
Permutation Tests.
• Wolter, K.M. (2007). Introduction to Variance Estimation. Second Edition. Springer, Inc.

Bootstrapping
• Efron, Bradley (1979). "Bootstrap methods: Another look at the jackknife" [8], The Annals of Statistics, 7, 1-26.
• Efron, Bradley (1981). "Nonparametric estimates of standard error: The jackknife, the bootstrap and other
methods", Biometrika, 68, 589-599.
• Efron, Bradley (1982). The jackknife, the bootstrap, and other resampling plans, In Society of Industrial and
Applied Mathematics CBMS-NSF Monographs, 38.
• Diaconis, P.; Efron, Bradley (1983), "Computer-intensive methods in statistics," Scientific American, May,
116-130.
• Efron, Bradley; Tibshirani, Robert J. (1993). An introduction to the bootstrap, New York: Chapman & Hall,
software [9].
• Davison, A. C. and Hinkley, D. V. (1997): Bootstrap Methods and their Application, software [10].
• Mooney, C Z & Duval, R D (1993). Bootstrapping. A Nonparametric Approach to Statistical Inference. Sage
University Paper series on Quantitative Applications in the Social Sciences, 07-095. Newbury Park, CA: Sage.
• Simon, J. L. (1997): Resampling: The New Statistics [11].

Jackknife
• Berger, Y.G. (2007). A jackknife variance estimator for unistage stratified samples with unequal probabilities.
Biometrika. Vol. 94, 4, pp. 953–964.
• Berger, Y.G. and Rao, J.N.K. (2006). Adjusted jackknife for imputation under unequal probability sampling
without replacement. Journal of the Royal Statistical Society B. Vol. 68, 3, pp. 531–547.
• Berger, Y.G. and Skinner, C.J. (2005). A jackknife variance estimator for unequal probability sampling. Journal
of the Royal Statistical Society B. Vol. 67, 1, pp. 79–89.
• Jiang, J., Lahiri, P. and Wan, S-M. (2002). A unified jackknife theory for empirical best prediction with
M-estimation. The Annals of Statistics. Vol. 30, 6, pp. 1782–810.
• Jones, H.L. (1974). Jackknife estimation of functions of stratum means. Biometrika. Vol. 61, 2, pp. 343–348.
• Kish, L. and Frankel M.R. (1974). Inference from complex samples. Journal of the Royal Statistical Society B.
Vol. 36, 1, pp. 1–37.
• Krewski, D. and Rao, J.N.K. (1981). Inference from stratified samples: properties of the linearization, jackknife
and balanced repeated replication methods. The Annals of Statistics. Vol. 9, 5, pp. 1010–1019.
• Quenouille, M.H. (1956). Notes on bias in estimation. Biometrika. Vol. 43, pp. 353–360.
• Rao, J.N.K. and Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation.
Biometrika. Vol. 79, 4, pp. 811–822.
• Rao, J.N.K., Wu, C.F.J. and Yue, K. (1992). Some recent work on resampling methods for complex surveys.
Survey Methodology. Vol. 18, 2, pp. 209–217.
Resampling (statistics) 546

• Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, Inc.
• Tukey, J.W. (1958). Bias and confidence in not-quite large samples (abstract). The Annals of Mathematical
Statistics. Vol. 29, 2, pp. 614.
• Wu, C.F.J. (1986). Jackknife, Bootstrap and other resampling methods in regression analysis. The Annals of
Statistics. Vol. 14, 4, pp. 1261–1295.

Monte Carlo methods


• George S. Fishman (1995). Monte Carlo: Concepts, Algorithms, and Applications, Springer, New York. ISBN
0-387-94527-X.
• James E. Gentle (2009). Computational Statistics, Springer, New York. Part III: Methods of Computational
Statistics. ISBN 978-0-387-98143-7.
• Dirk P. Kroese, Thomas Taimre and Zdravko I. Botev. Handbook of Monte Carlo Methods, John Wiley & Sons,
New York. ISBN 978-0-470-17793-8.
• Christian P. Robert and George Casella (2004). Monte Carlo Statistical Methods, Second ed., Springer, New
York. ISBN 0-387-21239-6.
• Shlomo Sawilowsky and Gail Fahoome (2003). Statistics via Monte Carlo Simulation with Fortran. Rochester
Hills, MI: JMASM. ISBN 0-9740236-0-4.

Permutation test
Original references:
• Fisher, R.A. (1935) The Design of Experiments, New York: Hafner
• Pitman, E. J. G. (1937) "Significance tests which may be applied to samples from any population", Royal
Statistical Society Supplement, 4: 119-130 and 225-32 (parts I and II). JSTOR 2984124 JSTOR 2983647
• Pitman, E. J. G. (1938) "Significance tests which may be applied to samples from any population. Part III. The
analysis of variance test", Biometrika, 29 (3-4): 322-335. doi:10.1093/biomet/29.3-4.322
Modern references:
• Edgington. E.S. (1995) Randomization tests, 3rd ed. New York: Marcel-Dekker
• Good, Phillip I. (2005) Permutation, Parametric and Bootstrap Tests of Hypotheses, 3rd ed., Springer ISBN
0-387-98898-X
• Good, P. (2002) "Extensions of the concept of exchangeability and their applications", J. Modern Appl. Statist.
Methods, 1:243-247.
• Lunneborg, Cliff. (1999) Data Analysis by Resampling, Duxbury Press. ISBN 0-534-22110-6.
• Pesarin, F. (2001). Multivariate Permutation Tests : With Applications in Biostatistics, John Wiley & Sons. ISBN
978-0471496700
• Welch, W. J. (1990) "Construction of permutation tests", Journal of the American Statistical Association,
85:693-698.
Computational methods:
• Mehta, C. R.; Patel, N. R. (1983). "A network algorithm for performing Fisher's exact test in r x c contingency
tables", J. Amer. Statist. Assoc, 78(382):427–434.
• Metha, C. R.; Patel, N. R.; Senchaudhuri, P. (1988). "Importance sampling for estimating exact probabilities in
permutational inference", J. Am. Statist. Assoc., 83(404):999–1005.
• Gill, P. M. W. (2007). "Efficient calculation of p-values in linear-statistic permutation significance tests", Journal
of Statistical Computation and Simulation , 77(1):55-61. doi:10.1080/10629360500108053
Resampling (statistics) 547

Resampling methods
• Good, P. (2006) Resampling Methods. 3rd Ed. Birkhauser.
• Wolter, K.M. (2007). Introduction to Variance Estimation. 2nd Edition. Springer, Inc.

External links

Current research on permutation tests


• Bootstrap Sampling tutorial [12]
• Hesterberg, T. C., D. S. Moore, S. Monaghan, A. Clipson, and R. Epstein (2005): Bootstrap Methods and
Permutation Tests [13], software [14].
• Moore, D. S., G. McCabe, W. Duckworth, and S. Sclove (2003): Bootstrap Methods and Permutation Tests [15]
• Simon, J. L. (1997): Resampling: The New Statistics [11].
• Yu, Chong Ho (2003): Resampling methods: concepts, applications, and justification. Practical Assessment,
Research & Evaluation, 8(19) [16]. (statistical bootstrapping)
• Resampling: A Marriage of Computers and Statistics (ERIC Digests) [17]

Software
• Angelo Canty and Brian Ripley (2010). boot: Bootstrap R (S-Plus) Functions. R package version 1.2-43. [18]
Functions and datasets for bootstrapping from the book Bootstrap Methods and Their Applications by A. C.
Davison and D. V. Hinkley (1997, CUP).
• Statistics101: Resampling, Bootstrap, Monte Carlo Simulation program [19]

References
[1] Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, Inc. pp. 281.
[2] Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, Inc.
[3] Wolter, K.M. (2007). Introduction to Variance Estimation. Second Edition. Springer, Inc.
[4] http:/ / tbf. coe. wayne. edu/ jmasm/ vol1_no2. pdf
[5] Meyer Dwass, "Modified Randomization Tests for Nonparametric Hypotheses", The Annals of Mathematical Statistics, 28:181-187, 1957.
[6] Thomas E. Nichols, Andrew P. Holmes (2001). "Nonparametric Permutation Tests For Functional Neuroimaging: A Primer with Examples"
(http:/ / www. fil. ion. ucl. ac. uk/ spm/ doc/ papers/ NicholsHolmes. pdf). Human Brain Mapping 15 (1): 1–25. doi:10.1002/hbm.1058.
PMID 11747097. .
[7] Gandy, Axel (2009). "Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk". Journal of the American
Statistical Association 104 (488): 1504-1511.
[8] http:/ / projecteuclid. org/ DPubS/ Repository/ 1. 0/ Disseminate?view=body& id=pdf_1& handle=euclid. aos/ 1176344552
[9] http:/ / lib. stat. cmu. edu/ S/ bootstrap. funs
[10] http:/ / statwww. epfl. ch/ davison/ BMA/ library. html
[11] http:/ / www. resample. com/ content/ text/ index. shtml
[12] http:/ / people. revoledu. com/ kardi/ tutorial/ Bootstrap/ index. html
[13] http:/ / bcs. whfreeman. com/ ips5e/ content/ cat_080/ pdf/ moore14. pdf
[14] http:/ / www. insightful. com/ Hesterberg/ bootstrap
[15] http:/ / bcs. whfreeman. com/ pbs/ cat_140/ chap18. pdf
[16] http:/ / PAREonline. net/ getvn. asp?v=8& n=19
[17] http:/ / www. ericdigests. org/ 1993/ marriage. htm
[18] http:/ / cran. at. r-project. org/ web/ packages/ boot/ index. html
[19] http:/ / www. statistics101. net
Schur complement 548

Schur complement
In linear algebra and the theory of matrices, the Schur complement of a matrix block (i.e., a submatrix within a
larger matrix) is defined as follows. Suppose A, B, C, D are respectively p×p, p×q, q×p and q×q matrices, and D is
invertible. Let

so that M is a (p+q)×(p+q) matrix.


Then the Schur complement of the block D of the matrix M is the p×p matrix

It is named after Issai Schur who used it to prove Schur's lemma, although it had been used previously.[1] Emilie
Haynsworth was the first to call it the Schur complement.[2]

Background
The Schur complement arises as the result of performing a block Gaussian elimination by multiplying the matrix M
from the right with the "block lower triangular" matrix

Here Ip denotes a p×p identity matrix. After multiplication with the matrix L the Schur complement appears in the
upper p×p block. The product matrix is

That is, we have shown that

and inverse of M thus may be expressed involving D−1 and the inverse of Schur's complement (if it exists) only as

C.f. matrix inversion lemma which illustrates relationships between the above and the equivalent derivation with the
roles of A and D interchanged.
If M is a positive-definite symmetric matrix, then so is the Schur complement of D in M.
If p and q are both 1 (i.e. A, B, C and D are all scalars), we get the familiar formula for the inverse of a 2-by-2
matrix:

provided that AD − BC is non-zero.


Schur complement 549

Application to solving linear equations


The Schur complement arises naturally in solving a system of linear equations such as

where x, a are p-dimensional column vectors, y, b are q-dimensional column vectors, and A, B, C, D are as above.
Multiplying the bottom equation by and then subtracting from the top equation one obtains

Thus if one can invert D as well as the Schur complement of D, one can solve for x, and then by using the equation
one can solve for y. This reduces the problem of inverting a matrix to that of
inverting a p×p matrix and a q×q matrix. In practice one needs D to be well-conditioned in order for this algorithm to
be numerically accurate.

Applications to probability theory and statistics


Suppose the random column vectors X, Y live in Rn and Rm respectively, and the vector (X, Y) in Rn+m has a
multivariate normal distribution whose variance is the symmetric positive-definite matrix

where A is n-by-n and C is m-by-m.


Then the conditional variance of X given Y is the Schur complement of C in V:

If we take the matrix V above to be, not a variance of a random vector, but a sample variance, then it may have a
Wishart distribution. In that case, the Schur complement of C in V also has a Wishart distribution.

Schur complement condition for positive definiteness


Let X be a symmetric matrix given by

Let S be the Schur complement of A in X, that is:

Then
• is positive definite if and only if and are both positive definite:
.
• is positive definite if and only if and are both positive definite:
.
• If is positive definite, then is positive semidefinite if and only if is positive semidefinite:
, .
• If is positive definite, then is positive semidefinite if and only if is positive
semidefinite:
, .
[3]
These statements can be derived by considering the minimizer of the quantity
Schur complement 550

as a function of u (for fixed v).

References
[1] Zhang, Fuzhen (2005). The Schur Complement and Its Applications. Springer. doi:10.1007/b105056. ISBN 0-387-24271-6.
[2] Haynsworth, E. V., "On the Schur Complement", Basel Mathematical Notes, #BNB 20, 17 pages, June 1968.
[3] Boyd, S. and Vandenberghe, L. (2004), "Convex Optimization", Cambridge University Press (Appendix A.5.5)

Sign test
In statistics, the sign test can be used to test the hypothesis that there is "no difference in medians" between the
continuous distributions of two random variables X and Y, in the situation when we can draw paired samples from X
and Y. It is a non-parametric test which makes very few assumptions about the nature of the distributions under test -
this means that it has very general applicability but may lack the statistical power of other tests such as the
paired-samples t-test or the Wilcoxon signed-rank test.

Method
Let p = Pr(X > Y), and then test the null hypothesis H0: p = 0.50. In other words, the null hypothesis states that given
a random pair of measurements (xi, yi), then xi and yi are equally likely to be larger than the other.
To test the null hypothesis, independent pairs of sample data are collected from the populations {(x1, y1), (x2, y2), . .
., (xn, yn)}. Pairs are omitted for which there is no difference so that there is a possibility of a reduced sample of m
pairs.[1]
Then let W be the number of pairs for which yi − xi > 0. Assuming that H0 is true, then W follows a binomial
distribution W ~ b(m, 0.5). The "W" is for Frank Wilcoxon who developed the test, then later, the more powerful
Wilcoxon signed-rank test.[2]

Assumptions
Let Zi = Yi – Xi for i = 1, ... , n.
1. The differences Zi are assumed to be independent.
2. Each Zi comes from the same continuous population.
3. The values of Xi and Yi represent are ordered (at least the ordinal scale), so the comparisons "greater than", "less
than", and "equal to" are meaningful.

Significance testing
Since the test statistic is expected to follow a binomial distribution, the standard binomial test is used to calculate
significance. The normal approximation to the binomial distribution can be used for large sample sizes, m>25.[1]
The left-tail value is computed by Pr(W ≤ w), which is the p-value for the alternative H1: p < 0.50. This alternative
means that the X measurements tend to be higher.
The right-tail value is computed by Pr(W ≥ w), which is the p-value for the alternative H1: p > 0.50. This alternative
means that the Y measurements tend to be higher.
For a two-sided alternative H1 the p-value is twice the smaller tail-value.
Sign test 551

References
[1] Mendenhall, W.; Wackerly, D. D. and Scheaffer, R. L. (1989), "15: Nonparametric statistics", Mathematical statistics with applications
(Fourth ed.), PWS-Kent, pp. 674–679, ISBN 0-534-92026-8
[2] Karas, J. & Savage, I.R. (1967) Publications of Frank Wilcoxon (1892–1965). Biometrics 23(1): 1–10

• Gibbons, J.D. and Chakraborti, S. (1992). Nonparametric Statistical Inference. Marcel Dekker Inc., New York.
• Kitchens, L.J.(2003). Basic Statistics and Data Analysis. Duxbury.
• Conover, W. J. (1980). Practical Nonparametric Statistics, 2nd ed. Wiley, New York.
• Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden and Day, San Francisco.

Singular value decomposition


In linear algebra, the singular value
decomposition (SVD) is a factorization of a real
or complex matrix, with many useful
applications in signal processing and statistics.
Formally, the singular value decomposition of an
m×n real or complex matrix M is a factorization
of the form

where U is an m×m real or complex unitary


matrix, Σ is an m×n rectangular diagonal matrix
with nonnegative real numbers on the diagonal,
and V* (the conjugate transpose of V) is an n×n
real or complex unitary matrix. The diagonal
entries Σi,i of Σ are known as the singular
Visualization of the SVD of a two-dimensional, real shearing matrix M. First,
values of M. The m columns of U and the n
we see the unit disc in blue together with the two canonical unit vectors. We
columns of V are called the left-singular vectors then see the action of M, which distorts the disk to an ellipse. The SVD
and right-singular vectors of M, respectively. decomposes M into three simple transformations: a rotation V*, a scaling Σ
along the rotated coordinate axes and a second rotation U. The lengths σ1 and
The singular value decomposition and the σ2 of the semi-axes of the ellipse are the singular values of M.
eigendecomposition are closely related. Namely:
• The left-singular vectors of M are eigenvectors of MM*.
• The right-singular vectors of M are eigenvectors of M*M.
• The non-zero-singular values of M (found on the diagonal entries of Σ) are the square roots of the non-zero
eigenvalues of both M*M and MM*.
Applications which employ the SVD include computing the pseudoinverse, least squares fitting of data, matrix
approximation, and determining the rank, range and null space of a matrix.

Statement of the theorem


Suppose M is an m×n matrix whose entries come from the field K, which is either the field of real numbers or the
field of complex numbers. Then there exists a factorization of the form

where U is an m×m unitary matrix over K, the matrix Σ is an m×n diagonal matrix with nonnegative real numbers on
the diagonal, and the n×n unitary matrix V* denotes the conjugate transpose of V. Such a factorization is called the
Singular value decomposition 552

singular value decomposition of M.


The diagonal entries of Σ are known as the singular values of M. A common convention is to list the singular
values in descending order. In this case, the diagonal matrix Σ is uniquely determined by M (though the matrices U
and V are not).

Intuitive interpretations

Rotation, scaling, rotation


In the special but common case in which M is just an m×m square matrix with positive determinant whose entries
are plain real numbers, then U, V*, and Σ are m×m matrices of real numbers as well, Σ can be regarded as a scaling
matrix, and U and V* can be viewed as rotation matrices.
If the abovementioned conditions are met, the expression can thus be intuitively interpreted as a
composition (or sequence) of three geometrical transformations: a rotation, a scaling, and another rotation. For
instance, the figure above explains how a shear matrix can be described as such a sequence.

Singular values as semiaxis of an ellipse or ellipsoid


As shown in the figure, the singular values can be interpreted as the semiaxes of an ellipse in 2D. This concept can
be generalized to n-dimensional Euclidean space, with the singular values of any n×n square matrix being viewed as
the semiaxes of an n-dimensional ellipsoid. See below for further details.

The columns of U and V are orthonormal bases


Since U and V* are unitary, the columns of each of them form a set of orthonormal vectors, which can be regarded as
basis vectors. By the definition of unitary matrix, the same is true for their conjugate transposes U* and V. In short,
U, U*, V, and V* are orthonormal bases.

Example
Consider the 4×5 matrix

A singular value decomposition of this matrix is given by

Notice is zero outside of the diagonal and one diagonal element is zero. Furthermore, because the matrices and
are unitary, multiplying by their respective conjugate transposes yields identity matrices, as shown below. In
this case, because and are real valued, they each are an orthogonal matrix.
Singular value decomposition 553

and

This particular singular value decomposition is not unique. Choosing such that

is also a valid singular value decomposition.

Singular values, singular vectors, and their relation to the SVD


A non-negative real number σ is a singular value for M if and only if there exist unit-length vectors u in Km and v in
Kn such that

The vectors u and v are called left-singular and right-singular vectors for σ, respectively.
In any singular value decomposition

the diagonal entries of Σ are equal to the singular values of M. The columns of U and V are, respectively, left- and
right-singular vectors for the corresponding singular values. Consequently, the above theorem implies that:
• An m × n matrix M has at least one and at most p = min(m,n) distinct singular values.
• It is always possible to find an orthogonal basis U for Km consisting of left-singular vectors of M.
• It is always possible to find an orthogonal basis V for Kn consisting of right-singular vectors of M.
A singular value for which we can find two left (or right) singular vectors that are linearly independent is called
degenerate.
Non-degenerate singular values always have unique left- and right-singular vectors, up to multiplication by a
unit-phase factor eiφ (for the real case up to sign). Consequently, if all singular values of M are non-degenerate and
non-zero, then its singular value decomposition is unique, up to multiplication of a column of U by a unit-phase
factor and simultaneous multiplication of the corresponding column of V by the same unit-phase factor.
Degenerate singular values, by definition, have non-unique singular vectors. Furthermore, if u1 and u2 are two
left-singular vectors which both correspond to the singular value σ, then any normalized linear combination of the
two vectors is also a left-singular vector corresponding to the singular value σ. The similar statement is true for
right-singular vectors. Consequently, if M has degenerate singular values, then its singular value decomposition is
not unique.
Singular value decomposition 554

Applications of the SVD

Pseudoinverse
The singular value decomposition can be used for computing the pseudoinverse of a matrix. Indeed, the
pseudoinverse of the matrix M with singular value decomposition is

where Σ+ is the pseudoinverse of Σ, which is formed by replacing every nonzero diagonal entry by its reciprocal and
transposing the resulting matrix. The pseudoinverse is one way to solve linear least squares problems.

Solving homogeneous linear equations


A set of homogeneous linear equations can be written as for a matrix and vector . A typical
situation is that is known and a non-zero is to be determined which satisfies the equation. Such an belongs
to 's null space and is sometimes called a (right) null vector of . can be characterized as a right-singular
vector corresponding to a singular value of that is zero. This observation means that if is a square matrix and
has no vanishing singular value, the equation has no non-zero as a solution. It also means that if there are several
vanishing singular values, any linear combination of the corresponding right-singular vectors is a valid solution.
Analogously to the definition of a (right) null vector, a non-zero satisfying , with denoting the
conjugate transpose of , is called a left null vector of .

Total least squares minimization


A total least squares problem refers to determining the vector which minimizes the 2-norm of a vector under
the constraint . The solution turns out to be the right-singular vector of corresponding to the smallest
singular value.

Range, null space and rank


Another application of the SVD is that it provides an explicit representation of the range and null space of a matrix
M. The right-singular vectors corresponding to vanishing singular values of M span the null space of M. E.g., the null
space is spanned by the last two columns of in the above example. The left-singular vectors corresponding to the
non-zero singular values of M span the range of M. As a consequence, the rank of M equals the number of non-zero
singular values which is the same as the number of non-zero diagonal elements in .
In numerical linear algebra the singular values can be used to determine the effective rank of a matrix, as rounding
error may lead to small but non-zero singular values in a rank deficient matrix.

Low-rank matrix approximation


Some practical applications need to solve the problem of approximating a matrix with another matrix , said
truncated, which has a specific rank . In the case that the approximation is based on minimizing the Frobenius
norm of the difference between and under the constraint that it turns out that the solution
is given by the SVD of , namely

where is the same matrix as except that it contains only the largest singular values (the other singular values
are replaced by zero). This is known as the Eckart–Young theorem, as it was proved by those two authors in 1936
(although it was later found to have been known to earlier authors; see Stewart 1993).
Also see CUR matrix approximation for another low-rank approximation that is easier to interpret.
Singular value decomposition 555

Separable models
The SVD can be thought of as decomposing a matrix into a weighted, ordered sum of separable matrices. By
separable, we mean that a matrix can be written as an outer product of two vectors , or, in
coordinates, . Specifically, the matrix M can be decomposed as:

Here and are the ith columns of the corresponding SVD matrices, are the ordered singular values, and
each is separable. The SVD can be used to find the decomposition of an image processing filter into separable
horizontal and vertical filters. Note that the number of non-zero is exactly the rank of the matrix.
Separable models often arise in biological systems, and the SVD decomposition is useful to analyze such systems.
For example, some visual area V1 simple cells' receptive fields can be well described[1] by a Gabor filter in the space
domain multiplied by a modulation function in the time domain. Thus, given a linear filter evaluated through, for
example, reverse correlation, one can rearrange the two spatial dimensions into one dimension, thus yielding a two
dimensional filter (space, time) which can be decomposed through SVD. The first column of U in the SVD
decomposition is then a Gabor while the first column of V represents the time modulation (or vice-versa). One may
then define an index of separability, , which is the fraction of the power in the matrix M which is

accounted for by the first separable matrix in the decomposition.[2]

Nearest orthogonal matrix


It is possible to use the SVD of to determine the orthogonal matrix closest to . The closeness of fit is
[3]
measured by the Frobenius norm of . The solution is the product . This intuitively makes sense
because an orthogonal matrix would have the decomposition where is the identity matrix, so that if
then the product amounts to replacing the singular values with ones.
A similar problem, with interesting applications in shape analysis, is the orthogonal Procrustes problem, which
consists of finding an orthogonal matrix which most closely maps to . Specifically,

where denotes the Frobenius norm.


This problem is equivalent to finding the nearest orthogonal matrix to a given matrix .

The Kabsch algorithm


The Kabsch algorithm (called Wahba's problem in other fields) uses SVD to compute the optimal rotation (with
respect to least-squares minimization) that will align a set of points with a corresponding set of points. It is used,
among other applications, to compare the structures of molecules.

Other examples
The SVD is also applied extensively to the study of linear inverse problems, and is useful in the analysis of
regularization methods such as that of Tikhonov. It is widely used in statistics where it is related to principal
component analysis and to Correspondence analysis, and in signal processing and pattern recognition. It is also used
in output-only modal analysis, where the non-scaled mode shapes can be determined from the singular vectors. Yet
another usage is latent semantic indexing in natural language text processing.
The SVD also plays a crucial role in the field of quantum information, in a form often referred to as the Schmidt
decomposition. Through it, states of two quantum systems are naturally decomposed, providing a necessary and
sufficient condition for them to be entangled : if the rank of the matrix is larger than one.
Singular value decomposition 556

One application of SVD to rather large matrices is in numerical weather prediction, where Lanczos methods are used
to estimate the most linearly quickly growing few perturbations to the central numerical weather prediction over a
given initial forward time period – i.e. the singular vectors corresponding to the largest singular values of the
linearized propagator for the global weather over that time interval. The output singular vectors in this case are entire
weather systems. These perturbations are then run through the full nonlinear model to generate an ensemble forecast,
giving a handle on some of the uncertainty that should be allowed for around the current central prediction.
Another application of SVD for daily life is that point in perspective view can be unprojected in a photo using the
calculated SVD matrix, this application leads to measuring length (a.k.a. the distance of two unprojected points in
perspective photo) by marking out the 4 corner points of known-size object in a single photo. PRuler is a demo to
implement this application by taking a photo of a regular credit card

Relation to eigenvalue decomposition


The singular value decomposition is very general in the sense that it can be applied to any m × n matrix whereas
eigenvalue decomposition can only be applied to certain classes of square matrices. Nevertheless, the two
decompositions are related.
Given an SVD of M, as described above, the following two relations hold:

The right-hand sides of these relations describe the eigenvalue decompositions of the left-hand sides. Consequently:
• The columns of V (right-singular vectors) are eigenvectors of
• The columns of U (left-singular vectors) are eigenvectors of
• The non-zero elements of Σ (non-zero singular values) are the square roots of the non-zero eigenvalues of
or
In the special case that M is a normal matrix, which by definition must be square, the spectral theorem says that it
can be unitarily diagonalized using a basis of eigenvectors, so that it can be written for a unitary
matrix U and a diagonal matrix D. When M is also positive semi-definite, the decomposition is also
a singular value decomposition.
However, the eigenvalue decomposition and the singular value decomposition differ for all other matrices M: the
eigenvalue decomposition is where U is not necessarily unitary and D is not necessarily positive
semi-definite, while the SVD is where Σ is a diagonal positive semi-definite, and U and V are
unitary matrices that are not necessarily related except through the matrix M.

Existence
An eigenvalue λ of a matrix is characterized by the algebraic relation M u = λ u. When M is Hermitian, a variational
characterization is also available. Let M be a real n × n symmetric matrix. Define f :Rn → R by f(x) = xT M x. By the
extreme value theorem, this continuous function attains a maximum at some u when restricted to the closed unit
sphere {||x|| ≤ 1}. By the Lagrange multipliers theorem, u necessarily satisfies

where the nabla symbol, , is the del operator.


A short calculation shows the above leads to M u = λ u (symmetry of M is needed here). Therefore λ is the largest
eigenvalue of M. The same calculation performed on the orthogonal complement of u gives the next largest
eigenvalue and so on. The complex Hermitian case is similar; there f(x) = x* M x is a real-valued function of 2n real
variables.
Singular value decomposition 557

Singular values are similar in that they can be described algebraically or from variational principles. Although,
unlike the eigenvalue case, Hermiticity, or symmetry, of M is no longer required.
This section gives these two arguments for existence of singular value decomposition.

Based on the spectral theorem


Let M be an m-by-n matrix with complex entries. M*M is positive semidefinite and Hermitian. By the spectral
theorem, there exists a unitary n-by-n matrix V such that

where D is diagonal and positive definite. Partition V appropriately so we can write

Therefore V1*M*MV1 = D and V2*M*MV2 = 0. The latter means MV2 = 0.


Also, since V is unitary, V1*V1 = I, V2*V2 = I and V1V1* + V2V2* = I.
Define

Then

We see that this is almost the desired result, except that U1 and V1 are not unitary in general, but merely isometries.
To finish the argument, one simply has to "fill out" these matrices to obtain unitaries. For example, one can choose
U2 such that

is unitary.
Define

where extra zero rows are added or removed to make the number of zero rows equal the number of columns of U2.
Then

which is the desired result:

Notice the argument could begin with diagonalizing MM* rather than M*M (This shows directly that MM* and M*M
have the same non-zero eigenvalues).
Singular value decomposition 558

Based on variational characterization


The singular values can also be characterized as the maxima of uTMv, considered as a function of u and v, over
particular subspaces. The singular vectors are the values of u and v where these maxima are attained.
Let M denote an m × n matrix with real entries. Let and denote the sets of unit 2-norm vectors in Rm
and Rn respectively. Define the function

for vectors u ∈ and v ∈ . Consider the function σ restricted to × . Since both and
are compact sets, their product is also compact. Furthermore, since σ is continuous, it attains a largest value
for at least one pair of vectors u ∈ and v ∈ . This largest value is denoted σ1 and the corresponding
vectors are denoted u1 and v1. Since is the largest value of it must be non-negative. If it were negative,
changing the sign of either u1 or v1 would make it positive and therefore larger.
Statement: u1, v1 are left and right-singular vectors of M with corresponding singular value σ1.
Proof: Similar to the eigenvalues case, by assumption the two vectors satisfy the Lagrange multiplier equation:

After some algebra, this becomes

and

Multiplying the first equation from left by and the second equation from left by and taking ||u|| = ||v|| = 1 into
account gives

So σ1 = 2 λ1 = 2 λ2. By properties of the functional φ defined by

we have

Similarly,

This proves the statement.


More singular vectors and singular values can be found by maximizing σ(u, v) over normalized u, v which are
orthogonal to u1 and v1, respectively.
The passage from real to complex is similar to the eigenvalue case.

Geometric meaning
Because U and V are unitary, we know that the columns u1,...,um of U yield an orthonormal basis of Km and the
columns v1,...,vn of V yield an orthonormal basis of Kn (with respect to the standard scalar products on these spaces).
The linear transformation T :Kn → Km that takes a vector x to Mx has a particularly simple description with respect to
these orthonormal bases: we have T(vi) = σi ui, for i = 1,...,min(m,n), where σi is the i-th diagonal entry of Σ, and
T(vi) = 0 for i > min(m,n).
The geometric content of the SVD theorem can thus be summarized as follows: for every linear map T :Kn → Km one
can find orthonormal bases of Kn and Km such that T maps the i-th basis vector of Kn to a non-negative multiple of
the i-th basis vector of Km, and sends the left-over basis vectors to zero. With respect to these bases, the map T is
Singular value decomposition 559

therefore represented by a diagonal matrix with non-negative real diagonal entries.


To get a more visual flavour of singular values and SVD decomposition —at least when working on real vector
spaces— consider the sphere S of radius one in Rn. The linear map T maps this sphere onto an ellipsoid in Rm.
Non-zero singular values are simply the lengths of the semi-axes of this ellipsoid. Especially when n=m, and all the
singular values are distinct and non-zero, the SVD of the linear map T can be easily analysed as a succession of three
consecutive moves : consider the ellipsoid T(S) and specifically its axes ; then consider the directions in Rn sent by T
onto these axes. These directions happen to be mutually orthogonal. Apply first an isometry v* sending these
directions to the coordinate axes of Rn. On a second move, apply an endomorphism d diagonalized along the
coordinate axes and stretching or shrinking in each direction, using the semi-axes lengths of T(S) as stretching
coefficients. The composition d o v* then sends the unit-sphere onto an ellipsoid isometric to T(S). To define the third
and last move u, apply an isometry to this ellipsoid so as to carry it over T(S). As can be easily checked, the
composition u o d o v* coincides with T.

Calculating the SVD

Numerical Approach
The SVD of a matrix M is typically computed by a two-step procedure. In the first step, the matrix is reduced to a
bidiagonal matrix. This takes O(mn2) floating-point operations (flops), assuming that m ≥ n (this formulation uses
the big O notation). The second step is to compute the SVD of the bidiagonal matrix. This step can only be done
with an iterative method (as with eigenvalue algorithms). However, in practice it suffices to compute the SVD up to
a certain precision, like the machine epsilon. If this precision is considered constant, then the second step takes O(n)
iterations, each costing O(n) flops. Thus, the first step is more expensive, and the overall cost is O(mn2) flops
(Trefethen & Bau III 1997, Lecture 31).
The first step can be done using Householder reflections for a cost of 4mn2 − 4n3/3 flops, assuming that only the
singular values are needed and not the singular vectors. If m is much larger than n then it is advantageous to first
reduce the matrix M to a triangular matrix with the QR decomposition and then use Householder reflections to
further reduce the matrix to bidiagonal form; the combined cost is 2mn2 + 2n3 flops (Trefethen & Bau III 1997,
Lecture 31).
The second step can be done by a variant of the QR algorithm for the computation of eigenvalues, which was first
described by Golub & Kahan (1965). The LAPACK subroutine DBDSQR[4] implements this iterative method, with
some modifications to cover the case where the singular values are very small (Demmel & Kahan 1990). Together
with a first step using Householder reflections and, if appropriate, QR decomposition, this forms the DGESVD[5]
routine for the computation of the singular value decomposition.
The same algorithm is implemented in the GNU Scientific Library (GSL). The GSL also offers an alternative
method, which uses a one-sided Jacobi orthogonalization in step 2 (GSL Team 2007). This method computes the
SVD of the bidiagonal matrix by solving a sequence of 2-by-2 SVD problems, similar to how the Jacobi eigenvalue
algorithm solves a sequence of 2-by-2 eigenvalue methods (Golub & Van Loan 1996, §8.6.3). Yet another method
for step 2 uses the idea of divide-and-conquer eigenvalue algorithms (Trefethen & Bau III 1997, Lecture 31).
Singular value decomposition 560

Analytic result of 2-by-2 SVD


The singular values of a 2-by-2 matrix can be found analytically. Let the matrix be
where are complex numbers that parameterize the matrix, is the
identity matrix, and denote the Pauli matrices. Then its two singular values are given by

Reduced SVDs
In applications it is quite unusual for the full SVD, including a full unitary decomposition of the null-space of the
matrix, to be required. Instead, it is often sufficient (as well as faster, and more economical for storage) to compute a
reduced version of the SVD. The following can be distinguished for an m×n matrix M of rank r:

Thin SVD

Only the n column vectors of U corresponding to the row vectors of V* are calculated. The remaining column
vectors of U are not calculated. This is significantly quicker and more economical than the full SVD if n<<m. The
matrix Un is thus m×n, Σn is n×n diagonal, and V is n×n.
The first stage in the calculation of a thin SVD will usually be a QR decomposition of M, which can make for a
significantly quicker calculation if n<<m.

Compact SVD

Only the r column vectors of U and r row vectors of V* corresponding to the non-zero singular values Σr are
calculated. The remaining vectors of U and V* are not calculated. This is quicker and more economical than the thin
SVD if r<<n. The matrix Ur is thus m×r, Σr is r×r diagonal, and Vr* is r×n.

Truncated SVD

Only the t column vectors of U and t row vectors of V* corresponding to the t largest singular values Σt are
calculated. The rest of the matrix is discarded. This can be much quicker and more economical than the compact
SVD if t<<r. The matrix Ut is thus m×t, Σt is t×t diagonal, and Vt* is t×n'.
Of course the truncated SVD is no longer an exact decomposition of the original matrix M, but as discussed above,
the approximate matrix is in a very useful sense the closest approximation to M that can be achieved by a matrix
of rank t.

Norms

Ky Fan norms
The sum of the k largest singular values of M is a matrix norm, the Ky Fan k-norm of M.
The first of the Ky Fan norms, the Ky Fan 1-norm is the same as the operator norm of M as a linear operator with
respect to the Euclidean norms of Km and Kn. In other words, the Ky Fan 1-norm is the operator norm induced by the
standard l2 Euclidean inner product. For this reason, it is also called the operator 2-norm. One can easily verify the
relationship between the Ky Fan 1-norm and singular values. It is true in general, for a bounded operator M on
Singular value decomposition 561

(possibly infinite dimensional) Hilbert spaces

But, in the matrix case, M*M½ is a normal matrix, so ||M* M||½ is the largest eigenvalue of M* M½, i.e. the largest
singular value of M.
The last of the Ky Fan norms, the sum of all singular values, is the trace norm (also known as the 'nuclear norm'),
defined by ||M|| = Tr[(M*M)½] (the eigenvalues of M* M are the squares of the singular values).

Hilbert–Schmidt norm
The singular values are related to another norm on the space of operators. Consider the Hilbert–Schmidt inner
product on the n × n matrices, defined by . So the induced norm is
. Since trace is invariant under unitary equivalence, this shows

where are the singular values of M. This is called the Frobenius norm, Schatten 2-norm, or Hilbert–Schmidt
norm of M. Direct calculation shows that if

the Frobenius norm of M coincides with

Tensor SVD
Unfortunately, the problem of finding a low rank approximation to a tensor is ill-posed. In other words, there doesn't
exist a best possible solution, but instead a sequence of better and better approximations that converge to infinitely
large matrices. But in spite of this, there are several ways of attempting this decomposition. There exist two types of
tensor decompositions which generalise SVD to multi-way arrays. One decomposition decomposes a tensor into a
sum of rank-1 tensors, see Candecomp-PARAFAC (CP) algorithm. The CP algorithm should not be confused with a
rank-R decomposition but, for a given N, it decomposes a tensor into a sum of N rank-1 tensors that optimally fit the
original tensor. The second type of decomposition computes the orthonormal subspaces associated with the different
axes or modes of a tensor (orthonormal row space, column space, fiber space, etc.). This decomposition is referred to
in the literature as the Tucker3/TuckerM, M-mode SVD, multilinear SVD and sometimes referred to as a
higher-order SVD (HOSVD). In addition, multilinear principal component analysis in multilinear subspace learning
involves the same mathematical operations as Tucker decomposition, being used in a different context of
dimensionality reduction.

Bounded operators on Hilbert spaces


The factorization can be extended to a bounded operator M on a separable Hilbert space H. Namely,
for any bounded operator M, there exist a partial isometry U, a unitary V, a measure space (X, μ), and a non-negative
measurable f such that

where is the multiplication by f on L2(X, μ).


This can be shown by mimicking the linear algebraic argument for the matricial case above. VTf V* is the unique
positive square root of M*M, as given by the Borel functional calculus for self adjoint operators. The reason why U
need not be unitary is because, unlike the finite dimensional case, given an isometry U1 with non trivial kernel, a
Singular value decomposition 562

suitable U2 may not be found such that

is a unitary operator.
As for matrices, the singular value factorization is equivalent to the polar decomposition for operators: we can
simply write

and notice that U V* is still a partial isometry while VTf V* is positive.

Singular values and compact operators


To extend notion of singular values and left/right-singular vectors to the operator case, one needs to restrict to
compact operators. It is a general fact that compact operators on Banach spaces have only discrete spectrum. This is
also true for compact operators on Hilbert spaces, since Hilbert spaces are a special case of Banach spaces. If T is
compact, every nonzero λ in its spectrum is an eigenvalue. Furthermore, a compact self adjoint operator can be
diagonalized by its eigenvectors. If M is compact, so is M*M. Applying the diagonalization result, the unitary image
of its positive square root Tf has a set of orthonormal eigenvectors {ei} corresponding to strictly positive eigenvalues
{σi}. For any ψ ∈ H,

where the series converges in the norm topology on H. Notice how this resembles the expression from the finite
dimensional case. The σi 's are called the singular values of M. {U ei} and {V ei} can be considered the left- and
right-singular vectors of M respectively.
Compact operators on a Hilbert space are the closure of finite-rank operators in the uniform operator topology. The
above series expression gives an explicit such representation. An immediate consequence of this is:
Theorem M is compact if and only if M*M is compact.

History
The singular value decomposition was originally developed by differential geometers, who wished to determine
whether a real bilinear form could be made equal to another by independent orthogonal transformations of the two
spaces it acts on. Eugenio Beltrami and Camille Jordan discovered independently, in 1873 and 1874 respectively,
that the singular values of the bilinear forms, represented as a matrix, form a complete set of invariants for bilinear
forms under orthogonal substitutions. James Joseph Sylvester also arrived at the singular value decomposition for
real square matrices in 1889, apparently independent of both Beltrami and Jordan. Sylvester called the singular
values the canonical multipliers of the matrix A. The fourth mathematician to discover the singular value
decomposition independently is Autonne in 1915, who arrived at it via the polar decomposition. The first proof of
the singular value decomposition for rectangular and complex matrices seems to be by Carl Eckart and Gale Young
in 1936;[6] they saw it as a generalization of the principal axis transformation for Hermitian matrices.
In 1907, Erhard Schmidt defined an analog of singular values for integral operators (which are compact, under some
weak technical assumptions); it seems he was unaware of the parallel work on singular values of finite matrices. This
theory was further developed by Émile Picard in 1910, who is the first to call the numbers singular values (or
rather, valeurs singulières).
Practical methods for computing the SVD date back to Kogbetliantz in 1954, 1955 and Hestenes in 1958.[7]
resembling closely the Jacobi eigenvalue algorithm, which uses plane rotations or Givens rotations. However, these
were replaced by the method of Gene Golub and William Kahan published in 1965,[8] which uses Householder
transformations or reflections. In 1970, Golub and Christian Reinsch[9] published a variant of the Golub/Kahan
Singular value decomposition 563

algorithm that is still the one most-used today.

Notes
[1] DeAngelis GC, Ohzawa I, Freeman RD (October 1995). "Receptive-field dynamics in the central visual pathways" (http:/ / linkinghub.
elsevier. com/ retrieve/ pii/ 0166-2236(95)94496-R). Trends Neurosci. 18 (10): 451–8. doi:10.1016/0166-2236(95)94496-R. PMID 8545912. .
[2] Depireux DA, Simon JZ, Klein DJ, Shamma SA (March 2001). "Spectro-temporal response field characterization with dynamic ripples in
ferret primary auditory cortex" (http:/ / jn. physiology. org/ cgi/ pmidlookup?view=long& pmid=11247991). J. Neurophysiol. 85 (3):
1220–34. PMID 11247991. .
[3] The Singular Value Decomposition in Symmetric (Lowdin) Orthogonalization and Data Compression (http:/ / www. wou. edu/ ~beavers/
Talks/ Willamette1106. pdf)
[4] Netlib.org (http:/ / www. netlib. org/ lapack/ double/ dbdsqr. f)
[5] Netlib.org (http:/ / www. netlib. org/ lapack/ double/ dgesvd. f)
[6] Eckart, C.; Young, G. (1936). "The approximation of one matrix by another of lower rank". Psychometrika 1 (3): 211–8.
doi:10.1007/BF02288367.
[7] Hestenes, M. R. (1958). "Inversion of Matrices by Biorthogonalization and Related Results". Journal of the Society for Industrial and Applied
Mathematics 6 (1): 51–90. doi:10.1137/0106005. JSTOR 2098862. MR0092215.
[8] Golub, G. H.; Kahan, W. (1965). "Calculating the singular values and pseudo-inverse of a matrix". Journal of the Society for Industrial and
Applied Mathematics: Series B, Numerical Analysis 2 (2): 205–224. doi:10.1137/0702016. JSTOR 2949777. MR0183105.
[9] Golub, G. H.; Reinsch, C. (1970). "Singular value decomposition and least squares solutions". Numerische Mathematik 14 (5): 403–420.
doi:10.1007/BF02163027. MR1553974.

References
• Trefethen, Lloyd N.; Bau III, David (1997). Numerical linear algebra. Philadelphia: Society for Industrial and
Applied Mathematics. ISBN 978-0-89871-361-9.
• Demmel, James; Kahan, William (1990). "Accurate singular values of bidiagonal matrices". Society for Industrial
and Applied Mathematics. Journal on Scientific and Statistical Computing 11 (5): 873–912.
doi:10.1137/0911052.
• Golub, Gene H.; Kahan, William (1965). "Calculating the singular values and pseudo-inverse of a matrix".
Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis 2 (2): 205–224.
doi:10.1137/0702016. JSTOR 2949777.
• Golub, Gene H.; Van Loan, Charles F. (1996). Matrix Computations (3rd ed.). Johns Hopkins.
ISBN 978-0-8018-5414-9.
• GSL Team (2007). "§13.4 Singular Value Decomposition" (http://www.gnu.org/software/gsl/manual/
html_node/Singular-Value-Decomposition.html). GNU Scientific Library. Reference Manual.
• Halldor, Bjornsson and Venegas, Silvia A. (1997). "A manual for EOF and SVD analyses of climate data" (http://
www.vedur.is/~halldor/TEXT/eofsvd.html). McGill University, CCGCR Report No. 97-1, Montréal, Québec,
52pp.
• Hansen, P. C. (1987). "The truncated SVD as a method for regularization". BIT 27: 534–553.
doi:10.1007/BF01937276.
• Horn, Roger A.; Johnson, Charles R. (1985). "Section 7.3". Matrix Analysis. Cambridge University Press.
ISBN 0-521-38632-2.
• Horn, Roger A.; Johnson, Charles R. (1991). "Chapter 3". Topics in Matrix Analysis. Cambridge University Press.
ISBN 0-521-46713-6.
• Samet, H. (2006). Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann.
ISBN 0-12-369446-9.
• Strang G. (1998). "Section 6.7". Introduction to Linear Algebra (3rd ed.). Wellesley-Cambridge Press.
ISBN 0-9614088-5-5.
• Stewart, G. W. (1993). "On the Early History of the Singular Value Decomposition" (http://citeseer.ist.psu.
edu/stewart92early.html). SIAM Review 35 (4): 551–566. doi:10.1137/1035134.
Singular value decomposition 564

• Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha (2003). "Singular value decomposition and principal
component analysis" (http://public.lanl.gov/mewall/kluwer2002.html). In D.P. Berrar, W. Dubitzky, M.
Granzow. A Practical Approach to Microarray Data Analysis. Norwell, MA: Kluwer. pp. 91–109.
• Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), "Section 2.6" (http://apps.nrbook.com/
empanel/index.html?pg=65), Numerical Recipes: The Art of Scientific Computing (3rd ed.), New York:
Cambridge University Press, ISBN 978-0-521-88068-8

External links

Implementations

Libraries that support complex and real SVD


• LAPACK ( website (http://www.netlib.org/lapack/lug/node53.html)), the Linear Algebra Package. The user
manual gives details of subroutines to calculate the SVD (see also (http://www.netlib.org/lapack/lug/node32.
html)).
• LINPACK Z ( website (http://people.sc.fsu.edu/~jburkardt/cpp_src/linpack_z/linpack_z.html)), Linear
Algebra Library. Has officially been superseded by LAPACK, but it includes a C version of SVD for complex
numbers.
• For the Python programming language:
• NumPy (http://www.scipy.org/doc/numpy_api_docs/numpy.linalg.linalg.html#svd) (NumPy is module
for numerical computing with arrays and matrices)
• SciPy (http://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.svd.html) (SciPy contains many
numerical routines)
• NMath ( NMath SVD Documentation (http://www.centerspace.net/doc/NMath/user/
matrix-decompositions-3.html)) Math and Statistics libraries for .NET.

Libraries that support real SVD


• GNU Scientific Library ( website (http://www.gnu.org/software/gsl)), a numerical C/C++ library supporting
SVD (see (http://www.gnu.org/software/gsl/manual/html_node/Singular-Value-Decomposition.html)).
• For the Python programming language:
• NumPy (http://www.scipy.org/doc/numpy_api_docs/numpy.linalg.linalg.html#svd) (NumPy is module
for numerical computing with arrays and matrices)
• SciPy (http://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.svd.html) (SciPy contains many
numerical routines)
• Gensim (http://radimrehurek.com/gensim), efficient randomized algorithm on top of NumPy; unlike other
implementations, allows SVD of matrices larger than RAM (incremental online SVD).
• sparsesvd (http://pypi.python.org/pypi/sparsesvd), Python wrapper of SVDLIBC.
• SVD-Python (http://stitchpanorama.sourceforge.net/Python/svd.py), pure Python SVD under GNU GPL.
• ALGLIB (http://www.alglib.net/matrixops/general/svd.php), includes a partial port of the LAPACK to C++,
C#, Delphi, Visual Basic, etc.
• JAMA (http://math.nist.gov/javanumerics/jama/), a Java matrix package provided by the NIST.
• COLT (http://acs.lbl.gov/~hoschek/colt/index.html), a Java package for High Performance Scientific and
Technical Computing, provided by CERN.
• Eigen (http://eigen.tuxfamily.org/dox/classEigen_1_1JacobiSVD.html), a templated C++ implementation.
• redsvd (http://code.google.com/p/redsvd/), efficient randomized algorithm on top of C++ Eigen.
• PROPACK (http://soi.stanford.edu/~rmunk/PROPACK/index.html), computes the SVD of large and sparse
or structured matrices, in Fortran 77.
Singular value decomposition 565

• SVDPACK (http://www.netlib.org/svdpack/), a library in ANSI FORTRAN 77 implementing four iterative


SVD methods. Includes C and C++ interfaces.
• SVDLIBC (http://tedlab.mit.edu/~dr/SVDLIBC/), re-writing of SVDPACK in C, with minor bug fixes.
• SVDLIBJ (http://bender.unibe.ch/svn/codemap/Archive/svdlibj/src/ch/akuhn/edu/mit/tedlab/), a Java
port of SVDLIBC. (Also available as an executable .jar similar to SVDLIBC in the S-Space Package (http://
code.google.com/p/airhead-research/downloads/detail?name=svd.jar))
• SVDLIBC# (http://www.semanticsearchart.com/researchSVD.html) SVDLIBC converted to C#.
• dANN (http://wiki.syncleus.com/index.php/DANN) part of the linear algebra package of the dANN java
Artificial Intelligence library by Syncleus, Inc.
• GraphLab (http://www.graphlab.ml.cmu.edu/pmf.html) GraphLab collaborative filtering library, large scale
parallel implementation of SVD (in C++) for multicore.

Texts and demonstrations


• MIT Lecture (http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-2005/VideoLectures/index.htm)
series by Gilbert Strang. See Lecture No. 29 on the SVD (scroll down to the bottom till you see "Singular Value
Decomposition"). The first 17 minutes give the overview. Then Prof. Strang works two examples. Then the last 4
minutes (min 36 to min 40) are a summary. You can probably fast forward the examples, but the first and last are
an excellent concise visual presentation of the topic.
• Applications of SVD (http://www.imm.dtu.dk/~pch/Projekter/tsvd.html) on PC Hansen's web site.
• Introduction to the Singular Value Decomposition (http://www.uwlax.edu/faculty/will/svd/) by Todd Will of
the University of Wisconsin—La Crosse. This site has animations for the visual minded as well as demonstrations
of compression using SVD.
• Los Alamos group's book chapter (http://public.lanl.gov/mewall/kluwer2002.html) has helpful gene data
analysis examples.
• SVD (http://www.kwon3d.com/theory/jkinem/svd.html), another explanation of singular value
decomposition
• SVD Tutorial (http://www.puffinwarellc.com/p3a.htm), yet another explanation of SVD. Very intuitive.
• Javascript script (http://users.pandora.be/paul.larmuseau/SVD.htm) demonstrating SVD more extensively,
paste your data from a spreadsheet.
• (http://www.stasegem.be/shop2/SVD.htm) demonstrating SVD recommender system (same as above but how
to make your own recommender matrix
• Chapter from "Numerical Recipes in C" (http://www.nrbook.com/a/bookcpdf/c2-6.pdf) gives more
information about implementation and applications of SVD. (Acrobat DRM plug-in required)
• Online Matrix Calculator (http://www.bluebit.gr/matrix-calculator/) Performs singular value decomposition of
matrices.
• A simple tutorial on SVD and applications of Spectral Methods (http://www.cse.iitb.ac.in/~ranade/miscdocs/
svd.pdf)
• Matrix and Tensor Decompositions in Genomic Signal Processing (http://www.alterlab.org/publications/)
• SVD (http://mathworld.wolfram.com/SingularValueDecomposition.html) on MathWorld, with image
compression as an example application (http://demonstrations.wolfram.com/
ImageCompressionViaTheSingularValueDecomposition/).
• Notes on Rank-K Approximation (and SVD for the uninitiated) (http://www.cs.utexas.edu/users/flame/
Notes/NotesOnRankKApprox.pdf) at The University of Texas at Austin. This demo with Octave uses the data
file lenna.m (http://www.cs.utexas.edu/users/flame/Notes/lenna.m).
• If you liked this... (http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html?_r=1&
pagewanted=all) New York Times article on SVD in movie-ratings and Netflix
Singular value decomposition 566

• David Austin, We Recommend a Singular Value Decomposition (http://www.ams.org/featurecolumn/archive/


svd.html), Featured Column from the AMS, August 2009.

Songs
• It Had To Be U (http://www.youtube.com/StatisticalSongs#p/u/4/JEYLfIVvR9I) is a song, written by
Michael Greenacre, about the singular value decomposition, explaining its definition and role in statistical
dimension reduction. It was first performed at the joint meetings of the 9th Tartu Conference on Multivariate
Statistics and 20th International Workshop on Matrices and Statistics, in Tartu, Estonia, June 2011.

Stein's method
Stein's method is a general method in probability theory to obtain bounds on the distance between two probability
distributions with respect to a probability metric. It was introduced by Charles Stein, who first published it 1972,[1]
to obtain a bound between the distribution of a sum of -dependent sequence of random variables and a standard
normal distribution in the Kolmogorov (uniform) metric and hence to prove not only a central limit theorem, but also
bounds on the rates of convergence for the given metric.

History
At the end of the 1960s, unsatisfied with the by-then known proofs of a specific central limit theorem, Charles Stein
developed a new way of proving the theorem for his statistics lecture.[2] The seminal paper[1] was presented in 1970
at the sixth Berkeley Symposium and published in the corresponding proceedings.
Later, his Ph.D. student Louis Chen Hsiao Yun modified the method so as to obtain approximation results for the
Poisson distribution,[3] therefore the method is often referred to as Stein-Chen method. Whereas moderate attention
was given to the new method in the 70s, it has undergone major development in the 80s, where many important
contributions were made and on which today's view of the method are largely based. Probably the most important
contributions are the monograph by Stein (1986), where he presents his view of the method and the concept of
auxiliary randomisation, in particular using exchangeable pairs, and the articles by Barbour (1988) and Götze
(1991), who introduced the so-called generator interpretation, which made it possible to easily adapt the method to
many other probability distributions. An important contribution was also an article by Bolthausen (1984) on a
long-standing open problem around the so-called combinatorial central limit theorem, which surely helped the
method to become widely known.
In the 1990s the method was adapted to a variety of distributions, such as Gaussian processes by Barbour (1990), the
binomial distribution by Ehm (1991), Poisson processes by Barbour and Brown (1992), the Gamma distribution by
Luk (1994), and many others.
Stein's method 567

The basic approach

Probability metrics
Stein's method is a way to bound the distance of two probability distributions in a specific probability metric. To be
tractable with the method, the metric must be given in the form

Here, and are probability measures on a measurable space , and are random variables with
distribution and respectively, is the usual expectation operator and is a set of functions from to
the real numbers. This set has to be large enough, so that the above definition indeed yields a metric. Important
examples are the total variation metric, where we let consist of all the indicator functions of measurable sets, the
Kolmogorov (uniform) metric for probability measures on the real numbers, where we consider all the half-line
indicator functions, and the Lipschitz (first order Wasserstein; Kantorovich) metric, where the underlying space is
itself a metric space and we take the set to be all Lipschitz-continuous functions with Lipschitz-constant 1.
However, note that not every metric can be represented in the form (1.1).
In what follows we think of as a complicated distribution (e.g. a sum of dependent random variables), which we
want to approximate by a much simpler and tractable distribution (e.g. the standard normal distribution to obtain
a central limit theorem).

The Stein operator


We assume now that the distribution is a fixed distribution; in what follows we shall in particular consider the
case where is the standard normal distribution, which serves as a classical example of the application of Stein's
method.
First of all, we need an operator which acts on functions from to the real numbers, and which
'characterizes' the distribution in the sense that the following equivalence holds:

We call such an operator the Stein operator. For the standard normal distribution, Stein's lemma exactly yields such
an operator:

thus we can take

We note that there are in general infinitely many such operators and it still remains an open question, which one to
choose. However, it seems that for many distributions there is a particular good one, like (2.3) for the normal
distribution.
There are different ways to find Stein operators. But by far the most important one is via generators. This approach
was, as already mentioned, introduced by Barbour and Götze. Assume that is a (homogeneous)
continuous time Markov process taking values in . If has the stationary distribution it is easy to see that,
if is the generator of , we have for a large set of functions . Thus, generators are
natural candidates for Stein operators and this approach will also help us for later computations.
Stein's method 568

Setting up the Stein equation


Observe now that saying that is close to with respect to is equivalent to saying that the difference of
expectations in (1.1) is close to 0, and indeed if it is equal to 0. We hope now that the operator exhibits
the same behavior: clearly if we have and hopefully if we have
.
To make this statement rigorous we could find a function , such that, for a given function ,

so that the behavior of the right hand side is reproduced by the operator and . However, this equation is too
general. We solve instead the more specific equation

which is called Stein equation. Replacing by and taking expectation with respect to , we are back to (3.1),
which is what we effectively want. Now all the effort is worth only if the left hand side of (3.1) is easier to bound
than the right hand side. This is, surprisingly, often the case.
If is the standard normal distribution and we use (2.3), the corresponding Stein equation is

which is just an ordinary differential equation.

Solving the Stein equation


Now, in general, we cannot say much about how the equation (3.2) is to be solved. However, there are important
cases, where we can.
Analytic methods. We see from (3.3) that equation (3.2) can in particular be a differential equation (if is
concentrated on the integers, it will often turn out to be a difference equation). As there are many methods available
to treat such equations, we can use them to solve the equation. For example, (3.3) can be easily solved explicitly:

Generator method. If is the generator of a Markov process as explained before, we can give a general
solution to (3.2):

where denotes expectation with respect to the process being started in . However, one still has to prove
that the solution (4.2) exists for all desired functions .

Properties of the solution to the Stein equation


After showing the existence of a solution to (3.2) we can now try to analyze its properties. Usually, one tries to give
bounds on and its derivatives (which has to be carefully defined if is a more complicated space) or differences
in terms of and its derivatives or differences, that is, inequalities of the form

for some specific (typically or , respectively, depending on the form of the


Stein operator) and where often is taken to be the supremum norm. Here, denotes the differential
operator, but in discrete settings it usually refers to a difference operator. The constants may contain the
parameters of the distribution . If there are any, they are often referred to as Stein factors or magic factors.
In the case of (4.1) we can prove for the supremum norm that
Stein's method 569

where the last bound is of course only applicable if is differentiable (or at least Lipschitz-continuous, which, for
example, is not the case if we regard the total variation metric or the Kolmogorov metric!). As the standard normal
distribution has no extra parameters, in this specific case, the constants are free of additional parameters.
Note that, up to this point, we did not make use of the random variable . So, the steps up to here in general have
to be calculated only once for a specific combination of distribution , metric and Stein operator .
However, if we have bounds in the general form (5.1), we usually are able to treat many probability metrics together.
Furthermore as there is often a particular 'good' Stein operator for a distribution (e.g., no other operator than (2.3) has
been used for the standard normal distribution up to now), one can often just start with the next step below, if bounds
of the form (5.1) are already available (which is the case for many distributions).

An abstract approximation theorem


We are now in a position to bound the left hand side of (3.1). As this step heavily depends on the form of the Stein
operator, we directly regard the case of the standard normal distribution.
Now, at this point we could directly plug in our random variable which we want to approximate and try to find
upper bounds. However, it is often fruitful to formulate a more general theorem using only abstract properties of
. Let us consider here the case of local dependence.

To this end, assume that is a sum of random variables such that the and variance

. Assume that, for every , there is a set , such that is


independent of all the random variables with . We call this set the 'neighborhood' of . Likewise let
be a set such that all with are independent of all , . We can
think of as the neighbors in the neighborhood of , a second-order neighborhood, so to speak. For a set
define now the sum .
Using basically only Taylor expansion, it is possible to prove that

Note that, if we follow this line of argument, we can bound (1.1) only for functions where is bounded
because of the third inequality of (5.2) (and in fact, if has discontinuities, so will ). To obtain a bound similar
to (6.1) which contains only the expressions and , the argument is much more involved and the
result is not as simple as (6.1); however, it can be done.
Theorem A. If is as described above, we have for the Lipschitz metric that

Proof. Recall that the Lipschitz metric is of the form (1.1) where the functions are Lipschitz-continuous with
Lipschitz-constant 1, thus . Combining this with (6.1) and the last bound in (5.2) proves the theorem.
Thus, roughly speaking, we have proved that, to calculate the Lipschitz-distance between a with local
dependence structure and a standard normal distribution, we only need to know the third moments of and the
size of the neighborhoods and .
Stein's method 570

Application of the theorem


We can treat the case of sums of independent and identically distributed random variables with Theorem A. So
assume now that , and . We can take and we
obtain from Theorem A that

Connections to other methods


• Lindeberg's method. Lindeberg (1922) introduced in a seminal article a method, where the difference in (1.1) is
directly bounded. This method usually also heavily relies on Taylor expansion and thus shows some similarities
with Stein's method.
• Tikhomirov's method. Clearly the approach via (1.1) and (3.1) does not involve characteristic functions.
However, Tikhomirov (1980) presented a proof of a central limit theorem based on characteristic functions and a
differential operator similar to (2.3). The basic observation is that the characteristic function of the standard
normal distribution satisfies the differential equation for all . Thus, if the characteristic
function of is such that we expect that and hence that
is close to the normal distribution. Tikhomirov states in his paper that he was inspired by Stein's seminal
paper.

Notes
[1] Stein, C. (1972). "A bound for the error in the normal approximation to the distribution of a sum of dependent random variables" (http:/ /
projecteuclid. org/ euclid. bsmsp/ 1200514239). Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability:
583–602. MR402873. Zbl 0278.60026. .
[2] Charles Stein: The Invariant, the Direct and the "Pretentious" (http:/ / www. ims. nus. edu. sg/ imprints/ interviews/ CharlesStein. pdf).
Interview given in 2003 in Singapore
[3] Chen, L.H.Y. (1975). "Poisson approximation for dependent trials". Annals of Probability 3 (3): 534–545. doi:10.1214/aop/1176996359.
JSTOR 2959474. MR428387. Zbl 0335.60016.

References
• Barbour, A. D. (1988). "Stein's method and Poisson process convergence". J. Appl. Probab. (Applied Probability
Trust) 25A: 175–184. doi:10.2307/3214155. JSTOR 3214155.
• Barbour, A. D. (1990). "Stein's method for diffusion approximations". Probab. Theory Related Fields 84 (3):
297–322. doi:10.1007/BF01197887.
• Barbour, A. D. and Brown, T. C. (1992). "Stein's method and point process approximation". Stochastic Process.
Appl. 43 (1): 9–31. doi:10.1016/0304-4149(92)90073-Y.
• Bolthausen, E. (1984). "An estimate of the remainder in a combinatorial central limit theorem". Z. Wahrsch.
Verw. Gebiete 66 (3): 379–386. doi:10.1007/BF00533704.
• Ehm, W. (1991). "Binomial approximation to the Poisson binomial distribution". Statist. Probab. Lett. 11 (1):
7–16. doi:10.1016/0167-7152(91)90170-V.
• Götze, F. (1991). "On the rate of convergence in the multivariate CLT". Ann. Probab. 19 (2): 724–739.
doi:10.1214/aop/1176990448.
• Lindeberg, J. W. (1922). "Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechung".
Math. Z. 15 (1): 211–225. doi:10.1007/BF01494395.
• Luk, H. M. (1994). Stein's method for the gamma distribution and related statistical applications. Dissertation.
• Stein, C. (1986). Approximate computation of expectations. Institute of Mathematical Statistics Lecture Notes,
Monograph Series, 7. ISBN 0-940600-08-0.
Stein's method 571

• Tikhomirov, A. N. (1980). "Convergence rate in the central limit theorem for weakly dependent random
variables". Teor. Veroyatnost. I Primenen. 25: 800–818. English translation in Theory Probab. Appl. 25
(1980–81): 790–809.

Literature
The following text is advanced, and gives a comprehensive overview of the normal case
• Chen, L.H.Y., Goldstein, L., and Shao, Q.M (2011). Normal approximation by Stein's method.
www.springer.com. ISBN 978-3-642-15006-7.
Another advanced book, but having some introductory character, is
• ed. Barbour, A.D. and Chen, L.H.Y. (2005). An introduction to Stein's method. Lecture Notes Series, Institute for
Mathematical Sciences, National University of Singapore. 4. Singapore University Press. ISBN 981-256-280-X.
A standard reference is the book by Stein,
• Stein, C. (1986). Approximate computation of expectations. Institute of Mathematical Statistics Lecture Notes,
Monograph Series, 7. Hayward, Calif.: Institute of Mathematical Statistics. ISBN 0-940600-08-0.
which contains a lot of interesting material, but may be a little hard to understand at first reading.
Despite its age, there are few standard introductory books about Stein's method available. The following recent
textbook has a chapter (Chapter 2) devoted to introducing Stein's method:
• Ross, Sheldon and Peköz, Erol (2007). A second course in probability. www.ProbabilityBookstore.com.
ISBN 978-0-9795704-0-7.
Although the book
• Barbour, A. D. and Holst, L. and Janson, S. (1992). Poisson approximation. Oxford Studies in Probability. 2. The
Clarendon Press Oxford University Press. ISBN 0-19-852235-5.
is by large parts about Poisson approximation, it contains nevertheless a lot of information about the generator
approach, in particular in the context of Poisson process approximation.
Stirling's approximation 572

Stirling's approximation
In mathematics, Stirling's
approximation (or Stirling's
formula) is an approximation for large
factorials. It is named after James
Stirling.
The formula as typically used in
applications is

The next term in the O(ln(n)) is


(1/2)ln(2πn); a more precise variant of
the formula is therefore

often written
The ratio of (ln n!) to (n ln n − n) approaches unity as n increases.

Sometimes, bounds for rather than asymptotics are required: one has, for all

so for all the ratio is always e.g. between 2.5 and 2.8.

Derivation
The formula, together with precise estimates of its error, can be derived as follows. Instead of approximating n!, one
considers its natural logarithm as this is a slowly varying function:

The right-hand side of this equation minus is the approximation by the


trapezoid rule of the integral and the error in this approximation is given by the

Euler–Maclaurin formula:

where Bk is a Bernoulli number and Rm,n is the remainder term in the Euler–Maclaurin formula. Take limits to find
that

Denote this limit by y. Because the remainder Rm,n in the Euler–Maclaurin formula satisfies
Stirling's approximation 573

where we use Big-O notation, combining the equations above yields the approximation formula in its logarithmic
form:

Taking the exponential of both sides, and choosing any positive integer m, we get a formula involving an unknown
quantity ey. For m=1, the formula is

The quantity ey can be found by taking the limit on both sides as n tends to infinity and using Wallis' product, which
shows that . Therefore, we get Stirling's formula:

The formula may also be obtained by repeated integration by parts, and the leading term can be found through
Laplace's method. Stirling's formula, without the factor that is often irrelevant in applications, can be
quickly obtained by approximating the sum

with an integral:

Speed of convergence and error estimates


More precisely,

with

Stirling's formula is in fact the first approximation to the following


series (now called the Stirling series): The relative error in a truncated Stirling series vs.
n, for 1 to 5 terms.

The first graph in this section shows the relative error vs. n, for 1 through all 5 terms listed above.
Stirling's approximation 574

As , the error in the truncated series is asymptotically equal


to the first omitted term. This is an example of an asymptotic
expansion. It is not a convergent series; for any particular value of n
there are only so many terms of the series that improve accuracy, after
which point accuracy actually gets worse. This is demonstrated in the
next graph, which shows the relative error vs. the number of terms in
the series, for larger numbers of terms. (More precisely, let be
the Stirling series to t terms evaluated at n. The graphs show
, which, when small, is essentially the relative
error.)
Writing Stirling's series in the form:
The relative error in a truncated Stirling series vs.
the number of terms used.

it is known that the error in truncating the series is always of the same sign and at most the same magnitude as the
first omitted term.

Stirling's formula for the Gamma function


For all positive integers,

However, the Pi function, unlike the factorial, is more broadly defined for all complex numbers other than
non-positive integers; nevertheless, Stirling's formula may still be applied. If then

Repeated integration by parts gives

where Bn is the nth Bernoulli number (note that the infinite sum is not convergent, so this formula is just an
asymptotic expansion). The formula is valid for z large enough in absolute value when , where ε
is positive, with an error term of when the first m terms are used. The corresponding approximation
may now be written:
Stirling's approximation 575

A convergent version of Stirling's formula


Thomas Bayes showed, in a letter to John Canton published by the Royal Society in 1763, that Stirling's formula did
not give a convergent series.[1]
Obtaining a convergent version of Stirling's formula entails evaluating

One way to do this is by means of a convergent series of inverted rising exponentials. If


; then

where

where denotes the Stirling numbers of the first kind. From this we obtain a version of Stirling's series

which converges when .

Versions suitable for calculators


The approximation:

or equivalently,

can be obtained by rearranging Stirling's extended formula and observing a coincidence between the resultant power
series and the Taylor series expansion of the hyperbolic sine function. This approximation is good to more than 8
decimal digits for z with a real part greater than 8. Robert H. Windschitl suggested it in 2002 for computing the
Gamma function with fair accuracy on calculators with limited program or register memory.[2]
Gergő Nemes proposed in 2007 an approximation which gives the same number of exact digits as the Windschitl
approximation but is much simpler:[3]

or equivalently,

An apparently superior approximation for log n! was also given by Srinivasa Ramanujan (Ramanujan 1988)
Stirling's approximation 576

History
The formula was first discovered by Abraham de Moivre[4][5] in the form

De Moivre gave an expression for the constant in terms of its natural logarithm. Stirling's contribution consisted of
showing that the constant is . The more precise versions are due to Jacques Binet.

Notes
[1] http:/ / www. york. ac. uk/ depts/ maths/ histstat/ letter. pdf
[2] Toth, V. T. Programmable Calculators: Calculators and the Gamma Function (2006) (http:/ / www. rskey. org/ gamma. htm)
[3] Nemes, Gergő (2010), "New asymptotic expansion for the Gamma function", Archiv der Mathematik 95 (2): 161–169,
doi:10.1007/s00013-010-0146-9, ISSN 0003-889X.
[4] Le Cam, L. (1986), "The central limit theorem around 1935", Statistical Science 1 (1): 78–96 [p. 81], doi:10.1214/ss/1177013818, "The
result, obtained using a formula originally proved by de Moivre but now called Sterling's formula, occurs in his `Doctrine of Chances' of
1733.".
[5] Pearson, Karl, "Historical note on the origin of the normal curve of errors", Biometrika 16: 402–404 [p. 403], "I consider that the fact that
Stirling showed that De Moivre's arithmetical constant was does not entitle him to claim the theorem, [...]"

References
• Abramowitz, M. & Stegun, I. (2002), Handbook of Mathematical Functions (http://www.math.hkbu.edu.hk/
support/aands/toc.htm)
• Nemes, G. (2010), "New asymptotic expansion for the Gamma function", Archiv der Mathematik 95 (2):
161–169, doi:10.1007/s00013-010-0146-9
• Paris, R. B. & Kaminsky, D. (2001), Asymptotics and the Mellin–Barnes Integrals, New York: Cambridge
University Press, ISBN 0-521-79001-8
• Whittaker, E. T. & Watson, G. N. (1996), A Course in Modern Analysis (4th ed.), New York: Cambridge
University Press, ISBN 0-521-58807-3
• Dan Romik, Stirling’s Approximation for n!: The Ultimate Short Proof?, The American Mathematical Monthly,
Vol. 107, No. 6 (Jun. – Jul., 2000), 556–557.
• Y.-C. Li, A Note on an Identity of The Gamma Function and Stirling’s Formula, Real Analysis Exchang, Vol.
32(1), 2006/2007, pp. 267–272.

External links
• Peter Luschny, Approximation formulas for the factorial function n! (http://www.luschny.de/math/factorial/
approx/SimpleCases.html)
• Weisstein, Eric W., " Stirling's Approximation (http://mathworld.wolfram.com/StirlingsApproximation.html)"
from MathWorld.
• Stirling's approximation (http://planetmath.org/encyclopedia/StirlingsApproximation.html) at PlanetMath
Student's t-distribution 577

Student's t-distribution
Student’s t

Probability density function

Cumulative distribution function

Parameters > 0 degrees of freedom (real)


Support x ∈ (−∞; +∞)
PDF

CDF

where 2F1 is the hypergeometric function


Mean 0 for > 1, otherwise undefined
Median 0
Mode 0
Variance for > 2, ∞ for 1 < ≤ 2, otherwise undefined
Skewness 0 for > 3, otherwise undefined
Ex. kurtosis for > 4, ∞ for 2 < ≤ 4, otherwise undefined
Entropy

• ψ: digamma function,
• B: beta function
MGF undefined
Student's t-distribution 578

CF
for >0
[1]
• (x): Bessel function

In probability and statistics, Student’s t-distribution (or simply the t-distribution) is a family of continuous
probability distributions that arises when estimating the mean of a normally distributed population in situations
where the sample size is small and population standard deviation is unknown. It plays a role in a number of widely
used statistical analyses, including the Student’s t-test for assessing the statistical significance of the difference
between two sample means, the construction of confidence intervals for the difference between two population
means, and in linear regression analysis. The Student’s t-distribution also arises in the Bayesian analysis of data from
a normal family.
The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is
more prone to producing values that fall far from its mean. This makes it useful for understanding the statistical
behavior of certain types of ratios of random quantities, in which variation in the denominator is amplified and may
produce outlying values when the denominator of the ratio falls close to zero. The Student’s t-distribution is a special
case of the generalised hyperbolic distribution.

Definition

Probability density function


Student's t-distribution has the probability density function given by

where is the number of degrees of freedom and is the Gamma function. This may also be written as

where B is the Beta function.


For even,

For odd,

The overall shape of the probability density function of the t-distribution resembles the bell shape of a normally
distributed variable with mean 0 and variance 1, except that it is a bit lower and wider. As the number of degrees of
freedom grows, the t-distribution approaches the normal distribution with mean 0 and variance 1.
The following images show the density of the t-distribution for increasing values of . The normal distribution is
shown as a blue line for comparison. Note that the t-distribution (red line) becomes closer to the normal distribution
as increases.
Student's t-distribution 579

Density of the t-distribution (red) for 1, 2, 3, 5, 10, and 30 df compared to the standard
normal distribution (blue).
Previous plots shown in green.

1 degree of freedom 2 degrees of freedom 3 degrees of freedom

5 degrees of freedom 10 degrees of freedom 30 degrees of freedom

Cumulative distribution function


The cumulative distribution function can be written in terms of I, the regularized incomplete beta function. For
t > 0,[2]

with

Other values would be obtained by symmetry. An alternative formula, valid for , is[2]
Student's t-distribution 580

where 2F1 is a particular case of the hypergeometric function.

Special cases
Certain values of give an especially simple form.

Distribution function:

Density function:

See Cauchy distribution



Distribution function:

Density function:


Density function:


Density function:

See Normal distribution

How the t-distribution arises


Let x1, ..., xn be the numbers observed in a sample from a continuously distributed population with expected value μ.
The sample mean and sample variance are respectively

The resulting t-value is

The t-distribution with n − 1 degrees of freedom is the sampling distribution of the t-value when the samples consist
of independent identically distributed observations from a normally distributed population.
Student's t-distribution 581

History and etymology


In statistics, the t-distribution was first derived as a posterior distribution by Helmert[3][4][5] and Lüroth.[6][7][8] In the
English literature, a derivation of the t-distribution was published in 1908 by William Sealy Gosset[9] while he
worked at the Guinness Brewery in Dublin. One version of the origin of the pseudonym Student is that Gosset's
employer forbade members of its staff from publishing scientific papers, so he had to hide his identity. Another
version is that Guinness did not want their competition to know that they were using the t-test to test the quality of
raw material.[10] The t-test and the associated theory became well-known through the work of R.A. Fisher, who
called the distribution "Student's distribution".[11][12]

Characterization

As the distribution of a test statistic


Student's t-distribution with degrees of freedom can be defined as the distribution of the random variable T with
[2][13]

where
• Z is normally distributed with expected value 0 and variance 1;
• V has a chi-squared distribution with ("nu") degrees of freedom;
• Z and V are independent.

A different distribution is defined as that of the random variable defined, for a given constant μ, by .

This random variable has a noncentral t-distribution with noncentrality parameter μ. This distribution is important in
studies of the power of Student's t test.

Derivation
Suppose X1, ..., Xn are independent values that are normally distributed with expected value μ and variance σ2. Let

be the sample mean, and

be an unbiased estimate of the variance from the sample. It can be shown that the random variable

has a chi-squared distribution with n − 1 degrees of freedom (by Cochran's theorem).[14] It is readily shown that the
quantity

is normally distributed with mean 0 and variance 1, since the sample mean is normally distributed with mean
and variance . Moreover, it is possible to show that these two random variables (the normally distributed one and
the chi-squared-distributed one) are independent. Consequently the pivotal quantity,
Student's t-distribution 582

which differs from Z in that the exact standard deviation σ is replaced by the random variable Sn, has a Student's
t-distribution as defined above. Notice that the unknown population variance σ2 does not appear in T, since it was in
both the numerator and the denominators, so it canceled. Gosset intuitively obtained the probability density function
stated above, with equal to n − 1, and Fisher proved it in 1925.[15]
The distribution of the test statistic, T, depends on , but not μ or σ; the lack of dependence on μ and σ is what
makes the t-distribution important in both theory and practice.

As a maximum entropy distribution


Student's t-distribution is the maximum entropy probability distribution for a random variate X for which
is fixed.[16]

Properties

Moments
The raw moments of the t-distribution are

The distinction between "undefined" and "defined with the value of infinity" should be kept in mind. This is
equivalent to the distinction between the result of 0/0 vs. 1/0. Attempting to evaluate the odd moments in the cases
above listed as "undefined" results in the expression Because the mean (first raw moment) is undefined
when (equivalent to the Cauchy distribution), all of the central moments and standardized moments are
likewise undefined, including the variance, skewness and kurtosis.
It should be noted that the term for 0 < k <  , k even, may be simplified using the properties of the Gamma
function to

For a t-distribution with degrees of freedom, the expected value is 0, and its variance is /(  − 2) if  > 2.
The skewness is 0 if  > 3 and the excess kurtosis is 6/(  − 4) if  > 4.

Relation to F distribution
• has an F-distribution if and has a Student's t-distribution.

Monte Carlo sampling


There are various approaches to constructing random samples from the Student-t distribution. The matter depends on
whether the samples are required on a stand-alone basis, or are to be constructed by application of a quantile function
to uniform samples; e.g., in the multi-dimensional applications basis of copula-dependency. In the case of
stand-alone sampling, an extension of the Box–Muller method and its polar form is easily deployed.[17] It has the
merit that it applies equally well to all real positive degrees of freedom, ν, while many other candidate methods fail if
ν is close to zero.[17]
Student's t-distribution 583

Integral of Student's probability density function and p-value


The function is the integral of Student's probability density function, ƒ(t) between −t and t, for t ≥ 0. It thus
gives the probability that a value of t less than that calculated from observed data would occur by chance. Therefore,
the function can be used when testing whether the difference between the means of two sets of data is
statistically significant, by calculating the corresponding value of t and the probability of its occurrence if the two
sets of data were drawn from the same population. This is used in a variety of situations, particularly in t-tests. For
the statistic t, with degrees of freedom, is the probability that t would be less than the observed value if the
two means were the same (provided that the smaller mean is subtracted from the larger, so that t ≥ 0). It can be easily
calculated from the cumulative distribution function of the t-distribution:

where Ix is the regularized incomplete beta function (a, b).


For statistical hypothesis testing this function is used to construct the p-value.

Non-standardized Student's t-distribution

In terms of standard deviation


Student's t distribution can be generalized to a three parameter location-scale family, introducing a location
parameter and a scale parameter . The resulting non-standardized Student's t-distribution has a density
defined by[18]

Equivalently, it can be written in terms of (corresponding to variance instead of standard deviation):

Other properties of this version of the distribution are[18]:

This distribution results from compounding a Gaussian distribution (normal distribution) with mean and
unknown variance, with an inverse gamma distribution placed over the variance with parameters and
. In other words, the random variable X is assumed to have a Gaussian distribution with an unknown
variance distributed as inverse gamma, and then the variance is marginalized out (integrated out). The reason for the
usefulness of this characterization is that the inverse gamma distribution is the conjugate prior distribution of the
variance of a Gaussian distribution. As a result, the non-standardized Student's t-distribution arises naturally in many
Bayesian inference problems. See below.
Equivalently, this distribution results from compounding a Gaussian distribution with a scaled-inverse-chi-squared
distribution with parameters and . The scaled-inverse-chi-squared distribution is exactly the same distribution
as the inverse gamma distribution, but with a different parameterization, i.e.
Student's t-distribution 584

In terms of precision
An alternative parameterization in terms of precision λ (reciprocal of variance) arises from the relation .
[19]
Then the density is defined by

Other properties of this version of the distribution are[19]:

This distribution results from compounding a Gaussian distribution with mean and unknown precision (the
reciprocal of the variance), with a gamma distribution placed over the precision with parameters and
. In other words, the random variable X is assumed to have a normal distribution with an unknown
precision distributed as gamma, and then this is marginalized over the gamma distribution.

Related distributions

Noncentral t-distribution
The noncentral t-distribution is a different way of generalizing the t-distribution to include a location parameter.
Unlike the nonstandardized t-distributions, the noncentral distributions are asymmetric (the median is not the same
as the mode).

Discrete Student's t-distribution


The "discrete Student's t-distribution" is defined by its probability mass function at r being proportional to[20]

Here a, b, and k are parameters. This distribution arises from the construction of a system of discrete distributions
similar to that of the Pearson distributions for continuous distributions.[21]

Uses

In frequentist statistical inference


Student's t-distribution arises in a variety of statistical estimation problems where the goal is to estimate an unknown
parameter, such as a mean value, in a setting where the data are observed with additive errors. If (as in nearly all
practical statistical work) the population standard deviation of these errors is unknown and has to be estimated from
the data, the t-distribution is often used to account for the extra uncertainty that results from this estimation. In most
such problems, if the standard deviation of the errors were known, a normal distribution would be used instead of the
t-distribution.
Confidence intervals and hypothesis tests are two statistical procedures in which the quantiles of the sampling
distribution of a particular statistic (e.g. the standard score) are required. In any situation where this statistic is a
linear function of the data, divided by the usual estimate of the standard deviation, the resulting quantity can be
rescaled and centered to follow Student's t-distribution. Statistical analyses involving means, weighted means, and
regression coefficients all lead to statistics having this form.
Student's t-distribution 585

Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the
need to use the Student's t-distribution. These problems are generally of two kinds: (1) those in which the sample
size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that
illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored
because that is not the point that the author or instructor is then explaining.

Hypothesis testing
A number of statistics can be shown to have t-distributions for samples of moderate size under null hypotheses that
are of interest, so that the t-distribution forms the basis for significance tests. For example, the distribution of
Spearman's rank correlation coefficient ρ, in the null case (zero correlation) is well approximated by the t distribution
for sample sizes above about 20 .

Confidence intervals
Suppose the number A is so chosen that

when T has a t-distribution with n − 1 degrees of freedom. By symmetry, this is the same as saying that A satisfies

so A is the "95th percentile" of this probability distribution, or . Then

and this is equivalent to

Therefore the interval whose endpoints are

is a 90-percent confidence interval for μ. Therefore, if we find the mean of a set of observations that we can
reasonably expect to have a normal distribution, we can use the t-distribution to examine whether the confidence
limits on that mean include some theoretically predicted value - such as the value predicted on a null hypothesis.
It is this result that is used in the Student's t-tests: since the difference between the means of samples from two
normal distributions is itself distributed normally, the t-distribution can be used to examine whether that difference
can reasonably be supposed to be zero.
If the data are normally distributed, the one-sided (1 − a)-upper confidence limit (UCL) of the mean, can be
calculated using the following equation:

The resulting UCL will be the greatest average value that will occur for a given confidence interval and population
size. In other words, being the mean of the set of observations, the probability that the mean of the distribution
is inferior to UCL1−a is equal to the confidence level 1 − a.
Student's t-distribution 586

Prediction intervals
The t-distribution can be used to construct a prediction interval for an unobserved sample from a normal distribution
with unknown mean and variance.

In Bayesian statistics
The Student's t-distribution, especially in its three-parameter (location-scale) version, arises frequently in Bayesian
statistics as a result of its connection with the normal distribution. Whenever the variance of a normally distributed
random variable is unknown and a conjugate prior placed over it that follows an inverse gamma distribution, the
resulting marginal distribution of the variable will follow a Student's t-distribution. Equivalent constructions with the
same results involve a conjugate scaled-inverse-chi-squared distribution over the variance, or a conjugate gamma
distribution over the precision. If an improper prior proportional to is placed over the variance, the
t-distribution also arises. This is the case regardless of whether the mean of the normally distributed variable is
known, is unknown distributed according to a conjugate normally distributed prior, or is unknown distributed
according to an improper constant prior.
Related situations that also produce a t-distribution are:
• The marginal posterior distribution of the unknown mean of a normally distributed variable, with unknown prior
mean and variance following the above model.
• The prior predictive distribution and posterior predictive distribution of a new normally distributed data point
when a series of independent identically distributed normally distributed data points have been observed, with
prior mean and variance as in the above model.

Robust parametric modeling


The t-distribution is often used as an alternative to the normal distribution as a model for data.[22] It is frequently the
case that real data have heavier tails than the normal distribution allows for. The classical approach was to identify
outliers and exclude or downweight them in some way. However, it is not always easy to identify outliers (especially
in high dimensions), and the t-distribution is a natural choice of model for such data and provides a parametric
approach to robust statistics.
Lange et al. explored the use of the t-distribution for robust modeling of heavy tailed data in a variety of contexts. A
Bayesian account can be found in Gelman et al. The degrees of freedom parameter controls the kurtosis of the
distribution and is correlated with the scale parameter. The likelihood can have multiple local maxima and, as such,
it is often necessary to fix the degrees of freedom at a fairly low value and estimate the other parameters taking this
as given. Some authors report that values between 3 and 9 are often good choices. Venables and Ripley suggest that
a value of 5 is often a good choice.

Table of selected values


Most statistical textbooks list t distribution tables. Nowadays, the better way to a fully precise critical t value or a
cumulative probability is the statistical function implemented in spreadsheets (Office Excel, OpenOffice Calc, etc.),
or an interactive calculating web page. The relevant spreadsheet functions are TDIST and TINV, while online
calculating pages save troubles like positions of parameters or names of functions. For example, a Mediawiki page
supported by R extension can easily give the interactive result [23] of critical values or cumulative probability, even
for noncentral t-distribution.
The following table lists a few selected values for t-distributions with degrees of freedom for a range of one-sided
or two-sided critical regions. For an example of how to read this table, take the fourth row, which begins with 4; that
means , the number of degrees of freedom, is 4 (and if we are dealing, as above, with n values with a fixed sum, n
= 5). Take the fifth entry, in the column headed 95% for one-sided (90% for two-sided). The value of that entry is
Student's t-distribution 587

"2.132". Then the probability that T is less than 2.132 is 95% or Pr(−∞ < T < 2.132) = 0.95; or mean that
Pr(−2.132 < T < 2.132) = 0.9.
This can be calculated by the symmetry of the distribution,
Pr(T < −2.132) = 1 − Pr(T > −2.132) = 1 − 0.95 = 0.05,
and so
Pr(−2.132 < T < 2.132) = 1 − 2(0.05) = 0.9.
Note that the last row also gives critical points: a t-distribution with infinitely many degrees of freedom is a normal
distribution. (See Related distributions above).
The first column is the number of degrees of freedom.

One Sided 75% 80% 85% 90% 95% 97.5% 99% 99.5% 99.75% 99.9% 99.95%

Two Sided 50% 60% 70% 80% 90% 95% 98% 99% 99.5% 99.8% 99.9%

1 1.000 1.376 1.963 3.078 6.314 12.71 31.82 63.66 127.3 318.3 636.6

2 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 14.09 22.33 31.60

3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 7.453 10.21 12.92

4 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610

5 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869

6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959

7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408

8 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041

9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781

10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587

11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437

12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318

13 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221

14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140

15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073

16 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015

17 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965

18 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922

19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883

20 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850

21 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819

22 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792

23 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.767

24 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745

25 0.684 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725

26 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707

27 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690

28 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674
Student's t-distribution 588

29 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659

30 0.683 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646

40 0.681 0.851 1.050 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551

50 0.679 0.849 1.047 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496

60 0.679 0.848 1.045 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460

80 0.678 0.846 1.043 1.292 1.664 1.990 2.374 2.639 2.887 3.195 3.416

100 0.677 0.845 1.042 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390

120 0.677 0.845 1.041 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373

0.674 0.842 1.036 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291

The number at the beginning of each row in the table above is which has been defined above as n − 1. The
percentage along the top is 100%(1 − α). The numbers in the main body of the table are tα, . If a quantity T is
distributed as a Student's t distribution with degrees of freedom, then there is a probability 1 − α that T will be less
than tα, .(Calculated as for a one-tailed or one-sided test as opposed to a two-tailed test.)
For example, given a sample with a sample variance 2 and sample mean of 10, taken from a sample set of 11 (10
degrees of freedom), using the formula

We can determine that at 90% confidence, we have a true mean lying below

(In other words, on average, 90% of the times that an upper threshold is calculated by this method, this upper
threshold exceeds the true mean.) And, still at 90% confidence, we have a true mean lying over

(In other words, on average, 90% of the times that a lower threshold is calculated by this method, this lower
threshold lies below the true mean.) So that at 80% confidence (calculated from 1 − 2 × (1 − 90%) = 80%), we have
a true mean lying within the interval

This is generally expressed in interval notation, e.g., for this case, at 80% confidence the true mean is within the
interval [9.41490, 10.58510].
(In other words, on average, 80% of the times that upper and lower thresholds are calculated by this method, the true
mean is both below the upper threshold and above the lower threshold. This is not the same thing as saying that there
is an 80% probability that the true mean lies between a particular pair of upper and lower thresholds that have been
calculated by this method—see confidence interval and prosecutor's fallacy.)
For information on the inverse cumulative distribution function see quantile function.
Student's t-distribution 589

Notes
[1] Hurst, Simon, The Characteristic Function of the Student-t Distribution (http:/ / wwwmaths. anu. edu. au/ research. reports/ srr/ 95/ 044/ ),
Financial Mathematics Research Report No. FMRR006-95, Statistics Research Report No. SRR044-95
[2] Johnson, N.L., Kotz, S., Balakrishnan, N. (1995) Continuous Univariate Distributions, Volume 2, 2nd Edition. Wiley, ISBN 0-471-58494-0
(Chapter 28)
[3] Helmert, F. R. (1875). "Über die Bestimmung des wahrscheinlichen Fehlers aus einer endlichen Anzahl wahrer Beobachtungsfehler". Z.
Math. Phys., 20, 300-3.
[4] Helmert, F. R. (1876a). "Über die Wahrscheinlichkeit der Potenzsummen der Beobachtungsfehler und uber einige damit in Zusammenhang
stehende Fragen". Z. Math. Phys., 21, 192-218.
[5] Helmert, F. R. (1876b). "Die Genauigkeit der Formel von Peters zur Berechnung des wahrscheinlichen Beobachtungsfehlers director
Beobachtungen gleicher Genauigkeit", Astron. Nachr., 88, 113-32.
[6] Lüroth, J (1876). "Vergleichung von zwei Werten des wahrscheinlichen Fehlers". Astron. Nachr. 87 (14): 209–20.
Bibcode 1876AN.....87..209L. doi:10.1002/asna.18760871402.
[7] Pfanzagl, J.; Sheynin, O. (1996). "A forerunner of the t-distribution (Studies in the history of probability and statistics XLIV)" (http:/ /
biomet. oxfordjournals. org/ cgi/ content/ abstract/ 83/ 4/ 891). Biometrika 83 (4): 891–898. doi:10.1093/biomet/83.4.891. MR1766040. .
[8] Sheynin, O (1995). "Helmert's work in the theory of errors". Arch. Hist. Ex. Sci. 49: 73–104. doi:10.1007/BF00374700.
[9] Student [William Sealy Gosset] (March 1908). "The probable error of a mean" (http:/ / www. york. ac. uk/ depts/ maths/ histstat/ student.
pdf). Biometrika 6 (1): 1–25. doi:10.1093/biomet/6.1.1. .
[10] Mortimer, Robert G. (2005) Mathematics for Physical Chemistry,Academic Press. 3 edition. ISBN 0-12-508347-5 (page 326)
[11] Fisher, R. A. (1925). "Applications of "Student's" distribution" (http:/ / digital. library. adelaide. edu. au/ coll/ special/ fisher/ 43. pdf).
Metron 5: 90–104. .
[12] Walpole, Ronald; Myers, Raymond; Myers, Sharon; Ye, Keying. (2002) Probability and Statistics for Engineers and Scientists. Pearson
Education, 7th edition, pg. 237 ISBN 81-7758-404-9
[13] Hogg & Craig (1978, Sections 4.4 and 4.8.)
[14] Cochran, W. G. (April 1934). "The distribution of quadratic forms in a normal system, with applications to the analysis of covariance".
Mathematical Proceedings of the Cambridge Philosophical Society 30 (2): 178–191. doi:10.1017/S0305004100016595.
[15] Fisher, R. A. (1925). "Applications of "Student's" distribution" (http:/ / digital. library. adelaide. edu. au/ coll/ special/ fisher/ 43. pdf).
Metron 5: 90–104. .
[16] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu.
edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics
(Elsevier): 219–230. . Retrieved 2011-06-02.
[17] Bailey, R. W. (1994). "Polar Generation of Random Variates with the t-Distribution". Mathematics of Computation 62 (206): 779–781.
doi:10.2307/2153537.
[18] Jackman, Simon (2009). Bayesian Analysis for the Social Sciences. Wiley.
[19] Bishop, C.M. (2006). Pattern recognition and machine learning. Springer.
[20] Ord, J.K. (1972) Families of Frequency Distributions, Griffin. ISBN 0-85264-137-0 (Table 5.1)
[21] Ord, J.K. (1972) Families of Frequency Distributions, Griffin. ISBN 0-85264-137-0 (Chapter 5)
[22] Lange, Kenneth L.; Little, Roderick J.A.; Taylor, Jeremy M.G. (1989). "Robust statistical modeling using the t-distribution". JASA 84 (408):
881–896. JSTOR 2290063.
[23] http:/ / mars. wiwi. hu-berlin. de/ mediawiki/ slides/ index. php/ Comparison_of_noncentral_and_central_distributions

References
• Senn, S.; Richardson, W. (1994). "The first t-test". Statistics in Medicine 13 (8): 785–803.
doi:10.1002/sim.4780130802. PMID 8047737.
• Hogg, R.V.; Craig, A.T. (1978). Introduction to Mathematical Statistics. New York: Macmillan.
• Venables, W.N.; B.D. Ripley, B.D. (2002)Modern Applied Statistics with S, Fourth Edition, Springer
• Gelman, Andrew; John B. Carlin, Hal S. Stern, Donald B. Rubin (2003). Bayesian Data Analysis (Second
Edition) (http://www.stat.columbia.edu/~gelman/book/). CRC/Chapman & Hall. ISBN 1-58488-388-X.
Student's t-distribution 590

External links
• Earliest Known Uses of Some of the Words of Mathematics (S) (http://jeff560.tripod.com/s.html) (Remarks
on the history of the term "Student's distribution")

Summation by parts
In mathematics, summation by parts transforms the summation of products of sequences into other summations,
often simplifying the computation or (especially) estimation of certain types of sums. The summation by parts
formula is sometimes called Abel's lemma or Abel transformation.

Statement
Suppose and are two sequences. Then,

Using the forward difference operator , it can be stated more succinctly as

Note that summation by parts is an analogue to the integration by parts formula,

Note also that although applications almost always deal with convergence of sequences, the statement is purely
algebraic and will work in any field. It will also work when one sequence is in a vector space, and the other is in the
relevant field of scalars.

Newton series
The formula is sometimes given in one of these - slightly different - forms

which represent a special cases ( ) of the more general rule

both result from iterated application of the initial formula. The auxiliary quantities are Newton series:

and
Summation by parts 591

A remarkable, particular ( ) result is the noteworthy identity

Here, is the binomial coefficient.

Method
For two given sequences and , with , one wants to study the sum of the following series:

If we define   then for every     and

Finally 

This process, called an Abel transformation, can be used to prove several criteria of convergence for .

Similarity with an integration by parts


The formula for an integration by parts is

Beside the boundary conditions, we notice that the first integral contains two multiplied functions, one which is
integrated in the final integral ( becomes ) and one which is differentiated ( becomes ).
The process of the Abel transformation is similar, since one of the two initial sequences is summed ( becomes
) and the other one is differenced ( becomes ).

Applications
We suppose that ; otherwise it is obvious that is a divergent series.

If is bounded by a real M and is absolutely convergent, then is a convergent series.

And the sum of the series verifies:


Summation by parts 592

References
• Abel's lemma [1], PlanetMath.org.

References
[1] http:/ / planetmath. org/ ?op=getobj& amp;from=objects& amp;id=3843

Taylor series
In mathematics, a Taylor series is a
representation of a function as an infinite
sum of terms that are calculated from the
values of the function's derivatives at a
single point.
The concept of a Taylor series was formally
introduced by the English mathematician
Brook Taylor in 1715. If the Taylor series is
centered at zero, then that series is also
called a Maclaurin series, named after the
Scottish mathematician Colin Maclaurin,
who made extensive use of this special case
of Taylor series in the 18th century.

It is common practice to approximate a


function by using a finite number of terms
of its Taylor series. Taylor's theorem gives
quantitative estimates on the error in this
approximation. Any finite number of initial
As the degree of the Taylor polynomial rises, it approaches the correct function.
terms of the Taylor series of a function is
This image shows and Taylor approximations, polynomials of degree 1, 3,
called a Taylor polynomial. The Taylor
5, 7, 9, 11 and 13.
series of a function is the limit of that
function's Taylor polynomials, provided that
the limit exists. A function may not be equal to its Taylor series, even if its Taylor series converges at every point. A
function that is equal to its Taylor series in an open interval (or a disc in the complex plane) is known as an analytic
function.

Definition
The Taylor series of a real or complex-valued function ƒ(x) that is infinitely differentiable in a neighborhood of a real
or complex number a is the power series
Taylor series 593

which can be written in the more compact sigma notation as

where n! denotes the factorial of n and ƒ (n)(a) denotes the nth


derivative of ƒ evaluated at the point a. The derivative of order zero ƒ is
defined to be ƒ itself and (x − a)0 and 0! are both defined to be 1. In the
case that a = 0, the series is also called a Maclaurin series.

Examples
The Maclaurin series for any polynomial is the polynomial itself.
The Maclaurin series for (1 − x)−1 for |x| < 1 is the geometric series

The exponential function (in blue), and the sum


so the Taylor series for x−1 at a = 1 is of the first n+1 terms of its Taylor series at 0 (in
red).

By integrating the above Maclaurin series we find the Maclaurin series


for log(1 − x), where log denotes the natural logarithm:

and the corresponding Taylor series for log(x) at a = 1 is

and more generally, the corresponding Taylor series for log(x) at some is:

The Taylor series for the exponential function ex at a = 0 is

The above expansion holds because the derivative of ex with respect to x is also ex and e0 equals 1. This leaves the
terms (x − 0)n in the numerator and n! in the denominator for each term in the infinite sum.

History
The Greek philosopher Zeno considered the problem of summing an infinite series to achieve a finite result, but
rejected it as an impossibility: the result was Zeno's paradox. Later, Aristotle proposed a philosophical resolution of
the paradox, but the mathematical content was apparently unresolved until taken up by Democritus and then
Archimedes. It was through Archimedes's method of exhaustion that an infinite number of progressive subdivisions
could be performed to achieve a finite result.[1] Liu Hui independently employed a similar method a few centuries
later.[2]
In the 14th century, the earliest examples of the use of Taylor series and closely related methods were given by
Madhava of Sangamagrama.[3][4]Though no record of his work survives, writings of later Indian mathematicians
suggest that he found a number of special cases of the Taylor series, including those for the trigonometric functions
of sine, cosine, tangent, and arctangent. The Kerala school of astronomy and mathematics further expanded his
Taylor series 594

works with various series expansions and rational approximations until the 16th century.
In the 17th century, James Gregory also worked in this area and published several Maclaurin series. It was not until
1715 however that a general method for constructing these series for all functions for which they exist was finally
provided by Brook Taylor,[5] after whom the series are now named.
The Maclaurin series was named after Colin Maclaurin, a professor in Edinburgh, who published the special case of
the Taylor result in the 18th century.

Analytic functions
If f(x) is given by a convergent power series in an open disc (or
interval in the real line) centered at b, it is said to be analytic in
this disc. Thus for x in this disc, f is given by a convergent power
series

Differentiating by x the above formula n times, then setting x=b


gives:

and so the power series expansion agrees with the Taylor series.
Thus a function is analytic in an open disc centered at b if and only
The function e−1/x² is not analytic at x = 0: the Taylor
if its Taylor series converges to the value of the function at each
series is identically 0, although the function is not.
point of the disc.
If f(x) is equal to its Taylor series everywhere it is called entire. The polynomials and the exponential function ex and
the trigonometric functions sine and cosine are examples of entire functions. Examples of functions that are not
entire include the logarithm, the trigonometric function tangent, and its inverse arctan. For these functions the Taylor
series do not converge if x is far from b. Taylor series can be used to calculate the value of an entire function in every
point, if the value of the function, and of all of its derivatives, are known at a single point.
Uses of the Taylor series for analytic functions include:
1. The partial sums (the Taylor polynomials) of the series can be used as approximations of the entire function.
These approximations are good if sufficiently many terms are included.
2. Differentiation and integration of power series can be performed term by term and is hence particularly easy.
3. An analytic function is uniquely extended to a holomorphic function on an open disk in the complex plane. This
makes the machinery of complex analysis available.
4. The (truncated) series can be used to compute function values numerically, (often by recasting the polynomial
into the Chebyshev form and evaluating it with the Clenshaw algorithm).
5. Algebraic operations can be done readily on the power series representation; for instance the Euler's formula
follows from Taylor series expansions for trigonometric and exponential functions. This result is of fundamental
importance in such fields as harmonic analysis.
6. Approximations using the first few terms of a Taylor series can make otherwise unsolvable problems possible for
a restricted domain; this approach is often used in Physics
Taylor series 595

Approximation and convergence


Pictured on the right is an accurate
approximation of sin(x) around the point x =
0. The pink curve is a polynomial of degree
seven:

The error in this approximation is no more


than |x|9/9!. In particular, for −1 < x < 1, the
error is less than 0.000003.
In contrast, also shown is a picture of the
natural logarithm function log(1 + x) and
some of its Taylor polynomials around a =
0. These approximations converge to the
function only in the region −1 < x ≤ 1; The sine function (blue) is closely approximated by its Taylor polynomial of
outside of this region the higher-degree degree 7 (pink) for a full period centered at the origin.
Taylor polynomials are worse
approximations for the function. This is
similar to Runge's phenomenon.

The error incurred in approximating a


function by its nth-degree Taylor
polynomial is called the remainder or
residual and is denoted by the function
Rn(x). Taylor's theorem can be used to
obtain a bound on the size of the remainder.

In general, Taylor series need not be


convergent at all. And in fact the set of
functions with a convergent Taylor series is
a meager set in the Fréchet space of smooth
functions. Even if the Taylor series of a
function f does converge, its limit need not
in general be equal to the value of the
function f(x). For example, the function

The Taylor polynomials for log(1+x) only provide accurate approximations in the
is infinitely differentiable at x = 0, and has range −1 < x ≤ 1. Note that, for x > 1, the Taylor polynomials of higher degree are
worse approximations.
all derivatives zero there. Consequently, the
Taylor series of f(x) about x = 0 is
identically zero. However, f(x) is not equal to the zero function, and so it is not equal to its Taylor series around the
origin.
In real analysis, this example shows that there are infinitely differentiable functions f(x) whose Taylor series are not
equal to f(x) even if they converge. By contrast in complex analysis there are no holomorphic functions f(z) whose
Taylor series converges to a value different from f(z). The complex function e−z−2 does not approach 0 as z
approaches 0 along the imaginary axis, and its Taylor series is thus not defined there.
Taylor series 596

More generally, every sequence of real or complex numbers can appear as coefficients in the Taylor series of an
infinitely differentiable function defined on the real line, a consequence of Borel's lemma (see also Non-analytic
smooth function). As a result, the radius of convergence of a Taylor series can be zero. There are even infinitely
differentiable functions defined on the real line whose Taylor series have a radius of convergence 0 everywhere.[6]
Some functions cannot be written as Taylor series because they have a singularity; in these cases, one can often still
achieve a series expansion if one allows also negative powers of the variable x; see Laurent series. For example,
f(x) = e−x−2 can be written as a Laurent series.

Generalization
There is, however, a generalization[7][8] of the Taylor series that does converge to the value of the function itself for
any bounded continuous function on (0,∞), using the calculus of finite differences. Specifically, one has the
following theorem, due to Einar Hille, that for any t > 0,

Here Δ is the n-th finite difference operator with step size h. The series is precisely the Taylor series, except that
divided differences appear in place of differentiation: the series is formally similar to the Newton series. When the
function f is analytic at a, the terms in the series converge to the terms of the Taylor series, and in this sense
generalizes the usual Taylor series.
In general, for any infinite sequence ai, the following power series identity holds:

So in particular,

The series on the right is the expectation value of f(a + X), where X is a Poisson distributed random variable that
takes the value jh with probability e−t/h(t/h)j/j!. Hence,

The law of large numbers implies that the identity holds.

List of Maclaurin series of some common functions


See also List of mathematical series
Several important Maclaurin series expansions follow.[9] All these expansions
are valid for complex arguments x.
Exponential function:

Natural logarithm:

The real part of the cosine function in the


complex plane.
Taylor series 597

Finite geometric series:

Infinite geometric series:

An 8th degree approximation of the


cosine function in the complex plane.
Variants of the infinite geometric series:

The two above curves put together.


Square root:

Binomial series (includes the square root for α = 1/2 and the infinite geometric series for α = −1):

with generalized binomial coefficients

Trigonometric functions:

where the Bs are Bernoulli numbers.


Taylor series 598

Hyperbolic functions:

The numbers Bk appearing in the summation expansions of tan(x) and tanh(x) are the Bernoulli numbers. The Ek in
the expansion of sec(x) are Euler numbers.

Calculation of Taylor series


Several methods exist for the calculation of Taylor series of a large number of functions. One can attempt to use the
Taylor series as-is and generalize the form of the coefficients, or one can use manipulations such as substitution,
multiplication or division, addition or subtraction of standard Taylor series to construct the Taylor series of a
function, by virtue of Taylor series being power series. In some cases, one can also derive the Taylor series by
repeatedly applying integration by parts. Particularly convenient is the use of computer algebra systems to calculate
Taylor series.

First example
Compute the 7th degree Maclaurin polynomial for the function
.
First, rewrite the function as
.
We have for the natural logarithm (by using the big O notation)

and for the cosine function

The latter series expansion has a zero constant term, which enables us to substitute the second series into the first one
and to easily omit terms of higher order than the 7th degree by using the big O notation:
Taylor series 599

Since the cosine is an even function, the coefficients for all the odd powers x, x3, x5, x7, ... have to be zero.

Second example
Suppose we want the Taylor series at 0 of the function

We have for the exponential function

and, as in the first example,

Assume the power series is

Then multiplication with the denominator and substitution of the series of the cosine yields

Collecting the terms up to fourth order yields

Comparing coefficients with the above series of the exponential function yields the desired Taylor series
Taylor series 600

Third example
Here we use a method called "Indirect Expansion" to expand the given function. This method uses the known
function of Taylor series for expansion.
Q: Expand the following function as a power series of x
.
We know the Taylor series of function is:

Thus,

Taylor series as definitions


Classically, algebraic functions are defined by an algebraic equation, and transcendental functions (including those
discussed above) are defined by some property that holds for them, such as a differential equation. For example, the
exponential function is the function which is equal to its own derivative everywhere, and assumes the value 1 at the
origin. However, one may equally well define an analytic function by its Taylor series.
Taylor series are used to define functions and "operators" in diverse areas of mathematics. In particular, this is true in
areas where the classical definitions of functions break down. For example, using Taylor series, one may define
analytical functions of matrices and operators, such as the matrix exponential or matrix logarithm.
In other areas, such as formal analysis, it is more convenient to work directly with the power series themselves. Thus
one may define a solution of a differential equation as a power series which, one hopes to prove, is the Taylor series
of the desired solution.

Taylor series in several variables


The Taylor series may also be generalized to functions of more than one variable with

For example, for a function that depends on two variables, x and y, the Taylor series to second order about the point
(a, b) is:

where the subscripts denote the respective partial derivatives.


Taylor series 601

A second-order Taylor series expansion of a scalar-valued function of more than one variable can be written
compactly as

where is the gradient of evaluated at and is the Hessian matrix. Applying the
multi-index notation the Taylor series for several variables becomes

which is to be understood as a still more abbreviated multi-index version of the first equation of this paragraph, again
in full analogy to the single variable case.

Example
Compute a second-order Taylor series expansion around point
of a function

Firstly, we compute all partial derivatives we need

Second-order Taylor series approximation (in gray) of


a function around
origin.
The Taylor series is

which in this case becomes

Since log(1 + y) is analytic in |y| < 1, we have

for |y| < 1.
Taylor series 602

Fractional Taylor series


With the emergence of fractional calculus, a natural question arises about what the Taylor Series expansion would
be. Odibat and Shawagfeh[10] answered this in 2007. By using the Caputo fractional derivative, , and
indicating the limit as we approach from the right, the fractional Taylor series can be written as

Notes
[1] Kline, M. (1990) Mathematical Thought from Ancient to Modern Times. Oxford University Press. pp. 35-37.
[2] Boyer, C. and Merzbach, U. (1991) A History of Mathematics. John Wiley and Sons. pp. 202-203.
[3] "Neither Newton nor Leibniz - The Pre-History of Calculus and Celestial Mechanics in Medieval Kerala" (http:/ / www. canisius. edu/ topos/
rajeev. asp). MAT 314. Canisius College. . Retrieved 2006-07-09.
[4] S. G. Dani (2012). "Ancient Indian Mathematics – A Conspectus". Resonance 17 (3): 236-246.
[5] Taylor, Brook, Methodus Incrementorum Directa et Inversa [Direct and Reverse Methods of Incrementation] (London, 1715), pages 21-23
(Proposition VII, Theorem 3, Corollary 2). Translated into English in D. J. Struik, A Source Book in Mathematics 1200-1800 (Cambridge,
Massachusetts: Harvard University Press, 1969), pages 329-332.
[6] Rudin, Walter (1980), Real and Complex Analysis, New Dehli: McGraw-Hill, p. 418, Exercise 13, ISBN 0-07-099557-5
[7] Feller, William (1971), An introduction to probability theory and its applications, Volume 2 (3rd ed.), Wiley, pp. 230–232.
[8] Hille, Einar; Phillips, Ralph S. (1957), Functional analysis and semi-groups, AMS Colloquium Publications, 31, American Mathematical
Society, p. 300–327.
[9] Most of these can be found in (Abramowitz & Stegun 1970).
[10] Odibat, ZM., Shawagfeh, NT., 2007. "Generalized Taylor's formula." Applied Mathematics and Computation 186, 286-293.

References
• Abramowitz, Milton; Stegun, Irene A. (1970), Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables, New York: Dover Publications, Ninth printing
• Thomas, George B. Jr.; Finney, Ross L. (1996), Calculus and Analytic Geometry (9th ed.), Addison Wesley,
ISBN 0-201-53174-7
• Greenberg, Michael (1998), Advanced Engineering Mathematics (2nd ed.), Prentice Hall, ISBN 0-13-321431-1

External links
• Weisstein, Eric W., " Taylor Series (http://mathworld.wolfram.com/TaylorSeries.html)" from MathWorld.
• Madhava of Sangamagramma (http://www-groups.dcs.st-and.ac.uk/~history/Projects/Pearce/Chapters/
Ch9_3.html)
• Taylor Series Representation Module by John H. Mathews (http://math.fullerton.edu/mathews/c2003/
TaylorSeriesMod.html)
• " Discussion of the Parker-Sochacki Method (http://csma31.csm.jmu.edu/physics/rudmin/ParkerSochacki.
htm)"
• Another Taylor visualisation (http://stud3.tuwien.ac.at/~e0004876/taylor/Taylor_en.html) - where you can
choose the point of the approximation and the number of derivatives
• Taylor series revisited for numerical methods (http://numericalmethods.eng.usf.edu/topics/taylor_series.
html) at Numerical Methods for the STEM Undergraduate (http://numericalmethods.eng.usf.edu)
• Cinderella 2: Taylor expansion (http://cinderella.de/files/HTMLDemos/2C02_Taylor.html)
• Taylor series (http://www.sosmath.com/calculus/tayser/tayser01/tayser01.html)
• Inverse trigonometric functions Taylor series (http://www.efunda.com/math/taylor_series/inverse_trig.cfm)
Uniform distribution (continuous) 603

Uniform distribution (continuous)


Uniform

Probability density function

Using maximum convention


Cumulative distribution function

Notation
Parameters
Support
PDF

CDF

Mean
Median
Mode any value in
Variance
Skewness 0
Ex. kurtosis
Entropy
MGF

CF

In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of
probability distributions such that for each member of the family, all intervals of the same length on the distribution's
support are equally probable. The support is defined by the two parameters, a and b, which are its minimum and
maximum values. The distribution is often abbreviated U(a,b). It is the maximum entropy probability distribution for
Uniform distribution (continuous) 604

a random variate X under no constraint other than that it is contained in the distribution's support.[1]

Characterization

Probability density function


The probability density function of the continuous uniform distribution is:

The values of f(x) at the two boundaries a and b are usually unimportant because they do not alter the values of the
integrals of f(x) dx over any interval, nor of x f(x) dx or any higher moment. Sometimes they are chosen to be zero,
and sometimes chosen to be 1/(b − a). The latter is appropriate in the context of estimation by the method of
maximum likelihood. In the context of Fourier analysis, one may take the value of f(a) or f(b) to be 1/(2(b − a)),
since then the inverse transform of many integral transforms of this uniform function will yield back the function
itself, rather than a function which is equal "almost everywhere", i.e. except on a set of points with zero measure.
Also, it is consistent with the sign function which has no such ambiguity.
In terms of mean μ and variance σ2, the probability density may be written as:

Cumulative distribution function


The cumulative distribution function is:

Its inverse is:

In mean and variance notation, the cumulative distribution function is:

and the inverse is:


Uniform distribution (continuous) 605

Generating functions

Moment-generating function
The moment-generating function is

from which we may calculate the raw moments m k

For a random variable following this distribution, the expected value is then m1 = (a + b)/2 and the variance is
m2 − m12 = (b − a)2/12.

Cumulant-generating function
For n ≥ 2, the nth cumulant of the uniform distribution on the interval [0, 1] is bn/n, where bn is the nth Bernoulli
number.

Properties

Moments and parameters


The first two moments of the distribution are:

Solving these two equations for parameters a and b, given known moments E(X) and V(X), yields:

Order statistics
Let X1, ..., Xn be an i.i.d. sample from U(0,1). Let X(k) be the kth order statistic from this sample. Then the probability
distribution of X(k) is a Beta distribution with parameters k and n − k + 1. The expected value is

This fact is useful when making Q-Q plots.


The variances are
Uniform distribution (continuous) 606

Uniformity
The probability that a uniformly distributed random variable falls within any interval of fixed length is independent
of the location of the interval itself (but it is dependent on the interval size), so long as the interval is contained in the
distribution's support.
To see this, if X ~ U(a,b) and [x, x+d] is a subinterval of [a,b] with fixed d > 0, then

which is independent of x. This fact motivates the distribution's name.

Generalization to Borel sets


This distribution can be generalized to more complicated sets than intervals. If S is a Borel set of positive, finite
measure, the uniform probability distribution on S can be specified by defining the pdf to be zero outside S and
constantly equal to 1/K on S, where K is the Lebesgue measure of S.

Standard uniform
Restricting and , the resulting distribution U(0,1) is called a standard uniform distribution.
One interesting property of the standard uniform distribution is that if u1 has a standard uniform distribution, then so
does 1-u1. This property can be used for generating antithetic variates, among other things.

Related distributions
• If X has a standard uniform distribution, then by the inverse transform sampling method, Y = − ln(X) / λ has an
exponential distribution with (rate) parameter λ.
• If X has a standard uniform distribution, then Y = Xn has a beta distribution with parameters 1/n and 1. (Note this
implies that the standard uniform distribution is a special case of the beta distribution, with parameters 1 and 1.)
• The Irwin–Hall distribution is the sum of n i.i.d. U(0,1) distributions.
• The sum of two independent, equally distributed, uniform distributions yields a symmetric triangular distribution.

Relationship to other functions


As long as the same conventions are followed at the transition points, the probability density function may also be
expressed in terms of the Heaviside step function:

or in terms of the rectangle function

There is no ambiguity at the transition point of the sign function. Using the half-maximum convention at the
transition points, the uniform distribution may be expressed in terms of the sign function as:
Uniform distribution (continuous) 607

Applications
In statistics, when a p-value is used as a test statistic for a simple null hypothesis, and the distribution of the test
statistic is continuous, then the p-value is uniformly distributed between 0 and 1 if the null hypothesis is true.

Sampling from a uniform distribution


There are many applications in which it is useful to run simulation experiments. Many programming languages have
the ability to generate pseudo-random numbers which are effectively distributed according to the standard uniform
distribution.
If u is a value sampled from the standard uniform distribution, then the value a + (b − a)u follows the uniform
distribution parametrised by a and b, as described above.

Sampling from an arbitrary distribution


The uniform distribution is useful for sampling from arbitrary distributions. A general method is the inverse
transform sampling method, which uses the cumulative distribution function (CDF) of the target random variable.
This method is very useful in theoretical work. Since simulations using this method require inverting the CDF of the
target variable, alternative methods have been devised for the cases where the cdf is not known in closed form. One
such method is rejection sampling.
The normal distribution is an important example where the inverse transform method is not efficient. However, there
is an exact method, the Box–Muller transformation, which uses the inverse transform to convert two independent
uniform random variables into two independent normally distributed random variables.

Estimation

Estimation of maximum
Given a uniform distribution on [0, N] with unknown N, the UMVU estimator for the maximum is given by

where m is the sample maximum and k is the sample size, sampling without replacement (though this distinction
almost surely makes no difference for a continuous distribution). This follows for the same reasons as estimation for
the discrete distribution, and can be seen as a very simple case of maximum spacing estimation. This problem is
commonly known as the German tank problem, due to application of maximum estimation to estimates of German
tank production during World War II.

Estimation of midpoint
The midpoint of the distribution (a + b) / 2 is both the mean and the median of the uniform distribution. Although
both the sample mean and the sample median are unbiased estimators of the midpoint, neither is as efficient as the
sample mid-range, i.e. the arithmetic mean of the sample maximum and the sample minimum, which is the UMVU
estimator of the midpoint (and also the maximum likelihood estimate).

Confidence interval for the maximum


Let X1, X2, X3, ..., Xn be a sample from U( 0 ,L ) where L is the population maximum. Then X(n) = max( X1, X2, X3,
..., Xn ) has the density[2]
Uniform distribution (continuous) 608

The confidence interval for the estimated population maximum is then ( X(n), X(n) / α1/n ) where 100 ( 1 - α )% is the
confidence level sought. In symbols

References
[1] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[2] Nechval KN, Nechval NA, Vasermanis EK, Makeev VY (2002) Constructing shortest-length confidence intervals. Transport and
Telecommunication 3 (1) 95-103

External links
• Online calculator of Uniform distribution (continuous) (http://www.stud.feec.vutbr.cz/~xvapen02/vypocty/
ro.php?language=english)
Uniform distribution (discrete) 609

Uniform distribution (discrete)


discrete uniform

Probability mass function

n = 5 where n = b − a + 1
Cumulative distribution function

Parameters

Support
PMF

CDF

Mean

Median

Mode N/A
Variance
[1]

Skewness
Ex. kurtosis

Entropy
MGF

CF
Uniform distribution (discrete) 610

In probability theory and statistics, the discrete uniform distribution is a probability distribution whereby a finite
number of equally spaced values are equally likely to be observed; every one of n values has equal probability 1/n.
Another way of saying "discrete uniform distribution" would be "a known, finite number of equally spaced outcomes
equally likely to happen."
If a random variable has any of possible values that are equally spaced and equally probable,
then it has a discrete uniform distribution. The probability of any outcome   is . A simple example of the
discrete uniform distribution is throwing a fair die. The possible values of are 1, 2, 3, 4, 5, 6; and each time the die
is thrown, the probability of a given score is 1/6. If two dice are thrown and their values added, the uniform
distribution no longer fits since the values from 2 to 12 do not have equal probabilities.
The cumulative distribution function (CDF) can be expressed in terms of a degenerate distribution as

where the Heaviside step function is the CDF of the degenerate distribution centered at , using the
convention that

Estimation of maximum
This example is described by saying that a sample of k observations is obtained from a uniform distribution on the
integers , with the problem being to estimate the unknown maximum N. This problem is commonly
known as the German tank problem, following the application of maximum estimation to estimates of German tank
production during World War II.
The UMVU estimator for the maximum is given by

where m is the sample maximum and k is the sample size, sampling without replacement.[2][3] This can be seen as a
very simple case of maximum spacing estimation.
The formula may be understood intuitively as:
"The sample maximum plus the average gap between observations in the sample",
the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population
maximum.[4]
This has a variance of[2]

so a standard deviation of approximately , the (population) average size of a gap between samples; compare
above.

The sample maximum is the maximum likelihood estimator for the population maximum, but, as discussed above, it
is biased.
If samples are not numbered but are recognizable or markable, one can instead estimate population size via the
capture-recapture method.
Uniform distribution (discrete) 611

Random permutation
See rencontres numbers for an account of the probability distribution of the number of fixed points of a uniformly
distributed random permutation.

Notes
[1] http:/ / adorio-research. org/ wordpress/ ?p=519
[2] Johnson, Roger (1994), "Estimating the Size of a Population", Teaching Statistics (http:/ / www. rsscse. org. uk/ ts/ index. htm) 16 (2
(Summer)), doi:10.1111/j.1467-9639.1994.tb00688.x
[3] Johnson, Roger (2006), "Estimating the Size of a Population" (http:/ / www. rsscse. org. uk/ ts/ gtb/ johnson. pdf), Getting the Best from
Teaching Statistics (http:/ / www. rsscse. org. uk/ ts/ gtb/ contents. html),
[4] The sample maximum is never more than the population maximum, but can be less, hence it is a biased estimator: it will tend to
underestimate the population maximum.

References
Weibull distribution 612

Weibull distribution
Weibull (2-Parameter)

Probability density function

Cumulative distribution function

Parameters scale (real)


shape (real)
Support
PDF

CDF
Mean
Median
Mode

Variance
Skewness

Ex. kurtosis (see text)


Entropy
MGF

CF
Weibull distribution 613

In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It is named
after Waloddi Weibull, who described it in detail in 1951, although it was first identified by Fréchet (1927) and first
applied by Rosin & Rammler (1933) to describe the size distribution of particles.

Definition
The probability density function of a Weibull random variable x is:[1]

where k > 0 is the shape parameter and λ > 0 is the scale parameter of the distribution. Its complementary
cumulative distribution function is a stretched exponential function. The Weibull distribution is related to a number
of other probability distributions; in particular, it interpolates between the exponential distribution (k = 1) and the
Rayleigh distribution (k = 2).
If the quantity x is a "time-to-failure", the Weibull distribution gives a distribution for which the failure rate is
proportional to a power of time. The shape parameter, k, is that power plus one, and so this parameter can be
interpreted directly as follows:
• A value of k<1 indicates that the failure rate decreases over time. This happens if there is significant "infant
mortality", or defective items failing early and the failure rate decreasing over time as the defective items are
weeded out of the population.
• A value of k=1 indicates that the failure rate is constant over time. This might suggest random external events are
causing mortality, or failure.
• A value of k>1 indicates that the failure rate increases with time. This happens if there is an "aging" process, or
parts that are more likely to fail as time goes on.
In the field of materials science, the shape parameter k of a distribution of strengths is known as the Weibull
modulus.

Properties

Density function
The form of the density function of the Weibull distribution changes drastically with the value of k. For 0 < k < 1, the
density function tends to ∞ as x approaches zero from above and is strictly decreasing. For k = 1, the density function
tends to 1/λ as x approaches zero from above and is strictly decreasing. For k > 1, the density function tends to zero
as x approaches zero from above, increases until its mode and decreases after it. It is interesting to note that the
density function has infinite negative slope at x=0 if 0 < k < 1, infinite positive slope at x= 0 if 1 < k < 2 and null
slope at x= 0 if k > 2. For k= 2 the density has a finite positive slope at x=0. As k goes to infinity, the Weibull
distribution converges to a Dirac delta distribution centred at x= λ. Moreover, the skewness and coefficient of
variation depend only on the shape parameter.
Weibull distribution 614

Distribution function
The cumulative distribution function for the Weibull distribution is

for x ≥ 0, and F(x; k; λ) = 0 for x < 0.


The failure rate h (or hazard rate) is given by

Moments
The moment generating function of the logarithm of a Weibull distributed random variable is given by[2]

where Γ is the gamma function. Similarly, the characteristic function of log X is given by

In particular, the nth raw moment of X is given by

The mean and variance of a Weibull random variable can be expressed as

and

The skewness is given by

where the mean is denoted by μ and the standard deviation is denoted by σ.


The excess kurtosis is given by

where . The kurtosis excess may also be written as:


Weibull distribution 615

Moment generating function


A variety of expressions are available for the moment generating function of X itself. As a power series, since the
raw moments are already known, one has

Alternatively, one can attempt to deal directly with the integral

If the parameter k is assumed to be a rational number, expressed as k = p/q where p and q are integers, then this
integral can be evaluated analytically.[3] With t replaced by −t, one finds

where G is the Meijer G-function.


The characteristic function has also been obtained by Muraleedharan et al. (2007).

Information entropy
The information entropy is given by

where is the Euler–Mascheroni constant.

Weibull plot
The goodness of fit of data to a Weibull distribution can be visually assessed using a Weibull Plot.[4] The Weibull
Plot is a plot of the empirical cumulative distribution function of data on special axes in a type of Q-Q plot.
The axes are versus . The reason for this change of variables is the cumulative
distribution function can be linearised:

which can be seen to be in the standard form of a straight line. Therefore if the data came from a Weibull distribution
then a straight line is expected on a Weibull plot.
There are various approaches to obtaining the empirical distribution function from data: one method is to obtain the
vertical coordinate for each point using where is the rank of the data point and is the number

of data points.[5]
Linear regression can also be used to numerically assess goodness of fit and estimate the parameters of the Weibull
distribution. The gradient informs one directly about the shape parameter and the scale parameter can also be
inferred.
Weibull distribution 616

Uses
The Weibull distribution is used
• In survival analysis[6]
• In reliability engineering and failure analysis
• In industrial engineering to represent manufacturing and delivery times
• In extreme value theory
• In weather forecasting
• To describe wind speed distributions, as the natural distribution often matches the Weibull shape[7]
• In communications systems engineering
• In radar systems to model the dispersion of the
received signals level produced by some types of
clutters
• To model fading channels in wireless
communications, as the Weibull fading model
seems to exhibit good fit to experimental fading
channel measurements
• In General insurance to model the size of Reinsurance
claims, and the cumulative development of Asbestosis
losses
• In forecasting technological change (also known as the Fitted cumulative Weibull distribution to maximum one-day
rainfalls
Sharif-Islam model)
• In hydrology the Weibull distribution is applied to
extreme events such as annual maximum one-day rainfalls and river discharges. The blue picture illustrates an
example of fitting the Weibull distribution to ranked annually maximum one-day rainfalls showing also the 90%
confidence belt based on the binomial distribution. The rainfall data are represented by plotting positions as part
of the cumulative frequency analysis.
• In describing the size of particles generated by grinding, milling and crushing operations, the 2-Parameter
Weibull distribution is used, and in these applications it is sometimes known as the Rosin-Rammler distribution.
In this context it predicts fewer fine particles than the Log-normal distribution and it is generally most accurate
for narrow particle size distributions. The interpretation of the cumulative distribution function is that F(x; k; λ) is
the mass fraction of particles with diameter smaller than x, where λ is the mean particle size and k is a measure of
the spread of particle sizes.

Related distributions
• The translated Weibull distribution contains an additional parameter.[2] It has the probability density function

for and f(x; k, λ, θ) = 0 for x < θ, where is the shape parameter, is the scale parameter and
is the location parameter of the distribution. When θ=0, this reduces to the 2-parameter distribution.
• The Weibull distribution can be characterized as the distribution of a random variable X such that the random
variable
Weibull distribution 617

is the standard exponential distribution with intensity 1.[2]


• The Weibull distribution interpolates between the exponential distribution with intensity 1/λ when k = 1 and a
Rayleigh distribution of mode when k = 2.
• The Weibull distribution can also be characterized in terms of a uniform distribution: if X is uniformly distributed
on (0,1), then the random variable is Weibull distributed with parameters k and λ. This
leads to an easily implemented numerical scheme for simulating a Weibull distribution.
• The Weibull distribution (usually sufficient in reliability engineering) is a special case of the three parameter
Exponentiated Weibull distribution where the additional exponent equals 1. The Exponentiated Weibull
distribution accommodates unimodal, bathtub shaped*[8] and monotone failure rates.
• The Weibull distribution is a special case of the generalized extreme value distribution. It was in this connection
that the distribution was first identified by Maurice Fréchet in 1927. The closely related Fréchet distribution,
named for this work, has the probability density function

• The distribution of a random variable that is defined as the minimum of several random variables, each having a
different Weibull distribution, is a poly-Weibull distribution.

References
[1] Papoulis, Pillai, "Probability, Random Variables, and Stochastic Processes, 4th Edition
[2] Johnson, Kotz & Balakrishnan 1994
[3] See (Cheng, Tellambura & Beaulieu 2004) for the case when k is an integer, and (Sagias & Karagiannidis 2005) for the rational case.
[4] The Weibull plot (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ weibplot. htm)
[5] Wayne Nelson (2004) Applied Life Data Analysis. Wiley-Blackwell ISBN 0-471-64462-5
[6] Survival/Failure Time Analysis (http:/ / www. statsoft. com/ textbook/ survival-failure-time-analysis/ #distribution)
[7] Wind Speed Distribution Weibull (http:/ / www. reuk. co. uk/ Wind-Speed-Distribution-Weibull. htm)
[8] "System evolution and reliability of systems" (http:/ / www. sys-ev. com/ reliability01. htm). Sysev (Belgium). 2010-01-01. .

Bibliography
• Fréchet, Maurice (1927), "Sur la loi de probabilité de l'écart maximum", Annales de la Société Polonaise de
Mathematique, Cracovie 6: 93–116.
• Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1994), Continuous univariate distributions. Vol. 1, Wiley
Series in Probability and Mathematical Statistics: Applied Probability and Statistics (2nd ed.), New York: John
Wiley & Sons, ISBN 978-0-471-58495-7, MR1299979
• Muraleedharan, G.; Rao, A.G.; Kurup, P.G.; Nair, N. Unnikrishnan; Sinha, Mourani (2007), "Coastal
Engineering", Coastal Engineering 54 (8): 630–638, doi:10.1016/j.coastaleng.2007.05.001
• Rosin, P.; Rammler, E. (1933), "The Laws Governing the Fineness of Powdered Coal", Journal of the Institute of
Fuel 7: 29–36.
• Sagias, Nikos C.; Karagiannidis, George K. (2005), "Gaussian class multivariate Weibull distributions: theory and
applications in fading channels", Institute of Electrical and Electronics Engineers. Transactions on Information
Theory 51 (10): 3608–3619, doi:10.1109/TIT.2005.855598, ISSN 0018-9448, MR2237527
• Weibull, W. (1951), "A statistical distribution function of wide applicability" (http://www.barringer1.com/
wa_files/Weibull-ASME-Paper-1951.pdf), J. Appl. Mech.-Trans. ASME 18 (3): 293–297.
• "Engineering statistics handbook" (http://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htm).
National Institute of Standards and Technology. 2008.
• Nelson, Jr, Ralph (2008-02-05). "Dispersing Powders in Liquids, Part 1, Chap 6: Particle Volume Distribution"
(http://www.erpt.org/014Q/nelsa-06.htm). Retrieved 2008-02-05.
Weibull distribution 618

External links
• Mathpages - Weibull Analysis (http://www.mathpages.com/home/kmath122/kmath122.htm)
• The Weibull Distribution (http://www.weibull.com/LifeDataWeb/the_weibull_distribution.htm)
• Reliability Analysis with Weibull (http://www.crgraph.com/Weibull11e.pdf)

Wilcoxon signed-rank test


The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when comparing two related
samples, matched samples, or repeated measurements on a single sample to assess whether their population mean
ranks differ (i.e. it is a paired difference test). It can be used as an alternative to the paired Student's t-test, t-test for
matched pairs, or the t-test for dependent samples when the population cannot be assumed to be normally
distributed.[1]
The test is named for Frank Wilcoxon (1892–1965) who, in a single paper, proposed both it and the rank-sum test
for two independent samples (Wilcoxon, 1945).[2] The test was popularized by Siegel (1956)[3] in his influential text
book on non-parametric statistics. Siegel used the symbol T for the value defined below as S. In consequence, the
test is sometimes referred to as the Wilcoxon T test, and the test statistic is reported as a value of T. Other names
may include the 't-test for matched pairs' or the 't-test for dependent samples'.

Assumptions
1. Data are paired and come from the same population.
2. Each pair is chosen randomly and independent.
3. The data are measured on an interval scale (ordinal is not sufficient because we take differences), but need not be
normal.

Test procedure
Let N be the sample size, the number of pairs. Thus, there are a total of 2N data points. For i = 1, ..., N, let and
denote the measurements.
     .
1. For i = 1, ..., N, calculate and , where sgn is the sign function.
2. Exclude pairs with . Let be the reduced sample size.
3. Order the remaining pairs from smallest absolute difference to largest absolute difference, .
4. Rank the pairs, starting with the smallest as 1. Ties receive a rank equal to the average of the ranks they span. Let
denote the rank.
5. Calculate the test statistic W.

, the absolute value of the sum of the signed ranks.

6. As increases, the sampling distribution of W converges to a normal distribution. Thus,

For , a z-score can be calculated as . If z >

zcritical, reject H0.


For , is compared to a critical value from a reference table[1]. If , reject H0.
Alternatively, a p-value can be calculated from enumeration of all possible combinations of given .
Wilcoxon signed-rank test 619

Example
order by absolute difference

1 125 110 1 15 5 140 140 0

2 115 122 –1 7 3 130 125 1 5 1.5 1.5

3 130 125 1 5 9 140 135 1 5 1.5 1.5

4 140 120 1 20 2 115 122 –1 7 3 –3

5 140 140 0 6 115 124 –1 9 4 –4

6 115 124 –1 9 10 135 145 –1 10 5 –5

7 140 123 1 17 8 125 137 –1 12 6 –6

8 125 137 –1 12 1 125 110 1 15 7 7

9 140 135 1 5 7 140 123 1 17 8 8

10 135 145 –1 10 4 140 120 1 20 9 9

is the sign function, is the absolute value, and is the rank. Notice that pairs 3 and 9 are tied in
absolute value. They would be ranked 1 and 2, so each gets the average of those ranks, 1.5.

References
[1] Lowry, Richard. "Concepts & Applications of Inferential Statistics" (http:/ / faculty. vassar. edu/ lowry/ ch12a. html). . Retrieved 24 March
2011.
[2] Wilcoxon, Frank (Dec 1945). "Individual comparisons by ranking methods" (http:/ / sci2s. ugr. es/ keel/ pdf/ algorithm/ articulo/
wilcoxon1945. pdf). Biometrics Bulletin 1 (6): 80–83. .
[3] Siegel, Sidney (1956). Non-parametric statistics for the behavioral sciences (http:/ / books. google. com/
books?ei=9cWLTfaTIcmEOs_NuM0L& ct=result& id=ebfRAAAAMAAJ& dq=Wilcoxon+ statistics+ for+ the+ behavioral+ sciences+
Non-parametric& q=Wilcoxon#search_anchor). New York: McGraw-Hill. pp. 75–83. .

External links
• Description of how to calculate p for the Wilcoxon signed-ranks test (http://comp9.psych.cornell.edu/
Darlington/wilcoxon/wilcox0.htm)
• Example of using the Wilcoxon signed-rank test (http://faculty.vassar.edu/lowry/ch12a.html)
• An online version of the test (http://faculty.vassar.edu/lowry/wilcoxon.html)
• A table of critical values for the Wilcoxon signed-rank test (http://www.sussex.ac.uk/Users/grahamh/
RM1web/WilcoxonTable2005.pdf)
Wilcoxon signed-rank test 620

Implementations
• ALGLIB (http://www.alglib.net/statistics/hypothesistesting/wilcoxonsignedrank.php) includes
implementation of the Wilcoxon signed-rank test in C++, C#, Delphi, Visual Basic, etc.
• The free statistical software R includes an implementation of the test as wilcox.test(x,y,
paired=TRUE), where x and y are vectors of equal length.
• GNU Octave implements various one-tailed and two-tailed versions of the test in the wilcoxon_test
function.
• SciPy (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html) includes an
implementation of the Wilcoxon signed-rank test in Python
Wishart distribution 621

Wishart distribution
Wishart

Parameters degrees of freedom (real)


scale matrix ( pos. def)
Support positive definite matrix
PDF

• is the multivariate gamma function


• is the trace function
Mean
Mode
Variance

Entropy see below


CF

In statistics, the Wishart distribution is a generalization to multiple dimensions of the chi-squared distribution, or,
in the case of non-integer degrees of freedom, of the gamma distribution. It is named in honor of John Wishart, who
first formulated the distribution in 1928.[1]
It is any of a family of probability distributions defined over symmetric, nonnegative-definite matrix-valued random
variables (“random matrices”). These distributions are of great importance in the estimation of covariance matrices in
multivariate statistics. In Bayesian statistics, the Wishart distribution is the conjugate prior of the inverse
covariance-matrix of a multivariate-normal random-vector.

Definition
Suppose X is an n × p matrix, each row of which is independently drawn from a p-variate normal distribution with
zero mean:

Then the Wishart distribution is the probability distribution of the p×p random matrix

known as the scatter matrix. One indicates that S has that probability distribution by writing

The positive integer n is the number of degrees of freedom. Sometimes this is written W(V, p, n). For n ≥ p the
matrix S is invertible with probability 1 if V is invertible.
If p = 1 and V = 1 then this distribution is a chi-squared distribution with n degrees of freedom.
Wishart distribution 622

Occurrence
The Wishart distribution arises as the distribution of the sample covariance matrix for a sample from a multivariate
normal distribution. It occurs frequently in likelihood-ratio tests in multivariate statistical analysis. It also arises in
the spectral theory of random matrices and in multidimensional Bayesian analysis.

Probability density function


The Wishart distribution can be characterized by its probability density function, as follows.
Let be a p × p symmetric matrix of random variables that is positive definite. Let V be a (fixed) positive definite
matrix of size p × p.
Then, if n ≥ p, has a Wishart distribution with n degrees of freedom if it has a probability density function given
by

where Γp(·) is the multivariate gamma function defined as

In fact the above definition can be extended to any real n > p − 1. If n ≤ p − 2, then the Wishart no longer has a
density—instead it represents a singular distribution. [2]

Properties

Log-expectation
Note the following formula:[3]

where is the digamma function (the derivative of the log of the gamma function).
This plays a role in variational Bayes derivations for Bayes networks involving the Wishart distribution.

Entropy
The information entropy of the distribution has the following formula:[3]

where is the normalizing constant of the distribution:

This can be expanded as follows:


Wishart distribution 623

Characteristic function
The characteristic function of the Wishart distribution is

In other words,

where denotes expectation. (Here and are matrices the same size as ( is the identity matrix); and
is the square root of −1).

Theorem
If has a Wishart distribution with m degrees of freedom and variance matrix —write —and is
a q × p matrix of rank q, then

Corollary 1
If is a nonzero constant vector, then .
In this case, is the chi-squared distribution and (note that is a constant; it is positive because
is positive definite).
Wishart distribution 624

Corollary 2
Consider the case where (that is, the jth element is one and all others zero). Then
corollary 1 above shows that

gives the marginal distribution of each of the elements on the matrix's diagonal.
Noted statistician George Seber points out that the Wishart distribution is not called the “multivariate chi-squared
distribution” because the marginal distribution of the off-diagonal elements is not chi-squared. Seber prefers to
reserve the term multivariate for the case when all univariate marginals belong to the same family.

Estimator of the multivariate normal distribution


The Wishart distribution is the sampling distribution of the maximum-likelihood estimator (MLE) of the covariance
matrix of a multivariate normal distribution with mean zero. A derivation of the MLE uses the spectral theorem.

Bartlett decomposition
The Bartlett decomposition of a matrix from a p-variate Wishart distribution with scale matrix V and n degrees
of freedom is the factorization:

where L is the Cholesky decomposition of V, and:

where and independently. This provides a useful method for obtaining random
[4]
samples from a Wishart distribution.

The possible range of the shape parameter


It can be shown [5] that the Wishart distribution can be defined if and only if the shape parameter n belongs to the set

This set is named after Gindikin, who introduced it[6] in the seventies in the context of gamma distributions on
homogeneous cones. However, for the new parameters in the discrete spectrum of the Gindikin ensemble, namely,

the corresponding Wishart distribution has no Lebesgue density.


Wishart distribution 625

Relationships to other distributions


• The Wishart distribution is related to the Inverse-Wishart distribution, denoted by , as follows: If
and if we do the change of variables , then . This
relationship may be derived by noting that the absolute value of the Jacobian determinant of this change of
variables is , see for example equation (15.15) in.[7]
• In Bayesian statistics, the Wishart distribution is a conjugate prior for the precision parameter of the multivariate
normal distribution, when the mean parameter is known.[8]
• A generalization is the multivariate gamma distribution.
• A different type of generalization is the normal-Wishart distribution, essentially the product of a multivariate
normal distribution with a Wishart distribution.

References
[1] Wishart, J. (1928). "The generalised product moment distribution in samples from a normal multivariate population". Biometrika 20A (1-2):
32–52. doi:10.1093/biomet/20A.1-2.32. JFM 54.0565.02.
[2] “On singular Wishart and singular multivariate beta distributions” by Harald Uhlig, The Annals of Statistics, 1994, 395-405 projecteuclid
(http:/ / projecteuclid. org/ DPubS?service=UI& version=1. 0& verb=Display& handle=euclid. aos/ 1176325375)
[3] C.M. Bishop, Pattern Recognition and Machine Learning, Springer 2006, p. 693.
[4] Smith, W. B.; Hocking, R. R. (1972). "Algorithm AS 53: Wishart Variate Generator". Journal of the Royal Statistical Society. Series C
(Applied Statistics) 21 (3): 341–345. JSTOR 2346290.
[5] Peddada and Richards, Shyamal Das; Richards, Donald St. P. (1991). "Proof of a Conjecture of M. L. Eaton on the Characteristic Function of
the Wishart Distribution,". Annals of Probability 19 (2): 868–874. doi:10.1214/aop/1176990455.
[6] Gindikin, S.G. (1975). "Invariant generalized functions in homogeneous domains,". Funct. Anal. Appl., 9 (1): 50–52.
doi:10.1007/BF01078179.
[7] Paul S. Dwyer, “SOME APPLICATIONS OF MATRIX DERIVATIVES IN MULTIVARIATE ANALYSIS”, JASA 1967; 62:607-625,
available JSTOR (http:/ / www. jstor. org/ pss/ 2283988).
[8] C.M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.
Article Sources and Contributors 626

Article Sources and Contributors


Bernoulli distribution  Source: http://en.wikipedia.org/w/index.php?oldid=507958186  Contributors: Adriaan Joubert, Albmont, AlekseyP, Alex.j.flint, Amonet, Andreas27krause, Aquae,
Aziz1005, Bando26, Bgeelhoed, Bryan Derksen, Btyner, Camkego, Cburnett, Charles Matthews, Complex01, Discospinster, El C, Eric Kvaalen, FilipeS, Flatland1, Giftlite, Herix,
ILikeHowMuch, Iwaterpolo, Jitse Niesen, Jpk, Jt, Kyng, Lilac Soul, Lothar von Richthofen, MarkSweep, Melcombe, Michael Hardy, Miguel, MrOllie, Olivier, Ozob, PAR, Pabristow, Policron,
Poor Yorick, Qwfp, RDBury, Rdsmith4, Schmock, Sharmistha1, TakuyaMurata, Theyshallbow, Tomash, Tomi, Typofier, Urhixidur, User3000, Weialawaga, Whkoh, Wikid77, Wjastle, Wtanaka,
Zven, 50 anonymous edits

Beta distribution  Source: http://en.wikipedia.org/w/index.php?oldid=508032247  Contributors: Adamace123, AllenDowney, AnRtist, Arauzo, Art2SpiderXL, Awaterl, Baccyak4H, Benwing,
Betadistribution, BlaiseFEgan, Bootstoots, Bryan Derksen, Btyner, Cburnett, Crasshopper, Cronholm144, DFRussia, Dean P Foster, Dr. J. Rodal, DrMicro, Dshutin, Eric Kvaalen, FilipeS, Fintor,
Fnielsen, Giftlite, Gill110951, GregorB, Gruntfuterk, Gökhan, HappyCamper, Henrygb, Hilgerdenaar, Hypnotoad33, IapetusWave, ImperfectlyInformed, J04n, Jamessungjin.kim, Janlo, Jhapk,
Jheald, Joriki, Josang, Ketiltrout, Krishnavedala, Kts, Ladislav Mecir, Linas, LiranKatzir, Livius3, Lovibond, MarkSweep, Mcld, Melcombe, Michael Hardy, MisterSheik, Mochan Shrestha,
Mpaa, MrOllie, Nbarth, O18, Oberobic, Ohanian, Oleg Alexandrov, Ott2, PAR, PBH, Paulginz, Pleasantville, Pnrj, Qwfp, Rcsprinter123, Rjwilmsi, Robbyjo, Robinh, Robma, Rodrigo braz,
Rumping, SJP, ST47, Saric, Schmock, SharkD, Sheppa28, Steve8675309, Stoni, Sukisuki, Thumperward, Tomi, Tthrall, UndercoverAgents, Urhixidur, Wile E. Heresiarch, Wjastle, YearOfGlad,
161 anonymous edits

Beta function  Source: http://en.wikipedia.org/w/index.php?oldid=506352439  Contributors: A. Pichler, Akurn, Albmont, AnRtist, Andre Engels, Arabani, Avihu, Bo Jacoby, CRGreathouse,
Charles Matthews, Damian Yerrick, DeadEyeArrow, Deepmath, Deltahedron, Djozwebo, Dysprosia, Dzordzm, Eric Kvaalen, Fintor, Fnielsen, Fredrik, Giftlite, Gildos, GregorB, Gurch,
Headbomb, Helder Ribeiro, HenningThielemann, Herbee, Jmeppley, JordiGH, Joule36e5, Jpod2, Jtico, Jugger90, Karho.Yau, Kiensvay, Kotasik, L33tsk33t3r, Linas, Loodog, Lumidek, LutzL,
MagnaMopus, MarkSweep, Melcombe, Michael Hardy, MrOllie, Niceguyedc, Nog33, Oleg Alexandrov, PAR, PMajer, PV=nRT, Pabristow, Qwfp, R.e.b., RL0919, Rar, RobHar, Romanm,
Rorro, Ruud Koot, ServiceAT, Stan Lioubomoudrov, Surfo, TakuyaMurata, Tarquin, Tarret, Tomfy, TomyDuby, Wasbeer, Wdvorak, Wile E. Heresiarch, 99 anonymous edits

Beta-binomial distribution  Source: http://en.wikipedia.org/w/index.php?oldid=506664672  Contributors: Auntof6, Baccyak4H, Benwing, Charlesmartin14, Chris the speller, Domminico,
Frederic Y Bois, Giftlite, Gnp, GoingBatty, Kdisarno, Massbless, Melcombe, Michael Hardy, Myasuda, Nschuma, PigFlu Oink, Qwfp, Rjwilmsi, Sheppa28, Thouis.r.jones, Thtanner, Tomixdf,
Willy.feng, 34 anonymous edits

Binomial coefficient  Source: http://en.wikipedia.org/w/index.php?oldid=507273858  Contributors: .:Ajvol:., 137.112.129.xxx, 3ICE, 777sms, A. Pichler, Altenmann, Amit man, Anonymous
Dissident, Askewchan, Atlastawake, Avé, AxelBoldt, Basploeger, Bo Jacoby, Boaex, BosRedSox13, Bsskchaitanya, Btyner, CRGreathouse, Calculuslover, CarloWood, Catapult, Cdang, Charles
Matthews, Cherkash, Classicalecon, Cometstyles, CommandoGuard, Connor Behan, Conversion script, Cornince, Cryptography project, DAGwyn, DVD R W, DVdm, Dalahäst, Daniel5Ko,
Danski14, David Eppstein, Dcoetzee, Dejan Jovanović, DejanSpasov, Denelson83, Devoutb3nji, Dgw, Don4of4, Doug Bell, DrBob, Duoduoduo, Dysprosia, E rulez, Ebony Jackson, Edemaine,
Eleuther, Emul0c, Endlessoblivion, Eric119, Excelsiorfireblade, Ferkel, Fredrik, Fresheneesz, Fropuff, Gauge, Ghazer, Giftlite, Graham87, Gzorg, Hashar, Hawthorn, Hede2000, Henri.vanliempt,
Hierakares, Himynameisbrak, Indianaed, Isilando2, Jim Sukwutput, Jim.belk, Jitse Niesen, Jmabel, Joao.pimentel.ferreira, Jobu0101, Joel B. Lewis, Joey96, Josh3580, Jrvz, Jusdafax, Jökullmar,
KSmrq, Kaimbridge, Karl Stroetmann, Kestrelsummer, Keta, King Bee, Kneufeld, Knightry, LakeHMM, Lambiam, Lantonov, Law, Lclem, Linas, Ling.Nut, Llama320, Llamabr, Luke
Gustafson, Mabuhelwa, Macrakis, Madmath789, Magioladitis, Magister Mathematicae, Mangojuice, Marc van Leeuwen, Maxal, Maxwell helper, Mboverload, Mcld, Mcmlxxxi, Mhym, Michael
Hardy, Michael Slone, Mikeblas, Mormegil, Nbarth, Ninly, Nk, Oleg Alexandrov, Oliverknill, Ondra.pelech, Orangutan, Ott2, PMajer, Paolo.dL, Patrick, Paul August, PaulTanenbaum, Pfunk42,
Pgc512, PhotoBox, Postxian, PrimeHunter, Quantling, RHB100, Ragzouken, Rahence, Rhebus, Rich Farmbrough, Rponamgi, Sander123, Sanjaymjoshi, Scharan, Schmock, Sciyoshi, Shelandy,
Shelbymoore3, Simoneau, Sjorford, Small potato, Spacepotato, Sreyan, Stcisthegreatest, Stebulus, Stellmach, Stpasha, TakuyaMurata, Tetracube, Thiago R Ramos, Timwi, Tomi, Twas Now,
Uncle uncle uncle, Uncompetence, Vonkje, WardenWalk, Wavelength, Wellithy, XJamRastafire, Xanthoxyl, Ylloh, Zahlentheorie, Zalethon, Ziyuang, ス マ ス リ ク, 286 anonymous edits

Binomial distribution  Source: http://en.wikipedia.org/w/index.php?oldid=505964622  Contributors: -- April, Aarond10, AchatesAVC, AdamRetchless, Ahoerstemeier, Ajs072, AlanUS,
Alexb@cut-the-knot.com, Alexius08, Alzarian16, Anupam, Atemperman, Atlant, AxelBoldt, Ayla, BPets, Baccyak4H, BenFrantzDale, Benwing, Bill Malloy, Blue520, Br43402, Brutha, Bryan
Derksen, Btyner, Can't sleep, clown will eat me, Cburnett, Cdang, Cflm001, Charles Matthews, Chewings72, Conversion script, Coppertwig, Crackerbelly, Cuttlefishy, DaDexter, David
Martland, DavidFHoughton, Daytona2, Deville, Dick Beldin, DrMicro, Dricherby, Duoduoduo, Eesnyder, Elipongo, Eric Kvaalen, Falk Lieder, Fisherjs, Froid, G716, Garde, Gary King,
Gauravm1312, Gauss, Gerald Tros, Giftlite, Gogobera, GorillaWarfare, Gperjim, Graham87, Hede2000, Henrygb, Hirak 99, Ian.Shannon, Ilmari Karonen, Intelligentsium, Iwaterpolo, J04n,
JB82, JEH, JamesBWatson, Janlo, Johnstjohn, Kakofonous, Kmassey, Knutux, Koczy, LOL, Larry_Sanger, LiDaobing, Linas, Lipedia, Ljwsummer, Logan, MC-CPO, MER-C, ML5, MSGJ,
Madkaugh, Mark Arsten, MarkSweep, Marvinrulesmars, Materialscientist, Mboverload, McKay, Meisterkoch, Melcombe, Mhadi.afrasiabi, Michael Hardy, MichaelGensheimer, Miguel,
MisterSheik, Mmustafa, Moseschinyama, Mr Ape, MrOllie, Musiphil, N5iln, N6ne, Nasnema, NatusRoma, Nbarth, Neshatian, New Thought, Nguyenngaviet, Nschuma, Oleg Alexandrov, PAR,
Pallaviagarwal90, Paul August, Ph.eyes, PhotoBox, Phr, Pleasantville, Postrach, PsyberS, Pt, Pufferfish101, Qonnec, Quietbritishjim, Qwertyus, Qwfp, R'n'B, Redtryfan77, Rgclegg, Rich
Farmbrough, Rjmorris, Rlendog, Robinh, Ruber chiken, Sealed123, Seglea, Sigma0 1, Sintau.tayua, Smachet, SoSaysChappy, Spellcast, Stebulus, Steven J. Anderson, Stigin, Stpasha,
Supergroupiejoy, TakuyaMurata, Talgalili, Tayste, Tedtoal, The Thing That Should Not Be, Tim1357, Timwi, Toll booth, Tomi, Tyw7, VectorPosse, Vincent Semeria, Welhaven, Westm,
Wikid77, WillKitch, Wjastle, Xiao Fei, Ylloh, Youandme, ZantTrang, Zmoboros, 385 anonymous edits

Cauchy distribution  Source: http://en.wikipedia.org/w/index.php?oldid=505038072  Contributors: 1diot, Abdullah Chougle, Acdx, Albmont, Andycjp, Arthur Rubin, AxelBoldt, Baccyak4H,
Beaumont, Benwing, Bfigura's puppy, BoH, Bryan Derksen, Btyner, Clbustos, Clíodhna-2, Conversion script, Cretog8, DJIndica, Derfugu, Dicklyon, Emilpohl, Fjhickernell, Fnielsen, FrankH,
GGripenberg, Gareth Owen, Giftlite, HEL, Hannes Eder, Headbomb, Henrygb, Heron, Hidaspal, Hxu, IRWolfie-, Igny, Irigi, Jan1nad, Jitse Niesen, K.F., KSmrq, Kribbeh, Kurykh, LOL,
Lambiam, Leendert, Lightst, Marie Poise, MarkSweep, Mathstat, Melchoir, Melcombe, Metacomet, Michael Hardy, Miguel, MisterSheik, MrOllie, Nbarth, Nichtich, O18, Oleg Alexandrov, Ott2,
PAR, PBH, Paul August, Perimosocordiae, PeterC, PhilipHeller55, Pizzadeliveryboy, Purple Post-its, QuantumSquirrel, Quantumor, Quietbritishjim, Qwfp, Rjwilmsi, Rlendog, Robinh,
Rogerbrent, Rolf.turner, Romanm, Rvencio, S2000magician, Sheppa28, Skbkekas, Snoyes, Sterrys, StevenBell, Stpasha, Sławomir Biały, Tensorpudding, Teply, The Anome, Thesilverbail,
Tkuvho, Tomeasy, Tomi, UKoch, WJVaughn3, Weialawaga, Wikid77, ZeroOne, Zeycus, Zundark, Zvika, 119 anonymous edits

Cauchy–Schwarz inequality  Source: http://en.wikipedia.org/w/index.php?oldid=504201354  Contributors: 209.218.246.xxx, ARUNKUMAR P.R, Alcapwned86, AleksaStankovic,
Alexmorgan, Algebra123230, Andrea Allais, AnyFile, Arbindkumarlal, Arjunarikeri, Arved, Avia, AxelBoldt, BWDuncan, Belizefan, Bender235, Bh3u4m, Bkell, Brad7777, CJauff, CSTAR,
Cbunix23, Chas zzz brown, Chinju, Christopherlumb, Conversion script, Cyan, D6, Dale101usa, Dicklyon, DifferCake, Dodger67, Dratman, EagleScout, Eliosh, EugenG, FelixP, Gauge, Giftlite,
GoingBatty, Graham87, Haham hanuka, HairyFotr, Hans Lundmark, Headbomb, Hede2000, Iamthewinnerofthemonth, Jackzhp, Jmsteele, JohnBlackburne, JohnMathTeacher, Justin W Smith,
Katzmik, Kawanoz, Krakhan, Madmath789, Marc van Leeuwen, MarkSweep, Martijn Hoekstra, MathMartin, Mathtyke, Maxzimet, Mboverload, Mct mht, Melcombe, Memming, Mertdikmen,
Mhym, Michael Hardy, Michael Slone, Microcell, Miguel, Missyouronald, Nicol, Njerseyguy, Nsk92, Okaygecko, Oleg Alexandrov, Omnipaedista, Orange Suede Sofa, Orioane, Pan Chenguang,
Paul August, PaulGarner, Phil Boswell, Phys, Prari, Primalbeing, Q0k, Quietbritishjim, Qwfp, R.e.b., Rajb245, Rchase, Reflex Reaction, Rjwilmsi, S tyler, Salgueiro, Sbyrnes321, Schutz, Shlav,
Simonchubocka, Sl, SlipperyHippo, Small potato, Sodin, Somewikian, StevinSimon, Stevo2001, Sverdrup, Sławomir Biały, TakuyaMurata, TedPavlic, Teorth, Tercer, The suffocated,
ThibautLienart, Tobias Bergemann, TomyDuby, Tosha, Uncia, Vaughan Pratt, Vovchyck, Xentram, Zdorovo, Zenosparadox, Zundark, Zuphilip, Zvika, 159 anonymous edits

Characteristic function (probability theory)  Source: http://en.wikipedia.org/w/index.php?oldid=503465626  Contributors: Aastrup, Aberdeen01, Aetheling, Albmont, Amonet, AndrewHowse,
Baccyak4H, Bdmy, Benwing, Bwilkins, Cwtyler, First Harmonic, Fnsteed, Giftlite, Ideal gas equation, JA(000)Davidson, Jamelan, James086, Jason Goldstick, Jeff G., Jk350, K ganju, Khazar2,
Kku, LOL, Lambiam, Lovibond, MLauba, MagnusPI, Maksim-e, Mathieu Perrin, Mathstat, MehdiPedia, Melcombe, Michael Hardy, Nbarth, PV=nRT, Peiresc, Quantling, Qwfp, Rabbanis,
Rlendog, Robinh, Stpasha, Thenub314, Tktktk, TomyDuby, Tsirel, Ulamgamer, Unbitwise, Vincent Semeria, Xiawi, 61 ,‫ לאף טוף‬anonymous edits

Chernoff bound  Source: http://en.wikipedia.org/w/index.php?oldid=504236001  Contributors: 3mta3, A3 nm, CALR, CSTAR, Cburnett, Dcoetzee, DrMicro, Ece8950, Emurphy42, Falsifian,
Giftlite, Headbomb, Jalanpalmer, Jduchi, JerroldPease-Atlanta, Jittat, Josh Parris, Kilom691, Kku, MarkSweep, Meand, Melcombe, Michael Hardy, Muyiwamc2, Pmsyyz, Qjqflash3, Radagast83,
Rchandan, Rjwilmsi, Sadeq, Sidawang, Sodin, Stijn Vermeeren, Svnsvn, Wullj, Ylloh, ‫ﻣﺎﻧﻲ‬, 67 anonymous edits

Chi-squared distribution  Source: http://en.wikipedia.org/w/index.php?oldid=507956032  Contributors: A.R., AaronSw, AdamSmithee, Afa86, Alvin-cs, Analytics447, Animeronin, Ap,
AstroWiki, AxelBoldt, BenFrantzDale, BiT, Blaisorblade, Bluemaster, Bryan Derksen, Btyner, CBM, Cburnett, Chaldor, Chris53516, Constructive editor, Control.valve, DanSoper, Dbachmann,
Dbenbenn, Den fjättrade ankan, Dgw, Digisus, DrMicro, Drhowey, EOBarnett, Eliel Jimenez, Eliezg, Emilpohl, Etoombs, Ettrig, Fergikush, Fibonacci, Fieldday-sunday, FilipeS, Fintor, G716,
Gaara144, Gauss, Giftlite, Gperjim, Henrygb, Herbee, Hgamboa, HyDeckar, Iav, Icseaturtles, Isopropyl, It Is Me Here, Iwaterpolo, J-stan, Jackzhp, Jaekrystyn, Jason Goldstick, Jdgilbey, Jitse
Niesen, Johnlemartirao, Johnlv12, Jspacemen01-wiki, Kastchei, Knetlalala, KnightRider, Kotasik, LeilaniLad, Leotolstoy, LilHelpa, Lixiaoxu, Loodog, Loren.wilton, Lovibond, MATThematical,
MER-C, MarkSweep, Markg0803, Master of Puppets, Materialscientist, Mcorazao, Mdebets, Melcombe, Mgiganteus1, Michael Hardy, Microball, Mikael Häggström, Mindmatrix, MisterSheik,
MrOllie, MtBell, Nbarth, Neon white, Nm420, NocturneNoir, Notatoad, O18, Oleg Alexandrov, PAR, Pabristow, Pahan, Paul August, Paulginz, Pet3ris, Philten, Pstevens, Qiuxing, Quantling,
Quietbritishjim, Qwfp, Rflrob, Rich Farmbrough, Rigadoun, Robinh, Ronz, Saippuakauppias, Sam Blacketer, SamuelTheGhost, Sander123, Schmock, Schwnj, Seglea, Shadowjams, Sheppa28,
Shoefly, Sietse, Silly rabbit, Sligocki, Stephen C. Carlson, Steve8675309, Stpasha, Talgalili, Tarkashastri, The Anome, TheProject, TimBentley, Tom.Reding, Tombomp, Tomi, TomyDuby,
Tony1, U+003F, User A1, Volkan.cevher, Wasell, Wassermann7, Weialawaga, Willem, Wjastle, Xnn, Zero0000, Zfr, Zvika, 295 anonymous edits
Article Sources and Contributors 627

Computational complexity of mathematical operations  Source: http://en.wikipedia.org/w/index.php?oldid=506021956  Contributors: Barticus88, Berland, Branger, CRGreathouse, Coma28,
EmilJ, Fangz, Fbahr, Fredrik, Giftlite, Halo, HappyCamper, Headbomb, Hermel, Hmonroe, Jafet, Jalal0, Jitse Niesen, Kaluppollo, Kri, Kuleebaba, Marek69, Mike40033, Mm, Optikos, Pdokj,
Prosfilaes, Qwertyus, Rjwilmsi, RobinK, TerraFrost, WhiteDragon, 32 anonymous edits

Conjugate prior  Source: http://en.wikipedia.org/w/index.php?oldid=506468521  Contributors: Aetheling, Avraham, Azylber, Bazugb07, Benwing, Blueyoshi321, Closedmouth, Dave44000,
Debejyo, Den fjättrade ankan, Dmeburk, Dstivers, Eisber, Gautam raw, Giftlite, Gnathan87, Gosselinf, Heneryville, Henrygb, Hilgerdenaar, Jitse Niesen, Junling, Jurgen, Kastchei, Kierano,
Kupirijo, Kzollman, LessHeard vanU, Lscharen, MarkSweep, Mavam, Melcombe, Michael Hardy, MisterSheik, Nbarth, Ninjagecko, Nparikh, Occawen, Ogo, Oleg Alexandrov, Paulpeeling, Phil
Boswell, Qwfp, Rjwilmsi, Schutz, SkeletorUK, Stimakov, Struway, Tcrykken, Tomyumgoong, Wile E. Heresiarch, Yunus.saatci, 60 anonymous edits

Continuous mapping theorem  Source: http://en.wikipedia.org/w/index.php?oldid=491563088  Contributors: Giftlite, Headbomb, Jmath666, Lifebonzza, Luizabpr, Melcombe, Rjwilmsi,
Stpasha, 1 anonymous edits

Convergence of random variables  Source: http://en.wikipedia.org/w/index.php?oldid=504588871  Contributors: A. Pichler, Aastrup, Albmont, Amir Aliev, Andrea Ambrosio, Ardonik,
AxelBoldt, Bjcairns, Brad7777, CALR, Cenarium, Chungc, Constructive editor, DHN, David Eppstein, Deepakr, Derveish, DrWhitechalk, Ensign beedrill, Fpoursafaei, Fram, GaborPete, Giftlite,
Headbomb, HyDeckar, Igny, J04n, JA(000)Davidson, JamesBWatson, Jarble, Jmath666, Jncraton, Josh Guffin, Josh Parris, Kolesarm, Kri, Landroni, LoveOfFate, McKay, Melcombe, Mets501,
Michael Hardy, Notedgrant, Oleg Alexandrov, Paul August, Philtime, PigFlu Oink, Prewitt81, Qwfp, Ricardogpn, Robin S, Robinh, Sam Hocevar, SchreiberBike, Sligocki, Spireguy, Steve
Kroon, Stigin, Stpasha, Sullivan.t.j, Szepi, TedPavlic, The Mad Echidna, Voorlandt, Wartoilet, Weierstraß, Zvika, 120 anonymous edits

Convergent series  Source: http://en.wikipedia.org/w/index.php?oldid=496768666  Contributors: 16@r, AdamSmithee, Alansohn, Bo Jacoby, Brad7777, BradBeattie, Chilkes, CiaPan,
Crasshopper, Dan Hoey, David Radcliffe, Dbsanfte, Dhollm, Dreamfall, Etscrivner, Feydey, Fmtmaster, Giftlite, Glenn L, GraemeMcRae, Henrygb, Jizzpus, Jkimath, JohnBlackburne, Jrtayloriv,
Kan8eDie, Maksim-e, MathMartin, Matman from Lublin, Maxal, Melchoir, Michael Hardy, Mox83, NatusRoma, Netheril96, Oleg Alexandrov, Oli Filth, PV=nRT, PabloCastellano, Patrick, Paul
August, Petr Dlouhý, R'n'B, Rejnal, Robo37, Rpchase, Rprimmer, Ryan Reich, Scarlet Lioness, Titoxd, Trovatore, Vinnie2k, Wikeepedian, William M. Connolley, 50 anonymous edits

Copula (probability theory)  Source: http://en.wikipedia.org/w/index.php?oldid=506844751  Contributors: 2001:718:2:1634:21C:25FF:FE1A:58B0, A. Pichler, Alanb, Albmont, Alg543,
Amir9a, AndrewDressel, Andrewpmack, AnonMoos, Aptperson, ArséniureDeGallium, Asjogren, Asnelt, Avraham, Baartvaark, BenFrantzDale, Bender235, Benjaminveal, Bkcurrier, Caco21,
Cecody, Charles Matthews, Cherkash, Christofurio, ChristophE, Clément Pillias, CopulaTomograph, Crasshopper, De2k, Den fjättrade ankan, Derex, Diegotorquemada, Edward,
Fabrizio.durante, Favonian, Feraudyh, GUONing, Gabeornelas, Gbohus, Gene Nygaard, Giftlite, Helgus, Hu12, Ikiwdq55, Indiv55, Inference, JA(000)Davidson, Jeffq, Jer ome, Jitse Niesen,
KHamsun, Kimys, Kmanoj, Magicmike, Martinp, Martyndorey, Mcld, Melcombe, Michael Hardy, Mwarren us, Nelson50, Nutcracker, Oleg Alexandrov, Omnipaedista, Ossiemanners, Paul H.,
PeterSarkoci, Philtime, Piloter, Qwfp, Rajah9, Roadrunner, Robertschulze, SHINIGAMI-HIKARU, Scwarebang, Sflanker, Shuetrim, Skbkekas, Sodin, Srini121, Stigin, Sumple, SymmyS, The
dood 7475, Tomixdf, Vvarkey, Waldir, Woohookitty, Woutersmet, Yonkeltron, Zasf, Zsolt Tulassay, Zundark, 164 anonymous edits

Coupon collector's problem  Source: http://en.wikipedia.org/w/index.php?oldid=483443997  Contributors: Adriantam, CRGreathouse, Credema, David Eppstein, Geometry guy, Giftlite,
Haruth, Igorpak, Joriki, Kmhkmh, Leo euler, Magicmike, Melcombe, Mhym, Michael Hardy, Natema, Piotrm, Pleasantville, Shreevatsa, Urhixidur, Zahlentheorie, 20 anonymous edits

Degrees of freedom (statistics)  Source: http://en.wikipedia.org/w/index.php?oldid=507675614  Contributors: Acalamari, Alaexis, Antonrojo, Askari Mark, Baccyak4H, Banaman1, BirgitteSB,
BradBeattie, Btyner, Cherkash, Chris53516, Cruise, Dabomb87, Dcljr, Dingar, Fangz, Fgnievinski, Forsakendaemon, G716, Gherardo, Giftlite, Goblin5, Gökhan, Hajhouse, Hede2000,
Hobartimus, Icairns, Iridescent, J.delanoy, Jgillenw, Jitse Niesen, Jmkim dot com, Jmnbatista, Joelr31, Jtneill, Kf4bdy, Kilva, Kjtobo, KnightRider, Koavf, Laeviar, LilHelpa, LindaGlass,
LindaGlass25, Male1979, Melcombe, Mfield, Michael Hardy, Mindmatrix, Missdipsy, Mmanneva, Nemeths, Oleg Alexandrov, Patrick, Pearle, Piotrus, Purpleidea, Qwfp, Ratcho, Rich
Farmbrough, Rituparnadas, Rjwilmsi, Rlsheehan, SPiNoZA, Salih, Sam Blacketer, Seervoitek, Sophus Bie, SteveChervitzTrutane, Stevemiller, Sun Creator, T8191, TheSuave, Troy 07, Vishnava,
Vizcarra, Vugluskr, Wavelength, Wikiacc, Zaqrfv, Zink Dawg, Zven, 137 anonymous edits

Determinant  Source: http://en.wikipedia.org/w/index.php?oldid=507622576  Contributors: 01001, 165.123.179.xxx, A-asadi, A. B., AbsolutDan, Adam4445, Adamp, Ae77, Ahoerstemeier,
AlanUS, Alex Sabaka, Alexander Chervov, Alexandre Duret-Lutz, Alexandre Martins, Algebraist, Alison, Alkarex, Alksub, Anakata, Andres, Anonymous Dissident, Anskas, Ardonik,
ArnoldReinhold, Arved, Asmeurer, AugPi, AxelBoldt, BPets, Balagen, Barking Mad142, BenFrantzDale, Bender2k14, Benji9072, Benwing, Betacommand, Big Jim Fae Scotland, BjornPoonen,
BrianOfRugby, Bryan Derksen, Burn, CBM, CRGreathouse, Campuzano85, Camrn86, Carbonrodney, Catfive, Cbogart2, Ccandan, Cesarth, Charles Matthews, Chester Markel, Chewings72,
Chocochipmuffin, Christopher Parham, Cjkstephenson, Closedmouth, Cmmthf, Cobi, Coffee2theorems, Cokaban, Connelly, Conversion script, Cowanae, Crasshopper, Cronholm144, Crystal
whacker, Cthulhu.mythos, Cwkmail, Danaman5, Dantestyrael, Dark Formal, Datahaki, Dcoetzee, Delirium, Demize, Dmbrown00, Dmcq, Doctormatt, Dysprosia, EconoPhysicist, Elphion,
Eniagrom, Entropeneur, Epbr123, Euphrat1508, EverettYou, Executive Outcomes, Ffatossaajvazii, Fredrik, Fropuff, Gauge, Gejikeiji, Gene Ward Smith, Gershwinrb, Giftlite, Graham87,
GrewalWiki, Greynose, Guiltyspark, Gwernol, Hangitfresh, Headbomb, Heili.brenna, HenkvD, HenningThielemann, Hlevkin, Ian13, Icairns, Ijpulido, Ino5hiro, Istcol, Itai, JC Chu, JJ Harrison,
JackSchmidt, Jackzhp, Jagged 85, Jakob.scholbach, Jasonevans, Jeff G., Jemebius, Jerry, Jersey Devil, Jewbacca, Jheald, Jim.belk, Jitse Niesen, Joejc, Jogers, Johnuniq, Jondaman21, Jordgette,
Joriki, Josp-mathilde, Josteinaj, Jrgetsin, Jshen6, Juansempere, Justin W Smith, Kaarebrandt, Kallikanzarid, Kaspar.jan, Kd345205, Khabgood, Kingpin13, Kmhkmh, Kokin, Kstueve, Kunal
Bhalla, Kurykh, Kwantus, LAncienne, LOL, Lagelspeil, Lambiam, Lavaka, Leakeyjee, Lethe, Lhf, Lightmouse, LilHelpa, Logapragasan, Luiscardona89, MackSalmon, Marc van Leeuwen,
Marek69, Marozols, MartinOtter, MathMartin, McKay, Mcconnell3, Mcstrother, Mdnahas, Merge, Mets501, Michael Hardy, Michael P. Barnett, Michael Slone, Mikael Häggström, Mild Bill
Hiccup, Misza13, Mkhan3189, Mmxx, Mobiusthefrost, Mrsaad31, Msa11usec, MuDavid, Myshka spasayet lva, N3vln, Nachiketvartak, NeilenMarais, Nekura, Netdragon, Nethgirb, Netrapt,
Nickj, Nicolae Coman, Nistra, Nsaa, Numbo3, Obradovic Goran, Octahedron80, Oleg Alexandrov, Oli Filth, Paolo.dL, Patamia, Patrick, Paul August, Pedrose, Pensador82, Personman, PhysPhD,
Pigei, Priitliivak, Protonk, Pt, Quadell, Quadrescence, Quantling, Quondum, R.e.b., RDBury, RIBEYE special, Rayray28, Rbb l181, Recentchanges, Reinyday, RekishiEJ, RexNL, Rgdboer, Rich
Farmbrough, Robinh, Rocchini, Roentgenium111, Rogper, Rpchase, Rpyle731, Rumblethunder, SUL, Sabri76, Salgueiro, Sandro.bosio, Sangwine, Sayahoy, SchreiberBike, Shai-kun, Shreevatsa,
Siener, Simon Sang, SkyWalker, Slady, Smithereens, Snoyes, Spartan S58, Spireguy, Spoon!, Ssd, Stdazi, Stefano85, Stevenj, StradivariusTV, Sun Creator, Supreme fascist, Swerdnaneb,
SwordSmurf, Sławomir Biały, T8191, Tarif Ezaz, Tarquin, Taw, TedPavlic, Tegla, Tekhnofiend, Tgr, The Thing That Should Not Be, TheEternalVortex, TheIncredibleEdibleOompaLoompa,
Thehelpfulone, Thenub314, Timberframe, Tobias Bergemann, TomViza, Tosha, Trashbird1240, TreyGreer62, Trifon Triantafillidis, Trivialsegfault, Truthnlove, Ulisse0, Unbitwise, Urdutext,
Vanka5, Vincent Semeria, Wellithy, Wik, Wirawan0, Wolfrock, Woscafrench, Wshun, Xaos, Zaslav, Ztutz, Zzedar, ^demon, 439 anonymous edits

Dirichlet distribution  Source: http://en.wikipedia.org/w/index.php?oldid=507653009  Contributors: A5, Adfernandes, Amit Moscovich, Azeari, Azhag, BSVulturis, Barak, Ben Ben,
Bender2k14, Benwing, BrotherE, Btyner, Charles Matthews, ChrisGualtieri, Coffee2theorems, Crasshopper, Cretog8, Daf, Dlwhall, Drevicko, Dycotiles, Erikerhardt, Finnancier, Franktuyl,
Frigyik, Giftlite, Gxhrid, Herve1729, Ipeirotis, Ivan.Savov, J04n, Josang, Kzollman, Liuyipei, M0nkey, Maiermarco, Mandarax, MarkSweep, Mathknightapprentice, Mathstat, Mavam, Mcld,
Melcombe, Michael Hardy, MisterSheik, Mitch3, Myasuda, Nbarth, Oscar Täckström, Prasenjitmukherjee, Qwfp, Robinh, Rvencio, Salgueiro, Saturdayswiki, Schmock, Shaharsh, Sinuhet,
Slxu.public, Tomi, Tomixdf, Tuonawa, Wavelength, Whosasking, Wjastle, Wolfman, Zvika, 75 anonymous edits

Effect size  Source: http://en.wikipedia.org/w/index.php?oldid=503543559  Contributors: 2610:130:103:D00:B10D:288C:C17D:A470, Abdul raja, Aldo samulo, Brainfsck, Ceyockey, Cherkash,
Chinju, Chris53516, Cronholm144, DanSoper, DanielCD, Davewho2, Dcoetzee, Devilly, Dfrg.msc, Dhaluza, Dogface, ESkog, Es uomikim, FlssSUT, Fnielsen, Freerow@gmail.com, Friend of
facts, Fugyoo, G716, Gary Cziko, George Burgess, Giftlite, Heathkeeper, Ian Pitchford, Ibpassociation, Igoldste, IncognitoErgoSum, Ioannes Pragensis, Isaac Dupree, J.delanoy,
JA(000)Davidson, JagDragon, Janarius, Jncraton, John Quiggin, JonathanWilliford, Jonkerz, JorisvS, JoshuaEyer, Jrdioko, Jthetzel, Jtneill, Kastchei, Kiefer.Wolfowitz, Kieff, Leonbax, LilHelpa,
Lixiaoxu, MarcoLittel, Mcdomell, Melcombe, Michael Hardy, Mikael Häggström, Mspraveen, Mycatharsis, NYC2TLV, Nesbit, NiallB, Ostracon, PE, Pgan002, Ph.eyes, Pi3832, Quercus
basaseachicensis, Qwfp, Rdledesma, Rgclegg, Rich Farmbrough, Rjwilmsi, Rlew234, Robert P. O'Shea, RyanCross, Sam Hocevar, Sammcq, Scarian, Schmitta1573, Schwnj, Sean a wallis,
Seedre, Shineydiamond, Skbkekas, Sue Wigg, Supernaut76, Talgalili, Tedtoal, Tillmo, Tim bates, Trontonian, Ultramarine, Wmahan, Xomyork, XpXiXpY, 145 anonymous edits

Erlang distribution  Source: http://en.wikipedia.org/w/index.php?oldid=499725168  Contributors: Acuster, Afa86, Aranel, Autopilot, Avatar, Basten, Bobo192, Bryan Derksen, Bullmoose953,
Calltech, Cbauckhage, Cnmirose, CraigNYC, Derisavi, Diogenes404, Donmegapoppadoc, DudMc3, Giftlite, Gustavf, Ian Geoffrey Kennedy, Iwaterpolo, Jbfung, Jim.henderson, Joshdick,
Jrdioko, Jwortman, Kastchei, Kitsonk, Luckyz, Mange01, MarkSweep, Maths2412, McKay, Melcombe, Michael Hardy, MisterSheik, Myleslong, PAR, Pbruel, Pichote, Qwfp, RHaworth,
Rmashhadi, Salsa Shark, Sheppa28, Stangaa, TedPavlic, User A1, Welsh, Zvika, 93 anonymous edits

Expectation–maximization algorithm  Source: http://en.wikipedia.org/w/index.php?oldid=507971854  Contributors: 2405:B000:600:262:0:0:36:61, 3mta3, A876, Aanthony1243, Alex
Kosorukoff, Alex Selby, AnAj, Andyrew609, BAxelrod, Benwing, Bigredbrain, Bilgrau, Bkkbrad, Blaisorblade, Bluemoose, BradBeattie, Btyner, Cataphract, Cburnett, Chire, Daviddoria, Derek
farn, Dicklyon, Douglas-Lanman, Dropsciencenotbombs, Edratzer, Erhanbas, Eric Kvaalen, Finfobia, GeypycGn, Giftlite, Glopk, GongYi, Hakeem.gadi, Hbeigi, Headbomb, Hike395, Hild,
Ismailari, Iwaterpolo, Jakarr, Jamshidian, Jheald, Jjmerelo, Jmc200, Joeyo, John Vandenberg, Jrouquie, Jszymon, Jwmarck, KYPark, Kallerdis, Karada, Kiefer.Wolfowitz, Klutzy, Ladypine,
Lamro, Lavaka, LeonardoWeiss, Libro0, M.A.Dabbah, Maechler, MarkSweep, Market Efficiency, Mcld, Melcombe, Michael Hardy, MisterSheik, Mosaliganti1.1, Nageh, Nbarth, Nils Grimsmo,
Nocheenlatierra, Numbo3, O18, Onco p53, Orderud, Osquar F, Owenman, Phil Boswell, Pine900, Piotrus, Pratx, Qiemem, Qwerty9967, Rama, Requestion, Richard Bartholomew, Rjwilmsi,
RobHar, Robbyjo, Rodrigob, Rusmike, Rxnt, Régis B., Salih, Schmock, Sitush, Skittleys, Slon02, Statna, Stpasha, Sunjuren, Talgalili, Tambal, Tekhnofiend, Tiedyeina, User A1, Vadmium, Wile
E. Heresiarch, Yasuo2, Yogtad, Zzpmarco, Ɯ, 155 anonymous edits

Exponential distribution  Source: http://en.wikipedia.org/w/index.php?oldid=504270643  Contributors: 2610:10:20:216:225:FF:FEF4:CAAF, A.M.R., A3r0, ActivExpression, Aiden Fisher,
Amonet, Asitgoes, Avabait, Avraham, AxelBoldt, Bdmy, Beaumont, Benwing, Boriaj, Bryan Derksen, Btyner, Butchbrody, CD.Rutgers, CYD, Calmer Waters, CapitalR, Cazort, Cburnett,
Article Sources and Contributors 628

Closedmouth, Coffee2theorems, Cyp, Dcljr, Dcoetzee, Decrypt3, Den fjättrade ankan, Dudubur, Duoduoduo, Edward, Enchanter, Erzbischof, Fvw, Gauss, Giftlite, GorillaWarfare,
Grinofadrunkwoman, Headbomb, Henrygb, Hsne, Hyoseok, IanOsgood, Igny, Ilmari Karonen, Isheden, Isis, Iwaterpolo, Jason Goldstick, Jester7777, Johndburger, Kan8eDie, Kappa,
Karl-Henner, Kastchei, Kyng, LOL, MStraw, MarkSweep, Markjoseph125, Mattroberts, Mcld, Mdf, MekaD, Melcombe, Memming, Michael Hardy, Mindmatrix, MisterSheik, Monsterman222,
Mpaa, Mwanner, Nothlit, Oysindi, PAR, Paul August, Qwfp, R'n'B, R.J.Oosterbaan, Remohammadi, Rich Farmbrough, Rp, Scortchi, Sergey Suslov, Shaile, Sheppa28, Shingkei, Skbkekas,
Skittleys, Smack, Spartanfox86, Sss41, Stpasha, Taral, Taw, The Thing That Should Not Be, Thegeneralguy, TimBentley, Tomi, UKoch, Ularevalo98, User A1, Vsmith, WDavis1911, Wilke,
Wjastle, Woohookitty, Wyatts, Yoyod, Z.E.R.O., Zeno of Elea, Zeycus, Zvika, Zzxterry, 210 anonymous edits

F-distribution  Source: http://en.wikipedia.org/w/index.php?oldid=484010808  Contributors: Adouzzy, Albmont, Amonet, Art2SpiderXL, Arthena, Bluemaster, Brenda Hmong, Jr, Bryan
Derksen, Btyner, Califasuseso, Cburnett, DanSoper, DarrylNester, Dysprosia, Elmer Clark, Emilpohl, Ethaniel, Fnielsen, Ged.R, Giftlite, Gperjim, Hectorlamadrid, Henrygb, Jan eissfeldt, Jitse
Niesen, JokeySmurf, Kastchei, Livingthingdan, MarkSweep, Markjoseph125, Materialscientist, Mdebets, Melcombe, Michael Hardy, MrOllie, Nehalem, O18, Oscar, PBH, Quietbritishjim, Qwfp,
Robinh, Salix alba, Seglea, Sheppa28, TedE, The Squicks, Timholy, Tom.Reding, Tomi, TomyDuby, UKoch, Unyoyega, Zorgkang, 50 anonymous edits

F-test  Source: http://en.wikipedia.org/w/index.php?oldid=503070399  Contributors: Adam Lyle Taylor, Alexandrov, Andre Engels, Berland, Btyner, Cached, Cherkash, CryptoDerk, Danger,
Davinchicode, Dfarrar, Dloeckx, Dlrohrer2003, Drilnoth, Edstat, Feinstein, Geinitz, Giftlite, Guoguo12, HazeNZ, J. J. F. Nau, Jayen466, Jollyroger131, JoseMires, Kgres, Kiefer.Wolfowitz,
Kolyma, Kudret abi, Landroni, Mathstat, Melcombe, Michael Hardy, Mike.lifeguard, Miserlou, Namkyu, Nostato, Oleg Alexandrov, PartierSP, Piotrus, Potamites, Qwfp, Rjwilmsi, Rkandilarov,
Salix alba, SalvNaut, Seaphoto, Skbkekas, Smith609, Spaceman1979, Szczepanh, Tekhnofiend, Theking2, Thesilverbail, TomyDuby, Tristanb, Uwhoff, Valley2390, Vapniks, Velocidex,
Vt007ken, Vulturelainen, Wolf87, Xtrememachineuk, Yeehoong, 101 anonymous edits

Fisher information  Source: http://en.wikipedia.org/w/index.php?oldid=506160066  Contributors: Aaronchall, Acidador, Adfernandes, Agl, Amir8797, Arthur Rubin, Barak, Camrn86,
Capricorn42, Cazort, Cburnett, Chris Howard, Cocomo-jp, Coffee2theorems, Cyan, DRHagen, Den fjättrade ankan, Eric Kvaalen, Eug, Fangz, Flammifer, Florian Huber, Freerow@gmail.com,
Giftlite, Gveda, Holon, Icep, Icosmology, John of Reading, Josh Parris, Josuechan, Jpillow, Kiefer.Wolfowitz, Lgallindo, Linas, Lupin, Matt.voroney, Mdf, Mebden, Melcombe, Michael Hardy,
Nathanielvirgo, Nbarth, OliverObst, Paolo.dL, PhysPhD, Physicistjedi, Polymath1976, Quantling, Qwfp, Rjwilmsi, Robinh, Sigmundg, Skittleys, Sohale, Soosed, Stpasha, Taxman, TedE, The
Anome, Thrasibule, Tobias Bergemann, Tomixdf, Violetriga, Vsmith, Wangandibeijing, WaysToEscape, Wikomidia, Winterfors, Wolf87, X42bn6, Yahya Abdal-Aziz, Zouxiaohui, Zvika, ‫ירון‬,
93 anonymous edits

Fisher's exact test  Source: http://en.wikipedia.org/w/index.php?oldid=502320370  Contributors: 3mta3, AgarwalSumeet, Antelan, Archimerged, Australisergosum, Avenue, Baccyak4H,
Beetstra, Bobcoyote, Btyner, Bueller 007, CWenger, Cannin, Cbogart2, Charles Matthews, Chris the speller, Chris53516, Cyclist, David Eppstein, David.Gross, Den fjättrade ankan, Dfarrar,
Douglas R. White, Eric Kvaalen, Giftlite, Helohe, Hughjonesd, Ispy1981, Jia.meng, John Vandenberg, Kastchei, Kbh3rd, Kenta2, Kierano, Lfriedl, Lixiaoxu, Malhonen, MarkSweep, Melcombe,
Michael Hardy, Mikeblas, Moverly, Nbarth, Pgan002, Ph.eyes, Phatsphere, Rich Farmbrough, Rjwilmsi, Robinh, RupertMillard, Scentoni, Schwnj, Seb951, Seglea, ShotgunApproach, Shyamal,
Sjoosse, Skbkekas, Slartibarti, Talgalili, TedE, Thincat, Tim bates, TimBentley, Tkirkman, Tomi, Wavelength, Welhaven, 60 anonymous edits

Gamma distribution  Source: http://en.wikipedia.org/w/index.php?oldid=504005178  Contributors: A5, Aastrup, Abtweed98, Adam Clark, Adfernandes, Albmont, Amonet, Aple123,
Apocralyptic, Arg, Asteadman, Autopilot, Baccyak4H, Barak, Bdmy, Benwing, Berland, Bethb88, Bo Jacoby, Bobmath, Bobo192, Brenton, Bryan Derksen, Btyner, CanadianLinuxUser,
CapitalR, Cburnett, Cerberus0, ClaudeLo, Cmghim925, Complex01, Darin, David Haslam, Dicklyon, Dlituiev, Dobromila, Donmegapoppadoc, DrMicro, Dshutin, Entropeneur, Entropeter,
Erik144, Eug, Fangz, Fnielsen, Frau K, Frobnitzem, Gaius Cornelius, Gandalf61, Gauss, Giftlite, Gjnaasaa, Henrygb, Hgkamath, Iwaterpolo, Jason Goldstick, Jirka6, Jlc46, JonathanWilliford,
Jshadias, Kastchei, Langmore, Linas, Lovibond, LoyalSoldier, LukeSurl, Luqmanskye, MarkSweep, Mathstat, Mcld, Mebden, Melcombe, Mich8611, Michael Hardy, MisterSheik, Mpaa,
MrOllie, MuesLee, Mundhenk, Narc813, Nickfeng88, O18, PAR, PBH, Patrke, Paul Pogonyshev, Paulginz, Phil Boswell, Pichote, Popnose, Qiuxing, Quietbritishjim, Qwfp, Qzxpqbp, RSchlicht,
Robbyjo, Robinh, Rockykumar1982, Samsara, Sandrobt, Schmock, Smmurphy, Stephreg, Stevvers, Sun Creator, Supergrane, Talgalili, Tayste, TestUser001, Thomas stieltjes, Thric3,
Tom.Reding, Tomi, Tommyjs, True rover, Umpi77, User A1, Vminin, Wavelength, Wiki me, Wiki5d, Wikid77, Wile E. Heresiarch, Wjastle, Xuehuit, Zvika, 313 anonymous edits

Gamma function  Source: http://en.wikipedia.org/w/index.php?oldid=507824244  Contributors: 1exec1, 209.218.246.xxx, 65.197.2.xxx, A. Pichler, Adriaan Joubert, Adselsum, Alamino,
Alansohn, Alejo2083, Alex Dainiak, Aliotra, Amir bike, Ams80, Anonymous Dissident, Apophos, Arabic Pilot, Arthur Rubin, Ashley Y, Asymptoticus, Atlien, AugPi, AxelBoldt, B75a, BRG,
Baccyak4H, Beeson, Ben Tillman, Bmusician, Bob K, Brad7777, Bromskloss, Bryan Derksen, Bubba73, Bubbha, CBM, CRGreathouse, Casey Abell, Charles Matthews, Cheese Sandwich,
Chinju, Chortos-2, Christian75, Closedmouth, Coasterlover1994, Conversion script, Cybercobra, DaveFoster110@hotmail.com, David Shay, Davius, Dcoetzee, Dicklyon, Didi12321,
Discospinster, Dmr2, Dojarca, Domitori, Dpmathmajor112, Drjt87, Dubosen, Dysprosia, Dzordzm, EdJohnston, Elharo, Ellywa, Evil saltine, Excirial, Eyrryds, Favonian, Feinstein, Fintor, Fred
Stober, Fredrik, Frencheigh, Fresheneesz, Gamesguru2, Gauss, Geeklizzard, Gene Ward Smith, Gesslein, Giftlite, Glenn L, GraemeL, Graham87, GregorB, Gulliveig, Hairy Dude, Hannes Eder,
Hao2lian, HappyCamper, HappyInGeneral, Herbee, Hypercube, Inquisitus, J6w5, JJL, JabberWok, Jackzhp, James mcl, JamesBWatson, Jdgilbey, JensG, Jfmantis, Joelphillips, John aveas,
Joke137, JonDePlume, Josevellezcaldas, Jshadias, Jsondow, Julian Brown, Junling, Justin W Smith, Kausikghatak, Kc135, Khinchin's constant, Kiensvay, Kit Cloudkicker, Kmarawer, Lambiam,
Linas, Looxix, Maksim-e, Marc van Leeuwen, Markhurd, Materialscientist, MathHisSci, Mathmo3141592653589, Maurice Carbonaro, Melchoir, Michael Hardy, Miha Ulanov, Mimpian228,
MuDavid, Muller spiegel, Murtasa, Nbarth, Ndenison, Nicolas Bray, Nohat, Nono64, Norm mit, Nozzer42, ObsessiveMathsFreak, Octaazacubane, Octahedron80, Oleg Alexandrov, Oliphaunt,
OneWeirdDude, Outriggr, PAR, PMajer, Pabristow, Pagw, Paul Pogonyshev, Pcb21, Peak, Phreed, Pmanderson, Policron, Poor Yorick, Powermath, Pra1998, Prince Max (scientist), Pt,
Quantling, Qwfp, R. J. Mathar, R.e.b., RDBury, Rbj, RedWolf, Reddi, Rgdboer, RobHar, Robinh, RogierBrussee, Rohan Ghatak, Romanm, Rp, Sabbut, Salgueiro, Sam Derbyshire, Sandrobt,
Scythe33, Senalba, ServiceAT, Setreset, Shadowjams, Singularity, Slawekb, Sligocki, Stan Sykora, Stevenj, Sverdrup, TakuyaMurata, Tal physdancer, Tarquin, Tassedethe, Taxman, The new
math, Tide rolls, Tobias Bergemann, ToddDeLuca, Tom Buktu, Tomi, TomyDuby, Tuhinsubhrakonar, Uranographer, Van helsing, Vanished User 0001, Vinícius Machado Vogt, Waabu, Warut,
Wavelength, WiiStation360, Wile E. Heresiarch, Wtuvell, YelloWord, ZakTek, Zero0000, Zero2ninE, Zstk, Ztar, 285 anonymous edits

Geometric distribution  Source: http://en.wikipedia.org/w/index.php?oldid=505976779  Contributors: AdamSmithee, AlanUS, Alexf, Amonet, Apocralyptic, Ashkax, Bjcairns, Bo Jacoby,
Bryan Derksen, Btyner, Calbaer, Capricorn42, Cburnett, Classicalecon, Count ludwig, Damian Yerrick, Deineka, Digfarenough, El C, Eraserhead1, Felipehsantos, Frietjes, Gauss, Giftlite,
Gogobera, Gsimard, Gökhan, Hhassanein, Hilgerdenaar, Imranaf, Iwaterpolo, Juergik, K.F., LOL, MarkSweep, MathKnight, Mav, Mcld, Melcombe, Michael Hardy, MichaelRutter, Mike Rosoft,
Mikez, Mr.gondolier, NeonMerlin, Nov ialiste, PhotoBox, Qwfp, Ricklethickets, Robma, Rumping, Ryguasu, Serdagger, Skbkekas, Speed8224, Squizzz, Steve8675309, Sun Creator,
SyedAshrafulla, TakuyaMurata, Terrek, Tomi, VoltzJer, Vrenator, Wafulz, Wikid77, Wjastle, Wrogiest, Wtruttschel, Xanthoxyl, Youandme, 119 anonymous edits

Hypergeometric distribution  Source: http://en.wikipedia.org/w/index.php?oldid=502616533  Contributors: Alexius08, Antoine 245, Arnold90, Baccyak4H, Benwing, Bilgrau, Bo Jacoby,
Booyabazooka, Bryan Derksen, Btyner, Burn, Cburnett, ChevyC, Commander Keane, DarrylNester, David Eppstein, David Shay, DavidLDill, Drcrnc, Eidolon232, El C, Eug, FedeLebron,
Felipehsantos, Gauss, Giftlite, Gnathan87, Goshng, Gunungblau, Herr blaschke, I3iaaach, I9606, Intervallic, It Is Me Here, Iwaterpolo, Jack Joff, Janto, Jht4060, Jia.meng, Johnlv12, Josh Cherry,
Kamrik, Kingboyk, Kiwi4boy, LOL, Linas, MC-CPO, MSGJ, MaxEnt, Maximilianh, Melcombe, Michael Hardy, Mindmatrix, Mtmoore321, Nbarth, Nerdmaster, Ott2, PAR, PBH,
Peteraandrews, Pleasantville, Pol098, Porejide, Prőhle Tamás, Qwfp, Randomactsofkindness2, Reb42, Rgclegg, Sboehringer, Schutz, Screech1941, Seattle Jörg, Skaphan, SkatingNerd,
TakuyaMurata, Talgalili, Tomi, User A1, Veryhuman, Wtmitchell, Wtruttschel, Yvswan, Zigger, ‫ﺯﺭﺷﮏ‬, 176 anonymous edits

Hölder's inequality  Source: http://en.wikipedia.org/w/index.php?oldid=506936260  Contributors: 3mta3, A. Pichler, Alan Liefting, Arvid42, AxelBoldt, Bdmy, Bh3u4m, Bryan Derksen,
Cazort, Cbunix23, Daniele.tampieri, DavidCBryant, Detritus, Eslip17, Fwappler, GBlanchard, Gene Nygaard, Giftlite, GoingBatty, Igny, Ilmari Karonen, Lantonov, Makotoy, MarSch,
MarkSweep, Mct mht, Melcombe, Merewyn, Minesweeper, Myasuda, Nousernamesleft, Oleg Alexandrov, Pred, Pulkitgrover, Quietbritishjim, R.e.b., Rich Farmbrough, Schmock, Small potato,
Stevenj, Sławomir Biały, Weierstrass, Whendrik, Wik, Wittlicher, Zvika, 52 anonymous edits

Inverse Gaussian distribution  Source: http://en.wikipedia.org/w/index.php?oldid=499832929  Contributors: Aastrup, Abtweed98, Baccyak4H, Batman50, Braincricket, Btyner, David Haslam,
Deavik, Dima373, DrMicro, Felipehsantos, Giftlite, Iwaterpolo, Jfr26, Kristjan.Jonasson, LachlanA, LandruBek, Melcombe, Memming, Michael Hardy, MisterSheik, NickMulgan, Oleg
Alexandrov, Qwfp, Rhfeng, Sheppa28, Sterrys, The real moloch57, Tjagger, Tomi, User A1, Vana Seshadri, Wikid77, Wjastle, Zfeinst, 50 anonymous edits

Inverse-gamma distribution  Source: http://en.wikipedia.org/w/index.php?oldid=497414146  Contributors: Benwing, Biostatprof, Btyner, Cburnett, Cquike, Dstivers, Fnielsen, Giftlite,
Greenw2, Iwaterpolo, Josevellezcaldas, Kastchei, M27315, MarkSweep, Melcombe, MisterSheik, PAR, Qwfp, Rfinlay@gmail.com, Rlendog, Rphlypo, Shadowjams, Sheppa28, Slavatrudu,
Tomi, User A1, Wjastle, 40 anonymous edits

Iteratively reweighted least squares  Source: http://en.wikipedia.org/w/index.php?oldid=484990627  Contributors: 3mta3, BenFrantzDale, Benwing, David Eppstein, Giggy, Grumpfel,
Kiefer.Wolfowitz, Lambiam, Lesswire, LutzL, Melcombe, Michael Hardy, Oleg Alexandrov, RainerBlome, Salix alba, Serg3d2, Stpasha, Wesleyyin, 9 anonymous edits

Kendall tau rank correlation coefficient  Source: http://en.wikipedia.org/w/index.php?oldid=482594886  Contributors: 3mta3, Adilapapaya, Arcadian, Arthur Rubin, As530, Barticus88,
Cronholm144, David Eppstein, Digisus, Dryke, Edurant, Fmccown, G716, Headbomb, Icedwater, Ichbin-dcw, Jacwa01, JamesHAndrews, Jtneill, Ldc, Mcld, Melcombe, Michael Hardy,
Monty669, Nick Number, O18, Olaf, Penpen, Ph.eyes, Piotrus, Qwfp, Rich Farmbrough, Sasikedi, Schmock, Schwnj, Squeakywaffle, Thecheesykid, WinerFresh, Yikes2000, 44 anonymous edits

Kolmogorov–Smirnov test  Source: http://en.wikipedia.org/w/index.php?oldid=507111462  Contributors: A. Pichler, Adam Lein, Adoniscik, Amonet, Avraham, AxelBoldt, Axl, Bender235,
Bryan Derksen, Casey Abell, Chris53516, CiaPan, Conversion script, DeadEyeArrow, Den fjättrade ankan, Doremo, Dresdnhope, EagerToddler39, EddEdmondson, Encyclops, Everettr2, Free
Software Knight, Geregen2, Giftlite, Goudzovski, GregorB, HandsomeFella, Headbomb, Huji, Igny, Inter, Irishguy, Jasondet, Jmjanzen, Jovillal, K.F., Kiefer.Wolfowitz, Klonimus,
Larry_Sanger, MBlakley, MH, Magioladitis, Mairi, Makemineamoose, MarkSweep, Melcombe, Memming, Michael Hardy, Miguel, Moverly, O18, Olaf, Patrick, Pgan002, Pgr94, Phr, Predictor,
Article Sources and Contributors 629

Qwerty Binary, Qwfp, Ragout, Rjwilmsi, Robinh, Ruud Koot, Schmock, Schutz, Schwnj, Selket, Smb1001, Smith609, Snoyes, Spangineer, Stangaa, Statisticsblog, Stern, Stpasha, Strafpeloton2,
Strait, Tabletop, TedDunning, Thorwald, Tomi, TomyDuby, Tyger7th, Wikid77, Wtng, Yaris678, Zaqrfv, 78 anonymous edits

Kronecker's lemma  Source: http://en.wikipedia.org/w/index.php?oldid=452306725  Contributors: Aastrup, Cenarium, Charles Matthews, Cmdrjameson, David Eppstein, FF2010, Giftlite,
Kurdo777, LennK, Linas, Ukookami, 1 anonymous edits

Kullback–Leibler divergence  Source: http://en.wikipedia.org/w/index.php?oldid=505224311  Contributors: 3mta3, A. Pichler, Adfernandes, Amit Moscovich, Amkilpatrick, Avraham,
Baisemain, Benwing, BlaiseFEgan, Charles Matthews, Cronholm144, Cstahlhut, Cyan, Deepmath, Den fjättrade ankan, Dfass, Dmb000006, Dnavarro, Edward, Epistemenical, Epomqo,
Ereiniona, FilipeS, Forwardmeasure, Francis liberty, Giftlite, Gzabers, Ignacioerrico, Inkling, Iturrate, JForget, Jamelan, Jheald, Jmorgan, Jon Awbrey, Kastchei, Kevin Baas, Kiefer.Wolfowitz,
Kyellan, Linas, Loniousmonk, MDReid, MarkSweep, MartinSpacek, Mcld, Mct mht, Mebden, Melcombe, Memming, Michael Hardy, Mike Lin, Miranda, MisterSheik, Mmernex, Mottzo,
Mpost89, Mundhenk, Nathanielvirgo, Nbarth, Neonleonb, Nothing1212, Object01, Oleg Alexandrov, PAR, Punkstar89, Quantumelfmage, Qwfp, Rinconsoleao, Rjwilmsi, Rkrish67, Romanpoet,
Schizoid, SciCompTeacher, Shreevatsa, Sir Vicious, Stangaa, Stern, Stpasha, Sun Creator, Thermochap, Wikomidia, Wile E. Heresiarch, Winterfors, Wittnate, Wullj, X7q, Yoderj, 犬 牙, 130
anonymous edits

Laplace distribution  Source: http://en.wikipedia.org/w/index.php?oldid=502197415  Contributors: Alektzin, Btyner, CRGreathouse, Cburnett, Charles Matthews, Comfortably Paranoid, Dcljr,
Dcoetzee, DrMicro, Fasten, Fnielsen, Foobarhoge, Giftlite, Henrygb, Huoer, Igny, Iwaterpolo, Jdobelman, Johnlv12, Jraudhi, Jurgen, Kabla002, Kastchei, Ludovic89, M.A.Dabbah, MarkSweep,
Mashiah Davidson, Mathstat, Meemoxp, Melcombe, Memming, Michael Hardy, MisterSheik, Mohammad Al-Aggan, PAR, Qwfp, Rjwilmsi, Rlendog, Sheppa28, Sterrys, Straightontillmorning,
Sun Creator, User A1, Vovchyck, Wastle, Wjastle, Wolf87, Zundark, Zvika, 45 anonymous edits

Laplace's equation  Source: http://en.wikipedia.org/w/index.php?oldid=503925895  Contributors: !jim, 124Nick, 213.253.39.xxx, Acipsen, Andrei Polyanin, Andrei r, Andres,
Anythingyouwant, Ap, Archeryguru2000, Astozzia, AugPi, Awickert, AxelBoldt, Bender235, Bh3u4m, BigJohnHenry, Blueboy814, Charles Baynham, Charles Matthews, Chubby Chicken,
Cj67, Coelacan, Crowsnest, DavidCBryant, Dmp450, Donludwig, Dratman, Drywallandoswald, Eerb, El C, ElTyrant, Giftlite, Goheeca, Gonfer, GuidoGer, Hadal, Haseldon, Hellisp, Hypernurb,
Infinityprob, Jasperdoomen, Jgwade, Juansempere, Jzsfvss, KamasamaK, Kibibu, Lfscheidegger, Linas, Liuyao, LokiClock, Lombar2, MarcelB612, Mecanismo, Mel Etitis, Mets501, Mhym,
Michael Hardy, Mleconte, Moink, Ninly, Nuwewsco, Oleg Alexandrov, Paolodm, Patrick, Phelimb, RayAYang, RexNL, Rhun, Rich Farmbrough, Roadrunner, Salih, Sandycx, Shinji311, Silly
rabbit, Slightsmile, Stsmith, Sullivan.t.j, TakuyaMurata, Tarquin, Tbsmith, Thenub314, Tim Starling, User A1, Wikibacc, Wolfkeeper, Wthered, Xbr 0511, く ま 兄 や ん, 111 anonymous
edits

Laplace's method  Source: http://en.wikipedia.org/w/index.php?oldid=503012375  Contributors: 777sms, Alekh, Anthony Appleyard, Arnold90, BDQBD, BenFrantzDale, Bluemaster,
Bluemasterbr, Charles Matthews, Conscious, Coolwangyx, Deville, Ephraim33, Giftlite, JabberWok, Jitse Niesen, Joriki, Karl-H, Keithcc, Krishnavedala, Leperous, Linas, LittleOldMe, McKay,
Michael Hardy, Monsterman222, Msalins, Mt06, MuDavid, Oleg Alexandrov, Olegalexandrov, Phil Boswell, R.e.b., Rossweisse, Trogsworth, Wilke, Zero sharp, 65 anonymous edits

Likelihood-ratio test  Source: http://en.wikipedia.org/w/index.php?oldid=502800268  Contributors: 1ForTheMoney, Adismalscientist, Adoniscik, AgentPeppermint, AnRtist, Arcadian,
ArcadianOnUnsecuredLoc, Arknascar44, Aryan1989, Babbage, Badgettrg, Btyner, Cancan101, Cburnett, Chuk.plante, Cmcnicoll, Conversion script, Corti, Dchudz, Den fjättrade ankan, Draeco,
El C, Elysdir, Fanyavizuri, Fayue1015, Fnielsen, Fortdj33, Frietjes, Giftlite, Graham87, Guy Macon, Henrygb, Jackzhp, Jeremiahrounds, Jfitzg, Jheald, Jmac2222, Kastchei, Kku, Kniwor,
LilHelpa, Mack2, Madbix, MarkSweep, Meduz, Melcombe, Michael Hardy, Mild Bill Hiccup, Moverly, NaftaliHarris, Nbarth, NeoUrfahraner, Nescio, Nilayvaish, Notheruser, Oleg Alexandrov,
Pete.Hurd, Pgan002, Quantling, Qwfp, RL0919, Rajah9, Ridgeback22, RobDe68, Robertvan1, Robinh, Sboludo, Seans Potato Business, Seglea, Smith609, Tayste, The Anome, Thecurran, Tim
bates, Torfason, Twri, Unknown, Vthesniper, Wiki091005!!, Yimmieg, Zaqrfv, 91 anonymous edits

List of integrals of exponential functions  Source: http://en.wikipedia.org/w/index.php?oldid=505272291  Contributors: Adrian.benko, Angalla, Anuclanus, Astrotrebor, Bilboq, Blah314,
Borgx, Cleaver2008, Csigabi, Deineka, Diffequa, Dojarca, Donarreiskoffer, Dungodung, Dusik, Dwees, Edsanville, EmilJ, Evil saltine, Fnielsen, Germandemat, Going3killu, Hasanshabbir786,
Helder.wiki, HenningThielemann, Icairns, Itai, Ivan Štambuk, JRSpriggs, Jasondet, JeffBobFrank, Jleedev, Kenyon, LeaW, Lzur, Mar.marco, Melchoir, Melink14, Mikez302, Mxn, NickFr, Oleg
Alexandrov, Physman, Pw brady, Schneelocke, Scythe33, Seyfried, SkiDragon, Smack, TakuyaMurata, Txus.aparicio, Unyoyega, Versus22, Viames, Waabu, Will5430, ZeroOne, 63 anonymous
edits

List of integrals of Gaussian functions  Source: http://en.wikipedia.org/w/index.php?oldid=501094160  Contributors: Michael Hardy, Qwfp, Stpasha, 14 anonymous edits

List of integrals of hyperbolic functions  Source: http://en.wikipedia.org/w/index.php?oldid=488579551  Contributors: Adrian.benko, Bilboq, Deineka, Dmcq, Enjuneer, Eric Burnett, Eynar,
Germandemat, Guiltyspark, Icairns, Itai, Ivan Štambuk, KnightRider, Lambiam, Lzur, NickFr, Number Googol, Oleg Alexandrov, Rmashhadi, Schneelocke, Smack, TakuyaMurata, Viames,
ZeroOne, Zvika, Zzedar, 11 anonymous edits

List of integrals of logarithmic functions  Source: http://en.wikipedia.org/w/index.php?oldid=501881412  Contributors: Albmont, Bilboq, Borgx, Charles Matthews, Daryl7569, Dojarca, Evil
saltine, Fnielsen, Germandemat, Icairns, Icek, Isnow, Itai, Ivan Štambuk, Jeffreyarcand, Lzur, Maksim-e, Moshi1618, NickFr, Oleg Alexandrov, Physman, Rmashhadi, Schneelocke,
TakuyaMurata, Trevva, Txus.aparicio, Viames, ZeroOne, Ziaris, Петър Петров, គីមស៊្រុន, 28 anonymous edits

Lists of integrals  Source: http://en.wikipedia.org/w/index.php?oldid=506714348  Contributors: 00Ragora00, Akikidis, Albert D. Rich, Amazins490, AngrySaki, Ant314159265,
ArnoldReinhold, Asmeurer, BANZ111, BananaFiend, BehzadAhmadi, Bilboq, Bruno3469, Brutha, CWenger, Ciphers, Cícero, DJPhoenix719, DavidWBrooks, Dcirovic, Deineka, DerHexer,
Dmcq, Doctormatt, Dogcow, Doraemonpaul, Dpb2104, Drahmedov, Dysprosia, Euty, FerrousTigrus, Fieldday-sunday, Fredrik, Giftlite, Giulio.orru, Gloriphobia, Happy-melon, IDGC, Icairns,
Imperial Monarch, Itai, Itu, Ivan Štambuk, IznoRepeat, JNW, Jaisenberg, Jimp, Jj137, John Vandenberg, Jon R W, Jumpythehat, Jwillbur, KSmrq, Kantorghor, Kiatdd, Kilonum, Kusluj,
LachlanA, LeaveSleaves, Legendre17, Lesonyrra, Linas, LizardJr8, Lzur, Macrakis, MathFacts, Michael Hardy, MrOllie, Msablic, Muro de Aguas, NNemec, Nbarth, New Math,
NewEnglandYankee, NickFr, NinjaCross, Oleg Alexandrov, Perelaar, Phatsphere, Physman, Physmanir, Pimvantend, Pokipsy76, Pschemp, Qmtead, RobHar, Salih, Salix alba, Schneelocke,
Scythe33, ShakataGaNai, Sseyler, Stpasha, TStein, TakuyaMurata, Template namespace initialisation script, Tetzcatlipoca, The Transhumanist, Thenub314, Tkreuz, Unyoyega, VasilievVV,
Vedantm, Waabu, Wile E. Heresiarch, Willking1979, Woohookitty, Xanthoxyl, Yeungchunk, Ylai, Zmoney918, 282 anonymous edits

Local regression  Source: http://en.wikipedia.org/w/index.php?oldid=503830743  Contributors: 3mta3, Afa86, Benwing, Btyner, Caitifty, Coppertwig, Den fjättrade ankan, Dontdoit,
DutchCanadian, Glane23, Gpeilon, JHunterJ, JonMcLoone, JonPeltier, Kendrick7, Kierano, Lambiam, Melcombe, Michael Hardy, Qwertyus, Ryepup, Sintaku, Talgalili, The Anome, Tjhalva,
Urhixidur, 21 anonymous edits

Log-normal distribution  Source: http://en.wikipedia.org/w/index.php?oldid=506211496  Contributors: 2D, A. Pichler, Acct4, Albmont, Alue, Ashkax, Asitgoes, Autopilot, AxelBoldt,
Baccyak4H, BenB4, Berland, Bfinn, Biochem67, Bryan Derksen, Btyner, Cburnett, Christian Damgaard, Ciberelm, Ciemo, Cleared as filed, Cmglee, ColinGillespie, Constructive editor,
Danhash, David.hilton.p, DonkeyKong64, DrMicro, Encyclops, Erel Segal, Evil Monkey, Floklk, Fluctuator, Frederic Y Bois, Fredrik, Gausseliminering, Giftlite, Humanengr, Hxu, IanOsgood,
IhorLviv, Isheden, Iwaterpolo, Jackzhp, Jeff3000, Jetlee0618, Jimt075, Jitse Niesen, Khukri, Kiwi4boy, Lbwhu, Leav, Letsgoexploring, LilHelpa, Lojikl, Lunch, Mange01, Martarius, Martinp23,
Mcld, Melcombe, Michael Hardy, Mikael Häggström, Mishnadar, MisterSheik, Nehalem, Nite1010, NonDucor, Ocatecir, Occawen, Osbornd, Oxymoron83, PAR, PBH, Paul Pogonyshev, Philip
Trueman, Philtime, Phoxhat, Pichote, Pontus, Porejide, Qwfp, R.J.Oosterbaan, Raddick, Rgbcmy, Rhowell77, Ricardogpn, Rjwilmsi, Rlendog, Rmaus, RobertHannah89, Safdarmarwat,
Sairvinexx, Schutz, Seriousme, Sheppa28, Skunkboy74, SqueakBox, Sterrys, Stigin, Stpasha, Ta bu shi da yu, Techman224, The Siktath, Till Riffert, Tkinias, Tomi, Umpi, Unyoyega, Urhixidur,
User A1, Vincent Semeria, Wavelength, Weialawaga, Wikomidia, Wile E. Heresiarch, Wjastle, Zachlipton, ZeroOne, ^demon, 201 ,‫ ירון‬anonymous edits

Logrank test  Source: http://en.wikipedia.org/w/index.php?oldid=506626143  Contributors: Bender235, Btyner, Cherkash, Dstivers, G716, Hermel, Johnlv12, Lilac Soul, Melcombe, Michael
Hardy, Ph.eyes, Qwfp, Reader0527, Rich Farmbrough, Rjwilmsi, Skbkekas, 18 anonymous edits

Lévy distribution  Source: http://en.wikipedia.org/w/index.php?oldid=500941726  Contributors: 84user, Badger Drink, Btyner, Caviare, DBrane, Digfarenough, Dysmorodrepanis, Eric Kvaalen,
Gaius Cornelius, Gbellocchi, Gene Nygaard, Giftlite, JamieBallingall, Jfr26, Kastchei, Kloveland, Krishnavedala, Lovibond, Melcombe, Michael Hardy, Nbarth, Night Gyr, PAR, Ptrf, PyonDude,
Qwfp, Rlendog, Saihtam, SebastianHelm, Sheppa28, Smarket, Tassedethe, Tsirel, Uniquejeff, WJVaughn3, Wainson, Xcentaur, Ynhockey, 25 anonymous edits

Mann–Whitney U  Source: http://en.wikipedia.org/w/index.php?oldid=507333032  Contributors: 3mta3, AbsolutDan, Adamsiepel, AndrewHZ, Baccyak4H, Bender235, Bequw, Blehfu,
Bobo192, Brian Everlasting, Briancady413, Buzhan, Chafe66, Charles Matthews, Chris53516, Ctacmo, Darrel francis, DeLarge, Den fjättrade ankan, Dfxoreilly, Fmccown, Gabe rosser,
Gandalf61, Giftlite, GregorB, Gstatistics, Happydaysarehere, Harrelfe, Headbomb, Jarekt, Jeremymiles, Jmorgan01007, Jowa fan, Kenkleinman, Kgwet, Kiefer.Wolfowitz, Kku, Klaus scheicher,
Klonimus, Kmk, KnightRider, Lefschetz, LenoxBlue, Lovewarcoffee, Mai-Thai, Marenty, MarkSweep, Markjoseph125, Martious, Mcld, Melcombe, Memming, Michael Hardy, Mikael
Häggström, Moverly, Mpf3205, MrOllie, Mwtoews, Navy blue84, Nbarth, Nvf, Omnipaedista, Pgan002, Ph.eyes, PigFlu Oink, Purple, Rjwilmsi, RobKushler, Robert Weemeyer, Searke, Seglea,
Selket, Sethant, Smith609, Strafpeloton2, Suruena, Talgalili, Tatpong, Tayste, TeaDrinker, Ted7815, Tim bates, Timothyarnold85, Tomi, Trevor Burnham, Urhixidur, Wayiran,
Where'stheindian?, Wiendietry, Xenonx, Zufar, 159 anonymous edits

Matrix calculus  Source: http://en.wikipedia.org/w/index.php?oldid=505301664  Contributors: Aalopes, Ahmadabdolkader, Albmont, Alelbre, Altenmann, Anonymous Dissident, ArloLora,
Arthur Rubin, Ashigabou, AugPi, Benwing, Blaisorblade, Bloodshedder, Brad7777, Brent Perreault, CBM, CamCairns, Charles Matthews, Cooli46, Cs32en, Ctacmo, Ctsourak, DJ Clayworth,
Article Sources and Contributors 630

DRHagen, Danielbaa, Dattorro, Dimarudoy, Dlohcierekim, Download, Enisbayramoglu, Eroblar, Esoth, Excirial, F=q(E+v^B), Fred Bauder, Freddy2222, Gauge, Geometry guy, Giftlite,
Giro720, Guohonghao, Hhchen1105, Hu12, Immunize, IznoRepeat, Jan mei118, Jitse Niesen, JohnBlackburne, Kirbin, Lethe, Lgstarn, Maschen, Melcombe, Michael Hardy, Morning Sunshine,
MrOllie, NawlinWiki, Oli Filth, Orderud, Oussjarrouse, Ozob, Pan Chenguang, Pearle, PeterShook, RJFJR, Rich Farmbrough, SDC, Sanchan89, Steve98052, Stpasha, Surya Prakash.S.A.,
SyedAshrafulla, TStein, The Thing That Should Not Be, Vgmddg, Willking1979, Wtmitchell, Xiaodi.Hou, Yuzisee, 222 anonymous edits

Maximum likelihood  Source: http://en.wikipedia.org/w/index.php?oldid=504777258  Contributors: 3mta3, Af1523, Albmont, Alfalfahotshots, Algebraic, Algocu, Arthena, Atabəy, Avraham,
BD2412, BPets, Baccyak4H, BenFrantzDale, Binarybits, Bo Jacoby, Brandynwhite, Btyner, Cal-linux, Cancan101, Casp11, Cbrown1023, Cburnett, Cehc84, Chadhoward, ChangChienFu,
Cherkash, Chinasaur, Chowells, Chris the speller, Cjpuffin, Classicalecon, CurranH, Davidmosen, Davyzhu, Den fjättrade ankan, Dimtsit, Dlituiev, Drazick, Dreadstar, Dysmorodrepanis, Earlh,
EduardoValle, F0rbidik, Flavio Guitian, Freeside3, G716, Gareth Griffith-Jones, Giftlite, Gill110951, Gjshisha, Graham87, Guan, Hawk8103, Headbomb, Henrikholm, Henrygb, Hike395,
Hongooi, Hu12, Inky, JA(000)Davidson, James I Hall, Jason Quinn, JeffreyRMiles, JimJJewett, Jmc200, John254, Jrtayloriv, Jsd115, Juffi, Julian Brown, Karada, Khazar2, Kiefer.Wolfowitz,
Koavf, Lavaka, Lexor, Lilac Soul, Logan, Loodog, Lucifer87, MJamesCA, Mathdrum, Mathuranathan, Matt Gleeson, Maye, Melcombe, Michael Hardy, Mikhail Ryazanov, MrOllie, Nak9x,
Nbarth, Nick Number, Ninja247, Nivix, Oleg Alexandrov, PAR, Patrick, Phil Boswell, Quietbritishjim, Qwerpoiu, Qwfp, R'n'B, RVS, Rama, Ramiromagalhaes, Reetep, Rich Farmbrough,
Rjwilmsi, Rlsheehan, Robinh, Rogerbrent, Royalguard11, Rschulz, Samikrc, Samsara, Saric, Set theorist, Shadowjams, Simo Kaupinmäki, Slaunger, SolarMcPanel, Stpasha, Svick, TedE, The
Anome, The Thing That Should Not Be, TheMathAddict, Travelbird, Ultramarine, Urhixidur, Velocidex, Violetriga, Vitanyi, Warren.cheung, Wavelength, Xappppp, XpXiXpY, Z10x, Zbodnar,
Zfeinst, Zonuleofzinn, Zvika, 215 anonymous edits

McNemar's test  Source: http://en.wikipedia.org/w/index.php?oldid=503452574  Contributors: Archanamiya, Bkkbrad, Bluemoose, Btyner, Calimo, Cannin, Chris53516, Chzz, Coruscater,
Davidswelt, Den fjättrade ankan, Dougweller, Ellogo, Functious, Gaius Cornelius, Giftlite, Headbomb, JerroldPease-Atlanta, Jitse Niesen, Johannes Hüsing, Kallerdis, Kastchei, Kgwet,
MarkSweep, Mehdimoodi, Melcombe, Michael Hardy, Monterey Bay, Moverly, Practical321, Qwfp, Rjwilmsi, Rtlam, Staats, Subversified, Talgalili, Tayste, Tim bates, TimBentley,
Toot123toot, Wasell, Zundark, 18 anonymous edits

Multicollinearity  Source: http://en.wikipedia.org/w/index.php?oldid=507151980  Contributors: 4wajzkd02, Altenmann, Bkwillwm, Bobo192, CBM, Counterfact, Den fjättrade ankan,
Dholtschlag, DickStartz, Dscannon, Duoduoduo, Dvdpwiki, Eagerbo, EconProf86, Ed Poor, Edward, Epa101, Fungus, Gap9551, Giftlite, Guillaume2303, Inhumandecency, Iridescent, JForget,
Jichuan, Joe.mellor, Joylee1130, KHamsun, Kukini, Kvng, Maddraven1716, Melcombe, Michael Hardy, Mishrasknehu, MrOllie, Nilesh2293, Robbyjo, S, Saxman77, Seriousj, Shethzulfi,
Studycourts, Sławomir Biały, Utcursch, Varuag doos, Whisky brewer, Ybbor, ‫ﻣﺎﻧﻲ‬, 104 anonymous edits

Multivariate normal distribution  Source: http://en.wikipedia.org/w/index.php?oldid=507275312  Contributors: A3 nm, Alanb, Arvinder.virk, AussieLegend, AxelBoldt, BenFrantzDale,
Benwing, BernardH, BlueScreenD, Breno, Bryan Derksen, Btyner, Cburnett, Cfp, ChristophE, Chromaticity, Ciphergoth, Coffee2theorems, Colin Rowat, Dannybix, Delirium, Delldot, Derfugu,
Distantcube, Eamon Nerbonne, Giftlite, Hongooi, HyDeckar, Isch de, J heisenberg, Jackzhp, Jasondet, Jondude11, Jorgenumata, Josuechan, KHamsun, Kaal, Karam.Anthony.K, KipKnight,
KrodMandooon, KurtSchwitters, Lambiam, Lockeownzj00, Longbiaochen, MER-C, Marc.coram, MarkSweep, Mathstat, Mauryaan, MaxSem, Mcld, Mct mht, Mdf, Mebden, Meduz, Melcombe,
Michael Hardy, Miguel, Mjdslob, Moriel, Mrwojo, Myasuda, Nabla, Ninjagecko, O18, Ogo, Oli Filth, Omrit, Opabinia regalis, Orderud, Paul August, Peni, PhysPhD, Picapica, Pycoucou,
Quantling, Qwfp, R'n'B, Riancon, Rich Farmbrough, Richardcherron, RickK, Rjwilmsi, Robinh, Rumping, Sanders muc, SebastianHelm, Selket, Set theorist, SgtThroat, Sigma0 1, SimonFunk,
Sinuhet, Steve8675309, Stpasha, Strashny, Sun Creator, Tabletop, Talgalili, TedPavlic, Toddst1, Tommyjs, Ulner, Velocitay, Viraltux, Waldir, Wavelength, Wikomidia, Winterfors, Winterstein,
Wjastle, Yoderj, Zelda, Zero0000, Zvika, มือใหม่, 209 anonymous edits

n-sphere  Source: http://en.wikipedia.org/w/index.php?oldid=507623088  Contributors: 4, Aetheling, Army1987, ArnoldReinhold, AstroHurricane001, Bcnof, BenBaker, Berland, Bkell,
Brad7777, CYD, Cicco, David Eppstein, Deepmath, Dingenis, Dionyziz, Diti, Donarreiskoffer, DryaUnda, Epistemenical, Eric119, Erud, Fly by Night, Freelance Intellectual, Fropuff, Geometry
guy, Giftlite, GoingBatty, GraemeMcRae, Gut Monk, Headbomb, Henboppa, Herbee, Howwhowhatwhen, Icairns, Iteloo, JJ Harrison, JRSpriggs, JYolkowski, Jaapie, Jackzhp, Jakob.scholbach,
Jasonphysics, JohnBlackburne, Johnflux, JokeySmurf, Jonathanledlie, Joseph Lindenberg, Jugander, Just granpa, Jwy, Jwz, Jörg Knappen, KSmrq, LVC, Lethe, LokiClock, Loudogg, MarSch,
Marozols, Maurice Carbonaro, Mcnaknik, Mebden, Michael Angelkovich, Michael Hardy, Mikey likes mountains, Ndickson, NeilHynes, Njerseyguy, PAR, PV=nRT, Patrick, Pauloj96, Paweł
Ziemian, Pbroks13, Poulpy, Pstanton, Quantling, Qubiter, Qwertyus, R. J. Mathar, RJD ^)$, Randomblue, Reaper Eternal, Reyk, Rocchini, RodC, SE16, Salix alba, Searchme, Shanes, Silly
rabbit, Slawekb, Smartcat, Spinningspark, Spoon!, Subh83, Sławomir Biały, TakuyaMurata, Tamfang, Thehotelambush, Thewhyman, Thomas s. briggs, Tkinsman, Tkuvho, Tompw, Tomruen,
Tosha, Trigamma, Turul2, UU, VKokielov, Velocidex, Wangtailun, Worm That Turned, Xenure, Zundark, Zvika, 142 anonymous edits

Negative binomial distribution  Source: http://en.wikipedia.org/w/index.php?oldid=500134820  Contributors: Airplaneman, Alexius08, Arunsingh16, Ascánder, Asymmetric, AxelBoldt,
Benwing, Bo Jacoby, Bryan Derksen, Btyner, Burn, CALR, CRGreathouse, Cburnett, Charles Matthews, Chocrates, Colinpmillar, Cretog8, Damian Yerrick, Dcljr, Deathbyfungwahbus,
DutchCanadian, DwightKingsbury, Econstatgeek, Eggstone, Entropeter, Evra83, Facorread, Felipehsantos, Formivore, Gabbe, Gauss, Giftlite, Headbomb, Henrygb, Hilgerdenaar, Iowawindow,
Iwaterpolo, Jahredtobin, Jason Davies, Jfr26, Jmc200, Keltus, Kevinhsun, Kodiologist, Linas, Ludovic89, MC-CPO, Manoguru, MarkSweep, Mathstat, McKay, Melcombe, Michael Hardy,
Mindmatrix, Moldi, Nov ialiste, O18, Odysseuscalypso, Oxymoron83, Panicpgh, Phantomofthesea, Pmokeefe, Qwfp, Rar, Renatovitolo, Rje, Rumping, Salgueiro, Sapphic, Schmock, Shreevatsa,
Sleempaster21229, Sleepmaster21229, Statone, Steve8675309, Stpasha, Sumsum2010, TGS, Talgalili, Taraborn, Tedtoal, TomYHChan, Tomi, Trevor.maynard, User A1, Waltpohl, Wikid77,
Wile E. Heresiarch, Wjastle, Zvika, 151 anonymous edits

Noncentral chi-squared distribution  Source: http://en.wikipedia.org/w/index.php?oldid=491008998  Contributors: AaronMSwan, Alanb, Barrywbrown, Ctacmo, Fintor, Gaius Cornelius,
Giftlite, Hsne, Kastchei, Lixiaoxu, Melcombe, Memming, Michael Hardy, Nielses, Oleg Alexandrov, PAR, PV=nRT, Pspijker, Renatokeshet, Shae, SnakeBDD, Splash, Steve8675309, Tc1008,
Tim1357, Tokorode, Tomi, TomyDuby, Tristanreid, Viruseb, Willem, Zaqrfv, Zvika, ‫ﻣﺤﻤﺪ‬.‫ﺭﺿﺎ‬, 34 anonymous edits

Noncentral F-distribution  Source: http://en.wikipedia.org/w/index.php?oldid=451728415  Contributors: Alankjackson, DanSoper, Dima373, Eric Kvaalen, Fnielsen, Giftlite, Jerryobject,
Josve05a, Kastchei, Lixiaoxu, MarkSweep, Michael Hardy, Natalie Erin, Patrick57, PrairieDogDoug, Simoneau, Steve8675309, 9 anonymous edits

Noncentral t-distribution  Source: http://en.wikipedia.org/w/index.php?oldid=451730621  Contributors: AnRtist, Barrywbrown, Benwing, David Eppstein, Fortdj33, Hypnotoad33, Janto,
Kastchei, LilHelpa, Lixiaoxu, MatthewVanitas, Melcombe, Qwfp, Rjwilmsi, Skbkekas, Steve8675309, Zenohockey, 11 anonymous edits

Norm (mathematics)  Source: http://en.wikipedia.org/w/index.php?oldid=507310470  Contributors: ABCD, ANONYMOUS COWARD0xC0DE, Algebraist, Allispaul, Almit39, Altenmann,
Army1987, Arthur Rubin, Baldphil, Beau, BenFrantzDale, Bestian, Bkell, Bobo192, Brews ohare, CBM, ChongDae, Chutzpan, CiaPan, Connelly, Crasshopper, Cybercobra, Cícero, D4g0thur,
DMacks, Dalf, Dan Granahan, Dan Polansky, DannyAsher, Datahaki, David Kernow, Dicklyon, Dlazesz, Dmitri666, Don Quixote de la Mancha, Dratman, Effigies, Eraserhead1, Falcongl,
Fmccown, Free0willy, Fropuff, Giftlite, Hairy Dude, HannsEwald, Hans Adler, Headbomb, Helptry, Heysan, InverseHypercube, Irritate, JackSchmidt, Jenny Harrison, JohnBlackburne,
JosephSilverman, Jowa fan, JumpDiscont, Jxramos, KHamsun, KSmrq, Kan8eDie, Kiefer.Wolfowitz, Killerandy, Lambiam, Lethe, Linas, Lovasoa, Lucaswilkins, Lunch, MFH, MathMartin, Mct
mht, Melchoir, Mhss, MiNombreDeGuerra, Michael Hardy, Mike Segal, MisterSheik, Mpd1989, Nbarth, Ncik, NearSetAccount, Oleg Alexandrov, Oli Filth, PMajer, Paolo.dL, Patrick, Paul
August, PhotoBox, Quondum, Reminiscenza, Rheyik, Robin S, Rockfang, Rudjek, Saavek47, Sbharris, SebastianHelm, Sebjlan, Selket, Selvik, Sendhil, Shadowjams, Silly rabbit, SimonD,
Singularitarian, Sperling, Steve Kroon, Stpasha, Sullivan.t.j, Sławomir Biały, Tamfang, Tardis, That Guy, From That Show!, The Anome, Tobias Bergemann, Tom Peleg, TomJF, Tomo,
Tomruen, Tosha, Tribaal, Trovatore, Urdutext, Urhixidur, Veromies, VikC, Wikimorphism, Xnn, Xtv, Zero0000, Ziyuang, Zundark, ‫אביב‬, ‫ﺳﻌﯽ‬, 125 anonymous edits

Normal distribution  Source: http://en.wikipedia.org/w/index.php?oldid=507959136  Contributors: 119, 194.203.111.xxx, 213.253.39.xxx, 5:40, A. Parrot, A. Pichler, A.M.R., AaronSw,
Abecedare, Abtweed98, Alektzin, Alex.j.flint, Ali Obeid, AllanBz, Alpharigel, Amanjain, AndrewHowse, Anna Lincoln, Appoose, Art LaPella, Asitgoes, Aude, Aurimus, Awickert, AxelBoldt,
Aydee, Aylex, Baccyak4H, Beetstra, BenFrantzDale, Benwing, Bhockey10, Bidabadi, Bluemaster, Bo Jacoby, Boreas231, Boxplot, Br43402, Brock, Bryan Derksen, Bsilverthorn, Btyner,
Bubba73, Burn, CBM, CRGreathouse, Calvin 1998, Can't sleep, clown will eat me, CapitalR, Cburnett, Cenarium, Cgibbard, Charles Matthews, Charles Wolf, Cherkash, Cheungpuiho04, Chill
doubt, Chris53516, ChrisHodgesUK, Christopher Parham, Ciphergoth, Cmglee, Coffee2theorems, ComputerPsych, Conversion script, Coolhandscot, Coppertwig, Coubure, Courcelles,
Crescentnebula, Cruise, Cwkmail, Cybercobra, DEMcAdams, DFRussia, DVdm, Damian Yerrick, DanSoper, Dannya222, Darwinek, David Haslam, David.hilton.p, DavidCBryant, Davidiad,
Den fjättrade ankan, Denis.arnaud, Derekleungtszhei, Dima373, Dj thegreat, Doood1, DrMicro, Drilnoth, Drostie, Dudzcom, Duoduoduo, Dzordzm, EOBarnett, Eclecticos, Ed Poor, Edin1,
Edokter, EelkeSpaak, Egorre, Elektron, Elockid, Enochlau, Epbr123, Eric Kvaalen, Ericd, Evan Manning, Fang Aili, Fangz, Fergusq, Fgnievinski, Fibonacci, FilipeS, Fintor, Firelog, Fjdulles,
Fledylids, Fnielsen, Fresheneesz, G716, GB fan, Galastril, Gandrusz, Gary King, Gauravm1312, Gauss, Geekinajeep, Gex999, GibboEFC, Giftlite, Gil Gamesh, Gioto, GordontheGorgon,
Gperjim, Graft, Graham87, Gunnar Larsson, Gzornenplatz, Gökhan, Habbie, Headbomb, Heimstern, Henrygb, HereToHelp, Heron, Hiihammuk, Hiiiiiiiiiiiiiiiiiiiii, Hu12, Hughperkins, Hugo
gasca aragon, I dream of horses, Ian Pitchford, IdealOmniscience, It Is Me Here, Itsapatel, Ivan Štambuk, Iwaterpolo, J heisenberg, JA(000)Davidson, JBancroftBrown, JaGa, Jackzhp, Jacobolus,
JahJah, JanSuchy, Jason.yosinski, Javazen, Jeff560, Jeffjnet, Jgonion, Jia.meng, Jim.belk, Jitse Niesen, Jmlk17, Joebeone, Jonkerz, Jorgenumata, Joris Gillis, Jorisverbiest, Josephus78, Josuechan,
Jpk, Jpsauro, Junkinbomb, KMcD, KP-Adhikari, Karl-Henner, Kaslanidi, Kastchei, Kay Dekker, Keilana, KipKnight, Kjtobo, Knutux, LOL, Lansey, Laurifer, Lee Daniel Crocker, Leon7, Lilac
Soul, Livius3, Lixy, Loadmaster, Lpele, Lscharen, Lself, MATThematical, MIT Trekkie, ML5, Manticore, MarkSweep, Markhebner, Markus Krötzsch, Marlasdad, Mateoee, Mathstat, Mcorazao,
Mdebets, Mebden, Meelar, Melcombe, Melongrower, Message From Xenu, Michael Hardy, Michael Zimmermann, Miguel, Mikael Häggström, Mikewax, Millerdl, Mindmatrix, MisterSheik,
Mkch, Mm 202, Morqueozwald, Mr Minchin, Mr. okinawa, MrKIA11, MrOllie, MrZeebo, Mrocklin, Mundhenk, Mwtoews, Mysteronald, Naddy, Nbarth, Netheril96, Nicholasink, Nicolas1981,
Nilmerg, NoahDawg, Noe, Nolanbard, NuclearWarfare, O18, Ohnoitsjamie, Ojigiri, Oleg Alexandrov, Oliphaunt, Olivier, Orderud, Ossiemanners, Owenozier, P.jansson, PAR, PGScooter,
Pablomme, Pabristow, Paclopes, Patrick, Paul August, Paulpeeling, Pcody, Pdumon, Personman, Petri Krohn, Pfeldman, Pgan002, Pinethicket, Piotrus, Plantsurfer, Plastikspork, Policron,
Polyester, Prodego, Prumpf, Ptrf, Qonnec, Quietbritishjim, Qwfp, R.J.Oosterbaan, R3m0t, RDBury, RHaworth, RSStockdale, Rabarberski, Rajah, Rajasekaran Deepak, Randomblue, Rbrwr,
Renatokeshet, RexNL, Rich Farmbrough, Richwales, Rishi.bedi, Rjwilmsi, Robbyjo, Robma, Romanski, Ronz, Rubicon, RxS, Ryguasu, SGBailey, SJP, Saintrain, SamuelTheGhost, Samwb123,
Article Sources and Contributors 631

Sander123, Schbam, Schmock, Schwnj, Scohoust, Seaphoto, Seidenstud, Seliopou, Seraphim, Sergey Suslov, SergioBruno66, Shabbychef, Shaww, Shuhao, Siddiganas, Sirex98, Smidas3,
Snoyes, Sole Soul, Somebody9973, Stan Lioubomoudrov, Stephenb, Stevan White, Stpasha, StradivariusTV, Sullivan.t.j, Sun Creator, SusanLarson, Sverdrup, Svick, Talgalili, Taxman,
Tdunning, TeaDrinker, The Anome, The Tetrast, TheSeven, Thekilluminati, TimBentley, Tomeasy, Tomi, Tommy2010, Tony1, Trewin, Tristanreid, Trollderella, Troutinthemilk, Tryggvi bt,
Tschwertner, Tstrobaugh, Unyoyega, Vakulgupta, Velocidex, Vhlafuente, Vijayarya, Vinodmp, Vrkaul, Waagh, Wakamex, Wavelength, Why Not A Duck, Wikidilworth, Wile E. Heresiarch,
Wilke, Will Thimbleby, Willking1979, Winsteps, Wissons, Wjastle, Wwoods, XJamRastafire, Yoshigev, Yoshis88, Zaracattle, Zero0000, Zhurov, Zrenneh, Zundark, Zvika, มือใหม่, 744
anonymous edits

Order statistic  Source: http://en.wikipedia.org/w/index.php?oldid=506783010  Contributors: A. Pichler, Bruzie, Charles Matthews, Cognitionmachine, David Haslam, Dchristle, Dcljr,
Dcoetzee, Den fjättrade ankan, Fangz, Gareth Jones, Giftlite, Hannes Eder, Jmath666, Karada, LOL, Lambiam, LorCC, Melcombe, Michael Hardy, Miguel, Mr Adequate, Nbarth, Nethgirb,
Nicolas.Wu, Pgan002, Phauly, Quantling, R'n'B, Rodolfo Hermans, Se'taan, Sirslope, Slendle84, Unixxx, Vaidhy, YahoKa, Zundark, 44 anonymous edits

Ordinary differential equation  Source: http://en.wikipedia.org/w/index.php?oldid=506887949  Contributors: 192.115.22.xxx, 48v, A. di M., Aboctok, Absurdburger, AdamSmithee, After
Midnight, Ahadley, Ahoerstemeier, Alfy Alf, Alll, Amixyue, Andrei Polyanin, Anetode, Anita5192, Ap, Arthena, Arthur Rubin, BL, BMF81, Baccala@freesoft.org, Bemoeial, BenFrantzDale,
Benjamin.friedrich, Berean Hunter, Bernhard Bauer, Beve, Bloodshedder, Bo Jacoby, Bogdangiusca, Bryan Derksen, Charles Matthews, Chenyu, Chilti, Chris in denmark, ChrisUK, Christian
List, Ciro.santilli, Cloudmichael, Cmdrjameson, Cmprince, Conversion script, Cpuwhiz11, Cutler, Danger, Danuthaiduc, Delaszk, Dickdock, Dicklyon, DiegoPG, Dmitrey, Dmr2, Dmytro,
DominiqueNC, Dominus, Don4of4, Donludwig, Doradus, Dysprosia, Ed Poor, Ekotkie, Emperorbma, Enochlau, Enzotib, F=q(E+v^B), Fintor, Fruge, Fzix info, Gauge, Gene s, Gerbrant, Giftlite,
Gombang, Graham87, HappyCamper, Heuwitt, Hongsichuan, Ht686rg90, Icairns, Isilanes, Iulianu, Jack in the box, Jak86, Jao, JeLuF, Jitse Niesen, Jni, JoanneB, John C PI, JohnBlackburne,
Jokes Free4Me, JonMcLoone, Josevellezcaldas, Juansempere, Kakila, Kawautar, Kdmckale, Klaas van Aarsen, Krakhan, Kwantus, L-H, LachlanA, Let01, Lethe, Linas, Lingwitt, Liquider, Lupo,
MarkGallagher, Math.geek3.1415926, MathMartin, Mathuvw, Matusz, Melikamp, Michael Hardy, Mikez, Mild Bill Hiccup, Moskvax, MrOllie, Msh210, Mtness, Niteowlneils, Oleg Alexandrov,
Patrick, Paul August, Paul Matthews, PaulTanenbaum, PavelSolin, Pdenapo, PenguiN42, Phil Bastian, Pichpich, PizzaMargherita, Pm215, Poonamjadhav, Poor Yorick, Pt, Randomguess,
Rasterfahrer, Raven in Orbit, Recentchanges, RedWolf, Rich Farmbrough, Rl, RobHar, Rogper, Romanm, Rpm, Ruakh, Salix alba, Sbyrnes321, Sekky, Shandris, Shirt58, SilverSurfer314, Sofia
karampataki, Ssd, Starlight37, Stevertigo, Stw, Superlaza, Susvolans, Sverdrup, Tarquin, Tbsmith, Technopilgrim, Telso, Template namespace initialisation script, The Anome, Thenub314,
Tobias Hoevekamp, TomyDuby, Tot12, TotientDragooned, Tristanreid, Twin Bird, Tyagi, UKoch, Ulner, Vadimvadim, Waltpohl, Wclxlus, Whommighter, Wideofthemark, WriterHound, Xrchz,
Yhkhoo, 今 古 庸 龍, 190 anonymous edits

Partial differential equation  Source: http://en.wikipedia.org/w/index.php?oldid=507434086  Contributors: Afluent Rider, Ahoerstemeier, Aliotra, Alpha Quadrant (alt), Andrei Polyanin,
AndrewHowse, AnkhMorpork, Arnero, ArnoldReinhold, Arthena, AxelBoldt, BASANTDUBE, Beckman16, Belovedfreak, Bemoeial, Ben pcc, BenFrantzDale, Bender235, Bertik, Bjorn.sjodin,
Bkocsis, Borgx, Brian Tvedt, CYD, Cbm, Charles Matthews, Chbarts, Chris in denmark, ChristophE, Cj67, Ckatz, Crowsnest, Crust, CyrilB, D.328, DStoykov, David Crawshaw,
Dharma6662000, Dicklyon, Dirkbb, Djordjes, DominiqueNC, Donludwig, DrHok, Druzhnik, Dysprosia, Egriffin, Eienmaru, Eigenlambda, El C, EmmetCaulfield, Epbr123, Erxnmedia,
Evankeane, F=q(E+v^B), Filemon, Fintor, Foober, Forbes72, Frosted14, Fuse809, Gaj0129, Gerasime, Germandemat, Giese, Giftlite, GraemeL, Gseryakov, Gurch, HappySophie, Hongooi, Hut
8.5, Isnow, Iwfyita, Ixfd64, JNW, JaGa, Jitse Niesen, Jmath666, Jon Cates, JonMcLoone, Jonathanstray, Jss214322, Jyril, Kbolino, Kwiki, L-H, Linas, MFH, Magister Mathematicae, Mandarax,
Mandolinface, Manticore, Marupio, MathMartin, Mathsfreak, Maurice Carbonaro, Mazi, Mhaitham.shammaa, Mhym, Michael Devore, Michael Hardy, Moink, Mpatel, Msh210, NSiDms,
Nbarth, Nneonneo, Ojcit, Oleg Alexandrov, Oliver Pereira, Orenburg1, OrgasGirl, Oscarjquintana, PL290, Pacaro, Patrick, Paul August, Paul Matthews, PeR, PhotoBox, Photonique, Pokespa,
Pranagailu1436, Prime Entelechy, Pt, Quibik, R'n'B, R.e.b., Rausch, RayAYang, Rememberlands, Richard77, Rjwilmsi, Rnt20, Roadrunner, Robinh, Roesser, Rpchase, Salih, Sbarnard,
Siegmaralber, SobakaKachalova, Spartan-James, Srleffler, Stevenj, Stizz, Super Cleverly, SwisterTwister, Sławomir Biały, THEN WHO WAS PHONE?, Tarquin, Tbsmith, The Anome, The
Transhumanist, Thenub314, Tiddly Tom, Timwi, Topbanana, Tosha, Ub3rm4th, Unigfjkl, User A1, Waltpohl, Wavelength, Wavesmikey, Winston365, Wolfrock, Wsulli74, Wtt, Yaje, Yhkhoo,
Zhou Yu, Zzuuzz, 317 anonymous edits

Pearson's chi-squared test  Source: http://en.wikipedia.org/w/index.php?oldid=502812581  Contributors: A-k-h, AbsolutDan, Agüeybaná, Aljeirou, Andropod, Arcadian, Asqueella, Athaler,
Avraham, Bender235, BlaiseFEgan, Bobo192, Btyner, Bubba73, Cherkash, Chuck Carroll, Connet, Cortonin, Czenek, Delirium, Den fjättrade ankan, Doyoung, Dpbsmith, Egil, Ektodu,
Fgnievinski, Free Software Knight, Funk17, Furrykef, Giuseppedn, Grotendeels Onschadelijk, Hirak 99, Horn.imh, Jcobb, Jfitzg, Jmcclung711, Joel B. Lewis, John254, JustAGal, Kastchei,
Kgwet, Kjtobo, KohanX, Kwamikagami, Lambiam, Lexor, LilHelpa, Loadmaster, Loodog, Mad Scientist, MarkSweep, Matt Crypto, Maxal, Maxbox51, Melcombe, Michael Hardy, Mikael
Häggström, Motorneuron, Moverly, MrOllie, Muhali, N5iln, Navywings, Nbarth, Neffk, O18, Omicron1234, Paul August, Paulck, Piotrus, PowerWill500, Qartis, Quadduc, Qwfp, Ranger2006,
Rar74B, Requestion, Retobaum, Rjwilmsi, Robinh, Rvrocha, Sander123, Sayantan m, Sbmehta, Schwnj, Seglea, Shadow308b4, Skbkekas, Spangineer, Ssola, Talgalili, Tambal, Tayste, The
Anome, Tim bates, TimBentley, TimBock, ToddDeLuca, Tomi, Triacylglyceride, Wtmitchell, 159 anonymous edits

Perron–Frobenius theorem  Source: http://en.wikipedia.org/w/index.php?oldid=506218307  Contributors: Alexander Chervov, Arcfrk, BenFrantzDale, Bender235, BeteNoir, Bill luv2fly,
Bmusician, Brad7777, Charles Matthews, Comfortably Paranoid, Cvdwoest, David Eppstein, Dcclark, Dima373, Doctorilluminatus, Flyingspuds, G.perarnau, Gdm, Giftlite, Justin Mauger,
Kiefer.Wolfowitz, Kirbin, Lfstevens, Lifeonahilltop, Linas, MRFS, Michael Hardy, Moogwrench, Nbarth, Niceguyedc, Pavel Stanley, Psychonaut, R.e.b., Rschwieb, Shining Celebi, Sodin,
Stigin, Tcnuk, TedPavlic, Urhixidur, Vinsz, Xnn, 67 anonymous edits

Poisson distribution  Source: http://en.wikipedia.org/w/index.php?oldid=507759981  Contributors: 2620:C6:8000:300:E84F:1D25:AA79:1023, Abtweed98, Adair2324, AdjustShift, Adoniscik,
Aeusoes1, AhmedHan, Ahoerstemeier, AlanUS, Alexius08, Alfpooh, Anchoar2001, Andre Engels, Ankit.shende, Anomalocaris, Army1987, Atonix, AxelBoldt, BL, Baccyak4H, Bdmy, Beetstra,
BenFrantzDale, Bender235, Bgeelhoed, Bidabadi, Bikasuishin, Bjcairns, Bobblewik, Brendan642, Bryan Derksen, Btyner, CameronHarris, Camitz, Captain-n00dle, Caricologist, Cburnett,
ChevyC, Chinasaur, Chriscf, Cimon Avaro, Ciphergoth, Citrus Lover, Constructive editor, Coppertwig, Count ludwig, Cqqbme, Cubic Hour, Cwkmail, DARTH SIDIOUS 2, Damistmu, Danger,
Dannomatic, DannyAsher, David Haslam, Deacon of Pndapetzim, Debastein, Dejvid, Denis.arnaud, Derek farn, Dhollm, Dougsim, DrMicro, Dreadstar, Drevicko, Duke Ganote, Eduardo Antico,
Edward, Emilpohl, Enric Naval, EnumaElish, Everyking, Falk Lieder, Favonian, Fayenatic london, Fnielsen, Fresheneesz, Frobnitzem, Fuzzyrandom, Gaius Cornelius, Gcbernier, Giftlite,
Giganut, Gigemag76, Giraffedata, Godix, Gperjim, GregorB, HamburgerRadio, Headbomb, Heliac, Henrygb, Hgkamath, Hu, HyDeckar, Hypnotoad33, Ian.Shannon, Ilmari Karonen, Inquisitus,
Intervallic, Iridescent, Iwaterpolo, Jamesscottbrown, JavOs, Jeff G., Jfr26, Jitse Niesen, Jleedev, Jmichael ll, Joeltenenbaum, Joseph.m.ernst, Jpk, Jrennie, Jshen6, KSmrq, Kastchei, Kay Kiljae
Lee, Kbk, King of Hearts, Kjfahmipedia, Kmtmeth, LOL, Laussy, Lgallindo, Lilac Soul, Linas, Ling Kah Jai, Lklundin, Logan, Loom91, Lucaswilkins, MC-CPO, Magicxcian, Marie Poise,
MarkSweep, Mathstat, Maxis ftw, McKay, Mdebets, Melcombe, Michael Hardy, Michael Ross, Miguel, Mike Young, Mindmatrix, Minesweeper, MisterSheik, Mobius, MrOllie, Mufka,
Mungbean, Munksm, NAHID, NabeelNM42, Nasnema, Nealmcb, Ned Scott, Netrapt, Nevsan, New Thought, Nicooo, Nijdam, Nipoez, Njerseyguy, Nsaa, O18, Ohnoitsjamie, Orionus, Ott2,
PAR, PBH, Pabristow, Pb30, Pftupper, Pfunk42, Philaulait, Phreed, PierreAbbat, Piplicus, Plasmidmap, Pmokeefe, Postrach, Princebustr, Qacek, Quietbritishjim, Qwfp, Qxz, Robert Hiller,
Ryguasu, SPART82, Saimhe, Salgueiro, Sanmele, Saxenad, Schaber, Scientist xkcd fan, Sdedeo, Sean r lynch, SeanJA, Seaphoto, SebastianHelm, Selket, Sergey Suslov, Sergio01, Serrano24,
Sheldrake, SiriusB, Skbkekas, Skittleys, Slarson, Snorgy, Spoon!, Stangaa, Steve8675309, Storm Rider, Stpasha, Strait, Sun Creator, Suslindisambiguator, Svick, Syzygy, TakuyaMurata,
Talgalili, Taw, Taylanwiki, Tayste, Tbhotch, Tcaruso2, TeaDrinker, Tedtoal, Teply, The Anome, The Thing That Should Not Be, TheNoise, TheTaintedOne, Theda, Thorvaldurgunn, Tomi,
Tommyjs, Tpb, Tpbradbury, Tpruane, Uncle Dick, Uriyan, User A1, Vector Alfawaz, VodkaJazz, Vrkaul, Wavelength, Weedier Mickey, Whosanehusy, Wikibuki, Wikid77, Wileycount,
WillowW, Wjastle, Wtanaka, XJamRastafire, YogeshwerSharma, Youandme, ZioX, Zundark, Þjóðólfr, 448 anonymous edits

Poisson process  Source: http://en.wikipedia.org/w/index.php?oldid=502479315  Contributors: AbsolutDan, Adler.fa, Aetheling, Almwi, Anaholic, Arnaudf, Atemperman, Bender235, Bezenek,
Binish fatimah, Bjcairns, Canley, CesarB, Changodoa, Charles Matthews, D nath1, Dale101usa, Darklilac, DavidRideout, Denisarona, Donmegapoppadoc, Dries, Drumlin mcl, Eslip17,
EtudiantEco, False vacuum, Faridani, Filur, Gareth Jones, Giftlite, Greatdiwei, Griffgruff, Grubbiv, ILikeThings, InvictaHOG, J heisenberg, JA(000)Davidson, Jackbaird, Jitse Niesen, Jonkerz,
Jtedor, Keilandreas, Kyrades, LOL, Lawrennd, LilHelpa, LuisPedroCoelho, Lyddea, Melcombe, Memming, Michael Hardy, Mike Rosoft, Mindmatrix, Muhends, Nbarth, O18, Oleg Alexandrov,
PAR, PanagosTheOther, Qlmatrix, Quantyz, Qwfp, Rgclegg, Rp, Sam Hocevar, Shd, Sjara, Skittleys, Slartibarti, Stochastic, Tdslk, Terhorst, Tomixdf, Uliba, Wavelength, Willem, Zundark, 128
anonymous edits

Proportional hazards models  Source: http://en.wikipedia.org/w/index.php?oldid=499739784  Contributors: 3mta3, Benjamin.haley, Boffob, Cherkash, Cyan, Den fjättrade ankan, E.amira,
Erianna, Favonian, G716, In base 4, John Vandenberg, JonAWellner, Kierano, Kllin1231, LilHelpa, Lilac Soul, Mathstat, Mebden, Melcombe, Memming, PamD, Pstevens, Qaswed-Ger, Qwfp,
Rich Farmbrough, Rjwilmsi, Rw251, Skbkekas, Skittleys, Theoriste, 19 anonymous edits

Random permutation statistics  Source: http://en.wikipedia.org/w/index.php?oldid=469021663  Contributors: Djozwebo, Edward, Emurphy42, Lantonov, Melchoir, Melcombe, Mhym, Michael
Hardy, Nbarth, Rjwilmsi, Shira.kritchman, Skysmith, Stasyuha, Sucharit, The Anome, Zahlentheorie, Zaslav, 12 anonymous edits

Rank (linear algebra)  Source: http://en.wikipedia.org/w/index.php?oldid=502602977  Contributors: (:Julien:), Andrevruas, Ashwin, AxelBoldt, Banus, Bappusona, BenFrantzDale,
Bender2k14, BiT, Bob.v.R, Bogdanno, Brian Tvedt, Chrystomath, Confluente, Conversion script, Counterfeit114, Cscwin, Danydib, David Eppstein, Demize.public, DirkOliverTheis, Dysprosia,
Dzordzm, Echocontrol, Ernsts, Fropuff, GPhilip, Geometry guy, Giftlite, Heron, Hroðulf, Hyperbola, Ivan Štambuk, Jcarroll, Jeltz, Jitse Niesen, Jmath666, JoergenB, Justin W Smith, KSmrq,
Kaspar.jan, Kawautar, Kbh3rd, Kck9f, Krunal800, Ligand, LokiClock, Lunch, Lupin, Marc van Leeuwen, Mattblack82, Matty j, Meigel, Mental Blank, Miaow Miaow, Michael Hardy, Milcak,
NClement, Naddy, Nbarth, Od Mishehu, Oleg Alexandrov, PapLorinc, Polizzi, Poor Yorick, R'n'B, Ram einstein, Richard Giuly, Saaska, Salih, Semifinalist, Shim'on, Skunkboy74, Sławomir
Biały, TakuyaMurata, Tardis, TeH nOmInAtOr, TexasAndroid, Trieu, Vql, Wshun, Zhaoway, 111 anonymous edits

Resampling (statistics)  Source: http://en.wikipedia.org/w/index.php?oldid=502973193  Contributors: A923812, Annabel, Archimerged, Avenue, AxelBoldt, BD2412, Baccyak4H, Biruitorul,
Bondegezou, Boxplot, Brendon1191, Btyner, Buettcher, Cherkash, Chris53516, Commadot, CristianCantoro, Damistmu, Dcoetzee, Den fjättrade ankan, Diegotorquemada, Dougher, Edstat,
Article Sources and Contributors 632

Fisherjs, Fnielsen, Freeparameter, Garik, Gideon.fell, Giftlite, Hilverd, Jackverr, JavaManAz, Jmc200, Jncraton, Jonkerz, Jrvz, Karol Langner, Kenkleinman, Koala9, Kpmiyapuram, Ling.Nut,
Mathstat, Matumio, Mcld, Melcombe, Michael Hardy, Minhtuanht, Mishnadar, Moachim, Nbarth, O18, Ohnoitsjamie, Oleg Alexandrov, Onionmon, Patrubdel, Pgan002, Pifflesticks, Polluxonis,
QualitycontrolUS, Quaristice, Qwfp, Rich Farmbrough, Rickogorman, Ronz, Ropable, Skbkekas, Spm, Tesi1700, Thefellswooper, TimHesterberg, ToddDeLuca, Tolstoy the Cat, Valter Sundh,
Vegaswikian, Wissons, Woohookitty, Wstomv, X7q, СтудентК, 122 anonymous edits

Schur complement  Source: http://en.wikipedia.org/w/index.php?oldid=492085177  Contributors: A. Pichler, Acroterion, Aranel, BernardH, Dannoryan, Danpovey, Giftlite, Khaosoahk,
Kkddkkdd, LachlanA, Lavaka, Lockeownzj00, Mct mht, Michael Hardy, Nick, Rotkraut, Shreevatsa, Teorth, Wikomidia, Yuzhang49, Zfeinst, 50 anonymous edits

Sign test  Source: http://en.wikipedia.org/w/index.php?oldid=503454667  Contributors: Asqueella, Btyner, DV8 2XL, Daytona2, FlowerFaerie087, Kareekacha, MC-CPO, Male1979,
MarkSweep, Mbhiii, Mcld, Melcombe, Michael Hardy, PamD, Qwfp, Radagast83, Sean1040, Talgalili, That Guy, From That Show!, Unschool, 14 anonymous edits

Singular value decomposition  Source: http://en.wikipedia.org/w/index.php?oldid=507711964  Contributors: 3mta3, ABCD, AdamSmithee, Alexanderfrey, Alexmov, Anoko moonlight, Argav,
Arthur Frayn, AxelBoldt, Bciguy, Bdmy, Ben pcc, BenFrantzDale, BigrTex, Billlion, Brech, Browndar, CBM, Cccddd2012, Celique, Chadnash, Charivari, Charles Matthews, Chnv, ChrisDing,
Coffee2theorems, Coppertwig, Countchoc, D1ma5ad, Damien d, Danielcohn, Danielx, David Eppstein, Daytona2, Ddcarnage, Dean P Foster, Debeo Morium, Diomidis Spinellis, Dohn joe,
Douglas guo, Dragon Phoenix Studio, Drizzd, EduardoValle, EmmetCaulfield, Eric Le Bigot, EverettYou, Fcpp, Fgnievinski, Frammm, FrozenPurpleCube, Gauge, Geneffects, Georg-Johann,
Giftlite, GromXXVII, Guaka, Guy.schaffer, Harold f, Headbomb, Headlessplatter, Helwr, Hike395, Hobsonlane, Humanengr, Iprometheus, Isnow, Ivann.exe, JEBrown87544, Jack446,
JackSchmidt, Jamelan, Jdpipe, JerroldPease-Atlanta, Jheald, Jitse Niesen, John, JohnBlackburne, Jonnat, Joriki, Jérôme, K.menin, KSmrq, KYN, Kaol, Kaspar.jan, Kgutwin, Kiefer.Wolfowitz,
Kieff, Kierano, Kjetil1001, KnowledgeOfSelf, Knowledgeis4all, Kupopo, LapoLuchini, LiamH, Lifeonahilltop, Loren.wilton, Lotje, Lourakis, Lupin, Mahakp, Male1979, Marco.lombardi,
Marozols, MathMartin, Matthewmatician, Mct mht, Mdnahas, Melcombe, Merilius, Mhsajadi, Michael Hardy, Michael.greenacre, Morana, Musiphil, Mwtoews, Nbarth, NoobX, Olberd, Oleg
Alexandrov, Orderud, Orie0505, Orlyal, Oyz, Paolo.dL, Peachris, Pftupper, Phrank36, Phrenophobia, Physicistjedi, PigFlu Oink, Pmineault, PolarYukon, ProveIt, Qjqflash3, Qtea, RDBrown,
Rafmag, Rakeshchalasani, Rama, Ranicki, Rdecker02, Reinderien, Rhubbarb, Rich Farmbrough, Rinconsoleao, Rjwilmsi, Robinsor, Ronz, Rschwieb, Semifinalist, ServiceAT, Shadowjams,
Sikfreeze, Sim, Slawekb, Spellbuilder, Starsky617, Stepa, SteveMyers999, Stevenj, Stpasha, Sullivan.t.j, Sushruthg, Szabolcs Nagy, Tauwasser, Tbackstr, TedPavlic, The Thing That Should Not
Be, Thecheesykid, Thorwald, Tom Duff, TomViza, Tony1, Trifon Triantafillidis, Veinor, Wagonsoul773, WhisperingGadfly, WikiMSL, Willem, Woohookitty, Wooooosaj, Wshun, Wsiegmund,
X7q, XiagenFeng, Zanetu, Zvika, Zwitter689, Пика Пика, 353 anonymous edits

Stein's method  Source: http://en.wikipedia.org/w/index.php?oldid=502856881  Contributors: 3mta3, Headbomb, J04n, Melcombe, Michael Hardy, Mild Bill Hiccup, Patschy, Paul August,
RomainThibaux, ScienceNUS, Slartibarti, Tabletop, Tassedethe, 21 anonymous edits

Stirling's approximation  Source: http://en.wikipedia.org/w/index.php?oldid=502070373  Contributors: A. Pichler, Aaron Rotenberg, Abovechief, Alberto da Calvairate, AvicAWB, AxelBoldt,
Balcer, Barak, Bender235, Bender2k14, Berland, Black Yoshi, Bluemaster, Bluemoose, Bluestarlight37, Btyner, C. Trifle, Calréfa Wéná, Charles Matthews, Cybercobra, DMJ001,
DavidCBryant, DavidHouse, Dcoetzee, Doctormatt, Dogcow, EL Willy, Enochlau, Eranb, Eric119, FilipeS, Fredrik, Frencheigh, GangofOne, Gene Ward Smith, GeneChase, Gerbrant, Giftlite,
Glenn L, Goldencako, Gregie156, Gruntler, Hannes Eder, He Who Is, Henrygb, Herbee, JabberWok, Karl-Henner, Keithcc, Krishnachandranvn, Linas, Lisatwo, Looxix, MIT Trekkie, MOBle,
Ma'ame Michu, Marcol, McKay, Melchoir, Michael Hardy, Minesweeper, MuDavid, Ncik, Nonenmac, OwenX, PAR, PMajer, Pde, Pete Ridges, Plasticup, Poor Yorick, Qwertyus, RDBury,
RJHall, RexNL, Rh, Rogper, Salgueiro, Sandrobt, Slaniel, Spud Gun, Spudbeach, StradivariusTV, Sverdrup, Thomas Bliem, Thomas9987, TigerTjäder, Tim Starling, TomyDuby, Toolnut,
Trovatore, Tsirel, Usurper, Varitek123, Vincent Semeria, Wilke, Wonghang, Woscafrench, Zero0000, 110 anonymous edits

Student's t-distribution  Source: http://en.wikipedia.org/w/index.php?oldid=505048269  Contributors: 3mta3, A bit iffy, A. di M., A.M.R., Addone, Afluent Rider, Albmont, AlexAlex,
Alvin-cs, Amonet, Arsenikk, Arthur Rubin, Asperal, Avraham, AxelBoldt, B k, Beetstra, Benwing, Bless sins, Bobo192, BradBeattie, Bryan Derksen, Btyner, CBM, Cburnett, Chiqago,
Chris53516, Chriscf, Classical geographer, Clbustos, Coppertwig, Count Iblis, Crouchy7, Daige, DanSoper, Danko Georgiev, Daveswahl, Dchristle, Ddxc, Dejo, Dkf11, Dmcalist, Dmcg026,
Duncharris, EPadmirateur, EdJohnston, Eleassar, Eric Kvaalen, Ethan, Everettr2, F.morett, Fgimenez, Finnancier, Fnielsen, Frankmeulenaar, Freerow@gmail.com, Furrykef, G716,
Gabrielhanzon, Giftlite, Gperjim, Guibber, Hadleywickham, Hankwang, Hemanshu, Hirak 99, History Sleuth, Huji, Icairns, Ichbin-dcw, Ichoran, Ilhanli, Iwaterpolo, JMiall, JamesBWatson, Jitse
Niesen, Jmk, John Baez, Johnson Lau, Jost Riedel, Kastchei, Kiefer.Wolfowitz, Koavf, Kotar, Kroffe, Kummi, Kyosuke Aoki, Lifeartist, Linas, Lvzon, M.S.K., MATThematical, Madcoverboy,
Maelgwn, Mandarax, MarkSweep, Mcarling, Mdebets, Melcombe, Michael C Price, Michael Hardy, Mig267, Millerdl, MisterSheik, MrOllie, Muzzamo, Nbarth, Netheril96, Ngwt, Nick Number,
NuclearWarfare, O18, Ocorrigan, Oliphaunt, PAR, PBH, Pegasus1457, Petter Strandmark, Phb, Piotrus, Pmanderson, Quietbritishjim, Qwfp, R'n'B, R.e.b., Rich Farmbrough, Rjwilmsi, Rlendog,
Robert Ham, Robinh, Royote, Salgueiro, Sam Derbyshire, Sander123, Scientific29, Secretlondon, Seglea, Serdagger, Sgb 85, Shaww, Shoefly, Skbkekas, Sonett72, Sougandh, Sprocketoreo,
Srbislav Nesic, Stasyuha, Steve8675309, Stpasha, Strait, TJ0513, Techman224, Tgambon, The Anome, Theodork, Thermochap, ThorinMuglindir, Tjfarrar, Tolstoy the Little Black Cat,
Tom.Reding, TomCerul, Tomi, Tutor dave, Uncle G, Unknown, User A1, Valravn, Velocidex, Waldo000000, Wastle, Wavelength, Wikid77, Wile E. Heresiarch, Xenonice, ZantTrang, 294
anonymous edits

Summation by parts  Source: http://en.wikipedia.org/w/index.php?oldid=490149714  Contributors: A. Pichler, Bdmy, Brad7777, Calle, Charles Matthews, Charvest, ChrisHodgesUK, David
Eppstein, DavidGSterling, Enchanter, FF2010, Julien Tuerlinckx, Linas, Michael Hardy, Myasuda, Oleg Alexandrov, Oracleofottawa, Radagast83, Shreevatsa, Stan Lioubomoudrov, Tbennert,
Tcnuk, 虞 海, 41 anonymous edits

Taylor series  Source: http://en.wikipedia.org/w/index.php?oldid=507980872  Contributors: 1exec1, AbcXyz, Abcwikip, AdamGomaa, Akitchin, Alamino, Alansfault, Alberto da Calvairate, Ali
Obeid, Alksub, Anaraug, Anonymous Dissident, Antandrus, Audacity, Autarkaw, Avarice2593, AxelBoldt, BD2412, Baccyak4H, Banaticus, Barak Sh, Basketball110, Bdmy, BenFrantzDale,
Benzi455, BigJohnHenry, Bill Malloy, BlueSoxSWJ, Bo Jacoby, Bo198214, Brad7777, Byakuya1995, CBM, CambridgeBayWeather, Carifio24, CecilWard, Charles Matthews, Chetrasho, Chris
the speller, Clarknj, Cmcb, Conversion script, CornellRunner314, Cwkmail, Cyp, Dalstadt, Datoews, Deeptrivia, Diotti, Djmutex, Doctormatt, Dominus, Doradus, Druseltal2005, Drzib, Dspdude,
Dudzcom, Duplode, Dysprosia, Egmontaz, Ejrh, El Jogg, Elb2000, Elocute, Epbr123, Eric119, Error792, Estherholleman, Fibonacci, Flcelloguy, FootballHK, Forbes72, Fram, Fredrik,
Frencheigh, Fresheneesz, Fundamental metric tensor, Fvw, Fx0701, GTBacchus, Gandalf61, Genjix, Gesslein, Giftlite, Gijs.peek, Glane23, Gombang, Goodale, Gracenotes, Graham87, Gtstricky,
Guardian of Light, Gulliveig, Hadal, Haham hanuka, Harryboyles, Headbomb, Hede2000, Holger Flier, Hulaxhula15, Ideyal, Igodard, J.delanoy, JRSpriggs, JabberWok, Jagged 85,
Jakob.scholbach, James T Curran, Jane Fallen, Jaro.p, Jasanwiki, Jatinshah, Jaylowblow, Jclemens, Jeff G., Jeff223, Jeronimo, Jesper Carlstrom, Jim1138, JimJast, Jimothy 46, Jitse Niesen,
Jmnbatista, Johnuniq, Jschnur, Jwestbrook, Kallikanzarid, Kepke, Koertefa, Krishnachandranvn, LaGrange, LachlanA, Lambiam, Laurifer, Lethe, Linas, Loisel, LucaB, Lumaga, Mav, McKay,
Melikamp, Mfwitten, Mh, Mhallwiki, Michael Hardy, Miguel, Minesweeper, Mjec, Mordacil, Msh210, Musicguyguy, Mwilde, Myasuda, Nadav1, Nakon, Naught101, NawlinWiki, Netheril96,
Nilesj, Ohconfucius, Olaf, Oleg Alexandrov, OneWeirdDude, Orionus, Patrick, Paul August, Perlmonger42, Petri Krohn, Pink-isnt-well, Plasticspork, Plastikspork, Pmonaragala, PoiZaN, Pomte,
Populus, PouyaDT, Pps, Pranathi, Preetum, Pruthvi.Vallabh, Puckly, Qgluca, Qorilla, Qualc1, Quondum, RETROFUTURE, RJFJR, RJaguar3, RRRR0000RRRR, RageGarden, Randomath,
Randomblue, RayAYang, Red Denim, Reinderien, Reminiscenza, Renfield, ResearchRave, RexNL, Richard L. Peterson, RickK, Rinn0, Roadrunner, Robertsrap111, Rudminjd,
Rudminjw@jmu.edu, Runningonbrains, Sajoka, Sam Derbyshire, Schapel, Schmock, Shalom Yechiel, Shanman7, Shinglor, Silly rabbit, Slash, Slawekb, Sligocki, Sloq, Staplesauce, Stealth HR,
Stephen.metzger, Stevertigo, Stikonas, Support.and.Defend, Sverdrup, Sławomir Biały, Talgalili, Tarquin, Tcnuk, TheQuestionGuy, Thesevenseas, Tide rolls, Tikai, Tobias Bergemann, Troubled
asset, Tsemii, Tskuzzy, Twisterplus, VKokielov, Vanished User 0001, Wclark, Wiki alf, WinterSpw, WojciechSwiderski, Wowus, XJamRastafire, Xantharius, Xem 007, Xionbox, Zeno Gantner,
Zhefurui, Zzuuzz, 石 庭 豐, 559 anonymous edits

Uniform distribution (continuous)  Source: http://en.wikipedia.org/w/index.php?oldid=504265526  Contributors: A.M.R., Abdullah Chougle, Aegis Maelstrom, Albmont, AlekseyP, Algebraist,
Amatulic, ArnoldReinhold, B k, Baccyak4H, Benlisquare, Brianga, Brumski, Btyner, Capricorn42, Cburnett, Ceancata, DaBler, DixonD, DrMicro, Duoduoduo, Euchiasmus, Fasten, FilipeS,
Gala.martin, Gareth Owen, Giftlite, Gilliam, Gritzko, Henrygb, Iae, Iwaterpolo, Jamelan, Jitse Niesen, Marie Poise, Melcombe, Michael Hardy, MisterSheik, Nbarth, Nsaa, Oleg Alexandrov,
Ossska, PAR, Qwfp, Ray Chason, Robbyjo, Ruy Pugliesi, Ryan Vesey, Sandrarossi, Sl, Stpasha, Stwalkerster, Sun Creator, Tpb, User A1, Vilietha, Warriorman21, Wikomidia, Zundark, 97
anonymous edits

Uniform distribution (discrete)  Source: http://en.wikipedia.org/w/index.php?oldid=490538818  Contributors: Alansohn, Alstublieft, Bob.warfield, Btyner, DVdm, DaBler, Dec1707, DixonD,
Duoduoduo, Fangz, Fasten, FilipeS, Furby100, Giftlite, Gvstorm, Hatster301, Henrygb, Iwaterpolo, Jamelan, Klausness, LimoWreck, Melcombe, Michael Hardy, Mike74dk, Nbarth, O18, P64,
PAR, Paul August, Postrach, Qwfp, Random2001, Stannered, Taylorluker, The Wordsmith, User A1, 59 anonymous edits

Weibull distribution  Source: http://en.wikipedia.org/w/index.php?oldid=504942769  Contributors: A. Pichler, Agriculture, Alexey Sanko, Alfpooh, Anomalocaris, Argyriou, Asitgoes,
Avraham, AxelBoldt, Bender235, Bryan Derksen, Btyner, Calimo, Cburnett, Corecode, Corfuman, Craigy144, Cyan, Darrel francis, David Haslam, Dhatfield, Diegotorquemada, Dmh, Doradus,
Edratzer, Eliezg, Emilpohl, Epzsl2, Erianna, Felipehsantos, Fæ, Gausseliminering, Gcm, Giftlite, Gobeirne, GuidoGer, Isheden, Iwaterpolo, J6w5, JA(000)Davidson, JJ Harrison, Janlo, Jason A
Johnson, Jfcorbett, Joanmg, KenT, Kghose, Kpmiyapuram, Lachambre, LachlanA, LilHelpa, MH, Mack2, Mebden, Melcombe, Michael Hardy, MisterSheik, O18, Olaf, Oznickr, PAR, Pleitch,
Policron, Prof. Frink, Qwfp, R.J.Oosterbaan, RekishiEJ, Rickysmithcmrp, Rlendog, Robertmbaldwin, Saad31, Sam Blacketer, Samikrc, Sandeep4tech, Slawekb, Smalljim, Stern, Stpasha, Strypd,
Sławomir Biały, TDogg310, Tassedethe, Tom harrison, Tomi, Uppland, WalNi, Wiki5d, Wikipelli, Wjastle, Yanyanjun, Zundark, 143 anonymous edits

Wilcoxon signed-rank test  Source: http://en.wikipedia.org/w/index.php?oldid=506946847  Contributors: AbsolutDan, Amechtley, Arauzo, Asqueella, Baccyak4H, Brian Everlasting,
Chris53516, CowboyBear, Ddxc, Diego Moya, Dwhdwh, Eric Kvaalen, Gstatistics, Hula-hooper0, Infovoria, Iwaterpolo, J-a-x, Jdoev121, JeremyA, Jmaferreira, Jogloran, JordiGH, Joxemai,
Kastchei, Keimzelle, Law, Ldm653, Lleeoo, Ma lafortune, MarkSweep, Mcld, Melcombe, MichaK, Michael Hardy, MrOllie, Mscnln, Muboshgu, Mwtoews, O18, Olaf, Oleg Alexandrov,
Pgan002, Podgorec, Qwerty Binary, Qwfp, Rasnake, RichardMills65, Sango123, Schutz, Schwnj, Seglea, Silverfish, Talgalili, Thorwald, ToddDeLuca, Wzsamd, X7q, YorkBW, Yrithinnd, 69
Article Sources and Contributors 633

anonymous edits

Wishart distribution  Source: http://en.wikipedia.org/w/index.php?oldid=503124768  Contributors: 3mta3, Aetheling, Aleenf1, Amonet, AtroX Worf, Baccyak4H, Benwing, Bryan Derksen,
Btyner, Crusoe8181, David Eppstein, Deacon of Pndapetzim, Dean P Foster, Entropeneur, Erki der Loony, Gammalgubbe, Giftlite, Headbomb, Ixfd64, Joriki, Jrennie, Kastchei,
Kiefer.Wolfowitz, Kurtitski, Lockeownzj00, MDSchneider, Melcombe, Michael Hardy, P omega sigma, P.wirapati, Panosmarko, Perturbationist, PhysPhD, Qwfp, R'n'B, Robbyjo, Robinh, Ryker,
Shae, Srbauer, TNeloms, Tom.Reding, Tomi, WhiteHatLurker, Wjastle, Zvika, 55 anonymous edits
Image Sources, Licenses and Contributors 634

Image Sources, Licenses and Contributors


Image:Beta distribution pdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Beta_distribution_pdf.svg  License: Creative Commons Attribution-Sharealike 3.0  Contributors:
Krishnavedala
Image:Beta distribution cdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Beta_distribution_cdf.svg  License: Creative Commons Attribution-Sharealike 3.0  Contributors:
Krishnavedala
File:Mode_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Mode_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0  Contributors:
Dr. J. Rodal
File:Median_Beta_Distribution_for_alpha_and_beta_from_0_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Median_Beta_Distribution_for_alpha_and_beta_from_0_to_5_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: Dr. J. Rodal
File:Relative_Error_for_Approximation_to_Median_of_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Relative_Error_for_Approximation_to_Median_of_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  License: Creative
Commons Attribution-Sharealike 3.0  Contributors: Dr. J. Rodal
File:Error_in_Median_Apprx._relative_to_Mean-Mode_distance_for_Beta_Distribution_with_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Error_in_Median_Apprx._relative_to_Mean-Mode_distance_for_Beta_Distribution_with_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  License:
Creative Commons Attribution-Sharealike 3.0  Contributors: Dr. J. Rodal
File:Mean Beta Distribution for alpha and beta from 0 to 5 - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Mean_Beta_Distribution_for_alpha_and_beta_from_0_to_5_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0  Contributors:
Dr. J. Rodal
File:Variance for Beta Distribution for alpha and beta ranging from 0 to 5 - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Variance_for_Beta_Distribution_for_alpha_and_beta_ranging_from_0_to_5_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike
3.0  Contributors: Dr. J. Rodal
File:Skewness for Beta Distribution as a function of the variance and the mean - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Skewness_for_Beta_Distribution_as_a_function_of_the_variance_and_the_mean_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: User:Dr. J. Rodal
File:Skewness_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Skewness_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: Dr. J. Rodal
File:Skewness_Beta_Distribution_for_alpha_and_beta_from_.1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Skewness_Beta_Distribution_for_alpha_and_beta_from_.1_to_5_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: Dr. J. Rodal
File:Excess Kurtosis for Beta Distribution as a function of variance and mean - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_for_Beta_Distribution_as_a_function_of_variance_and_mean_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike
3.0  Contributors: User:Dr. J. Rodal
File:Excess_Kurtosis_for_Beta_Distribution_with_alpha_and_beta_ranging_from_1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_for_Beta_Distribution_with_alpha_and_beta_ranging_from_1_to_5_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: Dr. J. Rodal
File:Excess_Kurtosis_for_Beta_Distribution_with_alpha_and_beta_ranging_from_0.1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_for_Beta_Distribution_with_alpha_and_beta_ranging_from_0.1_to_5_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: Dr. J. Rodal
File:Re(CharacteristicFunction) Beta Distr alpha=beta from 0 to 25 Back - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Re(CharacteristicFunction)_Beta_Distr_alpha=beta_from_0_to_25_Back_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: User:Dr. J. Rodal
File:Re(CharacteristicFunc) Beta Distr alpha=beta from 0 to 25 Front- J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Re(CharacteristicFunc)_Beta_Distr_alpha=beta_from_0_to_25_Front-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: User:Dr. J. Rodal
File:Re(CharacteristFunc) Beta Distr alpha from 0 to 25 and beta=alpha+0.5 Back - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Re(CharacteristFunc)_Beta_Distr_alpha_from_0_to_25_and_beta=alpha+0.5_Back_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: User:Dr. J. Rodal
File:Re(CharacterFunc) Beta Distrib. beta from 0 to 25, alpha=beta+0.5 Back - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Re(CharacterFunc)_Beta_Distrib._beta_from_0_to_25,_alpha=beta+0.5_Back_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike
3.0  Contributors: User:Dr. J. Rodal
File:Re(CharacterFunc) Beta Distr. beta from 0 to 25, alpha=beta+0.5 Front - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Re(CharacterFunc)_Beta_Distr._beta_from_0_to_25,_alpha=beta+0.5_Front_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike
3.0  Contributors: User:Dr. J. Rodal
File:Differential_Entropy_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: Dr. J. Rodal
File:Differential_Entropy_Beta_Distribution_for_alpha_and_beta_from_0.1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_for_alpha_and_beta_from_0.1_to_5_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike
3.0  Contributors: Dr. J. Rodal
File:Mean_Median_Difference_-_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Mean_Median_Difference_-_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: Dr. J. Rodal
File:Mean_Mode_Difference_-_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Mean_Mode_Difference_-_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: Dr. J. Rodal
File:Mode Beta Distribution for both alpha and beta greater than 1 - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Mode_Beta_Distribution_for_both_alpha_and_beta_greater_than_1_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: User:Dr. J. Rodal
File:Mode Beta Distribution for both alpha and beta greater than 1 - another view - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Mode_Beta_Distribution_for_both_alpha_and_beta_greater_than_1_-_another_view_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: User:Dr. J. Rodal
File:Skewness Beta Distribution for mean full range and variance between 0.05 and 0.25 - Dr. J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Skewness_Beta_Distribution_for_mean_full_range_and_variance_between_0.05_and_0.25_-_Dr._J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: User:Dr. J. Rodal
File:Skewness Beta Distribution for mean and variance both full range - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Skewness_Beta_Distribution_for_mean_and_variance_both_full_range_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: User:Dr. J. Rodal
Image Sources, Licenses and Contributors 635

File:Excess Kurtosis Beta Distribution with mean for full range and variance from 0.05 to 0.25 - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_Beta_Distribution_with_mean_for_full_range_and_variance_from_0.05_to_0.25_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: User:Dr. J. Rodal
File:Excess Kurtosis Beta Distribution with mean and variance for full range - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_Beta_Distribution_with_mean_and_variance_for_full_range_-_J._Rodal.jpg  License: Creative Commons Attribution-Sharealike
3.0  Contributors: User:Dr. J. Rodal
File:Differential Entropy Beta Distribution with mean from 0.2 to 0.8 and variance from 0.01 to 0.09 - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_with_mean_from_0.2_to_0.8_and_variance_from_0.01_to_0.09_-_J._Rodal.jpg  License: Creative
Commons Attribution-Sharealike 3.0  Contributors: User:Dr. J. Rodal
File:Differential Entropy Beta Distribution with mean from 0.3 to 0.7 and variance from 0 to 0.2 - J. Rodal.jpg  Source:
http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_with_mean_from_0.3_to_0.7_and_variance_from_0_to_0.2_-_J._Rodal.jpg  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: User:Dr. J. Rodal
File:Karl_Pearson_2.jpg  Source: http://en.wikipedia.org/w/index.php?title=File:Karl_Pearson_2.jpg  License: Public Domain  Contributors: User:Struthious Bandersnatch
Image:Beta-binomial distribution pmf.png  Source: http://en.wikipedia.org/w/index.php?title=File:Beta-binomial_distribution_pmf.png  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: Nschuma
Image:Beta-binomial cdf.png  Source: http://en.wikipedia.org/w/index.php?title=File:Beta-binomial_cdf.png  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Nschuma
Image:Pascal's triangle 5.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Pascal's_triangle_5.svg  License: GNU Free Documentation License  Contributors: User:Conrad.Irwin
originally User:Drini
Image:Pascal's triangle - 1000th row.png  Source: http://en.wikipedia.org/w/index.php?title=File:Pascal's_triangle_-_1000th_row.png  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: Endlessoblivion
File:Binomial distribution pmf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_distribution_pmf.svg  License: Public Domain  Contributors: Tayste
File:Binomial distribution cdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_distribution_cdf.svg  License: Public Domain  Contributors: Tayste
File:Binomial Distribution.PNG  Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_Distribution.PNG  License: unknown  Contributors: Schlurcher
File:Pascal's triangle; binomial distribution.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Pascal's_triangle;_binomial_distribution.svg  License: Public Domain  Contributors:
Lipedia
File:Binomial Distribution.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_Distribution.svg  License: GNU Free Documentation License  Contributors: cflm (talk)
Image:cauchy pdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Cauchy_pdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
Image:cauchy cdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Cauchy_cdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
File:Sinc simple.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Sinc_simple.svg  License: Public Domain  Contributors: Stpasha
File:2 cfs coincide over a finite interval.svg  Source: http://en.wikipedia.org/w/index.php?title=File:2_cfs_coincide_over_a_finite_interval.svg  License: Public Domain  Contributors: Stpasha
Image:Chernoff bound.png  Source: http://en.wikipedia.org/w/index.php?title=File:Chernoff_bound.png  License: Public Domain  Contributors: Dcoetzee
File:chi-square pdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg  License: Creative Commons Attribution 3.0  Contributors: Geek3
File:chi-square distributionCDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Chi-square_distributionCDF.svg  License: Creative Commons Zero  Contributors: Philten, 2
anonymous edits
Image:Convergence in distribution (sum of uniform rvs).gif  Source: http://en.wikipedia.org/w/index.php?title=File:Convergence_in_distribution_(sum_of_uniform_rvs).gif  License: Creative
Commons Zero  Contributors: Stpasha
Image:Comparison test series.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Comparison_test_series.svg  License: GNU Free Documentation License  Contributors: Titoxd
Image:ExpConvergence.gif  Source: http://en.wikipedia.org/w/index.php?title=File:ExpConvergence.gif  License: GNU Free Documentation License  Contributors: User:Rpchase
Image:LogConvergenceAnim.gif  Source: http://en.wikipedia.org/w/index.php?title=File:LogConvergenceAnim.gif  License: Public Domain  Contributors: Kan8eDie
File:Gaussian copula gaussian marginals.png  Source: http://en.wikipedia.org/w/index.php?title=File:Gaussian_copula_gaussian_marginals.png  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: Matteo Zandi
File:Biv gumbel dist.png  Source: http://en.wikipedia.org/w/index.php?title=File:Biv_gumbel_dist.png  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Matteo Zandi
File:copule ord.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Copule_ord.svg  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Matteo Zandi
File:Copula gaussian.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Copula_gaussian.svg  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Matteo Zandi
Image:Area parallellogram as determinant.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Area_parallellogram_as_determinant.svg  License: Public Domain  Contributors: Jitse
Niesen
Image:Determinant parallelepiped.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Determinant_parallelepiped.svg  License: Creative Commons Attribution 3.0  Contributors:
Claudio Rocchini
Image:Sarrus rule.png  Source: http://en.wikipedia.org/w/index.php?title=File:Sarrus_rule.png  License: Creative Commons Attribution 3.0  Contributors: Kmhkmh
Image:Dirichlet distributions.png  Source: http://en.wikipedia.org/w/index.php?title=File:Dirichlet_distributions.png  License: Public domain  Contributors: en:User:ThG
Image:Dirichlet example.png  Source: http://en.wikipedia.org/w/index.php?title=File:Dirichlet_example.png  License: Public Domain  Contributors: Mitch3
Image:Gamma distribution pdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_distribution_pdf.svg  License: Creative Commons Attribution-ShareAlike 3.0 Unported
 Contributors: Gamma_distribution_pdf.png: MarkSweep and Cburnett derivative work: Autopilot (talk)
Image:Gamma distribution cdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_distribution_cdf.svg  License: Creative Commons Attribution-ShareAlike 3.0 Unported
 Contributors: Gamma_distribution_cdf.png: MarkSweep and Cburnett derivative work: Autopilot (talk)
File:Em old faithful.gif  Source: http://en.wikipedia.org/w/index.php?title=File:Em_old_faithful.gif  License: Creative Commons Attribution-Sharealike 3.0  Contributors: 3mta3 (talk) 16:55, 23
March 2009 (UTC)
File:exponential pdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Exponential_pdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
File:exponential cdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Exponential_cdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
File:Mean exp.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Mean_exp.svg  License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0  Contributors: Erzbischof
File:Median exp.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Median_exp.svg  License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0  Contributors: Erzbischof
File:FitExponDistr.tif  Source: http://en.wikipedia.org/w/index.php?title=File:FitExponDistr.tif  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Buenas días
Image:F distributionPDF.png  Source: http://en.wikipedia.org/w/index.php?title=File:F_distributionPDF.png  License: GNU Free Documentation License  Contributors: en:User:Pdbailey
Image:F distributionCDF.png  Source: http://en.wikipedia.org/w/index.php?title=File:F_distributionCDF.png  License: GNU Free Documentation License  Contributors: en:User:Pdbailey
Image:F-dens-2-15df.svg  Source: http://en.wikipedia.org/w/index.php?title=File:F-dens-2-15df.svg  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Landroni
Image:Gamma-PDF-3D.png  Source: http://en.wikipedia.org/w/index.php?title=File:Gamma-PDF-3D.png  License: Creative Commons Attribution-Sharealike 3.0  Contributors:
User:Ronhjones
Image:Gamma-KL-3D.png  Source: http://en.wikipedia.org/w/index.php?title=File:Gamma-KL-3D.png  License: Creative Commons Attribution-Sharealike 3.0  Contributors: User:Ronhjones
Image:Gamma plot.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_plot.svg  License: Creative Commons Attribution-ShareAlike 3.0 Unported  Contributors: Alessio
Damato
Image:Factorial interpolation.png  Source: http://en.wikipedia.org/w/index.php?title=File:Factorial_interpolation.png  License: Public Domain  Contributors: Berland, Fredrik, Kilom691
Image:Complex gamma.jpg  Source: http://en.wikipedia.org/w/index.php?title=File:Complex_gamma.jpg  License: Public Domain  Contributors: Jan Homann
Image:Gamma function 2.png  Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_function_2.png  License: Public Domain  Contributors: Bobmath, TakuyaMurata
Image:Complex gamma function abs.png  Source: http://en.wikipedia.org/w/index.php?title=File:Complex_gamma_function_abs.png  License: Public Domain  Contributors: Bender2k14,
Fredrik, Lipedia, 2 anonymous edits
Image:Complex gamma function Re.png  Source: http://en.wikipedia.org/w/index.php?title=File:Complex_gamma_function_Re.png  License: Public Domain  Contributors: Fredrik
Image Sources, Licenses and Contributors 636

Image:Complex gamma function Im.png  Source: http://en.wikipedia.org/w/index.php?title=File:Complex_gamma_function_Im.png  License: Public Domain  Contributors: Fredrik
File:DanielBernoulliLettreAGoldbach-1729-10-06.jpg  Source: http://en.wikipedia.org/w/index.php?title=File:DanielBernoulliLettreAGoldbach-1729-10-06.jpg  License: unknown
 Contributors: Wirkstoff
Image:Euler factorial paper.png  Source: http://en.wikipedia.org/w/index.php?title=File:Euler_factorial_paper.png  License: Public Domain  Contributors: Euler
Image:Jahnke gamma function.png  Source: http://en.wikipedia.org/w/index.php?title=File:Jahnke_gamma_function.png  License: Public Domain  Contributors: Eugene Jahnke, Fritz Emde
File:geometric pmf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Geometric_pmf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
File:geometric cdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Geometric_cdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
Image:PDF invGauss.png  Source: http://en.wikipedia.org/w/index.php?title=File:PDF_invGauss.png  License: Creative Commons Attribution-Sharealike 2.5  Contributors: Thomas Steiner
Image:Inverse gamma pdf.png  Source: http://en.wikipedia.org/w/index.php?title=File:Inverse_gamma_pdf.png  License: GNU General Public License  Contributors: Alejo2083, Cburnett
Image:Inverse gamma cdf.png  Source: http://en.wikipedia.org/w/index.php?title=File:Inverse_gamma_cdf.png  License: GNU General Public License  Contributors: Alejo2083, Cburnett
File:KL-Gauss-Example.png  Source: http://en.wikipedia.org/w/index.php?title=File:KL-Gauss-Example.png  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Mundhenk
(talk)
File:ArgonKLdivergence.png  Source: http://en.wikipedia.org/w/index.php?title=File:ArgonKLdivergence.png  License: GNU Free Documentation License  Contributors: P. Fraundorf
Image:Laplace distribution pdf.png  Source: http://en.wikipedia.org/w/index.php?title=File:Laplace_distribution_pdf.png  License: GNU General Public License  Contributors: It Is Me Here,
Kilom691, MarkSweep
Image:Laplace distribution cdf.png  Source: http://en.wikipedia.org/w/index.php?title=File:Laplace_distribution_cdf.png  License: GNU General Public License  Contributors: Bender235,
MarkSweep, Perhelion
File:Laplace's equation on an annulus.jpg  Source: http://en.wikipedia.org/w/index.php?title=File:Laplace's_equation_on_an_annulus.jpg  License: Creative Commons Attribution-Sharealike
3.0  Contributors: DavidianSkitzou
File:Laplaces method.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Laplaces_method.svg  License: Creative Commons Zero  Contributors: User:Krishnavedala
Image:Loess curve.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Loess_curve.svg  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Kierano
Image:PD-icon.svg  Source: http://en.wikipedia.org/w/index.php?title=File:PD-icon.svg  License: Public Domain  Contributors: Alex.muller, Anomie, Anonymous Dissident, CBM, MBisanz,
PBS, Quadell, Rocket000, Strangerer, Timotheus Canens, 1 anonymous edits
Image:Some log-normal distributions.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Some_log-normal_distributions.svg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: original by User:Par Derivative by Mikael Häggström from the original, File:Lognormal distribution PDF.png by User:Par
Image:Lognormal distribution CDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Lognormal_distribution_CDF.svg  License: Creative Commons Attribution-ShareAlike 3.0
Unported  Contributors: Lognormal_distribution_CDF.png: User:PAR derivative work: Autopilot (talk)
Image:Comparison mean median mode.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Comparison_mean_median_mode.svg  License: Creative Commons Attribution-Sharealike
3.0  Contributors: Cmglee
File:FitLogNormDistr.tif  Source: http://en.wikipedia.org/w/index.php?title=File:FitLogNormDistr.tif  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Buenas días
Image:Levy0 distributionPDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Levy0_distributionPDF.svg  License: Creative Commons Zero  Contributors: User:Krishnavedala
Image:Levy0 distributionCDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Levy0_distributionCDF.svg  License: Creative Commons Zero  Contributors: User:Krishnavedala
Image:Levy0 LdistributionPDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Levy0_LdistributionPDF.svg  License: Creative Commons Zero  Contributors: User:Krishnavedala
Image:Ee noncompactness.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Ee_noncompactness.svg  License: Public Domain  Contributors: Stpasha
Image:MLfunctionbinomial-en.svg  Source: http://en.wikipedia.org/w/index.php?title=File:MLfunctionbinomial-en.svg  License: Creative Commons Attribution-Sharealike 3.0  Contributors:
User:Casp11
Image:GaussianScatterPCA.png  Source: http://en.wikipedia.org/w/index.php?title=File:GaussianScatterPCA.png  License: GNU Free Documentation License  Contributors: —Ben FrantzDale
(talk) (Transferred by ILCyborg)
Image:Sphere wireframe.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Sphere_wireframe.svg  License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0  Contributors:
Geek3
Image:Hypersphere coord.PNG  Source: http://en.wikipedia.org/w/index.php?title=File:Hypersphere_coord.PNG  License: Creative Commons Attribution 3.0  Contributors: derivative work:
Pbroks13 (talk) Hypersphere_coord.gif: Claudio Rocchini
File:N SpheresVolumeAndSurfaceArea.png  Source: http://en.wikipedia.org/w/index.php?title=File:N_SpheresVolumeAndSurfaceArea.png  License: Creative Commons
Attribution-Sharealike 3.0  Contributors: User:Joseph Lindenberg
File:Negbinomial.gif  Source: http://en.wikipedia.org/w/index.php?title=File:Negbinomial.gif  License: Public Domain  Contributors: Stpasha
File:Chi-Squared-(nonCentral)-pdf.png  Source: http://en.wikipedia.org/w/index.php?title=File:Chi-Squared-(nonCentral)-pdf.png  License: Creative Commons Attribution-Sharealike 2.5
 Contributors: Thomas Steiner
File:Chi-Squared-(nonCentral)-cdf.png  Source: http://en.wikipedia.org/w/index.php?title=File:Chi-Squared-(nonCentral)-cdf.png  License: Creative Commons Attribution-Sharealike 2.5
 Contributors: Thomas Steiner
Image:nc student t pdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Nc_student_t_pdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
Image:Vector norm sup.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Vector_norm_sup.svg  License: Public Domain  Contributors: Wiso, 1 anonymous edits
Image:Vector norms.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Vector_norms.svg  License: GNU Free Documentation License  Contributors: User:Esmil
File:Normal Distribution PDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Normal_Distribution_PDF.svg  License: Public Domain  Contributors: Inductiveload
File:Normal Distribution CDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Normal_Distribution_CDF.svg  License: Public Domain  Contributors: Inductiveload
File:standard deviation diagram.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Standard_deviation_diagram.svg  License: Creative Commons Attribution 2.5  Contributors:
Mwtoews
File:De moivre-laplace.gif  Source: http://en.wikipedia.org/w/index.php?title=File:De_moivre-laplace.gif  License: Public Domain  Contributors: Stpasha
File:Dice sum central limit theorem.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Dice_sum_central_limit_theorem.svg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: Cmglee
File:QHarmonicOscillator.png  Source: http://en.wikipedia.org/w/index.php?title=File:QHarmonicOscillator.png  License: GNU Free Documentation License  Contributors:
en:User:FlorianMarquardt
File:Fisher iris versicolor sepalwidth.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Fisher_iris_versicolor_sepalwidth.svg  License: Creative Commons Attribution-Sharealike 3.0
 Contributors: en:User:Qwfp (original); Pbroks13 (talk) (redraw)
File:FitNormDistr.tif  Source: http://en.wikipedia.org/w/index.php?title=File:FitNormDistr.tif  License: Public Domain  Contributors: Buenas días
File:Planche de Galton.jpg  Source: http://en.wikipedia.org/w/index.php?title=File:Planche_de_Galton.jpg  License: Creative Commons Attribution-Sharealike 3.0  Contributors: Antoine
Taveneaux
File:Carl Friedrich Gauss.jpg  Source: http://en.wikipedia.org/w/index.php?title=File:Carl_Friedrich_Gauss.jpg  License: Public Domain  Contributors: Gottlieb BiermannA. Wittmann (photo)
File:Pierre-Simon Laplace.jpg  Source: http://en.wikipedia.org/w/index.php?title=File:Pierre-Simon_Laplace.jpg  License: Public Domain  Contributors: Ashill, Ecummenic, Elcobbola,
Gene.arboit, Jimmy44, Olivier2, 霧 木 諒 二, 1 anonymous edits
Image:OrderStatistics.gif  Source: http://en.wikipedia.org/w/index.php?title=File:OrderStatistics.gif  License: GNU Free Documentation License  Contributors: Bokken, Flappiefh, Rodolfo
Hermans, 1 anonymous edits
Image:Parabolic trajectory.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Parabolic_trajectory.svg  License: Public Domain  Contributors: Oleg Alexandrov
File:Heat eqn.gif  Source: http://en.wikipedia.org/w/index.php?title=File:Heat_eqn.gif  License: Public Domain  Contributors: Oleg Alexandrov
File:Chi-square distributionCDF-English.png  Source: http://en.wikipedia.org/w/index.php?title=File:Chi-square_distributionCDF-English.png  License: Public Domain  Contributors: Mikael
Häggström
File:poisson pmf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Poisson_pmf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
Image Sources, Licenses and Contributors 637

File:poisson cdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Poisson_cdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
File:Binomial versus poisson.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_versus_poisson.svg  License: Creative Commons Attribution-Sharealike 3.0  Contributors:
Sergio01
Image:SampleProcess.png  Source: http://en.wikipedia.org/w/index.php?title=File:SampleProcess.png  License: Public Domain  Contributors: Willem (talk)
File:Singular value decomposition.gif  Source: http://en.wikipedia.org/w/index.php?title=File:Singular_value_decomposition.gif  License: Public Domain  Contributors: Kieff
Image:Stirling's Approximation.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Stirling's_Approximation.svg  License: Creative Commons Attribution-Share Alike  Contributors:
R. A. Nonenmacher
Image:StirlingErrorGraphBB.svg  Source: http://en.wikipedia.org/w/index.php?title=File:StirlingErrorGraphBB.svg  License: Public Domain  Contributors: DMJ001
Image:StirlingError1.svg  Source: http://en.wikipedia.org/w/index.php?title=File:StirlingError1.svg  License: Public Domain  Contributors: DMJ001
Image:student t pdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Student_t_pdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
Image:student t cdf.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Student_t_cdf.svg  License: Creative Commons Attribution 3.0  Contributors: Skbkekas
Image:T distribution 1df.png  Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_1df.png  License: GNU Free Documentation License  Contributors: Juiced lemon, Maksim,
1 anonymous edits
Image:T distribution 2df.png  Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_2df.png  License: GNU Free Documentation License  Contributors: Juiced lemon, Maksim,
1 anonymous edits
Image:T distribution 3df.png  Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_3df.png  License: GNU Free Documentation License  Contributors: Juiced lemon, Maksim,
1 anonymous edits
Image:T distribution 5df.png  Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_5df.png  License: GNU Free Documentation License  Contributors: Juiced lemon, Maksim,
1 anonymous edits
Image:T distribution 10df.png  Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_10df.png  License: GNU Free Documentation License  Contributors: Juiced lemon,
Maksim, 1 anonymous edits
Image:T distribution 30df.png  Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_30df.png  License: GNU Free Documentation License  Contributors: Juiced lemon,
Maksim, 1 anonymous edits
Image:sintay.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Sintay.svg  License: Creative Commons Attribution-ShareAlike 3.0 Unported  Contributors: User:Qualc1
Image:Exp series.gif  Source: http://en.wikipedia.org/w/index.php?title=File:Exp_series.gif  License: Public Domain  Contributors: Oleg Alexandrov
Image:Exp neg inverse square.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Exp_neg_inverse_square.svg  License: Public Domain  Contributors: Plastikspork ―Œ(talk). Original
uploader was Plastikspork at en.wikipedia
Image:Taylorsine.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Taylorsine.svg  License: Public Domain  Contributors: Geek3, Hellisp, Riojajar, 1 anonymous edits
Image:LogTay.svg  Source: http://en.wikipedia.org/w/index.php?title=File:LogTay.svg  License: Public Domain  Contributors: Niles
Image:TaylorCosCos.png  Source: http://en.wikipedia.org/w/index.php?title=File:TaylorCosCos.png  License: GNU Free Documentation License  Contributors: Original uploader was Sam
Derbyshire at en.wikipedia
Image:TaylorCosPol.png  Source: http://en.wikipedia.org/w/index.php?title=File:TaylorCosPol.png  License: GNU Free Documentation License  Contributors: Original uploader was Sam
Derbyshire at en.wikipedia
Image:TaylorCosAll.png  Source: http://en.wikipedia.org/w/index.php?title=File:TaylorCosAll.png  License: GNU Free Documentation License  Contributors: Original uploader was Sam
Derbyshire at en.wikipedia
Image:Taylor e^xln1plusy.png  Source: http://en.wikipedia.org/w/index.php?title=File:Taylor_e^xln1plusy.png  License: Creative Commons Attribution-Sharealike 3.0  Contributors:
Slobo486, 1 anonymous edits
image:Uniform distribution PDF.png  Source: http://en.wikipedia.org/w/index.php?title=File:Uniform_distribution_PDF.png  License: Public Domain  Contributors: EugeneZelenko, It Is Me
Here, Joxemai, PAR
image:Uniform distribution CDF.png  Source: http://en.wikipedia.org/w/index.php?title=File:Uniform_distribution_CDF.png  License: Public Domain  Contributors: EugeneZelenko, Joxemai,
PAR
Image:DUniform distribution PDF.png  Source: http://en.wikipedia.org/w/index.php?title=File:DUniform_distribution_PDF.png  License: GNU Free Documentation License  Contributors:
EugeneZelenko, PAR, WikipediaMaster
Image:Dis Uniform distribution CDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Dis_Uniform_distribution_CDF.svg  License: GNU General Public License  Contributors:
en:User:Pdbailey, traced by User:Stannered
Image:Weibull PDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Weibull_PDF.svg  License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0  Contributors: Calimo
Image:Weibull CDF.svg  Source: http://en.wikipedia.org/w/index.php?title=File:Weibull_CDF.svg  License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0  Contributors: Calimo,
after Philip Leitch.
File:FitWeibullDistr.tif  Source: http://en.wikipedia.org/w/index.php?title=File:FitWeibullDistr.tif  License: Public Domain  Contributors: Buenas días
License 638

License
Creative Commons Attribution-Share Alike 3.0 Unported
//creativecommons.org/licenses/by-sa/3.0/

Das könnte Ihnen auch gefallen