Beruflich Dokumente
Kultur Dokumente
Articles
Bernoulli distribution 1
Beta distribution 3
Beta function 31
Beta-binomial distribution 35
Binomial coefficient 41
Binomial distribution 59
Cauchy distribution 66
Cauchy–Schwarz inequality 74
Characteristic function (probability theory) 81
Chernoff bound 89
Chi-squared distribution 95
Computational complexity of mathematical operations 102
Conjugate prior 106
Continuous mapping theorem 112
Convergence of random variables 114
Convergent series 123
Copula (probability theory) 127
Coupon collector's problem 134
Degrees of freedom (statistics) 137
Determinant 143
Dirichlet distribution 161
Effect size 169
Erlang distribution 179
Expectation–maximization algorithm 182
Exponential distribution 191
F-distribution 201
F-test 204
Fisher information 209
Fisher's exact test 217
Gamma distribution 221
Gamma function 232
Geometric distribution 246
Hypergeometric distribution 250
Hölder's inequality 257
Inverse Gaussian distribution 264
Inverse-gamma distribution 269
Iteratively reweighted least squares 271
Kendall tau rank correlation coefficient 273
Kolmogorov–Smirnov test 277
Kronecker's lemma 280
Kullback–Leibler divergence 281
Laplace distribution 291
Laplace's equation 295
Laplace's method 301
Likelihood-ratio test 307
List of integrals of exponential functions 311
List of integrals of Gaussian functions 313
List of integrals of hyperbolic functions 315
List of integrals of logarithmic functions 317
Lists of integrals 319
Local regression 327
Log-normal distribution 331
Logrank test 339
Lévy distribution 341
Mann–Whitney U 344
Matrix calculus 350
Maximum likelihood 368
McNemar's test 379
Multicollinearity 382
Multivariate normal distribution 387
n-sphere 397
Negative binomial distribution 405
Noncentral chi-squared distribution 414
Noncentral F-distribution 419
Noncentral t-distribution 421
Norm (mathematics) 425
Normal distribution 432
Order statistic 460
Ordinary differential equation 465
Partial differential equation 475
Pearson's chi-squared test 488
Perron–Frobenius theorem 494
Poisson distribution 506
Poisson process 515
Proportional hazards models 519
Random permutation statistics 523
Rank (linear algebra) 535
Resampling (statistics) 541
Schur complement 548
Sign test 550
Singular value decomposition 551
Stein's method 566
Stirling's approximation 572
Student's t-distribution 577
Summation by parts 590
Taylor series 592
Uniform distribution (continuous) 603
Uniform distribution (discrete) 609
Weibull distribution 612
Wilcoxon signed-rank test 618
Wishart distribution 621
References
Article Sources and Contributors 626
Image Sources, Licenses and Contributors 634
Article Licenses
License 638
Bernoulli distribution 1
Bernoulli distribution
Bernoulli
Parameters
Support
PMF
CDF
Mean
Median
Mode
Variance
Skewness
Ex. kurtosis
Entropy
MGF
CF
PGF
In probability theory and statistics, the Bernoulli distribution (or binomial distribution), named after Swiss
scientist Jacob Bernoulli, is a discrete probability distribution, which takes value 1 with success probability and
value 0 with failure probability . So if X is a random variable with this distribution, we have:
A classical example of a Bernoulli experiment is a single toss of a coin. The coin might come up heads with
probability p and tails with probability 1-p. The experiment is called fair if p=0.5, indicating the origin of the
terminology in betting (the bet is fair if both possible outcomes have the same probability).
The probability mass function f of this distribution is
The above can be derived from the Bernoulli distribution as a special case of the Binomial distribution.[1]
The kurtosis goes to infinity for high and low values of p, but for the Bernoulli distribution has a lower
kurtosis than any other probability distribution, namely -2.
The Bernoulli distribution is a member of the exponential family.
The maximum likelihood estimator of p based on a random sample is the sample mean.
Related distributions
• If are independent, identically distributed (i.i.d.) random variables, all Bernoulli distributed with
distribution is simply .
• The categorical distribution is the generalization of the Bernoulli distribution for variables with any constant
number of discrete values.
• The Beta distribution is the conjugate prior of the Bernoulli distribution.
• The geometric distribution is the number of Bernoulli trials needed to get one success.
Notes
[1] McCullagh and Nelder (1989), Section 4.2.2.
References
• McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and
Hall/CRC. ISBN 0-412-31760-5.
• Johnson, N.L., Kotz, S., Kemp A. (1993) Univariate Discrete Distributions (2nd Edition). Wiley. ISBN
0-471-54897-9
External links
• Hazewinkel, Michiel, ed. (2001), "Binomial distribution" (http://www.encyclopediaofmath.org/index.
php?title=p/b016420), Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Weisstein, Eric W., " Bernoulli Distribution (http://mathworld.wolfram.com/BernoulliDistribution.html)"
from MathWorld.
Beta distribution 3
Beta distribution
Beta
CDF
Mean
Variance
Ex. kurtosis
MGF
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined
on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β. The beta distribution has
been applied to model the behavior of random variables limited to intervals of finite length. It has been used in
population genetics for a statistical description of the allele frequencies in the components of a sub-divided
population. It has also been used extensively in PERT, critical path method (CPM) and other project management /
control systems to describe the statistical distributions of the time to completion and the cost of a task. It has also
been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported
as a good indicator of the condition of gears[1]. It has also been used to model sunshine data for application to solar
renewable energy utilization[2]. It has also been used for parametrizing variability of soil properties at the regional
level for crop yield estimation, modeling crop response over the area of the association[3]. It has also been used to
determine well-log shale parameters, to describe the proportions of the mineralogical components existing in a
certain stratigraphic interval[4]. It is used extensively in Bayesian inference, since beta distributions provide a family
of conjugate prior distributions for binomial and geometric distributions. For example, the beta distribution can be
used in Bayesian analysis to describe initial knowledge concerning probability of success such as the probability that
a space vehicle will successfully complete a specified mission. The beta distribution is a suitable model for the
random behavior of percentages. It can be suited to the statistical modelling of proportions in applications where
values of proportions equal to 0 or 1 do not occur. One theoretical case where the beta distribution arises is as the
distribution of the ratio formed by one random variable having a Gamma distribution divided by the sum of it and
another independent random variable also having a Gamma distribution with the same scale parameter (but possibly
different shape parameter).
The usual formulation of the beta distribution is also known as the beta distribution of the first kind, whereas beta
distribution of the second kind is an alternative name for the beta prime distribution.
Characterization
where is the gamma function. The beta function, B, appears as a normalization constant to ensure that the total
probability integrates to unity.
This definition includes both ends x= 0 and x=1, which is consistent with definitions for other continuous
distributions supported on a bounded interval which are special cases of the beta distribution, for example the arcsine
distribution, and consistent with several authors, such as N.L.Johnson and S.Kotz[5][6][7][8] . Note, however, that
several other authors, including W. Feller [9] [10][11], choose to exclude the ends x= 0 and x=1, (such that the two
ends are not actually part of the density function) and consider instead 0<x<1.
Beta distribution 5
Several authors, including N.L.Johnson and S.Kotz[5], use the nomenclature p instead of α and q instead of β for the
shape parameters of the beta distribution, reminiscent of the nomenclature traditionally used for the parameters of the
Bernoulli distribution, because the beta distribution approaches the Bernoulli distribution in the limit as both shape
parameters α and β approach the value of zero.
In the following, that a random variable X is Beta-distributed with parameters α and β will be denoted by:
where is the incomplete beta function and is the regularized incomplete beta function.
Properties
Mode
The mode of a Beta distributed random variable X with both parameters α and β greater than one is:[5]
When both parameters are less than one (α < 1 and β < 1), this is the anti-mode: the lowest point of the probability
density curve[7]. Letting in the above expression one obtains , showing that for
the mode (in the case α > 1 and β > 1), or the anti-mode (in the case α < 1 and β < 1), are at the center of the
distribution: it is symmetric in those cases. See "Shapes" section in this article for a full list of mode cases, for
arbitrary values of α and β. For several of these cases, the maximum value of the density function occurs at one or
both ends. In some cases the (maximum) value of the density function occurring at the end is finite, for example in
the case of α=2, β=1 (or α=1, β=2), the right-triangle distribution, while in several other cases there is a singularity at
the end, and hence the value of the density function approaches infinity at the end, for example in the case α=β=1/2,
the arcsine distribution. The choice whether to include, or not to include, the ends x=0, and x=1, as part of the density
function, whether a singularity can be considered to be a mode, and whether cases with two maxima are to be
considered bimodal, is responsible for some authors considering these maximum values at the end of the density
distribution to be considered[12] modes or not[10].
Beta distribution 6
Median
The median of the beta distribution is
the unique real number x for which the
regularized incomplete beta function
. There is no general
closed-form expression for the median
of the beta distribution for arbitrary
values of α and β. Closed-form
expressions for particular values of the
parameters α and β follow:
• For symmetric cases
Mode for Beta distribution for 1≤α≤5 and 1≤β≤5
.
• For
(this case is the mirror-image of the power function [0,1] distribution)
• For (this case is the power function [0,1] distribution[10])
• For the real [0,1] solution to the quartic
equation
• For
A reasonable approximation of the
value of the median of the beta
distribution, for both α and β greater or
equal to one, is given by the
formula[13]
Mean
The expected value (mean) ( ) of a Beta distribution random variable X with parameters α and β is:[5]
Letting in the above expression one obtains , showing that for the mean is at the center of
the distribution: it is symmetric. Also, the following limits can be obtained from the above expression:
Therefore, for , or for , the mean is located at the right end, x = 1. For these limit ratios, the beta
distribution becomes a 1 point Degenerate distribution with a Dirac delta function spike at the right end, x = 1, with
probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at
the right end, x = 1.
Similarly, for , or for , the mean is located at the left end, x = 0. The beta distribution becomes a
1 point Degenerate distribution with a Dirac delta function spike at the left end, x = 0, with probability 1, and zero
probability everywhere else. There is 100% probability (absolute certainty) concentrated at the left end, x = 0.
Variance
The variance (the second moment
centered around the mean) of a Beta
distribution random variable X with
parameters α and β is:[5]
that for the variance decreases Mean for Beta distribution for 0 ≤ α ≤ 5 and 0 ≤ β ≤ 5
monotonically as increases.
Setting in this
expression, one finds the maximum
[5]
variance which only occurs approaching the limit, at .
The beta distribution may also be parametrized in terms of its mean μ (0 < μ < 1) and sample size ν = α + β (ν > 0)
(see section below titled "Mean and sample size"):
Using this parametrization, one can express the variance in terms of the mean μ and the sample size ν as follows:
Beta distribution 8
Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above
expressions:
Skewness
The skewness (the third moment
centered around the mean, normalized
by the 3/2 power of the variance) of
the beta distribution is[5]
Letting in the above expression one obtains , showing once again that for the distribution
is symmetric and hence the skewness is zero. Positive skew (right-tailed) for α < β, negative skew (left-tailed) for α
> β.
Using the parametrization in terms of mean μ and sample size ν = α + β:
one can express the skewness in terms of the mean μ and the sample size ν as follows:
Beta distribution 9
The skewness can also be expressed just in terms of the variance var and the mean μ as follows:
The accompanying plot of skewness as a function of variance and mean shows that maximum variance (1/4) is
coupled with zero skewness and the symmetry condition (μ = 1/2), and that maximum skewness (positive or negative
infinity) occurs when the mean is located at one end or the other, so that that the "mass" of the probability
distribution is concentrated at the ends (minimum variance).
For the symmetric case (α = β), skewness = 0 over the whole range, and the following limits apply:
For the unsymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be
obtained from the above expressions:
Kurtosis
The beta distribution has been applied
in acoustic analysis to assess damage
to gears, as the kurtosis of the beta
distribution has been reported to be a
good indicator of the condition of a
gear[1]. Kurtosis has also been used to
distinguish the seismic signal
generated by a person's footsteps from
other signals. As persons or other
targets moving on the ground generate
Excess Kurtosis for Beta Distribution as a function of variance and mean
Beta distribution 10
continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves they
generate. Kurtosis is sensitive to impulsive signals, so it’s much more sensitive to the signal generated by human
footsteps than other signals generated by vehicles, winds, noise, etc.[14] Unfortunately, the notation for kurtosis has
not been standardized. Kenney and Keeping[15] use the symbol for the excess kurtosis, but Abramowitz and
Stegun[16] use different terminology. To prevent confusion[17] between kurtosis (the fourth moment centered around
the mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelled
out as follows[11][10]:
Therefore for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of - 2
at the limit as {α = β} → 0, and approaching a maximum value of zero as {α = β} → ∞. The value of - 2 is the
minimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of any
possible kind) can ever achieve. This minimum value is reached when all the probability density is entirely
concentrated at each end x = 0 and x = 1, with nothing in between: a 2-point Bernoulli distribution with equal
probability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" for
further discussion). The description of kurtosis as a measure of the "peakedness" (or "heavy tails") of the probability
distribution, is strictly applicable to unimodal distributions (for example the normal distribution). However, for more
general distributions, like the beta distribution, a more general description of kurtosis is that it is a measure of the
proportion of the mass density near the mean. The higher the proportion of mass density near the mean, the higher
the kurtosis, while the higher the mass density away from the mean, the lower the kurtosis. For α ≠ β, skewed beta
distributions, the excess kurtosis can reach unlimited positive values (particularly for α → 0 for finite β, or for β → 0
for finite α) because all the mass density is concentrated at the mean when the mean coincides with one of the ends.
Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean is
at the center), and there is no probability mass density in between the ends.
Using the parametrization in terms of mean μ and sample size ν = α + β:
one can express the excess kurtosis in terms of the mean μ and the sample size ν as follows:
The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and the
sample size ν as follows:
The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excess
kurtosis (- 2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled with
the maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint (μ = 1/2). This
occurs for the symmetric case of α = β = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distribution
with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (A
coin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because the
distribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: the
probability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosis
reaches the minimum possible value (for any distribution) when the probability density function has two spikes at
each end: it is bi-"peaky" with nothing in between them.
On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end
(μ = 0 or μ = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean of
the distribution approaches either end.
Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square of
the skewness, and the sample size ν as follows:
From this last expression, one can obtain the same limits published practically a century ago by Karl Pearson in his
paper[18], for the beta distribution (see section below titled "Kurtosis bounded by the square of the skewness").
Setting α + β= ν = 0 in the above expression, one obtains Pearson's lower boundary (values for the skewness and
excess kurtosis below the boundary (excess kurtosis + 2 - skewness2 = 0) cannot occur for any distribution, and
hence Karl Pearson appropriately called the region below this boundary the "impossible region"). The limit of α + β=
ν → ∞ determines Pearson's upper boundary.
therefore:
Values of ν= α + β such that ν ranges from zero to infinity, 0 < ν < ∞, span the whole region of the beta distribution
in the plane of excess kurtosis versus squared skewness.
For the symmetric case (α = β), the following limits apply:
For the unsymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be
obtained from the above expressions:
Beta distribution 12
Characteristic function
The characteristic function is the
Fourier transform of the probability
density function. The characteristic
function of the beta distribution is
Kummer's confluent hypergeometric
function (of the first kind)[5][16][19]:
where
is the rising factorial, also called the "Pochhammer symbol". The value of the characteristic function for t = 0, is one:
.
Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to the
origin of variable t:
The symmetric case α = β simplifies the characteristic function of the beta distribution to a Bessel function, since in
the special case the confluent hypergeometric function (of the first kind) reduces to a Bessel function
(the modified Bessel function of the first kind ) using Kummer's second transformation as follows:
In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed for
symmetric (α = β) and skewed (α ≠ β) cases.
Beta distribution 15
Higher moments
Using the moment generating function, the th raw moment is given by[5] the factor
multiplying the (exponential series) term in the series of the moment generating function
where is a Pochhammer symbol representing rising factorial. It can also be written in a recursive form as
The following transformation by inversion of the random variable (X/(1 - X) gives the expected value of the inverted
beta distribution or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI)
[5]
:
Expected values for logarithmic transformations (that may be useful for maximum likelihood estimates, for
example):
Beta distribution 16
Higher order logarithmic moments can be expressed in terms of higher order poly-gamma functions:
also
These identities can be derived by using the representation of a Beta distribution as a proportion of two Gamma
distributions and differentiating through the integral.
The digamma function appears in the formula for the differential entropy as a consequence of Euler's integral
formula for the harmonic numbers which follows from the integral:
at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetric
case), α = β, and they approach infinity simultaneously, the probability density becomes a spike (Dirac delta
function) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zero
probability everywhere else.
The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the
"entropy of a continuous distribution"), as the concluding part[21] of the same paper where he defined the discrete
entropy. It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete
entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution).
What really matters is the relative value of entropy.
Given two beta distributed random variables, X ~ Beta(α, β) and Y ~ Beta(α', β'), the cross entropy is (measured in
nats)
It follows that the relative entropy, or Kullback–Leibler divergence, between these two beta distributions is
(measured in nats)
If 1 < β < α then the order of the inequalities are reversed. For α > 1 and β > 1 the absolute distance between the
mean and the median is less than 5% of the distance between the maximum and minimum values of x. On the other
hand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximum
and minimum values of x, for the (pathological) case of α ≈ 1 and β ≈ 1 (for which values the beta distribution
approaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum
"disorder").
For example, for α = 1.0001 and β = 1.00000001:
• mode = 0.9999; PDF(mode) = 1.00010
• mean = 0.500025; PDF(mean) = 1.00003
Beta distribution 18
or, equivalently,
(At a time when there were no powerful digital computers), Karl Pearson accurately computed further
boundaries[18][8], for example, separating the "U-shaped" from the "J-shaped" distributions. The lower boundary line
(excess kurtosis + 2 - skewness2 = 0) is produced by "U-shaped" beta distributions with values of shape parameters α
and β close to zero. The upper boundary line (excess kurtosis - (3/2) skewness2 = 0) is produced by extremely
skewed distributions with very large values of one of the parameters and very small values of the other parameter.
An example of a beta distribution near the upper boundary (excess kurtosis - (3/2) skewness2 = 0) is given by α= 0.1,
β=1000, for which the ratio (excess kurtosis)/(skewness2) = 1.49835 approaches the upper limit of 1.5 from below.
An example of a beta distribution near the lower boundary (excess kurtosis + 2 - skewness2 = 0) is given by α=
0.0001, β=0.1, for which values the expression (excess kurtosis+2)/(skewness2) =1.01621 approaches the lower limit
of 1 from above. In the infinitesimal limit for both α and β approaching zero symmetrically, the excess kurtosis
reaches its minimum value at -2. This minimum value occurs at the point at which the lower boundary line intersects
the vertical axis (ordinate). (Note, however, that in Pearson's original chart, the ordinate is kurtosis, instead of excess
kurtosis, and that it increases downwards rather than upwards).
Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 - skewness2 = 0) cannot
occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the
"impossible region." The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal
"U"-shaped distributions for which parameters α and β approach zero and hence all the probability density is
concentrated at each end: x = 0 and x = 1 with practically nothing in between them. Since for α ≈ β ≈ 0 the
Beta distribution 19
probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a
2-point distribution: the probability can only take 2 values (Bernoulli distribution), one value with probability p and
the other with probability q = 1 - p. For cases approaching this limit boundary with symmetry α = β, skewness ≈ 0,
excess kurtosis ≈ -2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are p ≈ q ≈
1/2. For cases approaching this limit boundary with skewness, excess kurtosis ≈ - 2 + skewness2, and the probability
density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at
the left end and at the right end .
Symmetry
All statements are conditional on α > 0 and β > 0
• Probability density function
• Variance symmetry
• Skewness skew-symmetry
• Characteristic function symmetry of Real part (with respect to the origin of variable "t")
• Characteristic function skew-symmetry of Imaginary part (with respect to the origin of variable "t")
Shapes
The beta density function can take a wide variety of different shapes depending on the values of the two parameters
α and β:
Symmetric
• the density function is symmetric about 1/2 (blue & teal plots).
•
•
• is U-shaped (blue plot).
•
[5]
•
•
• is a 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta
function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coin
being x = 0 and the other face being x = 1.
to reach.
• The differential entropy approaches a minimum value of
• is the arcsine distribution
•
•
• is the uniform [0,1] distribution
•
•
•
• The (negative anywhere else) differential entropy reaches its maximum value of zero
• is symmetric unimodal
•
[5]
•
•
• is a semi-elliptic [0,1] distribution, see: Wigner semicircle distribution
•
•
• is the parabolic [0,1] distribution
•
•
• is bell-shaped, with inflection points located to either side of the mode
•
•
• is a 1 point Degenerate distribution with a Dirac delta function spike at the
midpoint x = 1/2 with probability 1, and zero probability everywhere else. There is 100%
probability (absolute certainty) concentrated at the single point x = 1/2.
Beta distribution 21
•
• The differential entropy approaches a minimum value of
Skewed
• the density function is skewed. An interchange of parameter values yields the mirror image (the reverse) of
the initial curve.
• is skewed U-shaped. Positive skew for α < β, negative skew for α > β.
•
•
•
• is skewed unimodal (magenta & cyan plots). Positive skew for α < β, negative skew for α >
β.
•
•
•
• is reverse J-shaped with a right tail, positively skewed, strictly decreasing, strictly convex
•
•
• (maximum variance occurs for , or α=Φ the golden
ratio conjugate)
• is positively skewed, strictly decreasing (red plot), a reversed (mirror-image) power function
[0,1] distribution
•
• is strictly concave
•
•
• is a straight line with slope -2, the right-triangular distribution with right angle at the
left end, at x = 0
•
•
• is reverse J-shaped with a right tail, strictly convex
•
•
• is J-shaped with a left tail, negatively skewed, strictly increasing, strictly convex
•
•
• (maximum variance occurs for , or β=Φ the golden
ratio conjugate)
• is negatively skewed, strictly increasing (green plot), the power function [0,1] distribution[10]
•
• is strictly concave
•
Beta distribution 22
•
• is a straight line with slope +2, the right-triangular distribution with right angle at the
right end, at x = 1
•
•
• is J-shaped with a left tail, strictly convex
•
•
Parameter estimation
Method of moments
Using the method of moments, with the first two moments (sample mean and sample variance), let:
conditional on
conditional on
When the distribution is required over known interval other than , say , then replace with
and with in the above equations (see "Alternative parametrizations, four parameters" section
below).[23]
Maximum likelihood
As it is also the case for maximum likelihood estimates for the gamma distribution, the maximum likelihood
estimates for the beta distribution do not have a closed form solution for arbitrary values of the shape parameters. If
are independent random variables each having a beta distribution, the following system of coupled
maximum likelihood estimate equations (for the average log-likelihoods) needs to be inverted to obtain the
(unknown) shape parameter estimates in terms of the (known) average of logarithms of the samples
[5]
:
These coupled equations containing digamma functions of the shape parameter estimates must be solved by
[24]
numerical methods as done, for example, by Beckman et.al. .
Beta distribution 23
Gnanadesikan et. al. give numerical solutions for a few cases[25] .N.L.Johnson and S.Kotz[5] suggest that for "not too
small" shape parameter estimates , the logarithmic approximation to the digamma function
may be used to obtain initial values for an iterative solution, since the equations resulting
from this approximation can be solved exactly:
More readily, and perhaps more accurately, the estimates provided by the method of moments can instead be used as
initial values for an iterative solution of the maximum likelihood coupled equations in terms of the digamma
functions.
When the distribution is required over a known interval other than , say , then replace in the
first equation with and replace in the second equation with (see "Alternative
If, for example, is known, the unknown parameter is provided, exactly, by the inverse[26] of the digamma
function:
Note that this is the logarithm of the transformation by inversion of the random variable (X/(1 - X) that transforms a
beta distribution to the inverted beta distribution or beta prime distribution (also known as beta distribution of the
second kind or Pearson's Type VI).
In particular, if one of the shape parameters has a value of unity, for example for (the power function
distribution with bounded support [0,1]), using the recurrence relation , the maximum
Related distributions
Transformations
• If then mirror-image symmetry
• If then . The beta prime distribution, also called "beta
distribution of the second kind".
• If then (assuming n > 0 and m > 0). The Fisher-Snedecor F
distribution
• If then
where PERT denotes a distribution used in PERT
[29] [30]
analysis, and m=most likely value . Traditionally in PERT analysis.
• If then Kumaraswamy distribution with parameters
• If then Kumaraswamy distribution with parameters
• If then
• If and then
• If and then
• If and then . The power function distribution.
Generalisations
• The Dirichlet distribution is a multivariate generalization of the beta distribution. Univariate marginals of the
Dirichlet distribution have a beta distribution.
• The beta distribution is equivalent to the values that make the Pearson type I distribution a proper probability
distribution.
• the noncentral beta distribution
Beta distribution 25
Applications
Order statistics
The beta distribution has an important application in the theory of order statistics. A basic result is that the
distribution of the k'th largest of a sample of size n from a continuous uniform distribution has a beta distribution.[28]
This result is summarized as:
From this, and application of the theory related to the probability integral transform, the distribution of any
individual order statistic from any continuous distribution can be derived.[28]
Rule of succession
A classic application of the beta distribution is the rule of succession, introduced in the 18th century by Pierre-Simon
Laplace in the course of treating the sunrise problem. It states that, given s successes in n conditionally independent
Bernoulli trials with probability p, that p should be estimated as . This estimate may be regarded as the
expected value of the posterior distribution over p, namely Beta(s + 1, n − s + 1), which is given by Bayes' rule if one
assumes a uniform prior over p (i.e., Beta(1, 1)) and then observes that p generated s successes in n trials.
Bayesian inference
Beta distributions are used extensively in Bayesian inference, since beta distributions provide a family of conjugate
prior distributions for binomial (including Bernoulli) and geometric distributions. The Beta(0,0) distribution is an
improper prior and sometimes used to represent ignorance of parameter values.
The domain of the beta distribution can be viewed as a probability, and in fact the beta distribution is often used to
describe the distribution of an unknown probability value — typically, as the prior distribution over a probability
parameter, such as the probability of success in a binomial distribution or Bernoulli distribution. In fact, the beta
distribution is the conjugate prior of the binomial distribution and Bernoulli distribution.
The beta distribution is the special case of the Dirichlet distribution with only two parameters, and the beta is
conjugate to the binomial and Bernoulli distributions in exactly the same way as the Dirichlet distribution is
conjugate to the multinomial distribution and categorical distribution.
In Bayesian inference, the beta distribution can be derived as the posterior probability of the parameter p of a
binomial distribution after observing α − 1 successes (with probability p of success) and β − 1 failures (with
probability 1 − p of failure). Another way to express this is that placing a prior distribution of Beta(α,β) on the
parameter p of a binomial distribution is equivalent to adding α pseudo-observations of "success" and β
pseudo-observations of "failure" to the actual number of successes and failures observed, then estimating the
parameter p by the proportion of successes over both real- and pseudo-observations. If α and β are greater than 0,
this has the effect of smoothing out the distribution of the parameters by ensuring that some positive probability
mass is assigned to all parameters even when no actual observations corresponding to those parameters is observed.
Values of α and β less than 1 favor sparsity, i.e. distributions where the parameter p is close to either 0 or 1. In effect,
α and β, when operating together, function as a concentration parameter; see that article for more details.
Beta distribution 26
Subjective logic
In standard logic, propositions are considered to be either true or false. In contradistinction, subjective logic assumes
that humans cannot determine with absolute certainty whether a proposition about the real world is absolutely true or
false. In subjective logic the posteriori probability estimates of binary events can be represented by beta
distributions.[31]
Wavelet analysis
A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to
zero. It can typically be visualized as a "brief oscillation" that promptly decays. Wavelets can be used to extract
information from many different kinds of data, including – but certainly not limited to – audio signals and images.
Thus, wavelets are purposefully crafted to have specific properties that make them useful for signal processing.
Wavelets are localized in both time and frequency whereas the standard Fourier transform is only localized in
frequency. Therefore, standard Fourier Transforms are only applicable to stationary processes, while wavelets are
applicable to non-stationary processes. Continuous wavelets can be constructed based on the beta distribution. Beta
wavelets[32] can be viewed as a soft variety of Haar wavelets whose shape is fine-tuned by two shape parameters α
and β .
where a is the minimum, c is the maximum, and b is the most likely value (the mode for α > 1 and β > 1).
The above estimate for the mean is known as the PERT three-point estimation and it is
exact for either of the following values of β (for arbitrary α within these ranges):
excess kurtosis =
or
The above estimate for the standard deviation is exact for either of the following values of α and
β:
or
or
Otherwise, these can be poor approximations for beta distributions with other values of of α and β. For example, the
particular values and resulting in and
Alternative parametrizations
Two parameters
Under this parametrization, one can place a uniform prior over the mean, and a vague prior (such as an exponential
or gamma distribution) over the positive reals for the sample size.
Mean (allele frequency) and (Wright's) genetic distance between two populations
The Balding–Nichols model is a two-parameter parametrization of the beta distribution used in population genetics.
It is a statistical description of the allele frequencies in the components of a sub-divided population:
See the articles Balding–Nichols model, F-statistics, fixation index and coefficient of relationship, for further
information.
This parametrization of the beta distribution may lead to a more intuitive understanding (than the one based on the
original parameters α and β), for example, by expressing the mode, skewness, excess kurtosis and differential
Beta distribution 28
Four parameters
A beta distribution with the two shape parameters α and β is supported on the range [0,1]. It is possible to alter the
location and scale of the distribution by introducing two further parameters representing the minimum, a, and
maximum c, values of the distribution[5], by a linear transformation substituting the non-dimensional variable x in
terms of the new variable y (with support [a,c]) and the parameters a and c:
The probability density function of the four parameter beta distribution is then given by
The mean, mode and variance of the four parameters Beta distribution are:
Since the skewness and excess kurtosis are non-dimensional quantities (as moments normalized by the standard
deviation), they are independent of the parameters a and c, and therefore equal to the expressions given above in
terms of X (with support [0,1]):
Beta distribution 29
History
The first systematic, modern discussion of the
beta distribution is probably due to Karl
Pearson FRS[37] (27 March 1857 – 27 April
1936[38]), an influential English mathematician
who has been credited with establishing the
discipline of mathematical statistics.[39]. In
Pearson's papers[18] the beta distribution is
couched as a solution of a differential
equation: Pearson's Type I distribution. The
beta distribution is essentially identical to
Pearson's Type I distribution for Pearson's
parameter values for which Pearson's
differential equation solution becomes a proper
statistical distribution (with area under the
probability distribution equal to 1). In fact, in
several English books and journal articles in
the few decades prior to World War II, it was
common to refer to the beta distribution as
Pearson's Type I distribution. According to
David and Edwards's comprehensive treatise
on the history of statistics[40] the first modern
treatment of the beta distribution[41] using the
Karl Pearson analyzed the beta distribution as the solution "Type I" of Pearson
beta designation that has become standard is distributions
due to Corrado Gini,(May 23, 1884 – March
13, 1965), an Italian statistician, demographer, sociologist, who developed the Gini coefficient.
References
[1] Oguamanam, D.C.D.; Martin, H.R. , Huissoon, J.P. (1995). "On the application of the beta distribution to gear damage analysis". Applied
Acoustics 45 (3): 247–261. doi:10.1016/0003-682X(95)00001-P.
[2] Sulaiman, M.Yusof; W.M Hlaing Oo, Mahdi Abd Wahab, Azmi Zakaria (December 1999). "Application of beta distribution model to
Malaysian sunshine data". Renewable Energy 18 (4): 573–579. doi:10.1016/S0960-1481(99)00002-6.
[3] Haskett, Jonathan D.; Yakov A. Pachepsky, Basil Acock (1995). "Use of the beta distribution for parameterizing variability of soil properties
at the regional level for crop yield estimation". Agricultural Systems 48 (1): 73–86. doi:10.1016/0308-521X(95)93646-U.
[4] Gullco, Robert S.; Malcolm Anderson (December 2009). "Use of the Beta Distribution To Determine Well-Log Shale Parameters". SPE
Reservoir Evaluation & Engineering 12 (6): 929-942. doi:10.2118/106746-PA.
[5] Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1995). "Chapter 21:Beta Distributions". Continuous Univariate Distributions Vol. 2
(2nd ed.). Wiley. ISBN 978-0-471-58494-0.
[6] Keeping, E. S. (2010). Introduction to Statistical Inference. Dover Publications. pp. 462 pages. ISBN 978-0486685021.
[7] Wadsworth, George P. and Joseph Bryan (1960). Introduction to Probability and Random Variables. McGraw-Hill. pp. 101.
[8] Hahn, Gerald J. and S. Shapiro (1994). Statistical Models in Engineering (Wiley Classics Library). Wiley-Interscience. pp. 376.
ISBN 978-0471040651.
[9] Feller, William (1971). An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley. pp. 669. ISBN 978-0471257097.
[10] Gupta (Editor), Arjun K. (2004). Handbook of Beta Distribution and Its Applications. CRC Press. pp. 42. ISBN 978-0824753962.
[11] Panik, Michael J (2005). Advanced Statistics from an Elementary Point of View. Academic Press. pp. 273. ISBN 978-0120884940.
[12] Rose, Colin, and Murray D. Smith (2002). Mathematical Statistics with MATHEMATICA. Springer. pp. 496 pages. ISBN 978-0387952345.
[13] Kerman J (2011) "A closed-form approximation for the median of the beta distribution". arXiv:1111.0433v1
[14] Liang, Zhiqiang; Jianming Wei, Junyu Zhao, Haitao Liu, Baoqing Li, Jie Shen and Chunlei Zheng (27 August 2008). "The Statistical
Meaning of Kurtosis and Its New Application to Identification of Persons Based on Seismic Signals". Sensors 8: 5106–5119.
doi:10.3390/s8085106.
Beta distribution 30
[15] Kenney, J. F., and E. S. Keeping (1951). Mathematics of Statistics Part Two, 2nd edition. D. Van Nostrand Company Inc. pp. 429.
[16] Abramowitz, Milton and Irene A. Stegun (1965). Handbook Of Mathematical Functions With Formulas, Graphs, And Mathematical Tables.
Dover. pp. 1046 pages. ISBN 78-0486612720.
[17] Weisstein., Eric W.. "Kurtosis" (http:/ / mathworld. wolfram. com/ Kurtosis. html). MathWorld--A Wolfram Web Resource. . Retrieved 13
August 2012.
[18] Pearson, Karl (1916). "Mathematical contributions to the theory of evolution, XIX: Second supplement to a memoir on skew variation".
Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 216
(538–548): 429–457. Bibcode 1916RSPTA.216..429P. doi:10.1098/rsta.1916.0009. JSTOR 91092.
[19] Gradshteyn, I. S. , and I. M. Ryzhik (2000). Table of Integrals, Series, and Products, 6th edition. Academic Press. pp. 1163.
ISBN 978-0122947575.
[20] A. C. G. Verdugo Lazo and P. N. Rathie. "On the entropy of continuous probability distributions," IEEE Trans. Inf. Theory, IT-24:120–122,
1978.
[21] Shannon, Claude E., "A Mathematical Theory of Communication," Bell System Technical Journal, 27 (4):623–656,1948. PDF (http:/ /
www. alcatel-lucent. com/ bstj/ vol27-1948/ articles/ bstj27-4-623. pdf)
[22] Pearson, Egon S. (July 1969). "Some historical reflections traced through the development of the use of frequency curves" (http:/ / www.
smu. edu/ Dedman/ Academics/ Departments/ Statistics/ Research/ TechnicalReports). THEMIS Statistical Analysis Research Program,
Technical Report 38 Office of Naval Research, Contract N000014-68-A-0515 (Project NR 042-260): 23. .
[23] Engineering Statistics Handbook (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda366h. htm)
[24] Beckman, R. J.; G. L. Tietjen (1978). "Maximum likelihood estimation for the beta distribution". Journal of Statistical Computation and
Simulation 7 (3-4): 253-258. doi:10.1080/00949657808810232.
[25] Gnanadesikan, R.,Pinkham and Hughes (1967). "Maximum likelihood estimation of the parameters of the beta distribution from smallest
order statistics". Technometrics 9: 607-620.
[26] Fackler, Paul. "Inverse Digamma Function (Matlab)" (http:/ / hips. seas. harvard. edu/ content/ inverse-digamma-function-matlab). Harvard
University School of Engineering and Applied Sciences. . Retrieved 08/18/2012.
[27] van der Waerden, B. L., "Mathematical Statistics", Springer, ISBN 978-3-540-04507-6.
[28] David, H. A., Nagaraja, H. N. (2003) Order Statistics (3rd Edition). Wiley, New Jersey pp 458. ISBN 0-471-38926-9
[29] Herrerías-Velasco, José Manuel and Herrerías-Pleguezuelo, Rafael and René van Dorp, Johan. (2011). Revisiting the PERT mean and
Variance. European Journal of Operational Research (210), p. 448–451.
[30] Malcolm, D. G.; Roseboom, C. E., Clark, C. E. and Fazar, W., (1959). "Application of a technique for research and development program
evaluation". Operations Research 7: 646–649.
[31] A. Jøsang. A Logic for Uncertain Probabilities. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 9(3),
pp.279-311, June 2001. PDF (http:/ / www. unik. no/ people/ josang/ papers/ Jos2001-IJUFKS. pdf)
[32] H.M. de Oliveira and G.A.A. Araújo,. Compactly Supported One-cyclic Wavelets Derived from Beta Distributions. Journal of
Communication and Information Systems. vol.20, n.3, pp.27-33, 2005.
[33] Keefer, Donald L. and Verdini, William A. (1993). Better Estimation of PERT Activity Time Parameters. Management Science 39(9), p.
1086–1091.
[34] Keefer, Donald L. and Bodily, Samuel E. (1983). Three-point Approximations for Continuous Random variables. Management Science
29(5), p. 595–609.
[35] DRMI Newsletter, Issue 12, April 8, 2005 (http:/ / www. nps. edu/ drmi/ docs/ 1apr05-newsletter. pdf)
[36] Kruschke, J. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. Academic Press / Elsevier ISBN 978-0123814852 (p. 83)
[37] Yule, G. U.; Filon, L. N. G. (1936). "Karl Pearson. 1857-1936". Obituary Notices of Fellows of the Royal Society 2 (5): 72.
doi:10.1098/rsbm.1936.0007. JSTOR 769130.
[38] "Library and Archive catalogue" (http:/ / www2. royalsociety. org/ DServe/ dserve. exe?dsqIni=Dserve. ini& dsqApp=Archive&
dsqCmd=Show. tcl& dsqDb=Persons& dsqPos=0& dsqSearch=((text)=' Pearson: Karl (1857 - 1936) '))). Sackler Digital Archive. Royal
Society. . Retrieved 2011-07-01.
[39] "Karl Pearson sesquicentenary conference" (http:/ / www. economics. soton. ac. uk/ staff/ aldrich/ KP150. htm). Royal Statistical Society.
2007-03-03. . Retrieved 2008-07-25.
[40] David, H. A. and A.W.F. Edwards (2001). Annotated Readings in the History of Statistics. Springer; 1 edition. pp. 252.
ISBN 978-0387988443.
[41] Gini, Corrado (1911). Studi Economico-Giuridici della Università de Cagliari Anno III (reproduced in Metron 15, 133,171, 1949): 5-41.
Beta distribution 31
External links
• Weisstein, Eric W., " Beta Distribution (http://mathworld.wolfram.com/BetaDistribution.html)" from
MathWorld.
• "Beta Distribution" (http://demonstrations.wolfram.com/BetaDistribution/) by Fiona Maclachlan, the Wolfram
Demonstrations Project, 2007.
• Beta Distribution – Overview and Example (http://www.xycoon.com/beta.htm), xycoon.com
• Beta Distribution (http://www.brighton-webs.co.uk/distributions/beta.htm), brighton-webs.co.uk
Beta function
In mathematics, the beta function, also called the Euler integral of the first kind, is a special function defined by
for
The beta function was studied by Euler and Legendre and was given its name by Jacques Binet; its symbol Β is a
Greek capital β rather than the similar Latin capital B.
Properties
The beta function is symmetric, meaning that
[1]
When x and y are positive integers, it follows trivially from the definition of the gamma function that:
[1]
[2]
[2]
where is a truncated power function and the star denotes convolution. The second identity shows in
particular . Some of these identities, e.g. the trigonometric formula, can be applied to deriving the
volume of an n-ball in Cartesian coordinates.
Euler's integral for the beta function may be converted into an integral over the Pochhammer contour C as
Beta function 32
This Pochhammer contour integral converges for all values of α and β and so gives the analytic continuation of the
beta function.
Just as the gamma function for integers describes factorials, the beta function can define a binomial coefficient after
adjusting indices:
Moreover, for integer n, can be integrated to give a closed form, an interpolation function for continuous values of
k:
The beta function was the first known scattering amplitude in string theory, first conjectured by Gabriele Veneziano.
It also occurs in the theory of the preferential attachment process, a type of stochastic urn process.
Hence
The stated identity may be seen as a particular case of the identity for the integral of a convolution. Taking
Derivatives
We have
Integrals
The Nörlund–Rice integral is a contour integral involving the beta function.
Beta function 33
Approximation
Stirling's approximation gives the asymptotic formula
for large x and large y. If on the other hand x is large and y is fixed, then
For x = 1, the incomplete beta function coincides with the complete beta function. The relationship between the two
functions is like that between the gamma function and its generalization the incomplete gamma function.
The regularized incomplete beta function (or regularized beta function for short) is defined in terms of the
incomplete beta function and the complete beta function:
Working out the integral (one can use integration by parts) for integer values of a and b, one finds:
The regularized incomplete beta function is the cumulative distribution function of the Beta distribution, and is
related to the cumulative distribution function of a random variable X from a binomial distribution, where the
"probability of success" is p and the sample size is n:
Properties
Calculation
Even if unavailable directly, the complete and incomplete Beta function values can be calculated using functions
commonly included in spreadsheet or Computer algebra systems. With Excel as an example, using the GammaLn
and (cumulative) Beta distribution functions, we have:
Complete Beta Value = Exp(GammaLn(a) + GammaLn(b) - GammaLn(a + b))
and,
Incomplete Beta Value = BetaDist(x, a, b) * Exp(GammaLn(a) + GammaLn(b) - GammaLn(a + b)).
These result from rearranging the formulae for the Beta distribution, and the incomplete beta and complete beta
functions, which can also be defined as the ratio of the logs as above.
Beta function 34
Similarly, in MATLAB and GNU Octave, betainc (Incomplete beta function) computes the regularized incomplete
beta function - which is, in fact, the Cumulative Beta distribution - and so, to get the actual incomplete beta function,
one must multiply the result of betainc by the result returned by the corresponding beta function..//
References
[1] Davis (1972) 6.2.2 p.258
[2] Davis (1972) 6.2.1 p.258
• Askey, R. A.; Roy, R. (2010), "Beta function" (http://dlmf.nist.gov/5.12), in Olver, Frank W. J.; Lozier,
Daniel M.; Boisvert, Ronald F. et al., NIST Handbook of Mathematical Functions, Cambridge University Press,
ISBN 978-0521192255, MR2723248
• Zelen, M.; Severo, N. C. (1972), "26. Probability functions", in Abramowitz, Milton; Stegun, Irene A., Handbook
of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover Publications,
pp. 925-995, ISBN 978-0-486-61272-0
• Davis, Philip J. (1972), "6. Gamma function and related functions" (http://www.math.sfu.ca/~cbm/aands/
page_258.htm), in Abramowitz, Milton; Stegun, Irene A., Handbook of Mathematical Functions with Formulas,
Graphs, and Mathematical Tables, New York: Dover Publications, ISBN 978-0-486-61272-0
• Paris, R. B. (2010), "Incomplete beta functions" (http://dlmf.nist.gov/8.17), in Olver, Frank W. J.; Lozier,
Daniel M.; Boisvert, Ronald F. et al., NIST Handbook of Mathematical Functions, Cambridge University Press,
ISBN 978-0521192255, MR2723248
• Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), "Section 6.1 Gamma Function, Beta Function,
Factorials" (http://apps.nrbook.com/empanel/index.html?pg=256), Numerical Recipes: The Art of Scientific
Computing (3rd ed.), New York: Cambridge University Press, ISBN 978-0-521-88068-8
External links
• Evaluation of beta function using Laplace transform (http://planetmath.org/?op=getobj&from=objects&
amp;id=6206), PlanetMath.org.
• Arbitrarily accurate values can be obtained from:
• The Wolfram Functions Site (http://functions.wolfram.com): Evaluate Beta Regularized Incomplete beta
(http://functions.wolfram.com/webMathematica/FunctionEvaluation.jsp?name=BetaRegularized)
• danielsoper.com: Incomplete Beta Function Calculator (http://www.danielsoper.com/statcalc/calc36.aspx),
Regularized Incomplete Beta Function Calculator (http://www.danielsoper.com/statcalc/calc37.aspx)
Beta-binomial distribution 35
Beta-binomial distribution
Probability mass function
CDF
where 3F2(a,b,k) is the generalized hypergeometric
function
=3F2(1, α + k + 1, −n + k + 1; k + 2, −β − n + k + 2; 1)
Mean
Variance
Skewness
CF
In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on
a finite support of non-negative integers arising when the probability of success in each of a fixed or known number
of Bernoulli trials is either unknown or random. It is frequently used in Bayesian statistics, empirical Bayes methods
and classical statistics as an overdispersed binomial distribution.
Beta-binomial distribution 36
It reduces to the Bernoulli distribution as a special case when n = 1. For α = β = 1, it is the discrete uniform
distribution from 0 to n. It also approximates the binomial distribution arbitrarily well for large α and β. The
beta-binomial is a one-dimensional version of the Dirichlet-multinomial distribution, as the binomial and beta
distributions are special cases of the multinomial and Dirichlet distributions, respectively.
Using the properties of the beta function, this can alternatively be written
It is within this context that the beta-binomial distribution appears often in Bayesian statistics: the beta-binomial is
the predictive distribution of a binomial random variable with a beta distribution prior on the success probability.
where is the pairwise correlation between the n Bernoulli draws and is called the over-dispersion
parameter.
Point estimates
Method of moments
The method of moments estimates can be gained by noting the first and second moments of the beta-binomial
namely
Note that these estimates can be non-sensically negative which is evidence that the data is either undispersed or
underdispersed relative to the binomial distribution. In this case, the binomial distribution and the hypergeometric
distribution are alternative candidates respectively.
Beta-binomial distribution 38
Example
The following data gives the number of male children among the first 12 children of family size 13 in 6115 families
taken from hospital records in 19th century Saxony (Sokal and Rohlf, p. 59 from Lindsey). The 13th child is ignored
to assuage the effect of families non-randomly stopping when a desired gender is reached.
Males 0 1 2 3 4 5 6 7 8 9 10 11 12
Families 3 24 104 286 670 1033 1343 1112 829 478 181 45 7
The AIC for the competing binomial model is AIC = 25070.34 and thus we see that the beta-binomial model
provides a superior fit to the data i.e. there is evidence for overdispersion. Trivers and Willard posit a theoretical
justification for heterogeneity in gender-proneness among families (i.e. overdispersion).
The superior fit is evident especially among the tails
Beta-binomial distribution 39
Males 0 1 2 3 4 5 6 7 8 9 10 11 12
Observed Families 3 24 104 286 670 1033 1343 1112 829 478 181 45 7
Predicted (Beta-Binomial) 2.3 22.6 104.8 310.9 655.7 1036.2 1257.9 1182.1 853.6 461.9 177.9 43.8 5.2
Predicted (Binomial p = 0.519215) 0.9 12.1 71.8 258.5 628.1 1085.2 1367.3 1265.6 854.2 410.0 132.8 26.1 2.3
where
so that
And
Because the marginal is a complex, non-linear function of Gamma and Digamma functions, it is quite difficult to
obtain a marginal maximum likelihood estimate (MMLE) for the mean and variance. Instead, we use the method of
iterated expectations to find the expected value of the marginal moments.
Let us write our model as a two-stage compound sampling model. Let ki be the number of success out of ni trials for
event i:
Beta-binomial distribution 40
We can find iterated moment estimates for the mean and variance using the moments for the distributions in the
two-stage model:
(Here we have used the law of total expectation and the law of total variance.)
We want point estimates for and . The estimated mean is calculated from the sample
The estimate of the hyperparameter M is obtained using the moment estimates for the variance of the two-stage
model:
Solving:
where
Since we now have parameter point estimates, and , for the underlying distribution, we would like to find a
point estimate for the probability of success for event i. This is the weighted average of the event
estimate and . Given our point estimates for the prior, we may now plug in these values to find a point
estimate for the posterior
Shrinkage factors
We may write the posterior estimate as a weighted average:
Related distributions
• where is the discrete uniform distribution.
References
* Minka, Thomas P. (2003). Estimating a Dirichlet distribution [1]. Microsoft Technical Report.
External links
• Using the Beta-binomial distribution to assess performance of a biometric identification device [2]
• Fastfit [3] contains Matlab code for fitting Beta-Binomial distributions (in the form of two-dimensional Pólya
distributions) to data.
References
[1] http:/ / research. microsoft. com/ ~minka/ papers/ dirichlet/
[2] http:/ / it. stlawu. edu/ ~msch/ biometrics/ papers. htm
[3] http:/ / research. microsoft. com/ ~minka/ software/ fastfit/
Binomial coefficient
In mathematics, binomial coefficients are a family of positive
integers that occur as coefficients in the binomial theorem. They
are indexed by two nonnegative integers; the binomial coefficient
indexed by n and k is usually written . It is the coefficient of
the x k term in the polynomial expansion of the binomial power
(1 + x) n. Under suitable circumstances the value of the coefficient
is given by the expression . Arranging binomial
coefficients into rows for successive values of n, and in which k
ranges from 0 to n, gives a triangular array called Pascal's triangle.
The binomial coefficients can be arranged to form
This family of numbers also arises in many other areas than Pascal's triangle.
algebra, notably in combinatorics. For any set containing n
elements, the number of distinct k-element subsets of it that can be formed (the k-combinations of its elements) is
given by the binomial coefficient . Therefore is often read as "n choose k". The properties of binomial
coefficients have led to extending the meaning of the symbol beyond the basic case where n and k are
nonnegative integers with k ≤ n; such expressions are then still called binomial coefficients.
The notation was introduced by Andreas von Ettingshausen in 1826,[1] although the numbers were already
known centuries before that (see Pascal's triangle). The earliest known detailed discussion of binomial coefficients is
in a tenth-century commentary, due to Halayudha, on an ancient Hindu classic, Pingala's chandaḥśāstra. In about
1150, the Hindu mathematician Bhaskaracharya gave a very clear exposition of binomial coefficients in his book
Lilavati.[2]
Alternative notations include C(n, k), nCk, nCk, Ckn, Cnk,[3] in all of which the C stands for combinations or choices.
Binomial coefficient 42
(valid for any elements x,y of a commutative ring), which explains the name "binomial coefficient".
Another occurrence of this number is in combinatorics, where it gives the number of ways, disregarding order, that k
objects can be chosen from among n objects; more formally, the number of k-element subsets (or k-combinations) of
an n-element set. This number can be seen as equal to the one of the first definition, independently of any of the
formulas below to compute it: if in each of the n factors of the power (1 + X)n one temporarily labels the term X with
an index i (running from 1 to n), then each subset of k indices gives after expansion a contribution Xk, and the
coefficient of that monomial in the result will be the number of such subsets. This shows in particular that is a
natural number for any natural numbers n and k. There are many other combinatorial interpretations of binomial
coefficients (counting problems for which the answer is given by a binomial coefficient expression), for instance the
number of words formed of n bits (digits 0 or 1) whose sum is k is given by , while the number of ways to write
where every ai is a nonnegative integer is given by . Most of these
interpretations are easily seen to be equivalent to counting k-combinations.
Recursive formula
One has a recursive formula for binomial coefficients
The formula follows either from tracing the contributions to Xk in (1 + X)n−1(1 + X), or by counting k-combinations
of {1, 2, ..., n} that contain n and that do not contain n separately. It follows easily that = 0 when k > n, and
= 1 for all n, so the recursion can stop when reaching such cases. This recursive formula then allows the
construction of Pascal's triangle.
Binomial coefficient 43
Multiplicative formula
A more efficient method to compute individual binomial coefficients is given by the formula
where the numerator of the first fraction is expressed as a falling factorial power. This formula is easiest to
understand for the combinatorial interpretation of binomial coefficients. The numerator gives the number of ways to
select a sequence of k distinct objects, retaining the order of selection, from a set of n objects. The denominator
counts the number of distinct sequences that define the same k-combination when order is disregarded.
Factorial formula
Finally there is a formula using factorials that is easy to remember:
where n! denotes the factorial of n. This formula follows from the multiplicative formula above by multiplying
numerator and denominator by (n − k)!; as a consequence it involves many factors common to numerator and
denominator. It is less practical for explicit computation unless common factors are first canceled (in particular since
factorial values grow very rapidly). The formula does exhibit a symmetry that is less evident from the multiplicative
formula (though it is from the definitions)
(1)
With this definition one has a generalization of the binomial formula (with one of the variables set to 1), which
justifies still calling the binomial coefficients:
(2)
This formula is valid for all complex numbers α and X with |X| < 1. It can also be interpreted as an identity of formal
power series in X, where it actually can serve as definition of arbitrary powers of series with constant coefficient
equal to 1; the point is that with this definition all identities hold that one expects for exponentiation, notably
If α is a nonnegative integer n, then all terms with k > n are zero, and the infinite series becomes a finite sum, thereby
recovering the binomial formula. However for other values of α, including negative integers and rational numbers,
the series is really infinite.
Binomial coefficient 44
Pascal's triangle
Pascal's rule is the important recurrence relation
(3)
which can be used to prove by mathematical induction that is a natural number for all n and k, (equivalent to the
statement that k! divides the product of k consecutive integers), a fact that is not immediately obvious from formula
(1).
Pascal's rule also gives rise to Pascal's triangle:
Binomial coefficient 45
0: 1
1: 1 1
2: 1 2 1
3: 1 3 3 1
4: 1 4 6 4 1
5: 1 5 10 10 5 1
6: 1 6 15 20 15 6 1
7: 1 7 21 35 35 21 7 1
8: 1 8 28 56 70 56 28 8 1
Row number n contains the numbers for k = 0,…,n. It is constructed by starting with ones at the outside and
then always adding two adjacent numbers and writing the sum directly underneath. This method allows the quick
calculation of binomial coefficients without the need for fractions or multiplications. For instance, by looking at row
number 5 of the triangle, one can quickly read off that
(x + y)5 = 1 x5 + 5 x4y + 10 x3y2 + 10 x2y3 + 5 x y4 + 1 y5.
The differences between elements on other diagonals are the elements in the previous diagonal, as a consequence of
the recurrence relation (3) above.
• There are ways to choose k elements from a set of n elements. See Combination.
• There are ways to choose k elements from a set of n if repetitions are allowed. See Multiset.
• There are strings containing k ones and n zeros.
• There are strings consisting of k ones and n zeros such that no two ones are adjacent.[5]
For each k, the polynomial can be characterized as the unique degree k polynomial p(t) satisfying p(0) = p(1) =
... = p(k − 1) = 0 and p(k) = 1.
Its coefficients are expressible in terms of Stirling numbers of the first kind, by definition of the latter:
Binomial coefficient 46
combination . The coefficient ak is the kth difference of the sequence p(0), p(1), …, p(k). Explicitly,[6]
(3.5)
Integer-valued polynomials
Each polynomial is integer-valued: it takes integer values at integer inputs. (One way to prove this is by
induction on k, using Pascal's identity.) Therefore any integer linear combination of binomial coefficient polynomials
is integer-valued too. Conversely, (3.5) shows that any integer-valued polynomial is an integer linear combination of
these binomial coefficient polynomials. More generally, for any subring R of a characteristic 0 field K, a polynomial
in K[t] takes values in R at all integers if and only if it is an R-linear combination of binomial coefficient
polynomials.
Example
The integer-valued polynomial 3t(3t + 1)/2 can be rewritten as
(4)
(5)
is obtained from (2) using x = 1. This is equivalent to saying that the elements in one row of Pascal's triangle always
add up to two raised to an integer power. A combinatorial interpretation of this fact involving double counting is
given by counting subsets of size 0, size 1, size 2, and so on up to size n of a set S of n elements. Since we count the
number of subsets of size i for 0 ≤ i ≤ n, this sum must be equal to the number of subsets of S, which is known to be
2n. That is, Equation 5 is a statement that the power set for a finite set with n elements has size 2n. More explicitly,
consider a bit string with n digits. This bit string can be used to represent 2n numbers. Now consider all of the bit
strings with no ones in them. There is just one, or rather n choose 0. Next consider the number of bit strings with just
a single one in them. There are n, or rather n choose 1. Continuing this way we can see that the equation above holds.
The formulas
(6a)
and
(6b)
follow from (2) after differentiating with respect to x (twice in the latter) and then substituting x = 1.
The Chu-Vandermonde identity, which holds for any complex-values m and n and any non-negative integer k, is
(7a)
and can be found by examination of the coefficient of in the expansion of (1 + x)m (1 + x)n − m = (1 + x)n using
equation (2). When m = 1, equation (7a) reduces to equation (3).
A similar looking formula, which applies for any integers j, k, and n satisfying 0 ≤ j ≤ k ≤ n, is
(7b)
From expansion (7a) using n = 2m, k = m, and (1), one finds
(8)
Let F(n) denote the n-th Fibonacci number. We obtain a formula about the diagonals of Pascal's triangle
Binomial coefficient 48
(9)
This can be proved by induction using (3) or by Zeckendorf's representation (Just note that the lhs gives the number
of subsets of {F(2),...,F(n)} without consecutive members, which also form all the numbers below F(n+1)). A
combinatorial proof is given below.
Also using (3) and induction, one can show that
(10)
(unless one resorts to Hypergeometric functions), one can again use (3) and induction, to show that for k = 0, ..., n−1
(11)
as well as
(12)
[except in the trivial case where n = 0, where the result is 1 instead] which is itself a special case of the result from
the theory of finite differences that for any polynomial P(x) of degree less than n,[7]
(13a)
Differentiating (2) k times and setting x = −1 yields this for , when 0 ≤ k < n,
and the general case follows by taking linear combinations of these.
When P(x) is of degree less than or equal to n,
(13b)
(13c)
where m and d are complex numbers. This follows immediately applying (13b) to the polynomial Q(x):=P(m + dx)
instead of P(x), and observing that Q(x) has still degree less than or equal to n, and that its coefficient of degree n is
dnan.
The infinite series
Binomial coefficient 49
(14)
is convergent for k ≥ 2. This formula is used in the analysis of the German tank problem. It is equivalent to the
formula for the finite sum
(15)
and
(16)
Series multisection gives the following identity for the sum of binomial coefficients taken with a step s and offset t
as a closed-form sum of s terms:
(16b)
can be given a double counting proof as follows. The left side counts the number of ways of selecting a subset of [n]
= {1, 2, …, n} with at least q elements, and marking q elements among those selected. The right side counts the
same parameter, because there are ways of choosing a set of q marks and they occur in all subsets that
additionally contain some subset of the remaining elements, of which there are
In the Pascal's rule
both sides count the number of k-element subsets of [n] with the right hand side first grouping them into those that
contain element n and those that do not.
The identity (8) also has a combinatorial proof. The identity reads
Suppose you have empty squares arranged in a row and you want to mark (select) n of them. There are
ways to do this. On the other hand, you may select your n squares by selecting k squares from among the first n and
squares from the remaining n squares; any k from 1 to n will work. This gives
Binomial coefficient 50
has the following combinatorial proof. The number denotes the number of paths in a two-dimensional lattice
from to using steps and . This is easy to see: there are steps in total and
one may choose the steps. Now, replace each step by a step; note that there are exactly .
Then one arrives at point using steps and . Doing this for all between and gives
all paths from to using steps and . Clearly, there are exactly such paths.
Sum of coefficients row
The number of k-combinations for all k, , is the sum of the nth row (counting from 0) of the
binomial coefficients. These combinations are enumerated by the 1 digits of the set of base 2 numbers counting from
0 to , where each digit position is an item from the set of n.
Dixon's identity
Dixon's identity is
Continuous identities
Certain trigonometric integrals have values expressible in terms of binomial coefficients:
For and
(17)
(18)
(19)
These can be proved by using Euler's formula to convert trigonometric functions to complex exponentials, expanding
using the binomial theorem, and integrating term by term.
Binomial coefficient 51
Generating functions
Another bivariate generating function of the binomial coefficients, which is symmetric, is:
Divisibility properties
In 1852, Kummer proved that if m and n are nonnegative integers and p is a prime number, then the largest power of
p dividing equals pc, where c is the number of carries when m and n are added in base p. Equivalently, the
exponent of a prime p in equals the number of nonnegative integers j such that the fractional part of k/pj is
greater than the fractional part of n/pj. It can be deduced from this that is divisible by n/gcd(n,k).
A somewhat surprising result by David Singmaster (1974) is that any integer divides almost all binomial
coefficients. More precisely, fix an integer d and let f(N) denote the number of binomial coefficients with n < N
such that d divides . Then
Since the number of binomial coefficients with n < N is N(N+1) / 2, this implies that the density of binomial
coefficients divisible by d goes to 1.
Another fact: An integer n ≥ 2 is prime if and only if all the intermediate binomial coefficients
are divisible by n.
Proof: When p is prime, p divides
Binomial coefficient 52
because it is a natural number and the numerator has a prime factor p but the denominator does not have a prime
factor p.
When n is composite, let p be the smallest prime factor of n and let k = n/p. Then 0 < p < n and
otherwise the numerator k(n−1)(n−2)×...×(n−p+1) has to be divisible by n = k×p, this can only be the case when
(n−1)(n−2)×...×(n−p+1) is divisible by p. But n is divisible by p, so p does not divide n−1, n−2, ..., n−p+1 and
because p is prime, we know that p does not divide (n−1)(n−2)×...×(n−p+1) and so the numerator cannot be divisible
by n.
as
as .
This asymptotic behaviour is contained in the approximation
as well. (Here is the k-th harmonic number and is the Euler–Mascheroni constant).
The sum of binomial coefficients can be bounded by a term exponential in and the binary entropy of the largest
that occurs. More precisely, for and , it holds
Generalizations
Generalization to multinomials
Binomial coefficients can be generalized to multinomial coefficients. They are defined to be the number:
where
While the binomial coefficients represent the coefficients of (x+y)n, the multinomial coefficients represent the
coefficients of the polynomial
and symmetry:
Taylor series
Using Stirling numbers of the first kind the series expansion around any arbitrarily chosen point is
Binomial coefficient 54
This shows up when expanding into a power series using the Newton binomial series :
where the connection coefficients are multinomial coefficients. In terms of labelled combinatorial objects, the
connection coefficients represent the number of ways to assign m+n-k labels to a pair of labelled combinatorial
objects—of weight m and n respectively—that have had their first k labels identified, or glued together to get a new
labelled combinatorial object of weight m+n-k. (That is, to separate the labels into three portions to apply to the
glued part, the unglued part of the first object, and the unglued part of the second object.) In this regard, binomial
coefficients are to exponential generating series what falling factorials are to ordinary generating series.
and
The identity can be obtained by showing that both sides satisfy the differential equation (1+z) f'(z) = α f(z).
is applied.
Binomial coefficient 55
moreover,
The resulting function has been little-studied, apparently first being graphed in (Fowler 1996). Notably, many
binomial identities fail: but for n positive (so negative). The behavior is
quite complex, and markedly different in various octants (that is, with respect to the x and y axes and the line
), with the behavior for negative x having singularities at negative integer values and a checkerboard of
positive and negative regions:
• in the octant it is a smoothly interpolated form of the usual binomial, with a ridge ("Pascal's ridge").
• in the octant and in the quadrant the function is close to zero.
• in the quadrant the function is alternatingly very large positive and negative on the
parallelograms with vertices
• in the octant the behavior is again alternatingly very large positive and negative, but on a square
grid.
• in the octant it is close to zero, except for near the singularities.
Generalization to q-series
The binomial coefficient has a q-analog generalization known as the Gaussian binomial coefficient.
where A is some set with cardinality . One can show that the generalized binomial coefficient is well-defined, in
the sense that no matter what set we choose to represent the cardinal number , will remain the same. For
finite cardinals, this definition coincides with the standard definition of the binomial coefficient.
Assuming the Axiom of Choice, one can show that for any infinite cardinal .
Binomial coefficient 56
The notation is convenient in handwriting but inconvenient for typewriters and computer terminals. Many
programming languages do not offer a standard subroutine for computing the binomial coefficient, but for example
the J programming language uses the exclamation mark: k ! n .
Naive implementations of the factorial formula, such as the following snippet in Python:
are very slow and are useless for calculating factorials of very high numbers (in languages as C or Java they suffer
from overflow errors because of this reason). A direct implementation of the multiplicative formula works well:
(Notice that range(k) is a list from 0 to k-1 and, as a consequence, we need to use i+1 in the above function). The
example mentioned above can be also written in functional style. The following Scheme example uses recursive
definition
(define (binomial n k)
;; Helper function to compute C(n,k) via forward recursion
(define (binomial-iter n k i prev)
(if (>= i k)
prev
(binomial-iter n k (+ i 1) (/ (* (- n i) prev) (+ i 1)))))
;; Use symmetry property C(n,k)=C(n, n-k)
(if (< k (- n k))
(binomial-iter n k 0 1)
(binomial-iter n (- n k) 0 1)))
Another way to compute the binomial coefficient when using large numbers is to recognize that
Binomial coefficient 57
where denotes the natural logarithm of the gamma function at . It is a special function that is easily
computed and is standard in some programming languages such as using log_gamma in Maxima, LogGamma in
Mathematica, or gammaln in MATLAB. Roundoff error may cause the returned value to not be an integer.
Notes
[1] Higham (1998)
[2] Lilavati Section 6, Chapter 4 (see Knuth (1997)).
[3] Shilov (1977)
[4] See (Graham, Knuth & Patashnik 1994), which also defines for . Alternative generalizations, such as to two real or
complex valued arguments using the Gamma function assign nonzero values to for , but this causes most binomial coefficient
identities to fail, and thus is not widely used majority of definitions. One such choice of nonzero values leads to the aesthetically pleasing
"Pascal windmill" in Hilton, Holton and Pedersen, Mathematical reflections: in a room with many mirrors, Springer, 1997, but causes even
Pascal's identity to fail (at the origin).
[5] Muir, Thomas (1902). "Note on Selected Combinations" (http:/ / books. google. com/ books/ reader?id=EN8vAAAAIAAJ& output=reader&
pg=GBS. PA102). Proceedings of the Royal Society of Edinburgh. .
[6] This can be seen as a discrete analog of Taylor's theorem. It is closely related to Newton's polynomial. Alternating sums of this form may be
expressed as the Nörlund–Rice integral.
[7] Ruiz, Sebastian (1996). "An Algebraic Identity Leading to Wilson's Theorem" (http:/ / www. jstor. org/ stable/ 3618534). The Mathematical
Gazette 80 (489): 579-582. .
[8] see e.g. Flum & Grohe (2006, p. 427)
References
• Benjamin, Arthur T.; Quinn, Jennifer (2003). Proofs that Really Count: The Art of Combinatorial Proof (https://
www.maa.org/EbusPPRO/Bookstore/ProductDetail/tabid/170/Default.aspx?ProductId=675), Mathematical
Association of America.
• Bryant, Victor (1993). Aspects of combinatorics. Cambridge University Press. ISBN 0-521-41974-3.
• Flum, Jörg; Grohe, Martin (2006). Parameterized Complexity Theory (http://www.springer.com/east/home/
generic/search/results?SGWID=5-40109-22-141358322-0). Springer. ISBN 978-3-540-29952-3.
• Fowler, David (January 1996). "The Binomial Coefficient Function". The American Mathematical Monthly
(Mathematical Association of America) 103 (1): 1–17. doi:10.2307/2975209. JSTOR 2975209
• Graham, Ronald L.; Knuth, Donald E.; Patashnik, Oren (1994). Concrete Mathematics (Second ed.).
Addison-Wesley. pp. 153–256. ISBN 0-201-55802-5.
• Higham, Nicholas J. (1998). Handbook of writing for the mathematical sciences. SIAM. p. 25.
ISBN 0-89871-420-6.
• Knuth, Donald E. (1997). The Art of Computer Programming, Volume 1: Fundamental Algorithms (Third ed.).
Addison-Wesley. pp. 52–74. ISBN 0-201-89683-4.
• Singmaster, David (1974). "Notes on binomial coefficients. III. Any integer divides almost all binomial
coefficients". J. London Math. Soc. (2) 8 (3): 555–560. doi:10.1112/jlms/s2-8.3.555.
• Shilov, G. E. (1977). Linear algebra. Dover Publications. ISBN 978-0-486-63518-7.
Binomial coefficient 58
External links
• Calculation of Binomial Coefficient (http://www.stud.feec.vutbr.cz/~xvapen02/vypocty/komb.
php?language=english)
This article incorporates material from the following PlanetMath articles, which are licensed under the Creative
Commons Attribution/Share-Alike License: Binomial Coefficient, Bounds for binomial coefficients, Proof that C(n,k)
is an integer, Generalized binomial coefficients.
Binomial distribution 59
Binomial distribution
Probability mass function
Notation B(n, p)
Parameters n ∈ N0 — number of trials
p ∈ [0,1] — success probability in each trial
Support k ∈ { 0, …, n } — number of successes
PMF
CDF
Mean np
Median ⌊np⌋ or ⌈np⌉
Mode ⌊(n + 1)p⌋ or ⌊(n + 1)p⌋ − 1
Variance np(1 − p)
Skewness
Ex. kurtosis
Entropy
MGF
CF
PGF
Binomial distribution 60
Specification
is the binomial coefficient (hence the name of the distribution) "n choose k", also denoted C(n, k), nCk, or nCk. The
formula can be understood as follows: we want k successes (pk) and n − k failures (1 − p)n − k. However, the k
successes can occur anywhere among the n trials, and there are C(n, k) different ways of distributing k successes in a
sequence of n trials.
Binomial distribution 61
In creating reference tables for binomial distribution probability, usually the table is filled in up to n/2 values. This is
because for k > n/2, the probability can be calculated by its complement as
Looking at the expression ƒ(k, n, p) as a function of k, there is a k value that maximizes it. This k value can be found
by calculating
ƒ(k, n, p) is monotone increasing for k < M and monotone decreasing for k > M, with the exception of the case where
(n + 1)p is an integer. In this case, there are two values for which ƒ is maximal: (n + 1)p and (n + 1)p − 1. M is the
most probable (most likely) outcome of the Bernoulli trials and is called the mode. Note that the probability of it
occurring can be fairly small.
where is the "floor" under x, i.e. the greatest integer less than or equal to x.
It can also be represented in terms of the regularized incomplete beta function, as follows:
For k ≤ np, upper bounds for the lower tail of the distribution function can be derived. In particular, Hoeffding's
inequality yields the bound
Moreover, these bounds are reasonably tight when p = 1/2, since the following expression holds for all k ≥ 3n/8[1]
Binomial distribution 62
In general, there is no single formula to find the median for a binomial distribution, and it may even be non-unique.
However several special results have been established:
• If np is an integer, then the mean, median, and mode coincide and equal np.[2][3]
• Any median m must lie within the interval ⌊np⌋ ≤ m ≤ ⌈np⌉.[4]
• A median m cannot lie too far away from the mean: |m − np| ≤ min{ ln 2, max{p, 1 − p} }.[5]
• The median is unique and equal to m = round(np) in cases when either p ≤ 1 − ln 2 or p ≥ ln 2 or
|m − np| ≤ min{p, 1 − p} (except for the case when p = ½ and n is odd).[4][5]
• When p = 1/2 and n is odd, any number m in the interval ½(n − 1) ≤ m ≤ ½(n + 1) is a median of the binomial
distribution. If p = 1/2 and n is even, then m = n/2 is the unique median.
The first term is non-zero only when both X and Y are one, and μX and μY are equal to the two probabilities. Defining
pB as the probability of both happening at the same time, this gives
If X and Y are the same variable, this reduces to the variance formula given above.
Binomial distribution 63
Sums of binomials
If X ~ B(n, p) and Y ~ B(m, p) are independent binomial variables with the same probability p, then X + Y is again a
binomial variable; its distribution is
Conditional binomials
If X ~ B(n, p) and, conditional on X, Y ~ B(X, q), then Y is a simple binomial variable with distribution
Bernoulli distribution
The Bernoulli distribution is a special case of the binomial distribution, where n = 1. Symbolically, X ~ B(1, p) has
the same meaning as X ~ Bern(p). Conversely, any binomial distribution, B(n, p), is the sum of n independent
Bernoulli trials, Bern(p), each with the same probability p.
Normal approximation
If n is large enough, then the skew of the distribution is
not too great. In this case a reasonable approximation to
B(n, p) is given by the normal distribution
• Another commonly used rule holds that the normal approximation is appropriate only if everything within 3
standard deviations of its mean is within the range of possible values, that is if
Binomial distribution 64
The following is an example of applying a continuity correction. Suppose one wishes to calculate Pr(X ≤ 8) for a
binomial random variable X. If Y has a distribution given by the normal approximation, then Pr(X ≤ 8) is
approximated by Pr(Y ≤ 8.5). The addition of 0.5 is the continuity correction; the uncorrected normal approximation
gives considerably less accurate results.
This approximation, known as de Moivre–Laplace theorem, is a huge time-saver when undertaking calculations by
hand (exact calculations with large n are very onerous); historically, it was the first use of the normal distribution,
introduced in Abraham de Moivre's book The Doctrine of Chances in 1738. Nowadays, it can be seen as a
consequence of the central limit theorem since B(n, p) is a sum of n independent, identically distributed Bernoulli
variables with parameter p. This fact is the basis of a hypothesis test, a "proportion z-test," for the value of p using
x/n, the sample proportion and estimator of p, in a common test statistic.[7]
For example, suppose one randomly samples n people out of a large population and ask them whether they agree
with a certain statement. The proportion of people who agree will of course depend on the sample. If groups of n
people were sampled repeatedly and truly randomly, the proportions would follow an approximate normal
distribution with mean equal to the true proportion p of agreement in the population and with standard deviation
σ = (p(1 − p)/n)1/2. Large sample sizes n are good because the standard deviation, as a proportion of the expected
value, gets smaller, which allows a more precise estimate of the unknown parameter p.
Poisson approximation
The binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while
the product np remains fixed. Therefore the Poisson distribution with parameter λ = np can be used as an
approximation to B(n, p) of the binomial distribution if n is sufficiently large and p is sufficiently small. According
to two rules of thumb, this approximation is good if n ≥ 20 and p ≤ 0.05, or if n ≥ 100 and np ≤ 10.[8]
Limiting distributions
• Poisson limit theorem: As n approaches ∞ and p approaches 0 while np remains fixed at λ > 0 or at least np
approaches λ > 0, then the Binomial(n, p) distribution approaches the Poisson distribution with expected value λ.
• de Moivre–Laplace theorem: As n approaches ∞ while p remains fixed, the distribution of
approaches the normal distribution with expected value 0 and variance 1. This result is sometimes loosely
stated by saying that the distribution of X is asymptotically normal with expected value np and
variance np(1 − p). This result is a specific case of the central limit theorem.
Confidence intervals
Even for quite large values of n, the actual distribution of the mean is significantly nonnormal.[9] Because of this
problem several methods to estimate confidence intervals have been proposed.
Let n1 be the number of successes out of n, the total number of trials, and let
be the proportion of successes. Let zα/2 be the 100 ( 1 − α / 2 )th percentile of the standard normal distribution.
• Wald method
Binomial distribution 65
• ArcSine method[11]
The exact (Clopper-Pearson) method is the most conservative.[9] The Wald method although commonly
recommended in the text books is the most biased.
References
[1] Matoušek, J, Vondrak, J: The Probabilistic Method (lecture notes) (http:/ / kam. mff. cuni. cz/ ~matousek/ prob-ln. ps. gz).
[2] Neumann, P. (1966). "Über den Median der Binomial- and Poissonverteilung" (in German). Wissenschaftliche Zeitschrift der Technischen
Universität Dresden 19: 29–33.
[3] Lord, Nick. (July 2010). "Binomial averages when the mean is an integer", The Mathematical Gazette 94, 331-332.
[4] Kaas, R.; Buhrman, J.M. (1980). "Mean, Median and Mode in Binomial Distributions". Statistica Neerlandica 34 (1): 13–18.
doi:10.1111/j.1467-9574.1980.tb00681.x.
[5] Hamza, K. (1995). "The smallest uniform upper bound on the distance between the mean and the median of the binomial and Poisson
distributions". Statistics & Probability Letters 23: 21–25. doi:10.1016/0167-7152(94)00090-U.
[6] Box, Hunter and Hunter (1978). Statistics for experimenters. Wiley. p. 130.
[7] NIST/SEMATECH, "7.2.4. Does the proportion of defectives meet requirements?" (http:/ / www. itl. nist. gov/ div898/ handbook/ prc/
section2/ prc24. htm) e-Handbook of Statistical Methods.
[8] NIST/SEMATECH, "6.3.3.1. Counts Control Charts" (http:/ / www. itl. nist. gov/ div898/ handbook/ pmc/ section3/ pmc331. htm),
e-Handbook of Statistical Methods.
[9] Brown LD, Cai T. and DasGupta A (2001). Interval estimation for a binomial proportion (with discussion). Statist Sci 16: 101–133
[10] Agresti A, Coull BA (1998) "Approximate is better than 'exact' for interval estimation of binomial proportions". The American Statistician
52:119–126
[11] Pires MA () Confidence intervals for a binomial proportion: comparison of methods and software evaluation.
[12] Wilson EB (1927) "Probable inference, the law of succession, and statistical inference". Journal of the American Statistical Association 22:
209–212
[13] Devroye, Luc (1986) Non-Uniform Random Variate Generation, New York: Springer-Verlag. (See especially Chapter X, Discrete
Univariate Distributions (http:/ / cg. scs. carleton. ca/ ~luc/ chapter_ten. pdf))
[14] Kachitvichyanukul, V.; Schmeiser, B. W. (1988). "Binomial random variate generation". Communications of the ACM 31 (2): 216–222.
doi:10.1145/42372.42381.
Cauchy distribution 66
Cauchy distribution
Cauchy
CDF
Mean undefined
Median
Mode
Variance undefined
Skewness undefined
Ex. kurtosis undefined
Entropy
MGF does not exist
CF
The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known,
especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution,
Lorentz(ian) function, or Breit–Wigner distribution.
Cauchy distribution 67
The Cauchy distribution is often used in statistics as the canonical example of a "pathological" distribution. Its mean
does not exist and its variance is infinite. The Cauchy distribution does not have finite moments of order greater than
or equal to one; only fractional absolute moments exist.[1] The Cauchy distribution has no moment generating
function.
Its importance in physics is the result of its being the solution to the differential equation describing forced
resonance.[2] In mathematics, it is closely related to the Poisson kernel, which is the fundamental solution for the
Laplace equation in the upper half-plane. In spectroscopy, it is the description of the shape of spectral lines which are
subject to homogeneous broadening in which all atoms interact in the same way with the frequency range contained
in the line shape. Many mechanisms cause homogeneous broadening, most notably collision broadening, and
Chantler–Alda radiation.[3] In its standard form, it is the maximum entropy probability distribution for a random
variate X for which .[4]
Characterisation
where x0 is the location parameter, specifying the location of the peak of the distribution, and γ is the scale parameter
which specifies the half-width at half-maximum (HWHM). γ is also equal to half the interquartile range and is
sometimes called the probable error. Augustin-Louis Cauchy exploited such a density function in 1827 with an
infinitesimal scale parameter, defining what would now be called a Dirac delta function.
The amplitude of the above Lorentzian function is given by
The special case when x0 = 0 and γ = 1 is called the standard Cauchy distribution with the probability density
function
It follows that the first and third quartiles are , and hence the interquartile range is 2γ.
The derivative of the quantile function, the quantile density function, for the Cauchy distribution is:
The differential entropy of a distribution can be defined in terms of its quantile density,[5] specifically
Properties
The Cauchy distribution is an example of a distribution which has no mean, variance or higher moments defined. Its
mode and median are well defined and are both equal to x0.
When U and V are two independent normally distributed random variables with expected value 0 and variance 1,
then the ratio U/V has the standard Cauchy distribution.
If are independent and identically distributed random variables, each with a standard Cauchy distribution, then the
sample mean has the same standard Cauchy distribution (the sample median, which is not
affected by extreme values, can be used as a measure of central tendency). To see that this is true, compute the
characteristic function of the sample mean:
where is the sample mean. This example serves to show that the hypothesis of finite variance in the central limit
theorem cannot be dropped. It is also an example of a more generalized version of the central limit theorem that is
characteristic of all stable distributions, of which the Cauchy distribution is a special case.
The Cauchy distribution is an infinitely divisible probability distribution. It is also a strictly stable distribution.[6]
The standard Cauchy distribution coincides with the Student's t-distribution with one degree of freedom.
Like all stable distributions, the location-scale family to which the Cauchy distribution belongs is closed under linear
transformations with real coefficients. In addition, the Cauchy distribution is the only univariate distribution which is
closed under linear fractional transformations with real coefficients.[7] In this connection, see also McCullagh's
parametrization of the Cauchy distributions.
Cauchy distribution 69
Characteristic function
Let X denote a Cauchy distributed random variable. The characteristic function of the Cauchy distribution is given
by
which is just the Fourier transform of the probability density. The original probability density may be expressed in
terms of the characteristic function, essentially by using the inverse Fourier transform:
Observe that the characteristic function is not differentiable at the origin: this corresponds to the fact that the Cauchy
distribution does not have an expected value.
Mean
If a probability distribution has a density function f(x), then the mean is
If at most one of the two terms in (2) is infinite, then (1) is the same as (2). But in the case of the Cauchy
distribution, both the positive and negative terms of (2) are infinite. This means (2) is undefined. Moreover, if (1) is
construed as a Lebesgue integral, then (1) is also undefined, because (1) is then defined simply as the difference (2)
between positive and negative parts.
However, if (1) is construed as an improper integral rather than a Lebesgue integral, then (2) is undefined, and (1) is
not necessarily well-defined. We may take (1) to mean
and this is its Cauchy principal value, which is zero, but we could also take (1) to mean, for example,
Higher moments
The Cauchy distribution does not have finite moments of any order. Some of the higher raw moments do exist and
have a value of infinity, for example the raw second moment:
Higher even-powered raw moments will also evaluate to infinity. Odd-powered raw moments, however, do not exist
at all (i.e. are undefined), which is distinctly different from existing with the value of infinity. (Consider 1/0, which
is defined with the value of infinity, vs. 0/0, which is undefined.) The odd-powered raw moments are undefined
because their values are essentially equivalent to since the two halves of the integral both diverge and
have opposite signs. The first raw moment is the mean, which, being odd, does not exist. (See also the discussion
above about this.) This in turn means that all of the central moments and standardized moments do not exist (are
undefined), since they are all based on the mean. The variance — which is the second central moment — is likewise
non-existent (despite the fact that the raw second moment exists with the value infinity).
The results for higher moments follow from Hölder's inequality, which implies that higher moments (or halves of
moments) diverge if lower ones do.
Estimation of parameters
Because the mean and variance of the Cauchy distribution are not defined, attempts to estimate these parameters will
not be successful. For example, if n samples are taken from a Cauchy distribution, one may calculate the sample
mean as:
Although the sample values xi will be concentrated about the central value x0, the sample mean will become
increasingly variable as more samples are taken, because of the increased likelihood of encountering sample points
with a large absolute value. In fact, the distribution of the sample mean will be equal to the distribution of the
samples themselves; i.e., the sample mean of a large sample is no better (or worse) an estimator of x0 than any single
observation from the sample. Similarly, calculating the sample variance will result in values that grow larger as more
samples are taken.
Therefore, more robust means of estimating the central value x0 and the scaling parameter γ are needed. One simple
method is to take the median value of the sample as an estimator of x0 and half the sample interquartile range as an
estimator of γ. Other, more precise and robust methods have been developed [8][9] For example, the truncated mean
of the middle 24% of the sample order statistics produces an estimate for x0 that is more efficient than using either
the sample median or the full sample mean.[10][11] However, because of the fat tails of the Cauchy distribution, the
efficiency of the estimator decreases if more than 24% of the sample is used.[10][11]
Maximum likelihood can also be used to estimate the parameters x0 and γ. However, this tends to be complicated by
the fact that this requires finding the roots of a high degree polynomial, and there can be multiple roots that represent
local maxima.[12] Also, while the maximum likelihood estimator is asymptotically efficient, it is relatively inefficient
for small samples.[13] The log-likelihood function for the Cauchy distribution for sample size n is:
Maximizing the log likelihood function with respect to x0 and γ produces the following system of equations:
Cauchy distribution 71
Note that is a monotone function in γ and that the solution γ must satisfy
has unit modulus and is distributed on the unit circle with density:
and expresses the two parameters of the associated linear Cauchy distribution for x as a complex number:
[15][16]
The distribution is called the circular Cauchy distribution (also the complex Cauchy distribution)
with parameter . The circular Cauchy distribution is related to the wrapped Cauchy distribution. If is
a wrapped Cauchy distribution with the parameter representing the parameters of the corresponding
"unwrapped" Cauchy distribution in the variable y where , then
See also McCullagh's parametrization of the Cauchy distributions and Poisson kernel for related concepts.
The circular Cauchy distribution expressed in complex form has finite moments of all orders
is holomorphic on the unit disk, and the transformed variable is distributed as complex Cauchy with
parameter .
Given a sample of size n > 2, the maximum-likelihood equation
starting with The sequence of likelihood values is non-decreasing, and the solution is unique for samples
containing at least three distinct values. [17]
The maximum-likelihood estimate for the median ( ) and scale parameter ( ) of a real Cauchy sample is
obtained by the inverse transformation:
Cauchy distribution 72
For n ≤ 4, closed-form expressions are known for .[12] The density of the maximum-likelihood estimator at t in the
unit disk is necessarily of the form:
where
where x0(t) and γ(t) are real functions with x0(t) a homogeneous function of degree one and γ(t) a positive
homogeneous function of degree one.[19] More formally:[19]
and for all t.
An example of a bivariate Cauchy distribution can be given by:[20]
Note that in this example, even though there is no analogue to a covariance matrix, x and y are not statistically
independent.[20]
Analogously to the univariate density, the multidimensional Cauchy density also relates to the multivariate Student
distribution. They are equivalent when the degrees of freedom parameter is equal to one. The density of a k
dimension Student distribution with one degree of freedom becomes:
Properties and details for this density can be obtained by taking it as a particular case of the multivariate Student
density.
Transformation properties
• If then
• If then
• If and are independent, then
• If then
• McCullagh's parametrization of the Cauchy distributions: Expressing a Cauchy distribution in terms of one
complex parameter , define X ~ Cauchy to mean X ~ Cauchy . If X ~ Cauchy
then:
Cauchy distribution 73
~ CCauchy
Related distributions
• Student's t distribution
• Non-standardized Student's t distribution
• If and then
• If then
• If then
• The Cauchy distribution is a limiting case of a Pearson distribution of type 4
• The Cauchy distribution is a special case of a Pearson distribution of type 7.[1]
• The Cauchy distribution is a stable distribution: if X ~ Stable , then X ~Cauchy(μ, γ).
• The Cauchy distribution is a singular limit of a Hyperbolic distribution
• The wrapped Cauchy distribution, taking values on a circle, is derived from the Cauchy distribution by wrapping
it around the circle.
References
[1] N. L. Johnson, S. Kotz, and N. Balakrishnan (1994). Continuous Univariate Distributions, Volume 1. New York: Wiley., Chapter 16.
[2] http:/ / webphysics. davidson. edu/ Projects/ AnAntonelli/ node5. html Note that the intensity, which follows the Cauchy distribution, is the
square of the amplitude.
[3] E. Hecht (1987). Optics (2nd ed.). Addison-Wesley. p. 603.
[4] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. econ. yorku. ca/
cesg/ papers/ berapark. pdf). Journal of Econometrics (Elsevier): 219–230. . Retrieved 2011-06-02.
[5] Vasicek, Oldrich (1976). "A Test for Normality Based on Sample Entropy". Journal of the Royal Statistical Society, Series B
(Methodological) 38 (1): 54–59.
[6] S.Kotz et al (2006). Encyclopedia of Statistical Sciences (2nd ed.). John Wiley & Sons. p. 778. ISBN 978-0-471-15044-2.
[7] F. B. Knight (1976). "A characterization of the Cauchy type". Proceedings of the American Mathematical Society 55: 130–135.
[8] Cane, Gwenda J. (1974). "Linear Estimation of Parameters of the Cauchy Distribution Based on Sample Quantiles". Journal of the American
Statistical Association 69 (345): 243–245. JSTOR 2285535.
[9] Zhang, Jin (2010). "A Highly Efficient L-estimator for the Location Parameter of the Cauchy Distribution" (http:/ / www. springerlink. com/
content/ 3p1430175v4806jq). Computational Statistics 25 (1): 97–105. .
[10] Rothenberg, Thomas J.; Fisher, Franklin, M.; Tilanus, C.B. (1966). "A note on estimation from a Cauchy sample". Journal of the American
Statistical Association 59 (306): 460–463.
[11] Bloch, Daniel (1966). "A note on the estimation of the location parameters of the Cauchy distribution". Journal of the American Statistical
Association 61 (316): 852–855. JSTOR 2282794.
[12] Ferguson, Thomas S. (1978). "Maximum Likelihood Estimates of the Parameters of the Cauchy Distribution for Samples of Size 3 and 4".
Journal of the American Statistical Association 73 (361): 211. JSTOR 2286549.
[13] Cohen Freue, Gabriella V. (2007). "The Pitman estimator of the Cauchy location parameter" (http:/ / faculty. ksu. edu. sa/ 69424/ USEPAP/
Coushy dist. pdf). Journal of Statistical Planning and Inference 137: 1901. .
[14] Barnett, V. D. (1966). "Order Statistics Estimators of the Location of the Cauchy Distribution". Journal of the American Statistical
Association 61 (316): 1205. JSTOR 2283210.
Cauchy distribution 74
[15] McCullagh, P., "Conditional inference and Cauchy models" (http:/ / biomet. oxfordjournals. org/ cgi/ content/ abstract/ 79/ 2/ 247),
Biometrika, volume 79 (1992), pages 247–259. PDF (http:/ / www. stat. uchicago. edu/ ~pmcc/ pubs/ paper18. pdf) from McCullagh's
homepage.
[16] K.V. Mardia (1972). Statistics of Directional Data. Academic Press.
[17] J. Copas (1975). "On the unimodality of the likelihood function for the Cauchy distribution". Biometrika 62: 701–704.
[18] P. McCullagh (1996). "Mobius transformation and Cauchy parameter estimation.". Annals of Statistics 24: 786–808. JSTOR 2242674.
[19] Ferguson, Thomas S. (1962). "A Representation of the Symmetric Bivariate Cauchy Distribution". Journal of the American Statistical
Association: 1256. JSTOR 2237984.
[20] Molenberghs, Geert; Lesaffre, Emmanuel (1997). "Non-linear Integral Equations to Approximate Bivariate Densities with Given Marginals
and Dependence Function" (http:/ / www3. stat. sinica. edu. tw/ statistica/ oldpdf/ A7n310. pdf). Statistica Sinica 7: 713–738. .
External links
• Earliest Uses: The entry on Cauchy distribution has some historical information. (http://jeff560.tripod.com/c.
html)
• Weisstein, Eric W., " Cauchy Distribution (http://mathworld.wolfram.com/CauchyDistribution.html)" from
MathWorld.
• GNU Scientific Library – Reference Manual (http://www.gnu.org/software/gsl/manual/gsl-ref.
html#SEC294)
Cauchy–Schwarz inequality
In mathematics, the Cauchy–Schwarz inequality (also known as the Bunyakovsky inequality, the Schwarz
inequality, or the Cauchy–Bunyakovsky–Schwarz inequality, or Cauchy–Bunyakovsky inequality), is a useful
inequality encountered in many different settings, such as linear algebra, analysis, probability theory, and other
areas. It is considered to be one of the most important inequalities in all of mathematics.[1] It has a number of
generalizations, among them Hölder's inequality.
The inequality for sums was published by Augustin-Louis Cauchy (1821), while the corresponding inequality for
integrals was first stated by Viktor Bunyakovsky (1859) and rediscovered by Hermann Amandus Schwarz (1888).
where is the inner product. Equivalently, by taking the square root of both sides, and referring to the norms of
the vectors, the inequality is written as
Moreover, the two sides are equal if and only if x and y are linearly dependent (or, in a geometrical sense, they are
parallel or one of the vectors is equal to zero).
If and are any complex numbers, the inner product is the standard inner
product and the bar notation is used for complex conjugation then the inequality may be restated in more explicitly
as
When viewed in this way the numbers x1, ..., xn, and y1, ..., yn are the components of x and y with respect to an
orthonormal basis of V.
Even more compactly written:
CauchySchwarz inequality 75
Equality holds if and only if x and y are linearly dependent, that is, one is a scalar multiple of the other (which
includes the case when one or both are zero).
The finite-dimensional case of this inequality for real vectors was proved by Cauchy in 1821, and in 1859 Cauchy's
student Bunyakovsky noted that by taking limits one can obtain an integral form of Cauchy's inequality. The general
result for an inner product space was obtained by Schwarz in the year 1885.
Proof
Let u, v be arbitrary vectors in a vector space V over F with an inner product, where F is the field of real or complex
numbers. We prove the inequality
and the fact that equality holds only when u and v are linearly dependent (the fact that conversely one has equality if
u and v are linearly dependent is immediate from the properties of the inner product).
This inequality is trivial, and in fact an equality, in the case v = 0, and in this case u and v are also linearly dependent,
regardless of u. The theorem being thus proved for this case, we henceforth assume that v is nonzero. Let
Then, by linearity of the inner product in its first argument, one has
i.e., z is a vector orthogonal to the vector v (Indeed, z is the projection of u onto the plane orthogonal to v.) We can
thus apply the Pythagorean theorem to
which gives
and after multiplication by ||v||2 the Cauchy–Schwarz inequality. Moreover, if the relation '≥' in the above expression
is actually an equality, then ||z||2 = 0 and hence z = 0; the definition of z then establishes a relation of linear
dependence between u and v. This establishes the theorem.
CauchySchwarz inequality 76
Rn
In Euclidean space Rn with the standard inner product, the Cauchy–Schwarz inequality is
To prove this form of the inequality, consider the following quadratic polynomial in z.
Since it is nonnegative it has at most one real root in z, whence its discriminant is less than or equal to zero, that is,
collecting together identical terms (albeit with different summation indices) we find:
Because the left-hand side of the equation is a sum of the squares of real numbers it is greater than or equal to zero,
thus:
When n = 3 the Cauchy–Schwarz inequality can also be deduced from Lagrange's identity, which takes the form
L2
For the inner product space of square-integrable complex-valued functions, one has
Use
The triangle inequality for the inner product is often shown as a consequence of the Cauchy–Schwarz inequality, as
follows: given vectors x and y:
The Cauchy–Schwarz inequality proves that this definition is sensible, by showing that the right hand side lies in the
interval [−1, 1], and justifies the notion that (real) Hilbert spaces are simply generalizations of the Euclidean space.
It can also be used to define an angle in complex inner product spaces, by taking the absolute value of the right hand
side, as is done when extracting a metric from quantum fidelity.
The Cauchy–Schwarz is used to prove that the inner product is a continuous function with respect to the topology
induced by the inner product itself.
The Cauchy–Schwarz inequality is usually used to show Bessel's inequality.
Probability theory
For the multivariate case,
For the univariate case, Indeed, for random variables X and Y, the
Generalizations
Various generalizations of the Cauchy–Schwarz inequality exist in the context of operator theory, e.g. for
operator-convex functions, and operator algebras, where the domain and/or range of φ are replaced by a C*-algebra
or W*-algebra.
This section lists a few of such inequalities from the operator algebra setting, to give a flavor of results of this type.
Since < ƒ, ƒ > ≥ 0, φ(f*f) ≥ 0 for all ƒ in L2(m), where ƒ* is pointwise conjugate of ƒ. So φ is positive. Conversely
every positive functional φ gives a corresponding inner product < ƒ, g >φ = φ(g*ƒ). In this language, the
Cauchy–Schwarz inequality becomes
Since φ is a positive linear map whose range, the complex numbers C, is a commutative C*-algebra, φ is completely
positive. Therefore
This is precisely the Cauchy–Schwarz inequality. If ƒ and g are elements of a C*-algebra, f* and g* denote their
respective adjoints.
We can also deduce from above that every positive linear functional is bounded, corresponding to the fact that the
inner product is jointly continuous.
CauchySchwarz inequality 79
Positive maps
Positive functionals are special cases of positive maps. A linear map Φ between C*-algebras is said to be a positive
map if a ≥ 0 implies Φ(a) ≥ 0. It is natural to ask whether inequalities of Schwarz-type exist for positive maps. In
this more general setting, usually additional assumptions are needed to obtain such results.
Kadison–Schwarz inequality
The following theorem is named after Richard Kadison.
Theorem. If Φ is a unital positive map, then for every normal element a in its domain, we have Φ(a*a) ≥ Φ(a*)Φ(a)
and Φ(a*a) ≥ Φ(a)Φ(a*).
This extends the fact φ(a*a) · 1 ≥ φ(a)*φ(a) = |φ(a)|2, when φ is a linear functional.
The case when a is self-adjoint, i.e. a = a*, is sometimes known as Kadison's inequality.
2-positive maps
When Φ is 2-positive, a stronger assumption than merely positive, one has something that looks very similar to the
original Cauchy–Schwarz inequality:
Theorem (Modified Schwarz inequality for 2-positive maps)[2] For a 2-positive map Φ between C*-algebras, for all
a, b in its domain,
1.
2.
A simple argument for (2) is as follows. Consider the positive matrix
By 2-positivity of Φ,
is positive. The desired inequality then follows from the properties of positive 2 × 2 (operator) matrices.
Physics
The general formulation of the Heisenberg uncertainty principle is derived using the Cauchy–Schwarz inequality in
the Hilbert space of quantum observables.
Notes
[1] The Cauchy–Schwarz Master Class: an Introduction to the Art of Mathematical Inequalities, Ch. 1 (http:/ / www-stat. wharton. upenn. edu/
~steele/ Publications/ Books/ CSMC/ CSMC_index. html) by J. Michael Steele.
[2] Paulsen (2002), Completely Bounded Maps and Operator Algebras (http:/ / books. google. com/ books?id=VtSFHDABxMIC& pg=PA40),
ISBN 9780521816694, page 40.
CauchySchwarz inequality 80
References
• Bityutskov, V.I. (2001), "Bunyakovskii inequality" (http://www.encyclopediaofmath.org/index.php?title=b/
b017770), in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Bouniakowsky, V. (1859), "Sur quelques inegalités concernant les intégrales aux différences finies" (http://
www-stat.wharton.upenn.edu/~steele/Publications/Books/CSMC/bunyakovsky.pdf) (PDF), Mem. Acad. Sci.
St. Petersbourg 7 (1): 9
• Cauchy, A. (1821), Oeuvres 2, III, p. 373
• Dragomir, S. S. (2003), "A survey on Cauchy–Bunyakovsky–Schwarz type discrete inequalities" (http://jipam.
vu.edu.au/article.php?sid=301), JIPAM. J. Inequal. Pure Appl. Math. 4 (3): 142 pp
• Kadison, R.V. (1952), "A generalized Schwarz inequality and algebraic invariants for operator algebras", Ann.
Math. 56 (3): 494–503, doi:10.2307/1969657, JSTOR 1969657.
• Lohwater, Arthur (1982), Introduction to Inequalities (http://www.mediafire.com/?1mw1tkgozzu), Online
e-book in PDF fomat
• Paulsen, V. (2003), Completely Bounded Maps and Operator Algebras, Cambridge University Press.
• Schwarz, H. A. (1888), "Über ein Flächen kleinsten Flächeninhalts betreffendes Problem der Variationsrechnung"
(http://www-stat.wharton.upenn.edu/~steele/Publications/Books/CSMC/Schwarz.pdf) (PDF), Acta
Societatis scientiarum Fennicae XV: 318
• Solomentsev, E.D. (2001), "Cauchy inequality" (http://www.encyclopediaofmath.org/index.php?title=C/
c020880), in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Steele, J.M. (2004), The Cauchy–Schwarz Master Class (http://www-stat.wharton.upenn.edu/~steele/
Publications/Books/CSMC/CSMC_index.html), Cambridge University Press, ISBN 0-521-54677-X
External links
• Earliest Uses: The entry on the Cauchy–Schwarz inequality has some historical information. (http://jeff560.
tripod.com/c.html)
• Example of application of Cauchy–Schwarz inequality to determine Linearly Independent Vectors (http://
people.revoledu.com/kardi/tutorial/LinearAlgebra/LinearlyIndependent.html#LinearlyIndependentVectors)
Tutorial and Interactive program.
Characteristic function (probability theory) 81
Introduction
The characteristic function provides an alternative way for describing a random variable. Similarly to the cumulative
distribution function
(where 1{X ≤ x} is the indicator function — it is equal to 1 when X ≤ x, and zero
otherwise)
which completely determines behavior and properties of the probability distribution of the random variable X, the
characteristic function
also completely determines behavior and properties of the probability distribution of the random variable X. The two
approaches are equivalent in the sense that knowing one of the functions it is always possible to find the other, yet
they both provide different insight for understanding the features of the random variable. However, in particular
cases, there can be differences in whether these functions can be represented as expressions involving simple
standard functions.
If a random variable admits a density function, then the characteristic function is its dual, in the sense that each of
them is a Fourier transform of the other. If a random variable has a moment-generating function, then the domain of
the characteristic function can be extended to the complex plane, and
[1]
Note however that the characteristic function of a distribution always exists, even when the probability density
function or moment-generating function do not.
The characteristic function approach is particularly useful in analysis of linear combinations of independent random
variables: a classical proof of the Central Limit Theorem uses characteristic functions and Lévy's continuity theorem.
Another important application is to the theory of the decomposability of random variables.
Characteristic function (probability theory) 82
Definition
For a scalar random variable X the characteristic function is defined as the expected value of eitX, where i is the
imaginary unit, and t ∈ R is the argument of the characteristic function:
Here FX is the cumulative distribution function of X, and the integral is of the Riemann–Stieltjes kind. If random
variable X has a probability density function ƒX, then the characteristic function is its Fourier transform,[2] and the last
formula in parentheses is valid.
It should be noted though, that this convention for the constants appearing in the definition of the characteristic
function differs from the usual convention for the Fourier transform.[3] For example some authors[4] define
φX(t) = Ee−2πitX, which is essentially a change of parameter. Other notation may be encountered in the literature:
as the characteristic function for a probability measure p, or as the characteristic function corresponding to a
density ƒ.
Generalizations
The notion of characteristic functions generalizes to multivariate random variables and more complicated random
elements. The argument of the characteristic function will always belong to the continuous dual of the space where
random variable X takes values. For common cases such definitions are listed below:
• If X is a k-dimensional random vector, then for t ∈ Rk
• If X(s) is a stochastic process, then for all functions t(s) such that the integral ∫Rt(s)X(s)ds converges for almost
all realizations of X [7]
Here denotes matrix transpose, tr(·) — the matrix trace operator, Re(·) is the real part of a complex number, z
denotes complex conjugate, and * is conjugate transpose (that is z* = zT ).
Examples
Characteristic function (probability theory) 83
Degenerate δa
Bernoulli Bern(p)
Binomial B(n, p)
Poisson Pois(λ)
Uniform U(a, b)
Laplace L(μ, b)
Chi-squared χ2k
Cauchy C(μ, θ)
Gamma Γ(k, θ)
Exponential Exp(λ)
[8]
Multivariate Cauchy MultiCauchy(μ, Σ)
Properties
• The characteristic function of a real-valued random variable always exists, since it is an integral of a bounded
continuous function over a space whose measure is finite.
• A characteristic function is uniformly continuous on the entire space
• It is non-vanishing in a region around zero: φ(0) = 1.
• It is bounded: | φ(t) | ≤ 1.
• It is Hermitian: φ(−t) = φ(t). In particular, the characteristic function of a symmetric (around the origin) random
variable is real-valued and even.
• There is a bijection between distribution functions and characteristic functions. That is, for any two random
variables X1, X2
• If a random variable X has moments up to k-th order, then the characteristic function φX is k times continuously
differentiable on the entire real line. In this case
• If a characteristic function φX has a k-th derivative at zero, then the random variable X has all moments up to k if k
is even, but only up to k – 1 if k is odd.[9]
• If X1, …, Xn are independent random variables, and a1, …, an are some constants, then the characteristic function
of the linear combination of Xi's is
Characteristic function (probability theory) 84
One specific case would be the sum of two independent random variables and in which case one would
have .
• The tail behavior of the characteristic function determines the smoothness of the corresponding density function.
Continuity
The bijection stated above between probability distributions and characteristic functions is continuous. That is,
whenever a sequence of distribution functions { Fj(x) } converges (weakly) to some distribution F(x), the
corresponding sequence of characteristic functions { φj(t) } will also converge, and the limit φ(t) will correspond to
the characteristic function of law F. More formally, this is stated as
Lévy’s continuity theorem: A sequence { Xj } of n-variate random variables converges in distribution to
random variable X if and only if the sequence { φXj } converges pointwise to a function φ which is continuous
at the origin. Then φ is the characteristic function of X.[10]
This theorem is frequently used to prove the law of large numbers, and the central limit theorem.
Inversion formulas
Since there is a one-to-one correspondence between cumulative distribution functions and characteristic functions, it
is always possible to find one of these functions if we know the other one. The formula in definition of characteristic
function allows us to compute φ when we know the distribution function F (or density ƒ). If, on the other hand, we
know the characteristic function φ and want to find the corresponding distribution function, then one of the
following inversion theorems can be used.
Theorem. If characteristic function φX is integrable, then FX is absolutely continuous, and therefore X has the
probability density function given by
when X is scalar;
in multivariate case the pdf is understood as the Radon–Nikodym derivative of the distribution μX with respect to the
Lebesgue measure λ:
Theorem (Lévy).[11] If φX is characteristic function of distribution function FX, two points a<b are such that {x|a < x
< b} is a continuity set of μX (in the univariate case this condition is equivalent to continuity of FX at points a and b),
then
if X is scalar
variable.
Theorem (Gil-Pelaez).[12] For a univariate random variable X, if x is a continuity point of FX then
Characteristic function (probability theory) 85
The integral may be not Lebesgue-integrable; for example, when X is the discrete random variable that is always 0, it
becomes the Dirichlet integral.
Inversion formulas for multivariate distributions are available.[13]
• Mathias’ theorem. A real, even, continuous, absolutely integrable function φ equal to 1 at the origin is a
characteristic function if and only if
for n = 0,1,2,…, and all p > 0. Here H2n denotes the Hermite polynomial of degree 2n.
Pólya’s theorem. If φ is a real-valued continuous
function which satisfies the conditions
1. φ(0) = 1,
2. φ is even,
3. φ is convex for t>0,
4. φ(∞) = 0,
then φ(t) is the characteristic function of an
absolutely continuous symmetric distribution.
• A convex linear combination (with
Pólya’s theorem can be used to construct an example of two random
) of a finite or a countable number variables whose characteristic functions coincide over a finite
of characteristic functions is also a characteristic interval but are different elsewhere.
function.
• The product of a finite number of characteristic functions is also a characteristic function. The same
holds for an infinite product provided that it converges to a function continuous at the origin.
• If φ is a characteristic function and α is a real number, then φ, Re[φ], |φ|2, and φ(αt) are also characteristic
functions.
Characteristic function (probability theory) 86
Uses
Because of the continuity theorem, characteristic functions are used in the most frequently seen proof of the central
limit theorem. The main trick involved in making calculations with a characteristic function is recognizing the
function as the characteristic function of a particular distribution.
where the ai are constants, then the characteristic function for Sn is given by
In particular, φX+Y(t) = φX(t)φY(t). To see this, write out the definition of characteristic function:
Observe that the independence of X and Y is required to establish the equality of the third and fourth expressions.
Another special case of interest is when ai = 1/n and then Sn is the sample mean. In this case, writing X for the mean,
Moments
Characteristic functions can also be used to find moments of a random variable. Provided that the nth moment exists,
characteristic function can be differentiated n times and
For example, suppose X has a standard Cauchy distribution. Then φX(t) = e−|t|. See how this is not differentiable at
t = 0, showing that the Cauchy distribution has no expectation. Also see that the characteristic function of the sample
mean X of n independent observations has characteristic function φX(t) = (e−|t|/n)n = e−|t|, using the result from the
previous section. This is the characteristic function of the standard Cauchy distribution: thus, the sample mean has
the same distribution as the population itself.
The logarithm of a characteristic function is a cumulant generating function, which is useful for finding cumulants;
note that some instead define the cumulant generating function as the logarithm of the moment-generating function,
and call the logarithm of the characteristic function the second cumulant generating function.
Data analysis
Characteristic functions can be used as part of procedures for fitting probability distributions to samples of data.
Cases where this provides a practicable option compared to other possibilities include fitting the stable distribution
since closed form expressions for the density are not available which makes implementation of maximum likelihood
estimation difficult. Estimation procedures are available which match the theoretical characteristic function to the
empirical characteristic function, calculated from the data. Paulson et al. (1975) and Heathcote (1977) provide some
theoretical background for such an estimation procedure. In addition, Yu (2004) describes applications of empirical
characteristic functions to fit time series models where likelihood procedures are impractical.
Characteristic function (probability theory) 87
Example
The Gamma distribution with scale parameter θ and a shape parameter k has the characteristic function
with X and Y independent from each other, and we wish to know what the distribution of X + Y is. The characteristic
functions are
This is the characteristic function of the gamma distribution scale parameter θ and shape parameter k1 + k2, and we
therefore conclude
The result can be expanded to n independent gamma distributed random variables with the same scale parameter and
we get
Related concepts
Related concepts include the moment-generating function and the probability-generating function. The characteristic
function exists for all probability distributions. However this is not the case for moment generating function.
The characteristic function is closely related to the Fourier transform: the characteristic function of a probability
density function p(x) is the complex conjugate of the continuous Fourier transform of p(x) (according to the usual
convention; see continuous Fourier transform – other conventions).
where P(t) denotes the continuous Fourier transform of the probability density function p(x). Likewise, p(x) may be
recovered from φX(t) through the inverse Fourier transform:
Indeed, even when the random variable does not have a density, the characteristic function may be seen as the
Fourier transform of the measure corresponding to the random variable.
Characteristic function (probability theory) 88
Notes
[1] Lukacs (1970) p. 196
[2] Billingsley (1995)
[3] Pinsky (2002)
[4] Bochner (1955)
[5] Andersen et al. (1995, Definition 1.10)
[6] Andersen et al. (1995, Definition 1.20)
[7] Sobczyk (2001, p. 20)
[8] Kotz et al. p. 37 using 1 as the number of degree of freedom to recover the Cauchy distribution
[9] Lukacs (1970), Corollary 1 to Theorem 2.3.1
[10] Cuppens (1975, Theorem 2.6.9)
[11] Named after the French mathematician Paul Pierre Lévy
[12] Wendel, J.G. (1961)
[13] Shephard (1991a,b)
[14] Lukacs (1970), p.84
[15] Lukacs (1970, Chapter 7)
References
• Andersen, H.H., M. Højbjerre, D. Sørensen, P.S. Eriksen (1995). Linear and graphical models for the
multivariate complex normal distribution. Lecture notes in statistics 101. New York: Springer-Verlag.
ISBN 0-387-94521-0.
• Billingsley, Patrick (1995). Probability and measure (3rd ed.). John Wiley & Sons. ISBN 0-471-00710-2.
• Bisgaard, T. M.; Z. Sasvári (2000). Characteristic functions and moment sequences. Nova Science.
• Bochner, Salomon (1955). Harmonic analysis and the theory of probability. University of California Press.
• Cuppens, R. (1975). Decomposition of multivariate probabilities. Academic Press.
• Heathcote, C.R. (1977). "The integrated squared error estimation of parameters". Biometrika 64 (2): 255–264.
doi:10.1093/biomet/64.2.255.
• Lukacs, E. (1970). Characteristic functions. London: Griffin.
• Kotz, Samuel; Nadarajah, Saralees (2004). Multivariate T Distributions and Their Applications. Cambridge
University Press.
• Oberhettinger, Fritz (1973). Fourier Transforms of Distributions and their Inverses: A Collection of Tables.
Aciademic Press
• Paulson, A.S.; E.W. Holcomb, R.A. Leitch (1975). "The estimation of the parameters of the stable laws".
Biometrika 62 (1): 163–170. doi:10.1093/biomet/62.1.163.
• Pinsky, Mark (2002). Introduction to Fourier analysis and wavelets. Brooks/Cole. ISBN 0-534-37660-6.
• Sobczyk, Kazimierz (2001). Stochastic differential equations. Kluwer Academic Publishers.
ISBN 978-1-4020-0345-5.
• Wendel, J.G. (1961). "The non-absolute convergence of Gil-Pelaez' inversion integral". The Annals of
Mathematical Statistics 32 (1): 338–339. doi:10.1214/aoms/1177705164.
• Yu, J. (2004). "Empirical characteristic function estimation and its applications". Econometrics Reviews 23 (2):
93–1223. doi:10.1081/ETC-120039605.
• Shephard, N. G. (1991a) From characteristic function to distribution function: A simple framework for the theory.
Econometric Theory, 7, 519–529.
• Shephard, N. G. (1991b) Numerical integration rules for multivariate inversions. J. Statist. Comput. Simul., 39,
37–46.
Chernoff bound 89
Chernoff bound
In probability theory, the Chernoff bound, named after Herman Chernoff, gives exponentially decreasing bounds on
tail distributions of sums of independent random variables. It is a sharper bound than the known first or second
moment based tail bounds such as Markov's inequality or Chebyshev inequality, which only yield power-law bounds
on tail decay but it requires that the variates be independent - a condition that neither the Markov nor the Chebyshev
inequalities require.
It is related to the (historically earliest) Bernstein inequalities, and to Hoeffding's inequality.
Definition
Let X1, ..., Xn be independent Bernoulli random variables, each having probability p > 1/2. Then the probability of
simultaneous occurrence of more than n/2 of the events has an exact value S, where
The Chernoff bound shows that S has the following lower bound:
This result admits various generalizations as outlined below. One can encounter many flavours of Chernoff bounds:
the original additive form (which gives a bound on the absolute error) or the more practical multiplicative form
(which bounds the error relative to the mean).
A motivating example
The simplest case of Chernoff bounds is used to bound the success
probability of majority agreement for n independent, equally likely
events.
A simple motivating example is to consider a biased coin. One side
(say, Heads), is more likely to come up than the other, but you don't
know which and would like to find out. The obvious solution is to flip
it many times and then choose the side that comes up the most. But
how many times do you have to flip it to be confident that you've
chosen correctly?
In our example, let denote the event that the ith coin flip comes up
Heads; suppose that we want to ensure we choose the wrong side with
at most a small probability ε. Then, rearranging the above, we must
have:
If the coin is noticeably biased, say coming up on one side 60% of the time (p = .6), then we can guess that side with
95% ( ) accuracy after 150 flips . If it is 90% biased, then a mere 10 flips suffices. If the coin
is only biased a tiny amount, like most real coins are, the number of necessary flips becomes much larger.
More practically, the Chernoff bound is used in randomized algorithms (or in computational devices such as
quantum computers) to determine a bound on the number of runs necessary to determine a value by majority
agreement, up to a specified probability. For example, suppose an algorithm (or machine) A computes the correct
Chernoff bound 90
value of a function f with probability p > 1/2. If we choose n satisfying the inequality above, the probability that a
majority exists and is equal to the correct value is at least 1 − ε, which for small enough ε is quite reliable. If p is a
constant, ε diminishes exponentially with growing n, which is what makes algorithms in the complexity class BPP
efficient.
Notice that if p is very close to 1/2, the necessary n can become very large. For example, if p = 1/2 + 1/2m, as it
might be in some PP algorithms, the result is that n is bounded below by an exponential function in m:
Similarly,
and so,
and
where
Chernoff bound 91
is the Kullback-Leibler divergence between Bernoulli distributed random variables with parameters and
respectively. If , then
Proof
The proof starts from the general inequality (+) above. . Taking a = mq in (+), we obtain:
Therefore we can easily compute the infimum, using calculus and some logarithms. Thus,
so that .
Thus, .
As , we see that , so our bound is satisfied on . Having solved for , we can plug back
into the equations above to find that
To complete the proof for the symmetric case, we simply define the random variable , apply the
same proof, and plug it into our bound.
Simpler bounds
A simpler bound follows by relaxing the theorem using , which follows from the convexity
Proof
According to (+),
The third line above follows because takes the value with probability and the value with probability
. This is identical to the calculation above in the proof of the Theorem for additive form (absolute error).
Rewriting as and recalling that (with strict inequality if ),
we set . The same result can be obtained by directly replacing a in the equation for the Chernoff
bound with .[1]
Thus,
This proves the result desired. A similar proof strategy can be used to show that
Chernoff bound 93
(a) .
Then,
,
and therefore also
(b)
Then,
the sum of independent samples is precisely the maximum deviation among independent random walks of length . In
order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that should grow
logarithmically with in this scenario.[4]
The following theorem can be obtained by assuming has low rank, in order to avoid the dependency on the
dimensions.
References
[1] Refer to the proof above
[2] http:/ / books. google. com/ books?id=0bAYl6d7hvkC& printsec=frontcover& source=gbs_summary_r& cad=0#PPA71,M1
[3] http:/ / books. google. com/ books?id=0bAYl6d7hvkC& printsec=frontcover& source=gbs_summary_r& cad=0#PPA72,M1
[4] * Magen, A.; Zouzias, A. (2011). "Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication".
arXiv:1005.2724 [cs.DM].
• Chernoff, H. (1952). "A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of
Observations". Annals of Mathematical Statistics 23 (4): 493–507. doi:10.1214/aoms/1177729330.
JSTOR 2236576. MR57518. Zbl 0048.11804.
• Hoeffding, W. (1963). "Probability Inequalities for Sums of Bounded Random Variables". Journal of the
American Statistical Association 58 (301): 13–30. doi:10.2307/2282952. JSTOR 2282952.
• Chernoff, H. (1981). "A Note on an Inequality Involving the Normal Distribution". Annals of Probability 9 (3):
533. doi:10.1214/aop/1176994428. JSTOR 2243541. MR614640. Zbl 0457.60014.
• Hagerup, T. (1990). "A guided tour of Chernoff bounds". Information Processing Letters 33 (6): 305.
doi:10.1016/0020-0190(90)90214-I.
• Ahlswede, R.; Winter, A. (2003). "Strong Converse for Identification via Quantum Channels". IEEE Transactions
on Information Theory 48 (3): 569–579. arXiv:quant-ph/0012127.
• Mitzenmacher, M.; Upfal, E. (2005). Probability and Computing: Randomized Algorithms and Probabilistic
Analysis (http://books.google.com/books?id=0bAYl6d7hvkC). ISBN 978-0-521-83540-4.
• Nielsen, F. (2011). "Chernoff information of exponential families". arXiv:1102.2684 [cs.IT].
Chi-squared distribution 95
Chi-squared distribution
Probability density function
Notation or
Parameters (known as "degrees of freedom")
Support x ∈ [0, +∞)
PDF
CDF
Mean k
Median
Mode max{ k − 2, 0 }
Variance 2k
Skewness
Ex. kurtosis 12 / k
Entropy
In probability theory and statistics, the chi-squared distribution (also chi-square or χ²-distribution) with k degrees
of freedom is the distribution of a sum of the squares of k independent standard normal random variables. It is one of
the most widely used probability distributions in inferential statistics, e.g., in hypothesis testing or in construction of
confidence intervals.[2][3][4][5] When there is a need to contrast it with the noncentral chi-squared distribution, this
distribution is sometimes called the central chi-squared distribution.
Chi-squared distribution 96
The chi-squared distribution is used in the common chi-squared tests for goodness of fit of an observed distribution
to a theoretical one, the independence of two criteria of classification of qualitative data, and in confidence interval
estimation for a population standard deviation of a normal distribution from a sample standard deviation. Many other
statistical tests also use this distribution, like Friedman's analysis of variance by ranks.
The chi-squared distribution is a special case of the gamma distribution.
Definition
If Z1, ..., Zk are independent, standard normal random variables, then the sum of their squares,
is distributed according to the chi-squared distribution with k degrees of freedom. This is usually denoted as
The chi-squared distribution has one parameter: k — a positive integer that specifies the number of degrees of
freedom (i.e. the number of Zi’s)
Characteristics
Further properties of the chi-squared distribution can be found in the box at the upper right corner of this article.
where Γ(k/2) denotes the Gamma function, which has closed-form values for odd k.
For derivations of the pdf in the cases of one and two degrees of freedom, see Proofs related to chi-squared
distribution.
where γ(k,z) is the lower incomplete Gamma function and P(k,z) is the regularized Gamma function.
In a special case of k = 2 this function has a simple form:
For the cases when 0 < z < 1 (which include all of the cases when this CDF is less than half), the following Chernoff
upper bound may be obtained:[6]
The tail bound for the cases when z > 1 follows similarly
Tables of this cumulative distribution function are widely available and the function is included in many
spreadsheets and all statistical packages. For another approximation for the CDF modeled after the cube of a
Gaussian, see under Noncentral chi-squared distribution.
Chi-squared distribution 97
Additivity
It follows from the definition of the chi-squared distribution that the sum of independent chi-squared variables is also
chi-squared distributed. Specifically, if {Xi}i=1n are independent chi-squared variables with {ki}i=1n degrees of
freedom, respectively, then Y = X1 + ⋯ + Xn is chi-squared distributed with k1 + ⋯ + kn degrees of freedom.
Information entropy
The information entropy is given by
Noncentral moments
The moments about zero of a chi-squared distribution with k degrees of freedom are given by[8][9]
Cumulants
The cumulants are readily obtained by a (formal) power series expansion of the logarithm of the characteristic
function:
Asymptotic properties
By the central limit theorem, because the chi-squared distribution is the sum of k independent random variables with
finite mean and variance, it converges to a normal distribution for large k. For many practical purposes, for k > 50
the distribution is sufficiently close to a normal distribution for the difference to be ignored.[10] Specifically, if
X ~ χ²(k), then as k tends to infinity, the distribution of tends to a standard normal distribution.
However, convergence is slow as the skewness is and the excess kurtosis is 12/k. Other functions of the
chi-squared distribution converge more rapidly to a normal distribution. Some examples are:
• If X ~ χ²(k) then is approximately normally distributed with mean and unit variance (result credited
to R. A. Fisher).
[11]
• If X ~ χ²(k) then is approximately normally distributed with mean and variance This
is known as the Wilson-Hilferty transformation.
Chi-squared distribution 98
• If then
Generalizations
The chi-squared distribution is obtained as the sum of the squares of k independent, zero-mean, unit-variance
Gaussian random variables. Generalizations of this distribution can be obtained by summing the squares of other
types of Gaussian random variables. Several such distributions are described below.
Chi-squared distributions
Applications
The chi-squared distribution has numerous applications in inferential statistics, for instance in chi-squared tests and
in estimating variances. It enters the problem of estimating the mean of a normally distributed population and the
problem of estimating the slope of a regression line via its role in Student’s t-distribution. It enters all analysis of
variance problems via its role in the F-distribution, which is the distribution of the ratio of two independent
chi-squared random variables, each divided by their respective degrees of freedom.
Following are some of the most common situations in which the chi-squared distribution arises from a
Gaussian-distributed sample.
• if X1, ..., Xn are i.i.d. N(μ, σ2) random variables, then where .
• The box below shows probability distributions with name starting with chi for some statistics based on Xi ∼
Normal(μi, σ2i), i = 1, ⋯, k, independent random variables:
Chi-squared distribution 100
Name Statistic
chi-squared distribution
chi distribution
1 0.004 0.02 0.06 0.15 0.46 1.07 1.64 2.71 3.84 6.64 10.83
2 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60 5.99 9.21 13.82
3 0.35 0.58 1.01 1.42 2.37 3.66 4.64 6.25 7.82 11.34 16.27
4 0.71 1.06 1.65 2.20 3.36 4.88 5.99 7.78 9.49 13.28 18.47
5 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07 15.09 20.52
6 1.63 2.20 3.07 3.83 5.35 7.23 8.56 10.64 12.59 16.81 22.46
7 2.17 2.83 3.82 4.67 6.35 8.38 9.80 12.02 14.07 18.48 24.32
8 2.73 3.49 4.59 5.53 7.34 9.52 11.03 13.36 15.51 20.09 26.12
9 3.32 4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88
10 3.94 4.86 6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59
P value (Probability) 0.95 0.90 0.80 0.70 0.50 0.30 0.20 0.10 0.05 0.01 0.001
Nonsignificant Significant
Chi-squared distribution 101
History
This distribution was first described by the German statistician Helmert.
References
[1] M.A. Sanders. "Characteristic function of the central chi-squared distribution" (http:/ / www. planetmathematics. com/ CentralChiDistr. pdf).
. Retrieved 2009-03-06.
[2] Abramowitz, Milton; Stegun, Irene A., eds. (1965), "Chapter 26" (http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_940. htm), Handbook of
Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover, pp. 940, ISBN 978-0486612720, MR0167642,
.
[3] NIST (2006). Engineering Statistics Handbook - Chi-Squared Distribution (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/
eda3666. htm)
[4] Jonhson, N.L.; S. Kotz, , N. Balakrishnan (1994). Continuous Univariate Distributions (Second Ed., Vol. 1, Chapter 18). John Willey and
Sons. ISBN 0-471-58495-9.
[5] Mood, Alexander; Franklin A. Graybill, Duane C. Boes (1974). Introduction to the Theory of Statistics (Third Edition, p. 241-246).
McGraw-Hill. ISBN 0-07-042864-6.
[6] Dasgupta, Sanjoy D. A.; Gupta, Anupam K. (2002). "An Elementary Proof of a Theorem of Johnson and Lindenstrauss" (http:/ / cseweb.
ucsd. edu/ ~dasgupta/ papers/ jl. pdf). Random Structures and Algorithms 22: 60-65. . Retrieved 2012-05-01.
[7] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[8] Chi-squared distribution (http:/ / mathworld. wolfram. com/ Chi-SquaredDistribution. html), from MathWorld, retrieved Feb. 11, 2009
[9] M. K. Simon, Probability Distributions Involving Gaussian Random Variables, New York: Springer, 2002, eq. (2.35), ISBN
978-0-387-34657-1
[10] Box, Hunter and Hunter. Statistics for experimenters. Wiley. p. 46.
[11] Wilson, E.B.; Hilferty, M.M. (1931) "The distribution of chi-squared". Proceedings of the National Academy of Sciences, Washington, 17,
684–688.
[12] Chi-Squared Test (http:/ / www2. lv. psu. edu/ jxm57/ irp/ chisquar. html) Table B.2. Dr. Jacqueline S. McLaughlin at The Pennsylvania
State University. In turn citing: R.A. Fisher and F. Yates, Statistical Tables for Biological Agricultural and Medical Research, 6th ed., Table
IV
External links
• Earliest Uses of Some of the Words of Mathematics: entry on Chi squared has a brief history (http://jeff560.
tripod.com/c.html)
• Course notes on Chi-Squared Goodness of Fit Testing (http://www.stat.yale.edu/Courses/1997-98/101/chigf.
htm) from Yale University Stats 101 class.
• Mathematica demonstration showing the chi-squared sampling distribution of various statistics, e.g. Σx², for a
normal population (http://demonstrations.wolfram.com/StatisticsAssociatedWithNormalSamples/)
• Simple algorithm for approximating cdf and inverse cdf for the chi-squared distribution with a pocket calculator
(http://www.jstor.org/stable/2348373)
Computational complexity of mathematical operations 102
Arithmetic functions
Operation Input Output Algorithm Complexity
Addition Two n-digit numbers One n+1-digit Schoolbook addition with carry Θ(n)
number
Subtraction Two n-digit numbers One n+1-digit Schoolbook subtraction with borrow Θ(n)
number
Multiplication Two n-digit numbers One 2n-digit Schoolbook long multiplication O(n2)
number
Karatsuba algorithm O(n1.585)
[3]
Fürer's algorithm O(n log n 2log* n)
Division Two n-digit numbers One n-digit number Schoolbook long division O(n2)
Square root One n-digit number One n-digit number Newton's method M(n)
Modular Two n-digit numbers and a k-bit One n-digit number Repeated multiplication and reduction O(2kM(n))
exponentiation exponent
Exponentiation by squaring O(k M(n))
Schnorr and Stumpf[4] conjectured that no fastest algorithm for multiplication exists.
Computational complexity of mathematical operations 103
Algebraic functions
Operation Input Output Algorithm Complexity
Polynomial evaluation One polynomial of degree n with fixed-size One fixed-size number Direct evaluation Θ(n)
polynomial coefficients
Horner's method Θ(n)
Polynomial gcd (over Two polynomials of degree n with fixed-size One polynomial of degree Euclidean algorithm O(n2)
Z[x] or F[x]) polynomial coefficients at most n
Fast Euclidean O(n (log n)2 log
[5]
algorithm log n)
Special functions
Many of the methods in this section are given in Borwein & Borwein.[6]
Elementary functions
The elementary functions are constructed by composing arithmetic operations, the exponential function (exp), the
natural logarithm (log), trigonometric functions (sin, cos), and their inverses. The complexity of an elementary
function is equivalent to that of its inverse, since all elementary functions are analytic and hence invertible by means
of Newton's method. In particular, if either exp or log can be computed with some complexity, then that complexity
is attainable for all other elementary functions.
Below, the size n refers to the number of digits of precision at which the function is to be evaluated.
Taylor series; repeated argument reduction (e.g. exp(2x) = [exp(x)]2) and direct summation exp, log, sin, cos O(n1/2 M(n))
Taylor series; FFT-based acceleration exp, log, sin, cos O(n1/3 (log n)2 M(n))
It is not known whether O(log n M(n)) is the optimal complexity for elementary functions. The best known lower
bound is the trivial bound Ω(M(n)).
Non-elementary functions
Gamma function n-digit number Series approximation of the incomplete gamma function O(n1/2 (log n)2 M(n))
Hypergeometric function pFq n-digit number (As described in Borwein & Borwein) O(n1/2 (log n)2 M(n))
Mathematical constants
This table gives the complexity of computing approximations to the given constants to n correct digits.
Euler's number, e Binary splitting of the Taylor series for the exponential function O(log n M(n))
Pi, π Binary splitting of the arctan series in Machin's formula O((log n)2 M(n))
Euler's constant, γ Sweeney's method (approximation in terms of the exponential integral) O((log n)2 M(n))
Number theory
Algorithms for number theoretical calculations are studied in computational number theory.
Greatest common Two n-digit One number with at most n Euclidean algorithm O(n2)
divisor numbers digits
Binary GCD algorithm O(n2)
[8]
Left/Right k-ary Binary GCD algorithm O(n2 / log n)
O(log n M(n))
Schönhage controlled Euclidean descent
[10]
algorithm
Factorial A fixed-size One O(m log m)-digit Bottom-up multiplication O(m2 log m)
number m number
Binary splitting O(log m M(m log m))
Matrix algebra
The following complexity figures assume that arithmetic with individual elements has complexity O(1), as is the
case with fixed-precision floating-point arithmetic.
Computational complexity of mathematical operations 105
Matrix multiplication Two n×n-matrices One n×n-matrix Schoolbook matrix multiplication O(n3)
[14]
Williams algorithm O(n2.373)
Matrix multiplication One n×m-matrix & One n×p-matrix Schoolbook matrix multiplication O(nmp)
One m×p-matrix
LU decomposition O(n3)
In 2005, Henry Cohn, Robert Kleinberg, Balázs Szegedy and Christopher Umans showed that either of two different
conjectures would imply that the exponent of matrix multiplication is 2.[16] It has also been conjectured that no
fastest algorithm for matrix multiplication exists, in light of the nearly 20 successive improvements leading to the
Williams algorithm.
* Because of the possibility to blockwise invert a matrix, where an inversion of an n×n matrix requires inversion of
two half-sized matrices and 6 mulitplications between two half-sized matrices, and since matrix multiplication has a
lower bound of Ω(n2 log n) operations[17], it can be shown that a divide and conquer algorithm that uses blockwise
inversion to invert a matrix runs with the same time complexity as the matrix multiplication algorithm that is used
internally.
References
[1] A. Schönhage, A.F.W. Grotefeld, E. Vetter: Fast Algorithms—A Multitape Turing Machine Implementation, BI Wissenschafts-Verlag,
Mannheim, 1994
[2] D. Knuth. The Art of Computer Programming, Volume 2. Third Edition, Addison-Wesley 1997.
[3] Martin Fürer. Faster Integer Multiplication (http:/ / www. cse. psu. edu/ ~furer/ Papers/ mult. pdf). Proceedings of the 39th Annual ACM
Symposium on Theory of Computing, San Diego, California, USA, June 11–13, 2007, pp. 55–67.
[4] C. P. Schnorr and G. Stumpf. A characterization of complexity sequences. Zeitschrift fur Mathematische Logik und Grundlagen der
Mathematik 21(1):47–56, 1975.
[5] http:/ / planetmath. org/ encyclopedia/ HalfGCDAlgorithm. html
[6] J. Borwein & P. Borwein. Pi and the AGM: A Study in Analytic Number Theory and Computational Complexity. John Wiley 1987.
[7] David and Gregory Chudnovsky. Approximations and complex multiplication according to Ramanujan. Ramanujan revisited, Academic
Press, 1988, pp 375–472.
[8] J. Sorenson. (1994). "Two Fast GCD Algorithms". Journal of Algorithms 16 (1): 110–144. doi:10.1006/jagm.1994.1006.
[9] R. Crandall & C. Pomerance. Prime Numbers - A Computational Perspective. Second Edition, Springer 2005.
[10] Möller N (2008). "On Schönhage's algorithm and subquadratic integer gcd computation" (http:/ / www. lysator. liu. se/ ~nisse/ archive/ sgcd.
pdf). Mathematics of Computation 77 (261): 589–607. doi:10.1090/S0025-5718-07-02017-0. .
[11] Bernstein D J. "Faster Algorithms to Find Non-squares Modulo Worst-case Integers" (http:/ / cr. yp. to/ papers/ nonsquare. ps). .
[12] Richard P. Brent; Paul Zimmermann (2010). "An O(M(n) log n) algorithm for the Jacobi symbol". arXiv:1004.2091.
Computational complexity of mathematical operations 106
[13] P. Borwein. "On the complexity of calculating factorials". Journal of Algorithms 6, 376-380 (1985)
[14] Virginia Vassilevska Williams, Breaking the Coppersmith-Winograd barrier (http:/ / www. cs. berkeley. edu/ ~virgi/ matrixmult. pdf), 2011
preprint.
[15] J. B. Fraleigh and R. A. Beauregard, "Linear Algebra," Addison-Wesley Publishing Company, 1987, p 95.
[16] Henry Cohn, Robert Kleinberg, Balazs Szegedy, and Chris Umans. Group-theoretic Algorithms for Matrix Multiplication.
arXiv:math.GR/0511460. Proceedings of the 46th Annual Symposium on Foundations of Computer Science, 23–25 October 2005, Pittsburgh,
PA, IEEE Computer Society, pp. 379–388.
[17] Ran Raz. On the complexity of matrix product. In Proceedings of the thirty-fourth annual ACM symposium on Theory of computing. ACM
Press, 2002. doi:10.1145/509907.509932.
Conjugate prior
In Bayesian probability theory, if the posterior distributions p(θ|x) are in the same family as the prior probability
distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate
prior for the likelihood. For example, the Gaussian family is conjugate to itself (or self-conjugate) with respect to a
Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will
ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior
for the likelihood which is also Gaussian. The concept, as well as the term "conjugate prior", were introduced by
Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory.[1] A similar concept had been
discovered independently by George Alfred Barnard.[2]
Consider the general problem of inferring a distribution for a parameter θ given some datum or data x. From Bayes'
theorem, the posterior distribution is equal to the product of the likelihood function and prior p(θ),
normalized (divided) by the probability of the data p(x):
Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement
of the data-generating process. It is clear that different choices of the prior distribution p(θ) may make the integral
more or less difficult to calculate, and the product p(x|θ) × p(θ) may take one algebraic form or another. For certain
choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter
values). Such a choice is a conjugate prior.
A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior: otherwise a difficult
numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing
how a likelihood function updates a distribution.
All members of the exponential family have conjugate priors. See Gelman et al.[3] for a catalog.
Example
The form of the conjugate prior can generally be determined by inspection of the probability density or probability
mass function of a distribution. For example, consider a random variable which is a Bernoulli trial with unknown
probability of success q in [0,1]. The probability density function has the form
for some constants and . Generally, this functional form will have an additional multiplicative factor (the
normalizing constant) ensuring that the function is a probability distribution, i.e. the integral over the entire range is
1. This factor will often be a function of and , but never of .
In fact, the usual conjugate prior is the beta distribution with
Conjugate prior 107
where and are chosen to reflect any existing belief or information ( = 1 and = 1 would give a uniform
distribution) and Β( , ) is the Beta function acting as a normalising constant.
In this context, and are called hyperparameters (parameters of the prior), to distinguish them from parameters
of the underlying model (here q). It is a typical characteristic of conjugate priors that the dimensionality of the
hyperparameters is one greater than that of the parameters of the original distribution. If all parameters are scalar
values, then this means that there will be one more hyperparameter than parameter; but this also applies to
vector-valued and matrix-valued parameters. (See the general article on the exponential family, and consider also the
Wishart distribution, conjugate prior of the covariance matrix of a multivariate normal distribution, for an example
where a large dimensionality is involved.)
If we then sample this random variable and get s successes and f failures, we have
which is another Beta distribution with a simple change to the (hyper)parameters. This posterior distribution could
then be used as the prior for more samples, with the hyperparameters simply adding each extra piece of information
as it comes.
Pseudo-observations
It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having
observed a certain number of pseudo-observations with properties specified by the parameters. For example, the
values and of a beta distribution can be thought of as corresponding to successes and failures
if the posterior mode is used to choose an optimal parameter setting, or successes and failures if the posterior
mean is used to choose an optimal parameter setting. In general, for nearly all conjugate prior distributions, the
hyperparameters can be interpreted in terms of pseudo-observations. This can help both in providing an intuition
behind the often messy update equations, as well as to help choose reasonable hyperparameters for a prior.
Interpretations
using a mixture density of conjugate priors, rather than a single conjugate prior.
Dynamical system
One can think of conditioning on conjugate priors as defining a kind of (discrete time) dynamical system: from a
given set of hyperparameters, incoming data updates these hyperparameters, so one can see the change in
hyperparameters as a kind of "time evolution" of the system, corresponding to "learning". Starting at different points
yields different flows over time. This is again analogous with the dynamical system defined by a linear operator, but
note that since different samples lead to different inference, this is not simply dependent on time, but rather on data
over time. For related approaches, see Recursive Bayesian estimation and Data assimilation.
Bernoulli p Beta
successes,
(probability) [4]
failures
Binomial p Beta
successes,
(probability) [4] (beta-binomial)
failures
Negative p Beta
total
Binomial (probability)
successes,
with known [4]
failures (i.e.
failure number
r
experiments,
assuming stays
fixed)
Multinomial p Dirichlet
occurrences
(probability [4] (Dirichlet-multinomial)
of category
vector), k
(number of
categories,
i.e. size of
p)
Geometric p0 Beta
experiments,
(probability)
total
[4]
failures
deviations )
deviations )
squared deviations )
Conjugate prior 110
squared deviations )
Notes
[1] Howard Raiffa and Robert Schlaifer. Applied Statistical Decision Theory. Division of Research, Graduate School of Business Administration,
Harvard University, 1961.
[2] Jeff Miller et al. Earliest Known Uses of Some of the Words of Mathematics (http:/ / jeff560. tripod. com/ mathword. html), "conjugate prior
distributions" (http:/ / jeff560. tripod. com/ c. html). Electronic document, revision of November 13, 2005, retrieved December 2, 2005.
[3] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis, 2nd edition. CRC Press, 2003. ISBN
1-58488-388-X.
[4] The exact interpretation of the parameters of a beta distribution in terms of number of successes and failures depends on what function is used
to extract a point estimate from the distribution. The mode of a beta distribution is which corresponds to successes
and failures; but the mean is which corresponds to successes and failures. The use of and has
the advantage that a uniform prior corresponds to 0 successes and 0 failures, but the use of and is somewhat more
convenient mathematically and also corresponds well with the fact that Bayesians generally prefer to use the posterior mean rather than the
posterior mode as a point estimate. The same issues apply to the Dirichlet distribution.
[5] This is the posterior predictive distribution of a new data point given the observed data points, with the parameters marginalized out.
Variables with primes indicate the posterior values of the parameters.
[6] β is rate or inverse scale. In parameterization of gamma distribution,θ = 1/β and k = α.
[7] Fink, D. (1997). "A Compendium of Conjugate Priors" (In progress report: Extension and enhancement of methods for setting data quality
objectives). DOE contract 95‑831. CiteSeerX: 10.1.1.157.5540 (http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 157. 5540).
[8] This is the posterior predictive distribution of a new data point given the observed data points, with the parameters marginalized out.
Variables with primes indicate the posterior values of the parameters. and refer to the normal distribution and Student's t-distribution,
respectively, or to the multivariate normal distribution and multivariate t-distribution in the multivariate cases.
[9] Murphy, Kevin P. (2007). "Conjugate Bayesian analysis of the Gaussian distribution." (http:/ / www. cs. ubc. ca/ ~murphyk/ Papers/
bayesGauss. pdf)
Conjugate prior 112
References
External links
• Step-by-step calculation of normal distribution posterior hyperparameters (http://www.eisber.net/StatWiki/
index.php/Mathematische_Statistik_-_Übung_Ergänzungsaufgabe_2_Beispiel_2)
Statement
Let {Xn}, X be random elements defined on a metric space S. Suppose a function g: S→S′ (where S′ is another metric
space) has the set of discontinuity points Dg such that Pr[X ∈ Dg] = 0. Then[2][3][4]
1.
2.
3.
Proof
This proof has been adopted from (van der Vaart 1998, Theorem 2.3)
Spaces S and S′ are equipped with certain metrics. For simplicity we will denote both of these metrics using the |x−y|
notation, even though the metrics may be arbitrary and not necessarily Euclidian.
Convergence in distribution
We will need a particular statement from the portmanteau theorem: that convergence in distribution is
equivalent to
Fix an arbitrary closed set F⊂S′. Denote by g−1(F) the pre-image of F under the mapping g: the set of all points x∈S
such that g(x)∈F. Consider a sequence {xk} such that g(xk)∈F and xk→x. Then this sequence lies in g−1(F), and its
limit point x belongs to the closure of this set, g−1(F) (by definition of the closure). The point x may be either:
• a continuity point of g, in which case g(xk)→g(x), and hence g(x)∈F because F is a closed set, and therefore in
this case x belongs to the pre-image of F, or
• a discontinuity point of g, so that x∈Dg.
Thus the following relationship holds:
Continuous mapping theorem 113
Consider the event {g(Xn)∈F}. The probability of this event can be estimated as
and by the portmanteau theorem the limsup of the last expression is less than or equal to Pr(X∈g−1(F)). Using the
formula we derived in the previous paragraph, this can be written as
On plugging this back into the original expression, it can be seen that
which, by the portmanteau theorem, implies that g(Xn) converges to g(X) in distribution.
Convergence in probability
Fix an arbitrary ε>0. Then for any δ>0 consider the set Bδ defined as
This is the set of continuity points x of the function g(·) for which it is possible to find, within the δ-neighborhood of
x, a point which maps outside the ε-neighborhood of g(x). By definition of continuity, this set shrinks as δ goes to
zero, so that limδ→0Bδ = ∅.
Now suppose that |g(X) − g(Xn)| > ε. This implies that at least one of the following is true: either |X−Xn|≥δ, or X∈Dg,
or X∈Bδ. In terms of probabilities this can be written as
On the right-hand side, the first term converges to zero as n → ∞ for any fixed δ, by the definition of convergence in
probability of the sequence {Xn}. The second term converges to zero as δ → 0, since the set Bδ shrinks to an empty
set. And the last term is identically equal to zero by assumption of the theorem. Therefore the conclusion is that
References
Literature
• Amemiya, Takeshi (1985). Advanced Econometrics. Cambridge, MA: Harvard University Press.
ISBN 0-674-00560-0. LCCN HB139.A54 1985.
• Billingsley, Patrick (1969). Convergence of Probability Measures. John Wiley & Sons. ISBN 0-471-07242-7.
• Billingsley, Patrick (1999). Convergence of Probability Measures (2nd ed.). John Wiley & Sons.
ISBN 0-471-19745-9.
• Mann, H.B.; Wald, A. (1943). "On stochastic limit and order relationships". The Annals of Mathematical
Statistics 14 (3): 217–226. doi:10.1214/aoms/1177731415. JSTOR 2235800.
• Van der Vaart, A. W. (1998). Asymptotic statistics. New York: Cambridge University Press.
ISBN 978-0-521-49603-2. LCCN QA276 .V22 1998.
Notes
[1] Amemiya 1985, p. 88
[2] Van der Vaart 1998, Theorem 2.3, page 7
[3] Billingsley 1969, p. 31, Corollary 1
[4] Billingsley 1999, p. 21, Theorem 2.7
Background
"Stochastic convergence" formalizes the idea that a sequence of essentially random or unpredictable events can
sometimes be expected to settle into a pattern. The pattern may for instance be
• Convergence in the classical sense to a fixed value, perhaps itself coming from a random event
• An increasing similarity of outcomes to what a purely deterministic function would produce
• An increasing preference towards a certain outcome
• An increasing "aversion" against straying far away from a certain outcome
Some less obvious, more theoretical patterns could be
• That the probability distribution describing the next outcome may grow increasingly similar to a certain
distribution
• That the series formed by calculating the expected value of the outcome's distance from a particular value may
converge to 0
• That the variance of the random variable describing the next event grows smaller and smaller.
These other types of patterns that may arise are reflected in the different types of stochastic convergence that have
been studied.
Convergence of random variables 115
While the above discussion has related to the convergence of a single series to a limiting value, the notion of the
convergence of two series towards each other is also important, but this is easily handled by studying the sequence
defined as either the difference or the ratio of the two series.
For example, if the average of n uncorrelated random variables Yi, i = 1, ..., n, all having the same finite mean and
variance, is given by
then as n tends to infinity, Xn converges in probability (see below) to the common mean, μ, of the random variables
Yi. This result is known as the weak law of large numbers. Other forms of convergence are important in other useful
theorems, including the central limit theorem.
Throughout the following, we assume that (Xn) is a sequence of random variables, and X is a random variable, and
all of them are defined on the same probability space .
Convergence in distribution
Suppose a new dice factory has just been built. The first few dice come out quite biased, due to imperfections in the production process. The
outcome from tossing any of them will follow a distribution markedly different from the desired uniform distribution.
As the factory is improved, the dice become less and less loaded, and the outcomes from tossing a newly produced dice will follow the uniform
distribution more and more closely.
Tossing coins
Let Xn be the fraction of heads after tossing up an unbiased coin n times. Then X1 has the Bernoulli distribution with expected value μ = 0.5 and
variance σ2 = 0.25. The subsequent random variables X2, X3, … will all be distributed binomially.
As n grows larger, this distribution will gradually start to take shape more and more similar to the bell curve of the normal distribution. If we shift
and rescale Xn’s appropriately, then will be converging in distribution to the standard normal, the result that follows from the
celebrated central limit theorem.
Graphic example
Suppose { Xi } is an iid sequence of uniform U(−1,1) random variables. Let be their (normalized) sums. Then according to the
central limit theorem, the distribution of Zn approaches the normal N(0, ⅓) distribution. This convergence is shown in the picture: as n grows larger,
the shape of the pdf function gets closer and closer to the Gaussian curve.
With this mode of convergence, we increasingly expect to see the next outcome in a sequence of random
experiments becoming better and better modeled by a given probability distribution.
Convergence of random variables 116
Convergence in distribution is the weakest form of convergence, since it is implied by all other types of convergence
mentioned in this article. However convergence in distribution is very frequently used in practice; most often it
arises from application of the central limit theorem.
Definition
A sequence {X1, X2, …} of random variables is said to converge in distribution, or converge weakly, or converge
in law to a random variable X if
for every number x ∈ R at which F is continuous. Here Fn and F are the cumulative distribution functions of random
variables Xn and X correspondingly.
The requirement that only the continuity points of F should be considered is essential. For example if Xn are
distributed uniformly on intervals [0, 1⁄n], then this sequence converges in distribution to a degenerate random
variable X = 0. Indeed, Fn(x) = 0 for all n when x ≤ 0, and Fn(x) = 1 for all x ≥ 1⁄n when n > 0. However, for this
limiting random variable F(0) = 1, even though Fn(0) = 0 for all n. Thus the convergence of cdfs fails at the point x =
0 where F is discontinuous.
Convergence in distribution may be denoted as
where is the law (probability distribution) of X. For example if X is standard normal we can write .
For random vectors {X1, X2, …} ⊂ Rk the convergence in distribution is defined similarly. We say that this sequence
converges in distribution to a random k-vector X if
for all continuous bounded functions h(·).[2] Here E* denotes the outer expectation, that is the expectation of a
“smallest measurable function g that dominates h(Xn)”.
Properties
• Since F(a) = Pr(X ≤ a), the convergence in distribution means that the probability for Xn to be in a given range is
approximately equal to the probability that the value of X is in that range, provided n is sufficiently large.
• In general, convergence in distribution does not imply that the sequence of corresponding probability density
functions will also converge. As an example one may consider random variables with densities
ƒn(x) = (1 − cos(2πnx))1{x∈(0,1)}. These random variables converge in distribution to a uniform U(0, 1), whereas
their densities do not converge at all.[3]
• Portmanteau lemma provides several equivalent definitions of convergence in distribution. Although these
definitions are less intuitive, they are used to prove a number of statistical theorems. The lemma states that {Xn}
converges in distribution to X if and only if any of the following statements are true:
Convergence of random variables 117
Convergence in probability
Consider the following experiment. First, pick a random person in the street. Let X be his/her height, which is ex ante a random variable. Then you
start asking other people to estimate this height by eye. Let Xn be the average of the first n responses. Then (provided there is no systematic error) by
the law of large numbers, the sequence Xn will converge in probability to the random variable X.
Archer
Suppose a person takes a bow and starts shooting arrows at a target. Let Xn be his score in n-th shot. Initially he will be very likely to score zeros,
but as the time goes and his archery skill increases, he will become more and more likely to hit the bullseye and score 10 points. After the years of
practice the probability that he hit anything but 10 will be getting increasingly smaller and smaller. Thus, the sequence Xn converges in probability
to X = 10.
Note that Xn does not converge almost surely however. No matter how professional the archer becomes, there will always be a small probability of
making an error. Thus the sequence {Xn} will never turn stationary: there will always be non-perfect scores in it, even if they are becoming
increasingly less frequent.
The basic idea behind this type of convergence is that the probability of an “unusual” outcome becomes smaller and
smaller as the sequence progresses.
The concept of convergence in probability is used very often in statistics. For example, an estimator is called
consistent if it converges in probability to the quantity being estimated. Convergence in probability is also the type of
convergence established by the weak law of large numbers.
Definition
A sequence {Xn} of random variables converges in probability towards X if for all ε > 0
Formally, pick any ε > 0 and any δ > 0. Let Pn be the probability that Xn is outside the ball of radius ε centered at X.
Then for Xn to converge in probability to X there should exist a number Nδ such that for all n ≥ Nδ the probability Pn
is less than δ.
Convergence in probability is denoted by adding the letter p over an arrow indicating convergence, or using the
“plim” probability limit operator:
Convergence of random variables 118
For random elements {Xn} on a separable metric space (S, d), convergence in probability is defined similarly by[4]
Properties
• Convergence in probability implies convergence in distribution.[proof]
• Convergence in probability does not imply almost sure convergence.[proof]
• In the opposite direction, convergence in distribution implies convergence in probability only when the limiting
random variable X is a constant.[proof]
• The continuous mapping theorem states that for every continuous function g(·), if , then also .
• Convergence in probability defines a topology on the space of random variables over a fixed probability space.
This topology is metrizable by the Ky Fan metric:[5]
Consider an animal of some short-lived species. We record the amount of food that this animal consumes per day. This sequence of numbers will be
unpredictable, but we may be quite certain that one day the number will become zero, and will stay zero forever after.
Example 2
Consider a man who tosses seven coins every morning. Each afternoon, he donates one pound to a charity for each head that appeared. The first
time the result is all tails, however, he will stop permanently.
Let X1, X2, … be the daily amounts the charity receives from him.
We may be almost sure that one day this amount will be zero, and stay zero forever after that.
However, when we consider any finite number of days, there is a nonzero probability the terminating condition will not occur.
This is the type of stochastic convergence that is most similar to pointwise convergence known from elementary real
analysis.
Definition
To say that the sequence Xn converges almost surely or almost everywhere or with probability 1 or strongly
towards X means that
This means that the values of Xn approach the value of X, in the sense (see almost surely) that events for which Xn
does not converge to X have probability 0. Using the probability space and the concept of the random
variable as a function from Ω to R, this is equivalent to the statement
Almost sure convergence is often denoted by adding the letters a.s. over an arrow indicating convergence:
For generic random elements {Xn} on a metric space (S, d), convergence almost surely is defined similarly:
Convergence of random variables 119
Properties
• Almost sure convergence implies convergence in probability, and hence implies convergence in distribution. It is
the notion of convergence used in the strong law of large numbers.
• The concept of almost sure convergence does not come from a topology on the space of random variables. This
means there is no topology on the space of random variables such that the almost surely convergent sequences are
exactly the converging sequences with respect to that topology. In particular, there is no metric of almost sure
convergence.
Sure convergence
To say that the sequence or random variables (Xn) defined over the same probability space (i.e., a random process)
converges surely or everywhere or pointwise towards X means
where Ω is the sample space of the underlying probability space over which the random variables are defined.
This is the notion of pointwise convergence of sequence functions extended to sequence of random variables. (Note
that random variables themselves are functions).
Sure convergence of a random variable implies all the other kinds of convergence stated above, but there is no
payoff in probability theory by using sure convergence compared to using almost sure convergence. The difference
between the two only exists on sets with probability zero. This is why the concept of sure convergence of random
variables is very rarely used.
Convergence in mean
We say that the sequence Xn converges in the r-th mean (or in the Lr-norm) towards X, for some r ≥ 1, if r-th
absolute moments of Xn and X exist, and
where the operator E denotes the expected value. Convergence in r-th mean tells us that the expectation of the r-th
power of the difference between Xn and X converges to zero.
This type of convergence is often denoted by adding the letter Lr over an arrow indicating convergence:
Convergence in the r-th mean, for r > 0, implies convergence in probability (by Markov's inequality), while if r > s ≥
1, convergence in r-th mean implies convergence in s-th mean. Hence, convergence in mean square implies
convergence in mean.
Convergence of random variables 120
This is a rather "technical" mode of convergence. We essentially compute a sequence of real numbers, one number
for each random variable, and check if this sequence is convergent in the ordinary sense.
Formal definition
If for some real number a, then {Xn} converges in rth-order mean to a.
Properties
The chain of implications between the various notions of convergence are noted in their respective sections. They
are, using the arrow notation:
These properties, together with a number of other special cases, are summarized in the following list:
• Almost sure convergence implies convergence in probability:[7][proof]
• Convergence in probability implies there exists a sub-sequence which almost surely converges:[8]
• Convergence in r-th order mean implies convergence in lower order mean, assuming that both orders are greater
than one:
provided r ≥ s ≥ 1.
• If Xn converges in distribution to a constant c, then Xn converges in probability to c:[7][proof]
provided c is a constant.
• If Xn converges in distribution to X and the difference between Xn and Yn converges in probability to zero, then Yn
also converges in distribution to X:[7][proof]
• If Xn converges in distribution to X and Yn converges in distribution to a constant c, then the joint vector (Xn, Yn)
converges in distribution to (X, c):[7][proof]
provided c is a constant.
Note that the condition that Yn converges to a constant is important, if it were to converge to a random variable Y
then we wouldn’t be able to conclude that (Xn, Yn) converges to (X, Y).
Convergence of random variables 121
• If Xn converges in probability to X and Yn converges in probability to Y, then the joint vector (Xn, Yn) converges in
probability to (X, Y):[7][proof]
• If Xn converges in probability to X, and if P(|Xn| ≤ b) = 1 for all n and some b, then Xn converges in rth mean to X
for all r ≥ 1. In other words, if Xn converges in probability to X and all random variables Xn are almost surely
bounded above and below, then Xn converges to X also in any rth mean.
• Almost sure representation. Usually, convergence in distribution does not imply convergence almost surely.
However for a given sequence {Xn} which converges in distribution to X0 it is always possible to find a new
probability space (Ω, F, P) and random variables {Yn, n = 0,1,…} defined on it such that Yn is equal in
distribution to Xn for each n ≥ 0, and Yn converges to Y0 almost surely.[9]
• If for all ε > 0,
then we say that Xn converges almost completely, or almost in probability towards X. When Xn converges
almost completely towards X then it also converges almost surely to X. In other words, if Xn converges in
probability to X sufficiently quickly (i.e. the above sequence of tail probabilities is summable for all ε > 0),
then Xn also converges almost surely to X. This is a direct implication from the Borel-Cantelli lemma.
• If Sn is a sum of n real independent random variables:
• A necessary and sufficient condition for L1 convergence is and the sequence (Xn) is uniformly
integrable.
Notes
[1] Bickel et al. 1998, A.8, page 475
[2] van der Vaart & Wellner 1996, p. 4
[3] Romano & Siegel 1985, Example 5.26
[4] Dudley 2002, Chapter 9.2, page 287
[5] Dudley 2002, p. 289
[6] Porat, B. (1994). Digital Processing of Random Signals: Theory & Methods. Prentice Hall. pp. 19. ISBN 0-13-063751-3.
[7] van der Vaart 1998, Theorem 2.7
[8] Gut, Allan (2005). Probability: A graduate course. Theorem 3.4: Springer. ISBN 0-387-22833-0.
[9] van der Vaart 1998, Th.2.19
Convergence of random variables 122
References
• Bickel, Peter J.; Klaassen, Chris A.J.; Ritov, Ya’acov; Wellner, Jon A. (1998). Efficient and adaptive estimation
for semiparametric models. New York: Springer-Verlag. ISBN 0-387-98473-9. LCCN QA276.8.E374.
• Billingsley, Patrick (1986). Probability and Measure. Wiley Series in Probability and Mathematical Statistics
(2nd ed.). Wiley.
• Billingsley, Patrick (1999). Convergence of probability measures (2nd ed.). John Wiley & Sons. pp. 1–28.
ISBN 0-471-19745-9.
• Dudley, R.M. (2002). Real analysis and probability. Cambridge, UK: Cambridge University Press.
ISBN 0-521-80972-X.
• Grimmett, G.R.; Stirzaker, D.R. (1992). Probability and random processes (2nd ed.). Clarendon Press, Oxford.
pp. 271–285. ISBN 0-19-853665-8.
• Jacobsen, M. (1992). Videregående Sandsynlighedsregning (Advanced Probability Theory) (3rd ed.). HCØ-tryk,
Copenhagen. pp. 18–20. ISBN 87-91180-71-6.
• Ledoux, Michel; Talagrand, Michel (1991). Probability in Banach spaces. Berlin: Springer-Verlag. pp. xii+480.
ISBN 3-540-52013-9. MR1102015.
• Romano, Joseph P.; Siegel, Andrew F. (1985). Counterexamples in probability and statistics. Great Britain:
Chapman & Hall. ISBN 0-412-98901-8. LCCN QA273.R58 1985.
• van der Vaart, Aad W.; Wellner, Jon A. (1996). Weak convergence and empirical processes. New York:
Springer-Verlag. ISBN 0-387-94640-3. LCCN QA274.V33 1996.
• van der Vaart, Aad W. (1998). Asymptotic statistics. New York: Cambridge University Press.
ISBN 978-0-521-49603-2. LCCN QA276.V22 1998.
• Williams, D. (1991). Probability with Martingales. Cambridge University Press. ISBN 0-521-40605-6.
• Wong, E.; Hájek, B. (1985). Stochastic Processes in Engineering Systems. New York: Springer–Verlag.
This article incorporates material from the Citizendium article "Stochastic convergence", which is licensed under
the Creative Commons Attribution-ShareAlike 3.0 Unported License but not under the GFDL.
Convergent series 123
Convergent series
In mathematics, a series is the sum of the terms of a sequence of numbers.
Given a sequence , the nth partial sum is the sum of the first n terms of the sequence, that
is,
A series is convergent if the sequence of its partial sums converges. In more formal language,
a series converges if there exists a limit such that for any arbitrarily small positive number , there is a
large integer such that for all ,
• Alternating the signs of the reciprocals of positive integers produces a convergent series:
• Alternating the signs of the reciprocals of positive odd integers produces a convergent series (the Leibniz formula
for pi):
• The reciprocals of prime numbers produce a divergent series (so the set of primes is "large"):
• The reciprocals of square numbers produce a convergent series (the Basel problem):
• The reciprocals of powers of 2 produce a convergent series (so the set of powers of 2 is "small"):
Convergence tests
There are a number of methods of determining whether a series converges or diverges.
Comparison test. The terms of the sequence
are compared to those of another sequence . If,
then so does
However, if,
so does
where "lim sup" denotes the limit superior (possibly ∞; if the limit exists it is the same value).
If r < 1, then the series converges. If r > 1, then the series diverges. If r = 1, the root test is inconclusive, and the
series may converge or diverge.
The ratio test and the root test are both based on comparison with a geometric series, and as such they work in
similar situations. In fact, if the ratio test works (meaning that the limit exists and is not equal to 1) then so does the
root test; the converse, however, is not true. The root test is therefore more generally applicable, but as a practical
matter the limit is often difficult to compute for commonly seen types of series.
Integral test. The series can be compared to an integral to establish convergence or divergence. Let be
a positive and monotone decreasing function. If
then the series converges. But if the integral diverges, then the series does so as well.
Limit comparison test. If , and the limit exists and is not zero, then
Alternating series test. Also known as the Leibniz criterion, the alternating series test states that for an alternating
series of the form , if is monotone decreasing, and has a limit of 0 at infinity, then the series
converges.
Convergent series 125
Cauchy condensation test. If is a monotone decreasing sequence, then converges if and only if
converges.
Dirichlet's test
Abel's test
Raabe's test
Uniform convergence
Illustration of the conditional convergence of the
Main article: uniform convergence.
power series of log(z+1) around 0 evaluated at z =
Let be a sequence of functions. The series exp((π−1⁄3)i). The length of the line is infinite.
converges uniformly to f.
There is an analogue of the comparison test for infinite series of functions called the Weierstrass M-test.
Convergent series 126
converges if and only if the sequence of partial sums is a Cauchy sequence. This means that for every there
is a positive integer such that for all we have
which is equivalent to
References
• Rudin, Walter (1976). Principles of Mathematical Analysis. McGrawHill.
• Spivak, Michael (1994). Calculus (3rd ed.). Houston, Texas: Publish or Perish, Inc. ISBN 0-914098-89-6.
External links
• Weisstein, Eric (2005). Riemann Series Theorem [1]. Retrieved May 16, 2005.
References
[1] http:/ / mathworld. wolfram. com/ RiemannSeriesTheorem. html
Copula (probability theory) 127
The copula C contains all information on the dependence structure between the components of
whereas the marginal cumulative distribution functions contain all information on the marginal distributions.
The importance of the above is that the reverse of these steps can be used to generate pseudo-random samples from
general classes of multivariate probability distributions. That is, given a procedure to generate a sample
from the copula distribution, the required sample can be constructed as
The inverses are unproblematic as the were assumed to be continuous. The above formula for the copula
function can be rewritten to correspond to this as:
Definition
In probabilistic terms, is a d-dimensional copula if C is a joint cumulative distribution
function of a d-dimensional random vector on the unit cube with uniform marginals.[1]
In analytic terms, is a d-dimensional copula if
• , the copula is zero if one of the arguments is zero,
• , the copula is equal to u if one argument is u and all others 1,
non-negative:
where the .
Copula (probability theory) 128
Sklar's theorem
Sklar's theorem[2] provides the theoretical foundation for the
application of copulas. Sklar's theorem states that a multivariate
cumulative distribution function
where is a copula.
The theorem also states that, given , the copula is unique on
, which is the cartesian product of the
ranges of the marginal cdf's. This implies that the copula is unique if
the margins are continuous.
The converse is also true: given a copula and
margins then defines a
Density and contour plot of two Normal
d-dimensional cumulative distribution function. marginals joint with a Gumbel copula
The upper bound is sharp: M is always a copula, it corresponds to comonotone random variables.
The lower bound is point-wise sharp, in the sense that for fixed u, there is a copula such that .
However, W is a copula only in two dimensions, in which case it corresponds to countermonotonic random variables.
In two dimensions, i.e. the bivariate case, the Fréchet–Hoeffding Theorem states
Copula (probability theory) 129
Families of copulas
Gaussian copula
The Gaussian copula is a distribution over the unit cube . It is
constructed from a multivariate normal distribution over by using
the probability integral transform.
For a given correlation matrix , the Gaussian copula with
parameter matrix can be written as
Archimedean copulas
Archimedean copulas are an associative class of copulas. Most common Archimedean copulas admit an explicit
formula for the C, something not possible for instance for the Gaussian copula. In practise, Archimedean copulas are
popular because they allow to model dependence in arbitrarily high dimensions with only one parameter, governing
the strength of dependence.
A copula C is called Archimedean if it admits the representation
Ali-Mikhail-Haq
[7]
Clayton
Frank
Gumbel
Independence
Joe
Empirical copulas
When studying multivariate data, one might want to investigate the underlying copula. Suppose we have
observations
from a random vector with continuous margins. The corresponding "true" copula observations
would be
However, the marginal distribution functions are usually not known. Therefore, one can construct pseudo copula
observations by using the empirical distribution functions
The components of the pseudo copula samples can also be written as , where is the rank of the
observation :
Therefore, the empirical copula can be seen as the empirical distribution of the rank transformed data.
Copula (probability theory) 131
In case the copula C is absolutely continuous, i.e. C has a density c, this equation can be written as
If copula and margins are known (or if they have been estimated), this expectation can be approximated through the
following Monte Carlo algorithm:
1. Draw a sample of size n from the copula C
2. By applying the inverse marginal cdf's, produce a sample of by setting
Applications
Quantitative finance
The applications of copulas in quantitative finance are numerous, both in the real-world probability of risk/portfolio
management and in the risk-neutral probability of derivatives pricing.
In risk/portfolio management, copulas are used to perform stress-tests and robustness checks: panic copulas are
glued with market estimates of the marginal distributions to analyze the effects of panic regimes on the portfolio
profit and loss distribution. Panic copulas are created by Monte Carlo simulation, mixed with a re-weighting of the
probability of each scenario.[9]
As far as derivatives pricing is concerned, dependence modelling with copula functions is widely used in
applications of financial risk assessment and actuarial analysis – for example in the pricing of collateralized debt
obligations (CDOs).[10] Some believe the methodology of applying the Gaussian copula to credit derivatives to be
one of the reasons behind the global financial crisis of 2008–2009.[11][12] Despite this perception, there are
documented attempts of the financial industry, occurring before the crisis, to address the limitations of the Gaussian
copula and of copula functions more generally, specifically the lack of dependence dynamics and the poor
representation of extreme events.[13] There have been attempts to propose models rectifying some of the copula
limitations.[13][14][15]
While the application of copulas in credit has gone through popularity as well as misfortune during the global
financial crisis of 2008–2009,[16] it is arguably an industry standard model for pricing CDOs. Less arguably, copulas
have also been applied to other asset classes as a flexible tool in analyzing multi-asset derivative products. The first
Copula (probability theory) 132
such application outside credit was to use a copula to construct an implied basket volatility surface,[17] taking into
account the volatility smile of basket components. Copulas have since gained popularity in pricing and risk
management [18] of options on multi-assets in the presence of volatility smile/skew, in equity, foreign exchange and
fixed income derivative business. Some typical example applications of copulas are listed below:
• Analyzing and pricing volatility smile/skew of exotic baskets, e.g. best/worst of;
• Analyzing and pricing volatility smile/skew of less liquid FX cross, which is effectively a basket: C = S1/S2 or C
= S1*S2;
• Analyzing and pricing spread options, in particular in fixed income constant maturity swap spread options.
Civil engineering
Recently, copula functions have been successfully applied to the database formulation for the reliability analysis of
highway bridges, and to various multivariate simulation studies in civil[19], mechanical and offshore engineering.
Medicine
Copula functions have been successfully applied to the analysis of spike counts in neuroscience. [20]
Weather research
Copulas have been extensively used in climate and weather related research.[21]
References
[1] Nelsen, Roger B. (1999), An Introduction to Copulas, New York: Springer, ISBN 0-387-98623-5
[2] Sklar, A. (1959), "Fonctions de répartition à n dimensions et leurs marges", Publ. Inst. Statist. Univ. Paris 8: 229–231
[3] "J J O'Connor and E F Robertson" (March 2011). "Biography of Wassily Hoeffding" (http:/ / www-history. mcs. st-andrews. ac. uk/
Biographies/ Hoeffding. html). School of Mathematics and Statistics, University of St Andrews, Scotland. . Retrieved 8 November 2011.
[4] Arbenz, Philipp (2011). "Bayesian Copulae Distributions, with Application to Operational Risk Management - Some Comments".
Methodology and Computing in Applied Probability Forthcoming. doi:10.1007/s11009-011-9224-0.
[5] McNeil, A. J.; Nešlehová, J. (2009). "Multivariate Archimedean copulas, d-monotone functions and 1-norm symmetric distributions".
Annals of Statistics 37 (5b): 3059–3097. doi:10.1214/07-AOS556.
[6] Hofert, Jan Marius (2010). Sampling Nested Archimedean Copulas with Applications to CDO Pricing (http:/ / vts. uni-ulm. de/ docs/ 2010/
7242/ vts_7242_10223. pdf). Dissertation at the University of Ulm. .
[7] Clayton, David G. (1978). "A model for association in bivariate life tables and its application in epidemiological studies of familial tendency
in chronic disease incidence". Biometrika 65 (1): 141–151. JSTOR 2335289.
[8] Alexander J. McNeil, Rudiger Frey and Paul Embrechts (2005) "Quantitative Risk Management: Concepts, Techniques, and Tools",
Princeton Series in Finance
[9] Meucci, Attilio (2011), "A New Breed of Copulas for Risk and Portfolio Management" (http:/ / symmys. com/ node/ 335), Risk 24 (9):
122–126,
[10] Meneguzzo, David; Vecchiato, Walter (Nov 2003), "Copula sensitivity in collateralized debt obligations and basket default swaps", Journal
of Futures Markets 24 (1): 37–70, doi:10.1002/fut.10110
[11] Recipe for Disaster: The Formula That Killed Wall Street (http:/ / www. wired. com/ techbiz/ it/ magazine/ 17-03/
wp_quant?currentPage=all) Wired, 2/23/2009
[12] MacKenzie, Donald (2008), "End-of-the-World Trade" (http:/ / www. lrb. co. uk/ v30/ n09/ mack01_. html), London Review of Books,
2008-05-08, , retrieved 2009-07-27
[13] Lipton, Alexander; Rennie, Andrew. Credit Correlation: Life After Copulas. World Scientific. ISBN 978-981-270-949-3.
[14] Donnelly, C; Embrechts, P, (2010). The devil is in the tails: actuarial mathematics and the subprime mortgage crisis. ASTIN Bulletin 40(1),
1–33
[15] Brigo, D; Pallavicini, A; Torresetti, R (2010). Credit Models and the Crisis: A Journey into CDOs, Copulas, Correlations and dynamic
Models. Wiley and Sons
[16] Jones, Sam (April 24, 2009), "The formula that felled Wall St" (http:/ / www. ft. com/ cms/ s/ 2/ 912d85e8-2d75-11de-9eba-00144feabdc0.
html), Financial Times,
[17] Qu, Dong, (2001). "Basket Implied Volatility Surface". Derivatives Week, (4 June.)
[18] Qu, Dong, (2005). "Pricing Basket Options With Skew". Wilmott Magazine (July.)
Copula (probability theory) 133
[19] Thompson, David; Kilgore, Roger (2011), "Estimating Joint Flow Probabilities at Stream Confluences using Copulas" (http:/ / trb.
metapress. com/ content/ m3146tg612k80771/ ?p=d6b0d7200af148b8a4e18e592ca1e269& pi=3), Transportation Research Record 2262:
200–206, , retrieved 2012-2-21
[20] Onken, A; Grünewälder, S; Munk, MH; Obermayer, K (2009), Aertsen, Ad, ed., "Analyzing Short-Term Noise Dependencies of
Spike-Counts in Macaque Prefrontal Cortex Using Copulas and the Flashlight Transformation" (http:/ / www. ploscompbiol. org/ article/
info:doi/ 10. 1371/ journal. pcbi. 1000577), PLoS Computational Biology 5 (11): e1000577, doi:10.1371/journal.pcbi.1000577,
PMC 2776173, PMID 19956759,
[21] doi:10.5194/npg-15-761-2008
This citation will be automatically completed in the next few minutes. You can jump the queue or expand by hand (http:/ / en. wikipedia. org/
wiki/ Template:cite_doi/ 10. 5194. 2fnpg-15-761-2008. 0a?preload=Template:Cite_doi/ preload& editintro=Template:Cite_doi/ editintro&
action=edit)
Further reading
• The standard reference for an introduction to copulas. Covers all fundamental aspects, summarizes the most
popular copula classes, and provides proofs for the important theorems related to copulas
Roger B. Nelsen (1999), "An Introduction to Copulas", Springer. ISBN 978-0-387-98623-4
• A book covering current topics in mathematical research on copulas:
Piotr Jaworski, Fabrizio Durante, Wolfgang Karl Härdle, Tomasz Rychlik (Editors): (2010): "Copula
Theory and Its Applications" Lecture Notes in Statistics, Springer. ISBN 978-3-642-12464-8
• A paper covering the historic development of copula theory, by the person associated with the "invention" of
copulas, Abe Sklar.
Abe Sklar (1997): "Random variables, distribution functions, and copulas – a personal look backward
and forward" in Rüschendorf, L., Schweizer, B. und Taylor, M. (eds) Distributions With Fixed
Marginals & Related Topics (Lecture Notes – Monograph Series Number 28). ISBN
978-0-940600-40-9
• The standard reference for multivariate models and copula theory in the context of financial and insurance models
Alexander J. McNeil, Rudiger Frey and Paul Embrechts (2005) "Quantitative Risk Management:
Concepts, Techniques, and Tools", Princeton Series in Finance. ISBN 978-0-691-12255-7
External links
• Copula Wiki: community portal for researchers with interest in copulas (http://sites.google.com/site/
copulawiki/)
• A collection of Copula simulation and estimation codes (http://www.mathfinance.cn/tags/copula)
• Thorsten Schmidt (2006): "Coping with Copulas" (http://www.math.uni-leipzig.de/~tschmidt/
TSchmidt_Copulas.pdf)
• Copulas & Correlation using Excel Simulation Articles (http://www.crystalballservices.com/Resources/
ConsultantsCornerBlog/tagid/21/Correlation.aspx)
Coupon collector's problem 134
Solution
Here Hn is the harmonic number. Using the asymptotics of the harmonic numbers, we obtain:
where the last equality uses a value of the Riemann zeta function (see Basel problem).
Now one can use the Chebyshev inequality to bound the desired probability:
Tail estimates
A different upper bound can be derived from the following observation. Let denote the event that the -th
coupon was not picked in the first trials. Then:
• Donald J. Newman and Lawrence Shepp found a generalization of the coupon collector's problem when k copies
of each coupon needs to be collected. Let Tk be the first time k copies of each coupon are collected. They showed
that the expectation in this case satisfies:
Here k is fixed. When k = 1 we get the earlier formula for the expectation.
• Common generalization, also due to Erdős and Rényi:
Coupon collector's problem 136
Notes
[1] Here is the calculation: 50 ln(50) = 195.60, 50γ = 28.86; 50ln(50)+50γ = 224.46, and E(50) = 50*(1 + 1/2 + 1/3 + .. 1/50) = 224.96, the
expected number of trials to collect all 50 coupons.
References
• Blom, Gunnar; Holst, Lars; Sandell, Dennis (1994), "7.5 Coupon collecting I, 7.6 Coupon collecting II, and 15.4
Coupon collecting III" (http://books.google.com/books?id=KCsSWFMq2u0C&pg=PA85), Problems and
Snapshots from the World of Probability, New York: Springer-Verlag, pp. 85–87, 191, ISBN 0-387-94161-4,
MR1265713.
• Dawkins, Brian (1991), "Siobhan's problem: the coupon collector revisited", The American Statistician 45 (1):
76–82, JSTOR 2685247.
• Erdős, Paul; Rényi, Alfréd (1961), "On a classical problem of probability theory" (http://www.renyi.hu/
~p_erdos/1961-09.pdf), Magyar Tudományos Akadémia Matematikai Kutató Intézetének Közleményei 6:
215–220, MR0150807.
• Newman, Donald J.; Shepp, Lawrence (1960), "The double dixie cup problem", American Mathematical Monthly
67: 58–61, doi:10.2307/2308930, MR0120672
• Flajolet, Philippe; Gardy, Danièle; Thimonier, Loÿs (1992), "Birthday paradox, coupon collectors, caching
algorithms and self-organizing search" (http://algo.inria.fr/flajolet/Publications/alloc2.ps.gz), Discrete
Applied Mathematics 39 (3): 207–229, doi:10.1016/0166-218X(92)90177-C, MR1189469.
• Isaac, Richard (1995), "8.4 The coupon collector's problem solved" (http://books.google.com/
books?id=a_2vsIx4FQMC&pg=PA80), The Pleasures of Probability, Undergraduate Texts in Mathematics, New
York: Springer-Verlag, pp. 80–82, ISBN 0-387-94415-X, MR1329545.
• Motwani, Rajeev; Raghavan, Prabhakar (1995), "3.6. The Coupon Collector's Problem" (http://books.google.
com/books?id=QKVY4mDivBEC&pg=PA57), Randomized algorithms, Cambridge: Cambridge University
Press, pp. 57–63, MR1344451.
External links
• " Coupon Collector Problem (http://demonstrations.wolfram.com/CouponCollectorProblem/)" by Ed Pegg, Jr.,
the Wolfram Demonstrations Project. Mathematica package.
• Coupon Collector Problem (http://www-stat.stanford.edu/~susan/surprise/Collector.html), a simple Java
applet.
• How Many Singles, Doubles, Triples, Etc., Should The Coupon Collector Expect? (http://www.math.rutgers.
edu/~zeilberg/mamarim/mamarimhtml/coupon.html), a short note by Doron Zeilberger.
Degrees of freedom (statistics) 137
Notation
In equations, the typical symbol for degrees of freedom is (lowercase Greek letter nu). In text and tables, the
abbreviation "d.f." is commonly used. R.A. Fisher used n to symbolize degrees of freedom (writing n′ for sample
size) but modern usage typically reserves n for sample size.
Residuals
A common way to think of degrees of freedom is as the number of independent pieces of information available to
estimate another piece of information. More concretely, the number of degrees of freedom is the number of
independent observations in a sample of data that are available to estimate a parameter of the population from which
that sample is drawn. For example, if we have two observations, when calculating the mean we have two
independent observations; however, when calculating the variance, we have only one independent observation, since
the two observations are equally distant from the mean.
In fitting statistical models to data, the vectors of residuals are constrained to lie in a space of smaller dimension than
the number of components in the vector. That smaller dimension is the number of degrees of freedom for error.
Degrees of freedom (statistics) 138
Linear regression
Perhaps the simplest example is this. Suppose
are residuals that may be considered estimates of the errors Xi − μ. The sum of the residuals (unlike the sum of the
errors) is necessarily 0. If one knows the values of any n − 1 of the residuals, one can thus find the last one. That
means they are constrained to lie in a space of dimension n − 1. One says that "there are n − 1 degrees of freedom
for residual."
An only slightly less simple example is that of least squares estimation of a and b in the model
where εi and hence Yi are random. Let and be the least-squares estimates of a and b. Then the residuals
are constrained to lie within the space defined by the two equations
One says that there are n − 2 degrees of freedom for error.
The capital Y is used in specifying the model, and lower-case y in the definition of the residuals. That is because the
former are hypothesized random variables and the latter are data.
We can generalise this to multiple regression involving p parameters and covariates (e.g. p − 1 predictors and one
mean), in which case the cost in degrees of freedom of the fit is p.
Since this random vector can lie anywhere in n-dimensional space, it has n degrees of freedom.
Now, let be the sample mean. The random vector can be decomposed as the sum of the sample mean plus a
vector of residuals:
The first vector on the right-hand side is constrained to be a multiple of the vector of 1's, and the only free quantity is
. It therefore has 1 degree of freedom.
Degrees of freedom (statistics) 139
The second vector is constrained by the relation . The first n − 1 components of this vector can
be anything. However, once you know the first n − 1 components, the constraint tells you the value of the nth
component. Therefore, this vector has n − 1 degrees of freedom.
Mathematically, the first vector is the orthogonal, or least-squares, projection of the data vector onto the subspace
spanned by the vector of 1's. The 1 degree of freedom is the dimension of this subspace. The second residual vector
is the least-squares projection onto the (n − 1)-dimensional orthogonal complement of this subspace, and has n − 1
degrees of freedom.
In statistical testing applications, often one isn't directly interested in the component vectors, but rather in their
squared lengths. In the example above, the residual sum-of-squares is
If the data points are normally distributed with mean 0 and variance , then the residual sum of squares has a
scaled chi-squared distribution (scaled by the factor ), with n − 1 degrees of freedom. The degrees-of-freedom,
here a parameter of the distribution, can still be interpreted as the dimension of an underlying vector subspace.
Likewise, the one-sample t-test statistic,
follows a Student's t distribution with n − 1 degrees of freedom when the hypothesized mean is correct. Again,
the degrees-of-freedom arises from the residual vector in the denominator.
where are the means of the individual samples, and is the mean of all 3n
observations. In vector notation this decomposition can be written as
Degrees of freedom (statistics) 140
The observation vector, on the left-hand side, has 3n degrees of freedom. On the right-hand side, the first vector has
one degree of freedom (or dimension) for the overall mean. The second vector depends on three random variables,
, and . However, these must sum to 0 and so are constrained; the vector therefore
must lie in a 2-dimensional subspace, and has 2 degrees of freedom. The remaining 3n − 3 degrees of freedom are in
the residual vector (made up of n − 1 degrees of freedom within each of the populations).
with 3(n-1) degrees of freedom. Of course, introductory books on ANOVA usually state formulae without showing
the vectors, but it is this underlying geometry that gives rise to SS formulae, and shows how to unambiguously
determine the degrees of freedom in any given situation.
Under the null hypothesis of no difference between population means (and assuming that standard ANOVA
regularity assumptions are satisfied) the sums of squares have scaled chi-squared distributions, with the
corresponding degrees of freedom. The F-test statistic is the ratio, after scaling by the degrees of freedom. If there is
no difference between population means this ratio follows an F distribution with 2 and 3n − 3 degrees of freedom.
In some complicated settings, such as unbalanced split-plot designs, the sums-of-squares no longer have scaled
chi-squared distributions. Comparison of sum-of-squares with degrees-of-freedom is no longer meaningful, and
software may report certain fractional 'degrees of freedom' in these cases. Such numbers have no genuine
degrees-of-freedom interpretation, but are simply providing an approximate chi-squared distribution for the
corresponding sum-of-squares. The details of such approximations are beyond the scope of this page.
follows a chi-squared distribution with n−1 degrees of freedom. Here, the degrees of freedom arises from the
residual sum-of-squares in the numerator, and in turn the n−1 degrees of freedom of the underlying residual vector
.
In the application of these distributions to linear models, the degrees of freedom parameters can take only integer
values. The underlying families of distributions allow fractional values for the degrees-of-freedom parameters, which
can arise in more sophisticated uses. One set of examples is problems where chi-squared approximations based on
effective degrees of freedom are used. In other applications, such as modelling heavy-tailed data, a t or F distribution
may be used as an empirical model. In these cases, there is no particular degrees of freedom interpretation to the
distribution parameters, even though the terminology may continue to be used.
where is the vector of fitted values at each of the original covariate values from the fitted model, y is the original
vector of responses, and H is the hat matrix or, more generally, smoother matrix.
For statistical inference, sums-of-squares can still be formed: the model sum-of-squares is ; the residual
sum-of-squares is . However, because H does not correspond to an ordinary least-squares fit (i.e.
is not an orthogonal projection), these sums-of-squares no longer have (scaled, non-central) chi-squared
distributions, and dimensionally defined degrees-of-freedom are not useful. The distribution is a generalized
chi-squared distribution, and the theory associated with this distribution[5] provides an alternative route to the
answers provideed by an effective degrees of freedom.
The effective degrees of freedom of the fit can be defined in various ways to implement goodness-of-fit tests,
cross-validation and other inferential procedures. Here one can distinguish between regression effective degrees of
freedom and residual effective degrees of freedom.
Regression effective degrees of freedom.
Regarding the former, appropriate definitions can include the trace of the hat matrix,[6] tr(H), the trace of the
quadratic form of the hat matrix, tr(H'H), the form tr(2H - H H'), or the Satterthwaite approximation,
tr(H'H)2/tr(H'HH'H). In the case of linear regression, the hat matrix H is X(X 'X)−1X ', and all these definitions reduce
to the usual degrees of freedom. Notice that
i.e., the regression (not residual) degrees of freedom in linear models are "the sum of the sensitivities of the fitted
values with respect to the observed response values".[7]
Residual effective degrees of freedom.
There are corresponding definitions of residual effective degrees-of-freedom (redf), with H replaced by I − H. For
example, if the goal is to estimate error variance, the redf would be defined as tr((I − H)'(I − H)), and the unbiased
estimate is (with ),
Degrees of freedom (statistics) 142
or:[8][9][10]
The last approximation above[9] reduces the computational cost from O(n2) to only O(n). In general the numerator
would be the objective function being minimized; e.g., if the hat matrix includes an observation covariance matrix,
Σ, then becomes .
General.
Note that unlike in the original case, non-integer degrees of freedom are allowed, though the value must usually still
be constrained between 0 and n.
Consider, as an example, the k-nearest neighbour smoother, which is the average of the k nearest measured values to
the given point. Then, at each of the n measured points, the weight of the original value on the linear combination
that makes up the predicted value is just 1/k. Thus, the trace of the hat matrix is n/k. Thus the smooth costs n/k
effective degrees of freedom.
As another example, consider the existence of nearly duplicated observations. Naive application of classical formula,
n - p, would lead to over-estimation of the residuals degree of freedom, as if each observation were independent.
More realistically, though, the hat matrix H = X(X ' Σ−1 X)−1X ' Σ−1 would involve an observation covariance matrix
Σ indicating the non-zero correlation among observations. The more general formulation of effective degree of
freedom would result in a more realistic estimate for, e.g., the error variance σ2.
Similar concepts are the equivalent degrees of freedom in non-parametric regression,[11] the degree of freedom of
signal in atmospheric studies,[12][13] and the non-integer degree of freedom in geodesy.[14][15]
References
[1] "Degrees of Freedom" (http:/ / www. animatedsoftware. com/ statglos/ sgdegree. htm). "Glossary of Statistical Terms". Animated Software. .
Retrieved 2008-08-21.
[2] Lane, David M.. "Degrees of Freedom" (http:/ / davidmlane. com/ hyperstat/ A42408. html). HyperStat Online. Statistics Solutions. .
Retrieved 2008-08-21.
[3] Walker, H. M. (April 1940). "Degrees of Freedom". Journal of Educational Psychology 31 (4): 253–269. doi:10.1037/h0054588.
[4] Christensen, Ronald (2002). Plane Answers to Complex Questions: The Theory of Linear Models (Third ed.). New York: Springer.
ISBN 0-387-95361-2.
[5] Jones, D.A. (1983) "Statistical analysis of empirical models fitted by optimisation", Biometrika, 70 (1), 67–88
[6] Trevor Hastie, Robert Tibshirani, Jerome H. Friedman (2009), The elements of statistical learning: data mining, inference, and prediction,
2nd ed., 746 p. ISBN 978-0-387-84857-0, doi:10.1007/978-0-387-84858-7, (http:/ / books. google. com/ books?id=tVIjmNS3Ob8C&
lpg=PA153& dq=degrees of freedom of a smoother& pg=PA154#v=onepage& q=degrees of freedom of a smoother& f=false) (eq.(5.16))
[7] Ye, J. (1998), "On Measuring and Correcting the Effects of Data Mining and Model Selection", Journal of the American Statistical
Association, 93 (441), 120-131. JSTOR 2669609 (eq.(7))
[8] Clive Loader (1999), Local regression and likelihood, ISBN 978-0-387-98775-0, doi:10.1007/b98858, (http:/ / books. google. com/
books?id=D7GgBAfL4ngC& lpg=PP1& pg=PA28#v=onepage& q=degree of freedom& f=false) (eq.(2.18), p.30)
[9] Trevor Hastie, Robert Tibshirani (1990), Generalized additive models, CRC Press, (http:/ / books. google. com/ books?id=qa29r1Ze1coC&
lpg=PR3& dq=Hastie, T. J. , and Tibshirani, R. J. (1990), Generalized Additive Models, London: Chapman and Hall. &
pg=PA54#v=onepage& q=degrees of freedom& f=false) (p.54) and (eq.(B.1), p.305))
[10] Simon N. Wood (2006), Generalized additive models: an introduction with R, CRC Press, (http:/ / books. google. com/
books?id=hr17lZC-3jQC& lpg=PA170& dq=Effective degrees of freedom& pg=PA172#v=onepage& q& f=false) (eq.(4,14), p.172)
[11] Peter J. Green, B. W. Silverman (1994), Nonparametric regression and generalized linear models: a roughness penalty approach, CRC
Press (http:/ / books. google. com/ books?id=-AIVXozvpLUC& lpg=PA37& dq=generalized effective degrees of freedom&
pg=PA37#v=onepage& q& f=false) (eq.(3.15), p.37)
[12] Clive D. Rodgers (2000), Inverse methods for atmospheric sounding: theory and practice, World Scientific (eq.(2.56), p.31)
[13] Adrian Doicu, Thomas Trautmann, Franz Schreier (2010), Numerical Regularization for Atmospheric Inverse Problems, Springer (eq.(4.26),
p.114)
Degrees of freedom (statistics) 143
[14] D. Dong, T. A. Herring and R. W. King (1997), Estimating regional deformation from a combination of space and terrestrial geodetic data,
J. Geodesy, 72 (4), 200-214, doi:10.1007/s001900050161 (eq.(27), p.205)
[15] H. Theil (1963), "On the Use of Incomplete Prior Information in Regression Analysis", Journal of the American Statistical Association, 58
(302), 401-414 JSTOR 2283275 (eq.(5.19)-(5.20))
External links
• Walker, HW (1940) "Degrees of Freedom" Journal of Educational Psychology 31(4) 253-269. Transcription by C
Olsen with errata (http://courses.ncssm.edu/math/Stat_Inst/PDFS/DFWalker.pdf)
• Good, IJ (1973). "What Are Degrees of Freedom?". The American Statistician (The American Statistician, Vol.
27, No. 5) 27 (5): 227–228. doi:10.2307/3087407. JSTOR 3087407.
• Yu, Chong-ho (1997) Illustrating degrees of freedom in terms of sample size and dimensionality (http://www.
creative-wisdom.com/computer/sas/df.html)
• Dallal, GE. (2003) Degrees of Freedom (http://www.tufts.edu/~gdallal/dof.htm)
Determinant
In linear algebra, the determinant is a value associated with a square matrix. It can be computed from the entries of
the matrix by a specific arithmetic expression, while other ways to determine its value exist as well. The determinant
provides important information when the matrix is that of the coefficients of a system of linear equations, or when it
corresponds to a linear transformation of a vector space: in the first case the system has a unique solution if and only
if the determinant is nonzero, while in the second case that same condition means that the transformation has an
inverse operation. A geometric interpretation can be given to the value of the determinant of a square matrix with
real entries: the absolute value of the determinant gives the scale factor by which area or volume is multiplied under
the associated linear transformation, while its sign indicates whether the transformation preserves orientation. Thus a
2 × 2 matrix with determinant −2, when applied to a region of the plane with finite area, will transform that region
into one with twice the area, while reversing its orientation.
Determinants occur throughout mathematics. The use of determinants in calculus includes the Jacobian determinant
in the substitution rule for integrals of functions of several variables. They are used to define the characteristic
polynomial of a matrix that is an essential tool in eigenvalue problems in linear algebra. In some cases they are used
just as a compact notation for expressions that would otherwise be unwieldy to write down.
The determinant of a matrix A is denoted det(A), det A, or |A|.[1] In the case where the matrix entries are written out
in full, the determinant is denoted by surrounding the matrix entries by vertical bars instead of the brackets or
parentheses of the matrix. For instance, the determinant of the matrix
Although most often used for matrices whose entries are real or complex numbers, the definition of the determinant
only involves addition, subtraction and multiplication, and so it can be defined for square matrices with entries taken
from any commutative ring. Thus for instance the determinant of a matrix with integer coefficients will be an
integer, and the matrix has an inverse with integer coefficients if and only if this determinant is 1 or −1 (these being
the only invertible elements of the integers). For square matrices with entries in a non-commutative ring, for instance
the quaternions, there is no unique definition for the determinant, and no definition that has all the usual properties of
determinants over commutative rings.
Determinant 144
Definition
There are various ways to define the determinant of a square matrix A, i.e. one with the same number of rows and
columns. Perhaps the most natural way is expressed in terms of the columns of the matrix. If we write an n-by-n
matrix in terms of its column vectors
where the are vectors of size n, then the determinant of A is defined so that
where b and c are scalars, v is any vector of size n and I is the identity matrix of size n. These properties state that the
determinant is an alternating multilinear function of the columns, and they suffice to uniquely calculate the
determinant of any square matrix. Provided the underlying scalars form a field (more generally, a commutative ring
with unity), the definition below shows that such a function exists, and it can be shown to be unique.[2]
Equivalently, the determinant can be expressed as a sum of products of entries of the matrix where each product has
n terms and the coefficient of each product is -1 or 1 or 0 according to a given rule: it is a polynomial expression of
the matrix entries. This expression grows rapidly with the size of the matrix (an n-by-n matrix contributes n! terms),
so it will first be given explicitly for the case of 2-by-2 matrices and 3-by-3 matrices, followed by the rule for
arbitrary size matrices, which subsumes these two cases.
Assume A is a square matrix with n rows and n columns, so that it can be written as
The entries can be numbers or expressions (as happens when the determinant is used to define a characteristic
polynomial); the definition of the determinant depends only on the fact that they can be added and multiplied
together in a commutative manner.
The determinant of A is denoted as det(A), or it can be denoted directly in terms of the matrix entries by writing
enclosing bars instead of brackets:
Determinant 145
2-by-2 matrices
The determinant of a 2×2 matrix is defined by
If the matrix entries are real numbers, the matrix A can be used to
represent two linear mappings: one that maps the standard basis
vectors to the rows of A, and one that maps them to the columns of A.
In either case, the images of the basis vectors form a parallelogram that
represents the image of the unit square under the mapping. The
parallelogram defined by the rows of the above matrix is the one with
vertices at (0,0), (a,b), (a + c, b + d), and (c,d), as shown in the
accompanying diagram. The absolute value of is the area of
the parallelogram, and thus represents the scale factor by which areas
are transformed by A. (The parallelogram formed by the columns of A
The area of the parallelogram is the absolute
is in general a different parallelogram, but since the determinant is value of the determinant of the matrix formed by
symmetric with respect to rows and columns, the area will be the the vectors representing the parallelogram's sides.
same.)
The absolute value of the determinant together with the sign becomes the oriented area of the parallelogram. The
oriented area is the same as the usual area, except that it is negative when the angle from the first to the second
vector defining the parallelogram turns in a clockwise direction (which is opposite to the direction one would get for
the identity matrix).
Thus the determinant gives the scaling factor and the orientation induced by the mapping represented by A. When the
determinant is equal to one, the linear mapping defined by the matrix is equi-areal and orientation-preserving.
3-by-3 matrices
The determinant of a 3×3 matrix is defined
by
The volume of this Parallelepiped is the absolute value of the determinant of the
matrix formed by the rows r1, r2, and r3.
Determinant 146
This scheme for calculating the determinant of a 3×3 matrix does not carry over into higher dimensions. However,
recently an extension to the Sarrus rule to a 4×4 matrix has been developed that requires lining up 14 columns next
to each other. Note that since 24 terms are included in the determinant of a 4×4 matrix, at least 12 columns are
needed to use this method. [3]
n-by-n matrices
The determinant of a matrix of arbitrary size can be defined by the Leibniz formula or the Laplace formula.
The Leibniz formula for the determinant of an n-by-n matrix A is
Here the sum is computed over all permutations σ of the set {1, 2, ..., n}. A permutation is a function that reorders
this set of integers. The value in the i-th position after the reordering σ is denoted σi. For example, for n = 3, the
original sequence 1, 2, 3 might be reordered to σ = [2, 3, 1], with σ1 = 2, σ2 = 3, and σ3 = 1. The set of all such
permutations (also known as the symmetric group on n elements) is denoted Sn. For each permutation σ, sgn(σ)
denotes the signature of σ; it is +1 for even σ and −1 for odd σ. Evenness or oddness can be defined as follows: the
permutation is even (odd) if the new sequence can be obtained by an even number (odd, respectively) of switches of
numbers. For example, starting from [1, 2, 3] (and starting with the convention that the signature sgn([1,2,3]) = +1)
and switching the positions of 2 and 3 yields [1, 3, 2], with sgn([1,3,2]) = –1. Switching once more yields [3, 1, 2],
with sgn([3,1,2]) = +1 again. Finally, after a total of three switches (an odd number), the resulting permutation is [3,
2, 1], with sgn([3,2,1]) = –1. Therefore [3, 2, 1] is an odd permutation. Similarly, the permutation [2, 3, 1] is even:
[1, 2, 3] → [2, 1, 3] → [2, 3, 1], with an even number of switches.
A permutation cannot be simultaneously even and odd, but sometimes it is convenient to accept non-permutations:
sequences with repeated or skipped numbers, like [1, 2, 1]. In that case, the signature of any non-permutation is zero:
sgn([1,2,1]) = 0.
Determinant 147
is notation for the product of the entries at positions (i, σi), where i ranges from 1 to n:
This agrees with the rule of Sarrus given in the previous section.
The formal extension to arbitrary dimensions was made by Tullio Levi-Civita, see (Levi-Civita symbol) using a
pseudo-tensor symbol.
Levi-Civita symbol
The determinant for an n-by-n matrix can be expressed in terms of the totally antisymmetric Levi-Civita symbol as
follows:
3.
This can be deduced from some of the properties below, but it follows most easily directly from the Leibniz formula
(or from the Laplace expansion), in which the identity permutation is the only one that gives a non-zero contribution.
A number of additional properties relate to the effects on the determinant of changing particular rows or columns:
• Viewing an n×n matrix as being composed of n columns, the determinant is an n-linear function. This means
that if one column of a matrix A is written as a sum v + w of two column vectors, and all other columns are
left unchanged, then the determinant of A is the sum determinants of the matrices obtained from A by
replacing the column by v respectively by w (and a similar relation holds when writing a column as a scalar
Determinant 148
• Viewing an n×n matrix as being composed of n rows, the determinant is an n-linear function.
2. This n-linear function is an alternating form: whenever two rows of a matrix are identical, its determinant is 0.
3. Interchanging two columns of a matrix multiplies its determinant by −1. This follows from properties 7 and 8 (it
is a general property of multilinear alternating maps). Iterating gives that more generally a permutation of the
columns multiplies the determinant by the sign of the permutation. Similarly a permutation of the rows multiplies
the determinant by the sign of the permutation.
4. Adding a scalar multiple of one column to another column does not change the value of the determinant. This is a
consequence of properties 7 and 8: by property 7 the determinant changes by a multiple of the determinant of a
matrix with two equal columns, which determinant is 0 by property 8. Similarly, adding a scalar multiple of one
row to another row leaves the determinant unchanged.
These properties can be used to facilitate the computation of determinants by simplifying the matrix to the point
where the determinant can be determined immediately. Specifically, for matrices with coefficients in a field,
properties 11 and 12 can be used to transform any matrix into a triangular matrix, whose determinant is given by
property 6; this is essentially the method of Gaussian elimination.
For example, the determinant of can be computed using the following matrices:
Here, B is obtained from A by adding −1/2 × the first row to the second, so that det(A) = det(B). C is obtained from B
by adding the first to the third row, so that det(C) = det(B). Finally, D is obtained from C by exchanging the second
and third row, so that det(D) = −det(C). The determinant of the (upper) triangular matrix D is the product of its
entries on the main diagonal: (−2) · 2 · 4.5 = −18. Therefore det(A) = −det(D) = +18.
Determinant 149
Thus the determinant is a multiplicative map. This property is a consequence of the characterization given above of
the determinant as the unique n-linear alternating function of the columns with value 1 on the identity matrix, since
the function Mn(K) → K that maps M ↦ det(AM) can easily be seen to be n-linear and alternating in the columns of
M, and takes the value det(A) at the identity. The formula can be generalized to (square) products of rectangular
matrices, giving the Cauchy-Binet formula, which also provides an independent proof of the multiplicative property.
The determinant det(A) of a matrix A is non-zero if and only if A is invertible or, yet another equivalent statement, if
its rank equals the size of the matrix. If so, the determinant of the inverse matrix is given by
In particular, products and inverses of matrices with determinant one still have this property. Thus, the set of such
matrices (of fixed size n) form a group known as the special linear group. More generally, the word "special"
indicates the subgroup of another matrix group of matrices of determinant one. Examples include the special
orthogonal group (which if n is 2 or 3 consists of all rotation matrices), and the special unitary group.
Calculating det(A) by means of that formula is referred to as expanding the determinant along a row or column. For
the example 3-by-3 matrix , Laplace expansion along the second column (j = 2, the sum
where I is the identity matrix of the same dimension as A. Conversely, det(A) is the product of the eigenvalues of A,
counted with their algebraic multiplicities. The product of all non-zero eigenvalues is referred to as
pseudo-determinant.
An Hermitian matrix is positive definite if all its eigenvalues are positive. Sylvester's criterion asserts that this is
equivalent to the determinants of the submatrices
Here exp(A) denotes the matrix exponential of A, because every eigenvalue λ of A corresponds to the eigenvalue
exp(λ) of exp(A). In particular, given any logarithm of A, that is, any matrix L satisfying
Cramer's rule
For a matrix equation
where Ai is the matrix formed by replacing the i-th column of A by the column vector b. This follows immediately by
column expansion of the determinant, i.e.
where the vectors are the columns of A. The rule is also implied by the identity
It has recently been shown that Cramer's rule can be implemented in O(n3) time,[6] which is comparable to more
common methods of solving systems of linear equations, such as LU, QR, or singular value decomposition.
Block matrices
Suppose A, B, C, and D are n×n-, n×m-, m×n-, and m×m-matrices, respectively. Then
This can be seen from the Leibniz formula or by induction on n. When A is invertible, employing the following
identity
leads to
When D is invertible, a similar identity with factored out can be derived analogously,[7] that is,
When the blocks are square matrices of the same order further formulas hold. For example, if C and D commute (i.e.,
CD = DC), then the following formula comparable to the determinant of a 2-by-2 matrix holds:[8]
Determinant 152
Derivative
By definition, e.g., using the Leibniz formula, the determinant of real (or analogously for complex) square matrices
is a polynomial function from Rn×n to R. As such it is everywhere differentiable. Its derivative can be expressed
using Jacobi's formula:
This identity is used in describing the tangent space of certain matrix Lie groups.
If the matrix A is written as where a, b, c are vectors, then the gradient over one of the three
vectors may be written as the cross product of the other two:
Determinant of an endomorphism
The above identities concerning the determinant of a products and inverses of matrices imply that similar matrices
have the same determinant: two matrices A and B are similar, if there exists an invertible matrix X such that A =
X−1BX. Indeed, repeatedly applying the above identities yields
The determinant is therefore also called a similarity invariant. The determinant of a linear transformation
for some finite dimensional vector space V is defined to be the determinant of the matrix describing it, with respect
to an arbitrary choice of basis in V. By the similarity invariance, this determinant is independent of the choice of the
basis for V and therefore only depends on the endomorphism T.
Determinant 153
Exterior algebra
The determinant can also be characterized as the unique function
from the set of all n-by-n matrices with entries in a field K to this field satisfying the following three properties: first,
D is an n-linear function: considering all but one column of A fixed, the determinant is linear in the remaining
column, that is
for any column vectors v1, ..., vn, and w and any scalars (elements of K) a and b. Second, D is an alternating function:
for any matrix A with two identical columns D(A) = 0. Finally, D(In) = 1. Here In is the identity matrix.
This fact also implies that every other n-linear alternating function F: Mn(K) → K satisfies
The last part in fact follows from the preceding statement: one easily sees that if F is nonzero it satisfies F(I) ≠ 0,
and function that associates F(M)/F(I) to M satisfies all conditions of the theorem. The importance of stating this part
is mainly that it remains valid[9] if K is any commutative ring rather than a field, in which case the given argument
does not apply.
The determinant of a linear transformation A : V → V of an n-dimensional vector space V can be formulated in a
coordinate-free manner by considering the n-th exterior power ΛnV of V. A induces a linear map
As ΛnV is one-dimensional, the map ΛnA is given by multiplying with some scalar. This scalar coincides with the
determinant of A, that is to say
This definition agrees with the more concrete coordinate-dependent definition. This follows from the
characterization of the determinant given above. For example, switching two columns changes the parity of the
determinant; likewise, permuting the vectors in the exterior product v1 ∧ v2 ∧ ... ∧ vn to v2 ∧ v1 ∧ v3 ∧ ... ∧ vn, say,
also alters the parity.
For this reason, the highest non-zero exterior power Λn(V) is sometimes also called the determinant of V and
similarly for more involved objects such as vector bundles or chain complexes of vector spaces. Minors of a matrix
can also be cast in this setting, by considering lower alternating forms ΛkV with k < n.
is supposed to hold for all elements r and s of the ring. For example, the integers form a commutative ring.
Many of the above statements and notions carry over mutatis mutandis to determinants of these more general
matrices: the determinant is multiplicative in this more general situation, and Cramer's rule also holds. A square
matrix over a commutative ring R is invertible if and only if its determinant is a unit in R, that is, an element having a
(multiplicative) inverse. (If R is a field, this latter condition is equivalent to the determinant being nonzero, thus
giving back the above characterization.) For example, a matrix A with entries in Z, the integers, is invertible (in the
sense that the inverse matrix has again integer entries) if the determinant is +1 or −1. Such a matrix is called
unimodular.
Determinant 154
between the group of invertible n×n matrices with entries in R and the multiplicative group of units in R. Since it
respects the multiplication in both groups, this map is a group homomorphism. Secondly, given a ring
homomorphism f: R → S, there is a map GLn(R) → GLn(S) given by replacing all entries in R by their images under
f. The determinant respects these maps, i.e., given a matrix A = (ai,j) with entries in R, the identity
holds. For example, the determinant of the complex conjugate of a complex matrix (which is also the determinant of
its conjugate transpose) is the complex conjugate of its determinant, and for integer matrices: the reduction
modulo m of the determinant of such a matrix is equal to the determinant of the matrix reduced modulo m (the latter
determinant being computed using modular arithmetic). In the more high-brow parlance of category theory, the
determinant is a natural transformation between the two functors GLn and (⋅)×.[10] Adding yet another layer of
abstraction, this is captured by saying that the determinant is a morphism of algebraic groups, from the general linear
group to the multiplicative group,
Infinite matrices
For matrices with an infinite number of rows and columns, the above definitions of the determinant do not carry over
directly. For example, in Leibniz' formula, an infinite sum (all of whose terms are infinite products) would have to be
calculated. Functional analysis provides different extensions of the determinant for such infinite-dimensional
situations, which however only work for particular kinds of operators.
The Fredholm determinant defines the determinant for operators known as trace class operators by an appropriate
generalization of the formula
Further variants
Determinants of matrices in superrings (that is, Z/2-graded rings) are known as Berezinians or superdeterminants.[11]
The permanent of a matrix is defined as the determinant, except that the factors sgn(σ) occurring in Leibniz' rule are
omitted. The immanant generalizes both by introducing a character of the symmetric group Sn in Leibniz' rule.
Calculation
Determinants are mainly used as a theoretical tool. They are rarely calculated explicitly in numerical linear algebra,
where for applications like checking invertibility and finding eigenvalues the determinant has largely been
supplanted by other techniques.[12] Nonetheless, explicitly calculating determinants is required in some situations,
and different methods are available to do so.
Naive methods of implementing an algorithm to compute the determinant include using Leibniz' formula or
Laplace's formula. Both these approaches are extremely inefficient for large matrices, though, since the number of
required operations grows very quickly: it is of order n! (n factorial) for an n×n matrix M. For example, Leibniz'
formula requires to calculate n! products. Therefore, more involved techniques have been developed for calculating
determinants.
Decomposition methods
Given a matrix A, some methods compute its determinant by writing A as a product of matrices whose determinants
can be more easily computed. Such techniques are referred to as decomposition methods. Examples include the LU
decomposition, the QR decomposition or the Cholesky decomposition (for positive definite matrices). These
methods are of order O(n3), which is a significant improvement over O(n!)
The LU decomposition expresses A in terms of a lower triangular matrix L, an upper triangular matrix U and a
permutation matrix P:
The determinants of L and U can be quickly calculated, since they are the products of the respective diagonal entries.
The determinant of P is just the sign of the corresponding permutation (which is +1 for an even number of
permutations and is -1 for an uneven number of permutations). The determinant of A is then
Moreover, the decomposition can be chosen such that L is a unitriangular matrix and therefore has determinant 1, in
which case the formula further simplifies to
Further methods
If the determinant of A and the inverse of A have already been computed, the matrix determinant lemma allows to
quickly calculate the determinant of A + uvT, where u and v are column vectors.
Since the definition of the determinant does not need divisions, a question arises: do fast algorithms exist that do not
need divisions? This is especially interesting for matrices over rings. Indeed algorithms with run-time proportional to
n4 exist. An algorithm of Mahajan and Vinay, and Berkowitz[13] is based on closed ordered walks (short clow). It
computes more products than the determinant definition requires, but some of these products cancel and the sum of
these products can be computed more efficiently. The final algorithm looks very much like an iterated product of
triangular matrices.
If two matrices of order n can be multiplied in time M(n), where M(n)≥na for some a>2, then the determinant can be
computed in time O(M(n)).[14] This means, for example, that an O(n2.376) algorithm exists based on the
Coppersmith–Winograd algorithm.
Determinant 156
Algorithms can also be assessed according to their bit complexity, i.e., how many bits of accuracy are needed to
store intermediate values occurring in the computation. For example, the Gaussian elimination (or LU
decomposition) methods is of order O(n3), but the bit length of intermediate values can become exponentially
long.[15] The Bareiss Algorithm, on the other hand, is an exact-division method based on Sylvester's identity is also
of order n3, but the bit complexity is roughly the bit size of the original entries in the matrix times n.[16]
History
Historically, determinants were considered without reference to matrices: originally, a determinant was defined as a
property of a system of linear equations. The determinant "determines" whether the system has a unique solution
(which occurs precisely if the determinant is non-zero). In this sense, determinants were first used in the Chinese
mathematics textbook The Nine Chapters on the Mathematical Art (九 章 算 術, Chinese scholars, around the 3rd
century BC). In Europe, two-by-two determinants were considered by Cardano at the end of the 16th century and
larger ones by Leibniz.[17][18][19][20]
In Europe, Cramer (1750) added to the theory, treating the subject in relation to sets of equations. The recurrence law
was first announced by Bézout (1764).
It was Vandermonde (1771) who first recognized determinants as independent functions.[17] Laplace (1772) [21][22]
gave the general method of expanding a determinant in terms of its complementary minors: Vandermonde had
already given a special case. Immediately following, Lagrange (1773) treated determinants of the second and third
order. Lagrange was the first to apply determinants to questions of elimination theory; he proved many special cases
of general identities.
Gauss (1801) made the next advance. Like Lagrange, he made much use of determinants in the theory of numbers.
He introduced the word determinants (Laplace had used resultant), though not in the present signification, but rather
as applied to the discriminant of a quantic. Gauss also arrived at the notion of reciprocal (inverse) determinants, and
came very near the multiplication theorem.
The next contributor of importance is Binet (1811, 1812), who formally stated the theorem relating to the product of
two matrices of m columns and n rows, which for the special case of m = n reduces to the multiplication theorem. On
the same day (November 30, 1812) that Binet presented his paper to the Academy, Cauchy also presented one on the
subject. (See Cauchy-Binet formula.) In this he used the word determinant in its present sense,[23][24] summarized
and simplified what was then known on the subject, improved the notation, and gave the multiplication theorem with
a proof more satisfactory than Binet's.[17][25] With him begins the theory in its generality.
The next important figure was Jacobi[18] (from 1827). He early used the functional determinant which Sylvester later
called the Jacobian, and in his memoirs in Crelle for 1841 he specially treats this subject, as well as the class of
alternating functions which Sylvester has called alternants. About the time of Jacobi's last memoirs, Sylvester (1839)
and Cayley began their work.[26][27]
The study of special forms of determinants has been the natural result of the completion of the general theory.
Axisymmetric determinants have been studied by Lebesgue, Hesse, and Sylvester; persymmetric determinants by
Sylvester and Hankel; circulants by Catalan, Spottiswoode, Glaisher, and Scott; skew determinants and Pfaffians, in
connection with the theory of orthogonal transformation, by Cayley; continuants by Sylvester; Wronskians (so called
by Muir) by Christoffel and Frobenius; compound determinants by Sylvester, Reiss, and Picquet; Jacobians and
Hessians by Sylvester; and symmetric gauche determinants by Trudi. Of the text-books on the subject
Spottiswoode's was the first. In America, Hanus (1886), Weld (1893), and Muir/Metzler (1933) published treatises.
Determinant 157
Applications
Linear independence
As mentioned above, the determinant of a matrix (with real or complex entries, say) is zero if and only if the column
vectors of the matrix are linearly dependent. Thus, determinants can be used to characterize linearly dependent
vectors. For example, given two vectors v1, v2 in R3, a third vector v3 lies in the plane spanned by the former two
vectors exactly if the determinant of the 3-by-3 matrix consisting of the three vectors is zero. The same idea is also
used in the theory of differential equations: given n functions f1(x), ..., fn(x) (supposed to be n−1 times
differentiable), the Wronskian is defined to be
It is non-zero (for some x) in a specified interval if and only if the given functions and all their derivatives up to
order n−1 are linearly independent. If it can be shown that the Wronskian is zero everywhere on an interval then, in
the case of analytic functions, this implies the given functions are linearly dependent. See the Wronskian and linear
independence.
Orientation of a basis
The determinant can be thought of as assigning a number to every sequence of n in Rn, by using the square matrix
whose columns are the given vectors. For instance, an orthogonal matrix with entries in Rn represents an
orthonormal basis in Euclidean space. The determinant of such a matrix determines whether the orientation of the
basis is consistent with or opposite to the orientation of the standard basis. Namely, if the determinant is +1, the basis
has the same orientation. If it is −1, the basis has the opposite orientation.
More generally, if the determinant of A is positive, A represents an orientation-preserving linear transformation (if A
is an orthogonal 2×2 or 3×3 matrix, this is a rotation), while if it is negative, A switches the orientation of the basis.
By calculating the volume of the tetrahedron bounded by four points, they can be used to identify skew lines. The
volume of any tetrahedron, given its vertices a, b, c, and d, is (1/6)·|det(a − b, b − c, c − d)|, or any other
combination of pairs of vertices that would form a spanning tree over the vertices.
For a general differentiable function, much of the above carries over by considering the Jacobian matrix of f. For
Its determinant, the Jacobian determinant appears in the higher-dimensional version of integration by substitution:
for suitable functions f and an open subset U of R'n (the domain of f), the integral over f(U) of some other function φ:
Determinant 158
Rn → Rm is given by
where the right-hand side is the continued product of all the differences that can be formed from the n(n-1)/2 pairs of
numbers taken from x1, x2, ..., xn, with the order of the differences taken in the reversed order of the suffixes that are
involved.
Circulants
Second order
Third order
where ω and ω2 are the complex cube roots of 1. In general, the nth-order circulant determinant is [28]
Notes
[1] Poole, David (2006), Linear Algebra: A Modern Introduction, Thomson Brooks/Cole, p. 262, ISBN 0-534-99845-3
[2] Serge Lang, Linear Algebra, 2nd Edition, Addison-Wesley, 1971, pp 173, 191.
[3] Ramazi, p., Shoeiby, B. and Abbasian, T. (2012) The extension of Sarrus’ Rule for finding the determinant of a 4x4 matrix. The American
Mathematical Monthly. April V.
[4] In a non-commutative setting left-linearity (compatibility with left-multiplication by scalars) should be distinguished from right-linearity.
Assuming linearity in the columns is taken to be left-linearity, one would have, for non-commuting scalars a, b:
References
• Axler, Sheldon Jay (1997), Linear Algebra Done Right (2nd ed.), Springer-Verlag, ISBN 0-387-98259-0
• de Boor, Carl (1990), "An empty exercise" (http://ftp.cs.wisc.edu/Approx/empty.pdf), ACM SIGNUM
Newsletter 25 (2): 3–7, doi:10.1145/122272.122273.
• Lay, David C. (August 22, 2005), Linear Algebra and Its Applications (3rd ed.), Addison Wesley,
ISBN 978-0-321-28713-7
• Meyer, Carl D. (February 15, 2001), Matrix Analysis and Applied Linear Algebra (http://www.matrixanalysis.
com/DownloadChapters.html), Society for Industrial and Applied Mathematics (SIAM),
ISBN 978-0-89871-454-8
• Poole, David (2006), Linear Algebra: A Modern Introduction (2nd ed.), Brooks/Cole, ISBN 0-534-99845-3
• Anton, Howard (2005), Elementary Linear Algebra (Applications Version) (9th ed.), Wiley International
• Leon, Steven J. (2006), Linear Algebra With Applications (7th ed.), Pearson Prentice Hall
External links
• Hazewinkel, Michiel, ed. (2001), "Determinant" (http://www.encyclopediaofmath.org/index.
php?title=Determinant&oldid=12692), Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Weisstein, Eric W., " Determinant (http://mathworld.wolfram.com/Determinant.html)" from MathWorld.
• O'Connor, John J.; Robertson, Edmund F., "Matrices and determinants" (http://www-history.mcs.st-andrews.
ac.uk/HistTopics/Matrices_and_determinants.html), MacTutor History of Mathematics archive, University of
St Andrews.
• WebApp to calculate determinants and descriptively solve systems of linear equations (http://sole.ooz.ie/en)
• Determinant Interactive Program and Tutorial (http://people.revoledu.com/kardi/tutorial/LinearAlgebra/
MatrixDeterminant.html)
• Online Matrix Calculator (http://matri-tri-ca.narod.ru/en.index.html)
• Linear algebra: determinants. (http://www.umat.feec.vutbr.cz/~novakm/determinanty/en/) Compute
determinants of matrices up to order 6 using Laplace expansion you choose.
• Matrices and Linear Algebra on the Earliest Uses Pages (http://www.economics.soton.ac.uk/staff/aldrich/
matrices.htm)
• Determinants explained in an easy fashion in the 4th chapter as a part of a Linear Algebra course. (http://algebra.
math.ust.hk/course/content.shtml)
• Instructional Video on taking the determinant of an nxn matrix (Khan Academy) (http://khanexercises.appspot.
com/video?v=H9BWRYJNIv4)
• Online matrix calculator (determinant, track, inverse, adjoint, transpose) (http://www.stud.feec.vutbr.cz/
~xvapen02/vypocty/matreg.php?language=english) Compute determinant of matrix up to order 8
Dirichlet distribution 161
Dirichlet distribution
Dirichlet
where
where
Mean
Variance
where
In probability and statistics, the Dirichlet distribution (after Johann Peter Gustav Lejeune Dirichlet), often denoted
, is a family of continuous multivariate probability distributions parametrized by a vector of positive
reals. It is the multivariate generalization of the beta distribution.[1] Dirichlet distributions are very often used as
prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical
distribution and multinomial distribution. That is, its probability density function returns the belief that the
probabilities of K rival events are given that each event has been observed times.
The infinite-dimensional generalization of the Dirichlet distribution is the Dirichlet process.
Dirichlet distribution 162
for all x1, ..., xK–1 > 0 satisfying x1 + ... + xK–1 < 1, and where xK = 1 – x1 – ... – xK–1. The density is zero outside this
open (K − 1)-dimensional simplex.
The normalizing constant is the multinomial beta function, which can be expressed in terms of the gamma function:
Support
The support of the Dirichlet distribution is the set of -dimensional vectors whose entries are real numbers in
the interval ; furthermore, , i.e. the sum of the coordinates is 1. These can be viewed as the
probabilities of a K-way categorical event. Another way to express this is that the domain of the Dirichlet
distribution is itself a probability distribution, specifically a -dimensional discrete distribution. Note that the
technical term for the set of points in the support of a -dimensional Dirichlet distribution is the open standard
-simplex, which is a generalization of a triangle, embedded in the next-higher dimension. For example, with
, the support looks like an equilateral triangle embedded in a downward-angle fashion in three-dimensional
space, with vertices at and , i.e. touching each of the coordinate axes at a point 1 unit
away from the origin.
Special cases
A very common special case is the symmetric Dirichlet distribution, where all of the elements making up the
parameter vector have the same value. Symmetric Dirichlet distributions are often used when a Dirichlet prior is
called for, since there typically is no prior knowledge favoring one component over another. Since all elements of
the parameter vector have the same value, the distribution alternatively can be parametrized by a single scalar value
, called the concentration parameter. The density function then simplifies to
concentration-parameter-disambiguation
When , the symmetric Dirichlet distribution is equivalent to a uniform
distribution over the open standard -simplex, i.e. it is uniform over all points in its support. Values of the
concentration parameter above 1 prefer variates that are dense, evenly distributed distributions, i.e. all probabilities
returned are similar to each other. Values of the concentration parameter below 1 prefer sparse distributions, i.e.
most of the probabilities returned will be close to 0, and the vast majority of the mass will be concentrated in a few
of the probabilities.
More generally, parameter vector is sometimes written as the product of a (scalar) concentration parameter
and a (vector) base measure where lies within the K-1 simplex (i.e.: its coordinates sum
to one). The concentration parameter in this case is larger by a factor of K than the concentration parameter for a
symmetric Dirichlet distribution described above. This construction ties in with concept of a base measure when
discussing Dirichlet processes and is often used in the topic modelling literature.
If we define the concentration parameter as the sum of the Dirichlet parameters for each dimension, the
Dirichlet distribution is uniform with a concentration parameter is K, the dimension of the distribution.
Dirichlet distribution 163
Properties
Moments
Let , meaning that the first K – 1 components have the above density and
Define . Then[2][3]
Furthermore, if
Mode
The mode of the distribution is the vector (x1, ..., xK) with
Marginal distributions
The marginal distributions are beta distributions:[4]
Conjugate to categorical/multinomial
The Dirichlet distribution is the conjugate prior distribution of the categorical distribution (a generic discrete
probability distribution with a given number of possible outcomes) and multinomial distribution (the distribution
over observed counts of each possible category in a set of categorically distributed observations). This means that if
a data point has either a categorical or multinomial distribution, and the prior distribution of the data point's
parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the posterior
distribution of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the
parameter prior to observing the data point, we then can update our knowledge based on the data point and end up
with a new distribution of the same form as the old one. This means that we can successively update our knowledge
of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.
Formally, this can be expressed as follows. Given a model
This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution
given a collection of N samples. Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as
representing the number of observations in each category that we have already seen. Then we simply add in the
Dirichlet distribution 164
counts for all the new observations (the vector c) in order to derive the posterior distribution.
In Bayesian mixture models and other hierarchical Bayesian models with mixture components, Dirichlet
distributions are commonly used as the prior distributions for the categorical variables appearing in the models. See
the section on applications below for more information.
Entropy
If X is a Dir(α) random variable, then the exponential family differential identities can be used to get an analytic
expression for the expectation of and its associated covariance matrix:
and
where is the digamma function, is the trigamma function, and is the Kronecker delta. The formula for
yields the following formula for the information entropy of X:
Aggregation
If then, if the random variables with subscripts i and j are dropped
from the vector and replaced by their sum,
This aggregation property may be used to derive the marginal distribution of mentioned above.
Neutrality
If , then the vector is said to be neutral[5] in the sense that is
[6]
independent of where
and similarly for removing any of . Observe that any permutation of is also neutral (a property
[7]
not possessed by samples drawn from a generalized Dirichlet distribution.)
Dirichlet distribution 165
Related distributions
If, for
then [8]
and
Although the Xis are not independent from one another, they can be seen to be generated from a set of
[9]
independent gamma random variables (see for proof). Unfortunately, since the sum is lost in forming X, it is
not possible to recover the original gamma random variables from these values alone. Nevertheless, because
independent random variables are simpler to work with, this reparametrization can still be useful for proofs about
properties of the Dirichlet distribution.
Applications
Dirichlet distributions are most commonly used as the prior distribution of categorical variables or multinomial
variables in Bayesian mixture models and other hierarchical Bayesian models. (Note that in many fields, such as in
natural language processing, categorical variables are often imprecisely called "multinomial variables". Such a usage
is liable to cause confusion, just as if Bernoulli distributions and binomial distributions were commonly conflated.)
Inference over hierarchical Bayesian models is often done using Gibbs sampling, and in such a case, instances of the
Dirichlet distribution are typically marginalized out of the model by integrating out the Dirichlet random variable.
This causes the various categorical variables drawn from the same Dirichlet random variable to become correlated,
and the joint distribution over them assumes a Dirichlet-multinomial distribution, conditioned on the
hyperparameters of the Dirichlet distribution (the concentration parameters). One of the reasons for doing this is that
Gibbs sampling of the Dirichlet-multinomial distribution is extremely easy; see that article for more information.
Gamma distribution
A fast method to sample a random vector from the K-dimensional Dirichlet distribution with
parameters follows immediately from this connection. First, draw K independent random samples
from gamma distributions each with density
Finally, set .
String cutting
One example use of the Dirichlet distribution is if one wanted to cut strings (each of initial length 1.0) into K pieces
with different lengths, where each piece had a designated average length, but allowing some variation in the relative
sizes of the pieces. The α/α0 values specify the mean lengths of the cut pieces of string resulting from the
distribution. The variance around this mean varies inversely with α0.
Pólya's urn
Consider an urn containing balls of K different colors. Initially, the urn contains α1 balls of color 1, α2 balls of color
2, and so on. Now perform N draws from the urn, where after each draw, the ball is placed back into the urn with an
additional ball of the same color. In the limit as N approaches infinity, the proportions of different colored balls in
the urn will be distributed as Dir(α1,...,αK).[11]
For a formal proof, note that the proportions of the different colored balls form a bounded [0,1]K-valued martingale,
hence by the martingale convergence theorem, these proportions converge almost surely and in mean to a limiting
random vector. To see that this limiting vector has the above Dirichlet distribution, check that all mixed moments
agree.
Note that each draw from the urn modifies the probability of drawing a ball of any one color from the urn in the
future. This modification diminishes with the number of draws, since the relative effect of adding a new ball to the
urn diminishes as the urn accumulates increasing numbers of balls. This "diminishing returns" effect can also help
explain how small α values yield Dirichlet distributions with most of the probability mass concentrated around a
single point on the simplex.
Dirichlet distribution 168
References
[1] S. Kotz, N. Balakrishnan, and N. L. Johnson (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications. New York:
Wiley. ISBN 0-471-18387-3. (Chapter 49: Dirichlet and Inverted Dirichlet Distributions)
[2] Eq. (49.9) on page 488 of Kotz, Balakrishnan & Johnson (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications.
New York: Wiley. (http:/ / www. wiley. com/ WileyCDA/ WileyTitle/ productCd-0471183873. html)
[3] BalakrishV. B. (2005). ""Chapter 27. Dirichlet Distribution"". A Primer on Statistical Distributions. Hoboken, NJ: John Wiley & Sons, Inc..
p. 274. ISBN 978-0-471-42798-8.
[4] Ferguson, Thomas S. (1973). "A Bayesian analysis of some nonparametric problems". The Annals of Statistics 1 (2): 209–230.
doi:10.1214/aos/1176342360.
[5] Connor, Robert J.; Mosimann, James E (1969). "Concepts of Independence for Proportions with a Generalization of the Dirichlet
Distribution". Journal of the American statistical association (American Statistical Association) 64 (325): 194–206. doi:10.2307/2283728.
JSTOR 2283728.
[6] Bela A. Frigyik, Amol Kapila, and Maya R. Gupta (2010). "Introduction to the Dirichlet Distribution and Related Processes" (http:/ / ee.
washington. edu/ research/ guptalab/ publications/ UWEETR-2010-0006. pdf) (Technical Report UWEETR-2010-006). University of
Washington Department of Electrical Engineering. . Retrieved May 2012.
[7] See Kotz, Balakrishnan & Johnson (2000), Section 8.5, "Connor and Mosimann's Generalization", pp. 519–521.
[8] Devroye, Luc (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ chapter_nine. pdf). pp. 402. .
[9] Devroye, Luc (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). pp. 594. . (Chapter 11.)
[10] A. Gelman and J. B. Carlin and H. S. Stern and D. B. Rubin (2003). Bayesian Data Analysis (2nd ed.). pp. 582. ISBN 1-58488-388-X.
[11] Blackwell, David; MacQueen, James B. (1973). "Ferguson distributions via Polya urn schemes". Ann. Stat. 1 (2): 353–355.
doi:10.1214/aos/1176342372.
External links
• Dirichlet Distribution (http://www.cis.hut.fi/ahonkela/dippa/node95.html)
• How to estimate the parameters of the Dirichlet distribution using expectation-maximization (EM) (http://www.
ee.washington.edu/research/guptalab/publications/EMbookChenGupta2010.pdf)
• Luc Devroye. "Non-Uniform Random Variate Generation" (http://luc.devroye.org/rnbookindex.html).
Retrieved May 2012.
• Dirichlet Random Measures, Method of Construction via Compound Poisson Random Variables, and
Exchangeability Properties of the resulting Gamma Distribution (http://www.cs.princeton.edu/courses/
archive/fall07/cos597C/scribe/20071130.pdf)
Effect size 169
Effect size
In statistics, an effect size is a measure of the strength of the relationship between two variables in a statistical
population, or a sample-based estimate of that quantity. An effect size calculated from data is a descriptive statistic
that conveys the estimated magnitude of a relationship without making any statement about whether the apparent
relationship in the data reflects a true relationship in the population. In that way, effect sizes complement inferential
statistics such as p-values. Among other uses, effect size measures play an important role in meta-analysis studies
that summarize findings from a specific area of research, and in statistical power analyses.
The concept of effect size appears already in everyday language. For example, a weight loss program may boast that
it leads to an average weight loss of 30 pounds. In this case, 30 pounds is an indicator of the claimed effect size.
Another example is that a tutoring program may claim that it raises school performance by one letter grade. This
grade increase is the claimed effect size of the program. These are both examples of "absolute effect sizes", meaning
that they convey the average difference between two groups without any discussion of the variability within the
groups. For example, if the weight loss program results in an average loss of 30 pounds, it is possible that every
participant loses exactly 30 pounds, or half the participants lose 60 pounds and half lose no weight at all.
Reporting effect sizes is considered good practice when presenting empirical research findings in many fields.[1][2]
The reporting of effect sizes facilitates the interpretation of the substantive, as opposed to the statistical, significance
of a research result.[3] Effect sizes are particularly prominent in social and medical research. Relative and absolute
measures of effect size convey different information, and can be used complementarily. A prominent task force in
the psychology research community expressed the following recommendation:
Always present effect sizes for primary outcomes...If the units of measurement are meaningful on a practical
level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure
(regression coefficient or mean difference) to a standardized measure (r or d).
— L. Wilkinson and APA Task Force on Statistical Inference (1999, p. 599)
Overview
Types
Pearson r (correlation)
Pearson's correlation, often denoted r and introduced by Karl Pearson, is widely used as an effect size when paired
quantitative data are available; for instance if one were studying the relationship between birth weight and longevity.
The correlation coefficient can also be used when the data are binary. Pearson's r can vary in magnitude from −1 to
1, with −1 indicating a perfect negative linear relation, 1 indicating a perfect positive linear relation, and 0 indicating
no linear relation between two variables. Cohen gives the following guidelines for the social sciences:[6][7]
Effect size r
Small 0.10
Medium 0.30
Large 0.50
Coefficient of determination
A related effect size is r², the coefficient of determination (also referred to as "r-squared"), calculated as the square of
the Pearson correlation r. In the case of paired data, this is a measure of the proportion of variance shared by the two
variables, and varies from 0 to 1. For example, with an r of 0.21 the coefficient of determination is 0.0441, meaning
that 4.4% of the variance of either variable is shared with the other variable. The r² is always positive, so does not
convey the direction of the correlation between the two variables.
Effect size 171
Cohen's ƒ2
Cohen's ƒ2 is one of several effect size measures to use in the context of an F-test for ANOVA or multiple regression.
Note that it estimates for the sample rather than the population and is biased (overestimates effect size for the
ANOVA).
The ƒ2 effect size measure for multiple regression is defined as:
where R2A is the variance accounted for by a set of one or more independent variables A, and R2AB is the
combined variance accounted for by A and another set of one or more independent variables B.
By convention, ƒ2A effect sizes of 0.02, 0.15, and 0.35 are termed small, medium, and large, respectively.[6]
Cohen's can also be found for factorial analysis of variance (ANOVA, aka the F-test) working backwards using :
In a balanced design (equivalent sample sizes across groups) of ANOVA, the corresponding population parameter of
is
wherein μj denotes the population mean within the jth group of the total K groups, and σ the equivalent population
standard deviations within each groups. SS is the sum of squares manipulation in ANOVA. An unbiased estimator
for ANOVA would be based on Omega squared, which estimates for the population.
ω2
A more unbiased estimator of the variance explained in the population is omega-squared[8][9][10]
This form of the formula is limited to between-subjects analysis with equal sample sizes in all cells,[10]. Since it is
unbiased, ω2 is preferable to Cohen's ƒ2; however, it can be more inconvenient to calculate for complex analyses. A
generalized form of the estimator has been published for between-subjects and within-subjects analysis, repeated
measure, mixed design, and randomized block design experiments.[11] In addition, methods to calculate partial
Omega2 for individual factors and combined factors in designs with up to three independent variables have been
published.[11]
Effect size 172
where μ1 is the mean for one population, μ2 is the mean for the other population, and σ is a standard deviation based
on either or both populations.
In the practical setting the population values are typically not known and must be estimated from sample statistics.
The several versions of effect sizes based on means differ with respect to which statistics are used.
This form for the effect size resembles the computation for a t-test statistic, with the critical difference that the t-test
statistic includes a factor of . This means that for a given effect size, the significance level increases with the
sample size. Unlike the t-test statistic, the effect size aims to estimate a population parameter, so is not affected by
the sample size.
Cohen's d
Cohen's d is defined as the difference between two means divided by a standard deviation for the data
Cohen's d is frequently used in estimating sample sizes. A lower Cohen's d indicates a necessity of larger sample
sizes, and vice versa, as can subsequently be determined together with the additional parameters of desired
significance level and statistical power.[13]
What precisely the standard deviation s is was not originally made explicit by Jacob Cohen because he defined it
(using the symbol "σ") as "the standard deviation of either population (since they are assumed equal)".[6]:20 Other
authors make the computation of the standard deviation more explicit with the following definition for a pooled
standard deviation[14]:14 with two independent samples.
This definition of "Cohen's d" is termed the maximum likelihood estimator by Hedges and Olkin,[12] and it is related
to Hedges' g (see below) by a scaling[12]:82
Glass's Δ
In 1976 Gene V. Glass proposed an estimator of the effect size that uses only the standard deviation of the second
group[12]:78
The second group may be regarded as a control group, and Glass argued that if several treatments were compared to
the control group it would be better to use just the standard deviation computed from the control group, so that effect
sizes would not differ under equal means and different variances.
Under an assumption of equal population variances a pooled estimate for σ is more precise.
Effect size 173
Hedges' g
Hedges' g, suggested by Larry Hedges in 1981,[15] is like the other measures based on a standardized difference[12]:79
but its pooled standard deviation is computed slightly differently from Cohen's d
As an estimator for the population effect size θ it is biased. However, this bias can be corrected for by multiplication
with a factor
Hedges and Olkin refer to this unbiased estimator as d,[12] but it is not the same as Cohen's d. The exact form for
the correction factor J() involves the gamma function[12]:104
Provided that the data is Gaussian distributed a scaled Hedges' g, , follows a noncentral
φ, Cramér's φ, or Cramér's V
The best measure of association for the chi-squared test is phi (or Cramér's phi or V). Phi is related to the
point-biserial correlation coefficient and Cohen's d and estimates the extent of the relationship between two variables
(2 x 2).[16] Cramér's Phi may be used with variables having more than two levels.
Phi can be computed by finding the square root of the chi-squared statistic divided by the sample size.
Similarly, Cramér's phi is computed by taking the square root of the chi-squared statistic divided by the sample size
and the length of the minimum dimension (k is the smaller of the number of rows r or columns c).
φc is the intercorrelation of the two discrete variables[17] and may be computed for any value of r or c. However, as
chi-squared values tend to increase with the number of cells, the greater the difference between r and c, the more
likely φc will tend to 1 without strong evidence of a meaningful correlation.
Cramér's phi may also be applied to 'goodness of fit' chi-squared models (i.e. those where c=1). In this case it
functions as a measure of tendency towards a single outcome (i.e. out of k outcomes).
Effect size 174
Odds ratio
The odds ratio (OR) is another useful effect size. It is appropriate when both variables are binary. For example,
consider a study on spelling. In a control group, two students pass the class for every one who fails, so the odds of
passing are two to one (or more briefly 2/1 = 2). In the treatment group, six students pass for every one who fails, so
the odds of passing are six to one (or 6/1 = 6). The effect size can be computed by noting that the odds of passing in
the treatment group are three times higher than in the control group (because 6 divided by 2 is 3). Therefore, the odds
ratio is 3. However, odds ratio statistics are on a different scale to Cohen's d. So, this '3' is not comparable to a
Cohen's d of 3.
Relative risk
The relative risk (RR), also called risk ratio, is simply the risk (probability) of an event relative to some independent
variable. This measure of effect size differs from the odds ratio in that it compares probabilities instead of odds, but
asymptotically approaches the latter for small probabilities. Using the example above, the probabilities for those in
the control group and treatment group passing is 2/3 (or 0.67) and 6/7 (or 0.86), respectively. The effect size can be
computed the same as above, but using the probabilities instead. Therefore, the relative risk is 1.28. Since rather
large probabilities of passing were used, there is a large difference between relative risk and odds ratio. Had failure
(a smaller probability) been used as the event (rather than passing), the difference between the two measures of
effect size would not be so great.
While both measures are useful, they have different statistical uses. In medical research, the odds ratio is commonly
used for case-control studies, as odds, but not probabilities, are usually estimated.[18] Relative risk is commonly used
in randomized controlled trials and cohort studies.[19] When the incidence of outcomes are rare in the study
population (generally interpreted to mean less than 10%), the odds ratio is considered a good estimate of the risk
ratio. However, as outcomes become more common, the odds ratio and risk ratio diverge, with the odds ratio
overestimating or underestimating the risk ratio when the estimates are greater than or less than 1, respectively.
When estimates of the incidence of outcomes are available, methods exist to convert odds ratios to risk ratios.[20][21]
and Cohen's
Effect size 175
So,
wherein
and Cohen's
So,
One-way ANOVA test for mean difference across multiple independent groups
One-way ANOVA test applies noncentral F distribution. While with a given population standard deviation , the
same test question applies noncentral chi-squared distribution.
While,
In case of for K independent groups of same size, the total sample size is N := n·K.
Effect size 176
T-test of pair of independent groups is a special case of one-way ANOVA. Note that noncentrality parameter
of F is not comparable to the noncentrality parameter of the corresponding t. Actually, , and
in the case.
References
[1] Wilkinson, Leland; APA Task Force on Statistical Inference (1999). "Statistical methods in psychology journals: Guidelines and
explanations". American Psychologist 54 (8): 594–604. doi:10.1037/0003-066X.54.8.594.
[2] Nakagawa, Shinichi; Cuthill, Innes C (2007). "Effect size, confidence interval and statistical significance: a practical guide for biologists".
Biological Reviews Cambridge Philosophical Society 82 (4): 591–605. doi:10.1111/j.1469-185X.2007.00027.x. PMID 17944619.
[3] Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: An Introduction to Statistical Power, Meta-Analysis and the Interpretation of
Research Results. United Kingdom: Cambridge University Press.
[4] Brand A, Bradley MT, Best LA, Stoica G (2008). "Accuracy of effect size estimates from published psychological research" (http:/ /
mtbradley. com/ brandbradelybeststoicapdf. pdf). Perceptual and Motor Skills 106 (2): 645–649. doi:10.2466/PMS.106.2.645-649.
PMID 18556917. .
[5] Brand A, Bradley MT, Best LA, Stoica G (2011). "Multiple trials may yield exaggerated effect size estimates" (http:/ / www. ipsychexpts.
com/ brand_et_al_(2011). pdf). The Journal of General Psychology 138 (1): 1–11. doi:10.1080/00221309.2010.520360. .
[6] Jacob Cohen (1988). Statistical Power Analysis for the Behavioral Sciences (second ed.). Lawrence Erlbaum Associates.
[7] Cohen, J (1992). "A power primer". Psychological Bulletin 112 (1): 155–159. doi:10.1037/0033-2909.112.1.155. PMID 19565683.
[8] Bortz, 1999, p. 269f.;
[9] Bühner & Ziegler (2009, p. 413f)
[10] Tabachnick & Fidell (2007, p. 55)
[11] Olejnik, S. & Algina, J. 2003. Generalized Eta and Omega Squared Statistics: Measures of Effect Size for Some Common Research Designs
Psychological Methods. 8:(4)434-447. http:/ / cps. nova. edu/ marker/ olejnik2003. pdf
Effect size 177
[12] Larry V. Hedges & Ingram Olkin (1985). Statistical Methods for Meta-Analysis. Orlando: Academic Press. ISBN 0-12-336380-2.
[13] Chapter 13 (http:/ / davidakenny. net/ statbook/ chapter_13. pdf), page 215, in: Kenny, David A. (1987). Statistics for the social and
behavioral sciences. Boston: Little, Brown. ISBN 0-316-48915-8.
[14] Joachim Hartung, Guido Knapp & Bimal K. Sinha (2008). Statistical Meta-Analysis with Application. Hoboken, New Jersey: Wiley.
[15] Larry V. Hedges (1981). "Distribution theory for Glass's estimator of effect size and related estimators". Journal of Educational Statistics 6
(2): 107–128. doi:10.3102/10769986006002107.
[16] Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and d-based effect-size indices: Problems with a commonly
recommended formula. (http:/ / www. eric. ed. gov/ ERICWebPortal/ custom/ portlets/ recordDetails/ detailmini. jsp?_nfpb=true& _&
ERICExtSearch_SearchValue_0=ED433353& ERICExtSearch_SearchType_0=no& accno=ED433353) Paper presented at the annual meeting
of the Florida Educational Research Association, Orlando, FL. (ERIC Document Reproduction Service No. ED433353)
[17] Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press.
[18] Deeks J (1998). "When can odds ratios mislead? : Odds ratios should be used only in case-control studies and logistic regression analyses".
BMJ 317 (7166): 1155–6. PMC 1114127. PMID 9784470.
[19] Medical University of South Carolina. Odds ratio versus relative risk (http:/ / www. musc. edu/ dc/ icrebm/ oddsratio. html). Accessed on:
September 8, 2005.
[20] Zhang, J.; Yu, K. (1998). "What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes". JAMA:
the Journal of the American Medical Association 280 (19): 1690–1691. doi:10.1001/jama.280.19.1690. PMID 9832001.
[21] Greenland, S. (2004). "Model-based Estimation of Relative Risks and Other Epidemiologic Measures in Studies of Common Outcomes and
in Case-Control Studies". American Journal of Epidemiology 160 (4): 301–305. doi:10.1093/aje/kwh221. PMID 15286014.
[22] Russell V. Lenth. "Java applets for power and sample size" (http:/ / www. stat. uiowa. edu/ ~rlenth/ Power/ ). Division of Mathematical
Sciences, the College of Liberal Arts or The University of Iowa. . Retrieved 2008-10-08.
Further reading
• Aaron, B., Kromrey, J. D., & Ferron, J. M. (1998, November). Equating r-based and d-based effect-size indices:
Problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida
Educational Research Association, Orlando, FL. (ERIC Document Reproduction Service No. ED433353) (http://
www.eric.ed.gov/ERICWebPortal/contentdelivery/servlet/ERICServlet?accno=ED433353)
• Bonett, D.G. (2008). Confidence intervals for standardized linear contrasts of means, Psychological Methods, 13,
99-109.
• Bonett, D.G. (2009). Estimating standardized linear contrasts of means with desired precision, Psychological
Methods"", 14, 1-5.
• Cumming, G. and Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals
that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 530–572.
• Kelley, K. (2007). Confidence intervals for standardized effect sizes: Theory, application, and implementation.
Journal of Statistical Software, 20(8), 1-24. (http://www.jstatsoft.org/v20/i08/paper)
• Lipsey, M.W., & Wilson, D.B. (2001). Practical meta-analysis. Sage: Thousand Oaks, CA.
External links
Software
• compute.es: Compute Effect Sizes (http://cran.r-project.org/web/packages/compute.es/index.html) (R
package)
• MIX 2.0 (http://www.meta-analysis-made-easy.com) Software for professional meta-analysis in Excel. Many
effect sizes available.
• Effect Size Calculators (http://myweb.polyu.edu.hk/~mspaul/calculator/calculator.html) Calculate d and r
from a variety of statistics.
• Free Effect Size Generator (http://www.clintools.com/victims/resources/software/effectsize/
effect_size_generator.html) - PC & Mac Software
• MBESS (http://cran.r-project.org/web/packages/MBESS/index.html) - One of R's packages providing
confidence intervals of effect sizes based non-central parameters
• Free GPower Software (http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/) - PC & Mac Software
Effect size 178
Erlang distribution
Erlang
Parameters shape
, rate (real)
alt.: scale (real)
Support
PDF
CDF
Mean
Median No simple closed form
Mode for
Variance
Skewness
Ex. kurtosis
Entropy
MGF for
CF
The Erlang distribution is a continuous probability distribution with wide applicability primarily due to its relation
to the exponential and Gamma distributions. The Erlang distribution was developed by A. K. Erlang to examine the
number of telephone calls which might be made at the same time to the operators of the switching stations. This
work on telephone traffic engineering has been expanded to consider waiting times in queueing systems in general.
The distribution is now used in the fields of stochastic processes and of biomathematics.
Erlang distribution 180
Overview
The distribution is a continuous distribution, which has a positive value for all real numbers greater than zero, and is
given by two parameters: the shape , which is a non-negative integer, and the rate , which is a non-negative
real number. The distribution is sometimes defined using the inverse of the rate parameter, the scale . It is the
distribution of the sum of independent exponential variables with mean .
When the shape parameter equals 1, the distribution simplifies to the exponential distribution. The Erlang
distribution is a special case of the Gamma distribution where the shape parameter is an integer. In the Gamma
distribution, this parameter is not restricted to the integers.
Characterization
The parameter is called the shape parameter and the parameter is called the rate parameter. An alternative, but
equivalent, parametrization uses the scale parameter which is the reciprocal of the rate parameter (i.e.,
):
When the scale parameter equals 2, then distribution simplifies to the chi-squared distribution with 2k degrees of
freedom. It can therefore be regarded as a generalized chi-squared distribution.
Because of the factorial function in the denominator, the Erlang distribution is only defined when the parameter k is
a positive integer. In fact, this distribution is sometimes called the Erlang-k distribution (e.g., an Erlang-2
distribution is an Erlang distribution with k=2). The Gamma distribution generalizes the Erlang by allowing to be
any real number, using the gamma function instead of the factorial function.
where is the lower incomplete gamma function. The CDF may also be expressed as
Erlang distribution 181
Occurrence
Waiting times
Events that occur independently with some average rate are modeled with a Poisson process. The waiting times
between k occurrences of the event are Erlang distributed. (The related question of the number of events in a given
amount of time is described by the Poisson distribution.)
The Erlang distribution, which measures the time between incoming calls, can be used in conjunction with the
expected duration of incoming calls to produce information about the traffic load measured in Erlang units. This can
be used to determine the probability of packet loss or delay, according to various assumptions made about whether
blocked calls are aborted (Erlang B formula) or queued until served (Erlang C formula). The Erlang-B and C
formulae are still in everyday use for traffic modeling for applications such as the design of call centers.
A.K. Erlang worked a lot in traffic modeling. There are thus two other Erlang distributions, both used in modeling
traffic:
Erlang B distribution: this is the easier of the two, and can be used, for example, in a call centre to calculate the
number of trunks one need to carry a certain amount of phone traffic with a certain "target service".
Erlang C distribution: this formula is much more difficult and is often used, for example, to calculate how long
callers will have to wait before being connected to a human in a call centre or similar situation.
Stochastic processes
The Erlang distribution is the distribution of the sum of k independent identically distributed random variables
each having an exponential distribution. The long-run rate at which events occur is the reciprocal of the expectation
of , that is . The (age specific event) rate of the Erlang distribution is, for , monotonic in ,
increasing from zero at , to as tends to infinity.[1]
Related distributions
• If then
• (normal distribution)
• If and then
• If then
• Erlang distribution is a special case of type 3 Pearson distribution
• If (gamma distribution) then
• If and then
Notes
[1] Cox, D.R. (1967) Renewal Theory, p20, Methuen.
References
• Ian Angus "An Introduction to Erlang B and Erlang C" (http://www.tarrani.net/linda/ErlangBandC.pdf),
Telemanagement #187 (PDF Document - Has terms and formulae plus short biography)
External links
• Erlang Distribution (http://www.xycoon.com/erlang.htm)
• Resource Dimensioning Using Erlang-B and Erlang-C (http://www.eventhelix.com/RealtimeMantra/
CongestionControl/resource_dimensioning_erlang_b_c.htm)
Erlang distribution 182
• Erlang-C (http://www.kooltoolz.com/Erlang-C.htm)
Expectation–maximization algorithm
In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood
or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on
unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a
function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a
maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step.
These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
History
The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and
Donald Rubin.[1] They pointed out that the method had been "proposed many times in special circumstances" by
earlier authors. In particular, a very detailed treatment of the EM method for exponential families was published by
Rolf Sundberg in his thesis and several papers[2][3][4] following his collaboration with Per Martin-Löf and Anders
Martin-Löf.[5][6][7][8][9][10][11] The Dempster-Laird-Rubin paper in 1977 generalized the method and sketched a
convergence analysis for a wider class of problems. Regardless of earlier inventions, the innovative
Dempster-Laird-Rubin paper in the Journal of the Royal Statistical Society received an enthusiastic discussion at the
Royal Statistical Society meeting with Sundberg calling the paper "brilliant". The Dempster-Laird-Rubin paper
established the EM method as an important tool of statistical analysis.
The convergence analysis of the Dempster-Laird-Rubin paper was flawed and a correct convergence analysis was
published by C. F. Jeff Wu in 1983. Wu's proof established the EM method's convergence outside of the exponential
family, as claimed by Dempster-Laird-Rubin.[12]
Introduction
The EM algorithm is used to find the maximum likelihood parameters of a statistical model in cases where the
equations cannot be solved directly. Typically these models involve latent variables in addition to unknown
parameters and known data observations. That is, either there are missing values among the data, or the model can be
formulated more simply by assuming the existence of additional unobserved data points. (For example, a mixture
model can be described more simply by assuming that each observed data point has a corresponding unobserved data
point, or latent variable, specifying the mixture component that each data point belongs to.)
Finding a maximum likelihood solution requires taking the derivatives of the likelihood function with respect to all
the unknown values — i.e. both the parameters and the latent variables — and simultaneously solving the resulting
equations. In statistical models with latent variables, this usually is not possible. Instead, the result is typically a set
of interlocking equations in which the solution to the parameters requires the values of the latent variables and
vice-versa, but substituting one set of equations into the other produces an unsolvable equation.
The EM algorithm proceeds from the observation that the following is a way to solve these two sets of equations
numerically. One can simply pick arbitrary values for one of the two sets of unknowns, use them to estimate the
second set, then use these new values to find a better estimate of the first set, and then keep alternating between the
two until the resulting values both converge to fixed points. It's not obvious that this will work at all, but in fact it can
be proven that in this particular context it does, and that the value is a local maximum of the likelihood function. In
general there may be multiple maxima, and no guarantee that the global maximum will be found. Some likelihoods
also have singularities in them, i.e. nonsensical maxima. For example, one of the "solutions" that may be found by
EM in a mixture model involves setting one of the components to have zero variance and the mean parameter for the
Expectationmaximization algorithm 183
Description
Given a statistical model consisting of a set of observed data, a set of unobserved latent data or missing values
, and a vector of unknown parameters , along with a likelihood function , the
maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the
observed data
Maximization step (M step): Find the parameter that maximizes this quantity:
probability of each possible value of for each data point, and then using the probabilities associated with a particular
value of to compute a weighted average over the entire set of data points. The resulting algorithm is commonly
called soft EM, and is the type of algorithm normally associated with EM. The counts used to compute these
weighted averages are called soft counts (as opposed to the hard counts used in a hard-EM-type algorithm such as
k-means). The probabilities computed for are posterior probabilities and are what is computed in the E step. The soft
counts used to compute new parameter values are what is computed in the M step.
Properties
Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed,
data-dependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is
maximized in the second (M) step of an EM algorithm.
Although an EM iteration does increase the observed data (i.e. marginal) likelihood function there is no guarantee
that the sequence converges to a maximum likelihood estimator. For multimodal distributions, this means that an EM
algorithm may converge to a local maximum of the observed data likelihood function, depending on starting values.
There are a variety of heuristic or metaheuristic approaches for escaping a local maximum such as random restart
(starting with several different random initial estimates θ(t)), or applying simulated annealing methods.
EM is particularly useful when the likelihood is an exponential family: the E step becomes the sum of expectations
of sufficient statistics, and the M step involves maximizing a linear function. In such a case, it is usually possible to
derive closed form updates for each step, using the Sundberg formula (published by Rolf Sundberg using
unpublished results of Per Martin-Löf and Anders Martin-Löf).[3][4][7][8][9][10][11]
The EM method was modified to compute maximum a posteriori (MAP) estimates for Bayesian inference in the
original paper by Dempster, Laird, and Rubin.
There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or
variations of the Gauss–Newton method. Unlike EM, such methods typically require the evaluation of first and/or
second derivatives of the likelihood function.
Proof of correctness
Expectation-maximization works to improve rather than directly improving . Here we
[13]
show that improvements to the former imply improvements to the latter.
For any with non-zero probability , we can write
We take the expectation over values of by multiplying both sides by and summing (or integrating)
over . The left-hand side is the expectation of a constant, so we get:
where is defined by the negated sum it is replacing. This last equation holds for any value of
including ,
and subtracting this last equation from the previous equation gives
Alternative description
Under some circumstances, it is convenient to view the EM algorithm as two alternating maximization steps.[14][15]
Consider the function:
where q is an arbitrary probability distribution over the unobserved data z, pZ|X(· |x;θ) is the conditional distribution
of the unobserved data given the observed data x, H is the entropy and DKL is the Kullback–Leibler divergence.
Then the steps in the EM algorithm may be viewed as:
Expectation step: Choose q to maximize F:
Maximization step: Choose θ to maximize F:
Applications
EM is frequently used for data clustering in machine learning and computer vision. In natural language processing,
two prominent instances of the algorithm are the Baum-Welch algorithm (also known as forward-backward) and the
inside-outside algorithm for unsupervised induction of probabilistic context-free grammars.
In psychometrics, EM is almost indispensable for estimating item parameters and latent abilities of item response
theory models.
With the ability to deal with missing data and observe unidentified variables, EM is becoming a useful tool to price
and manage risk of a portfolio.
The EM algorithm (and its faster variant Ordered subset expectation maximization) is also widely used in medical
image reconstruction, especially in positron emission tomography and single photon emission computed
tomography. See below for other faster variants of EM.
where are scalar output estimates calculated by a filter or a smoother from N scalar measurements .
Similarly, for a first-order auto-regressive process, an updated process noise variance estimate can be calculated by
Expectationmaximization algorithm 186
where and are scalar state estimates calculated by a filter or a smoother. The updated model coefficient
estimate is obtained via
The convergence of the above parameter estimates are studied in [16] [17].
Variants
A number of methods have been proposed to accelerate the sometimes slow convergence of the EM algorithm, such
as those utilising conjugate gradient and modified Newton–Raphson techniques.[18] Additionally EM can be utilised
with constrained estimation techniques.
Expectation conditional maximization (ECM) replaces each M step with a sequence of conditional maximization
(CM) steps in which each parameter θi is maximized individually, conditionally on the other parameters remaining
fixed.[19]
This idea is further extended in generalized expectation maximization (GEM) algorithm, in which one only seeks
an increase in the objective function F for both the E step and M step under the alternative description.[14]
It is also possible to consider the EM algorithm as a subclass of the MM (Majorize/Minimize or Minorize/Maximize,
depending on context) algorithm,[20] and therefore use any machinery developed in the more general case.
α-EM algorithm
The Q-function used in the EM algorithm is based on the log likelihood. Therefore, it is regarded as the log-EM
algorithm. The use of the log likelihood can be generalized to that of the α-log likelihood ratio. Then, the α-log
likelihood ratio of the observed data can be exactly expressed as equality by using the Q-function of the α-log
likelihood ratio and the α-divergence. Obtaining this Q-function is a generalized E step. Its maximization is a
generalized M step. This pair is called the α-EM algorithm [21] which contains the log-EM algorithm as its subclass.
Thus, the α-EM algorithm by Yasuo Matsuyama is an exact generalization of the log-EM algorithm. No computation
of gradient or Hessian matrix is needed. The α-EM shows faster convergence than the log-EM algorithm by
choosing an appropriate α. The α-EM algorithm leads to a faster version of the Hidden Markov model estimation
algorithm α-HMM. [22]
Expectationmaximization algorithm 187
Geometric interpretation
In information geometry, the E step and the M step are interpreted as projections under dual affine connections,
called the e-connection and the m-connection; the Kullback–Leibler divergence can also be understood in these
terms.
Examples
Gaussian mixture
Let x = (x1,x2,…,xn) be a sample of n independent
observations from a mixture of two multivariate normal
distributions of dimension d, and let z=(z1,z2,…,zn) be the
latent variables that determine the component from which
the observation originates.[15]
and
where
and
where is an indicator function and f is the probability density function of a multivariate normal. This may be
rewritten in exponential family form:
To see the last equality, note that for each i all indicators are equal to zero, except for one which is equal
to one. The inner sum thus reduces to a single term.
Expectationmaximization algorithm 188
E step
Given our current estimate of the parameters θ(t), the conditional distribution of the Zi is determined by Bayes
theorem to be the proportional height of the normal density weighted by τ:
M step
The quadratic form of Q(θ|θ(t)) means that determining the maximising values of θ is relatively straightforward.
Firstly note that τ, (μ1,Σ1) and (μ2,Σ2) may be all maximised independently of each other since they all appear in
separate linear terms.
Firstly, consider τ, which has the constraint τ1 + τ2=1:
This has the same form as the MLE for the binomial distribution, so:
This has the same form as a weighted MLE for a normal distribution, so
and
and, by symmetry:
and .
Expectationmaximization algorithm 189
References
• Robert Hogg, Joseph McKean and Allen Craig. Introduction to Mathematical Statistics. pp. 359–364. Upper
Saddle River, NJ: Pearson Prentice Hall, 2005.
• The on-line textbook: Information Theory, Inference, and Learning Algorithms [24], by David J.C. MacKay
includes simple examples of the EM algorithm such as clustering using the soft k-means algorithm, and
emphasizes the variational view of the EM algorithm, as described in Chapter 33.7 of version 7.2 (fourth edition).
• Theory and Use of the EM Method [25] by M. R. Gupta and Y. Chen is a well-written short book on EM,
including detailed derivation of EM for GMMs, HMMs, and Dirichlet.
• Bilmes, Jeff. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian
Mixture and Hidden Markov Models. CiteSeerX: 10.1.1.28.613 [26], includes a simplified derivation of the EM
equations for Gaussian Mixtures and Gaussian Mixture Hidden Markov Models.
• Variational Algorithms for Approximate Bayesian Inference [27], by M. J. Beal includes comparisons of EM to
Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs (chapters [28]).
• Dellaert, Frank. The Expectation Maximization Algorithm. CiteSeerX: 10.1.1.9.9735 [29], gives an easier
explanation of EM algorithm in terms of lowerbound maximization.
• The Expectation Maximization Algorithm: A short tutorial [30], A self contained derivation of the EM Algorithm
by Sean Borman.
• The EM Algorithm [31], by Xiaojin Zhu.
• EM algorithm and variants: an informal tutorial [32] by Alexis Roche. A concise and very clear description of EM
and many interesting variants.
• Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. ISBN 0-387-31073-8.
• Einicke, G.A. (2012). Smoothing, Filtering and Prediction: Estimating the Past, Present and Future [33]. Rijeka,
Croatia: Intech. ISBN 978-953-307-752-9.
References
[1] Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal
Statistical Society. Series B (Methodological) 39 (1): 1–38. JSTOR 2984875. MR0501537.
[2] Sundberg, Rolf (1974). "Maximum likelihood theory for incomplete data from an exponential family". Scandinavian Journal of Statistics 1
(2): 49–58. JSTOR 4615553. MR381110.
[3] Rolf Sundberg. 1971. Maximum likelihood theory and applications for distributions generated when observing a function of an exponential
family variable. Dissertation, Institute for Mathematical Statistics, Stockholm University.
[4] Sundberg, Rolf (1976). "An iterative method for solution of the likelihood equations for incomplete data from exponential families".
Communications in Statistics – Simulation and Computation 5 (1): 55–64. doi:10.1080/03610917608812007. MR443190.
[5] See the acknowledgement by Dempster, Laird and Rubin on pages 3, 5 and 11.
[6] G. Kulldorff. 1961. Contributions to the theory of estimation from grouped and partially grouped samples. Almqvist & Wiksell.
[7] Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg
formula")
[8] Per Martin-Löf. 1966. Statistics from the point of view of statistical mechanics. Lecture notes, Mathematical Institute, Aarhus University.
("Sundberg formula" credited to Anders Martin-Löf).
[9] Per Martin-Löf. 1970. Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in the
academic year 1969-1970), with the assistance of Rolf Sundberg. Stockholm University. ("Sundberg formula")
[10] Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of
observational data. With a discussion by F. Abildgård, A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, G. A. Barnard, O.
Barndorff-Nielsen, J. D. Kalbfleisch and G. Rasch and a reply by the author. Proceedings of Conference on Foundational Questions in
Expectationmaximization algorithm 190
Statistical Inference (Aarhus, 1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.
[11] Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set
of observational data. Scand. J. Statist. 1 (1974), no. 1, 3–18.
[12] Wu, C. F. Jeff (Mar. 1983). "On the Convergence Properties of the EM Algorithm". Annals of Statistics 11 (1): 95–103.
doi:10.1214/aos/1176346060. JSTOR 2240463. MR684867.
[13] Little, Roderick J.A.; Rubin, Donald B. (1987). Statistical Analysis with Missing Data. Wiley Series in Probability and Mathematical
Statistics. New York: John Wiley & Sons. pp. 134--136. ISBN 0-471-80254-9.
[14] Neal, Radford; Hinton, Geoffrey (1999). Michael I. Jordan. ed. "A view of the EM algorithm that justifies incremental, sparse, and other
variants" (ftp:/ / ftp. cs. toronto. edu/ pub/ radford/ emk. pdf). Learning in Graphical Models (Cambridge, MA: MIT Press): 355–368.
ISBN 0-262-60032-3. . Retrieved 2009-03-22.
[15] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2001). "8.5 The EM algorithm". The Elements of Statistical Learning. New York:
Springer. pp. 236–243. ISBN 0-387-95284-5.
[16] Einicke, G.A.; Malos, J.T.; Reid, D.C.; Hainsworth, D.W. (January 2009). "Riccati Equation and EM Algorithm Convergence for Inertial
Navigation Alignment". IEEE Trans. Signal Processing 57 (1): 370–375
[17] Einicke, G.A.; Falco, G.; Malos, J.T. (May 2010). "EM Algorithm State Matrix Estimation for Navigation". IEEE Signal Processing Letters
17 (5): 437–440
[18] Jamshidian, Mortaza; Jennrich, Robert I. (1997). "Acceleration of the EM Algorithm by using Quasi-Newton Methods". Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 59 (2): 569–587. doi:10.1111/1467-9868.00083. MR1452026.
[19] Meng, Xiao-Li; Rubin, Donald B. (1993). "Maximum likelihood estimation via the ECM algorithm: A general framework". Biometrika 80
(2): 267–278. doi:10.1093/biomet/80.2.267. MR1243503.
[20] Hunter DR and Lange K (2004), A Tutorial on MM Algorithms (http:/ / www. stat. psu. edu/ ~dhunter/ papers/ mmtutorial. pdf), The
American Statistician, 58: 30-37
[21] Matsuyama, Yasuo (2003). "The α-EM algorithm: Surrogate likelihood maximization using α-logarithmic information measures". IEEE
Transactions on Information Theory 49 (3): 692–706.
[22] Matsuyama, Yasuo (2011). "Hidden Markov model estimation based on alpha-EM algorithm: Discrete and continuous alpha-HMMs".
International Joint Conference on Neural Networks: 808–816.
[23] Wolynetz, M.S. (1979) "Maximum Likelihood estimation in a Linear model from Confined and Censored Normal Data". Journal of the
Royal Statistical Society (Series C), 28(2), 195–206
[24] http:/ / www. inference. phy. cam. ac. uk/ mackay/ itila/
[25] http:/ / ee. washington. edu/ research/ guptalab/ publications/ EMbookChenGupta2010. pdf
[26] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 28. 613
[27] http:/ / www. cse. buffalo. edu/ faculty/ mbeal/ papers/ beal03. pdf
[28] http:/ / www. cse. buffalo. edu/ faculty/ mbeal/ thesis/ index. html
[29] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 9. 9735
[30] http:/ / www. seanborman. com/ publications/ EM_algorithm. pdf
[31] http:/ / pages. cs. wisc. edu/ ~jerryzhu/ cs838/ EM. pdf
[32] http:/ / arxiv. org/ pdf/ 1105. 1476. pdf
[33] http:/ / www. intechopen. com/ books/ smoothing-filtering-and-prediction-estimating-the-past-present-and-future
External links
• Various 1D, 2D and 3D demonstrations of EM together with Mixture Modeling (http://wiki.stat.ucla.edu/socr/
index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture) are provided as part of the
paired SOCR activities and applets. These applets and activities show empirically the properties of the EM
algorithm for parameter estimation in diverse settings.
• Class hierarchy in C++ (GPL) including Gaussian Mixtures (https://github.com/l-/CommonDataAnalysis)
Exponential distribution 191
Exponential distribution
Exponential
CF
In probability theory and statistics, the exponential distribution (a.k.a. negative exponential distribution) is a
family of continuous probability distributions. It describes the time between events in a Poisson process, i.e. a
process in which events occur continuously and independently at a constant average rate. It is the continuous
analogue of the geometric distribution.
Note that the exponential distribution is not the same as the class of exponential families of distributions, which is a
large class of probability distributions that includes the exponential distribution as one of its members, but also
includes the normal distribution, binomial distribution, gamma distribution, Poisson, and many others.
Exponential distribution 192
Characterization
Alternatively, this can be defined using the Heaviside step function, H(x).
Here λ > 0 is the parameter of the distribution, often called the rate parameter. The distribution is supported on the
interval [0, ∞). If a random variable X has this distribution, we write X ~ Exp(λ).
The exponential distribution exhibits infinite divisibility.
Alternative parameterization
A commonly used alternative parameterization is to define the probability density function (pdf) of an exponential
distribution as
where β > 0 is a scale parameter of the distribution and is the reciprocal of the rate parameter, λ, defined above. In
this specification, β is a survival parameter in the sense that if a random variable X is the duration of time that a
given biological or mechanical system manages to survive and X ~ Exponential(β) then E[X] = β. That is to say, the
expected duration of survival of the system is β units of time. The parameterisation involving the "rate" parameter
arises in the context of events arriving at a rate λ, when the time between events (which might be modelled using an
exponential distribution) has a mean of β = λ−1.
The alternative specification is sometimes more convenient than the one given above, and some authors will use it as
a standard definition. This alternative specification is not used here. Unfortunately this gives rise to a notational
ambiguity. In general, the reader must check which of these two specifications is being used if an author writes
"X ~ Exponential(λ)", since either the notation in the previous (using λ) or the notation in this section (here, using β
to avoid confusion) could be intended.
Exponential distribution 193
Properties
In light of the examples given above, this makes sense: if you receive
phone calls at an average rate of 2 per hour, then you can expect to
wait half an hour for every call.
The variance of X is given by
Memorylessness
An important property of the exponential distribution is that it is
memoryless. This means that if a random variable T is exponentially
distributed, its conditional probability obeys The median is the preimage .
This says that the conditional probability that we need to wait, for example, more than another 10 seconds before the
first arrival, given that the first arrival has not yet happened after 30 seconds, is equal to the initial probability that
we need to wait more than 10 seconds for the first arrival. So, if we waited for 30 seconds and the first arrival didn't
happen (T > 30), probability that we'll need to wait another 10 seconds for the first arrival (T > 30 + 10) is the same
as the initial probability that we need to wait more than 10 seconds for the first arrival (T > 10). The fact that
Pr(T > 40 | T > 30) = Pr(T > 10) does not mean that the events T > 40 and T > 30 are independent.
To summarize: "memorylessness" of the probability distribution of the waiting time T until the first arrival means
The exponential distribution is consequently also necessarily the only continuous probability distribution that has a
constant Failure rate.
Quantiles
The quantile function (inverse cumulative distribution function) for Exponential(λ) is
Kullback–Leibler divergence
The directed Kullback–Leibler divergence between Exp(λ0) ('true' distribution) and Exp(λ) ('approximating'
distribution) is given by
The index of the variable which achieves the minimum is distributed according to the law
Note that
Parameter estimation
Suppose a given variable is exponentially distributed and the rate parameter λ is to be estimated.
Maximum likelihood
The likelihood function for λ, given an independent and identically distributed sample x = (x1, ..., xn) drawn from the
variable, is
where
While this estimate is the most likely reconstruction of the true parameter λ, it is only an estimate, and as such, one
can imagine that the more data points are available the better the estimate will be. It so happens that one can compute
an exact confidence interval – that is, a confidence interval that is valid for all number of samples, not just large
ones. The 100(1 − α)% exact confidence interval for this estimate is given by[2]
where is the MLE estimate, λ is the true value of the parameter, and χ2p,ν is the 100(1 – p) percentile of the chi
squared distribution with ν degrees of freedom.
Bayesian inference
The conjugate prior for the exponential distribution is the gamma distribution (of which the exponential distribution
is a special case). The following parameterization of the gamma pdf is useful:
The posterior distribution p can then be expressed in terms of the likelihood function defined above and a gamma
prior:
Exponential distribution 196
Now the posterior density p has been specified up to a missing normalizing constant. Since it has the form of a
gamma pdf, this can easily be filled in, and one obtains
Here the parameter α can be interpreted as the number of prior observations, and β as the sum of the prior
observations.
Confidence interval
A simple and rapid method to calculate an approximate confidence interval for the estimation of λ is based on the
application of the central limit theorem.[3] This method provides a good approximation of the confidence interval
limits, for samples containing at least 15 – 20 elements. Denoting by N the sample size, the upper and lower limits of
the 95% confidence interval are given by:
Moreover, if U is uniform on (0, 1), then so is 1 − U. This means one can generate exponential variates as follows:
Other methods for generating exponential variates are discussed by Knuth[4] and Devroye.[5]
The ziggurat algorithm is a fast method for generating exponential variates.
A fast method for generating a set of ready-ordered exponential variates without using a sorting routine is also
available.[5]
Related distributions
• Exponential distribution is closed under scaling by a positive factor. If then
• If and then
• If then
• The Benktander Weibull distribution reduces to a truncated exponential distribution
• If then (Benktander Weibull distribution)
• The exponential distribution is a limit of a scaled beta distribution:
Exponential distribution 197
• If then
• If and then
• If and then
• If , then : see skew-logistic distribution.
• Y ∼ Gumbel(μ, β), i.e. Y has a Gumbel distribution, if Y = μ − βlog(Xλ) and X ∼ Exponential(λ).
• X ∼ χ22, i.e. X has a chi-squared distribution with 2 degrees of freedom, if .
Applications
Occurrence of events
The exponential distribution occurs naturally when describing the lengths of the inter-arrival times in a homogeneous
Poisson process.
The exponential distribution may be viewed as a continuous counterpart of the geometric distribution, which
describes the number of Bernoulli trials necessary for a discrete process to change state. In contrast, the exponential
distribution describes the time for a continuous process to change state.
In real-world scenarios, the assumption of a constant rate (or probability per unit time) is rarely satisfied. For
example, the rate of incoming phone calls differs according to the time of day. But if we focus on a time interval
during which the rate is roughly constant, such as from 2 to 4 p.m. during work days, the exponential distribution can
be used as a good approximate model for the time until the next phone call arrives. Similar caveats apply to the
following examples which yield approximately exponentially distributed variables:
• The time until a radioactive particle decays, or the time between clicks of a geiger counter
• The time it takes before your next telephone call
• The time until default (on payment to company debt holders) in reduced form credit risk modeling
Exponential variables can also be used to model situations where certain events occur with a constant probability per
unit length, such as the distance between mutations on a DNA strand, or between roadkills on a given road.
In queuing theory, the service times of agents in a system (e.g. how long it takes for a bank teller etc. to serve a
customer) are often modeled as exponentially distributed variables. (The inter-arrival of customers for instance in a
system is typically modeled by the Poisson distribution in most management science textbooks.) The length of a
process that can be thought of as a sequence of several independent tasks is better modeled by a variable following
the Erlang distribution (which is the distribution of the sum of several independent exponentially distributed
variables).
Reliability theory and reliability engineering also
make extensive use of the exponential distribution.
Because of the memoryless property of this
distribution, it is well-suited to model the constant
hazard rate portion of the bathtub curve used in
reliability theory. It is also very convenient because it
is so easy to add failure rates in a reliability model.
The exponential distribution is however not
appropriate to model the overall lifetime of
organisms or technical devices, because the "failure
rates" here are not constant: more failures occur for
very young and for very old systems.
Fitted cumulative exponential distribution to annually maximum 1-day
In physics, if you observe a gas at a fixed [6]
rainfalls using CumFreq
temperature and pressure in a uniform gravitational
field, the heights of the various molecules also follow
an approximate exponential distribution. This is a consequence of the entropy property mentioned below.
In hydrology, the exponential distribution is used to analyze extreme values of such variables as monthly and annual
maximum values of daily rainfall and river discharge volumes.[7]
The blue picture illustrates an example of fitting the exponential distribution to ranked annually maximum
one-day rainfalls showing also the 90% confidence belt based on the binomial distribution. The rainfall data
are represented by plotting positions as part of the cumulative frequency analysis.
Exponential distribution 199
Prediction
Having observed a sample of n data points from an unknown exponential distribution a common task is to use these
samples to make predictions about future data from the same source. A common predictive distribution over future
samples is the so-called plug-in distribution, formed by plugging a suitable estimate for the rate parameter λ into the
exponential density function. A common choice of estimate is the one provided by the principle of maximum
likelihood, and using this yields the predictive density over a future sample xn+1, conditioned on the observed
samples x = (x1, ..., xn) given by
The Bayesian approach provides a predictive distribution which takes into account the uncertainty of the estimated
parameter, although this may depend crucially on the choice of prior.
A predictive distribution free of the issues of choosing priors that arise under the subjective Bayesian approach is
which can be considered as (1) a frequentist confidence distribution, obtained from the distribution of the pivotal
quantity ;[8] (2) a profile predictive likelihood, obtained by eliminating the parameter from the joint
likelihood of and by maximization;[9] (3) an objective Bayesian predictive posterior distribution, obtained
using the non-informative Jeffreys prior ; and (4) the Conditional Normalized Maximum Likelihood (CNML)
predictive distribution, from information theoretic considerations.[10]
The accuracy of a predictive distribution may be measured using the distance or divergence between the true
exponential distribution with rate parameter, λ0, and the predictive distribution based on the sample x. The
Kullback–Leibler divergence is a commonly used, parameterisation free measure of the difference between two
distributions. Letting Δ(λ0||p) denote the Kullback–Leibler divergence between an exponential with rate parameter λ0
and a predictive distribution p it can be shown that
where the expectation is taken with respect to the exponential distribution with rate parameter λ0 ∈ (0, ∞), and ψ( · )
is the digamma function. It is clear that the CNML predictive distribution is strictly superior to the maximum
likelihood plug-in distribution in terms of average Kullback–Leibler divergence for all sample sizes n > 0.
References
[1] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[2] Ross, Sheldon M. (2009). Introduction to probability and statistics for engineers and scientists (http:/ / books. google. com/
books?id=mXP_UEiUo9wC& pg=PA267) (4th ed.). Associated Press. p. 267. ISBN 978-0-12-370483-2. .
[3] Guerriero V. et al. (2010). "Quantifying uncertainties in multi-scale studies of fractured reservoir analogues: Implemented statistical analysis
of scan line data from carbonate rocks" (PDF). Journal of Structural Geology (Elsevier). doi:10.1016/j.jsg.2009.04.016.
[4] Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd edn. Boston: Addison–Wesley. ISBN
0-201-89684-2. See section 3.4.1, p. 133.
[5] Luc Devroye (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). New York: Springer-Verlag.
ISBN 0-387-96305-7. See chapter IX (http:/ / luc. devroye. org/ chapter_nine. pdf), section 2, pp. 392–401.
[6] "Cumfreq, a free computer program for cumulative frequency analysis" (http:/ / www. waterlog. info/ cumfreq. htm). .
[7] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage
Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The
Netherlands. pp. 175–224. ISBN 90-70754-33-9. .
Exponential distribution 200
[8] Lawless, J.F., Fredette, M.,"Frequentist predictions intervals and predictive distributions", Biometrika (2005), Vol 92, Issue 3, pp 529–542.
[9] Bjornstad, J.F., "Predictive Likelihood: A Review", Statist. Sci. Volume 5, Number 2 (1990), 242–254.
[10] D. F. Schmidt and E. Makalic, " Universal Models for the Exponential Distribution (http:/ / www. emakalic. org/ blog/ wp-content/ uploads/
2010/ 04/ SchmidtMakalic09b. pdf)", IEEE Transactions on Information Theory, Volume 55, Number 7, pp. 3087–3090, 2009
doi:10.1109/TIT.2009.2018331
External links
• Online calculator of Exponential Distribution (http://www.stud.feec.vutbr.cz/~xvapen02/vypocty/ex.
php?language=english)
F-distribution 201
F-distribution
Fisher-Snedecor
CDF
Mean
for
Mode
for
Variance
for
Skewness
for
Ex. kurtosis see text
F-distribution 202
CF see text
In probability theory and statistics, the F-distribution is a continuous probability distribution.[1][2][3][4] It is also
known as Snedecor's F distribution or the Fisher-Snedecor distribution (after R.A. Fisher and George W.
Snedecor). The F-distribution arises frequently as the null distribution of a test statistic, most notably in the analysis
of variance; see F-test.
Definition
If a random variable has an F-distribution with parameters and , we write . Then the
probability density function for is given by
for real . Here is the beta function. In many applications, the parameters and are positive integers,
but the distribution is well-defined for positive real values of these parameters.
The cumulative distribution function is
[5]
The k-th moment of an distribution exists and is finite only when and it is equal to :
The F-distribution is a particular parametrization of the beta prime distribution, which is also called the beta
distribution of the second kind.
The characteristic function is listed incorrectly in many standard references (e.g., [2]). The correct expression [6] is
Characterization
A random variate of the F-distribution with parameters d1 and d2 arises as the ratio of two appropriately scaled
chi-squared variates:
where
• U1 and U2 have chi-squared distributions with d1 and d2 degrees of freedom respectively, and
• U1 and U2 are independent.
In instances where the F-distribution is used, for instance in the analysis of variance, independence of U1 and U2
might be demonstrated by applying Cochran's theorem.
Generalization
A generalization of the (central) F-distribution is the noncentral F-distribution.
• Equivalently, if , then .
• If then .
References
[1] Johnson, Norman Lloyd; Samuel Kotz, N. Balakrishnan (1995). Continuous Univariate Distributions, Volume 2 (Second Edition, Section 27).
Wiley. ISBN 0-471-58494-0.
[2] Abramowitz, Milton; Stegun, Irene A., eds. (1965), "Chapter 26" (http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_946. htm), Handbook of
Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover, pp. 946, ISBN 978-0486612720, MR0167642,
.
[3] NIST (2006). Engineering Statistics Handbook - F Distribution (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda3665. htm)
[4] Mood, Alexander; Franklin A. Graybill, Duane C. Boes (1974). Introduction to the Theory of Statistics (Third Edition, p. 246-249).
McGraw-Hill. ISBN 0-07-042864-6.
[5] Taboga, Marco. "The F distribution" (http:/ / www. statlect. com/ F_distribution. htm). .
[6] Phillips, P. C. B. (1982) "The true characteristic function of the F distribution," Biometrika, 69: 261-264 JSTOR 2335882
External links
• Table of critical values of the F-distribution (http://www.itl.nist.gov/div898/handbook/eda/section3/
eda3673.htm)
• Earliest Uses of Some of the Words of Mathematics: entry on F-distribution contains a brief history (http://
jeff560.tripod.com/f.html)
F-test
An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most
often used when comparing statistical models that have been fit to a data set, in order to identify the model that best
fits the population from which the data were sampled. Exact F-tests mainly arise when the models have been fit to
the data using least squares. The name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher
initially developed the statistic as the variance ratio in the 1920s.[1]
or
where denotes the sample mean in the ith group, ni is the number of observations in the ith group, and denotes
the overall mean of the data.
The "unexplained variance", or "within-group variability" is
where Yij is the jth observation in the ith out of K groups and N is the overall sample size. This F-statistic follows the
F-distribution with K − 1, N −K degrees of freedom under the null hypothesis. The statistic will be large if the
between-group variability is large relative to the within-group variability, which is unlikely to happen if the
population means of the groups all have the same value.
Note that when there are only two groups for the one-way ANOVA F-test, F = t2 where t is the Student's t statistic.
F-test 206
Regression problems
Consider two models, 1 and 2, where model 1 is 'nested' within model 2. Model 1 is the Restricted model, and Model
2 is the Unrestricted one. That is, model 1 has p1 parameters, and model 2 has p2 parameters, where p2 > p1, and for
any choice of parameters in model 1, the same regression curve can be achieved by some choice of the parameters of
model 2. (We use the convention that any constant parameter in a model is included when counting the parameters.
For instance, the simple linear model y = mx + b has p = 2 under this convention.) The model with more parameters
will always be able to fit the data at least as well as the model with fewer parameters. Thus typically model 2 will
give a better (i.e. lower error) fit to the data than model 1. But one often wants to determine whether model 2 gives a
significantly better fit to the data. One approach to this problem is to use an F test.
If there are n data points to estimate parameters of both models from, then one can calculate the F statistic
(coefficient of determination), given by
where RSSi is the residual sum of squares of model i. If your regression model has been calculated with weights,
then replace RSSi with χ2, the weighted sum of squared residuals. Under the null hypothesis that model 2 does not
provide a significantly better fit than model 1, F will have an F distribution, with (p2 − p1, n − p2) degrees of
freedom. The null hypothesis is rejected if the F calculated from the data is greater than the critical value of the
F-distribution for some desired false-rejection probability (e.g. 0.05). The F-test is a Wald test.
a1 a2 a3
6 8 13
8 12 9
4 9 11
5 11 8
3 6 7
4 8 12
The null hypothesis, denoted H0, for the overall F-test for this experiment would be that all three levels of the factor
produce the same response, on average. To calculate the F-ratio:
Step 1: Calculate the mean within each group:
Step 4: Calculate the "within-group" sum of squares. Begin by centering the data in each group
a1 a2 a3
The within-group sum of squares is the sum of squares of all 18 values in this table
After performing the F-test, it is common to carry out some "post-hoc" analysis of the group means. In this case, the
first two group means differ by 4 units, the first and third group means differ by 5 units, and the second and third
group means differ by only 1 unit. The standard error of each of these differences is .
F-test 208
Thus the first group is strongly different from the other groups, as the mean difference is more times the standard
error, so we can be highly confident that the population mean of the first group differs from the population means of
the other groups. However there is no evidence that the second and third groups have different population means
from each other, as their mean difference of one unit is comparable to the standard error.
Note F(x, y) denotes an F-distribution with x degrees of freedom in the numerator and y degrees of freedom in the
denominator.
References
[1] Lomax, Richard G. (2007) Statistical Concepts: A Second Course, p. 10, ISBN 0-8058-5850-4
[2] Box, G.E.P. (1953). "Non-Normality and Tests on Variances". Biometrika 40 (3/4): 318–335. JSTOR 2333350.
[3] Markowski, Carol A; Markowski, Edward P. (1990). "Conditions for the Effectiveness of a Preliminary Test of Variance". The American
Statistician 44 (4): 322–326. doi:10.2307/2684360. JSTOR 2684360.
[4] Sawilowsky, S. (2002). "Fermat, Schubert, Einstein, and Behrens-Fisher:The Probable Difference Between Two Means When σ12 ≠ σ22".
Journal of Modern Applied Statistical Methods, 1(2), 461–472.
[5] Blair, R. C. (1981). "A reaction to 'Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and
covariance.'" Review of Educational Research, 51, 499-507.
[6] Randolf, E. A., & Barcikowski, R. S. (1989, November). "Type I error rate when real study values are used as population parameters in a
Monte Carlo study". Paper presented at the 11th annual meeting of the Mid-Western Educational Research Association, Chicago.
[7] Sawilowsky, S. (1990). Nonparametric tests of interaction in experimental design. Review of Educational Research, 25(20-59).
External links
• Testing utility of model – F-test (http://www.public.iastate.edu/~alicia/stat328/Multiple regression - F test.
pdf)
• F-test (http://rkb.home.cern.ch/rkb/AN16pp/node81.html)
• Table of F-test critical values (http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm)
• FTEST in Microsoft Excel which is different (http://office.microsoft.com/en-gb/excel-help/
ftest-HP005209098.aspx)
Fisher information 209
Fisher information
In mathematical statistics and information theory, the Fisher information (sometimes simply called information[1])
can be defined as the variance of the score, or as the expected value of the observed information. In Bayesian
statistics, the asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior
(according to the Bernstein–von Mises theorem, which was anticipated by Laplace for exponential families).[2] The
role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the
statistician R.A. Fisher (following some initial results by F. Y. Edgeworth). The Fisher information is also used in
the calculation of the Jeffreys prior, which is used in Bayesian statistics.
The Fisher-information matrix is used to calculate the covariance matrices associated with maximum-likelihood
estimates. It can also be used in the formulation of test statistics, such as the Wald test.
History
The Fisher information was discussed by several early statisticians, notably F. Y. Edgeworth.[3] For example,
Savage[4] says: "In it [Fisher information], he [Fisher] was to some extent anticipated (Edgeworth 1908–9 esp. 502,
507–8, 662, 677–8, 82–5 and references he [Edgeworth] cites including Pearson and Filon 1898 [. . .])." There are a
number of early historical sources[5] and a number of reviews of this early work.[6][7][8]
Definition
The Fisher information is a way of measuring the amount of information that an observable random variable X
carries about an unknown parameter θ upon which the probability of X depends. The probability function for X,
which is also the likelihood function for θ, is a function ƒ(X; θ); it is the probability mass (or probability density) of
the random variable X conditional on the value of θ. The partial derivative with respect to θ of the natural logarithm
of the likelihood function is called the score. Under certain regularity conditions, it can be shown that the first
moment of the score is 0. The second moment is called the Fisher information:
where, for any given value of θ, the expression E[…|θ] denotes the conditional expectation over values for X with
respect to the probability function ƒ(x; θ) given θ. Note that . A random variable carrying high
Fisher information implies that the absolute value of the score is often high. The Fisher information is not a function
of a particular observation, as the random variable X has been averaged out.
Since the expectation of the score is zero, the Fisher information is also the variance of the score.
If log ƒ(x; θ) is twice differentiable with respect to θ, and under certain regularity conditions, then the Fisher
information may also be written as[9]
Thus, the Fisher information is the negative of the expectation of the second derivative with respect to θ of the
natural logarithm of f. Information may be seen to be a measure of the "curvature" of the support curve near the
maximum likelihood estimate of θ. A "blunt" support curve (one with a shallow maximum) would have a low
negative expected second derivative, and thus low information; while a sharp one would have a high negative
expected second derivative and thus high information.
Information is additive, in that the information yielded by two independent experiments is the sum of the information
from each experiment separately:
Fisher information 210
This result follows from the elementary fact that if random variables are independent, the variance of their sum is the
sum of their variances. Hence the information in a random sample of size n is n times that in a sample of size 1 (if
observations are independent).
The information provided by a sufficient statistic is the same as that of the sample X. This may be seen by using
Neyman's factorization criterion for a sufficient statistic. If T(X) is sufficient for θ, then
for some functions g and h. See sufficient statistic for a more detailed explanation. The equality of information then
follows from the following fact:
which follows from the definition of Fisher information, and the independence of h(X) from θ. More generally, if T
= t(X) is a statistic, then
The likelihood function ƒ(X; θ) describes the probability that we observe a given sample x given a known value of θ.
If ƒ is sharply peaked with respect to changes in θ, it is easy to intuit the "correct" value of θ given the data, and
hence the data contains a lot of information about the parameter. If the likelihood ƒ is flat and spread-out, then it
would take many, many samples of X to estimate the actual "true" value of θ. Therefore, we would intuit that the
data contain much less information about the parameter.
Now, we differentiate the unbiased-ness condition above to get
We now make use of two facts. The first is that the likelihood ƒ is just the probability of the data given the parameter.
Since it is a probability, it must be normalized, implying that
The left-most factor is the expected mean-squared error of the estimator θ^, since
In other words, the precision to which we can estimate θ is fundamentally limited by the Fisher Information of the
likelihood function.
(1) defines Fisher information. (2) invokes the fact that the information in a sufficient statistic is the same as that of
the sample itself. (3) expands the natural logarithm term and drops a constant. (4) and (5) differentiate with respect to
θ. (6) replaces A and B with their expectations. (7) is algebra.
The end result, namely,
Fisher information 212
is the reciprocal of the variance of the mean number of successes in n Bernoulli trials, as expected (see last sentence
of the preceding section).
Matrix form
When there are N parameters, so that θ is a Nx1 vector then the Fisher information takes
the form of an NxN matrix, the Fisher Information Matrix (FIM), with typical element:
The FIM is a NxN positive semidefinite symmetric matrix, defining a Riemannian metric on the N-dimensional
parameter space, thus connecting Fisher information to differential geometry. In that context, this metric is known as
the Fisher information metric, and the topic is called information geometry.
Under certain regularity conditions, the Fisher Information Matrix may also be written as:
The metric is interesting in several ways; it can be derived as the Hessian of the relative entropy; it can be
understood as a metric induced from the Euclidean metric, after appropriate change of variable; in its
complex-valued form, it is the Fubini-Study metric.
Orthogonal parameters
We say that two parameters θi and θj are orthogonal if the element of the ith row and jth column of the Fisher
information matrix is zero. Orthogonal parameters are easy to deal with in the sense that their maximum likelihood
estimates are independent and can be calculated separately. When dealing with research problems, it is very common
for the researcher to invest some time searching for an orthogonal parametrization of the densities involved in the
problem.
where denotes the transpose of a vector, tr(..) denotes the trace of a square matrix, and:
Note that a special, but very common, case is the one where Σ(θ) = Σ, a constant. Then
Fisher information 213
In this case the Fisher information matrix may be identified with the coefficient matrix of the normal equations of
least squares estimation theory.
Another special case is that the mean and covariance depends on two different parameters, say, β and θ. This is
especially popular in the analysis of spacial data, which uses a linear model with correlated residuals. We have
where
[10]
The prove of this special case is given in literature. Using the same technique in this paper, it's not difficult to
prove the original result.
Properties
Reparametrization
The Fisher information depends on the parametrization of the problem. If θ and η are two scalar parametrizations of
an estimation problem, and θ is a continuously differentiable function of η, then
where the (i, j)th element of the k × k Jacobian matrix is defined by
Applications
Relation to KL-divergence
The Fisher information matrix is the Hessian matrix (second derivative) of the Kullback–Leibler divergence of the
distribution from the true distribution .[17] Here, is the true value of the parameter and
derivatives are taken with respect to .
The difference between the negative Hessian and the Fisher information is
Fisher information 215
This extra term goes away if, instead, one considers the Hessian of the relative entropy instead of the Shannon
entropy; the relative entropy can be thought of as incorporating the Bayesian prior into the calculation.
Equality
In particular, the Fisher Information matrix will be the same as the negative of the Hessian of the entropy in
situations where is zero for all i, j, X, and θ. For instance, a two-dimensional example that makes the
two equal is
Inequality
A one-dimensional example where the Fisher Information differs from the negative Hessian is
. In this case, the entropy H is independent of the distribution mean θ. Thus, the second
derivative of the entropy with respect to θ is zero. However, for the Fisher information, we have
Notes
[1] Lehmann and Casella, p. 115
[2] Lucien Le Cam (1986) Asymptotic Methods in Statistical Decision Theory: Pages 336 and 618–621 (von Mises and Bernstein).
[3] Savage (1976)
[4] Savage(1976), page 156
[5] Edgeworth (Sept. 1908, Dec. 1908)
[6] Pratt(1976)
[7] Stigler (1978,1986,1999)
[8] Hald (1998,1999)
[9] Lehmann and Casella, eq. (2.5.16).
[10] Maximum likelihood estimation of models for residual covariance in spatial regression, K. V. Mardia and R. J. Marshall, Biometrika (1984),
71, 1, pp. 135-46
[11] Lehmann and Casella, eq. (2.5.11).
[12] Lehmann and Casella, eq. (2.6.16)
[13] W. Janke, D. A. Johnston, and R. Kenna, Physica A 336, 181 (2004).
[14] M. Prokopenko, J. T. Lizier, O. Obst, and X. R. Wang, Relating Fisher information to order parameters, Physical Review E, 84, 041116,
2011.
[15] Friedrick Pukelsheim, Optimal Design of Experiments, 1993
[16] Bayesian theory, Jose M. Bernardo and Adrian FM. Smith, John Wiley & Sons, 1994
[17] Christian Gourieroux. Statistics and Econometric Models. 1995. p88
Fisher information 216
References
• Edgeworth, F. Y. (Sep. 1908). "On the Probable Errors of Frequency-Constants". Journal of the Royal Statistical
Society 71 (3): 499–512. doi:10.2307/2339293. JSTOR 2339293.
• Edgeworth, F. Y. (Dec. 1908). "On the Probable Errors of Frequency-Constants". Journal of the Royal Statistical
Society 71 (4): 651–678. doi:10.2307/2339378. JSTOR 2339378.
• Frieden, B. Roy (2004) Science from Fisher Information: A Unification. Cambridge Univ. Press. ISBN
0-521-00911-1.
• Hald, A. (May 1999). "On the History of Maximum Likelihood in Relation to Inverse Probability and Least
Squares". Statistical Science 14 (2): 214–222. JSTOR 2676741.
• Hald, A. (1998). A History of Mathematical Statistics from 1750 to 1930. New York: Wiley.
ISBN 0-471-17912-4.
• Lehmann, E. L.; Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. ISBN 0-387-98502-6.
• Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag.
ISBN 0-387-96307-3.
• Pratt, John W. (May 1976). "F. Y. Edgeworth and R. A. Fisher on the Efficiency of Maximum Likelihood
Estimation". The Annals of Statistics 4 (3): 501–514. doi:10.1214/aos/1176343457. JSTOR 2958222.
• Leonard J. Savage (May 1976)). "On Rereading R. A. Fisher". The Annals of Statistics 4 (3): 441–500.
doi:10.1214/aos/1176343456. JSTOR 2958221.
• Schervish, Mark J. (1995). "Section 2.3.1". Theory of Statistics. New York: Springer. ISBN 0-387-94546-6.
• Stephen Stigler (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard
University Press. ISBN 0-674-40340-1.
• Stephen M. Stigler (1978). "Francis Ysidro Edgeworth, Statistician". Journal of the Royal Statistical Society.
Series A (General) 141 (3): 287–322. doi:10.2307/2344804. JSTOR 2344804.
• Stephen Stigler (1999). Statistics on the Table: The History of Statistical Concepts and Methods. Harvard
University Press. ISBN 0-674-83601-4.
• Van Trees, H. L. (1968). Detection, Estimation, and Modulation Theory, Part I. New York: Wiley.
ISBN 0-471-09517-6.
External links
• Fisher4Cast: a Matlab, GUI-based Fisher information tool (http://www.mathworks.com/matlabcentral/
fileexchange/loadFile.do?objectId=20008&objectType=File) for research and teaching, primarily aimed at
cosmological forecasting applications.
• FandPLimitTool (http://www4.utsouthwestern.edu/wardlab/fandplimittool.asp) a GUI-based software to
calculate the Fisher information and CRLB with application to single-molecule microscopy.
Fisher's exact test 217
Example
For example, a sample of teenagers might be divided into male and female on the one hand, and those that are and
are not currently dieting on the other. We hypothesize, for example, that the proportion of dieting individuals is
higher among the women than among the men, and we want to test whether any difference of proportions that we
observe is significant. The data might look like this:
Dieting 1 9 10
Non-dieting 11 3 14
Col. total 12 12 24
These data would not be suitable for analysis by a chi-squared test, because the expected values in the table are all
below 10, and in a 2 × 2 contingency table, the number of degrees of freedom is always 1.
The question we ask about these data is: knowing that 10 of these 24 teenagers are dieters, and that 12 of the 24 are
female, what is the probability that these 10 dieters would be so unevenly distributed between the women and the
men? If we were to choose 10 of the teenagers at random, what is the probability that 9 of them would be among the
12 women, and only 1 from among the 12 men?
Before we proceed with the Fisher test, we first introduce some notation. We represent the cells by the letters a, b, c
and d, call the totals across rows and columns marginal totals, and represent the grand total by n. So the table now
looks like this:
Dieting a b a+b
Non-dieting c d c+d
Fisher showed that the probability of obtaining any such set of values was given by the hypergeometric distribution:
where is the binomial coefficient and the symbol ! indicates the factorial operator.
This formula gives the exact probability of observing this particular arrangement of the data, assuming the given
marginal totals, on the null hypothesis that men and women are equally likely to be dieters. To put it another way, if
we assume that the probability that a man is a dieter is p, the probability that a woman is a dieter is p, and we assume
that both men and women enter our sample independently of whether or not they are dieters, then this
hypergeometric formula gives the conditional probability of observing the values a, b, c, d in the four cells,
conditionally on the observed marginals. This remains true even if men enter our sample with different probabilities
than women. The requirement is merely that the two classification characteristics - gender, and dieter (or not) - are
not associated. For example, suppose we knew probabilities P,Q,p,q with P+Q=p+q=1 such that (male dieter, male
non-dieter, female dieter, female non-dieter) had respective probabilities (Pp,Pq,Qp,Qq) for each individual
encountered under our sampling procedure. Then still, were we to calculate the distribution of cell entries conditional
given marginals, we would obtain the above formula in which neither p nor P occurs. Thus, we can calculate the
exact probability of any arrangement of the 24 teenagers into the four cells of the table, but Fisher showed that to
generate a significance level, we need consider only the cases where the marginal totals are the same as in the
observed table, and among those, only the cases where the arrangement is as extreme as the observed arrangement,
Fisher's exact test 219
or more so. (Barnard's test relaxes this constraint on one set of the marginal totals.) In the example, there are 11 such
cases. Of these only one is more extreme in the same direction as our data; it looks like this:
Dieting 0 10 10
Non-dieting 12 2 14
Totals 12 12 24
In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme or
more extreme if the null hypothesis is true, we have to calculate the values of p for both these tables, and add them
together. This gives a one-tailed test; for a two-tailed test we must also consider tables that are equally extreme but in
the opposite direction. Unfortunately, classification of the tables according to whether or not they are 'as extreme' is
problematic. An approach used in R using the "fisher.test" function computes the p-value by summing the
probabilities for all tables with probabilities less than or equal to that of the observed table. For tables with small
counts, the 2-sided p-value can differ substantially from twice the 1-sided value, unlike the case with test statistics
that have a symmetric sampling distribution.
As noted above, most modern statistical packages will calculate the significance of Fisher tests, in some cases even
where the chi-squared approximation would also be acceptable. The actual computations as performed by statistical
software packages will as a rule differ from those described above, because numerical difficulties may result from
the large values taken by the factorials. A simple, somewhat better computational approach relies on a gamma
function or log-gamma function, but methods for accurate computation of hypergeometric and binomial probabilities
remains an active research area.
Controversies
Despite the fact that Fisher's test gives exact p-values, some authors have argued that it is conservative, i.e. that its
actual rejection rate is below the nominal significance level.[10][11][12] The apparent contradiction stems from the
combination of a discrete statistic with fixed significance levels.[13][14] To be more precise, consider the following
proposal for a significance test at the 5%-level: reject the null hypothesis for each table to which Fisher's test assigns
a p-value equal to or smaller than 5%. Because the set of all tables is discrete, there may not be a table for which
equality is achieved. If is the largest p-value smaller than 5% which can actually occur for some table, then the
proposed test effectively tests at the -level. For small sample sizes, might be significantly lower than
[10][11][12]
5%. While this effect occurs for any discrete statistic (not just in contingency tables, or for Fisher's test), it
has been argued that the problem is compounded by the fact that Fisher's test conditions on the marginals.[15] To
avoid the problem, many authors discourage the use of fixed significance levels when dealing with discrete
problems.[13][14]
Another early discussion revolved around the necessity to condition on the marginals.[16][17] Fisher's test gives exact
p-values both for fixed and for random marginals. Other tests, most prominently Barnard's, require random
marginals. Some authors[13][14][17] (including, later, Barnard himself[13]) have criticized Barnard's test based on this
property. They argue that the marginal totals are an (almost[14]) ancillary statistic, containing (almost) no
information about the tested property.
Fisher's exact test 220
Alternatives
An alternative exact test, Barnard's exact test, has been developed and proponents of it suggest that this method is
more powerful, particularly in 2 × 2 tables. Another alternative is to use maximum likelihood estimates to calculate a
p-value from the exact binomial or multinomial distributions and accept or reject based on the p-value.
References
[1] Fisher, R. A. (1922). "On the interpretation of χ2 from contingency tables, and the calculation of P". Journal of the Royal Statistical Society
85 (1): 87–94. doi:10.2307/2340521. JSTOR 2340521.
[2] Fisher, R.A. (1954). Statistical Methods for Research Workers. Oliver and Boyd. ISBN 0-05-002170-2.
[3] Agresti, Alan (1992). "A Survey of Exact Inference for Contingency Tables". Statistical Science 7 (1): 131–153. doi:10.1214/ss/1177011454.
JSTOR 2246001.
[4] Larntz, Kinley (1978). "Small-sample comparisons of exact levels for chi-squared goodness-of-fit statistics". Journal of the American
Statistical Association 73 (362): 253–263. doi:10.2307/2286650. JSTOR 2286650.
[5] Mehta, Cyrus R; Patel, Nitin R; Tsiatis, Anastasios A (1984). "Exact significance testing to establish treatment equivalence with ordered
categorical data". Biometrics 40 (3): 819–825. doi:10.2307/2530927. JSTOR 2530927. PMID 6518249.
[6] Mehta, C. R. 1995. SPSS 6.1 Exact test for Windows. Englewood Cliffs, NJ: Prentice Hall.
[7] Mehta C.R., Patel N.R. (1983). "A Network Algorithm for Performing Fisher's Exact Test in r Xc Contingency Tables". Journal of the
American Statistical Association 78 (382): 427–434. doi:10.2307/2288652.
[8] mathworld.wolfram.com (http:/ / mathworld. wolfram. com/ FishersExactTest. html) Page giving the formula for the general form of Fisher's
exact test for m × n contingency tables
[9] Cyrus R. Mehta and Nitin R. Patel (1986). "ALGORITHM 643: FEXACT: a FORTRAN subroutine for Fisher's exact test on unordered r×c
contingency tables". ACM Trans. Math. Softw. 12 (2): 154–161. doi:10.1145/6497.214326.
[10] Liddell, Douglas (1976). "Practical tests of 2x2 contingency tables". The Statistican 25 (4): 295–304. doi:10.2307/2988087.
JSTOR 2988087.
[11] Berkson, Joseph (1978). "In dispraise of the exact test". Journal of Statistic Planning and Inference 2: 27–42.
[12] D'Agostino, R. B., Chase, W., and Belanger, A. (1988). "The Appropriateness of Some Common Procedures for Testing Equality of Two
Independent Binomial Proportions". The American Statistician 42 (3): 198–202. doi:10.2307/2685002. JSTOR 2685002.
[13] Yates, F. (1984). "Tests of Significance for 2 x 2 Contingency Tables (with discussion)". Journal of the Royal Statistical Society, Ser. A 147
(3): 426–463. doi:10.2307/2981577. JSTOR 2981577.
[14] Roderick J. A. Little (1989). "Testing the Equality of Two Independent Binomial Proportions". The American Statistician 43 (4): 283–288.
doi:10.2307/2685390. JSTOR 2685390.
[15] Cyrus R. Mehta and Pralay Senchaudhuri, "Conditional versus Unconditional Exact Tests for Comparing Two Binomials" (4 September
2003). Retrieved 20 November 2009 from http:/ / www. cytel. com/ Papers/ twobinomials. pdf
[16] Barnard, G.A (1945). "A New Test for 2×2 Tables". Nature 156 (3954): 177. doi:10.1038/156177a0.
[17] Fisher (1945). "A New Test for 2 × 2 Tables". Nature 156: 388. doi:10.1038/156388a0.; Barnard, G.A (1945). "A New Test for 2×2 Tables".
Nature 156: 783. doi:10.1038/156783b0.
Web Resources
Calculate Fishers Exact Test Online (http://www.langsrud.com/stat/fisher.htm)
Gamma distribution 221
Gamma distribution
Gamma
In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability
distributions. There are two different parameterizations in common use:
1. With a shape parameter k and a scale parameter θ.
2. With a shape parameter α = k and an inverse scale parameter β = 1⁄θ, called a rate parameter.
The parameterization with k and θ appears to be more common in econometrics and certain other applied fields,
where e.g. the gamma distribution is frequently used to model waiting times. For instance, in life testing, the waiting
time until death is a random variable that is frequently modeled with a gamma distribution.[1]
The parameterization with α and β is more common in Bayesian statistics, where the gamma distribution is used as a
conjugate prior distribution for various types of inverse scale (aka rate) parameters, such as the λ of an exponential
distribution or a Poisson distribution — or for that matter, the β of the gamma distribution itself. (The closely related
inverse gamma distribution is used as a conjugate prior for scale parameters, such as the variance of a normal
distribution.)
If k is an integer, then the distribution represents an Erlang distribution; i.e., the sum of k independent exponentially
distributed random variables, each of which has a mean of θ (which is equivalent to a rate parameter of 1/θ).
Equivalently, if α is an integer, then the distribution again represents an Erlang distribution, i.e. the sum of α
independent exponentially distributed random variables, each of which has a mean of 1/β (which is equivalent to a
rate parameter of β).
The gamma distribution is the maximum entropy probability distribution for a random variable X for which
is fixed and greater than zero, and is fixed ( is the
digamma function).[2]
Illustration of the Gamma PDF for parameter values over k and x with θ set to
1, 2, 3, 4, 5 and 6. One can see each θ layer by itself here [3] as well as by k [4] and x.
[5].
Both parametrizations are common because either can be more convenient depending on the situation.
Properties
Skewness
The skewness depends only on the first parameter ( α ). It approaches a normal distribution when α is large
(approximately when α > 10).
Median calculation
Unlike the mode and the mean which have readily calculable formulas based on the parameters, the median does not
have an easy closed form equation. The median for this distribution is defined as the constant x0 such that
The ease of this calculation is dependent on the k parameter. This is best achieved by a computer since the
calculations can quickly grow out of control.
For the Γ( n + 1, 1 ) distribution the median ( ν ) is known[7] to lie between
A method of estimating the median for any Gamma distribution has been derived based on the ratio μ /( μ - ν ) which
to a very good approximation when α ≥ 1 is a linear function of α.[9] The median estimated by this method is
approximately
Summation
If Xi has a Γ(ki, θ) distribution for i = 1, 2, ..., N (i.e., all distributions have the same scale parameter θ), then
Scaling
If
Exponential family
The Gamma distribution is a two-parameter exponential family with natural parameters k − 1 and −1⁄θ (equivalently,
α − 1 and −β), and natural statistics X and ln(X).
If the shape parameter α is held fixed, the resulting one-parameter family of distributions is a natural exponential
family.
Logarithmic expectation
One can show that
or equivalently,
Information entropy
The information entropy can be derived as
Kullback–Leibler divergence
The Kullback–Leibler divergence
(KL-divergence), as with the
information entropy and various other
theoretical properties, are more
commonly seen using the α,β
parameterization because of their uses
in Bayesian and other theoretical
statistics frameworks.
The KL-divergence of
("true"
distribution) from
("approximating" distribution) is given
by[10] Illustration of the Kullback–Leibler (KL) divergence for two Gamma PDFs. Here
β = β0 + 1 which are set to 1, 2, 3, 4, 5 and 6. The typical asymmetry for the KL
divergence is clearly visible.
Laplace transform
The Laplace transform of the gamma PDF is
Parameter estimation
Finding the maximum with respect to θ by taking the derivative and setting it equal to zero yields the maximum
likelihood estimator of the θ parameter:
Finding the maximum with respect to k by taking the derivative and setting it equal to zero yields
where
If we let
then k is approximately
which is within 1.5% of the correct value. An explicit form for the Newton-Raphson update of this initial guess is
given by Choi and Wette (1969) as the following expression:
Gamma distribution 228
where denotes the trigamma function (the derivative of the digamma function).
The digamma and trigamma functions can be difficult to calculate with high precision. However, approximations
known to be good to several significant figures can be computed using the following approximation formulae:
and
Denoting
Integration over θ can be carried out using a change of variables, revealing that 1⁄θ is gamma-distributed with
parameters .
which shows that the mean ± standard deviation estimate of the posterior distribution for theta is
where are all uniformly distributed on and independent. All that is left now is to generate a variable
distributed as for and apply the "α-addition" property once more. This is the most difficult part.
Random generation of gamma variates is discussed in detail by Devroye,[11] noting that none are uniformly fast for
all shape parameters. For small values of the shape parameter, the algorithms are often not valid.[12] For arbitrary
Gamma distribution 229
values of the shape parameter, one can apply the Ahrens and Dieter[13] modified acceptance-rejection method
Algorithm GD (shape k ≥ 1), or transformation method[14] when 0 < k < 1. Also see Cheng and Feast Algorithm
GKM 3[15] or Marsaglia's squeeze method.[16]
The following is a version of the Ahrens-Dieter acceptance-rejection method:[13]
1. Let be 1.
2. Generate , and as independent uniformly distributed on variables.
3. If , where , then go to step 4, else go to step 5.
4. Let . Go to step 6.
5. Let .
6. If , then increment and go to step 2.
7. Assume to be the realization of .
A summary of this is
where
• is the integral part of ,
• has been generated using the algorithm above with (the fractional part of ),
• and are distributed as explained above and are all independent.
Related distributions
Special cases
• If , then X has an exponential distribution with rate parameter λ.
• If , then X is identical to χ2(ν), the chi-squared distribution with ν degrees of freedom.
Conversely, if and c is a positive constant, then .
• If is an integer, the gamma distribution is an Erlang distribution and is the probability distribution of the
waiting time until the -th "arrival" in a one-dimensional Poisson process with intensity 1/θ. If
and , then .
• If X has a Maxwell-Boltzmann distribution with parameter a, then .
• , then follows a generalized gamma distribution with parameters , , and .
• , then ; i.e. an exponential distribution: see skew-logistic distribution.
Conjugate prior
In Bayesian inference, the gamma distribution is the conjugate prior to many likelihood distributions: the Poisson,
exponential, normal (with known mean), Pareto, gamma with known shape σ, inverse gamma with known shape
parameter, and Gompertz with known scale parameter.
The Gamma distribution's conjugate prior is:[17]
Where Z is the normalizing constant, which has no closed form solution. The posterior distribution can be found by
updating the parameters as follows.
Gamma distribution 230
Compound gamma
If the shape parameter of the gamma distribution is known, but the inverse-scale parameter is unknown, then a
gamma distribution for the inverse-scale forms a conjugate prior. The compound distribution, which results from
integrating out the inverse-scale has a closed form solution, known as the compound gamma distribution.[18]
Others
• If X has a Γ(k, θ) distribution, then 1/X has an inverse-gamma distribution with parameters k and θ−1.
• If X and Y are independently distributed Γ(α, θ) and Γ(β, θ) respectively, then X / (X + Y) has a beta distribution
with parameters α and β.
• If Xi are independently distributed Γ(αi, 1) respectively, then the vector (X1 / S, ..., Xn / S), where S = X1 + ... + Xn,
follows a Dirichlet distribution with parameters α1, …, αn.
• For large k the gamma distribution converges to Gaussian distribution with mean and variance .
• The Gamma distribution is the conjugate prior for the precision of the normal distribution with known mean.
• The Wishart distribution is a multivariate generalization of the gamma distribution (samples are positive-definite
matrices rather than positive real numbers).
• The Gamma distribution is a special case of the generalized gamma distribution, the generalized integer gamma
distribution, and the generalized inverse Gaussian distribution
• Among the discrete distributions, the negative binomial distribution is sometimes considered the discrete
analogue of the Gamma distribution
Applications
The gamma distribution has been used to model the size of insurance claims and rainfalls.[19] This means that
aggregate insurance claims and the amount of rainfall accumulated in a reservoir are modelled by a gamma process.
The gamma distribution is also used to model errors in multi-level Poisson regression models, because the
combination of the Poisson distribution and a gamma distribution is a negative binomial distribution.
In neuroscience, the gamma distribution is often used to describe the distribution of inter-spike intervals.[20]
Although in practice the gamma distribution often provides a good fit, there is no underlying biophysical motivation
for using it.
In bacterial gene expression, the copy number of a constitutively expressed protein often follows the gamma
distribution, where the scale and shape parameter are, respectively, the mean number of bursts per cell cycle and the
mean number of protein molecules produced by a single mRNA during its lifetime.[21]
The gamma distribution is widely used as a conjugate prior in Bayesian statistics. It is the conjugate prior for the
precision (i.e. inverse of the variance) of a normal distribution. It is also the conjugate prior for the exponential
distribution.
Gamma distribution 231
Notes
[1] See Hogg and Craig (1978, Remark 3.3.1) for an explicit motivation
[2] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[3] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-k. png
[4] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-Theta. png
[5] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-x. png
[6] Papoulis, Pillai, Probability, Random Variables, and Stochastic Processes, Fourth Edition
[7] Chen J, Rubin H (1986) Bounds for the difference between median and mean of Gamma and Poisson distributions. Statist Probab Lett 4:
281–283
[8] Choi KP (1994) On the medians of Gamma distributions and an equation of Ramanujan. Proc Amer Math Soc 121 (1) 245–251
[9] Banneheka BMSG, Ekanayake GEMUPD (2009) A new point estimator for the median of Gamma distribution. Viyodaya J Science
14:95-103
[10] W.D. Penny, KL-Divergences of Normal, Gamma, Dirichlet, and Wishart densities
[11] Luc Devroye (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). New York: Springer-Verlag.
. See Chapter 9, Section 3, pages 401–428.
[12] Devroye (1986), p. 406.
[13] Ahrens, J. H. and Dieter, U. (1982). Generating gamma variates by a modified rejection technique. Communications of the ACM, 25, 47–54.
Algorithm GD, p. 53.
[14] Ahrens, J. H.; Dieter, U. (1974). "Computer methods for sampling from gamma, beta, Poisson and binomial distributions". Computing 12:
223–246. CiteSeerX: 10.1.1.93.3828 (http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 93. 3828).
[15] Cheng, R.C.H., and Feast, G.M. Some simple gamma variate generators. Appl. Stat. 28 (1979), 290-295.
[16] Marsaglia, G. The squeeze method for generating gamma variates. Comput, Math. Appl. 3 (1977), 321-325.
[17] Fink, D. 1995 A Compendium of Conjugate Priors (http:/ / www. stat. columbia. edu/ ~cook/ movabletype/ mlm/ CONJINTRnew+ TEX.
pdf). In progress report: Extension and enhancement of methods for setting data quality objectives. (DOE contract 95‑831).
[18] Dubey, Satya D. (December 1970). "Compound gamma, beta and F distributions" (http:/ / www. springerlink. com/ content/
u750hg4630387205/ ). Metrika 16: 27–31. doi:10.1007/BF02613934. .
[19] Aksoy, H. (2000) "Use of Gamma Distribution in Hydrological Analysis" (http:/ / journals. tubitak. gov. tr/ engineering/ issues/
muh-00-24-6/ muh-24-6-7-9909-13. pdf), Turk J. Engin Environ Sci, 24, 419 – 428.
[20] J. G. Robson and J. B. Troy, "Nature of the maintained discharge of Q, X, and Y retinal ganglion cells of the cat," J. Opt. Soc. Am. A 4,
2301-2307 (1987)
[21] N. Friedman, L. Cai and X. S. Xie (2006) "Linking stochastic dynamics to population distribution: An analytical framework of gene
expression," Phys. Rev. Lett. 97, 168302.
References
• R. V. Hogg and A. T. Craig. (1978) Introduction to Mathematical Statistics, 4th edition. New York: Macmillan.
(See Section 3.3.)'
• S. C. Choi and R. Wette. (1969) Maximum Likelihood Estimation of the Parameters of the Gamma Distribution
and Their Bias, Technometrics, 11(4) 683–690
External links
• Weisstein, Eric W., " Gamma distribution (http://mathworld.wolfram.com/GammaDistribution.html)" from
MathWorld.
• Engineering Statistics Handbook (http://www.itl.nist.gov/div898/handbook/eda/section3/eda366b.htm)
Gamma function 232
Gamma function
In mathematics, the gamma function
(represented by the capital Greek
letter Γ) is an extension of the factorial
function, with its argument shifted
down by 1, to real and complex
numbers. That is, if n is a positive
integer:
Motivation
The gamma function can be seen as a solution to the
following interpolation problem:
"Find a smooth curve that connects the
points (x, y) given by y = (x − 1)! at the positive
integer values for x."
A plot of the first few factorials makes clear that such a
curve can be drawn, but it would be preferable to have
a formula that precisely describes the curve, in which
the number of operations does not depend on the size
of n. The simple formula for the factorial, n! = 1 × 2 ×
… × n, cannot be used directly for fractional values
of n since it is only valid when n is a natural number It is easy graphically to interpolate the factorial function to
non-integer values, but is there a formula that describes the resulting
(i.e., a positive integer). There are, relatively speaking,
curve?
no such simple solutions for factorials; any
combination of sums, products, powers, exponential
functions, or logarithms with a fixed number of terms will not suffice to express n!. Stirling's approximation is
asymptotically equal to the factorial function for large values of n. It is possible to find a general formula for
factorials using tools such as integrals and limits from calculus. A good solution to this is the gamma function.
Gamma function 233
There are infinitely many continuous extensions of the factorial to non-integers: infinitely many curves can be drawn
through any set of isolated points. The gamma function is the most useful solution in practice, being analytic (except
at the non-positive integers), and it can be characterized in several ways. However, it is not the only analytic function
which extends the factorial, as adding to it any analytic function which is zero on the positive integers will give
another function with that property.
A more restrictive property than satisfying the above interpolation is to satisfy the recurrence relation defining a
slightly translated version of the factorial function,
for x equal to any positive real number. The Bohr–Mollerup theorem proves that these properties, together with the
assumption that f be logarithmically convex (aka: "superconvex"[1]), uniquely determine f for positive, real inputs.
From there, the gamma function can be extended to all real and complex values (except the negative integers and
zero) by using the unique analytic continuation of f.
Definition
Main definition
The notation Γ(z) is due to Legendre. If the real part of the complex
number z is positive (Re(z) > 0), then the integral
Alternative definitions
The following infinite product definitions for the gamma function, due to Euler and Weierstrass respectively, are
valid for all complex numbers z, except the non-positive integers:
where is the Euler–Mascheroni constant. It is straightforward to show that the Euler definition
satisfies the functional equation (1) above.
Gamma function 234
A somewhat curious parametrization of the gamma function is given in terms of generalized Laguerre polynomials,
where the symbol ~ means that the quotient of both sides converges to 1.
The behavior for nonpositive z is more intricate. Euler's integral does not converge for z ≤ 0, but the function it
defines in the positive complex half-plane has a unique analytic continuation to the negative half-plane. One way to
find that analytic continuation is to use Euler's integral for positive arguments and extend the domain to negative
numbers by repeated application of the recurrence formula,
choosing n such that z + n is positive. The product in the denominator is zero when z equals any of the integers
0, −1, −2,... . Thus, the gamma function must be undefined at those points; it is a meromorphic function with simple
poles at the nonpositive integers. The residues of the function at those points are:
The following image shows the graph of the gamma function along the real line:
The gamma function is nonzero everywhere along the real line, although it comes arbitrarily close as .
There is in fact no complex number z for which , and hence the reciprocal gamma function is an
entire function, with zeros at z = 0, −1, −2,.... We see that the gamma function has a local minimum at
where it attains the value . The gamma function must alternate sign
between the poles because the product in the forward recurrence contains an odd number of negative factors if the
number of poles between and is odd, and an even number if the number of poles is even.
Gamma function 235
Properties
General
Other important functional equations for the gamma function are Euler's reflection formula
A simple but useful property, which can be seen from the limit definition, is:
which can be found by setting z = 1/2 in the reflection or duplication formulas, by using the relation to the beta
function given below with x = y = 1/2, or simply by making the substitution u = √t in the integral definition of the
gamma function, resulting in a Gaussian integral. In general, for non-negative integer values of n we have:
where n!! denotes the double factorial and, when n = 0, (-1)!! = 1. See Particular values of the gamma function for
calculated values.
It might be tempting to generalize the result that by looking for a formula for other individual
values where is rational. However, these numbers are not known to be expressible by themselves in terms
of elementary functions. It has been proved that is a transcendental number and algebraically
independent of for any integer and each of the fractions = 1/6, 1/4, 1/3, 2/3, 3/4, and 5/6.[2] In general, when
computing values of the gamma function, we must settle for numerical approximations.
Another useful limit for asymptotic approximations is:
Gamma function 236
The derivatives of the gamma function are described in terms of the polygamma function. For example:
For positive integer m the derivative of gamma function can be calculated as follows (here γ is the Euler–Mascheroni
constant):
[3]
The gamma function has simple poles at z = −n = 0, −1, −2, −3, ... . The residue there is
The Bohr–Mollerup theorem states that among all functions extending the factorial functions to the positive real
numbers, only the gamma function is log-convex, that is, its natural logarithm is convex.
Pi function
An alternative notation which was originally introduced by Gauss and which was sometimes used is the Pi function,
which in terms of the gamma function is
so that
where sinc is the normalized sinc function, while the multiplication theorem takes on the form
which is an entire function, defined for every complex number. That π(z) is entire entails it has no poles, so Γ(z) has
no zeros.
Gamma function 237
• The derivative of the logarithm of the gamma function is called the digamma function; higher derivatives are the
polygamma functions.
• The analog of the gamma function over a finite field or a finite ring is the Gaussian sums, a type of exponential
sum.
• The reciprocal gamma function is an entire function and has been studied as a specific topic.
• The gamma function also shows up in an important relation with the Riemann zeta function, ζ(z).
The logarithm of the gamma function satisfies the following formula due to Lerch:
where is the Hurwitz zeta function, is the Riemann zeta function and the prime (') denotes
differentiation in the first variable.
• The gamma function is intimately related to the stretched exponential function. For instance, the moments of that
function are
Particular values
Some particular values of the gamma function are:
Gamma function 238
Approximations
Complex values of the gamma function can be computed numerically with arbitrary precision using Stirling's
approximation or the Lanczos approximation.
The gamma function can be computed to fixed precision for Re(z) ∈ [1, 2] by applying integration by parts to Euler's
integral. For any positive number x the gamma function can be written
When Re(z) ∈ [1, 2] and x ≥ 1, the absolute value of the last integral is smaller than (x + 1) e−x. By choosing x large
enough, this last expression can be made smaller than 2−N for any desired value N. Thus, the gamma function can be
evaluated to N bits of precision with the above series.
The only fast algorithm for calculation of the Euler gamma function for any algebraic argument (including rational)
was constructed by E.A. Karatsuba,[4][5] see for details "Fast Algorithms and the FEE Method".[6]
For arguments that are integer multiples of 1/24 the gamma function can also be evaluated quickly using
arithmetic-geometric mean iterations (see particular values of the gamma function).
Because the Gamma and factorial functions grow so rapidly for moderately large arguments, many computing
environments include a function that returns the natural logarithm of the gamma function (often given the name
lngamma in programming environments or gammaln in spreadsheets); this grows much more slowly, and for
combinatorial calculations allows adding and subtracting logs instead of multiplying and dividing very large values.
The digamma function, which is the derivative of this function, is also commonly seen. In the context of technical
and physical applications, e.g. with wave propagation, the functional equation
is often used since it allows one to determine function values in one strip of width 1 in z from the neighbouring strip.
In particular, starting with a good approximation for a z with large real part one may go step by step down to the
desired z. Following an indication of Carl Friedrich Gauss, Rocktaeschel (1922) proposed for lngamma an
approximation for large Re(z):
This can be used to accurately approximate for z with a smaller Re(z) via (P.E.Böhmer, 1939)
A more accurate approximation can be obtained by using more terms from the asymptotic expansions of
and , which are based on Stirling's approximation.
An asymptotic approximation of the gamma function is
Gamma function 239
Applications
Opening a random page in an advanced table of formulas, one may be as likely to spot the gamma function as a
trigonometric function. One author describes the gamma function as "Arguably, the most common special function,
or the least 'special' of them. The other transcendental functions listed below are called 'special' because you could
conceivably avoid some of them by staying away from many specialized mathematical topics. On the other hand, the
gamma function is most difficult to avoid."[7]
Integration problems
The gamma function finds application in such diverse areas as quantum physics, astrophysics and fluid dynamics.[8]
The gamma distribution, which is formulated in terms of the gamma function, is used in statistics to model a wide
range of processes; for example, the time between occurrences of earthquakes.[9]
The primary reason for the gamma function's usefulness in such contexts is the prevalence of expressions of the type
which describe processes that decay exponentially in time or space. Integrals of such expressions can
occasionally be solved in terms of the gamma function when no elementary solution exists. For example, if is a
power function and is a linear function, a simple change of variables gives the evaluation
The fact that the integration is performed along the entire positive real line might signify that the gamma function
describes the cumulation of a time-dependent process that continues indefinitely, or the value might be the total of a
distribution in an infinite space.
It is of course frequently useful to take limits of integration other than 0 and to describe the cumulation of a
finite process, in which case the ordinary gamma function is no longer a solution; the solution is then called an
incomplete gamma function. (The ordinary gamma function, obtained by integrating across the entire positive real
line, is sometimes called the complete gamma function for contrast).
An important category of exponentially decaying functions is that of Gaussian functions and integrals
thereof, such as the error function. There are many interrelations between these functions and the gamma function;
notably, the square root of we obtained by evaluating is the "same" as that found in the normalizing
factor of the error function and the normal distribution.
The integrals we have discussed so far involve transcendental functions, but the gamma function also arises from
integrals of purely algebraic functions. In particular, the arc lengths of ellipses and of the lemniscate, which are
curves defined by algebraic equations, are given by elliptic integrals that in special cases can be evaluated in terms of
the gamma function. The gamma function can also be used to calculate "volume" and "area" of -dimensional
hyperspheres.
Another important special case is that of the beta function
Gamma function 240
Calculating products
The gamma function's ability to and generalize factorial products immediately leads to applications in many areas of
mathematics; in combinatorics, and by extension in areas such as probability theory and the calculation of power
series. Many expressions involving products of successive integers can be written as some combination of factorials,
the most important example perhaps being that of the binomial coefficient
The example of binomial coefficients motivates why the properties of the gamma function when extended to
negative numbers are natural. A binomial coefficient gives the number of ways to choose elements from a set of
elements; if , there are of course no ways. If , is the factorial of a negative integer and
hence infinite if we use the gamma function definition of factorials — dividing by infinity gives the expected value
of 0.
We can replace the factorial by a gamma function to extend any such formula to the complex numbers. Generally,
this works for any product wherein each factor is a rational function of the index variable, by factoring the rational
function into linear expressions. If and are monic polynomials of degree and with respective roots
and , we have
If we have a way to calculate the gamma function numerically, it is a breeze to calculate numerical values of such
products. The number of gamma functions in the right-hand side depends only on the degree of the polynomials, so it
does not matter whether equals 5 or . Moreover, due to the poles of the gamma function, the equation
also holds (in the sense of taking limits) when the left-hand product contain zeros or poles.
By taking limits, certain rational products with infinitely many factors can be evaluated in terms of the gamma
function as well. Due to the Weierstrass factorization theorem, analytic functions can be written as infinite products,
and these can sometimes be represented as finite products or quotients of the gamma function. We have already seen
one striking example: the reflection formula essentially represents the sine function as the product of two gamma
functions. Starting from this formula, the exponential function as well as all the trigonometric and hyperbolic
functions can be expressed in terms of the gamma function.
More functions yet, including the hypergeometric function and special cases thereof, can be represented by means of
complex contour integrals of products and quotients of the gamma function, called Mellin-Barnes integrals.
Among other things, this provides an explicit form for the analytic continuation of the zeta function to a
meromorphic function in the complex plane and leads to an immediate proof that the zeta function has infinitely
many so-called "trivial" zeros on the real line. Borwein et al. call this formula "one of the most beautiful findings in
mathematics".[10] Another champion for that title might be
Both formulas were derived by Bernhard Riemann in his seminal 1859 paper "Über die Anzahl der Primzahlen unter
einer gegebenen Grösse" ("On the Number of Prime Numbers less than a Given Quantity"), one of the milestones in
Gamma function 241
the development of analytic number theory — the branch of mathematics that studies prime numbers using the tools
of mathematical analysis. Factorial numbers, considered as discrete objects, are an important concept in classical
number theory because they contain many prime factors, but Riemann found a use for their continuous extension that
arguably turned out to be even more important.
History
The gamma function has caught the interest of some of the most prominent mathematicians of all time. Its history,
notably documented by Philip J. Davis in an article that won him the 1963 Chauvenet Prize, reflects many of the
major developments within mathematics since the 18th century. In the words of Davis, "each generation has found
something of interest to say about the gamma function. Perhaps the next generation will also."[11]
which is valid for n > 0. By the change of variables t = −ln s, this becomes the familiar Euler integral. Euler
published his results in the paper "De progressionibus transcendentibus seu quarum termini generales algebraice dari
nequeunt" ("On transcendental progressions, that is, those whose general terms cannot be given algebraically"),
submitted to the St. Petersburg Academy on November 28, 1729.[12] Euler further discovered some of the gamma
function's important functional properties, including the reflection formula.
James Stirling, a contemporary of Euler, also attempted to find a continuous expression for the factorial and came up
with what is now known as Stirling's formula. Although Stirling's formula gives a good estimate of , also for
non-integers, it does not provide the exact value. Extensions of his formula that correct the error were given by
Stirling himself and by Jacques Philippe Marie Binet.
Gamma function 242
Abramowitz and Stegun became the standard reference for this and many other special functions after its publication
in 1964.
Gamma function 244
Double-precision floating-point implementations of the gamma function and its logarithm are now available in most
scientific computing software and special functions libraries, for example Matlab, GNU Octave, and the GNU
Scientific Library. The gamma function was also added to the C standard library (math.h). Arbitrary-precision
implementations are available in most computer algebra systems, such as Mathematica and Maple. PARI/GP, MPFR
and MPFUN contain free arbitrary-precision implementations.
Notes
[1] Kingman, J.F.C. 1961. A convexity property of positive matrices. Quart. J. Math. Oxford (2) 12,283-284.
[2] Waldschmidt, M. (2006). " Transcendence of Periods: The State of the Art (http:/ / www. math. jussieu. fr/ ~miw/ articles/ pdf/
TranscendencePeriods. pdf)". Pure and Applied Mathematics Quarterly, Volume 2, Number 2, 435—463 (PDF copy published by the author)
[3] This can be derived by differentiating the integral form of the gamma function with respect to x, and using the technique of differentiation
under the integral sign.
[4] E.A. Karatsuba, Fast evaluation of transcendental functions. Probl. Inf. Transm. Vol.27, No.4, pp.339-360 (1991).
[5] E.A. Karatsuba, On a new method for fast evaluation of transcendental functions. Russ. Math. Surv. Vol.46, No.2, pp.246-247 (1991).
[6] E.A. Karatsuba " Fast Algorithms and the FEE Method (http:/ / www. ccas. ru/ personal/ karatsuba/ algen. htm)".
[7] Michon, G. P. " Trigonometry and Basic Functions (http:/ / home. att. net/ ~numericana/ answer/ functions. htm)". Numericana. Retrieved
May 5, 2007.
[8] Chaudry, M. A. & Zubair, S. M. (2001). On A Class of Incomplete Gamma Functions with Applications. p. 37
[9] Rice, J. A. (1995). Mathematical Statistics and Data Analysis (Second Edition). p. 52–53
[10] Borwein, J., Bailey, D. H. & Girgensohn, R. (2003). Experimentation in Mathematics. A. K. Peters. pp. 133. ISBN 1-56881-136-5.
[11] Davis, P. J. (1959). "Leonhard Euler's Integral: A Historical Profile of the Gamma Function", The American Mathematical Monthly, Vol. 66,
No. 10 (Dec., 1959), pp. 849–869 (http:/ / mathdl. maa. org/ mathDL/ 22/ ?pa=content& sa=viewDocument& nodeId=3104)
[12] Euler's paper was published in Commentarii academiae scientiarum Petropolitanae 5, 1738, 36–57. See E19 -- De progressionibus
transcendentibus seu quarum termini generales algebraice dari nequeunt (http:/ / math. dartmouth. edu/ ~euler/ pages/ E019. html), from The
Euler Archive, which includes a scanned copy of the original article. An English translation (http:/ / home. sandiego. edu/ ~langton/ eg. pdf)
by S. Langton is also available.
[13] Remmert, R., Kay, L. D. (translator) (2006). Classical Topics in Complex Function Theory. Springer. ISBN 0-387-98221-3.
[14] Lanczos, C. (1964). "A precision approximation of the gamma function." J. SIAM Numer. Anal. Ser. B, Vol. 1.
[15] Knuth, D. E. (1997). The Art of Computer Programming, volume 1 (Fundamental Algorithms). Addison-Wesley.
[16] Berry, M. " Why are special functions special? (http:/ / scitation. aip. org/ journals/ doc/ PHTOAD-ft/ vol_54/ iss_4/ 11_1.
shtml?bypassSSO=1)". Physics Today, April 2001
References
• Milton Abramowitz and Irene A. Stegun, eds. Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables. New York: Dover, 1972. (See Chapter 6) (http://www.math.sfu.ca/~cbm/aands/
page_253.htm)
• G. E. Andrews, R. Askey, R. Roy, Special Functions, Cambridge University Press, 2001. ISBN
978-0-521-78988-2. Chapter one, covering the gamma and beta functions, is highly readable and definitive.
• Emil Artin, "The Gamma Function", in Rosen, Michael (ed.) Exposition by Emil Artin: a selection; History of
Mathematics 30. Providence, RI: American Mathematical Society (2006).
• Askey, R. A.; Roy, R. (2010), "Gamma function" (http://dlmf.nist.gov/5), in Olver, Frank W. J.; Lozier, Daniel
M.; Boisvert, Ronald F. et al., NIST Handbook of Mathematical Functions, Cambridge University Press,
ISBN 978-0521192255, MR2723248
• P. E. Böhmer, ´´Differenzengleichungen und bestimmte Integrale´´, Köhler Verlag, Leipzig, 1939.
• Philip J. Davis, "Leonhard Euler's Integral: A Historical Profile of the Gamma Function," American Mathematical
Monthly 66, 849-869 (1959)
• Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), "Section 6.1. Gamma Function" (http://apps.
nrbook.com/empanel/index.html?pg=256), Numerical Recipes: The Art of Scientific Computing (3rd ed.), New
York: Cambridge University Press, ISBN 978-0-521-88068-8
• O. R. Rocktaeschel, ´´Methoden zur Berechnung der Gammafunktion für komplexes Argument``, University of
Dresden, Dresden, 1922.
Gamma function 245
• Nico M. Temme, "Special Functions: An Introduction to the Classical Functions of Mathematical Physics", John
Wiley & Sons, New York, ISBN 0-471-11313-1,1996.
• E. T. Whittaker and G. N. Watson, A Course of Modern Analysis. Cambridge University Press (1927; reprinted
1996) ISBN 978-0521588072
External links
• Pascal Sebah and Xavier Gourdon. Introduction to the Gamma Function. In PostScript (http://numbers.
computation.free.fr/Constants/Miscellaneous/gammaFunction.ps) and HTML (http://numbers.computation.
free.fr/Constants/Miscellaneous/gammaFunction.html) formats.
• C++ reference for std::tgamma (http://en.cppreference.com/w/cpp/numeric/math/tgamma)
• Examples of problems involving the gamma function can be found at Exampleproblems.com (http://www.
exampleproblems.com/wiki/index.php?title=Special_Functions).
• Hazewinkel, Michiel, ed. (2001), "Gamma function" (http://www.encyclopediaofmath.org/index.php?title=p/
g043310), Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Wolfram gamma function evaluator (arbitrary precision) (http://functions.wolfram.com/webMathematica/
FunctionEvaluation.jsp?name=Gamma)
• Gamma (http://functions.wolfram.com/GammaBetaErf/Gamma/) at the Wolfram Functions Site
• Volume of n-Spheres and the Gamma Function (http://www.mathpages.com/home/kmath163/kmath163.htm)
at MathPages
• Weisstein, Eric W., " Gamma Function (http://mathworld.wolfram.com/GammaFunction.html)" from
MathWorld.
• "Elementary Proofs and Derivations" (http://www.docstoc.com/docs/3507375/
500-Integrals-of-Elementary-and-Special-Functions,)
• "Selected Transformations, Identities, and Special Values for the Gamma Function" (http://www.docstoc.com/
docs/5836783/Selected-Transformations-Identities--and-Special-Values--for-the-Gamma-Function,)
• This article incorporates material from the Citizendium article "Gamma function", which is licensed under the
Creative Commons Attribution-ShareAlike 3.0 Unported License but not under the GFDL.
Geometric distribution 246
Geometric distribution
In probability theory and statistics, the geometric distribution is either of two discrete probability distributions:
• The probability distribution of the number of X Bernoulli trials needed to get one success, supported on the
set { 1, 2, 3, ...}
• The probability distribution of the number Y = X − 1 of failures before the first success, supported on the
set { 0, 1, 2, 3, ... }
Which of these one calls "the" geometric distribution is a matter of convention and convenience.
Geometric
Median
(not unique if (not unique if
is an integer) is an integer)
Geometric distribution 247
Mode
Variance
Skewness
Excess kurtosis
Entropy
Moment-generating function
,
(mgf)
for
Characteristic function
These two different geometric distributions should not be confused with each other. Often, the name shifted
geometric distribution is adopted for the former one (distribution of the number X); however, to avoid ambiguity, it
is considered wise to indicate which is intended, by mentioning the range explicitly.
It’s the probability that the first occurrence of success require k number of independent trials, each with success
probability p. If the probability of success on each trial is p, then the probability that the kth trial (out of k trials) is
the first success is
for k = 1, 2, 3, ....
The above form of geometric distribution is used for modeling the number of trials until the first success. By
contrast, the following form of geometric distribution is used for modeling number of failures until the first success:
for k = 0, 1, 2, 3, ....
In either case, the sequence of probabilities is a geometric sequence.
For example, suppose an ordinary die is thrown repeatedly until the first time a "1" appears. The probability
distribution of the number of times it is thrown is supported on the infinite set { 1, 2, 3, ... } and is a geometric
distribution with p = 1/6.
Similarly, the expected value of the geometrically distributed random variable Y is (1 − p)/p, and its variance is
(1 − p)/p2:
Let μ = (1 − p)/p be the expected value of Y. Then the cumulants of the probability distribution of Y satisfy the
recursion
Outline of proof: That the expected value is (1 − p)/p can be shown in the following way. Let Y be as above. Then
Geometric distribution 248
(The interchange of summation and differentiation is justified by the fact that convergent power series converge
uniformly on compact subsets of the set of points where they converge.)
Parameter estimation
For both variants of the geometric distribution, the parameter p can be estimated by equating the expected value with
the sample mean. This is the method of moments, which in this case happens to yield maximum likelihood estimates
of p.
Specifically, for the first variant let k = k1, ..., kn be a sample where ki ≥ 1 for i = 1, ..., n. Then p can be estimated as
In Bayesian inference, the Beta distribution is the conjugate prior distribution for the parameter p. If this parameter is
given a Beta(α, β) prior, then the posterior distribution is
The posterior mean E[p] approaches the maximum likelihood estimate as α and β approach zero.
In the alternative case, let k1, ..., kn be a sample where ki ≥ 0 for i = 1, ..., n. Then p can be estimated as
Again the posterior mean E[p] approaches the maximum likelihood estimate as α and β approach zero.
the properties of "Y", 'P is parallel to D5.
• The probability-generating functions of X and Y are, respectively,
• Like its continuous analogue (the exponential distribution), the geometric distribution is memoryless. That means
that if you intend to repeat an experiment until the first success, then, given that the first success has not yet
occurred, the conditional probability distribution of the number of additional trials does not depend on how many
failures have been observed. The die one throws or the coin one tosses does not have a "memory" of these
Geometric distribution 249
failures. The geometric distribution is in fact the only memoryless discrete distribution.
• Among all discrete probability distributions supported on {1, 2, 3, ... } with given expected value μ, the geometric
distribution X with parameter p = 1/μ is the one with the largest entropy.
• The geometric distribution of the number Y of failures before the first success is infinitely divisible, i.e., for any
positive integer n, there exist independent identically distributed random variables Y1, ..., Yn whose sum has the
same distribution that Y has. These will not be geometrically distributed unless n = 1; they follow a negative
binomial distribution.
• The decimal digits of the geometrically distributed random variable Y are a sequence of independent (and not
identically distributed) random variables. For example, the hundreds digit D has this probability distribution:
where q = 1 − p, and similarly for the other digits, and, more generally, similarly for numeral systems with
other bases than 10. When the base is 2, this shows that a geometrically distributed random variable can be
written as a sum of independent random variables whose probability distributions are indecomposable.
• Golomb coding is the optimal prefix code for the geometric discrete distribution.
Related distributions
• The geometric distribution Y is a special case of the negative binomial distribution, with r = 1. More generally, if
Y1, ..., Yr are independent geometrically distributed variables with parameter p, then the sum
• Suppose 0 < r < 1, and for k = 1, 2, 3, ... the random variable Xk has a Poisson distribution with expected value
r k/k. Then
has a geometric distribution taking values in the set {0, 1, 2, ...}, with expected value r/(1 − r).
• The exponential distribution is the continuous analogue of the geometric distribution. If X is an exponentially
distributed random variable with parameter λ, then
where is the floor (or greatest integer) function, is a geometrically distributed random variable with
parameter p = 1 − e−λ (thus λ = −ln(1 − p)[2]) and taking values in the set {0, 1, 2, ...}. This can be used to
generate geometrically distributed pseudorandom numbers by first generating exponentially distributed
pseudorandom numbers from a uniform pseudorandom number generator: then is
geometrically distributed with parameter , if is uniformly distributed in [0,1].
Geometric distribution 250
References
[1] Pitman, Jim. Probability (1993 edition). Springer Publishers. pp 372.
[2] http:/ / www. wolframalpha. com/ input/ ?i=inverse+ p+ %3D+ 1+ -+ e^-l
External links
• Geometric distribution (http://planetmath.org/?op=getobj&from=objects&id=3456),
PlanetMath.org.
• Geometric distribution (http://mathworld.wolfram.com/GeometricDistribution.html) on MathWorld.
• Online geometric distribution calculator (http://www.solvemymath.com/online_math_calculator/statistics/
discrete_distributions/geometric/index.php)
Hypergeometric distribution
Hypergeometric
Parameters
Support
PMF
CDF
Mean
Mode
Variance
Skewness
Ex.
kurtosis
MGF
CF
In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that
describes the probability of successes in draws from a finite population of size containing successes
without replacement. (cf. the binomial distribution, which describes the probability of successes in draws with
replacement.)
Hypergeometric distribution 251
Definition
A random variable follows the hypergeometric distribution if its probability mass function is given by:[1]
Where,
• is the population size
• is the number of success states in the population
• is the number of draws
• is the number of successes
• is a binomial coefficient
It is positive when .
Combinatorial identities
As one would expect intuitively, the probabilities sum up to 1 :
This follows clearly from the symmetry of the problem, but it can also be shown easily by expressing the binomial
coefficients in terms of factorials, and rearranging the latter.
total n N−n N
Now, assume (for example) that there are 5 white and 45 black marbles in the urn. Standing next to the urn, you
close your eyes and draw 10 marbles without replacement. What is the probability that exactly 4 of the 10 are white?
Note that although we are looking at success/failure, the data are not accurately modeled by the binomial
distribution, because the probability of success on each trial is not the same, as the size of the remaining population
changes as we remove each marble.
This problem is summarized by the following contingency table:
black marbles n − k = 6 N + k − n − m = 39 N − m = 45
total n = 10 N − n = 40 N = 50
The probability of drawing exactly k white marbles can be calculated by the formula
Intuitively we would expect it to be even more unlikely for all 5 marbles to be white.
As expected, the probability of drawing 5 white marbles is roughly 35 times less likely than that of drawing 4.
Symmetries
Swapping the roles of black and white marbles:
Symmetry application
The metaphor of defective and drawn objects depicts an application of the hypergeometric distribution in which the
interchange symmetry between n and m is not of foremost concern. Here is an alternative metaphor which brings this
symmetry into sharper focus, as there are also applications where it serves no purpose to distinguish n from m.
Suppose you have a set of N children who have been identified with an unusual bone marrow antigen. The doctor
wishes to conduct a heredity study to determine the inheritance pattern of this antigen. For the purposes of this study,
the doctor wishes to draw tissue from the bone marrow from the biological mother and biological father of each
child. This is an uncomfortable procedure, and not all the mothers and fathers will agree to participate. Of the
mothers, m participate and N-m decline. Of the fathers, n participate and N-n decline.
We assume here that the decisions made by the mothers is independent of the decisions made by the fathers. Under
this assumption, the doctor, who is given n and m, wishes to estimate k, the number of children where both parents
have agreed to participate. The hypergeometric distribution can be used to determine this distribution over k. It's not
straightforward why the doctor would know n and m, but not k. Perhaps n and m are dictated by the experimental
design, while the experimenter is left blind to the true value of k.
It is important to recognize that for given N, n and m a single degree of freedom partitions N into four
sub-populations:
1. Children where both parents participate
2. Children where only the mother participates
3. Children where only the father participates and
4. Children where neither parent participates.
Knowing any one of these four values determines the other three by simple arithmetic relations. For this reason, each
of these quadrants is governed by an equivalent hypergeometric distribution. The mean, mode, and values of k
contained within the support differ from one quadrant to another, but the size of the support, the variance, and other
high order statistics do not.
For the purpose of this study, it might make no difference to the doctor whether the mother participates or the father
participates. If this happens to be true, the doctor will view the result as a three-way partition: children where both
parents participate, children where one parent participates, children where neither parent participates. Under this
view, the last remaining distinction between n and m has been eliminated. The distribution where one parent
participates is the sum of the distributions where either parent alone participates.
Hypergeometric distribution 254
Order of draws
The probability of drawing any sequence of white and black marbles (the hypergeometric distribution) depends only
on the number of white and black marbles, not on the order in which they appear; i.e., it is an exchangeable
distribution. As a result, the probability of drawing a white marble in the draw is
This can be shown by induction. First, it is certainly true for the first draw that:
Since in the draw either a white or a black marble needs to be drawn, we also know that
.
Combining these two equations immediately yields
Related distributions
Let X ~ Hypergeometric( , , ) and .
• If then has a Bernoulli distribution with parameter .
• Let have a binomial distribution with parameters and ; this models the number of successes in the
analogous sampling problem with replacement. If and are large compared to and is not close to 0 or
1, then and have similar distributions, i.e., .
• If is large, and are large compared to and is not close to 0 or 1, then
Hypergeometric distribution 256
Parameters
Support
PMF
Mean
Variance
The model of an urn with black and white marbles can be extended to the case where there are more than two colors
of marbles. If there are mi marbles of color i in the urn and you take n marbles at random without replacement, then
the number of marbles of each color in the sample (k1,k2,...,kc) has the multivariate hypergeometric distribution. This
has the same relationship to the multinomial distribution that the hypergeometric distribution has to the binomial
distribution—the multinomial distribution is the "with-replacement" distribution and the multivariate hypergeometric
is the "without-replacement" distribution.
The properties of this distribution are given in the adjacent table, where c is the number of different colors and
is the total number of marbles.
Example
Suppose there are 5 black, 10 white, and 15 red marbles in an urn. You reach in and randomly select six marbles
without replacement. What is the probability that you pick exactly two of each color?
Note: When picking the six marbles without replacement, the expected number of black marbles is 6*(5/30) = 1, the
expected number of white marbles is 6*(10/30) = 2, and the expected number of red marbles is 6*(15/30) = 3.
Hypergeometric distribution 257
References
[1] Rice, John A. (2007). Mathematical Statistics and Data Analysis (Third Edition ed.). Duxbury Press. p. 42.
[2] Rivals, I.; Personnaz, L.; Taing, L.; Potier, M.-C (2007). "Enrichment or depletion of a GO category within a class of genes: which test?".
Bioformatics 23: 401–407.
[3] K. Preacher and N. Briggs. "Calculation for Fisher's Exact Test: An interactive calculation tool for Fisher's exact probability test for 2 x 2
tables (interactive page)" (http:/ / quantpsy. org/ fisher/ fisher. htm). .
External links
• Gnu implemetationfor C/C++ (http://www.gnu.org/software/gsl/manual/html_node/
The-Hypergeometric-Distribution.html)
• The Hypergeometric Distribution (http://demonstrations.wolfram.com/TheHypergeometricDistribution/) and
Binomial Approximation to a Hypergeometric Random Variable (http://demonstrations.wolfram.com/
BinomialApproximationToAHypergeometricRandomVariable/) by Chris Boucher, Wolfram Demonstrations
Project.
• Weisstein, Eric W., " Hypergeometric Distribution (http://mathworld.wolfram.com/
HypergeometricDistribution.html)" from MathWorld.
• Hypergeometric tail inequalities: ending the insanity (http://ansuz.sooke.bc.ca/professional/hypergeometric.
pdf) by Matthew Skala.
• Survey Analysis Tool (http://www.i-marvin.si) using discrete hypergeometric distribution based on A.
Berkopec, HyperQuick algorithm for discrete hypergeometric distribution, Journal of Discrete Algorithms,
Elsevier, 2006 (http://dx.doi.org/10.1016/j.jda.2006.01.001).
Hölder's inequality
In mathematical analysis Hölder's inequality, named after Otto Hölder, is a fundamental inequality between
integrals and an indispensable tool for the study of Lp spaces.
Let (S, Σ, μ) be a measure space and let 1 ≤ p, q ≤ ∞ with 1/p + 1/q = 1. Then, for all measurable real- or
complex-valued functions f and g on S,
The numbers p and q above are said to be Hölder conjugates of each other. The special case p = q = 2 gives a form
of the Cauchy–Schwarz inequality.
Hölder's inequality holds even if ||fg ||1 is infinite, the right-hand side also being infinite in that case. In particular, if f
is in Lp(μ) and g is in Lq(μ), then fg is in L1(μ).
For 1 < p, q < ∞ and f ∈ Lp(μ) and g ∈ Lq(μ), Hölder's inequality becomes an equality if and only if |f |p and |g |q are
linearly dependent in L1(μ), meaning that there exist real numbers α, β ≥ 0, not both of them zero, such that α |f |p = β
|g |q μ-almost everywhere.
Hölder's inequality is used to prove the Minkowski inequality, which is the triangle inequality in the space Lp(μ), and
also to establish that Lq(μ) is the dual space of Lp(μ) for 1 ≤ p < ∞.
Hölder's inequality was first found by L. J. Rogers (1888), and discovered independently by Hölder (1889).
Hölder's inequality 258
Remarks
Conventions
The brief statement of Hölder's inequality uses some conventions.
• In the definition of Hölder conjugates, 1/ ∞ means zero.
• If 1 ≤ p, q < ∞, then ||f ||p and ||g ||q stand for the (possibly infinite) expressions
and
• If p = ∞, then ||f ||∞ stands for the essential supremum of |f |, similarly for ||g ||∞.
• The notation ||f ||p with 1 ≤ p ≤ ∞ is a slight abuse, because in general it is only a norm of f if ||f ||p is finite and f is
considered as equivalence class of μ-almost everywhere equal functions. If f ∈ Lp(μ) and g ∈ Lq(μ), then the
notation is adequate.
• On the right-hand side of Hölder's inequality, 0 times ∞ as well as ∞ times 0 means 0. Multiplying a > 0 with ∞
gives ∞.
and the similar one for fg hold, and Hölder's inequality can be applied to the right-hand side. In particular, if f and g
are in the Hilbert space L2(μ), then Hölder's inequality for p = q = 2 implies
where the angle brackets refer to the inner product of L2(μ). This is also called Cauchy–Schwarz inequality, but
requires for its statement that ||f||2 and ||g||2 are finite to make sure that the inner product of f and g is well defined.
We may recover the original inequality (for the case p=2) by using the functions |f| and |g| in place of f and g.
Counting measure
In the case of n-dimensional Euclidean space, when the set S is {1, …, n} with the counting measure, we have
If S = N with the counting measure, then we get Hölder's inequality for sequence spaces:
Lebesgue measure
If S is a measurable subset of Rn with the Lebesgue measure, and f and g are measurable real- or complex-valued
functions on S, then Hölder inequality is
Probability measure
For the probability space , let E denote the expectation operator. For real- or complex-valued random
variables X and Y on Ω, Hölder's inequality reads
Let 0 < r < s and define p = s / r. Then q = p / (p−1) is the Hölder conjugate of p. Applying Hölder's inequality to the
random variables |X |r and 1Ω, we obtain
In particular, if the sth absolute moment is finite, then the r th absolute moment is finite, too. (This also follows from
Jensen's inequality.)
Product measure
For two σ-finite measure spaces (S1, Σ1, μ1) and (S2, Σ2, μ2) define the product measure space by
where S is the Cartesian product of S1 and S2, the σ-algebra Σ arises as product σ-algebra of Σ1 and Σ2, and μ denotes
the product measure of μ1 and μ2. Then Tonelli's theorem allows us to rewrite Hölder's inequality using iterated
integrals: If f and g are Σ-measurable real- or complex-valued functions on the Cartesian product S, then
Vector-valued functions
Let (S, Σ, μ) denote a σ-finite measure space and suppose that f = (f1, …, fn) and g = (g1, …, gn) are Σ-measurable
functions on S, taking values in the n-dimensional real- or complex Euclidean space. By taking the product with the
counting measure on {1, …, n}, we can rewrite the above product measure version of Hölder's inequality in the form
If the two integrals on the right-hand side are finite, then equality holds if and only if there exist real numbers
α, β ≥ 0, not both of them zero, such that
for μ-almost all x in S.
This finite-dimensional version generalizes to functions f and g taking values in a sequence space.
for all nonnegative a and b, where equality is achieved if and only if a p = b q. Hence
and
such that
μ-almost everywhere (*)
The case ||f ||p = 0 corresponds to β = 0 in (*). The case ||g ||q = 0 corresponds to α = 0 in (*).
Hölder's inequality 261
Extremal equality
Statement
Assume that 1 ≤ p < ∞ and let q denote the Hölder conjugate. Then, for every ƒ ∈ Lp(μ),
where max indicates that there actually is a g maximizing the right-hand side. When p = ∞ and if each set A in the
σ-field Σ with μ(A) = ∞ contains a subset B ∈ Σ with 0 < μ(B) < ∞ (which is true in particular when μ is σ-finite),
then
Applications
• The extremal equality is one of the ways for proving the triangle inequality ||ƒ1 + ƒ2||p ≤ ||ƒ1||p + ||ƒ2||p for all ƒ1 and
ƒ2 in Lp(μ), see Minkowski inequality.
• Hölder's inequality implies that every ƒ ∈ Lp(μ) defines a bounded (or continuous) linear functional κƒ on Lq(μ) by
the formula
The extremal equality (when true) shows that the norm of this functional κƒ as element of the continuous dual
space Lq(μ)∗ coincides with the norm of ƒ in Lp(μ) (see also the Lp-space article).
In particular,
Hölder's inequality 262
Note:
• For r ∈ (0, 1), contrary to the notation, ||.||r is in general not a norm, because it doesn't satisfy the triangle
inequality.
Interpolation
Let p1, …, pn ∈ (0, ∞] and let θ1, …, θn ∈ (0, 1) denote weights with θ1+ … + θn = 1. Define p as the weighted
harmonic mean, i.e.,
In particular, taking θ1 = θ and θ2 = 1 - θ, in the case n =2, we obtain the interpolation result
If ||fg ||1 < ∞ and ||g ||−1/(p −1) > 0, then the reverse Hölder inequality is an equality if and only if there exists an α ≥ 0
such that
μ-almost everywhere.
Note: ||f ||1/p and ||g ||−1/(p −1) are not norms, these expressions are just compact notation for
and
Hölder's inequality 263
Remarks:
• If a non-negative random variable Z has infinite expected value, then its conditional expectation is defined by
• On the right-hand side of the conditional Hölder inequality, 0 times ∞ as well as ∞ times 0 means 0. Multiplying
a > 0 with ∞ gives ∞.
References
• Hardy, G. H.; Littlewood, J. E.; Pólya, G. (1934), Inequalities, Cambridge University Press, pp. XII+314,
ISBN 0-521-35880-9, JFM 60.0169.01, Zbl 0010.10703.
• Hölder, O. (1889), "Ueber einen Mittelwertsatz" [1] (in German), Nachrichten von der Königl. Gesellschaft der
Wissenschaften und der Georg-Augusts-Universität zu Göttingen, Band 1889 (2): 38–47, JFM 21.0260.07.
Available at Digi Zeitschriften [2].
• Kuptsov, L. P. (2001), "Hölder inequality" [3], in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer,
ISBN 978-1-55608-010-4.
• Rogers, L. J. (February 1888), "An extension of a certain theorem in inequalities" [4], Messenger of Mathematics,
New Series XVII (10): 145–150, JFM 20.0254.02, archived from the original [5] on August 21, 2007.
External links
• Kuttler, Kenneth (2007), An introduction to linear algebra [6], Online e-book in PDF format, Brigham Young
University.
• Lohwater, Arthur (1982) (PDF), Introduction to Inequalities [7].
References
[1] http:/ / resolver. sub. uni-goettingen. de/ purl?GDZPPN00252421X
[2] http:/ / www. digizeitschriften. de/ index. php?id=64& L=2
[3] http:/ / www. encyclopediaofmath. org/ index. php?title=H/ h047514
[4] http:/ / www. archive. org/ stream/ messengermathem01unkngoog#page/ n183/ mode/ 1up
[5] http:/ / www. archive. org/ details/ messengermathem01unkngoog
[6] http:/ / www. math. byu. edu/ ~klkuttle/ Linearalgebra. pdf
[7] http:/ / www. mediafire. com/ ?1mw1tkgozzu
Inverse Gaussian distribution 264
Parameters
Support
PDF
CDF
where is the standard normal (standard
Variance
Skewness
Ex.
kurtosis
MGF
CF
In probability theory, the inverse Gaussian distribution (also known as the Wald distribution) is a two-parameter
family of continuous probability distributions with support on (0,∞).
Its probability density function is given by
the inverse Gaussian describes the distribution of the time a Brownian Motion with positive drift takes to reach a
fixed positive level.
Its cumulant generating function (logarithm of the characteristic function) is the inverse of the cumulant generating
function of a Gaussian random variable.
To indicate that a random variable X is inverse Gaussian-distributed with mean μ and shape parameter λ we write
Properties
Summation
If Xi has a IG(μ0wi, λ0wi2) distribution for i = 1, 2, ..., n and all Xi are independent, then
Note that
is constant for all i. This is a necessary condition for the summation. Otherwise S would not be inverse Gaussian.
Scaling
For any t > 0 it holds that
Exponential family
The inverse Gaussian distribution is a two-parameter exponential family with natural parameters -λ/(2μ²) and -λ/2,
and natural statistics X and 1/X.
Maximum likelihood
The model where
with all wi known, (μ, λ) unknown and all Xi independent has the following likelihood function
Solving the likelihood equation yields the following maximum likelihood estimates
Generate another random variate, this time sampled from a uniformed distribution between 0 and 1
If
then return
else return
Related distributions
• If then
• If then
• If for then
• If then
The convolution of a Wald distribution and an exponential (the ex-Wald distribution) is used as a model for response
times in psychology.[2]
History
This distribution appears to have been first derived by Schrödinger in 1915 as the time to first passage of a Brownian
motion.[3] The name inverse Gaussian was proposed by Tweedie in 1945.[4] Wald re derived this distribution in 1947
as the limiting form of a sample in a sequential probability ratio test. Tweedie investigated this distribution in 1957
and established some of its statistical properties.
Software
The R programming language has software for this distribution. [5]
Inverse Gaussian distribution 268
Notes
[1] Generating Random Variates Using Transformations with Multiple Roots by John R. Michael, William R. Schucany and Roy W. Haas,
American Statistician, Vol. 30, No. 2 (May, 1976), pp. 88–90
[2] Schwarz W (2001) The ex-Wald distribution as a descriptive model of response times. Behav Res Methods Instrum Comput 33(4):457-469
[3] Schrodinger E (1915) Zur Theorie der Fall—und Steigversuche an Teilchenn mit Brownscher Bewegung. Physikalische Zeitschrift 16,
289-295
[4] Folks JL & Chhikara RS (1978) The inverse Gaussian and its statistical application - a review. J Roy Stat Soc 40(3) 263-289
[5] http:/ / www. stat. ucl. ac. be/ ISdidactique/ Rhelp/ library/ SuppDists/ html/ invGauss. html
References
• The inverse gaussian distribution: theory, methodology, and applications by Raj Chhikara and Leroy Folks, 1989
ISBN 0-8247-7997-5
• System Reliability Theory by Marvin Rausand and Arnljot Høyland
• The Inverse Gaussian Distribution by Dr. V. Seshadri, Oxford Univ Press, 1993
External links
• Inverse Gaussian Distribution (http://mathworld.wolfram.com/InverseGaussianDistribution.html) in Wolfram
website.
Inverse-gamma distribution 269
Inverse-gamma distribution
Inverse-gamma
CDF
Mean
for
Mode
Variance
for
Skewness
for
Ex. kurtosis
for
Entropy
MGF
CF
In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous
probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed
according to the gamma distribution. Perhaps the chief use of the inverse gamma distribution is in Bayesian
statistics, where it serves as the conjugate prior of the variance of a normal distribution. However, it is common
among Bayesians to consider an alternative parametrization of the normal distribution in terms of the precision,
Inverse-gamma distribution 270
defined as the reciprocal of the variance, which allows the gamma distribution to be used directly as a conjugate
prior.
Characterization
where the numerator is the upper incomplete gamma function and the denominator is the gamma function. Many
math packages allow you to compute Q, the regularized gamma function, directly.
Properties
For and
Related distributions
• If then
• If then (inverse-chi-squared distribution)
• If then (scaled-inverse-chi-squared distribution)
• If then (Lévy distribution)
• If (Gamma distribution) then
• Inverse gamma distribution is a special case of type 5 Pearson distribution
• A multivariate generalization of the inverse-gamma distribution is the inverse-Wishart distribution.
Inverse-gamma distribution 271
Replacing with ; with ; and with results in the inverse-gamma pdf shown above
by an iterative method in which each step involves solving a weighted least squares problem of the form:
IRLS is used to find the maximum likelihood estimates of a generalized linear model, and in robust regression to
find an M-estimator, as a way of mitigating the influence of outliers in an otherwise normally-distributed data set.
For example, by minimizing the least absolute error rather than the least square error.
Although not a linear regression problem, Weiszfeld's algorithm for approximating the geometric median can also be
viewed as a special case of iteratively reweighted least squares, in which the objective function is the sum of
distances of the estimator from the samples.
One of the advantages of IRLS over linear and convex programming is that it can be used with Gauss–Newton and
Levenberg–Marquardt numerical algorithms.
Iteratively reweighted least squares 272
Examples
the IRLS algorithm at step t+1 involves solving the weighted linear least squares problem:[3]
In the case p = 1, this corresponds to least absolute deviation regression (in this case, the problem would be better
approached by use of linear programming methods).
Notes
[1] Chartrand, R.; Yin, W. (March 31 - April 4, 2008). "Iteratively reweighted algorithms for compressive sensing" (http:/ / ieeexplore. ieee. org/
xpl/ freeabs_all. jsp?arnumber=4518498). IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008.
pp. 3869 - 3872. .
[2] I Daubechies et al (2008). "Iteratively reweighted least squares minimization for sparse recovery" (http:/ / www. ricam. oeaw. ac. at/ people/
page/ fornasier/ DDFG14. pdf). . Retrieved 2010-11-02.
[3] Gentle, James (2007). "6.8.1 Solutions that Minimize Other Norms of the Residuals". Matrix algebra. New York: Springer.
doi:10.1007/978-0-387-70873-7. ISBN 978-0-387-70872-0.
References
• Stanford Lecture Notes on the IRLS algorithm by Antoine Guitton (http://sepwww.stanford.edu/public/docs/
sep103/antoine2/paper_html/index.html)
• Numerical Methods for Least Squares Problems by Åke Björck (http://www.mai.liu.se/~akbjo/LSPbook.
html) (Chapter 4: Generalized Least Squares Problems.)
• Practical Least-Squares for Computer Graphics. SIGGRAPH Course 11 (http://graphics.stanford.edu/~jplewis/
lscourse/SLIDES.pdf)
Kendall tau rank correlation coefficient 273
Definition
Let (x1, y1), (x2, y2), …, (xn, yn) be a set of joint observations from two random variables X and Y respectively, such
that all the values of (xi) and (yi) are unique. Any pair of observations (xi, yi) and (xj, yj) are said to be concordant if
the ranks for both elements agree: that is, if both xi > xj and yi > yj or if both xi < xj and yi < yj. They are said to be
discordant, if xi > xj and yi < yj or if xi < xj and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor
discordant.
The Kendall τ coefficient is defined as:
[3]
Properties
The denominator is the total number of pairs, so the coefficient must be in the range −1 ≤ τ ≤ 1.
• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value
1.
• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the
coefficient has value −1.
• If X and Y are independent, then we would expect the coefficient to be approximately zero.
Hypothesis test
The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two
variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any
assumptions on the distributions of X or Y.
Under a null hypothesis of X and Y being independent, the sampling distribution of τ will have an expected value of
zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated
exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with
mean zero and variance
.[4]
Kendall tau rank correlation coefficient 274
Tau-a
Tau-a statistic tests the strength of association of the cross tabulations. Both variables have to be ordinal. Tau-a will
not make any adjustment for ties.
Tau-b
Tau-b statistic, unlike tau-a, makes adjustments for ties and is suitable for square tables. Values of tau-b range from
−1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A
value of zero indicates the absence of association.
The Kendall tau-b coefficient is defined as:
where
Tau-c
Tau-c differs from tau-b as in being more suitable for rectangular tables than for square tables.
Significance tests
When two quantities are statistically independent, the distribution of is not easily characterizable in terms of
known distributions. However, for the following statistic, , is approximately characterized by a standard
normal distribution when the quantities are statistically independent:
Thus, if you want to test whether two quantities are statistically dependent, compute , and find the cumulative
probability for a standard normal distribution at . For a 2-tailed test, multiply that number by two and this
gives you the p-value. If the p-value is below your acceptance level (typically 5%), you can reject the null hypothesis
that the quantities are statistically independent and accept the hypothesis that they are dependent.
Numerous adjustments should be added to when accounting for ties. The following statistic, , provides an
approximation coinciding with the distribution and is again approximately characterized by a standard normal
distribution when the quantities are statistically independent:
where
Kendall tau rank correlation coefficient 275
Algorithms
The direct computation of the numerator , involves two nested iterations, as characterized by the
following pseudo-code:
numer := 0
for i:=2..N do
for j:=1..(i-1) do
numer := numer + sgn(x[i] - x[j]) * sgn(y[i] - y[j])
return numer
Although quick to implement, this algorithm is in complexity and becomes very slow on large samples. A
more sophisticated algorithm[5] built upon the Merge Sort algorithm can be used to compute the numerator in
time.
Begin by ordering your data points sorting by the first quantity, , and secondarily (among ties in ) by the
second quantity, . With this initial ordering, is not sorted, and the core of the algorithm consists of computing
how many steps a Bubble Sort would take to sort this initial . An enhanced Merge Sort algorithm, with
complexity, can be applied to compute the number of swaps, , that would be required by a
Bubble Sort to sort . Then the numerator for is computed as:
,
where is computed like and , but with respect to the joint ties in and .
A Merge Sort partitions the data to be sorted, into two roughly equal halves, and , then sorts each
half recursive, and then merges the two sorted halves into a fully sorted vector. The number of Bubble Sort swaps is
equal to:
where and are the sorted versions of and , and characterizes the Bubble Sort
swap-equivalent for a merge operation. is computed as depicted in the following pseudo-code:
function M(L[1..n], R[1..m])
n := n + m
i := 1
j := 1
nSwaps := 0
while i + j <= n do
if i > m or R[j] < L[i] then
nSwaps := nSwaps + m - (i-1)
j := j + 1
else
i := i + 1
return nSwaps
Kendall tau rank correlation coefficient 276
A side effect of the above steps is that you end up with both a sorted version of and a sorted version of . With
these, the factors and used to compute are easily obtained in a single linear-time pass through the sorted
arrays.
A second algorithm with time complexity, based on AVL trees, was devised by David Christensen.[6]
References
[1] Kendall, M. (1938). "A New Measure of Rank Correlation". Biometrika 30 (1–2): 81–89. doi:10.1093/biomet/30.1-2.81. JSTOR 2332226.
[2] Kruskal, W.H. (1958). "Ordinal Measures of Association". Journal of the American Statistical Association 53 (284): 814–861.
doi:10.2307/2281954. JSTOR 2281954. MR100941.
[3] Nelsen, R.B. (2001), "Kendall tau metric" (http:/ / www. encyclopediaofmath. org/ index. php?title=K/ k130020), in Hazewinkel, Michiel,
Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4,
[4] Prokhorov, A.V. (2001), "Kendall coefficient of rank correlation" (http:/ / www. encyclopediaofmath. org/ index. php?title=K/ k055200), in
Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4,
[5] Knight, W. (1966). "A Computer Method for Calculating Kendall's Tau with Ungrouped Data". Journal of the American Statistical
Association 61 (314): 436–439. doi:10.2307/2282833. JSTOR 2282833.
[6] Christensen, David (2005). "Fast algorithms for the calculation of Kendall's τ". Computational Statistics 20 (1): 51–62.
doi:10.1007/BF02736122.
External links
• Tied rank calculation (http://www.statsdirect.com/help/nonparametric_methods/kend.htm)
• Why Kendall tau? (http://www.rsscse-edu.org.uk/tsj/bts/noether/text.html)
• Software for computing Kendall's tau on very large datasets (http://law.dsi.unimi.it/software/)
• Online software: computes Kendall's tau rank correlation (http://www.wessa.net/rwasp_kendall.wasp)
• The CORR Procedure: Statistical Computations (http://www.technion.ac.il/docs/sas/proc/zompmeth.htm)
KolmogorovSmirnov test 277
Kolmogorov–Smirnov test
In statistics, the Kolmogorov–Smirnov test (K–S test) is a nonparametric test for the equality of continuous,
one-dimensional probability distributions that can be used to compare a sample with a reference probability
distribution (one-sample K–S test), or to compare two samples (two-sample K–S test). The Kolmogorov–Smirnov
statistic quantifies a distance between the empirical distribution function of the sample and the cumulative
distribution function of the reference distribution, or between the empirical distribution functions of two samples.
The null distribution of this statistic is calculated under the null hypothesis that the samples are drawn from the same
distribution (in the two-sample case) or that the sample is drawn from the reference distribution (in the one-sample
case). In each case, the distributions considered under the null hypothesis are continuous distributions but are
otherwise unrestricted.
The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as
it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two
samples.
The Kolmogorov–Smirnov test can be modified to serve as a goodness of fit test. In the special case of testing for
normality of the distribution, samples are standardized and compared with a standard normal distribution. This is
equivalent to setting the mean and variance of the reference distribution equal to the sample estimates, and it is
known that using these to define the specific reference distribution changes the null distribution of the test statistic:
see below. Various studies have found that, even in this corrected form, the test is less powerful for testing normality
than the Shapiro–Wilk test or Anderson–Darling test.[1]
Kolmogorov–Smirnov statistic
The empirical distribution function Fn for n iid observations Xi is defined as
where sup x is the supremum of the set of distances. By the Glivenko–Cantelli theorem, if the sample comes from
distribution F(x), then Dn converges to 0 almost surely. Kolmogorov strengthened this result, by effectively
providing the rate of this convergence (see below). The Donsker theorem provides yet a stronger result.
In practice, the statistic requires a relatively large number of data points to properly reject the null hypothesis.
Kolmogorov distribution
The Kolmogorov distribution is the distribution of the random variable
where B(t) is the Brownian bridge. The cumulative distribution function of K is given by[2]
Both the form of the Kolmogorov–Smirnov test statistic and its asymptotic distribution under the null hypothesis
were published by Andrey Kolmogorov,[3] while a table of the distribution was published by Nikolai Vasilyevich
Smirnov.[4] Recurrence relations for the distribution of the test statistic in finite samples are available.[3]
KolmogorovSmirnov test 278
Kolmogorov–Smirnov test
Under null hypothesis that the sample comes from the hypothesized distribution F(x),
where and are the empirical distribution functions of the first and the second sample respectively.
The null hypothesis is rejected at level if
Note that the two-sample test checks whether the two data samples come from the same distribution. This does not
specify what that common distribution is (e.g. normal or not normal). Again, tables of critical values have been
published.[5]
KolmogorovSmirnov test 279
Footnotes
[1] Stephens, M. A. (1974). "EDF Statistics for Goodness of Fit and Some Comparisons". Journal of the American Statistical Association
(American Statistical Association) 69 (347): 730–737. doi:10.2307/2286009. JSTOR 2286009.
[2] Marsaglia G, Tsang WW, Wang J (2003). "Evaluating Kolmogorov’s Distribution" (http:/ / www. jstatsoft. org/ v08/ i18/ paper). Journal of
Statistical Software 8 (18): 1-4. .
[3] Kolmogorov A (1933). "Sulla determinazione empirica di una legge di distribuzione". G. Inst. Ital. Attuari 4: 83.
[4] Smirnov NV (1948). "Tables for estimating the goodness of fit of empirical distributions". Annals of Mathematical Statistics 19: 279.
[5] Pearson E.S. and Hartley, H.O., ed. (1972). Biometrika Tables for Statisticians. 2. Cambridge University Press. pp. 117–123, Tables 54, 55.
ISBN 0-521-06937-8.
[6] Galen R. Shorack and Jon A. Wellner (1986). Empirical Processes with Applications to Statistics. p. 239. ISBN 047186725X.
[7] Peacock J.A. (1983). "Two-dimensional goodness-of-fit testing in astronomy". Monthly Notices of the Royal Astronomical Society 202:
615–627. Bibcode 1983MNRAS.202..615P.
[8] Fasano, G., Franceschini, A. (1987). "A multidimensional version of the Kolmogorov–Smirnov test" (http:/ / articles. adsabs. harvard. edu/
full/ 1987MNRAS. 225. . 155F). Monthly Notices of the Royal Astronomical Society (ISSN 0035-8711) 225: 155–170. .
[9] Lopes, R.H.C., Reid, I., Hobson, P.R. (April 23–27, 2007). "The two-dimensional Kolmogorov-Smirnov test" (http:/ / dspace. brunel. ac. uk/
bitstream/ 2438/ 1166/ 1/ acat2007. pdf). XI International Workshop on Advanced Computing and Analysis Techniques in Physics Research.
Amsterdam, the Netherlands. .
References
• Eadie, W.T.; D. Drijard, F.E. James, M. Roos and B. Sadoulet (1971). Statistical Methods in Experimental
Physics. Amsterdam: North-Holland. pp. 269–271. ISBN 0-444-10117-9.
• Stuart, Alan; Ord, Keith; Arnold, Steven [F.] (1999). Classical Inference and the Linear Model. Kendall's
Advanced Theory of Statistics. 2A (Sixth ed.). London: Arnold. pp. 25.37–25.43. ISBN 0-340-66230-1.
MR1687411.
• Corder, G.W., Foreman, D.I. (2009).Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach
Wiley, ISBN 978-0-470-45461-9
• Stephens, M.A. (1979) Test of fit for the logistic distribution based on the empirical distribution function,
Biometrika, 66(3), 591-5.
KolmogorovSmirnov test 280
External links
• Short introduction (http://www.physics.csbsju.edu/stats/KS-test.html)
• KS test explanation (http://www.itl.nist.gov/div898/handbook/eda/section3/eda35g.htm)
• JavaScript implementation of one- and two-sided tests (http://www.ciphersbyritter.com/JAVASCRP/
NORMCHIK.HTM)
• Online calculator with the K-S test (http://jumk.de/statistic-calculator/)
• Open-source C++ code to compute the Kolmogorov distribution (http://root.cern.ch/root/html/TMath.
html#TMath:KolmogorovProb) and perform the K-S test (http://root.cern.ch/root/html/TMath.
html#TMath:KolmogorovTest)
• Paper on Evaluating Kolmogorov’s Distribution (http://www.jstatsoft.org/v08/i18/paper); contains C
implementation. This is the method used in Matlab.
Kronecker's lemma
In mathematics, Kronecker's lemma (see, e.g., Shiryaev (1996, Lemma IV.3.2)) is a result about the relationship
between convergence of infinite sums and convergence of sequences. The lemma is often used in the proofs of
theorems concerning sums of independent random variables such as the strong Law of large numbers. The lemma is
named after the German mathematician Leopold Kronecker.
The lemma
If is an infinite sequence of real numbers such that
Proof
Let denote the partial sums of the x's. Using summation by parts,
Pick any ε > 0. Now choose N so that is ε-close to s for k > N. This can be done as the sequence converges
to s. Then the right hand side is:
Now, let n go to infinity. The first term goes to s, which cancels with the third term. The second term goes to zero (as
the sum is a fixed value). Since the b sequence is increasing, the last term is bounded by .
Kronecker's lemma 281
References
• Shiryaev, Albert N. (1996). Probability (2nd ed.). Springer. ISBN 0-387-94549-0.
Kullback–Leibler divergence
In probability theory and information theory, the Kullback–Leibler divergence[1][2][3] (also information
divergence, information gain, relative entropy, or KLIC) is a non-symmetric measure of the difference between
two probability distributions P and Q. KL measures the expected number of extra bits required to code samples from
P when using a code based on Q, rather than using a code based on P. Typically P represents the "true" distribution
of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory,
model, description, or approximation of P.
Although it is often intuited as a metric or distance, the KL divergence is not a true metric — for example, it is not
symmetric: the KL from P to Q is generally not the same as the KL from Q to P. However, its infinitesimal form,
specifically its Hessian, is a metric tensor: it is the Fisher information metric.
KL divergence is a special case of a broader class of divergences called f-divergences. It was originally introduced
by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions. It can be
derived from the Bregman divergence.
Definition
For probability distributions P and Q of a discrete random variable their K–L divergence is defined to be
In words, it is the average of the logarithmic difference between the probabilities P and Q, where the average is taken
using the probabilities P. The K-L divergence is only defined if P and Q both sum to 1 and if for any i
such that . If the quantity appears in the formula, it is interpreted as zero.
For distributions P and Q of a continuous random variable, KL-divergence is defined to be the integral:[4]
where is the Radon–Nikodym derivative of Q with respect to P, and provided the expression on the right-hand
which we recognize as the entropy of P relative to Q. Continuing in this case, if is any measure on X for which
and exist, then the Kullback–Leibler divergence from P to Q is given as
KullbackLeibler divergence 282
The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base e if
information is measured in nats. Most formulas involving the KL divergence hold irrespective of log base.
In this article, this will be referred to as the divergence from P to Q, although some authors call it the divergence
"from Q to P" and others call it the divergence "between P and Q" (though note it is not symmetric as this latter
terminology implies). Care must be taken due to the lack of standardization in terminology.
Motivation
In information theory, the
Kraft–McMillan theorem establishes
that any directly decodable coding
scheme for coding a message to identify
one value xi out of a set of possibilities
X can be seen as representing an
implicit probability distribution q(xi)
= 2−li over X, where li is the length of
the code for xi in bits. Therefore, KL
divergence can be interpreted as the
expected extra message-length per
datum that must be communicated if a
code that is optimal for a given (wrong)
distribution Q is used, compared to
using a code based on the true
Illustration of the Kullback–Leibler (KL) divergence for two normal Gaussian
distribution P. distributions. Note the typical asymmetry for the KL divergence is clearly visible.
where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P.
Note also that there is a relation between the Kullback–Leibler divergence and the "rate function" in the theory of
large deviations.[5][6]
Properties
The Kullback–Leibler divergence is always non-negative,
a result known as Gibbs' inequality, with DKL(P||Q) zero if and only if P = Q almost everywhere. The entropy H(P)
thus sets a minimum value for the cross-entropy H(P,Q), the expected number of bits required when using a code
based on Q rather than P; and the KL divergence therefore represents the expected number of extra bits that must be
transmitted to identify a value x drawn from X, if a code is used corresponding to the probability distribution Q,
rather than the "true" distribution P.
The Kullback–Leibler divergence remains well-defined for continuous distributions, and furthermore is invariant
under parameter transformations. For example, if a transformation is made from variable x to variable y(x), then,
since P(x)dx=P(y)dy and Q(x)dx=Q(y)dy the Kullback–Leibler divergence may be rewritten:
where and . Although it was assumed that the transformation was continuous, this need
not be the case. This also shows that the Kullback–Leibler divergence produces a dimensionally consistent quantity,
since if x is a dimensioned variable, P(x) and Q(x) are also dimensioned, since e.g. P(x)dx is dimensionless. The
argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a
more fundamental quantity than some other properties in information theory[7] (such as self-information or Shannon
entropy), which can become undefined or negative for non-discrete probabilities.
The Kullback–Leibler divergence is additive for independent distributions in much the same way as Shannon
entropy. If are independent distributions, with the joint distribution , and
likewise, then
.
[8]
The logarithm must be taken to base e since the two terms following the logarithm are themselves base-e logarithms
of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives
a result measured in nats. Dividing the entire expression above by loge 2 yields the divergence in bits.
Relation to metrics
One might be tempted to call it a "distance metric" on the space of probability distributions, but this would not be
correct as the Kullback–Leibler divergence is not symmetric – that is, , – nor does it
satisfy the triangle inequality. Still, being a premetric, it generates a topology on the space of generalized probability
distributions, of which probability distributions proper are a special case. More concretely, if is a
sequence of distributions such that
then it is said that . Pinsker's inequality entails that , where the latter stands for
the usual convergence in total variation.
KullbackLeibler divergence 284
Following Rényi (1970, 1961)[9][10] the term is sometimes also called the information gain about X achieved if P
can be used instead of Q. It is also called the relative entropy, for using Q instead of P.
with an infinitesimal change of in the j direction, and the Hessian matrix representing the
corresponding change in the probability distribution. Then, for this expression for P, one has
is the KL divergence of the probability distribution P(i) from a Kronecker delta representing certainty that i=m —
i.e. the number of extra bits that must be transmitted to identify i if only the probability distribution P(i) is available
to the receiver, not the fact that i=m.
The mutual information,
is the KL divergence of the product P(X)P(Y) of the two marginal probability distributions from the joint probability
distribution P(X,Y) — i.e. the expected number of extra bits that must be transmitted to identify X and Y if they are
coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability
P(X,Y) is known, it is the expected number of extra bits that must on average be sent to identify Y if the value of X is
not already known to the receiver.
The Shannon entropy,
is the number of bits which would have to be transmitted to identify X from N equally likely possibilities, less the KL
divergence of the uniform distribution PU(X) from the true distribution P(X) — i.e. less the expected number of bits
saved, which would have had to be sent if the value of X were coded according to the uniform distribution PU(X)
rather than the true distribution P(X).
The conditional entropy,
KullbackLeibler divergence 285
is the number of bits which would have to be transmitted to identify X from N equally likely possibilities, less the KL
divergence of the product distribution PU(X) P(Y) from the true joint distribution P(X,Y) — i.e. less the expected
number of bits saved which would have had to be sent if the value of X were coded according to the uniform
distribution PU(X) rather than the conditional distribution P(X|Y) of X given Y.
The cross entropy between two probability distributions measures the average number of bits needed to identify an
event from a set of possibilities, if a coding scheme is used based on a given probability distribution , rather than
the "true" distribution . The cross entropy for two distributions and over the same probability space is thus
defined as follows:
which may be less than or greater than the original entropy H(p(·|I)). However, from the standpoint of the new
probability distribution one can estimate that to have used the original code based on p(x|I) instead of a new code
based on p(x|y,I) would have added an expected number of bits
to the message length. This therefore represents the amount of useful information, or information gain, about X, that
we can estimate has been learned by discovering Y = y.
If a further piece of data, Y2 = y2, subsequently comes in, the probability distribution for x can be updated further, to
give a new best guess p(x|y1,y2,I). If one reinvestigates the information gain for using p(x|y1,I) rather than p(x|I), it
turns out that it may be either greater or less than previously estimated:
and so the combined information gain does not obey the triangle inequality:
All one can say is that on average, averaging using p(y2|y1,x,I), the two sides will average out.
KullbackLeibler divergence 286
Discrimination information
The Kullback–Leibler divergence DKL( p(x|H1) || p(x|H0) ) can also be interpreted as the expected discrimination
information for H1 over H0: the mean information per sample for discriminating in favor of a hypothesis H1 against
a hypothesis H0, when hypothesis H1 is true.[12] Another name for this quantity, given to it by I.J. Good, is the
expected weight of evidence for H1 over H0 to be expected from each sample.
The expected weight of evidence for H1 over H0 is not the same as the information gain expected per sample about
the probability distribution p(H) of the hypotheses,
DKL( p(x|H1) || p(x|H0) ) IG = DKL( p(H|x) || p(H|I) ).
Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal
next question to investigate: but they will in general lead to rather different experimental strategies.
On the entropy scale of information gain there is very little difference between near certainty and absolute
certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute
certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is
enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level)
that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a
mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how
well each reflects the particular circumstances of the problem in question.
i.e. the sum of the KL divergence of p(a) the prior distribution for a from the updated distribution u(a), plus the
expected value (using the probability distribution u(a)) of the KL divergence of the prior conditional distribution
p(x|a) from the new conditional distribution q(x|a). (Note that often the later expected value is called the conditional
KL divergence (or conditional relative entropy) and denoted by DKL(q(x|a)||p(x|a))[13]) This is minimised if q(x|a) =
p(x|a) over the whole support of u(a); and we note that this result incorporates Bayes' theorem, if the new distribution
u(a) is in fact a δ function representing certainty that a has one particular value.
MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum
Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to
continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the KL
divergence continues to be just as relevant.
In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or
Minxent for short. Minimising the KL divergence of m from p with respect to m is equivalent to minimizing the
KullbackLeibler divergence 287
which is appropriate if one is trying to choose an adequate approximation to p. However, this is just as often not the
task one is trying to achieve. Instead, just as often it is m that is some fixed prior reference measure, and p that one is
attempting to optimise by minimising DKL(p||m) subject to some constraint. This has led to some ambiguity in the
literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be DKL(p||m),
rather than H(p,m).
When temperature T is fixed, free-energy (T × A) is also Pressure versus volume plot of available work from a mole
minimized. Thus if T, V and number of molecules N are of Argon gas relative to ambient, calculated as To times KL
divergence.
constant, the Helmholtz free energy F ≡ U − TS (where U is
energy) is minimized as a system "equilibrates." If T and P are
held constant (say during processes in your body), the Gibbs free energy G ≡ U + PV − TS is minimized instead. The
change in free energy under these conditions is a measure of available work that might be done in the process. Thus
available work for an ideal gas at constant temperature To and pressure Po is W = ΔG = NkToΘ[V/Vo] where Vo
= NkTo/Po and Θ[x] ≡ x − 1 − ln x ≥ 0 (see also Gibbs inequality).
More generally[18] the work available relative to some ambient is obtained by multiplying ambient temperature To by
KL-divergence or net-surprisal ΔI ≥ 0, defined as the average value of k ln[p/po] where po is the probability of a
given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to
ambient values of Vo and To is thus W =ToΔI, where KL-divergence ΔI = Nk(Θ[V/Vo] + 3⁄2Θ[T/To]). The resulting
contours of constant KL-divergence, at right for a mole of Argon at standard temperature and pressure, for example
put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to
convert boiling-water to ice-water discussed here.[19] Thus KL-divergence measures thermodynamic availability in
bits.
KullbackLeibler divergence 288
In quantum information science the minimum of over all separable states Q can also be used as a
measure of entanglement in the state P.
Symmetrised divergence
Kullback and Leibler themselves actually defined the divergence as:
which is symmetric and nonnegative. This quantity has sometimes been used for feature selection in classification
problems, where P and Q are the conditional pdfs of a feature under two different classes.
An alternative is given via the λ divergence,
which can be interpreted as the expected information gain about X from discovering which probability distribution X
is drawn from, P or Q, if they currently have probabilities λ and (1 − λ) respectively.
The value λ = 0.5 gives the Jensen–Shannon divergence, defined by
DJS can also be interpreted as the capacity of a noisy information channel with two inputs giving the output
distributions p and q. The Jensen–Shannon divergence proportional to the square of the Fisher information metric
and is equivalent to the Hellinger metric, and the Jensen–Shannon divergence is also equal to one-half the so-called
Jeffreys divergence (Rubner et al., 2000; Jeffreys 1946).
KullbackLeibler divergence 289
Hence
Data differencing
Just as absolute entropy serves as theoretical background for data compression, relative entropy serves as theoretical
background for data differencing – the absolute entropy of a set of data in this sense being the data required to
reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of
data, is the data required to reconstruct the target given the source (minimum size of a patch).
References
[1] Kullback, S.; Leibler, R.A. (1951). "On Information and Sufficiency". Annals of Mathematical Statistics 22 (1): 79–86.
doi:10.1214/aoms/1177729694. MR39968.
[2] S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY).
[3] Kullback, S.; Burnham, K. P.; Laubscher, N. F.; Dallal, G. E.; Wilkinson, L.; Morrison, D. F.; Loyer, M. W.; Eisenberg, B. et al. (1987).
"Letter to the Editor: The Kullback–Leibler distance". The American Statistician 41 (4): 340–341. JSTOR 2684769.
[4] C. Bishop (2006). Pattern Recognition and Machine Learning. p. 55.
[5] Sanov I.N. (1957) "On the probability of large deviations of random magnitudes". Matem. Sbornik, v. 42 (84), 11--44.
[6] Novak S.Y. (2011) ch. 14.5, "Extreme value methods with applications to finance". Chapman & Hall/CRC Press. ISBN 978-1-4398-3574-6.
[7] See the section "differential entropy - 4" in Relative Entropy (http:/ / videolectures. net/ nips09_verdu_re/ ) video lecture by Sergio Verdú
NIPS 2009
[8] Penny & Roberts, PARG-00-12, (2000) (http:/ / www. allisons. org/ ll/ MML/ KL/ Normal). pp. 18
[9] A. Rényi (1970). Probability Theory. New York: Elsevier. Appendix, Sec.4. ISBN 0-486-45867-9.
[10] A. Rényi (1961). "On measures of information and entropy" (http:/ / digitalassets. lib. berkeley. edu/ math/ ucb/ text/ math_s4_v1_article-27.
pdf). Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability 1960. pp. 547–561. .
[11] Chaloner K. and Verdinelli I. (1995) Bayesian Experimental Design: A Review. Statistical Science 10 (3): 273-304. (http:/ / dx. doi. org/ 10.
1214/ aoms/ 1177729694) doi:10.1214/ss/1177009939
[12] Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 14.7.2. Kullback-Leibler Distance" (http:/ / apps. nrbook. com/
empanel/ index. html#pg=756). Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press.
ISBN 978-0-521-88068-8.
[13] Thomas M. Cover, Joy A. Thomas (1991) Elements of Information Theory (John Wiley and Sons, New York, NY), p.22
[14] Myron Tribus (1961) Thermodynamics and thermostatics (D. Van Nostrand, New York)
[15] E. T. Jaynes (1957) Information theory and statistical mechanics (http:/ / bayes. wustl. edu/ etj/ articles/ theory. 1. pdf), Physical Review
106:620
KullbackLeibler divergence 290
[16] E. T. Jaynes (1957) Information theory and statistical mechanics II (http:/ / bayes. wustl. edu/ etj/ articles/ theory. 2. pdf), Physical Review
108:171
[17] J.W. Gibbs (1873) A method of geometrical representation of thermodynamic properties of substances by means of surfaces, reprinted in
The Collected Works of J. W. Gibbs, Volume I Thermodynamics, ed. W. R. Longley and R. G. Van Name (New York: Longmans, Green,
1931) footnote page 52.
[18] M. Tribus and E. C. McIrvine (1971) Energy and information, Scientific American 224:179–186.
[19] P. Fraundorf (2007) Thermal roots of correlation-based complexity (http:/ / www3. interscience. wiley. com/ cgi-bin/ abstract/ 117861985/
ABSTRACT), Complexity 13:3, 18–26
[20] Kenneth P. Burnham and David R. Anderson (2001) Kullback–Leibler information as a basis for strong inference in ecological studies
(http:/ / www. publish. csiro. au/ paper/ WR99107. htm), Wildlife Research 28:111–119.
[21] Burnham, K. P. and Anderson D. R. (2002) Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach,
Second Edition (Springer Science, New York) ISBN 978-0-387-95364-9.
[22] Rubner, Y., Tomasi, C., and Guibas, L. J., 2000. The Earth Mover's distance as a metric for image retrieval. International Journal of
Computer Vision, 40(2): 99–121.
External links
• Ruby gem for calculating KL divergence (https://github.com/evansenter/diverge)
• Jon Shlens' tutorial on Kullback-Leibler divergence and likelihood theory (http://www.snl.salk.edu/~shlens/
kl.pdf)
• Matlab code for calculating KL divergence (http://www.mathworks.com/matlabcentral/fileexchange/loadFile.
do?objectId=13089&objectType=file)
• Sergio Verdú, Relative Entropy (http://videolectures.net/nips09_verdu_re/), NIPS 2009. One-hour video
lecture.
• A modern summary of info-theoretic divergence measures (http://arxiv.org/abs/math/0604246)
Laplace distribution 291
Laplace distribution
Laplace
CF
In probability theory and statistics, the Laplace distribution is a continuous probability distribution named after
Pierre-Simon Laplace. It is also sometimes called the double exponential distribution, because it can be thought of as
two exponential distributions (with an additional location parameter) spliced together back-to-back, but the term
double exponential distribution is also sometimes used to refer to the Gumbel distribution. The difference between
two independent identically distributed exponential random variables is governed by a Laplace distribution, as is a
Brownian motion evaluated at an exponentially distributed random time. Increments of Laplace motion or a variance
gamma process evaluated over the time scale also have a Laplace distribution.
Laplace distribution 292
Characterization
Here, μ is a location parameter and b > 0, which is sometimes referred to as the diversity, is a scale parameter. If μ =
0 and b = 1, the positive half-line is exactly an exponential distribution scaled by 1/2.
The probability density function of the Laplace distribution is also reminiscent of the normal distribution; however,
whereas the normal distribution is expressed in terms of the squared difference from the mean μ, the Laplace density
is expressed in terms of the absolute difference from the mean. Consequently the Laplace distribution has fatter tails
than the normal distribution.
has a Laplace distribution with parameters μ and b. This follows from the inverse cumulative distribution function
given above.
A Laplace(0, b) variate can also be generated as the difference of two i.i.d. Exponential(1/b) random variables.
Equivalently, a Laplace(0, 1) random variable can be generated as the logarithm of the ratio of two iid uniform
random variables.
Laplace distribution 293
Parameter estimation
Given N independent and identically distributed samples x1, x2, ..., xN, the maximum likelihood estimator of is
[1]
the sample median, and the maximum likelihood estimator of b is
(revealing a link between the Laplace distribution and least absolute deviations).
Moments
Related distributions
• If then
• If then (exponential distribution)
• If and then .
• If then
• If then (Exponential power distribution)
• If (Normal distribution) for then
Sargan distributions
Sargan distributions are a system of distributions of which the Laplace distribution is a core member. A p'th order
Sargan distribution has density[3][4]
for parameters α > 0, βj ≥ 0. The Laplace distribution results for p=0.
Applications
The Laplacian distribution has been used in speech recognition to model priors on DFT coefficients.[5]
The addition of noise drawn from a Laplacian distribution, with scaling parameter appropriate to a function's
sensitivity, to the output of a statistical database query is the most common means to provide differential privacy in
statistical databases.
History
This distribution is often referred to as Laplace's first law of errors. He published it in 1774 when he noted that the
frequency of an error could be expressed as an exponential function of its magnitude once its sign was
disregarded.[6][7]
Laplace in 1778 published his second law of errors wherein he noted that the frequency of an error was proportional
to the exponential of the square of its magnitude. This was subsequently rediscovered by Gauss (possibly in 1795)
and it is now best known as the Normal distribution.
Keynes published a paper in 1911 based on his earlier thesis wherein he showed that the Laplace distribution
minimised the absolute deviation from the median.[8]
References
[1] Robert M. Norton (May 1984). "The Double Exponential Distribution: Using Calculus to Find a Maximum Likelihood Estimator". The
American Statistician (American Statistical Association) 38 (2): 135–136. doi:10.2307/2683252. JSTOR 2683252.
[2] Kotz, Samuel; Kozubowski, Tomasz J.; Podgórski, Krzysztof (2001). The Laplace distribution and generalizations: a revisit with
applications to Communications, Economics, Engineering and Finance (http:/ / books. google. com/ books?id=cb8B07hwULUC&
lpg=PA22& dq=laplace distribution exponential characteristic function& hl=fr& pg=PA23#v=onepage& q=laplace distribution exponential
characteristic function& f=false). Birkhauser. pp. p.23 (Proposition 2.2.2, Equation 2.2.8). ISBN 9780817641665. .
[3] Everitt, B.S. (2002) The Cambridge Dictionary of Statistics, CUP. ISBN 0-521-81099-X
[4] Johnson, N.L., Kotz S., Balakrishnan, N. (1994) Continuous Univariate Distributions, Wiley. ISBN 0-471-58495-9. p. 60
[5] Eltoft, T.; Taesu Kim; Te-Won Lee (2006). "On the multivariate Laplace distribution" (http:/ / eo. uit. no/ publications/ TE-SPL-06. pdf).
IEEE Signal Processing Letters 13 (5): 300-303. doi:10.1109/LSP.2006.870353. .
[6] Laplace, P-S. (1774). Mémoire sur la probabilité des causes par les évènements. Mémoires de l’Academie Royale des Sciences Presentés par
Divers Savan, 6, 621–656
[7] Wilson EB (1923) First and second laws of error. JASA 18, 143
[8] Keynes JM (1911) The principal averages and the laws of error which lead to them. J Roy Stat Soc, 74, 322–331
Laplace's equation 295
Laplace's equation
In mathematics, Laplace's equation is a second-order partial differential equation named after Pierre-Simon Laplace
who first studied its properties. This is often written as:
where ∆ = ∇² is the Laplace operator and is a scalar function. In general, ∆ = ∇² is the Laplace–Beltrami or
Laplace–de Rham operator.
Laplace's equation and Poisson's equation are the simplest examples of elliptic partial differential equations.
Solutions of Laplace's equation are called harmonic functions.
The general theory of solutions to Laplace's equation is known as potential theory. The solutions of Laplace's
equation are the harmonic functions, which are important in many fields of science, notably the fields of
electromagnetism, astronomy, and fluid dynamics, because they can be used to accurately describe the behavior of
electric, gravitational, and fluid potentials. In the study of heat conduction, the Laplace equation is the steady-state
heat equation.
Definition
In three dimensions, the problem is to find twice-differentiable real-valued functions , of real variables x, y, and z,
such that
In Cartesian coordinates
In cylindrical coordinates,
In spherical coordinates,
In Curvilinear coordinates,
or
Boundary conditions
The Dirichlet problem for Laplace's
equation consists of finding a solution
on some domain such that
on the boundary of is equal to
some given function. Since the Laplace
operator appears in the heat equation,
one physical interpretation of this
problem is as follows: fix the
temperature on the boundary of the
domain according to the given
specification of the boundary
condition. Allow heat to flow until a
stationary state is reached in which the
Laplace's Equation on an annulus (r=2 and R=4) with Dirichlet Boundary Conditions:
temperature at each point on the
u(r=2)=0 and u(r=4)=4sin(5*θ)
domain doesn't change anymore. The
temperature distribution in the interior
will then be given by the solution to the corresponding Dirichlet problem.
The Neumann boundary conditions for Laplace's equation specify not the function itself on the boundary of ,
but its normal derivative. Physically, this corresponds to the construction of a potential for a vector field whose
effect is known at the boundary of alone.
Solutions of Laplace's equation are called harmonic functions; they are all analytic within the domain where the
equation is satisfied. If any two functions are solutions to Laplace's equation (or any linear homogeneous differential
equation), their sum (or any linear combination) is also a solution. This property, called the principle of
superposition, is very useful, e.g., solutions to complex problems can be constructed by summing simple solutions.
Analytic functions
The real and imaginary parts of a complex analytic function both satisfy the Laplace equation. That is, if z = x + iy,
and if
then the necessary condition that f(z) be analytic is that the Cauchy-Riemann equations be satisfied:
Therefore u satisfies the Laplace equation. A similar calculation shows that v also satisfies the Laplace equation.
Conversely, given a harmonic function, it is the real part of an analytic function, (at least locally). If a trial
form is
The Laplace equation for φ implies that the integrability condition for ψ is satisfied:
and thus ψ may be defined by a line integral. The integrability condition and Stokes' theorem implies that the value
of the line integral connecting two points is independent of the path. The resulting pair of solutions of the Laplace
equation are called conjugate harmonic functions. This construction is only valid locally, or provided that the path
does not loop around a singularity. For example, if r and θ are polar coordinates and
However, the angle θ is single-valued only in a region that does not enclose the origin.
The close connection between the Laplace equation and analytic functions implies that any solution of the Laplace
equation has derivatives of all orders, and can be expanded in a power series, at least inside a circle that does not
enclose a singularity. This is in sharp contrast to solutions of the wave equation, which generally have less regularity.
There is an intimate connection between power series and Fourier series. If we expand a function f in a power series
inside a circle of radius R, this means that
with suitably defined coefficients whose real and imaginary parts are given by
Therefore
which is a Fourier series for f. These trigonometric functions can themselves be expanded, using multiple angle
formulae.
Laplace's equation 298
Fluid flow
Let the quantities u and v be the horizontal and vertical components of the velocity field of a steady incompressible,
irrotational flow in two dimensions. The condition that the flow be incompressible is that
then the incompressibility condition is the integrability condition for this differential: the resulting function is called
the stream function because it is constant along flow lines. The first derivatives of ψ are given by
and the irrotationality condition implies that ψ satisfies the Laplace equation. The harmonic function φ that is
conjugate to ψ is called the velocity potential. The Cauchy-Riemann equations imply that
Thus every analytic function corresponds to a steady incompressible, irrotational fluid flow in the plane. The real
part is the velocity potential, and the imaginary part is the stream function.
Electrostatics
According to Maxwell's equations, an electric field (u,v) in two space dimensions that is independent of time satisfies
and
where ρ is the charge density. The first Maxwell equation is the integrability condition for the differential
Fundamental solution
A fundamental solution of Laplace's equation satisfies
where the Dirac delta function denotes a unit source concentrated at the point No function has this
property, but it can be thought of as a limit of functions whose integrals over space are unity, and whose support (the
region where the function is non-zero) shrinks to a point (see weak solution). It is common to take a different sign
convention for this equation than one typically does when defining fundamental solutions. This choice of sign is
often convenient to work with because is a positive operator. The definition of the fundamental solution thus
Laplace's equation 299
implies that, if the Laplacian of u is integrated over any volume that encloses the source point, then
The Laplace equation is unchanged under a rotation of coordinates, and hence we can expect that a fundamental
solution may be obtained among solutions that only depend upon the distance r from the source point. If we choose
the volume to be a ball of radius a around the source point, then Gauss' divergence theorem implies that
It follows that
on a sphere of radius r that is centered around the source point, and hence
Note that, with the opposite sign convention (used in Physics), this is the potential generated by a point particle, for
an inverse-square law force, arising in the solution of Poisson equation. A similar argument shows that in two
dimensions
where denotes the natural logarithm. Note that, with the opposite sign convention, this is the potential
generated by a pointlike sink (see point particle), which is the solution of the Euler equations in two-dimensional
incompressible flow.
Green's function
A Green's function is a fundamental solution that also satisfies a suitable condition on the boundary S of a volume V.
For instance, may satisfy
and u assumes the boundary values g on S, then we may apply Green's identity, (a consequence of the divergence
theorem) which states that
The notations un and Gn denote normal derivatives on S. In view of the conditions satisfied by u and G, this result
simplifies to
Thus the Green's function describes the influence at of the data f and g. For the case of the interior of a
sphere of radius a, the Green's function may be obtained by means of a reflection (Sommerfeld, 1949): the source
point P at distance ρ from the center of the sphere is reflected along its radial line to a point P' that is at a distance
Note that if P is inside the sphere, then P' will be outside the sphere. The Green's function is then given by
Laplace's equation 300
where R denotes the distance to the source point P and R' denotes the distance to the reflected point P. A
consequence of this expression for the Green's function is the Poisson integral formula'. Let ρ, θ, and φ be
spherical coordinates for the source point P. Here θ denotes the angle with the vertical axis, which is contrary to the
usual American mathematical notation, but agrees with standard European and physical practice. Then the solution
of the Laplace equation inside the sphere is given by
where
A simple consequence of this formula is that if u is a harmonic function, then the value of u at the center of the
sphere is the mean value of its values on the sphere. This mean value property immediately implies that a
non-constant harmonic function cannot assume its maximum value at an interior point.
Electrostatics
In free space the Laplace equation of any electrostatic potential must equal zero since (charge density) is zero in
free space.
Taking the gradient of the electric potential we get the electrostatic field
Taking the divergence of the electrostatic field, we obtain Poisson's equation, that relates charge density and electric
potential
In the particular case of the empty space ( ) Poisson's equation reduces to Laplace's equation for the electric
potential.
Using a uniqueness theorem and showing that a potential satisfies Laplace's equation (second derivative of V should
be zero i.e. in free space) and the potential has the correct values at the boundaries, the potential is then uniquely
defined.
A potential that doesn't satisfy Laplace's equation together with the boundary condition is an invalid electrostatic
potential.
References
• Evans, L. C. (1998). Partial Differential Equations. Providence: American Mathematical Society.
ISBN 0-8218-0772-2.
• Petrovsky, I. G. (1967). Partial Differential Equations. Philadelphia: W. B. Saunders.
• Polyanin, A. D. (2002). Handbook of Linear Partial Differential Equations for Engineers and Scientists. Boca
Raton: Chapman & Hall/CRC Press. ISBN 1-58488-299-9.
• Sommerfeld, A. (1949). Partial Differential Equations in Physics. New York: Academic Press.
Laplace's equation 301
External links
• Laplace Equation (particular solutions and boundary value problems) [1] at EqWorld: The World of Mathematical
Equations.
• Laplace Differential Equation [2] on PlanetMath
• Example initial-boundary value problems [3] using Laplace's equation from exampleproblems.com.
• Weisstein, Eric W., "Laplace’s Equation [4]" from MathWorld.
• Module for Laplace’s Equation by John H. Mathews [5]
• Find out how boundary value problems governed by Laplace's equation may be solved numerically by boundary
element method [6]
References
[1] http:/ / eqworld. ipmnet. ru/ en/ solutions/ lpde/ lpde301. pdf
[2] http:/ / planetmath. org/ encyclopedia/ LaplaceDifferentialEquation. html
[3] http:/ / www. exampleproblems. com/ wiki/ index. php/ PDE:Laplaces_Equation
[4] http:/ / mathworld. wolfram. com/ LaplacesEquation. html
[5] http:/ / math. fullerton. edu/ mathews/ c2003/ DirichletProblemMod. html
[6] http:/ / www. ntu. edu. sg/ home/ mwtang/ bemsite. htm
Laplace's method
In mathematics, Laplace's method, named after Pierre-Simon Laplace, is a technique used to approximate integrals
of the form
where ƒ(x) is some twice-differentiable function, M is a large number, and the integral endpoints a and b could
possibly be infinite. This technique was originally presented in Laplace (1774, pp. 366–367).
Laplace's method 302
Thus, significant contributions to the integral of this function will come only
from points x in a neighborhood of x0, which can then be estimated.
for x close to x0 (recall that the second derivative is negative at the global maximum ƒ(x0)). The assumptions made
ensure the accuracy of the approximation
(see the picture on the right). This latter integral is a Gaussian integral if the limits of integration go from −∞ to +∞
(which can be assumed because the exponential decays very fast away from x0), and thus it can be calculated. We
find
A generalization of this method and extension to arbitrary precision is provided by Fog (2008).
Formal statement and proof:
Assume that is a twice differentiable function on with the unique point such that
. Assume additionally that and that for any ,
Then,
Laplace's method 303
Proof:
Lower bound:
Let . Then by the continuity of there exists such that if then
. By Taylor's Theorem, for any ,
that so that is why we can take the square root of its negation.
If we divide both sides of the above inequality by and take the limit we get:
Upper bound:
The proof of the upper bound is similar to the proof of the lower bound but there are a few annoyances. Again we
start by picking an but in order for the the proof to work we need small enough so that .
Then, as above, by continuity of and Taylor's Theorem we can find so that if , then
, then .
Then we can calculate the following upper bound:
If we divide both sides of the above inequality by and take the limit we get:
And combining this with the lower bound gives the result.
for a path passing through the saddle point at z0. Note the explicit appearance of a minus sign to indicate the
direction of the second derivative: one must not take the modulus. Also note that if the integrand is meromorphic,
one may have to add residues corresponding to poles traversed while deforming the contour (see for example section
3 of Okounkov's paper Symmetric functions and random partitions).
Further generalizations
An extension of the steepest descent method is the so-called nonlinear stationary phase/steepest descent method.
Here, instead of integrals, one needs to evaluate asymptotically solutions of Riemann–Hilbert factorization
problems.
Given a contour C in the complex sphere, a function ƒ defined on that contour and a special point, say infinity, one
seeks a function M holomorphic away from the contour C, with prescribed jump across C, and with a given
normalization at infinity. If ƒ and hence M are matrices rather than scalars this is a problem that in general does not
admit an explicit solution.
An asymptotic evaluation is then possible along the lines of the linear stationary phase/steepest descent method. The
idea is to reduce asymptotically the solution of the given Riemann–Hilbert problem to that of a simpler, explicitly
solvable, Riemann–Hilbert problem. Cauchy's theorem is used to justify deformations of the jump contour.
The nonlinear stationary phase was introduced by Deift and Zhou in 1993, based on earlier work of Its. A (properly
speaking) nonlinear steepest descent method was introduced by Kamvissis, K. McLaughlin and P. Miller in 2003,
based on previous work of Lax, Levermore, Deift, Venakides and Zhou.
The nonlinear stationary phase/steepest descent method has applications to the theory of soliton equations and
integrable models, random matrices and combinatorics.
Laplace's method 305
Complex integrals
For complex integrals in the form:
with t >> 1, we make the substitution t = iu and the change of variable s = c + ix to get the Laplace bilateral
transform:
We then split g(c+ix) in its real and complex part, after which we recover u = t / i. This is useful for inverse Laplace
transforms, the Perron formula and complex integration.
so that
This integral has the form necessary for Laplace's method with
which is twice-differentiable:
The maximum of ƒ(z) lies at z0 = 1, and the second derivative of ƒ(z) has the value −1 at this point. Therefore, we
obtain
Laplace's method 306
References
• Azevedo-Filho, A.; Shachter, R. (1994), "Laplace's Method Approximations for Probabilistic Inference in Belief
Networks with Continuous Variables", in Mantaras, R.; Poole, D., Uncertainty in Artificial Intelligence, San
Francisco, CA: Morgan Kauffman, CiteSeerX: 10.1.1.91.2064 [1].
• Deift, P.; Zhou, X. (1993), "A steepest descent method for oscillatory Riemann–Hilbert problems. Asymptotics
for the MKdV equation", Ann. of Math. 137 (2): 295–368, doi:10.2307/2946540.
• Erdelyi, A. (1956), Asymptotic Expansions, Dover.
• Fog, A. (2008), "Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution", Communications
in Statistics, Simulation and Computation 37 (2): 258–273, doi:10.1080/03610910701790269.
• Kamvissis, S.; McLaughlin, K. T.-R.; Miller, P. (2003), "Semiclassical Soliton Ensembles for the Focusing
Nonlinear Schrödinger Equation", Annals of Mathematics Studies (Princeton University Press) 154.
• Laplace, P. S. (1774). Memoir on the probability of causes of events. Mémoires de Mathématique et de Physique,
Tome Sixième. (English translation by S. M. Stigler 1986. Statist. Sci., 1(19):364–378).
This article incorporates material from saddle point approximation on PlanetMath, which is licensed under the
Creative Commons Attribution/Share-Alike License.
References
[1] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 91. 2064
Likelihood-ratio test 307
Likelihood-ratio test
In statistics, a likelihood ratio test is a statistical test used to compare the fit of two models, one of which (the null
model) is a special case of the other (the alternative model). The test is based on the likelihood ratio, which expresses
how many times more likely the data are under one model than the other. This likelihood ratio, or equivalently its
logarithm, can then be used to compute a p-value, or compared to a critical value to decide whether to reject the null
model in favour of the alternative model. When the logarithm of the likelihood ratio is used, the statistic is known as
a log-likelihood ratio statistic, and the probability distribution of this test statistic, assuming that the null model is
true, can be approximated using Wilks' theorem.
In the case of distinguishing between two models, each of which has no unknown parameters, use of the likelihood
ratio test can be justified by the Neyman–Pearson lemma, which demonstrates that such a test has the highest power
among all competitors.[1]
Use
Each of the two competing models, the null model and the alternative model, is separately fitted to the data and the
log-likelihood recorded. The test statistic (often denoted by D) is twice the difference in these log-likelihoods:
The model with more parameters will always fit at least as well (have a greater log-likelihood). Whether it fits
significantly better and should thus be preferred is determined by deriving the probability or p-value of the
difference D. Where the null hypothesis represents a special case of the alternative hypothesis, the probability
distribution of the test statistic is approximately a chi-squared distribution with degrees of freedom equal to
df2 − df1 .[2] Symbols df1 and df2 represent the number of free parameters of models 1 and 2, the null model and the
alternative model, respectively. The test requires nested models, that is: models in which the more complex one can
be transformed into the simpler model by imposing a set of constraints on the parameters.[3]
For example: if the null model has 1 free parameter and a log-likelihood of −8024 and the alternative model has 3
degrees of freedom and a LL of −8012, then the probability of this difference is that of chi-squared value of
+2·(8024 − 8012) = 24 with 3 − 1 = 2 degrees of freedom. Certain assumptions must be met for the statistic to follow
a chi-squared distribution and often empirical p-values are computed.
Background
The likelihood ratio, often denoted by (the capital Greek letter lambda), is the ratio of the likelihood function
varying the parameters over two different sets in the numerator and denominator. A likelihood-ratio test is a
statistical test for making a decision between two hypotheses based on the value of this ratio.
It is central to the Neyman–Pearson approach to statistical hypothesis testing, and, like statistical hypothesis testing
generally, is both widely used and much criticized.
Likelihood-ratio test 308
Simple-versus-simple hypotheses
A statistical model is often a parametrized family of probability density functions or probability mass functions
. A simple-vs-simple hypotheses test has completely specified models under both the null and alternative
hypotheses, which for convenience are written in terms of fixed values of a notional parameter :
Note that under either hypothesis, the distribution of the data is fully specified; there are no unknown parameters to
estimate. The likelihood ratio test statistic can be written as:[4][5]
or
where is the likelihood function. Note that some references may use the reciprocal as the definition.[6] In
the form stated here, the likelihood ratio is small if the alternative model is better than the null model and the
likelihood ratio test provides the decision rule as:
If , do not reject ;
If , reject ;
Reject with probability if
The values are usually chosen to obtain a specified significance level , through the relation:
. The Neyman-Pearson lemma states that this likelihood ratio test
is the most powerful among all level- tests for this problem.
The likelihood function is (with being the pdf or pmf) is a function of the parameter
with held fixed at the value that was actually observed, i.e., the data. The likelihood ratio test statistic is [7]
Interpretation
Being a function of the data , the LR is therefore a statistic. The likelihood-ratio test rejects the null hypothesis if
the value of this statistic is too small. How small is too small depends on the significance level of the test, i.e., on
what probability of Type I error is considered tolerable ("Type I" errors consist of the rejection of a null hypothesis
that is true).
The numerator corresponds to the maximum likelihood of an observed outcome under the null hypothesis. The
denominator corresponds to the maximum likelihood of an observed outcome varying parameters over the whole
parameter space. The numerator of this ratio is less than the denominator. The likelihood ratio hence is between 0
and 1. Lower values of the likelihood ratio mean that the observed result was much less likely to occur under the null
hypothesis as compared to the alternative. Higher values of the statistic mean that the observed outcome was more
than or equally likely or nearly as likely to occur under the null hypothesis as compared to the alternative, and the
null hypothesis cannot be rejected.
Examples
Coin tossing
An example, in the case of Pearson's test, we might try to compare two coins to determine whether they have the
same probability of coming up heads. Our observation can be put into a contingency table with rows corresponding
to the coin and columns corresponding to heads or tails. The elements of the contingency table will be the number of
times the coin for that row came up heads or tails. The contents of this table are our observation .
Heads Tails
Coin 1
Coin 2
Here consists of the parameters , , , and , which are the probability that coins 1 and 2 come
up heads or tails. The hypothesis space is defined by the usual constraints on a distribution, , and
. The null hypothesis is the sub-space where . In all of these constraints,
and .
Writing for the best values for under the hypothesis , maximum likelihood is achieved with
Writing for the best values for under the null hypothesis , maximum likelihood is achieved with
The hypothesis and null hypothesis can be rewritten slightly so that they satisfy the constraints for the logarithm of
the likelihood ratio to have the desired nice distribution. Since the constraint causes the two-dimensional to be
reduced to the one-dimensional , the asymptotic distribution for the test will be , the distribution
with one degree of freedom.
For the general contingency table, we can write the log-likelihood ratio statistic as
References
[1] Jerzy Neyman, Egon Pearson (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses". Philosophical Transactions of
the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231 (694–706): 289–337.
doi:10.1098/rsta.1933.0009. JSTOR 91247.
[2] Huelsenbeck, J. P.; Crandall, K. A. (1997). "Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood". Annual Review of
Ecology and Systematics 28: 437–466. doi:10.1146/annurev.ecolsys.28.1.437.
[3] An example using phylogenetic analyses is described at Huelsenbeck, J. P.; Hillis, D. M.; Nielsen, R. (1996). "A Likelihood-Ratio Test of
Monophyly". Systematic Biology 45 (4): 546. doi:10.1093/sysbio/45.4.546.
[4] Mood, A.M.; Graybill, F.A. (1963) Introduction to the Theory of Statistics, 2nd edition. McGraw-Hill ISBN 978-0070428638 (page 286)
[5] Kendall, M.G., Stuart, A. (1973) The Advanced Theory of Statistics, Volume 2, Griffin. ISBN 0852642156 (page 234)
[6] Cox, D. R. and Hinkley, D. V Theoretical Statistics, Chapman and Hall, 1974. (page 92)
[7] Casella, George; Berger, Roger L. (2001) Statistical Inference, Second edition. ISBN 978-0534243128 (page 375)
[8] Wilks, S. S. (1938). "The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses". The Annals of
Mathematical Statistics 9: 60–62. doi:10.1214/aoms/1177732360.
External links
• Practical application of Likelihood-ratio test described (http://www.itl.nist.gov/div898/handbook/apr/
section2/apr233.htm)
• Vassar College's Likelihood Ratio Given Sensitivity/Specificity/Prevalence (http://faculty.vassar.edu/lowry/
clin2.html) Online Calculator
List of integrals of exponential functions 311
Indefinite integrals
Indefinite integrals are antiderivative functions. A constant (the constant of integration) may be added to the right
hand side of any of these formulas, but has been suppressed here in the interest of brevity.
for
where
where
when , , and
when , , and
Definite integrals
for
References
• Wolfram Mathematica Online Integrator (http://integrals.wolfram.com/index.jsp)
• V. H. Moll, The Integrals in Gradshteyn and Ryzhik (http://www.math.tulane.edu/~vhm/Table.html)
Owen [1] has an extensive list of Gaussian-type integrals; only a subset is given below.
Indefinite integrals
[2]
(in these integrals, n!! is the double factorial: for even n’s it is equal to the product of all even
numbers from 2 to n, and for odd n’s it is the product of all odd numbers from 1 to n, additionally
it is assumed that 0!! = (−1)!! = 1)
[3]
List of integrals of Gaussian functions 314
Definite integrals
[4]
List of integrals of Gaussian functions 315
References
[1] Owen (1980)
[2] Patel & Read (1996) list this integral without the minus sign, which is an error. See calculation by WolframAlpha (http:/ / www.
wolframalpha. com/ input/ ?fp=1& i=D(-e^(-x^2/ 2)/ sqrt(2pi)*Sum((2k)!!/ (2j)!!*x^(2j),{j,0,k}),x)& s=40& incTime=true)
[3] Patel & Read (1996) report this integral with error, see WolframAlpha (http:/ / www. wolframalpha. com/ input/ ?i=Integrate(1/
sqrt(2pi)*e^(-x^2/ 2)*1/ sqrt(2pi)*e^(-(a+ b*x)^2/ 2),x))
[4] Patel & Read (1996) report this integral incorrectly by omitting x from the integrand
• Patel, Jagdish K.; Read, Campbell B. (1996). Handbook of the normal distribution (2nd ed.). CRC Press.
ISBN 0-8247-9342-0.
• Owen, D. (1980). "A table of normal integrals". Communications in Statistics: Computation and Simulation B9:
pp. 389 - 419.
also:
also:
also:
also:
List of integrals of hyperbolic functions 316
also:
also:
also:
also:
also:
also:
List of integrals of hyperbolic functions 317
generalizes to
References
• Milton Abramowitz and Irene A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables, 1964. A few integrals are listed on page 69 [1].
References
[1] http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_69. htm
Lists of integrals 319
Lists of integrals
This article is mainly about indefinite integrals in calculus. For a list of definite integrals see List of definite
integrals.
Integration is the basic operation in integral calculus. While differentiation has easy rules by which the derivative of
a complicated function can be found by differentiating its simpler component functions, integration does not, so
tables of known integrals are often useful. This page lists some of the most common antiderivatives.
Lists of integrals
More detail may be found on the following pages for the lists of integrals:
• List of integrals of rational functions
• List of integrals of irrational functions
• List of integrals of trigonometric functions
• List of integrals of inverse trigonometric functions
• List of integrals of hyperbolic functions
• List of integrals of inverse hyperbolic functions
• List of integrals of exponential functions
• List of integrals of logarithmic functions
• List of integrals of Gaussian functions
Gradshteyn, Ryzhik, Jeffrey, Zwillinger's Table of Integrals, Series, and Products contains a large collection of
results. An even larger, multivolume table is the Integrals and Series by Prudnikov, Brychkov, and Marichev (with
volumes 1–3 listing integrals and series of elementary and special functions, volume 4–5 are tables of Laplace
transforms). More compact collections can be found in e.g. Brychkov, Marichev, Prudnikov's Tables of Indefinite
Integrals, or as chapters in Zwillinger's CRC Standard Mathematical Tables and Formulae, Bronstein and
Semendyayev's Handbook of Mathematics (Springer) and Oxford Users' Guide to Mathematics (Oxford Univ.
Press), and other mathematical handbooks.
Other useful resources include Abramowitz and Stegun and the Bateman Manuscript Project. Both works contain
many identities concerning specific integrals, which are organized with the most relevant topic instead of being
collected into a separate table. Two volumes of the Bateman Manuscript are specific to integral transforms.
Lists of integrals 320
There are several web sites which have tables of integrals and integrals on demand. Wolfram Alpha can show
results, and for some simpler expressions, also the intermediate steps of the integration. Wolfram Research also
operates another online service, the Wolfram Mathematica Online Integrator [1].
there is a singularity at 0 and the integral becomes infinite there. If the integral above was used to give a definite
integral between -1 and 1 the answer would be 0. This however is only the value assuming the Cauchy principal
value for the integral around the singularity. If the integration was done in the complex plane the result would
depend on the path around the origin, in this case the singularity contributes −iπ when using a path above the origin
and iπ for a path below the origin. A function on the real line could use a completely different value of C on either
side of the origin as in:
Rational functions
more integrals: List of integrals of rational functions
These rational functions have a non-integrable singularity at 0 for a ≤ −1.
More generally,[2]
Lists of integrals 321
Exponential functions
more integrals: List of integrals of exponential functions
Logarithms
more integrals: List of integrals of logarithmic functions
Trigonometric functions
more integrals: List of integrals of trigonometric functions
(See Integral of the secant function. This result was a well-known conjecture in the 17th century.)
Hyperbolic functions
more integrals: List of integrals of hyperbolic functions
Special functions
Ci, Si: Trigonometric integrals, Ei: Exponential integral, li: Logarithmic integral function, erf: Error function
Lists of integrals 324
for a > 0
for
when a > 0
(for
(for
References
[1] http:/ / integrals. wolfram. com/ index. jsp
[2] " Reader Survey: log|x| + C (http:/ / golem. ph. utexas. edu/ category/ 2012/ 03/ reader_survey_logx_c. html)", Tom Leinster, The n-category
Café, March 19, 2012
• M. Abramowitz and I.A. Stegun, editors. Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables.
• I.S. Gradshteyn (И.С. Градштейн), I.M. Ryzhik (И.М. Рыжик); Alan Jeffrey, Daniel Zwillinger, editors. Table of
Integrals, Series, and Products, seventh edition. Academic Press, 2007. ISBN 978-0-12-373637-6. Errata. (http://
www.mathtable.com/gr) (Several previous editions as well.)
• A.P. Prudnikov (А.П. Прудников), Yu.A. Brychkov (Ю.А. Брычков), O.I. Marichev (О.И. Маричев). Integrals
and Series. First edition (Russian), volume 1–5, Nauka, 1981−1986. First edition (English, translated from the
Russian by N.M. Queen), volume 1–5, Gordon & Breach Science Publishers/CRC Press, 1988–1992, ISBN
2-88124-097-6. Second revised edition (Russian), volume 1–3, Fiziko-Matematicheskaya Literatura, 2003.
• Yu.A. Brychkov (Ю.А. Брычков), Handbook of Special Functions: Derivatives, Integrals, Series and Other
Formulas. Russian edition, Fiziko-Matematicheskaya Literatura, 2006. English edition, Chapman & Hall/CRC
Press, 2008, ISBN 1-58488-956-X.
Lists of integrals 326
• Daniel Zwillinger. CRC Standard Mathematical Tables and Formulae, 31st edition. Chapman & Hall/CRC Press,
2002. ISBN 1-58488-291-3. (Many earlier editions as well.)
Historical
• Meyer Hirsch, Integraltafeln, oder, Sammlung von Integralformeln (http://books.google.com/
books?id=Cdg2AAAAMAAJ) (Duncker und Humblot, Berlin, 1810)
• Meyer Hirsch, Integral Tables, Or, A Collection of Integral Formulae (http://books.google.com/
books?id=NsI2AAAAMAAJ) (Baynes and son, London, 1823) [English translation of Integraltafeln]
• David Bierens de Haan, Nouvelles Tables d'Intégrales définies (http://www.archive.org/details/
nouvetaintegral00haanrich) (Engels, Leiden, 1862)
• Benjamin O. Pierce A short table of integrals - revised edition (http://books.google.com/
books?id=pYMRAAAAYAAJ) (Ginn & co., Boston, 1899)
External links
Tables of integrals
• Paul's Online Math Notes (http://tutorial.math.lamar.edu/pdf/Common_Derivatives_Integrals.pdf)
• A. Dieckmann, Table of Integrals (Elliptic Functions, Square Roots, Inverse Tangents and More Exotic
Functions): Indefinite Integrals (http://pi.physik.uni-bonn.de/~dieckman/IntegralsIndefinite/IndefInt.html)
Definite Integrals (http://pi.physik.uni-bonn.de/~dieckman/IntegralsDefinite/DefInt.html)
• Math Major: A Table of Integrals (http://mathmajor.org/calculus-and-analysis/table-of-integrals/)
• O'Brien, Francis J. Jr. Integrals (http://www.docstoc.com/docs/23969109/
500-Integrals-of-Elementary-and-Special-Functions''500) Derived integrals of exponential and logarithmic
functions
• Rule-based Mathematics (http://www.apmaths.uwo.ca/RuleBasedMathematics/index.html) Precisely defined
indefinite integration rules covering a wide class of integrands
Derivations
• V. H. Moll, The Integrals in Gradshteyn and Ryzhik (http://www.math.tulane.edu/~vhm/Table.html)
Online service
• Integration examples for Wolfram Alpha (http://www.wolframalpha.com/examples/Integrals.html)
Local regression
LOESS,[1] and LOWESS (locally
weighted scatterplot smoothing) are two
strongly related regression modeling
methods that combine multiple regression
models in a k-nearest-neighbor-based
meta-model.
The trade-off for these features is increased computation. Because it is so computationally intensive, LOESS would
have been practically impossible to use in the era when least squares regression was being developed. Most other
modern methods for process modeling are similar to LOESS in this respect. These methods have been consciously
designed to use our current computational ability to the fullest possible advantage to achieve goals not easily
achieved by traditional approaches.
Plotting a smooth curve through a set of data points using this statistical technique is called a Loess Curve,
particularly when each smoothed value is given by a weighted quadratic least squares regression over the span of
values of the y-axis scattergram criterion variable. When each smoothed value is given by a weighted linear least
squares regression over the span, this is known as a Lowess curve; however, some authorities treat Lowess and
Loess as synonyms.
Weight function
As mentioned above, the weight function gives the most weight to the data points nearest the point of estimation and
the least weight to the data points that are furthest away. The use of the weights is based on the idea that points near
each other in the explanatory variable space are more likely to be related to each other in a simple way than points
that are further apart. Following this logic, points that are likely to follow the local model best influence the local
model parameter estimates the most. Points that are less likely to actually conform to the local model have less
influence on the local model parameter estimates.
The traditional weight function used for LOESS is the tri-cube weight function,
However, any other weight function that satisfies the properties listed in Cleveland (1979) could also be used. The
weight for a specific point in any localized subset of data is obtained by evaluating the weight function at the
distance between that point and the point of estimation, after scaling the distance so that the maximum absolute
distance over all of the points in the subset of data is exactly one.
Advantages of LOESS
As discussed above, the biggest advantage LOESS has over many other methods is the fact that it does not require
the specification of a function to fit a model to all of the data in the sample. Instead the analyst only has to provide a
smoothing parameter value and the degree of the local polynomial. In addition, LOESS is very flexible, making it
ideal for modeling complex processes for which no theoretical models exist. These two advantages, combined with
the simplicity of the method, make LOESS one of the most attractive of the modern regression methods for
applications that fit the general framework of least squares regression but which have a complex deterministic
Local regression 329
structure.
Although it is less obvious than for some of the other methods related to linear least squares regression, LOESS also
accrues most of the benefits typically shared by those procedures. The most important of those is the theory for
computing uncertainties for prediction and calibration. Many other tests and procedures used for validation of least
squares models can also be extended to LOESS models .
Disadvantages of LOESS
LOESS makes less efficient use of data than other least squares methods. It requires fairly large, densely sampled
data sets in order to produce good models. This is not really surprising, however, since LOESS needs good empirical
information on the local structure of the process in order to perform the local fitting. In fact, given the results it
provides, LOESS could be more efficient overall than other methods like nonlinear least squares. It may simply
frontload the costs of an experiment in data collection but then reduce analysis costs.
Another disadvantage of LOESS is the fact that it does not produce a regression function that is easily represented by
a mathematical formula. This can make it difficult to transfer the results of an analysis to other people. In order to
transfer the regression function to another person, they would need the data set and software for LOESS calculations.
In nonlinear regression, on the other hand, it is only necessary to write down a functional form in order to provide
estimates of the unknown parameters and the estimated uncertainty. Depending on the application, this could be
either a major or a minor drawback to using LOESS.
Finally, as discussed above, LOESS is a computationally intensive method. This is not usually a problem in our
current computing environment, however, unless the data sets being used are very large. LOESS is also prone to the
effects of outliers in the data set, like other least squares methods. There is an iterative, robust version of LOESS
[Cleveland (1979)] that can be used to reduce LOESS' sensitivity to outliers, but too many extreme outliers can still
overcome even the robust method.
References
[1] LOESS is a later generalization of LOWESS; although it isn't a true initialism, it may be understood as standing for "LOcal regrESSion" (e.g.
John Fox, Nonparametric Regression: Appendix to An R and S-PLUS Companion to Applied Regression (http:/ / cran. r-project. org/ doc/
contrib/ Fox-Companion/ appendix-nonparametric-regression. pdf), January 2002)
• Cleveland, William S. (1979). "Robust Locally Weighted Regression and Smoothing Scatterplots". Journal of the
American Statistical Association 74 (368): 829–836. doi:10.2307/2286407. JSTOR 2286407. MR0556476.
• Cleveland, William S. (1981). "LOWESS: A program for smoothing scatterplots by robust locally weighted
regression". The American Statistician 35 (1): 54. JSTOR 2683591.
• Cleveland, William S.; Devlin, Susan J. (1988). "Locally-Weighted Regression: An Approach to Regression
Analysis by Local Fitting". Journal of the American Statistical Association 83 (403): 596–610.
doi:10.2307/2289282. JSTOR 2289282.
Local regression 330
External links
• Local Regression and Election Modeling (http://voteforamerica.net/editorials/Comments.aspx?ArticleId=28&
ArticleName=Electoral+Projections+Using+LOESS)
• Smoothing by Local Regression: Principles and Methods (PostScript Document) (http://www.stat.purdue.edu/
~wsc/papers/localregression.principles.ps)
• NIST Engineering Statistics Handbook Section on LOESS (http://www.itl.nist.gov/div898/handbook/pmd/
section1/pmd144.htm)
• Local Fitting Software (http://www.stat.purdue.edu/~wsc/localfitsoft.html)
• LOESS Smoothing in Excel (http://peltiertech.com/WordPress/loess-smoothing-in-excel/)
• Scatter Plot Smoothing (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/lowess.html)
• The Loess function (http://research.stowers-institute.org/efg/R/Statistics/loess.htm) in R
• Quantile LOWESS (http://www.r-statistics.com/2010/04/
quantile-lowess-combining-a-moving-quantile-window-with-lowess-r-function/) – A method to perform Local
regression on a Quantile moving window (with R code)
This article incorporates public domain material from websites or documents of the National Institute of
Standards and Technology (http://www.nist.gov).
Log-normal distribution 331
Log-normal distribution
Log-normal
Notation
Parameters σ2 > 0 — shape (real),
μ ∈ R — log-scale
Support x ∈ (0, +∞)
PDF
CDF
Log-normal distribution 332
Mean
Median
Mode
Variance
Skewness
Ex. kurtosis
Entropy
Fisher information
In probability theory, a log-normal distribution is a continuous probability distribution of a random variable whose
logarithm is normally distributed. If X is a random variable with a normal distribution, then Y = exp(X) has a
log-normal distribution; likewise, if Y is log-normally distributed, then X = log(Y) has a normal distribution. The
log-normal distribution is the distribution of a random variable that takes only positive real values.
Log-normal is also written log normal or lognormal. The distribution is occasionally referred to as the Galton
distribution or Galton's distribution, after Francis Galton,[1] and other names such as McAlister, Gibrat and
Cobb–Douglas been associated.[1]
A variable might be modeled as log-normal if it can be thought of as the multiplicative product of many independent
random variables each of which is positive. (This is justified by considering the central limit theorem in the
log-domain.) For example, in finance, the variable could represent the compound return from a sequence of many
trades (each expressed as its return + 1); or a long-term discount factor can be derived from the product of short-term
discount factors. In wireless communication, the attenuation caused by shadowing or slow fading from random
objects is often assumed to be log-normally distributed: see log-distance path loss model.
The log-normal distribution is the maximum entropy probability distribution for a random variate X for which the
mean and variance of is fixed.[2]
μ and σ
In a log-normal distribution X, the parameters denoted μ and σ are, respectively, the mean and standard deviation of
the variable’s natural logarithm (by definition, the variable’s logarithm is normally distributed), which means
Characterization
This follows by applying the change-of-variables rule on the density function of a normal distribution.
where erfc is the complementary error function, and Φ is the cumulative distribution function of the standard normal
distribution.
This series representation is divergent for Re(σ2) > 0. However, it is sufficient for evaluating the characteristic
function numerically at positive as long as the upper limit in the sum above is kept bounded, n ≤ N, where
and σ2 < 0.1. To bring the numerical values of parameters μ, σ into the domain where strong inequality holds true
one could use the fact that if X is log-normally distributed then Xm is also log-normally distributed with parameters
μm, σm. Since , the inequality could be satisfied for sufficiently small m. The sum of series first
converges to the value of φ(t) with arbitrary high accuracy if m is small enough, and left part of the strong inequality
is satisfied. If considerably larger number of terms are taken into account the sum eventually diverges when the right
part of the strong inequality is no longer valid.
Another useful representation is available[3][4] by means of double Taylor expansion of e(ln x − μ)2/(2σ2).
The moment-generating function for the log-normal distribution does not exist on the domain R, but only exists on
the half-interval (−∞, 0].
Properties
Geometric moments
The geometric mean of the log-normal distribution is . Because the log of a log-normal variable is symmetric and
quantiles are preserved under monotonic transformations, the geometric mean of a log-normal distribution is equal to
its median.[5]
Log-normal distribution 334
The geometric mean (mg) can alternatively be derived from the arithmetic mean (ma) in a log-normal distribution by:
Arithmetic moments
If X is a lognormally distributed variable, its expected value (E – the arithmetic mean), variance (Var), and standard
deviation (s.d.) are
Equivalently, parameters μ and σ can be obtained if the expected value and variance are known:
For any real or complex number s, the sth moment of log-normal X is given by[1]
A log-normal distribution is not uniquely determined by its moments E[Xk] for k ≥ 1, that is, there exists some other
distribution with the same moments for all k.[1] In fact, there is a whole family of distributions with the same
moments as the log-normal distribution.
Coefficient of variation
The coefficient of variation is the ratio s.d.
over m (on the natural scale) and is equal to:
Partial expectation Comparison of mean, median and mode of two log-normal distributions with
different skewness.
The partial expectation of a random variable
X with respect to a threshold k is defined as
g(k) = E[X | X > k]P[X > k]. For a log-normal random variable the partial expectation is given by
Log-normal distribution 335
This formula has applications in insurance and economics, it is used in solving the partial differential equation
leading to the Black–Scholes formula.
Other
A set of data that arises from the log-normal distribution has a symmetric Lorenz curve (see also Lorenz asymmetry
coefficient).[6]
The harmonic (H), geometric (G) and arithmetic (A) means of this distribution are related[7]; such relation is given
by
Occurrence
• In biology, variables whose logarithms tend to have a normal distribution include:
• Measures of size of living tissue (length, skin area, weight);[8]
• The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth;
• Certain physiological measurements, such as blood pressure of adult humans (after separation on male/female
subpopulations)[9]
Consequently, reference ranges for measurements in healthy
individuals are more accurately estimated by assuming a
log-normal distribution than by assuming a symmetric
distribution about the mean.
• In hydrology, the log-normal distribution is used to analyze
extreme values of such variables as monthly and annual
maximum values of daily rainfall and river discharge
volumes.[10]
• The image on the right illustrates an example of fitting the
log-normal distribution to ranked annually maximum Fitted cumulative log-normal distribution to annually
maximum 1-day rainfalls
one-day rainfalls showing also the 90% confidence belt
based on the binomial distribution. The rainfall data are
represented by plotting positions as part of a cumulative frequency analysis.
• In economics, there is evidence that the income of 97%–99% of the population is distributed log-normally[11].
• In finance, in particular the Black–Scholes model, changes in the logarithm of exchange rates, price indices, and
stock market indices are assumed normal[12] (these variables behave like compound interest, not like simple
interest, and so are multiplicative). However, some mathematicians such as Benoît Mandelbrot have argued that
log-Levy distributions which possesses heavy tails would be a more appropriate model, in particular for the
analysis for stock market crashes. Indeed stock price distributions typically exhibit a fat tail[13].
• The distribution of city sizes is lognormal. This follows from Gibrat's law of proportionate (or scale-free) growth.
Irrespective of their size, all cities follow the same stochastic growth process. As a result, the logarithm of city
size is normally distributed. There is also evidence of lognormality in the firm size distribution and of Gibrat's
law.
• In reliability analysis, the lognormal distribution is often used to model times to repair a maintainable system.
• In wireless communication, "the local-mean power expressed in logarithmic values, such as dB or neper, has a
normal (i.e., Gaussian) distribution." [14]
Log-normal distribution 336
• It has been proposed that coefficients of friction and wear may be treated as having a lognormal distribution [15]
where by ƒL we denote the probability density function of the log-normal distribution and by ƒN that of the normal
distribution. Therefore, using the same indices to denote distributions, we can write the log-likelihood function thus:
Since the first term is constant with regard to μ and σ, both logarithmic likelihood functions, ℓL and ℓN, reach their
maximum with the same μ and σ. Hence, using the formulas for the normal distribution maximum likelihood
parameter estimators and the equality above, we deduce that for the log-normal distribution it holds that
Multivariate log-normal
If is a multivariate normal distribution then has a multivariate log-normal
[16]
distribution with mean
Related distributions
• If is a normal distribution, then
• If is distributed log-normally, then is a normal random variable.
• If are n independent log-normally distributed variables, and , then Y is
also distributed log-normally:
In the case that all have the same variance parameter , these formulas simplify to
• If , then X + c is said to have a shifted log-normal distribution with support x ∈ (c, +∞).
E[X + c] = E[X] + c, Var[X + c] = Var[X].
• If , then
• If , then
• If then for
• Lognormal distribution is a special case of semi-bounded Johnson distribution
• If with , then (Suzuki distribution)
Similar distributions
• A substitute for the log-normal whose integral can be expressed in terms of more elementary functions (Swamee,
2002) can be obtained based on the logistic distribution to get the CDF
Notes
[1] Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1994), "14: Lognormal Distributions", Continuous univariate distributions. Vol. 1,
Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics (2nd ed.), New York: John Wiley & Sons,
ISBN 978-0-471-58495-7, MR1299979
[2] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[3] Leipnik, Roy B. (1991), "On Lognormal Random Variables: I – The Characteristic Function", Journal of the Australian Mathematical Society
Series B, 32, 327–347.
[4] Daniel Dufresne (2009), SUMS OF LOGNORMALS (http:/ / www. soa. org/ library/ proceedings/ arch/ 2009/ arch-2009-iss1-dufresne. pdf,),
Centre for Actuarial Studies, University of Melbourne.
[5] Leslie E. Daly, Geoffrey Joseph Bourke (2000) Interpretation and uses of medical statistics (http:/ / books. google. se/
books?id=AY7LnYkiLNkC& pg=PA89) Edition: 5. Wiley-Blackwell ISBN 0-632-04763-1, ISBN 978-0-632-04763-5 (page 89)
[6] Damgaard, Christian; Weiner, Jacob (2000). "Describing inequality in plant size or fecundity". Ecology 81 (4): 1139–1142.
doi:10.1890/0012-9658(2000)081[1139:DIIPSO]2.0.CO;2.
[7] Rossman LA (1990) "Design stream flows based on harmonic means". J Hydraulic Engineering ASCE 116 (7) 946–950
[8] Huxley, Julian S. (1932). Problems of relative growth. London. ISBN 0-486-61114-0. OCLC 476909537.
[9] Makuch, Robert W.; D.H. Freeman, M.F. Johnson (1979). "Justification for the lognormal distribution as a model for blood pressure" (http:/ /
www. sciencedirect. com/ science/ article/ pii/ 0021968179900705). Journal of Chronic Diseases 32 (3): 245–250.
doi:10.1016/0021-9681(79)90070-5. (. . Retrieved 27 February 2012.
[10] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage
Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The
Netherlands. pp. 175–224. ISBN 90-70754-33-9. .
Log-normal distribution 338
[11] Clementi, F.; Gallegati, M. (2005) "Pareto's law of income distribution: Evidence for Germany, the United Kingdom, and the United States"
(http:/ / ideas. repec. org/ p/ wpa/ wuwpmi/ 0505006. html), EconWPA
[12] Black, Fischer and Myron Scholes, "The Pricing of Options and Corporate Liabilities", Journal of Political Economy, Vol. 81, No. 3,
(May/June 1973), pp. 637–654.
[13] Bunchen, P., Advanced Option Pricing, University of Sydney coursebook, 2007
[14] http:/ / wireless. per. nl/ reference/ chaptr03/ shadow/ shadow. htm
[15] Steele, C. (2008). "Use of the lognormal distribution for the coefficients of friction and wear". Reliability Engineering & System Safety 93
(10): 1574–2013. doi:10.1016/j.ress.2007.09.005.
[16] Tarmast, Ghasem (2001) "Multivariate Log–Normal Distribution" (http:/ / isi. cbs. nl/ iamamember/ CD2/ pdf/ 329. PDF) ISI Proceedings:
Seoul 53rd Session 2001
[17] Gao, X.; Xu, H; Ye, D. (2009), "Asymptotic Behaviors of Tail Density for Sum of Correlated Lognormal Variables" (http:/ / www. hindawi.
com/ journals/ ijmms/ 2009/ 630857. html). International Journal of Mathematics and Mathematical Sciences, vol. 2009, Article ID 630857.
doi:10.1155/2009/630857
References
• Aitchison, J. and Brown, J.A.C. (1957) The Lognormal Distribution, Cambridge University Press.
• E. Limpert, W. Stahel and M. Abbt (2001) Log-normal Distributions across the Sciences: Keys and Clues (http://
stat.ethz.ch/~stahel/lognormal/bioscience.pdf), BioScience, 51 (5), 341–352.
• Eric W. Weisstein et al. Log Normal Distribution (http://mathworld.wolfram.com/LogNormalDistribution.
html) at MathWorld. Electronic document, retrieved October 26, 2006.
• Swamee, P.K. (2002). [ Near Lognormal Distribution (http://ascelibrary.org/doi/abs/10.1061/
(ASCE)1084-0699(2002)7:6(441))], Journal of Hydrologic Engineering. 7 (6): 441–444 Swamee, P. K. (2002).
"Near Lognormal Distribution". Journal of Hydrologic Engineering 7 (6): 441–000.
doi:10.1061/(ASCE)1084-0699(2002)7:6(441).
• Holgate, P. (1989). "The lognormal characteristic function". Communications in Statistics - Theory and Methods
18 (12): 4539–4548. doi:10.1080/03610928908830173.
Further reading
• Robert Brooks, Jon Corson, and J. Donal Wales. "The Pricing of Index Options When the Underlying Assets All
Follow a Lognormal Diffusion" (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=5735), in Advances in
Futures and Options Research, volume 7, 1994.
Logrank test 339
Logrank test
In statistics, the logrank test is a hypothesis test to compare the survival distributions of two samples. It is a
nonparametric test and appropriate to use when the data are right skewed and censored (technically, the censoring
must be non-informative). It is widely used in clinical trials to establish the efficacy of a new treatment compared to
a control treatment when the measurement is the time to event (such as the time from initial treatment to a heart
attack). The test is sometimes called the Mantel–Cox test, named after Nathan Mantel and David Cox. The logrank
test can also be viewed as a time stratified Cochran–Mantel–Haenszel test.
The test was first proposed by Nathan Mantel and was named the logrank test by Richard and Julian Peto.[1][2][3]
Definition
The logrank test statistic compares estimates of the hazard functions of the two groups at each observed event time.
It is constructed by computing the observed and expected number of events in one of the groups at each observed
event time and then adding these to obtain an overall summary across all time points where there is an event.
Let j = 1, ..., J be the distinct times of observed events in either group. For each time , let and be the
number of subjects "at risk" (have not yet had an event or been censored) at the start of period in the groups
respectively. Let . Let and be the observed number of events in the groups
respectively at time , and define .
Given that events happened across both groups at time , under the null hypothesis (of the two groups having
identical survival and hazard functions) has the hypergeometric distribution with parameters , , and
The logrank statistic compares each to its expectation under the null hypothesis and is defined as
Asymptotic distribution
If the two groups have the same survival function, the logrank statistic is approximately standard normal. A
one-sided level test will reject the null hypothesis if where is the upper quantile of the standard
normal distribution. If the hazard ratio is , there are total subjects, is the probability a subject in either
group will eventually have an event (so that is the expected number of events at the time of the analysis), and
the proportion of subjects randomized to each group is 50%, then the logrank statistic is approximately normal with
mean and variance 1.[4] For a one-sided level test with power , the sample size required
Joint distribution
Suppose and are the logrank statistics at two different time points in the same study ( earlier). Again,
assume the hazard functions in the two groups are proportional with hazard ratio and and are the
probabilities that a subject will have an event at the two time points where . and are approximately
bivariate normal with means and and correlation . Calculations involving the
joint distribution are needed to correctly maintain the error rate when the data are examined multiple times within a
study by a Data Monitoring Committee.
References
[1] Mantel, Nathan (1966). "Evaluation of survival data and two new rank order statistics arising in its consideration.". Cancer Chemotherapy
Reports 50 (3): 163–70. PMID 5910392.
[2] Peto, Richard; Peto, Julian (1972). "Asymptotically Efficient Rank Invariant Test Procedures". Journal of the Royal Statistical Society. Series
A (General) (Blackwell Publishing) 135 (2): 185–207. doi:10.2307/2344317. JSTOR 2344317.
[3] Harrington, David (2005). "Linear Rank Tests in Survival Analysis". Encyclopedia of Biostatistics. Wiley Interscience.
doi:10.1002/0470011815.b2a11047.
[4] Schoenfeld, D (1981). "The asymptotic properties of nonparametric tests for comparing survival distributions". Biometrika 68: 316–319.
JSTOR 2335833.
External links
• Bland, J. M.; Altman, D. G. (2004). "The logrank test". BMJ 328 (7447): 1073. doi:10.1136/bmj.328.7447.1073.
PMC 403858. PMID 15117797.
Lévy distribution 341
Lévy distribution
Lévy (unshifted)
CDF
Mean
Median , for
Mode , for
Variance
Skewness undefined
Ex. kurtosis undefined
Entropy
where is Euler's constant
MGF undefined
CF
In probability theory and statistics, the Lévy distribution, named after Paul Pierre Lévy, is a continuous probability
distribution for a non-negative random variable. In spectroscopy this distribution, with frequency as the dependent
variable, is known as a van der Waals profile.[1] It is a special case of the inverse-gamma distribution.
Lévy distribution 342
It is one of the few distributions that are stable and that have probability density functions that can be expressed
analytically, the others being the normal distribution and the Cauchy distribution. All three are special cases of the
stable distributions, which do not generally have a probability density function which can be expressed analytically.
Definition
The probability density function of the Lévy distribution over the domain is
where is the location parameter and is the scale parameter. The cumulative distribution function is
where is the complementary error function. The shift parameter has the effect of shifting the curve to
the right by an amount , and changing the support to the interval [ , ). Like all stable distributions, the
Levy distribution has a standard form f(x;0,1) which has the following property:
where y is defined as
Note that the characteristic function can also be written in the same form used for the stable distribution with
and :
Assuming , the nth moment of the unshifted Lévy distribution is formally defined by:
which diverges for all n > 0 so that the moments of the Lévy distribution do not exist. The moment generating
function is then formally defined by:
which diverges for and is therefore not defined in an interval around zero, so that the moment generating
function is not defined per se. Like all stable distributions except the normal distribution, the wing of the probability
density function exhibits heavy tail behavior falling off according to a power law:
This is illustrated in the diagram below, in which the probability density functions for various values of c and
are plotted on a log-log scale.
Lévy distribution 343
Related distributions
• If then
• If then
• If then
• If then
(Stable distribution)
Probability density function for the Lévy distribution on a log-log scale. • If then
(Scaled-inverse-chi-squared distribution)
• If then (Folded normal distribution)
Applications
• The frequency of geomagnetic reversals appears to follow a Lévy distribution
• The time of hitting a single point (different from the starting point 0) by the Brownian motion has the Lévy
distribution with . (For a Brownian motion with drift, this time may follow an inverse Gaussian
distribution, which has the Lévy distribution as a limit.)
• The length of the path followed by a photon in a turbid medium follows the Lévy distribution. [2]
Footnotes
[1] "van der Waals profile" appears with lowercase "van" in almost all sources, such as: Statistical mechanics of the liquid surface by Clive
Anthony Croxton, 1980, A Wiley-Interscience publication, ISBN 0-471-27663-4, ISBN 978-0-471-27663-0, (http:/ / books. google. it/
books?id=Wve2AAAAIAAJ& q="Van+ der+ Waals+ profile"& dq="Van+ der+ Waals+ profile"& hl=en); and in Journal of technical
physics, Volume 36, by Instytut Podstawowych Problemów Techniki (Polska Akademia Nauk), publisher: Państwowe Wydawn. Naukowe.,
1995, (http:/ / books. google. it/ books?id=2XpVAAAAMAAJ& q="Van+ der+ Waals+ profile"& dq="Van+ der+ Waals+ profile"& hl=en)
[2] Rogers, Geoffrey L, Multiple path analysis of reflectance from turbid media. Journal of the Optical Society of America A, 25:11, p 2879-2883
(2008).
Notes
References
• "Information on stable distributions" (http://academic2.american.edu/~jpnolan/stable/stable.html). Retrieved
July 13 2005. - John P. Nolan's introduction to stable distributions, some papers on stable laws, and a free
program to compute stable densities, cumulative distribution functions, quantiles, estimate parameters, etc. See
especially An introduction to stable distributions, Chapter 1 (http://academic2.american.edu/~jpnolan/stable/
Lévy distribution 344
chap1.pdf)
Mann–Whitney U
In statistics, the Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon (MWW) or Wilcoxon
rank-sum test) is a non-parametric statistical hypothesis test for assessing whether one of two samples of
independent observations tends to have larger values than the other. It is one of the most well-known non-parametric
significance tests. It was proposed initially[1] by the German Gustav Deuchler in 1914 (with a missing term in the
variance) and later independently by Frank Wilcoxon in 1945,[2] for equal sample sizes, and extended to arbitrary
sample sizes and in other ways by Henry Mann and his student Donald Ransom Whitney in 1947.[3]
Calculations
The test involves the calculation of a statistic, usually called U, whose distribution under the null hypothesis is
known. In the case of small samples, the distribution is tabulated, but for sample sizes above ~20 approximation
using the normal distribution is fairly good. Some books tabulate statistics equivalent to U, such as the sum of ranks
in one of the samples, rather than U itself.
The U test is included in most modern statistical packages. It is also easily calculated by hand, especially for small
samples. There are two ways of doing this.
First, arrange all the observations into a single ranked series. That is, rank all the observations without regard to
which sample they are in.
Method one:
For small samples a direct method is recommended. It is very quick, and gives an insight into the meaning of the U
statistic.
MannWhitney U 345
1. Choose the sample for which the ranks seem to be smaller (The only reason to do this is to make computation
easier). Call this "sample 1," and call the other sample "sample 2."
2. For each observation in sample 1, count the number of observations in sample 2 that have a smaller rank (count a
half for any that are equal to it). The sum of these counts is U.
Method two:
For larger samples, a formula can be used:
1. Add up the ranks for the observations which came from sample 1. The sum of ranks in sample 2 is now
determinate, since the sum of all the ranks equals N(N + 1)/2 where N is the total number of observations.
2. U is then given by:
where n1 is the sample size for sample 1, and R1 is the sum of the ranks in sample 1.
Note that it doesn't matter which of the two samples is considered sample 1. An equally valid formula
for U is
The smaller value of U1 and U2 is the one used when consulting significance tables. The sum of the two
values is given by
Knowing that R1 + R2 = N(N + 1)/2 and N = n1 + n2 , and doing some algebra, we find that the sum is
Properties
The maximum value of U is the product of the sample sizes for the two samples. In such a case, the "other" U would
be 0.
Examples
the sum of the ranks achieved by the hares is 2 + 3 + 4 + 5 + 6 + 12 = 32, leading to U = 32 − 21 = 11.
Consulting tables, or using the approximation below, shows that this U value gives significant evidence that hares
tend to do better than tortoises (p < 0.05, two-tailed). Obviously this is an extreme distribution that would be spotted
easily, but in a larger sample something similar could happen without it being so apparent. Notice that the problem
here is not that the two distributions of ranks have different variances; they are mirror images of each other, so their
variances are the same, but they have very different skewness.
Normal approximation
For large samples, U is approximately normally distributed. In that case, the standardized value
where mU and σU are the mean and standard deviation of U, is approximately a standard normal deviate whose
significance can be checked in tables of the normal distribution. mU and σU are given by
The formula for the standard deviation is more complicated in the presence of tied ranks; the full formula is given in
the text books referenced below. However, if the number of ties is small (and especially if there are no large tie
bands) ties can be ignored when doing calculations by hand. The computer statistical packages will use the correctly
adjusted formula as a matter of routine.
Note that since U1 + U2 = n1 n2, the mean n1 n2/2 used in the normal approximation is the mean of the two values of
U. Therefore, the absolute value of the z statistic calculated will be same whichever value of U is used.
MannWhitney U 347
Because of its probabilistic form, the U statistic can be generalised to a measure of a classifier's separation power for
more than two classes:[11]
Where c is the number of classes, and the term of considers only the ranking of the items belonging
to classes k and l (i.e., items belonging to all other classes are ignored) according to the classifier's estimates of the
probability of those items belonging to class k. will always be zero but, unlike in the two-class case,
generally , which is why the measure sums over all (k, l) pairs, in effect using the average
of and .
MannWhitney U 348
Different distributions
If one is only interested in stochastic ordering of the two populations (i.e., the concordance probability P(Y > X)), the
U test can be used even if the shapes of the distributions are different. The concordance probability is exactly equal
to the area under the receiver operating characteristic curve (ROC) that is often used in the context.
Alternatives
If one desires a simple shift interpretation, the U test should not be used when the distributions of the two samples
are very different, as it can give erroneously significant results. In that situation, the unequal variances version of the
t test is likely to give more reliable results, but only if normality holds.
Alternatively, some authors (e.g. Conover) suggest transforming the data to ranks (if they are not already ranks) and
then performing the t test on the transformed data, the version of the t test used depending on whether or not the
population variances are suspected to be different. Rank transformations do not preserve variances, but variances are
recomputed from samples after rank transformations.
The Brown–Forsythe test has been suggested as an appropriate non-parametric equivalent to the F test for equal
variances.
Kendall's τ
The U test is related to a number of other non-parametric statistical procedures. For example, it is equivalent to
Kendall's τ correlation coefficient if one of the variables is binary (that is, it can only take two values).
ρ statistic
A statistic called ρ that is linearly related to U and widely used in studies of categorization (discrimination learning
involving concepts), and elsewhere,[12] is calculated by dividing U by its maximum value for the given sample sizes,
which is simply n1 × n2. ρ is thus a non-parametric measure of the overlap between two distributions; it can take
values between 0 and 1, and it is an estimate of P(Y > X) + 0.5 P(Y = X), where X and Y are randomly chosen
observations from the two distributions. Both extreme values represent complete separation of the distributions,
while a ρ of 0.5 represents complete overlap. The usefulness of the ρ statistic can be seen in the case of the odd
example used above, where two distributions that were significantly different on a U-test nonetheless had nearly
identical medians: the ρ value in this case is approximately 0.723 in favour of the hares, correctly reflecting the fact
that even though the median tortoise beat the median hare, the hares collectively did better than the tortoises
collectively.
A statement that does full justice to the statistical status of the test might run,
"Outcomes of the two treatments were compared using the Wilcoxon–Mann–Whitney two-sample rank-sum
test. The treatment effect (difference between treatments) was quantified using the Hodges–Lehmann (HL)
estimator, which is consistent with the Wilcoxon test.[13] This estimator (HLΔ) is the median of all possible
differences in outcomes between a subject in group B and a subject in group A. A non-parametric 0.95
confidence interval for HLΔ accompanies these estimates as does ρ, an estimate of the probability that a
randomly chosen subject from population B has a higher weight than a randomly chosen subject from
population A. The median [quartiles] weight for subjects on treatment A and B respectively are 147 [121, 177]
and 151 [130, 180] kg. Treatment A decreased weight by HLΔ = 5 kg (0.95 CL [2, 9] kg, 2P = 0.02, ρ =
0.58)."
However it would be rare to find so extended a report in a document whose major topic was not statistical inference.
Implementations
• Online implementation [14] using javascript
• ALGLIB [15] includes implementation of the Mann–Whitney U test in C++, C#, Delphi, Visual Basic, etc.
• R includes an implementation of the test (there referred to as the Wilcoxon two-sample test) as wilcox.test
[16]
.
• SAS implements the test in the PROC NPAR1WAY procedure
• Stata includes implementation of Wilcoxon-Mann-Whitney rank-sum test with ranksum [17] command.
• SciPy has the mannwhitneyu [18] function in the stats module.
• MATLAB implements the test with function ranksum in the statistics toolbox.
• Mathematica implements the function as MannWhitneyTest [19].
Notes
[1] Kruskal, William H. (September 1957). "Historical Notes on the Wilcoxon Unpaired Two-Sample Test" (http:/ / www. jstor. org/ stable/
2280906). Journal of the American Statistical Association 52 (279): 356–360. .
[2] Wilcoxon, Frank (1945). "Individual comparisons by ranking methods". Biometrics Bulletin 1 (6): 80–83. doi:10.2307/3001968.
JSTOR 3001968.
[3] Mann, Henry B.; Whitney, Donald R. (1947). "On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other".
Annals of Mathematical Statistics 18 (1): 50–60. doi:10.1214/aoms/1177730491. MR22058. Zbl 0041.26103.
[4] Fay, Michael P.; Proschan, Michael A. (2010). "Wilcoxon–Mann–Whitney or t-test? On assumptions for hypothesis tests and multiple
interpretations of decision rules". Statistics Surveys 4: 1–39. doi:10.1214/09-SS051. MR2595125. PMC 2857732. PMID 20414472.
[5] Motulsky, Harvey J.; Statistics Guide, San Diego, CA: GraphPad Software, 2007, p. 123
[6] Lehamnn, Erich L.; Elements of Large Sample Theory, Springer, 1999, p. 176
[7] Conover, William J.; Practical Nonparametric Statistics (http:/ / kecubung. webfactional. com/ ebook/
practical-nonparametric-statistics-conover-download-pdf. pdf), John Wiley & Sons, 1980 (2nd Edition), pp. 225–226
[8] Conover, William J.; Iman, Ronald L. (1981). "Rank Transformations as a Bridge Between Parametric and Nonparametric Statistics". The
American Statistician 35 (3): 124–129. doi:10.2307/2683975. JSTOR 2683975.
[9] Hanley, James A.; McNeil, Barbara J. (1982). "The Meaning and Use of the Area under a Receiver Operating (ROC) Curve Characteristic".
Radiology 143 (1): 29–36. PMID 7063747.
[10] Mason, Simon J.; Graham, Nicholas E. (2002). "Areas beneath the relative operating characteristics (ROC) and relative operating levels
(ROL) curves: Statistical significance and interpretation" (http:/ / reia. inmet. gov. br/ documentos/ cursoI_INMET_IRI/
Climate_Information_Course/ References/ Mason+ Graham_2002. pdf). Quarterly Journal of the Royal Meteorological Society (128):
2145–2166. .
[11] Hand, David J.; Till, Robert J. (2001). "A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification
Problems" (http:/ / www. springerlink. com/ index/ nn141j42838n7u21. pdf). Machine Learning 45 (2): 171–186.
doi:10.1023/A:1010920819831. .
[12] Herrnstein, Richard J.; Loveland, Donald H.; Cable, Cynthia (1976). "Natural Concepts in Pigeons". Journal of Experimental Psychology:
Animal Behavior Processes 2: 285–302. doi:10.1037/0097-7403.2.4.285.
[13] Myles Hollander and Douglas A. Wolfe (1999). Nonparametric Statistical Methods (2 ed.). ISBN 978-0471190455.
[14] http:/ / faculty. vassar. edu/ lowry/ utest. html
MannWhitney U 350
References
• Lehmann, Erich L. (1975); Nonparametrics: Statistical Methods Based on Ranks.
External links
• Table of critical values of U (pdf) (http://math.usask.ca/~laverty/S245/Tables/wmw.pdf)
• Discussion and table of critical values for the original Wilcoxon Rank-Sum Test, which uses a slightly different
test statistic ( pdf (http://www.stat.auckland.ac.nz/~wild/ChanceEnc/Ch10.wilcoxon.pdf))
• Interactive calculator (http://faculty.vassar.edu/lowry/utest.html) for U and its significance
Matrix calculus
In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of
matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a
multivariate function with respect to a single variable, into vectors and matrices that can be treated as single entities.
This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving
systems of differential equations. The notation used here is commonly used in statistics and engineering, while the
tensor index notation is preferred in physics.
Two competing notational conventions split the field of matrix calculus into two separate groups. The two groups
can be distinguished by whether they write the derivative of a scalar with respect to a vector as a column vector or a
row vector. Both of these conventions are possible even when the common assumption is made that vectors should
be treated as column vectors when combined with matrices (rather than row vectors). A single convention can be
somewhat standard throughout a single field that commonly use matrix calculus (e.g. econometrics, statistics,
estimation theory and machine learning). However, even within a given field different authors can be found using
competing conventions. Authors of both groups often write as though their specific convention is standard. Serious
mistakes can result when combining results from different authors without carefully verifying that compatible
notations are used. Therefore great care should be taken to ensure notational consistency. Definitions of these two
conventions and comparisons between them are collected in the layout conventions section.
Scope
Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of
each component of the dependent variable with respect to each component of the independent variable. In general,
the independent variable can be a scalar, a vector, or a matrix while the dependent variable can be any of these as
well. Each different situation will lead to a different set of rules, or a separate calculus, using the broader sense of the
term. Matrix notation serves as a convenient way to collect the many derivatives in an organized way.
As a first example, consider the gradient from vector calculus. For a scalar function of three independent variables,
, the gradient is given by the vector equation
,
Matrix calculus 351
where represents a unit vector in the direction. This type of generalized derivative can be seen as the
derivative of a scalar, f, with respect to a vector, and it's result can be easily collected in vector form.
More complicated examples include the derivative of a scalar function with respect to a matrix, known as the
gradient matrix, which collects the derivative of with respect to each matrix element in the corresponding position in
the resulting matrix. In that case the scalar must be a function of each of the independent variables in the matrix. As
another example, if we have an n-vector of dependent variables, or functions, of m independent variables we might
consider the derivative of the dependent vector with respect to the independent vector. The result could be collected
in an m×n matrix consisting of all of the possible derivative combinations. There are, of course, a total of nine
possibilities using scalars, vectors, and matrices. Notice that as we consider higher numbers of components in each
of the independent and dependent variables we can be left with a very large number of possibilities.
The six kinds of derivatives that can be most neatly organized in matrix form are collected in the following table.[1]
Scalar
Vector
Matrix
Where we have used the term matrix in its most general sense, recognizing that vectors and scalars are simply
matrices with one column and then one row respectively. Moreover, we have used bold letters to indicated vectors
and bold capital letters for matrices. This notation is used throughout.
Notice that we could also talk about the derivative of a vector with respect to a matrix, or any of the other unfilled
cells in our table. However, these derivatives are most naturally organized in a tensor of rank higher than 2, so that
they do not fit neatly into a matrix. In the following three sections we will define each one of these derivatives and
relate them to other branches of mathematics. See the layout conventions section for a more detailed table.
Usages
Matrix calculus is used for deriving optimal stochastic estimators, often involving the use of Lagrange multipliers.
This includes the derivation of:
• Kalman filter
• Wiener filter
• Expectation-maximization algorithm for Gaussian mixture
Notation
The vector and matrix derivatives presented in the sections to follow take full advantage of matrix notation, using a
single variable to represent a large number of variables. In what follows we will distinguish scalars, vectors and
matrices by their typeface. We will let M(n,m) denote the space of real n×m matrices with n rows and m columns.
Such matrices will be denoted using bold capital letters: A, X, Y, etc. An element of M(n,1), that is, a column vector,
is denoted with a boldface lowercase letter: a, x, y, etc. An element of M(1,1) is a scalar, denoted with lowercase
italic typeface: a, t, x, etc. XT denotes matrix transpose, tr(X) is the trace, and det(X) is the determinant. All functions
are assumed to be of differentiability class C1 unless otherwise noted. Generally letters from first half of the alphabet
(a, b, c, …) will be used to denote constants, and from the second half (t, x, y, …) to denote variables.
NOTE: As mentioned above, there are competing notations for laying out systems of partial derivatives in vectors
and matrices, and no standard appears to be emerging as of yet. The next two introductory sections use the numerator
layout convention simply for the purposes of convenience, to avoid overly complicating the discussion. The section
after them discusses layout conventions in more detail. It is important to realize the following:
1. Despite the use of the terms "numerator layout" and "denominator layout", there are actually more than two
possible notational choices involved. The reason is that the choice of numerator vs. denominator (or in some
situations, numerator vs. mixed) can be made independently for scalar-by-vector, vector-by-scalar,
vector-by-vector, and scalar-by-matrix derivatives, and a number of authors mix and match their layout choices in
various ways.
2. The choice of numerator layout in the introductory sections below does not imply that this is the "correct" or
"superior" choice. There are advantages and disadvantages to the various layout types. Serious mistakes can result
from carelessly combining formulas written in different layouts, and converting from one layout to another
requires care to avoid errors. As a result, when working with existing formulas the best policy is probably to
identify whichever layout is used and maintain consistency with it, rather than attempting to use the same layout
in all situations.
Alternatives
The tensor index notation with its Einstein summation convention is very similar to the matrix calculus, except one
writes only a single component at a time. It has the advantage that one can easily manipulate arbitrarily high rank
tensors, whereas tensors of rank higher than two are quite unwieldy with matrix notation. All of the work here can be
done in this notation without use of the single-variable matrix notation. However, many problems in estimation
theory and other areas of applied mathematics would result in too many indices to properly keep track of, pointing in
favor of matrix calculus in those areas. Also, Einstein notation can be very useful in proving the identities presented
here, as an alternative to typical element notation, which can become cumbersome when the explicit sums are carried
around. Note that a matrix can be considered a tensor of rank two.
Matrix calculus 353
Vector-by-scalar
In vector calculus the derivative of a vector y with respect to a scalar x is known as the tangent vector of the vector
. Notice here that y:R Rm.
Example Simple examples of this include the velocity vector in Euclidean space, which is the tangent vector of the
position vector (considered as a function of time). Also, the acceleration is the tangent vector of the velocity.
Scalar-by-vector
In vector calculus the gradient of a scalar field y, in the space Rn whose independent coordinates are the components
of x is the derivative of a scalar by a vector. In physics, the electric field is the vector gradient of the electric
potential.
The directional derivative of a scalar function f(x) of the space vector x in the direction of the unit vector u is defined
using the gradient as follows.
Using the notation just defined for the derivative of a scalar with respect to a vector we can re-write the directional
derivative as This type of notation will be nice when proving product rules and chain rules that
come out looking similar to what we are familiar with for the scalar derivative.
Matrix calculus 354
Vector-by-vector
Each of the previous two cases can be considered as an application of the derivative of a vector with respect to a
vector, using a vector of size one appropriately. Similarly we will find that the derivatives involving matrices will
reduce to derivatives involving vectors in a corresponding way.
The derivative of a vector function (a vector whose components are functions) , of an independent
In vector calculus the derivative of a vector function y with respect to a vector x that whose components represent a
space is known as the pushforward or differential, or the Jacobian matrix.
Matrix-by-scalar
The derivative of a matrix function Y by a scalar x is known as the tangent matrix and is given (in numerator layout
notation) by
Matrix calculus 355
Scalar-by-matrix
The derivative of a scalar y function of a matrix X of independent variables, with respect to the matrix X, is given (in
numerator layout notation) by
Notice that the indexing of the gradient with respect to X is transposed as compared with the indexing of X.
Important examples of scalar functions of matrices include the trace of a matrix and the determinant.
In analog with vector calculus this derivative is often written as the following.
Also in analog with vector calculus, the directional derivative of a scalar f(X) of a matrix X in the direction of
matrix Y is given by
It is the gradient matrix, in particular, that finds many uses in minimization problems in estimation theory,
particularly in the derivation of the Kalman filter algorithm, which is of great importance in the field.
and note that each is a p×q matrix defined as above. Note also that this matrix has its indexing transposed; m
rows and n columns. The pushforward along F of an n×m matrix Y in M(n,m) is then
Note that this definition encompasses all of the preceding definitions as special cases.
According to Jan R. Magnus and Heinz Neudecker, the following notations are both unsuitable, as the determinant of
the second resulting matrix would have "no interpretation" and "a useful chain rule does not exist" if these notations
are being used:[2]
Given , a differentiable function of an matrix ,
Matrix calculus 356
Layout conventions
This section discusses the similarities and differences between notational conventions that are used in the various
fields that take advantage of matrix calculus. Although there are largely two consistent conventions, some authors
find it convenient to mix the two conventions in forms that are discussed below. After this section equations will be
listed in both competing forms separately.
The fundamental issue is that the derivative of a vector with respect to a vector, i.e. , is often written in two
competing ways. If the numerator y is of size m and the denominator x of size n, then the result can be laid out as
either an m×n matrix or n×m matrix, i.e. the elements of y laid out in columns and the elements of x laid out in rows,
or vice-versa. This leads to the following possibilities:
1. Numerator layout, i.e. lay out according to y and xT (i.e. contrarily to x). This is sometimes known as the
Jacobian formulation.
2. Denominator layout, i.e. lay out according to yT and x (i.e. contrarily to y). This is sometimes known as the
Hessian formulation. Some authors term this layout the gradient, in distinction to the Jacobian (numerator
layout), which is its transpose. (However, "gradient" more commonly means the derivative regardless of
layout.)
3. A third possibility sometimes seen is to insist on writing the derivative as (i.e. the derivative is taken with
respect to the transpose of x) and follow the numerator layout. This makes it possible to claim that the matrix is
laid out according to both numerator and denominator. In practice this produces results the same as the numerator
layout.
When handling the gradient and the opposite case we have the same issues. To be consistent, we should do
1. If we choose numerator layout for we should lay out the gradient as a row vector, and as a column
vector.
2. If we choose denominator layout for we should lay out the gradient as a column vector, and as a
row vector.
3. In the third possibility above, we write and and use numerator layout.
Matrix calculus 357
Not all math textbooks and papers are consistent in this respect throughout the entire paper. That is, sometimes
different conventions are used in different contexts within the same paper. For example, some choose denominator
layout for gradients (laying them out as column vectors), but numerator layout for the vector-by-vector derivative
Similarly, when it comes to scalar-by-matrix derivatives and matrix-by-scalar derivatives then consistent
numerator layout lays out according to Y and XT, while consistent denominator layout lays out according to YT and
X. In practice, however, following a denominator layout for and laying the result out according to YT, is rarely
seen because it makes for ugly formulas that do not correspond to the scalar formulas. As a result, the following
layouts can often be found:
1. Consistent numerator layout, which lays out according to Y and according to XT.
3. Use the notation with results the same as consistent numerator layout.
In the following formulas, we handle the five possible combinations and separately. We
also handle cases of scalar-by-scalar derivatives that involve an intermediate vector or matrix. (This can arise, for
example, if a multi-dimensional parametric curve is defined in terms of a scalar variable, and then a derivative of a
scalar function of the curve is taken with respect to the scalar that parameterizes the curve.) For each of the various
combinations, we give numerator-layout and denominator-layout results, except in the cases above where
denominator layout rarely occurs. In cases involving matrices where it makes sense, we give numerator-layout and
mixed-layout results. As noted above, cases where vector and matrix denominators are written in transpose notation
are equivalent to numerator layout with the denominators written without the transpose.
Keep in mind that various authors use different combinations of numerator and denominator layouts for different
types of derivatives, and there is no guarantee that an author will consistently use either numerator or denominator
layout for all types. Match up the formulas below with those quoted in the source to determine the layout used for
that particular type of derivative, but be careful not to assume that derivatives of other types necessarily follow the
same kind of layout.
When taking derivatives with an aggregate (vector or matrix) denominator in order to find a maximum or minimum
of the aggregate, it should be kept in mind that using numerator layout will produce results that are transposed with
respect to the aggregate. For example, in attempting to find the maximum likelihood estimate of a multivariate
normal distribution using matrix calculus, if the domain is a kx1 column vector, then the result using the numerator
layout will be in the form of a 1xk row vector. Thus, either the results should be transposed at the end or the
denominator layout (or mixed layout) should be used.
Matrix calculus 358
Vector x (size (numerator layout) size-n row (numerator layout) m×n matrix ?
n) vector (denominator layout) n×m
(denominator layout) size-n matrix
column vector
The results of operations will be transposed when switching between numerator-layout and denominator-layout
notation.
Numerator-layout notation
Using numerator-layout notation, we have:[1]
Denominator-layout notation
Using denominator-layout notation, we have:[3][4]
Identities
As noted above, in general, the results of operations will be transposed when switching between numerator-layout
and denominator-layout notation.
To help make sense of all the identities below, keep in mind the most important rules: the chain rule, product rule
and sum rule. The sum rule applies universally, and the product rule applies in most of the cases below, provided that
the order of matrix products is maintained, since matrix products are not commutative. The chain rule applies in
some of the cases, but unfortunately does not apply in matrix-by-scalar derivatives or scalar-by-matrix derivatives
(in the latter case, mostly involving the trace operator applied to matrices). In the latter case, the product rule can't
quite be applied directly, either, but the equivalent can be done with a bit more work using the differential identities.
Vector-by-vector identities
This is presented first because all of the operations that apply to vector-by-vector differentiation apply directly to
vector-by-scalar or scalar-by-vector differentiation simply by reducing the appropriate vector in the numerator or
denominator to a scalar.
Identities: vector-by-vector
Condition Expression Numerator layout, i.e. by y and xT Denominator layout, i.e. by yT and x
a is not a function of x
A is not a function of x
A is not a function of x
Matrix calculus 360
a is not a function of
x,
u = u(x)
A is not a function of
x,
u = u(x)
u = u(x), v = v(x)
u = u(x)
u = u(x)
Scalar-by-vector identities
The fundamental identities are placed above the thick black line.
Identities: scalar-by-vector
Condition Expression Denominator layout,
Numerator layout,
i.e. by x; result is column vector
i.e. by xT; result is row vector
a is not a function of
x,
u = u(x)
u = u(x), v = v(x)
u = u(x), v = v(x)
u = u(x)
u = u(x)
u = u(x), v = v(x)
u = u(x), v = v(x),
A is not a function of
x • assumes numerator layout of • assumes denominator layout of
a is not a function of
x
A is not a function of
x
b is not a function of
x
Matrix calculus 361
A is not a function of
x
A is not a function of
x
A is symmetric
A is not a function of
x
A is not a function of
x
A is symmetric
a is not a function of
x,
u = u(x)
• assumes numerator layout of • assumes denominator layout of
A, b, C, D, e are not
functions of x
a is not a function of
x
Vector-by-scalar identities
Identities: vector-by-scalar
Condition Expression Numerator layout, i.e. by Denominator layout, i.e. by
y, yT,
result is column vector result is row vector
a is not a function of x
a is not a function of
x,
u = u(x)
A is not a function of
x,
u = u(x)
u = u(x)
u = u(x), v = v(x)
u = u(x), v = v(x)
u = u(x)
u = u(x)
NOTE: The formulas involving the vector-by-vector derivatives and (whose outputs are matrices)
assume the matrices are laid out consistent with the vector layout, i.e. numerator-layout matrix when
numerator-layout vector and vice-versa; otherwise, transpose the vector-by-vector derivatives.
Scalar-by-matrix identities
Note that exact equivalents of the scalar product rule and chain rule do not exist when applied to matrix-valued
functions of matrices. However, the product rule of this sort does apply to the differential form (see below), and this
is the way to derive many of the identities below involving the trace function, combined with the fact that the trace
function allows transposing and cyclic permutation, i.e.:
Therefore,
Identities: scalar-by-matrix
Condition Expression Denominator layout, i.e. by X
Numerator layout, i.e. by XT
a is not a function of
X, u = u(X)
u = u(X), v = v(X)
u = u(X), v = v(X)
u = u(X)
Matrix calculus 363
u = u(X)
U = U(X)
[7]
U = U(X), V = V(X)
a is not a function of
X,
U = U(X)
g(X) is any
polynomial with
scalar coefficients,
or any matrix
function defined by
an infinite
polynomial series
(e.g. eX, sin(X),
cos(X), ln(X), etc.
using a Taylor
series); g(x) is the
equivalent scalar
function, g′(x) is its
derivative, and
g′(X) is the
corresponding
matrix function
A is not a function
[7]
of X
A is not a function
[7]
of X
A, B are not
functions of X
A, B, C are not
functions of X
n is a positive
[7]
integer
Matrix calculus 364
A is not a function
[7]
of X,
n is a positive
integer
[7]
[7]
[9]
a is not a function of
[7] [10]
X
A, B are not
[7]
functions of X
n is a positive
[7]
integer
(see pseudo-inverse)
[7]
(see pseudo-inverse)
[7]
A is not a function
of X,
X is square and
invertible
A is not a function
of X,
X is non-square,
A is symmetric
A is not a function
of X,
X is non-square,
A is non-symmetric
Matrix-by-scalar identities
Identities: matrix-by-scalar
Condition Expression Numerator layout, i.e. by Y
U = U(x)
U = U(x), V = V(x)
U = U(x), V = V(x)
U = U(x), V = V(x)
Matrix calculus 365
U = U(x), V = V(x)
U = U(x)
U = U(x,y)
A is not a function of x
Scalar-by-scalar identities
u = u(x)
u = u(x), v = v(x)
U = U(x)
U = U(x)
U = U(x)
U = U(x)
Matrix calculus 366
A is not a function of x
A is not a function of X
a is not a function of X
(Kronecker product)
(Hadamard product)
(conjugate transpose)
To convert to normal derivative form, first convert it to one of the following canonical forms, and then use these
identities:
Matrix calculus 367
Notes
[1] Minka, Thomas P. "Old and New Matrix Algebra Useful for Statistics." December 28, 2000. (http:/ / research. microsoft. com/ en-us/ um/
people/ minka/ papers/ matrix/ )
[2] Magnus, Jan R.; Neudecker, Heinz (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley Series in
Probability and Statistics (2nd ed.). Wiley. pp. 171–173.
[3] (http:/ / www. colorado. edu/ engineering/ CAS/ courses. d/ IFEM. d/ IFEM. AppD. pdf)
[4] Dattorro, Jon (2005). Convex Optimization & Euclidean Distance Geometry, Appendix D. Meßoo Publishing USA. Version v2010.01.05.
[5] Here, refers to a column vector of all 0's, of size n, where n is the length of x.
[6] Here, refers to a matrix of all 0's, of the same shape as X.
[7] Petersen, Kaare Brandt and Michael Syskind Pedersen. The Matrix Cookbook. November 14, 2008. http:/ / matrixcookbook. com. (http:/ /
[8] Duchi, John C. "Properties of the Trace and Matrix Derivatives" (http:/ / www. cs. berkeley. edu/ ~jduchi/ projects/ matrix_prop. pdf).
University of California at Berkeley. . Retrieved 19 July 2011.
[9] See Determinant#Derivative for the derivation.
[10] The constant a disappears in the result. This is intentional. In general,
External links
• Linear Algebra: Determinants, Inverses, Rank (http://www.colorado.edu/engineering/cas/courses.d/IFEM.
d/IFEM.AppD.d/IFEM.AppD.pdf) appendix D from Introduction to Finite Element Methods book on
University of Colorado at Boulder. Uses the Hessian (transpose to Jacobian) definition of vector and matrix
derivatives.
• Matrix Reference Manual (http://www.psi.toronto.edu/matrix/calculus.html), Mike Brookes, Imperial
College London.
• The Matrix Cookbook (http://matrixcookbook.com), with a derivatives chapter. Uses the Hessian definition.
• Linear Algebra and its Applications (http://www.wiley.com/WileyCDA/WileyTitle/
productCd-0471751561,descCd-authorInfo.html) (author information page; see Chapter 9 of book), Peter Lax,
Courant Institute.
• Matrix Differentiation (and some other stuff) (http://www.atmos.washington.edu/~dennis/MatrixCalculus.
pdf), Randal J. Barnes, Department of Civil Engineering, University of Minnesota.
• Notes on Matrix Calculus (http://www4.ncsu.edu/~pfackler/MatCalc.pdf), Paul L. Fackler, North Carolina
State University.
Matrix calculus 368
Maximum likelihood
In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical
model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates
for the model's parameters.
The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example,
one may be interested in the heights of adult female giraffes, but be unable to measure the height of every single
giraffe in a population due to cost or time constraints. Assuming that the heights are normally (Gaussian) distributed
with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the
heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as
parameters and finding particular parametric values that make the observed results the most probable (given the
model).
In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects values
of the model parameters that produce a distribution that gives the observed data the greatest probability (i.e.,
parameters that maximize the likelihood function). Maximum-likelihood estimation gives a unified approach to
estimation, which is well-defined in the case of the normal distribution and many other problems. However, in some
complicated problems, difficulties do occur: in such problems, maximum-likelihood estimators are unsuitable or do
not exist.
Principles
Suppose there is a sample x1, x2, …, xn of n independent and identically distributed observations, coming from a
distribution with an unknown pdf ƒ0(·). It is however surmised that the function ƒ0 belongs to a certain family of
distributions { ƒ(·|θ), θ ∈ Θ }, called the parametric model, so that ƒ0 = ƒ(·|θ0). The value θ0 is unknown and is
referred to as the "true value" of the parameter. It is desirable to find an estimator which would be as close to the
true value θ0 as possible. Both the observed variables xi and the parameter θ can be vectors.
To use the method of maximum likelihood, one first specifies the joint density function for all observations. For an
i.i.d. sample, this joint density function is
Now we look at this function from a different perspective by considering the observed values x1, x2, ..., xn to be fixed
"parameters" of this function, whereas θ will be the function's variable and allowed to vary freely; this function will
be called the likelihood:
Maximum likelihood 369
In practice it is often more convenient to work with the logarithm of the likelihood function, called the
log-likelihood:
The hat over indicates that it is akin to some estimator. Indeed, estimates the expected log-likelihood of a single
observation in the model.
The method of maximum likelihood estimates θ0 by finding a value of θ that maximizes . This method of
estimation defines a maximum-likelihood estimator (MLE) of θ0
if any maximum exists. An MLE estimate is the same regardless of whether we maximize the likelihood or the
log-likelihood function, since log is a monotone transformation.
For many models, a maximum likelihood estimator can be found as an explicit function of the observed data x1, …,
xn. For many other models, however, no closed-form solution to the maximization problem is known or available,
and an MLE has to be found numerically using optimization methods. For some problems, there may be multiple
estimates that maximize the likelihood. For other problems, no maximum likelihood estimate exists (meaning that
the log-likelihood function increases without attaining the supremum value).
In the exposition above, it is assumed that the data are independent and identically distributed. The method can be
applied however to a broader setting, as long as it is possible to write the joint density function ƒ(x1,…,xn | θ), and its
parameter θ has a finite dimension which does not depend on the sample size n. In a simpler extension, an allowance
can be made for data heterogeneity, so that the joint density is equal to ƒ1(x1|θ) · ƒ2(x2|θ) · … · ƒn(xn|θ). In the more
complicated case of time series models, the independence assumption may have to be dropped as well.
A maximum likelihood estimator coincides with the most probable Bayesian estimator given a uniform prior
distribution on the parameters.
Properties
A maximum-likelihood estimator is an extremum estimator obtained by maximizing, as a function of θ, the objective
function
this being the sample analogue of the expected log-likelihood , where this expectation is taken with
respect to the true density f(·|θ0).
Maximum-likelihood estimators have no optimum properties for finite samples, in the sense that (when evaluated on
finite samples) other estimators have greater concentration around the true parameter-value.[1] However, like other
estimation methods, maximum-likelihood estimation possesses a number of attractive limiting properties: As the
sample-size increases to infinity, sequences of maximum-likelihood estimators have these properties:
• Consistency: a subsequence of the sequence of MLEs converges in probability to the value being estimated.
• Asymptotic normality: as the sample size increases, the distribution of the MLE tends to the Gaussian distribution
with mean and covariance matrix equal to the inverse of the Fisher information matrix.
Maximum likelihood 370
• Efficiency, i.e., it achieves the Cramér–Rao lower bound when the sample size tends to infinity. This means that
no asymptotically unbiased estimator has lower asymptotic mean squared error than the MLE (or other estimators
attaining this bound).
• Second-order efficiency after correction for bias.
Consistency
Under the conditions outlined below, the maximum likelihood estimator is consistent. The consistency means that
having a sufficiently large number of observations n, it is possible to find the value of θ0 with arbitrary precision. In
mathematical terms this means that as n goes to infinity the estimator converges in probability to its true value:
Under slightly stronger conditions, the estimator converges almost surely (or strongly) to:
In other words, different parameter values θ correspond to different distributions within the model. If this
condition did not hold, there would be some value θ1 such that θ0 and θ1 generate an identical distribution of the
observable data. Then we wouldn't be able to distinguish between these two parameters even with an infinite
amount of data — these parameters would have been observationally equivalent.
The identification condition is absolutely necessary for the ML estimator to be consistent. When this condition
holds, the limiting likelihood function ℓ(θ|·) has unique global maximum at θ0.
2. Compactness: the parameter space Θ of the model is compact.
The identification condition establishes that the
log-likelihood has a unique global maximum.
Compactness implies that the likelihood cannot
approach the maximum value arbitrarily close at some
other point (as demonstrated for example in the picture
on the right). Compactness is only a sufficient
condition and not a necessary condition. Compactness
can be replaced by some other conditions, such as:
• both concavity of the log-likelihood function and compactness of some (nonempty) upper level sets of the
log-likelihood function, or
• existence of a compact neighborhood N of θ0 such that outside of N the log-likelihood function is less than the
maximum by at least some ε > 0.
3. Continuity: the function ln f(x|θ) is continuous in θ for almost all values of x:
The continuity here can be replaced with a slightly weaker condition of upper semi-continuity.
4. Dominance: there exists an integrable function D(x) such that
By the uniform law of large numbers, the dominance condition together with continuity establish the uniform
convergence in probability of the log-likelihood:
The dominance condition can be employed in the case of i.i.d. observations. In the non-i.i.d. case the uniform
convergence in probability can be checked by showing that the sequence is stochastically equicontinuous.
Maximum likelihood 371
If one wants to demonstrate that the ML estimator converges to θ0 almost surely, then a stronger condition of
uniform convergence almost surely has to be imposed:
Asymptotic normality
Maximum-likelihood estimators can lack asymptotic normality and can be inconsistent if there is a failure of one (or
more) of the below regularity conditions:
Estimate on boundary. Sometimes the maximum likelihood estimate lies on the boundary of the set of possible
parameters, or (if the boundary is not, strictly speaking, allowed) the likelihood gets larger and larger as the
parameter approaches the boundary. Standard asymptotic theory needs the assumption that the true parameter value
lies away from the boundary. If we have enough data, the maximum likelihood estimate will keep away from the
boundary too. But with smaller samples, the estimate can lie on the boundary. In such cases, the asymptotic theory
clearly does not give a practically useful approximation. Examples here would be variance-component models,
where each component of variance, σ2, must satisfy the constraint σ2 ≥0.
Data boundary parameter-dependent. For the theory to apply in a simple way, the set of data values which has
positive probability (or positive probability density) should not depend on the unknown parameter. A simple
example where such parameter-dependence does hold is the case of estimating θ from a set of independent
identically distributed when the common distribution is uniform on the range (0,θ). For estimation purposes the
relevant range of θ is such that θ cannot be less than the largest observation. Because the interval (0,θ) is not
compact, there exists no maximum for the likelihood function: For any estimate of theta, there exists a greater
estimate that also has greater likelihood. In contrast, the interval [0,θ] includes the end-point θ and is compact, in
which case the maximum-likelihood estimator exists. However, in this case, the maximum-likelihood estimator is
biased. Asymptotically, this maximum-likelihood estimator is not normally distributed.[3]
Nuisance parameters. For maximum likelihood estimations, a model may have a number of nuisance parameters.
For the asymptotic behaviour outlined to hold, the number of nuisance parameters should not increase with the
number of observations (the sample size). A well-known example of this case is where observations occur as pairs,
where the observations in each pair have a different (unknown) mean but otherwise the observations are independent
and Normally distributed with a common variance. Here for 2N observations, there are N+1 parameters. It is well
known that the maximum likelihood estimate for the variance does not converge to the true value of the variance.
Increasing information. For the asymptotics to hold in cases where the assumption of independent identically
distributed observations does not hold, a basic requirement is that the amount of information in the data increases
indefinitely as the sample size increases. Such a requirement may not be met if either there is too much dependence
in the data (for example, if new observations are essentially identical to existing observations), or if new independent
observations are subject to an increasing observation error.
Some regularity conditions which ensure this behavior are:
1. The first and second derivatives of the log-likelihood function must be defined.
2. The Fisher information matrix must not be zero, and must be continuous as a function of the parameter.
3. The maximum likelihood estimator is consistent.
Suppose that conditions for consistency of maximum likelihood estimator are satisfied, and[4]
1. θ0 ∈ interior(Θ);
2. f(x|θ) > 0 and is twice continuously differentiable in θ in some neighborhood N of θ0;
3. ∫ supθ∈N||∇θf(x|θ)||dx < ∞, and ∫ supθ∈N||∇θθf(x|θ)||dx < ∞;
4. I = E[∇θlnf(x|θ0) ∇θlnf(x|θ0)′] exists and is nonsingular;
5. E[ supθ∈N||∇θθlnf(x|θ)||] < ∞.
Maximum likelihood 372
When the log-likelihood is twice differentiable, this expression can be expanded into a Taylor series around the point
θ = θ0:
where is some point intermediate between θ0 and . From this expression we can derive that
Here the expression in square brackets converges in probability to H = E[−∇θθln f(x|θ0)] by the law of large
numbers. The continuous mapping theorem ensures that the inverse of this expression also converges in probability,
to H−1. The second sum, by the central limit theorem, converges in distribution to a multivariate normal with mean
zero and variance matrix equal to the Fisher information I. Thus, applying the Slutsky's theorem to the whole
expression, we obtain that
Finally, the information equality guarantees that when the model is correctly specified, matrix H will be equal to the
Fisher information I, so that the variance expression simplifies to just I−1.
Functional invariance
The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible
probability (or probability density, in the continuous case). If the parameter consists of a number of components,
then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the
complete parameter. Consistent with this, if is the MLE for θ, and if g(θ) is any transformation of θ, then the
MLE for α = g(θ) is by definition
The MLE is also invariant with respect to certain transformations of the data. If Y = g(X) where g is one to one and
does not depend on the parameters to be estimated, then the density functions satisfy
and hence the likelihood functions for X and Y differ only by a factor that does not depend on the model parameters.
For example, the MLE parameters of the log-normal distribution are the same as those of the normal distribution
fitted to the logarithm of the data.
Maximum likelihood 373
Higher-order properties
The standard asymptotics tells that the maximum-likelihood estimator is √n-consistent and asymptotically efficient,
meaning that it reaches the Cramér–Rao bound:
In particular, it means that the bias of the maximum-likelihood estimator is equal to zero up to the order n−1/2.
However when we consider the higher-order terms in the expansion of the distribution of this estimator, it turns out
that θmle has bias of order n−1. This bias is equal to (componentwise)[5]
jk
where Einstein's summation convention over the repeating indices has been adopted; I denotes the j,k-th
component of the inverse Fisher information matrix I−1, and
Using these formulas it is possible to estimate the second-order bias of the maximum likelihood estimator, and
correct for that bias by subtracting it:
This estimator is unbiased up to the terms of order n−1, and is called the bias-corrected maximum likelihood
estimator.
This bias-corrected estimator is second-order efficient (at least within the curved exponential family), meaning that it
has minimal mean squared error among all second-order bias-corrected estimators, up to the terms of the order n−2. It
is possible to continue this process, that is to derive the third-order bias-correction term, and so on. However as was
shown by Kano (1996), the maximum-likelihood estimator is not third-order efficient.
Examples
The likelihood is maximized when p = 2/3, and so this is the maximum likelihood estimate for p.
which has solutions p = 0, p = 1, and p = 49/80. The solution which maximizes the likelihood is clearly p = 49/80
(since p = 0 and p = 1 result in a likelihood of zero). Thus the maximum likelihood estimator for p is 49/80.
This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number
of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli
trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli
trials resulting in t 'successes'.
the corresponding probability density function for a sample of n independent identically distributed normal random
variables (the likelihood) is
or more conveniently:
which is solved by
Maximum likelihood 376
This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly
less than zero. Its expectation value is equal to the parameter μ of the given distribution,
which is solved by
Inserting we obtain
To calculate its expected value, it is convenient to rewrite the expression in terms of zero-mean random variables
(statistical error) . Expressing the estimate in these variables yields
Simplifying the expression above, utilizing the facts that and , allows us to obtain
In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have
to be obtained simultaneously.
Non-independent variables
It may be the case that variables are correlated, that is, not independent. Two random variables X and Y are
independent only if their joint probability density function is the product of the individual probability density
functions, i.e.
Suppose one constructs an order-n Gaussian vector out of random variables , where each variable has
means given by . Furthermore, let the covariance matrix be denoted by
The joint probability density function of these n random variables is then given by:
In the two variable case, the joint probability density function is given by:
Maximum likelihood 377
In this and other cases where a joint density function exists, the likelihood function is defined as above, under
Principles, using this density.
Applications
Maximum likelihood estimation is used for a wide range of statistical models, including:
• linear models and generalized linear models;
• exploratory and confirmatory factor analysis;
• structural equation modeling;
• many situations in the context of hypothesis testing and confidence interval formation;
• discrete choice models.
These uses arise across applications in widespread set of fields, including:
• communication systems;
• psychometrics;
• econometrics;
• time-delay of arrival (TDOA) in acoustic or electromagnetic detection;
• data modeling in nuclear and particle physics;
• magnetic resonance imaging;
• computational phylogenetics;
• origin/destination and path-choice modeling in transport networks;
• geographical satellite-image classification.
History
Maximum-likelihood estimation was recommended, analyzed (with flawed attempts at proofs) and vastly
popularized by R. A. Fisher between 1912 and 1922[6] (although it had been used earlier by Gauss, Laplace, T. N.
Thiele, and F. Y. Edgeworth).[7] Reviews of the development of maximum likelihood have been provided by a
number of authors.[8]
Much of the theory of maximum-likelihood estimation was first developed for Bayesian statistics, and then
simplified by later authors.[6]
Notes
[1] Pfanzagl (1994, p. 206)
[2] Newey & McFadden (1994, Theorem 2.5.)
[3] Lehamnn & Casella (1998)
[4] Newey & McFadden (1994, Theorem 3.3.)
[5] Cox & Snell (1968, formula (20))
[6] Pfanzagl (1994)
[7] Edgeworth (September 1908) and Edgeworth (December 1908)
[8] Savage (1976), Pratt (1976), Stigler (1978, 1986, 1999), Hald (1998, 1999), and Aldrich (1997)
Maximum likelihood 378
References
• Aldrich, John (1997). "R. A. Fisher and the making of maximum likelihood 1912–1922". Statistical Science 12
(3): 162–176. doi:10.1214/ss/1030037906. MR1617519.
• Andersen, Erling B. (1970); "Asymptotic Properties of Conditional Maximum Likelihood Estimators", Journal of
the Royal Statistical Society B 32, 283–301
• Andersen, Erling B. (1980); Discrete Statistical Models with Social Science Applications, North Holland, 1980
• Basu, Debabrata (1988); Statistical Information and Likelihood : A Collection of Critical Essays by Dr. D. Basu;
in Ghosh, Jayanta K., editor; Lecture Notes in Statistics, Volume 45, Springer-Verlag, 1988
• Cox, David R.; Snell, E. Joyce (1968). "A general definition of residuals". Journal of the Royal Statistical Society.
Series B (Methodological): 248–275. JSTOR 2984505.
• Edgeworth, Francis Y. (Sep 1908). "On the probable errors of frequency-constants". Journal of the Royal
Statistical Society 71 (3): 499–512. doi:10.2307/2339293. JSTOR 2339293.
• Edgeworth, Francis Y. (Dec 1908). "On the probable errors of frequency-constants". Journal of the Royal
Statistical Society 71 (4): 651–678. doi:10.2307/2339378. JSTOR 2339378.
• Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical
Association 77 (380): 831–834. JSTOR 2287314.
• Ferguson, Thomas S. (1996). A course in large sample theory. Chapman & Hall. ISBN 0-412-04371-8.
• Hald, Anders (1998). A history of mathematical statistics from 1750 to 1930. New York, NY: Wiley.
ISBN 0-471-17912-4.
• Hald, Anders (1999). "On the history of maximum likelihood in relation to inverse probability and least squares".
Statistical Science 14 (2): 214–222. JSTOR 2676741.
• Kano, Yutaka (1996). "Third-order efficiency implies fourth-order efficiency" (http://www.journalarchive.jst.
go.jp/english/jnlabstract_en.php?cdjournal=jjss1995&cdvol=26&noissue=1&startpage=101). Journal of the
Japan Statistical Society 26: 101–117.
• Le Cam, Lucien (1990). "Maximum likelihood — an introduction". ISI Review 58 (2): 153–171.
• Le Cam, Lucien; Lo Yang, Grace (2000). Asymptotics in statistics: some basic concepts (Second ed.). Springer.
ISBN 0-387-95036-2.
• Le Cam, Lucien (1986). Asymptotic methods in statistical decision theory. Springer-Verlag. ISBN 0-387-96307-3.
• Lehmann, Erich L.; Casella, George (1998). Theory of Point Estimation, 2nd ed. Springer. ISBN 0-387-98502-6.
• Newey, Whitney K.; McFadden, Daniel (1994). "Chapter 35: Large sample estimation and hypothesis testing". In
Engle, Robert; McFadden, Dan. Handbook of Econometrics, Vol.4. Elsevier Science. pp. 2111–2245.
ISBN 0-444-88766-0.
• Pfanzagl, Johann (1994). Parametric statistical theory. with the assistance of R. Hamböker. Berlin, DE: Walter de
Gruyter. pp. 207–208. ISBN 3-11-013863-8.
• Pratt, John W. (1976). "F. Y. Edgeworth and R. A. Fisher on the efficiency of maximum likelihood estimation".
The Annals of Statistics 4 (3): 501–514. doi:10.1214/aos/1176343457. JSTOR 2958222.
• Ruppert, David (2010). Statistics and Data Analysis for Financial Engineering (http://books.google.com/
books?id=i2bD50PbIikC&pg=PA98). Springer. p. 98. ISBN 978-1-4419-7786-1.
• Savage, Leonard J. (1976). "On rereading R. A. Fisher". The Annals of Statistics 4 (3): 441–500.
doi:10.1214/aos/1176343456. JSTOR 2958221.
• Stigler, Stephen M. (1978). "Francis Ysidro Edgeworth, statistician". Journal of the Royal Statistical Society.
Series A (General) 141 (3): 287–322. doi:10.2307/2344804. JSTOR 2344804.
• Stigler, Stephen M. (1986). The history of statistics: the measurement of uncertainty before 1900. Harvard
University Press. ISBN 0-674-40340-1.
• Stigler, Stephen M. (1999). Statistics on the table: the history of statistical concepts and methods. Harvard
University Press. ISBN 0-674-83601-4.
• van der Vaart, Aad W. (1998). Asymptotic Statistics. ISBN 0-521-78450-6.
Maximum likelihood 379
External links
• Maximum Likelihood Estimation Primer (an excellent tutorial) (http://statgen.iop.kcl.ac.uk/bgim/mle/
sslike_1.html)
• Implementing MLE for your own likelihood function using R (http://www.mayin.org/ajayshah/KB/R/
documents/mle/mle.html)
• A selection of likelihood functions in R (http://www.netstorm.be/home/mle)
• "Tutorial on maximum likelihood estimation". Journal of Mathematical Psychology. CiteSeerX: 10.1.1.74.671
(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.74.671).
McNemar's test
In statistics, McNemar's test is a non-parametric method used on nominal data. It is applied to 2 × 2 contingency
tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal
frequencies are equal ("marginal homogeneity"). It is named after Quinn McNemar, who introduced it in 1947.[1] An
application of the test in genetics is the transmission disequilibrium test for detecting genetic linkage.[2]
Definition
The test is applied to a 2 × 2 contingency table, which tabulates the outcomes of two tests on a sample of n subjects,
as follows.
The null hypothesis of marginal homogeneity states that the two marginal probabilities for each outcome are the
same, i.e. pa + pb = pa + pc and pc + pd = pb + pd.
Thus the null and alternative hypotheses are[1]
Here pa, etc., denote the theoretical probability of occurrences in cells with the corresponding label.
The McNemar test statistic is:
An alternative correction of 1 instead of 0.5 is attributed to Edwards [4] by Fleiss [5], resulting in a similar equation:
Under the null hypothesis, with a sufficiently large number of discordants (cells b and c), has a chi-squared
distribution with 1 degree of freedom. If either b or c is small (b + c < 25) then is not well-approximated by the
chi-squared distribution. The binomial distribution can be used to obtain the exact distribution for an equivalent to
the uncorrected form of McNemar's test statistic.[6] In this formulation, b is compared to a binomial distribution with
McNemar's test 380
size parameter equal to b + c and "probability of success" = ½, which is essentially the same as the binomial sign
test. For b + c < 25, the binomial calculation should be performed, and indeed, most software packages simply
perform the binomial calculation in all cases, since the result then is an exact test in all cases. When comparing the
resulting statistic to the right tail of the chi-squared distribution, the p-value that is found is two-sided, whereas
to achieve a two-sided p-value in the case of the exact binomial test, the p-value of the extreme tail should be
multiplied by 2.
If the result is significant, this provides sufficient evidence to reject the null hypothesis, in favour of the
alternative hypothesis that pb ≠ pc, which would mean that the marginal proportions are significantly different from
each other.
Example
In the following example, a researcher attempts to determine if a drug has an effect on a particular disease. Counts of
individuals are given in the table, with the diagnosis (disease: present or absent) before treatment given in the rows,
and the diagnosis after treatment in the columns. The test requires the same subjects to be included in the
before-and-after measurements (matched pairs).
Before: absent 59 33 92
In this example, the null hypothesis of "marginal homogeneity" would mean there was no effect of the treatment.
From the above data, the McNemar test statistic with Yates's continuity correction is
has the value 21.01, which is extremely unlikely from the distribution implied by the null hypothesis. Thus the test
provides strong evidence to reject the null hypothesis of no treatment effect.
Discussion
An interesting observation when interpreting McNemar's test is that the elements of the main diagonal do not
contribute to the decision about whether (in the above example) pre- or post-treatment condition is more favourable.
An extension of McNemar's test exists in situations where independence does not necessarily hold between the pairs;
instead, there are clusters of paired data where the pairs in a cluster may not be independent, but independence holds
between different clusters. An example is analyzing the effectiveness of a dental procedure; in this case, a pair
corresponds to the treatment of an individual tooth in patients who might have multiple teeth treated; the
effectiveness of treatment of two teeth in the same patient is not likely to be independent, but the treatment of two
teeth in different patients is more likely to be independent.[7]
McNemar's test 381
They calculated a chi-squared statistic of 1.53, which is not significant.[...] [they] had made an error in
their analysis by ignoring the pairings.[...] [their] samples were not independent, because the siblings
were paired [...] we set up a table that exhibits the pairings:
It is to the second table that McNemar's test can be applied. Notice that the sum of the numbers in the second table is
85—the number of pairs of siblings—whereas the sum of the numbers in the first table is twice as big, 170—the
number of individuals. The second table gives more information than the first. The numbers in the first table can be
found by using the numbers in the second table, but not vice versa. The numbers in the first table give only the
marginal totals of the numbers in the second table.
Related tests
• The binomial sign test gives an exact test for the McNemar's test.
• The Cochran's Q test for two "treatments" is equivalent to the McNemar's test.
• The Liddell's exact test is an exact alternative to McNemar's test.[9][10]
• The Stuart–Maxwell test is different generalization of the McNemar test, used for testing marginal homogeneity
in a square table with more than two rows/columns.[11]
• The Bhapkar's test (1966) is a more powerful alternative to the Stuart–Maxwell test.[12]
References
[1] McNemar, Quinn (June 18, 1947). "Note on the sampling error of the difference between correlated proportions or percentages".
Psychometrika 12 (2): 153–157. doi:10.1007/BF02295996. PMID 20254758.
[2] Spielman RS; McGinnis RE; Ewens WJ (Mar 1993). "Transmission test for linkage disequilibrium: the insulin gene region and
insulin-dependent diabetes mellitus (IDDM)". Am J Hum Genet. 52 (3): 506–16. PMC 1682161. PMID 8447318.
[3] Yates, F (1934). Contingency table involving small numbers and the χ2 test. Supplement to the Journal of the Royal Statistical Society 1(2),
217–235. JSTOR Archive for the journal (http:/ / www. jstor. org/ pss/ 2983604)
[4] Edwards, A (1948). "Note on the "correction for continuity" in testing the significance of the difference between correlated proportions".
Psychometrika 13: 185–187.
[5] Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John Wiley. p. 114. ISBN 0-471-06428-9.
[6] Sheskin (2004)
[7] Durkalski, V.L.; Palesch, Y.Y.; Lipsitz, S.R.; Rust, P.F. (2003). "Analysis of clustered matched-pair data" (http:/ / www3. interscience. wiley.
com/ journal/ 104545274/ abstract). Statistics in medicine 22 (15): 2417–28. doi:10.1002/sim.1438. PMID 12872299. . Retrieved April 1,
2009.
[8] Rice, John (1995). Mathematical Statistics and Data Analysis (Second ed.). Belmont, California: Duxbury Press. pp. 492–494.
ISBN 0-534-20934-3.
[9] Liddell, D. (1976). "Practical Tests of 2 × 2 Contingency Tables". Journal of the Royal Statistical Society 25 (4): 295–304. JSTOR 2988087.
[10] http:/ / rimarcik. com/ en/ navigator/ z-nominal. html
[11] Sun, Xuezheng; Yang, Zhao (2008). "Generalized McNemar's Test for Homogeneity of the Marginal Distributions" (http:/ / www2. sas.
com/ proceedings/ forum2008/ 382-2008. pdf). SAS Global Forum. .
[12] http:/ / www. john-uebersax. com/ stat/ mcnemar. htm#bhapkar
McNemar's test 382
External links
• Vassar College's McNemar 2×2 Grid (http://faculty.vassar.edu/lowry/propcorr.html)
• McNemar Tests of Marginal Homogeneity (http://john-uebersax.com/stat/mcnemar.htm)
Multicollinearity
Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression
model are highly correlated. In this situation the coefficient estimates may change erratically in response to small
changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as
a whole, at least within the sample data themselves; it only affects calculations regarding individual predictors. That
is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors
predicts the outcome variable, but it may not give valid results about any individual predictor, or about which
predictors are redundant with respect to others.
A high degree of multicollinearity can also cause computer software packages to be unable to perform the matrix
inversion that is required for computing the regression coefficients, or it may make the results of that inversion
inaccurate.
Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase
"no multicollinearity" is sometimes used to mean the absence of perfect multicollinearity, which is an exact
(non-stochastic) linear relation among the regressors.
Definition
Collinearity is a linear relationship between two explanatory variables. Two variables are perfectly collinear if there
is an exact linear relationship between the two. For example, and are perfectly collinear if there exist
parameters and such that, for all observations i, we have
Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are
highly linearly related. We have perfect multicollinearity if, for example as in the equation above, the correlation
between two independent variables is equal to 1 or -1. In practice, we rarely face perfect multicollinearity in a data
set. More commonly, the issue of multicollinearity arises when there is a strong linear relationship among two or
more independent variables.
Mathematically, a set of variables is perfectly multicollinear if there exist one or more exact linear relationships
among some of the variables. For example, we may have
holding for all observations i, where are constants and is the ith observation on the jth explanatory variable.
We can explore one issue caused by multicollinearity by examining the process of attempting to obtain estimates for
the parameters of the multiple regression equation
where
Multicollinearity 383
If there is an exact linear relationship (perfect multicollinearity) among the independent variables, the rank of X (and
therefore of XTX) is less than k+1, and the matrix XTX will not be invertible.
In most applications, perfect multicollinearity is unlikely. An analyst is more likely to face a high degree of
multicollinearity. For example, suppose that instead of the above equation holding, we have that equation in
modified form with an error term :
In this case, there is no exact linear relationship among the variables, but the variables are nearly perfectly
multicollinear if the variance of is small for some set of values for the 's. In this case, the matrix XTX has an
inverse, but is ill-conditioned so that a given computer algorithm may or may not be able to compute an approximate
inverse, and if it does so the resulting computed inverse may be highly sensitive to slight variations in the data (due
to magnified effects of rounding error) and so may be very inaccurate.
Detection of multicollinearity
Indicators that multicollinearity may be present in a model:
1. Large changes in the estimated regression coefficients when a predictor variable is added or deleted
2. Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the
joint hypothesis that those coefficients are all zero (using an F-test)
3. Some authors have suggested a formal detection-tolerance or the variance inflation factor (VIF) for
multicollinearity:
where is the coefficient of determination of a regression of explanator j on all the other explanators. A
tolerance of less than 0.20 or 0.10 and/or a VIF of 5 or 10 and above indicates a multicollinearity problem (but
see O'Brien 2007).[1]
4. Condition Number Test: The standard measure of ill-conditioning in a matrix is the condition index. It will
indicate that the inversion of the matrix is numerically unstable with finite-precision numbers ( standard computer
floats and doubles ). This indicates the potential sensitivity of the computed inverse to small changes in the
original matrix. The Condition Number is computed by finding the square root of (the maximum eigenvalue
divided by the minimum eigenvalue). If the Condition Number is above 30, the regression is said to have
significant multicollinearity.
5. Farrar-Glauber Test:[2] If the variables are found to be orthogonal, there is no multicollinearity; if the variables
are not orthogonal, then multicollinearity is present.
6. Construction of a pair-wise correlation matrix will yield indications as to the likelihood that any given couplet of
right-hand-side variables are multi-collinear. Correlation values .4 and higher can indicate a multicollinierity
issue, but sometimes variables may be correlated as high as .8 without causing such issues.
Multicollinearity 384
Consequences of multicollinearity
As mentioned above, one consequence of a high degree of multicollinearity is that, even if the matrix XTX is
invertible, a computer algorithm may be unsuccessful in obtaining an approximate inverse, and if it does obtain one
it may be numerically inaccurate. But even in the presence of an accurate XTX matrix, the following consequences
arise:
In the presence of multicollinearity, the estimate of one variable's impact on y while controlling for the others tends
to be less precise than if predictors were uncorrelated with one another. The usual interpretation of a regression
coefficient is that it provides an estimate of the effect of a one unit change in an independent variable, , holding
the other variables constant. If is highly correlated with another independent variable, , in the given data set,
then we only have observations for which and have a particular relationship (either positive or negative).
We don't have observations for which changes independently of , so we have an imprecise estimate of the
effect of independent changes in .
In some sense, the collinear variables contain the same information about the dependent variable. If nominally
"different" measures actually quantify the same phenomenon then they are redundant. Alternatively, if the variables
are accorded different names and perhaps employ different numeric measurement scales but are highly correlated
with each other, then they suffer from redundancy.
One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large. In that
case, the test of the hypothesis that the coefficient is equal to zero leads to a failure to reject the null hypothesis.
However, if a simple linear regression of the explained variable on this explanatory variable is estimated, the
coefficient will be found to be significant; specifically, the analyst will reject the hypothesis that the coefficient is
zero. In the presence of multicollinearity, an analyst might falsely conclude that there is no linear relationship
between an independent and a dependent variable.
A principal danger of such data redundancy is that of overfitting in regression analysis models. The best regression
models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but
correlate at most only minimally with each other. Such a model is often called "low noise" and will be statistically
robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical
population).
So long as the underlying specification is correct, multicollinearity does not actually bias results, it just produces
large standard errors in the related independent variables. If, however, there are other problems (such as omitted
variables) which introduce bias, multicollinearity can multiply (by orders of magnitude) the effects of that bias. More
importantly, the usual use of regression is to take coefficients from the model and then apply them to other data. If
the new data differs in any way from the data that was fitted you may introduce large errors in your predictions
because the pattern of multicollinearity between the independent variables is different in your new data from the data
you used for your estimates.
Survival analysis
Multicollinearity may also represent a serious issue in survival analysis. The problem is that time-varying covariates
may change their value over the time line of the study. A special procedure is recommended to assess the impact of
multicollinearity on the results. See Van den Poel & Larivière (2004)[4] for a detailed discussion.
where
If there is an exact linear relationship (perfect multicollinearity) among the independent variables, the rank of X (and
therefore of XTX) is less than k+1, and the matrix XTX will not be invertible.
In most applications, perfect multicollinearity is unlikely. An analyst is more likely to face a high degree of
multicollinearity. For example, suppose that instead of the above equation holding, we have that equation in
modified form with an error term :
In this case, there is no exact linear relationship among the variables, but the variables are nearly perfectly
multicollinear if the variance of is small for some set of values for the 's. In this case, the matrix XTX has an inverse,
but is ill-conditioned so that a given computer algorithm may or may not be able to compute an approximate inverse,
and if it does so the resulting computed inverse may be highly sensitive to slight variations in the data (due to
magnified effects of rounding error) and so may be very inaccurate.
References
[1] O'Brien, Robert M. 2007. "A Caution Regarding Rules of Thumb for Variance Inflation Factors," Quality and Quantity 41(5)673-690.
[2] Farrar Donald E. and Glauber, Robert R. 1967. "Multicollinearity in Regression Analysis: The Problem Revisited," The Review of Economics
and Statistics 49(1):92-107.
[3] Lipovestky and Conklin, 2001,"Analysis of Regression in Game Theory Approach". Applied Stochastic Models and Data Analysis 17 (2001):
319-330."
[4] Van den Poel, Dirk, and Larivière, Bart (2004), "Attrition Analysis for Financial Services Using Proportional Hazard Models," European
Journal of Operational Research, 157 (1), 196-217.
External links
• Earliest Uses: The entry on Multicollinearity has some historical information. (http://jeff560.tripod.com/m.
html)
Multivariate normal distribution 387
MGF
CF
In probability theory and statistics, the multivariate normal distribution or multivariate Gaussian distribution, is
a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One possible
definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k
components has a univariate normal distribution. However, its importance derives mainly from the multivariate
central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set
of (possibly) correlated real-valued random variables each of which clusters around a mean value.
Multivariate normal distribution 388
Definition
A random vector x = (X1, …, Xk)' is said to have the multivariate normal distribution if it satisfies the following
equivalent conditions.[1]
• Every linear combination of its components Y = a1X1 + … + akXk is normally distributed. That is, for any constant
vector a ∈ Rk, the random variable Y = a′x has a univariate normal distribution.
• There exists a random ℓ-vector z, whose components are independent standard normal random variables, a
k-vector μ, and a k×ℓ matrix A, such that x = Az + μ. Here ℓ is the rank of the covariance matrix Σ = AA′.
Especially in the case of full rank, see the section below on Geometric Interpretation.
• There is a k-vector μ and a symmetric, nonnegative-definite k×k matrix Σ, such that the characteristic function of
x is
The covariance matrix is allowed to be singular (in which case the corresponding distribution has no density). This
case arises frequently in statistics; for example, in the distribution of the vector of residuals in the ordinary least
squares regression. Note also that the Xi are in general not independent; they can be seen as the result of applying the
matrix A to a collection of independent Gaussian variables z.
Properties
Density function
Non-degenerate case
The multivariate normal distribution is said to be "non-degenerate" when the covariance matrix of the
multivariate normal distribution is symmetric and positive definite. In this case the distribution has density
where is the determinant of . Note how the equation above reduces to that of the univariate normal
distribution if is a matrix (i.e. a real number).
Bivariate case
In the 2-dimensional nonsingular case (k = rank(Σ) = 2), the probability density function of a vector [X Y]′ is
Multivariate normal distribution 389
where ρ is the correlation between X and Y and where and . In this case,
In the bivariate case, we also have a theorem that makes the first equivalent condition for multivariate normality less
restrictive: it is sufficient to verify that countably many distinct linear combinations of X and Y are normal in order
to conclude that the vector [X Y]′ is bivariate normal.[2]
When plotted in the x,y-plane the distribution appears to be squeezed to the line:
as the correlation parameter ρ increases. This is because the above expression is the best linear unbiased prediction
of Y given a value of X.[3]
Degenerate case
If the covariance matrix is not full rank, then the multivariate normal distribution is degenerate and does not have
a density. More precisely, it does not have a density with respect to k-dimensional Lebesgue measure (which is the
usual measure assumed in calculus-level probability courses). Only random vectors whose distributions are
absolutely continuous with respect to a measure are said to have densities (with respect to that measure). To talk
about densities but avoid dealing with measure-theoretic complications it can be simpler to restrict attention to a
subset of of the coordinates of such that covariance matrix for this subset is positive definite; then the
other coordinates may be thought of as an affine function of the selected coordinates.
To talk about densities meaningfully in the singular case, then, we must select a different base measure. Using the
disintegration theorem we can define a restriction of Lebesgue measure to the -dimensional affine
subspace of where the Gaussian distribution is supported, i.e. . With respect to this
probability measure the distribution has density:
Higher moments
The kth-order moments of x are defined by
where r1 + r2 + ⋯ + rN = k.
The central k-order central moments are given as follows
(a) If k is odd, μ1, …, N(x − μ) = 0.
(b) If k is even with k = 2λ, then
where the sum is taken over all allocations of the set into λ (unordered) pairs. That is, if you have a
kth ( = 2λ = 6) central moment, you will be summing the products of λ = 3 covariances (the -μ notation has been
dropped in the interests of parsimony):
Multivariate normal distribution 390
This yields terms in the sum (15 in the above case), each being the product of λ (in
this case 3) covariances. For fourth order moments (four variables) there are three terms. For sixth-order moments
there are 3 × 5 = 15 terms, and for eighth-order moments there are 3 × 5 × 7 = 105 terms.
The covariances are then determined by replacing the terms of the list by the corresponding terms of
the list consisting of r1 ones, then r2 twos, etc.. To illustrate this, examine the following 4th-order central moment
case:
where is the covariance of xi and xj. The idea with the above method is you first find the general case for a kth
moment where you have k different x variables - and then you can simplify this accordingly. Say,
you have then you simply let xi = xj and realise that σii = σi2.
Likelihood function
If the mean and variance matrix are unknown, a suitable log likelihood function for a single observation x would be:
where x is a vector of real numbers. The complex case, where z is a vector of complex numbers, would be:
.
A similar notation is used for multiple linear regression.[4]
Entropy
The differential entropy of the multivariate normal distribution is[5]
Kullback–Leibler divergence
The Kullback–Leibler divergence from to , for non-singular matrices Σ0 and Σ1, is:[6]
The logarithm must be taken to base e since the two terms following the logarithm are themselves base-e logarithms
of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives
a result measured in nats. Dividing the entire expression above by loge 2 yields the divergence in bits.
Tolerance region
The equivalent for the univariate Normal distribution tolerance intervals in the multivariate case would yield a
tolerance region. Such region consists of those vectors x satisfying
Here is a -dimensional vector, is the known -dimensional mean vector, is the known covariance
matrix and is the quantile function for probability of the chi-squared distribution with degrees of
freedom.
When the expression defines the interior of an ellipse and the chi-squared distribution simplifies to an
exponential distribution with mean equal to two.
Joint normality
Two normally distributed random variables need not be jointly bivariate normal
The fact that two random variables X and Y both have a normal distribution does not imply that the pair (X, Y) has a
joint normal distribution. A simple example is one in which X has a normal distribution with expected value 0 and
variance 1, and Y = X if |X| > c and Y = −X if |X| < c, where c > 0. There are similar counterexamples for more than
two random variables.
Multivariate normal distribution 392
Conditional distributions
If μ and Σ are partitioned as follows
with sizes
with sizes
This matrix is the Schur complement of Σ22 in Σ. This means that to calculate the conditional covariance matrix, one
inverts the overall covariance matrix, drops the rows and columns corresponding to the variables being conditioned
upon, and then inverts back to get the conditional covariance matrix. Here is the generalized inverse of
Note that knowing that x2 = a alters the variance, though the new variance does not depend on the specific value of
a; perhaps more surprisingly, the mean is shifted by ; compare this with the situation of not
knowing the value of a, in which case x1 would have distribution .
An interesting fact derived in order to prove this result, is that the random vectors and
are independent.
The matrix Σ12Σ22−1 is known as the matrix of regression coefficients.
In the bivariate case where x is partitioned into X1 and X2, the conditional distribution of X1 given X2 is
where the final ratio here is called the inverse Mills ratio.
Marginal distributions
To obtain the marginal distribution over a subset of multivariate normal random variables, one only needs to drop the
irrelevant variables (the variables that one wants to marginalize out) from the mean vector and the covariance matrix.
The proof for this follows from the definitions of multivariate normal distributions and linear algebra.[8]
Example
Let x = [X1, X2, X3] be multivariate normal random variables with mean vector μ = [μ1, μ2, μ3] and covariance matrix
Σ (standard parametrization for multivariate normal distributions). Then the joint distribution of x′ = [X1, X3] is
multivariate normal with mean vector μ′ = [μ1, μ3] and covariance matrix .
Affine transformation
If y = c + Bx is an affine transformation of where c is an vector of constants and B is a
constant matrix, then y has a multivariate normal distribution with expected value c + Bμ and variance
T
BΣB i.e., . In particular, any subset of the xi has a marginal distribution that is also
multivariate normal. To see this, consider the following example: to extract the subset (x1, x2, x4)T, use
and considering only the first component of the product (the first row of B is the vector b). Observe how the
positive-definiteness of Σ implies that the variance of the dot product must be positive.
An affine transformation of x such as 2x is not the same as the sum of two independent realisations of x.
Multivariate normal distribution 394
Geometric interpretation
The equidensity contours of a non-singular multivariate normal distribution are ellipsoids (i.e. linear transformations
of hyperspheres) centered at the mean.[9] The directions of the principal axes of the ellipsoids are given by the
eigenvectors of the covariance matrix Σ. The squared relative lengths of the principal axes are given by the
corresponding eigenvalues.
If Σ = UΛUT = UΛ1/2(UΛ1/2)T is an eigendecomposition where the columns of U are unit eigenvectors and Λ is a
diagonal matrix of the eigenvalues, then we have
Moreover, U can be chosen to be a rotation matrix, as inverting an axis does not have any effect on N(0, Λ), but
inverting a column changes the sign of U's determinant. The distribution N(μ, Σ) is in effect N(0, I) scaled by Λ1/2,
rotated by U and translated by μ.
Conversely, any choice of μ, full rank matrix U, and positive diagonal entries Λi yields a non-singular multivariate
normal distribution. If any Λi is zero and U is square, the resulting covariance matrix UΛUT is singular.
Geometrically this means that every contour ellipsoid is infinitely thin and has zero volume in n-dimensional space,
as at least one of the principal axes has length of zero.
Estimation of parameters
The derivation of the maximum-likelihood estimator of the covariance matrix of a multivariate normal distribution is
perhaps surprisingly subtle and elegant. See estimation of covariance matrices.
In short, the probability density function (pdf) of an k-dimensional multivariate normal is
which is simply the sample covariance matrix. This is a biased estimator whose expectation is
The Fisher information matrix for estimating the parameters of a multivariate normal distribution has a closed form
expression. This can be used, for example, to compute the Cramér–Rao bound for parameter estimation in this
setting. See Fisher information for more details.
Multivariate normal distribution 395
Bayesian inference
In Bayesian statistics, the conjugate prior of the mean vector is another multivariate normal distribution, and the
conjugate prior of the covariance matrix is an inverse-Wishart distribution . Suppose then that n observations
have been made
where
and
Then,
where
Under the null hypothesis of multivariate normality, the statistic A will have approximately a chi-squared distribution
with 16⋅k(k + 1)(k + 2) degrees of freedom, and B will be approximately standard normal N(0,1).
Mardia's kurtosis statistic is skewed and converges very slowly to the limiting normal distribution. For medium size
samples , the parameters of the asymptotic distribution of the kurtosis statistic are modified[14]
For small sample tests ( ) empirical critical values are used. Tables of critical values for both statistics are
given by Rencher[15] for k=2,3,4.
Mardia's tests are affine invariant but not consistent. For example, the multivariate skewness test is not consistent
against symmetric non-normal alternatives.
The BHEP test[16] computes the norm of the difference between the empirical characteristic function and the
theoretical characteristic function of the normal distribution. Calculation of the norm is performed in the L2(μ) space
Multivariate normal distribution 396
of square-integrable functions with respect to the Gaussian weighting function . The test statistic is
The limiting distribution of this test statistic is a weighted sum of chi-squared random variables,[17] however in
practice it is more convenient to compute the sample quantiles using the Monte-Carlo simulations.
A detailed survey of these and other test procedures is available.[18]
References
[1] Gut, Allan (2009) An Intermediate Course in Probability, Springer. ISBN 9781441901613 (Chapter 5)
[2] Hamedani, G. G.; Tata, M. N. (1975). "On the determination of the bivariate normal distribution from distributions of linear combinations of
the variables". The American Mathematical Monthly 82 (9): 913–915. doi:10.2307/2318494.
[3] Wyatt, John. "Linear least mean-squared error estimation" (http:/ / web. mit. edu/ 6. 041/ www/ LECTURE/ lec22. pdf). Lecture notes course
on applied probability. . Retrieved 23 January 2012.
[4] Tong, T. (2010) Multiple Linear Regression : MLE and Its Distributional Results (http:/ / amath. colorado. edu/ courses/ 7400/ 2010Spr/
lecture9. pdf), Lecture Notes
[5] Gokhale, DV; NA Ahmed, BC Res, NJ Piscataway (May 1989). "Entropy Expressions and Their Estimators for Multivariate Distributions".
Information Theory, IEEE Transactions on 35 (3): 688–692. doi:10.1109/18.30996.
[6] Penny & Roberts, PARG-00-12, (2000) (http:/ / www. allisons. org/ ll/ MML/ KL/ Normal). pp. 18
[7] Eaton, Morris L. (1983). Multivariate Statistics: a Vector Space Approach. John Wiley and Sons. pp. 116–117. ISBN 0-471-02776-6.
[8] The formal proof for marginal distribution is shown here http:/ / fourier. eng. hmc. edu/ e161/ lectures/ gaussianprocess/ node7. html
[9] Nikolaus Hansen. "The CMA Evolution Strategy: A Tutorial" (http:/ / www. lri. fr/ ~hansen/ cmatutorial. pdf) (PDF). .
[10] Cox, D. R.; N. J. H. Small (August 1978). "Testing multivariate normality". Biometrika 65 (2): 263–272. doi:10.1093/biomet/65.2.263.
[11] Smith, Stephen P.; Anil K. Jain (September 1988). "A test to determine the multivariate normality of a dataset". IEEE Transactions on
Pattern Analysis and Machine Intelligence 10 (5): 757–761. doi:10.1109/34.6789.
[12] Friedman, J. H. and Rafsky, L. C. (1979) "Multivariate generalizations of the Wald-Wolfowitz and Smirnov two sample tests". Annals of
Statistics, 7, 697–717.
[13] Mardia, K. V. (1970). "Measures of multivariate skewness and kurtosis with applications". Biometrika 57 (3): 519–530.
doi:10.1093/biomet/57.3.519.
[14] Rencher (1995), pages 112-113.
[15] Rencher (1995), pages 493-495.
[16] Epps, Lawrence B.; Pulley, Lawrence B. (1983). "A test for normality based on the empirical characteristic function". Biometrika 70 (3):
723–726. doi:10.1093/biomet/70.3.723.
[17] Baringhaus, L.; Henze, N. (1988). "A consistent test for multivariate normality based on the empirical characteristic function". Metrika 35
(1): 339–348. doi:10.1007/BF02613322.
Multivariate normal distribution 397
[18] Henze, Norbert (2002). "Invariant tests for multivariate normality: a critical review". Statistical Papers 43 (4): 467–506.
doi:10.1007/s00362-002-0119-6.
Literature
• Rencher, A.C. (1995). Methods of Multivariate Analysis. New York: Wiley.
n-sphere
In mathematics, an n-sphere is a generalization of the surface of an
ordinary sphere to arbitrary dimension. For any natural number n, an
n-sphere of radius r is defined as the set of points in
(n + 1)-dimensional Euclidean space which are at distance r from a
central point, where the radius r may be any positive real number. In
symbols:
Description
For any natural number n, an n-sphere of radius r is defined as the set of points in (n + 1)-dimensional Euclidean
space that are at distance r from some fixed point c, where r may be any positive real number and where c may be
any point in (n + 1)-dimensional space. In particular:
• a 0-sphere is a pair of points {c − r, c + r}, and is the boundary of a line segment (1-ball).
• a 1-sphere is a circle of radius r centered at c, and is the boundary of a disk (2-ball).
• a 2-sphere is an ordinary 2-dimensional sphere in 3-dimensional Euclidean space, and is the boundary of an
ordinary ball (3-ball).
n-sphere 398
where * is the Hodge star operator; see Flanders (1989, §6.1) for a discussion and proof of this formula in the case
r = 1. As a result,
n-ball
The space enclosed by an n-sphere is called an (n + 1)-ball. An (n + 1)-ball is closed if it includes the n-sphere, and it
is open if it does not include the n-sphere.
Specifically:
• A 1-ball, a line segment, is the interior of a (0-sphere).
• A 2-ball, a disk, is the interior of a circle (1-sphere).
• A 3-ball, an ordinary ball, is the interior of a sphere (2-sphere).
• A 4-ball, is the interior of a 3-sphere, etc.
n-sphere 399
Topological description
Topologically, an n-sphere can be constructed as a one-point compactification of n-dimensional Euclidean space.
Briefly, the n-sphere can be described as , which is n-dimensional Euclidean space plus a
single point representing infinity in all directions. In particular, if a single point is removed from an n-sphere, it
becomes homeomorphic to . This forms the basis for stereographic projection.[1]
The 1-sphere of radius R is the circle of radius R in the Euclidean plane, and this has circumference (1-dimensional
measure)
The region enclosed by the 1-sphere is the 2-ball, or disk of radius R, and this has area (2-dimensional measure)
Analogously, in 3-dimensional Euclidean space, the surface area (2-dimensional measure) of the 2-sphere of radius R
is given by
and the volume enclosed is the volume (3-dimensional measure) of the 3-ball, and is given by
In general, the volume, in n-dimensional Euclidean space, of the n-ball of radius R is proportional to the nth power of
the R:
where the constant of proportionality, the volume of the unit n-ball, is given by
and since
for odd n,
where S0=2, V1=2R, S1=2πR and V2=πR2. (The 0-dimensional Hausdorff measure is the number of points in a set.
The 0-sphere consists of two points, at −R and +R; so S0 = 2.)
The recurrence relation for can be proved via integration with 2-dimensional polar coordinates:
Spherical coordinates
We may define a coordinate system in an n-dimensional Euclidean space which is analogous to the spherical
coordinate system defined for 3-dimensional Euclidean space, in which the coordinates consist of a radial coordinate,
and n − 1 angular coordinates where ranges over radians (or over [0, 360)
degrees) and the other angles range over radians (or over [0, 180] degrees). If are the Cartesian
coordinates, then we may compute from with:
n-sphere 401
Except in the special cases described below, the inverse transformation is unique:
where if for some but all of are zero then when , and
radians (180 degrees) when .
There are some special cases where the inverse transform is not unique; for any will be ambiguous whenever
all of are zero; in this case may be chosen to be zero.
Note that a half-angle formula is used for because the more straightforward is too
and the above equation for the volume of the n-ball can be recovered by integrating:
The volume element of the (n-1)–sphere, which generalizes the area element of the 2-sphere, is given by
The natural choice of an orthogonal basis over the angular coordinates is a product of ultraspherical polynomials,
for j = 1, 2, ..., n − 2, and the e isφj for the angle j = n − 1 in concordance with the spherical harmonics.
n-sphere 402
Stereographic projection
Just as a two dimensional sphere embedded in three dimensions can be mapped onto a two-dimensional plane by a
stereographic projection, an n-sphere can be mapped onto an n-dimensional hyperplane by the n-dimensional version
of the stereographic projection. For example, the point on a two-dimensional sphere of radius 1 maps to
Likewise, the stereographic projection of an n-sphere of radius 1 will map to the dimensional
hyperplane perpendicular to the axis as
The vector is uniformly distributed over the surface of the unit n-ball.
Examples
For example, when n = 2 the normal distribution exp(−x12) when expanded over another axis exp(−x22) after
multiplication takes the form exp(−x12−x22) or exp(−r2) and so is only dependent on distance from the origin.
Alternatives
Another way to generate a random distribution on a hypersphere is to make a uniform distribution over a hypercube
that includes the unit hyperball, exclude those points that are outside the hyperball, then project the remaining
interior points outward from the origin onto the surface. This will give a uniform distribution, but it is necessary to
remove the exterior points. As the relative volume of the hyperball to the hypercube decreases very rapidly with
dimension, this procedure will succeed with high probability only for fairly small numbers of dimensions.
Wendel's theorem gives the probability that all of the points generated will lie in the same half of the hypersphere.
n-sphere 403
Specific spheres
0-sphere
The pair of points {±R} with the discrete topology for some R > 0. The only sphere that is disconnected. Has a
natural Lie group structure; isomorphic to O(1). Parallelizable.
1-sphere
Also known as the circle. Has a nontrivial fundamental group. Abelian Lie group structure U(1); the circle
group. Topologically equivalent to the real projective line, RP1. Parallelizable. SO(2) = U(1).
2-sphere
Also known as the sphere. Complex structure; see Riemann sphere. Equivalent to the complex projective line,
CP1. SO(3)/SO(2).
3-sphere
Parallelizable, Principal U(1)-bundle over the 2-sphere, Lie group structure Sp(1), where also
.
4-sphere
Equivalent to the quaternionic projective line, HP1. SO(5)/SO(4).
5-sphere
Principal U(1)-bundle over CP2. SO(6)/SO(5) = SU(3)/SU(2).
6-sphere
Almost complex structure coming from the set of pure unit octonions. SO(7)/SO(6) = G2/SU(3).
7-sphere
Topological quasigroup structure as the set of unit octonions. Principal Sp(1)-bundle over S4. Parallelizable.
SO(8)/SO(7) = SU(4)/SU(3) = Sp(2)/Sp(1) = Spin(7)/G2 = Spin(6)/SU(3). The 7-sphere is of particular interest
since it was in this dimension that the first exotic spheres were discovered.
8-sphere
Equivalent to the octonionic projective line OP1.
23-sphere
A highly dense sphere-packing is possible in 24 dimensional space, which is related to the unique qualities of
the Leech lattice.
n-sphere 404
Notes
[1] James W. Vick (1994). Homology theory, p. 60. Springer
References
• Flanders, Harley (1989). Differential forms with applications to the physical sciences. New York: Dover
Publications. ISBN 978-0-486-66169-8..
• Moura, Eduarda; Henderson, David G. (1996). Experiencing geometry: on plane and sphere (http://www.math.
cornell.edu/~henderson/books/eg00). Prentice Hall. ISBN 978-0-13-373770-7 (Chapter 20: 3-spheres and
hyperbolic 3-spaces.)
• Weeks, Jeffrey R. (1985). The Shape of Space: how to visualize surfaces and three-dimensional manifolds.
Marcel Dekker. ISBN 978-0-8247-7437-0 (Chapter 14: The Hypersphere)
• Marsaglia, G. (1972). "Choosing a Point from the Surface of a Sphere". Ann. Math. Stat. 43 (2): 645–646.
doi:10.1214/aoms/1177692644.
• Huber, Greg (1982). "Gamma function derivation of n-sphere volumes". Am. Math. Monthly 89 (5): 301–302.
doi:10.2307/2321716. JSTOR 2321716. MR1539933.
External links
• Exploring Hyperspace with the Geometric Product (http://www.bayarea.net/~kins/thomas_briggs/)
• Weisstein, Eric W., " Hypersphere (http://mathworld.wolfram.com/Hypersphere.html)" from MathWorld.
Negative binomial distribution 405
The orange line represents the mean, which is equal to 10 in each of these plots;
the green line shows the standard deviation.
Notation
Parameters r > 0 — number of failures until the experiment is stopped (integer,
but the definition can also be extended to reals)
p ∈ (0,1) — success probability in each experiment (real)
Support k ∈ { 0, 1, 2, 3, … } — number of successes
PMF
involving a binomial coefficient
Mode
Variance
Skewness
Ex. kurtosis
MGF
CF
PGF
In probability theory and statistics, the negative binomial distribution is a discrete probability distribution of the
number of successes in a sequence of Bernoulli trials before a specified (non-random) number of failures (denoted r)
occur. For example, if one throws a die repeatedly until the third time “1” appears, then the probability distribution of
the number of non-“1”s that had appeared will be negative binomial.
The Pascal distribution (after Blaise Pascal) and Polya distribution (for George Pólya) are special cases of the
negative binomial. There is a convention among engineers, climatologists, and others to reserve “negative binomial”
in a strict sense or “Pascal” for the case of an integer-valued stopping-time parameter r, and use “Polya” for the
real-valued case. The Polya distribution more accurately models occurrences of “contagious” discrete events, like
Negative binomial distribution 406
Definition
Suppose there is a sequence of independent Bernoulli trials, each trial having two potential outcomes called
“success” and “failure”. In each trial the probability of success is p and of failure is (1 − p). We are observing this
sequence until a predefined number r of failures has occurred. Then the random number of successes we have seen,
X, will have the negative binomial (or Pascal) distribution:
When applied to real-world problems, outcomes of success and failure may or may not be outcomes we ordinarily
view as good and bad, respectively. Suppose we used the negative binomial distribution to model the number of days
a certain machine works before it breaks down. In this case “success” would be the result on a day when the machine
worked properly, whereas a breakdown would be a “failure”. If we used the negative binomial distribution to model
the number of goal attempts a sportsman makes before scoring a goal, though, then each unsuccessful attempt would
be a “success”, and scoring a goal would be “failure”. If we are tossing a coin, then the negative binomial distribution
can give the number of heads (“success”) we are likely to encounter before we encounter a certain number of tails
(“failure”).
The probability mass function of the negative binomial distribution is
This quantity can alternatively be written in the following manner, explaining the name “negative binomial”:
To understand the above definition of the probability mass function, note that the probability for every specific
sequence of k successes and r failures is (1 − p)rpk, because the outcomes of the k + r trials are supposed to happen
independently. Since the rth failure comes last, it remains to choose the k trials with successes out of the remaining
k + r − 1 trials. The above binomial coefficient, due to its combinatorial interpretation, gives precisely the number of
all these sequences of length k + r − 1.
Extension to real-valued r
It is possible to extend the definition of the negative binomial distribution to the case of a positive real parameter r.
Although it is impossible to visualize a non-integer number of “failures”, we can still formally define the distribution
through its probability mass function.
As before, we say that X has a negative binomial (or Pólya) distribution if it has a probability mass function:
Here r is a real, positive number. The binomial coefficient is then defined by the multiplicative formula and can also
be rewritten using the gamma function:
Note that by the binomial series and (*) above, for every 0 ≤ p < 1,
Negative binomial distribution 407
hence the terms of the probability mass function indeed add up to one.
Alternative formulations
Some textbooks may define the negative binomial distribution slightly differently than it is done here. The most
common variations are:
• The definition where X is the total number of trials needed to get r failures, not simply the number of successes.
Since the total number of trials is equal to the number of successes plus the number of failures, this definition
differs from ours by adding constant r. In order to convert formulas written with this definition into the one used
in the article, replace everywhere “k” with “k - r”, and also subtract r from the mean, the median, and the mode. In
order to convert formulas of this article into this alternative definition, replace “k” with “k + r” and add r to the
mean, the median and the mode. Effectively, this implies using the probability mass function
which perhaps resembles the binomial distribution more closely than the version above. Note that the arguments
of the binomial coefficient are decremented due to order: the last "failure" must occur last, and so the other events
have one fewer positions available when counting possible orderings. Note that this definition of the negative
binomial distribution does not easily generalize to a positive, real parameter r.
• The definition where p denotes the probability of a failure, not of a success. In order to convert formulas between
this definition and the one used in the article, replace “p” with “1 − p” everywhere.
• The definition where the support X is defined as the number of failures, rather than the number of successes. This
definition — where X counts failures but p is the probability of success — has exactly the same formulas as in the
previous case where X counts successes but p is the probability of failure. However, the corresponding text will
have the words “failure” and “success” swapped compared with the previous case.
• The two alterations above may be applied simultaneously, i.e. X counts total trials, and p is the probability of
failure.
Occurrence
Overdispersed Poisson
The negative binomial distribution, especially in its alternative parameterization described above, can be used as an
alternative to the Poisson distribution. It is especially useful for discrete data over an unbounded positive range
whose sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a
Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not an appropriate
model. Since the negative binomial distribution has one more parameter than the Poisson, the second parameter can
be used to adjust the variance independently of the mean. See Cumulants of some discrete probability distributions.
An application of this is to annual counts of tropical cyclones in the North Atlantic or to monthly to 6-monthly
counts of wintertime extratropical cyclones over Europe, for which the variance is greater than the mean.[1][2][3] In
the case of modest overdispersion, this may produce substantially similar results to an overdispersed Poisson
distribution.[4][5]
Related distributions
• The geometric distribution (on { 0, 1, 2, 3, ... }) is a special case of the negative binomial distribution, with
• The negative binomial distribution is a special case of the discrete phase-type distribution.
Poisson distribution
Consider a sequence of negative binomial distributions where the stopping parameter r goes to infinity, whereas the
probability of success in each trial, p, goes to zero in such a way as to keep the mean of the distribution constant.
Denoting this mean λ, the parameter p will have to be
Now if we consider the limit as r → ∞, the second factor will converge to one, and the third to the exponent
function:
which is the mass function of a Poisson-distributed random variable with expected value λ.
In other words, the alternatively parameterized negative binomial distribution converges to the Poisson distribution
and r controls the deviation from the Poisson. This makes the negative binomial distribution suitable as a robust
alternative to the Poisson, which approaches the Poisson for large r, but which has larger variance than the Poisson
for small r.
Negative binomial distribution 409
Gamma–Poisson mixture
The negative binomial distribution also arises as a continuous mixture of Poisson distributions (i.e. a compound
probability distribution) where the mixing distribution of the Poisson rate is a gamma distribution. That is, we can
view the negative binomial as a Poisson(λ) distribution, where λ is itself a random variable, distributed according to
Gamma(r, p/(1 − p)).
Formally, this means that the mass function of the negative binomial distribution can be written as
Because of this, the negative binomial distribution is also known as the gamma–Poisson (mixture) distribution.
In this sense, the negative binomial distribution is the "inverse" of the binomial distribution.
The sum of independent negative-binomially distributed random variables r1 and r2 with the same value for
parameter p is negative-binomially distributed with the same p but with "r-value" r1 + r2.
The negative binomial distribution is infinitely divisible, i.e., if Y has a negative binomial distribution, then for any
positive integer n, there exist independent identically distributed random variables Y1, ..., Yn whose sum has the same
distribution that Y has.
Negative binomial distribution 410
Let N be a random variable, independent of the sequence, and suppose that N has a Poisson distribution with
parameter λ = −r ln(1 − p). Then the random sum
is NB(r,p)-distributed. To prove this, we calculate the probability generating function GX of X, which is the
composition of the probability generating functions GN and GY1. Using
and
we obtain
Properties
but this is a biased estimate. Its inverse (r + k)/r, is an unbiased estimate of 1/p, however.[6]
Negative binomial distribution 411
in which the upper bound of summation is infinite. In this case, the binomial coefficient
is defined when n is a real number, instead of just a positive integer. But in our case of the binomial distribution it is
zero when k > n. We can then say, for example
is just the probability that the number of failures before the rth success is equal to k, provided r is an integer. (If r is a
negative non-integer, so that the exponent is a positive non-integer, then some of the terms in the sum above are
negative, so we do not have a probability distribution on the set of all nonnegative integers.)
Now we also allow non-integer values of r. Then we have a proper negative binomial distribution, which is a
generalization of the Pascal distribution, which coincides with the Pascal distribution when r happens to be a positive
integer.
Recall from above that
The sum of independent negative-binomially distributed random variables r1 and r2 with the same value for
parameter p is negative-binomially distributed with the same p but with "r-value" r1 + r2.
This property persists when the definition is thus generalized, and affords a quick way to see that the negative
binomial distribution is infinitely divisible.
Negative binomial distribution 412
Parameter estimation
To find the maximum we take the partial derivatives with respect to r and p and set them equal to zero:
and
where
This equation cannot be solved in closed form. If a numerical solution is desired, an iterative technique such as
Newton's method can be used.
Examples
Selling candy
Pat is required to sell candy bars to raise money for the 6th grade field trip. There are thirty houses in the
neighborhood, and Pat is not supposed to return home until five candy bars have been sold. So the child goes door to
door, selling candy bars. At each house, there is a 0.4 probability of selling one candy bar and a 0.6 probability of
selling nothing.
What's the probability of selling the last candy bar at the nth house?
Recall that the NegBin(r, p) distribution describes the probability of k failures and r successes in k + r Bernoulli(p)
trials with success on the last trial. Selling five candy bars means getting five successes. The number of trials (i.e.
houses) this takes is therefore k + 5 = n. The random variable we are interested in is the number of houses, so we
substitute k = n − 5 into a NegBin(5, 0.4) mass function and obtain the following mass function of the distribution of
houses (for n ≥ 5):
What's the probability that Pat finishes on or before reaching the eighth house?
To finish on or before the eighth house, Pat must finish at the fifth, sixth, seventh, or eighth house. Sum those
probabilities:
What's the probability that Pat exhausts all 30 houses in the neighborhood?
This can be expressed as the probability that Pat does not finish on the fifth through the thirtieth house:
References
[1] Villarini, G.; Vecchi, G.A. and Smith, J.A. (2010). "Modeling of the dependence of tropical storm counts in the North Atlantic Basin on
climate indices". Monthly Weather Review 138 (7): 2681–2705. doi:10.1175/2010MWR3315.1.
[2] Mailier, P.J.; Stephenson, D.B.; Ferro, C.A.T.; Hodges, K.I. (2006). "Serial Clustering of Extratropical Cyclones". Monthly Weather Review
134 (8): 2224–2240. doi:10.1175/MWR3160.1.
[3] Vitolo, R.; Stephenson, D.B.; Cook, Ian M.; Mitchell-Wallace, K. (2009). "Serial clustering of intense European storms". Meteorologische
Zeitschrift 18 (4): 411–424. doi:10.1127/0941-2948/2009/0393.
[4] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and Hall/CRC.
ISBN 0-412-31760-5.
[5] Cameron, Adrian C.; Trivedi, Pravin K. (1998). Regression analysis of count data. Cambridge University Press. ISBN 0-521-63567-5.
[6] J. B. S. Haldane, "On a Method of Estimating Frequencies", Biometrika, Vol. 33, No. 3 (Nov., 1945), pp. 222–225. JSTOR 2332299
[7] Spencer, Paul, 1998, The Pastoral Continuum: the Marginalization of Tradition in East Africa, Clarendon Press, Oxford (pp. 51-92).
Further reading
• Hilbe, Joseph M., Negative Binomial Regression, Cambridge, UK: Cambridge University Press (2007) Negative
Binomial Regression – Cambridge University Press (http://www.cambridge.org/uk/catalogue/catalogue.
asp?isbn=9780521857727)
Noncentral chi-squared distribution 414
Mean
Variance
Skewness
Ex. kurtosis
MGF
for
CF
In probability theory and statistics, the noncentral chi-squared or noncentral distribution is a generalization of
the chi-squared distribution. This distribution often arises in the power analysis of statistical tests in which the null
distribution is (perhaps asymptotically) a chi-squared distribution; important examples of such tests are the
likelihood ratio tests.
Noncentral chi-squared distribution 415
Background
Let be k independent, normally distributed random variables with means and variances . Then the
random variable
is distributed according to the noncentral chi-squared distribution. It has two parameters: which specifies the
number of degrees of freedom (i.e. the number of ), and which is related to the mean of the random variables
by:
is sometime called the noncentrality parameter. Note that some references define in other ways, such as half
of the above sum, or its square root.
This distribution arises in multivariate statistics as a derivative of the multivariate normal distribution. While the
central chi-squared distribution is the squared norm of a random vector with distribution (i.e., the
squared distance from the origin of a point taken at random from that distribution), the non-central is the squared
norm of a random vector with distribution. Here is a zero vector of length k, and
is the identity matrix of size k.
Definition
The probability density function is given by
Using the relation between Bessel functions and hypergeometric functions, the pdf can also be written as:[1]
Siegel (1979) discusses the case k=0 specifically (zero degrees of freedom), in which case the distribution has a
discrete component at zero.
Noncentral chi-squared distribution 416
Properties
Moments
The first few raw moments are:
Hence
where is the cumulative distribution function of the central chi-squared distribution with k degrees of
freedom which is given by
Approximation
Sankaran [3] discusses a number of closed form approximations for the cumulative distribution function. In an earlier
paper,[4] he derived and states the following approximation:
where
denotes the cumulative distribution function of the standard normal distribution;
4. Expand the cosh term in a Taylor series. This gives the Poisson-weighted mixture representation of the density,
still for k=1. The indices on the chi-squared random variables in the series above are 1+2i in this case.
5. Finally, for the general case. We've assumed, without loss of generality, that are standard normal,
and so has a central chi-squared distribution with (k-1) degrees of freedom, independent of
. Using the poisson-weighted mixture representation for , and the fact that the sum of chi-squared
random variables is also chi-squared, completes the result. The indices in the series are (1+2i)+(k-1) = k+2i as
required.
Noncentral chi-squared distribution 418
Related distributions
• If is chi-squared distributed then is also non-central chi-squared distributed:
• If and and is independent of then a noncentral F-distributed variable is
developed as
• If , then
or .
Transformations
Sankaran (1963) discusses the transformations of the form . He analyzes the
expansions of the cumulants of up to the term and shows that the following choices of
produce reasonable results:
• makes the second cumulant of approximately independent of
• makes the third cumulant of approximately independent of
• makes the fourth cumulant of approximately independent of
Also, a simpler transformation can be used as a variance stabilizing transformation
that produces a random variable with mean and variance .
Usability of these transformations may be hampered by the need to take the square roots of negative numbers.
chi-squared distribution
chi distribution
Notes
[1] Muirhead (2005) Theorem 1.3.4
[2] Nuttall, Albert H. (1975): Some Integrals Involving the QM Function (http:/ / ieeexplore. ieee. org/ xpl/ freeabs_all. jsp?arnumber=1055327),
IEEE Transactions on Information Theory, 21(1), 95-96, ISSN 0018-9448
[3] Sankaran , M. (1963). Approximations to the non-central chi-squared distribution (http:/ / biomet. oxfordjournals. org/ cgi/ content/ citation/
50/ 1-2/ 199) Biometrika, 50(1-2), 199–204
[4] Sankaran , M. (1959). "On the non-central chi-squared distribution", Biometrika 46, 235–237
[5] Johnson et al. (1995) Section 29.8
[6] Muirhead (2005) pages 22–24 and problem 1.18.
References
• Abramowitz, M. and Stegun, I.A. (1972), Handbook of Mathematical Functions, Dover. Section 26.4.25. (http://
www.math.sfu.ca/~cbm/aands/page_942.htm)
• Johnson, N. L., Kotz, S., Balakrishnan, N. (1970), Continuous Univariate Distributions, Volume 2, Wiley. ISBN
0-471-58494-0
• Muirhead, R. (2005) Aspects of Multivariate Statistical Theory (2nd Edition). Wiley. ISBN 0-471-76985-1
• Siegel, A.F. (1979), "The noncentral chi-squared distribution with zero degrees of freedom and testing for
uniformity", Biometrika, 66, 381–386
• Press, S.J. (1966), "Linear combinations of non-central chi-squared variates", The Annals of Mathematical
Statistics 37 (2): 480–487, JSTOR 2238621
Noncentral F-distribution
In probability theory and statistics, the noncentral F-distribution is a continuous probability distribution that is a
generalization of the (ordinary) F-distribution. It describes the distribution of the quotient (X/n1)/(Y/n2), where the
numerator X has a noncentral chi-squared distribution with n1 degrees of freedom and the denominator Y has a
central chi-squared distribution n2 degrees of freedom. It is also required that X and Y are statistically independent of
each other.
It is the distribution of the test statistic in analysis of variance problems when the null hypothesis is false. The
noncentral F-distribution is used to find the power function of such a test.
is a noncentral F-distributed random variable. The probability density function for the noncentral F-distribution is [1]
when and zero otherwise. The degrees of freedom and are positive. The noncentrailty parameter
is nonnegative. The term is the beta function, where
and
Special cases
When λ = 0, the noncentral F-distribution becomes the F-distribution.
Related distributions
Z has a noncentral chi-squared distribution if
Implementations
The noncentral F-distribution is implemented in the R language (e.g., pf function), in MATLAB (ncfcdf, ncfinv,
ncfpdf, ncfrnd and ncfstat functions in the statistics toolbox) in Mathematica (NoncentralFRatioDistribution
function), in NumPy (random.noncentral_f), and in Boost C++ Libraries.[2]
A collaborative wiki page implements an interactive online calculator, programmed in R language, for noncentral t,
chisquare, and F, at the Institute of Statistics and Econometrics, School of Business and Economics,
Humboldt-Universität zu Berlin.[3]
Notes
[1] S. Kay, Fundamentals of Statistical Signal Processing: Detection Theory, (New Jersey: Prentice Hall, 1998), p. 29.
[2] John Maddock, Paul A. Bristow, Hubert Holin, Xiaogang Zhang, Bruno Lalande, Johan Råde. "Noncentral F Distribution: Boost 1.39.0"
(http:/ / www. boost. org/ doc/ libs/ 1_39_0/ libs/ math/ doc/ sf_and_dist/ html/ math_toolkit/ dist/ dist_ref/ dists/ nc_f_dist. html). Boost.org. .
Retrieved 20 August 2011.
[3] Sigbert Klinke (10 December 2008). "Comparison of noncentral and central distributions" (http:/ / mars. wiwi. hu-berlin. de/ mediawiki/
slides/ index. php/ Comparison_of_noncentral_and_central_distributions). Humboldt-Universität zu Berlin. .
References
• Weisstein, Eric W., et al. "Noncentral F-distribution" (http://mathworld.wolfram.com/
NoncentralF-Distribution.html). MathWorld. Wolfram Research, Inc. Retrieved 20 August 2011.
Noncentral t-distribution 421
Noncentral t-distribution
Noncentral Student's t
In probability and statistics, the noncentral t-distribution (also known as the singly noncentral t-distribution)
generalizes Student's t-distribution using a noncentrality parameter. Like the central t-distribution, the noncentral
t-distribution is primarily used in statistical inference, although it may also be used in robust modeling for data. In
particular, the noncentral t-distribution arises in power analysis.
Characterization
If is a normally distributed random variable with unit variance and zero mean, and is a Chi-squared
distributed random variable with degrees of freedom that is statistically independent of , then
is a noncentral t-distributed random variable with degrees of freedom and noncentrality parameter . Note that
the noncentrality parameter may be negative.
Noncentral t-distribution 422
where
and
is the cumulative distribution function of the standard normal distribution.
Alternatively, the noncentral t-distribution CDF can be expressed as:
where is the gamma function and is the regularized incomplete beta function.
Although there are other forms of the cumulative distribution function, the first form presented above is very easy to
evaluate through recursive computing.[1] In statistical software R, the cumulative distribution function is
implemented as pt.
A third form of the density is obtained using its cumulative distribution functions, as follows.
Noncentral t-distribution 423
Properties
and
Asymmetry
The noncentral t-distribution is asymmetric unless μ is zero, i.e., a central t-distribution. The right tail will be
heavier than the left when μ > 0, and vice versa. However, the usual skewness is not generally a good measure of
asymmetry for this distribution, because if the degrees of freedom is not larger than 3, the third moment does not
exist at all. Even if the degrees of freedom is greater than 3, the sample estimate of the skewness is still very unstable
unless the sample size is very large.
Mode
The noncentral t-distribution is always unimodal and bell shaped, but the mode is not analytically available,
although it always lies in the interval[4]
when and
when
Moreover, the mode always has the same sign as the noncentrality parameter and the negative of the mode is
exactly the mode for a noncentral t-distribution with the same number of degrees of freedom but noncentrality
parameter
The mode is strictly increasing with when and strictly decreasing with when In the limit,
when approaches zero, the mode is approximated by
Occurrences
where is the sample mean and is the unbiased sample variance. Since the right hand side of the second
equality exactly matches the characterization of a noncentral t-distribution as described above, has a noncentral
t-distribution with degrees of freedom and noncentrality parameter .
If the test procedure rejects the null hypothesis whenever , where is the upper quantile
of the (central) Student's t-distribution for a pre-specified , then the power of this test is given by
Similar applications of the noncentral t-distribution can be found in the power analysis of the general normal-theory
linear models, which includes the above one sample t-test as a special case.
Related distributions
• Central t distribution: The central t-distribution can be converted into a location/scale family. This family of
distributions is used in data modeling to capture various tail behaviors. The location/scale generalization of the
central t-distribution is a different distribution from the noncentral t-distribution discussed in this article. In
particular, this approximation does not respect the asymmetry of the noncentral t-distribution. However, the
central t-distribution can be used as an approximation to the non-central t-distribution.[5]
• If is noncentral t-distributed with degrees of freedom and noncentrality parameter and , then
has a noncentral -distribution with 1 numerator degree of freedom, denominator degrees of freedom, and
noncentrality parameter .
• If is noncentral t-distributed with degrees of freedom and noncentrality parameter and ,
then has a normal distribution with mean and unit variance.
• When the denominator noncentrality parameter of a doubly noncentral t-distribution is zero, then it becomes a
noncentral t-distribution.
Noncentral t-distribution 425
Special cases
• When , the noncentral t-distribution becomes the central (Student's) t-distribution with the same degrees
of freedom.
References
[1] Lenth, Russell V (1989). "Algorithm AS 243: Cumulative Distribution Function of the Non-central t Distribution". Journal of the Royal
Statistical Society. Series C (Applied Statistics) 38: 185–189. JSTOR 2347693.
[2] L. Scharf, Statistical Signal Processing, (Massachusetts: Addison-Wesley, 1991), p.177.
[3] Hogben, D; Wilk, MB (1961). "The moments of the non-central t-distribution". Biometrika 48: 465–468. JSTOR 2332772.
[4] van Aubel, A; Gawronski, W (2003). "Analytic properties of noncentral distributions" (http:/ / www. sciencedirect. com/ science/ article/
B6TY8-47G44WX-V/ 2/ 7705d2642b1a384b13e0578898a22d48). Applied Mathematics and Computation 141: 3–12.
doi:10.1016/S0096-3003(02)00316-8. .
[5] Helena Chmura Kraemer; Minja Paik (1979). "A Central t Approximation to the Noncentral t Distribution". Technometrics 21 (3): 357–360.
JSTOR 1267759.
External links
• Eric W. Weisstein. "Noncentral Student's t-Distribution." (http://mathworld.wolfram.com/
NoncentralStudentst-Distribution.html) From MathWorld—A Wolfram Web Resource
Norm (mathematics)
In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictly
positive length or size to all vectors in a vector space, other than the zero vector (which has zero length assigned to
it). A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition to the zero
vector).
A simple example is the 2-dimensional Euclidean space R2 equipped with the Euclidean norm. Elements in this
vector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional cartesian coordinate system starting at the
origin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean norm
is often known as the magnitude.
A vector space with a norm is called a normed vector space. Similarly, a vector space with a seminorm is called a
seminormed vector space.
Notation
The norm of a vector, matrix, or set (its cardinality) is usually noted using the "double vertical line", Unicode
Ux2016 : ( ‖ ). For example, the norm of a vector v is usually denoted ‖v‖. Sometimes the vertical line, Unicode
Ux007c ( | ), is used (e.g. |v|), but this latter notation is generally discouraged, because it is also used to denote the
absolute value of scalars and the determinant of matrices. The double vertical line should not be confused with the
"parallel to" symbol, Unicode Ux2225 ( ∥ ). This is usually not a problem because ‖ is used in parenthesis-like
fashion, whereas ∥ is used as an infix operator.
Norm (mathematics) 426
Definition
Given a vector space V over a subfield F of the complex numbers, a norm on V is a function p: V → R with the
following properties:[1]
For all a ∈ F and all u, v ∈ V,
1. p(av) = |a| p(v), (positive homogeneity or positive scalability).
2. p(u + v) ≤ p(u) + p(v) (triangle inequality or subadditivity).
3. If p(v) = 0 then v is the zero vector (separates points).
A simple consequence of the first two axioms, positive homogeneity and the triangle inequality, is p(0) = 0 and thus
p(v) ≥ 0 (positivity).
A seminorm is a norm with the 3rd property (separating points) removed.
Although every vector space is seminormed (e.g., with the trivial seminorm in the Examples section below), it may
not be normed. Every vector space V with seminorm p(v) induces a normed space V/W, called the quotient space,
where W is the subspace of V consisting of all vectors v in V with p(v) = 0. The induced norm on V/W is clearly
well-defined and is given by:
p(W + v) = p(v).
A topological vector space is called normable (seminormable) if the topology of the space can be induced by a
norm (seminorm).
Examples
• All norms are seminorms.
• The trivial seminorm, with p(x) = 0 for all x in V.
• The absolute value is a norm on the real numbers.
• Every linear form f on a vector space defines a seminorm by x → |f(x)|.
Euclidean norm
On an n-dimensional Euclidean space Rn, the intuitive notion of length of the vector x = (x1, x2, ..., xn) is captured by
the formula
This gives the ordinary distance from the origin to the point x, a consequence of the Pythagorean theorem. The
Euclidean norm is by far the most commonly used norm on Rn, but there are other norms on this vector space as will
be shown below. However all these norms are equivalent in the sense that they all define the same topology.
On an n-dimensional complex space Cn the most common norm is
In both cases we can also express the norm as the square root of the inner product of the vector and itself:
where x is represented as a column vector ([x1; x2; ...; xn]), and x* denotes its conjugate transpose.
This formula is valid for any inner product space, including Euclidean and complex spaces. For Euclidean spaces,
the inner product is equivalent to the dot product. Hence, in this specific case the formula can be also written with
the following notation:
The Euclidean norm is also called the Euclidean length, L2 distance, ℓ2 distance, L2 norm, or ℓ2 norm; see Lp
space.
Norm (mathematics) 427
The set of vectors in Rn+1 whose Euclidean norm is a given positive constant forms an n-sphere.
complex number.
The name relates to the distance a taxi has to drive in a rectangular street grid to get from the origin to the point x.
The set of vectors whose 1-norm is a given constant forms the surface of a cross polytope of dimension equivalent to
that of the norm minus 1. The Taxicab norm is also called the L1 norm. The distance derived from this norm is
called the Manhattan distance or L1 distance.
The 1-norm is simply the sum of the absolute values of the columns.
In contrast,
p-norm
Let p ≥ 1 be a real number.
Note that for p = 1 we get the taxicab norm, for p = 2 we get the Euclidean norm, and as p approaches the
p-norm approaches the infinity norm or maximum norm.
This definition is still of some interest for 0 < p < 1, but the resulting function does not define a norm,[2] because it
violates the triangle inequality. What is true for this case of 0 < p < 1, even in the measurable analog, is that the
corresponding Lp class is a vector space, and it is also true that the function
(without p-th root) defines a distance that makes Lp(X) into a complete metric topological vector space. These spaces
are of great interest in functional analysis, probability theory, and harmonic analysis. However, outside trivial cases,
this topological vector space is not locally convex and has no continuous nonzero linear forms. Thus the topological
dual space contains only the zero functional.
Norm (mathematics) 428
Maximum norm (special case of: infinity norm, uniform norm, or supremum norm)
The set of vectors whose infinity norm is a given constant, c, forms the
surface of a hypercube with edge length 2c.
Zero norm
In probability and functional analysis, the zero norm induces a
complete metric topology for the space of measureable functions and
for the F-space of sequences with F–norm
, which is discussed by Stefan
In metric geometry, the discrete metric takes the value one for distinct points and zero otherwise. When applied
coordinate-wise to the elements of a vector space, the discrete distance defines the Hamming distance, which is
important in coding and information theory. In the field of real or complex numbers, the distance of the discrete
metric from zero is not homogeneous in the non-zero point; indeed, the distance from zero remains one as its
non-zero argument approaches zero. However, the discrete distance of a number from zero does satisfy the other
properties of a norm, namely the triangle inequality and positive definiteness. When applied component-wise to
vectors, the discrete distance from zero behaves like a non-homogeneous "norm", which counts the number of
non-zero components in its vector argument; again, this non-homogeneous "norm" is discontinuous.
In signal processing and statistics, David Donoho referred to the zero "norm" with quotation marks. Following
Donoho's notation, the zero "norm" of x is simply the number of non-zero coordinates of x, or the Hamming distance
of the vector from zero. When this "norm" is localized to a bounded set, it is the limit of p-norms as p approaches 0.
Of course, the zero "norm" is not a B-norm, because it is not positive homogeneous. It is not even an F-norm,
because it is discontinuous, jointly and severally, with respect to the scalar argument in scalar-vector multiplication
and with respect to its vector argument. Abusing terminology, some engineers omit Donoho's quotation marks and
inappropriately call the number-of-nonzeros function the L0 norm (sic.), also misusing the notation for the Lebesgue
space of measurable functions.
Other norms
Other norms on Rn can be constructed by combining the above; for example
is a norm on R4.
For any norm and any injective linear transformation A we can define a new norm of x, equal to
In 2D, with A a rotation by 45° and a suitable scaling, this changes the taxicab norm into the maximum norm. In 2D,
each A applied to the taxicab norm, up to inversion and interchanging of axes, gives a different unit ball: a
parallelogram of a particular shape, size and orientation. In 3D this is similar but different for the 1-norm
(octahedrons) and the maximum norm (prisms with parallelogram base).
All the above formulas also yield norms on Cn without modification.
Norm (mathematics) 429
Infinite-dimensional case
The generalization of the above norms to an infinite number of components leads to the Lp spaces, with norms
(for complex-valued sequences x resp. functions f defined on ), which can be further generalized (see Haar
measure).
Properties
The concept of unit circle (the set of all vectors of norm 1) is different in different
norms: for the 1-norm the unit circle in R2 is a square, for the 2-norm (Euclidean
norm) it is the well-known unit circle, while for the infinity norm it is a different
square. For any p-norm it is a superellipse (with congruent axes). See the
accompanying illustration. Note that due to the definition of the norm, the unit circle
is always convex and centrally symmetric (therefore, for example, the unit ball may be
a rectangle but cannot be a triangle).
In terms of the vector space, the seminorm defines a topology on the space, and this is
a Hausdorff topology precisely when the seminorm can distinguish between distinct
vectors, which is again equivalent to the seminorm being a norm. The topology thus
defined (by either a norm or a seminorm) can be understood either in terms of
sequences or open sets. A sequence of vectors is said to converge in norm to
if as . Equivalently, the topology consists of all sets
that can be represented as a union of open balls.
Two norms ||•||α and ||•||β on a vector space V are called equivalent if there exist
positive real numbers C and D such that
In particular,
If the vector space is a finite-dimensional real/complex one, all norms are equivalent. If not, some norms are not.
Equivalent norms define the same notions of continuity and convergence and for many purposes do not need to be
distinguished. To be more precise the uniform structure defined by equivalent norms on the vector space is
uniformly isomorphic.
Every (semi)-norm is a sublinear function, which implies that every norm is a convex function. As a result, finding a
global optimum of a norm-based objective function is often tractable.
Given a finite family of seminorms pi on a vector space the sum
Norm (mathematics) 430
is again a seminorm.
For any norm p on a vector space V, we have that for all u and v ∈ V:
p(u ± v) ≥ | p(u) − p(v) |
For the lp norms, we have Hölder's inequality[4]
Notes
[1] Prugovec̆ki 1981, page 20 (http:/ / books. google. com/ books?id=GxmQxn2PF3IC& pg=PA20)
[2] Except in R1, where it coincides with the Euclidean norm, and R0, where it is trivial.
[3] Rolewicz, Stefan (1987), Functional analysis and control theory: Linear systems, Mathematics and its Applications (East European Series),
29 (Translated from the Polish by Ewa Bednarczuk ed.), Dordrecht; Warsaw: D. Reidel Publishing Co.; PWN—Polish Scientific Publishers,
pp. xvi+524, ISBN 90-277-2186-6, MR920371, OCLC 13064804
[4] Golub, Gene; Charles F. Van Loan (1996). Matrix Computations - Third Edition. Baltimore: The Johns Hopkins University Press. p. 53.
ISBN 0-8018-5413-X.
Norm (mathematics) 431
References
• Bourbaki, Nicolas (1987). "Chapters 1–5". Topological vector spaces. Springer. ISBN 3-540-13627-4.
• Prugovec̆ki, Eduard (1981). Quantum mechanics in Hilbert space (2nd ed.). Academic Press. p. 20.
ISBN 0-12-566060-X.
Normal distribution 432
Normal distribution
Probability density function
Notation
Parameters μ ∈ R — mean (location)
σ2 > 0 — variance (squared scale)
Support x∈R
PDF
CDF
Mean μ
Median μ
Mode μ
Variance
Skewness 0
Ex. kurtosis 0
Entropy
MGF
CF
In probability theory, the normal (or Gaussian) distribution is a continuous probability distribution that has a
bell-shaped probability density function, known as the Gaussian function or informally the bell curve:[1]
Normal distribution 433
The parameter μ is the mean or expectation (location of the peak) and σ 2 is the variance. σ is known as the standard
deviation. The distribution with μ = 0 and σ 2 = 1 is called the standard normal distribution or the unit normal
distribution. A normal distribution is often used as a first approximation to describe real-valued random variables
that cluster around a single mean value.
The normal distribution is considered the most prominent probability distribution in statistics. There are several
reasons for this:[2] First, the normal distribution arises from the central limit theorem, which states that under mild
conditions the mean of a large number of random variables drawn from the same distribution is distributed
approximately normally, irrespective of the form of the original distribution. This gives it exceptionally wide
application in, for example, sampling. Secondly, the normal distribution is very tractable analytically, that is, a large
number of results involving this distribution can be derived in explicit form.
For these reasons, the normal distribution is commonly encountered in practice, and is used throughout statistics,
natural sciences, and social sciences[3] as a simple model for complex phenomena. For example, the observational
error in an experiment is usually assumed to follow a normal distribution, and the propagation of uncertainty is
computed using this assumption. Note that a normally-distributed variable has a symmetric distribution about its
mean. Quantities that grow exponentially, such as prices, incomes or populations, are often skewed to the right, and
hence may be better described by other distributions, such as the log-normal distribution or Pareto distribution. In
addition, the probability of seeing a normally-distributed value that is far (i.e. more than a few standard deviations)
from the mean drops off extremely rapidly. As a result, statistical inference using a normal distribution is not robust
to the presence of outliers (data that is unexpectedly far from the mean, due to exceptional circumstances,
observational error, etc.). When outliers are expected, data may be better described using a heavy-tailed distribution
such as the Student's t-distribution.
From a technical perspective, alternative characterizations are possible, for example:
• The normal distribution is the only absolutely continuous distribution all of whose cumulants beyond the first two
(i.e. other than the mean and variance) are zero.
• For a given mean and variance, the corresponding normal distribution is the continuous distribution with the
maximum entropy.[4][5]
The normal distributions are a sub-class of the elliptical distributions.
Definition
The simplest case of a normal distribution is known as the standard normal distribution, described by the probability
density function
The factor in this expression ensures that the total area under the curve ϕ(x) is equal to one[proof], and 12 in the
exponent makes the "width" of the curve (measured as half the distance between the inflection points) also equal to
one. It is traditional in statistics to denote this function with the Greek letter ϕ (phi), whereas density functions for all
other distributions are usually denoted with letters f or p.[6] The alternative glyph φ is also used quite often, however
within this article "φ" is reserved to denote characteristic functions.
Every normal distribution is the result of exponentiating a quadratic function (just as an exponential distribution
results from exponentiating a linear function):
This yields the classic "bell curve" shape, provided that a < 0 so that the quadratic function is concave for x close to
0. f(x) > 0 everywhere. One can adjust a to control the "width" of the bell, then adjust b to move the central peak of
the bell along the x-axis, and finally one must choose c such that (which is only possible
when a < 0).
Normal distribution 434
Rather than using a, b, and c, it is far more common to describe a normal distribution by its mean μ = − b2a and
variance σ2 = − 12a. Changing to these new parameters allows one to rewrite the probability density function in a
convenient standard form,
For a standard normal distribution, μ = 0 and σ2 = 1. The last part of the equation above shows that any other normal
distribution can be regarded as a version of the standard normal distribution that has been stretched horizontally by a
factor σ and then translated rightward by a distance μ. Thus, μ specifies the position of the bell curve's central peak,
and σ specifies the "width" of the bell curve.
The parameter μ is at the same time the mean, the median and the mode of the normal distribution. The parameter σ2
is called the variance; as for any random variable, it describes how concentrated the distribution is around its mean.
The square root of σ2 is called the standard deviation and is the width of the density function.
The normal distribution is usually denoted by N(μ, σ2).[7] Thus when a random variable X is distributed normally
with mean μ and variance σ2, we write
Alternative formulations
Some authors advocate using the precision instead of the variance. The precision is normally defined as the
reciprocal of the variance (τ = σ−2), although it is occasionally defined as the reciprocal of the standard deviation (τ
= σ−1).[8] This parametrization has an advantage in numerical applications where σ2 is very close to zero and is more
convenient to work with in analysis as τ is a natural parameter of the normal distribution. This parametrization is
common in Bayesian statistics, as it simplifies the Bayesian analysis of the normal distribution. Another advantage
of using this parametrization is in the study of conditional distributions in the multivariate normal case. The form of
the normal distribution with the more common definition τ = σ−2 is as follows:
The question of which normal distribution should be called the "standard" one is also answered differently by
various authors. Starting from the works of Gauss the standard normal was considered to be the one with variance σ2
= 12 :
Stigler (1982) goes even further and insists the standard normal to be with the variance σ2 = 12π :
According to the author, this formulation is advantageous because of a much simpler and easier-to-remember
formula, the fact that the pdf has unit height at zero, and simple approximate formulas for the quantiles of the
distribution.
Normal distribution 435
Characterization
In the previous section the normal distribution was defined by specifying its probability density function. However
there are other ways to characterize a probability distribution. They include: the cumulative distribution function, the
moments, the cumulants, the characteristic function, the moment-generating function, etc.
This is a proper function only when the variance σ2 is not equal to zero. In that case this is a continuous smooth
function, defined on the entire real line, and which is called the "Gaussian function".
Properties:
• Function f(x) is unimodal and symmetric around the point x = μ, which is at the same time the mode, the median
and the mean of the distribution.[9]
• The inflection points of the curve occur one standard deviation away from the mean (i.e., at x = μ − σ and x = μ +
σ).[9]
• Function f(x) is log-concave.[9]
• The standard normal density ϕ(x) is an eigenfunction of the Fourier transform in that if ƒ is a normalized Gaussian
function with variance σ2, centered at zero, then its Fourier transform is a Gaussian function with variance 1/σ2.
• The function is supersmooth of order 2, implying that it is infinitely differentiable.[10]
• The first derivative of ϕ(x) is ϕ′(x) = −x·ϕ(x); the second derivative is ϕ′′(x) = (x2 − 1)ϕ(x). More generally, the
n-th derivative is given by ϕ(n)(x) = (−1)nHn(x)ϕ(x), where Hn is the Hermite polynomial of order n.[11]
When σ2 = 0, the density function doesn't exist. However a generalized function that defines a measure on the real
line, and it can be used to calculate, for example, expected value is
where δ(x) is the Dirac delta function which is equal to infinity at x = μ and is zero elsewhere.
This integral cannot be expressed in terms of elementary functions, so is simply called a transformation of the error
function, or erf, a special function. Numerical methods for calculation of the standard normal CDF are discussed
below. For a generic normal random variable with mean μ and variance σ2 > 0 the CDF will be equal to
The complement of the standard normal CDF, Q(x) = 1 − Φ(x), is referred to as the Q-function, especially in
engineering texts.[12][13] This represents the upper tail probability of the Gaussian distribution: that is, the probability
that a standard normal random variable X is greater than the number x. Other definitions of the Q-function, all of
which are simple transformations of Φ, are also used occasionally.[14]
Normal distribution 436
Properties:
• The standard normal CDF is 2-fold rotationally symmetric around point (0, ½): Φ(−x) = 1 − Φ(x).
• The derivative of Φ(x) is equal to the standard normal pdf ϕ(x): Φ′(x) = ϕ(x).
• The antiderivative of Φ(x) is: ∫ Φ(x) dx = x Φ(x) + ϕ(x).
For a normal distribution with zero variance, the CDF is the Heaviside step function (with H(0) = 1 convention):
Quantile function
The quantile function of a distribution is the inverse of the CDF. The quantile function of the standard normal
distribution is called the probit function, and can be expressed in terms of the inverse error function:
Quantiles of the standard normal distribution are commonly denoted as zp. The quantile zp represents such a value
that a standard normal random variable X has the probability of exactly p to fall inside the (−∞, zp] interval. The
quantiles are used in hypothesis testing, construction of confidence intervals and Q-Q plots. The most "famous"
normal quantile is 1.96 = z0.975. A standard normal random variable is greater than 1.96 in absolute value in 5% of
cases.
For a normal random variable with mean μ and variance σ2, the quantile function is
The characteristic function can be analytically extended to the entire complex plane: one defines φ(z) = eiμz − 12σ2z2
for all z ∈ C.[16]
The moment generating function is defined as the expected value of etX. For a normal distribution, the moment
generating function exists and is equal to
The cumulant generating function is the logarithm of the moment generating function:
Since this is a quadratic polynomial in t, only the first two cumulants are nonzero.
Normal distribution 437
Moments
The normal distribution has moments of all orders. That is, for a normally distributed X with mean μ and variance σ
2
, the expectation E[|X|p] exists and is finite for all p such that Re[p] > −1. Usually we are interested only in moments
of integer orders: p = 1, 2, 3, ….
• Central moments are the moments of X around its mean μ. Thus, a central moment of order p is the expected
value of (X − μ) p. Using standardization of normal random variables, this expectation will be equal to σ p · E[Zp],
where Z is standard normal.
Here n!! denotes the double factorial, that is the product of every odd number from n to 1.
• Central absolute moments are the moments of |X − μ|. They coincide with regular moments for all even orders,
but are nonzero for all odd p's.
These expressions remain valid even if p is not integer. See also generalized Hermite polynomials.
• First two cumulants are equal to μ and σ 2 respectively, whereas all higher-order cumulants are equal to zero.
1 μ 0 μ
2 μ2 + σ2 σ2 σ2
3 0 0
μ3 + 3μσ2
4 0
μ4 + 6μ2σ2 + 3σ4 3σ 4
5 0 0
μ5 + 10μ3σ2 + 15μσ4
6 0
μ6 + 15μ4σ2 + 45μ2σ4 + 15σ6 15σ 6
7 0 0
μ7 + 21μ5σ2 + 105μ3σ4 + 105μσ6
8 0
μ8 + 28μ6σ2 + 210μ4σ4 + 420μ2σ6 + 105σ8 105σ 8
Normal distribution 438
Properties
has mean zero and unit variance, that is Z has the standard normal distribution. Conversely, having a standard normal
random variable Z we can always construct another normal random variable with specific mean μ and variance σ2:
This "standardizing" transformation is convenient as it allows one to compute the PDF and especially the CDF of a
normal distribution having the table of PDF and CDF values for the standard normal. They will be related via
The next table gives the reverse relation of sigma multiples corresponding to a few often used values for the area
under the bell curve. These values are useful to determine (asymptotic) tolerance intervals of the specified levels
based on normally distributed (or asymptotically normal) estimators:[18]
Normal distribution 439
n n
where the value on the left of the table is the proportion of values that will fall within a given interval and n is a
multiple of the standard deviation that specifies the width of the interval.
Miscellaneous
1. The family of normal distributions is closed under linear transformations. That is, if X is normally distributed
with mean μ and variance σ2, then a linear transform aX + b (for some real numbers a and b) is also normally
distributed:
Also if X1, X2 are two independent normal random variables, with means μ1, μ2 and standard deviations σ1, σ2,
then their linear combination will also be normally distributed: [proof]
2. The converse of (1) is also true: if X1 and X2 are independent and their sum X1 + X2 is distributed normally, then
both X1 and X2 must also be normal.[19] This is known as Cramér's decomposition theorem. The interpretation
of this property is that a normal distribution is only divisible by other normal distributions. Another application of
this property is in connection with the central limit theorem: although the CLT asserts that the distribution of a
sum of arbitrary non-normal iid random variables is approximately normal, the Cramér's theorem shows that it
can never become exactly normal.[20]
3. If the characteristic function φX of some random variable X is of the form φX(t) = eQ(t), where Q(t) is a
polynomial, then the Marcinkiewicz theorem (named after Józef Marcinkiewicz) asserts that Q can be at most a
quadratic polynomial, and therefore X a normal random variable.[20] The consequence of this result is that the
normal distribution is the only distribution with a finite number (two) of non-zero cumulants.
4. If X and Y are jointly normal and uncorrelated, then they are independent. The requirement that X and Y should be
jointly normal is essential, without it the property does not hold.[proof] For non-normal random variables
Normal distribution 441
7. The normal distribution is stable (with exponent α = 2): if X1, X2 are two independent N(μ, σ2) random variables
and a, b are arbitrary real numbers, then
where X3 is also N(μ, σ2). This relationship directly follows from property (1).
8. The Kullback–Leibler divergence between two normal distributions X1 ∼ N(μ1, σ21 )and X2 ∼ N(μ2, σ22 )is given
by:[24]
9. The Fisher information matrix for a normal distribution is diagonal and takes the form
10. Normal distributions belongs to an exponential family with natural parameters and , and natural
statistics x and x . The dual, expectation parameters for normal distribution are η1 = μ and η2 = μ + σ2.
2 2
11. The conjugate prior of the mean of a normal distribution is another normal distribution.[25] Specifically, if x1,
…, xn are iid N(μ, σ2) and the prior is μ ~ N(μ0, σ ), then the posterior distribution for the estimator of μ will be
12. Of all probability distributions over the reals with mean μ and variance σ2, the normal distribution N(μ, σ2) is the
one with the maximum entropy.[26]
13. The family of normal distributions forms a manifold with constant curvature −1. The same family is flat with
respect to the (±1)-connections ∇(e) and ∇(m).[27]
Normal distribution 442
Related distributions
• If X1, …, Xn, Y1, …, Ym are independent standard normal random variables, then the ratio of their normalized
sums of squares will have the F-distribution with (n, m) degrees of freedom:
Normal distribution 443
Extensions
The notion of normal distribution, being one of the most important distributions in probability theory, has been
extended far beyond the standard framework of the univariate (that is one-dimensional) case (Case 1). All these
extensions are also called normal or Gaussian laws, so a certain ambiguity in names exists.
• Multivariate normal distribution describes the Gaussian law in the k-dimensional Euclidean space. A vector X ∈
Rk is multivariate-normally distributed if any linear combination of its components ∑ aj Xj has a
(univariate) normal distribution. The variance of X is a k×k symmetric positive-definite matrix V.
• Rectified Gaussian distribution a rectified version of normal distribution with all the negative elements reset to 0
• Complex normal distribution deals with the complex normal vectors. A complex vector X ∈ Ck is said to be
normal if both its real and imaginary components jointly possess a 2k-dimensional multivariate normal
distribution. The variance-covariance structure of X is described by two matrices: the variance matrix Γ, and the
relation matrix C.
• Matrix normal distribution describes the case of normally distributed matrices.
• Gaussian processes are the normally distributed stochastic processes. These can be viewed as elements of some
infinite-dimensional Hilbert space H, and thus are the analogues of multivariate normal vectors for the case k = ∞.
A random element h ∈ H is said to be normal if for any constant a ∈ H the scalar product (a, h) has a (univariate)
normal distribution. The variance structure of such Gaussian random element can be described in terms of the
linear covariance operator K: H → H. Several Gaussian processes became popular enough to have their own
names:
• Brownian motion,
• Brownian bridge,
• Ornstein–Uhlenbeck process.
• Gaussian q-distribution is an abstract mathematical construction which represents a "q-analogue" of the normal
distribution.
• the q-Gaussian is an analogue of the Gaussian distribution, in the sense that it maximises the Tsallis entropy, and
is one type of Tsallis distribution. Note that this distribution is different from the Gaussian q-distribution above.
One of the main practical uses of the Gaussian law is to model the empirical distributions of many different random
variables encountered in practice. In such case a possible extension would be a richer family of distributions, having
more than two parameters and therefore being able to fit the empirical distribution more accurately. The examples of
such extensions are:
• Pearson distribution — a four-parametric family of probability distributions that extend the normal law to include
different skewness and kurtosis values.
Normal distribution 444
Normality tests
Normality tests assess the likelihood that the given data set {x1, …, xn} comes from a normal distribution. Typically
the null hypothesis H0 is that the observations are distributed normally with unspecified mean μ and variance σ2,
versus the alternative Ha that the distribution is arbitrary. A great number of tests (over 40) have been devised for
this problem, the more prominent of them are outlined below:
• "Visual" tests are more intuitively appealing but subjective at the same time, as they rely on informal human
judgement to accept or reject the null hypothesis.
• Q-Q plot — is a plot of the sorted values from the data set against the expected values of the corresponding
quantiles from the standard normal distribution. That is, it's a plot of point of the form (Φ−1(pk), x(k)), where
plotting points pk are equal to pk = (k − α)/(n + 1 − 2α) and α is an adjustment constant which can be anything
between 0 and 1. If the null hypothesis is true, the plotted points should approximately lie on a straight line.
• P-P plot — similar to the Q-Q plot, but used much less frequently. This method consists of plotting the points
(Φ(z(k)), pk), where . For normally distributed data this plot should lie on a 45° line between
(0, 0) and (1, 1).
• Wilk–Shapiro test employs the fact that the line in the Q-Q plot has the slope of σ. The test compares the least
squares estimate of that slope with the value of the sample variance, and rejects the null hypothesis if these two
quantities differ significantly.
• Normal probability plot (rankit plot)
• Moment tests:
• D'Agostino's K-squared test
• Jarque–Bera test
• Empirical distribution function tests:
• Lilliefors test (an adaptation of the Kolmogorov–Smirnov test)
• Anderson–Darling test
Estimation of parameters
It is often the case that we don't know the parameters of the normal distribution, but instead want to estimate them.
That is, having a sample (x1, …, xn) from a normal N(μ, σ2) population we would like to learn the approximate
values of parameters μ and σ2. The standard approach to this problem is the maximum likelihood method, which
requires maximization of the log-likelihood function:
Taking derivatives with respect to μ and σ2 and solving the resulting system of first order conditions yields the
maximum likelihood estimates:
Estimator is called the sample mean, since it is the arithmetic mean of all observations. The statistic is complete
and sufficient for μ, and therefore by the Lehmann–Scheffé theorem, is the uniformly minimum variance unbiased
(UMVU) estimator.[29] In finite samples it is distributed normally:
The variance of this estimator is equal to the μμ-element of the inverse Fisher information matrix . This implies
that the estimator is finite-sample efficient. Of practical importance is the fact that the standard error of is
proportional to , that is, if one wishes to decrease the standard error by a factor of 10, one must increase the
number of points in the sample by a factor of 100. This fact is widely used in determining sample sizes for opinion
Normal distribution 445
The estimator is called the sample variance, since it is the variance of the sample (x1, …, xn). In practice, another
estimator is often used instead of the . This other estimator is denoted s2, and is also called the sample variance,
which represents a certain ambiguity in terminology; its square root s is called the sample standard deviation. The
estimator s2 differs from by having (n − 1) instead of n in the denominator (the so called Bessel's correction):
The difference between s2 and becomes negligibly small for large n's. In finite samples however, the motivation
behind the use of s2 is that it is an unbiased estimator of the underlying parameter σ2, whereas is biased. Also, by
the Lehmann–Scheffé theorem the estimator s2 is uniformly minimum variance unbiased (UMVU),[29] which makes
it the "best" estimator among all unbiased ones. However it can be shown that the biased estimator is "better"
than the s2 in terms of the mean squared error (MSE) criterion. In finite samples both s2 and have scaled
chi-squared distribution with (n − 1) degrees of freedom:
The first of these expressions shows that the variance of s2 is equal to 2σ4/(n−1), which is slightly greater than the
σσ-element of the inverse Fisher information matrix . Thus, s2 is not an efficient estimator for σ2, and moreover,
since s is UMVU, we can conclude that the finite-sample efficient estimator for σ2 does not exist.
2
Applying the asymptotic theory, both estimators s2 and are consistent, that is they converge in probability to σ2 as
the sample size n → ∞. The two estimators are also both asymptotically normal:
This quantity t has the Student's t-distribution with (n − 1) degrees of freedom, and it is an ancillary statistic
(independent of the value of the parameters). Inverting the distribution of this t-statistics will allow us to construct
the confidence interval for μ;[30] similarly, inverting the χ2 distribution of the statistic s2 will give us the confidence
interval for σ2:[31]
the asymptotic distributions of and s2. The approximate formulas become valid for large values of n, and are more
convenient for the manual calculation since the standard normal quantiles zα/2 do not depend on n. In particular, the
most popular value of α = 5%, results in |z0.025| = 1.96.
Scalar form
The following auxiliary formula is useful for simplifying the posterior update equations, which otherwise become
fairly tedious.
This equation rewrites the sum of two quadratics in x by expanding the squares, grouping the terms in x, and
completing the square. Note the following about the complex constant factors attached to some of the terms:
situation where the reciprocals of quantities a and b add directly, so to combine a and b themselves, it's necessary
to reciprocate, add, and reciprocate the result again to get back into the original units. This is exactly the sort of
operation performed by the harmonic mean, so it is not surprising that is one-half the harmonic mean of a
and b.
Normal distribution 447
Vector form
A similar formula can be written for the sum of two vector quadratics: If are vectors of length , and
and are symmetric, invertible matrices of size , then
where
In other words, it sums up all possible combinations of products of pairs of elements from , with a separate
coefficient for each. In addition, since , only the sum matters for any off-diagonal
elements of , and there is no loss of generality in assuming that is symmetric. Furthermore, if is
symmetric, then the form .
where
In the above derivation, we used the formula above for the sum of two quadratics and eliminated all constant factors
not involving . The result is the kernel of a normal distribution, with mean and precision
, i.e.
This can be written as a set of Bayesian update equations for the posterior parameters in terms of the prior
parameters:
That is, to combine data points with total precision of (or equivalently, total variance of ) and mean of
values , derive a new total precision simply by adding the total precision of the data to the prior total precision,
and form a new mean through a precision-weighted average, i.e. a weighted average of the data mean and the prior
mean, each weighted by the associated total precision. This makes logical sense if the precision is thought of as
indicating the certainty of the observations: In the distribution of the posterior mean, each of the input components is
weighted by its certainty, and the certainty of this distribution is the sum of the individual certainties. (For the
intuition of this, compare the expression "the whole is (or is not) greater than the sum of its parts". In addition,
consider that the knowledge of the posterior comes from a combination of the knowledge of the prior and likelihood,
so it makes sense that we are more certain of it than of either of its components.)
The above formula reveals why it is more convenient to do Bayesian analysis of conjugate priors for the normal
distribution in terms of the precision. The posterior precision is simply the sum of the prior and likelihood precisions,
and the posterior mean is computed through a precision-weighted average, as described above. The same formulas
can be written in terms of variance by reciprocating all the precisions, yielding the more ugly formulas
Normal distribution 449
The likelihood function from above, written in terms of the variance, is:
where
Then:
or equivalently
The respective numbers of pseudo-observations just add the number of actual observations to them. The new mean
hyperparameter is once again a weighted average, this time weighted by the relative numbers of observations.
Finally, the update for is similar to the case with known mean, but in this case the sum of squared deviations
is taken with respect to the observed data mean rather than the true mean, and as a result a new "interaction term"
needs to be added to take care of the additional error source stemming from the deviation between prior and data
mean.
Proof is as follows.
Occurrence
The occurrence of normal distribution in practical problems can be loosely classified into three categories:
1. Exactly normal distributions;
2. Approximately normal laws, for example when such approximation is justified by the central limit theorem; and
3. Distributions modeled as normal – the normal distribution being the distribution with maximum entropy for a
given mean and variance.
Exact normality
Certain quantities in physics are distributed normally, as was first
demonstrated by James Clerk Maxwell. Examples of such quantities
are:
• Velocities of the molecules in the ideal gas. More generally,
velocities of the particles in any system in thermodynamic
equilibrium will have normal distribution, due to the maximum
entropy principle.
• Probability density function of a ground state in a quantum harmonic
oscillator.
• The position of a particle which experiences diffusion. If initially the The ground state of a quantum harmonic
particle is located at a specific point (that is its probability oscillator has the Gaussian distribution.
distribution is the dirac delta function), then after time t its location
is described by a normal distribution with variance t, which satisfies the diffusion equation ∂∂t f(x,t) = 12 ∂2∂x2
f(x,t). If the initial location is given by a certain density function g(x), then the density at time t is the convolution
of g and the normal PDF.
Approximate normality
Approximately normal distributions occur in many situations, as explained by the central limit theorem. When the
outcome is produced by a large number of small effects acting additively and independently, its distribution will be
close to normal. The normal approximation will not be valid if the effects act multiplicatively (instead of additively),
or if there is a single external influence which has a considerably larger magnitude than the rest of the effects.
• In counting problems, where the central limit theorem includes a discrete-to-continuum approximation and where
infinitely divisible and decomposable distributions are involved, such as
• Binomial random variables, associated with binary response variables;
• Poisson random variables, associated with rare events;
• Thermal light has a Bose–Einstein distribution on very short time scales, and a normal distribution on longer
timescales due to the central limit theorem.
Normal distribution 452
Assumed normality
I can only recognize the occurrence of the normal curve – the
Laplacian curve of errors – as a very abnormal phenomenon. It
is roughly approximated to in certain distributions; for this
reason, and on account for its beautiful simplicity, we may,
perhaps, use it as a first approximation, particularly in theoretical
investigations.
—Pearson (1901)
There are statistical methods to empirically test that assumption, see
the above Normality tests section.
• In biology, the logarithm of various variables tend to have a normal distribution, that is, they tend to have a
log-normal distribution (after separation on male/female subpopulations), with examples including:
• Measures of size of living tissue (length, height, skin area, weight);[32]
• The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth;
presumably the thickness of tree bark also falls under this category;
• Certain physiological measurements, such as blood pressure of adult humans.
• In finance, in particular the Black–Scholes model, changes in the logarithm of exchange rates, price indices, and
stock market indices are assumed normal (these variables behave like compound interest, not like simple interest,
and so are multiplicative). Some mathematicians such as Benoît Mandelbrot have argued that log-Levy
distributions which possesses heavy tails would be a more appropriate model, in particular for the analysis for
stock market crashes.
• Measurement errors in physical experiments are often modeled by a normal distribution. This use of a normal
distribution does not imply that one is assuming the measurement errors are normally distributed, rather using the
normal distribution produces the most conservative predictions possible given only knowledge about the mean
and variance of the errors.[33]
• In standardized testing, results can be made to have a
normal distribution. This is done by either selecting the
number and difficulty of questions (as in the IQ test), or by
transforming the raw test scores into "output" scores by
fitting them to the normal distribution. For example, the
SAT's traditional range of 200–800 is based on a normal
distribution with a mean of 500 and a standard deviation of
100.
• Many scores are derived from the normal distribution,
including percentile ranks ("percentiles" or "quantiles"),
normal curve equivalents, stanines, z-scores, and T-scores. Fitted cumulative normal distribution to October rainfalls
Additionally, a number of behavioral statistical procedures
are based on the assumption that scores are normally distributed; for example, t-tests and ANOVAs. Bell curve
grading assigns relative grades based on a normal distribution of scores.
• In hydrology the distribution of long duration river discharge or rainfall, e.g. monthly and yearly totals, is often
thought to be practically normal according to the central limit theorem.[34] The blue picture illustrates an example
of fitting the normal distribution to ranked October rainfalls showing the 90% confidence belt based on the
binomial distribution. The rainfall data are represented by plotting positions as part of the cumulative frequency
analysis.
Normal distribution 453
will both have the standard normal distribution, and will be independent. This formulation arises because for a
bivariate normal random vector (X Y) the squared norm X2 + Y2 will have the chi-squared distribution with two
degrees of freedom, which is an easily generated exponential random variable corresponding to the quantity
−2ln(U) in these equations; and the angle is distributed uniformly around the circle, chosen by the random
variable V.
• Marsaglia polar method is a modification of the Box–Muller method algorithm, which does not require
computation of functions sin() and cos(). In this method U and V are drawn from the uniform (−1,1)
distribution, and then S = U2 + V2 is computed. If S is greater or equal to one then the method starts over,
otherwise two quantities
are returned. Again, X and Y will be independent and standard normally distributed.
• The Ratio method[37] is a rejection method. The algorithm proceeds as follows:
• Generate two independent uniform deviates U and V;
• Compute X = √8/e (V − 0.5)/U;
• If X2 ≤ 5 − 4e1/4U then accept X and terminate algorithm;
Normal distribution 454
where ϕ(x) is the standard normal PDF, and b0 = 0.2316419, b1 = 0.319381530, b2 = −0.356563782, b3 =
1.781477937, b4 = −1.821255978, b5 = 1.330274429.
• Hart (1968) lists almost a hundred of rational function approximations for the erfc() function. His algorithms
vary in the degree of complexity and the resulting precision, with maximum absolute precision of 24 digits. An
algorithm by West (2009) combines Hart's algorithm 5666 with a continued fraction approximation in the tail to
provide a fast computation algorithm with a 16-digit precision.
• W. J. Cody (1969) after recalling Hart68 solution is not suited for erf, gives a solution for both erf and erfc, with
maximal relative error bound, via Rational Chebyshev Approximation. (Cody, W. J. (1969). "Rational Chebyshev
Approximations for the Error Function", paper here).
• Marsaglia (2004) suggested a simple algorithm[39] based on the Taylor series expansion
for calculating Φ(x) with arbitrary precision. The drawback of this algorithm is comparatively slow calculation
time (for example it takes over 300 iterations to calculate the function with 16 digits of precision when x = 10).
• The GNU Scientific Library calculates values of the standard normal CDF using Hart's algorithms and
approximations with Chebyshev polynomials.
History
Development
Some authors[40][41] attribute the credit for the discovery of the normal distribution to de Moivre, who in 1738 [42]
published in the second edition of his "The Doctrine of Chances" the study of the coefficients in the binomial
expansion of (a + b)n. De Moivre proved that the middle term in this expansion has the approximate magnitude of
, and that "If m or ½n be a Quantity infinitely great, then the Logarithm of the Ratio, which a Term distant
from the middle by the Interval ℓ, has to the middle Term, is ."[43] Although this theorem can be interpreted as
Normal distribution 455
the first obscure expression for the normal probability law, Stigler points out that de Moivre himself did not interpret
his results as anything more than the approximate rule for the binomial coefficients, and in particular de Moivre
lacked the concept of the probability density function.[44]
In 1809 Gauss published his monograph "Theoria motus corporum
coelestium in sectionibus conicis solem ambientium" where among
other things he introduces several important statistical concepts, such
as the method of least squares, the method of maximum likelihood, and
the normal distribution. Gauss used M, M′, M′′, … to denote the
measurements of some unknown quantity V, and sought the "most
probable" estimator: the one which maximizes the probability φ(M−V)
· φ(M′−V) · φ(M′′−V) · … of obtaining the observed experimental
results. In his notation φΔ is the probability law of the measurement
errors of magnitude Δ. Not knowing what the function φ is, Gauss
requires that his method should reduce to the well-known answer: the
arithmetic mean of the measured values.[45] Starting from these
principles, Gauss demonstrates that the only law which rationalizes the
choice of arithmetic mean as an estimator of the location parameter, is
Carl Friedrich Gauss discovered the normal
the normal law of errors:[46]
distribution in 1809 as a way to rationalize the
method of least squares.
where h is "the measure of the precision of the observations". Using this normal law as a generic model for errors in
the experiments, Gauss formulates what is now known as the non-linear weighted least squares (NWLS) method.[47]
Although Gauss was the first to suggest the normal distribution law,
Laplace made significant contributions.[48] It was Laplace who first
posed the problem of aggregating several observations in 1774,[49]
although his own solution led to the Laplacian distribution. It was
Laplace who first calculated the value of the integral ∫ e−t ²dt = √π in
1782, providing the normalization constant for the normal
distribution.[50] Finally, it was Laplace who in 1810 proved and
presented to the Academy the fundamental central limit theorem,
which emphasized the theoretical importance of the normal
distribution.[51]
Naming
Since its introduction, the normal distribution has been known by many different names: the law of error, the law of
facility of errors, Laplace's second law, Gaussian law, etc. Gauss himself apparently coined the term with reference
to the "normal equations" involved in its applications, with normal having its technical meaning of orthogonal rather
than "usual".[55] However, by the end of the 19th century some authors[56] had started using the name normal
distribution, where the word "normal" was used as an adjective – the term now being seen as a reflection of the fact
that this distribution was seen as typical, common – and thus "normal". Peirce (one of those authors) once defined
"normal" thus: "...the 'normal' is not the average (or any other kind of mean) of what actually occurs, but of what
would, in the long run, occur under certain circumstances."[57] Around the turn of the 20th century Pearson
popularized the term normal as a designation for this distribution.[58]
Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an
international question of priority, has the disadvantage of leading people to believe that all other distributions
of frequency are in one sense or another 'abnormal'.
—Pearson (1920)
Also, it was Pearson who first wrote the distribution in terms of the standard deviation σ as in modern notation. Soon
after this, in year 1915, Fisher added the location parameter to the formula for normal distribution, expressing it in
the way it is written nowadays:
The term "standard normal" which denotes the normal distribution with zero mean and unit variance came into
general use around 1950s, appearing in the popular textbooks by P.G. Hoel (1947) "Introduction to mathematical
statistics" and A.M. Mood (1950) "Introduction to the theory of statistics".[59]
When the name is used, the "Gaussian distribution" was named after Carl Friedrich Gauss, who introduced the
distribution in 1809 as a way of rationalizing the method of least squares as outlined above. The related work of
Laplace, also outlined above has led to the normal distribution being sometimes called Laplacian, especially in
French-speaking countries. Among English speakers, both "normal distribution" and "Gaussian distribution" are in
common use, with different terms preferred by different communities.
Notes
[1] The designation "bell curve" is ambiguous: there are many other distributions which are "bell"-shaped: the Cauchy distribution, Student's
t-distribution, generalized normal, logistic, etc.
[2] Casella & Berger (2001, p. 102)
[3] Gale Encyclopedia of Psychology – Normal Distribution (http:/ / findarticles. com/ p/ articles/ mi_g2699/ is_0002/ ai_2699000241)
[4] Cover, T. M.; Thomas, Joy A (2006). Elements of information theory. John Wiley and Sons. p. 254.
[5] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[6] Halperin & et al. (1965, item 7)
[7] McPherson (1990, p. 110)
[8] Bernardo & Smith (2000, p. 121)
[9] Patel & Read (1996, [2.1.4])
[10] Fan (1991, p. 1258)
[11] Patel & Read (1996, [2.1.8])
[12] Scott, Clayton; Robert Nowak (August 7, 2003). "The Q-function" (http:/ / cnx. org/ content/ m11537/ 1. 2/ ). Connexions. .
[13] Barak, Ohad (April 6, 2006). "Q function and error function" (http:/ / www. eng. tau. ac. il/ ~jo/ academic/ Q. pdf). Tel Aviv University. .
[14] Weisstein, Eric W., " Normal Distribution Function (http:/ / mathworld. wolfram. com/ NormalDistributionFunction. html)" from
MathWorld.
[15] Bryc (1995, p. 23)
[16] Bryc (1995, p. 24)
Normal distribution 457
[17] WolframAlpha.com (http:/ / www. wolframalpha. com/ input/ ?i=Table[{N(Erf(n/ Sqrt(2)),+ 12),+ N(1-Erf(n/ Sqrt(2)),+ 12),+ N(1/
(1-Erf(n/ Sqrt(2))),+ 12)},+ {n,1,6}])
[18] part 1 (http:/ / www. wolframalpha. com/ input/ ?i=Table[Sqrt(2)*InverseErf(x),+ {x,+ N({8/ 10,+ 9/ 10,+ 19/ 20,+ 49/ 50,+ 99/ 100,+ 995/
1000,+ 998/ 1000},+ 13)}]), part 2 (http:/ / www. wolframalpha. com/ input/
?i=Table[{N(1-10^(-x),9),N(Sqrt(2)*InverseErf(1-10^(-x)),13)},{x,3,9}])
[19] Galambos & Simonelli (2004, Theorem 3.5)
[20] Bryc (1995, p. 35)
[21] Bryc (1995, p. 27)
[22] Lukacs & King (1954)
[23] Patel & Read (1996, [2.3.6])
[24] http:/ / www. allisons. org/ ll/ MML/ KL/ Normal/
[25] "Stat260: Bayesian Modeling and Inference Lecture Date: February 8th, 2010, The Conjugate Prior for the Normal Distribution, Lecturer:
Michael I. Jordan|" (http:/ / www. cs. berkeley. edu/ ~jordan/ courses/ 260-spring10/ lectures/ lecture5. pdf). .
[26] Cover & Thomas (2006, p. 254)
[27] Amari & Nagaoka (2000)
[28] Mathworld entry for Normal Product Distribution (http:/ / mathworld. wolfram. com/ NormalProductDistribution. html)
[29] Krishnamoorthy (2006, p. 127)
[30] Krishnamoorthy (2006, p. 130)
[31] Krishnamoorthy (2006, p. 133)
[32] Huxley (1932)
[33] Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. pp. 592–593.
[34] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage
Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The
Netherlands. pp. 175–224. ISBN 90-70754-33-9. .
[35] Wichura, M.J. (1988). "Algorithm AS241: The Percentage Points of the Normal Distribution". Applied Statistics (Blackwell Publishing) 37
(3): 477–484. doi:10.2307/2347330. JSTOR 2347330.
[36] Johnson et al. (1995, Equation (26.48))
[37] Kinderman & Monahan (1976)
[38] http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_932. htm
[39] For example, this algorithm is given in the article Bc programming language.
[40] Johnson et al. (1994, page 85)
[41] Le Cam (2000, p. 74)
[42] De Moivre first published his findings in 1733, in a pamphlet "Approximatio ad Summam Terminorum Binomii (a + b)n in Seriem Expansi"
that was designated for private circulation only. But it was not until the year 1738 that he made his results publicly available. The original
pamphlet was reprinted several times, see for example Walker (1985).
[43] De Moivre (1733), Corollary I – see Walker (1985, p. 77)
[44] Stigler (1986, p. 76)
[45] "It has been customary certainly to regard as an axiom the hypothesis that if any quantity has been determined by several direct
observations, made under the same circumstances and with equal care, the arithmetical mean of the observed values affords the most probable
value, if not rigorously, yet very nearly at least, so that it is always most safe to adhere to it." — Gauss (1809, section 177)
[46] Gauss (1809, section 177)
[47] Gauss (1809, section 179)
[48] "My custom of terming the curve the Gauss–Laplacian or normal curve saves us from proportioning the merit of discovery between the two
great astronomer mathematicians." quote from Pearson (1905, p. 189)
[49] Laplace (1774, Problem III)
[50] Pearson (1905, p. 189)
[51] Stigler (1986, p. 144)
[52] Stigler (1978, p. 243)
[53] Stigler (1978, p. 244)
[54] Maxwell (1860), p. 23
[55] Jaynes, E J, Probability Theory: The Logic of Science Ch 7 (http:/ / www-biba. inrialpes. fr/ Jaynes/ cc07s. pdf)
[56] Besides those specifically referenced here, such use is encountered in the works of Peirce, Galton and Lexis approximately around 1875.
[57] Peirce, C. S. (c. 1909 MS), Collected Papers v. 6, paragraph 327.
[58] Kruskal & Stigler (1997)
[59] "Earliest uses… (entry STANDARD NORMAL CURVE)" (http:/ / jeff560. tripod. com/ s. html). .
Normal distribution 458
Citations
References
• Aldrich, John; Miller, Jeff. "Earliest uses of symbols in probability and statistics" (http://jeff560.tripod.com/
stat.html).
• Aldrich, John; Miller, Jeff. "Earliest known uses of some of the words of mathematics" (http://jeff560.tripod.
com/mathword.html). In particular, the entries for "bell-shaped and bell curve" (http://jeff560.tripod.com/b.
html), "normal (distribution)" (http://jeff560.tripod.com/n.html), "Gaussian" (http://jeff560.tripod.com/g.
html), and "Error, law of error, theory of errors, etc." (http://jeff560.tripod.com/e.html).
• Amari, Shun-ichi; Nagaoka, Hiroshi (2000). Methods of information geometry. Oxford University Press.
ISBN 0-8218-0531-2.
• Bernardo, J. M.; Smith, A.F.M. (2000). Bayesian Theory. Wiley. ISBN 0-471-49464-X.
• Bryc, Wlodzimierz (1995). The normal distribution: characterizations with applications. Springer-Verlag.
ISBN 0-387-97990-5.
• Casella, George; Berger, Roger L. (2001). Statistical inference (2nd ed.). Duxbury. ISBN 0-534-24312-6.
• Cover, T. M.; Thomas, Joy A. (2006). Elements of information theory. John Wiley and Sons.
• de Moivre, Abraham (1738). The Doctrine of Chances. ISBN 0-8218-2103-2.
• Fan, Jianqing (1991). "On the optimal rates of convergence for nonparametric deconvolution problems". The
Annals of Statistics 19 (3): 1257–1272. doi:10.1214/aos/1176348248. JSTOR 2241949.
• Galambos, Janos; Simonelli, Italo (2004). Products of random variables: applications to problems of physics and
to arithmetical functions. Marcel Dekker, Inc.. ISBN 0-8247-5402-6.
• Gauss, Carolo Friderico (1809) (in Latin). Theoria motvs corporvm coelestivm in sectionibvs conicis Solem
ambientivm [Theory of the motion of the heavenly bodies moving about the Sun in conic sections]. English
translation (http://books.google.com/books?id=1TIAAAAAQAAJ).
• Gould, Stephen Jay (1981). The mismeasure of man (first ed.). W.W. Norton. ISBN 0-393-01489-4.
• Halperin, Max; Hartley, H. O.; Hoel, P. G. (1965). "Recommended standards for statistical symbols and notation.
COPSS committee on symbols and notation". The American Statistician 19 (3): 12–14. doi:10.2307/2681417.
JSTOR 2681417.
• Hart, John F.; et al (1968). Computer approximations. New York: John Wiley & Sons, Inc. ISBN 0-88275-642-7.
• Hazewinkel, Michiel, ed. (2001), "Normal distribution" (http://www.encyclopediaofmath.org/index.
php?title=p/n067460), Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
• Herrnstein, C.; Murray (1994). The bell curve: intelligence and class structure in American life. Free Press.
ISBN 0-02-914673-9.
• Huxley, Julian S. (1932). Problems of relative growth. London. ISBN 0-486-61114-0. OCLC 476909537.
• Johnson, N.L.; Kotz, S.; Balakrishnan, N. (1994). Continuous univariate distributions, Volume 1. Wiley.
ISBN 0-471-58495-9.
• Johnson, N.L.; Kotz, S.; Balakrishnan, N. (1995). Continuous univariate distributions, Volume 2. Wiley.
ISBN 0-471-58494-0.
• Krishnamoorthy, K. (2006). Handbook of statistical distributions with applications. Chapman & Hall/CRC.
ISBN 1-58488-635-8.
• Kruskal, William H.; Stigler, Stephen M. (1997). Normative terminology: 'normal' in statistics and elsewhere.
Statistics and public policy, edited by Bruce D. Spencer. Oxford University Press. ISBN 0-19-852341-6.
• la Place, M. de (1774). "Mémoire sur la probabilité des causes par les évènemens". Mémoires de Mathématique et
de Physique, Presentés à l'Académie Royale des Sciences, par divers Savans & lûs dans ses Assemblées, Tome
Sixième: 621–656. Translated by S.M.Stigler in Statistical Science 1 (3), 1986: JSTOR 2245476.
• Laplace, Pierre-Simon (1812). Analytical theory of probabilities.
Normal distribution 459
• Lukacs, Eugene; King, Edgar P. (1954). "A property of normal distribution". The Annals of Mathematical
Statistics 25 (2): 389–394. doi:10.1214/aoms/1177728796. JSTOR 2236741.
• McPherson, G. (1990). Statistics in scientific investigation: its basis, application and interpretation.
Springer-Verlag. ISBN 0-387-97137-8.
• Marsaglia, George; Tsang, Wai Wan (2000). "The ziggurat method for generating random variables" (http://
www.jstatsoft.org/v05/i08/paper). Journal of Statistical Software 5 (8).
• Marsaglia, George (2004). "Evaluating the normal distribution" (http://www.jstatsoft.org/v11/i05/paper).
Journal of Statistical Software 11 (4).
• Maxwell, James Clerk (1860). "V. Illustrations of the dynamical theory of gases. — Part I: On the motions and
collisions of perfectly elastic spheres". Philosophical Magazine, series 4 19 (124): 19–32.
doi:10.1080/14786446008642818.
• Patel, Jagdish K.; Read, Campbell B. (1996). Handbook of the normal distribution (2nd ed.). CRC Press.
ISBN 0-8247-9342-0.
• Pearson, Karl (1905). "'Das Fehlergesetz und seine Verallgemeinerungen durch Fechner und Pearson'. A
rejoinder". Biometrika 4 (1): 169–212. JSTOR 2331536.
• Pearson, Karl (1920). "Notes on the history of correlation". Biometrika 13 (1): 25–45.
doi:10.1093/biomet/13.1.25. JSTOR 2331722.
• Stigler, Stephen M. (1978). "Mathematical statistics in the early states". The Annals of Statistics 6 (2): 239–265.
doi:10.1214/aos/1176344123. JSTOR 2958876.
• Stigler, Stephen M. (1982). "A modest proposal: a new standard for the normal". The American Statistician 36
(2): 137–138. doi:10.2307/2684031. JSTOR 2684031.
• Stigler, Stephen M. (1986). The history of statistics: the measurement of uncertainty before 1900. Harvard
University Press. ISBN 0-674-40340-1.
• Stigler, Stephen M. (1999). Statistics on the table. Harvard University Press. ISBN 0-674-83601-4.
• Walker, Helen M (1985). "De Moivre on the law of normal probability" (http://www.york.ac.uk/depts/maths/
histstat/demoivre.pdf). In Smith, David Eugene. A source book in mathematics. Dover. ISBN 0-486-64690-4.
• Weisstein, Eric W. "Normal distribution" (http://mathworld.wolfram.com/NormalDistribution.html).
MathWorld.
• West, Graeme (2009). "Better approximations to cumulative normal functions" (http://www.wilmott.com/pdfs/
090721_west.pdf). Wilmott Magazine: 70–76.
• Zelen, Marvin; Severo, Norman C. (1964). Probability functions (chapter 26) (http://www.math.sfu.ca/~cbm/
aands/page_931.htm). Handbook of mathematical functions with formulas, graphs, and mathematical tables, by
Abramowitz and Stegun: National Bureau of Standards. New York: Dover. ISBN 0-486-61272-4.
External links
• Normal Distribution Video Tutorial Part 1-2 (http://www.youtube.com/watch?v=kB_kYUbS_ig)
• An 8-foot-tall (unknown operator: u'strong' m) Probability Machine (named Sir Francis) comparing stock
market returns to the randomness of the beans dropping through the quincunx pattern. (http://www.youtube.
com/watch?v=AUSKTk9ENzg) YouTube link originating from Index Funds Advisors (http://www.ifa.com)
• An interactive Normal (Gaussian) distribution plot (http://peter.freeshell.org/gaussian/)
Order statistic 460
Order statistic
In statistics, the kth order statistic of a statistical sample is equal to its
kth-smallest value. Together with rank statistics, order statistics are
among the most fundamental tools in non-parametric statistics and
inference.
Important special cases of the order statistics are the minimum and
maximum value of a sample, and (with some qualifications discussed
below) the sample median and other sample quantiles.
When using probability theory to analyze order statistics of random Probability distributions for the n = 5 order
samples from a continuous distribution, the cumulative distribution statistics of an exponential distribution with θ = 3
function is used to reduce the analysis to the case of order statistics of
the uniform distribution.
where the subscript i in indicates simply the order in which the observations were recorded and is usually
assumed not to be significant. A case when the order is significant is when the observations are part of a time series.
The order statistics would be denoted
where the subscript (i) enclosed in parentheses indicates the ith order statistic of the sample.
The first order statistic (or smallest order statistic) is always the minimum of the sample, that is,
where, following a common convention, we use upper-case letters to refer to random variables, and lower-case
letters (as above) to refer to their actual observed values.
Similarly, for a sample of size n, the nth order statistic (or largest order statistic) is the maximum, that is,
The sample range is the difference between the maximum and minimum. It is clearly a function of the order
statistics:
A similar important statistic in exploratory data analysis that is simply related to the order statistics is the sample
interquartile range.
The sample median may or may not be an order statistic, since there is a single middle value only when the number n
of observations is odd. More precisely, if n = 2m+1 for some m, then the sample median is and so is an
order statistic. On the other hand, when n is even, n = 2m and there are two middle values, and , and
the sample median is some function of the two (usually the average) and hence not an order statistic. Similar remarks
apply to all sample quantiles.
Order statistic 461
Probabilistic analysis
Given any random variables X2, ..., Xn, the order statistics X(1), X(2), ..., X(n) are also random variables, defined by
sorting the values (realizations) of X2, ..., Xn in increasing order.
When the random variables X2, ..., Xn form a sample, they are independent and identically distributed (iid). This is
the case treated below. In general, the random variables X2, ..., Xn can arise by sampling from more than one
population. Then they are independent but not necessarily identically distributed, and their joint probability
distribution is given by the Bapat–Beg theorem.
From now on, we will assume that the random variables under consideration are continuous and, where convenient,
we will also assume that they have a density (that is, they are absolutely continuous). The peculiarities of the analysis
of distributions assigning mass to points (in particular, discrete distributions) are discussed at the end.
that is, the kth order statistic of the uniform distribution is a Beta random variable.
The proof of these statements is as follows. For to be between u and u + du, it is necessary that exactly k − 1
elements of the sample are smaller than u, and that at least one is between u and u + du. The probability that more
than one is in this latter interval is already , so we have to calculate the probability that exactly k − 1, 1 and
n − k observations fall in the intervals , and respectively. This equals (refer to
multinomial distribution for details)
which is (up to terms of higher order than ) the probability that i − 1, 1, j − 1 − i, 1 and n − j sample
elements fall in the intervals , , , , respectively.
One reasons in an entirely analogous way to derive the higher-order joint distributions. Perhaps surprisingly, the
joint density of the n order statistics turns out to be constant:
Order statistic 462
One way to understand this is that the unordered sample does have constant density equal to 1, and that there are n!
different permutations of the sample corresponding to the same sequence of order statistics. This is related to the fact
that 1/n! is the volume of the region .
and
to derive the following probability density functions (pdfs) for the order statistics of a sample of size n drawn from
the distribution of X:
where
A small-sample-size example
The simplest case to consider is how well the sample median estimates the population median.
As an example, consider a random sample of size 6. In that case, the sample median is usually defined as the
midpoint of the interval delimited by the 3rd and 4th order statistics. However, we know from the preceding
discussion that the probability that this interval actually contains the population median is
Although the sample median is probably among the best distribution-independent point estimates of the population
median, what this example illustrates is that it is not a particularly good one in absolute terms. In this particular case,
a better confidence interval for the median is the one delimited by the 2nd and 5th order statistics, which contains the
population median with probability
With such a small sample size, if one wants at least 95% confidence, one is reduced to saying that the median is
between the minimum and the maximum of the 6 observations with probability 31/32 or approximately 97%. Size 6
is, in fact, the smallest sample size such that the interval determined by the minimum and the maximum is at least a
95% confidence interval for the population median.
Order statistic 463
For a general distribution F with a continuous non-zero density at F −1(p), a similar asymptotic normality applies:
where f is the density function, and F −1 is the quantile function associated with F.
An interesting observation can be made in the case where the distribution is symmetric, and the population median
equals the population mean. In this case, the sample mean, by the central limit theorem, is also asymptotically
normally distributed, but with variance σ2/n instead. This asymptotic analysis suggests that the mean outperforms the
median in cases of low kurtosis, and vice versa. For example, the median achieves better confidence intervals for the
Laplace distribution, while the mean performs better for X that are normally distributed.
Proof
It can be shown that
where
with Zi being independent identically distributed exponential random variables with rate 1. Since X/n and Y/n are
asymptotically normally distributed by the CLT, our results follow by application of the delta method.
The cumulative distribution function of the order statistic can be computed by noting that
Similarly, is given by
Note that the probability mass function of is just the difference of these values, that is to say
Order statistic 464
References
• David, H. A., Nagaraja, H. N. (2003) Order Statistics (3rd Edition). Wiley, New Jersey pp 458. ISBN
0-471-38926-9
• Sefling, R. J. (1980) Approximation Theorems of Mathematical Statistics. Wiley, New York. ISBN
0-471-02403-1
External links
• Order statistics [1] at PlanetMath Retrieved Feb 02,2005
• Weisstein, Eric W., "Order Statistic [2]" from MathWorld. Retrieved Feb 02,2005
• Dr. Susan Holmes Order Statistics [3] Retrieved Feb 02,2005
References
[1] http:/ / planetmath. org/ encyclopedia/ OrderStatistics. html
[2] http:/ / mathworld. wolfram. com/ OrderStatistic. html
[3] http:/ / www-stat. stanford. edu/ ~susan/ courses/ s116/ node79. html
Ordinary differential equation 465
Background
Ordinary differential equations arise in many different
contexts throughout mathematics and science (social
and natural) one way or another, because when
describing changes mathematically, the most accurate
way uses differentials and derivatives (related, though
not quite the same). Since various differentials,
derivatives, and functions become inevitably related to
each other via equations, a differential equation is the The trajectory of a projectile launched from a cannon follows a curve
determined by an ordinary differential equation that is derived from
result, governing dynamical phenomena, evolution and
Newton's second law.
variation. Often, quantities are defined as the rate of
change of other quantities (time derivatives), or
gradients of quantities, which is how they enter differential equations.
Specific mathematical fields include geometry and analytical mechanics. Scientific fields include much of physics
and astronomy (celestial mechanics), geology (weather modelling), chemistry (reaction rates)[3], biology (infectious
diseases, genetic variation), ecology and population modelling (population competition), economics (stock trends,
interest rates and the market equilibrium price changes).
Many mathematicians have studied differential equations and contributed to the field, including Newton, Leibniz, the
Bernoulli family, Riccati, Clairaut, d'Alembert and Euler.
A simple example is Newton's second law of motion - the relationship between the displacement x and the time t of
the object under the force F which leads to the differential equation
for the motion of a particle of constant mass m. In general, F depends on the position x(t) of the particle at time t, and
so the unknown function x(t) appears on both sides of the differential equation, as is indicated in the notation
F(x(t)).[4][5][6][7]
Ordinary differential equation 466
Definitions
In what follows, let y be a dependent variable and x an independent variable, so that y = y(x) is an unknown function
in x. The notation for differentiation varies depending upon the author and upon which notation is most useful for the
task at hand. In this context the Leibniz's notation (dy/dx,d2y/dx2,...dny/dxn) is useful for differentials and when
integration is to be done, while Newton's and Lagrange's notation (y′,y′′, ... y(n)) is useful for representing derivatives
of any order compactly.
where ai(x) and r(x) continuous functions in x[11][12][13]. Non-linear equations cannot be written in this form. The
function r(x) is called the source term, leading to two further important classifications:[14][15]
Homogeneous: If r(x) = 0, and consequently one "automatic" solution is the trivial solution, y = 0. The solution of a
linear homogeneous equation is a complementary function, denoted here by yc.
Nonhomogeneous (or inhomogeneous): If r(x) ≠ 0. The additional solution to the complementary function is the
particular integral, denoted here by yp.
The general solution to a linear equation can be written as y = yc + yp.
System of ODEs
A number of coupled differential equations form a system of equations. If y is a vector whose elements are
functions; y(x) = [y1(x), y2(x),... ym(x)], and F is a vector valued function of y and its derivatives, then
is an explicit system of ordinary differential equations of order or dimension m. In column vector form:
Solutions
Given a differential equation
a function u: I ⊂ R → R is called the solution or integral curve for F, if u is n-times differentiable on I, and
A solution which has no extension is called a maximal solution. A solution defined on all of R is called a global
solution.
A general solution of an n-th order equation is a solution containing n arbitrary independent constants of integration.
A particular solution is derived from the general solution by setting the constants to particular values, often chosen
to fulfill set 'initial conditions or boundary conditions'.[16] A singular solution is a solution which cannot be obtained
by assigning definite values to the arbitrary constants in the general solution.[17]
Theories of ODEs
Singular solutions
The theory of singular solutions of ordinary and partial differential equations was a subject of research from the time
of Leibniz, but only since the middle of the nineteenth century did it receive special attention. A valuable but
little-known work on the subject is that of Houtain (1854). Darboux (starting in 1873) was a leader in the theory, and
in the geometric interpretation of these solutions he opened a field which was worked by various writers, notably
Casorati and Cayley. To the latter is due (1872) the theory of singular solutions of differential equations of the first
order as accepted circa 1900.
Reduction to quadratures
The primitive attempt in dealing with differential equations had in view a reduction to quadratures. As it had been
the hope of eighteenth-century algebraists to find a method for solving the general equation of the th degree, so it
was the hope of analysts to find a general method for integrating any differential equation. Gauss (1799) showed,
however, that the differential equation meets its limitations very soon unless complex numbers are introduced. Hence
analysts began to substitute the study of functions, thus opening a new and fertile field. Cauchy was the first to
appreciate the importance of this view. Thereafter the real question was to be, not whether a solution is possible by
means of known functions or their integrals, but whether a given differential equation suffices for the definition of a
function of the independent variable or variables, and if so, what are the characteristic properties of this function.
Ordinary differential equation 468
Fuchsian theory
Two memoirs by Fuchs (Crelle, 1866, 1868), inspired a novel approach, subsequently elaborated by Thomé and
Frobenius. Collet was a prominent contributor beginning in 1869, although his method for integrating a non-linear
system was communicated to Bertrand in 1868. Clebsch (1873) attacked the theory along lines parallel to those
followed in his theory of Abelian integrals. As the latter can be classified according to the properties of the
fundamental curve which remains unchanged under a rational transformation, so Clebsch proposed to classify the
transcendent functions defined by the differential equations according to the invariant properties of the
corresponding surfaces f = 0 under rational one-to-one transformations.
Lie's theory
From 1870 Sophus Lie's work put the theory of differential equations on a more satisfactory foundation. He showed
that the integration theories of the older mathematicians can, by the introduction of what are now called Lie groups,
be referred to a common source; and that ordinary differential equations which admit the same infinitesimal
transformations present comparable difficulties of integration. He also emphasized the subject of transformations of
contact.
Lie's group theory of differential equations, has been certified, namely: (1) that it unifies the many ad hoc methods
known for solving differential equations, and (2) that it provides powerful new ways to find solutions. The theory
has applications to both ordinary and partial differential equations.[18]
A general approach to solve DE's uses the symmetry property of differential equations, the continuous infinitesimal
transformations of solutions to solutions (Lie theory). Continuous group theory, Lie algebras and differential
geometry are used to understand the structure of linear and nonlinear (partial) differential equations for generating
integrable equations, to find its Lax pairs, recursion operators, Bäcklund transform and finally finding exact analytic
solutions to the DE.
Symmetry methods have been recognized to study differential equations arising in mathematics, physics,
engineering, and many other disciplines.
Sturm–Liouville theory
Sturm–Liouville theory is a theory of eigenvalues and eigenfunctions of linear operators defined in terms of
second-order homogeneous linear equations, and is useful in the analysis of certain partial differential equations.
in the x-y plane, where a and b are real (symbolically: a, b ∈ ℝ) and × denotes the cartesian product, square brackets
denote closed intervals, then there is an interval
for some h ∈ ℝ where the solution to the above equation and initial value problem can be found. That is, there is a
solution and it is unique. Since there is no restriction on F to be linear, this applies to non-linear equations which
take the form F(x, y), and it can also be applied to systems of equations.
such that any solution which satisfies this initial condition is a restriction of the solution which satisfies this initial
condition with domain Imax.
In the case that , there are exactly two possibilities
This means that F(x, y) = y2, which is C1 and therefore Lipschitz continuous for all y, satisfying the Picard–Lindelöf
theorem.
Even in such a simple setting, the maximum domain of solution cannot be all ℝ, since the solution is
This shows clearly that the maximum interval may depend on the initial conditions. The domain of y could be taken
as being , but this would lead to a domain that is not an interval, so that the side opposite to the
initial condition would be disconnected from the initial condition, and therefore not uniquely determined by it.
Ordinary differential equation 470
which is one of the two possible cases according to the above theorem.
Reduction of order
Differential equations can usually be solved more easily if the order of the equation can be reduced.
can be written as a system of n first-order differential equations by defining a new family of unknown functions
for i = 1, 2,... n. The n-dimensional system of first-order coupled differential equations is then
where
Separable equations
Separation of variables
First order, separable in x and y (general
[21] (divide by P2Q1).
case, see below for special cases)
Direct integration.
[22]
First order, separable in x
where Y(y) and X(x) are functions from the integrals rather than constant
values, which are set to make the final function F(x, y) satisfy the initial
where equation.
where
Multiply equation by
Second order
2dy/dx, substitute
Integrating factor:
First order, linear, inhomogeneous, function
[30] .
coefficients
Notes
[1] Kreyszig (1972, p. 1)
[2] Simmons (1972, p. 2)
[3] Mathematics for Chemists, D.M. Hirst, THE MACMILLAN PRESS, 1976, (No ISBN) SBN: 333-18172-7
[4] Kreyszig (1972, p. 64)
[5] Simmons (1972, pp. 1,2)
[6] Halliday & Resnick (1977, p. 78)
[7] Tipler (1991, pp. 78–83)
[8] Harper (1976, p. 127)
[9] Kreyszig (1972, p. 2)
[10] Simmons (1972, p. 3)
[11] Harper (1976, p. 127)
[12] Kreyszig (1972, p. 24)
[13] Simmons (1972, p. 47)
[14] Harper (1976, p. 128)
[15] Kreyszig (1972, p. 24)
[16] Kreyszig (1972, p. 78)
[17] Kreyszig (1972, p. 4)
[18] Lawrence (1999, p. 9)
[19] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISBN 0-471-83824-1
[20] Boscain; Chitour 2011, p. 21
[21] Mathematical Handbook of Formulas and Tables (3rd edition), S. Lipschutz, M.R. Spiegel, J. Liu, Schuam's Outline Series, 2009, ISC_2N
978-0-07-154855-7
[22] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[23] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[24] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[25] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[26] Mathematical Handbook of Formulas and Tables (3rd edition), S. Lipschutz, M.R. Spiegel, J. Liu, Schuam's Outline Series, 2009, ISC_2N
978-0-07-154855-7
[27] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[28] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[29] Further Elementary Analysis, R. Porter, G.Bell & Sons (London), 1978, ISBN 0-7135-1594-5
[30] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[31] Mathematical methods for physics and engineering, K.F. Riley, M.P. Hobson, S.J. Bence, Cambridge University Press, 2010, ISC_2N
978-0-521-86153-3
[32] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[33] Mathematical methods for physics and engineering, K.F. Riley, M.P. Hobson, S.J. Bence, Cambridge University Press, 2010, ISC_2N
978-0-521-86153-3
[34] Elementary Differential Equations and Boundary Value Problems (4th Edition), W.E. Boyce, R.C. Diprima, Wiley International, John Wiley
& Sons, 1986, ISC_2N 0-471-83824-1
[35] http:/ / sage. openopt. org/ welcome
[36] https:/ / github. com/ headmyshoulder/ odeint-v2
[37] http:/ / www. vissim. us
[38] http:/ / user. mendelu. cz/ marik/ maw/ index. php?lang=en& form=ode
[39] http:/ / www. dotnumerics. com/ NumericalLibraries/ DifferentialEquations/
[40] http:/ / jsxgraph. uni-bayreuth. de/ wiki/ index. php/ Differential_equations
Ordinary differential equation 474
References
• Halliday, David; Resnick, Robert (1977), Physics (3rd ed.), New York: Wiley, ISBN 0-471-71716-9
• Harper, Charlie (1976), Introduction to Mathematical Physics, New Jersey: Prentice-Hall, ISBN 0-13-487538-9
• Kreyszig, Erwin (1972), Advanced Engineering Mathematics (3rd ed.), New York: Wiley, ISBN 0-471-50728-8.
• Polyanin, A. D. and V. F. Zaitsev, Handbook of Exact Solutions for Ordinary Differential Equations (2nd
edition)", Chapman & Hall/CRC Press, Boca Raton, 2003. ISBN 1-58488-297-2
• Simmons, George F. (1972), Differential Equations with Applications and Historical Notes, New York:
McGraw-Hill
• Tipler, Paul A. (1991), Physics for Scientists and Engineers: Extended version (3rd ed.), New York: Worth
Publishers, ISBN 0-87901-432-6
• Boscain, Ugo; Chitour, Yacine (2011) (in french), Introduction à l'automatique (http://www.cmapx.
polytechnique.fr/~boscain/poly2011.pdf)
• Lawrence, Dresner (1999), Applications of Lie's Theory of Ordinary and Partial Differential Equations, Bristol
and Philadelphia: Institute of Physics Publishing
Bibliography
• Coddington, Earl A.; Levinson, Norman (1955). Theory of Ordinary Differential Equations. New York:
McGraw-Hill.
• Hartman, Philip, Ordinary Differential Equations, 2nd Ed., Society for Industrial & Applied Math, 2002. ISBN
0-89871-510-5.
• W. Johnson, A Treatise on Ordinary and Partial Differential Equations (http://www.hti.umich.edu/cgi/b/bib/
bibperm?q1=abv5010.0001.001), John Wiley and Sons, 1913, in University of Michigan Historical Math
Collection (http://hti.umich.edu/u/umhistmath/)
• E.L. Ince, Ordinary Differential Equations, Dover Publications, 1958, ISBN 0-486-60349-0
• Witold Hurewicz, Lectures on Ordinary Differential Equations, Dover Publications, ISBN 0-486-49510-8
• Ibragimov, Nail H (1993). CRC Handbook of Lie Group Analysis of Differential Equations Vol. 1-3. Providence:
CRC-Press. ISBN 0-8493-4488-3.
• Teschl, Gerald (2012). Ordinary Differential Equations and Dynamical Systems (http://www.mat.univie.ac.at/
~gerald/ftp/book-ode/). Providence: American Mathematical Society. ISBN 978-0-8218-8328-0.
• A. D. Polyanin, V. F. Zaitsev, and A. Moussiaux, Handbook of First Order Partial Differential Equations, Taylor
& Francis, London, 2002. ISBN 0-415-27267-X
• D. Zwillinger, Handbook of Differential Equations (3rd edition), Academic Press, Boston, 1997.
External links
• NCLab (http://nclab.com) provides interactive graphical modules in the web browser to solve ordinary and
partial differential equations.
• Differential Equations (http://www.dmoz.org/Science/Math/Differential_Equations//) at the Open Directory
Project (includes a list of software for solving differential equations).
• EqWorld: The World of Mathematical Equations (http://eqworld.ipmnet.ru/index.htm), containing a list of
ordinary differential equations with their solutions.
• Online Notes / Differential Equations (http://tutorial.math.lamar.edu/classes/de/de.aspx) by Paul Dawkins,
Lamar University.
• Differential Equations (http://www.sosmath.com/diffeq/diffeq.html), S.O.S. Mathematics.
• A primer on analytical solution of differential equations (http://numericalmethods.eng.usf.edu/mws/gen/
08ode/mws_gen_ode_bck_primer.pdf) from the Holistic Numerical Methods Institute, University of South
Florida.
Ordinary differential equation 475
Introduction
Partial Differential Equations (PDEs) are equations that involve rates of change with respect to continuous variables.
The configuration of a rigid body is specified by six numbers, but the configuration of a fluid is given by the
continuous distribution of the temperature, pressure, and so forth. The dynamics for the rigid body take place in a
finite-dimensional configuration space; the dynamics for the fluid occur in an infinite-dimensional configuration
space. This distinction usually makes PDEs much harder to solve than Ordinary Differential Equations (ODEs), but
here again there will be simple solutions for linear problems. Classic domains where PDEs are used include
acoustics, fluid flow, electrodynamics, and heat transfer.
A partial differential equation (PDE) for the function is an equation of the form
If F is a linear function of u and its derivatives, then the PDE is called linear. Common examples of linear PDEs
include the heat equation, the wave equation, Laplace's equation, Helmholtz equation, Klein–Gordon equation, and
Poisson's equation.
A relatively simple PDE is
This relation implies that the function u(x,y) is independent of x. However, the equation gives no information on the
function's dependence on the variable y. Hence the general solution of this equation is
where c is any constant value. These two examples illustrate that general solutions of ordinary differential equations
(ODEs) involve arbitrary constants, but solutions of PDEs involve arbitrary functions. A solution of a PDE is
generally not unique; additional conditions must generally be specified on the boundary of the region where the
solution is defined. For instance, in the simple example above, the function f(y) can be determined if u is specified on
the line x = 0.
where n is an integer. The derivative of u with respect to y approaches 0 uniformly in x as n increases, but the
solution is
This solution approaches infinity if nx is not an integer multiple of π for any non-zero value of y. The Cauchy
problem for the Laplace equation is called ill-posed or not well posed, since the solution does not depend
continuously upon the data of the problem. Such ill-posed problems are not usually satisfactory for physical
applications.
Notation
In PDEs, it is common to denote partial derivatives using subscripts. That is:
Especially in physics, del (∇) is often used for spatial derivatives, and for time derivatives. For example, the
wave equation (described below) can be written as
Partial differential equation 477
or
Examples
where u(t,x) is temperature, and α is a positive constant that describes the rate of diffusion. The Cauchy problem for
this equation consists in specifying u(0, x)= f(x), where f(x) is an arbitrary function.
General solutions of the heat equation can be found by the method of separation of variables. Some examples appear
in the heat equation article. They are examples of Fourier series for periodic f and Fourier transforms for
non-periodic f. Using the Fourier transform, a general solution of the heat equation has the form
where F is an arbitrary function. To satisfy the initial condition, F is given by the Fourier transform of f, that is
If f represents a very small but intense source of heat, then the preceding integral can be approximated by the delta
distribution, multiplied by the strength of the source. For a source whose strength is normalized to 1, the result is
This result corresponds to the normal probability density for x with mean 0 and variance 2αt. The heat equation and
similar diffusion equations are useful tools to study random phenomena.
Here u might describe the displacement of a stretched string from equilibrium, or the difference in air pressure in a
tube, or the magnitude of an electromagnetic field in a tube, and c is a number that corresponds to the velocity of the
wave. The Cauchy problem for this equation consists in prescribing the initial displacement and velocity of a string
or other medium:
where f and g are arbitrary given functions. The solution of this problem is given by d'Alembert's formula:
Partial differential equation 478
This formula implies that the solution at (t,x) depends only upon the data on the segment of the initial line that is cut
out by the characteristic curves
that are drawn backwards from that point. These curves correspond to signals that propagate with velocity c forward
and backward. Conversely, the influence of the data at any given point on the initial line propagates with the finite
velocity c: there is no effect outside a triangle through that point whose sides are characteristic curves. This behavior
is very different from the solution for the heat equation, where the effect of a point source appears (with small
amplitude) instantaneously at every point in space. The solution given above is also valid if t is negative, and the
explicit formula shows that the solution depends smoothly upon the data: both the forward and backward Cauchy
problems for the wave equation are well-posed.
w(x) is the weighting function with respect to which the eigenfunctions of are orthogonal) in the x coordinate.
Subject to the boundary conditions:
Then:
If:
where
Partial differential equation 479
Spherical waves
Spherical waves are waves whose amplitude depends only upon the radial distance r from a central point source. For
such waves, the three-dimensional wave equation takes the form
This is equivalent to
and hence the quantity ru satisfies the one-dimensional wave equation. Therefore a general solution for spherical
waves has the form
where F and G are completely arbitrary functions. Radiation from an antenna corresponds to the case where G is
identically zero. Thus the wave form transmitted from an antenna has no distortion in time: the only distorting factor
is 1/r. This feature of undistorted propagation of waves is not present if there are two spatial dimensions.
Conversely, given any harmonic function in two dimensions, it is the real part of an analytic function, at least locally.
Details are given in Laplace equation.
Petrovsky (1967, p. 248) shows how this formula can be obtained by summing a Fourier series for φ. If r<1, the
derivatives of φ may be computed by differentiating under the integral sign, and one can verify that φ is analytic,
even if u is continuous but not necessarily differentiable. This behavior is typical for solutions of elliptic partial
differential equations: the solutions may be much more smooth than the boundary data. This is in contrast to
solutions of the wave equation, and more general hyperbolic partial differential equations, which typically have no
more derivatives than the data.
Partial differential equation 480
Euler–Tricomi equation
The Euler–Tricomi equation is used in the investigation of transonic flow.
Advection equation
The advection equation describes the transport of a conserved scalar ψ in a velocity field . It is:
If the velocity field is solenoidal (that is, ), then the equation may be simplified to
In the one-dimensional case where u is not constant and is equal to ψ, the equation is referred to as Burgers'
equation.
Ginzburg–Landau equation
The Ginzburg–Landau equation is used in modelling superconductivity. It is
Vibrating string
If the string is stretched between two points where x=0 and x=L and u denotes the amplitude of the displacement of
the string, then u satisfies the one-dimensional wave equation in the region where 0<x<L and t is unlimited. Since the
string is tied down at the ends, u must also satisfy the boundary conditions
where
where the constant k must be determined. The boundary conditions then imply that X is a multiple of sin kx, and k
must have the form
Partial differential equation 481
where n is an integer. Each term in the sum corresponds to a mode of vibration of the string. The mode with n=1 is
called the fundamental mode, and the frequencies of the other modes are all multiples of this frequency. They form
the overtone series of the string, and they are the basis for musical acoustics. The initial conditions may then be
satisfied by representing f and g as infinite sums of these modes. Wind instruments typically correspond to vibrations
of an air column with one end open and one end closed. The corresponding boundary conditions are
The method of separation of variables can also be applied in this case, and it leads to a series of odd overtones.
The general problem of this type is solved in Sturm–Liouville theory.
Vibrating membrane
If a membrane is stretched over a curve C that forms the boundary of a domain D in the plane, its vibrations are
governed by the wave equation
if t>0 and (x,y) is in D. The boundary condition is u(t,x,y) = 0 if (x,y) is on C. The method of separation of variables
leads to the form
The latter equation is called the Helmholtz Equation. The constant k must be determined to allow a non-trivial v to
satisfy the boundary condition on C. Such values of k2 are called the eigenvalues of the Laplacian in D, and the
associated solutions are the eigenfunctions of the Laplacian in D. The Sturm–Liouville theory may be extended to
this elliptic eigenvalue problem (Jost, 2002).
Other examples
The Schrödinger equation is a PDE at the heart of non-relativistic quantum mechanics. In the WKB approximation it
is the Hamilton–Jacobi equation.
Except for the Dym equation and the Ginzburg–Landau equation, the above equations are linear in the sense that
they can be written in the form Au = f for a given linear operator A and a given function f. Other important non-linear
equations include the Navier–Stokes equations describing the flow of fluids, and Einstein's field equations of general
relativity.
Also see the list of non-linear partial differential equations.
Partial differential equation 482
Classification
Some linear, second-order partial differential equations can be classified as parabolic, hyperbolic or elliptic. Others
such as the Euler–Tricomi equation have different types in different regions. The classification provides a guide to
appropriate initial and boundary conditions, and to smoothness of the solutions.
where the coefficients A, B, C etc. may depend upon x and y. If over a region of the xy plane,
the PDE is second-order in that region. This form is analogous to the equation for a conic section:
More precisely, replacing by X, and likewise for other variables (formally this is done by a Fourier transform),
converts a constant-coefficient PDE into a polynomial of the same degree, with the top degree (a homogeneous
polynomial, here a quadratic form) being most significant for the classification.
Just as one classifies conic sections and quadratic forms into parabolic, hyperbolic, and elliptic based on the
discriminant , the same can be done for a second-order PDE at a given point. However, the
discriminant in a PDE is given by due to the convention of the xy term being 2B rather than B;
formally, the discriminant (of the associated quadratic form) is with the factor
of 4 dropped for simplicity.
1. : solutions of elliptic PDEs are as smooth as the coefficients allow, within the interior of the
region where the equation and solutions are defined. For example, solutions of Laplace's equation are analytic
within the domain where they are defined, but solutions may assume boundary values that are not smooth. The
motion of a fluid at subsonic speeds can be approximated with elliptic PDEs, and the Euler–Tricomi equation is
elliptic where x < 0.
2. : equations that are parabolic at every point can be transformed into a form analogous to the
heat equation by a change of independent variables. Solutions smooth out as the transformed time variable
increases. The Euler–Tricomi equation has parabolic type on the line where x=0.
3. : hyperbolic equations retain any discontinuities of functions or derivatives in the initial data.
An example is the wave equation. The motion of a fluid at supersonic speeds can be approximated with
hyperbolic PDEs, and the Euler–Tricomi equation is hyperbolic where x>0.
If there are n independent variables x1, x2 , ..., xn, a general linear partial differential equation of second order has the
form
The classification depends upon the signature of the eigenvalues of the coefficient matrix.
1. Elliptic: The eigenvalues are all positive or all negative.
2. Parabolic : The eigenvalues are all positive or all negative, save one that is zero.
3. Hyperbolic: There is only one negative eigenvalue and all the rest are positive, or there is only one positive
eigenvalue and all the rest are negative.
4. Ultrahyperbolic: There is more than one positive eigenvalue and more than one negative eigenvalue, and there are
no zero eigenvalues. There is only limited theory for ultrahyperbolic equations (Courant and Hilbert, 1962).
Partial differential equation 483
where the coefficient matrices Aν and the vector B may depend upon x and u. If a hypersurface S is given in the
implicit form
where φ has a non-zero gradient, then S is a characteristic surface for the operator L at a given point if the
characteristic form vanishes:
The geometric interpretation of this condition is as follows: if data for u are prescribed on the surface S, then it may
be possible to determine the normal derivative of u on S from the differential equation. If the data on S and the
differential equation determine the normal derivative of u on S, then S is non-characteristic. If the data on S and the
differential equation do not determine the normal derivative of u on S, then the surface is characteristic, and the
differential equation restricts the data on S: the differential equation is internal to S.
1. A first-order system Lu=0 is elliptic if no surface is characteristic for L: the values of u on S and the differential
equation always determine the normal derivative of u on S.
2. A first-order system is hyperbolic at a point if there is a space-like surface S with normal ξ at that point. This
means that, given any non-trivial vector η orthogonal to ξ, and a scalar multiplier λ, the equation
has m real roots λ1, λ2, ..., λm. The system is strictly hyperbolic if these roots are always distinct. The geometrical
interpretation of this condition is as follows: the characteristic form Q(ζ)=0 defines a cone (the normal cone) with
homogeneous coordinates ζ. In the hyperbolic case, this cone has m sheets, and the axis ζ = λ ξ runs inside these
sheets: it does not intersect any of them. But when displaced from the origin by η, this axis intersects every sheet. In
the elliptic case, the normal cone has no real sheets.
which is called elliptic-hyperbolic because it is elliptic in the region x < 0, hyperbolic in the region x > 0, and
degenerate parabolic on the line x = 0.
Partial differential equation 484
Separation of variables
Linear PDEs can be reduced to systems of ordinary differential equations by the important technique of separation of
variables. The logic of this technique may be confusing upon first acquaintance, but it rests on the uniqueness of
solutions to differential equations: as with ODEs, if you can find any solution that solves the equation and satisfies
the boundary conditions, then it is the solution. We assume as an ansatz that the dependence of the solution on space
and time can be written as a product of terms that each depend on a single coordinate, and then see if and how this
can be made to solve the problem.
In the method of separation of variables, one reduces a PDE to a PDE in fewer variables, which is an ODE if in one
variable – these are in turn easier to solve.
This is possible for simple PDEs, which are called separable partial differential equations, and the domain is
generally a rectangle (a product of intervals). Separable PDEs correspond to diagonal matrices – thinking of "the
value for fixed x" as a coordinate, each coordinate can be understood separately.
This generalizes to the method of characteristics, and is also used in integral transforms.
Method of characteristics
In special cases, one can find characteristic curves on which the equation reduces to an ODE – changing coordinates
in the domain to straighten these curves allows separation of variables, and is called the method of characteristics.
More generally, one may find characteristic surfaces.
Integral transform
An integral transform may transform the PDE to a simpler one, in particular a separable PDE. This corresponds to
diagonalizing an operator.
An important example of this is Fourier analysis, which diagonalizes the heat equation using the eigenbasis of
sinusoidal waves.
If the domain is finite or periodic, an infinite sum of solutions such as a Fourier series is appropriate, but an integral
of solutions such as a Fourier integral is generally required for infinite domains. The solution for a point source for
the heat equation given above is an example for use of a Fourier integral.
Partial differential equation 485
Change of variables
Often a PDE can be reduced to a simpler form with a known solution by a suitable change of variables. For example
the Black–Scholes PDE
by the change of variables (for complete details see Solution of the Black Scholes Equation [1])
Fundamental solution
Inhomogeneous equations can often be solved (for constant coefficient PDEs, always be solved) by finding the
fundamental solution (the solution for a point source), then taking the convolution with the boundary conditions to
get the solution.
This is analogous in signal processing to understanding a filter by its impulse response.
Superposition principle
Because any superposition of solutions of a linear, homogeneous PDE is again a solution, the particular solutions
may then be combined to obtain more general solutions.
Semianalytical methods
The adomian decomposition method, the Lyapunov artificial small parameter method, and He's homotopy
perturbation method are all special cases of the more general homotopy analysis method. These are series expansion
methods, and except for the Lyapunov method, are independent of small physical parameters as compared to the well
known perturbation theory, thus giving these methods greater flexibility and solution generality.
References
• Adomian, G. (1994). Solving Frontier problems of Physics: The decomposition method. Kluwer Academic
Publishers.
• Courant, R. & Hilbert, D. (1962), Methods of Mathematical Physics, II, New York: Wiley-Interscience.
• Evans, L. C. (1998), Partial Differential Equations, Providence: American Mathematical Society,
ISBN 0-8218-0772-2.
• Ibragimov, Nail H (1993), CRC Handbook of Lie Group Analysis of Differential Equations Vol. 1-3, Providence:
CRC-Press, ISBN 0-8493-4488-3.
• John, F. (1982), Partial Differential Equations (4th ed.), New York: Springer-Verlag, ISBN 0-387-90609-6.
• Jost, J. (2002), Partial Differential Equations, New York: Springer-Verlag, ISBN 0-387-95428-7.
• Lewy, Hans (1957), "An example of a smooth linear partial differential equation without solution", Annals of
Mathematics, 2nd Series 66 (1): 155–158.
• Liao, S.J. (2003), Beyond Perturbation: Introduction to the Homotopy Analysis Method, Boca Raton: Chapman &
Hall/ CRC Press, ISBN 1-58488-407-X
• Olver, P.J. (1995), Equivalence, Invariants and Symmetry, Cambridge Press.
• Petrovskii, I. G. (1967), Partial Differential Equations, Philadelphia: W. B. Saunders Co..
• Pinchover, Y. & Rubinstein, J. (2005), An Introduction to Partial Differential Equations, New York: Cambridge
University Press, ISBN 0-521-84886-5.
• Polyanin, A. D. (2002), Handbook of Linear Partial Differential Equations for Engineers and Scientists, Boca
Raton: Chapman & Hall/CRC Press, ISBN 1-58488-299-9.
• Polyanin, A. D. & Zaitsev, V. F. (2004), Handbook of Nonlinear Partial Differential Equations, Boca Raton:
Chapman & Hall/CRC Press, ISBN 1-58488-355-3.
• Polyanin, A. D.; Zaitsev, V. F. & Moussiaux, A. (2002), Handbook of First Order Partial Differential Equations,
London: Taylor & Francis, ISBN 0-415-27267-X.
• Solin, P. (2005), Partial Differential Equations and the Finite Element Method, Hoboken, NJ: J. Wiley & Sons,
ISBN 0-471-72070-4.
• Solin, P.; Segeth, K. & Dolezel, I. (2003), Higher-Order Finite Element Methods, Boca Raton: Chapman &
Hall/CRC Press, ISBN 1-58488-438-X.
• Stephani, H. (1989), Differential Equations: Their Solution Using Symmetries. Edited by M. MacCallum,
Cambridge University Press.
• Wazwaz, Abdul-Majid (2009). Partial Differential Equations and Solitary Waves Theory. Higher Education
Press. ISBN 90-5809-369-7.
• Zwillinger, D. (1997), Handbook of Differential Equations (3rd ed.), Boston: Academic Press,
ISBN 0-12-784395-7.
• Gershenfeld, N. (1999), The Nature of Mathematical Modeling (1st ed.), New York: Cambridge University Press,
New York, NY, USA, ISBN 0-521-57095-6.
Partial differential equation 488
External links
• Partial Differential Equations: Exact Solutions [2] at EqWorld: The World of Mathematical Equations.
• Partial Differential Equations: Index [3] at EqWorld: The World of Mathematical Equations.
• Partial Differential Equations: Methods [4] at EqWorld: The World of Mathematical Equations.
• Example problems with solutions [5] at exampleproblems.com
• Partial Differential Equations [6] at mathworld.wolfram.com
• Dispersive PDE Wiki [7]
• NEQwiki, the nonlinear equations encyclopedia [8]
References
[1] http:/ / web. archive. org/ web/ 20080411030405/ http:/ / www. math. unl. edu/ ~sdunbar1/ Teaching/ MathematicalFinance/ Lessons/
BlackScholes/ Solution/ solution. shtml
[2] http:/ / eqworld. ipmnet. ru/ en/ pde-en. htm
[3] http:/ / eqworld. ipmnet. ru/ en/ solutions/ eqindex/ eqindex-pde. htm
[4] http:/ / eqworld. ipmnet. ru/ en/ methods/ meth-pde. htm
[5] http:/ / www. exampleproblems. com/ wiki/ index. php?title=Partial_Differential_Equations
[6] http:/ / mathworld. wolfram. com/ PartialDifferentialEquation. html
[7] http:/ / tosio. math. toronto. edu/ wiki/ index. php/ Main_Page
[8] http:/ / www. primat. mephi. ru/ wiki/
Definition
Pearson's chi-squared test is used to assess two types of comparison: tests of goodness of fit and tests of
independence.
• A test of goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical
distribution.
• A test of independence assesses whether paired observations on two variables, expressed in a contingency table,
are independent of each other (e.g. polling responses from people of different nationalities to see if one's
nationality affects the response).
The first step is to calculate the chi-squared test statistic, X2, which resembles a normalized sum of squared
deviations between observed and theoretical frequencies (see below). The second step is to determine the degrees of
freedom, , of that statistic, which is essentially the number of frequencies reduced by the number of parameters of
the fitted distribution. In the third step, X2 is compared to the critical value of no significance from the
distribution, which in many cases gives a good approximation of the distribution of X2. A test that does not rely on
this approximation is Fisher's exact test; it is substantially more accurate in obtaining a significance level, especially
Pearson's chi-squared test 489
and the reduction in the degrees of freedom is , notionally because the observed frequencies are
constrained to sum to .
Other distributions
When testing whether observations are random variables whose distribution belongs to a given family of
distributions, the "theoretical frequencies" are calculated using a distribution from that family fitted in some standard
way. The reduction in the degrees of freedom is calculated as , where is the number of parameters
used in fitting the distribution. For instance, when checking a three-parameter Weibull distribution, , and
when checking a normal distribution (where the parameters are mean and standard deviation), . In other
words, there will be degrees of freedom, where is the number of categories.
It should be noted that the degrees of freedom are not based on the number of observations as with a Student's t or
F-distribution. For example, if testing for a fair, six-sided die, there would be five degrees of freedom because there
are six categories/parameters (each number). The number of times the die is rolled will have absolutely no effect on
the number of degrees of freedom.
where
= Pearson's cumulative test statistic, which asymptotically approaches a distribution.
= an observed frequency;
= an expected (theoretical) frequency, asserted by the null hypothesis;
= the number of cells in the table.
Pearson's chi-squared test 490
Bayesian method
In Bayesian statistics, one would instead use a Dirichlet distribution as conjugate prior. If one took a uniform prior,
then the maximum likelihood estimate for the population probability is the observed probability, and one may
compute a credible region around this or another estimate.
Test of independence
In this case, an "observation" consists of the values of two outcomes and the null hypothesis is that the occurrence of
these outcomes is statistically independent. Each observation is allocated to one cell of a two-dimensional array of
cells (called a table) according to the values of the two outcomes. If there are r rows and c columns in the table, the
"theoretical frequency" for a cell, given the hypothesis of independence, is
where N is the total sample size (the sum of all cells in the table). The value of the test-statistic is
Fitting the model of "independence" reduces the number of degrees of freedom by p = r + c − 1. The number of
degrees of freedom is equal to the number of cells rc, minus the reduction in degrees of freedom, p, which reduces
to (r − 1)(c − 1).
For the test of independence, also known as the test of homogeneity, a chi-squared probability of less than or equal
to 0.05 (or the chi-squared statistic being at or larger than the 0.05 critical point) is commonly interpreted by applied
workers as justification for rejecting the null hypothesis that the row variable is independent of the column
variable.[2] The alternative hypothesis corresponds to the variables having an association or relationship where the
structure of this relationship is not specified.
Pearson's chi-squared test 491
Assumptions
The chi-squared test, when used with the standard approximation that a chi-squared distribution is applicable, has the
following assumptions:
• Simple random sample – The sample data is a random sampling from a fixed distribution or population where
each member of the population has an equal probability of selection. Variants of the test have been developed for
complex samples, such as where the data is weighted.
• Sample size (whole table) – A sample with a sufficiently large size is assumed. If a chi squared test is conducted
on a sample with a smaller size, then the chi squared test will yield an inaccurate inference. The researcher, by
using chi squared test on small samples, might end up committing a Type II error.
• Expected cell count – Adequate expected cell counts. Some require 5 or more, and others require 10 or more. A
common rule is 5 or more in all cells of a 2-by-2 table, and 5 or more in 80% of cells in larger tables, but no cells
with zero expected count. When this assumption is not met, Yates's Correction is applied.
• Independence – The observations are always assumed to be independent of each other. This means chi-squared
cannot be used to test correlated data (like matched pairs or panel data). In those cases you might want to turn to
McNemar's test.
Examples
Goodness of fit
For example, to test the hypothesis that a random sample of 100 people has been drawn from a population in which
men and women are equal in frequency, the observed number of men and women would be compared to the
theoretical frequencies of 50 men and 50 women. If there were 44 men in the sample and 56 women, then
If the null hypothesis is true (i.e., men and women are chosen with equal probability in the sample), the test statistic
will be drawn from a chi-squared distribution with one degree of freedom. Though one might expect two degrees of
freedom (one each for the men and women), we must take into account that the total number of men and women is
constrained (100), and thus there is only one degree of freedom (2 − 1). Alternatively, if the male count is known the
female count is determined, and vice-versa.
Consultation of the chi-squared distribution for 1 degree of freedom shows that the probability of observing this
difference (or a more extreme difference than this) if men and women are equally numerous in the population is
approximately 0.23. This probability is higher than conventional criteria for statistical significance (0.001–0.05), so
normally we would not reject the null hypothesis that the number of men in the population is the same as the number
of women (i.e., we would consider our sample within the range of what we'd expect for a 50/50 male/female ratio.)
Problems
The approximation to the chi-squared distribution breaks down if expected frequencies are too low. It will normally
be acceptable so long as no more than 20% of the events have expected frequencies below 5. Where there is only 1
degree of freedom, the approximation is not reliable if expected frequencies are below 10. In this case, a better
approximation can be obtained by reducing the absolute value of each difference between observed and expected
frequencies by 0.5 before squaring; this is called Yates's correction for continuity.
In cases where the expected value, E, is found to be small (indicating either a small underlying population
probability, or a small number of observations), the normal approximation of the multinomial distribution can fail,
and in such cases it is found to be more appropriate to use the G-test, a likelihood ratio-based test statistic. Where the
Pearson's chi-squared test 492
total sample size is small, it is necessary to use an appropriate exact test, typically either the binomial test or (for
contingency tables) Fisher's exact test; but note that this test assumes fixed and known marginal totals.
Distribution
The null distribution of the Pearson statistic with j rows and k columns is approximated by the chi-squared
distribution with (k − 1)(j − 1) degrees of freedom.[3]
This approximation arises as the true distribution, under the null hypothesis, if the expected value is given by a
multinomial distribution. For large sample sizes, the central limit theorem says this distribution tends toward a
certain multivariate normal distribution.
Two cells
In the special case where there are only two cells in the table, the expected values follow a binomial distribution,
where
p = probability, under the null hypothesis,
n = number of observations in the sample.
In the above example the hypothesised probability of a male observation is 0.5, with 100 samples. Thus we expect to
observe 50 males.
If n is sufficiently large, the above binomial distribution may be approximated by a Gaussian (normal) distribution
and thus the Pearson test statistic approximates a chi-squared distribution,
Let O1 be the number of observations from the sample that are in the first cell. The Pearson test statistic can be
expressed as
By the normal approximation to a binomial this is the squared of one standard normal variate, and hence is
distributed as chi-squared with 1 degree of freedom. Note that the denominator is one standard deviation of the
Gaussian approximation, so can be written
So as consistent with the meaning of the chi-squared distribution, we are measuring how probable the observed
number of standard deviations away from the mean is under the Gaussian approximation (which is a good
approximation for large n).
The chi-squared distribution is then integrated on the right of the statistic value to obtain the P-value, which is equal
to the probability of getting a statistic equal or bigger than the observed one, assuming the null hypothesis.
Pearson's chi-squared test 493
Many cells
Similar arguments as above lead to the desired result. Each cell (except the final one, whose value is completely
determined by the others) is treated as an independent binomial variable, and their contributions are summed and
each contributes one degree of freedom.
Notes
[1] Karl Pearson (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is
such that it can be reasonably supposed to have arisen from random sampling". Philosophical Magazine, Series 5 50 (302): 157–175.
doi:10.1080/14786440009463897.
[2] "Critical Values of the Chi-Squared Distribution" (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda3674. htm).
NIST/SEMATECH e-Handbook of Statistical Methods. National Institute of Standards and Technology. .
[3] Statistics for Applications. MIT OpenCourseWare. Lecture 23 (http:/ / ocw. mit. edu/ courses/ mathematics/
18-443-statistics-for-applications-fall-2003/ lecture-notes/ lec23. pdf). Pearson's Theorem. Retrieved 21 March 2007.
References
• Chernoff, H.; Lehmann E.L. (1954). "The use of maximum likelihood estimates in tests for goodness-of-fit".
The Annals of Mathematical Statistics 25 (3): 579–586. doi:10.1214/aoms/1177728726.
• Plackett, R.L. (1983). "Karl Pearson and the Chi-Squared Test". International Statistical Review (International
Statistical Institute (ISI)) 51 (1): 59–72. doi:10.2307/1402731. JSTOR 1402731.
• Greenwood, P.E., Nikulin, M.S.(1996). A guide to chi-squared testing, J.Wiley, New York, ISBN
0-471-55779-X.
PerronFrobenius theorem 494
Perron–Frobenius theorem
In linear algebra, the Perron–Frobenius theorem, proved by Oskar Perron (1907) and Georg Frobenius (1912),
asserts that a real square matrix with positive entries has a unique largest real eigenvalue and that the corresponding
eigenvector has strictly positive components, and also asserts a similar statement for certain classes of nonnegative
matrices. This theorem has important applications to probability theory (ergodicity of Markov chains); to the theory
of dynamical systems (subshifts of finite type); to economics (Leontief's input-output model)[1]; to demography
(Leslie population age distribution model)[2] to mathematical background of the Internet search engines[3] and even
to ranking of football teams[4]
Positive matrices
Let A = (aij) be an n × n positive matrix: aij > 0 for 1 ≤ i, j ≤ n. Then the following statements hold.
1. There is a positive real number r, called the Perron root or the Perron–Frobenius eigenvalue, such that r is an
eigenvalue of A and any other eigenvalue λ (possibly, complex) is strictly smaller than r in absolute value, |λ| < r.
Thus, the spectral radius ρ(A) is equal to r.
2. The Perron–Frobenius eigenvalue is simple: r is a simple root of the characteristic polynomial of A.
Consequently, the eigenspace associated to r is one-dimensional. (The same is true for the left eigenspace, i.e., the
eigenspace for AT.)
3. There exists an eigenvector v = (v1,…,vn) of A with eigenvalue r such that all components of v are positive: A v =
r v, vi > 0 for 1 ≤ i ≤ n. (Respectively, there exists a positive left eigenvector w : wT A = r wT, wi > 0.)
4. There are no other positive (moreover non-negative) eigenvectors except positive multiples of v (respectively, left
eigenvectors except w), i.e., all other eigenvectors must have at least one negative or non-real component.
5. , where the left and right eigenvectors for A are normalized so that wTv = 1. Moreover, the
matrix v wT is the projection onto the eigenspace corresponding to r. This projection is called the Perron
projection.
6. Collatz–Wielandt formula: for all non-negative non-zero vectors x, let f(x) be the minimum value of [Ax]i / xi
taken over all those i such that xi ≠ 0. Then f is a real valued function whose maximum is the Perron–Frobenius
eigenvalue.
7. A "Min-max" Collatz–Wielandt formula takes a form similar to the one above: for all strictly positive vectors x,
let g(x) be the maximum value of [Ax]i / xi taken over i. Then g is a real valued function whose minimum is the
Perron–Frobenius eigenvalue.
8. The Perron–Frobenius eigenvalue satisfies the inequalities
The left and right eigenvectors v and w are usually normalized so that the sum of their components is equal to 1; in
this case, they are sometimes called stochastic eigenvectors.
Non-negative matrices
An extension of the theorem to matrices with non-negative entries is also available. In order to highlight the
similarities and differences between the two cases the following points are to be noted: every non-negative matrix
can be obviously obtained as a limit of positive matrices, thus one obtains the existence of an eigenvector with
non-negative components; obviously the corresponding eigenvalue will be non-negative and greater or equal in
absolute value than all other eigenvalues.[7] [8] However, the simple examples
show that for non-negative matrices there may exist eigenvalues of the same absolute value as the maximal one ((1)
and (−1) – eigenvalues of the first matrix); moreover the maximal eigenvalue may not be a simple root of the
characteristic polynomial, can be zero and the corresponding eigenvector (1,0) is not strictly positive (second
example). So it may seem that most properties are broken for non-negative matrices, however Frobenius found the
right way.
The key feature of theory in the non-negative case is to find some special subclass of non-negative matrices—
irreducible matrices— for which a non-trivial generalization is possible. Namely, although eigenvalues attaining
maximal absolute value may not be unique, the structure of maximal eigenvalues is under control: they have the
form ei2πl/hr, where h is some integer number—period of matrix, r is a real strictly positive eigenvalue,
l = 0, 1, ..., h − 1. The eigenvector corresponding to r has strictly positive components (in contrast with the general
case of non-negative matrices, where components are only non-negative). Also all such eigenvalues are simple roots
of the characteristic polynomial. Further properties are described below.
Classification of matrices
Let A be a square matrix (not necessarily positive or even real). The matrix A is irreducible if any of the following
equivalent properties holds.
Definition 1 : A does not have non-trivial invariant coordinate subspaces. Here a non-trivial coordinate subspace
means a linear subspace spanned by any proper subset of basis vectors. More explicitly, for any linear subspace
spanned by basis vectors ei1 , ..., eik, n > k > 0 its image under the action of A is not contained in the same subspace.
Definition 2: A cannot be conjugated into block upper triangular form by a permutation matrix P:
where E and G are non-trivial (i.e. of size greater than zero) square matrices.
If A is non-negative other definitions exist:
Definition 3: For every pair of indices i and j, there exists a natural number m such that (Am)ij is not equal to 0.
Definition 4: One can associate with a matrix A a certain directed graph GA. It has exactly n vertices, where n is size
of A, and there is an edge from vertex i to vertex j precisely when Aij > 0. Then the matrix A is irreducible if and only
if its associated graph GA is strongly connected.
A matrix is reducible if it is not irreducible.
Let A be non-negative. Fix an index i and define the period of index i to be the greatest common divisor of all
natural numbers m such that (Am)ii > 0. When A is irreducible, the period of every index is the same and is called the
period of A. In fact, when A is irreducible, the period can be defined as the greatest common divisor of the lengths of
the closed directed paths in GA (see Kitchens[9] page 16). The period is also called the index of imprimitivity
PerronFrobenius theorem 496
where the blocks along the main diagonal are zero square matrices.
9. Collatz–Wielandt formula: for all non-negative non-zero vectors x let f(x) be the minimum value of [Ax]i /
xi taken over all those i such that xi ≠ 0. Then f is a real valued function whose maximum is the
Perron–Frobenius eigenvalue.
10. The Perron–Frobenius eigenvalue satisfies the inequalities
The matrix shows that the blocks on the diagonal may be of different sizes, the matrices Aj need not
Further properties
Let A be an irreducible non-negative matrix, then:
1. (I+A)n−1 is a positive matrix. (Meyer[5] claim 8.3.5 p. 672 [6]).
2. Wielandt's theorem. If |B|<A, then ρ(B)≤ρ(A). If equality holds (i.e. if μ=ρ(A)eiφ is eigenvalue for B), then B = eiφ
D AD−1 for some diagonal unitary matrix D (i.e. diagonal elements of D equals to eiΘl, non-diagonal are zero).[10]
3. If some power Aq is reducible, then it is completely reducible, i.e. for some permutation matrix P, it is true that:
eigenvalue. The number of these matrices d is the greatest common divisor of q and h, where h is period of A.[11]
4. If c(x)=xn+ck1 xn-k1 +ck2 xn-k2 + ... + cks xn-ks is the characteristic polynomial of A in which the only nonzero
coefficients are listed, then the period of A equals to the greatest common divisor for k1, k2, ... , ks.[12]
5. Cesàro averages: where the left and right eigenvectors for A are normalized
so that wtv = 1. Moreover the matrix v wt is the spectral projection corresponding to r - Perron projection.[13]
6. Let r be the Perron-Frobenius eigenvalue, then the adjoint matrix for (r-A) is positive.[14]
7. If A has at least one non-zero diagonal element, then A is primitive.[15]
Also:
• If 0 ≤ A < B, the rA ≤ rB, moreover, if A is irreducible, then the inequality is strict: rA < rB.
One of the definitions of primitive matrix requires A to be non-negative and there exists m, such that Am is positive.
One may one wonder how big m can be, depending on the size of A. The following answers this question.
• Assume A is non-negative primitive matrix of size n, then An2-2n+2 is positive. Moreover there exists a matrix M
given below, such that Mk remains not positive (just non-negative) for all k< n2-2n+2, in particular
(Mn2-2n+1)11=0.
[16]
Applications
Numerous books have been written on the subject of non-negative matrices, and Perron–Frobenius theory is
invariably a central feature. The following examples given below only scratch the surface of its vast application
domain.
Non-negative matrices
The Perron–Frobenius theorem does not apply directly to non-negative matrices. Nevertheless any reducible square
matrix A may be written in upper-triangular block form (known as the normal form of a reducible matrix)[17]
PAP−1 =
PerronFrobenius theorem 498
where P is a permutation matrix and each Bi is a square matrix that is either irreducible or zero. Now if A is
non-negative then so are all the Bi and the spectrum of A is just the union of their spectra. Therefore many of the
spectral properties of A may be deduced by applying the theorem to the irreducible Bi.
For example the Perron root is the maximum of the ρ(Bi). While there will still be eigenvectors with non-negative
components it is quite possible that none of these will be positive.
Stochastic matrices
A row (column) stochastic matrix is a square matrix each of whose rows (columns) consists of non-negative real
numbers whose sum is unity. The theorem cannot be applied directly to such matrices because they need not be
irreducible.
If A is row-stochastic then the column vector with each entry 1 is an eigenvector corresponding to the eigenvalue 1,
which is also ρ(A) by the remark above. It might not be the only eigenvalue on the unit circle: and the associated
eigenspace can be multi-dimensional. If A is row-stochastic and irreducible then the Perron projection is also
row-stochastic and all its rows are equal.
Compact operators
More generally, it can be extended to the case of non-negative compact operators, which, in many ways, resemble
finite-dimensional matrices. These are commonly studied in physics, under the name of transfer operators, or
sometimes Ruelle–Perron–Frobenius operators (after David Ruelle). In this case, the leading eigenvalue
corresponds to the thermodynamic equilibrium of a dynamical system, and the lesser eigenvalues to the decay modes
of a system that is not in equilibrium. Thus, the theory offers a way of discovering the arrow of time in what would
otherwise appear to be reversible, deterministic dynamical processes, when examined from the point of view of
point-set topology.[20]
Proof methods
A common thread in many proofs is the Brouwer fixed point theorem. Another popular method is that of Wielandt
(1950). He used the Collatz–Wielandt formula described above to extend and clarify Frobenius's work.[21] Another
proof is based on the spectral theory[22] from which part of the arguments are borrowed.
Perron root is strictly maximal eigenvalue for positive (and primitive) matrices
Case: If A is a positive (or more generally primitive) matrix, then there exists a real positive eigenvalue r
(Perron-Frobenius eigenvalue or Perron root), which is strictly greater in absolute value than all other eigenvalues,
hence r is the spectral radius of A.
PerronFrobenius theorem 499
That claim is wrong for general non-negative irreducible matrices, which have h eigenvalues with the same absolute
eigenvalue as r, where h is the period of A.
Lemma
Given a non-negative A, assume there exists m, such that Am is positive, then Am+1, Am+2, Am+3,... are all positive.
Am+1= A Am, so it can have zero element only if some row of A is entirely zero, but in this case the same row of Am
will be zero.
Applying the same arguments as above for primitive matrices, prove the main claim.
Multiplicity one
The proof that the Perron-Frobenius eigenvalue is a simple root of the characteristic polynomial is also elementary.
The arguments here are close to those in Meyer.[5]
Case: The eigenspace associated to Perron-Frobenius eigenvalue r is one-dimensional.
Given a strictly positive eigenvector v corresponding to r and another eigenvector w with the same eigenvalue.
(Vector w can be chosen to be real, because A and r are both real, so the null space of A-r has a basis consisting of
real vectors). Assuming at least one of the components of w is positive (otherwise multiply w by -1). Given maximal
possible α such that u=v- α w is non-negative, then one of the components of u is zero, otherwise α is not maximum.
Vector u is an eigenvector. It is non-negative, hence by the lemma described in the previous section non-negativity
implies strict positivity for any eigenvector. On the other hand as above at least one component of u is zero. The
contradiction implies that w does not exist.
Case: There are no Jordan cells corresponding to the Perron-Frobenius eigenvalue r and all other eigenvalues which
have the same absolute value.
If there is a Jordan cell, then the infinity norm ||(A/r)k||∞ tends to infinity for k → ∞ , but that contradicts the existence
of the positive eigenvector.
Given r=1, or A/r. Letting v be a Perron-Frobenius strictly positive eigenvector, so Av=v, then:
So ||Ak||∞ is bounded for all k. This gives another proof that there are no eigenvalues which have greater absolute
value than Perron-Frobenius one. It also contradicts the existence of the Jordan cell for any eigenvalue which has
absolute value equal to 1 (in particular for the Perron-Frobenius one), because existence of the Jordan cell implies
that ||Ak||∞ is unbounded. For a two by two matrix:
hence ||Jk||∞ = |k+λ| (for |λ|=1), so it tends to infinity when k does so. Since ||Jk|| = ||C-1 AkC ||, then || Ak || >= ||Jk||/ (
||C−1|| || C ||), so it also tends to infinity. The resulting contradiction implies that there are no Jordan cells for the
corresponding eigenvalues.
Combining the two claims above reveals that the Perron-Frobenius eigenvalue r is simple root of the characteristic
polynomial. In the case of non primitive matrices, there exist other eigenvalues which have the same absolute value
as r. The same claim is true for them, but requires more work.
Collatz–Wielandt formula
Case: Given a positive (or more generally irreducible non-negative matrix) A, for all non-negative non-zero vectors x
and f(x) as the minimum value of [Ax]i / xi taken over all those i such that xi ≠ 0, then f is a real valued function
whose maximum is the Perron–Frobenius eigenvalue r.
Here, r is attained for x taken to be the Perron-Frobenius eigenvector v. The proof requires that values f on the other
vectors are less or equal. Given a vector x. Let ξ=f(x), so 0<= ξx <=Ax and w to be the right eigenvector for A, then
wt ξx <= wt (Ax) = (wt A)x = r wt x . Hence ξ<=r.[24]
This is not specific to non-negative matrices: for any matrix A and any its eigenvalue λ it is true that
. This is an immediate corollary of the Gershgorin circle theorem. However another proof is
more direct:
Any matrix induced norm satisfies the inequality ||A|| ≥ |λ| for any eigenvalue λ, because ||A|| ≥ ||Ax||/||x|| = ||λx||/||x|| =
|λ|. The infinity norm of a matrix is the maximum of row sums: Hence the desired
inequality is exactly ||A||∞ ≥ |λ| applied to non-negative matrix A.
Another inequality is:
PerronFrobenius theorem 502
This fact is specific to non-negative matrices; for general matrices there is nothing similar. Given that A is positive
(not just non-negative), then there exists a positive eigenvector w such that Aw = rw and the smallest component of w
(say wi) is 1. Then r = (Aw)i ≥ the sum of the numbers in row i of A. Thus the minimum row sum gives a lower
bound for r and this observation can be extended to all non-negative matrices by continuity.
Another way to argue it is via the Collatz-Wielandt formula. One takes the vector x = (1, 1, ..., 1) and immediately
obtains the inequality.
Further proofs
Perron projection
The proof now proceeds using spectral decomposition. The trick here is to split the Perron root from the other
eigenvalues. The spectral projection associated with the Perron root is called the Perron projection and it enjoys the
following property:
Case: The Perron projection of an irreducible non-negative square matrix is a positive matrix.
Perron's findings and also (1)–(5) of the theorem are corollaries of this result. The key point is that a positive
projection always has rank one. This means that if A is an irreducible non-negative square matrix then the algebraic
and geometric multiplicities of its Perron root are both one. Also if P is its Perron projection then AP = PA = ρ(A)P
so every column of P is a positive right eigenvector of A and every row is a positive left eigenvector. Moreover if Ax
= λx then PAx = λPx = ρ(A)Px which means Px = 0 if λ ≠ ρ(A). Thus the only positive eigenvectors are those
associated with ρ(A). If A is a primitive matrix with ρ(A) = 1 then it can be decomposed as P ⊕ (1 − P)A so that An =
P + (1 − P)An. As n increases the second of these terms decays to zero leaving P as the limit of An as n → ∞.
The power method is a convenient way to compute the Perron projection of a primitive matrix. If v and w are the
positive row and column vectors that it generates then the Perron projection is just wv/vw. It should be noted that the
spectral projections aren't neatly blocked as in the Jordan form. Here they are overlaid and each generally has
complex entries extending to all four corners of the square matrix. Nevertheless they retain their mutual
orthogonality which is what facilitates the decomposition.
Peripheral projection
The analysis when A is irreducible and non-negative is broadly similar. The Perron projection is still positive but
there may now be other eigenvalues of modulus ρ(A) that negate use of the power method and prevent the powers of
(1 − P)A decaying as in the primitive case whenever ρ(A) = 1. So enter the peripheral projection which is the
spectral projection of A corresponding to all the eigenvalues that have modulus ρ(A) .... Case: The peripheral
projection of an irreducible non-negative square matrix is a non-negative matrix with a positive diagonal.
Cyclicity
Suppose in addition that ρ(A) = 1 and A has h eigenvalues on the unit circle. If P is the peripheral projection then the
matrix R = AP = PA is non-negative and irreducible, Rh = P, and the cyclic group P, R, R2, ...., Rh−1 represents the
harmonics of A. The spectral projection of A at the eigenvalue λ on the unit circle is given by the formula
. All of these projections (including the Perron projection) have the same positive diagonal, moreover
choosing any one of them and then taking the modulus of every entry invariably yields the Perron projection. Some
donkey work is still needed in order to establish the cyclic properties (6)–(8) but it's essentially just a matter of
turning the handle. The spectral decomposition of A is given by A = R ⊕ (1 − P)A so the difference between An and
Rn is An − Rn = (1 − P)An representing the transients of An which eventually decay to zero. P may be computed as the
limit of Anh as n → ∞.
PerronFrobenius theorem 503
Caveats
can go wrong if the necessary conditions are not met. It is easily seen that the Perron and peripheral projections of L
are both equal to P, thus when the original matrix is reducible the projections may lose non-negativity and there is no
chance of expressing them as limits of its powers. The matrix T is an example of a primitive matrix with zero
diagonal. If the diagonal of an irreducible non-negative square matrix is non-zero then the matrix must be primitive
but this example demonstrates that the converse is false. M is an example of a matrix with several missing spectral
teeth. If ω = eiπ/3 then ω6 = 1 and the eigenvalues of M are {1,ω2,ω3,ω4} so ω and ω5 are both absent.
Terminology
A problem that causes confusion is a lack of standardisation in the definitions. For example, some authors use the
terms strictly positive and positive to mean > 0 and ≥ 0 respectively. In this article positive means > 0 and
non-negative means ≥ 0. Another vexed area concerns decomposability and reducibility: irreducible is an overloaded
term. For avoidance of doubt a non-zero non-negative square matrix A such that 1 + A is primitive is sometimes said
to be connected. Then irreducible non-negative square matrices and connected matrices are synonymous.[25]
The nonnegative eigenvector is often normalized so that the sum of its components is equal to unity; in this case, the
eigenvector is a the vector of a probability distribution and is sometimes called a stochastic eigenvector.
Perron–Frobenius eigenvalue and dominant eigenvalue are alternative names for the Perron root. Spectral
projections are also known as spectral projectors and spectral idempotents. The period is sometimes referred to as
the index of imprimitivity or the order of cyclicity.
Notes
[1] Meyer 2000, p. 8.3.6 p. 681 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[2] Meyer 2000, p. 8.3.7 p. 683 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[3] Langville & Meyer 2006, p. 15.2 p. 167 (http:/ / books. google. com/ books?id=hxvB14-I0twC& lpg=PP1& dq=isbn:0691122024&
pg=PA167#v=onepage& q& f=false)
[4] Keener 1993, p. p. 80 (http:/ / links. jstor. org/ sici?sici=0036-1445(199303)35:1<80:TPTATR>2. 0. CO;2-O)
[5] Meyer 2000, p. chapter 8 page 665 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[6] http:/ / www. matrixanalysis. com/ Chapter8. pdf
[7] Meyer 2000, p. chapter 8.3 page 670 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[8] Gantmacher 2000, p. chapter XIII.3 theorem 3 page 66 (http:/ / books. google. com/ books?id=cyX32q8ZP5cC& lpg=PA178& vq=preceding
section& pg=PA66#v=onepage& q& f=false)
[9] Kitchens, Bruce (1998), Symbolic dynamics: one-sided, two-sided and countable state markov shifts. (http:/ / books. google. ru/
books?id=mCcdC_5crpoC& lpg=PA195& ots=RbFr1TkSiY& dq=kitchens perron frobenius& pg=PA16#v=onepage& q& f=false), Springer,
[10] Meyer 2000, p. claim 8.3.11 p. 675 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[11] Gantmacher 2000, p. section XIII.5 theorem 9
[12] Meyer 2000, p. page 679 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[13] Meyer 2000, p. example 8.3.2 p. 677 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[14] Gantmacher 2000, p. section XIII.2.2 page 62 (http:/ / books. google. com/ books?id=cyX32q8ZP5cC& lpg=PA178& vq=preceding
section& pg=PA62#v=onepage& q& f=true)
[15] Meyer 2000, p. example 8.3.3 p. 678 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[16] Meyer 2000, p. chapter 8 example 8.3.4 page 679 and exercise 8.3.9 p. 685 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[17] Varga 2002, p. 2.43 (page 51)
[18] Brualdi, Richard A.; Ryser, Herbert John (1992). Combinatorial Matrix Theory. Cambridge: Cambridge UP. ISBN 0-521-32265-0.
[19] Brualdi, Richard A.; Cvetkovic, Dragos (2009). A Combinatorial Approach to Matrix Theory and Its Applications. Boca Raton, FL: CRC
Press. ISBN 978-1-4200-8223-4.
[20] Mackey, Michael C. (1992). Time's Arrow: The origins of thermodynamic behaviour. New York: Springer-Verlag. ISBN 0-387-97702-3.
[21] Gantmacher 2000, p. section XIII.2.2 page 54 (http:/ / books. google. ru/ books?id=cyX32q8ZP5cC& lpg=PR5& dq=Applications of the
theory of matrices& pg=PA54#v=onepage& q& f=false)
PerronFrobenius theorem 504
[22] Smith, Roger (2006), "A Spectral Theoretic Proof of Perron–Frobenius" (ftp:/ / emis. maths. adelaide. edu. au/ pub/ EMIS/ journals/
MPRIA/ 2002/ pa102i1/ pdf/ 102a102. pdf), Mathematical Proceedings of the Royal Irish Academy (The Royal Irish Academy) 102 (1):
29–35, doi:10.3318/PRIA.2002.102.1.29,
[23] Meyer 2000, p. chapter 8 claim 8.2.10 page 666 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[24] Meyer 2000, p. chapter 8 page 666 (http:/ / www. matrixanalysis. com/ Chapter8. pdf)
[25] For surveys of results on irreducibility, see Olga Taussky-Todd and Richard A. Brualdi.
References
Original papers
• Perron, Oskar (1907), "Zur Theorie der Matrices", Mathematische Annalen 64 (2): 248–263,
doi:10.1007/BF01449896
• Frobenius, Georg (1912), "Ueber Matrizen aus nicht negativen Elementen", Sitzungsber. Königl. Preuss. Akad.
Wiss.: 456–477
• Frobenius, Georg (1908), "Über Matrizen aus positiven Elementen, 1", Sitzungsber. Königl. Preuss. Akad. Wiss.:
471–476
• Frobenius, Georg (1909), "Über Matrizen aus positiven Elementen, 2", Sitzungsber. Königl. Preuss. Akad. Wiss.:
514–518
• Gantmacher, Felix (2000) [1959], The Theory of Matrices, Volume 2 (http://books.google.com/
books?id=cyX32q8ZP5cC&lpg=PA178&vq=preceding section&pg=PA53#v=onepage&q&f=true), AMS
Chelsea Publishing, ISBN 0-8218-2664-6 (1959 edition had different title: "Applications of the theory of
matrices". Also the numeration of chapters is different in the two editions.)
• Langville, Amy; Meyer, Carl (2006), Google page rank and beyond (http://pagerankandbeyond.com), Princeton
University Press, ISBN 0-691-12202-4
• Keener, James (1993), "The Perron–Frobenius theorem and the ranking of football teams" (http://links.jstor.
org/sici?sici=0036-1445(199303)35:1<80:TPTATR>2.0.CO;2-O), SIAM Review (SIAM) 35 (1): 80–93
• Meyer, Carl (2000), Matrix analysis and applied linear algebra (http://www.matrixanalysis.com/Chapter8.
pdf), SIAM, ISBN 0-89871-454-0
• Romanovsky, V. (1933), "Sur les zéros des matrices stocastiques" (http://www.numdam.org/
item?id=BSMF_1933__61__213_0), Bulletin de la Société Mathématique de France 61: 213–219
• Collatz, Lothar (1942), "Einschließungssatz für die charakteristischen Zahlen von Matrize", Mathematische
Zeitschrift 48 (1): 221–226, doi:10.1007/BF01180013
• Wielandt, Helmut (1950), "Unzerlegbare, nicht negative Matrizen", Mathematische Zeitschrift 52 (1): 642–648,
doi:10.1007/BF02230720
Further reading
• Abraham Berman, Robert J. Plemmons, Nonnegative Matrices in the Mathematical Sciences, 1994, SIAM. ISBN
0-89871-321-8.
• Chris Godsil and Gordon Royle, Algebraic Graph Theory, Springer, 2001.
• A. Graham, Nonnegative Matrices and Applicable Topics in Linear Algebra, John Wiley&Sons, New York, 1987.
• R. A. Horn and C.R. Johnson, Matrix Analysis, Cambridge University Press, 1990
• S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability (https://netfiles.uiuc.edu/meyn/www/
spm_files/book.html) London: Springer-Verlag, 1993. ISBN 0-387-19832-6 (2nd edition, Cambridge University
Press, 2009)
• Henryk Minc, Nonnegative matrices, John Wiley&Sons, New York, 1988, ISBN 0-471-83966-3
PerronFrobenius theorem 505
• Seneta, E. Non-negative matrices and Markov chains. 2nd rev. ed., 1981, XVI, 288 p., Softcover Springer Series
in Statistics. (Originally published by Allen & Unwin Ltd., London, 1973) ISBN 978-0-387-29765-1
• Suprunenko, D.A. (2001), "P/p072350" (http://www.encyclopediaofmath.org/index.php?title=P/p072350), in
Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4 (The claim that Aj has
order n/h at the end of the statement of the theorem is incorrect.)
• Richard S. Varga, Matrix Iterative Analysis, 2nd ed., Springer-Verlag, 2002
Poisson distribution 506
Poisson distribution
Poisson
The horizontal axis is the index k, the number of occurrences. The function is only defined at integer values of k. The
connecting lines are only guides for the eye.
Cumulative distribution function
The horizontal axis is the index k, the number of occurrences. The CDF is discontinuous at the integers of k and flat
everywhere else because a variable that is Poisson distributed only takes on integer values.
Notation
Parameters λ > 0 (real)
Support k ∈ { 0, 1, 2, 3, ... }
PMF
CDF
--or-- (for where is the
Entropy
(for large )
MGF
CF
PGF
In probability theory and statistics, the Poisson distribution (pronounced [pwasɔ̃]) is a discrete probability
distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or
space if these events occur with a known average rate and independently of the time since the last event.[1] (The
Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or
volume.)
Suppose someone typically gets on the average 4 pieces of mail per day. There will be however a certain spread:
sometimes a little more, sometimes a little less, once in a while nothing at all.[2] Given only the average rate, for a
certain period of observation (pieces of mail per day, phonecalls per hour, etc.), and assuming that the process, or
mix of processes, that produce the event flow are essentially random, the Poisson distribution specifies how likely it
is that the count will be 3, or 5, or 11, or any other number, during one period of observation. That is, it predicts the
degree of spread around a known average rate of occurrence.[2]
The distribution's practical usefulness has been explained by the Poisson law of small numbers.[3]
History
The distribution was first introduced by Siméon Denis Poisson (1781–1840) and published, together with his
probability theory, in 1837 in his work Recherches sur la probabilité des jugements en matière criminelle et en
matière civile (“Research on the Probability of Judgments in Criminal and Civil Matters”).[4] The work focused on
certain random variables N that count, among other things, the number of discrete occurrences (sometimes called
“arrivals”) that take place during a time-interval of given length.
A practical application of this distribution was made by Ladislaus Bortkiewicz in 1898 when he was given the task
of investigating the number of soldiers in the Prussian army killed accidentally by horse kick; this experiment
introduced the Poisson distribution to the field of reliability engineering.[5]
Definition
A discrete stochastic variable X is said to have a Poisson distribution with parameter λ>0, if for k = 0, 1, 2, ... the
probability mass function of X is given by:
where
• e is the base of the natural logarithm (e = 2.71828...)
• k! is the factorial of k.
The positive real number λ is equal to the expected value of X, but also to the variance:
The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. The
Poisson distribution is sometimes called a Poissonian.
Poisson distribution 508
Properties
Mean
• The expected value of a Poisson-distributed random variable is equal to λ and so is its variance.
• The coefficient of variation is , while the index of dispersion is 1.[6]
• The mean deviation about the mean is[6]
• The mode of a Poisson-distributed random variable with non-integer λ is equal to , which is the largest
integer less than or equal to λ. This is also written as floor(λ). When λ is a positive integer, the modes are λ and
λ − 1.
• All of the cumulants of the Poisson distribution are equal to the expected value λ. The nth factorial moment of the
Poisson distribution is λn.
Median
Bounds for the median ( ν ) of the distribution are known and are sharp:[7]
Higher moments
• The higher moments mk of the Poisson distribution about the origin are Touchard polynomials in λ:
where
are Stirling numbers of the second kind.[8] The coefficients of the polynomials have a combinatorial meaning. In
fact, when the expected value of the Poisson distribution is 1, then Dobinski's formula says that the nth moment
equals the number of partitions of a set of size n.
• Sums of Poisson-distributed random variables:
A converse is Raikov's theorem, which says that if the sum of two independent random variables is
Poisson-distributed, then so is each of those two independent random variables.[10]
Poisson distribution 509
Other properties
• The Poisson distributions are infinitely divisible probability distributions.[11][12]
• The directed Kullback-Leibler divergence between Pois(λ) and Pois(λ0) is given by
• Bounds for the tail probabilities of a Poisson random variable can be derived using a Chernoff
bound argument.[13]
Related distributions
• If and are independent, then the difference follows a
Skellam distribution.
• If and are independent, then the distribution of conditional on
is a binomial distribution. Specifically, given ,
. More generally, if X1, X2,..., Xn are independent Poisson random variables
with parameters λ1, λ2,..., λn then
given . In fact,
• The Poisson distribution can be derived as a limiting case to the binomial distribution as the number of trials goes
to infinity and the expected number of successes remains fixed — see law of rare events below. Therefore it can
be used as an approximation of the binomial distribution if n is sufficiently large and p is sufficiently small. There
is a rule of thumb stating that the Poisson distribution is a good approximation of the binomial distribution if n is
at least 20 and p is smaller than or equal to 0.05, and an excellent approximation if n ≥ 100 and np ≤ 10.[14]
• For sufficiently large values of λ, (say λ>1000), the normal distribution with mean λ and variance λ (standard
deviation ), is an excellent approximation to the Poisson distribution. If λ is greater than about 10, then the
normal distribution is a good approximation if an appropriate continuity correction is performed, i.e., P(X ≤ x),
where (lower-case) x is a non-negative integer, is replaced by P(X ≤ x + 0.5).
• Variance-stabilizing transformation: When a variable is Poisson distributed, its square root is approximately
normally distributed with expected value of about and variance of about 1/4.[15][16] Under this
transformation, the convergence to normality (as λ increases) is far faster than the untransformed variable. Other,
slightly more complicated, variance stabilizing transformations are available,[16] one of which is Anscombe
transform. See Data transformation (statistics) for more general uses of transformations.
• If for every t > 0 the number of arrivals in the time interval [0,t] follows the Poisson distribution with mean λ t,
then the sequence of inter-arrival times are independent and identically distributed exponential random variables
having mean 1 / λ.[17]
• The cumulative distribution functions of the Poisson and chi-squared distributions are related in the following
ways:[18]
Poisson distribution 510
and[19]
Occurrence
Applications of the Poisson distribution can be found in many fields related to counting:
• Electrical system example: telephone calls arriving in a system.
• Astronomy example: photons arriving at a telescope.
• Biology example: the number of mutations on a strand of DNA per unit time.
• Management example: customers arriving at a counter or call centre.
• Civil Engineering example: cars arriving at a traffic light.
• Finance and Insurance example: Number of Losses/Claims occurring in a given period of Time.
• Earthquake Seismology example: An asymptotic Poisson model of seismic risk for large earthquakes. (Lomnitz,
1994).
The Poisson distribution arises in connection with Poisson processes. It applies to various phenomena of discrete
properties (that is, those that may happen 0, 1, 2, 3, ... times during a given period of time or in a given area)
whenever the probability of the phenomenon happening is constant in time or space. Examples of events that may be
modelled as a Poisson distribution include:
• The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was
made famous by a book of Ladislaus Josephovich Bortkiewicz (1868–1931).
• The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy
Gosset (1876–1937).[20]
• The number of phone calls arriving at a call centre per minute.
• The number of goals in sports involving two competing teams.
• The number of deaths per year in a given age group.
• The number of jumps in a stock price in a given time interval.
• Under an assumption of homogeneity, the number of times a web server is accessed per minute.
• The number of mutations in a given stretch of DNA after a certain amount of radiation.
• The proportion of cells that will be infected at a given multiplicity of infection.
Poisson distribution 511
The word law is sometimes used as a synonym of probability distribution, and convergence in law means
convergence in distribution. Accordingly, the Poisson distribution is sometimes called the law of small numbers
because it is the probability distribution of the number of occurrences of an event that happens rarely but has very
many opportunities to happen. The Law of Small Numbers is a book by Ladislaus Bortkiewicz about the Poisson
distribution, published in 1898. Some have suggested that the Poisson distribution should have been called the
Bortkiewicz distribution.[21]
While simple, the complexity is linear in λ. There are many other algorithms to overcome this. Some are given in
Ahrens & Dieter, see References below. Also, for large values of λ, there may be numerical stability issues because
of the term e−λ. One solution for large values of λ is Rejection sampling, another is to use a Gaussian approximation
to the Poisson.
Inverse transform sampling is simple and efficient for small values of λ, and requires only one uniform random
number u per sample. Cumulative probabilities are examined in turn until one exceeds u.
Poisson distribution 513
Parameter estimation
Maximum likelihood
Given a sample of n measured values ki we wish to estimate the value of the parameter λ of the Poisson population
from which the sample was drawn. The maximum likelihood estimate is
Since each observation has expectation λ so does this sample mean. Therefore the maximum likelihood estimate is
an unbiased estimator of λ. It is also an efficient estimator, i.e. its estimation variance achieves the Cramér–Rao
lower bound (CRLB). Hence it is MVUE. Also it can be proved that the sample mean is a complete and sufficient
statistic for λ.
Confidence interval
The confidence interval for a Poisson mean is calculated using the relationship between the Poisson and Chi-square
distributions, and can be written as:
where k is the number of event occurrences in a given interval and is the chi-square deviate with lower
[18][22]
tail area p and degrees of freedom n. This interval is 'exact' in the sense that its coverage probability is never
less than the nominal 1 – α.
When quantiles of the chi-square distribution are not available, an accurate approximation to this exact interval was
proposed by DP Byar (based on the Wilson–Hilferty transformation):[23]
where denotes the standard normal deviate with upper tail area α / 2.
For application of these formulae in the same context as above (given a sample of n measured values ki), one would
set
calculate an interval for μ=nλ, and then derive the interval for λ.
Bayesian inference
In Bayesian inference, the conjugate prior for the rate parameter λ of the Poisson distribution is the gamma
distribution. Let
denote that λ is distributed according to the gamma density g parameterized in terms of a shape parameter α and an
inverse scale parameter β:
Then, given the same sample of n measured values ki as before, and a prior of Gamma(α, β), the posterior
distribution is
Poisson distribution 514
The posterior mean E[λ] approaches the maximum likelihood estimate in the limit as .
The posterior predictive distribution for a single additional observation is a negative binomial distribution
distribution, sometimes called a Gamma-Poisson distribution.
with
The marginal distributions are Poisson( θ1 ) and Poisson( θ2 ) and the correlation coefficient is limited to the range
Notes
[1] Frank A. Haight (1967). Handbook of the Poisson Distribution. New York: John Wiley & Sons.
[2] "Statistics | The Poisson Distribution" (http:/ / www. umass. edu/ wsp/ statistics/ lessons/ poisson/ index. html). Umass.edu. 2007-08-24. .
Retrieved 2012-04-05.
[3] Gullberg, Jan (1997). Mathematics from the birth of numbers. New York: W. W. Norton. pp. 963–965. ISBN 0-393-04002-X.
[4] S.D. Poisson, Probabilité des jugements en matière criminelle et en matière civile, précédées des règles générales du calcul des probabilitiés
(Paris, France: Bachelier, 1837), page 206 (http:/ / books. google. com/ books?id=uovoFE3gt2EC& pg=PA206#v=onepage& q& f=false).
[5] Ladislaus von Bortkiewicz, Das Gesetz der kleinen Zahlen [The law of small numbers] (Leipzig, Germany: B.G. Teubner, 1898). On page 1
(http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA1#v=onepage& q& f=false), Bortkiewicz presents the Poisson distribution.
On pages 23-25 (http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA23#v=onepage& q& f=false), Bortkiewicz presents his
famous analysis of "4. Beispiel: Die durch Schlag eines Pferdes im preussischen Heere Getöteten." (4. Example: Those killed in the Prussian
army by a horse's kick.).
[6] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p157
[7] Choi KP (1994) On the medians of Gamma distributions and an equation of Ramanujan. Proc Amer Math Soc 121 (1) 245–251
[8] Riordan, John (1937). "Moment recurrence relations for binomial, Poisson and hypergeometric frequency distributions". Annals of
Mathematical Statistics 8: 103–111. Also see Haight (1967), p. 6.
[9] E. L. Lehmann (1986). Testing Statistical Hypotheses (second ed.). New York: Springer Verlag. ISBN 0-387-94919-4. page 65.
[10] Raikov, D. (1937). On the decomposition of Poisson laws. Comptes Rendus (Doklady) de l' Academie des Sciences de l'URSS, 14, 9–11.
(The proof is also given in von Mises, Richard (1964). Mathematical Theory of Probability and Statistics. New York: Academic Press.)
[11] Laha, R. G. and Rohatgi, V. K.. Probability Theory. New York: John Wiley & Sons. p. 233. ISBN 0-471-03262-X.
[12] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p159
[13] Massimo Franceschetti and Olivier Dousse and David N. C. Tse and Patrick Thiran (2007). "Closing the Gap in the Capacity of Wireless
Networks Via Percolation Theory" (http:/ / circuit. ucsd. edu/ ~massimo/ Journal/ IEEE-TIT-Capacity. pdf). IEEE Transactions on
Information Theory 53 (3): 1009–1018. .
[14] NIST/SEMATECH, ' 6.3.3.1. Counts Control Charts (http:/ / www. itl. nist. gov/ div898/ handbook/ pmc/ section3/ pmc331. htm)',
e-Handbook of Statistical Methods, accessed 25 October 2006
[15] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models. London: Chapman and Hall. ISBN 0-412-31760-5. page 196 gives the
approximation and higher order terms.
[16] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p163
[17] S. M. Ross (2007). Introduction to Probability Models (ninth ed.). Boston: Academic Press. ISBN 978-0-12-598062-3. pp. 307–308.
[18] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p171
[19] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p153
[20] Philip J. Boland. "A Biographical Glimpse of William Sealy Gosset" (http:/ / wfsc. tamu. edu/ faculty/ tdewitt/ biometry/ Boland PJ (1984)
American Statistician 38 179-183 - A biographical glimpse of William Sealy Gosset. pdf). The American Statistician, Vol. 38, No. 3. (Aug.,
1984), pp. 179-183.. . Retrieved 2011-06-22. "At the turn of the 19th century, Arthur Guinness, Son & Co. became interested in hiring
scientists to analyze data concerned with various aspects of its brewing process. Gosset was to be one of the first of these scientists, and so it
was that in 1899 he moved to Dublin to take up a job as a brewer at St. James' Gate... Student published 22 papers, the first of which was
entitled "On the Error of Counting With a Haemacytometer" (Biometrika, 1907). In it, Student illustrated the practical use of the Poisson
Poisson distribution 515
distribution in counting the number of yeast cells on a square of a haemacytometer. Up until just before World War II, Guinness would not
allow its employees to publish under their own names, and hence Gosset chose to write under the pseudonym of "Student.""
[21] Good, I. J. (1986). "Some statistical applications of Poisson's work". Statistical Science 1 (2): 157–180. doi:10.1214/ss/1177013690.
JSTOR 2245435.
[22] Garwood, F. (1936). "Fiducial Limits for the Poisson Distribution". Biometrika 28 (3/4): 437–442. doi:10.1093/biomet/28.3-4.437.
[23] Breslow, NE; Day, NE (1987). Statistical Methods in Cancer Research: Volume 2—The Design and Analysis of Cohort Studies (http:/ /
www. iarc. fr/ en/ publications/ pdfs-online/ stat/ sp82/ index. php). Paris: International Agency for Research on Cancer.
ISBN 978-92-832-0182-3. .
[24] Loukas S, Kemp CD (1986) The index of dispersion test for the bivariate Poisson distribution. Biometrics 42(4) 941-948
References
• Joachim H. Ahrens, Ulrich Dieter (1974). "Computer Methods for Sampling from Gamma, Beta, Poisson and
Binomial Distributions". Computing 12 (3): 223–246. doi:10.1007/BF02293108.
• Joachim H. Ahrens, Ulrich Dieter (1982). "Computer Generation of Poisson Deviates". ACM Transactions on
Mathematical Software 8 (2): 163–179. doi:10.1145/355993.355997.
• Ronald J. Evans, J. Boersma, N. M. Blachman, A. A. Jagers (1988). "The Entropy of a Poisson Distribution:
Problem 87-6". SIAM Review 30 (2): 314–317. doi:10.1137/1030059.
• Donald E. Knuth (1969). Seminumerical Algorithms. The Art of Computer Programming, Volume 2. Addison
Wesley.
Poisson process
In probability theory, a Poisson process is a stochastic process which counts the number of events[1] and the time
that these events occur in a given time interval. The time between each pair of consecutive events has an exponential
distribution with parameter λ and each of these inter-arrival times is assumed to be independent of other inter-arrival
times. The process is named after the French mathematician Siméon-Denis Poisson and is a good model of
radioactive decay,[2] telephone calls[3] and requests for a particular document on a web server,[4] among many other
phenomena.
The Poisson process is a continuous-time process; the sum of a Bernoulli process can be thought of as its
discrete-time counterpart. A Poisson process is a pure-birth process, the simplest example of a birth-death process. It
is also a point process on the real half-line.
Definition
The basic form of Poisson process, often referred to simply as "the Poisson process", is a continuous-time counting
process {N(t), t ≥ 0} that possesses the following properties:
• N(0) = 0
• Independent increments (the numbers of occurrences counted in disjoint intervals are independent from each
other)
• Stationary increments (the probability distribution of the number of occurrences counted in any time interval only
depends on the length of the interval)
• No counted occurrences are simultaneous.
Consequences of this definition include:
• The probability distribution of N(t) is a Poisson distribution.
• The probability distribution of the waiting time until the next occurrence is an exponential distribution.
• The occurrences are distributed uniformly on any interval of time. (Note that N(t), the total number of
occurrences, has a Poisson distribution over (0, t], whereas the location of an individual occurrence on t ∈ (a, b] is
Poisson process 516
uniform.)
Other types of Poisson process are described below.
Types
Homogeneous
The homogeneous Poisson process is one of the most well-known Lévy
processes. This process is characterized by a rate parameter λ, also
known as intensity, such that the number of events in time interval
(t, t + τ] follows a Poisson distribution with associated parameter λτ.
This relation is given as
Sample Path of a Poisson process N(t)
Non-homogeneous
In general, the rate parameter may change over time; such a process is called a non-homogeneous Poisson process
or inhomogeneous Poisson process. In this case, the generalized rate function is given as λ(t). Now the expected
number of events between time a and time b is
Thus, the number of arrivals in the time interval (a, b], given as N(b) − N(a), follows a Poisson distribution with
associated parameter λa,b
A homogeneous Poisson process may be viewed as a special case when λ(t) = λ, a constant rate.
Spatial
An important variation on the (notionally time-based) Poisson process is the spatial Poisson process. In the case of a
one-dimension space (a line) the theory differs from that of a time-based Poisson process only in the interpretation of
the index variable. For higher dimension spaces, where the index variable (now x) is in some vector space V (e.g. R2
or R3), a spatial Poisson process can be defined by the requirement that the random variables defined as the counts of
the number of "events" inside each of a number of non-overlapping finite sub-regions of V should each have a
Poisson distribution and should be independent of each other.
Space-time
A further variation on the Poisson process, the space-time Poisson process, allows for separately distinguished space
and time variables. Even though this can theoretically be treated as a pure spatial process by treating "time" as just
another component of a vector space, it is convenient in most applications to treat space and time separately, both for
modeling purposes in practical applications and because of the types of properties of such processes that it is
interesting to study.
Poisson process 517
In comparison to a time-based inhomogeneous Poisson process, the extension to a space-time Poisson process can
introduce a spatial dependence into the rate function, such that it is defined as , where for some
vector space V (e.g. R2 or R3). However a space-time Poisson process may have a rate function that is constant with
respect to either or both of x and t. For any set (e.g. a spatial region) with finite measure , the
number of events occurring inside this region can be modeled as a Poisson process with associated rate function λS(t)
such that
(If this is not the case, λ(t) can be scaled appropriately.) Now, represents the spatial probability density
function of these random events in the following sense. The act of sampling this spatial Poisson process is equivalent
to sampling a Poisson process with rate function λ(t), and associating with each event a random vector sampled
from the probability density function . A similar result can be shown for the general (non-separable) case.
Characterisation
In its most general form, the only two conditions for a counting process to be a Poisson process are:
• Orderliness: which roughly means
which implies that arrivals don't occur simultaneously (but this is actually a mathematically stronger
statement).
• Memorylessness (also called evolution without after-effects): the number of arrivals occurring in any bounded
interval of time after time t is independent of the number of arrivals occurring before time t.
These seemingly unrestrictive conditions actually impose a great deal of structure in the Poisson process. In
particular, they imply that the time between consecutive events (called interarrival times) are independent random
variables. For the homogeneous Poisson process, these inter-arrival times are exponentially distributed with
parameter λ (mean 1/λ).
Proof : Let be the first arrival time of the Poisson process. Its distribution satisfies
Also, the memorylessness property entails that the number of events in any time interval is independent of the
number of events in any other interval that is disjoint from it. This latter property is known as the independent
increments property of the Poisson process.
Poisson process 518
Properties
As defined above, the stochastic process {N(t)} is a Markov process, or more specifically, a continuous-time Markov
process.
To illustrate the exponentially distributed inter-arrival times property, consider a homogeneous Poisson process N(t)
with rate parameter λ, and let Tk be the time of the kth arrival, for k = 1, 2, 3, ... . Clearly the number of arrivals
before some fixed time t is less than k if and only if the waiting time until the kth arrival is more than t. In symbols,
the event [N(t) < k] occurs if and only if the event [Tk > t] occurs. Consequently the probabilities of these events are
the same:
In particular, consider the waiting time until the first arrival. Clearly that time is more than t if and only if the number
of arrivals before time t is 0. Combining this latter property with the above probability distribution for the number of
homogeneous Poisson process events in a fixed interval gives
Consequently, the waiting time until the first arrival T1 has an exponential distribution, and is thus memoryless. One
can similarly show that the other interarrival times Tk − Tk−1 share the same distribution. Hence, they are
independent, identically distributed (i.i.d.) random variables with parameter λ > 0; and expected value 1/λ. For
example, if the average rate of arrivals is 5 per minute, then the average waiting time between arrivals is 1/5 minute.
Applications
The classic example of phenomena well modelled by a Poisson process is deaths due to horse kick in the Prussian
army, as shown by Ladislaus Bortkiewicz in 1898.[5][6] The following examples are also well-modeled by the
Poisson process:
• Requests for telephone calls at a switchboard.
• Goals scored in a soccer match.[7]
• Requests for individual documents on a web server.[8]
• Particle emissions due to radioactive decay by an unstable substance. In this case the Poisson process is
non-homogeneous in a predictable manner - the emission rate declines as particles are emitted.
In queueing theory, the times of customer/job arrivals at queues are often assumed to be a Poisson process.
Occurrence
The Palm–Khintchine theorem provides a result that shows that the superposition of many low intensity non-Poisson
point processes will be close to a Poisson process.
Further reading
• Cox, D. R.; Isham, V. I. (1980). Point Processes. Chapman & Hall. ISBN 0-412-21910-7.
• Ross, S. M. (1995). Stochastic Processes. Wiley. ISBN 978-0-471-12062-9.
• Snyder, D. L.; Miller, M. I. (1991). Random Point Processes in Time and Space. Springer-Verlag.
ISBN 0-387-97577-2.
Poisson process 519
Notes
[1] The word event used here is not an instance of the concept of event as frequently used in probability theory.
[2] Cannizzaro, F.; Greco, G.; Rizzo, S.; Sinagra, E. (1978). "Results of the measurements carried out in order to verify the validity of the
poisson-exponential distribution in radioactive decay events". The International Journal of Applied Radiation and Isotopes 29 (11): 649.
doi:10.1016/0020-708X(78)90101-1.
[3] Willkomm, D.; Machiraju, S.; Bolot, J.; Wolisz, A. (2009). "Primary user behavior in cellular networks and implications for dynamic
spectrum access". IEEE Communications Magazine 47 (3): 88. doi:10.1109/MCOM.2009.4804392.
[4] Arlitt, Martin F.; Williamson, Carey L. (1997). "Internet Web servers: Workload characterization and performance implications". IEEE/ACM
Transactions on Networking 5 (5): 631. doi:10.1109/90.649565.
[5] Ladislaus von Bortkiewicz, Das Gesetz der kleinen Zahlen [The law of small numbers] (Leipzig, Germany: B.G. Teubner, 1898). On page 1
(http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA1#v=onepage& q& f=false), Bortkiewicz presents the Poisson distribution.
On pages 23-25 (http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA23#v=onepage& q& f=false), Bortkiewicz presents his
famous analysis of "4. Beispiel: Die durch Schlag eines Pferdes im preussischen Heere Getöteten." (4. Example: Those killed in the Prussian
army by a horse's kick.).
[6] Gibbons, Robert D.; Bhaumik, Dulal; Aryal, Subhash (2009). Statistical Methods for Groundwater Monitoring. John Wiley and Sons. p. 72.
ISBN 0-470-16496-4.
[7] Heuer, A.; Müller, C.; Rubner, O. (2010). "Soccer: Is scoring goals a predictable Poissonian process?". EPL (Europhysics Letters) 89 (3):
38007. doi:10.1209/0295-5075/89/38007. "To a very good approximation scoring goals during a match can be characterized as independent
Poissonian processes with pre-determined expectation values."
[8] Arlitt, Martin F.; Williamson, Carey L. (1997). "Internet Web servers: Workload characterization and performance implications". IEEE/ACM
Transactions on Networking 5 (5): 631. doi:10.1109/90.649565.
References
Introduction
Survival models can be viewed as consisting of two parts: the underlying hazard function, often denoted ,
describing how the hazard (risk) changes over time at baseline levels of covariates; and the effect parameters,
describing how the hazard varies in response to explanatory covariates. A typical medical example would include
covariates such as treatment assignment, as well as patient characteristics such as age, gender, and the presence of
other diseases in order to reduce variability and/or control for confounding.
The proportional hazards condition[1] states that covariates are multiplicatively related to the hazard. In the simplest
case of stationary coefficients, for example, a treatment with a drug may, say, halve a subject's hazard at any given
time , while the baseline hazard may vary. Note however, that the covariate is not restricted to binary predictors;
in the case of a continuous covariate , the hazard responds logarithmically; each unit increase in results in
proportional scaling of the hazard. The Cox partial likelihood shown below, is obtained by using Breslow's estimate
of the baseline hazard function, plugging it into the full likelihood and then observing that the result is a product of
two factors. The first factor is the partial likelihood shown below, in which the baseline hazard has "canceled out".
The second factor is free of the regression coefficients and depends on the data only through the censoring pattern.
Proportional hazards models 520
The effect of covariates estimated by any proportional hazards model can thus be reported as hazard ratios.
Sir David Cox observed that if the proportional hazards assumption holds (or, is assumed to hold) then it is possible
to estimate the effect parameter(s) without any consideration of the hazard function. This approach to survival data is
called application of the Cox proportional hazards model,[2] sometimes abbreviated to Cox model or to proportional
hazards model.
This expression gives the hazard at time t for an individual with covariate vector (explanatory variables) X. Based on
this hazard function, a partial likelihood can be constructed from the datasets as
where θj = exp(β′Xj) and X1, ..., Xn are the covariate vectors for the n independently sampled individuals in the
dataset (treated here as column vectors).
The corresponding log partial likelihood is
This function can be maximized over β to produce maximum partial likelihood estimates of the model parameters.
The partial score function is
Using this score function and Hessian matrix, the partial likelihood can be maximized using the Newton-Raphson
algorithm. The inverse of the Hessian matrix, evaluated at the estimate of β, can be used as an approximate
variance-covariance matrix for the estimate, and used to produce approximate standard errors for the regression
coefficients.
Tied times
Several approaches have been proposed to handle situations in which there are ties in the time data. Breslow's
method describes the approach in which the procedure described above is used unmodified, even when ties are
present. An alternative approach that is considered to give better results is Efron's method.[3] Let tj denote the unique
times, let Hj denote the set of indices i such that Yi = tj and Ci = 1, and let mj = |Hj|. Efron's approach maximizes the
following partial likelihood.
where
Note that when Hj is empty (all observations with time tj are censored), the summands in these expressions are
treated as zero.
The book on generalized linear models by McCullagh and Nelder[8] has a chapter on converting proportional hazards
models to generalized linear models.
Notes
[1] Breslow, N. E. (1975). "Analysis of Survival Data under the Proportional Hazards Model". International Statistical Review / Revue
Internationale de Statistique 43 (1): 45–57. doi:10.2307/1402659. JSTOR 1402659.
[2] Cox, David R (1972). "Regression Models and Life-Tables". Journal of the Royal Statistical Society. Series B (Methodological) 34 (2):
187–220. JSTOR 2985181. MR0341758
[3] Efron, Bradley (1974). "The Efficiency of Cox's Likelihood Function for Censored Data". Journal of the American Statistical Association 72
(359): 557–565. JSTOR 2286217.
[4] Andersen, P.; Gill, R. (1982). "Cox's regression model for counting processes, a large sample study.". Annals of Statistics 10 (4): 1100–1120.
doi:10.1214/aos/1176345976. JSTOR 2240714.
[5] Martinussen & Scheike (2006) Dynamic Regression Models for Survival Data (Springer).
[6] Bender, R., Augustin, T. and Blettner, M. (2006). Generating survival times to simulate Cox proportional hazards models, Statistics in
Medicine 2005; 24:1713–1723. doi:10.1002/sim.2369
[7] Nan Laird and Donald Olivier (1981). "Covariance Analysis of Censored Survival Data Using Log-Linear Analysis Techniques". Journal of
the American Statistical Association 76 (374): 231–240. doi:10.2307/2287816. JSTOR 2287816.
[8] P. McCullagh and J. A. Nelder (2000). "Chapter 13: Models for Survival Data". Generalized Linear Models (Second ed.). Boca Raton,
Florida: Chapman & Hall/CRC. ISBN 0-412-31760-5. (Second edition 1989; first CRC reprint 1999.)
References
• D. R. Cox and D. Oakes (1984). Analysis of survival data (Chapman & Hall).
• D. Collett (2003). Modelling survival data in medical research (Chapman & Hall/CRC).
• T. M. Therneau and P. M. Grambsch (2000). Modeling survival data: extending the Cox Model (Springer).
• V.Bagdonavicius, R.Levuliene, M.Nikulin (2010). "Goodness-of-fit criteria for the Cox model from left truncated
and right censored data". Journal of Mathematical Sciences, v.167, #4, 436-443.
Random permutation statistics 523
where we have used the fact that the EGF of the set of permutations (there are n! permutations of n elements) is
This one equation will allow us to derive a surprising number of permutation statistics. Firstly, by dropping terms
from , i.e. exp, we may constrain the number of cycles that a permutation contains, e.g. by restricting the EGF to
we obtain permutations containing two cycles. Secondly, note that the EGF of labelled cycles, i.e. of , is
the value of b for a permutation to be the sum of its values on the cycles, then we may mark cycles of length k
with ub(k) and obtain a bivariate generating function g(z, u) that describes the parameter, i.e.
This is a mixed generating function which is exponential in the permutation size and ordinary in the secondary
parameter u. Differentiating and evaluating at u = 1, we have
Random permutation statistics 524
i.e. the EGF of the sum of b over all permutations, or alternatively, the OGF, or more precisely, PGF (probability
generating function) of the expectation of b.
This article uses the coefficient extraction operator [zn], documented on the page for formal power series.
This gives the explicit formula for the total number of involutions among the permutations σ ∈ Sn:
Now multiplication by just sums coefficients, so that we have the following formula for , the
total number of derangements:
Hence there are about derangements and the probability that a random permutation is a derangement is
This result may also be proved by inclusion-exclusion. Using the sets where to denote the set of
permutations that fix p, we have
Random permutation statistics 525
This formula counts the number of permutations that have at least one fixed point. The cardinalities are as follows:
or
It follows that
and hence
or
because the cycle of more than elements will necessarily be unique. Using the fact that , we
find that
which yields
Finally, using an integral estimate such as Euler–MacLaurin summation, or the asymptotic expansion of the nth
harmonic number, we obtain
so that
The above computation may be performed in a more simple and direct way, as follows: first note that a permutation
of elements contains at most one cycle of length strictly greater than . Thus, if we denote
then
Explanation: is the number of ways of choosing the elements that consist the cycle; is the number of
ways of arranging items in a cycle; and is the number of ways to permute the remaining elements.
Thus,
We conclude that
The term
yields the Stirling numbers of the first kind, i.e. is the EGF of the unsigned Stirling numbers of the first kind.
We can compute the OGF of these numbers for n fixed, i.e.
Start with
which yields
Using the formula for on the left, the definition of on the right, and the binomial theorem, we obtain
Comparing the coefficients of , and using the definition of the binomial coefficient, we finally have
a falling factorial.
or
This means that the expected number of cycles of size m in a permutation of length n less than m is zero (obviously).
A random permutation of length at least m contains on average 1/m cycles of length m. In particular, a random
permutation contains about one fixed point.
The OGF of the expected number of cycles of length less than or equal to m is therefore
where Hm is the mth harmonic number. Hence the expected number of cycles of length at most m in a random
permutation is about ln m.
Let the random variable X be the number of fixed points of a random permutation. Using Stirling numbers of the
second kind, we have the following formula for the mth moment of X:
which is zero when , and one otherwise. Hence only terms with contribute to the sum. This yields
Random permutation statistics 529
and
We could also have obtained this formula by noting that the number of transpositions is obtained by adding the
lengths of all cycles (which gives n) and subtracting one for every cycle (which gives by the previous
section).
Note that again generates the unsigned Stirling numbers of the first kind, but in reverse order. More
precisely, we have
and that
which we saw to be the EGF of the unsigned Stirling numbers of the first kind in the section on permutations
consisting of precisely m cycles.
Random permutation statistics 530
It follows that the probability that a random element lies on a cycle of length m is
Averaging out we obtain that the probability of the elements of Q being on the same cycle is
or
In particular, the probability that two elements p < q are on the same cycle is 1/2.
Random permutation statistics 531
or
This simplifies to
or
This says that there is one permutation of size zero containing an even number of even cycles (the empty
permutation, which contains zero cycles of even length), one such permutation of size one (the fixed point, which
also contains zero cycles of even length), and that for , there are such permutations.
Permutations where the sum of the lengths of the even cycles is six
This class has the specification
There is a sematic nuance here. We could consider permutations containing no even cycles as belonging to this class,
since zero is even. The first few values are
The recurrence
Observe carefully how the specifications of the even cycle component are constructed. It is best to think of them in
terms of parse trees. These trees have three levels. The nodes at the lowest level represent sums of products of
even-length cycles of the singleton . The nodes at the middle level represent restrictions of the set operator.
Finally the node at the top level sums products of contributions from the middle level. Note that restrictions of the set
operator, when applied to a generating function that is even, will preserve this feature, i.e. produce another even
generating function. But all the inputs to the set operators are even since they arise from even-length cycles. The
result is that all generating functions involved have the form
where the sum is over all permutations of , is the sign of , i.e. if is even and
if is odd, and is the number of fixed points of .
Now the sign of is given by
where the product is over all cycles c of , as explained e.g. on the page on even and odd permutations.
Hence we consider the combinatorial class
where marks one minus the length of a contributing cycle, and marks fixed points. Translating to generating
functions, we obtain
or
Now we have
or
Extracting coefficients, we find that the coefficient of is zero. The constant is one, which does not agree with
the formula (should be zero). For positive, however, we obtain
Random permutation statistics 534
or
Now the value of the product on the right for a permutation is , where f is the number of fixed points of
. Hence
which yields
and finally
External links
• Alois Panholzer, Helmut Prodinger, Marko Riedel, Measuring post-quickselect disorder. [1]
• Putnam Competition Archive, William Lowell Putnam Competition Archive [2]
• Philip Sung, Yan Zhang, Recurring Recurrences in Counting Permutations [3]
100 prisoners
• Anna Gál, Peter Bro Miltersen, The cell probe complexity of succinct data structures [4]
• Peter Winkler, Seven puzzles you think you must not have heard correctly [5]
• Various authors, Les-Mathematiques.net [6]. Cent prisonniers [7] (French)
Random permutation statistics 535
References
[1] http:/ / www. mathematik. uni-stuttgart. de/ ~riedelmo/ papers/ qsdis-jalc. pdf
[2] http:/ / www. unl. edu/ amc/ a-activities/ a7-problems/ putnamindex. shtml
[3] http:/ / citeseerx. ist. psu. edu/ viewdoc/ download?doi=10. 1. 1. 91. 1088& rep=rep1& type=pdf
[4] http:/ / www. daimi. au. dk/ ~bromille/ Papers/ succinct. pdf
[5] http:/ / www. math. dartmouth. edu/ ~pw/ solutions. pdf
[6] http:/ / les-mathematiques. net
[7] http:/ / les-mathematiques. u-strasbg. fr/ phorum5/ read. php?12,341672
But recall that the 's are linearly independent because they are a basis of the row space of . This implies that
, which proves our claim that are linearly independent.
Now, each is obviously a vector in the column space of . So, is a set of linearly
independent vectors in the column space of and, hence, the dimension of the column space of (i.e. the column
rank of ) must be at least as big as . This proves that row rank of = r ≤ column rank of . Now apply this
result to the transpose of to get the reverse inequality: column rank of = row rank of ≤ column rank of
= row rank of . This proves column rank of equals row rank of or, equivalently, rk(A) = rk(AT).
QED.
Finally, we provide a proof of the related result, rk(A) = rk(A*), where A* is the conjugate transpose or hermitian
transpose of A. When the elements of A are real numbers, this result becomes rk(A) = rk(AT) and can constitute
another proof for row rank = column rank. Otherwise, for complex matrices, rk(A) = rk(A*) is not equivalent to row
rank = column rank, and one of the above two proofs should be used. This proof is short, elegant and makes use of
the null space.
Third proof: Let be an × matrix. Define to mean the column rank of and let denote the
conjugate transpose or hermitian transpose of . First note that if and only if . This is
elementary linear algebra – one direction is trivial; the other follows from:
where is the Euclidean norm. This proves that the null space of is equal to the null space of . From
the rank-nullity theorem, we obtain . (Alternate argument: Since if and only if
, the columns of satisfy the same linear relationships as the columns of . In particular, they must
have the same number of linearly independent columns and, hence, the same column rank.) Each column of is
a linear combination of the columns of . Therefore, the column space of is a subspace of the column
space of . This implies that . We have proved: .
Now apply this result to to obtain the reverse inequality: since ( , we can write
. This proves . When the elements of are real, the
conjugate transpose is the transpose and we obtain . QED.
Rank (linear algebra) 537
Alternative definitions
dimension of image
If one considers the matrix A as a linear mapping
f : Fn → Fm
such that
f(x) = Ax
then the rank of A can also be defined as the dimension of the image of f (see linear map for a discussion of image
and kernel). This definition has the advantage that it can be applied to any linear map without need for a specific
matrix. The rank can also be defined as n minus the dimension of the kernel of f; the rank-nullity theorem states that
this is the same as the dimension of the image of f.
column rank – dimension of column space
The maximal number of linearly independent columns of the m×n matrix A with entries in the field
F is equal to the dimension of the column space of A (the column space being the subspace of Fm generated by the
columns of A, which is in fact just the image of A as a linear map).
row rank – dimension of row space
Since the column rank and the row rank are the same, we can also define the rank of A as the dimension of the row
space of A, or the number of rows in a basis of the row space.
decomposition rank
The rank can also be characterized as the decomposition rank: the minimum k such that A can be factored as
, where C is an m×k matrix and R is a k×n matrix. Like the "dimension of image" characterization this
can be generalized to a definition of the rank of a linear map: the rank of a linear map f from V → W is the minimal
dimension k of an intermediate space X such that f can be written as the composition of a map V → X and a map X →
W. While this definition does not suggest an efficient manner to compute the rank (for which it is better to use one of
the alternative definitions), it does allow to easily understand many of the properties of the rank, for instance that the
rank of the transpose of A is the same as that of A. See rank factorization for details.
determinantal rank – size of largest non-vanishing minor
Another equivalent definition of the rank of a matrix is the greatest order of any non-zero minor in the matrix (the
order of a minor being the size of the square sub-matrix of which it is the determinant). Like the decomposition rank
characterization, this does not give an efficient way of computing the rank, but it is useful theoretically: a single
non-zero minor witnesses a lower bound (namely its order) for the rank of the matrix, which can be useful to prove
that certain operations do not lower the rank of a matrix.
Equivalence of the determinantal definition (rank of largest non-vanishing minor) is generally proved alternatively. It
is a generalization of the statement that if the span of n vectors has dimension p, then p of those vectors span the
space: one can choose a spanning set that is a subset of the vectors. For determinantal rank, the statement is that if
the row rank (column rank) of a matrix is p, then one can choose a p × p submatrix that is invertible: a subset of the
rows and a subset of the columns simultaneously define an invertible submatrix. It can be alternatively stated as: if
the span of n vectors has dimension p, then p of these vectors span the space and there is a set of p coordinates on
which they are linearly independent.
A non-vanishing p-minor (p × p submatrix with non-vanishing determinant) shows that the rows and columns of that
submatrix are linearly independent, and thus those rows and columns of the full matrix are linearly independent (in
the full matrix), so the row and column rank are at least as large as the determinantal rank; however, the converse is
less straightforward.
tensor rank – minimum number of simple tensors
Rank (linear algebra) 538
The rank of a square matrix can also be characterized as the tensor rank: the minimum number of simple tensors
(rank 1 tensors) needed to express A as a linear combination, . Here a rank 1 tensor (matrix product
of a column vector and a row vector) is the same thing as a rank 1 matrix of the given size. This interpretation can be
generalized in the separable models interpretation of the singular value decomposition.
Properties
We assume that A is an m-by-n matrix over either the real numbers or the complex numbers, and we define the linear
map f by f(x) = Ax as above.
• Only a zero matrix has rank zero.
•
• f is injective if and only if A has rank n (in this case, we say that A has full column rank).
• f is surjective if and only if A has rank m (in this case, we say that A has full row rank).
• If A is a square matrix (i.e., m = n), then A is invertible if and only if A has rank n (that is, A has full rank).
• If B is any n-by-k matrix, then
• The rank of A is equal to r if and only if there exists an invertible m-by-m matrix X and an invertible n-by-n
matrix Y such that
This can be shown by proving equality of their null spaces. Null space of the Gram matrix is given by vectors
for which . If this condition is fulfilled, also holds . This proof
was adapted from Mirsky.[3]
• If denotes the conjugate transpose of (i.e., the adjoint of ), then
Rank (linear algebra) 539
Matrix can be put in reduced row-echelon form by using the following elementary row operations:
By looking at the final matrix (reduced row-echelon form) one could see that the first non-zero entry in both
and is a . Therefore the rank of matrix is 2.
Computation
The easiest way to compute the rank of a matrix A is given by the Gauss elimination method. The row-echelon form
of A produced by the Gauss algorithm has the same rank as A, and its rank can be read off as the number of non-zero
rows.
Consider for example the 4-by-4 matrix
We see that the second column is twice the first column, and that the fourth column equals the sum of the first and
the third. The first and the third columns are linearly independent, so the rank of A is two. This can be confirmed
with the Gauss algorithm. It produces the following row echelon form of A:
Applications
One useful application of calculating the rank of a matrix is the computation of the number of solutions of a system
of linear equations. According to the Rouché–Capelli theorem, the system is inconsistent if the rank of the
augmented matrix is greater than the rank of the coefficient matrix. If, on the other hand, ranks of these two matrices
are equal, the system must have at least one solution. The solution is unique if and only if the rank equals the number
of variables. Otherwise the general solution has k free parameters where k is the difference between the number of
variables and the rank.
In control theory, the rank of a matrix can be used to determine whether a linear system is controllable, or
observable.
Generalization
There are different generalisations of the concept of rank to matrices over arbitrary rings. In those generalisations,
column rank, row rank, dimension of column space and dimension of row space of a matrix may be different from
the others or may not exist.
Thinking of matrices as tensors, the tensor rank generalizes to arbitrary tensors; note that for tensors of order greater
than 2 (matrices are order 2 tensors), rank is very hard to compute, unlike for matrices.
There is a notion of rank for smooth maps between smooth manifolds. It is equal to the linear rank of the derivative.
Matrices as tensors
Matrix rank should not be confused with tensor order, which is called tensor rank. Tensor order is the number of
indices required to write a tensor, and thus matrices all have tensor order 2. More precisely, matrices are tensors of
type (1,1), having one row index and one column index, also called covariant order 1 and contravariant order 1; see
Tensor (intrinsic definition) for details.
Note that the tensor rank of a matrix can also mean the minimum number of simple tensors necessary to express the
matrix as a linear combination, and that this definition does agree with matrix rank as here discussed.
References
[1] Proof: Apply the rank-nullity theorem to the inequality:
is well-defined and injective. We thus obtain the inequality in terms of dimensions of kernel, which can then be converted to the inequality in
terms of ranks by the rank-nullity theorem. Alternatively, if M is a linear subspace then ; apply this inequality
to the subspace defined by the (orthogonal) complement of the image of BC in the image of B, whose dimension is
; its image under A has dimension
[3] Leon Mirsky: An Introduction to Linear Algebra, 1990, ISBN 0-486-66434-1
Rank (linear algebra) 541
Further reading
• Horn, Roger A. and Johnson, Charles R. Matrix Analysis. Cambridge University Press, 1985. ISBN
0-521-38632-2.
• Kaw, Autar K. Two Chapters from the book Introduction to Matrix Algebra: 1. Vectors (http://
numericalmethods.eng.usf.edu/mws/che/04sle/mws_che_sle_bck_vectors.pdf) and System of Equations
(http://numericalmethods.eng.usf.edu/mws/che/04sle/mws_che_sle_bck_system.pdf)
• Mike Brookes: Matrix Reference Manual. (http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/property.
html#rank)
Resampling (statistics)
In statistics, resampling is any of a variety of methods for doing one of the following:
1. Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data
(jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
2. Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests,
randomization tests, or re-randomization tests)
3. Validating models by using random subsets (bootstrapping, cross validation)
Common resampling techniques include bootstrapping, jackknifing and permutation tests.
Bootstrap
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by sampling with
replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors
and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation
coefficient or regression coefficient. It may also be used for constructing hypothesis tests. It is often used as a robust
alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric
inference is impossible or requires very complicated formulas for the calculation of standard errors.
Jackknife
Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias and standard error
(variance) of a statistic, when a random sample of observations is used to calculate it. The basic idea behind the
jackknife variance estimator lies in systematically recomputing the statistic estimate leaving out one or more
observations at a time from the sample set. From this new set of replicates of the statistic, an estimate for the bias
and an estimate for the variance of the statistic can be calculated.
Both methods, the bootstrap and the jackknife, estimate the variability of a statistic from the variability of that
statistic between subsamples, rather than from parametric assumptions. For the more general jackknife, the delete-m
observations jackknife, the bootstrap can be seen as a random approximation of it. Both yield similar numerical
results, which is why each can be seen as approximation to the other. Although there are huge theoretical differences
in their mathematical insights, the main practical difference for statistics users is that the bootstrap gives different
results when repeated on the same data, whereas the jackknife gives exactly the same result each time. Because of
this, the jackknife is popular when the estimates need to be verified several times before publishing (e.g. official
statistics agencies). On the other hand, when this verification feature is not crucial and it is of interest not to have a
number but just an idea of its distribution the bootstrap is preferred (e.g. studies in physics, economics, biological
sciences).
Resampling (statistics) 542
Whether to use bootstrap or jackknife may depend more on non-statistical concerns but on operational aspects of a
survey. The bootstrap provides a powerful and easy way to estimate not just the variance of a point estimator but its
whole distribution, thus becoming highly computer intensive. On the other hand, the jackknife (originally used for
bias reduction) only provides estimates of the variance of the point estimator. This can be enough for basic statistical
inference (e.g. hypothesis testing, confidence intervals). Hence, the jackknife is a specialized method for estimating
variances whereas the bootstrap first estimates the whole distribution from where the variance is assessed.
"The bootstrap can be applied to both variance and distribution estimation problems. However, the bootstrap
variance estimator is not as good as the jackknife or the balanced repeated replication (BRR) variance estimator in
terms of the empirical results. Furthermore, the bootstrap variance estimator usually requires more computations
than the jackknife or the BRR . Thus, the bootstrap is mainly recommended for distribution estimation." [1]
There is a special consideration with the jackknife, particularly with the delete-1 observation jackknife. It should
only be used with smooth differentiable statistics, that is: totals, means, proportions, ratios, odd ratios, regression
coefficients, etc.; but not with medians or quantiles. This clearly may become a practical disadvantage (or not,
depending on the needs of the user). This disadvantage is usually the argument against the jackknife in benefit to the
bootstrap. More general jackknifes than the delete-1, such as the delete-m jackknife, overcome this problem for the
medians and quantiles by relaxing the smoothness requirements for consistent variance estimation.
Usually the jackknife is easier to apply to complex sampling schemes than the bootstrap. Complex sampling schemes
may involve stratification, multi-stages (clustering), varying sampling weights (non-response adjustments,
calibration, post-stratification) and under unequal-probability sampling designs. Theoretical aspects of both the
bootstrap and the jackknife can be found in,[2] whereas a basic introduction is accounted in.[3]
Cross-validation
Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held out for use as
validating sets; a model is fit to the remaining data (a training set) and used to predict for the validation set.
Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy.
One form of cross-validation leaves out a single observation at a time; this is similar to the jackknife. Another,
K-fold cross-validation, splits the data into K subsets; each is held out in turn as the validation set.
This avoids "self-influence". For comparison, in regression analysis methods such as linear regression, each y value
draws the regression line toward itself, making the prediction of that value appear more accurate than it really is.
Cross-validation applied to linear regression predicts the y value for each observation without using that observation.
This is often used for deciding how many predictor variables to use in regression. Without cross-validation, adding
predictors always reduces the residual sum of squares (or possibly leaves it unchanged). In contrast, the
cross-validated mean-square error will tend to decrease if valuable predictors are added, but increase if worthless
predictors are added.
Permutation tests
A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical
significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all
possible values of the test statistic under rearrangements of the labels on the observed data points. In other words, the
method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that
design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance
levels; see also exchangeability. Confidence intervals can then be derived from the tests. The theory has evolved
from the works of R.A. Fisher and E.J.G. Pitman in the 1930s.
To illustrate the basic idea of a permutation test, suppose we have two groups and whose sample means are
and , and that we want to test, at 5% significance level, whether they come from the same distribution. Let
Resampling (statistics) 543
and be the sample size corresponding to each group. The permutation test is designed to determine whether the
observed difference between the sample means is large enough to reject the null hypothesis H that the two groups
have identical probability distribution.
The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the
observed value of the test statistic, T(obs). Then the observations of groups and are pooled.
Next, the difference in sample means is calculated and recorded for every possible way of dividing these pooled
values into two groups of size and (i.e., for every permutation of the group labels A and B). The set of
these calculated differences is the exact distribution of possible differences under the null hypothesis that group label
does not matter.
The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in
means was greater than or equal to T(obs). The two-sided p-value of the test is calculated as the proportion of
sampled permutations where the absolute difference was greater than or equal to ABS(T(obs)).
If the only purpose of the test is reject or not reject the null hypothesis, we can as an alternative sort the recorded
differences, and then observe if T(obs) is contained within the middle 95% of them. If it is not, we reject the
hypothesis of identical probability curves at the 5% significance level.
Advantages
Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always
free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses.
Permutation tests can be used for analyzing unbalanced designs [4] and for combining dependent tests on mixtures of
categorical, ordinal, and metric data (Pesarin, 2001).
Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small
sample sizes.
Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated
path algorithms applicable in special situations, made the application of permutation test methods practical for a
wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages
and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and
computing test-based "exact" confidence intervals.
Limitations
An important assumption behind a permutation test is that the observations are exchangeable under the null
hypothesis. An important consequence of this assumption is that tests of difference in location (like a permutation
t-test) require equal variance. In this respect, the permutation t-test shares the same weakness as the classical
Student's t-test (the Behrens–Fisher problem). A third alternative in this situation is to use a bootstrap-based test.
Good (2000) explains the difference between permutation tests and bootstrap tests the following way: "Permutations
test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap
entails less-stringent assumptions." Of course, bootstrap tests are not exact.
Bibliography
Introductory statistics
• Good, P. (2005) Introduction to Statistics Through Resampling Methods and R/S-PLUS. Wiley. ISBN
0-471-71575-1
• Good, P. (2005) Introduction to Statistics Through Resampling Methods and Microsoft Office Excel. Wiley. ISBN
0-471-73191-9
• Hesterberg, T. C., D. S. Moore, S. Monaghan, A. Clipson, and R. Epstein (2005). Bootstrap Methods and
Permutation Tests.
• Wolter, K.M. (2007). Introduction to Variance Estimation. Second Edition. Springer, Inc.
Bootstrapping
• Efron, Bradley (1979). "Bootstrap methods: Another look at the jackknife" [8], The Annals of Statistics, 7, 1-26.
• Efron, Bradley (1981). "Nonparametric estimates of standard error: The jackknife, the bootstrap and other
methods", Biometrika, 68, 589-599.
• Efron, Bradley (1982). The jackknife, the bootstrap, and other resampling plans, In Society of Industrial and
Applied Mathematics CBMS-NSF Monographs, 38.
• Diaconis, P.; Efron, Bradley (1983), "Computer-intensive methods in statistics," Scientific American, May,
116-130.
• Efron, Bradley; Tibshirani, Robert J. (1993). An introduction to the bootstrap, New York: Chapman & Hall,
software [9].
• Davison, A. C. and Hinkley, D. V. (1997): Bootstrap Methods and their Application, software [10].
• Mooney, C Z & Duval, R D (1993). Bootstrapping. A Nonparametric Approach to Statistical Inference. Sage
University Paper series on Quantitative Applications in the Social Sciences, 07-095. Newbury Park, CA: Sage.
• Simon, J. L. (1997): Resampling: The New Statistics [11].
Jackknife
• Berger, Y.G. (2007). A jackknife variance estimator for unistage stratified samples with unequal probabilities.
Biometrika. Vol. 94, 4, pp. 953–964.
• Berger, Y.G. and Rao, J.N.K. (2006). Adjusted jackknife for imputation under unequal probability sampling
without replacement. Journal of the Royal Statistical Society B. Vol. 68, 3, pp. 531–547.
• Berger, Y.G. and Skinner, C.J. (2005). A jackknife variance estimator for unequal probability sampling. Journal
of the Royal Statistical Society B. Vol. 67, 1, pp. 79–89.
• Jiang, J., Lahiri, P. and Wan, S-M. (2002). A unified jackknife theory for empirical best prediction with
M-estimation. The Annals of Statistics. Vol. 30, 6, pp. 1782–810.
• Jones, H.L. (1974). Jackknife estimation of functions of stratum means. Biometrika. Vol. 61, 2, pp. 343–348.
• Kish, L. and Frankel M.R. (1974). Inference from complex samples. Journal of the Royal Statistical Society B.
Vol. 36, 1, pp. 1–37.
• Krewski, D. and Rao, J.N.K. (1981). Inference from stratified samples: properties of the linearization, jackknife
and balanced repeated replication methods. The Annals of Statistics. Vol. 9, 5, pp. 1010–1019.
• Quenouille, M.H. (1956). Notes on bias in estimation. Biometrika. Vol. 43, pp. 353–360.
• Rao, J.N.K. and Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation.
Biometrika. Vol. 79, 4, pp. 811–822.
• Rao, J.N.K., Wu, C.F.J. and Yue, K. (1992). Some recent work on resampling methods for complex surveys.
Survey Methodology. Vol. 18, 2, pp. 209–217.
Resampling (statistics) 546
• Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, Inc.
• Tukey, J.W. (1958). Bias and confidence in not-quite large samples (abstract). The Annals of Mathematical
Statistics. Vol. 29, 2, pp. 614.
• Wu, C.F.J. (1986). Jackknife, Bootstrap and other resampling methods in regression analysis. The Annals of
Statistics. Vol. 14, 4, pp. 1261–1295.
Permutation test
Original references:
• Fisher, R.A. (1935) The Design of Experiments, New York: Hafner
• Pitman, E. J. G. (1937) "Significance tests which may be applied to samples from any population", Royal
Statistical Society Supplement, 4: 119-130 and 225-32 (parts I and II). JSTOR 2984124 JSTOR 2983647
• Pitman, E. J. G. (1938) "Significance tests which may be applied to samples from any population. Part III. The
analysis of variance test", Biometrika, 29 (3-4): 322-335. doi:10.1093/biomet/29.3-4.322
Modern references:
• Edgington. E.S. (1995) Randomization tests, 3rd ed. New York: Marcel-Dekker
• Good, Phillip I. (2005) Permutation, Parametric and Bootstrap Tests of Hypotheses, 3rd ed., Springer ISBN
0-387-98898-X
• Good, P. (2002) "Extensions of the concept of exchangeability and their applications", J. Modern Appl. Statist.
Methods, 1:243-247.
• Lunneborg, Cliff. (1999) Data Analysis by Resampling, Duxbury Press. ISBN 0-534-22110-6.
• Pesarin, F. (2001). Multivariate Permutation Tests : With Applications in Biostatistics, John Wiley & Sons. ISBN
978-0471496700
• Welch, W. J. (1990) "Construction of permutation tests", Journal of the American Statistical Association,
85:693-698.
Computational methods:
• Mehta, C. R.; Patel, N. R. (1983). "A network algorithm for performing Fisher's exact test in r x c contingency
tables", J. Amer. Statist. Assoc, 78(382):427–434.
• Metha, C. R.; Patel, N. R.; Senchaudhuri, P. (1988). "Importance sampling for estimating exact probabilities in
permutational inference", J. Am. Statist. Assoc., 83(404):999–1005.
• Gill, P. M. W. (2007). "Efficient calculation of p-values in linear-statistic permutation significance tests", Journal
of Statistical Computation and Simulation , 77(1):55-61. doi:10.1080/10629360500108053
Resampling (statistics) 547
Resampling methods
• Good, P. (2006) Resampling Methods. 3rd Ed. Birkhauser.
• Wolter, K.M. (2007). Introduction to Variance Estimation. 2nd Edition. Springer, Inc.
External links
Software
• Angelo Canty and Brian Ripley (2010). boot: Bootstrap R (S-Plus) Functions. R package version 1.2-43. [18]
Functions and datasets for bootstrapping from the book Bootstrap Methods and Their Applications by A. C.
Davison and D. V. Hinkley (1997, CUP).
• Statistics101: Resampling, Bootstrap, Monte Carlo Simulation program [19]
References
[1] Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, Inc. pp. 281.
[2] Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer-Verlag, Inc.
[3] Wolter, K.M. (2007). Introduction to Variance Estimation. Second Edition. Springer, Inc.
[4] http:/ / tbf. coe. wayne. edu/ jmasm/ vol1_no2. pdf
[5] Meyer Dwass, "Modified Randomization Tests for Nonparametric Hypotheses", The Annals of Mathematical Statistics, 28:181-187, 1957.
[6] Thomas E. Nichols, Andrew P. Holmes (2001). "Nonparametric Permutation Tests For Functional Neuroimaging: A Primer with Examples"
(http:/ / www. fil. ion. ucl. ac. uk/ spm/ doc/ papers/ NicholsHolmes. pdf). Human Brain Mapping 15 (1): 1–25. doi:10.1002/hbm.1058.
PMID 11747097. .
[7] Gandy, Axel (2009). "Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk". Journal of the American
Statistical Association 104 (488): 1504-1511.
[8] http:/ / projecteuclid. org/ DPubS/ Repository/ 1. 0/ Disseminate?view=body& id=pdf_1& handle=euclid. aos/ 1176344552
[9] http:/ / lib. stat. cmu. edu/ S/ bootstrap. funs
[10] http:/ / statwww. epfl. ch/ davison/ BMA/ library. html
[11] http:/ / www. resample. com/ content/ text/ index. shtml
[12] http:/ / people. revoledu. com/ kardi/ tutorial/ Bootstrap/ index. html
[13] http:/ / bcs. whfreeman. com/ ips5e/ content/ cat_080/ pdf/ moore14. pdf
[14] http:/ / www. insightful. com/ Hesterberg/ bootstrap
[15] http:/ / bcs. whfreeman. com/ pbs/ cat_140/ chap18. pdf
[16] http:/ / PAREonline. net/ getvn. asp?v=8& n=19
[17] http:/ / www. ericdigests. org/ 1993/ marriage. htm
[18] http:/ / cran. at. r-project. org/ web/ packages/ boot/ index. html
[19] http:/ / www. statistics101. net
Schur complement 548
Schur complement
In linear algebra and the theory of matrices, the Schur complement of a matrix block (i.e., a submatrix within a
larger matrix) is defined as follows. Suppose A, B, C, D are respectively p×p, p×q, q×p and q×q matrices, and D is
invertible. Let
It is named after Issai Schur who used it to prove Schur's lemma, although it had been used previously.[1] Emilie
Haynsworth was the first to call it the Schur complement.[2]
Background
The Schur complement arises as the result of performing a block Gaussian elimination by multiplying the matrix M
from the right with the "block lower triangular" matrix
Here Ip denotes a p×p identity matrix. After multiplication with the matrix L the Schur complement appears in the
upper p×p block. The product matrix is
and inverse of M thus may be expressed involving D−1 and the inverse of Schur's complement (if it exists) only as
C.f. matrix inversion lemma which illustrates relationships between the above and the equivalent derivation with the
roles of A and D interchanged.
If M is a positive-definite symmetric matrix, then so is the Schur complement of D in M.
If p and q are both 1 (i.e. A, B, C and D are all scalars), we get the familiar formula for the inverse of a 2-by-2
matrix:
where x, a are p-dimensional column vectors, y, b are q-dimensional column vectors, and A, B, C, D are as above.
Multiplying the bottom equation by and then subtracting from the top equation one obtains
Thus if one can invert D as well as the Schur complement of D, one can solve for x, and then by using the equation
one can solve for y. This reduces the problem of inverting a matrix to that of
inverting a p×p matrix and a q×q matrix. In practice one needs D to be well-conditioned in order for this algorithm to
be numerically accurate.
If we take the matrix V above to be, not a variance of a random vector, but a sample variance, then it may have a
Wishart distribution. In that case, the Schur complement of C in V also has a Wishart distribution.
Then
• is positive definite if and only if and are both positive definite:
.
• is positive definite if and only if and are both positive definite:
.
• If is positive definite, then is positive semidefinite if and only if is positive semidefinite:
, .
• If is positive definite, then is positive semidefinite if and only if is positive
semidefinite:
, .
[3]
These statements can be derived by considering the minimizer of the quantity
Schur complement 550
References
[1] Zhang, Fuzhen (2005). The Schur Complement and Its Applications. Springer. doi:10.1007/b105056. ISBN 0-387-24271-6.
[2] Haynsworth, E. V., "On the Schur Complement", Basel Mathematical Notes, #BNB 20, 17 pages, June 1968.
[3] Boyd, S. and Vandenberghe, L. (2004), "Convex Optimization", Cambridge University Press (Appendix A.5.5)
Sign test
In statistics, the sign test can be used to test the hypothesis that there is "no difference in medians" between the
continuous distributions of two random variables X and Y, in the situation when we can draw paired samples from X
and Y. It is a non-parametric test which makes very few assumptions about the nature of the distributions under test -
this means that it has very general applicability but may lack the statistical power of other tests such as the
paired-samples t-test or the Wilcoxon signed-rank test.
Method
Let p = Pr(X > Y), and then test the null hypothesis H0: p = 0.50. In other words, the null hypothesis states that given
a random pair of measurements (xi, yi), then xi and yi are equally likely to be larger than the other.
To test the null hypothesis, independent pairs of sample data are collected from the populations {(x1, y1), (x2, y2), . .
., (xn, yn)}. Pairs are omitted for which there is no difference so that there is a possibility of a reduced sample of m
pairs.[1]
Then let W be the number of pairs for which yi − xi > 0. Assuming that H0 is true, then W follows a binomial
distribution W ~ b(m, 0.5). The "W" is for Frank Wilcoxon who developed the test, then later, the more powerful
Wilcoxon signed-rank test.[2]
Assumptions
Let Zi = Yi – Xi for i = 1, ... , n.
1. The differences Zi are assumed to be independent.
2. Each Zi comes from the same continuous population.
3. The values of Xi and Yi represent are ordered (at least the ordinal scale), so the comparisons "greater than", "less
than", and "equal to" are meaningful.
Significance testing
Since the test statistic is expected to follow a binomial distribution, the standard binomial test is used to calculate
significance. The normal approximation to the binomial distribution can be used for large sample sizes, m>25.[1]
The left-tail value is computed by Pr(W ≤ w), which is the p-value for the alternative H1: p < 0.50. This alternative
means that the X measurements tend to be higher.
The right-tail value is computed by Pr(W ≥ w), which is the p-value for the alternative H1: p > 0.50. This alternative
means that the Y measurements tend to be higher.
For a two-sided alternative H1 the p-value is twice the smaller tail-value.
Sign test 551
References
[1] Mendenhall, W.; Wackerly, D. D. and Scheaffer, R. L. (1989), "15: Nonparametric statistics", Mathematical statistics with applications
(Fourth ed.), PWS-Kent, pp. 674–679, ISBN 0-534-92026-8
[2] Karas, J. & Savage, I.R. (1967) Publications of Frank Wilcoxon (1892–1965). Biometrics 23(1): 1–10
• Gibbons, J.D. and Chakraborti, S. (1992). Nonparametric Statistical Inference. Marcel Dekker Inc., New York.
• Kitchens, L.J.(2003). Basic Statistics and Data Analysis. Duxbury.
• Conover, W. J. (1980). Practical Nonparametric Statistics, 2nd ed. Wiley, New York.
• Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden and Day, San Francisco.
where U is an m×m unitary matrix over K, the matrix Σ is an m×n diagonal matrix with nonnegative real numbers on
the diagonal, and the n×n unitary matrix V* denotes the conjugate transpose of V. Such a factorization is called the
Singular value decomposition 552
Intuitive interpretations
Example
Consider the 4×5 matrix
Notice is zero outside of the diagonal and one diagonal element is zero. Furthermore, because the matrices and
are unitary, multiplying by their respective conjugate transposes yields identity matrices, as shown below. In
this case, because and are real valued, they each are an orthogonal matrix.
Singular value decomposition 553
and
This particular singular value decomposition is not unique. Choosing such that
The vectors u and v are called left-singular and right-singular vectors for σ, respectively.
In any singular value decomposition
the diagonal entries of Σ are equal to the singular values of M. The columns of U and V are, respectively, left- and
right-singular vectors for the corresponding singular values. Consequently, the above theorem implies that:
• An m × n matrix M has at least one and at most p = min(m,n) distinct singular values.
• It is always possible to find an orthogonal basis U for Km consisting of left-singular vectors of M.
• It is always possible to find an orthogonal basis V for Kn consisting of right-singular vectors of M.
A singular value for which we can find two left (or right) singular vectors that are linearly independent is called
degenerate.
Non-degenerate singular values always have unique left- and right-singular vectors, up to multiplication by a
unit-phase factor eiφ (for the real case up to sign). Consequently, if all singular values of M are non-degenerate and
non-zero, then its singular value decomposition is unique, up to multiplication of a column of U by a unit-phase
factor and simultaneous multiplication of the corresponding column of V by the same unit-phase factor.
Degenerate singular values, by definition, have non-unique singular vectors. Furthermore, if u1 and u2 are two
left-singular vectors which both correspond to the singular value σ, then any normalized linear combination of the
two vectors is also a left-singular vector corresponding to the singular value σ. The similar statement is true for
right-singular vectors. Consequently, if M has degenerate singular values, then its singular value decomposition is
not unique.
Singular value decomposition 554
Pseudoinverse
The singular value decomposition can be used for computing the pseudoinverse of a matrix. Indeed, the
pseudoinverse of the matrix M with singular value decomposition is
where Σ+ is the pseudoinverse of Σ, which is formed by replacing every nonzero diagonal entry by its reciprocal and
transposing the resulting matrix. The pseudoinverse is one way to solve linear least squares problems.
where is the same matrix as except that it contains only the largest singular values (the other singular values
are replaced by zero). This is known as the Eckart–Young theorem, as it was proved by those two authors in 1936
(although it was later found to have been known to earlier authors; see Stewart 1993).
Also see CUR matrix approximation for another low-rank approximation that is easier to interpret.
Singular value decomposition 555
Separable models
The SVD can be thought of as decomposing a matrix into a weighted, ordered sum of separable matrices. By
separable, we mean that a matrix can be written as an outer product of two vectors , or, in
coordinates, . Specifically, the matrix M can be decomposed as:
Here and are the ith columns of the corresponding SVD matrices, are the ordered singular values, and
each is separable. The SVD can be used to find the decomposition of an image processing filter into separable
horizontal and vertical filters. Note that the number of non-zero is exactly the rank of the matrix.
Separable models often arise in biological systems, and the SVD decomposition is useful to analyze such systems.
For example, some visual area V1 simple cells' receptive fields can be well described[1] by a Gabor filter in the space
domain multiplied by a modulation function in the time domain. Thus, given a linear filter evaluated through, for
example, reverse correlation, one can rearrange the two spatial dimensions into one dimension, thus yielding a two
dimensional filter (space, time) which can be decomposed through SVD. The first column of U in the SVD
decomposition is then a Gabor while the first column of V represents the time modulation (or vice-versa). One may
then define an index of separability, , which is the fraction of the power in the matrix M which is
Other examples
The SVD is also applied extensively to the study of linear inverse problems, and is useful in the analysis of
regularization methods such as that of Tikhonov. It is widely used in statistics where it is related to principal
component analysis and to Correspondence analysis, and in signal processing and pattern recognition. It is also used
in output-only modal analysis, where the non-scaled mode shapes can be determined from the singular vectors. Yet
another usage is latent semantic indexing in natural language text processing.
The SVD also plays a crucial role in the field of quantum information, in a form often referred to as the Schmidt
decomposition. Through it, states of two quantum systems are naturally decomposed, providing a necessary and
sufficient condition for them to be entangled : if the rank of the matrix is larger than one.
Singular value decomposition 556
One application of SVD to rather large matrices is in numerical weather prediction, where Lanczos methods are used
to estimate the most linearly quickly growing few perturbations to the central numerical weather prediction over a
given initial forward time period – i.e. the singular vectors corresponding to the largest singular values of the
linearized propagator for the global weather over that time interval. The output singular vectors in this case are entire
weather systems. These perturbations are then run through the full nonlinear model to generate an ensemble forecast,
giving a handle on some of the uncertainty that should be allowed for around the current central prediction.
Another application of SVD for daily life is that point in perspective view can be unprojected in a photo using the
calculated SVD matrix, this application leads to measuring length (a.k.a. the distance of two unprojected points in
perspective photo) by marking out the 4 corner points of known-size object in a single photo. PRuler is a demo to
implement this application by taking a photo of a regular credit card
The right-hand sides of these relations describe the eigenvalue decompositions of the left-hand sides. Consequently:
• The columns of V (right-singular vectors) are eigenvectors of
• The columns of U (left-singular vectors) are eigenvectors of
• The non-zero elements of Σ (non-zero singular values) are the square roots of the non-zero eigenvalues of
or
In the special case that M is a normal matrix, which by definition must be square, the spectral theorem says that it
can be unitarily diagonalized using a basis of eigenvectors, so that it can be written for a unitary
matrix U and a diagonal matrix D. When M is also positive semi-definite, the decomposition is also
a singular value decomposition.
However, the eigenvalue decomposition and the singular value decomposition differ for all other matrices M: the
eigenvalue decomposition is where U is not necessarily unitary and D is not necessarily positive
semi-definite, while the SVD is where Σ is a diagonal positive semi-definite, and U and V are
unitary matrices that are not necessarily related except through the matrix M.
Existence
An eigenvalue λ of a matrix is characterized by the algebraic relation M u = λ u. When M is Hermitian, a variational
characterization is also available. Let M be a real n × n symmetric matrix. Define f :Rn → R by f(x) = xT M x. By the
extreme value theorem, this continuous function attains a maximum at some u when restricted to the closed unit
sphere {||x|| ≤ 1}. By the Lagrange multipliers theorem, u necessarily satisfies
Singular values are similar in that they can be described algebraically or from variational principles. Although,
unlike the eigenvalue case, Hermiticity, or symmetry, of M is no longer required.
This section gives these two arguments for existence of singular value decomposition.
Then
We see that this is almost the desired result, except that U1 and V1 are not unitary in general, but merely isometries.
To finish the argument, one simply has to "fill out" these matrices to obtain unitaries. For example, one can choose
U2 such that
is unitary.
Define
where extra zero rows are added or removed to make the number of zero rows equal the number of columns of U2.
Then
Notice the argument could begin with diagonalizing MM* rather than M*M (This shows directly that MM* and M*M
have the same non-zero eigenvalues).
Singular value decomposition 558
for vectors u ∈ and v ∈ . Consider the function σ restricted to × . Since both and
are compact sets, their product is also compact. Furthermore, since σ is continuous, it attains a largest value
for at least one pair of vectors u ∈ and v ∈ . This largest value is denoted σ1 and the corresponding
vectors are denoted u1 and v1. Since is the largest value of it must be non-negative. If it were negative,
changing the sign of either u1 or v1 would make it positive and therefore larger.
Statement: u1, v1 are left and right-singular vectors of M with corresponding singular value σ1.
Proof: Similar to the eigenvalues case, by assumption the two vectors satisfy the Lagrange multiplier equation:
and
Multiplying the first equation from left by and the second equation from left by and taking ||u|| = ||v|| = 1 into
account gives
we have
Similarly,
Geometric meaning
Because U and V are unitary, we know that the columns u1,...,um of U yield an orthonormal basis of Km and the
columns v1,...,vn of V yield an orthonormal basis of Kn (with respect to the standard scalar products on these spaces).
The linear transformation T :Kn → Km that takes a vector x to Mx has a particularly simple description with respect to
these orthonormal bases: we have T(vi) = σi ui, for i = 1,...,min(m,n), where σi is the i-th diagonal entry of Σ, and
T(vi) = 0 for i > min(m,n).
The geometric content of the SVD theorem can thus be summarized as follows: for every linear map T :Kn → Km one
can find orthonormal bases of Kn and Km such that T maps the i-th basis vector of Kn to a non-negative multiple of
the i-th basis vector of Km, and sends the left-over basis vectors to zero. With respect to these bases, the map T is
Singular value decomposition 559
Numerical Approach
The SVD of a matrix M is typically computed by a two-step procedure. In the first step, the matrix is reduced to a
bidiagonal matrix. This takes O(mn2) floating-point operations (flops), assuming that m ≥ n (this formulation uses
the big O notation). The second step is to compute the SVD of the bidiagonal matrix. This step can only be done
with an iterative method (as with eigenvalue algorithms). However, in practice it suffices to compute the SVD up to
a certain precision, like the machine epsilon. If this precision is considered constant, then the second step takes O(n)
iterations, each costing O(n) flops. Thus, the first step is more expensive, and the overall cost is O(mn2) flops
(Trefethen & Bau III 1997, Lecture 31).
The first step can be done using Householder reflections for a cost of 4mn2 − 4n3/3 flops, assuming that only the
singular values are needed and not the singular vectors. If m is much larger than n then it is advantageous to first
reduce the matrix M to a triangular matrix with the QR decomposition and then use Householder reflections to
further reduce the matrix to bidiagonal form; the combined cost is 2mn2 + 2n3 flops (Trefethen & Bau III 1997,
Lecture 31).
The second step can be done by a variant of the QR algorithm for the computation of eigenvalues, which was first
described by Golub & Kahan (1965). The LAPACK subroutine DBDSQR[4] implements this iterative method, with
some modifications to cover the case where the singular values are very small (Demmel & Kahan 1990). Together
with a first step using Householder reflections and, if appropriate, QR decomposition, this forms the DGESVD[5]
routine for the computation of the singular value decomposition.
The same algorithm is implemented in the GNU Scientific Library (GSL). The GSL also offers an alternative
method, which uses a one-sided Jacobi orthogonalization in step 2 (GSL Team 2007). This method computes the
SVD of the bidiagonal matrix by solving a sequence of 2-by-2 SVD problems, similar to how the Jacobi eigenvalue
algorithm solves a sequence of 2-by-2 eigenvalue methods (Golub & Van Loan 1996, §8.6.3). Yet another method
for step 2 uses the idea of divide-and-conquer eigenvalue algorithms (Trefethen & Bau III 1997, Lecture 31).
Singular value decomposition 560
Reduced SVDs
In applications it is quite unusual for the full SVD, including a full unitary decomposition of the null-space of the
matrix, to be required. Instead, it is often sufficient (as well as faster, and more economical for storage) to compute a
reduced version of the SVD. The following can be distinguished for an m×n matrix M of rank r:
Thin SVD
Only the n column vectors of U corresponding to the row vectors of V* are calculated. The remaining column
vectors of U are not calculated. This is significantly quicker and more economical than the full SVD if n<<m. The
matrix Un is thus m×n, Σn is n×n diagonal, and V is n×n.
The first stage in the calculation of a thin SVD will usually be a QR decomposition of M, which can make for a
significantly quicker calculation if n<<m.
Compact SVD
Only the r column vectors of U and r row vectors of V* corresponding to the non-zero singular values Σr are
calculated. The remaining vectors of U and V* are not calculated. This is quicker and more economical than the thin
SVD if r<<n. The matrix Ur is thus m×r, Σr is r×r diagonal, and Vr* is r×n.
Truncated SVD
Only the t column vectors of U and t row vectors of V* corresponding to the t largest singular values Σt are
calculated. The rest of the matrix is discarded. This can be much quicker and more economical than the compact
SVD if t<<r. The matrix Ut is thus m×t, Σt is t×t diagonal, and Vt* is t×n'.
Of course the truncated SVD is no longer an exact decomposition of the original matrix M, but as discussed above,
the approximate matrix is in a very useful sense the closest approximation to M that can be achieved by a matrix
of rank t.
Norms
Ky Fan norms
The sum of the k largest singular values of M is a matrix norm, the Ky Fan k-norm of M.
The first of the Ky Fan norms, the Ky Fan 1-norm is the same as the operator norm of M as a linear operator with
respect to the Euclidean norms of Km and Kn. In other words, the Ky Fan 1-norm is the operator norm induced by the
standard l2 Euclidean inner product. For this reason, it is also called the operator 2-norm. One can easily verify the
relationship between the Ky Fan 1-norm and singular values. It is true in general, for a bounded operator M on
Singular value decomposition 561
But, in the matrix case, M*M½ is a normal matrix, so ||M* M||½ is the largest eigenvalue of M* M½, i.e. the largest
singular value of M.
The last of the Ky Fan norms, the sum of all singular values, is the trace norm (also known as the 'nuclear norm'),
defined by ||M|| = Tr[(M*M)½] (the eigenvalues of M* M are the squares of the singular values).
Hilbert–Schmidt norm
The singular values are related to another norm on the space of operators. Consider the Hilbert–Schmidt inner
product on the n × n matrices, defined by . So the induced norm is
. Since trace is invariant under unitary equivalence, this shows
where are the singular values of M. This is called the Frobenius norm, Schatten 2-norm, or Hilbert–Schmidt
norm of M. Direct calculation shows that if
Tensor SVD
Unfortunately, the problem of finding a low rank approximation to a tensor is ill-posed. In other words, there doesn't
exist a best possible solution, but instead a sequence of better and better approximations that converge to infinitely
large matrices. But in spite of this, there are several ways of attempting this decomposition. There exist two types of
tensor decompositions which generalise SVD to multi-way arrays. One decomposition decomposes a tensor into a
sum of rank-1 tensors, see Candecomp-PARAFAC (CP) algorithm. The CP algorithm should not be confused with a
rank-R decomposition but, for a given N, it decomposes a tensor into a sum of N rank-1 tensors that optimally fit the
original tensor. The second type of decomposition computes the orthonormal subspaces associated with the different
axes or modes of a tensor (orthonormal row space, column space, fiber space, etc.). This decomposition is referred to
in the literature as the Tucker3/TuckerM, M-mode SVD, multilinear SVD and sometimes referred to as a
higher-order SVD (HOSVD). In addition, multilinear principal component analysis in multilinear subspace learning
involves the same mathematical operations as Tucker decomposition, being used in a different context of
dimensionality reduction.
is a unitary operator.
As for matrices, the singular value factorization is equivalent to the polar decomposition for operators: we can
simply write
where the series converges in the norm topology on H. Notice how this resembles the expression from the finite
dimensional case. The σi 's are called the singular values of M. {U ei} and {V ei} can be considered the left- and
right-singular vectors of M respectively.
Compact operators on a Hilbert space are the closure of finite-rank operators in the uniform operator topology. The
above series expression gives an explicit such representation. An immediate consequence of this is:
Theorem M is compact if and only if M*M is compact.
History
The singular value decomposition was originally developed by differential geometers, who wished to determine
whether a real bilinear form could be made equal to another by independent orthogonal transformations of the two
spaces it acts on. Eugenio Beltrami and Camille Jordan discovered independently, in 1873 and 1874 respectively,
that the singular values of the bilinear forms, represented as a matrix, form a complete set of invariants for bilinear
forms under orthogonal substitutions. James Joseph Sylvester also arrived at the singular value decomposition for
real square matrices in 1889, apparently independent of both Beltrami and Jordan. Sylvester called the singular
values the canonical multipliers of the matrix A. The fourth mathematician to discover the singular value
decomposition independently is Autonne in 1915, who arrived at it via the polar decomposition. The first proof of
the singular value decomposition for rectangular and complex matrices seems to be by Carl Eckart and Gale Young
in 1936;[6] they saw it as a generalization of the principal axis transformation for Hermitian matrices.
In 1907, Erhard Schmidt defined an analog of singular values for integral operators (which are compact, under some
weak technical assumptions); it seems he was unaware of the parallel work on singular values of finite matrices. This
theory was further developed by Émile Picard in 1910, who is the first to call the numbers singular values (or
rather, valeurs singulières).
Practical methods for computing the SVD date back to Kogbetliantz in 1954, 1955 and Hestenes in 1958.[7]
resembling closely the Jacobi eigenvalue algorithm, which uses plane rotations or Givens rotations. However, these
were replaced by the method of Gene Golub and William Kahan published in 1965,[8] which uses Householder
transformations or reflections. In 1970, Golub and Christian Reinsch[9] published a variant of the Golub/Kahan
Singular value decomposition 563
Notes
[1] DeAngelis GC, Ohzawa I, Freeman RD (October 1995). "Receptive-field dynamics in the central visual pathways" (http:/ / linkinghub.
elsevier. com/ retrieve/ pii/ 0166-2236(95)94496-R). Trends Neurosci. 18 (10): 451–8. doi:10.1016/0166-2236(95)94496-R. PMID 8545912. .
[2] Depireux DA, Simon JZ, Klein DJ, Shamma SA (March 2001). "Spectro-temporal response field characterization with dynamic ripples in
ferret primary auditory cortex" (http:/ / jn. physiology. org/ cgi/ pmidlookup?view=long& pmid=11247991). J. Neurophysiol. 85 (3):
1220–34. PMID 11247991. .
[3] The Singular Value Decomposition in Symmetric (Lowdin) Orthogonalization and Data Compression (http:/ / www. wou. edu/ ~beavers/
Talks/ Willamette1106. pdf)
[4] Netlib.org (http:/ / www. netlib. org/ lapack/ double/ dbdsqr. f)
[5] Netlib.org (http:/ / www. netlib. org/ lapack/ double/ dgesvd. f)
[6] Eckart, C.; Young, G. (1936). "The approximation of one matrix by another of lower rank". Psychometrika 1 (3): 211–8.
doi:10.1007/BF02288367.
[7] Hestenes, M. R. (1958). "Inversion of Matrices by Biorthogonalization and Related Results". Journal of the Society for Industrial and Applied
Mathematics 6 (1): 51–90. doi:10.1137/0106005. JSTOR 2098862. MR0092215.
[8] Golub, G. H.; Kahan, W. (1965). "Calculating the singular values and pseudo-inverse of a matrix". Journal of the Society for Industrial and
Applied Mathematics: Series B, Numerical Analysis 2 (2): 205–224. doi:10.1137/0702016. JSTOR 2949777. MR0183105.
[9] Golub, G. H.; Reinsch, C. (1970). "Singular value decomposition and least squares solutions". Numerische Mathematik 14 (5): 403–420.
doi:10.1007/BF02163027. MR1553974.
References
• Trefethen, Lloyd N.; Bau III, David (1997). Numerical linear algebra. Philadelphia: Society for Industrial and
Applied Mathematics. ISBN 978-0-89871-361-9.
• Demmel, James; Kahan, William (1990). "Accurate singular values of bidiagonal matrices". Society for Industrial
and Applied Mathematics. Journal on Scientific and Statistical Computing 11 (5): 873–912.
doi:10.1137/0911052.
• Golub, Gene H.; Kahan, William (1965). "Calculating the singular values and pseudo-inverse of a matrix".
Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis 2 (2): 205–224.
doi:10.1137/0702016. JSTOR 2949777.
• Golub, Gene H.; Van Loan, Charles F. (1996). Matrix Computations (3rd ed.). Johns Hopkins.
ISBN 978-0-8018-5414-9.
• GSL Team (2007). "§13.4 Singular Value Decomposition" (http://www.gnu.org/software/gsl/manual/
html_node/Singular-Value-Decomposition.html). GNU Scientific Library. Reference Manual.
• Halldor, Bjornsson and Venegas, Silvia A. (1997). "A manual for EOF and SVD analyses of climate data" (http://
www.vedur.is/~halldor/TEXT/eofsvd.html). McGill University, CCGCR Report No. 97-1, Montréal, Québec,
52pp.
• Hansen, P. C. (1987). "The truncated SVD as a method for regularization". BIT 27: 534–553.
doi:10.1007/BF01937276.
• Horn, Roger A.; Johnson, Charles R. (1985). "Section 7.3". Matrix Analysis. Cambridge University Press.
ISBN 0-521-38632-2.
• Horn, Roger A.; Johnson, Charles R. (1991). "Chapter 3". Topics in Matrix Analysis. Cambridge University Press.
ISBN 0-521-46713-6.
• Samet, H. (2006). Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann.
ISBN 0-12-369446-9.
• Strang G. (1998). "Section 6.7". Introduction to Linear Algebra (3rd ed.). Wellesley-Cambridge Press.
ISBN 0-9614088-5-5.
• Stewart, G. W. (1993). "On the Early History of the Singular Value Decomposition" (http://citeseer.ist.psu.
edu/stewart92early.html). SIAM Review 35 (4): 551–566. doi:10.1137/1035134.
Singular value decomposition 564
• Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha (2003). "Singular value decomposition and principal
component analysis" (http://public.lanl.gov/mewall/kluwer2002.html). In D.P. Berrar, W. Dubitzky, M.
Granzow. A Practical Approach to Microarray Data Analysis. Norwell, MA: Kluwer. pp. 91–109.
• Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007), "Section 2.6" (http://apps.nrbook.com/
empanel/index.html?pg=65), Numerical Recipes: The Art of Scientific Computing (3rd ed.), New York:
Cambridge University Press, ISBN 978-0-521-88068-8
External links
Implementations
Songs
• It Had To Be U (http://www.youtube.com/StatisticalSongs#p/u/4/JEYLfIVvR9I) is a song, written by
Michael Greenacre, about the singular value decomposition, explaining its definition and role in statistical
dimension reduction. It was first performed at the joint meetings of the 9th Tartu Conference on Multivariate
Statistics and 20th International Workshop on Matrices and Statistics, in Tartu, Estonia, June 2011.
Stein's method
Stein's method is a general method in probability theory to obtain bounds on the distance between two probability
distributions with respect to a probability metric. It was introduced by Charles Stein, who first published it 1972,[1]
to obtain a bound between the distribution of a sum of -dependent sequence of random variables and a standard
normal distribution in the Kolmogorov (uniform) metric and hence to prove not only a central limit theorem, but also
bounds on the rates of convergence for the given metric.
History
At the end of the 1960s, unsatisfied with the by-then known proofs of a specific central limit theorem, Charles Stein
developed a new way of proving the theorem for his statistics lecture.[2] The seminal paper[1] was presented in 1970
at the sixth Berkeley Symposium and published in the corresponding proceedings.
Later, his Ph.D. student Louis Chen Hsiao Yun modified the method so as to obtain approximation results for the
Poisson distribution,[3] therefore the method is often referred to as Stein-Chen method. Whereas moderate attention
was given to the new method in the 70s, it has undergone major development in the 80s, where many important
contributions were made and on which today's view of the method are largely based. Probably the most important
contributions are the monograph by Stein (1986), where he presents his view of the method and the concept of
auxiliary randomisation, in particular using exchangeable pairs, and the articles by Barbour (1988) and Götze
(1991), who introduced the so-called generator interpretation, which made it possible to easily adapt the method to
many other probability distributions. An important contribution was also an article by Bolthausen (1984) on a
long-standing open problem around the so-called combinatorial central limit theorem, which surely helped the
method to become widely known.
In the 1990s the method was adapted to a variety of distributions, such as Gaussian processes by Barbour (1990), the
binomial distribution by Ehm (1991), Poisson processes by Barbour and Brown (1992), the Gamma distribution by
Luk (1994), and many others.
Stein's method 567
Probability metrics
Stein's method is a way to bound the distance of two probability distributions in a specific probability metric. To be
tractable with the method, the metric must be given in the form
Here, and are probability measures on a measurable space , and are random variables with
distribution and respectively, is the usual expectation operator and is a set of functions from to
the real numbers. This set has to be large enough, so that the above definition indeed yields a metric. Important
examples are the total variation metric, where we let consist of all the indicator functions of measurable sets, the
Kolmogorov (uniform) metric for probability measures on the real numbers, where we consider all the half-line
indicator functions, and the Lipschitz (first order Wasserstein; Kantorovich) metric, where the underlying space is
itself a metric space and we take the set to be all Lipschitz-continuous functions with Lipschitz-constant 1.
However, note that not every metric can be represented in the form (1.1).
In what follows we think of as a complicated distribution (e.g. a sum of dependent random variables), which we
want to approximate by a much simpler and tractable distribution (e.g. the standard normal distribution to obtain
a central limit theorem).
We call such an operator the Stein operator. For the standard normal distribution, Stein's lemma exactly yields such
an operator:
We note that there are in general infinitely many such operators and it still remains an open question, which one to
choose. However, it seems that for many distributions there is a particular good one, like (2.3) for the normal
distribution.
There are different ways to find Stein operators. But by far the most important one is via generators. This approach
was, as already mentioned, introduced by Barbour and Götze. Assume that is a (homogeneous)
continuous time Markov process taking values in . If has the stationary distribution it is easy to see that,
if is the generator of , we have for a large set of functions . Thus, generators are
natural candidates for Stein operators and this approach will also help us for later computations.
Stein's method 568
so that the behavior of the right hand side is reproduced by the operator and . However, this equation is too
general. We solve instead the more specific equation
which is called Stein equation. Replacing by and taking expectation with respect to , we are back to (3.1),
which is what we effectively want. Now all the effort is worth only if the left hand side of (3.1) is easier to bound
than the right hand side. This is, surprisingly, often the case.
If is the standard normal distribution and we use (2.3), the corresponding Stein equation is
Generator method. If is the generator of a Markov process as explained before, we can give a general
solution to (3.2):
where denotes expectation with respect to the process being started in . However, one still has to prove
that the solution (4.2) exists for all desired functions .
where the last bound is of course only applicable if is differentiable (or at least Lipschitz-continuous, which, for
example, is not the case if we regard the total variation metric or the Kolmogorov metric!). As the standard normal
distribution has no extra parameters, in this specific case, the constants are free of additional parameters.
Note that, up to this point, we did not make use of the random variable . So, the steps up to here in general have
to be calculated only once for a specific combination of distribution , metric and Stein operator .
However, if we have bounds in the general form (5.1), we usually are able to treat many probability metrics together.
Furthermore as there is often a particular 'good' Stein operator for a distribution (e.g., no other operator than (2.3) has
been used for the standard normal distribution up to now), one can often just start with the next step below, if bounds
of the form (5.1) are already available (which is the case for many distributions).
To this end, assume that is a sum of random variables such that the and variance
Note that, if we follow this line of argument, we can bound (1.1) only for functions where is bounded
because of the third inequality of (5.2) (and in fact, if has discontinuities, so will ). To obtain a bound similar
to (6.1) which contains only the expressions and , the argument is much more involved and the
result is not as simple as (6.1); however, it can be done.
Theorem A. If is as described above, we have for the Lipschitz metric that
Proof. Recall that the Lipschitz metric is of the form (1.1) where the functions are Lipschitz-continuous with
Lipschitz-constant 1, thus . Combining this with (6.1) and the last bound in (5.2) proves the theorem.
Thus, roughly speaking, we have proved that, to calculate the Lipschitz-distance between a with local
dependence structure and a standard normal distribution, we only need to know the third moments of and the
size of the neighborhoods and .
Stein's method 570
Notes
[1] Stein, C. (1972). "A bound for the error in the normal approximation to the distribution of a sum of dependent random variables" (http:/ /
projecteuclid. org/ euclid. bsmsp/ 1200514239). Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability:
583–602. MR402873. Zbl 0278.60026. .
[2] Charles Stein: The Invariant, the Direct and the "Pretentious" (http:/ / www. ims. nus. edu. sg/ imprints/ interviews/ CharlesStein. pdf).
Interview given in 2003 in Singapore
[3] Chen, L.H.Y. (1975). "Poisson approximation for dependent trials". Annals of Probability 3 (3): 534–545. doi:10.1214/aop/1176996359.
JSTOR 2959474. MR428387. Zbl 0335.60016.
References
• Barbour, A. D. (1988). "Stein's method and Poisson process convergence". J. Appl. Probab. (Applied Probability
Trust) 25A: 175–184. doi:10.2307/3214155. JSTOR 3214155.
• Barbour, A. D. (1990). "Stein's method for diffusion approximations". Probab. Theory Related Fields 84 (3):
297–322. doi:10.1007/BF01197887.
• Barbour, A. D. and Brown, T. C. (1992). "Stein's method and point process approximation". Stochastic Process.
Appl. 43 (1): 9–31. doi:10.1016/0304-4149(92)90073-Y.
• Bolthausen, E. (1984). "An estimate of the remainder in a combinatorial central limit theorem". Z. Wahrsch.
Verw. Gebiete 66 (3): 379–386. doi:10.1007/BF00533704.
• Ehm, W. (1991). "Binomial approximation to the Poisson binomial distribution". Statist. Probab. Lett. 11 (1):
7–16. doi:10.1016/0167-7152(91)90170-V.
• Götze, F. (1991). "On the rate of convergence in the multivariate CLT". Ann. Probab. 19 (2): 724–739.
doi:10.1214/aop/1176990448.
• Lindeberg, J. W. (1922). "Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechung".
Math. Z. 15 (1): 211–225. doi:10.1007/BF01494395.
• Luk, H. M. (1994). Stein's method for the gamma distribution and related statistical applications. Dissertation.
• Stein, C. (1986). Approximate computation of expectations. Institute of Mathematical Statistics Lecture Notes,
Monograph Series, 7. ISBN 0-940600-08-0.
Stein's method 571
• Tikhomirov, A. N. (1980). "Convergence rate in the central limit theorem for weakly dependent random
variables". Teor. Veroyatnost. I Primenen. 25: 800–818. English translation in Theory Probab. Appl. 25
(1980–81): 790–809.
Literature
The following text is advanced, and gives a comprehensive overview of the normal case
• Chen, L.H.Y., Goldstein, L., and Shao, Q.M (2011). Normal approximation by Stein's method.
www.springer.com. ISBN 978-3-642-15006-7.
Another advanced book, but having some introductory character, is
• ed. Barbour, A.D. and Chen, L.H.Y. (2005). An introduction to Stein's method. Lecture Notes Series, Institute for
Mathematical Sciences, National University of Singapore. 4. Singapore University Press. ISBN 981-256-280-X.
A standard reference is the book by Stein,
• Stein, C. (1986). Approximate computation of expectations. Institute of Mathematical Statistics Lecture Notes,
Monograph Series, 7. Hayward, Calif.: Institute of Mathematical Statistics. ISBN 0-940600-08-0.
which contains a lot of interesting material, but may be a little hard to understand at first reading.
Despite its age, there are few standard introductory books about Stein's method available. The following recent
textbook has a chapter (Chapter 2) devoted to introducing Stein's method:
• Ross, Sheldon and Peköz, Erol (2007). A second course in probability. www.ProbabilityBookstore.com.
ISBN 978-0-9795704-0-7.
Although the book
• Barbour, A. D. and Holst, L. and Janson, S. (1992). Poisson approximation. Oxford Studies in Probability. 2. The
Clarendon Press Oxford University Press. ISBN 0-19-852235-5.
is by large parts about Poisson approximation, it contains nevertheless a lot of information about the generator
approach, in particular in the context of Poisson process approximation.
Stirling's approximation 572
Stirling's approximation
In mathematics, Stirling's
approximation (or Stirling's
formula) is an approximation for large
factorials. It is named after James
Stirling.
The formula as typically used in
applications is
often written
The ratio of (ln n!) to (n ln n − n) approaches unity as n increases.
Sometimes, bounds for rather than asymptotics are required: one has, for all
so for all the ratio is always e.g. between 2.5 and 2.8.
Derivation
The formula, together with precise estimates of its error, can be derived as follows. Instead of approximating n!, one
considers its natural logarithm as this is a slowly varying function:
Euler–Maclaurin formula:
where Bk is a Bernoulli number and Rm,n is the remainder term in the Euler–Maclaurin formula. Take limits to find
that
Denote this limit by y. Because the remainder Rm,n in the Euler–Maclaurin formula satisfies
Stirling's approximation 573
where we use Big-O notation, combining the equations above yields the approximation formula in its logarithmic
form:
Taking the exponential of both sides, and choosing any positive integer m, we get a formula involving an unknown
quantity ey. For m=1, the formula is
The quantity ey can be found by taking the limit on both sides as n tends to infinity and using Wallis' product, which
shows that . Therefore, we get Stirling's formula:
The formula may also be obtained by repeated integration by parts, and the leading term can be found through
Laplace's method. Stirling's formula, without the factor that is often irrelevant in applications, can be
quickly obtained by approximating the sum
with an integral:
with
The first graph in this section shows the relative error vs. n, for 1 through all 5 terms listed above.
Stirling's approximation 574
it is known that the error in truncating the series is always of the same sign and at most the same magnitude as the
first omitted term.
However, the Pi function, unlike the factorial, is more broadly defined for all complex numbers other than
non-positive integers; nevertheless, Stirling's formula may still be applied. If then
where Bn is the nth Bernoulli number (note that the infinite sum is not convergent, so this formula is just an
asymptotic expansion). The formula is valid for z large enough in absolute value when , where ε
is positive, with an error term of when the first m terms are used. The corresponding approximation
may now be written:
Stirling's approximation 575
where
where denotes the Stirling numbers of the first kind. From this we obtain a version of Stirling's series
or equivalently,
can be obtained by rearranging Stirling's extended formula and observing a coincidence between the resultant power
series and the Taylor series expansion of the hyperbolic sine function. This approximation is good to more than 8
decimal digits for z with a real part greater than 8. Robert H. Windschitl suggested it in 2002 for computing the
Gamma function with fair accuracy on calculators with limited program or register memory.[2]
Gergő Nemes proposed in 2007 an approximation which gives the same number of exact digits as the Windschitl
approximation but is much simpler:[3]
or equivalently,
An apparently superior approximation for log n! was also given by Srinivasa Ramanujan (Ramanujan 1988)
Stirling's approximation 576
History
The formula was first discovered by Abraham de Moivre[4][5] in the form
De Moivre gave an expression for the constant in terms of its natural logarithm. Stirling's contribution consisted of
showing that the constant is . The more precise versions are due to Jacques Binet.
Notes
[1] http:/ / www. york. ac. uk/ depts/ maths/ histstat/ letter. pdf
[2] Toth, V. T. Programmable Calculators: Calculators and the Gamma Function (2006) (http:/ / www. rskey. org/ gamma. htm)
[3] Nemes, Gergő (2010), "New asymptotic expansion for the Gamma function", Archiv der Mathematik 95 (2): 161–169,
doi:10.1007/s00013-010-0146-9, ISSN 0003-889X.
[4] Le Cam, L. (1986), "The central limit theorem around 1935", Statistical Science 1 (1): 78–96 [p. 81], doi:10.1214/ss/1177013818, "The
result, obtained using a formula originally proved by de Moivre but now called Sterling's formula, occurs in his `Doctrine of Chances' of
1733.".
[5] Pearson, Karl, "Historical note on the origin of the normal curve of errors", Biometrika 16: 402–404 [p. 403], "I consider that the fact that
Stirling showed that De Moivre's arithmetical constant was does not entitle him to claim the theorem, [...]"
References
• Abramowitz, M. & Stegun, I. (2002), Handbook of Mathematical Functions (http://www.math.hkbu.edu.hk/
support/aands/toc.htm)
• Nemes, G. (2010), "New asymptotic expansion for the Gamma function", Archiv der Mathematik 95 (2):
161–169, doi:10.1007/s00013-010-0146-9
• Paris, R. B. & Kaminsky, D. (2001), Asymptotics and the Mellin–Barnes Integrals, New York: Cambridge
University Press, ISBN 0-521-79001-8
• Whittaker, E. T. & Watson, G. N. (1996), A Course in Modern Analysis (4th ed.), New York: Cambridge
University Press, ISBN 0-521-58807-3
• Dan Romik, Stirling’s Approximation for n!: The Ultimate Short Proof?, The American Mathematical Monthly,
Vol. 107, No. 6 (Jun. – Jul., 2000), 556–557.
• Y.-C. Li, A Note on an Identity of The Gamma Function and Stirling’s Formula, Real Analysis Exchang, Vol.
32(1), 2006/2007, pp. 267–272.
External links
• Peter Luschny, Approximation formulas for the factorial function n! (http://www.luschny.de/math/factorial/
approx/SimpleCases.html)
• Weisstein, Eric W., " Stirling's Approximation (http://mathworld.wolfram.com/StirlingsApproximation.html)"
from MathWorld.
• Stirling's approximation (http://planetmath.org/encyclopedia/StirlingsApproximation.html) at PlanetMath
Student's t-distribution 577
Student's t-distribution
Student’s t
CDF
• ψ: digamma function,
• B: beta function
MGF undefined
Student's t-distribution 578
CF
for >0
[1]
• (x): Bessel function
In probability and statistics, Student’s t-distribution (or simply the t-distribution) is a family of continuous
probability distributions that arises when estimating the mean of a normally distributed population in situations
where the sample size is small and population standard deviation is unknown. It plays a role in a number of widely
used statistical analyses, including the Student’s t-test for assessing the statistical significance of the difference
between two sample means, the construction of confidence intervals for the difference between two population
means, and in linear regression analysis. The Student’s t-distribution also arises in the Bayesian analysis of data from
a normal family.
The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is
more prone to producing values that fall far from its mean. This makes it useful for understanding the statistical
behavior of certain types of ratios of random quantities, in which variation in the denominator is amplified and may
produce outlying values when the denominator of the ratio falls close to zero. The Student’s t-distribution is a special
case of the generalised hyperbolic distribution.
Definition
where is the number of degrees of freedom and is the Gamma function. This may also be written as
For odd,
The overall shape of the probability density function of the t-distribution resembles the bell shape of a normally
distributed variable with mean 0 and variance 1, except that it is a bit lower and wider. As the number of degrees of
freedom grows, the t-distribution approaches the normal distribution with mean 0 and variance 1.
The following images show the density of the t-distribution for increasing values of . The normal distribution is
shown as a blue line for comparison. Note that the t-distribution (red line) becomes closer to the normal distribution
as increases.
Student's t-distribution 579
Density of the t-distribution (red) for 1, 2, 3, 5, 10, and 30 df compared to the standard
normal distribution (blue).
Previous plots shown in green.
with
Other values would be obtained by symmetry. An alternative formula, valid for , is[2]
Student's t-distribution 580
Special cases
Certain values of give an especially simple form.
•
Distribution function:
Density function:
Density function:
•
Density function:
•
Density function:
The t-distribution with n − 1 degrees of freedom is the sampling distribution of the t-value when the samples consist
of independent identically distributed observations from a normally distributed population.
Student's t-distribution 581
Characterization
where
• Z is normally distributed with expected value 0 and variance 1;
• V has a chi-squared distribution with ("nu") degrees of freedom;
• Z and V are independent.
A different distribution is defined as that of the random variable defined, for a given constant μ, by .
This random variable has a noncentral t-distribution with noncentrality parameter μ. This distribution is important in
studies of the power of Student's t test.
Derivation
Suppose X1, ..., Xn are independent values that are normally distributed with expected value μ and variance σ2. Let
be an unbiased estimate of the variance from the sample. It can be shown that the random variable
has a chi-squared distribution with n − 1 degrees of freedom (by Cochran's theorem).[14] It is readily shown that the
quantity
is normally distributed with mean 0 and variance 1, since the sample mean is normally distributed with mean
and variance . Moreover, it is possible to show that these two random variables (the normally distributed one and
the chi-squared-distributed one) are independent. Consequently the pivotal quantity,
Student's t-distribution 582
which differs from Z in that the exact standard deviation σ is replaced by the random variable Sn, has a Student's
t-distribution as defined above. Notice that the unknown population variance σ2 does not appear in T, since it was in
both the numerator and the denominators, so it canceled. Gosset intuitively obtained the probability density function
stated above, with equal to n − 1, and Fisher proved it in 1925.[15]
The distribution of the test statistic, T, depends on , but not μ or σ; the lack of dependence on μ and σ is what
makes the t-distribution important in both theory and practice.
Properties
Moments
The raw moments of the t-distribution are
The distinction between "undefined" and "defined with the value of infinity" should be kept in mind. This is
equivalent to the distinction between the result of 0/0 vs. 1/0. Attempting to evaluate the odd moments in the cases
above listed as "undefined" results in the expression Because the mean (first raw moment) is undefined
when (equivalent to the Cauchy distribution), all of the central moments and standardized moments are
likewise undefined, including the variance, skewness and kurtosis.
It should be noted that the term for 0 < k < , k even, may be simplified using the properties of the Gamma
function to
For a t-distribution with degrees of freedom, the expected value is 0, and its variance is /( − 2) if > 2.
The skewness is 0 if > 3 and the excess kurtosis is 6/( − 4) if > 4.
Relation to F distribution
• has an F-distribution if and has a Student's t-distribution.
This distribution results from compounding a Gaussian distribution (normal distribution) with mean and
unknown variance, with an inverse gamma distribution placed over the variance with parameters and
. In other words, the random variable X is assumed to have a Gaussian distribution with an unknown
variance distributed as inverse gamma, and then the variance is marginalized out (integrated out). The reason for the
usefulness of this characterization is that the inverse gamma distribution is the conjugate prior distribution of the
variance of a Gaussian distribution. As a result, the non-standardized Student's t-distribution arises naturally in many
Bayesian inference problems. See below.
Equivalently, this distribution results from compounding a Gaussian distribution with a scaled-inverse-chi-squared
distribution with parameters and . The scaled-inverse-chi-squared distribution is exactly the same distribution
as the inverse gamma distribution, but with a different parameterization, i.e.
Student's t-distribution 584
In terms of precision
An alternative parameterization in terms of precision λ (reciprocal of variance) arises from the relation .
[19]
Then the density is defined by
This distribution results from compounding a Gaussian distribution with mean and unknown precision (the
reciprocal of the variance), with a gamma distribution placed over the precision with parameters and
. In other words, the random variable X is assumed to have a normal distribution with an unknown
precision distributed as gamma, and then this is marginalized over the gamma distribution.
Related distributions
Noncentral t-distribution
The noncentral t-distribution is a different way of generalizing the t-distribution to include a location parameter.
Unlike the nonstandardized t-distributions, the noncentral distributions are asymmetric (the median is not the same
as the mode).
Here a, b, and k are parameters. This distribution arises from the construction of a system of discrete distributions
similar to that of the Pearson distributions for continuous distributions.[21]
Uses
Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the
need to use the Student's t-distribution. These problems are generally of two kinds: (1) those in which the sample
size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that
illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored
because that is not the point that the author or instructor is then explaining.
Hypothesis testing
A number of statistics can be shown to have t-distributions for samples of moderate size under null hypotheses that
are of interest, so that the t-distribution forms the basis for significance tests. For example, the distribution of
Spearman's rank correlation coefficient ρ, in the null case (zero correlation) is well approximated by the t distribution
for sample sizes above about 20 .
Confidence intervals
Suppose the number A is so chosen that
when T has a t-distribution with n − 1 degrees of freedom. By symmetry, this is the same as saying that A satisfies
is a 90-percent confidence interval for μ. Therefore, if we find the mean of a set of observations that we can
reasonably expect to have a normal distribution, we can use the t-distribution to examine whether the confidence
limits on that mean include some theoretically predicted value - such as the value predicted on a null hypothesis.
It is this result that is used in the Student's t-tests: since the difference between the means of samples from two
normal distributions is itself distributed normally, the t-distribution can be used to examine whether that difference
can reasonably be supposed to be zero.
If the data are normally distributed, the one-sided (1 − a)-upper confidence limit (UCL) of the mean, can be
calculated using the following equation:
The resulting UCL will be the greatest average value that will occur for a given confidence interval and population
size. In other words, being the mean of the set of observations, the probability that the mean of the distribution
is inferior to UCL1−a is equal to the confidence level 1 − a.
Student's t-distribution 586
Prediction intervals
The t-distribution can be used to construct a prediction interval for an unobserved sample from a normal distribution
with unknown mean and variance.
In Bayesian statistics
The Student's t-distribution, especially in its three-parameter (location-scale) version, arises frequently in Bayesian
statistics as a result of its connection with the normal distribution. Whenever the variance of a normally distributed
random variable is unknown and a conjugate prior placed over it that follows an inverse gamma distribution, the
resulting marginal distribution of the variable will follow a Student's t-distribution. Equivalent constructions with the
same results involve a conjugate scaled-inverse-chi-squared distribution over the variance, or a conjugate gamma
distribution over the precision. If an improper prior proportional to is placed over the variance, the
t-distribution also arises. This is the case regardless of whether the mean of the normally distributed variable is
known, is unknown distributed according to a conjugate normally distributed prior, or is unknown distributed
according to an improper constant prior.
Related situations that also produce a t-distribution are:
• The marginal posterior distribution of the unknown mean of a normally distributed variable, with unknown prior
mean and variance following the above model.
• The prior predictive distribution and posterior predictive distribution of a new normally distributed data point
when a series of independent identically distributed normally distributed data points have been observed, with
prior mean and variance as in the above model.
"2.132". Then the probability that T is less than 2.132 is 95% or Pr(−∞ < T < 2.132) = 0.95; or mean that
Pr(−2.132 < T < 2.132) = 0.9.
This can be calculated by the symmetry of the distribution,
Pr(T < −2.132) = 1 − Pr(T > −2.132) = 1 − 0.95 = 0.05,
and so
Pr(−2.132 < T < 2.132) = 1 − 2(0.05) = 0.9.
Note that the last row also gives critical points: a t-distribution with infinitely many degrees of freedom is a normal
distribution. (See Related distributions above).
The first column is the number of degrees of freedom.
One Sided 75% 80% 85% 90% 95% 97.5% 99% 99.5% 99.75% 99.9% 99.95%
Two Sided 50% 60% 70% 80% 90% 95% 98% 99% 99.5% 99.8% 99.9%
1 1.000 1.376 1.963 3.078 6.314 12.71 31.82 63.66 127.3 318.3 636.6
2 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 14.09 22.33 31.60
3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 7.453 10.21 12.92
4 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437
12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318
13 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140
15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073
16 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015
17 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965
18 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922
19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883
20 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850
21 0.686 0.859 1.063 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819
22 0.686 0.858 1.061 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792
23 0.685 0.858 1.060 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.767
24 0.685 0.857 1.059 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745
25 0.684 0.856 1.058 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725
26 0.684 0.856 1.058 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707
27 0.684 0.855 1.057 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690
28 0.683 0.855 1.056 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674
Student's t-distribution 588
29 0.683 0.854 1.055 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.659
30 0.683 0.854 1.055 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646
40 0.681 0.851 1.050 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551
50 0.679 0.849 1.047 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496
60 0.679 0.848 1.045 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460
80 0.678 0.846 1.043 1.292 1.664 1.990 2.374 2.639 2.887 3.195 3.416
100 0.677 0.845 1.042 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390
120 0.677 0.845 1.041 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373
0.674 0.842 1.036 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291
The number at the beginning of each row in the table above is which has been defined above as n − 1. The
percentage along the top is 100%(1 − α). The numbers in the main body of the table are tα, . If a quantity T is
distributed as a Student's t distribution with degrees of freedom, then there is a probability 1 − α that T will be less
than tα, .(Calculated as for a one-tailed or one-sided test as opposed to a two-tailed test.)
For example, given a sample with a sample variance 2 and sample mean of 10, taken from a sample set of 11 (10
degrees of freedom), using the formula
We can determine that at 90% confidence, we have a true mean lying below
(In other words, on average, 90% of the times that an upper threshold is calculated by this method, this upper
threshold exceeds the true mean.) And, still at 90% confidence, we have a true mean lying over
(In other words, on average, 90% of the times that a lower threshold is calculated by this method, this lower
threshold lies below the true mean.) So that at 80% confidence (calculated from 1 − 2 × (1 − 90%) = 80%), we have
a true mean lying within the interval
This is generally expressed in interval notation, e.g., for this case, at 80% confidence the true mean is within the
interval [9.41490, 10.58510].
(In other words, on average, 80% of the times that upper and lower thresholds are calculated by this method, the true
mean is both below the upper threshold and above the lower threshold. This is not the same thing as saying that there
is an 80% probability that the true mean lies between a particular pair of upper and lower thresholds that have been
calculated by this method—see confidence interval and prosecutor's fallacy.)
For information on the inverse cumulative distribution function see quantile function.
Student's t-distribution 589
Notes
[1] Hurst, Simon, The Characteristic Function of the Student-t Distribution (http:/ / wwwmaths. anu. edu. au/ research. reports/ srr/ 95/ 044/ ),
Financial Mathematics Research Report No. FMRR006-95, Statistics Research Report No. SRR044-95
[2] Johnson, N.L., Kotz, S., Balakrishnan, N. (1995) Continuous Univariate Distributions, Volume 2, 2nd Edition. Wiley, ISBN 0-471-58494-0
(Chapter 28)
[3] Helmert, F. R. (1875). "Über die Bestimmung des wahrscheinlichen Fehlers aus einer endlichen Anzahl wahrer Beobachtungsfehler". Z.
Math. Phys., 20, 300-3.
[4] Helmert, F. R. (1876a). "Über die Wahrscheinlichkeit der Potenzsummen der Beobachtungsfehler und uber einige damit in Zusammenhang
stehende Fragen". Z. Math. Phys., 21, 192-218.
[5] Helmert, F. R. (1876b). "Die Genauigkeit der Formel von Peters zur Berechnung des wahrscheinlichen Beobachtungsfehlers director
Beobachtungen gleicher Genauigkeit", Astron. Nachr., 88, 113-32.
[6] Lüroth, J (1876). "Vergleichung von zwei Werten des wahrscheinlichen Fehlers". Astron. Nachr. 87 (14): 209–20.
Bibcode 1876AN.....87..209L. doi:10.1002/asna.18760871402.
[7] Pfanzagl, J.; Sheynin, O. (1996). "A forerunner of the t-distribution (Studies in the history of probability and statistics XLIV)" (http:/ /
biomet. oxfordjournals. org/ cgi/ content/ abstract/ 83/ 4/ 891). Biometrika 83 (4): 891–898. doi:10.1093/biomet/83.4.891. MR1766040. .
[8] Sheynin, O (1995). "Helmert's work in the theory of errors". Arch. Hist. Ex. Sci. 49: 73–104. doi:10.1007/BF00374700.
[9] Student [William Sealy Gosset] (March 1908). "The probable error of a mean" (http:/ / www. york. ac. uk/ depts/ maths/ histstat/ student.
pdf). Biometrika 6 (1): 1–25. doi:10.1093/biomet/6.1.1. .
[10] Mortimer, Robert G. (2005) Mathematics for Physical Chemistry,Academic Press. 3 edition. ISBN 0-12-508347-5 (page 326)
[11] Fisher, R. A. (1925). "Applications of "Student's" distribution" (http:/ / digital. library. adelaide. edu. au/ coll/ special/ fisher/ 43. pdf).
Metron 5: 90–104. .
[12] Walpole, Ronald; Myers, Raymond; Myers, Sharon; Ye, Keying. (2002) Probability and Statistics for Engineers and Scientists. Pearson
Education, 7th edition, pg. 237 ISBN 81-7758-404-9
[13] Hogg & Craig (1978, Sections 4.4 and 4.8.)
[14] Cochran, W. G. (April 1934). "The distribution of quadratic forms in a normal system, with applications to the analysis of covariance".
Mathematical Proceedings of the Cambridge Philosophical Society 30 (2): 178–191. doi:10.1017/S0305004100016595.
[15] Fisher, R. A. (1925). "Applications of "Student's" distribution" (http:/ / digital. library. adelaide. edu. au/ coll/ special/ fisher/ 43. pdf).
Metron 5: 90–104. .
[16] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu.
edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics
(Elsevier): 219–230. . Retrieved 2011-06-02.
[17] Bailey, R. W. (1994). "Polar Generation of Random Variates with the t-Distribution". Mathematics of Computation 62 (206): 779–781.
doi:10.2307/2153537.
[18] Jackman, Simon (2009). Bayesian Analysis for the Social Sciences. Wiley.
[19] Bishop, C.M. (2006). Pattern recognition and machine learning. Springer.
[20] Ord, J.K. (1972) Families of Frequency Distributions, Griffin. ISBN 0-85264-137-0 (Table 5.1)
[21] Ord, J.K. (1972) Families of Frequency Distributions, Griffin. ISBN 0-85264-137-0 (Chapter 5)
[22] Lange, Kenneth L.; Little, Roderick J.A.; Taylor, Jeremy M.G. (1989). "Robust statistical modeling using the t-distribution". JASA 84 (408):
881–896. JSTOR 2290063.
[23] http:/ / mars. wiwi. hu-berlin. de/ mediawiki/ slides/ index. php/ Comparison_of_noncentral_and_central_distributions
References
• Senn, S.; Richardson, W. (1994). "The first t-test". Statistics in Medicine 13 (8): 785–803.
doi:10.1002/sim.4780130802. PMID 8047737.
• Hogg, R.V.; Craig, A.T. (1978). Introduction to Mathematical Statistics. New York: Macmillan.
• Venables, W.N.; B.D. Ripley, B.D. (2002)Modern Applied Statistics with S, Fourth Edition, Springer
• Gelman, Andrew; John B. Carlin, Hal S. Stern, Donald B. Rubin (2003). Bayesian Data Analysis (Second
Edition) (http://www.stat.columbia.edu/~gelman/book/). CRC/Chapman & Hall. ISBN 1-58488-388-X.
Student's t-distribution 590
External links
• Earliest Known Uses of Some of the Words of Mathematics (S) (http://jeff560.tripod.com/s.html) (Remarks
on the history of the term "Student's distribution")
Summation by parts
In mathematics, summation by parts transforms the summation of products of sequences into other summations,
often simplifying the computation or (especially) estimation of certain types of sums. The summation by parts
formula is sometimes called Abel's lemma or Abel transformation.
Statement
Suppose and are two sequences. Then,
Note also that although applications almost always deal with convergence of sequences, the statement is purely
algebraic and will work in any field. It will also work when one sequence is in a vector space, and the other is in the
relevant field of scalars.
Newton series
The formula is sometimes given in one of these - slightly different - forms
both result from iterated application of the initial formula. The auxiliary quantities are Newton series:
and
Summation by parts 591
Method
For two given sequences and , with , one wants to study the sum of the following series:
Finally
This process, called an Abel transformation, can be used to prove several criteria of convergence for .
Beside the boundary conditions, we notice that the first integral contains two multiplied functions, one which is
integrated in the final integral ( becomes ) and one which is differentiated ( becomes ).
The process of the Abel transformation is similar, since one of the two initial sequences is summed ( becomes
) and the other one is differenced ( becomes ).
Applications
We suppose that ; otherwise it is obvious that is a divergent series.
References
• Abel's lemma [1], PlanetMath.org.
References
[1] http:/ / planetmath. org/ ?op=getobj& amp;from=objects& amp;id=3843
Taylor series
In mathematics, a Taylor series is a
representation of a function as an infinite
sum of terms that are calculated from the
values of the function's derivatives at a
single point.
The concept of a Taylor series was formally
introduced by the English mathematician
Brook Taylor in 1715. If the Taylor series is
centered at zero, then that series is also
called a Maclaurin series, named after the
Scottish mathematician Colin Maclaurin,
who made extensive use of this special case
of Taylor series in the 18th century.
Definition
The Taylor series of a real or complex-valued function ƒ(x) that is infinitely differentiable in a neighborhood of a real
or complex number a is the power series
Taylor series 593
Examples
The Maclaurin series for any polynomial is the polynomial itself.
The Maclaurin series for (1 − x)−1 for |x| < 1 is the geometric series
and more generally, the corresponding Taylor series for log(x) at some is:
The above expansion holds because the derivative of ex with respect to x is also ex and e0 equals 1. This leaves the
terms (x − 0)n in the numerator and n! in the denominator for each term in the infinite sum.
History
The Greek philosopher Zeno considered the problem of summing an infinite series to achieve a finite result, but
rejected it as an impossibility: the result was Zeno's paradox. Later, Aristotle proposed a philosophical resolution of
the paradox, but the mathematical content was apparently unresolved until taken up by Democritus and then
Archimedes. It was through Archimedes's method of exhaustion that an infinite number of progressive subdivisions
could be performed to achieve a finite result.[1] Liu Hui independently employed a similar method a few centuries
later.[2]
In the 14th century, the earliest examples of the use of Taylor series and closely related methods were given by
Madhava of Sangamagrama.[3][4]Though no record of his work survives, writings of later Indian mathematicians
suggest that he found a number of special cases of the Taylor series, including those for the trigonometric functions
of sine, cosine, tangent, and arctangent. The Kerala school of astronomy and mathematics further expanded his
Taylor series 594
works with various series expansions and rational approximations until the 16th century.
In the 17th century, James Gregory also worked in this area and published several Maclaurin series. It was not until
1715 however that a general method for constructing these series for all functions for which they exist was finally
provided by Brook Taylor,[5] after whom the series are now named.
The Maclaurin series was named after Colin Maclaurin, a professor in Edinburgh, who published the special case of
the Taylor result in the 18th century.
Analytic functions
If f(x) is given by a convergent power series in an open disc (or
interval in the real line) centered at b, it is said to be analytic in
this disc. Thus for x in this disc, f is given by a convergent power
series
and so the power series expansion agrees with the Taylor series.
Thus a function is analytic in an open disc centered at b if and only
The function e−1/x² is not analytic at x = 0: the Taylor
if its Taylor series converges to the value of the function at each
series is identically 0, although the function is not.
point of the disc.
If f(x) is equal to its Taylor series everywhere it is called entire. The polynomials and the exponential function ex and
the trigonometric functions sine and cosine are examples of entire functions. Examples of functions that are not
entire include the logarithm, the trigonometric function tangent, and its inverse arctan. For these functions the Taylor
series do not converge if x is far from b. Taylor series can be used to calculate the value of an entire function in every
point, if the value of the function, and of all of its derivatives, are known at a single point.
Uses of the Taylor series for analytic functions include:
1. The partial sums (the Taylor polynomials) of the series can be used as approximations of the entire function.
These approximations are good if sufficiently many terms are included.
2. Differentiation and integration of power series can be performed term by term and is hence particularly easy.
3. An analytic function is uniquely extended to a holomorphic function on an open disk in the complex plane. This
makes the machinery of complex analysis available.
4. The (truncated) series can be used to compute function values numerically, (often by recasting the polynomial
into the Chebyshev form and evaluating it with the Clenshaw algorithm).
5. Algebraic operations can be done readily on the power series representation; for instance the Euler's formula
follows from Taylor series expansions for trigonometric and exponential functions. This result is of fundamental
importance in such fields as harmonic analysis.
6. Approximations using the first few terms of a Taylor series can make otherwise unsolvable problems possible for
a restricted domain; this approach is often used in Physics
Taylor series 595
The Taylor polynomials for log(1+x) only provide accurate approximations in the
is infinitely differentiable at x = 0, and has range −1 < x ≤ 1. Note that, for x > 1, the Taylor polynomials of higher degree are
worse approximations.
all derivatives zero there. Consequently, the
Taylor series of f(x) about x = 0 is
identically zero. However, f(x) is not equal to the zero function, and so it is not equal to its Taylor series around the
origin.
In real analysis, this example shows that there are infinitely differentiable functions f(x) whose Taylor series are not
equal to f(x) even if they converge. By contrast in complex analysis there are no holomorphic functions f(z) whose
Taylor series converges to a value different from f(z). The complex function e−z−2 does not approach 0 as z
approaches 0 along the imaginary axis, and its Taylor series is thus not defined there.
Taylor series 596
More generally, every sequence of real or complex numbers can appear as coefficients in the Taylor series of an
infinitely differentiable function defined on the real line, a consequence of Borel's lemma (see also Non-analytic
smooth function). As a result, the radius of convergence of a Taylor series can be zero. There are even infinitely
differentiable functions defined on the real line whose Taylor series have a radius of convergence 0 everywhere.[6]
Some functions cannot be written as Taylor series because they have a singularity; in these cases, one can often still
achieve a series expansion if one allows also negative powers of the variable x; see Laurent series. For example,
f(x) = e−x−2 can be written as a Laurent series.
Generalization
There is, however, a generalization[7][8] of the Taylor series that does converge to the value of the function itself for
any bounded continuous function on (0,∞), using the calculus of finite differences. Specifically, one has the
following theorem, due to Einar Hille, that for any t > 0,
Here Δ is the n-th finite difference operator with step size h. The series is precisely the Taylor series, except that
divided differences appear in place of differentiation: the series is formally similar to the Newton series. When the
function f is analytic at a, the terms in the series converge to the terms of the Taylor series, and in this sense
generalizes the usual Taylor series.
In general, for any infinite sequence ai, the following power series identity holds:
So in particular,
The series on the right is the expectation value of f(a + X), where X is a Poisson distributed random variable that
takes the value jh with probability e−t/h(t/h)j/j!. Hence,
Natural logarithm:
Binomial series (includes the square root for α = 1/2 and the infinite geometric series for α = −1):
Trigonometric functions:
Hyperbolic functions:
The numbers Bk appearing in the summation expansions of tan(x) and tanh(x) are the Bernoulli numbers. The Ek in
the expansion of sec(x) are Euler numbers.
First example
Compute the 7th degree Maclaurin polynomial for the function
.
First, rewrite the function as
.
We have for the natural logarithm (by using the big O notation)
The latter series expansion has a zero constant term, which enables us to substitute the second series into the first one
and to easily omit terms of higher order than the 7th degree by using the big O notation:
Taylor series 599
Since the cosine is an even function, the coefficients for all the odd powers x, x3, x5, x7, ... have to be zero.
Second example
Suppose we want the Taylor series at 0 of the function
Then multiplication with the denominator and substitution of the series of the cosine yields
Comparing coefficients with the above series of the exponential function yields the desired Taylor series
Taylor series 600
Third example
Here we use a method called "Indirect Expansion" to expand the given function. This method uses the known
function of Taylor series for expansion.
Q: Expand the following function as a power series of x
.
We know the Taylor series of function is:
Thus,
For example, for a function that depends on two variables, x and y, the Taylor series to second order about the point
(a, b) is:
A second-order Taylor series expansion of a scalar-valued function of more than one variable can be written
compactly as
where is the gradient of evaluated at and is the Hessian matrix. Applying the
multi-index notation the Taylor series for several variables becomes
which is to be understood as a still more abbreviated multi-index version of the first equation of this paragraph, again
in full analogy to the single variable case.
Example
Compute a second-order Taylor series expansion around point
of a function
for |y| < 1.
Taylor series 602
Notes
[1] Kline, M. (1990) Mathematical Thought from Ancient to Modern Times. Oxford University Press. pp. 35-37.
[2] Boyer, C. and Merzbach, U. (1991) A History of Mathematics. John Wiley and Sons. pp. 202-203.
[3] "Neither Newton nor Leibniz - The Pre-History of Calculus and Celestial Mechanics in Medieval Kerala" (http:/ / www. canisius. edu/ topos/
rajeev. asp). MAT 314. Canisius College. . Retrieved 2006-07-09.
[4] S. G. Dani (2012). "Ancient Indian Mathematics – A Conspectus". Resonance 17 (3): 236-246.
[5] Taylor, Brook, Methodus Incrementorum Directa et Inversa [Direct and Reverse Methods of Incrementation] (London, 1715), pages 21-23
(Proposition VII, Theorem 3, Corollary 2). Translated into English in D. J. Struik, A Source Book in Mathematics 1200-1800 (Cambridge,
Massachusetts: Harvard University Press, 1969), pages 329-332.
[6] Rudin, Walter (1980), Real and Complex Analysis, New Dehli: McGraw-Hill, p. 418, Exercise 13, ISBN 0-07-099557-5
[7] Feller, William (1971), An introduction to probability theory and its applications, Volume 2 (3rd ed.), Wiley, pp. 230–232.
[8] Hille, Einar; Phillips, Ralph S. (1957), Functional analysis and semi-groups, AMS Colloquium Publications, 31, American Mathematical
Society, p. 300–327.
[9] Most of these can be found in (Abramowitz & Stegun 1970).
[10] Odibat, ZM., Shawagfeh, NT., 2007. "Generalized Taylor's formula." Applied Mathematics and Computation 186, 286-293.
References
• Abramowitz, Milton; Stegun, Irene A. (1970), Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables, New York: Dover Publications, Ninth printing
• Thomas, George B. Jr.; Finney, Ross L. (1996), Calculus and Analytic Geometry (9th ed.), Addison Wesley,
ISBN 0-201-53174-7
• Greenberg, Michael (1998), Advanced Engineering Mathematics (2nd ed.), Prentice Hall, ISBN 0-13-321431-1
External links
• Weisstein, Eric W., " Taylor Series (http://mathworld.wolfram.com/TaylorSeries.html)" from MathWorld.
• Madhava of Sangamagramma (http://www-groups.dcs.st-and.ac.uk/~history/Projects/Pearce/Chapters/
Ch9_3.html)
• Taylor Series Representation Module by John H. Mathews (http://math.fullerton.edu/mathews/c2003/
TaylorSeriesMod.html)
• " Discussion of the Parker-Sochacki Method (http://csma31.csm.jmu.edu/physics/rudmin/ParkerSochacki.
htm)"
• Another Taylor visualisation (http://stud3.tuwien.ac.at/~e0004876/taylor/Taylor_en.html) - where you can
choose the point of the approximation and the number of derivatives
• Taylor series revisited for numerical methods (http://numericalmethods.eng.usf.edu/topics/taylor_series.
html) at Numerical Methods for the STEM Undergraduate (http://numericalmethods.eng.usf.edu)
• Cinderella 2: Taylor expansion (http://cinderella.de/files/HTMLDemos/2C02_Taylor.html)
• Taylor series (http://www.sosmath.com/calculus/tayser/tayser01/tayser01.html)
• Inverse trigonometric functions Taylor series (http://www.efunda.com/math/taylor_series/inverse_trig.cfm)
Uniform distribution (continuous) 603
Notation
Parameters
Support
PDF
CDF
Mean
Median
Mode any value in
Variance
Skewness 0
Ex. kurtosis
Entropy
MGF
CF
In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of
probability distributions such that for each member of the family, all intervals of the same length on the distribution's
support are equally probable. The support is defined by the two parameters, a and b, which are its minimum and
maximum values. The distribution is often abbreviated U(a,b). It is the maximum entropy probability distribution for
Uniform distribution (continuous) 604
a random variate X under no constraint other than that it is contained in the distribution's support.[1]
Characterization
The values of f(x) at the two boundaries a and b are usually unimportant because they do not alter the values of the
integrals of f(x) dx over any interval, nor of x f(x) dx or any higher moment. Sometimes they are chosen to be zero,
and sometimes chosen to be 1/(b − a). The latter is appropriate in the context of estimation by the method of
maximum likelihood. In the context of Fourier analysis, one may take the value of f(a) or f(b) to be 1/(2(b − a)),
since then the inverse transform of many integral transforms of this uniform function will yield back the function
itself, rather than a function which is equal "almost everywhere", i.e. except on a set of points with zero measure.
Also, it is consistent with the sign function which has no such ambiguity.
In terms of mean μ and variance σ2, the probability density may be written as:
Generating functions
Moment-generating function
The moment-generating function is
For a random variable following this distribution, the expected value is then m1 = (a + b)/2 and the variance is
m2 − m12 = (b − a)2/12.
Cumulant-generating function
For n ≥ 2, the nth cumulant of the uniform distribution on the interval [0, 1] is bn/n, where bn is the nth Bernoulli
number.
Properties
Solving these two equations for parameters a and b, given known moments E(X) and V(X), yields:
Order statistics
Let X1, ..., Xn be an i.i.d. sample from U(0,1). Let X(k) be the kth order statistic from this sample. Then the probability
distribution of X(k) is a Beta distribution with parameters k and n − k + 1. The expected value is
Uniformity
The probability that a uniformly distributed random variable falls within any interval of fixed length is independent
of the location of the interval itself (but it is dependent on the interval size), so long as the interval is contained in the
distribution's support.
To see this, if X ~ U(a,b) and [x, x+d] is a subinterval of [a,b] with fixed d > 0, then
Standard uniform
Restricting and , the resulting distribution U(0,1) is called a standard uniform distribution.
One interesting property of the standard uniform distribution is that if u1 has a standard uniform distribution, then so
does 1-u1. This property can be used for generating antithetic variates, among other things.
Related distributions
• If X has a standard uniform distribution, then by the inverse transform sampling method, Y = − ln(X) / λ has an
exponential distribution with (rate) parameter λ.
• If X has a standard uniform distribution, then Y = Xn has a beta distribution with parameters 1/n and 1. (Note this
implies that the standard uniform distribution is a special case of the beta distribution, with parameters 1 and 1.)
• The Irwin–Hall distribution is the sum of n i.i.d. U(0,1) distributions.
• The sum of two independent, equally distributed, uniform distributions yields a symmetric triangular distribution.
There is no ambiguity at the transition point of the sign function. Using the half-maximum convention at the
transition points, the uniform distribution may be expressed in terms of the sign function as:
Uniform distribution (continuous) 607
Applications
In statistics, when a p-value is used as a test statistic for a simple null hypothesis, and the distribution of the test
statistic is continuous, then the p-value is uniformly distributed between 0 and 1 if the null hypothesis is true.
Estimation
Estimation of maximum
Given a uniform distribution on [0, N] with unknown N, the UMVU estimator for the maximum is given by
where m is the sample maximum and k is the sample size, sampling without replacement (though this distinction
almost surely makes no difference for a continuous distribution). This follows for the same reasons as estimation for
the discrete distribution, and can be seen as a very simple case of maximum spacing estimation. This problem is
commonly known as the German tank problem, due to application of maximum estimation to estimates of German
tank production during World War II.
Estimation of midpoint
The midpoint of the distribution (a + b) / 2 is both the mean and the median of the uniform distribution. Although
both the sample mean and the sample median are unbiased estimators of the midpoint, neither is as efficient as the
sample mid-range, i.e. the arithmetic mean of the sample maximum and the sample minimum, which is the UMVU
estimator of the midpoint (and also the maximum likelihood estimate).
The confidence interval for the estimated population maximum is then ( X(n), X(n) / α1/n ) where 100 ( 1 - α )% is the
confidence level sought. In symbols
References
[1] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu.
cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier):
219–230. . Retrieved 2011-06-02.
[2] Nechval KN, Nechval NA, Vasermanis EK, Makeev VY (2002) Constructing shortest-length confidence intervals. Transport and
Telecommunication 3 (1) 95-103
External links
• Online calculator of Uniform distribution (continuous) (http://www.stud.feec.vutbr.cz/~xvapen02/vypocty/
ro.php?language=english)
Uniform distribution (discrete) 609
n = 5 where n = b − a + 1
Cumulative distribution function
Parameters
Support
PMF
CDF
Mean
Median
Mode N/A
Variance
[1]
Skewness
Ex. kurtosis
Entropy
MGF
CF
Uniform distribution (discrete) 610
In probability theory and statistics, the discrete uniform distribution is a probability distribution whereby a finite
number of equally spaced values are equally likely to be observed; every one of n values has equal probability 1/n.
Another way of saying "discrete uniform distribution" would be "a known, finite number of equally spaced outcomes
equally likely to happen."
If a random variable has any of possible values that are equally spaced and equally probable,
then it has a discrete uniform distribution. The probability of any outcome is . A simple example of the
discrete uniform distribution is throwing a fair die. The possible values of are 1, 2, 3, 4, 5, 6; and each time the die
is thrown, the probability of a given score is 1/6. If two dice are thrown and their values added, the uniform
distribution no longer fits since the values from 2 to 12 do not have equal probabilities.
The cumulative distribution function (CDF) can be expressed in terms of a degenerate distribution as
where the Heaviside step function is the CDF of the degenerate distribution centered at , using the
convention that
Estimation of maximum
This example is described by saying that a sample of k observations is obtained from a uniform distribution on the
integers , with the problem being to estimate the unknown maximum N. This problem is commonly
known as the German tank problem, following the application of maximum estimation to estimates of German tank
production during World War II.
The UMVU estimator for the maximum is given by
where m is the sample maximum and k is the sample size, sampling without replacement.[2][3] This can be seen as a
very simple case of maximum spacing estimation.
The formula may be understood intuitively as:
"The sample maximum plus the average gap between observations in the sample",
the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population
maximum.[4]
This has a variance of[2]
so a standard deviation of approximately , the (population) average size of a gap between samples; compare
above.
The sample maximum is the maximum likelihood estimator for the population maximum, but, as discussed above, it
is biased.
If samples are not numbered but are recognizable or markable, one can instead estimate population size via the
capture-recapture method.
Uniform distribution (discrete) 611
Random permutation
See rencontres numbers for an account of the probability distribution of the number of fixed points of a uniformly
distributed random permutation.
Notes
[1] http:/ / adorio-research. org/ wordpress/ ?p=519
[2] Johnson, Roger (1994), "Estimating the Size of a Population", Teaching Statistics (http:/ / www. rsscse. org. uk/ ts/ index. htm) 16 (2
(Summer)), doi:10.1111/j.1467-9639.1994.tb00688.x
[3] Johnson, Roger (2006), "Estimating the Size of a Population" (http:/ / www. rsscse. org. uk/ ts/ gtb/ johnson. pdf), Getting the Best from
Teaching Statistics (http:/ / www. rsscse. org. uk/ ts/ gtb/ contents. html),
[4] The sample maximum is never more than the population maximum, but can be less, hence it is a biased estimator: it will tend to
underestimate the population maximum.
References
Weibull distribution 612
Weibull distribution
Weibull (2-Parameter)
CDF
Mean
Median
Mode
Variance
Skewness
CF
Weibull distribution 613
In probability theory and statistics, the Weibull distribution is a continuous probability distribution. It is named
after Waloddi Weibull, who described it in detail in 1951, although it was first identified by Fréchet (1927) and first
applied by Rosin & Rammler (1933) to describe the size distribution of particles.
Definition
The probability density function of a Weibull random variable x is:[1]
where k > 0 is the shape parameter and λ > 0 is the scale parameter of the distribution. Its complementary
cumulative distribution function is a stretched exponential function. The Weibull distribution is related to a number
of other probability distributions; in particular, it interpolates between the exponential distribution (k = 1) and the
Rayleigh distribution (k = 2).
If the quantity x is a "time-to-failure", the Weibull distribution gives a distribution for which the failure rate is
proportional to a power of time. The shape parameter, k, is that power plus one, and so this parameter can be
interpreted directly as follows:
• A value of k<1 indicates that the failure rate decreases over time. This happens if there is significant "infant
mortality", or defective items failing early and the failure rate decreasing over time as the defective items are
weeded out of the population.
• A value of k=1 indicates that the failure rate is constant over time. This might suggest random external events are
causing mortality, or failure.
• A value of k>1 indicates that the failure rate increases with time. This happens if there is an "aging" process, or
parts that are more likely to fail as time goes on.
In the field of materials science, the shape parameter k of a distribution of strengths is known as the Weibull
modulus.
Properties
Density function
The form of the density function of the Weibull distribution changes drastically with the value of k. For 0 < k < 1, the
density function tends to ∞ as x approaches zero from above and is strictly decreasing. For k = 1, the density function
tends to 1/λ as x approaches zero from above and is strictly decreasing. For k > 1, the density function tends to zero
as x approaches zero from above, increases until its mode and decreases after it. It is interesting to note that the
density function has infinite negative slope at x=0 if 0 < k < 1, infinite positive slope at x= 0 if 1 < k < 2 and null
slope at x= 0 if k > 2. For k= 2 the density has a finite positive slope at x=0. As k goes to infinity, the Weibull
distribution converges to a Dirac delta distribution centred at x= λ. Moreover, the skewness and coefficient of
variation depend only on the shape parameter.
Weibull distribution 614
Distribution function
The cumulative distribution function for the Weibull distribution is
Moments
The moment generating function of the logarithm of a Weibull distributed random variable is given by[2]
where Γ is the gamma function. Similarly, the characteristic function of log X is given by
and
If the parameter k is assumed to be a rational number, expressed as k = p/q where p and q are integers, then this
integral can be evaluated analytically.[3] With t replaced by −t, one finds
Information entropy
The information entropy is given by
Weibull plot
The goodness of fit of data to a Weibull distribution can be visually assessed using a Weibull Plot.[4] The Weibull
Plot is a plot of the empirical cumulative distribution function of data on special axes in a type of Q-Q plot.
The axes are versus . The reason for this change of variables is the cumulative
distribution function can be linearised:
which can be seen to be in the standard form of a straight line. Therefore if the data came from a Weibull distribution
then a straight line is expected on a Weibull plot.
There are various approaches to obtaining the empirical distribution function from data: one method is to obtain the
vertical coordinate for each point using where is the rank of the data point and is the number
of data points.[5]
Linear regression can also be used to numerically assess goodness of fit and estimate the parameters of the Weibull
distribution. The gradient informs one directly about the shape parameter and the scale parameter can also be
inferred.
Weibull distribution 616
Uses
The Weibull distribution is used
• In survival analysis[6]
• In reliability engineering and failure analysis
• In industrial engineering to represent manufacturing and delivery times
• In extreme value theory
• In weather forecasting
• To describe wind speed distributions, as the natural distribution often matches the Weibull shape[7]
• In communications systems engineering
• In radar systems to model the dispersion of the
received signals level produced by some types of
clutters
• To model fading channels in wireless
communications, as the Weibull fading model
seems to exhibit good fit to experimental fading
channel measurements
• In General insurance to model the size of Reinsurance
claims, and the cumulative development of Asbestosis
losses
• In forecasting technological change (also known as the Fitted cumulative Weibull distribution to maximum one-day
rainfalls
Sharif-Islam model)
• In hydrology the Weibull distribution is applied to
extreme events such as annual maximum one-day rainfalls and river discharges. The blue picture illustrates an
example of fitting the Weibull distribution to ranked annually maximum one-day rainfalls showing also the 90%
confidence belt based on the binomial distribution. The rainfall data are represented by plotting positions as part
of the cumulative frequency analysis.
• In describing the size of particles generated by grinding, milling and crushing operations, the 2-Parameter
Weibull distribution is used, and in these applications it is sometimes known as the Rosin-Rammler distribution.
In this context it predicts fewer fine particles than the Log-normal distribution and it is generally most accurate
for narrow particle size distributions. The interpretation of the cumulative distribution function is that F(x; k; λ) is
the mass fraction of particles with diameter smaller than x, where λ is the mean particle size and k is a measure of
the spread of particle sizes.
Related distributions
• The translated Weibull distribution contains an additional parameter.[2] It has the probability density function
for and f(x; k, λ, θ) = 0 for x < θ, where is the shape parameter, is the scale parameter and
is the location parameter of the distribution. When θ=0, this reduces to the 2-parameter distribution.
• The Weibull distribution can be characterized as the distribution of a random variable X such that the random
variable
Weibull distribution 617
• The distribution of a random variable that is defined as the minimum of several random variables, each having a
different Weibull distribution, is a poly-Weibull distribution.
References
[1] Papoulis, Pillai, "Probability, Random Variables, and Stochastic Processes, 4th Edition
[2] Johnson, Kotz & Balakrishnan 1994
[3] See (Cheng, Tellambura & Beaulieu 2004) for the case when k is an integer, and (Sagias & Karagiannidis 2005) for the rational case.
[4] The Weibull plot (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ weibplot. htm)
[5] Wayne Nelson (2004) Applied Life Data Analysis. Wiley-Blackwell ISBN 0-471-64462-5
[6] Survival/Failure Time Analysis (http:/ / www. statsoft. com/ textbook/ survival-failure-time-analysis/ #distribution)
[7] Wind Speed Distribution Weibull (http:/ / www. reuk. co. uk/ Wind-Speed-Distribution-Weibull. htm)
[8] "System evolution and reliability of systems" (http:/ / www. sys-ev. com/ reliability01. htm). Sysev (Belgium). 2010-01-01. .
Bibliography
• Fréchet, Maurice (1927), "Sur la loi de probabilité de l'écart maximum", Annales de la Société Polonaise de
Mathematique, Cracovie 6: 93–116.
• Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1994), Continuous univariate distributions. Vol. 1, Wiley
Series in Probability and Mathematical Statistics: Applied Probability and Statistics (2nd ed.), New York: John
Wiley & Sons, ISBN 978-0-471-58495-7, MR1299979
• Muraleedharan, G.; Rao, A.G.; Kurup, P.G.; Nair, N. Unnikrishnan; Sinha, Mourani (2007), "Coastal
Engineering", Coastal Engineering 54 (8): 630–638, doi:10.1016/j.coastaleng.2007.05.001
• Rosin, P.; Rammler, E. (1933), "The Laws Governing the Fineness of Powdered Coal", Journal of the Institute of
Fuel 7: 29–36.
• Sagias, Nikos C.; Karagiannidis, George K. (2005), "Gaussian class multivariate Weibull distributions: theory and
applications in fading channels", Institute of Electrical and Electronics Engineers. Transactions on Information
Theory 51 (10): 3608–3619, doi:10.1109/TIT.2005.855598, ISSN 0018-9448, MR2237527
• Weibull, W. (1951), "A statistical distribution function of wide applicability" (http://www.barringer1.com/
wa_files/Weibull-ASME-Paper-1951.pdf), J. Appl. Mech.-Trans. ASME 18 (3): 293–297.
• "Engineering statistics handbook" (http://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htm).
National Institute of Standards and Technology. 2008.
• Nelson, Jr, Ralph (2008-02-05). "Dispersing Powders in Liquids, Part 1, Chap 6: Particle Volume Distribution"
(http://www.erpt.org/014Q/nelsa-06.htm). Retrieved 2008-02-05.
Weibull distribution 618
External links
• Mathpages - Weibull Analysis (http://www.mathpages.com/home/kmath122/kmath122.htm)
• The Weibull Distribution (http://www.weibull.com/LifeDataWeb/the_weibull_distribution.htm)
• Reliability Analysis with Weibull (http://www.crgraph.com/Weibull11e.pdf)
Assumptions
1. Data are paired and come from the same population.
2. Each pair is chosen randomly and independent.
3. The data are measured on an interval scale (ordinal is not sufficient because we take differences), but need not be
normal.
Test procedure
Let N be the sample size, the number of pairs. Thus, there are a total of 2N data points. For i = 1, ..., N, let and
denote the measurements.
.
1. For i = 1, ..., N, calculate and , where sgn is the sign function.
2. Exclude pairs with . Let be the reduced sample size.
3. Order the remaining pairs from smallest absolute difference to largest absolute difference, .
4. Rank the pairs, starting with the smallest as 1. Ties receive a rank equal to the average of the ranks they span. Let
denote the rank.
5. Calculate the test statistic W.
Example
order by absolute difference
is the sign function, is the absolute value, and is the rank. Notice that pairs 3 and 9 are tied in
absolute value. They would be ranked 1 and 2, so each gets the average of those ranks, 1.5.
References
[1] Lowry, Richard. "Concepts & Applications of Inferential Statistics" (http:/ / faculty. vassar. edu/ lowry/ ch12a. html). . Retrieved 24 March
2011.
[2] Wilcoxon, Frank (Dec 1945). "Individual comparisons by ranking methods" (http:/ / sci2s. ugr. es/ keel/ pdf/ algorithm/ articulo/
wilcoxon1945. pdf). Biometrics Bulletin 1 (6): 80–83. .
[3] Siegel, Sidney (1956). Non-parametric statistics for the behavioral sciences (http:/ / books. google. com/
books?ei=9cWLTfaTIcmEOs_NuM0L& ct=result& id=ebfRAAAAMAAJ& dq=Wilcoxon+ statistics+ for+ the+ behavioral+ sciences+
Non-parametric& q=Wilcoxon#search_anchor). New York: McGraw-Hill. pp. 75–83. .
External links
• Description of how to calculate p for the Wilcoxon signed-ranks test (http://comp9.psych.cornell.edu/
Darlington/wilcoxon/wilcox0.htm)
• Example of using the Wilcoxon signed-rank test (http://faculty.vassar.edu/lowry/ch12a.html)
• An online version of the test (http://faculty.vassar.edu/lowry/wilcoxon.html)
• A table of critical values for the Wilcoxon signed-rank test (http://www.sussex.ac.uk/Users/grahamh/
RM1web/WilcoxonTable2005.pdf)
Wilcoxon signed-rank test 620
Implementations
• ALGLIB (http://www.alglib.net/statistics/hypothesistesting/wilcoxonsignedrank.php) includes
implementation of the Wilcoxon signed-rank test in C++, C#, Delphi, Visual Basic, etc.
• The free statistical software R includes an implementation of the test as wilcox.test(x,y,
paired=TRUE), where x and y are vectors of equal length.
• GNU Octave implements various one-tailed and two-tailed versions of the test in the wilcoxon_test
function.
• SciPy (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html) includes an
implementation of the Wilcoxon signed-rank test in Python
Wishart distribution 621
Wishart distribution
Wishart
In statistics, the Wishart distribution is a generalization to multiple dimensions of the chi-squared distribution, or,
in the case of non-integer degrees of freedom, of the gamma distribution. It is named in honor of John Wishart, who
first formulated the distribution in 1928.[1]
It is any of a family of probability distributions defined over symmetric, nonnegative-definite matrix-valued random
variables (“random matrices”). These distributions are of great importance in the estimation of covariance matrices in
multivariate statistics. In Bayesian statistics, the Wishart distribution is the conjugate prior of the inverse
covariance-matrix of a multivariate-normal random-vector.
Definition
Suppose X is an n × p matrix, each row of which is independently drawn from a p-variate normal distribution with
zero mean:
Then the Wishart distribution is the probability distribution of the p×p random matrix
known as the scatter matrix. One indicates that S has that probability distribution by writing
The positive integer n is the number of degrees of freedom. Sometimes this is written W(V, p, n). For n ≥ p the
matrix S is invertible with probability 1 if V is invertible.
If p = 1 and V = 1 then this distribution is a chi-squared distribution with n degrees of freedom.
Wishart distribution 622
Occurrence
The Wishart distribution arises as the distribution of the sample covariance matrix for a sample from a multivariate
normal distribution. It occurs frequently in likelihood-ratio tests in multivariate statistical analysis. It also arises in
the spectral theory of random matrices and in multidimensional Bayesian analysis.
In fact the above definition can be extended to any real n > p − 1. If n ≤ p − 2, then the Wishart no longer has a
density—instead it represents a singular distribution. [2]
Properties
Log-expectation
Note the following formula:[3]
where is the digamma function (the derivative of the log of the gamma function).
This plays a role in variational Bayes derivations for Bayes networks involving the Wishart distribution.
Entropy
The information entropy of the distribution has the following formula:[3]
Characteristic function
The characteristic function of the Wishart distribution is
In other words,
where denotes expectation. (Here and are matrices the same size as ( is the identity matrix); and
is the square root of −1).
Theorem
If has a Wishart distribution with m degrees of freedom and variance matrix —write —and is
a q × p matrix of rank q, then
Corollary 1
If is a nonzero constant vector, then .
In this case, is the chi-squared distribution and (note that is a constant; it is positive because
is positive definite).
Wishart distribution 624
Corollary 2
Consider the case where (that is, the jth element is one and all others zero). Then
corollary 1 above shows that
gives the marginal distribution of each of the elements on the matrix's diagonal.
Noted statistician George Seber points out that the Wishart distribution is not called the “multivariate chi-squared
distribution” because the marginal distribution of the off-diagonal elements is not chi-squared. Seber prefers to
reserve the term multivariate for the case when all univariate marginals belong to the same family.
Bartlett decomposition
The Bartlett decomposition of a matrix from a p-variate Wishart distribution with scale matrix V and n degrees
of freedom is the factorization:
where and independently. This provides a useful method for obtaining random
[4]
samples from a Wishart distribution.
This set is named after Gindikin, who introduced it[6] in the seventies in the context of gamma distributions on
homogeneous cones. However, for the new parameters in the discrete spectrum of the Gindikin ensemble, namely,
References
[1] Wishart, J. (1928). "The generalised product moment distribution in samples from a normal multivariate population". Biometrika 20A (1-2):
32–52. doi:10.1093/biomet/20A.1-2.32. JFM 54.0565.02.
[2] “On singular Wishart and singular multivariate beta distributions” by Harald Uhlig, The Annals of Statistics, 1994, 395-405 projecteuclid
(http:/ / projecteuclid. org/ DPubS?service=UI& version=1. 0& verb=Display& handle=euclid. aos/ 1176325375)
[3] C.M. Bishop, Pattern Recognition and Machine Learning, Springer 2006, p. 693.
[4] Smith, W. B.; Hocking, R. R. (1972). "Algorithm AS 53: Wishart Variate Generator". Journal of the Royal Statistical Society. Series C
(Applied Statistics) 21 (3): 341–345. JSTOR 2346290.
[5] Peddada and Richards, Shyamal Das; Richards, Donald St. P. (1991). "Proof of a Conjecture of M. L. Eaton on the Characteristic Function of
the Wishart Distribution,". Annals of Probability 19 (2): 868–874. doi:10.1214/aop/1176990455.
[6] Gindikin, S.G. (1975). "Invariant generalized functions in homogeneous domains,". Funct. Anal. Appl., 9 (1): 50–52.
doi:10.1007/BF01078179.
[7] Paul S. Dwyer, “SOME APPLICATIONS OF MATRIX DERIVATIVES IN MULTIVARIATE ANALYSIS”, JASA 1967; 62:607-625,
available JSTOR (http:/ / www. jstor. org/ pss/ 2283988).
[8] C.M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.
Article Sources and Contributors 626
Beta distribution Source: http://en.wikipedia.org/w/index.php?oldid=508032247 Contributors: Adamace123, AllenDowney, AnRtist, Arauzo, Art2SpiderXL, Awaterl, Baccyak4H, Benwing,
Betadistribution, BlaiseFEgan, Bootstoots, Bryan Derksen, Btyner, Cburnett, Crasshopper, Cronholm144, DFRussia, Dean P Foster, Dr. J. Rodal, DrMicro, Dshutin, Eric Kvaalen, FilipeS, Fintor,
Fnielsen, Giftlite, Gill110951, GregorB, Gruntfuterk, Gökhan, HappyCamper, Henrygb, Hilgerdenaar, Hypnotoad33, IapetusWave, ImperfectlyInformed, J04n, Jamessungjin.kim, Janlo, Jhapk,
Jheald, Joriki, Josang, Ketiltrout, Krishnavedala, Kts, Ladislav Mecir, Linas, LiranKatzir, Livius3, Lovibond, MarkSweep, Mcld, Melcombe, Michael Hardy, MisterSheik, Mochan Shrestha,
Mpaa, MrOllie, Nbarth, O18, Oberobic, Ohanian, Oleg Alexandrov, Ott2, PAR, PBH, Paulginz, Pleasantville, Pnrj, Qwfp, Rcsprinter123, Rjwilmsi, Robbyjo, Robinh, Robma, Rodrigo braz,
Rumping, SJP, ST47, Saric, Schmock, SharkD, Sheppa28, Steve8675309, Stoni, Sukisuki, Thumperward, Tomi, Tthrall, UndercoverAgents, Urhixidur, Wile E. Heresiarch, Wjastle, YearOfGlad,
161 anonymous edits
Beta function Source: http://en.wikipedia.org/w/index.php?oldid=506352439 Contributors: A. Pichler, Akurn, Albmont, AnRtist, Andre Engels, Arabani, Avihu, Bo Jacoby, CRGreathouse,
Charles Matthews, Damian Yerrick, DeadEyeArrow, Deepmath, Deltahedron, Djozwebo, Dysprosia, Dzordzm, Eric Kvaalen, Fintor, Fnielsen, Fredrik, Giftlite, Gildos, GregorB, Gurch,
Headbomb, Helder Ribeiro, HenningThielemann, Herbee, Jmeppley, JordiGH, Joule36e5, Jpod2, Jtico, Jugger90, Karho.Yau, Kiensvay, Kotasik, L33tsk33t3r, Linas, Loodog, Lumidek, LutzL,
MagnaMopus, MarkSweep, Melcombe, Michael Hardy, MrOllie, Niceguyedc, Nog33, Oleg Alexandrov, PAR, PMajer, PV=nRT, Pabristow, Qwfp, R.e.b., RL0919, Rar, RobHar, Romanm,
Rorro, Ruud Koot, ServiceAT, Stan Lioubomoudrov, Surfo, TakuyaMurata, Tarquin, Tarret, Tomfy, TomyDuby, Wasbeer, Wdvorak, Wile E. Heresiarch, 99 anonymous edits
Beta-binomial distribution Source: http://en.wikipedia.org/w/index.php?oldid=506664672 Contributors: Auntof6, Baccyak4H, Benwing, Charlesmartin14, Chris the speller, Domminico,
Frederic Y Bois, Giftlite, Gnp, GoingBatty, Kdisarno, Massbless, Melcombe, Michael Hardy, Myasuda, Nschuma, PigFlu Oink, Qwfp, Rjwilmsi, Sheppa28, Thouis.r.jones, Thtanner, Tomixdf,
Willy.feng, 34 anonymous edits
Binomial coefficient Source: http://en.wikipedia.org/w/index.php?oldid=507273858 Contributors: .:Ajvol:., 137.112.129.xxx, 3ICE, 777sms, A. Pichler, Altenmann, Amit man, Anonymous
Dissident, Askewchan, Atlastawake, Avé, AxelBoldt, Basploeger, Bo Jacoby, Boaex, BosRedSox13, Bsskchaitanya, Btyner, CRGreathouse, Calculuslover, CarloWood, Catapult, Cdang, Charles
Matthews, Cherkash, Classicalecon, Cometstyles, CommandoGuard, Connor Behan, Conversion script, Cornince, Cryptography project, DAGwyn, DVD R W, DVdm, Dalahäst, Daniel5Ko,
Danski14, David Eppstein, Dcoetzee, Dejan Jovanović, DejanSpasov, Denelson83, Devoutb3nji, Dgw, Don4of4, Doug Bell, DrBob, Duoduoduo, Dysprosia, E rulez, Ebony Jackson, Edemaine,
Eleuther, Emul0c, Endlessoblivion, Eric119, Excelsiorfireblade, Ferkel, Fredrik, Fresheneesz, Fropuff, Gauge, Ghazer, Giftlite, Graham87, Gzorg, Hashar, Hawthorn, Hede2000, Henri.vanliempt,
Hierakares, Himynameisbrak, Indianaed, Isilando2, Jim Sukwutput, Jim.belk, Jitse Niesen, Jmabel, Joao.pimentel.ferreira, Jobu0101, Joel B. Lewis, Joey96, Josh3580, Jrvz, Jusdafax, Jökullmar,
KSmrq, Kaimbridge, Karl Stroetmann, Kestrelsummer, Keta, King Bee, Kneufeld, Knightry, LakeHMM, Lambiam, Lantonov, Law, Lclem, Linas, Ling.Nut, Llama320, Llamabr, Luke
Gustafson, Mabuhelwa, Macrakis, Madmath789, Magioladitis, Magister Mathematicae, Mangojuice, Marc van Leeuwen, Maxal, Maxwell helper, Mboverload, Mcld, Mcmlxxxi, Mhym, Michael
Hardy, Michael Slone, Mikeblas, Mormegil, Nbarth, Ninly, Nk, Oleg Alexandrov, Oliverknill, Ondra.pelech, Orangutan, Ott2, PMajer, Paolo.dL, Patrick, Paul August, PaulTanenbaum, Pfunk42,
Pgc512, PhotoBox, Postxian, PrimeHunter, Quantling, RHB100, Ragzouken, Rahence, Rhebus, Rich Farmbrough, Rponamgi, Sander123, Sanjaymjoshi, Scharan, Schmock, Sciyoshi, Shelandy,
Shelbymoore3, Simoneau, Sjorford, Small potato, Spacepotato, Sreyan, Stcisthegreatest, Stebulus, Stellmach, Stpasha, TakuyaMurata, Tetracube, Thiago R Ramos, Timwi, Tomi, Twas Now,
Uncle uncle uncle, Uncompetence, Vonkje, WardenWalk, Wavelength, Wellithy, XJamRastafire, Xanthoxyl, Ylloh, Zahlentheorie, Zalethon, Ziyuang, ス マ ス リ ク, 286 anonymous edits
Binomial distribution Source: http://en.wikipedia.org/w/index.php?oldid=505964622 Contributors: -- April, Aarond10, AchatesAVC, AdamRetchless, Ahoerstemeier, Ajs072, AlanUS,
Alexb@cut-the-knot.com, Alexius08, Alzarian16, Anupam, Atemperman, Atlant, AxelBoldt, Ayla, BPets, Baccyak4H, BenFrantzDale, Benwing, Bill Malloy, Blue520, Br43402, Brutha, Bryan
Derksen, Btyner, Can't sleep, clown will eat me, Cburnett, Cdang, Cflm001, Charles Matthews, Chewings72, Conversion script, Coppertwig, Crackerbelly, Cuttlefishy, DaDexter, David
Martland, DavidFHoughton, Daytona2, Deville, Dick Beldin, DrMicro, Dricherby, Duoduoduo, Eesnyder, Elipongo, Eric Kvaalen, Falk Lieder, Fisherjs, Froid, G716, Garde, Gary King,
Gauravm1312, Gauss, Gerald Tros, Giftlite, Gogobera, GorillaWarfare, Gperjim, Graham87, Hede2000, Henrygb, Hirak 99, Ian.Shannon, Ilmari Karonen, Intelligentsium, Iwaterpolo, J04n,
JB82, JEH, JamesBWatson, Janlo, Johnstjohn, Kakofonous, Kmassey, Knutux, Koczy, LOL, Larry_Sanger, LiDaobing, Linas, Lipedia, Ljwsummer, Logan, MC-CPO, MER-C, ML5, MSGJ,
Madkaugh, Mark Arsten, MarkSweep, Marvinrulesmars, Materialscientist, Mboverload, McKay, Meisterkoch, Melcombe, Mhadi.afrasiabi, Michael Hardy, MichaelGensheimer, Miguel,
MisterSheik, Mmustafa, Moseschinyama, Mr Ape, MrOllie, Musiphil, N5iln, N6ne, Nasnema, NatusRoma, Nbarth, Neshatian, New Thought, Nguyenngaviet, Nschuma, Oleg Alexandrov, PAR,
Pallaviagarwal90, Paul August, Ph.eyes, PhotoBox, Phr, Pleasantville, Postrach, PsyberS, Pt, Pufferfish101, Qonnec, Quietbritishjim, Qwertyus, Qwfp, R'n'B, Redtryfan77, Rgclegg, Rich
Farmbrough, Rjmorris, Rlendog, Robinh, Ruber chiken, Sealed123, Seglea, Sigma0 1, Sintau.tayua, Smachet, SoSaysChappy, Spellcast, Stebulus, Steven J. Anderson, Stigin, Stpasha,
Supergroupiejoy, TakuyaMurata, Talgalili, Tayste, Tedtoal, The Thing That Should Not Be, Tim1357, Timwi, Toll booth, Tomi, Tyw7, VectorPosse, Vincent Semeria, Welhaven, Westm,
Wikid77, WillKitch, Wjastle, Xiao Fei, Ylloh, Youandme, ZantTrang, Zmoboros, 385 anonymous edits
Cauchy distribution Source: http://en.wikipedia.org/w/index.php?oldid=505038072 Contributors: 1diot, Abdullah Chougle, Acdx, Albmont, Andycjp, Arthur Rubin, AxelBoldt, Baccyak4H,
Beaumont, Benwing, Bfigura's puppy, BoH, Bryan Derksen, Btyner, Clbustos, Clíodhna-2, Conversion script, Cretog8, DJIndica, Derfugu, Dicklyon, Emilpohl, Fjhickernell, Fnielsen, FrankH,
GGripenberg, Gareth Owen, Giftlite, HEL, Hannes Eder, Headbomb, Henrygb, Heron, Hidaspal, Hxu, IRWolfie-, Igny, Irigi, Jan1nad, Jitse Niesen, K.F., KSmrq, Kribbeh, Kurykh, LOL,
Lambiam, Leendert, Lightst, Marie Poise, MarkSweep, Mathstat, Melchoir, Melcombe, Metacomet, Michael Hardy, Miguel, MisterSheik, MrOllie, Nbarth, Nichtich, O18, Oleg Alexandrov, Ott2,
PAR, PBH, Paul August, Perimosocordiae, PeterC, PhilipHeller55, Pizzadeliveryboy, Purple Post-its, QuantumSquirrel, Quantumor, Quietbritishjim, Qwfp, Rjwilmsi, Rlendog, Robinh,
Rogerbrent, Rolf.turner, Romanm, Rvencio, S2000magician, Sheppa28, Skbkekas, Snoyes, Sterrys, StevenBell, Stpasha, Sławomir Biały, Tensorpudding, Teply, The Anome, Thesilverbail,
Tkuvho, Tomeasy, Tomi, UKoch, WJVaughn3, Weialawaga, Wikid77, ZeroOne, Zeycus, Zundark, Zvika, 119 anonymous edits
Cauchy–Schwarz inequality Source: http://en.wikipedia.org/w/index.php?oldid=504201354 Contributors: 209.218.246.xxx, ARUNKUMAR P.R, Alcapwned86, AleksaStankovic,
Alexmorgan, Algebra123230, Andrea Allais, AnyFile, Arbindkumarlal, Arjunarikeri, Arved, Avia, AxelBoldt, BWDuncan, Belizefan, Bender235, Bh3u4m, Bkell, Brad7777, CJauff, CSTAR,
Cbunix23, Chas zzz brown, Chinju, Christopherlumb, Conversion script, Cyan, D6, Dale101usa, Dicklyon, DifferCake, Dodger67, Dratman, EagleScout, Eliosh, EugenG, FelixP, Gauge, Giftlite,
GoingBatty, Graham87, Haham hanuka, HairyFotr, Hans Lundmark, Headbomb, Hede2000, Iamthewinnerofthemonth, Jackzhp, Jmsteele, JohnBlackburne, JohnMathTeacher, Justin W Smith,
Katzmik, Kawanoz, Krakhan, Madmath789, Marc van Leeuwen, MarkSweep, Martijn Hoekstra, MathMartin, Mathtyke, Maxzimet, Mboverload, Mct mht, Melcombe, Memming, Mertdikmen,
Mhym, Michael Hardy, Michael Slone, Microcell, Miguel, Missyouronald, Nicol, Njerseyguy, Nsk92, Okaygecko, Oleg Alexandrov, Omnipaedista, Orange Suede Sofa, Orioane, Pan Chenguang,
Paul August, PaulGarner, Phil Boswell, Phys, Prari, Primalbeing, Q0k, Quietbritishjim, Qwfp, R.e.b., Rajb245, Rchase, Reflex Reaction, Rjwilmsi, S tyler, Salgueiro, Sbyrnes321, Schutz, Shlav,
Simonchubocka, Sl, SlipperyHippo, Small potato, Sodin, Somewikian, StevinSimon, Stevo2001, Sverdrup, Sławomir Biały, TakuyaMurata, TedPavlic, Teorth, Tercer, The suffocated,
ThibautLienart, Tobias Bergemann, TomyDuby, Tosha, Uncia, Vaughan Pratt, Vovchyck, Xentram, Zdorovo, Zenosparadox, Zundark, Zuphilip, Zvika, 159 anonymous edits
Characteristic function (probability theory) Source: http://en.wikipedia.org/w/index.php?oldid=503465626 Contributors: Aastrup, Aberdeen01, Aetheling, Albmont, Amonet, AndrewHowse,
Baccyak4H, Bdmy, Benwing, Bwilkins, Cwtyler, First Harmonic, Fnsteed, Giftlite, Ideal gas equation, JA(000)Davidson, Jamelan, James086, Jason Goldstick, Jeff G., Jk350, K ganju, Khazar2,
Kku, LOL, Lambiam, Lovibond, MLauba, MagnusPI, Maksim-e, Mathieu Perrin, Mathstat, MehdiPedia, Melcombe, Michael Hardy, Nbarth, PV=nRT, Peiresc, Quantling, Qwfp, Rabbanis,
Rlendog, Robinh, Stpasha, Thenub314, Tktktk, TomyDuby, Tsirel, Ulamgamer, Unbitwise, Vincent Semeria, Xiawi, 61 , לאף טוףanonymous edits
Chernoff bound Source: http://en.wikipedia.org/w/index.php?oldid=504236001 Contributors: 3mta3, A3 nm, CALR, CSTAR, Cburnett, Dcoetzee, DrMicro, Ece8950, Emurphy42, Falsifian,
Giftlite, Headbomb, Jalanpalmer, Jduchi, JerroldPease-Atlanta, Jittat, Josh Parris, Kilom691, Kku, MarkSweep, Meand, Melcombe, Michael Hardy, Muyiwamc2, Pmsyyz, Qjqflash3, Radagast83,
Rchandan, Rjwilmsi, Sadeq, Sidawang, Sodin, Stijn Vermeeren, Svnsvn, Wullj, Ylloh, ﻣﺎﻧﻲ, 67 anonymous edits
Chi-squared distribution Source: http://en.wikipedia.org/w/index.php?oldid=507956032 Contributors: A.R., AaronSw, AdamSmithee, Afa86, Alvin-cs, Analytics447, Animeronin, Ap,
AstroWiki, AxelBoldt, BenFrantzDale, BiT, Blaisorblade, Bluemaster, Bryan Derksen, Btyner, CBM, Cburnett, Chaldor, Chris53516, Constructive editor, Control.valve, DanSoper, Dbachmann,
Dbenbenn, Den fjättrade ankan, Dgw, Digisus, DrMicro, Drhowey, EOBarnett, Eliel Jimenez, Eliezg, Emilpohl, Etoombs, Ettrig, Fergikush, Fibonacci, Fieldday-sunday, FilipeS, Fintor, G716,
Gaara144, Gauss, Giftlite, Gperjim, Henrygb, Herbee, Hgamboa, HyDeckar, Iav, Icseaturtles, Isopropyl, It Is Me Here, Iwaterpolo, J-stan, Jackzhp, Jaekrystyn, Jason Goldstick, Jdgilbey, Jitse
Niesen, Johnlemartirao, Johnlv12, Jspacemen01-wiki, Kastchei, Knetlalala, KnightRider, Kotasik, LeilaniLad, Leotolstoy, LilHelpa, Lixiaoxu, Loodog, Loren.wilton, Lovibond, MATThematical,
MER-C, MarkSweep, Markg0803, Master of Puppets, Materialscientist, Mcorazao, Mdebets, Melcombe, Mgiganteus1, Michael Hardy, Microball, Mikael Häggström, Mindmatrix, MisterSheik,
MrOllie, MtBell, Nbarth, Neon white, Nm420, NocturneNoir, Notatoad, O18, Oleg Alexandrov, PAR, Pabristow, Pahan, Paul August, Paulginz, Pet3ris, Philten, Pstevens, Qiuxing, Quantling,
Quietbritishjim, Qwfp, Rflrob, Rich Farmbrough, Rigadoun, Robinh, Ronz, Saippuakauppias, Sam Blacketer, SamuelTheGhost, Sander123, Schmock, Schwnj, Seglea, Shadowjams, Sheppa28,
Shoefly, Sietse, Silly rabbit, Sligocki, Stephen C. Carlson, Steve8675309, Stpasha, Talgalili, Tarkashastri, The Anome, TheProject, TimBentley, Tom.Reding, Tombomp, Tomi, TomyDuby,
Tony1, U+003F, User A1, Volkan.cevher, Wasell, Wassermann7, Weialawaga, Willem, Wjastle, Xnn, Zero0000, Zfr, Zvika, 295 anonymous edits
Article Sources and Contributors 627
Computational complexity of mathematical operations Source: http://en.wikipedia.org/w/index.php?oldid=506021956 Contributors: Barticus88, Berland, Branger, CRGreathouse, Coma28,
EmilJ, Fangz, Fbahr, Fredrik, Giftlite, Halo, HappyCamper, Headbomb, Hermel, Hmonroe, Jafet, Jalal0, Jitse Niesen, Kaluppollo, Kri, Kuleebaba, Marek69, Mike40033, Mm, Optikos, Pdokj,
Prosfilaes, Qwertyus, Rjwilmsi, RobinK, TerraFrost, WhiteDragon, 32 anonymous edits
Conjugate prior Source: http://en.wikipedia.org/w/index.php?oldid=506468521 Contributors: Aetheling, Avraham, Azylber, Bazugb07, Benwing, Blueyoshi321, Closedmouth, Dave44000,
Debejyo, Den fjättrade ankan, Dmeburk, Dstivers, Eisber, Gautam raw, Giftlite, Gnathan87, Gosselinf, Heneryville, Henrygb, Hilgerdenaar, Jitse Niesen, Junling, Jurgen, Kastchei, Kierano,
Kupirijo, Kzollman, LessHeard vanU, Lscharen, MarkSweep, Mavam, Melcombe, Michael Hardy, MisterSheik, Nbarth, Ninjagecko, Nparikh, Occawen, Ogo, Oleg Alexandrov, Paulpeeling, Phil
Boswell, Qwfp, Rjwilmsi, Schutz, SkeletorUK, Stimakov, Struway, Tcrykken, Tomyumgoong, Wile E. Heresiarch, Yunus.saatci, 60 anonymous edits
Continuous mapping theorem Source: http://en.wikipedia.org/w/index.php?oldid=491563088 Contributors: Giftlite, Headbomb, Jmath666, Lifebonzza, Luizabpr, Melcombe, Rjwilmsi,
Stpasha, 1 anonymous edits
Convergence of random variables Source: http://en.wikipedia.org/w/index.php?oldid=504588871 Contributors: A. Pichler, Aastrup, Albmont, Amir Aliev, Andrea Ambrosio, Ardonik,
AxelBoldt, Bjcairns, Brad7777, CALR, Cenarium, Chungc, Constructive editor, DHN, David Eppstein, Deepakr, Derveish, DrWhitechalk, Ensign beedrill, Fpoursafaei, Fram, GaborPete, Giftlite,
Headbomb, HyDeckar, Igny, J04n, JA(000)Davidson, JamesBWatson, Jarble, Jmath666, Jncraton, Josh Guffin, Josh Parris, Kolesarm, Kri, Landroni, LoveOfFate, McKay, Melcombe, Mets501,
Michael Hardy, Notedgrant, Oleg Alexandrov, Paul August, Philtime, PigFlu Oink, Prewitt81, Qwfp, Ricardogpn, Robin S, Robinh, Sam Hocevar, SchreiberBike, Sligocki, Spireguy, Steve
Kroon, Stigin, Stpasha, Sullivan.t.j, Szepi, TedPavlic, The Mad Echidna, Voorlandt, Wartoilet, Weierstraß, Zvika, 120 anonymous edits
Convergent series Source: http://en.wikipedia.org/w/index.php?oldid=496768666 Contributors: 16@r, AdamSmithee, Alansohn, Bo Jacoby, Brad7777, BradBeattie, Chilkes, CiaPan,
Crasshopper, Dan Hoey, David Radcliffe, Dbsanfte, Dhollm, Dreamfall, Etscrivner, Feydey, Fmtmaster, Giftlite, Glenn L, GraemeMcRae, Henrygb, Jizzpus, Jkimath, JohnBlackburne, Jrtayloriv,
Kan8eDie, Maksim-e, MathMartin, Matman from Lublin, Maxal, Melchoir, Michael Hardy, Mox83, NatusRoma, Netheril96, Oleg Alexandrov, Oli Filth, PV=nRT, PabloCastellano, Patrick, Paul
August, Petr Dlouhý, R'n'B, Rejnal, Robo37, Rpchase, Rprimmer, Ryan Reich, Scarlet Lioness, Titoxd, Trovatore, Vinnie2k, Wikeepedian, William M. Connolley, 50 anonymous edits
Copula (probability theory) Source: http://en.wikipedia.org/w/index.php?oldid=506844751 Contributors: 2001:718:2:1634:21C:25FF:FE1A:58B0, A. Pichler, Alanb, Albmont, Alg543,
Amir9a, AndrewDressel, Andrewpmack, AnonMoos, Aptperson, ArséniureDeGallium, Asjogren, Asnelt, Avraham, Baartvaark, BenFrantzDale, Bender235, Benjaminveal, Bkcurrier, Caco21,
Cecody, Charles Matthews, Cherkash, Christofurio, ChristophE, Clément Pillias, CopulaTomograph, Crasshopper, De2k, Den fjättrade ankan, Derex, Diegotorquemada, Edward,
Fabrizio.durante, Favonian, Feraudyh, GUONing, Gabeornelas, Gbohus, Gene Nygaard, Giftlite, Helgus, Hu12, Ikiwdq55, Indiv55, Inference, JA(000)Davidson, Jeffq, Jer ome, Jitse Niesen,
KHamsun, Kimys, Kmanoj, Magicmike, Martinp, Martyndorey, Mcld, Melcombe, Michael Hardy, Mwarren us, Nelson50, Nutcracker, Oleg Alexandrov, Omnipaedista, Ossiemanners, Paul H.,
PeterSarkoci, Philtime, Piloter, Qwfp, Rajah9, Roadrunner, Robertschulze, SHINIGAMI-HIKARU, Scwarebang, Sflanker, Shuetrim, Skbkekas, Sodin, Srini121, Stigin, Sumple, SymmyS, The
dood 7475, Tomixdf, Vvarkey, Waldir, Woohookitty, Woutersmet, Yonkeltron, Zasf, Zsolt Tulassay, Zundark, 164 anonymous edits
Coupon collector's problem Source: http://en.wikipedia.org/w/index.php?oldid=483443997 Contributors: Adriantam, CRGreathouse, Credema, David Eppstein, Geometry guy, Giftlite,
Haruth, Igorpak, Joriki, Kmhkmh, Leo euler, Magicmike, Melcombe, Mhym, Michael Hardy, Natema, Piotrm, Pleasantville, Shreevatsa, Urhixidur, Zahlentheorie, 20 anonymous edits
Degrees of freedom (statistics) Source: http://en.wikipedia.org/w/index.php?oldid=507675614 Contributors: Acalamari, Alaexis, Antonrojo, Askari Mark, Baccyak4H, Banaman1, BirgitteSB,
BradBeattie, Btyner, Cherkash, Chris53516, Cruise, Dabomb87, Dcljr, Dingar, Fangz, Fgnievinski, Forsakendaemon, G716, Gherardo, Giftlite, Goblin5, Gökhan, Hajhouse, Hede2000,
Hobartimus, Icairns, Iridescent, J.delanoy, Jgillenw, Jitse Niesen, Jmkim dot com, Jmnbatista, Joelr31, Jtneill, Kf4bdy, Kilva, Kjtobo, KnightRider, Koavf, Laeviar, LilHelpa, LindaGlass,
LindaGlass25, Male1979, Melcombe, Mfield, Michael Hardy, Mindmatrix, Missdipsy, Mmanneva, Nemeths, Oleg Alexandrov, Patrick, Pearle, Piotrus, Purpleidea, Qwfp, Ratcho, Rich
Farmbrough, Rituparnadas, Rjwilmsi, Rlsheehan, SPiNoZA, Salih, Sam Blacketer, Seervoitek, Sophus Bie, SteveChervitzTrutane, Stevemiller, Sun Creator, T8191, TheSuave, Troy 07, Vishnava,
Vizcarra, Vugluskr, Wavelength, Wikiacc, Zaqrfv, Zink Dawg, Zven, 137 anonymous edits
Determinant Source: http://en.wikipedia.org/w/index.php?oldid=507622576 Contributors: 01001, 165.123.179.xxx, A-asadi, A. B., AbsolutDan, Adam4445, Adamp, Ae77, Ahoerstemeier,
AlanUS, Alex Sabaka, Alexander Chervov, Alexandre Duret-Lutz, Alexandre Martins, Algebraist, Alison, Alkarex, Alksub, Anakata, Andres, Anonymous Dissident, Anskas, Ardonik,
ArnoldReinhold, Arved, Asmeurer, AugPi, AxelBoldt, BPets, Balagen, Barking Mad142, BenFrantzDale, Bender2k14, Benji9072, Benwing, Betacommand, Big Jim Fae Scotland, BjornPoonen,
BrianOfRugby, Bryan Derksen, Burn, CBM, CRGreathouse, Campuzano85, Camrn86, Carbonrodney, Catfive, Cbogart2, Ccandan, Cesarth, Charles Matthews, Chester Markel, Chewings72,
Chocochipmuffin, Christopher Parham, Cjkstephenson, Closedmouth, Cmmthf, Cobi, Coffee2theorems, Cokaban, Connelly, Conversion script, Cowanae, Crasshopper, Cronholm144, Crystal
whacker, Cthulhu.mythos, Cwkmail, Danaman5, Dantestyrael, Dark Formal, Datahaki, Dcoetzee, Delirium, Demize, Dmbrown00, Dmcq, Doctormatt, Dysprosia, EconoPhysicist, Elphion,
Eniagrom, Entropeneur, Epbr123, Euphrat1508, EverettYou, Executive Outcomes, Ffatossaajvazii, Fredrik, Fropuff, Gauge, Gejikeiji, Gene Ward Smith, Gershwinrb, Giftlite, Graham87,
GrewalWiki, Greynose, Guiltyspark, Gwernol, Hangitfresh, Headbomb, Heili.brenna, HenkvD, HenningThielemann, Hlevkin, Ian13, Icairns, Ijpulido, Ino5hiro, Istcol, Itai, JC Chu, JJ Harrison,
JackSchmidt, Jackzhp, Jagged 85, Jakob.scholbach, Jasonevans, Jeff G., Jemebius, Jerry, Jersey Devil, Jewbacca, Jheald, Jim.belk, Jitse Niesen, Joejc, Jogers, Johnuniq, Jondaman21, Jordgette,
Joriki, Josp-mathilde, Josteinaj, Jrgetsin, Jshen6, Juansempere, Justin W Smith, Kaarebrandt, Kallikanzarid, Kaspar.jan, Kd345205, Khabgood, Kingpin13, Kmhkmh, Kokin, Kstueve, Kunal
Bhalla, Kurykh, Kwantus, LAncienne, LOL, Lagelspeil, Lambiam, Lavaka, Leakeyjee, Lethe, Lhf, Lightmouse, LilHelpa, Logapragasan, Luiscardona89, MackSalmon, Marc van Leeuwen,
Marek69, Marozols, MartinOtter, MathMartin, McKay, Mcconnell3, Mcstrother, Mdnahas, Merge, Mets501, Michael Hardy, Michael P. Barnett, Michael Slone, Mikael Häggström, Mild Bill
Hiccup, Misza13, Mkhan3189, Mmxx, Mobiusthefrost, Mrsaad31, Msa11usec, MuDavid, Myshka spasayet lva, N3vln, Nachiketvartak, NeilenMarais, Nekura, Netdragon, Nethgirb, Netrapt,
Nickj, Nicolae Coman, Nistra, Nsaa, Numbo3, Obradovic Goran, Octahedron80, Oleg Alexandrov, Oli Filth, Paolo.dL, Patamia, Patrick, Paul August, Pedrose, Pensador82, Personman, PhysPhD,
Pigei, Priitliivak, Protonk, Pt, Quadell, Quadrescence, Quantling, Quondum, R.e.b., RDBury, RIBEYE special, Rayray28, Rbb l181, Recentchanges, Reinyday, RekishiEJ, RexNL, Rgdboer, Rich
Farmbrough, Robinh, Rocchini, Roentgenium111, Rogper, Rpchase, Rpyle731, Rumblethunder, SUL, Sabri76, Salgueiro, Sandro.bosio, Sangwine, Sayahoy, SchreiberBike, Shai-kun, Shreevatsa,
Siener, Simon Sang, SkyWalker, Slady, Smithereens, Snoyes, Spartan S58, Spireguy, Spoon!, Ssd, Stdazi, Stefano85, Stevenj, StradivariusTV, Sun Creator, Supreme fascist, Swerdnaneb,
SwordSmurf, Sławomir Biały, T8191, Tarif Ezaz, Tarquin, Taw, TedPavlic, Tegla, Tekhnofiend, Tgr, The Thing That Should Not Be, TheEternalVortex, TheIncredibleEdibleOompaLoompa,
Thehelpfulone, Thenub314, Timberframe, Tobias Bergemann, TomViza, Tosha, Trashbird1240, TreyGreer62, Trifon Triantafillidis, Trivialsegfault, Truthnlove, Ulisse0, Unbitwise, Urdutext,
Vanka5, Vincent Semeria, Wellithy, Wik, Wirawan0, Wolfrock, Woscafrench, Wshun, Xaos, Zaslav, Ztutz, Zzedar, ^demon, 439 anonymous edits
Dirichlet distribution Source: http://en.wikipedia.org/w/index.php?oldid=507653009 Contributors: A5, Adfernandes, Amit Moscovich, Azeari, Azhag, BSVulturis, Barak, Ben Ben,
Bender2k14, Benwing, BrotherE, Btyner, Charles Matthews, ChrisGualtieri, Coffee2theorems, Crasshopper, Cretog8, Daf, Dlwhall, Drevicko, Dycotiles, Erikerhardt, Finnancier, Franktuyl,
Frigyik, Giftlite, Gxhrid, Herve1729, Ipeirotis, Ivan.Savov, J04n, Josang, Kzollman, Liuyipei, M0nkey, Maiermarco, Mandarax, MarkSweep, Mathknightapprentice, Mathstat, Mavam, Mcld,
Melcombe, Michael Hardy, MisterSheik, Mitch3, Myasuda, Nbarth, Oscar Täckström, Prasenjitmukherjee, Qwfp, Robinh, Rvencio, Salgueiro, Saturdayswiki, Schmock, Shaharsh, Sinuhet,
Slxu.public, Tomi, Tomixdf, Tuonawa, Wavelength, Whosasking, Wjastle, Wolfman, Zvika, 75 anonymous edits
Effect size Source: http://en.wikipedia.org/w/index.php?oldid=503543559 Contributors: 2610:130:103:D00:B10D:288C:C17D:A470, Abdul raja, Aldo samulo, Brainfsck, Ceyockey, Cherkash,
Chinju, Chris53516, Cronholm144, DanSoper, DanielCD, Davewho2, Dcoetzee, Devilly, Dfrg.msc, Dhaluza, Dogface, ESkog, Es uomikim, FlssSUT, Fnielsen, Freerow@gmail.com, Friend of
facts, Fugyoo, G716, Gary Cziko, George Burgess, Giftlite, Heathkeeper, Ian Pitchford, Ibpassociation, Igoldste, IncognitoErgoSum, Ioannes Pragensis, Isaac Dupree, J.delanoy,
JA(000)Davidson, JagDragon, Janarius, Jncraton, John Quiggin, JonathanWilliford, Jonkerz, JorisvS, JoshuaEyer, Jrdioko, Jthetzel, Jtneill, Kastchei, Kiefer.Wolfowitz, Kieff, Leonbax, LilHelpa,
Lixiaoxu, MarcoLittel, Mcdomell, Melcombe, Michael Hardy, Mikael Häggström, Mspraveen, Mycatharsis, NYC2TLV, Nesbit, NiallB, Ostracon, PE, Pgan002, Ph.eyes, Pi3832, Quercus
basaseachicensis, Qwfp, Rdledesma, Rgclegg, Rich Farmbrough, Rjwilmsi, Rlew234, Robert P. O'Shea, RyanCross, Sam Hocevar, Sammcq, Scarian, Schmitta1573, Schwnj, Sean a wallis,
Seedre, Shineydiamond, Skbkekas, Sue Wigg, Supernaut76, Talgalili, Tedtoal, Tillmo, Tim bates, Trontonian, Ultramarine, Wmahan, Xomyork, XpXiXpY, 145 anonymous edits
Erlang distribution Source: http://en.wikipedia.org/w/index.php?oldid=499725168 Contributors: Acuster, Afa86, Aranel, Autopilot, Avatar, Basten, Bobo192, Bryan Derksen, Bullmoose953,
Calltech, Cbauckhage, Cnmirose, CraigNYC, Derisavi, Diogenes404, Donmegapoppadoc, DudMc3, Giftlite, Gustavf, Ian Geoffrey Kennedy, Iwaterpolo, Jbfung, Jim.henderson, Joshdick,
Jrdioko, Jwortman, Kastchei, Kitsonk, Luckyz, Mange01, MarkSweep, Maths2412, McKay, Melcombe, Michael Hardy, MisterSheik, Myleslong, PAR, Pbruel, Pichote, Qwfp, RHaworth,
Rmashhadi, Salsa Shark, Sheppa28, Stangaa, TedPavlic, User A1, Welsh, Zvika, 93 anonymous edits
Expectation–maximization algorithm Source: http://en.wikipedia.org/w/index.php?oldid=507971854 Contributors: 2405:B000:600:262:0:0:36:61, 3mta3, A876, Aanthony1243, Alex
Kosorukoff, Alex Selby, AnAj, Andyrew609, BAxelrod, Benwing, Bigredbrain, Bilgrau, Bkkbrad, Blaisorblade, Bluemoose, BradBeattie, Btyner, Cataphract, Cburnett, Chire, Daviddoria, Derek
farn, Dicklyon, Douglas-Lanman, Dropsciencenotbombs, Edratzer, Erhanbas, Eric Kvaalen, Finfobia, GeypycGn, Giftlite, Glopk, GongYi, Hakeem.gadi, Hbeigi, Headbomb, Hike395, Hild,
Ismailari, Iwaterpolo, Jakarr, Jamshidian, Jheald, Jjmerelo, Jmc200, Joeyo, John Vandenberg, Jrouquie, Jszymon, Jwmarck, KYPark, Kallerdis, Karada, Kiefer.Wolfowitz, Klutzy, Ladypine,
Lamro, Lavaka, LeonardoWeiss, Libro0, M.A.Dabbah, Maechler, MarkSweep, Market Efficiency, Mcld, Melcombe, Michael Hardy, MisterSheik, Mosaliganti1.1, Nageh, Nbarth, Nils Grimsmo,
Nocheenlatierra, Numbo3, O18, Onco p53, Orderud, Osquar F, Owenman, Phil Boswell, Pine900, Piotrus, Pratx, Qiemem, Qwerty9967, Rama, Requestion, Richard Bartholomew, Rjwilmsi,
RobHar, Robbyjo, Rodrigob, Rusmike, Rxnt, Régis B., Salih, Schmock, Sitush, Skittleys, Slon02, Statna, Stpasha, Sunjuren, Talgalili, Tambal, Tekhnofiend, Tiedyeina, User A1, Vadmium, Wile
E. Heresiarch, Yasuo2, Yogtad, Zzpmarco, Ɯ, 155 anonymous edits
Exponential distribution Source: http://en.wikipedia.org/w/index.php?oldid=504270643 Contributors: 2610:10:20:216:225:FF:FEF4:CAAF, A.M.R., A3r0, ActivExpression, Aiden Fisher,
Amonet, Asitgoes, Avabait, Avraham, AxelBoldt, Bdmy, Beaumont, Benwing, Boriaj, Bryan Derksen, Btyner, Butchbrody, CD.Rutgers, CYD, Calmer Waters, CapitalR, Cazort, Cburnett,
Article Sources and Contributors 628
Closedmouth, Coffee2theorems, Cyp, Dcljr, Dcoetzee, Decrypt3, Den fjättrade ankan, Dudubur, Duoduoduo, Edward, Enchanter, Erzbischof, Fvw, Gauss, Giftlite, GorillaWarfare,
Grinofadrunkwoman, Headbomb, Henrygb, Hsne, Hyoseok, IanOsgood, Igny, Ilmari Karonen, Isheden, Isis, Iwaterpolo, Jason Goldstick, Jester7777, Johndburger, Kan8eDie, Kappa,
Karl-Henner, Kastchei, Kyng, LOL, MStraw, MarkSweep, Markjoseph125, Mattroberts, Mcld, Mdf, MekaD, Melcombe, Memming, Michael Hardy, Mindmatrix, MisterSheik, Monsterman222,
Mpaa, Mwanner, Nothlit, Oysindi, PAR, Paul August, Qwfp, R'n'B, R.J.Oosterbaan, Remohammadi, Rich Farmbrough, Rp, Scortchi, Sergey Suslov, Shaile, Sheppa28, Shingkei, Skbkekas,
Skittleys, Smack, Spartanfox86, Sss41, Stpasha, Taral, Taw, The Thing That Should Not Be, Thegeneralguy, TimBentley, Tomi, UKoch, Ularevalo98, User A1, Vsmith, WDavis1911, Wilke,
Wjastle, Woohookitty, Wyatts, Yoyod, Z.E.R.O., Zeno of Elea, Zeycus, Zvika, Zzxterry, 210 anonymous edits
F-distribution Source: http://en.wikipedia.org/w/index.php?oldid=484010808 Contributors: Adouzzy, Albmont, Amonet, Art2SpiderXL, Arthena, Bluemaster, Brenda Hmong, Jr, Bryan
Derksen, Btyner, Califasuseso, Cburnett, DanSoper, DarrylNester, Dysprosia, Elmer Clark, Emilpohl, Ethaniel, Fnielsen, Ged.R, Giftlite, Gperjim, Hectorlamadrid, Henrygb, Jan eissfeldt, Jitse
Niesen, JokeySmurf, Kastchei, Livingthingdan, MarkSweep, Markjoseph125, Materialscientist, Mdebets, Melcombe, Michael Hardy, MrOllie, Nehalem, O18, Oscar, PBH, Quietbritishjim, Qwfp,
Robinh, Salix alba, Seglea, Sheppa28, TedE, The Squicks, Timholy, Tom.Reding, Tomi, TomyDuby, UKoch, Unyoyega, Zorgkang, 50 anonymous edits
F-test Source: http://en.wikipedia.org/w/index.php?oldid=503070399 Contributors: Adam Lyle Taylor, Alexandrov, Andre Engels, Berland, Btyner, Cached, Cherkash, CryptoDerk, Danger,
Davinchicode, Dfarrar, Dloeckx, Dlrohrer2003, Drilnoth, Edstat, Feinstein, Geinitz, Giftlite, Guoguo12, HazeNZ, J. J. F. Nau, Jayen466, Jollyroger131, JoseMires, Kgres, Kiefer.Wolfowitz,
Kolyma, Kudret abi, Landroni, Mathstat, Melcombe, Michael Hardy, Mike.lifeguard, Miserlou, Namkyu, Nostato, Oleg Alexandrov, PartierSP, Piotrus, Potamites, Qwfp, Rjwilmsi, Rkandilarov,
Salix alba, SalvNaut, Seaphoto, Skbkekas, Smith609, Spaceman1979, Szczepanh, Tekhnofiend, Theking2, Thesilverbail, TomyDuby, Tristanb, Uwhoff, Valley2390, Vapniks, Velocidex,
Vt007ken, Vulturelainen, Wolf87, Xtrememachineuk, Yeehoong, 101 anonymous edits
Fisher information Source: http://en.wikipedia.org/w/index.php?oldid=506160066 Contributors: Aaronchall, Acidador, Adfernandes, Agl, Amir8797, Arthur Rubin, Barak, Camrn86,
Capricorn42, Cazort, Cburnett, Chris Howard, Cocomo-jp, Coffee2theorems, Cyan, DRHagen, Den fjättrade ankan, Eric Kvaalen, Eug, Fangz, Flammifer, Florian Huber, Freerow@gmail.com,
Giftlite, Gveda, Holon, Icep, Icosmology, John of Reading, Josh Parris, Josuechan, Jpillow, Kiefer.Wolfowitz, Lgallindo, Linas, Lupin, Matt.voroney, Mdf, Mebden, Melcombe, Michael Hardy,
Nathanielvirgo, Nbarth, OliverObst, Paolo.dL, PhysPhD, Physicistjedi, Polymath1976, Quantling, Qwfp, Rjwilmsi, Robinh, Sigmundg, Skittleys, Sohale, Soosed, Stpasha, Taxman, TedE, The
Anome, Thrasibule, Tobias Bergemann, Tomixdf, Violetriga, Vsmith, Wangandibeijing, WaysToEscape, Wikomidia, Winterfors, Wolf87, X42bn6, Yahya Abdal-Aziz, Zouxiaohui, Zvika, ירון,
93 anonymous edits
Fisher's exact test Source: http://en.wikipedia.org/w/index.php?oldid=502320370 Contributors: 3mta3, AgarwalSumeet, Antelan, Archimerged, Australisergosum, Avenue, Baccyak4H,
Beetstra, Bobcoyote, Btyner, Bueller 007, CWenger, Cannin, Cbogart2, Charles Matthews, Chris the speller, Chris53516, Cyclist, David Eppstein, David.Gross, Den fjättrade ankan, Dfarrar,
Douglas R. White, Eric Kvaalen, Giftlite, Helohe, Hughjonesd, Ispy1981, Jia.meng, John Vandenberg, Kastchei, Kbh3rd, Kenta2, Kierano, Lfriedl, Lixiaoxu, Malhonen, MarkSweep, Melcombe,
Michael Hardy, Mikeblas, Moverly, Nbarth, Pgan002, Ph.eyes, Phatsphere, Rich Farmbrough, Rjwilmsi, Robinh, RupertMillard, Scentoni, Schwnj, Seb951, Seglea, ShotgunApproach, Shyamal,
Sjoosse, Skbkekas, Slartibarti, Talgalili, TedE, Thincat, Tim bates, TimBentley, Tkirkman, Tomi, Wavelength, Welhaven, 60 anonymous edits
Gamma distribution Source: http://en.wikipedia.org/w/index.php?oldid=504005178 Contributors: A5, Aastrup, Abtweed98, Adam Clark, Adfernandes, Albmont, Amonet, Aple123,
Apocralyptic, Arg, Asteadman, Autopilot, Baccyak4H, Barak, Bdmy, Benwing, Berland, Bethb88, Bo Jacoby, Bobmath, Bobo192, Brenton, Bryan Derksen, Btyner, CanadianLinuxUser,
CapitalR, Cburnett, Cerberus0, ClaudeLo, Cmghim925, Complex01, Darin, David Haslam, Dicklyon, Dlituiev, Dobromila, Donmegapoppadoc, DrMicro, Dshutin, Entropeneur, Entropeter,
Erik144, Eug, Fangz, Fnielsen, Frau K, Frobnitzem, Gaius Cornelius, Gandalf61, Gauss, Giftlite, Gjnaasaa, Henrygb, Hgkamath, Iwaterpolo, Jason Goldstick, Jirka6, Jlc46, JonathanWilliford,
Jshadias, Kastchei, Langmore, Linas, Lovibond, LoyalSoldier, LukeSurl, Luqmanskye, MarkSweep, Mathstat, Mcld, Mebden, Melcombe, Mich8611, Michael Hardy, MisterSheik, Mpaa,
MrOllie, MuesLee, Mundhenk, Narc813, Nickfeng88, O18, PAR, PBH, Patrke, Paul Pogonyshev, Paulginz, Phil Boswell, Pichote, Popnose, Qiuxing, Quietbritishjim, Qwfp, Qzxpqbp, RSchlicht,
Robbyjo, Robinh, Rockykumar1982, Samsara, Sandrobt, Schmock, Smmurphy, Stephreg, Stevvers, Sun Creator, Supergrane, Talgalili, Tayste, TestUser001, Thomas stieltjes, Thric3,
Tom.Reding, Tomi, Tommyjs, True rover, Umpi77, User A1, Vminin, Wavelength, Wiki me, Wiki5d, Wikid77, Wile E. Heresiarch, Wjastle, Xuehuit, Zvika, 313 anonymous edits
Gamma function Source: http://en.wikipedia.org/w/index.php?oldid=507824244 Contributors: 1exec1, 209.218.246.xxx, 65.197.2.xxx, A. Pichler, Adriaan Joubert, Adselsum, Alamino,
Alansohn, Alejo2083, Alex Dainiak, Aliotra, Amir bike, Ams80, Anonymous Dissident, Apophos, Arabic Pilot, Arthur Rubin, Ashley Y, Asymptoticus, Atlien, AugPi, AxelBoldt, B75a, BRG,
Baccyak4H, Beeson, Ben Tillman, Bmusician, Bob K, Brad7777, Bromskloss, Bryan Derksen, Bubba73, Bubbha, CBM, CRGreathouse, Casey Abell, Charles Matthews, Cheese Sandwich,
Chinju, Chortos-2, Christian75, Closedmouth, Coasterlover1994, Conversion script, Cybercobra, DaveFoster110@hotmail.com, David Shay, Davius, Dcoetzee, Dicklyon, Didi12321,
Discospinster, Dmr2, Dojarca, Domitori, Dpmathmajor112, Drjt87, Dubosen, Dysprosia, Dzordzm, EdJohnston, Elharo, Ellywa, Evil saltine, Excirial, Eyrryds, Favonian, Feinstein, Fintor, Fred
Stober, Fredrik, Frencheigh, Fresheneesz, Gamesguru2, Gauss, Geeklizzard, Gene Ward Smith, Gesslein, Giftlite, Glenn L, GraemeL, Graham87, GregorB, Gulliveig, Hairy Dude, Hannes Eder,
Hao2lian, HappyCamper, HappyInGeneral, Herbee, Hypercube, Inquisitus, J6w5, JJL, JabberWok, Jackzhp, James mcl, JamesBWatson, Jdgilbey, JensG, Jfmantis, Joelphillips, John aveas,
Joke137, JonDePlume, Josevellezcaldas, Jshadias, Jsondow, Julian Brown, Junling, Justin W Smith, Kausikghatak, Kc135, Khinchin's constant, Kiensvay, Kit Cloudkicker, Kmarawer, Lambiam,
Linas, Looxix, Maksim-e, Marc van Leeuwen, Markhurd, Materialscientist, MathHisSci, Mathmo3141592653589, Maurice Carbonaro, Melchoir, Michael Hardy, Miha Ulanov, Mimpian228,
MuDavid, Muller spiegel, Murtasa, Nbarth, Ndenison, Nicolas Bray, Nohat, Nono64, Norm mit, Nozzer42, ObsessiveMathsFreak, Octaazacubane, Octahedron80, Oleg Alexandrov, Oliphaunt,
OneWeirdDude, Outriggr, PAR, PMajer, Pabristow, Pagw, Paul Pogonyshev, Pcb21, Peak, Phreed, Pmanderson, Policron, Poor Yorick, Powermath, Pra1998, Prince Max (scientist), Pt,
Quantling, Qwfp, R. J. Mathar, R.e.b., RDBury, Rbj, RedWolf, Reddi, Rgdboer, RobHar, Robinh, RogierBrussee, Rohan Ghatak, Romanm, Rp, Sabbut, Salgueiro, Sam Derbyshire, Sandrobt,
Scythe33, Senalba, ServiceAT, Setreset, Shadowjams, Singularity, Slawekb, Sligocki, Stan Sykora, Stevenj, Sverdrup, TakuyaMurata, Tal physdancer, Tarquin, Tassedethe, Taxman, The new
math, Tide rolls, Tobias Bergemann, ToddDeLuca, Tom Buktu, Tomi, TomyDuby, Tuhinsubhrakonar, Uranographer, Van helsing, Vanished User 0001, Vinícius Machado Vogt, Waabu, Warut,
Wavelength, WiiStation360, Wile E. Heresiarch, Wtuvell, YelloWord, ZakTek, Zero0000, Zero2ninE, Zstk, Ztar, 285 anonymous edits
Geometric distribution Source: http://en.wikipedia.org/w/index.php?oldid=505976779 Contributors: AdamSmithee, AlanUS, Alexf, Amonet, Apocralyptic, Ashkax, Bjcairns, Bo Jacoby,
Bryan Derksen, Btyner, Calbaer, Capricorn42, Cburnett, Classicalecon, Count ludwig, Damian Yerrick, Deineka, Digfarenough, El C, Eraserhead1, Felipehsantos, Frietjes, Gauss, Giftlite,
Gogobera, Gsimard, Gökhan, Hhassanein, Hilgerdenaar, Imranaf, Iwaterpolo, Juergik, K.F., LOL, MarkSweep, MathKnight, Mav, Mcld, Melcombe, Michael Hardy, MichaelRutter, Mike Rosoft,
Mikez, Mr.gondolier, NeonMerlin, Nov ialiste, PhotoBox, Qwfp, Ricklethickets, Robma, Rumping, Ryguasu, Serdagger, Skbkekas, Speed8224, Squizzz, Steve8675309, Sun Creator,
SyedAshrafulla, TakuyaMurata, Terrek, Tomi, VoltzJer, Vrenator, Wafulz, Wikid77, Wjastle, Wrogiest, Wtruttschel, Xanthoxyl, Youandme, 119 anonymous edits
Hypergeometric distribution Source: http://en.wikipedia.org/w/index.php?oldid=502616533 Contributors: Alexius08, Antoine 245, Arnold90, Baccyak4H, Benwing, Bilgrau, Bo Jacoby,
Booyabazooka, Bryan Derksen, Btyner, Burn, Cburnett, ChevyC, Commander Keane, DarrylNester, David Eppstein, David Shay, DavidLDill, Drcrnc, Eidolon232, El C, Eug, FedeLebron,
Felipehsantos, Gauss, Giftlite, Gnathan87, Goshng, Gunungblau, Herr blaschke, I3iaaach, I9606, Intervallic, It Is Me Here, Iwaterpolo, Jack Joff, Janto, Jht4060, Jia.meng, Johnlv12, Josh Cherry,
Kamrik, Kingboyk, Kiwi4boy, LOL, Linas, MC-CPO, MSGJ, MaxEnt, Maximilianh, Melcombe, Michael Hardy, Mindmatrix, Mtmoore321, Nbarth, Nerdmaster, Ott2, PAR, PBH,
Peteraandrews, Pleasantville, Pol098, Porejide, Prőhle Tamás, Qwfp, Randomactsofkindness2, Reb42, Rgclegg, Sboehringer, Schutz, Screech1941, Seattle Jörg, Skaphan, SkatingNerd,
TakuyaMurata, Talgalili, Tomi, User A1, Veryhuman, Wtmitchell, Wtruttschel, Yvswan, Zigger, ﺯﺭﺷﮏ, 176 anonymous edits
Hölder's inequality Source: http://en.wikipedia.org/w/index.php?oldid=506936260 Contributors: 3mta3, A. Pichler, Alan Liefting, Arvid42, AxelBoldt, Bdmy, Bh3u4m, Bryan Derksen,
Cazort, Cbunix23, Daniele.tampieri, DavidCBryant, Detritus, Eslip17, Fwappler, GBlanchard, Gene Nygaard, Giftlite, GoingBatty, Igny, Ilmari Karonen, Lantonov, Makotoy, MarSch,
MarkSweep, Mct mht, Melcombe, Merewyn, Minesweeper, Myasuda, Nousernamesleft, Oleg Alexandrov, Pred, Pulkitgrover, Quietbritishjim, R.e.b., Rich Farmbrough, Schmock, Small potato,
Stevenj, Sławomir Biały, Weierstrass, Whendrik, Wik, Wittlicher, Zvika, 52 anonymous edits
Inverse Gaussian distribution Source: http://en.wikipedia.org/w/index.php?oldid=499832929 Contributors: Aastrup, Abtweed98, Baccyak4H, Batman50, Braincricket, Btyner, David Haslam,
Deavik, Dima373, DrMicro, Felipehsantos, Giftlite, Iwaterpolo, Jfr26, Kristjan.Jonasson, LachlanA, LandruBek, Melcombe, Memming, Michael Hardy, MisterSheik, NickMulgan, Oleg
Alexandrov, Qwfp, Rhfeng, Sheppa28, Sterrys, The real moloch57, Tjagger, Tomi, User A1, Vana Seshadri, Wikid77, Wjastle, Zfeinst, 50 anonymous edits
Inverse-gamma distribution Source: http://en.wikipedia.org/w/index.php?oldid=497414146 Contributors: Benwing, Biostatprof, Btyner, Cburnett, Cquike, Dstivers, Fnielsen, Giftlite,
Greenw2, Iwaterpolo, Josevellezcaldas, Kastchei, M27315, MarkSweep, Melcombe, MisterSheik, PAR, Qwfp, Rfinlay@gmail.com, Rlendog, Rphlypo, Shadowjams, Sheppa28, Slavatrudu,
Tomi, User A1, Wjastle, 40 anonymous edits
Iteratively reweighted least squares Source: http://en.wikipedia.org/w/index.php?oldid=484990627 Contributors: 3mta3, BenFrantzDale, Benwing, David Eppstein, Giggy, Grumpfel,
Kiefer.Wolfowitz, Lambiam, Lesswire, LutzL, Melcombe, Michael Hardy, Oleg Alexandrov, RainerBlome, Salix alba, Serg3d2, Stpasha, Wesleyyin, 9 anonymous edits
Kendall tau rank correlation coefficient Source: http://en.wikipedia.org/w/index.php?oldid=482594886 Contributors: 3mta3, Adilapapaya, Arcadian, Arthur Rubin, As530, Barticus88,
Cronholm144, David Eppstein, Digisus, Dryke, Edurant, Fmccown, G716, Headbomb, Icedwater, Ichbin-dcw, Jacwa01, JamesHAndrews, Jtneill, Ldc, Mcld, Melcombe, Michael Hardy,
Monty669, Nick Number, O18, Olaf, Penpen, Ph.eyes, Piotrus, Qwfp, Rich Farmbrough, Sasikedi, Schmock, Schwnj, Squeakywaffle, Thecheesykid, WinerFresh, Yikes2000, 44 anonymous edits
Kolmogorov–Smirnov test Source: http://en.wikipedia.org/w/index.php?oldid=507111462 Contributors: A. Pichler, Adam Lein, Adoniscik, Amonet, Avraham, AxelBoldt, Axl, Bender235,
Bryan Derksen, Casey Abell, Chris53516, CiaPan, Conversion script, DeadEyeArrow, Den fjättrade ankan, Doremo, Dresdnhope, EagerToddler39, EddEdmondson, Encyclops, Everettr2, Free
Software Knight, Geregen2, Giftlite, Goudzovski, GregorB, HandsomeFella, Headbomb, Huji, Igny, Inter, Irishguy, Jasondet, Jmjanzen, Jovillal, K.F., Kiefer.Wolfowitz, Klonimus,
Larry_Sanger, MBlakley, MH, Magioladitis, Mairi, Makemineamoose, MarkSweep, Melcombe, Memming, Michael Hardy, Miguel, Moverly, O18, Olaf, Patrick, Pgan002, Pgr94, Phr, Predictor,
Article Sources and Contributors 629
Qwerty Binary, Qwfp, Ragout, Rjwilmsi, Robinh, Ruud Koot, Schmock, Schutz, Schwnj, Selket, Smb1001, Smith609, Snoyes, Spangineer, Stangaa, Statisticsblog, Stern, Stpasha, Strafpeloton2,
Strait, Tabletop, TedDunning, Thorwald, Tomi, TomyDuby, Tyger7th, Wikid77, Wtng, Yaris678, Zaqrfv, 78 anonymous edits
Kronecker's lemma Source: http://en.wikipedia.org/w/index.php?oldid=452306725 Contributors: Aastrup, Cenarium, Charles Matthews, Cmdrjameson, David Eppstein, FF2010, Giftlite,
Kurdo777, LennK, Linas, Ukookami, 1 anonymous edits
Kullback–Leibler divergence Source: http://en.wikipedia.org/w/index.php?oldid=505224311 Contributors: 3mta3, A. Pichler, Adfernandes, Amit Moscovich, Amkilpatrick, Avraham,
Baisemain, Benwing, BlaiseFEgan, Charles Matthews, Cronholm144, Cstahlhut, Cyan, Deepmath, Den fjättrade ankan, Dfass, Dmb000006, Dnavarro, Edward, Epistemenical, Epomqo,
Ereiniona, FilipeS, Forwardmeasure, Francis liberty, Giftlite, Gzabers, Ignacioerrico, Inkling, Iturrate, JForget, Jamelan, Jheald, Jmorgan, Jon Awbrey, Kastchei, Kevin Baas, Kiefer.Wolfowitz,
Kyellan, Linas, Loniousmonk, MDReid, MarkSweep, MartinSpacek, Mcld, Mct mht, Mebden, Melcombe, Memming, Michael Hardy, Mike Lin, Miranda, MisterSheik, Mmernex, Mottzo,
Mpost89, Mundhenk, Nathanielvirgo, Nbarth, Neonleonb, Nothing1212, Object01, Oleg Alexandrov, PAR, Punkstar89, Quantumelfmage, Qwfp, Rinconsoleao, Rjwilmsi, Rkrish67, Romanpoet,
Schizoid, SciCompTeacher, Shreevatsa, Sir Vicious, Stangaa, Stern, Stpasha, Sun Creator, Thermochap, Wikomidia, Wile E. Heresiarch, Winterfors, Wittnate, Wullj, X7q, Yoderj, 犬 牙, 130
anonymous edits
Laplace distribution Source: http://en.wikipedia.org/w/index.php?oldid=502197415 Contributors: Alektzin, Btyner, CRGreathouse, Cburnett, Charles Matthews, Comfortably Paranoid, Dcljr,
Dcoetzee, DrMicro, Fasten, Fnielsen, Foobarhoge, Giftlite, Henrygb, Huoer, Igny, Iwaterpolo, Jdobelman, Johnlv12, Jraudhi, Jurgen, Kabla002, Kastchei, Ludovic89, M.A.Dabbah, MarkSweep,
Mashiah Davidson, Mathstat, Meemoxp, Melcombe, Memming, Michael Hardy, MisterSheik, Mohammad Al-Aggan, PAR, Qwfp, Rjwilmsi, Rlendog, Sheppa28, Sterrys, Straightontillmorning,
Sun Creator, User A1, Vovchyck, Wastle, Wjastle, Wolf87, Zundark, Zvika, 45 anonymous edits
Laplace's equation Source: http://en.wikipedia.org/w/index.php?oldid=503925895 Contributors: !jim, 124Nick, 213.253.39.xxx, Acipsen, Andrei Polyanin, Andrei r, Andres,
Anythingyouwant, Ap, Archeryguru2000, Astozzia, AugPi, Awickert, AxelBoldt, Bender235, Bh3u4m, BigJohnHenry, Blueboy814, Charles Baynham, Charles Matthews, Chubby Chicken,
Cj67, Coelacan, Crowsnest, DavidCBryant, Dmp450, Donludwig, Dratman, Drywallandoswald, Eerb, El C, ElTyrant, Giftlite, Goheeca, Gonfer, GuidoGer, Hadal, Haseldon, Hellisp, Hypernurb,
Infinityprob, Jasperdoomen, Jgwade, Juansempere, Jzsfvss, KamasamaK, Kibibu, Lfscheidegger, Linas, Liuyao, LokiClock, Lombar2, MarcelB612, Mecanismo, Mel Etitis, Mets501, Mhym,
Michael Hardy, Mleconte, Moink, Ninly, Nuwewsco, Oleg Alexandrov, Paolodm, Patrick, Phelimb, RayAYang, RexNL, Rhun, Rich Farmbrough, Roadrunner, Salih, Sandycx, Shinji311, Silly
rabbit, Slightsmile, Stsmith, Sullivan.t.j, TakuyaMurata, Tarquin, Tbsmith, Thenub314, Tim Starling, User A1, Wikibacc, Wolfkeeper, Wthered, Xbr 0511, く ま 兄 や ん, 111 anonymous
edits
Laplace's method Source: http://en.wikipedia.org/w/index.php?oldid=503012375 Contributors: 777sms, Alekh, Anthony Appleyard, Arnold90, BDQBD, BenFrantzDale, Bluemaster,
Bluemasterbr, Charles Matthews, Conscious, Coolwangyx, Deville, Ephraim33, Giftlite, JabberWok, Jitse Niesen, Joriki, Karl-H, Keithcc, Krishnavedala, Leperous, Linas, LittleOldMe, McKay,
Michael Hardy, Monsterman222, Msalins, Mt06, MuDavid, Oleg Alexandrov, Olegalexandrov, Phil Boswell, R.e.b., Rossweisse, Trogsworth, Wilke, Zero sharp, 65 anonymous edits
Likelihood-ratio test Source: http://en.wikipedia.org/w/index.php?oldid=502800268 Contributors: 1ForTheMoney, Adismalscientist, Adoniscik, AgentPeppermint, AnRtist, Arcadian,
ArcadianOnUnsecuredLoc, Arknascar44, Aryan1989, Babbage, Badgettrg, Btyner, Cancan101, Cburnett, Chuk.plante, Cmcnicoll, Conversion script, Corti, Dchudz, Den fjättrade ankan, Draeco,
El C, Elysdir, Fanyavizuri, Fayue1015, Fnielsen, Fortdj33, Frietjes, Giftlite, Graham87, Guy Macon, Henrygb, Jackzhp, Jeremiahrounds, Jfitzg, Jheald, Jmac2222, Kastchei, Kku, Kniwor,
LilHelpa, Mack2, Madbix, MarkSweep, Meduz, Melcombe, Michael Hardy, Mild Bill Hiccup, Moverly, NaftaliHarris, Nbarth, NeoUrfahraner, Nescio, Nilayvaish, Notheruser, Oleg Alexandrov,
Pete.Hurd, Pgan002, Quantling, Qwfp, RL0919, Rajah9, Ridgeback22, RobDe68, Robertvan1, Robinh, Sboludo, Seans Potato Business, Seglea, Smith609, Tayste, The Anome, Thecurran, Tim
bates, Torfason, Twri, Unknown, Vthesniper, Wiki091005!!, Yimmieg, Zaqrfv, 91 anonymous edits
List of integrals of exponential functions Source: http://en.wikipedia.org/w/index.php?oldid=505272291 Contributors: Adrian.benko, Angalla, Anuclanus, Astrotrebor, Bilboq, Blah314,
Borgx, Cleaver2008, Csigabi, Deineka, Diffequa, Dojarca, Donarreiskoffer, Dungodung, Dusik, Dwees, Edsanville, EmilJ, Evil saltine, Fnielsen, Germandemat, Going3killu, Hasanshabbir786,
Helder.wiki, HenningThielemann, Icairns, Itai, Ivan Štambuk, JRSpriggs, Jasondet, JeffBobFrank, Jleedev, Kenyon, LeaW, Lzur, Mar.marco, Melchoir, Melink14, Mikez302, Mxn, NickFr, Oleg
Alexandrov, Physman, Pw brady, Schneelocke, Scythe33, Seyfried, SkiDragon, Smack, TakuyaMurata, Txus.aparicio, Unyoyega, Versus22, Viames, Waabu, Will5430, ZeroOne, 63 anonymous
edits
List of integrals of Gaussian functions Source: http://en.wikipedia.org/w/index.php?oldid=501094160 Contributors: Michael Hardy, Qwfp, Stpasha, 14 anonymous edits
List of integrals of hyperbolic functions Source: http://en.wikipedia.org/w/index.php?oldid=488579551 Contributors: Adrian.benko, Bilboq, Deineka, Dmcq, Enjuneer, Eric Burnett, Eynar,
Germandemat, Guiltyspark, Icairns, Itai, Ivan Štambuk, KnightRider, Lambiam, Lzur, NickFr, Number Googol, Oleg Alexandrov, Rmashhadi, Schneelocke, Smack, TakuyaMurata, Viames,
ZeroOne, Zvika, Zzedar, 11 anonymous edits
List of integrals of logarithmic functions Source: http://en.wikipedia.org/w/index.php?oldid=501881412 Contributors: Albmont, Bilboq, Borgx, Charles Matthews, Daryl7569, Dojarca, Evil
saltine, Fnielsen, Germandemat, Icairns, Icek, Isnow, Itai, Ivan Štambuk, Jeffreyarcand, Lzur, Maksim-e, Moshi1618, NickFr, Oleg Alexandrov, Physman, Rmashhadi, Schneelocke,
TakuyaMurata, Trevva, Txus.aparicio, Viames, ZeroOne, Ziaris, Петър Петров, គីមស៊្រុន, 28 anonymous edits
Lists of integrals Source: http://en.wikipedia.org/w/index.php?oldid=506714348 Contributors: 00Ragora00, Akikidis, Albert D. Rich, Amazins490, AngrySaki, Ant314159265,
ArnoldReinhold, Asmeurer, BANZ111, BananaFiend, BehzadAhmadi, Bilboq, Bruno3469, Brutha, CWenger, Ciphers, Cícero, DJPhoenix719, DavidWBrooks, Dcirovic, Deineka, DerHexer,
Dmcq, Doctormatt, Dogcow, Doraemonpaul, Dpb2104, Drahmedov, Dysprosia, Euty, FerrousTigrus, Fieldday-sunday, Fredrik, Giftlite, Giulio.orru, Gloriphobia, Happy-melon, IDGC, Icairns,
Imperial Monarch, Itai, Itu, Ivan Štambuk, IznoRepeat, JNW, Jaisenberg, Jimp, Jj137, John Vandenberg, Jon R W, Jumpythehat, Jwillbur, KSmrq, Kantorghor, Kiatdd, Kilonum, Kusluj,
LachlanA, LeaveSleaves, Legendre17, Lesonyrra, Linas, LizardJr8, Lzur, Macrakis, MathFacts, Michael Hardy, MrOllie, Msablic, Muro de Aguas, NNemec, Nbarth, New Math,
NewEnglandYankee, NickFr, NinjaCross, Oleg Alexandrov, Perelaar, Phatsphere, Physman, Physmanir, Pimvantend, Pokipsy76, Pschemp, Qmtead, RobHar, Salih, Salix alba, Schneelocke,
Scythe33, ShakataGaNai, Sseyler, Stpasha, TStein, TakuyaMurata, Template namespace initialisation script, Tetzcatlipoca, The Transhumanist, Thenub314, Tkreuz, Unyoyega, VasilievVV,
Vedantm, Waabu, Wile E. Heresiarch, Willking1979, Woohookitty, Xanthoxyl, Yeungchunk, Ylai, Zmoney918, 282 anonymous edits
Local regression Source: http://en.wikipedia.org/w/index.php?oldid=503830743 Contributors: 3mta3, Afa86, Benwing, Btyner, Caitifty, Coppertwig, Den fjättrade ankan, Dontdoit,
DutchCanadian, Glane23, Gpeilon, JHunterJ, JonMcLoone, JonPeltier, Kendrick7, Kierano, Lambiam, Melcombe, Michael Hardy, Qwertyus, Ryepup, Sintaku, Talgalili, The Anome, Tjhalva,
Urhixidur, 21 anonymous edits
Log-normal distribution Source: http://en.wikipedia.org/w/index.php?oldid=506211496 Contributors: 2D, A. Pichler, Acct4, Albmont, Alue, Ashkax, Asitgoes, Autopilot, AxelBoldt,
Baccyak4H, BenB4, Berland, Bfinn, Biochem67, Bryan Derksen, Btyner, Cburnett, Christian Damgaard, Ciberelm, Ciemo, Cleared as filed, Cmglee, ColinGillespie, Constructive editor,
Danhash, David.hilton.p, DonkeyKong64, DrMicro, Encyclops, Erel Segal, Evil Monkey, Floklk, Fluctuator, Frederic Y Bois, Fredrik, Gausseliminering, Giftlite, Humanengr, Hxu, IanOsgood,
IhorLviv, Isheden, Iwaterpolo, Jackzhp, Jeff3000, Jetlee0618, Jimt075, Jitse Niesen, Khukri, Kiwi4boy, Lbwhu, Leav, Letsgoexploring, LilHelpa, Lojikl, Lunch, Mange01, Martarius, Martinp23,
Mcld, Melcombe, Michael Hardy, Mikael Häggström, Mishnadar, MisterSheik, Nehalem, Nite1010, NonDucor, Ocatecir, Occawen, Osbornd, Oxymoron83, PAR, PBH, Paul Pogonyshev, Philip
Trueman, Philtime, Phoxhat, Pichote, Pontus, Porejide, Qwfp, R.J.Oosterbaan, Raddick, Rgbcmy, Rhowell77, Ricardogpn, Rjwilmsi, Rlendog, Rmaus, RobertHannah89, Safdarmarwat,
Sairvinexx, Schutz, Seriousme, Sheppa28, Skunkboy74, SqueakBox, Sterrys, Stigin, Stpasha, Ta bu shi da yu, Techman224, The Siktath, Till Riffert, Tkinias, Tomi, Umpi, Unyoyega, Urhixidur,
User A1, Vincent Semeria, Wavelength, Weialawaga, Wikomidia, Wile E. Heresiarch, Wjastle, Zachlipton, ZeroOne, ^demon, 201 , ירוןanonymous edits
Logrank test Source: http://en.wikipedia.org/w/index.php?oldid=506626143 Contributors: Bender235, Btyner, Cherkash, Dstivers, G716, Hermel, Johnlv12, Lilac Soul, Melcombe, Michael
Hardy, Ph.eyes, Qwfp, Reader0527, Rich Farmbrough, Rjwilmsi, Skbkekas, 18 anonymous edits
Lévy distribution Source: http://en.wikipedia.org/w/index.php?oldid=500941726 Contributors: 84user, Badger Drink, Btyner, Caviare, DBrane, Digfarenough, Dysmorodrepanis, Eric Kvaalen,
Gaius Cornelius, Gbellocchi, Gene Nygaard, Giftlite, JamieBallingall, Jfr26, Kastchei, Kloveland, Krishnavedala, Lovibond, Melcombe, Michael Hardy, Nbarth, Night Gyr, PAR, Ptrf, PyonDude,
Qwfp, Rlendog, Saihtam, SebastianHelm, Sheppa28, Smarket, Tassedethe, Tsirel, Uniquejeff, WJVaughn3, Wainson, Xcentaur, Ynhockey, 25 anonymous edits
Mann–Whitney U Source: http://en.wikipedia.org/w/index.php?oldid=507333032 Contributors: 3mta3, AbsolutDan, Adamsiepel, AndrewHZ, Baccyak4H, Bender235, Bequw, Blehfu,
Bobo192, Brian Everlasting, Briancady413, Buzhan, Chafe66, Charles Matthews, Chris53516, Ctacmo, Darrel francis, DeLarge, Den fjättrade ankan, Dfxoreilly, Fmccown, Gabe rosser,
Gandalf61, Giftlite, GregorB, Gstatistics, Happydaysarehere, Harrelfe, Headbomb, Jarekt, Jeremymiles, Jmorgan01007, Jowa fan, Kenkleinman, Kgwet, Kiefer.Wolfowitz, Kku, Klaus scheicher,
Klonimus, Kmk, KnightRider, Lefschetz, LenoxBlue, Lovewarcoffee, Mai-Thai, Marenty, MarkSweep, Markjoseph125, Martious, Mcld, Melcombe, Memming, Michael Hardy, Mikael
Häggström, Moverly, Mpf3205, MrOllie, Mwtoews, Navy blue84, Nbarth, Nvf, Omnipaedista, Pgan002, Ph.eyes, PigFlu Oink, Purple, Rjwilmsi, RobKushler, Robert Weemeyer, Searke, Seglea,
Selket, Sethant, Smith609, Strafpeloton2, Suruena, Talgalili, Tatpong, Tayste, TeaDrinker, Ted7815, Tim bates, Timothyarnold85, Tomi, Trevor Burnham, Urhixidur, Wayiran,
Where'stheindian?, Wiendietry, Xenonx, Zufar, 159 anonymous edits
Matrix calculus Source: http://en.wikipedia.org/w/index.php?oldid=505301664 Contributors: Aalopes, Ahmadabdolkader, Albmont, Alelbre, Altenmann, Anonymous Dissident, ArloLora,
Arthur Rubin, Ashigabou, AugPi, Benwing, Blaisorblade, Bloodshedder, Brad7777, Brent Perreault, CBM, CamCairns, Charles Matthews, Cooli46, Cs32en, Ctacmo, Ctsourak, DJ Clayworth,
Article Sources and Contributors 630
DRHagen, Danielbaa, Dattorro, Dimarudoy, Dlohcierekim, Download, Enisbayramoglu, Eroblar, Esoth, Excirial, F=q(E+v^B), Fred Bauder, Freddy2222, Gauge, Geometry guy, Giftlite,
Giro720, Guohonghao, Hhchen1105, Hu12, Immunize, IznoRepeat, Jan mei118, Jitse Niesen, JohnBlackburne, Kirbin, Lethe, Lgstarn, Maschen, Melcombe, Michael Hardy, Morning Sunshine,
MrOllie, NawlinWiki, Oli Filth, Orderud, Oussjarrouse, Ozob, Pan Chenguang, Pearle, PeterShook, RJFJR, Rich Farmbrough, SDC, Sanchan89, Steve98052, Stpasha, Surya Prakash.S.A.,
SyedAshrafulla, TStein, The Thing That Should Not Be, Vgmddg, Willking1979, Wtmitchell, Xiaodi.Hou, Yuzisee, 222 anonymous edits
Maximum likelihood Source: http://en.wikipedia.org/w/index.php?oldid=504777258 Contributors: 3mta3, Af1523, Albmont, Alfalfahotshots, Algebraic, Algocu, Arthena, Atabəy, Avraham,
BD2412, BPets, Baccyak4H, BenFrantzDale, Binarybits, Bo Jacoby, Brandynwhite, Btyner, Cal-linux, Cancan101, Casp11, Cbrown1023, Cburnett, Cehc84, Chadhoward, ChangChienFu,
Cherkash, Chinasaur, Chowells, Chris the speller, Cjpuffin, Classicalecon, CurranH, Davidmosen, Davyzhu, Den fjättrade ankan, Dimtsit, Dlituiev, Drazick, Dreadstar, Dysmorodrepanis, Earlh,
EduardoValle, F0rbidik, Flavio Guitian, Freeside3, G716, Gareth Griffith-Jones, Giftlite, Gill110951, Gjshisha, Graham87, Guan, Hawk8103, Headbomb, Henrikholm, Henrygb, Hike395,
Hongooi, Hu12, Inky, JA(000)Davidson, James I Hall, Jason Quinn, JeffreyRMiles, JimJJewett, Jmc200, John254, Jrtayloriv, Jsd115, Juffi, Julian Brown, Karada, Khazar2, Kiefer.Wolfowitz,
Koavf, Lavaka, Lexor, Lilac Soul, Logan, Loodog, Lucifer87, MJamesCA, Mathdrum, Mathuranathan, Matt Gleeson, Maye, Melcombe, Michael Hardy, Mikhail Ryazanov, MrOllie, Nak9x,
Nbarth, Nick Number, Ninja247, Nivix, Oleg Alexandrov, PAR, Patrick, Phil Boswell, Quietbritishjim, Qwerpoiu, Qwfp, R'n'B, RVS, Rama, Ramiromagalhaes, Reetep, Rich Farmbrough,
Rjwilmsi, Rlsheehan, Robinh, Rogerbrent, Royalguard11, Rschulz, Samikrc, Samsara, Saric, Set theorist, Shadowjams, Simo Kaupinmäki, Slaunger, SolarMcPanel, Stpasha, Svick, TedE, The
Anome, The Thing That Should Not Be, TheMathAddict, Travelbird, Ultramarine, Urhixidur, Velocidex, Violetriga, Vitanyi, Warren.cheung, Wavelength, Xappppp, XpXiXpY, Z10x, Zbodnar,
Zfeinst, Zonuleofzinn, Zvika, 215 anonymous edits
McNemar's test Source: http://en.wikipedia.org/w/index.php?oldid=503452574 Contributors: Archanamiya, Bkkbrad, Bluemoose, Btyner, Calimo, Cannin, Chris53516, Chzz, Coruscater,
Davidswelt, Den fjättrade ankan, Dougweller, Ellogo, Functious, Gaius Cornelius, Giftlite, Headbomb, JerroldPease-Atlanta, Jitse Niesen, Johannes Hüsing, Kallerdis, Kastchei, Kgwet,
MarkSweep, Mehdimoodi, Melcombe, Michael Hardy, Monterey Bay, Moverly, Practical321, Qwfp, Rjwilmsi, Rtlam, Staats, Subversified, Talgalili, Tayste, Tim bates, TimBentley,
Toot123toot, Wasell, Zundark, 18 anonymous edits
Multicollinearity Source: http://en.wikipedia.org/w/index.php?oldid=507151980 Contributors: 4wajzkd02, Altenmann, Bkwillwm, Bobo192, CBM, Counterfact, Den fjättrade ankan,
Dholtschlag, DickStartz, Dscannon, Duoduoduo, Dvdpwiki, Eagerbo, EconProf86, Ed Poor, Edward, Epa101, Fungus, Gap9551, Giftlite, Guillaume2303, Inhumandecency, Iridescent, JForget,
Jichuan, Joe.mellor, Joylee1130, KHamsun, Kukini, Kvng, Maddraven1716, Melcombe, Michael Hardy, Mishrasknehu, MrOllie, Nilesh2293, Robbyjo, S, Saxman77, Seriousj, Shethzulfi,
Studycourts, Sławomir Biały, Utcursch, Varuag doos, Whisky brewer, Ybbor, ﻣﺎﻧﻲ, 104 anonymous edits
Multivariate normal distribution Source: http://en.wikipedia.org/w/index.php?oldid=507275312 Contributors: A3 nm, Alanb, Arvinder.virk, AussieLegend, AxelBoldt, BenFrantzDale,
Benwing, BernardH, BlueScreenD, Breno, Bryan Derksen, Btyner, Cburnett, Cfp, ChristophE, Chromaticity, Ciphergoth, Coffee2theorems, Colin Rowat, Dannybix, Delirium, Delldot, Derfugu,
Distantcube, Eamon Nerbonne, Giftlite, Hongooi, HyDeckar, Isch de, J heisenberg, Jackzhp, Jasondet, Jondude11, Jorgenumata, Josuechan, KHamsun, Kaal, Karam.Anthony.K, KipKnight,
KrodMandooon, KurtSchwitters, Lambiam, Lockeownzj00, Longbiaochen, MER-C, Marc.coram, MarkSweep, Mathstat, Mauryaan, MaxSem, Mcld, Mct mht, Mdf, Mebden, Meduz, Melcombe,
Michael Hardy, Miguel, Mjdslob, Moriel, Mrwojo, Myasuda, Nabla, Ninjagecko, O18, Ogo, Oli Filth, Omrit, Opabinia regalis, Orderud, Paul August, Peni, PhysPhD, Picapica, Pycoucou,
Quantling, Qwfp, R'n'B, Riancon, Rich Farmbrough, Richardcherron, RickK, Rjwilmsi, Robinh, Rumping, Sanders muc, SebastianHelm, Selket, Set theorist, SgtThroat, Sigma0 1, SimonFunk,
Sinuhet, Steve8675309, Stpasha, Strashny, Sun Creator, Tabletop, Talgalili, TedPavlic, Toddst1, Tommyjs, Ulner, Velocitay, Viraltux, Waldir, Wavelength, Wikomidia, Winterfors, Winterstein,
Wjastle, Yoderj, Zelda, Zero0000, Zvika, มือใหม่, 209 anonymous edits
n-sphere Source: http://en.wikipedia.org/w/index.php?oldid=507623088 Contributors: 4, Aetheling, Army1987, ArnoldReinhold, AstroHurricane001, Bcnof, BenBaker, Berland, Bkell,
Brad7777, CYD, Cicco, David Eppstein, Deepmath, Dingenis, Dionyziz, Diti, Donarreiskoffer, DryaUnda, Epistemenical, Eric119, Erud, Fly by Night, Freelance Intellectual, Fropuff, Geometry
guy, Giftlite, GoingBatty, GraemeMcRae, Gut Monk, Headbomb, Henboppa, Herbee, Howwhowhatwhen, Icairns, Iteloo, JJ Harrison, JRSpriggs, JYolkowski, Jaapie, Jackzhp, Jakob.scholbach,
Jasonphysics, JohnBlackburne, Johnflux, JokeySmurf, Jonathanledlie, Joseph Lindenberg, Jugander, Just granpa, Jwy, Jwz, Jörg Knappen, KSmrq, LVC, Lethe, LokiClock, Loudogg, MarSch,
Marozols, Maurice Carbonaro, Mcnaknik, Mebden, Michael Angelkovich, Michael Hardy, Mikey likes mountains, Ndickson, NeilHynes, Njerseyguy, PAR, PV=nRT, Patrick, Pauloj96, Paweł
Ziemian, Pbroks13, Poulpy, Pstanton, Quantling, Qubiter, Qwertyus, R. J. Mathar, RJD ^)$, Randomblue, Reaper Eternal, Reyk, Rocchini, RodC, SE16, Salix alba, Searchme, Shanes, Silly
rabbit, Slawekb, Smartcat, Spinningspark, Spoon!, Subh83, Sławomir Biały, TakuyaMurata, Tamfang, Thehotelambush, Thewhyman, Thomas s. briggs, Tkinsman, Tkuvho, Tompw, Tomruen,
Tosha, Trigamma, Turul2, UU, VKokielov, Velocidex, Wangtailun, Worm That Turned, Xenure, Zundark, Zvika, 142 anonymous edits
Negative binomial distribution Source: http://en.wikipedia.org/w/index.php?oldid=500134820 Contributors: Airplaneman, Alexius08, Arunsingh16, Ascánder, Asymmetric, AxelBoldt,
Benwing, Bo Jacoby, Bryan Derksen, Btyner, Burn, CALR, CRGreathouse, Cburnett, Charles Matthews, Chocrates, Colinpmillar, Cretog8, Damian Yerrick, Dcljr, Deathbyfungwahbus,
DutchCanadian, DwightKingsbury, Econstatgeek, Eggstone, Entropeter, Evra83, Facorread, Felipehsantos, Formivore, Gabbe, Gauss, Giftlite, Headbomb, Henrygb, Hilgerdenaar, Iowawindow,
Iwaterpolo, Jahredtobin, Jason Davies, Jfr26, Jmc200, Keltus, Kevinhsun, Kodiologist, Linas, Ludovic89, MC-CPO, Manoguru, MarkSweep, Mathstat, McKay, Melcombe, Michael Hardy,
Mindmatrix, Moldi, Nov ialiste, O18, Odysseuscalypso, Oxymoron83, Panicpgh, Phantomofthesea, Pmokeefe, Qwfp, Rar, Renatovitolo, Rje, Rumping, Salgueiro, Sapphic, Schmock, Shreevatsa,
Sleempaster21229, Sleepmaster21229, Statone, Steve8675309, Stpasha, Sumsum2010, TGS, Talgalili, Taraborn, Tedtoal, TomYHChan, Tomi, Trevor.maynard, User A1, Waltpohl, Wikid77,
Wile E. Heresiarch, Wjastle, Zvika, 151 anonymous edits
Noncentral chi-squared distribution Source: http://en.wikipedia.org/w/index.php?oldid=491008998 Contributors: AaronMSwan, Alanb, Barrywbrown, Ctacmo, Fintor, Gaius Cornelius,
Giftlite, Hsne, Kastchei, Lixiaoxu, Melcombe, Memming, Michael Hardy, Nielses, Oleg Alexandrov, PAR, PV=nRT, Pspijker, Renatokeshet, Shae, SnakeBDD, Splash, Steve8675309, Tc1008,
Tim1357, Tokorode, Tomi, TomyDuby, Tristanreid, Viruseb, Willem, Zaqrfv, Zvika, ﻣﺤﻤﺪ.ﺭﺿﺎ, 34 anonymous edits
Noncentral F-distribution Source: http://en.wikipedia.org/w/index.php?oldid=451728415 Contributors: Alankjackson, DanSoper, Dima373, Eric Kvaalen, Fnielsen, Giftlite, Jerryobject,
Josve05a, Kastchei, Lixiaoxu, MarkSweep, Michael Hardy, Natalie Erin, Patrick57, PrairieDogDoug, Simoneau, Steve8675309, 9 anonymous edits
Noncentral t-distribution Source: http://en.wikipedia.org/w/index.php?oldid=451730621 Contributors: AnRtist, Barrywbrown, Benwing, David Eppstein, Fortdj33, Hypnotoad33, Janto,
Kastchei, LilHelpa, Lixiaoxu, MatthewVanitas, Melcombe, Qwfp, Rjwilmsi, Skbkekas, Steve8675309, Zenohockey, 11 anonymous edits
Norm (mathematics) Source: http://en.wikipedia.org/w/index.php?oldid=507310470 Contributors: ABCD, ANONYMOUS COWARD0xC0DE, Algebraist, Allispaul, Almit39, Altenmann,
Army1987, Arthur Rubin, Baldphil, Beau, BenFrantzDale, Bestian, Bkell, Bobo192, Brews ohare, CBM, ChongDae, Chutzpan, CiaPan, Connelly, Crasshopper, Cybercobra, Cícero, D4g0thur,
DMacks, Dalf, Dan Granahan, Dan Polansky, DannyAsher, Datahaki, David Kernow, Dicklyon, Dlazesz, Dmitri666, Don Quixote de la Mancha, Dratman, Effigies, Eraserhead1, Falcongl,
Fmccown, Free0willy, Fropuff, Giftlite, Hairy Dude, HannsEwald, Hans Adler, Headbomb, Helptry, Heysan, InverseHypercube, Irritate, JackSchmidt, Jenny Harrison, JohnBlackburne,
JosephSilverman, Jowa fan, JumpDiscont, Jxramos, KHamsun, KSmrq, Kan8eDie, Kiefer.Wolfowitz, Killerandy, Lambiam, Lethe, Linas, Lovasoa, Lucaswilkins, Lunch, MFH, MathMartin, Mct
mht, Melchoir, Mhss, MiNombreDeGuerra, Michael Hardy, Mike Segal, MisterSheik, Mpd1989, Nbarth, Ncik, NearSetAccount, Oleg Alexandrov, Oli Filth, PMajer, Paolo.dL, Patrick, Paul
August, PhotoBox, Quondum, Reminiscenza, Rheyik, Robin S, Rockfang, Rudjek, Saavek47, Sbharris, SebastianHelm, Sebjlan, Selket, Selvik, Sendhil, Shadowjams, Silly rabbit, SimonD,
Singularitarian, Sperling, Steve Kroon, Stpasha, Sullivan.t.j, Sławomir Biały, Tamfang, Tardis, That Guy, From That Show!, The Anome, Tobias Bergemann, Tom Peleg, TomJF, Tomo,
Tomruen, Tosha, Tribaal, Trovatore, Urdutext, Urhixidur, Veromies, VikC, Wikimorphism, Xnn, Xtv, Zero0000, Ziyuang, Zundark, אביב, ﺳﻌﯽ, 125 anonymous edits
Normal distribution Source: http://en.wikipedia.org/w/index.php?oldid=507959136 Contributors: 119, 194.203.111.xxx, 213.253.39.xxx, 5:40, A. Parrot, A. Pichler, A.M.R., AaronSw,
Abecedare, Abtweed98, Alektzin, Alex.j.flint, Ali Obeid, AllanBz, Alpharigel, Amanjain, AndrewHowse, Anna Lincoln, Appoose, Art LaPella, Asitgoes, Aude, Aurimus, Awickert, AxelBoldt,
Aydee, Aylex, Baccyak4H, Beetstra, BenFrantzDale, Benwing, Bhockey10, Bidabadi, Bluemaster, Bo Jacoby, Boreas231, Boxplot, Br43402, Brock, Bryan Derksen, Bsilverthorn, Btyner,
Bubba73, Burn, CBM, CRGreathouse, Calvin 1998, Can't sleep, clown will eat me, CapitalR, Cburnett, Cenarium, Cgibbard, Charles Matthews, Charles Wolf, Cherkash, Cheungpuiho04, Chill
doubt, Chris53516, ChrisHodgesUK, Christopher Parham, Ciphergoth, Cmglee, Coffee2theorems, ComputerPsych, Conversion script, Coolhandscot, Coppertwig, Coubure, Courcelles,
Crescentnebula, Cruise, Cwkmail, Cybercobra, DEMcAdams, DFRussia, DVdm, Damian Yerrick, DanSoper, Dannya222, Darwinek, David Haslam, David.hilton.p, DavidCBryant, Davidiad,
Den fjättrade ankan, Denis.arnaud, Derekleungtszhei, Dima373, Dj thegreat, Doood1, DrMicro, Drilnoth, Drostie, Dudzcom, Duoduoduo, Dzordzm, EOBarnett, Eclecticos, Ed Poor, Edin1,
Edokter, EelkeSpaak, Egorre, Elektron, Elockid, Enochlau, Epbr123, Eric Kvaalen, Ericd, Evan Manning, Fang Aili, Fangz, Fergusq, Fgnievinski, Fibonacci, FilipeS, Fintor, Firelog, Fjdulles,
Fledylids, Fnielsen, Fresheneesz, G716, GB fan, Galastril, Gandrusz, Gary King, Gauravm1312, Gauss, Geekinajeep, Gex999, GibboEFC, Giftlite, Gil Gamesh, Gioto, GordontheGorgon,
Gperjim, Graft, Graham87, Gunnar Larsson, Gzornenplatz, Gökhan, Habbie, Headbomb, Heimstern, Henrygb, HereToHelp, Heron, Hiihammuk, Hiiiiiiiiiiiiiiiiiiiii, Hu12, Hughperkins, Hugo
gasca aragon, I dream of horses, Ian Pitchford, IdealOmniscience, It Is Me Here, Itsapatel, Ivan Štambuk, Iwaterpolo, J heisenberg, JA(000)Davidson, JBancroftBrown, JaGa, Jackzhp, Jacobolus,
JahJah, JanSuchy, Jason.yosinski, Javazen, Jeff560, Jeffjnet, Jgonion, Jia.meng, Jim.belk, Jitse Niesen, Jmlk17, Joebeone, Jonkerz, Jorgenumata, Joris Gillis, Jorisverbiest, Josephus78, Josuechan,
Jpk, Jpsauro, Junkinbomb, KMcD, KP-Adhikari, Karl-Henner, Kaslanidi, Kastchei, Kay Dekker, Keilana, KipKnight, Kjtobo, Knutux, LOL, Lansey, Laurifer, Lee Daniel Crocker, Leon7, Lilac
Soul, Livius3, Lixy, Loadmaster, Lpele, Lscharen, Lself, MATThematical, MIT Trekkie, ML5, Manticore, MarkSweep, Markhebner, Markus Krötzsch, Marlasdad, Mateoee, Mathstat, Mcorazao,
Mdebets, Mebden, Meelar, Melcombe, Melongrower, Message From Xenu, Michael Hardy, Michael Zimmermann, Miguel, Mikael Häggström, Mikewax, Millerdl, Mindmatrix, MisterSheik,
Mkch, Mm 202, Morqueozwald, Mr Minchin, Mr. okinawa, MrKIA11, MrOllie, MrZeebo, Mrocklin, Mundhenk, Mwtoews, Mysteronald, Naddy, Nbarth, Netheril96, Nicholasink, Nicolas1981,
Nilmerg, NoahDawg, Noe, Nolanbard, NuclearWarfare, O18, Ohnoitsjamie, Ojigiri, Oleg Alexandrov, Oliphaunt, Olivier, Orderud, Ossiemanners, Owenozier, P.jansson, PAR, PGScooter,
Pablomme, Pabristow, Paclopes, Patrick, Paul August, Paulpeeling, Pcody, Pdumon, Personman, Petri Krohn, Pfeldman, Pgan002, Pinethicket, Piotrus, Plantsurfer, Plastikspork, Policron,
Polyester, Prodego, Prumpf, Ptrf, Qonnec, Quietbritishjim, Qwfp, R.J.Oosterbaan, R3m0t, RDBury, RHaworth, RSStockdale, Rabarberski, Rajah, Rajasekaran Deepak, Randomblue, Rbrwr,
Renatokeshet, RexNL, Rich Farmbrough, Richwales, Rishi.bedi, Rjwilmsi, Robbyjo, Robma, Romanski, Ronz, Rubicon, RxS, Ryguasu, SGBailey, SJP, Saintrain, SamuelTheGhost, Samwb123,
Article Sources and Contributors 631
Sander123, Schbam, Schmock, Schwnj, Scohoust, Seaphoto, Seidenstud, Seliopou, Seraphim, Sergey Suslov, SergioBruno66, Shabbychef, Shaww, Shuhao, Siddiganas, Sirex98, Smidas3,
Snoyes, Sole Soul, Somebody9973, Stan Lioubomoudrov, Stephenb, Stevan White, Stpasha, StradivariusTV, Sullivan.t.j, Sun Creator, SusanLarson, Sverdrup, Svick, Talgalili, Taxman,
Tdunning, TeaDrinker, The Anome, The Tetrast, TheSeven, Thekilluminati, TimBentley, Tomeasy, Tomi, Tommy2010, Tony1, Trewin, Tristanreid, Trollderella, Troutinthemilk, Tryggvi bt,
Tschwertner, Tstrobaugh, Unyoyega, Vakulgupta, Velocidex, Vhlafuente, Vijayarya, Vinodmp, Vrkaul, Waagh, Wakamex, Wavelength, Why Not A Duck, Wikidilworth, Wile E. Heresiarch,
Wilke, Will Thimbleby, Willking1979, Winsteps, Wissons, Wjastle, Wwoods, XJamRastafire, Yoshigev, Yoshis88, Zaracattle, Zero0000, Zhurov, Zrenneh, Zundark, Zvika, มือใหม่, 744
anonymous edits
Order statistic Source: http://en.wikipedia.org/w/index.php?oldid=506783010 Contributors: A. Pichler, Bruzie, Charles Matthews, Cognitionmachine, David Haslam, Dchristle, Dcljr,
Dcoetzee, Den fjättrade ankan, Fangz, Gareth Jones, Giftlite, Hannes Eder, Jmath666, Karada, LOL, Lambiam, LorCC, Melcombe, Michael Hardy, Miguel, Mr Adequate, Nbarth, Nethgirb,
Nicolas.Wu, Pgan002, Phauly, Quantling, R'n'B, Rodolfo Hermans, Se'taan, Sirslope, Slendle84, Unixxx, Vaidhy, YahoKa, Zundark, 44 anonymous edits
Ordinary differential equation Source: http://en.wikipedia.org/w/index.php?oldid=506887949 Contributors: 192.115.22.xxx, 48v, A. di M., Aboctok, Absurdburger, AdamSmithee, After
Midnight, Ahadley, Ahoerstemeier, Alfy Alf, Alll, Amixyue, Andrei Polyanin, Anetode, Anita5192, Ap, Arthena, Arthur Rubin, BL, BMF81, Baccala@freesoft.org, Bemoeial, BenFrantzDale,
Benjamin.friedrich, Berean Hunter, Bernhard Bauer, Beve, Bloodshedder, Bo Jacoby, Bogdangiusca, Bryan Derksen, Charles Matthews, Chenyu, Chilti, Chris in denmark, ChrisUK, Christian
List, Ciro.santilli, Cloudmichael, Cmdrjameson, Cmprince, Conversion script, Cpuwhiz11, Cutler, Danger, Danuthaiduc, Delaszk, Dickdock, Dicklyon, DiegoPG, Dmitrey, Dmr2, Dmytro,
DominiqueNC, Dominus, Don4of4, Donludwig, Doradus, Dysprosia, Ed Poor, Ekotkie, Emperorbma, Enochlau, Enzotib, F=q(E+v^B), Fintor, Fruge, Fzix info, Gauge, Gene s, Gerbrant, Giftlite,
Gombang, Graham87, HappyCamper, Heuwitt, Hongsichuan, Ht686rg90, Icairns, Isilanes, Iulianu, Jack in the box, Jak86, Jao, JeLuF, Jitse Niesen, Jni, JoanneB, John C PI, JohnBlackburne,
Jokes Free4Me, JonMcLoone, Josevellezcaldas, Juansempere, Kakila, Kawautar, Kdmckale, Klaas van Aarsen, Krakhan, Kwantus, L-H, LachlanA, Let01, Lethe, Linas, Lingwitt, Liquider, Lupo,
MarkGallagher, Math.geek3.1415926, MathMartin, Mathuvw, Matusz, Melikamp, Michael Hardy, Mikez, Mild Bill Hiccup, Moskvax, MrOllie, Msh210, Mtness, Niteowlneils, Oleg Alexandrov,
Patrick, Paul August, Paul Matthews, PaulTanenbaum, PavelSolin, Pdenapo, PenguiN42, Phil Bastian, Pichpich, PizzaMargherita, Pm215, Poonamjadhav, Poor Yorick, Pt, Randomguess,
Rasterfahrer, Raven in Orbit, Recentchanges, RedWolf, Rich Farmbrough, Rl, RobHar, Rogper, Romanm, Rpm, Ruakh, Salix alba, Sbyrnes321, Sekky, Shandris, Shirt58, SilverSurfer314, Sofia
karampataki, Ssd, Starlight37, Stevertigo, Stw, Superlaza, Susvolans, Sverdrup, Tarquin, Tbsmith, Technopilgrim, Telso, Template namespace initialisation script, The Anome, Thenub314,
Tobias Hoevekamp, TomyDuby, Tot12, TotientDragooned, Tristanreid, Twin Bird, Tyagi, UKoch, Ulner, Vadimvadim, Waltpohl, Wclxlus, Whommighter, Wideofthemark, WriterHound, Xrchz,
Yhkhoo, 今 古 庸 龍, 190 anonymous edits
Partial differential equation Source: http://en.wikipedia.org/w/index.php?oldid=507434086 Contributors: Afluent Rider, Ahoerstemeier, Aliotra, Alpha Quadrant (alt), Andrei Polyanin,
AndrewHowse, AnkhMorpork, Arnero, ArnoldReinhold, Arthena, AxelBoldt, BASANTDUBE, Beckman16, Belovedfreak, Bemoeial, Ben pcc, BenFrantzDale, Bender235, Bertik, Bjorn.sjodin,
Bkocsis, Borgx, Brian Tvedt, CYD, Cbm, Charles Matthews, Chbarts, Chris in denmark, ChristophE, Cj67, Ckatz, Crowsnest, Crust, CyrilB, D.328, DStoykov, David Crawshaw,
Dharma6662000, Dicklyon, Dirkbb, Djordjes, DominiqueNC, Donludwig, DrHok, Druzhnik, Dysprosia, Egriffin, Eienmaru, Eigenlambda, El C, EmmetCaulfield, Epbr123, Erxnmedia,
Evankeane, F=q(E+v^B), Filemon, Fintor, Foober, Forbes72, Frosted14, Fuse809, Gaj0129, Gerasime, Germandemat, Giese, Giftlite, GraemeL, Gseryakov, Gurch, HappySophie, Hongooi, Hut
8.5, Isnow, Iwfyita, Ixfd64, JNW, JaGa, Jitse Niesen, Jmath666, Jon Cates, JonMcLoone, Jonathanstray, Jss214322, Jyril, Kbolino, Kwiki, L-H, Linas, MFH, Magister Mathematicae, Mandarax,
Mandolinface, Manticore, Marupio, MathMartin, Mathsfreak, Maurice Carbonaro, Mazi, Mhaitham.shammaa, Mhym, Michael Devore, Michael Hardy, Moink, Mpatel, Msh210, NSiDms,
Nbarth, Nneonneo, Ojcit, Oleg Alexandrov, Oliver Pereira, Orenburg1, OrgasGirl, Oscarjquintana, PL290, Pacaro, Patrick, Paul August, Paul Matthews, PeR, PhotoBox, Photonique, Pokespa,
Pranagailu1436, Prime Entelechy, Pt, Quibik, R'n'B, R.e.b., Rausch, RayAYang, Rememberlands, Richard77, Rjwilmsi, Rnt20, Roadrunner, Robinh, Roesser, Rpchase, Salih, Sbarnard,
Siegmaralber, SobakaKachalova, Spartan-James, Srleffler, Stevenj, Stizz, Super Cleverly, SwisterTwister, Sławomir Biały, THEN WHO WAS PHONE?, Tarquin, Tbsmith, The Anome, The
Transhumanist, Thenub314, Tiddly Tom, Timwi, Topbanana, Tosha, Ub3rm4th, Unigfjkl, User A1, Waltpohl, Wavelength, Wavesmikey, Winston365, Wolfrock, Wsulli74, Wtt, Yaje, Yhkhoo,
Zhou Yu, Zzuuzz, 317 anonymous edits
Pearson's chi-squared test Source: http://en.wikipedia.org/w/index.php?oldid=502812581 Contributors: A-k-h, AbsolutDan, Agüeybaná, Aljeirou, Andropod, Arcadian, Asqueella, Athaler,
Avraham, Bender235, BlaiseFEgan, Bobo192, Btyner, Bubba73, Cherkash, Chuck Carroll, Connet, Cortonin, Czenek, Delirium, Den fjättrade ankan, Doyoung, Dpbsmith, Egil, Ektodu,
Fgnievinski, Free Software Knight, Funk17, Furrykef, Giuseppedn, Grotendeels Onschadelijk, Hirak 99, Horn.imh, Jcobb, Jfitzg, Jmcclung711, Joel B. Lewis, John254, JustAGal, Kastchei,
Kgwet, Kjtobo, KohanX, Kwamikagami, Lambiam, Lexor, LilHelpa, Loadmaster, Loodog, Mad Scientist, MarkSweep, Matt Crypto, Maxal, Maxbox51, Melcombe, Michael Hardy, Mikael
Häggström, Motorneuron, Moverly, MrOllie, Muhali, N5iln, Navywings, Nbarth, Neffk, O18, Omicron1234, Paul August, Paulck, Piotrus, PowerWill500, Qartis, Quadduc, Qwfp, Ranger2006,
Rar74B, Requestion, Retobaum, Rjwilmsi, Robinh, Rvrocha, Sander123, Sayantan m, Sbmehta, Schwnj, Seglea, Shadow308b4, Skbkekas, Spangineer, Ssola, Talgalili, Tambal, Tayste, The
Anome, Tim bates, TimBentley, TimBock, ToddDeLuca, Tomi, Triacylglyceride, Wtmitchell, 159 anonymous edits
Perron–Frobenius theorem Source: http://en.wikipedia.org/w/index.php?oldid=506218307 Contributors: Alexander Chervov, Arcfrk, BenFrantzDale, Bender235, BeteNoir, Bill luv2fly,
Bmusician, Brad7777, Charles Matthews, Comfortably Paranoid, Cvdwoest, David Eppstein, Dcclark, Dima373, Doctorilluminatus, Flyingspuds, G.perarnau, Gdm, Giftlite, Justin Mauger,
Kiefer.Wolfowitz, Kirbin, Lfstevens, Lifeonahilltop, Linas, MRFS, Michael Hardy, Moogwrench, Nbarth, Niceguyedc, Pavel Stanley, Psychonaut, R.e.b., Rschwieb, Shining Celebi, Sodin,
Stigin, Tcnuk, TedPavlic, Urhixidur, Vinsz, Xnn, 67 anonymous edits
Poisson distribution Source: http://en.wikipedia.org/w/index.php?oldid=507759981 Contributors: 2620:C6:8000:300:E84F:1D25:AA79:1023, Abtweed98, Adair2324, AdjustShift, Adoniscik,
Aeusoes1, AhmedHan, Ahoerstemeier, AlanUS, Alexius08, Alfpooh, Anchoar2001, Andre Engels, Ankit.shende, Anomalocaris, Army1987, Atonix, AxelBoldt, BL, Baccyak4H, Bdmy, Beetstra,
BenFrantzDale, Bender235, Bgeelhoed, Bidabadi, Bikasuishin, Bjcairns, Bobblewik, Brendan642, Bryan Derksen, Btyner, CameronHarris, Camitz, Captain-n00dle, Caricologist, Cburnett,
ChevyC, Chinasaur, Chriscf, Cimon Avaro, Ciphergoth, Citrus Lover, Constructive editor, Coppertwig, Count ludwig, Cqqbme, Cubic Hour, Cwkmail, DARTH SIDIOUS 2, Damistmu, Danger,
Dannomatic, DannyAsher, David Haslam, Deacon of Pndapetzim, Debastein, Dejvid, Denis.arnaud, Derek farn, Dhollm, Dougsim, DrMicro, Dreadstar, Drevicko, Duke Ganote, Eduardo Antico,
Edward, Emilpohl, Enric Naval, EnumaElish, Everyking, Falk Lieder, Favonian, Fayenatic london, Fnielsen, Fresheneesz, Frobnitzem, Fuzzyrandom, Gaius Cornelius, Gcbernier, Giftlite,
Giganut, Gigemag76, Giraffedata, Godix, Gperjim, GregorB, HamburgerRadio, Headbomb, Heliac, Henrygb, Hgkamath, Hu, HyDeckar, Hypnotoad33, Ian.Shannon, Ilmari Karonen, Inquisitus,
Intervallic, Iridescent, Iwaterpolo, Jamesscottbrown, JavOs, Jeff G., Jfr26, Jitse Niesen, Jleedev, Jmichael ll, Joeltenenbaum, Joseph.m.ernst, Jpk, Jrennie, Jshen6, KSmrq, Kastchei, Kay Kiljae
Lee, Kbk, King of Hearts, Kjfahmipedia, Kmtmeth, LOL, Laussy, Lgallindo, Lilac Soul, Linas, Ling Kah Jai, Lklundin, Logan, Loom91, Lucaswilkins, MC-CPO, Magicxcian, Marie Poise,
MarkSweep, Mathstat, Maxis ftw, McKay, Mdebets, Melcombe, Michael Hardy, Michael Ross, Miguel, Mike Young, Mindmatrix, Minesweeper, MisterSheik, Mobius, MrOllie, Mufka,
Mungbean, Munksm, NAHID, NabeelNM42, Nasnema, Nealmcb, Ned Scott, Netrapt, Nevsan, New Thought, Nicooo, Nijdam, Nipoez, Njerseyguy, Nsaa, O18, Ohnoitsjamie, Orionus, Ott2,
PAR, PBH, Pabristow, Pb30, Pftupper, Pfunk42, Philaulait, Phreed, PierreAbbat, Piplicus, Plasmidmap, Pmokeefe, Postrach, Princebustr, Qacek, Quietbritishjim, Qwfp, Qxz, Robert Hiller,
Ryguasu, SPART82, Saimhe, Salgueiro, Sanmele, Saxenad, Schaber, Scientist xkcd fan, Sdedeo, Sean r lynch, SeanJA, Seaphoto, SebastianHelm, Selket, Sergey Suslov, Sergio01, Serrano24,
Sheldrake, SiriusB, Skbkekas, Skittleys, Slarson, Snorgy, Spoon!, Stangaa, Steve8675309, Storm Rider, Stpasha, Strait, Sun Creator, Suslindisambiguator, Svick, Syzygy, TakuyaMurata,
Talgalili, Taw, Taylanwiki, Tayste, Tbhotch, Tcaruso2, TeaDrinker, Tedtoal, Teply, The Anome, The Thing That Should Not Be, TheNoise, TheTaintedOne, Theda, Thorvaldurgunn, Tomi,
Tommyjs, Tpb, Tpbradbury, Tpruane, Uncle Dick, Uriyan, User A1, Vector Alfawaz, VodkaJazz, Vrkaul, Wavelength, Weedier Mickey, Whosanehusy, Wikibuki, Wikid77, Wileycount,
WillowW, Wjastle, Wtanaka, XJamRastafire, YogeshwerSharma, Youandme, ZioX, Zundark, Þjóðólfr, 448 anonymous edits
Poisson process Source: http://en.wikipedia.org/w/index.php?oldid=502479315 Contributors: AbsolutDan, Adler.fa, Aetheling, Almwi, Anaholic, Arnaudf, Atemperman, Bender235, Bezenek,
Binish fatimah, Bjcairns, Canley, CesarB, Changodoa, Charles Matthews, D nath1, Dale101usa, Darklilac, DavidRideout, Denisarona, Donmegapoppadoc, Dries, Drumlin mcl, Eslip17,
EtudiantEco, False vacuum, Faridani, Filur, Gareth Jones, Giftlite, Greatdiwei, Griffgruff, Grubbiv, ILikeThings, InvictaHOG, J heisenberg, JA(000)Davidson, Jackbaird, Jitse Niesen, Jonkerz,
Jtedor, Keilandreas, Kyrades, LOL, Lawrennd, LilHelpa, LuisPedroCoelho, Lyddea, Melcombe, Memming, Michael Hardy, Mike Rosoft, Mindmatrix, Muhends, Nbarth, O18, Oleg Alexandrov,
PAR, PanagosTheOther, Qlmatrix, Quantyz, Qwfp, Rgclegg, Rp, Sam Hocevar, Shd, Sjara, Skittleys, Slartibarti, Stochastic, Tdslk, Terhorst, Tomixdf, Uliba, Wavelength, Willem, Zundark, 128
anonymous edits
Proportional hazards models Source: http://en.wikipedia.org/w/index.php?oldid=499739784 Contributors: 3mta3, Benjamin.haley, Boffob, Cherkash, Cyan, Den fjättrade ankan, E.amira,
Erianna, Favonian, G716, In base 4, John Vandenberg, JonAWellner, Kierano, Kllin1231, LilHelpa, Lilac Soul, Mathstat, Mebden, Melcombe, Memming, PamD, Pstevens, Qaswed-Ger, Qwfp,
Rich Farmbrough, Rjwilmsi, Rw251, Skbkekas, Skittleys, Theoriste, 19 anonymous edits
Random permutation statistics Source: http://en.wikipedia.org/w/index.php?oldid=469021663 Contributors: Djozwebo, Edward, Emurphy42, Lantonov, Melchoir, Melcombe, Mhym, Michael
Hardy, Nbarth, Rjwilmsi, Shira.kritchman, Skysmith, Stasyuha, Sucharit, The Anome, Zahlentheorie, Zaslav, 12 anonymous edits
Rank (linear algebra) Source: http://en.wikipedia.org/w/index.php?oldid=502602977 Contributors: (:Julien:), Andrevruas, Ashwin, AxelBoldt, Banus, Bappusona, BenFrantzDale,
Bender2k14, BiT, Bob.v.R, Bogdanno, Brian Tvedt, Chrystomath, Confluente, Conversion script, Counterfeit114, Cscwin, Danydib, David Eppstein, Demize.public, DirkOliverTheis, Dysprosia,
Dzordzm, Echocontrol, Ernsts, Fropuff, GPhilip, Geometry guy, Giftlite, Heron, Hroðulf, Hyperbola, Ivan Štambuk, Jcarroll, Jeltz, Jitse Niesen, Jmath666, JoergenB, Justin W Smith, KSmrq,
Kaspar.jan, Kawautar, Kbh3rd, Kck9f, Krunal800, Ligand, LokiClock, Lunch, Lupin, Marc van Leeuwen, Mattblack82, Matty j, Meigel, Mental Blank, Miaow Miaow, Michael Hardy, Milcak,
NClement, Naddy, Nbarth, Od Mishehu, Oleg Alexandrov, PapLorinc, Polizzi, Poor Yorick, R'n'B, Ram einstein, Richard Giuly, Saaska, Salih, Semifinalist, Shim'on, Skunkboy74, Sławomir
Biały, TakuyaMurata, Tardis, TeH nOmInAtOr, TexasAndroid, Trieu, Vql, Wshun, Zhaoway, 111 anonymous edits
Resampling (statistics) Source: http://en.wikipedia.org/w/index.php?oldid=502973193 Contributors: A923812, Annabel, Archimerged, Avenue, AxelBoldt, BD2412, Baccyak4H, Biruitorul,
Bondegezou, Boxplot, Brendon1191, Btyner, Buettcher, Cherkash, Chris53516, Commadot, CristianCantoro, Damistmu, Dcoetzee, Den fjättrade ankan, Diegotorquemada, Dougher, Edstat,
Article Sources and Contributors 632
Fisherjs, Fnielsen, Freeparameter, Garik, Gideon.fell, Giftlite, Hilverd, Jackverr, JavaManAz, Jmc200, Jncraton, Jonkerz, Jrvz, Karol Langner, Kenkleinman, Koala9, Kpmiyapuram, Ling.Nut,
Mathstat, Matumio, Mcld, Melcombe, Michael Hardy, Minhtuanht, Mishnadar, Moachim, Nbarth, O18, Ohnoitsjamie, Oleg Alexandrov, Onionmon, Patrubdel, Pgan002, Pifflesticks, Polluxonis,
QualitycontrolUS, Quaristice, Qwfp, Rich Farmbrough, Rickogorman, Ronz, Ropable, Skbkekas, Spm, Tesi1700, Thefellswooper, TimHesterberg, ToddDeLuca, Tolstoy the Cat, Valter Sundh,
Vegaswikian, Wissons, Woohookitty, Wstomv, X7q, СтудентК, 122 anonymous edits
Schur complement Source: http://en.wikipedia.org/w/index.php?oldid=492085177 Contributors: A. Pichler, Acroterion, Aranel, BernardH, Dannoryan, Danpovey, Giftlite, Khaosoahk,
Kkddkkdd, LachlanA, Lavaka, Lockeownzj00, Mct mht, Michael Hardy, Nick, Rotkraut, Shreevatsa, Teorth, Wikomidia, Yuzhang49, Zfeinst, 50 anonymous edits
Sign test Source: http://en.wikipedia.org/w/index.php?oldid=503454667 Contributors: Asqueella, Btyner, DV8 2XL, Daytona2, FlowerFaerie087, Kareekacha, MC-CPO, Male1979,
MarkSweep, Mbhiii, Mcld, Melcombe, Michael Hardy, PamD, Qwfp, Radagast83, Sean1040, Talgalili, That Guy, From That Show!, Unschool, 14 anonymous edits
Singular value decomposition Source: http://en.wikipedia.org/w/index.php?oldid=507711964 Contributors: 3mta3, ABCD, AdamSmithee, Alexanderfrey, Alexmov, Anoko moonlight, Argav,
Arthur Frayn, AxelBoldt, Bciguy, Bdmy, Ben pcc, BenFrantzDale, BigrTex, Billlion, Brech, Browndar, CBM, Cccddd2012, Celique, Chadnash, Charivari, Charles Matthews, Chnv, ChrisDing,
Coffee2theorems, Coppertwig, Countchoc, D1ma5ad, Damien d, Danielcohn, Danielx, David Eppstein, Daytona2, Ddcarnage, Dean P Foster, Debeo Morium, Diomidis Spinellis, Dohn joe,
Douglas guo, Dragon Phoenix Studio, Drizzd, EduardoValle, EmmetCaulfield, Eric Le Bigot, EverettYou, Fcpp, Fgnievinski, Frammm, FrozenPurpleCube, Gauge, Geneffects, Georg-Johann,
Giftlite, GromXXVII, Guaka, Guy.schaffer, Harold f, Headbomb, Headlessplatter, Helwr, Hike395, Hobsonlane, Humanengr, Iprometheus, Isnow, Ivann.exe, JEBrown87544, Jack446,
JackSchmidt, Jamelan, Jdpipe, JerroldPease-Atlanta, Jheald, Jitse Niesen, John, JohnBlackburne, Jonnat, Joriki, Jérôme, K.menin, KSmrq, KYN, Kaol, Kaspar.jan, Kgutwin, Kiefer.Wolfowitz,
Kieff, Kierano, Kjetil1001, KnowledgeOfSelf, Knowledgeis4all, Kupopo, LapoLuchini, LiamH, Lifeonahilltop, Loren.wilton, Lotje, Lourakis, Lupin, Mahakp, Male1979, Marco.lombardi,
Marozols, MathMartin, Matthewmatician, Mct mht, Mdnahas, Melcombe, Merilius, Mhsajadi, Michael Hardy, Michael.greenacre, Morana, Musiphil, Mwtoews, Nbarth, NoobX, Olberd, Oleg
Alexandrov, Orderud, Orie0505, Orlyal, Oyz, Paolo.dL, Peachris, Pftupper, Phrank36, Phrenophobia, Physicistjedi, PigFlu Oink, Pmineault, PolarYukon, ProveIt, Qjqflash3, Qtea, RDBrown,
Rafmag, Rakeshchalasani, Rama, Ranicki, Rdecker02, Reinderien, Rhubbarb, Rich Farmbrough, Rinconsoleao, Rjwilmsi, Robinsor, Ronz, Rschwieb, Semifinalist, ServiceAT, Shadowjams,
Sikfreeze, Sim, Slawekb, Spellbuilder, Starsky617, Stepa, SteveMyers999, Stevenj, Stpasha, Sullivan.t.j, Sushruthg, Szabolcs Nagy, Tauwasser, Tbackstr, TedPavlic, The Thing That Should Not
Be, Thecheesykid, Thorwald, Tom Duff, TomViza, Tony1, Trifon Triantafillidis, Veinor, Wagonsoul773, WhisperingGadfly, WikiMSL, Willem, Woohookitty, Wooooosaj, Wshun, Wsiegmund,
X7q, XiagenFeng, Zanetu, Zvika, Zwitter689, Пика Пика, 353 anonymous edits
Stein's method Source: http://en.wikipedia.org/w/index.php?oldid=502856881 Contributors: 3mta3, Headbomb, J04n, Melcombe, Michael Hardy, Mild Bill Hiccup, Patschy, Paul August,
RomainThibaux, ScienceNUS, Slartibarti, Tabletop, Tassedethe, 21 anonymous edits
Stirling's approximation Source: http://en.wikipedia.org/w/index.php?oldid=502070373 Contributors: A. Pichler, Aaron Rotenberg, Abovechief, Alberto da Calvairate, AvicAWB, AxelBoldt,
Balcer, Barak, Bender235, Bender2k14, Berland, Black Yoshi, Bluemaster, Bluemoose, Bluestarlight37, Btyner, C. Trifle, Calréfa Wéná, Charles Matthews, Cybercobra, DMJ001,
DavidCBryant, DavidHouse, Dcoetzee, Doctormatt, Dogcow, EL Willy, Enochlau, Eranb, Eric119, FilipeS, Fredrik, Frencheigh, GangofOne, Gene Ward Smith, GeneChase, Gerbrant, Giftlite,
Glenn L, Goldencako, Gregie156, Gruntler, Hannes Eder, He Who Is, Henrygb, Herbee, JabberWok, Karl-Henner, Keithcc, Krishnachandranvn, Linas, Lisatwo, Looxix, MIT Trekkie, MOBle,
Ma'ame Michu, Marcol, McKay, Melchoir, Michael Hardy, Minesweeper, MuDavid, Ncik, Nonenmac, OwenX, PAR, PMajer, Pde, Pete Ridges, Plasticup, Poor Yorick, Qwertyus, RDBury,
RJHall, RexNL, Rh, Rogper, Salgueiro, Sandrobt, Slaniel, Spud Gun, Spudbeach, StradivariusTV, Sverdrup, Thomas Bliem, Thomas9987, TigerTjäder, Tim Starling, TomyDuby, Toolnut,
Trovatore, Tsirel, Usurper, Varitek123, Vincent Semeria, Wilke, Wonghang, Woscafrench, Zero0000, 110 anonymous edits
Student's t-distribution Source: http://en.wikipedia.org/w/index.php?oldid=505048269 Contributors: 3mta3, A bit iffy, A. di M., A.M.R., Addone, Afluent Rider, Albmont, AlexAlex,
Alvin-cs, Amonet, Arsenikk, Arthur Rubin, Asperal, Avraham, AxelBoldt, B k, Beetstra, Benwing, Bless sins, Bobo192, BradBeattie, Bryan Derksen, Btyner, CBM, Cburnett, Chiqago,
Chris53516, Chriscf, Classical geographer, Clbustos, Coppertwig, Count Iblis, Crouchy7, Daige, DanSoper, Danko Georgiev, Daveswahl, Dchristle, Ddxc, Dejo, Dkf11, Dmcalist, Dmcg026,
Duncharris, EPadmirateur, EdJohnston, Eleassar, Eric Kvaalen, Ethan, Everettr2, F.morett, Fgimenez, Finnancier, Fnielsen, Frankmeulenaar, Freerow@gmail.com, Furrykef, G716,
Gabrielhanzon, Giftlite, Gperjim, Guibber, Hadleywickham, Hankwang, Hemanshu, Hirak 99, History Sleuth, Huji, Icairns, Ichbin-dcw, Ichoran, Ilhanli, Iwaterpolo, JMiall, JamesBWatson, Jitse
Niesen, Jmk, John Baez, Johnson Lau, Jost Riedel, Kastchei, Kiefer.Wolfowitz, Koavf, Kotar, Kroffe, Kummi, Kyosuke Aoki, Lifeartist, Linas, Lvzon, M.S.K., MATThematical, Madcoverboy,
Maelgwn, Mandarax, MarkSweep, Mcarling, Mdebets, Melcombe, Michael C Price, Michael Hardy, Mig267, Millerdl, MisterSheik, MrOllie, Muzzamo, Nbarth, Netheril96, Ngwt, Nick Number,
NuclearWarfare, O18, Ocorrigan, Oliphaunt, PAR, PBH, Pegasus1457, Petter Strandmark, Phb, Piotrus, Pmanderson, Quietbritishjim, Qwfp, R'n'B, R.e.b., Rich Farmbrough, Rjwilmsi, Rlendog,
Robert Ham, Robinh, Royote, Salgueiro, Sam Derbyshire, Sander123, Scientific29, Secretlondon, Seglea, Serdagger, Sgb 85, Shaww, Shoefly, Skbkekas, Sonett72, Sougandh, Sprocketoreo,
Srbislav Nesic, Stasyuha, Steve8675309, Stpasha, Strait, TJ0513, Techman224, Tgambon, The Anome, Theodork, Thermochap, ThorinMuglindir, Tjfarrar, Tolstoy the Little Black Cat,
Tom.Reding, TomCerul, Tomi, Tutor dave, Uncle G, Unknown, User A1, Valravn, Velocidex, Waldo000000, Wastle, Wavelength, Wikid77, Wile E. Heresiarch, Xenonice, ZantTrang, 294
anonymous edits
Summation by parts Source: http://en.wikipedia.org/w/index.php?oldid=490149714 Contributors: A. Pichler, Bdmy, Brad7777, Calle, Charles Matthews, Charvest, ChrisHodgesUK, David
Eppstein, DavidGSterling, Enchanter, FF2010, Julien Tuerlinckx, Linas, Michael Hardy, Myasuda, Oleg Alexandrov, Oracleofottawa, Radagast83, Shreevatsa, Stan Lioubomoudrov, Tbennert,
Tcnuk, 虞 海, 41 anonymous edits
Taylor series Source: http://en.wikipedia.org/w/index.php?oldid=507980872 Contributors: 1exec1, AbcXyz, Abcwikip, AdamGomaa, Akitchin, Alamino, Alansfault, Alberto da Calvairate, Ali
Obeid, Alksub, Anaraug, Anonymous Dissident, Antandrus, Audacity, Autarkaw, Avarice2593, AxelBoldt, BD2412, Baccyak4H, Banaticus, Barak Sh, Basketball110, Bdmy, BenFrantzDale,
Benzi455, BigJohnHenry, Bill Malloy, BlueSoxSWJ, Bo Jacoby, Bo198214, Brad7777, Byakuya1995, CBM, CambridgeBayWeather, Carifio24, CecilWard, Charles Matthews, Chetrasho, Chris
the speller, Clarknj, Cmcb, Conversion script, CornellRunner314, Cwkmail, Cyp, Dalstadt, Datoews, Deeptrivia, Diotti, Djmutex, Doctormatt, Dominus, Doradus, Druseltal2005, Drzib, Dspdude,
Dudzcom, Duplode, Dysprosia, Egmontaz, Ejrh, El Jogg, Elb2000, Elocute, Epbr123, Eric119, Error792, Estherholleman, Fibonacci, Flcelloguy, FootballHK, Forbes72, Fram, Fredrik,
Frencheigh, Fresheneesz, Fundamental metric tensor, Fvw, Fx0701, GTBacchus, Gandalf61, Genjix, Gesslein, Giftlite, Gijs.peek, Glane23, Gombang, Goodale, Gracenotes, Graham87, Gtstricky,
Guardian of Light, Gulliveig, Hadal, Haham hanuka, Harryboyles, Headbomb, Hede2000, Holger Flier, Hulaxhula15, Ideyal, Igodard, J.delanoy, JRSpriggs, JabberWok, Jagged 85,
Jakob.scholbach, James T Curran, Jane Fallen, Jaro.p, Jasanwiki, Jatinshah, Jaylowblow, Jclemens, Jeff G., Jeff223, Jeronimo, Jesper Carlstrom, Jim1138, JimJast, Jimothy 46, Jitse Niesen,
Jmnbatista, Johnuniq, Jschnur, Jwestbrook, Kallikanzarid, Kepke, Koertefa, Krishnachandranvn, LaGrange, LachlanA, Lambiam, Laurifer, Lethe, Linas, Loisel, LucaB, Lumaga, Mav, McKay,
Melikamp, Mfwitten, Mh, Mhallwiki, Michael Hardy, Miguel, Minesweeper, Mjec, Mordacil, Msh210, Musicguyguy, Mwilde, Myasuda, Nadav1, Nakon, Naught101, NawlinWiki, Netheril96,
Nilesj, Ohconfucius, Olaf, Oleg Alexandrov, OneWeirdDude, Orionus, Patrick, Paul August, Perlmonger42, Petri Krohn, Pink-isnt-well, Plasticspork, Plastikspork, Pmonaragala, PoiZaN, Pomte,
Populus, PouyaDT, Pps, Pranathi, Preetum, Pruthvi.Vallabh, Puckly, Qgluca, Qorilla, Qualc1, Quondum, RETROFUTURE, RJFJR, RJaguar3, RRRR0000RRRR, RageGarden, Randomath,
Randomblue, RayAYang, Red Denim, Reinderien, Reminiscenza, Renfield, ResearchRave, RexNL, Richard L. Peterson, RickK, Rinn0, Roadrunner, Robertsrap111, Rudminjd,
Rudminjw@jmu.edu, Runningonbrains, Sajoka, Sam Derbyshire, Schapel, Schmock, Shalom Yechiel, Shanman7, Shinglor, Silly rabbit, Slash, Slawekb, Sligocki, Sloq, Staplesauce, Stealth HR,
Stephen.metzger, Stevertigo, Stikonas, Support.and.Defend, Sverdrup, Sławomir Biały, Talgalili, Tarquin, Tcnuk, TheQuestionGuy, Thesevenseas, Tide rolls, Tikai, Tobias Bergemann, Troubled
asset, Tsemii, Tskuzzy, Twisterplus, VKokielov, Vanished User 0001, Wclark, Wiki alf, WinterSpw, WojciechSwiderski, Wowus, XJamRastafire, Xantharius, Xem 007, Xionbox, Zeno Gantner,
Zhefurui, Zzuuzz, 石 庭 豐, 559 anonymous edits
Uniform distribution (continuous) Source: http://en.wikipedia.org/w/index.php?oldid=504265526 Contributors: A.M.R., Abdullah Chougle, Aegis Maelstrom, Albmont, AlekseyP, Algebraist,
Amatulic, ArnoldReinhold, B k, Baccyak4H, Benlisquare, Brianga, Brumski, Btyner, Capricorn42, Cburnett, Ceancata, DaBler, DixonD, DrMicro, Duoduoduo, Euchiasmus, Fasten, FilipeS,
Gala.martin, Gareth Owen, Giftlite, Gilliam, Gritzko, Henrygb, Iae, Iwaterpolo, Jamelan, Jitse Niesen, Marie Poise, Melcombe, Michael Hardy, MisterSheik, Nbarth, Nsaa, Oleg Alexandrov,
Ossska, PAR, Qwfp, Ray Chason, Robbyjo, Ruy Pugliesi, Ryan Vesey, Sandrarossi, Sl, Stpasha, Stwalkerster, Sun Creator, Tpb, User A1, Vilietha, Warriorman21, Wikomidia, Zundark, 97
anonymous edits
Uniform distribution (discrete) Source: http://en.wikipedia.org/w/index.php?oldid=490538818 Contributors: Alansohn, Alstublieft, Bob.warfield, Btyner, DVdm, DaBler, Dec1707, DixonD,
Duoduoduo, Fangz, Fasten, FilipeS, Furby100, Giftlite, Gvstorm, Hatster301, Henrygb, Iwaterpolo, Jamelan, Klausness, LimoWreck, Melcombe, Michael Hardy, Mike74dk, Nbarth, O18, P64,
PAR, Paul August, Postrach, Qwfp, Random2001, Stannered, Taylorluker, The Wordsmith, User A1, 59 anonymous edits
Weibull distribution Source: http://en.wikipedia.org/w/index.php?oldid=504942769 Contributors: A. Pichler, Agriculture, Alexey Sanko, Alfpooh, Anomalocaris, Argyriou, Asitgoes,
Avraham, AxelBoldt, Bender235, Bryan Derksen, Btyner, Calimo, Cburnett, Corecode, Corfuman, Craigy144, Cyan, Darrel francis, David Haslam, Dhatfield, Diegotorquemada, Dmh, Doradus,
Edratzer, Eliezg, Emilpohl, Epzsl2, Erianna, Felipehsantos, Fæ, Gausseliminering, Gcm, Giftlite, Gobeirne, GuidoGer, Isheden, Iwaterpolo, J6w5, JA(000)Davidson, JJ Harrison, Janlo, Jason A
Johnson, Jfcorbett, Joanmg, KenT, Kghose, Kpmiyapuram, Lachambre, LachlanA, LilHelpa, MH, Mack2, Mebden, Melcombe, Michael Hardy, MisterSheik, O18, Olaf, Oznickr, PAR, Pleitch,
Policron, Prof. Frink, Qwfp, R.J.Oosterbaan, RekishiEJ, Rickysmithcmrp, Rlendog, Robertmbaldwin, Saad31, Sam Blacketer, Samikrc, Sandeep4tech, Slawekb, Smalljim, Stern, Stpasha, Strypd,
Sławomir Biały, TDogg310, Tassedethe, Tom harrison, Tomi, Uppland, WalNi, Wiki5d, Wikipelli, Wjastle, Yanyanjun, Zundark, 143 anonymous edits
Wilcoxon signed-rank test Source: http://en.wikipedia.org/w/index.php?oldid=506946847 Contributors: AbsolutDan, Amechtley, Arauzo, Asqueella, Baccyak4H, Brian Everlasting,
Chris53516, CowboyBear, Ddxc, Diego Moya, Dwhdwh, Eric Kvaalen, Gstatistics, Hula-hooper0, Infovoria, Iwaterpolo, J-a-x, Jdoev121, JeremyA, Jmaferreira, Jogloran, JordiGH, Joxemai,
Kastchei, Keimzelle, Law, Ldm653, Lleeoo, Ma lafortune, MarkSweep, Mcld, Melcombe, MichaK, Michael Hardy, MrOllie, Mscnln, Muboshgu, Mwtoews, O18, Olaf, Oleg Alexandrov,
Pgan002, Podgorec, Qwerty Binary, Qwfp, Rasnake, RichardMills65, Sango123, Schutz, Schwnj, Seglea, Silverfish, Talgalili, Thorwald, ToddDeLuca, Wzsamd, X7q, YorkBW, Yrithinnd, 69
Article Sources and Contributors 633
anonymous edits
Wishart distribution Source: http://en.wikipedia.org/w/index.php?oldid=503124768 Contributors: 3mta3, Aetheling, Aleenf1, Amonet, AtroX Worf, Baccyak4H, Benwing, Bryan Derksen,
Btyner, Crusoe8181, David Eppstein, Deacon of Pndapetzim, Dean P Foster, Entropeneur, Erki der Loony, Gammalgubbe, Giftlite, Headbomb, Ixfd64, Joriki, Jrennie, Kastchei,
Kiefer.Wolfowitz, Kurtitski, Lockeownzj00, MDSchneider, Melcombe, Michael Hardy, P omega sigma, P.wirapati, Panosmarko, Perturbationist, PhysPhD, Qwfp, R'n'B, Robbyjo, Robinh, Ryker,
Shae, Srbauer, TNeloms, Tom.Reding, Tomi, WhiteHatLurker, Wjastle, Zvika, 55 anonymous edits
Image Sources, Licenses and Contributors 634
File:Excess Kurtosis Beta Distribution with mean for full range and variance from 0.05 to 0.25 - J. Rodal.jpg Source:
http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_Beta_Distribution_with_mean_for_full_range_and_variance_from_0.05_to_0.25_-_J._Rodal.jpg License: Creative Commons
Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal
File:Excess Kurtosis Beta Distribution with mean and variance for full range - J. Rodal.jpg Source:
http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_Beta_Distribution_with_mean_and_variance_for_full_range_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike
3.0 Contributors: User:Dr. J. Rodal
File:Differential Entropy Beta Distribution with mean from 0.2 to 0.8 and variance from 0.01 to 0.09 - J. Rodal.jpg Source:
http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_with_mean_from_0.2_to_0.8_and_variance_from_0.01_to_0.09_-_J._Rodal.jpg License: Creative
Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal
File:Differential Entropy Beta Distribution with mean from 0.3 to 0.7 and variance from 0 to 0.2 - J. Rodal.jpg Source:
http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_with_mean_from_0.3_to_0.7_and_variance_from_0_to_0.2_-_J._Rodal.jpg License: Creative Commons
Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal
File:Karl_Pearson_2.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Karl_Pearson_2.jpg License: Public Domain Contributors: User:Struthious Bandersnatch
Image:Beta-binomial distribution pmf.png Source: http://en.wikipedia.org/w/index.php?title=File:Beta-binomial_distribution_pmf.png License: Creative Commons Attribution-Sharealike 3.0
Contributors: Nschuma
Image:Beta-binomial cdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Beta-binomial_cdf.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: Nschuma
Image:Pascal's triangle 5.svg Source: http://en.wikipedia.org/w/index.php?title=File:Pascal's_triangle_5.svg License: GNU Free Documentation License Contributors: User:Conrad.Irwin
originally User:Drini
Image:Pascal's triangle - 1000th row.png Source: http://en.wikipedia.org/w/index.php?title=File:Pascal's_triangle_-_1000th_row.png License: Creative Commons Attribution-Sharealike 3.0
Contributors: Endlessoblivion
File:Binomial distribution pmf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_distribution_pmf.svg License: Public Domain Contributors: Tayste
File:Binomial distribution cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_distribution_cdf.svg License: Public Domain Contributors: Tayste
File:Binomial Distribution.PNG Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_Distribution.PNG License: unknown Contributors: Schlurcher
File:Pascal's triangle; binomial distribution.svg Source: http://en.wikipedia.org/w/index.php?title=File:Pascal's_triangle;_binomial_distribution.svg License: Public Domain Contributors:
Lipedia
File:Binomial Distribution.svg Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_Distribution.svg License: GNU Free Documentation License Contributors: cflm (talk)
Image:cauchy pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Cauchy_pdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
Image:cauchy cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Cauchy_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
File:Sinc simple.svg Source: http://en.wikipedia.org/w/index.php?title=File:Sinc_simple.svg License: Public Domain Contributors: Stpasha
File:2 cfs coincide over a finite interval.svg Source: http://en.wikipedia.org/w/index.php?title=File:2_cfs_coincide_over_a_finite_interval.svg License: Public Domain Contributors: Stpasha
Image:Chernoff bound.png Source: http://en.wikipedia.org/w/index.php?title=File:Chernoff_bound.png License: Public Domain Contributors: Dcoetzee
File:chi-square pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg License: Creative Commons Attribution 3.0 Contributors: Geek3
File:chi-square distributionCDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Chi-square_distributionCDF.svg License: Creative Commons Zero Contributors: Philten, 2
anonymous edits
Image:Convergence in distribution (sum of uniform rvs).gif Source: http://en.wikipedia.org/w/index.php?title=File:Convergence_in_distribution_(sum_of_uniform_rvs).gif License: Creative
Commons Zero Contributors: Stpasha
Image:Comparison test series.svg Source: http://en.wikipedia.org/w/index.php?title=File:Comparison_test_series.svg License: GNU Free Documentation License Contributors: Titoxd
Image:ExpConvergence.gif Source: http://en.wikipedia.org/w/index.php?title=File:ExpConvergence.gif License: GNU Free Documentation License Contributors: User:Rpchase
Image:LogConvergenceAnim.gif Source: http://en.wikipedia.org/w/index.php?title=File:LogConvergenceAnim.gif License: Public Domain Contributors: Kan8eDie
File:Gaussian copula gaussian marginals.png Source: http://en.wikipedia.org/w/index.php?title=File:Gaussian_copula_gaussian_marginals.png License: Creative Commons
Attribution-Sharealike 3.0 Contributors: Matteo Zandi
File:Biv gumbel dist.png Source: http://en.wikipedia.org/w/index.php?title=File:Biv_gumbel_dist.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: Matteo Zandi
File:copule ord.svg Source: http://en.wikipedia.org/w/index.php?title=File:Copule_ord.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Matteo Zandi
File:Copula gaussian.svg Source: http://en.wikipedia.org/w/index.php?title=File:Copula_gaussian.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Matteo Zandi
Image:Area parallellogram as determinant.svg Source: http://en.wikipedia.org/w/index.php?title=File:Area_parallellogram_as_determinant.svg License: Public Domain Contributors: Jitse
Niesen
Image:Determinant parallelepiped.svg Source: http://en.wikipedia.org/w/index.php?title=File:Determinant_parallelepiped.svg License: Creative Commons Attribution 3.0 Contributors:
Claudio Rocchini
Image:Sarrus rule.png Source: http://en.wikipedia.org/w/index.php?title=File:Sarrus_rule.png License: Creative Commons Attribution 3.0 Contributors: Kmhkmh
Image:Dirichlet distributions.png Source: http://en.wikipedia.org/w/index.php?title=File:Dirichlet_distributions.png License: Public domain Contributors: en:User:ThG
Image:Dirichlet example.png Source: http://en.wikipedia.org/w/index.php?title=File:Dirichlet_example.png License: Public Domain Contributors: Mitch3
Image:Gamma distribution pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_distribution_pdf.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported
Contributors: Gamma_distribution_pdf.png: MarkSweep and Cburnett derivative work: Autopilot (talk)
Image:Gamma distribution cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_distribution_cdf.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported
Contributors: Gamma_distribution_cdf.png: MarkSweep and Cburnett derivative work: Autopilot (talk)
File:Em old faithful.gif Source: http://en.wikipedia.org/w/index.php?title=File:Em_old_faithful.gif License: Creative Commons Attribution-Sharealike 3.0 Contributors: 3mta3 (talk) 16:55, 23
March 2009 (UTC)
File:exponential pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Exponential_pdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
File:exponential cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Exponential_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
File:Mean exp.svg Source: http://en.wikipedia.org/w/index.php?title=File:Mean_exp.svg License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors: Erzbischof
File:Median exp.svg Source: http://en.wikipedia.org/w/index.php?title=File:Median_exp.svg License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors: Erzbischof
File:FitExponDistr.tif Source: http://en.wikipedia.org/w/index.php?title=File:FitExponDistr.tif License: Creative Commons Attribution-Sharealike 3.0 Contributors: Buenas días
Image:F distributionPDF.png Source: http://en.wikipedia.org/w/index.php?title=File:F_distributionPDF.png License: GNU Free Documentation License Contributors: en:User:Pdbailey
Image:F distributionCDF.png Source: http://en.wikipedia.org/w/index.php?title=File:F_distributionCDF.png License: GNU Free Documentation License Contributors: en:User:Pdbailey
Image:F-dens-2-15df.svg Source: http://en.wikipedia.org/w/index.php?title=File:F-dens-2-15df.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Landroni
Image:Gamma-PDF-3D.png Source: http://en.wikipedia.org/w/index.php?title=File:Gamma-PDF-3D.png License: Creative Commons Attribution-Sharealike 3.0 Contributors:
User:Ronhjones
Image:Gamma-KL-3D.png Source: http://en.wikipedia.org/w/index.php?title=File:Gamma-KL-3D.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Ronhjones
Image:Gamma plot.svg Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_plot.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Alessio
Damato
Image:Factorial interpolation.png Source: http://en.wikipedia.org/w/index.php?title=File:Factorial_interpolation.png License: Public Domain Contributors: Berland, Fredrik, Kilom691
Image:Complex gamma.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Complex_gamma.jpg License: Public Domain Contributors: Jan Homann
Image:Gamma function 2.png Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_function_2.png License: Public Domain Contributors: Bobmath, TakuyaMurata
Image:Complex gamma function abs.png Source: http://en.wikipedia.org/w/index.php?title=File:Complex_gamma_function_abs.png License: Public Domain Contributors: Bender2k14,
Fredrik, Lipedia, 2 anonymous edits
Image:Complex gamma function Re.png Source: http://en.wikipedia.org/w/index.php?title=File:Complex_gamma_function_Re.png License: Public Domain Contributors: Fredrik
Image Sources, Licenses and Contributors 636
Image:Complex gamma function Im.png Source: http://en.wikipedia.org/w/index.php?title=File:Complex_gamma_function_Im.png License: Public Domain Contributors: Fredrik
File:DanielBernoulliLettreAGoldbach-1729-10-06.jpg Source: http://en.wikipedia.org/w/index.php?title=File:DanielBernoulliLettreAGoldbach-1729-10-06.jpg License: unknown
Contributors: Wirkstoff
Image:Euler factorial paper.png Source: http://en.wikipedia.org/w/index.php?title=File:Euler_factorial_paper.png License: Public Domain Contributors: Euler
Image:Jahnke gamma function.png Source: http://en.wikipedia.org/w/index.php?title=File:Jahnke_gamma_function.png License: Public Domain Contributors: Eugene Jahnke, Fritz Emde
File:geometric pmf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Geometric_pmf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
File:geometric cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Geometric_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
Image:PDF invGauss.png Source: http://en.wikipedia.org/w/index.php?title=File:PDF_invGauss.png License: Creative Commons Attribution-Sharealike 2.5 Contributors: Thomas Steiner
Image:Inverse gamma pdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Inverse_gamma_pdf.png License: GNU General Public License Contributors: Alejo2083, Cburnett
Image:Inverse gamma cdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Inverse_gamma_cdf.png License: GNU General Public License Contributors: Alejo2083, Cburnett
File:KL-Gauss-Example.png Source: http://en.wikipedia.org/w/index.php?title=File:KL-Gauss-Example.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: Mundhenk
(talk)
File:ArgonKLdivergence.png Source: http://en.wikipedia.org/w/index.php?title=File:ArgonKLdivergence.png License: GNU Free Documentation License Contributors: P. Fraundorf
Image:Laplace distribution pdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Laplace_distribution_pdf.png License: GNU General Public License Contributors: It Is Me Here,
Kilom691, MarkSweep
Image:Laplace distribution cdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Laplace_distribution_cdf.png License: GNU General Public License Contributors: Bender235,
MarkSweep, Perhelion
File:Laplace's equation on an annulus.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Laplace's_equation_on_an_annulus.jpg License: Creative Commons Attribution-Sharealike
3.0 Contributors: DavidianSkitzou
File:Laplaces method.svg Source: http://en.wikipedia.org/w/index.php?title=File:Laplaces_method.svg License: Creative Commons Zero Contributors: User:Krishnavedala
Image:Loess curve.svg Source: http://en.wikipedia.org/w/index.php?title=File:Loess_curve.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Kierano
Image:PD-icon.svg Source: http://en.wikipedia.org/w/index.php?title=File:PD-icon.svg License: Public Domain Contributors: Alex.muller, Anomie, Anonymous Dissident, CBM, MBisanz,
PBS, Quadell, Rocket000, Strangerer, Timotheus Canens, 1 anonymous edits
Image:Some log-normal distributions.svg Source: http://en.wikipedia.org/w/index.php?title=File:Some_log-normal_distributions.svg License: Creative Commons Attribution-Sharealike 3.0
Contributors: original by User:Par Derivative by Mikael Häggström from the original, File:Lognormal distribution PDF.png by User:Par
Image:Lognormal distribution CDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Lognormal_distribution_CDF.svg License: Creative Commons Attribution-ShareAlike 3.0
Unported Contributors: Lognormal_distribution_CDF.png: User:PAR derivative work: Autopilot (talk)
Image:Comparison mean median mode.svg Source: http://en.wikipedia.org/w/index.php?title=File:Comparison_mean_median_mode.svg License: Creative Commons Attribution-Sharealike
3.0 Contributors: Cmglee
File:FitLogNormDistr.tif Source: http://en.wikipedia.org/w/index.php?title=File:FitLogNormDistr.tif License: Creative Commons Attribution-Sharealike 3.0 Contributors: Buenas días
Image:Levy0 distributionPDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Levy0_distributionPDF.svg License: Creative Commons Zero Contributors: User:Krishnavedala
Image:Levy0 distributionCDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Levy0_distributionCDF.svg License: Creative Commons Zero Contributors: User:Krishnavedala
Image:Levy0 LdistributionPDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Levy0_LdistributionPDF.svg License: Creative Commons Zero Contributors: User:Krishnavedala
Image:Ee noncompactness.svg Source: http://en.wikipedia.org/w/index.php?title=File:Ee_noncompactness.svg License: Public Domain Contributors: Stpasha
Image:MLfunctionbinomial-en.svg Source: http://en.wikipedia.org/w/index.php?title=File:MLfunctionbinomial-en.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors:
User:Casp11
Image:GaussianScatterPCA.png Source: http://en.wikipedia.org/w/index.php?title=File:GaussianScatterPCA.png License: GNU Free Documentation License Contributors: —Ben FrantzDale
(talk) (Transferred by ILCyborg)
Image:Sphere wireframe.svg Source: http://en.wikipedia.org/w/index.php?title=File:Sphere_wireframe.svg License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors:
Geek3
Image:Hypersphere coord.PNG Source: http://en.wikipedia.org/w/index.php?title=File:Hypersphere_coord.PNG License: Creative Commons Attribution 3.0 Contributors: derivative work:
Pbroks13 (talk) Hypersphere_coord.gif: Claudio Rocchini
File:N SpheresVolumeAndSurfaceArea.png Source: http://en.wikipedia.org/w/index.php?title=File:N_SpheresVolumeAndSurfaceArea.png License: Creative Commons
Attribution-Sharealike 3.0 Contributors: User:Joseph Lindenberg
File:Negbinomial.gif Source: http://en.wikipedia.org/w/index.php?title=File:Negbinomial.gif License: Public Domain Contributors: Stpasha
File:Chi-Squared-(nonCentral)-pdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Chi-Squared-(nonCentral)-pdf.png License: Creative Commons Attribution-Sharealike 2.5
Contributors: Thomas Steiner
File:Chi-Squared-(nonCentral)-cdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Chi-Squared-(nonCentral)-cdf.png License: Creative Commons Attribution-Sharealike 2.5
Contributors: Thomas Steiner
Image:nc student t pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Nc_student_t_pdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
Image:Vector norm sup.svg Source: http://en.wikipedia.org/w/index.php?title=File:Vector_norm_sup.svg License: Public Domain Contributors: Wiso, 1 anonymous edits
Image:Vector norms.svg Source: http://en.wikipedia.org/w/index.php?title=File:Vector_norms.svg License: GNU Free Documentation License Contributors: User:Esmil
File:Normal Distribution PDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Normal_Distribution_PDF.svg License: Public Domain Contributors: Inductiveload
File:Normal Distribution CDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Normal_Distribution_CDF.svg License: Public Domain Contributors: Inductiveload
File:standard deviation diagram.svg Source: http://en.wikipedia.org/w/index.php?title=File:Standard_deviation_diagram.svg License: Creative Commons Attribution 2.5 Contributors:
Mwtoews
File:De moivre-laplace.gif Source: http://en.wikipedia.org/w/index.php?title=File:De_moivre-laplace.gif License: Public Domain Contributors: Stpasha
File:Dice sum central limit theorem.svg Source: http://en.wikipedia.org/w/index.php?title=File:Dice_sum_central_limit_theorem.svg License: Creative Commons Attribution-Sharealike 3.0
Contributors: Cmglee
File:QHarmonicOscillator.png Source: http://en.wikipedia.org/w/index.php?title=File:QHarmonicOscillator.png License: GNU Free Documentation License Contributors:
en:User:FlorianMarquardt
File:Fisher iris versicolor sepalwidth.svg Source: http://en.wikipedia.org/w/index.php?title=File:Fisher_iris_versicolor_sepalwidth.svg License: Creative Commons Attribution-Sharealike 3.0
Contributors: en:User:Qwfp (original); Pbroks13 (talk) (redraw)
File:FitNormDistr.tif Source: http://en.wikipedia.org/w/index.php?title=File:FitNormDistr.tif License: Public Domain Contributors: Buenas días
File:Planche de Galton.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Planche_de_Galton.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Antoine
Taveneaux
File:Carl Friedrich Gauss.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Carl_Friedrich_Gauss.jpg License: Public Domain Contributors: Gottlieb BiermannA. Wittmann (photo)
File:Pierre-Simon Laplace.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Pierre-Simon_Laplace.jpg License: Public Domain Contributors: Ashill, Ecummenic, Elcobbola,
Gene.arboit, Jimmy44, Olivier2, 霧 木 諒 二, 1 anonymous edits
Image:OrderStatistics.gif Source: http://en.wikipedia.org/w/index.php?title=File:OrderStatistics.gif License: GNU Free Documentation License Contributors: Bokken, Flappiefh, Rodolfo
Hermans, 1 anonymous edits
Image:Parabolic trajectory.svg Source: http://en.wikipedia.org/w/index.php?title=File:Parabolic_trajectory.svg License: Public Domain Contributors: Oleg Alexandrov
File:Heat eqn.gif Source: http://en.wikipedia.org/w/index.php?title=File:Heat_eqn.gif License: Public Domain Contributors: Oleg Alexandrov
File:Chi-square distributionCDF-English.png Source: http://en.wikipedia.org/w/index.php?title=File:Chi-square_distributionCDF-English.png License: Public Domain Contributors: Mikael
Häggström
File:poisson pmf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Poisson_pmf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
Image Sources, Licenses and Contributors 637
File:poisson cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Poisson_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
File:Binomial versus poisson.svg Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_versus_poisson.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors:
Sergio01
Image:SampleProcess.png Source: http://en.wikipedia.org/w/index.php?title=File:SampleProcess.png License: Public Domain Contributors: Willem (talk)
File:Singular value decomposition.gif Source: http://en.wikipedia.org/w/index.php?title=File:Singular_value_decomposition.gif License: Public Domain Contributors: Kieff
Image:Stirling's Approximation.svg Source: http://en.wikipedia.org/w/index.php?title=File:Stirling's_Approximation.svg License: Creative Commons Attribution-Share Alike Contributors:
R. A. Nonenmacher
Image:StirlingErrorGraphBB.svg Source: http://en.wikipedia.org/w/index.php?title=File:StirlingErrorGraphBB.svg License: Public Domain Contributors: DMJ001
Image:StirlingError1.svg Source: http://en.wikipedia.org/w/index.php?title=File:StirlingError1.svg License: Public Domain Contributors: DMJ001
Image:student t pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Student_t_pdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
Image:student t cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Student_t_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas
Image:T distribution 1df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_1df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim,
1 anonymous edits
Image:T distribution 2df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_2df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim,
1 anonymous edits
Image:T distribution 3df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_3df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim,
1 anonymous edits
Image:T distribution 5df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_5df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim,
1 anonymous edits
Image:T distribution 10df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_10df.png License: GNU Free Documentation License Contributors: Juiced lemon,
Maksim, 1 anonymous edits
Image:T distribution 30df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_30df.png License: GNU Free Documentation License Contributors: Juiced lemon,
Maksim, 1 anonymous edits
Image:sintay.svg Source: http://en.wikipedia.org/w/index.php?title=File:Sintay.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: User:Qualc1
Image:Exp series.gif Source: http://en.wikipedia.org/w/index.php?title=File:Exp_series.gif License: Public Domain Contributors: Oleg Alexandrov
Image:Exp neg inverse square.svg Source: http://en.wikipedia.org/w/index.php?title=File:Exp_neg_inverse_square.svg License: Public Domain Contributors: Plastikspork ―Œ(talk). Original
uploader was Plastikspork at en.wikipedia
Image:Taylorsine.svg Source: http://en.wikipedia.org/w/index.php?title=File:Taylorsine.svg License: Public Domain Contributors: Geek3, Hellisp, Riojajar, 1 anonymous edits
Image:LogTay.svg Source: http://en.wikipedia.org/w/index.php?title=File:LogTay.svg License: Public Domain Contributors: Niles
Image:TaylorCosCos.png Source: http://en.wikipedia.org/w/index.php?title=File:TaylorCosCos.png License: GNU Free Documentation License Contributors: Original uploader was Sam
Derbyshire at en.wikipedia
Image:TaylorCosPol.png Source: http://en.wikipedia.org/w/index.php?title=File:TaylorCosPol.png License: GNU Free Documentation License Contributors: Original uploader was Sam
Derbyshire at en.wikipedia
Image:TaylorCosAll.png Source: http://en.wikipedia.org/w/index.php?title=File:TaylorCosAll.png License: GNU Free Documentation License Contributors: Original uploader was Sam
Derbyshire at en.wikipedia
Image:Taylor e^xln1plusy.png Source: http://en.wikipedia.org/w/index.php?title=File:Taylor_e^xln1plusy.png License: Creative Commons Attribution-Sharealike 3.0 Contributors:
Slobo486, 1 anonymous edits
image:Uniform distribution PDF.png Source: http://en.wikipedia.org/w/index.php?title=File:Uniform_distribution_PDF.png License: Public Domain Contributors: EugeneZelenko, It Is Me
Here, Joxemai, PAR
image:Uniform distribution CDF.png Source: http://en.wikipedia.org/w/index.php?title=File:Uniform_distribution_CDF.png License: Public Domain Contributors: EugeneZelenko, Joxemai,
PAR
Image:DUniform distribution PDF.png Source: http://en.wikipedia.org/w/index.php?title=File:DUniform_distribution_PDF.png License: GNU Free Documentation License Contributors:
EugeneZelenko, PAR, WikipediaMaster
Image:Dis Uniform distribution CDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Dis_Uniform_distribution_CDF.svg License: GNU General Public License Contributors:
en:User:Pdbailey, traced by User:Stannered
Image:Weibull PDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Weibull_PDF.svg License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors: Calimo
Image:Weibull CDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Weibull_CDF.svg License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors: Calimo,
after Philip Leitch.
File:FitWeibullDistr.tif Source: http://en.wikipedia.org/w/index.php?title=File:FitWeibullDistr.tif License: Public Domain Contributors: Buenas días
License 638
License
Creative Commons Attribution-Share Alike 3.0 Unported
//creativecommons.org/licenses/by-sa/3.0/