Guidance for transforming data to meet statistical assumptions

In statistics, data transformationis the application of a
deterministic mathematical functionto each point in

a dataset that is, each data point ziis replaced with the
transformed value yi= f(zi), where fis a function. Transforms
are usually applied so that the data appear to more closely
meet the assumptions of a statistical inferenceprocedure
that is to be applied, or to improve the interpretability or
appearance of graphs.
Nearly always, the function that is used to transform the
data is invertible, and generally is continuous. The
transformation is usually applied to a collection of
comparable measurements. For example, if we are working
with data on peoples' incomes in some currencyunit, it
would be common to transform each person's income
value by the logarithmfunction.
Contents [hide]
1 Reasons for transforming data
2 Data transformation in regression
2.1 Alternative
3 Examples of transformations
4 Common transformations
5 Transforming to normality
6 Transforming to a uniform distribution
7 Variance stabilizing transformations
8 Transformations for multivariate data
9 See also
10 References
11 External links
A scatterplot in which the areas of the sovereign states

Guidance for how data should be transformed, or whether and dependent territories in the world are plotted on the
a transformation should be applied at all, should come from
vertical axis against their populations on the horizontal
the particular statistical analysis to be performed. For
axis. The upper plot uses raw data. In the lower plot, both
example, a simple way to construct an approximate
95% confidence intervalfor the population mean is to take the area and population data have been transformed
the sample meanplus or minus standard errorunits. using the logarithm function.
However, the constant factor 2 used here is particular to
the normal distribution, and is only applicable if the sample
mean varies approximately normally. The central limit theoremstates that in many situations, the sample mean does
vary normally if the sample size is reasonably large. However, if the populationis substantially skewedand the
sample size is at most moderate, the approximation provided by the central limit theorem can be poor, and the
resulting confidence interval will likely have the wrong coverage probability. Thus, when there is evidence of
substantial skew in the data, it is common to transform the data to a symmetricdistributionbefore constructing a
confidence interval. If desired, the confidence interval can then be transformed back to the original scale using the
inverse of the transformation that was applied to the data.
Data can also be transformed to make it easier to visualize them. For example, suppose we have a scatterplot in
which the points are the countries of the world, and the data values being plotted are the land area and population of
each country. If the plot is made using untransformed data (e.g. square kilometers for area and the number of people
for population), most of the countries would be plotted in tight cluster of points in the lower left corner of the graph.
The few countries with very large areas and/or populations would be spread thinly around most of the graph's area.
Simply rescaling units (e.g., to thousand square kilometers, or to millions of people) will not change this. However,
following logarithmictransformations of both area and population, the points will be spread more uniformly in the
graph.
A final reason that data can be transformed is to improve interpretability, even if no formal statistical analysis or
visualization is to be performed. For example, suppose we are comparing carsin terms of their fuel economy. These
data are usually presented as "kilometers per liter" or "miles per gallon." However, if the goal is to assess how much
additional fuel a person would use in one year when driving one car compared to another, it is more natural to work
with the data transformed by the reciprocalfunction, yielding liters per kilometer, or gallons per mile.
Linear regressionis a statistical technique for relating a dependent variableYto one or more independent variables X.
The simplest regression models capture a linearrelationship between the expected valueof Yand each independent
variable(when the other independent variables are held fixed). If linearity fails to hold, even approximately, it is
sometimes possible to transform either the independent or dependent variables in the regression model to improve
the linearity.
Another assumption of linear regression is that the variancebe the same for each possible expected value (this is
known as homoscedasticity). Univariate normality is not needed for least squaresestimates of the regression
parameters to be meaningful (see GaussMarkov theorem). However confidence intervals and hypothesis testswill
have better statistical properties if the variables exhibit multivariate normality. This can be assessed empirically by
plotting the fitted values against the residuals, and by inspecting the normal quantile plotof the residuals. Note that it
is not relevant whether the dependent variable Yis marginallynormally distributed.
Generalized linear models(GLMs) provide a flexible generalization of ordinary linear regression that allows for
response variables that have error distribution models other than a normal distribution. GLMs generalize allow the
linear model to be related to the response variable via a link function and allow the magnitude of the variance of
each measurement to be a function of its predicted value.
Equation:
Meaning:A unit increase in X is associated with an average of b units increase in Y.
Equation: (From exponentiating both sides of the equation: )
Meaning:A unit increase in X is associated with an average of b% increase in Y.
Equation:
Meaning:A 1% increase in X is associated with an average b/100 units increase in Y.
Equation: (From exponentiating both sides of the equation: )
Meaning:A 1% increase in X is associated with a b% increase in Y.
The logarithmand square roottransformations are commonly used for positive data, and the multiplicative
inverse(reciprocal) transformation can be used for non-zero data. The power transformationis a family of
transformations parametrized by a non-negative value that includes the logarithm, square root, and multiplicative
inverse as special cases. To approach data transformation systematically, it is possible to use statistical
estimationtechniques to estimate the parameter in the power transformation, thereby identifying the transformation
that is approximately the most appropriate in a given setting. Since the power transformation family also includes the
identity transformation, this approach can also indicate whether it would be best to analyze the data without a
transformation. In regression analysis, this approach is known as the BoxCox technique.
The reciprocal and some power transformations can be meaningfully applied to data that include both positive and
negative values (the power transformation is invertible over all real numbers if is an odd integer). However, when
both negative and positive values are observed, it is more common to begin by adding a constant to all values,
producing a set of non-negative data to which any power transformation can be applied.
A common situation where a data transformation is applied is when a value of interest ranges over several orders of
magnitude. Many physical and social phenomena exhibit such behavior incomes, species populations, galaxy
sizes, and rainfall volumes, to name a few. Power transforms, and in particular the logarithm, can often be used to
induce symmetry in such data. The logarithm is often favored because it is easy to interpret its result in terms of "fold
changes."
The logarithm also has a useful effect on ratios. If we are comparing positive quantities Xand Yusing the ratio X / Y,
then if X < Y, the ratio is in the unit interval (0,1), whereas if X > Y, the ratio is in the half-line (1,), where the ratio of
1 corresponds to equality. In an analysis where Xand Yare treated symmetrically, the log-ratio log(X / Y) is zero in
the case of equality, and it has the property that if Xis Ktimes greater than Y, the log-ratio is the equidistant from zero
as in the situation where Yis Ktimes greater than X(the log-ratios are log(K) and log(K) in these two situations).
If values are naturally restricted to be in the range 0 to 1, not including the end-points, then a logit transformationmay
be appropriate: this yields values in the range (,).
It is not always necessary or desirable to transform a data set to resemble a normal distribution. However, if
symmetry or normality are desired, they can often be induced through one of the power transformations.
To assess whether normality has been achieved, a graphical approach is usually more informative than a formal
statistical test. A normal quantile plotis commonly used to assess the fit of a data set to a normal population.
Alternatively, rules of thumb based on the sample skewnessand kurtosishave also been proposed, such as having
skewness in the range of 0.8 to 0.8 and kurtosis in the range of 3.0 to 3.0.[citation needed]
If we observe a set of nvalues X1, ..., Xnwith no ties (i.e., there are ndistinct values), we can replace Xiwith the
transformed value Yi= k, where kis defined such that Xiis the kthlargest among all the Xvalues. This is called the rank
transform[citation needed], and creates data with a perfect fit to a uniform distribution. This approach has
a populationanalogue.
Using the probability integral transform, if Xis any random variable, and Fis the cumulative distribution functionof X,
then as long as Fis invertible, the random variable U= F(X) follows a uniform distribution on the unit interval[0,1].
From a uniform distribution, we can transform to any distribution with an invertible cumulative distribution function.
If Gis an invertible cumulative distribution function, and Uis a uniformly distributed random variable, then the random
variable G1(U) has Gas its cumulative distribution function.
Main article: Variance-stabilizing transformation

Many types of statistical data exhibit a "variance-on-mean relationship", meaning that the variability is different for
data values with different expected values. As an example, in comparing different populations in the world, the
variance of income tends to increase with mean income. If we consider a number of small area units (e.g., counties
in the United States) and obtain the mean and variance of incomes within each county, it is common that the
counties with higher mean income also have higher variances.
A variance-stabilizing transformationaims to remove a variance-on-mean relationship, so that the variance becomes
constant relative to the mean. Examples of variance-stabilizing transformations are the Fisher transformationfor the
sample correlation coefficient, the square roottransformation or Anscombe transformfor Poissondata (count data),
the BoxCox transformationfor regression analysis and the arcsine square root transformation or angular
transformationfor proportions (binomialdata). While commonly used for statistical analysis of proportional data, the
arcsine square root transformation is not recommended because logistic regressionor a logit transformationare more
appropriate for binomial or non-binomial proportions, respectively, especially due to decreased type-II error.[1]
Univariate functions can be applied point-wise to multivariate data to modify their marginal distributions. It is also
possible to modify some attributes of a multivariate distribution using an appropriately constructed transformation.
For example, when working with time seriesand other types of sequential data, it is common to differencethe data to
improve stationarity. If data are observed as random vectors Xiwith covariance matrix, a linear transformationcan
be used to decorrelate the data. To do this, use the Cholesky decompositionto express = AA'. Then the
transformed vector Yi= A1Xihas the identity matrixas its covariance matrix.

Guidance for transforming data to meet statistical assumptions

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Guidance for transforming data to meet statistical assumptions

Hochgeladen von

Copyright:

Verfügbare Formate

In statistics, data transformationis the application of a

deterministic mathematical functionto each point in

A scatterplot in which the areas of the sovereign states

Main article: Variance-stabilizing transformation

Das könnte Ihnen auch gefallen