Beruflich Dokumente
Kultur Dokumente
Robustness
Resistance
Transformation
Outlier
2013 spring
1 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Introduction
In Chapter 2, we discussed the mechanics of using t-procedures to perform statistical inference. Namely t-tests and condence interval. We base these procedures on certain assumptions: we have random samples, representative of populations data come from Normal population samples are drawn independently. in pooled two-sample settings, we have equal variance (1 = 2 = ) In practice, these assumptions are usually not strictly met. When are these procedures still appropriate?
2 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Intrduction
Robustness
Resistance
Transformation
Outlier
Graphical Summaries
library("Sleuth2") boxplot(Rainfall ~ Treatment, ylab='Rainfall (acre-feet)', data=case0301)
Rainfall (acrefeet)
500
1000
1500
2000
2500
Unseeded
Seeded
4 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Graphical Summaries
par(mfrow=c(2,1), mar=c(4,4,1,0.5)) hist(case0301$Rainfall[case0301$Treatment=="Seeded"], breaks=10, main="Seeded - Rainfall", xlim=c(0,3000), col="gray", xlab="") hist(case0301$Rainfall[case0301$Treatment=="Unseeded"], breaks=8, main="Unseeded - Rainfall", xlim=c(0,3000),col="gray", xlab="")
Frequency
10
12
Seeded Rainfall
500
1000
1500
2000
2500
3000
Frequency
10
15
20
Unseeded Rainfall
500
1000
1500
2000
2500
3000
5 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Numerical Summaries: Do it yourself (follow the R-code on page 42 of Chapter 2 slides) Graphical and numerical summaries indicate that rainfall tended to be greater on seeded days. However, there are problems with our necessary assumptions: both distributions are very skewed both distributions have outliers variability is much greater in the seeded group than in the unseeded group Can we use our usual t-tools to analyze these data? How?
6 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Can we do this?
> t.test(Rainfall ~ Treatment, alternative="two.sided", + var.equal=TRUE, data=case0301) Two Sample t-test data: Rainfall by Treatment t = -1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -556.224179 1.431851 sample estimates: mean in group Unseeded mean in group Seeded 164.5885 441.9846
How much did the violations of our assumptions aect these results?
7 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Robustness
t-tools may be used even when assumptions are violated, to a certain degree, because the t-tools are robust.
Robustness: A statistical procedure is robust to departures from a particular assumption if it is valid even when the assumption is not met.
8 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Recall that the Central Limit Theorem (CLT) states that sample averages have approximately Normal sampling distributions, regardless of the shape of the population distribution, for large samples. As long as samples are large enough, the t-ratio will follow an approximate t-distribution even if the data is non-Normal.
9 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
If two populations have same standard deviations and approximately same shapes, and if n1 n2, then validity of t-tools is aected very little by skewness. If two populations have same standard deviations and approximately same shapes, but n1 = n2, then validity of t-tools is aected substantially by skewness. Larger sample size diminish this eect. If skewness in two populations diers considerably, tools can be very misleading with small and moderate sample sizes. See Display 3.4 in the textbook for simulation results.
10 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Intrduction
Robustness
Resistance
Transformation
Outlier
A cluster eect occurs when the data have been collected in subgroups. Observations in the same subgroup tend to be more similar in their responses than observations in dierent subgroups. A serial eect occurs when measurements are taken over time and observations close together in time tend to be more similar (or more dierent) than observations collected at distant time points.
When the assumption of independence is violated, the standard error becomes very inaccurate. t-tools are usually not recommended in such cases.
12 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Question: Can you tell the dierence between Robustness and Resistance?
13 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Example of Outlier
1.0 1.0 6 3 0.5 0.0 0.5
0 x
3 3
14 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Example of Resistance
Consider a hypothetical sample: 10, 20, 30, 50, 70 The sample mean is 36, and the sample median is 30. Now consider the sample: 10, 20, 30, 50, 700 What happens to the sample mean? What about the sample median?
The sample median is resistant to any change in a single observation, while the sample mean is not.
15 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Resistance of t-Tools
Since t-tools are based on mean, they are not resistant. Small portion of the data can have a major inuence on the results. One or two outliers can aect a 95% CI or change a p-value enough to alter a conclusion. Solution: When you have an outlier, it is good practice to run your analysis with and without the outlier in the data set. Compare your results to see how inuential the outlier in question is.
16 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Our task is to size up actual conditions, using available data, and evaluate appropriateness of t-tools:
1 2
think about possible cluster and serial eects evaluate the suitability of t-tools by examining graphical displays (side-by-side histograms or box plots) consider alternatives
a. Transform the data (Section 3.5) to see if the transformed data looks nicer b. Alternative tools that do not require model assumptions (Chapter 4)
17 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Transformations of Data
For positive data, the most useful transformation is the logarithm (log), particularly the natural (base e) logarithm (e = 2.71828...). log(1) = 0 log(ex ) = x
log function
log(x)
2 0
4 x
10
18 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
if samples are skewed, with the group with the larger average having a greater spread, then a log transformation could be a good choice.
before transformation
1200
after transformation
800
1000
log(y) 2 0 2
200
400
600
19 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
after transformation
2500
2000
1500
1000
500
Unseeded
Seeded
20 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Two-Sample t-Analysis
Before:
> t.test(Rainfall ~ Treatment, alternative="two.sided", + var.equal=TRUE, data=case0301) Two Sample t-test data: Rainfall by Treatment t = -1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -556.224179 1.431851 sample estimates: mean in group Unseeded mean in group Seeded 164.5885 441.9846
After:
> t.test(logRain ~ Treatment, data=case0301, + alternative="less", var.equal=TRUE) Two Sample t-test data: logRain by Treatment t = -2.5444, df = 50, p-value = 0.007041 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.3904045 sample estimates: mean in group Unseeded mean in group Seeded 3.990406 5.134187
21 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
We interpret this in the following way: The volume of rainfall produced by a seeded cloud is estimated to be 3.14 times as large as the volume that would have been produced in the absence of seeding.
22 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Condence Interval
> (test <- t.test(logRain ~ Treatment, data=case0301, var.equal=TRUE)) Two Sample t-test data: logRain by Treatment t = -2.5444, df = 50, p-value = 0.01408 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.0466973 -0.2408651 sample estimates: mean in group Unseeded mean in group Seeded 3.990406 5.134187 > test$conf.int [1] -2.0466973 -0.2408651 attr(,"conf.level") [1] 0.95 > exp(test$conf.int) [1] 0.1291608 0.7859476 attr(,"conf.level") [1] 0.95
A 95% condence interval for the multiplicative eect of unseeding/seeding is 0.129 to 0.786 times.
23 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Often there is no way to know what caused the outlier(s). Two tools exist:
employ resistant statistical tool (Chapter 4) adopt a careful strategy - see Display 3.6 in the text.
The idea is to perform analysis with and without the suspected outliers.
If both analyses give same answer, only report results INCLUDING suspected outliers. If not, report both results.
24 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
> ### dealing with Missing data ### > (cc <- complete.cases(ex0327)) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE [16] FALSE TRUE TRUE TRUE TRUE TRUE TRUE > data2 <- ex0327[cc, ]; data2[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 17 Sweden 74.7 5596 Industrialized 18 Switzerland 72.1 2963 Industrialized
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE
25 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
26 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
A: None, theyre all waiting for the unseen hand of the market to correct the lighting disequilibrium.
27 / 27