Sie sind auf Seite 1von 25

Computing Simulations in SAS

Jordan Elm
7/26/2007
Reference:
SAS for Monte Carlo Studies: A Guide for Quantitative Researchers
by Xitao Fan, Akos Felsovalyi, Stephen Sivo, and Sean Keenan
Copyright(c) 2002 by SAS Institute Inc., Cary, NC, USA

What is meant by
Running Simulations

Simulating Data- Use Random


Number Generator.

To generate data with certain


distribution/shape.

Monte Carlo Simulations- Use


Random Number Generator, Do Loops,
Macros

To generate data and compare performance


of different methods of analysis.

Monte Carlo Simulations

The use of Random Sampling techniques and a


computer to obtain approximate solutions to
mathematical problems (probability)
Can find solutions to mathematical problems
(which may have many variables) that cannot
easily be solved, for example, by
integral calculus, or other numerical methods.
Learn how a statistic may vary from sample to
sample (i.e. obtain the sampling dist for the
statistic) by repeatedly drawing random
samples from a specific population.

Suitable Questions

How does the sample median behave versus


the sample mean for a particular distribution.
How much variability is there in a sample
correlation coefficient for a given sample size.
How does non-normality of the data affect
the regression coefficients (PROC GLM).
Theory is weak or assumptions are violated,
so need MC simulations to answer the
question.

What is MC simulation Used for?

Determining Power/Sample size of a


Statistical Method during the planning
phase of a study
Assess consequences of Violation of
Assumptions (homogeneity of variance for
t-test, normality of data)
Comparing Performance (e.g. Power, Type
I error rate) of different Statistical
Methods

Example: Rolling the Die Twice


What are the chances of obtaining 2 as the
sum of rolling a die twice?
1.

2.
3.

Roll die twice 10,000 times by hand so can


estimate the chance of obtaining 2 as the sum.
Apply probability theory (1/6*1/6)=0.028
Empirical Approach Monte Carlo Simulation
The outcomes of rolling a die are SIMULATED.
Requires a computer and software (SAS, Stata
R).

Rolling Die Twice: Prob Sum=2

Rolling Die Twice: Prob Sum=2

Basic Programming Steps


1.
2.
3.

Generate Random Sample


(Data Step)
Perform Analysis in Question and
Output Statistic to a dataset (Proc)
Repeat

4.
5.

(1000- 1,000,000 times depending on


desired precision)
Macro, Do Loop

Analyze the Accumulated Statistic of


Interest
Present Results

Step 1: Generating Data

Use functions to generate data with a known


distribution
E.g.

RANUNI, RANEXP, RANNOR, RAND

Transform the data to the desired shape


x=MU+sqrt(S2)*rannor(seed); ~Norm (MU, S2)
x=ranexp(seed)/lambda;
~Exp(lambda)
x=ranbin(seed, n, p);
~Binomial(n, p)

Generating Data

Transform the data to the desired shape:


x=MU+sqrt(S2)*rannor(seed);
Seed:

is an integer.
If seed < 0, the time of day is used to initialize
the seed stream, and the stream of random
numbers is not replicable.
If you use a positive seed, you can always
replicate the stream of random numbers by
using the same DATA step, but must make your
macro program change seed for each
replication of the do loop.

Generating Data

Multivariate data:
%MVN

macro
Download from
http://support.sas.com/ctx/samples/index.jsp?sid=509

Tip for faster program:


Generate

ALL the data first, then use the BY


statement within PROC to analyze each
sample.
E.g. If sample size is 50 and # of reps is set to
1000, then generate data with 50000 obs.

Generating Data that Mirror Your


Sample Characteristics

How well does the t-test do when data is nonnormal?


Generate non-normal data:

Obtain 1st 4 moments from your sample data (mean,


sd, skewness, kurtosis)
Obtain inter-variable correlations (PROC CORR) if
variables are correlated.
Use sample moments and correlations as population
parameters, generate data accordingly.
Fleishmans Power Transformation Method
Y=a+bZ+cZ2+dZ3, Y non-normal variable, Z~N(0,1), a,b,c,d
given by Fleishman (1978) for diff values of kurtosis and
skewness.

Example
Generating Non-Normal data
**** Program 4.3 Fleishman Method for Generating 3 Non-Normal Variables ****;
DATA A;
DO I = 1 TO 10000;
X1 = RANNOR (0);
X2 = RANNOR (0);
X3 = RANNOR (0);
*** Fleishman non-normality transformation;
X1 = -.124833577 + .978350485*X1 + .124833577*X1**2 + .001976943*X1**3;
X2 = .124833577 + .978350485*X2 - .124833577*X2**2 + .001976943*X2**3;
X3 = -.096435287 + .843688891*X3 + .096435287*X3**2 + .046773413*X3**3;
X1 = 100 + 15*X1;
***linear transformation;
X2 = 50 + 10*X2;
X3 = X3;
OUTPUT;
END;
PROC MEANS N MEAN STD SKEWNESS KURTOSIS;
VAR X1 X2 X3;
PROC CORR NOSIMPLE;
VAR X1 X2 X3;
RUN;
**************************************************************************;

Example of MC study

Assessing the effect of Non-normal


data on the Type I error rate of an
ANOVA test

Proc IML

Matrix Language within SAS

Allows for faster programming,


however, still slower than Stata, R,
S-plus.

Program 6.3:
Assessing the effect of Data NonNormality on the Type I error rate in
ANOVA

Bootstrapping & Jackknifing

Bootstrapping (Efron 1979) Drawing a sample


from an existing dataset. Sample is same size (or
smaller than the original dataset) (Re-sampling
with replacement)

Purpose- To estimate the dispersion (variance) of


poorly understood statistics (nonparametric
statistics)

Jackknifing-Re-sampling with replacement from


an existing dataset. Sample is same size as the
original dataset minus 1 observation.

used to detect outlier or to make sure that results


are repeatable (cross validation).

Examples of Simulation
Studies in Epidemiology

Simulation Study of
Confounder-Selection Strategies
G Maldonado, S Greenland
American Journal of Epidemiology Vol. 138, No. 11: 923-936

In the absence of prior knowledge about population relations, investigators


frequently employ a strategy that uses the data to help them decide whether
to adjust for a variable. The authors compared the performance of several
such strategies for fitting multiplicative Poisson regression models to cohort
data: 1) the "change-in-estimate" strategy, in which a variable is controlled if
the adjusted and unadjusted estimates differ by some important amount; 2)
the "significance-test-of-the-covariate" strategy, in which a variable is
controlled if its coefficient is significantly different from zero at some
predetermined significance level; 3) the "significance-test-of-the-difference"
strategy, which tests the difference between the adjusted and unadjusted
exposure coefficients; 4) the "equivalence-test-of-the-difference" strategy,
which significance-tests the equivalence of the adjusted and unadjusted
exposure coefficients; and 5) a hybrid strategy that takes a weighted
average of adjusted and unadjusted estimates. Data were generated from
8,100 population structures at each of several sample sizes. The
performance of the different strategies was evaluated by computing bias,
mean squared error, and coverage rates of confidence intervals. At least one
variation of each strategy that was examined performed acceptably. The
change-in-estimate and equivalence-test-of-the-difference strategies
performed best when the cut-point for deciding whether crude and adjusted
estimates differed by an important amount was set to a low value (10%).
The significance test strategies performed best when the alpha level was set
to much higher than conventional levels (0.20).

Confidence Intervals for Biomarker-based Human


Immunodeficiency Virus Incidence Estimates and
Differences using Prevalent Data
Cole et al. American J Epid 165 (1): 94. (2007)

Prevalent biologic specimens can be used to estimate human


immunodeficiency virus (HIV) incidence using a two-stage immunologic
testing algorithm that hinges on the average time, T, between testing
HIV-positive on highly sensitive enzyme immunoassays and testing HIVpositive on less sensitive enzyme immunoassays. Common approaches to
confidence interval (CI) estimation for this incidence measure have
included 1) ignoring the random error in T or 2) employing a Bonferroni
adjustment of the box method. The authors present alternative Monte
Carlo-based CIs for this incidence measure, as well as CIs for the
biomarker-based incidence difference; standard approaches to CIs are
typically appropriate for the incidence ratio. Using American Red Cross
blood donor data as an example, the authors found that ignoring the
random error in T provides a 95% CI for incidence as much as 0.26 times
the width of the Monte Carlo CI, while the Bonferroni-box method
provides a 95% CI as much as 1.57 times the width of the Monte Carlo
CI. Further research is needed to understand under what circumstances
the proposed Monte Carlo methods fail to provide valid CIs. The Monte
Carlo-based CI may be preferable to competing methods because of the
ease of extension to the incidence difference or to exploration of
departures from assumptions.
http://aje.oxfordjournals.org/cgi/content/full/165/1/94#APP2

Your Turn to Try

Assess the Effect of Unequal Pop


Variance in a 2-sample T-test

Design a MC study with to


determine:
What

happens to the type I error rate?


What happens to the Power?

Problem

Do 1000 replications
Let the sample size for the 2 groups
(X1 and X2) be 20/group.
Alpha=0.05
Mean=50 (under null) Mean=40
(under alternative)
SD=10 and 15
Compute a 2-sample t-test

Reference

SAS for Monte Carlo Studies: A


Guide for Quantitative Researchers
by Xitao Fan, Akos Felsovalyi,
Stephen Sivo, and Sean Keenan
Copyright(c) 2002 by SAS Institute
Inc., Cary, NC, USA
ISBN 1-59047-141-5

Das könnte Ihnen auch gefallen