Sie sind auf Seite 1von 13

An Introduction to SAS

SAS is the program and programming language that will be used for nearly all analyses in our course this
semester. These days nearly all statistical packages have the familiar windows based format with drop down
menus that allow the user to quickly select the operations of choice. SAS seems to have bucked this trend, for
although SAS has the classic windows form, it does not provide you with simple drop down menu choices.
Instead, SAS requires that you program you analytic choices. This requires skill at programming, knowledge
of the programming language and knowledge of how SAS operates.
This short tutorial will provide you with a working introduction to SAS. It is in no way meant to be a
comprehensive course on the subject. Instead, this tutorial is meant to be a starting point from which you may
extend your understanding of SAS when the need arises.

The SAS interface is shown in the following screen shot:

There are several items of importance shown in the screen shot. First there are the usual main menu items, such
as File, Edit, etc. I seldom use anything but the File and Edit menu items. In the File menu there are the typical
Open, Close, Save, Print, etc. items, while in the Edit menu there is Copy, Paste, Replace, etc. For the most part
these are all I use. The Help menu may or may not be fully functional. In the past this would be the approach to
opening the SAS online documentation. On my current PC I must open the online documentation by going to
the START icon, going to All Programs, finding SAS in the list of programs and among the SAS software items
is the Documentation.

On the Command bar (second bar from the top) are a series of icons, the most important of which are the
Submit, Clear All and Break (4th, 3rd and 2nd from the last items, respectively). If you place the cursor over
these icons you will get a brief description of each. The Submit button submits the set of SAS code that is
located in Editor window; all programs will originate from the Editor window. The Clear All button simply
clears all text in the Editor window and the Break button will stop SAS processing a set of code. This is very
useful if you have made a coding mistake that results in an infinite loop or when a SAS procedure is running too
long due to model complexity.
There are typically three windows of importance to SAS programs (only two of the important one are shown in
the screen shot. I typically close the Explorer window in the left panel. The two of importance that are shown
are the Editor, where all SAS code goes and the Log, which reports on operation of the SAS. All warnings and
errors will show up in the Log. For this reason it is a good idea to look at the Log for problems. The other
window of importance is the Output window. This is where all of the results from the submitted program will
reside. You can bring any of the three windows into focus by clicking on the appropriate button at the bottom of
the screen (e.g., Output, Log and Editor). You can also tile these windows.
A Short Course on SAS Commands
Before any data can be analyzed, you must make the data available to SAS. This is performed in a DATA step.
Data Input
SAS code manipulates SAS data files, whether it is to compute the statistics for an analysis of variance or to add
two numbers together. A SAS data file persists only while the session is running (from the time you load SAS
until you terminate the program). Therefore, to perform any analyses using SAS you must first create a SAS
data file. There are several SAS statements required to create a SAS data file. These include the DATA,
INPUT, INFILE and CARDS statements. Some of these statements have several options available. All SAS
statements end with a semicolon ;.
There are essentially two methods of SAS data file creation: either the dtat is entered in the program code
stream or the data is entered from an external data file.
The general form of a data step with the data entered in the program stream is:
DATA filename;
INPUT variable_1 variable_2 @@@ variable_n;
SAS statement_1;
SAS statement_2;
.
.
.
SAS statement_n;
CARDS;
data row 1
data row 2
.
.
.
data row n
;
RUN;
2

The general form of a data step with the data entered from an external file is:
DATA filename;
INFILE path and filename options;
INPUT variable_1 variable_2 @@@ variable_n;
SAS statement_1;
SAS statement_2;
.
.
.
SAS statement_n;
RUN;
For all programs developed over the next several pages only the first approach will be used, since the example
data sets are relatively small. However, it is usually easier to initially save your data in a spreadsheet program,
such as Excel, and then access the data from the file by using the INFILE statement.

The SAS statements in the DATA step:


DATA filename;

Initializes the data step and assigns the filename to the


SAS data file.

INFILE path and file name options;

Data will be read from an external file. The path and file
name are specified in quotes. Two main options are of use
here: dlm = character and firstobs = n. The character in
the dlm option is typically a comma since this is the
separator between columns of a comma separated value file
(MS Excel data type). The firstobs option indicates from
which row of the data file to start reading. Here n is an
integer.

INPUT variable_1, variable_2, @@@, variable_n;

Variables variable_1, variable_2, @@@, variable_n are to be


created using in-stream data entry. There are several
options available for the INPUT statement, but only a
couple will be discussed. Data may be numeric or
character. The default data type is numeric. Character data
must be followed by a $. If data are missing, use a period.
The @@ character at the end of an INPUT statement
allows for data to be read in multiple complete sets from
the same data line.

CARDS; or
DATALINES;

Indicates that the data are to be entered in-stream and


directly follow the CARDS statement.

SAS Statements in the data step can be used to manipulate data or create new variables. These SAS Statements
appear in the data step following the INPUT statement. These SAS statements can be arithmetic, logical or
functional in nature and are located within the data step, but after the definition of the variables.
Arithmetic Operators
operator
**
*
/
+
-

function
raise to a power
multiplication
division
addition
subtraction

example
X = Y**Z;
X = Y*Z;
X = Y/Z;
X = Y + Z;
X = Y - Z;

Comparison Operators
operator
= or EQ
^= or NE
> or GT
< or LT
>= or GE
<= or LE

function
equal to
not equal to
greater than
less than
greater than or equal to
less than or equal to

example
if Y = Z then X = 2;
if Y ^= Z then X = 2;
if Y > Z then X = 2;
if Y< Z then X = 2;
if Y >= Z then X = 2;
if Y <= Z then X = 2;
4

Logical Operators
operator
& or AND
| or OR
^ or NOT

function
intersection of two logicals
union of two logicals
complement

example
if Y = Z AND X = 2 then W = X;
if Y = Z OR X = 2 then W = X;
if NOT (Y = 2) then W = X;

SAS Functions
SAS function can be applied to variables or constants. For example, suppose you wanted to create a new
variable X that was the natural logarithm of the variable Y. The functional form would be:
X = LOG(Y);
Here, LOG is the function for the natural logarithm and Y is the argument. The argument can be either a
defined variable or a constant.
A Survey of SAS Functions
function
ABS
SQRT
ROUND
LOG
LOG10
ARCOS
ARSIN
ATAN
COS
SIN
TAN
POISSON
PROBBNML
PROBCHI
PROBF
PROBNORM
PROBT
CINV
FINV
PROBIT
TINV

definition
Absolute value
Square root
Rounds to the nearest roundoff unit
Natural logarithm
Logarithm base 10
Arc-cosine
Arc-sine
Arc-tangent
Cosine
Sine
Tangent
Poisson probability distribution function (cumulative probability)
Binomial probability distribution function (cumulative probability)
Chi-square probability distribution function (cumulative probability)
F probability distribution function (cumulative probability)
Standard normal probability distribution function (cumulative probability)
T probability distribution function (cumulative probability)
Inverse of the chi-square distribution function (quantile)
Inverse of the F distribution function (quantile)
Inverse of the normal distribution function (quantile)
Inverse of the T distribution function (quantile)

SAS Procedures
The effectiveness of SAS as a statistical software package lies in the union of the ability to manipulate data in
the data step and then send the data to a procedure for further analysis. SAS procedures are essentially
functions. However, unlike the functions just described, SAS procedures are far more complex and produce and
array of output. All SAS procedure call begin with the key work PROC. Some of the basic statistical
procedures supplied by SAS include PROC UNIVARIATE, PROC GLM, PROC REG, PROC PLOT PROC
ANOVA PROC MIXED and PROC LOGISTIC. There are many more SAS PROCs, too many to discuss here.
The procedures of interest to the analysis of variance include:
PROC GLM
PROC ANOVA
PROC MIXED
PROC UNIVARIATE
PROC PLOT
PROC SORT
PROC TTEST
PROC REG

general linear models procedure for analysis of variance and regression


analysis of variance procedure
mixed model analysis of variance procedure
univariate statistics and tests for normality
X, Y scatter plot procedure
sort data in ascending or descending order
perform both an independent two sample t-test and a paired t-test
regression analysis procedure

The use of arithmetic operators, comparison operators, logical operators, SAS functions and SAS procedures
are discussed in detail within the online documentation. You should explore these pages for your own
benefit. The arithmetic operators, comparison operators, logical operators, SAS functions can be found in
the Base SAS pages, while the procedures documentation can be found in the SAS/STAT pages (near the
bottom of the list).
Now that you have some understanding of the lay-out of SAS, it is time to create a SAS data set and use a
procedure.
Example - SAS Data File Creation Using In-Stream Data Entry:
In the following example, a SAS data set named new will be created with variables X, Y and Z. X will be
numeric, while Y and Z will be character variables. SAS reads data line by line, reading from left to right.
The data set new is printed (results will appear in the Output window) using the Print procedure. To
invoke a procedure you simply use the term proc followed by the procedure name. Each procedure will
have many possible options or statements. For the Print procedure the only option used is specification of
the data set to use, new in this case.
data new;
input X Y $ Z $;
cards;
10 Kim Forester
15 Jim Green
25 Peter Zanis
;
Proc Print data = new;
run;
The Proc Print data = new; and run; statements are not needed for the data step, they are needed to see
the results of the data step.

Entering the example SAS code into the SAS Editor window and clicking on the Submit button produces the
following screen shot:

Comparison of Two Means for Independent and Dependent Samples Using SAS Proc TTEST
Example - Independent Samples
An entomologist is interested in the effectiveness of two different types of moth traps. He places eleven
(11) type 1 traps and eight (8) type 2 traps out during the night. Later, he counts the number of moths caught
in each trap. The entomologist would like to know if there is a difference between the catch rate of the two
traps. The data collected from this study are presented below.
TRAP 1

41 34 33 36 40 25 31 37 34 30 38

TRAP 2

52 57 62 55 64 57 56 55

SAS Code using Proc ttest


options pageno = 1;
title "Comparison of Two Types of Moth Traps";
data moths;
input Trap_Type Moths @@;
cards;
1 41 1 34 1 33 1 36 1 40 1 25 1 31 1 37 1 34 1 30 1 38
2 52 2 57 2 62 2 55 2 64 2 57 2 56 2 55
;
proc print data = moths;
run;
title2 "Analysis Using an Independent Two-Sample T-test";
proc ttest data = moths;
class Trap_Type;
var Moths;
run;

The components, with explanation, for the program, including Proc ttest for the comparison of two means
from independent samples are:
options pageno = 1;

The options statement can go anywhere in SAS and is used to control


some aspects of the output. In this instance the pageno = 1" option
forces SAS to number its pages starting with 1. If this were left off,
SAS would have begun numbering from where it left off. Other
options for this statement are linesize = n and pagesize = n, which are
used to set the number of characters across a page and the number of
lines on a page.

title "Comparison of Two Types of Moth Traps";


title2 "Analysis Using an Independent Two-Sample T-test";

Are the title and subtitle printed on each page of the output following
their definition (the subtitle will not be printed for the above program
until after the Proc Print has been run).
proc ttest data = moths;
class Trap_Type;

var Moths;
run;

(Invokes Proc ttest and tells SAS to use the moths data set.)
(Tells Proc ttest which variable to use to determine the two
treatments or populations)
(Tells Proc ttest which variable is the response)
8

Example - Dependent Samples


Tardive dyskinesia denotes a syndrome comprising a variety of abnormal involuntary movements assumed
to follow long-term use of antipsychotic drugs. In an experiment to see whether the drug deanol produced
an effect over baseline scores of a placebo treatment, the two treatments were administered for four weeks
each in random order to 10 patients. Results from these treatments, as measured by total severity index
(TSI) scores are:

Patient
1
2
3
4
5
6
7
8
9
10
------------------------------------------------------------------------------------------------------------Deanol
12.4 12.6 13.2 12.1
5.9
12.0 11.5 13.0
5.1
9.6
Placebo
9.2
12.2 12.7 12.4
5.9
8.5
7.8
9.1
3.5
6.4
-------------------------------------------------------------------------------------------------------------

SAS Code using Proc ttest:


options pageno = 1;
title "Comparison of Two Types of Treatments for Tardive Dyskinesia";
data drug;
input placebo deanol;
datalines;
12.4
9.2
12.6 12.2
13.2 12.7
12.1 12.4
5.9
5.9
12.0
8.5
11.5
7.8
13.0
9.1
5.1
3.5
9.6
6.4
;
title2 "Analysis Using an Paired T-test";
proc ttest data = drug;
paired placebo*deanol;
run;

The components, with explanation, of Proc ttest for the comparison of two means from paired samples are:
proc ttest data = drug;
paired placebo*deanol;

(Invokes Proc ttest and tells SAS to use the drug data set.)
(Describes the paired responses for comparison .)

run;

Example - Randomization of Treatments to Experimental Units for a CRD using SAS


For a completely randomized design the randomization of the treatments to a set experimental units is a
fairly simple process. First you must construct a variable which has entries corresponding to the treatment
values (e.g., 1, 2, etc.), with exactly the same number of treatment values as there are observations for a
given sample. Next a set of uniform random numbers is created using the SAS uniform random number
function (RANUNI). These two sets are sorted based on the uniform random numbers and printed. The
assignment of experimental units, labeled 1 to n, follows directly from the printed treatments.
Example
Suppose that an experiment which was to follow the protocol for a completely randomized design was to
involve 4 treatments, each with 5 observations (replicates). The SAS code performing the random
assignment of treatments to experimental units is
data random;
input treatments @@;
rv = ranuni(1);
cards;
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
;
proc sort data = random; by rv;
run;
data random; set random;
Replicate = _n_;
proc print data = random;
run;

rv = ranuni(1);

In the above code you are using a function ranuni(1)


which creates a uniform (0, 1) random variable for each
observation (total of 16). The 1 in parentheses is the
seed to start the random number generation and is
required. The seed can be any positive whole number,
but you must choose one.

proc sort data = random; by rv;

The Sort procedure sorts the declared data set (random)


using the values in the variable rv. By default they
are sorted in descending order. Options for the sort
routine allow for other methods of sorting. You should
look at the SAS Online documentation to learn more.

data random; set random;

You can re-open a data set, or create a new data set


from an existing data set, by starting a new DATA step.
In this instance, the data set random is created by the
statement data random; and is populated by the
existing data set random by using the statement set
random;.

Replicate = _n_;

This statement declares a new variable Replicate with


indexed values created by the SAS variable _n_.
10

Analysis of a Completely Randomized Design with a One-Way Treatment Structure Using SAS
The Environmental Protection Agency (EPA) utilizes the services of a number of analytic laboratories. It is
of interest to EPA that these laboratories produce equivalent results when asked to analyze water samples for
possible contamination. In order to assess the quality of the analyses from these laboratories the EPA
decided to send each of 3 laboratories a set of 6 samples that were known to have a DDT contamination
level of 1000 ppm. Each water sample was constructed independent of the others and randomly assigned to
a laboratory. The resulting data are given in the following table:
Replicates
Laboratory

1005

1015

1033

1028

1023

1043

995

1008

976

1014

1011

982

950

975

988

1015

1008

994

title "Analysis of EPA Laboratory Data";


title2 "Using a Completely Randomized Design";
title3 "With a One-Way Treatment Structure";
data epa;
input laboratory ddt @@;
cards;
1 1005 1 1015 1 1033 1 1028 1 1023 1 1043
2 995 2 1008 2 976 2 1014 2 1011 2 982
3 950 3 975 3 988 3 1015 3 1008 3 994
;
run;
proc print data = epa;
run;
proc glm data = epa;
class laboratory;
model ddt = laboratory;
run;

The statistical design is a completely randomized design with a one-way treatment structure (Laboratories).
The model for a CRD with a one-way treatment structure codes for a grand mean or model intercept and
the treatment effects (Laboratory in the above example). Replicates are the source of random variation and
appear in the residual error. In the above code, the response (DDT) is a function of the treatment
(Laboratory). Any remaining variation goes to the residual error.
The key elements of the analysis are:
proc glm data = epa;
class laboratory;
model ddt = laboratory;

Proc GLM is the general linear models procedure which


can be used to run ANOVA or regression analysis.
The class statement tells Proc GLM which factors are in
the ANOVA. In this case Laboratory is the factor.
The model statement relates the response (DDT) to the
factor (Laboratory). This will produce the ANOVA
table with the F-test and Pvalues.

run;

11

Proc Mixed is another SAS procedure that will produce the test statistics and parameter estimates for
ANOVA models. However, Proc Mixed uses the method of maximum likelihood (ML) to obtain the
estimates of the model parameters. For the simple models of a completely randomized design with a oneway treatment structure, the results from Proc GLM and Proc Mixed are identical. Later it will become
apparent that these methods are not exactly the same. In fact, we will want to use Proc Mixed for more
complicated designs because maximum likelihood estimation is more robust to violations of the
assumptions and when missing observations occur. In addition, Proc Mixed can be used to model the
variance structure when you have non-constant variance.
For now, the SAS Proc Mixed code for the EPA data should be:
title "Analysis of EPA Laboratory Data";
title2 "Using a Completely Randomized Design";
title3 "With a One-Way Treatment Structure";
data epa;
input laboratory ddt @@;
cards;
1 1005 1 1015 1 1033 1 1028 1 1023 1 1043
2 995 2 1008 2 976 2 1014 2 1011 2 982
3 950 3 975 3 988 3 1015 3 1008 3 994
;
run;
proc print data = epa;
run;
proc mixed data = epa;
class laboratory;
model ddt = laboratory;
run;

Please recall that the model for the EPA data is based on a CRD with a one-way treatment structure. For
this data set, the model is of the following structure:
Yij = + i + (i)j
Definitions and Assumptions:
Yij

the measured response for the ith treatment and jth observation

i
(i)j

the grand mean


= i - is the effect of the ith treatment
are independently and identically distributed normal random errors having mean 0 and common
variance 2.

If the response Y was recorded as DDT, the treatments were recorded as Laboratory (1, 2, 3) and the
Replication was recorded as Sample (1, 2, 3, 4, 5 and 6 within each laboratory), then the SAS code, which
included the replication for the model statement would be
model DDT = Laboratory Sample (Laboratory);
Remember, you can label the replicates as a factor, in addition to the primary factor (Laboratory). However,
there is a consequence in running the analysis in this manner. Explore this consequence by making the
appropriate changes to the SAS code on this page.

12

Computing Power of the Test for the Independent Two Sample t-Test
SAS Proc Power can be used to determine power or sample size for designed experiments. Proc Power is
quite easy to use, however, there are a couple of concepts that you must understand before using the
procedure. First, you must understand that the results of the power calculations are only a guideline. This is
because you must define the relative values of the means that are expected if the experiment were to be
conducted and you must define the true value of the variance. If you actually knew the true means for the
different treatments you wouldnt even need to run the experiment. Of course you dont know the values of
the means, so you must come up with values that you think represent a meaningful set of values. By
meaningful I mean values for the means that imply the minimum detectable difference that is important.
You might deduce these values from a pilot study, or from the literature. If this were the case, then you
would also have a value of the anticipated variance.
As an example, consider the moth trap problem. If we believe that a meaningful difference in the true
means is 10 and the standard deviation was 4.38 (variance is 19.19), then the SAS Proc Power code for
determining sample size for a power of 0.8 is as follows:
proc power;
twosamplemeans
meandiff= 10
stdev=4.38
groupweights=(1 2)
power=0.8
ntotal=.;
run;

Results
Comparison of Two Types of Moth Traps
1
Analysis Using an Independent Two-Sample T-test
08:46 Friday, May 10, 2013
The POWER Procedure
Two-sample t Test for Mean Difference
Fixed Scenario Elements
Distribution
Method
Mean Difference
Standard Deviation
Group 1 Weight
Group 2 Weight
Nominal Power
Number of Sides
Null Difference
Alpha

Normal
Exact
10
4.38
1
2
0.8
2
0
0.05

Computed N Total
Actual
Power

N
Total

0.918

12

13

Das könnte Ihnen auch gefallen