Sie sind auf Seite 1von 35

SAS Training Session 2

Basic Statistical Analysis Using SAS


Sun Li
Centre for Academic Computing
lsun@smu.edu.sg
Outline
Produce reports
Using Output Delivery System (ODS) to produce output
Producing summary report using PROC MEANS & PROC FREQ
Perform elementary statistical procedures
Simple inference using PROC FREQ & PROC UNIVARIATE
Correlation using PROC CORR
Group comparison using PROC TTEST
<10-min Break>
Introduction to regression procedures
ANOVA procedures using PROC ANOVA and PROC GLM
General Linear regression using PROC REG
Logistic models using PROC LOGISTIC
Produce reports
Produce reports
Using Output Delivery System (ODS) to produce output
Producing HTML output
Producing RTF file
Producing descriptive statistics using PROC MEANS
Computing descriptive statistics
Creating a summarized data set
Produce tabular reports using PROC FREQ
Creating frequency tables
Creating cross-tabulations
Produce reports - ODS
Using Output Delivery System (ODS)
Produce reports - ODS
RTF output
icon
Produce reports - ODS
** producing HTML output and RTF file;
PROC PRINT data=sas2.marchflights (obs=10);
RUN;
ODS html body=E:\lsun\marchflights.html';
PROC PRINT data=sas2.marchflights (obs=10);
RUN;
ODS html close;
ODS listing close;
ODS rtf file=E:\lsun\insure.rtf';
PROC PRINT data=sas2.insure;
RUN;
PROC TABULATE data=sas2.insure;
var total balancedue;
table min max mean, total balancedue;
RUN;
ODS rtf close;
ODS listing;
Produce reports PROC MEANS
PROC MEANS
Statistic-keywords:
PROC MEANS <DATA=SAS-data-set>
<statistic-keyword(s)> <option(s)>;
RUN;
CLM Two-sided confidence intervals RANGE The range
CSS Corrected sun of squares SKEWNESS Skewness
CV Coefficient of variation STDDEV Standard deviation
KURTOSIS Kurtosis STDERR Standard error of the mean
LCLM Lower confidence interval SUM Sum
MAX Maximum value SUMWGT Sum of weight variables
MEAN Mean UCLM Upper confidence limit
MIN Minimum value USS Uncorrected sum of squares
N Number of non-missing values VAR Variance
NMISS Number of missing values PROBT Probability of Students t
MEDIAN Median T Students t
Q1 25% quantile Q3 75% quantile
P1 1% quantile P5 5% quantile
P10 10% quantile P90 90% quantile
P95 95% quantile P99 99% quantile
Produce reports PROC MEANS
** computing statistics using proc means;
DATA prdsale;
set sas2.prdsale;
RUN;
PROC PRINT data=prdsale (obs=10); RUN;
PROC MEANS data=prdsale maxdec=2 alpha=0.1 clm mean std;
var actual predict;
class product;
RUN;
PROC SORT data=prdsale;
by product;
RUN;
PROC MEANS data=prdsale maxdec=2 alpha=0.1 clm mean std;
var actual predict;
by product;
RUN;
BY statement vs. CLASS statement:
1. Unlike CLASS processing, BY processing
requires that your data already be sorted or indexed
in the order of the BY variables.
2. BY group results have a layout that is different
from the layout of CLASS group results.
Produce reports PROC MEANS
** creating a summarized data set using proc means;
PROC MEANS data=prdsale mean clm;
var actual predict;
class product year;
output out=prdstats mean=ave_act ave_pre uclm=uclm_act
uclm_pre lclm=lclm_act lclm_pre;
RUN;
OUTPUT OUT=SAS-data-set statistic=variable(s);
OUT=specifies the name of the output data set
statistic= specifies the summary statistic written out
variable(s) specifies the names of the variables to create. These variables represent
the statistics for the analysis variables that are listed in the VAR statement.
Produce reports PROC FREQ
PROC FREQ
PROC FREQ <DATA=SAS-data-set> ;
TABLES variables
variable-1*variable-2 <* ... variable-n>;
RUN;
**creating tables in proc freq;
PROC FREQ data=Color;
weight Count;
tables Eyes Hair Eyes*Hair;
RUN;
PROC SORT data=Color;
by region;
RUN;
PROC FREQ data=color nlevels;
weight count;
tables eyes*hair /crosslist;
by region;
RUN;
NLEVELS: displays the number of levels for
the variables listed.
CROSSLIST: displays the cross-tabulation
results in a listing form.
Produce reports
QUIZ 1
See the file QUIZ-SAS2.pdf.
Elementary statistical procedures
Perform elementary statistical procedures
Simple statistical inference
PROC FREQ
PROC UNIVARIATE
Correlation using PROC CORR
Group comparison using PROC TTEST
One-sample T test
Two-independent samples T test
Elementary statistical procedures PROC FREQ
Simple statistical inference
Chi-square test using PROC FREQ
More hypothesis tests in PROC UNIVARIATE
**chi-square test using proc freq;
PROC FREQ data=color;
weight Count;
table eyes*hair /chisq cl;
RUN;
CHISQ: displays Chi-square test results.
CL: displays the 95% confidence intervals of the statistics.
PROC UNIVARIATE <DATA=SAS-data-set> PLOT NORMAL ;
CLASS variable(s) ;
HISTOGRAM variable(s) /normal ;
PROBPLOT variable(s) /normal ;
QQPLOT variable(s);
VAR variable(s) ;
RUN ;
PLOT: generates a stem and leaf plot, a box plot and a normal probability plot
NORMAL: generates normality test
HISTOGRAM/normal: generates histogram with fitted distribution curve
PROBPLOT/normal: generates probability plot with specified distribution
QQPLOT: generates QQ plot
**simple hypothesis tests in proc univariate;
PROC UNIVARIATE data=prdsale modes plot normal cibasic(alpha=.1);
var actual;
histogram /normal (color=red);
qqplot;
RUN;
Elementary statistical procedures PROC UNIVARIATE
PROC UNIVARIATE vs. PROC MEANS
In general, if youre interested in a general view of your population
distribution, or if you want to do some simple hypothesis tests
(normality, etc.), then PROC UNIVARIATE is appropriate. Otherwise, if
youre looking at specific statistics, then PROC MEANS and a specific
output may be more efficient.
If youre looking for a probability plot or a histogram, PROC UNIVARIATE
may be what you need.
Elementary statistical procedures PROC UNIVARIATE
Correlation
Elementary statistical procedures PROC CORR
PROC CORR <DATA=SAS-data-set> options;
VAR variables ;
RUN ;
**correlation using proc corr;
ODS html;
ODS graphics on;
PROC CORR data=sas2.citiday nomiss pearson spearman cov
plots=matrix;
var snydjcm snysecm dsiuswil;
RUN;
ODS graphics off;
ODS html close;
COV requests output of the covariance matrix (for Pearson).
PEARSON requests Pearsons product moment correlation coefficient (default).
KENDALL requests Kendalls tau-b correlation coefficient.
SPEARMAN requests Spearmans rank-order correlation coefficient.
PARTIAL produces partial correlations in the VAR variable list, controlling for the variables specified in the
PARTIAL variable list.
Group comparison
PROC TTEST <DATA=SAS-data-set> ;
VAR variables ;
CLASS stratifier ;
RUN ;
**group comparison using proc ttest;
PROC TABULATE data=prdsale;
var actual;
class year quarter;
tables year*quarter, actual*(mean min max);
RUN;
PROC TTEST data=prdsale h0=500 alpha=0.1;
var actual;
where year=1994;
RUN;
PROC TTEST data=prdsale;
class year;
var actual;
RUN;
Elementary statistical procedures PROC TTEST
Elementary statistical procedures
Variable name Variable information
permno CRSP Permanent Number
date Numeric date
ret Holding Period Return
retx Return without dividends
mktrf Excess return on markert
smb Small-minus-big return
hml High-minus-low return
rf Risk-free return rate
umd Momentum factor
QUIZ 2
See the file QUIZ-SAS2.pdf.
Introduction to regression procedures
Introduction to regression procedures
Introduction to ANOVA procedures
Comparing groups using PROC ANOVA
Unbalanced design using PROC GLM
General linear models
The REG procedure
Logistic regression
Statistical background
The Logistic procedure
Regression procedures PROC ANOVA
Introduction to ANOVA procedures
PROC ANOVA <DATA=SAS-data-set> ;
CLASS stratifier ;
MODEL dependents = effects ;
MEANS var / options;
RUN ;
QUIT;
**comparing groups using proc anova;
PROC TABULATE data=sas2.cargo99;
var cargorev cargowgt;
class routeid;
tables routeid, cargorev*mean
cargowgt*mean;
RUN;
ODS html;
ODS graphics on;
PROC ANOVA data=sas2.cargo99;
class routeid;
model cargorev = routeid;
RUN;
means routeid / Bon;
RUN;
ODS graphics off;
ODS html close;
QUIT;
CLASS specifies stratifier variables.
MODEL defines the model to be fit.
MEANS compute and compare means.
PROC GLM <DATA=SAS-data-set> ;
CLASS variables;
MODEL dependents = effects ;
Means effects;
RUN ;
QUIT;
**unbalanced ANOVA for two-way design with interaction;
ODS html;
ODS graphics on;
PROC GLM data=sas2.cargo99;
class routeid;
model cargorev=routeid cargowgt routeid*cargowgt / ss1 ss2 ss3 ss4;
RUN;
lsmeans routeid / pdiff=all adjust=bon ;
RUN;
ODS graphics off;
ODS html close;
QUIT;
Regression procedures PROC GLM
Linear regression procedures
General linear models
c | + = X Y
. X
Y
s y variable explanator of matrix ) 1 ( the is
responses. of vector 1 the is
+

p n
n
ts coefficien regression the of estimates square least Y X
1
X) X (

'

'
= |
PROC GLM:
It uses the method of least squares to fit general linear models
relating to one or several continuous dependent variables to one or
several independent variables.
Strengths:
direct specification of polynomial effects
ease of specifying categorical effects (PROC GLM automatically
generates dummy variables for class variables)
Weaknesses:
No collinearity diagnostics
No influence diagnostics
No scatter plots
Only one model at one time
Regression procedures PROC GLM
PROC REG: Provides the most general analysis capabilities
handles multiple regression models
provides nine model-selection methods
allows interactive changes both in the model and in the data used to
fit the model
allows linear equality restrictions on parameters
tests linear hypotheses and multivariate hypotheses
produces collinearity diagnostics, influence diagnostics, and partial
regression leverage plots
saves estimates, predicted values, residuals, confidence limits, and
other diagnostic statistics in output SAS data sets
generates plots of data and of various statistics
Regression procedures PROC REG
PROC REG <DATA=SAS-dataset> ;
MODEL dependent-variable = predictors /
selection=method R CLI CLM ;
PLOT r.*p. ;
RUN ;
QUIT;
*Regression using proc reg ;
PROC REG data=insurance;
model time = size type sizetype
/selection=none;
RUN;
delete sizetype;
print;
RUN;
plot r.*p. time*p.;
RUN;
QUIT;
DATA insurance;
input time size type @@;
sizetype=size*type;
datalines;
17 151 0 26 92 0 21 175 0 30 31 0 22 104 0
0 277 0 12 210 0 19 120 0 4 290 0 16 238 0
28 164 1 15 272 1 11 295 1 38 68 1 31 85 1
21 224 1 20 166 1 13 305 1 30 124 1 14 246 1
;
SELECTION: specifies model selection
model: forward, backward, etc.
DELETE: deletes variables from the model.
PRINT: print the analysis results.
PROT: produces diagnostic plots.
Regression procedures PROC REG
*Polynomial regression using proc reg;
PROC REG data=USPopulation;
var YearSq;
model Population=Year / selection=none;
plot r.*p. ;
RUN;
add YearSq;
print;
plot / cframe=ligr;
RUN;
plot (Population predicted. u95. l95.)*Year / overlay cframe=ligr;
RUN;
QUIT;
ODS html;
ODS graphics on;
PROC REG data=USPopulation;
Linear: model Population=Year;
Quadratic:model Population=Year YearSq;
RUN;
ODS graphics off;
ODS html close;
QUIT;
Regression procedures PROC REG
Logistic regression procedures
Logistic models
Binary logistic model: dichotomous response outcomes
e.g.: presence or absence of an event
PROC LOGISTIC provides the capability of model-fitting.
Ordinal logistic model: ordinal response variable with more than two
ordered categories
e.g.: a 5-point Likert scale
PROC LOGISTIC fits the proportional odds model with CLOGIT link.
Multinomial logistic model: nominal response variables with more than
two categories
e.g.: different types of programs in school
PROC LOGISTIC fits the generalized logit model if you specify the GLOGIT link.
Binary logistic model
Ordinal logistic model
Multinomial logistic model
) | (
i i i
x y E = t
X g | o t t t t ' + = = = ) ( )) 1 /( log( ) ( logit
k i X X i Y g
i
,......, 1 , ' )) | (Pr( = + = s | o
k i X
X k Y
X i Y
i i
,......, 1 , '
) | 1 Pr(
) | Pr(
log = + =
|
|
.
|

\
|
+ =
=
| o
PROC LOGISTIC <DATA=SAS-dataset> ;
CLASS variables;
MODEL dependent-variable = predictors / options;
RUN ;
Regression procedures PROC LOGISTIC
Binary logistic model
Variable name Variable information
age Age in years
ed Level of education
1= didnt complete high school 2= high school degree
3= college degree 4= undergraduate 5= postgraduate
employ Years with current employer
address Years in current address
income Household income in thousands
debtinc Debt to income ratio (*100)
creddebt Credit card debt in thousands
othdebt Other debts in thousands
default Previously defaulted (1=Yes; 0=No)
How to identify a person with high chance of getting defaults on the bank
loan. We have 700 records from bank database (bankloan) .
Regression procedures PROC LOGISTIC
*Binary logistic model;
PROC MEANS data=sas2.bankloan;
var age employ address income debtinc creddebt othdebt;
class default;
RUN;
PROC LOGISTIC data=sas2.bankloan;
class ed(ref='1') / param=ref;
model default(event='1')= ed age employ address income debtinc
creddebt othdebt
/ selection=stepwise slentry=0.3 slstay=0.35 details
rsquare lackfit;
output out=bankloanpred p=prob lower=lcl upper=ucl xbeta=logit;
ods output parameterestimates=bankloanest;
RUN;
Regression procedures PROC LOGISTIC
SELECTION: specifies model selection methods.
SLENTRY=0.3 : a significance level of 0.3 is required to allow a variable into the model.
SLSTAY=0.35: a significance level of 0.35 is required for a variable to stay in the model.
DETAILS: produces a detailed account of the variable selection process.
RSQUARE: produces generalized R-square measure.
LACKFIT: produces Hosmer and Lemeshow goodness-of-fit test for the final selected model.
PARAM=REF: specifies the reference cell coding.
REF: specifies reference group for categorical predictors.
EVENT: specifies reference group for dependent variable.
Regression procedures PROC LOGISTIC
Proportional Odds Model for Ordinal Logistic Model
To identify factors that influence a persons income category.
*Ordinal logistic model;
DATA income;
set sas2.bankloan;
if income<20 then inccat=1;
else if 20 <= income < 30 then inccat=2;
else if 30 <= income < 40 then inccat=3;
else if 40 <= income < 50 then inccat=4;
else if income >= 50 then inccat=5;
else inccat=.;
RUN;
PROC LOGISTIC data=income;
class ed(ref='5') / param=ref;
model inccat = age ed employ address debtinc / link=clogit;
output out=incpred p=prob xbeta=linp;
RUN;
LINK: specifies the link function.
CLOGIT: cumulative logits.
Regression procedures PROC LOGISTIC
Generalized Logits Model for Multinomial Logistic Model
*Multinomial logistic model;
DATA school;
input school program style $ count;
datalines;
1 1 self 10
1 1 team 17
1 1 class 26
1 2 self 5
1 2 team 12
1 2 class 50
2 1 self 21
2 1 team 17
2 1 class 26
2 2 self 16
2 2 team 12
2 2 class 36
3 1 self 15
3 1 team 15
3 1 class 16
3 2 self 12
3 2 team 12
3 2 class 20
;
To identify the difference of study types among
schools and programs.
Regression procedures PROC LOGISTIC
PROC LOGISTIC data=school;
freq count;
class school program / order=data;
model style = school program school*program / link=glogit;
output out=progstat p=prob;
ods output parameterestimates=progest;
RUN;
PROC FREQ data=progstat;
format prob 5.4;
tables school*program*_level_*prob /list nopercent nocum;
RUN;
DATA progodd;
set progest;
odds=exp(estimate);
RUN;
PROC PRINT data=progodd;
var response estimate odds;
RUN;
LINK: specifies the link function.
GLOGIT: generalized logit function.
Resources and books
Regression methods
Applied Regression Analysis, Linear Models, and Related Methods by John Fox
Regression Analysis by Example by Chatterjee, Hadi and Price
An Introduction to Generalized Linear Models, Second Edition by Annette J. Dobson
Logistic regression and categorical data analysis
Applied Logistic Regression, Second Edition by David Hosmer and Stanley Lemeshow
An Introduction to Categorical Data Analysis Alan Agresti
CAC statistical consultation support:
CAC statistical WIKI page:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/SAS.aspx
Statistical consultation service: lsun@smu.edu.sg
End!

Das könnte Ihnen auch gefallen