Sie sind auf Seite 1von 21

Introduction to SAS by Kaz Download from www.src.uchicago.

edu/users/ueka

Introduction to SAS Version 1.4 updated 9/29/2002 by Kazuaki Uekawa, Ph.D. Visiting Scholar, The Department of Sociology, The University of Chicago; Population Research Center at NORC; Address: 1155 E. 60th. St, Room 340, Chicago, IL 60637 www.src.uchicago.edu/users/ueka kuekawa@alumni.uchicago.edu Copyright 2002 By Kazuaki Uekawa All rights reserved. Table of Contents
I. Introduction.............................................................................................................2 II. How to start?..........................................................................................................3 III. LIBNAME: Assigning library name..........................................................................3 IV. Create SAS data for a practice..............................................................................4 V. Creating New Variables...........................................................................................7 VI. Procedures.............................................................................................................9 a. PROC CONTENTS: Description of Contents.........................................................9 b. PROC PRINT: See Data.......................................................................................10 c. PROC SORT: Sorting Observations based on a value of variable.......................10 d. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max)......................11 e. PROC FREQ: Get Frequencies............................................................................13 f. PROC UNIVARIATE: Get elaborate statistics and a univariate plot....................13 g. PROC PLOT: Plotting Two Variables....................................................................14 h. PROC TIMEPLOT: Time Plot................................................................................14 i. PROC CORR: Correlation.....................................................................................15 j. PROC OLS: OLS Regression ...............................................................................15 k. PROC LOGISTIC: Logistic Regression ................................................................15 l. MAKE AN ASCHI FILE..........................................................................................15 VII. More Procedures.................................................................................................15 m. PROC STANDARD: Standardize Values.............................................................15 n. PROC RANK: Rank observations........................................................................17 o. PROC SQL: Creating group-level mean variables .............................................18 VIII. Merging Data Sets.............................................................................................18 IX. Temporary and Permanent Data Sets..................................................................19

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka

I. Introduction I recommend SAS over other statistical packages because: a) ODS (Output Delivery System) allows users to save statistical results as data. A user can create tables off the result data set in one single program (as opposed to printing out the results on paper and use excel to finish tables.) The table can be as sophisticated as http://www.src.uchicago.edu/users/ueka/SAS/proc_mixed_example1output.txt and this can be further saved in an excel format using PROC EXPORT. b) Rich arrays of macro functions c) Email support service with quick response. support@sas.com d) Users come from many fields, including social and natural sciences, as well as business. Thus, SAS programming skill can be an asset in the job market. I discuss both ODS and MACRO in Introduction SAS 2, the document of which is available from the same website. Idiosyncrasy of this document I am writing this document on my Japanese PC and backslash is not available. I use \ instead. U. of Chicago People can access SAS on-line on the web! SAS On-line for version 8 http://gsbapp2.uchicago.edu/sas/sashtml/main.htm Note on SAS email support: When you email SAS support with a question, you need to identify yourself as a legitimate SAS customer. Look at the head of a log file and copy and paste the information at the beginning of your email text.
NOTE: Copyright (c) 1999-2001 by SAS Institute Inc., Cary, NC, USA. NOTE: SAS (r) Proprietary Software Release 8.2 (TS2M0) Licensed to UNIVERSITY OF XXXXX, Site XXXXX. NOTE: This session is executing on the WIN_ME platform.

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka

II. How to start?


1. Start SAS. You can find the short cut going from START PROGRAMThe SAS System. 2. Type in syntax in EDITOR window. document. 3. Click on the runner icon to run the program. Alternatively, you can highlight the part of syntax that you want to run and then click the runner to run the program selectively. (The downside of using UNIX instead of WINDOWS is that UNIX cannot let you do this selective run.) LOG file contains messages. Watch for the words error and warning. OUTPUT file contains output. If you ever mistype syntax and want to redo, do control-z. This is the same command that can be used with Microsoft Office products. To cancel the run while it is happening, click on the stop icon (which looks like !) right next to the runner icon. Syntax is something you learn in this

III. LIBNAME: Assigning library name Assigning library name Using path names as directory names is too tedious (e.g., C: \temp\abc\old), so we want to give nicknames to them at the beginning of a program. libname here C:\TEMP; libname there C:\; So from now on, here.abc means the data set named abc placed in the directory nicknamed here.
there.xyz means the data set named xyz placed in the directory nicknamed there.

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka

IV. Create SAS data for a practice


Description of Practice Data The data comes from TIMSS (Third International Mathematics and Science Study) in which some 40 nations three population groups (3&4th graders, 7&8th graders, and high school seniors) participated. I aggregated data at the national level. The variables are: acro: acronym for participant nations. nation: name of the country name: complete name of the country mat8: 8thgraders average math test score mat7: 7thgraders average math test score GNP14: GNP per capita prop: proportion of 8th graders in schooling NATEXA: Administers national-level exam NATSYLB: Sylbus is decided at the national level NATTEXT: text is chosen at the national level.

libname here C:\TEMP; libname there C:\;


data kaz; input acro $ NATION $ 6-14 NAME $ 15-33 MAT7 MAT8 GNP14 PROP NATEXAM

NATSYLB NATTEXT block $; cards; aus aut bfl bfr can col cyp Australi Australia Austria Austria 498 529.63 -0.15526 509 539.43 -0.29163 558 565.18 -0.25157 507 526.26 -0.25157 494 527.24 0.07184 84 100 100 100 88 62 95 0 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0 0 0 1 ocea weuro weuro weuro namer samer seuro

Belgi_FL Belgium (Fl) Belgi_FR Belgium (Fr) Canada Canada

Colombia Colombia Cyprus Cyprus

369 384.76 -0.23699 446 473.59 -0.41906

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka

csk dnk fra deu grc hkg hun isl irn irl isr jpn kor kwt lva ltu nld nzl nor prt rom rus sco sgp slv svn esp swe che tha usa ; run;

Czech Denmark France Germany Greece

Czech Republic Denmark France Germany Greece

523 563.75 -0.34840 465 502.29 -0.34057 492 537.83 484 509.16 0.55791 0.91992

86 100 100 100 99 98 81 100 66 100 87 96 93 60 87 78 93 100 100 81 82 88 100 84 89 85 100 99 91 37 97

0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0

1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 1 0 1 0

0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 0 1 0

eeuro weuro weuro weuro seuro seasia eeuro neuro meast weuro meast seasia seasia meast eeuro eeuro weuro ocea neuro weuro eeuro eeuro weuro seasia eeuro eeuro weuro neuro weuro seasia namer

440 483.90 -0.32620 564 588.02 -0.31638 502 537.26 -0.37602 459 486.78 -0.42606 401 428.33 -0.17095 500 527.40 -0.38919 . 521.59 -0.35464 571 604.77 1.85543

HongKong Hong Kong Hungary Iceland Iran Ireland Israel Japan Korea Kuwait Latvia Hungary Iceland Iran, Islamic Rep. Ireland Israel Japan Korea Kuwait Latvia (LSS)

577 607.38 -0.01168 . 392.18 -0.40359 462 493.36 -0.42319 428 477.23 -0.41785 516 540.99 -0.18184 472 507.80 -0.38319 461 503.29 -0.35450 423 454.45 -0.32588 454 481.55 -0.35396 501 535.47 463 498.46 0.12827 0.48017

Lithuani Lithuania Netherla Netherlands NewZeala New Zealand Norway Norway

Portugal Portugal Romania Romania

RussianF Russian Federation Scotland Scotland Singapor Singapore SlovakRe Slovak Republic Slovenia Slovenia Spain Sweden Spain Sweden

601 643.30 -0.37279 508 547.11 -0.40217 498 540.80 -0.41310 448 487.35 0.03461

477 518.64 -0.30049 506 545.44 -0.27916 495 522.37 -0.14533 476 499.76 5.37506

Switzerl Switzerland Thailand Thailand USA United States

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka

/*this prints out the data*/ proc print; run;

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka

Advanced Topic: Alternatively you can save above data (just data part) as a simple text and save it at your C-drives temp directory as kaz.txt. (In case you only have this document as a hard copy, visit www.src.uchicago.edu/users/ueka for a digital version of this document, so you can copy and paste.) Then use the program below to read in the file.

/*these two lines are not crucial in this example, but lets just put these at the beginning of your program*/ libname here C:\TEMP; libname there C:\; data kaz; infile C:\TEMP\kaz.txt missover; input acro $ NATION $ 6-14 NAME $ 15-33 MAT7 GNP14 PROP NATEXAM NATSYLB NATTEXT block $; run;

MAT8

I think missover means that when there is no value in the spot where there is supposed to be a value, just treat it as a missing value, but I forgot exactly. It is safe to use it. $ means whatever comes before it is a character variable as opposed to numeric.

V. Creating New Variables Data kaz2; set kaz; /*ADDITION*/ var1=mat7+mat8;

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka

/*OR*/ var2=sum(of mat7 mat8); /*SUBSTRACTION*/ var3=mat8-mat7; /*MULTIPLICATION*/ var4=mat7*mat8; /*DIVISION*/ var5=mat7/mat8; /*Use brackets effectively*/ var6=1/(mat7+mat8); /*MEAN of several variables*/ var7=mean(of mat7 mat8); /*MAX of several variables*/ var8=max(of mat7 mat8); /*MIN of several variables*/ var9=min(of mat7 mat8);

/*LOG: a value to enter must be positive*/ var10=log(mat7); /*Absolute values: this takes out negative signs*/ var11=abs(gnp14); run; /*TO SEE WHAT YOU DID, USE PROC PRINT*/ proc print data=kaz2; title Lots of manipulations: See results; var mat7 mat8 var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11;
run;

Advanced Topics: How is Z=mean(of X1 X2 X3) different from Z=(X1+X2+X3)/2;? How is Z=sum(of X1 X2 X3) different from Z=X1+X2+X3;? Functions, such as mean(of ) or sum (of ), take statistics of non-missing values. They do return values even when some of the variables in the brackets are missing. For example, if X1 is missing: X=mean (of X1 X2 X3); will return the average of X2 and X3. In contrast, X=(X1+X2+X3)/2 will return a missing value, namely, .

Read this after you study PROC REG later in the document. When we compare several regression models (e.g., coefficients, R2, Goodness-of-fit, etc.),

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka

we want to keep the number of observations same across different models. Because predictors may have different patterns of missing values, this must be made to happen if you want to. For example, mat7, which is 7th graders mathematics score include some missing cases. Some nations only let their 8th graders participate in this international test. Use NMISS function to create a new variable john. data kaz2;set kaz; john=nmiss(of GNP14 mat8 mat7);/*this returns the number of missing cases*/ run; /*check how the data looks like now*/ proc print data=kaz2; var name gnp14 mat8 mat7 john; run; /*Apply OLS regression with cases with perfect data (no missing cases). In this way, model 1 and model 2 will have the same number of cases, or to be more precise, the same data.*/ proc reg data=kaz2; where john=0; /*Run only when john=0, namely, number of missing cases is 0*/ model mat8=mat7; model mat8=mat7 gnp14; run;

VI. Procedures a. PROC CONTENTS: Description of Contents PROC CONTENTS data=kaz; run; Advanced topic: the variables will be sorted by alphabetical order. They can be also shown

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 10

by position in the data set (left to right) by addition position: proc contents data=kaz position; run; I like this option because in this way you can find related variables close to each other. b. PROC PRINT: See Data PROC PRINT data=kaz; VAR nation mat7 mat8 natexam; /*without this, all variables will be printed*/ run; Advanced topic: You can selectively print observations. /*print only when natexam=1*/ proc print data=kaz;where natexam=1;var nation mat7 mat8;run; /*print by group units*/ proc sort data=kaz out=kaz2;by block;run; proc print data=kaz;by block;var nation mat7 mat8;run; /*print only up to a certain number of observations*/ /*you want to do this when you data is big and dont want to print every observation*/ data kaz2;set kaz; john=_n_; /*this creates a new variable indicating the column sequence of observation*/ run; proc print data=kaz2;where john < 5;run; /*this shows the first 4 observations*/ If you want a nicer print-out, try proc report. c. PROC SORT: Sorting Observations based on a value of variable
You would be using this procedure a lot, but be careful with large data set. This procedure consumes lots of computation time. PROC SORT data=kaz out=kaz2; /*If you dont want to create a new data set, just write out=kaz*/

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 11

by mat8; run; Advanced topics: proc sort data=kaz out=kaz2 nodupkey; by block; run; proc print data=kaz2;run; This takes only the first observation of each block. Imagine that you have data where there are individual level variable (e.g., 100 students) and group level variable (e.g., 10 schools). Imagine you want to get school level information from this data. Above procedure would take just the first observation of each school and gets you ten lines of data for 10 schools. however. You can use more than one variable in by line. proc sort data=kaz out=kaz2; by natexam block; run; /*How would the new data look like?*/ proc print data=kaz2;run; Ignore individual-level variables,

d. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max) PROC MEANS data=kaz; VAR mat7 mat8; run;

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 12

Advanced topic: Group means. /*Report group means*/ proc sort data=kaz out=kaz2;by block;run; proc means data=kaz2; by block; var mat7 mat8; run; You can also use class statement instead of by statement. Class statement is easier because you dont need to sort the data by the by-variable before it. I forgot what the downside of it was. proc means data=kaz2; /*now, kaz2 does not have to be sorted by block*/ class block; var mat7 mat8; run; /*Save group means*/ ods listing close; /*printing of results suppressed*/ proc means data=kaz2; /*make sure kaz2 is already sorted by group ID*/ by block; var mat7 mat8; ods output summary=john; /*Output Delivery System Used. See SAS manual 2*/ run; ods listing on; /*printing of results resumed*/ proc print data=john; run;

/*Get standard errors by adding STDERR*/ /*But it would only get standard error, so you must add other statistics you would like with it. Specify mean, N, STD, MAX, and MIN*/

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 13

PROC MEANS data=kaz mean n std max min stderr; VAR mat7 mat8;run; run; I recommend reading a chapter on PROC MEANS in SAS CD-online. It is a very versatile procedure. e. PROC FREQ: Get Frequencies PROC FREQ data=kaz; Tables natexam ; Run; Advanced topics: Get cross tabulation: PROC FREQ data=kaz; tables natexam*block; run; PROC UNIVARIATE: Get elaborate statistics and a univariate plot PROC UNIVARIATE PLOT DATA=KAZ; var mat7 mat8 gnp14; run; Advanced topic:Get a whisker plot by sub groups, so you can compare group values. But the output is text-based and pretty ugly. proc sort data=kaz out=kaz2; by block; run; PROC UNIVARIATE data=kaz2 plot; by block; var mat8; run; f.

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 14

g. PROC PLOT: Plotting Two Variables This is text-based graph. Use proc gplot for a nicer graphic. PROC PLOT data=KAZ; Plot mat7*mat8; run; h. PROC TIMEPLOT: Time Plot proc timeplot data=KAZ; plot mat8= '*'; id NAME; run; Advanced topics: /*Sort first by the variable of your interest and see it*/ /*you will be seeing a ranking of nations*/ proc sort data=kaz out=kaz2; by mat8; run; proc timeplot data=KAZ2; plot mat8= '*'; id NAME; run; Add bells and whistles. Below, I am asking, Does GNP has anything to do with test score? /*First sort by GNP*/ proc sort data=kaz out=kaz2; by gnp14; run; proc timeplot data=KAZ2; title TIMSS countries sorted by GNP; plot mat7 mat8/overlay hiloc npp ; id NAME block gnp14 prop; run;

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 15

i. PROC CORR: Correlation PROC CORR DATA=KAZ; VAR mat7 mat8 gnp14; Run; j. PROC OLS: OLS Regression PROC REG DATA=KAZ; MODEL mat8=natexam gnp14; Run; Advanced Topic: See www.src.uchicago.edu/users/ueka for the creation of OLS table using OLS. Also see PROC IML instruction on the same page to learn how OLS estimates its coefficients. k. PROC LOGISTIC: Logistic Regression /*I dont know if natexam can be considered a dependent variable, but for the sake of demonstration*/ PROC logistic data=kaz; Model natexam=gnp14; run; l. MAKE AN ASCHI FILE To use a stand-alone software program, you may have to create a simple aschi file. But I rarely use this lately because many software read SAS data directly. data timss;set kaz; file "aschi_example.txt"; put (nation) (10.0) (mat7 mat8) (8.0); run; VII. More Procedures m. PROC STANDARD: Standardize Values Make Z-score with a mean of 0 and standard deviation of 1

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 16

proc standard data=kaz out=kaz2 mean=0 std=1; var mat7 mat8; run; /*then see what you did*/ proc print data=kaz2; run; Advanced technique: Standardize within groups. /*First sort by group ID*/ proc sort data=kaz out=kaz2; by block; run; /*Use by statement*/ proc standard data=kaz2 out=kaz3 mean=0 std=1; by block; var mat7 mat8; run;

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 17

n. PROC RANK: Rank observations proc rank data=kaz out=kaz2 group=3; /*Creates 3 groups. The new values will be 0, 1, and 2. */ var mat7 mat8; RANKS Rmat7 Rmat8; /*give names to the new variables*/ Run; /*see what happened*/ proc print data=kaz2; var mat7 Rmat7 mat8 Rmat8; RUN; Research Tip: Why do we use rank? a. We can split the sample based on the rank. e.g., high SES student sample versus low SES student sample. b. We can create dummy variables quickly by specifying group=2. e.g., high SES student will receive 1; else 0. This grouping occurs at the median point of a variable, which may or may not be always the best strategy. Alternative way is to assign 1 and 0 based on some meaningful threshold. For example, I have temperature data, I may use a medium point to split the data if it makes sense, but maybe I use 0 degree (Freezing point) as a meaningful point to split the data instead.

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 18

o. PROC SQL: Creating group-level mean variables One could use proc means to derive group-level means. I dont recommend this since it involves extra steps of merging the mean data back to the main data set. Extra steps always create rooms for errors. PROC SQL does it at once. proc sql; create table kaz2 as select *, mean(mat7) as mean_mat7, mean(mat8) as mean_mat8, mean(gnp14) as mean_gnp from kaz group by block; run; /*proc sql does not really require run statement, but for the sake of consistency*/ proc print data=kaz2; run;

VIII. Merging Data Sets libname here C:\; /*Create two data sets A and B.*/ data A; set kaz; /*I am assuming that you already have this data set kaz by running the program on page 4 and 5 of this document. */ keep nation mat7;

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 19

run; data B; set kaz; keep nation mat8; run; /*MERGE DATA SETS*/ /*First sort them by a common ID*/ /*Here they are already sorted, so the following two lines are not really necessary*/ proc sort data=A;by nation;run; proc sort data=B;by nation;run; data NEW; merge A B; by nation; run; /*Confirm*/ proc print data=NEW; run; IX. Temporary and Permanent Data Sets There are temporary and permanent SAS data sets. When you turn off SAS, the temporary data will be erased. Throughout the exercise, you have seen kaz and kaz2. They are temporary data sets. To actually see these data, go to the Explorer (leftish side of the SAS window), then to Libraries, and find folders in there. The default directory is called Work. (You will also find folders that you nicknamed.) Click them to open and find data in them. If you want to make them permanent, so they dont disappear when you turn off SAS, add the directory nickname in front of the new data set. For example: Data here.abc;set kaz; keep nation growth; growth=mat8-mat7;

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 20

run; You are bringing in a temporary data set kaz and are creating a new permanent data called abc in the directory C:\TEMP (nicknamed here by a library statement) You are creating a variable called growth and it now is in here.abc. Only nation and growth are kept in the new data set. You can also do the opposite: bring in a permanent data set this time and create a temporary data. Data xyz; set here.abc; growth=mat8-mat7; drop mat8 mat7; run; You are bringing a permanent data set called abc placed in C:\TEMP and create a new data abc in SASs defalt directory. You created a variable called growth and it now is in abc. Mat8 and mat7 are dropped from the new data set. (Of course, reading in a permanent data and creating a permanent data is possible by data here.xyz; set here.xyz;)

Research Tip: I recommend that you make permanent data as infrequently as possible. Just save your syntax program and create fresh temporary data each time you start and save disc space.. In this way, you can just save your small syntax program. Also research is a lot easier if you have only a few programs and data sets. http://www.src.uchicago.edu/users/ueka/SAS/Dataextractor8.3.txt Every time I need to work on this study, I can just run this one single program to reproduce data. I dont have to remember the name convention and location of the data sets that I have to deal with. For this particular study, I only need to deal with this file above and one more file that actually does the analyses.

Introduction to SAS by Kaz Download from www.src.uchicago.edu/users/ueka 21

http://www.src.uchicago.edu/users/ueka/SAS/MakeFinalTables7.2.txt If I need to make changes to my analyses, I know I just have to look into these two files. This would be impossible if I had too many files and data sets flying all over the places even in one directory. HOWEVER, if your data is huge (e.g., census data), then you may be better off saving permanent data, so it is quicker.

END of Document

Das könnte Ihnen auch gefallen