Sie sind auf Seite 1von 50

SAS ROUTINES

Contents
1

Label variable....................................................................................................... 4

Report.................................................................................................................. 4

Creating user defined format...............................................................................4

Displaying user defined format............................................................................ 4

Random number generation................................................................................4

Define................................................................................................................... 5

Initialize a variable with a value...........................................................................5

Reading from text file........................................................................................... 5

ODS...................................................................................................................... 6

10

Reading from csv file......................................................................................... 7

11

Check missing value in dataset.........................................................................8

12

Standarization of variables...............................................................................8
12.1.1 How do I standardize variables in SAS?...................................................8

13

Capturing output in file................................................................................... 12


13.1.1 Creating an output data set...................................................................12
13.1.2 How to identify output objects...............................................................12
13.1.3 Using object label to create an output data set.....................................12
13.1.4 Turn the listing output of......................................................................12
13.1.5 Output to an HTML file...........................................................................12

14

95% confidence interval of mean....................................................................16

15

Column input................................................................................................... 16

16

formatted input............................................................................................... 16

17

Formatting in proc report-...............................................................................17

18

Proc report...................................................................................................... 17

19

Across variable to group variable horizontally................................................17

20

Computed variable.......................................................................................... 17

21

Summary report.............................................................................................. 18

22

Viewing contents of dataset............................................................................18

23

printing portion of dataset.............................................................................. 18

24

Print by............................................................................................................ 18

25

create new library........................................................................................... 19

26

creating and adding data to dataset...............................................................19

27

importing csv file............................................................................................ 19

28

importing excel file......................................................................................... 20

29

IMPORTING TEXT FILE...................................................................................... 20

30

COPYING DATASET........................................................................................... 21

31

ADDING NEW VARIABLES AND CREATING DATASET........................................21

32

DROP AND KEEP VARIABLES IN NEW DATASET................................................21

33

PRINTING NO OBSERVATION NUMBER.............................................................21

34

SUBSTRING..................................................................................................... 22

35

OTHER STRING FUNCTIONS............................................................................. 22

36

DATE FUNCTION.............................................................................................. 22

37

PRINTING VARIABLES IN DATASET...................................................................22

38

SORT............................................................................................................... 22

39

REMOVING DUPLICATE.................................................................................... 23

40

REMOVE DUPLICATE BASED ON KEY...............................................................23

41

MOVE DUPLICATES INTO NEW DATASET..........................................................24

42

PLOT................................................................................................................ 24

43

Median in proc sql........................................................................................... 25

44

PROC SQL........................................................................................................ 25

45

Proc sql case................................................................................................... 26

46

MERGING TWO DATASETS............................................................................... 26

47

SAMPLING....................................................................................................... 27

48

PRINTING VERTICAL HEADING.........................................................................28

49

MEAN CALCULATION....................................................................................... 28

50

Moving means data into output file................................................................30

51

Merging two data together.............................................................................. 31

52

QUANTILES...................................................................................................... 32

53

TO REMOVE NA VALUES TO NUMERICAL.........................................................33

54

CREATE FREQUENCY TABLE............................................................................. 33

55

create two variable categorical frequency table.............................................34

56

Weight statement............................................................................................ 35

57

order............................................................................................................... 36

58

three variable frequency................................................................................. 36

59

Correlation...................................................................................................... 36

60

Regression...................................................................................................... 36

61

logistic regression........................................................................................... 37

62

test stationarity............................................................................................... 37

63

create a diferentiated time series..................................................................37

64

create ACF and PACF Plots.............................................................................. 37

65

to calculate the ESACF and SCAN function values..........................................37

66

forecast using ARIMA...................................................................................... 38

67

advanced mean concepts...............................................................................38

67.1 Basic mean.................................................................................................. 38


67.2 Selecting Analysis Variables, Analyses to be Performed by PROC MEANS ,
and Rounding of Results........................................................................................ 39
67.3 Selecting Other Analyses.............................................................................40
67.4 Step 4: Analysis with CLASS (variables).......................................................41
67.5 Step 5: Dont Miss the Missings!..................................................................42
67.6 Survey means - How to Estimate a Ratio of Means using SAS.....................43
68

Mean and ratios.............................................................................................. 45

69

SAS QUESTIONS.............................................................................................. 47

1 Label variable
Label varname = label name

2 Report
Proc report data = dataset name;
Column age, weight prints age and wieight in columns

3 Creating user defined format


To define format for displaying one or more values:
Use VALUE statement
Format:
VALUE format-name
range1='label1'
range2=label2'
...;
o format-name is the name of the format that is being created
o range specifies one or more variable values and a character string or existing
format
o label is a text string enclosed in quotation marks

Non-inclusive ranges can be specified:


using the less than symbol to avoid overlapping, as in a-<b
using the keywords LOW and HIGH to identify lower and upper limits, as in low-<b
and a-<high
using the keywords OTHER to label missing values.

4 Displaying user defined format


Top display a list of all the formats in the format catalog with the
descriptions of their values, the keyword FNTLIB can be added to the
PROC FORMAT statement.
The name, range, and label for each format is provided, as well as:
the length of the longest value
number of values defined by this format
version of SAS the format is compatible with
the data and time the format was created.

5 Random number generation


The RANUNI function returns a number that is generated from the uniform distribution on
the interval (0,1) using a prime modulus multiplicative generator with modulus 2 - and
multiplier 397204094 (Fishman and Moore 1982) (See References).
31

You can use a multiplier to change the length of the interval and an added constant to move
the interval. For example,

random_variate=a*ranuni(seed)+b;
returns a number that is generated from the uniform distribution on the interval (b,a+b).

6 Define
Assign formats to variables
Specify column headings and width
Proc report data = dataset <options>;
Define variable/ <usage><attributes><options><justification><Columnheading>;
Run;

7 Initialize a variable with a value


By default, an accumulator variable is initialized to 0. To initialize to a
diferent number, use the RETAIN statement.

RETAIN variable initial-value;


o variable is the name of the variable whose value will be retained
o initial-value is the initial numeric or character value for the variable.
Restrictions to the RETAIN statement include:
no efect on variables which are read with SET, MERGE, and UPDATE statements
the retained variable will be initialized to missing of no initial value is specified
the RETAIN statement is a compile-time only statement to create variables that do
not already exist.

8 Reading from text file


data Sample2;
infile 'c:\books\statistics by example\delim.txt';
length Gender $ 1;
input ID Age Gender $;
run;

The LENGTH statement tells SAS that the variable Gender is character (the
dollar sign indicates this) and that you want to store Gender in 1 byte (the 1
indicates this). The INPUT statement lists the
variable names in the same order as the values in the text file. Because you
already told
SAS that Gender is a character variable, the dollar sign following the name
Gender on the
INPUT statement is not necessary. If you had not included a LENGTH statement,
the
dollar sign following Gender on the INPUT statement would have been
necessary. SAS
assumes variables are numeric unless you tell it otherwise.

9 ODS
Output can be created in a variety of formats with ODS:
HTML output
Output data set of procedures results
Traditional SAS listing output.
When ODS statements are submitted and the output is created by the SAS program:
ODS creates the output in the form of output objects; each output object
containing:
o Data component - the results of a procedure or DATA step
o Table definition information about how the results are rendered.
The output object is sent to specified ODS destination(s) and creates the
formatted output controlled by the destination.

ODS destinations supported include:


HTML output formatted in HTML
Listing output formatted like traditional SAS procedure output
Markup Language Family output formatted in markup languages, such as XML
ODS Document hierarchy of output objects
Output SAS data sets
Printer Family output formatted in PS, PDF, or PCL files
RTF output formatted in RTF format using Microsoft Word.
For each type of formatted output created, an ODS statement is required to open
the destination and another ODS statement to close the destination. The default
destination is Listing.
To open and close ODS destinations:
Use ODS statement
Format:
ODS open-destination;
ODS close-destination CLOSE ;
o open-destination is the keyword and any required options for the output type to
be created
o close-destination is the keyword for the type of output.
Since the default destination is Listing, it is considered always open. Best practices
would close the listing destination at the beginning of the program. If multiple ODS
destinations are open concurrently, they can be closed at the same time using the
keyword _ALL_.
To create HTML output with a table of contents:
Use ODS HTML statement
Format:
ODS HTML
o BODY=body-file-specification;
o CONTENTS=contents-file-specification;
o FRAME=frame-file-specification;
ODS HTML CLOSE ;
body-file-specification is the name of the HTML file containing the procedure
output
contents-file-specification is the name of the HTML file containing a table of
contents with links to the procedure output
frame-file-specification is the name of the HTML file connecting the table of
contents
with the body file
A CONTENTS= option is required if FRAME= is used
A table of contents will contain a numbered heading for each procedure that
creates output

Links are generated between files by using HTML filenames specified in the ODS
HTML statement.
The URL=suboption in the BODY or CONTENTS file specification will
allow an URL to be used for all links that is created to the file. Either
relative URLs or absolute URLs can be specified.
PATH=option will allow the specification of a location where HTML
output can be store and assists in streamlining the ODS HTML
statement.
STYLE=option can be used to change the appearance of the HTML
output, and valid SAS or user-defined style definition can be used

10Reading from csv file


data Sample2;
infile 'c:\books\statistics by example\comma.csv' dsd;
length Gender $ 1;
input ID Age Gender $;
run;

The DSD option specifies that two consecutive commas represent a


missing value and that the default delimiter is a comma. Here is the modified
program

11Check missing value in dataset


Proc means data = football n nmiss;
Run;

12Standarization of variables
12.1.1
How do I standardize variables in SAS?
To standardize variables in SAS, you can use proc standard. The example shown below creates
a data file cars and then uses proc standard to standardize weight and price.
DATA cars;
INPUT mpg weight price ;
DATALINES;
22 2930 4099

17 3350
22 2640
20 3250
15 4080
;
RUN;

4749
3799
4816
7827

PROC STANDARD DATA=cars MEAN=0 STD=1 OUT=zcars;


VAR weight price ;
RUN;
PROC MEANS DATA=zcars;
RUN;

The mean=0 and std=1 options are used to tell SAS what you want the mean and standard
deviation to be for the variables named on the var statement. Of course, a mean of 0 and
standard deviation of 1 indicate that you want to standardize the variables. The out=zcars option
states that the output file with the standardized variables will be called zcars.
The proc means on zcars is used to verify that the standardization was performed properly. The
output below confirms that the variables have been properly standardized.
Variable N
Mean
Std Dev
Minimum
Maximum
------------------------------------------------------------------MPG
5
19.2000000
3.1144823
15.0000000
22.0000000
WEIGHT
5 -4.44089E-17
1.0000000
-1.1262551
1.5324455
PRICE
5 -4.44089E-17
1.0000000
-0.7835850
1.7233892
-------------------------------------------------------------------

Often times you would like to have both the standardized variables and the unstandardized
variables in the same data file. The example below shows how you can do that. By making extra
copies of the variables zweight and zprice, we can standardize those variables and then have
weight and price as the unchanged values.
DATA cars2;
SET cars;
zweight = weight;
zprice = price;
RUN;
PROC STANDARD DATA=cars2 MEAN=0 STD=1 OUT=zcars;
VAR zweight zprice ;
RUN;
PROC MEANS DATA=zcars;
RUN;

As before, we use proc means to confirm that the variables are properly standardized.
Variable N
Mean
Std Dev
Minimum
Maximum
------------------------------------------------------------------MPG
5
19.2000000
3.1144823
15.0000000
22.0000000
WEIGHT
5
3250.00
541.6179465
2640.00
4080.00

PRICE
5
5058.00
1606.72
3799.00
7827.00
ZWEIGHT
5 -4.44089E-17
1.0000000
-1.1262551
1.5324455
ZPRICE
5 -4.44089E-17
1.0000000
-0.7835850
1.7233892
-------------------------------------------------------------------

As we see in the output above, zweight and zprice have been standardized, and weight and
price remain unchanged.
PROC STANDARD <option(s)>;
Task
Specify the input data set
Specify the output data set
Computational options
Exclude observations with nonpositive weights
Specify the mean value
Replace missing values with a variable mean or MEAN= value
Specify the standard deviation value
Specify the divisor for variance calculations
Control printed output
Print statistics for each variable to standardize
Suppress all printed output

Without Options
If you do not specify MEAN=, REPLACE, or STD=, the output data set is an identical copy
of the input data set.

Options
DATA=SAS-data-set
identifies the input SAS data set.
Main
discussion:
Restriction:

EXCLNPWGT

Input Data Sets


You cannot use PROC STANDARD with an engine that supports concurrent
access if another user is updating the data set at the same time.

excludes observations with nonpositive weight values (zero or negative). The


procedure does not use the observation to calculate the mean and standard
deviation, but the observation is still standardized. By default, the procedure treats
observations with negative weights like those with zero weights and counts them in
the total number of observations.
Alias:
EXCLNPWGTS
MEAN=mean-value
standardizes variables to a mean of mean-value.
Default:
mean of the input values
Featured in:
Standardizing to a Given Mean and Standard Deviation
NOPRINT
suppresses the printing of the procedure output. NOPRINT is the default value.

OUT=SAS-data-set
identifies the output data set. If SAS-data-set does not exist, PROC STANDARD
creates it. If you omit OUT=, the data set is named DATAn, where n is the smallest
integer that makes the name unique.
Default:
Featured in:

DATAn
Standardizing to a Given Mean and Standard Deviation

PRINT
prints the original frequency, mean, and standard deviation for each variable to
standardize.
Featured in:
Standardizing BY Groups and Replacing Missing Values
REPLACE
replaces missing values with the variable mean.
Interaction:

If you use MEAN=, PROC STANDARD replaces missing values with the given
mean.
Standardizing BY Groups and Replacing Missing Values

Featured in:
STD=std-value
standardizes variables to a standard deviation of
std-value.

Default:
standard deviation of the input values
Featured in:
Standardizing to a Given Mean and Standard Deviation
VARDEF=divisor
specifies the divisor to use in the calculation of variances and standard deviation.
The following table shows the possible values for divisor and the associated
divisors.

Possible Values for VARDEF=


Value
DF
N
WDF
WEIGHT|WGT

The procedure computes the variance as


corrected sums of squares and equals
analysis variables,
mean.
Default:
Tip:

Tip:

See also:
Main
discussion:

equals

Proc means data = mydata.loan_all mean;

ods csvall close;

where

is the weighted

variance is asymptotically (for large n) an estimate of


, where is the
average weight. This yields an asymptotic estimate of the variance of an
observation with average weight.
WEIGHT
Keywords and Formulas

Ods csvall file = d:\ramesh\output\secind.csv;

Run;

. When you weight the

and
is the weight for the ith observation. This yields
an estimate of the variance of an observation with unit weight.
When you use the WEIGHT statement and VARDEF=WGT, the computed

Ods listing;

Class Tenure;

is the

DF
When you use the WEIGHT statement and VARDEF=DF, the variance is an
estimate of
, where the variance of the ith observation is

13Capturing output in file

Var default;

, where

13.1.1
Creating an output data set
13.1.2
How to identify output objects
13.1.3
Using object label to create an output data set
13.1.4
Turn the listing output of
13.1.5
Output to an HTML file
SAS introduced the Output Delivery System (ODS) with Version 7, making output much more
flexible. We show some examples using ODS here. We are going to use the data set below for
the purpose of demonstration.
OPTIONS nocenter;
DATA hsb25;
INPUT id female race ses schtype $ prog
read write math science socst;
DATALINES;
147 1 1 3 pub 1 47 62 53 53 61
108 0 1 2 pub 2 34 33 41 36 36
18 0 3 2 pub 3 50 33 49 44 36
153 0 1 2 pub 3 39 31 40 39 51
50 0 2 2 pub 2 50 59 42 53 61
51 1 2 1 pub 2 42 36 42 31 39
102 0 1 1 pub 1 52 41 51 53 56
57 1 1 2 pub 1 71 65 72 66 56
160 1 1 2 pub 1 55 65 55 50 61
136 0 1 2 pub 1 65 59 70 63 51
88 1 1 1 pub 1 68 60 64 69 66
177 0 1 2 pri 1 55 59 62 58 51
95 0 1 1 pub 1 73 60 71 61 71
144 0 1 1 pub 2 60 65 58 61 66
139 1 1 2 pub 1 68 59 61 55 71
135 1 1 3 pub 1 63 60 65 54 66
191 1 1 1 pri 1 47 52 43 48 61
171 0 1 2 pub 1 60 54 60 55 66
22 0 3 2 pub 3 42 39 39 56 46
47 1 2 3 pub 1 47 46 49 33 41
56 0 1 2 pub 3 55 45 46 58 51
128 0 1 1 pub 1 39 33 38 47 41
36 1 2 3 pub 2 44 49 44 35 51
53 0 2 2 pub 3 34 37 46 39 31
26 1 4 1 pub 1 60 59 62 61 51
;
RUN;

Creating an output data set

Let's say we have a data set of student scores and want to conduct a paired t-test on writing score
and math score for each program type. For some reason, we want to save the t-values and pvalues to a data set for later use. Without ODS, it would not be an easy thing to do since proc
ttest does not have an output statement. With ODS it is only one more line of code.

We will sort the data set first by variable prog and use statement ods output Ttests=test_output
to create a temporary data set called test_output containing information of t-values and p-values
together with degrees of freedom for each t-test conducted.
proc sort data=hsb25;
by prog;
proc ttest data=hsb25;
by prog;
paired write*math;
ods output Ttests=ttest_output;
run;
proc print data=ttest_output;
run;
The SAS System
Obs
Probt

prog

Variable1

Variable2

Difference

tValue

DF

1
0.1389
2
0.5475
3
0.0766

write

math

write - math

-1.57

14

write

math

write - math

0.66

write

math

write - math

-2.37

How to identify output objects

For each SAS procedure, SAS produces a group of ODS output objects. For example, in the
above example, Ttests is the name of a such object associated with proc ttest. In order to know
what objects are associated with a particular proc, we use ods trace on statement right before the
proc and turn the trace off right after it. Let's look at another example using proc reg. The option
listing with ods trace on displays the information of an object along with the corresponding
output. Below we see three objects (data sets in this case) associated with proc reg when no
extra options used. The ANOVA part of the output is stored in a data set called ANOVA. The
parameter estimates are stored in ParameterEstimates. Each object has a name, a label and a
path along with its template. Once we obtain the name or the label of the object, we can use ods
output statement to output it to a dataset as shown in the example above.
ods trace on /listing;
proc reg data=hsb25;
model write = female math;
run;
quit;
ods trace off;
The REG Procedure
Model: MODEL1
Dependent Variable: write
Output Added:

------------Name:
ANOVA
Label:
Analysis of Variance
Template:
Stat.REG.ANOVA
Path:
Reg.MODEL1.Fit.write.ANOVA
------------Analysis of Variance
Source

DF

Sum of
Squares

Mean
Square

Model
Error
Corrected Total

2
22
24

2154.11191
1222.04809
3376.16000

1077.05596
55.54764

F Value

Pr > F

19.39

<.0001

Output Added:
------------Name:
FitStatistics
Label:
Fit Statistics
Template:
Stat.REG.FitStatistics
Path:
Reg.MODEL1.Fit.write.FitStatistics
------------Root MSE
Dependent Mean
Coeff Var

7.45303
50.44000
14.77603

R-Square
Adj R-Sq

0.6380
0.6051

Output Added:
------------Name:
ParameterEstimates
Label:
Parameter Estimates
Template:
Stat.REG.ParameterEstimates
Path:
Reg.MODEL1.Fit.write.ParameterEstimates
------------Parameter Estimates
Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept
female
math

1
1
1

7.07533
5.95697
0.76991

7.56161
3.07209
0.14323

0.94
1.94
5.38

0.3596
0.0654
<.0001

Using object label to create an output data set

Along with the name of an object, we also see the label for the object. We can use the label to
create a data set just as using the name.
ods output "Parameter Estimates"=parest;
proc reg data=hsb25;
model write = female math;

run;
quit;
ods output close;
proc print data=parest;
run;
Obs
Model
Dependent
tValue
Probt

Variable

DF

Estimate

StdErr

1
0.94
2
1.94
3
5.38

write

Intercept

7.07533

7.56161

write

female

5.95697

3.07209

write

math

0.76991

0.14323

MODEL1
0.3596
MODEL1
0.0654
MODEL1
<.0001

Turn the listing output of

Since we can save our output from a proc to a dataset using ODS, we sometimes want to turn the
listing output off. We can NOT use noprint option since ODS requires an output object. What
we'll do is to use ODS statement here shown as in the example below. It makes sense because
listing output is just a form of ODS output. The statement ods listing close eliminates the output
to appear in the output window. After the proc reg, we turn back the listing output back so output
will appear in the output window again. The
ods listing close;
ods output "Parameter Estimates"=parest;
proc reg data=hsb25;
model write = female math;
run;
quit;
ods output close;
ods listing;

Output to an HTML file

Let's say that we want to write the output of our proc reg to an HTML file. This can be done
very easily using ODS. First we specify the file name we are going to use. Then we point the ods
html output to it. At the end we close the ods html output to finish writing to the HTML file. You
can view procreg.html created by the following code.
filename myhtml "c:\examples\procreg.html";
ods html body=myhtml;
proc reg data=hsb25;
model write= female math;
run;
quit;
ods html close;

1495% confidence interval of mean


Use clm option in Proc Mean

15Column input
if you have ID data in columns 13, Age in columns 46, and Gender in
column 7 of your raw data file, your input statement might look like this:
input ID $ 1-3 Age 4-6 Gender $ 7;

16formatted input
input @1 ID $3.
@4 Age 3.
@7 Gender $1.;

The informat $3. tells SAS to read three columns of character data; the 3.
informat says to
read three columns of numeric data; the $1. informat says to read one column
of character
data. The two informats n. and $n., are used to read n columns of numeric and
character
data, respectively.

17Formatting in proc reportFormat = format;


Format Dollar15.2,
Width = statement
Define revenue/format = dollar15.2
Define flight/width = 7
Space = statement spacing between selected column and the next column to
its left

18Proc report
Display, order, group, across , analysis or computed

Character display by default


Define flight/order Flight/Number width = 6 center
Numeric = analysis variables by default
proc print data=vlib.emp1;
where lastname < 'KAP' and payrate > 30 * overtime;
run;

19Across variable to group variable horizontally


Define flight/

20Computed variable
Computed variable is not part of the dataset

21Summary report
Define flight/group Flight/Number width = 6 center;

PROC SUMMARY DATA=preteen NWAY;


CLASS sex;
VAR age height weight;
OUTPUT OUT=group_averages(DROP = _type_ _freq_)
MIN (age )=Youngest
MAX (age )=Oldest
MEAN(height)=Avg_Height
MEAN(weight)=Avg_Weight;
RUN;

Nway suppress grand total

22Viewing contents of dataset


proc contents data=sashelp.air;
run;

23printing portion of dataset


proc print data=sashelp.air(obs=10);

run;

24Print by
Proc print data = order_finance;
Var payment_gateway payment_mode;
By payment_mode

Print each of the values in var fro each payment_mode

Data set is not mandatory in proc print. If data set is not given, it will print the
lastly created dataset

Example =

25create new library


libname <Library name> '<Library path>';

26 creating and adding data to dataset


data mydata.income;
input income expense;
datalines;

4500 2000
5000 2300
7890 2810
8900 5400
2300 2000
;
run;

27importing csv file


PROC IMPORT OUT= MYDATA.sat_exam
DATAFILE= "C:\Users\Documents\Datasets\SAT_Exam.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

28importing excel file


PROC IMPORT OUT= MYDATA.sat_exam
DATAFILE= "C:\Users\Documents\Datasets\SAT_Exam.csv"
DBMS=EXCEL REPLACE;
SHEET = SHEET1$
GETNAMES=YES;
DATAROW=2;
RUN;

PROC IMPORT OUT= WORK.add_budget


DATAFILE= "C:\Users\VENKAT\Google Drive\Training\Books\Content\
Regression Analysis\Add_budget_data.xls"
DBMS=EXCEL REPLACE;
RANGE="budget$";

GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;

29IMPORTING TEXT FILE


PROC IMPORT OUT= MYDATA.SAT_EXAM_data_from_text_file
DATAFILE= "C:\Users\Documents\Datasets\SAT_Exam.txt"
DBMS=TAB REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

30COPYING DATASET
data MYDATA.sat_exam_copy;
set MYDATA.sat_exam;
run;

31ADDING NEW VARIABLES AND CREATING DATASET


data new_data;
set old_data;
<new var statements>;
run;

32DROP AND KEEP VARIABLES IN NEW DATASET


data new_data;
set old_data(Keep=Var1 Var2 Var3);
<Rest of the statements>

run;

data new_data;
set old_data(Drop=Var5 Var6 Var7);
<rest of the statements>
run;

33PRINTING NO OBSERVATION NUMBER


proc print data=market_asset(obs=10) noobs;
run;

34SUBSTRING
New variable= SUBSTR(variable, start character, number of characters)

35OTHER STRING FUNCTIONS


LENGTH
TRIM
UPCASE
LOWCASE

36DATE FUNCTION
Duration_days=INTCK('day',start_date,end_date); /* Finds the duration in days */
Duration_months=INTCK('month',start_date,end_date); /* Finds the duration in
months */
Duration_weeks=INTCK('week',start_date,end_date); /* Finds the duration in
weeks */
MONTH
YEAR

37PRINTING VARIABLES IN DATASET


proc contents data=sashelp.stocks varnum;
run;

proc contents data=sashelp.stocks varnum short;


run;

38SORT
proc sort data=<dataset>;
by <variable>;
run;

proc sort data=<dataset> out = <New Data set>;


by <variable>;
run;

proc sort data=MYDATA.bill out=mydata.bill_top;


by descending Bill_Amount ;
run;

proc sort data=MYDATA.bill out=mydata.bill_top100k;


by descending Bill_Amount ;
where Bill_Amount>100000;
run;

39REMOVING DUPLICATE
proc sort data=MYDATA.bill out=mydata.bill_wod nodup;
by cust_id ;
run;

40REMOVE DUPLICATE BASED ON KEY


proc sort data=MYDATA.bill out=mydata.bill_cust_wod nodupkey;
by Bill_Id ;
run;

41MOVE DUPLICATES INTO NEW DATASET


proc sort data=MYDATA.bill out=mydata.bill_wod nodup
dupout=mydata.nodup_cust_id ;
by cust_id ;
run;
proc print data=mydata.nodup_cust_id;
run;

42PLOT
proc gplot data= <data set>;
plot y*x;
run;

symbol i=none;
proc gplot data= market_asset;
plot reach*budget;
run;

proc gchart data= market_asset;

vbar category;
Run;

proc gchart data= market_asset;


pie category ;
Run;

/* 3D bar chart */
proc gchart data= market_asset;
vbar3d category ;
Run;
/* 3D pie chart */
proc gchart data= market_asset;
pie3d category ;
Run;

43Median in proc sql


PROC SQL;
SELECT MEDIAN(a) LABEL='Median of 1'
FROM threex3
;
QUIT;

We get:
Median
of 1
-------1.1
6
7.7

You were probably expecting to see just a 6, the median of the values in column A.
Instead we have, for each row, the trivial median of a single value of A. In other words,
the processing was horizontal rather than vertical.

The explanation: vertical calculation of medians is not supported in PROC SQL (though
it is in PROC SUMMARY). Thus, there is no ambiguity. In SQL, the only valid
interpretation of MEDIAN with a single argument is that it is a SAS function call, to be

computed horizontally

44PROC SQL
proc sql;
create buss_fin /* This is the new dataset */
as select *
from market_asset
where Category= 'Business/Finance';
Quit;

proc sql;
select Category, sum(budget) as total_budget
from market_asset
group by Category;
Quit;

45Proc sql case


PROC SQL;
CREATE TABLE trip_list AS
SELECT fname,
age,
sex,
CASE WHEN age=11 THEN 'Zoo'
WHEN sex='F' THEN 'Museum'
ELSE '[None]'
END
AS Trip
FROM preteen
;
QUIT;

46MERGING TWO DATASETS


Data Students_1_2;
set students1 students2;
run;

proc sort data=students1;


by name;
run;
proc sort data=students2;
by name;
run;
data studentmerge;
Merge students1 students2;
by name;
run;

data final;
Merge data1(in=a) data2(in=b);
by var;
if a;
run;

data final1;
Merge data1(in=a) data2(in=b);
by var;
if b;
run;

data final2;
Merge data1(in=a) data2(in=b);
by var;
if a and b;
run;

47SAMPLING
proc surveyselect data = sashelp.prdsale
method = SRS
rep = 1
sampsize = 30 seed = 12345 out = prod_sample_30;
id _all_;
run;

48PRINTING VERTICAL HEADING


proc print data=<<data set>> label noobs heading=vertical;
var <<variable-list>>;
by var1; run;
Label: This option prints variable labels as column headings instead of variable
names.
Noobs: This option removes the OBS column from output.
Heading=vertical: This option prints the column headings vertically. This is
useful
when the names are long but the values of the variable are short.
By: The by statement produces output grouped by values of the mentioned
variables

49MEAN CALCULATION
Proc means data=online_sales mean;
var listPrice;
class brand;
run;

proc means data = online_sales nmiss kurtosis var;


run;

class provides mean for each item at brand

What if you want to see the grand mean, as well as the means broken down by
Drug, all
in one listing? The PROC MEANS option PRINTALLTYPES does this for you when
you include a CLASS statement. Here is the modified program

title "Descriptive Statistics Broken Down by Drug";


proc means data=example.Blood_Pressure n nmiss
mean std median printalltypes maxdec=3;
class Drug;
var SBP DBP;
run;

libname example 'c:\books\statistics by example';


title "Descriptive Statistics for SBP and DBP";
proc means data=example.Blood_Pressure n nmiss mean std median
maxdec=3;
var SBP DBP;
run;

Option Description
N Number of nonmissing observations
NMISS Number of observations with missing values
MEAN Arithmetic mean

STD Standard deviation


STDERR Standard error
MIN Minimum value
MAX Maximum value
MEDIAN Median
MAXDEC= Maximum number of decimal places to display
CLM 95% confidence limit on the mean
CV Coefficient of variation

50Moving means data into output file

51Merging two data together

52QUANTILES
Proc univariate data= online_sales ;
var listPrice ;
run

title "Demonstrating PROC UNIVARIATE";


proc univariate data=example.Blood_Pressure;

id Subj; The ID statement is not necessary, but it is particularly useful with PROC
UNIVARIATE

var SBP DBP;


histogram;
probplot / normal(mu=est sigma=est);
run;

the PROBPLOT statement requests a probability plot. This plot shows


percentiles from a theoretical distribution on the x-axis and data values on the
y-axis.
This example program selects the normal distribution using the NORMAL option
after
the forward slash. If your data values are normally distributed, the points on this
plot will
form a straight line. To make it easier to see deviations from normality, the
option
NORMAL also produces a reference line where your data values would fall if they
came
from a normal distribution. When you use the NORMAL option, you also need to
specify

a mean and standard deviation. Specify these by using the keyword MU= to
specify the
mean and the keyword SIGMA= to specify a standard deviation. The keyword
EST tells
the procedure to use the data values to estimate the mean and standard
deviation, instead
of some theoretical value.

Notice the slash between the word PROBPLOT and NORMAL. Using a slash here
follows standard SAS syntax: if you want to specify options for any statement in
a PROC
step, follow the statement keyword with a slash.

/* BOX PLOT*/
Proc univariate data= health_claim plot;
var Claim_amount ;
run;

53TO REMOVE NA VALUES TO NUMERICAL


Data cust_cred_raw_v1;
Set cust_cred_raw;
MonthlyIncome_new= MonthlyIncome*1;
NumberOfDependents_new=NumberOfDependents*1;
run;

54CREATE FREQUENCY TABLE


Title 'Frequency table for Serious delinquency in 2 years ';
proc freq data= cust_cred_raw_v1;

table SeriousDlqin2yrs;
run;

55create two variable categorical frequency table


data respire;
input treat $ outcome $ count;
datalines;
placebo f 16
placebo u 48

test f 40
test u 20
;
proc freq;
weight count;
tables treat*outcome;
run;
/CROSSLIST

A CROSSLIST option will display crosstabulation tables in ODS


column format. CROSSLIST creates a table with a customizable table
definition. A table definition is customized using the TEMPLATE
procedure.

A LIST option will allow bulky, complex crosstabulations to be read as


a continuous list - great for n-way tables that have three or more
variables specified. It eliminates row and column frequencies and
percents
Format:
TABLES variable-1* variable-2 <*.. variable-n> /LIST;
The output of the FREQ procedures can be limited to a few specific
statistics:
NOFREQ suppresses cell frequencies
NOPERCENT suppresses cell percentages
NOROW suppresses row percentages
NOCOL suppresses column percentages

56Weight statement
The WEIGHT statement is necessary to tell the procedure that the data are count
data, or frequency data; the variable listed in the WEIGHT statement contains the
values of the count variable
If the data is stored in the record form, the weight is not required. Ie
data respire;
input treat $ outcome $ @@;
datalines;
placebo f placebo f placebo f
placebo f placebo f
placebo u placebo u placebo u
placebo u placebo u placebo u
placebo u placebo u placebo u
placebo u
test f test f test f
test f test f test f
test f test f
test u test u test u
;

proc freq;
tables treat*outcome;
run;

57order
order = data that the sort order is the same order in which the values are
encountered in the data set.
Thus, since marked comes first, it is first in the sort order. Since some is the
second value for
IMPROVE encountered in the data set, then it is second in the sort order. And none
would be third
in the sort order. This is the desired sort order. The following PROC FREQ statements
produce a
table displaying the sort order resulting from the ORDER=DATA option
proc freq order=data;
weight count;
tables treatment*improve;
run;

58three variable frequency


proc freq order=data;
weight count;
tables sex*treatment*improve / nocol nopct;
run;

NOCOL and NOPCT options suppress the printing of column percentages and cell
percentages, respectively

59Correlation
proc corr data=add_budget ;
var Online_Budget Responses_online ;
run;

60Regression
/* Predicting SAT score using rest of the four variables. General_knowledge,
Aptitude,

Mathematics, and Science */


proc reg data=sat_score;
model SAT=General_knowledge Aptitude Mathematics Science;
run;

61logistic regression
Proc logistic data=ice_cream_sales;
model buy_ind=age;
run;

62test stationarity
proc arima data= ms;
identify var= stock_price stationarity=(DICKEY);
run;

63create a diferentiated time series


proc arima data= ms1;set ms;
dif_stock_price=stock_price-lag1(stock_price);
run;
Yt-1
ms1 is the new data set, Lag1 denotes the Yt-1 values, and dif_stock_price is
the new diferentiated series

64create ACF and PACF Plots


proc arima data=ts15 plots=all;
identify var=Series_Values ;
run;

65to calculate the ESACF and SCAN function values


proc arima data= TS13 plots=all;
identify var=x SCAN ESACF ;
run;

66forecast using ARIMA


proc arima data=web_views;
Identify var=Visitors;
Estimate p=1 q=0 method=ml;
Forecast lead=7;
run;

67advanced mean concepts


Two SAS data sets are used to generate the examples youll see in this tutorial. An Early Adopter Release of SAS 9
Software was used to create the code and output, but everything presented in this paper is available in Release 8.0
and higher of the SAS System.
The first data set, ELEC_ANNUAL, contains about 16,300 customer-level observations (rows) with information about
how much electricity they consumed in a year, the rate schedule on which they were billed for the electricity, the total
revenue billed for that energy and the geographic region in which they live. The variables in the data set are:
PREMISE Premise Number [Unique identifier for customer meter]
TOTKWH Total Kilowatt Hours [KwH is the basic unit of electricity consumption]
TOTREV Total Revenue [Amount billed for the KwH consumed
TOTHRS Total Hours [Total Hours Service in Calendar Year]
RATE_SCHEDULE Rate Schedule [Table of Rates for Electric Consumption Usage]
REGION Geographic Region [Area in which customer lives]
The second data set, CARD_TRANS2, contains about 1.35 million observations (rows), each representing one
(simulated) credit card transaction. The variables in the data set are:
CARDNUMBER Credit Card Number
CARDTYPE Credit Card Type [Visa, MasterCard, etc.]
CHARGE_AMOUNT Transaction Amount (in dollars/cents)
TRANS_DATE Transaction Date [SAS Date Variable]
TRANS_TYPE Transaction Type [1=Electronic 2=Manual]

67.1 Basic mean


By default, PROC MEANS will analyze all numeric variables in your data set and deliver those analyses to your
Output Window. Five default statistical measures are calculated:
N Number of observations with a non-missing value of the analysis variable
MEAN Mean (Average) of the analysis variables non-missing values
STD Standard Deviation
MAX Largest (Maximum) Value
MIN Smallest (Minimum) Value
Using the ELEC_ANNUAL Data Set and PROC MEANS, we can see how the default actions of PROC MEANS are
carried out by submitting the following code:
* Step 1: Basics and Defaults;

PROC MEANS DATA=SUGI.ELEC_ANNUAL;


title 'SUGI 29 in Montreal';
title2 'Steps to Success with PROC MEANS';
title3 'Step 1: The Basics and Defaults';
run;
The results displayed in the Output Window are:

Since TOTKWH, TOTREV and TOTHRS are all numeric variables, PROC MEANS calculated the five default
statistical measures on them and placed the results in the Output Window.

67.2 Selecting Analysis Variables, Analyses to be Performed by PROC


MEANS , and Rounding of Results
In most situations, your data sets will probably have many more numeric variables you want PROC MEANS to
analyze. This particularly true if some of your numeric variables dont admit of a meaningful arithmetic operation,
which is a fancy way of saying that the results of calculating a statistic on them results in meaningless information.
For example, the sum of ZIPCODE or the MEAN of telephone number is unlikely to be useful. So, we dont want to
waste time having these values calculated or clutter up our output with meaningless information. Also, we may no

need all of the five statistical analyses that PROC MEANS will perform automatically. And, we may want to round the
values to a more useful number of decimal places than what PROC MEANS will do for us automatically.
Again using the ELEC_ANNUAL data set, here is how we can take more control over what PROC MEANS will do for
us. Suppose we just want the SUM and MEAN of TOTREV, rounded to two decimal places. The following PROC
MEANS task gets us just what we want.

A box has been drawn around the important features presented iin Step 2. First, the SUM and MEAN statistics
keywords were specified, which instructs PROC MEANS to just perform those analyses. Second, the MAXDEC
option was used to round the results in the Output Window to just two decimal places. (If we had wanted the
analyses rounded to the nearest whole number, then MAXDEC = 0 would have been specified.) Finally, the VAR
Statement was added, giving the name of the variable for which the analyses were desired. You can put as many
(numeric) variables as you need/want in to one VAR Statement in your PROC MEANS task.

The Output Window displays:

67.3 Selecting Other Analyses


So far weve worked some of the (five) default statistical analyses available from PROC MEANS. There are many
other statistical analyses you can obtain from the procedure! Here is a complete list

Suppose the observations in ELEC_DATA are a random sample from a larger population of utility customers. We
might therefore want to obtain, say, a 95 percent confidence interval around the mean total KwH consumption and
around the mean billed revenue, along with the mean and median. From the above table, you can see that the
MEAN, MEDIAN and CLM statistics keywords will generate the desired analyses. The PROC MEANS task below

generates the desired analyses. The task also includes a LABEL Statement, which add additional information about
the variables in the Output Window.
Selecting Statistics;
PROC MEANS DATA=SUGI.ELEC_ANNUAL
MEDIAN MEAN CLM MAXDEC=0;
Label TOTREV = 'Total Billed Revenue'
TOTKWH = 'Total KwH Consumption';
VAR TOTREV TOTKWH;
title3 'Step 3: Selecting Statistics';
run;
The output generated is:

67.4 Step 4: Analysis with CLASS (variables)


So far weve analyzed the values of variables from ELEC_ANNUAL without regard to the values of potentially
interesting and useful classification variables. PROC MEANS can do this for you with a minimum of additional
coding. First, we need to understand what the CLASS and BY Statements do when included in a PROC MEANS
task. The CLASS statement does not require that the input (source) data set be sorted by the values of the
classification variables. On the other hand, using the BY Statement requires that the input data set be sorted by the
values of the classification variables.
In most situations, it does not matter if you use the CLASS or BY statement to request analyses classified by the
values of a classification variable. If you are working with a very large file, however, with many classification
variables (and/or classification variables with many distinct values), you may obtain significant processing time
reductions if you first use PROC SORT to sort the data by the values of the classification variable and then use
PROC MEANS with a BY Statement. Unfortunately, I cannot give you a magic number of observations or variables
at which it become more efficient to first sort and then use a BY statement versus using the CLASS statement on a
unsorted data set. Factors such as the actual number of observations, the number of unique values of the CLASS
variables, memory allocation/availability, CPU power, etc. all come in to play and cant really be estimated in
advance. Youll have to use some trial and error to figure out which approach is best for your unique data structures
and computing capabilities.
Having said all of this, lets take a look at how we can obtain the MEAN and SUM of TOTREV classified by REGION
in the ELEC_ANNUAL data set. All we need to do is add the CLASS statement (with REGION as the classification
variable) to the PROC MEANS task, as shown below.

By specifying REGION in the CLASS Statement, we now have the MEAN and SUM of TOTREV and TOTKWH for
each unique value of region. We also have a column called N Obs, which is worthy of further discussion. By
default, PROC MEANS shows the number of observations for each value of the classification variable. So, we can
see that there are, for example, 5,061 observations in the data set from the WESTERN Region.
How does PROC MEANS handle missing values of classification variables? Suppose there were some observations
in ELEC_ANNUAL with missing values for REGION. By default, those observations would not be included in the
analyses generated by PROC MEANSbut, we have an option in PROC MEANS that we can use to include
observations with missing values of the classification variables in our analysis. This option is shown in Step 5.

67.5 Step 5: Dont Miss the Missings!


As we saw in Step 4, PROC MEANS automatically creates a column called N Obs when a classification variable is
placed in a CLASS or BY Statement. But, observations with a missing value are, by default, excluded (not portrayed)
in the output analysis. There are certainly many instances where it would be useful to know: a) how many
observations have a missing value for the classification variable and b) what the analyses of the analysis variables
are for observations that have a missing value for the given classification variable. We can easily obtain this
information by specifying the MISSING option in the PROC MEANS statement. Heres how to do it:

67.6 Survey means - How to Estimate a Ratio of Means using SAS


This section describes how to use SAS to estimate a ratio of means for all adults and for males
and females separately. To illustrate this, the sum of calcium from milk is divided by the sum of
total calcium for each population group as an example.
Sorting is not a necessary first step in SAS as it is in SUDAAN. Therefore, properly weighted
estimated means and standard errors, using complex survey design factors (e.g., strata and PSU),
can be obtained with the single SAS procedure PROC SURVEYMEANS.

67.6.1.1
67.6.1.2

Use SAS to Estimate How Much Dietary Calcium Consumed by Adults,


Ages 20 Years and Older, Comes from Milk
Sample Code

*-------------------------------------------------------------------------;
* Use the PROC SURVEYMEANS procedure in SAS to compute a properly weighted;
* estimated ratio of means for all persons ages 20+ and by gender.
;
*-------------------------------------------------------------------------;
* Run analysis for overall subpopulation of interest;
proc surveymeans data=DTTOT;
where usedat=1 ;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTDRD1;
var D1MCALC DR1TCALC;
ratio D1MCALC / DR1TCALC;
title " Ratio of Means -- All Persons ages 20+" ;
run ;
*-------------------------------------------------------------------------;
* Use the PROC SORT procedure to sort the data by gender.
*-------------------------------------------------------------------------;

proc sort data =DTTOT;


by RIAGENDR;
run ;
* Run analysis by gender within subpopulation of interest;
proc surveymeans data=DTTOT;
where usedat= 1 ;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTDRD1;
var D1MCALC DR1TCALC;
ratio D1MCALC / DR1TCALC;
by RIAGENDR;
title " Ratios of Means -- by Gender" ;
run ;

67.6.1.3

Output of Program

Ratio of Means -- All Persons ages 20+


Data Summary
Number of Strata
15
Number of Clusters
30
Number of Observations
4448
Sum of Weights
205284669
Statistics
Std Error
Lower 95%
Upper 95%
Variable Label
N
Mean
of Mean
CL for Mean
CL for Mean
--------------------------------------------------------------------------------------------------d1mcalc
Calcium (mg)
4448
101.162167
7.647887
84.861081
117.463253
DR1TCALC Calcium (mg)
4448
880.130855
16.722099
844.488545
915.773166
--------------------------------------------------------------------------------------------------Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
4448
0.114940
0.006826
0.100390
0.129490
----------------------------------------------------------------------------------------------Ratios of Means -- by Gender
Gender - Adjudicated=male
Data Summary
Number of Strata
15
Number of Clusters
30
Number of Observations
2135
Sum of Weights
98664010.2

Statistics
Std Error
Lower 95%
Upper 95%
Variable Label
N
Mean
of Mean
CL for Mean CL for Mean
------------------------------------------------------------------------------------------------d1mcalc
Calcium (mg)
2135
122.142347
8.719800
103.556533
140.728162
DR1TCALC Calcium (mg)
2135
998.359501 21.809584
951.873474
1044.845528
-----------------------------------------------------------------------------------------------Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
2135
0.122343
0.007148
0.107107
0.137579
----------------------------------------------------------------------------------------------Gender - Adjudicated=female
Data Summary
Number of Strata
15
Number of Clusters
30
Number of Observations
2313
Sum of Weights
106620659
Statistics
Std Error
Lower 95%
Upper 95%
Variable Label
N
Mean
of Mean
CL for Mean
CL for Mean
-----------------------------------------------------------------------------------------------d1mcalc
Calcium (mg)
2313
81.747649
9.880726
60.687380
102.807918
DR1TCALC Calcium (mg)
2313
770.725113
15.292108
738.130756
803.319469
-----------------------------------------------------------------------------------------------Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
2313
0.106066
0.011329
0.081919
0.130213
-----------------------------------------------------------------------------------------------

Highlights from the output include:

The ratio of mean calcium from milk to total calcium, for all persons ages 20
and older, is 0.11 (with a standard error of 0.01). The corresponding values
for males and females, respectively, are 0.12 (0.01) and 0.11 (0.01).

Note that, even though this analysis did not incorporate a domain statement,
the results are exactly equal to those obtained using SUDAAN and its
SUBPOPN statement because the subgroup of interest was one for which the
weighted NHANES sample is representative.

68Mean and ratios


need to find a ratio of two mean values, that I have found using proc means.

proc means data=a;


class X Y;
var x1 x2;
run;

Then I get the output mean values for variables x1 and x2 in the two categories of X and Y,
but it is x1/x2 for each category that I am interested in, and doing it by hand is not really a
solution
You need to precompute x1/x2 or postcompute x1/x2 (Depending on whether you want
mean(x1/x2) or mean(x1)/mean(x2), which can have different answers of x1 and x2 have
different numbers of responses).
So either (... means fill in what you have already)
data premean;
set have;
x1x2 = x1/x2;
run;

proc means ... ;


class ... ;
var x1x2;
run;

or
proc means ...;
class ... ;
var x1 x2;
output out=postmeans mean=;
run;

data want;
set postmeans;

x1x2=x1/x2;

run;

69SAS QUESTIONS
1. To create a raw data file:
Use the SET statement
Format:
DATA _null_;
SET dataset;
_null _ allows the DATA step to be used to without
creating a SAS data set

2. Result
a.
b.
c.

formats for SAS formats can be specified:


Traditional SAS listing
HTML document
Both a listing and HTML document.

3. The appearance of the output can be controlled, specifically (for SAS listings):
a. line size (the maximum width of the log and output)
b. page size (the number of lines per printed page)
c. page numbers displayed
d. data and time displayed
4. Variable length identifies the number of bytes used to store the variable.
Length is dependent on type:
a. Character variables can be up to 32,767 bytes long
b. Numeric variables have a constant default length of 8 with
c. an infinite number of digits possible
d. Numeric variables have a constant length because they
e. are stored as floating-point numbers
f. A diferent length can be specified for numeric variables.
5. Data set has two parts: a descriptive portion and a data portion that the data
set can locate

6. Descriptive Portion - Contains information about the data set, including:


a. data set name
b. data type
c. creation date and time
d. number of observations
e. number of variables
f. number of indexes

g. Contains information about the variables in the data set, including:


i. name
ii. type
iii. length
iv. format
v. informat
vi. label.
7. Drop statements
If a variable should be processed but not appear in the
new data set, use the DROP= option in the DATA
statement
If a variable should not be processed nor appear in the
new data set, use the DROP= option in the SET
statement
8. Observations in the input data set are read as they appear in the physical file,
or sequentially. Sequential reading can be bypassed using a POINT= option
9. The criteria used by one-to-one readings to select data:
a. The new data set contains all variables from all input data sets
b. If data sets have variables of the same name, the values from the last
data set overwrite the values read from previous data sets
c. The number of observations in the new data set will be the same as
the number of observations in the smallest original data set
d. Observations are combined based on their relative position, that is,
the first observations from each data set are combined, and so on.
10.Raw data can be organized in several ways:
a. arranged in columns, or fixed fields
b. arranged without columns, or free format.
11.Raw data can contain:
a. standard data without any special characters
b. nonstandard data with special characters.
12.SAS has three input styles
a. column input
b. formatted input
c. list input.
13.Informats
a. The $w. informat allows character data to be read. The dollar sign
indicates character only data. The w represents the field width, or
number of columns, of the data. The period ends the informat.
b. The w.d format allows standard numeric data to be read. The w
represents the field width, or number of columns, of the data. If a
decimal point exists with the raw data, that acts as one decimal. The
period acts as a delimiter. The optional d specifies the number of
implied decimal places (not necessary if the value already has decimal
places).
c. The COMMAw.d will read nonstandard numeric data, removing any
embedded:
i. blanks
ii. commas

iii. dashes
iv. dollar signs
v. percent signs
vi. right parentheses
vii. left parentheses.
14.Record Formats of external file define how data is read by column input and
formatted input processes - The default value of the maximum record length
is determined by the operating environment. The maximum record length can
be changed using the LRECL=option in the INFILE statement
15.A List input can read standard and nonstandard data in a free-format record
16.By default, List input does not have specified column locations, so:
a. all fields must be separated by at least one delimiter
b. fields cannot be skipped or re-read
c. the order for reading fields is from left to right
17.List input - By default several limitations exist on the type of data that can be
read using list input:
a. Character values that are longer than eight characters will be
truncated
b. Data must be in standard numeric or character format
c. Character values cannot contain embedded delimiters
d. Missing numeric and character values must be represented by a
period or some other character
18.List input - The default length of character values is 8. Variables that are
longer than 8 are truncated when written to the program data vector. Using a
LENGTH statement before the INPUT statement will define the length and
type of the variable.
19.List input missing values:
a. If missing values occur at the end of the record, the MISSOVER option
in the INFILE statement can be used to read them. The MISSOVER
option will prevent the SAS from going to another record if values
cannot be found for every specified variable in the current line.
b. MISSOVER only works with missing values at the end of a record.
c. To begin to read missing values in the beginning or middle of the
record, the DSD option in the INFILE statement can be used.
d. DSD changes how delimiters are treated when using a list input:
i. by setting the default delimiter to a comma
ii. treating two consecutive delimiters as a missing value
iii. removing quotation marks from values.
e. If multiple delimiters are used in the original file, they can be identified
by using the DLM=option.
f. Modifying List inputs allows it to be more versatile. Two modifiers can
be used:
i. the ampersand is used to read character values that contain
embedded blanks
ii. the colon is used to read nonstandard data values and
character values longer than eight characters
20.Creating customized layout

a. Using the ID statement in conjunction with the ID and SUM statements


will show the BY variable heading only once: If the variable specified by
the IN statement is the same as the BY statement, then:
i. The OBS column is suppressed
ii. The ID/BY variable is printed in the far left column
iii. Each ID/BY value is printed at the start of each BY group and on
the same line as the group's subtotal
b. Each BY group can be printed on a separate page using the PAGEBY
statement: Format: PAGEBY BY-variable;
c. To double space the report, use the DOUBLE option in the PROC PRINT
statement.
21.Foot notes and titles - TITLE and FOOTNOTE statements are global
statements and are in place until they are modified, canceled, or the SAS
session ends. When redefining a title or footnote, all higher-numbered titles
or footnotes are canceled. A null TITLE or FOOTNOTE statement has no
number or text.
22.To temporarily assign a label or format to the data output, use the LABEL or
FORMAT statements within the PROC PRINT step. To permanently assign a
label or format to the data , u
se the LABEL or FORMAT statements within
the DATA step.
23.PROC REPORT LISTING - The appearance of the headings found in the list
report can be changed using two options:
a. HEADLINE underlines all column headings and the spaces between
them
b. HEADSKIP creates a blank line beneath all column headings or after
the underline if the HEADLINE option is used.

Urls to visit
http://www.ats.ucla.edu/stat/
http://www.okstate.edu/sas/v8/saspdf/proc/c21.pdf

Das könnte Ihnen auch gefallen