Beruflich Dokumente
Kultur Dokumente
-Archit
-ArchitKumar
Kumar
SI
SI–BI
–BI
BOFA
BOFA
© 2008 Infosys Technologies Ltd. Strictly private and confidential.
No part of this document should be reproduced or distributed without the prior permission of Infosys Technologies Ltd.
1
Contents
Introduction to SAS
Introduction to SAS programs
SAS dada libraries.
Producing list report- Print procedure
Customizing report appearance – creating HTML reports
Reading raw data files
Dropping and keeping variables
Concatenating SAS data sets
Producing summary reports
Introduction to graphics
Controlling input and output
Summarizing data
2
Reading and writing different types of data
Data transformation –
- Manipulating character values
- Manipulating numeric values
- Manipulating date values
Do loops in SAS
SAS arrays
Match merging two or more data sets
Using SQL queries in SAS
SAS macros
Basic efficiency techniques
3
Overview of SAS system
4
Data Processing
Process of delivering meaningful information –
-Accessing data
- Transforming data
- Managing data
- Storing and retrieving data
- Analysis
Raw data
5
Introduction to SAS program
/*********************Data step*****************************/
data work.staff;
infile ‘raw data file’;
input LastName $ 1-20 FirstName $21-30 JobTitle $ 36-43 Salary 54-
59;
run;
/**********************************************************/
/********************Proc Step******************************/
proc print data=work.staff;
run;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
/**********************************************************/
6
Fundamental Concepts
7
SAS variables
8
SAS - Syntax Rules
9
SAS – Data Libraries
SAS data library is a collection of SAS files that are recognized as a unit by SAS.
Where SAS data sets are referred to as SAS files here.
Types of SAS libraries –
Temporary library- When SAS is invoked, it automatically gives access to
temporary library which is named as work. Datasets made here are removed
once the SAS session ends.
Permanent library- SASUSER is the permanent SAS library present in SAS. We
can create permanent SAS library using libname statement.
Syntax – Libname libref ‘SAS data library’ <options>.
Rules- 1. Name of library must be 8 characters or less.
2. Must begin with a letter or underscore.
3. Remaining characters are letters, numbers or underscores.
e.g. libname Test_lib ‘c:\workshop\prog1’;
Once the libname is specified, datasets can be created inside the library by
refering to the data sate by “libref.filename or libname.data set name”
10
PRINT Procedure
General form of print procedure
proc print data= SAS data set;
run;
The print procedure prints the dataset with all the columns adding a
column of observation to it, which has the row number.
Features of print procedure
1. Titles and footnotes – Discussed in subsequent slides
2. Formatted value - Discussed in subsequent slides
3. Printing selected variable –
proc print data= ia.empdata;
var empname salary jobcode;
run;
This statement prints the selected variables only in the order in which they
are written.
11
4. Suppressing the observation columns – NOOBS option
proc print data= ia.empdata noobs;
run;
5. Sub setting data – Where statement is used to select some observations only.
Syntax – where <condition>;
Where condition contains operators(constants or variables) and operands(comparison, logical,
special operators or functions).
e.g. – Comparison – Where salary>25000;
Logical – Where Jobcode=‘A’ and Salary=25000; similarly or and not can also be used.
Special operator – Between – Where salary between 5000 and 7000;
Contains(?) – Where lastname ? ‘LAM’;
Example of proc step with where clause-
proc print data= iq.empdata;
var Jobcode Empid Salary;
where Jobcode = ‘A’ and Salary between 20000 and 30000 ;
run;
6. Column totals – Sum statement is used to get column total.
e.g. proc print data = ia.empdata;
var jobcode Salary Empid;
sum salary;
run;
12
Special where statements
13
Sequencing and Grouping observations
Sort procedure – Sort procedure is used to sequence the observation.
1. Re arranges the observations in SAS dataset.
2. Can create new dataset with re arranged data.
3. Can sort on multiple values.
4. Does not generate printed output.
5. Treats missing value as the smallest possible value.
6. Sorts in ascending order by default.
Syntax – proc sort data = input dataset out= Output dataset;
by <descending> by-variable;
run;
e.g. proc sort data= ia.empdata out=work.jobsal;
by jobcode descending salary;
run;
14
Grouping data and Printing Subtotals and Grand totals – Using a by clause with
proc print procedure groups the data according to the different values of that
variable.
e.g. – proc print data=ia.empdata;
by jobcode;
sum salary;
run;
The above code groups the data according to jobcode values and the sum statement
prints the sum of salary for different groups of jobcode, which is the sub total.
Note- Data must be indexed or sorted in order to use by clause.
Page Breaks – PAGEBY statement is used to put each subgroup on a separate page.
e.g. - proc print data=ia.empdata;
by jobcode;
pageby jobcode;
sum salary;
run;
Pageby must be used along with a by clause and the variable appearing in the by
clause only can be used in the pageby clause.
15
Enhancing outputs
ID statement- ID statement is used to suppress the obs column and the
variable used with id replaces the obs column i.e. is placed left most. We
can use ID statement along with BY statement. ID statement places the
variable left most in place of obs and if a BY clause is also there for the
same variable then it groups data according to that variable.
e.g. – proc print data=ia.empdata;
id Jobcode;
by Jobcode;
pageby Jobcode;
sum Salary;
run;
The above code will print the output page wise according to groups of
Jobcode working as id i.e. in place of obs column and at the end of each
page sum of salary values at that page will be displayed.
16
Customizing Report Appearance
17
Column Labels – This assigns labels to different fields.
e.g. – proc print data=ia.empdata label;
label lastname=‘Last Name’
Firstname=‘First Name’;
run;
split =‘ ‘ option if placed instead of label in the proc print
statement , splits the label into two lines based on the delimiter
specified.
SAS System Options – SAS options are used to change the
appearance of report.
18
1. Date – specifies to print the date and time at which SAS session began
at the top of each page.
2. Nodate – Specifies not to print the date and time.
3. Linesize =width – Specifies the line size.
4. Pagesize=n - Specifies the number of lines per page.
5. Number – Specifies that page number be printed on the first line of
each page output.
6. Nonumber – specifies page number not to be printed.
7. Pageno=n – Specifies the beginning of the page number.
Example – options nodate nonumber ls=72;
Option statement is not placed in a data or proc step.
19
Formatting Data Values
20
SAS Formats
21
Date Formats
SAS dates are stored as the number of days between 1st January 1960
and the specified date. So date formats are used to print dates in the
standard form. Date formats available and values they display are(e.g.
Date= 16Oct2001)-
22
User Defined Format
23
Assigning character values to and range of characters labels.
proc format;
value $grade ‘A’=‘Good’
‘B’ - ’D’=‘Fair’
‘F’ = ‘Poor’
Other= ‘Miscoded’;
run;
Applying format
proc print data=ia.student;
format CGPA $grade.;
run;
24
Creating HTML reports
ODS(Output Delivery System) method is used to create output in
variety of forms.
ODS HTML statement opens, closes and manages the HTML
destination.
General form of ODS method-
ODS html file=‘HTML file specification’;
SAS code;
ODS html close;
Example –
ODS html file=‘D:\odscode.html’;
proc print data=ia.empdata’;
run;
ODS html close;
25
Reading raw data file
26
Input specification –
•Name SAS variable.
•Identifies the variable as character or numeric.
•Specifies the locations of the fields in the raw data file.
•Can be specified as column, formatted, list or named input.
27
Formatting Input
28
Example
Data work.dfwlax;
Infile ‘Raw data file’;
Input @1 Flight $3.
@4 Date mmddyy8.
@12 Dest $3.
@15 Firstclass 3.
@18 Economy 3.;
Run;
The above code reads Flight starting from 1 st position till 3 characters in
character format, Date form 4th position in mmddyy8. format, 3
characters for Dest form 12th position in character format, 3 numbers for
Firstclass starting form 15th position in integer format and Economy
from 18th position till 3 integers.
29
Reading SAS data sets
Steps for creating a SAS data set using another data set.
DATA statement to start a DATA step and name the SAS data set being
created.
SET statement to identify the SAS data set being read.
To create a variable use assignment statement to modify the values of
existing data set variable(s).
Example –
Data work.new_data;
set ia.dwflax;
total = FirtsClass + Economy;
Run;
The above code reads all the fields and observations from dwflax and
creates a new field in new_data named total.
30
Operators
31
Using SAS functions
32
Month(SAS date) – Extracts month from SAS date and returns from 1
to 12.
Weekday(SAS date) – Extracts day of the week from SAS date returns
number from1 to 7, where 1 is Sunday and so on.
33
Dropping and Keeping variables
Drop and Keep statements can be used to control what variables are
written to the new data set.
General from – Drop variables; / Keep variables;
Example –
data test_new;
set ia.dwflax;
drop FirstClass Economy;
Total = FirstClass + Economy;
run;
The above code creates new data set without FirstClass and Economy
variables and with total variable.
34
Conditional processing
35
Executing set of conditional statements
Do and End statement can be used to execute a set of statements.
Example –
data flightrev;
set ia.dwflax;
total=sum(Firstclass,Economy);
if Dest=‘LAX’ then do;
revenue=sum(2000*Firstclass,1200*Economy);
city=‘Dallas’;
end;
else if Dest=‘DFW’ then do;
revenue=sum(1500*Firstclass,900*Economy);
city=‘Los Angeles’;
end;
run;
36
Variable Lengths
37
Deleting or Selecting Rows
38
Concatenating SAS data sets
39
Example –
data newhires;
set n1 n2;
run;
If the number and name of fields are same in na1 and na2, then
newhires will have all the fields with data from na2 following the data
from na2.
If the name of fields are different then we can rename the fields using
RENAME statement. E.g. if there Name, Gender, Jobcode in na1 and
Name, Gender and Jcode in na2 then we can rename Jcode as Jobcode.
40
Example –
data newhires;
set na1 na2(rename=(Jcode=Jobcode));
run;
We can also interleave the resulting data set using BY statement.
data newhires;
set na1 na2 (rename=(Jcode=Jobcode));
by name;
run;
The above code orders the newhires data set by name.
41
Merging Data Sets
42
Conditional merging
43
Additional Features
44
Summary Reports
Summary report procedures used are –
45
Proc Freq
Proc Freq procedure displays the frequency counts of the data values
in a SAS data set.
It analyzes every variable in the SAS data set.
Displays each distinct data value.
Calculates the number of observations in which each data value
appears and the corresponding percentage.
Indicates for each variable how many observations have missing
values.
Example –
proc freq data=ia.dfwlax;
run;
46
Features of proc freq
We can limit the number of variables whose frequency we want to see. Tables option
is used to limit the number of variables. SAS creates separate frequency for each
variable specified after table options separate by a space.
Example – proc freq data=ia.dfwlax;
tables economy flight;
run;
Nlevels option is used to display the number of levels in the frequency report i.e.
frequency for how many values is given.
Noprint option is used for not displaying the frequency counts, it is generally used
with nlevels when only number of levels is required.
Example – proc freq data=ia.dfwlax nlevels;
tables _all_ / noprint;
title ‘Number of levels’;
run;
Formats can also be used while displaying frequency reports.
47
Cross tabular frequency
A cross tabular frequency report analyzes all possible combinations of the distinct
values of the two variables.
Example – proc format;
value $codefmt
‘FLTAT1’ – ‘FLTAT2’ = ‘Flight Attendant’
‘PILOT1’ – ‘PILOT2’ = ‘Pilot’;
value money
low - <25000=‘Less than 25000’
25000 – 50000=‘25,000 to 50,000’
50000 < - high = ‘More than 50000’;
run;
pro freq data=ia.crew;
tables jobcode*salary;
format jobcdoe $codefmt. salary money.;
run;
Crosslist option can be used similar to noprint for result in listing form.
48
Proc Means
49
Proc Report
50
Report procedure
Default listing displays –
Each data value as it is store in the data set.
Variable names as report column headings
Default width for columns.
Character value as left justified.
Numeric values as right justified.
Printing selected variable –
COLUMN statement is used in order to print selected variables and in the order
in which they are specified.
Example –
Title ‘Salary Analysis’;
Proc report data=ia.crew;
Column Jobcode Location Salary;
Run;
51
Define statement
52
Group variable – group option can be used with many variables. It is
shown in the report in the order in the order in which variables are
written. Order can not be used with group. This also displays the sum
of numeric variables for each group, if group is not used then grand
total of numeric values is displayed.
Sum – This is used to print the sum of all values.
Mean – Used for displaying mean of all the values.
N – Used for displaying the number of non missing values.
Max – Used for displaying the maximum value.
Min – Used for displaying the minimum value.
53
RBREAK
54
Introduction to Graphics – Bar and Pie Charts
55
Options Contd.
SUMVAR – This specifies the summary variable against the bar variable and
replaces the frequency with that variable.
TYPE – Used along with SUMVAR variable so as to specify on what basis the
summary variable need to be classified for bar variable. E.g MEAN | SUM.
Example – Proc gchart data=ia.crew;
vbar Jobcode / sumvar=Salary type=mean;
run;
The above code will print a vertical bar chart with jobcode as bar
variable, whose length will be decided by mean of salary for a
particular jobcode.
FILL – This option is used with pie charts so as to specify whether to fill pie
slices in a solid (FILL=S) or a cross hatched (FILL=X) patten.
EXPLODE – EXPLODE = ‘Value’, this option explodes the pie chart for that
particular value.
56
Producing PLOTS
GPLOT is used to plot one variable against another variable using
coordinate axis.
General Form –
Proc GPLOT data=SAS data set;
PLOT vertical variable* horizontal variable </Options>;
Run;
You can –
1. Specify the symbol to represent data.
2. Use different methods of interpolation.
3. Specify line styles, colors and thickness.
4. Draw reference lines within the axes.
5. Place one or more plot lines within the axes.
57
Example
58
Options
59
Controlling Axis
60
Outputting Observations
A SAS data step implicitly outputs the contents of PDV to data set, if
we write an explicit output statement, it overrides the implicit output.
General form - OUTPUT <SAS data set1> <SAS data set2>…...;
Output statement can be used to –
1. Create two or more SAS observations from each line of input
2. Write observation to multiple SAS data sets.
Example –
61
Data forecast;
drop numemps;
set prog2.growth;
year=1;
Newtotal=Numemps *(1 + increase);
output;
year=2;
Newtotal=newtotal*(1 + increase);
output;
year=3;
Newtotal=newtotal*(1 + increase);
output;
Run;
62
Writing to multiple data sets
Output statement is used to write observations to desired data sets.
Example –
data army navy airforce;
drop type;
set prog2.mlitary;
if type eq ‘Army’ then
output army;
else if type eq ‘Navy’ then
output navy;
else if type eq ‘Air force’ then
output airforce;
run;
63
First Obs and Obs statements can be used to control the number of
observations to be read by a dataset.
OBS statement – Set prog2.military(obs = 25); this statement selects first
25 observations from the input dataset into the output data set.
First Obs statement – Set prog2.military (firstobs=11 obs=25); this
statement starts reading observations into military data set starting
form 11th observation of the input data set till 25th observation.
64
Writing to an external file
Data can be written to an external file using either ODS method or FILE statement.
ODS method –
ods csvall file=‘raw – data – file’;
proc print data=prog2.maysale noobs;
format listdate
selldate date9.;
run;
ods csvall close;
File statement –
data _null_;
set prog2.maysales;
file ‘raw – data – file’;
put description
listdate ; date9.;
run;
65
_N_ and ISLAST automatic variables -
data _null_;
set prog2.maysales;
file ‘raw – data – file’;
if _N_=1 then
put ‘Description’ ‘ListDate’;
put description
listdate ; date9.;
if ISLAST = 1 then
put ‘End of data’;
run;
Specifying delimiter – DLM= option is used to specify the delimiter in the file.
Example – file ‘raw – data – file’ DLM=‘,’;
66
Summarizing data
Creating an accumulating variable – We can use RETAIN statement to create a
variable having a running sum of another numeric variable.
Retain statement –
1. Retains the value of the value of the variable in the PDV across iterations of the
data step.
2. Initializes retain variable to missing if no default value is specified.
Example –
data mnthtot;
set prog2.daysales;
retain mth2dte 0;
mth2dte=mth2dte+saleamt;
run;
The above code will create a new variable mth2dte having a running sum of saleamt,
but if there is any missing value in saleamt then all sebsequent values of mth2dte will
be missing for that we use sum statement. Sum is a replacement to retain statement.
67
Accumulating totals for a group of data
68
Reading delimited raw data file
Common delimiters used are blanks, commas and tab characters. Default
delimiter is space.
For specifying the format in which SAS should read the data value. We can
specify the informat name.
To specify an informat, use colon between name of the informat variable name.
Colon signals SAS to read from delimiter to delimiter.
Length of the variable can also be specified in advance using length statement.
Using length, we can avoid colon.
Example –
data airplanes;
length ID $5;
infile ‘raw data file’;
input ID $
Inservice : date9.
passcap cargocap;
run;
69
Delimiters and missing data
DLM= option is used to specify the delimiter in the following manner
infile ‘raw data file’ dlm=‘:’;
If you specify series of delimiters in DLM option then it considers any or all
of the characters as delimiter e.g. – DLM=‘:!’;
If there is missing data in the record then SAS automatically appends the
next data to the previous data line. To avoid this MISSOVER option is used.
infile ‘raw data file’ dlm=‘:’ missover;
If the length of any data value is less then the specified data length then
missover statement will take it as missing value, so to avoid this we use
TRUNCOVER option.
infile ‘raw data file’ dlm=‘:’ missover truncover;
Two consecutive delimiters are treated as one, so to specify a missing value
there should be a placeholder, which can be ‘.’ for numeric filed and blank
for character field.
70
If placeholder is not present then we can use the DSD option.
Features of DSD option –
1. Sets the default delimiter to comma.
2. Treats consecutive delimiters as missing values.
3. enables SAS to read values with embedded delimiters if the
value is surrounded by double quotes.
Example – infile ‘Raw data file’ dsd;
71
Controlling when a record loads
SAS loads a new record into data set when it encounters input
statement.
We can also use forward slash which moves the pointer to next line.
input Lname $20. Fname $10. /
City $10. State $20.;
This code will read Lname and Fname from first line and then move to
next line and start reading city and state.
#n moves the pointer to desired line.
input #1 Lname $20. Fname $10.
#2 City $10. State $20.;
This will read Lname and Fname form first line and City and State from
second line. This cycle will carry on for 3 rd and 4th record and so on till it
reaches the end.
72
If statement can also be used to control loading of observations based on
the value of any field.
Example –
input salesid 5. Location $3.;
if Location=‘USA’ then
input Saledate : mmddyy10.
Amount;
if Location=‘EUR’ then
input Saledate : date9.
Amount: comma8.;
Above code will load salesid and location first and then depending on the
value of location read it will load the value of saledate and amount.
For values not satisfying any criteria saledate and amount will be blank.
73
To avoid this scenario, we can use trailing character ‘@’
Trailing option holds the raw data record in the in the input buffer
until –
1. Executes an input with no trailing @ or
2. Reaches the end of data file step.
Input var1 var2 var3….@;
Reading multiple observations in one record – Multiple observations
can be read into one record if we use double trailing ‘@@’.
Input var1 var2 var3…..@@:
74
Data Transformation
75
SAS Functions
76
Concatenation operator - This operator is used to concatenate two or more
strings. To concatenate, we can use either (!!) or (||).
General Form – Newvar = String1 !! String2;
Trim function – This function removes trailing blanks form the string
General form – Newvar = TRIM(argument);
If the argument is blank then it returns a blank. Trim function does not trim
leading blanks, for that we can use a combination of left and trim.
Example – Fullname = trim(left(Firstname)) !! ‘ ‘ !! Lastname;
CATX function – This function concatenates character strings, removes
leading and trailing and inserts separators.
General Form – CATX(separator, string 1,……,string n);
Similar to this CAT concatenates without removing blanks, CATS
concatenates and removes leading and trailing blanks and CATT
concatenates and removes trailing blanks only.
77
Find function – This function searches for a specific substring within a string
and returns its location if found and returns 0 if not found.
General Form – Position = FIND(target,value,<modifiers>,<start>);
- Modifier can be I or T. I indicates that search is case insensitive, by default
its case sensitive. T indicates that search ignores trailing blanks.
- Start identifies the start position of search, a positive value signifies forward
search and a negative value signifies backward search.
Index function works same as find function except it doe not have modifier
and start argument.
UPCASE function – This converts all the letters and arguments to upper case
and has no effect on digits and special characters.
General Form – NewVal = UPCASE(argument);
LOWCASE function converts the text to lowercase.
PROPCASE function converts the text to proper sentence form.
78
TRANWRD function – This function translates a particular set of
character in a string with other set of characters.
General Form – Desert = Tranwrd(Desert , ’Pumpkin’ , ’Apple’);
This replaces Pumpkin with apple in desert. If the length of replacing
string is greater than replaced string then it causes truncation of string
if length is not specified.
SUBSTR left side – If substr function is used of the left side of the
assignment statement then it replaces that substring in the text with the
substring on right.
General Form – SUBSTR(string , start , <length>)=value;
79
Manipulating numeric values
80
Manipulating Date values
Creating SAS date value – MDY function returns SAS date from date, month and
year given separately.
General Form - Newdate=MDY(month,date,year);
TODAY() – This function returns the system date.
Extracting information – We can extract day , month or year from SAS date using
DAY(SAS date ), MONTH(SAS date) or YEAR(SAS date) respectively. Similarly we
can use QTR and WEEKDAY.
Calculating Interval of Years– YRDIF function calculates year difference between
two SAS dates.
General Form – Diff= YRDIF(sdate , edate , basis)
Basis can take following values –
1. ‘ACT/ACT’ – This calculates the actual difference in fraction.
2. ’30/360’ – Specifies 30 day month and 360 days year.
3. ‘ACT/360’ – Takes actual number of days and divides it by 360.
4. ‘ACT/365’ – Takes actual number of days and divides it by 365.
81
Converting variable type
82
Automatic conversions
83
Do loop Processing
84
Do While loop – This is used for conditional iteration of a set of statements.
General form – DO WHILE(expression);
END;
Statement is executed first, if true then only loop is executed.
Do Until loop - This is used for conditional iteration of a set of statements.
General form – DO UNTIL(expression);
END;
Statement is executed first, if not true then also once loop is executed.
Combining Do WHILE and DO UNTIL with DO – This method is used to
avoid infinite loop.
DO index variable = start TO stop <BY variable>;
WHILE | UNTIL (expression);
END;
85
Nested Do loops
86
SAS arrays
Creating variables with arrays –
Example -
Data percent (drop = qtr);
Set donate;
Total = sum(of qtr1 – qtr 4);
array contrib(4) qtr1 – qtr4;
array percent(4);
do qtr=1 to 4;
percent(qtr)=contrib(qtr)/total;
end;
run;
In the above code, contrib takes the value of qtr1 to qtr4 and percent is an empty array. We
can also format the array variable while declaration.
Example - var ID Percetn1 – Percent4;
Format percent1 – percent4 percent6.;
Percentw.d fromat multiplies value by 100 and adds a % sign at the end
87
Assigning initial values
Example –
data compare(drop = qtr goal1 – goal4);
set donate;
array contrib(4) qtr1 – qtr4;
array diff(4);
array goal(4) goal1 – goal4 (10,15,5,10);
do qtr=1 to 4;
diff(qtr) = contrib(qtr) – goal(qtr);
end;
run;
The above code takes the value of existing variable qtr1 –qtr4 into contrib, assigns
values to new array goal with variable names goal1 to goal4 and calculates value for
diff array. Initial values are retained until new values are assigned and in case of less
values then array length, rest of the variables are set as having missing value.
88
Temporary arrays
89
Rotating SAS data set
Input Data Set
ID QTR1 QTR2 QTR3 QTR4
E00224 12 33 22
E00367 35 48 40 30
90
SAS Program for rotation
91
Conditional match merging of SAS data sets
92
Solution
Data Newtrans
Noactiv(drop = trans amt)
Noact(drop = branch);
Merge transact(IN = Intrans)
Branches(IN = InBanks);
By actnum;
If Intrans and Inbanks
Then output Newtrans;
Else if Inbanks and not InTrans
then output Noactiv;
Else If Intrans and not Inbanks
then output Noacct;
Run;
93
Writing SQL queries in SAS data set
We can use SQL queries in SAS by enclosing them in PROC SQL; and
QUIT;
While joining two data sets using an SQL query the data sets need not
be sorted contrary to MERGE command in SAS where the input data
sets need to be sorted by the BY variable.
Example –
Proc SQL;
Select T.Actnum, T.Trans, T.Amt, B.Branch
from Transact T , Branches B
where T.Actnum = B.Actnum;
Quit;
No RUN command is required for an SQL query.
94
SAS Macros
Macros construct input for the SAS compiler.
Functions of the SAS macro processor:
• pass symbolic values between SAS statements and steps
• establish default symbolic values
• conditionally execute SAS steps
• invoke very long, complex code in a quick, short way.
95
Advantages of SAS macros -
• substitute text in statements like TITLEs
• communicate across SAS steps
• establish default values
• conditionally execute SAS steps
• hide complex code that can be invoked easily.
96
Components of SAS macros
Macro variables:
• used to store and manipulate character strings
• follow SAS naming rules
• are NOT the same as DATA step variables
• are stored in memory in a macro symbol table.
Macro statements:
• begin with a % and a macro keyword and end with semicolon (;)
• assign values, substitute values, and change macro variables
• can branch or generate SAS statements conditionally.
97
Automatic macro variables
98
Displaying macro variables
%PUT is used to display macro variables on the log.
Example –
%PUT **** SYSDAY = &SYSDAY;
%PUT **** SYSTIME = &SYSTIME;
%PUT **** SYSDATE = &SYSDATE;
The above code prints –
**** SYSDAY = Friday
**** SYSTIME = 13:42
**** SYSDATE = 25JUL08
Example of proc print using macro variable –
proc contents data=&SYSLAST;
title "contents of &SYSLAST";
run;
99
User defined macro variables
Macro variables can be defined by using %LET statement.
General form - %LET var_name = value;
This variable can be used anywhere using a ‘&’ sign.
Example –
%LET NAME=PAYROLL;
PROC PRINT DATA=&NAME;
TITLE "PRINT OF DATASET &NAME";
RUN;
The above code will substitute NAME with PAYROLL in the proc print
procedure and prints the data set.
% STR allows values with semicolon (;) .
Example - %LET CHART=%STR(PROC CHART;VBAR EMP;RUN;);
&CHART;
100
Defining and Using Macros
101
Parameterized Macro
Example –
%MACRO CHART(NAME,BARVAR);
PROC CHART DATA=&NAME;
VBAR &BARVAR;
RUN;
%MEND;
%CHART(PAYROLL,EMP);
The above macro resolves to –
PROC CHART DATA=PAYROLL;
VBAR EMP;
RUN;
102
Conditional Macro
%IF and %DO can be used inside macro to execute a set of steps conditionally.
Example –
%MACRO PTCHT(PRTCH,NAME,BARVAR);
%IF &PRTCH=YES %THEN
%DO;
PROC PRINT DATA=&NAME;
TITLE "PRINT OF DATASET &NAME";
RUN;
END;
PROC CHART DATA=&NAME;
VBAR &BARVAR;
RUN;
%MEND;
%PTCHT(YES,PAYROLL,EMP)
103
Transferring values between SAS steps
SYMGET and SYMPUT can be used to transfer values between data steps
or proc steps.
Example –
%MACRO OBSCOUNT(NAME);
DATA _NULL_;
SET &NAME NOBS=OBSOUT;
CALL SYMPUT('MOBSOUT',OBSOUT);
STOP;
RUN;
PROC PRINT DATA=&NAME;
TITLE "DATASET &NAME CONTAINS &MOBSOUT OBSERVATIONS";
RUN;
%MEND;
%OBSCOUNT(PAYROLL);
104
Efficiency Techniques
105 105
Selecting Observations
When we want to test for different values of a variable using the IF statement, we can
When we want to test for different values of a variable using the IF statement, we can
choose between the IN operator or the OR operator. The examples below show that the
choose between the IN operator or the OR operator. The examples below show that the
IN operator requires more CPU time. The difference becomes even more important when
IN operator requires more CPU time. The difference becomes even more important when
testing huge set of records.
testing huge set of records.
PROGRAM 1-A
PROGRAM
DATA 1-A
PRODUCTSALES; PROGRAM 1-B
DATA PRODUCTSALES; PROGRAM
DATA 1-B
PRODUCTSALES;
SET DATA1.SALES; DATA PRODUCTSALES;
SET DATA1.SALES;
WHERE PRODUCT_ID IN ('111', '142', '152', SET DATA1.SALES;
WHERE
'165', '166');PRODUCT_ID IN ('111', '142', '152', IFSET DATA1.SALES;
PRODUCT_ID = '111' OR
'165', '166'); IF PRODUCT_ID
PRODUCT_ID = '142'= '111'
OR OR
Run; PRODUCT_ID = '142' OR
Run; PRODUCT_ID = '152' OR
PRODUCT_ID= ='165'
PRODUCT_ID '152'OR
OR
PRODUCT_ID = '166'; OR
PRODUCT_ID = '165'
PRODUCT_ID = '166';
RUN;
RUN;
106 106
PROGRAM 1-C
PROGRAM
DATA 1-C
PRODUCTSALES;
DATA
SET PRODUCTSALES;
DATA1.SALES;
SET DATA1.SALES; PROGRAM 1-D
WHERE PRODUCT_ID IN ('111', '142', '152', PROGRAM 1-D
WHERE DATA PRODUCTSALES;
'165', '166', PRODUCT_ID
'411', IN ('111', '142', '152',
DATA PRODUCTSALES;
'165', '166', '411', SET DATA1.SALES;
'412', '417', '421',
'412','519',
'417','525',
'421', IFSET DATA1.SALES;
PRODUCT_ID = '111' OR
'423', IF PRODUCT_ID = '111'
'423','733',
'519','736');
'525', PRODUCT_ID = '142' OR OR
'526', PRODUCT_ID = '142' OR
'526', '733', '736'); PRODUCT_ID = '152' OR
RUN; PRODUCT_ID= ='165'
'152'OR
OR
RUN; PRODUCT_ID
PRODUCT_ID = '165'
PRODUCT_ID = '166' OR OR
PRODUCT_ID= ='411'
PRODUCT_ID '166'OR
OR
PRODUCT_ID= ='412'
PRODUCT_ID '411'OR
OR
PRODUCT_ID = '412'
PRODUCT_ID = '417' OR OR
PRODUCT_ID= ='421'
PRODUCT_ID '417'OR
OR
PRODUCT_ID = '421'
PRODUCT_ID = '423' OR OR
PRODUCT_ID= ='519'
PRODUCT_ID '423'OR
OR
PRODUCT_ID= ='525'
PRODUCT_ID '519'OR
OR
PRODUCT_ID = '525'
PRODUCT_ID = '526' OR OR
PRODUCT_ID= ='733'
PRODUCT_ID '526'OR
OR
PRODUCT_ID = '736'; OR
PRODUCT_ID = '733'
PRODUCT_ID = '736';
RUN;
RUN;
107
Comparison on the basis of time
Comparison on the basis of time
Program number Method used and size of data CPU time elapsed
108
PROGRAM 2-A PROGRAM 2-B
PROGRAM 2-B
DATA CLIENT; DATA CLIENT;
DATA CLIENT;
SET DATA1.CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
WHERE LAST_NAME = ‘VAN BRUSSELS’;
WHERE LAST_NAME = ‘VAN BRUSSELS’;
IF LAST_NAME = ‘VAN BRUSSELS’; RUN;
RUN;
RUN;
Sub setting data in a DATA step is possible through the IF statement or the WHERE
Sub setting data in a DATA step is possible through the IF statement or the WHERE
statement. Usually the WHERE statement is more efficient than the IF statement,
statement. Usually the WHERE statement is more efficient than the IF statement,
because the IF statement is executed on the data, being in the Program Data Vector,
because the IF statement is executed on the data, being in the Program Data Vector,
whereas the WHERE statement is executed before bringing the data in the Program
whereas the WHERE statement is executed before bringing the data in the Program
Data Vector. The following examples show this behavior.
Data Vector. The following examples show this behavior.
109
PROGRAM 2-C PROGRAM 2-D
PROGRAM 2-C PROGRAM 2-D
DATA CLIENT; DATA CLIENT;
DATA CLIENT; DATA CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
IF SUBSTR (LAST_NAME, 1, 3) = 'VAN'; WHERE SUBSTR (LAST_NAME, 1, 3) = 'VAN';
IF SUBSTR (LAST_NAME, 1, 3) = 'VAN'; WHERE SUBSTR (LAST_NAME, 1, 3) = 'VAN';
RUN; RUN;
RUN; RUN;
PROGRAM 2-E
PROGRAM 2-E
DATA CLIENT;
DATA CLIENT;
SET DATA1.CLIENT;
SET DATA1.CLIENT;
WHERE LAST_NAME LIKE 'VAN%';
WHERE LAST_NAME LIKE 'VAN%';
RUN;
RUN;
Although there is an exception in where statement too. The above examples show that
Although there is an exception in where statement too. The above examples show that
using the SUBSTR function in a WHERE statement increases the CPU time incredibly
using the SUBSTR function in a WHERE statement increases the CPU time incredibly
compared to the corresponding IF statement. When using a typical WHERE operand
compared to the corresponding IF statement. When using a typical WHERE operand
(LIKE), the same subset is created, but CPU time decreases and gives a better
(LIKE), the same subset is created, but CPU time decreases and gives a better
performance again compared to the sub setting IF statement.
performance again compared to the sub setting IF statement.
110
Comparison on the basis of time
2-A IF 0.90
111
Reducing Observation Length
Several data manipulation functions have ‘space leaks’: If LENGTH statement is not
Several data manipulation functions have ‘space leaks’: If LENGTH statement is not
specified to identify the resulting variable, a lot of disk space might be wasted. Two
specified to identify the resulting variable, a lot of disk space might be wasted. Two
examples illustrate this behavior. Within the first example the variable INITIALS
examples illustrate this behavior. Within the first example the variable INITIALS
contains the output of the SUBSTR function, but the length of this variable equals the
contains the output of the SUBSTR function, but the length of this variable equals the
sum of the contributing variables. As a result, every observation in the output table
sum of the contributing variables. As a result, every observation in the output table
contains (length of first name + length of last name - 2) redundant blanks. Let us
contains (length of first name + length of last name - 2) redundant blanks. Let us
assume that the length of first name and last name is 20 each in that case every initials
assume that the length of first name and last name is 20 each in that case every initials
will have 38 redundant blanks.
will have 38 redundant blanks.
112
Some functions – like the SCAN function – create a result with a default length of 200, being the
Some functions – like the SCAN function – create a result with a default length of 200, being the
maximum length of a character variable. Following is an example of space wastage in that case.
maximum length of a character variable. Following is an example of space wastage in that case.
PROGRAM 1-C
PROGRAM 1-C PROGRAM 1-D
DATA CLIENT; PROGRAM 1-D
DATA CLIENT; DATA CLIENT;
SET DATA1.CLIENT; DATA CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
COUNTRY = SCAN (CLIENT_ID, 1, '-'); SET DATA1.CLIENT;
COUNTRY = SCAN (CLIENT_ID, 1, '-'); LENGTH COUNTRY CITY $ 2
CITY = SCAN (CLIENT_ID, 2, '-'); LENGTH COUNTRY CITY $ 2
CITY = SCAN (CLIENT_ID, 2, '-'); NUMBER $ 8;
NUMBER = SCAN (CLIENT_ID, 3, '-'); NUMBER $ 8;
NUMBER = SCAN (CLIENT_ID, 3, '-'); COUNTRY = SCAN (CLIENT_ID, 1, '-');
RUN; COUNTRY = SCAN (CLIENT_ID, 1, '-');
RUN; CITY = SCAN (CLIENT_ID, 2, '-');
CITY = SCAN (CLIENT_ID, 2, '-');
NUMBER = SCAN (CLIENT_ID, 3, '-');
NUMBER = SCAN (CLIENT_ID, 3, '-');
RUN;
RUN;
113
Comparison on the basis of size
Comparison on the basis of size
114
Indexing
Indexing
Although an index is considered for use in a WHERE statement and not in a sub setting IF
Although an index is considered for use in a WHERE statement and not in a sub setting IF
statement, we still find several programs using an IF statement to subset a table with an
statement, we still find several programs using an IF statement to subset a table with an
index. The gain in CPU time becomes more important if the subset returned by the index is
index. The gain in CPU time becomes more important if the subset returned by the index is
smaller. In the following examples, a simple index exists on the variables SHOP_ID and
smaller. In the following examples, a simple index exists on the variables SHOP_ID and
CUSTOMER_ID. The variable SHOP_ID has only 7 distinct values, whereas the variable
CUSTOMER_ID. The variable SHOP_ID has only 7 distinct values, whereas the variable
CUSTOMER_ID contains approximately 80.000 different values. Accessing the data
CUSTOMER_ID contains approximately 80.000 different values. Accessing the data
through the index on SHOP_ID returns +/- 15% of the data, resulting in only a small
through the index on SHOP_ID returns +/- 15% of the data, resulting in only a small
difference between the WHERE statement (using the index) and the IF statement
difference between the WHERE statement (using the index) and the IF statement
(performing a sequential search).
(performing a sequential search).
115
Accessing the data through the index on CUSTOMER_ID returns less than 0.01% of the
Accessing the data through the index on CUSTOMER_ID returns less than 0.01% of the
data and is extremely fast compared to the sub setting IF statement.
data and is extremely fast compared to the sub setting IF statement.
PROGRAM 2-A
PROGRAM 2-A
DATA SALES_12345;
DATA SALES_12345;
SET DATA1.SALES_INDEXED;
SET DATA1.SALES_INDEXED;
IF CUSTOMER_ID = ‘12345';
IF CUSTOMER_ID = ‘12345';
RUN;
RUN;
PROGRAM 2-B
PROGRAM 2-B
DATA SALES_12345;
DATA SALES_12345;
SET DATA1.SALES_INDEXED;
SET DATA1.SALES_INDEXED;
WHERE CUSTOMER_ID = ‘12345';
WHERE CUSTOMER_ID = ‘12345';
RUN;
RUN;
116
Comparison on the basis on time
Comparison on the basis on time
117
Compressing
Compressing
118
Comparison on the basis of time
Comparison on the basis of time
119
Sub
Subsetting
settingexternal
externalfiles
files
The INPUT statement, structuring the input buffer’s content into variables in the
The INPUT statement, structuring the input buffer’s content into variables in the
Program Data Vector will consume quite some CPU time. If you only need to process
Program Data Vector will consume quite some CPU time. If you only need to process
a subset of the external file, only examine part of the input buffer, and if this part
a subset of the external file, only examine part of the input buffer, and if this part
meets your sub setting condition, examine the rest of the input buffer. The trailing @
meets your sub setting condition, examine the rest of the input buffer. The trailing @
in the INPUT statement allows holding contents the input buffer.
in the INPUT statement allows holding contents the input buffer.
PROGRAM 1-A
PROGRAM 1-A
DATA CLIENT;
DATACLIENT;
INFILE CLIENT;
INFILE CLIENT; $ 1 - 14
INPUT CLIENT_ID
INPUT CLIENT_ID
LAST_NAME $ 16 - 35$ 1 - 14
FIRST_NAME $ $3716- -5635
LAST_NAME
FIRST_NAME
HOME_CITY $ 37
$ 58 - 77- 56
HOME_COUNTRY $ 77
HOME_CITY $ 58 - 79 - 93
…;HOME_COUNTRY $ 79 - 93
…;
RUN;
RUN;CLIENT_LONDON;
DATA
DATA
SET CLIENT_LONDON;
CLIENT;
SET CLIENT;
IF HOME_CITY = 'LONDON';
IF HOME_CITY = 'LONDON';
RUN;
RUN;
120
PROGRAM 1-B PROGRAM 1-C
PROGRAM 1-B PROGRAM 1-C
DATA CLIENT_LONDON; DATA CLIENT_LONDON;
DATA CLIENT_LONDON; DATA CLIENT_LONDON;
INFILE CLIENT; INFILE CLIENT;
INFILE CLIENT; INFILE CLIENT;
INPUT CLIENT_ID $ 1 - 14 INPUT HOME_CITY $ 58 - 77 @;
INPUT CLIENT_ID $ 1 - 14 INPUT HOME_CITY $ 58 - 77 @;
LAST_NAME $ 16 - 35 IF HOME_CITY = 'LONDON';
LAST_NAME $ 16 - 35 IF HOME_CITY = 'LONDON';
FIRST_NAME $ 37 - 56 INPUT CLIENT_ID $ 1 - 14
FIRST_NAME $ 37 - 56 INPUT CLIENT_ID $ 1 - 14
HOME_CITY $ 58 - 77 LAST_NAME $ 16 - 35
HOME_CITY $ 58 - 77 LAST_NAME $ 16 - 35
HOME_COUNTRY $ 79 - 93 FIRST_NAME $ 37 - 56
HOME_COUNTRY $ 79 - 93 FIRST_NAME $ 37 - 56
…; HOME_COUNTRY $ 79 - 93
…; HOME_COUNTRY $ 79 - 93
IF HOME_CITY = 'LONDON'; …;
IF HOME_CITY = 'LONDON'; …;
RUN; RUN;
RUN; RUN;
121
Comparison on the basis on time
Comparison on the basis on time
122
EFFICIENTLY COMBINING DATA - CONCATENATING SAS DATA SETS
EFFICIENTLY COMBINING DATA - CONCATENATING SAS DATA SETS
Many users are familiar with the APPEND procedure for adding a new table immediately to a
Many users are familiar with the APPEND procedure for adding a new table immediately to a
master table, without reading / writing the master table. Still, they rarely code the APPEND
master table, without reading / writing the master table. Still, they rarely code the APPEND
procedure, because they are used to typing the DATA step, which is coded very fast. In the next
procedure, because they are used to typing the DATA step, which is coded very fast. In the next
example the traditional DATA step concatenation capabilities are compared with using the
example the traditional DATA step concatenation capabilities are compared with using the
OUTER UNION CORR operator in the SQL procedure. The result can also be created using the
OUTER UNION CORR operator in the SQL procedure. The result can also be created using the
SQL INSERT statement to add all observations of the second table to the end of the master table.
SQL INSERT statement to add all observations of the second table to the end of the master table.
PROGRAM 1-A
PROGRAM 1-A
DATA SALES; PROGRAM 1-D
DATA SALES; PROGRAM 1-D
SET SALES DATA1.SALES2003; PROC SQL;
SET SALES DATA1.SALES2003; PROC SQL;
RUN; CREATE TABLE SALES AS
RUN; CREATE TABLE SALES AS
SELECT *
SELECT *
FROM SALES
PROGRAM 1-B FROM SALES
PROGRAM 1-B OUTER UNION CORR
PROC APPEND BASE = SALES OUTER UNION CORR
PROC APPEND BASE = SALES SELECT *
DATA = DATA1.SALES2003; SELECT *
DATA = DATA1.SALES2003; FROM DATA1.SALES2003;
RUN; FROM DATA1.SALES2003;
RUN; QUIT;
QUIT;
PROGRAM 1-C
PROGRAM 1-C
PROC SQL;
PROC SQL;
INSERT INTO SALES
INSERT INTO SALES
SELECT * FROM DATA1.SALES2003;
SELECT * FROM DATA1.SALES2003;
QUIT;
QUIT;
123
Comparison on the basis of time
Comparison on the basis of time
124
Interleaving
Interleavingdataset
dataset
You can concatenate two sorted input SAS data sets into a sorted result in several ways.
You can concatenate two sorted input SAS data sets into a sorted result in several ways.
The following example compares the traditional DATA step followed by a SORT
The following example compares the traditional DATA step followed by a SORT
procedure with a BY statement immediately specified in the DATA step and with the
procedure with a BY statement immediately specified in the DATA step and with the
OUTER UNION CORR operator with an ORDER BY clause in the SQL procedure. As
OUTER UNION CORR operator with an ORDER BY clause in the SQL procedure. As
expected the SQL procedure requires more CPU time than the DATA step.
expected the SQL procedure requires more CPU time than the DATA step.
PROGRAM 1-A
PROGRAM 1-A PROGRAM 1-C
DATA SALES; PROGRAM 1-C
DATA SALES; PROC SQL;
SET DATA1.SALES_B DATA1.SALES_NL; PROC SQL;
SET DATA1.SALES_B DATA1.SALES_NL; CREATE TABLE SALES AS
RUN; CREATE TABLE SALES AS
RUN; SELECT *
PROC SORT DATA = SALES; SELECT *
PROC SORT DATA = SALES; FROM DATA1.SALES_B
BY SALES_DATE; FROM DATA1.SALES_B
BY SALES_DATE; OUTER UNION CORR
RUN; OUTER UNION CORR
RUN; SELECT *
SELECT *
FROM DATA1.SALES_NL
FROM DATA1.SALES_NL
ORDER BY SALES_DATE;
ORDER BY SALES_DATE;
PROGRAM 1-B QUIT;
PROGRAM 1-B QUIT;
DATA SALES;
DATA SALES;
SET DATA1.SALES_B DATA1.SALES_NL;
SET DATA1.SALES_B DATA1.SALES_NL;
BY SALES_DATE;
BY SALES_DATE;
RUN;
RUN;
125
Comparison on the basis on time
Comparison on the basis on time
126