Sie sind auf Seite 1von 126

Base SAS programming skills

-Archit
-ArchitKumar
Kumar
SI
SI–BI
–BI
BOFA
BOFA
© 2008 Infosys Technologies Ltd. Strictly private and confidential.
No part of this document should be reproduced or distributed without the prior permission of Infosys Technologies Ltd.

1
Contents

 Introduction to SAS
 Introduction to SAS programs
 SAS dada libraries.
 Producing list report- Print procedure
 Customizing report appearance – creating HTML reports
 Reading raw data files
 Dropping and keeping variables
 Concatenating SAS data sets
 Producing summary reports
 Introduction to graphics
 Controlling input and output
 Summarizing data

2
 Reading and writing different types of data
 Data transformation –
- Manipulating character values
- Manipulating numeric values
- Manipulating date values
 Do loops in SAS
 SAS arrays
 Match merging two or more data sets
 Using SQL queries in SAS
 SAS macros
 Basic efficiency techniques

3
Overview of SAS system

Functionality of SAS system is built around the four data driven


tasks
1.Data access – address the data required by the application
2.Data Management – shapes data into a form required by the
application
3.Data analysis – summarizes, reduces, or otherwise transforms
raw data into meaningful and useful information
4.Data representation – communicates information in ways that
clearly demonstrate its significance

4
Data Processing
Process of delivering meaningful information –
-Accessing data
- Transforming data
- Managing data
- Storing and retrieving data
- Analysis
Raw data

Data Step SAS data set Proc Step Report

SAS data set

5
Introduction to SAS program
/*********************Data step*****************************/
data work.staff;
infile ‘raw data file’;
input LastName $ 1-20 FirstName $21-30 JobTitle $ 36-43 Salary 54-
59;
run;
/**********************************************************/
/********************Proc Step******************************/
proc print data=work.staff;
run;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
/**********************************************************/

6
Fundamental Concepts

SAS data sets


Descriptor portion
proc contents data= SAS data set;
Run;
Proc contents displays the following information about the data set
 General information about the data set such as data set name, number of
observation, number of variables etc.
 Variable attributes such as name, type, length, position, informat, format etc.
Data portion
proc print data= SAS data set;
Run;
The data portion shows the data present in the data set in tabular form showing
the variables which corresponds to fields and observations which corresponds to
the data lines.

7
SAS variables

There are two types of variables-


 Character – Contains any value i.e. letters, numbers, special
characters and blanks. Character values have length ranging from 1
to 32767 characters.
 Numeric – Stored as floating point numbers in 8 bytes of storage by
default. Eight byte floating point storage provide space for 16
significant digits.
SAS variable names-
 Can be 32 characters long.
 Can be uppercase, lowercase or mixed case.
 Must start with a letter or underscore. Subsequent characters can be
letters, underscore or numeric digits.
Date values – SAS date values are stored as numeric values. Date
value is stored as number of days between January 1, 1960.

8
SAS - Syntax Rules

 Usually begins with an identifying statement


 Always end with a semicolon.
 SAS statements are free format.
 They can begin and end in any column
 A single statement can span multiple lines.
 Several statements can be on the same line.
Comments-
 Multiple line comment begins with /* and ends with */.
 Single line comments can be written by putting an asterisk at
the beginning of line.

9
SAS – Data Libraries

SAS data library is a collection of SAS files that are recognized as a unit by SAS.
Where SAS data sets are referred to as SAS files here.
Types of SAS libraries –
 Temporary library- When SAS is invoked, it automatically gives access to
temporary library which is named as work. Datasets made here are removed
once the SAS session ends.
 Permanent library- SASUSER is the permanent SAS library present in SAS. We
can create permanent SAS library using libname statement.
Syntax – Libname libref ‘SAS data library’ <options>.
Rules- 1. Name of library must be 8 characters or less.
2. Must begin with a letter or underscore.
3. Remaining characters are letters, numbers or underscores.
e.g. libname Test_lib ‘c:\workshop\prog1’;
Once the libname is specified, datasets can be created inside the library by
refering to the data sate by “libref.filename or libname.data set name”

10
PRINT Procedure
 General form of print procedure
proc print data= SAS data set;
run;
The print procedure prints the dataset with all the columns adding a
column of observation to it, which has the row number.
 Features of print procedure
1. Titles and footnotes – Discussed in subsequent slides
2. Formatted value - Discussed in subsequent slides
3. Printing selected variable –
proc print data= ia.empdata;
var empname salary jobcode;
run;
This statement prints the selected variables only in the order in which they
are written.

11
4. Suppressing the observation columns – NOOBS option
proc print data= ia.empdata noobs;
run;
5. Sub setting data – Where statement is used to select some observations only.
Syntax – where <condition>;
Where condition contains operators(constants or variables) and operands(comparison, logical,
special operators or functions).
e.g. – Comparison – Where salary>25000;
Logical – Where Jobcode=‘A’ and Salary=25000; similarly or and not can also be used.
Special operator – Between – Where salary between 5000 and 7000;
Contains(?) – Where lastname ? ‘LAM’;
Example of proc step with where clause-
proc print data= iq.empdata;
var Jobcode Empid Salary;
where Jobcode = ‘A’ and Salary between 20000 and 30000 ;
run;
6. Column totals – Sum statement is used to get column total.
e.g. proc print data = ia.empdata;
var jobcode Salary Empid;
sum salary;
run;

12
Special where statements

Additional special operators supported by where statement are –


 Like – It selects observation by comparing character values to specified
patterns.
e.g. – where code like ‘E_U%’;
It searches for code value beginning with E, followed by a single character,
followed by a U, followed by any number of characters.
 Sounds like – The sounds like (=*) operator selects observation that contains
spelling variations of the word specified.
e.g. – where name =* ‘SMITH’;
Selects name like SMYTHE and SMITT.
 IS NULL or IS MISSING – Selects observations in which the value of the
variable is missing.
e.g. – where flight is missing;
where flight is null;

13
Sequencing and Grouping observations
 Sort procedure – Sort procedure is used to sequence the observation.
1. Re arranges the observations in SAS dataset.
2. Can create new dataset with re arranged data.
3. Can sort on multiple values.
4. Does not generate printed output.
5. Treats missing value as the smallest possible value.
6. Sorts in ascending order by default.
 Syntax – proc sort data = input dataset out= Output dataset;
by <descending> by-variable;
run;
e.g. proc sort data= ia.empdata out=work.jobsal;
by jobcode descending salary;
run;

14
 Grouping data and Printing Subtotals and Grand totals – Using a by clause with
proc print procedure groups the data according to the different values of that
variable.
e.g. – proc print data=ia.empdata;
by jobcode;
sum salary;
run;
The above code groups the data according to jobcode values and the sum statement
prints the sum of salary for different groups of jobcode, which is the sub total.
Note- Data must be indexed or sorted in order to use by clause.
 Page Breaks – PAGEBY statement is used to put each subgroup on a separate page.
e.g. - proc print data=ia.empdata;
by jobcode;
pageby jobcode;
sum salary;
run;
Pageby must be used along with a by clause and the variable appearing in the by
clause only can be used in the pageby clause.

15
Enhancing outputs
 ID statement- ID statement is used to suppress the obs column and the
variable used with id replaces the obs column i.e. is placed left most. We
can use ID statement along with BY statement. ID statement places the
variable left most in place of obs and if a BY clause is also there for the
same variable then it groups data according to that variable.
e.g. – proc print data=ia.empdata;
id Jobcode;
by Jobcode;
pageby Jobcode;
sum Salary;
run;
The above code will print the output page wise according to groups of
Jobcode working as id i.e. in place of obs column and at the end of each
page sum of salary values at that page will be displayed.

16
Customizing Report Appearance

Titles and Footnotes –


1. Titles appear at the top of the page.
2. Default SAS title is The SAS System.
3. The null title statement, ”title;” , cancels all titles.
4. Footnote appears at the bottom of the page.
5. No footnote appears unless one is specified.
6. The null footnote statement, footnote;, cancels all footnote.
7. More than one titles and footnotes can be specified in one proc step by
numbering the title/footnote. E.g. title1 ‘First Line’; title2 ‘Second Line’.
After getting the second title first one is cancelled.
8. More than one titles or footnotes can be defined by number them
title1,title2,……,titlen. The value of n can be 10.

17
 Column Labels – This assigns labels to different fields.
e.g. – proc print data=ia.empdata label;
label lastname=‘Last Name’
Firstname=‘First Name’;
run;
split =‘ ‘ option if placed instead of label in the proc print
statement , splits the label into two lines based on the delimiter
specified.
 SAS System Options – SAS options are used to change the
appearance of report.

18
1. Date – specifies to print the date and time at which SAS session began
at the top of each page.
2. Nodate – Specifies not to print the date and time.
3. Linesize =width – Specifies the line size.
4. Pagesize=n - Specifies the number of lines per page.
5. Number – Specifies that page number be printed on the first line of
each page output.
6. Nonumber – specifies page number not to be printed.
7. Pageno=n – Specifies the beginning of the page number.
Example – options nodate nonumber ls=72;
Option statement is not placed in a data or proc step.

19
Formatting Data Values

 To apply a format to a specific SAS variable, use the format statement.


 General form of format statement –
FORMAT variable name format;
 Example –
proc print data=ia.empdata;
format Salary dollar11.2;
run;
The above code will print the data with salary values formatted,
preceded by a dollar sign, with commas, having a total length 11 and 2
decimal places.

20
SAS Formats

SAS Formats Description


w.d Standard numeric format
e.g. 8.2 Width=8, 2 decimal places
$w. Standard character format
$5. Width=5
Commaw.d Commas in a number
Comma9.2 Width=9, 2 decimal number
Dollarw.d Dollar sign and commas
Dollar10.2 Width=10, 2 decimal places

21
Date Formats

SAS dates are stored as the number of days between 1st January 1960
and the specified date. So date formats are used to print dates in the
standard form. Date formats available and values they display are(e.g.
Date= 16Oct2001)-

Format Displayed Value


MMDDYY6. 101601
MMDDYY8. 10/16/01
MMDDYY10. 10/16/2001
DATE7. 16OCT01
DATE9. 16OCT2001

22
User Defined Format

 Format procedure can be used to define custom formats.


 General from of PROC FORMAT –
proc format;
value format-name range1=‘label’;
……………..;
 Example –
proc format;
value gender 1=‘Female’
2=‘Male’
other=‘Miscoded’;
run;
Above code defines a user defined format gender that replaces the
values 1, 2 and other with respective labels.

23
 Assigning character values to and range of characters labels.
proc format;
value $grade ‘A’=‘Good’
‘B’ - ’D’=‘Fair’
‘F’ = ‘Poor’
Other= ‘Miscoded’;
run;
 Applying format
proc print data=ia.student;
format CGPA $grade.;
run;

24
Creating HTML reports
 ODS(Output Delivery System) method is used to create output in
variety of forms.
 ODS HTML statement opens, closes and manages the HTML
destination.
 General form of ODS method-
ODS html file=‘HTML file specification’;
SAS code;
ODS html close;
 Example –
ODS html file=‘D:\odscode.html’;
proc print data=ia.empdata’;
run;
ODS html close;

25
Reading raw data file

Steps for creating SAS data set


 Start a data step and name the SAS data set being created(DATA
statement).
DATA libref.SAS-data-set
e.g. - data work.dwflax;
 Identify the location of the raw data file to read(INFILE statement).
INFILE ‘Filename’
e.g. – infile ‘C:\workshop\dwflax.txt’
 Describe how to read the data fields from the raw data file(INPUT
statement).
INPUT input – specifications;

26
Input specification –
•Name SAS variable.
•Identifies the variable as character or numeric.
•Specifies the locations of the fields in the raw data file.
•Can be specified as column, formatted, list or named input.

Example data set –


Data work.dwflax;
infile ‘C:\workshop\dwflax.txt’;
input Flight $ 1-3 Date $ 4-11
Dest $ 12-14 FirstClass 15-17;
run;

27
Formatting Input

Formatted input is used to read data values by –


 Moving the input pointer to the starting position of the field.
 Specifying a variable name.
 Specifying an informat.
Pointer controls –
 @n : Moves the pointer to column n.
 +n: Moves the pointer n positions.
Informat statement is specified in the following way –
<$> informat – name w.<d>
In the above code $ specifies character value, w specifies the total
width of field, ‘.’ specifies the delimiter and d specifies number of
decimal places.

28
Example

Data work.dfwlax;
Infile ‘Raw data file’;
Input @1 Flight $3.
@4 Date mmddyy8.
@12 Dest $3.
@15 Firstclass 3.
@18 Economy 3.;
Run;
The above code reads Flight starting from 1 st position till 3 characters in
character format, Date form 4th position in mmddyy8. format, 3
characters for Dest form 12th position in character format, 3 numbers for
Firstclass starting form 15th position in integer format and Economy
from 18th position till 3 integers.

29
Reading SAS data sets

Steps for creating a SAS data set using another data set.
 DATA statement to start a DATA step and name the SAS data set being
created.
 SET statement to identify the SAS data set being read.
To create a variable use assignment statement to modify the values of
existing data set variable(s).
Example –
Data work.new_data;
set ia.dwflax;
total = FirtsClass + Economy;
Run;
The above code reads all the fields and observations from dwflax and
creates a new field in new_data named total.

30
Operators

Operator Action Example Priority


+ Addition Sum = x + y III
- Subtraction Diff = x – y III
* Multiplication Mul = x * y II
/ Division Div = x / y II
** Exponentiation Raise x ** y I
- Negative prefix Negative = -x I

Operations of priority I are performed first, then II and III, right


to left for priority I and left to right for II and III

31
Using SAS functions

 SUM function - Calculates the sum of arguments.


e.g. – Total = Sum(FirtsClass,Economy);
Sum function calculates the sum even value is missing for any
argument, whereas simple addition does not for any missing value.
 Today() – Obtains the date value from system clock.
 MDY(month,day,year) – Uses numeric values of month, date and year
values to return the corresponding SAS date value.
 Year(SAS Date) – extracts year from a SAS date and returns a four
digit value.
 QTR (SAS Date) – Extracts date from SAS date and returns 1 to 4.

32
 Month(SAS date) – Extracts month from SAS date and returns from 1
to 12.
 Weekday(SAS date) – Extracts day of the week from SAS date returns
number from1 to 7, where 1 is Sunday and so on.

33
Dropping and Keeping variables

 Drop and Keep statements can be used to control what variables are
written to the new data set.
 General from – Drop variables; / Keep variables;
 Example –
data test_new;
set ia.dwflax;
drop FirstClass Economy;
Total = FirstClass + Economy;
run;
The above code creates new data set without FirstClass and Economy
variables and with total variable.

34
Conditional processing

 IF – Then – Else clause can be used to conditionally process rows and


select some of the observations.
 Example –
data flightrev;
set ia.dwflax;
total=sum(Firstclass,Economy);
if Dest=‘LAX’ then
revenue=sum(2000*Firstclass,1200*Economy);
else if Dest=‘DFW’ then
revenue=sum(1500*Firstclass,900*Economy);
run;

35
Executing set of conditional statements
 Do and End statement can be used to execute a set of statements.
Example –
data flightrev;
set ia.dwflax;
total=sum(Firstclass,Economy);
if Dest=‘LAX’ then do;
revenue=sum(2000*Firstclass,1200*Economy);
city=‘Dallas’;
end;
else if Dest=‘DFW’ then do;
revenue=sum(1500*Firstclass,900*Economy);
city=‘Los Angeles’;
end;
run;

36
Variable Lengths

 At compile time, the length of a variable is determined the first time


the variable is encountered. To overcome this, we specify length of the
variable prior to assignment;
e.g. – In the previous example, first encountered value of city is Dallas,
so the length of city is 6 and Los Angeles will be truncated to Los An.
To avoid this we can specify length of the variable city before the if
condition.
length city $ 11;
‘$’ specifies character value.

37
Deleting or Selecting Rows

 Rows can be deleted using a Delete statement with if condition.


Example –
In the previous example we can add one more condition after the total
statement as –
if total le 175 then delete;
This statement will delete the rows for which the value of total is less
than 175.
 Similarly we can select rows by using if statement without delete.
Example –
if total gt 175;
 Similar to above conditions we can also compare date values with
constant date value written in the form ‘ddMMMyyyy’d.

38
Concatenating SAS data sets

Steps for concatenating DATA sets –


• Use the SET statement in DATA step to concatenate SAS data sets
• Use the Rename = data set option to change the names of the variables
• Use SET and BY statements to interleave data sets.
• General form –
DATA SAS data set;
SET SAS data set1 SAS data set2;
run;
• The above code works similar to UNION in SQL query.

39
 Example –
data newhires;
set n1 n2;
run;
 If the number and name of fields are same in na1 and na2, then
newhires will have all the fields with data from na2 following the data
from na2.
 If the name of fields are different then we can rename the fields using
RENAME statement. E.g. if there Name, Gender, Jobcode in na1 and
Name, Gender and Jcode in na2 then we can rename Jcode as Jobcode.

40
 Example –
data newhires;
set na1 na2(rename=(Jcode=Jobcode));
run;
 We can also interleave the resulting data set using BY statement.
data newhires;
set na1 na2 (rename=(Jcode=Jobcode));
by name;
run;
 The above code orders the newhires data set by name.

41
Merging Data Sets

 MERGE statement is used to merge corresponding observations from


two or more data sets.
 General form –
DATA SAS data set;
Merge SAS data sets;
By BY- variable;
run;
 The above code will form a resulting data set having by variable filed
and all the other fields and data corresponding to every common value
of by variable and for different values the fields of other data sets will
be having null.
 So merge statement works like a join statement of SQL.

42
Conditional merging

 IN= option is used to determine which data set contribute to current


observation. Using this we can determine whether the join will be left or
right or any other condition.
 Example –
Data work.combine;
Merge ia.gercrew(in = Increw)
work.gersched(in = Inshced);
by EmpId;
if Insched=1;
run;
 In= option above gives an alias to every observation of that data set and
the if condition specifies that observation will be written to resulting
data set if value for Inshced is not null or not missing.

43
Additional Features

 In addition to one–to-one merge, there can be one to many and many


to many merges.
 In one to many merge, unique value of one data set has many matches
in other dataset, which results in that many entries in final data set
with same value for first dataset and different values for the second.
 In many to many merges, many values of first dataset matches with
many entries on second dataset, in this case the dataset in which extra
entries are present are matched with the last entry having that value in
the other dataset.

44
Summary Reports
Summary report procedures used are –

 Proc Freq – Calculates frequency counts.

 Proc Means – Produces simple statistics.

 Proc Report – Produces flexible, detailed and summary reports.

45
Proc Freq

 Proc Freq procedure displays the frequency counts of the data values
in a SAS data set.
 It analyzes every variable in the SAS data set.
 Displays each distinct data value.
 Calculates the number of observations in which each data value
appears and the corresponding percentage.
 Indicates for each variable how many observations have missing
values.
 Example –
proc freq data=ia.dfwlax;
run;

46
Features of proc freq
 We can limit the number of variables whose frequency we want to see. Tables option
is used to limit the number of variables. SAS creates separate frequency for each
variable specified after table options separate by a space.
Example – proc freq data=ia.dfwlax;
tables economy flight;
run;
 Nlevels option is used to display the number of levels in the frequency report i.e.
frequency for how many values is given.
 Noprint option is used for not displaying the frequency counts, it is generally used
with nlevels when only number of levels is required.
 Example – proc freq data=ia.dfwlax nlevels;
tables _all_ / noprint;
title ‘Number of levels’;
run;
 Formats can also be used while displaying frequency reports.

47
Cross tabular frequency
 A cross tabular frequency report analyzes all possible combinations of the distinct
values of the two variables.
 Example – proc format;
value $codefmt
‘FLTAT1’ – ‘FLTAT2’ = ‘Flight Attendant’
‘PILOT1’ – ‘PILOT2’ = ‘Pilot’;
value money
low - <25000=‘Less than 25000’
25000 – 50000=‘25,000 to 50,000’
50000 < - high = ‘More than 50000’;
run;
pro freq data=ia.crew;
tables jobcode*salary;
format jobcdoe $codefmt. salary money.;
run;
 Crosslist option can be used similar to noprint for result in listing form.

48
Proc Means

 This procedure gives the number observation, mean, standard


deviation, minimum and maximum value for every field in the SAS
data set. Additional statistics that can be obtained are range, median,
sum and nmiss(number of missing values).
 Var statement can be used for limited the output to some fields and
Class statement can be used to categorize the output corresponding to
any variable.
 Example –
proc means data=ia.crew;
var salary;
class jobcode;
title ‘Salary for Job code’;
run;

49
Proc Report

Proc report enables –


 Creating listing reports. Using report procedure.
 Creating summary report using SUM, GROUP and ORDER statements.
 Enhance reports.
 Request separate subtotals and grand totals.
 Extra features provided by report procedure in comparison to print
procedure are –
1. Summary Report.
2. Cross tabular Report.
3. Sort data for report.

50
Report procedure
Default listing displays –
 Each data value as it is store in the data set.
 Variable names as report column headings
 Default width for columns.
 Character value as left justified.
 Numeric values as right justified.
Printing selected variable –
COLUMN statement is used in order to print selected variables and in the order
in which they are specified.
Example –
Title ‘Salary Analysis’;
Proc report data=ia.crew;
Column Jobcode Location Salary;
Run;

51
Define statement

 Reports can be enhanced using define statement using various


attributes.
 General from – DEFINE variable / <attribute list>;
 Functions of DEFINE statement –
1. Format variables, default format is the format stored in the SAS data
set
2. Width – Width if the variables can be assigned, the default width is
variable for character variables and 9 for numeric variables or the
width stored in the data set.
3. Order – It orders the values of that variable in ascending order by
default. Descending need to be mentioned specifically. Suppresses
repetitive values.

52
 Group variable – group option can be used with many variables. It is
shown in the report in the order in the order in which variables are
written. Order can not be used with group. This also displays the sum
of numeric variables for each group, if group is not used then grand
total of numeric values is displayed.
 Sum – This is used to print the sum of all values.
 Mean – Used for displaying mean of all the values.
 N – Used for displaying the number of non missing values.
 Max – Used for displaying the maximum value.
 Min – Used for displaying the minimum value.

53
RBREAK

 This is used for following purposes –


1. Adding grand total at the top or the bottom of the page.
2. Adding line before grand total.
3. Adding line after grand total.
 General Form – RBREAK Before | After </options>;
 Options –
1. Summarize – prints the total.
2. OL - Prints a single above the total.
3. DOL – Prints double line above the total.
4. UL – Prints single line below the total.
5. DUL – prints double line below the total.

54
Introduction to Graphics – Bar and Pie Charts

 GCHART procedure is used to specify a chart with following features –


1. Specify the form of the chart.
2. Identify the chart variable.
3. Optionally identify an analysis variable.
 General form –
Proc GCHART data =SAS data set;
HBAR/VBAR/PIE Chart variable name </Options>;
Run;
 This produces chart for different values of chart variable with the length
of the bar of size of the pie depending on the frequency of that value.
 For numeric values SAS automatically divide into intervals and midpoints
are identified and one bar for each midpoint is created. To ovoid this we
can use DISCRETE option.

55
Options Contd.
 SUMVAR – This specifies the summary variable against the bar variable and
replaces the frequency with that variable.
 TYPE – Used along with SUMVAR variable so as to specify on what basis the
summary variable need to be classified for bar variable. E.g MEAN | SUM.
 Example – Proc gchart data=ia.crew;
vbar Jobcode / sumvar=Salary type=mean;
run;
The above code will print a vertical bar chart with jobcode as bar
variable, whose length will be decided by mean of salary for a
particular jobcode.
 FILL – This option is used with pie charts so as to specify whether to fill pie
slices in a solid (FILL=S) or a cross hatched (FILL=X) patten.
 EXPLODE – EXPLODE = ‘Value’, this option explodes the pie chart for that
particular value.

56
Producing PLOTS
 GPLOT is used to plot one variable against another variable using
coordinate axis.
 General Form –
Proc GPLOT data=SAS data set;
PLOT vertical variable* horizontal variable </Options>;
Run;
 You can –
1. Specify the symbol to represent data.
2. Use different methods of interpolation.
3. Specify line styles, colors and thickness.
4. Draw reference lines within the axes.
5. Place one or more plot lines within the axes.

57
Example

 Proc GPLOT data = ia.Flight;


where date between ‘02mar2001’d and ‘08mar2001’d;
plot Boarded * Date;
title ‘Total Passengers for flight 114’;
title2 ‘between 02mar2001 and 08mar2001’;
run;
 This will plot boarded against date for the specified flight dates.
 The symbol used here by default will be plus ‘+’ and values will
be shown discrete without any interpolation.

58
Options

 SYMBOL – Options which symbol statement can take are –


1. VALUE – It specifies the symbol for showing the values, which can
be plus(default), star, diamond, square, triangle and none.
2. I – This signifies the interpolation, which can have values I=
join/needle/spline.
3. Width(w) – This specifies the width of the line.
4. Color( c ) – This specifies the color of the line.
Example –
Proc GPLOT data = ia.Flight;
Plot Boarded * Date;
Symbol value=square i=join w=2 c=red;
Title ‘Total Passengers for flight 114’;

59
Controlling Axis

 We can use the following options with PLOT statement –


1. HAXIS – It scales the horizontal axis.
2. VAXIS – It scales the vertical axis.
3. CAXIS – Specifies color of both the axes.
4. CTEXT – Specifies the color of text on both axes.
 Example –
Plot Boarded * Date / Vaxis = 100 to 200 by 25 ctext=blue;

60
Outputting Observations

 A SAS data step implicitly outputs the contents of PDV to data set, if
we write an explicit output statement, it overrides the implicit output.
 General form - OUTPUT <SAS data set1> <SAS data set2>…...;
 Output statement can be used to –
1. Create two or more SAS observations from each line of input
2. Write observation to multiple SAS data sets.
 Example –

61
Data forecast;
drop numemps;
set prog2.growth;
year=1;
Newtotal=Numemps *(1 + increase);
output;
year=2;
Newtotal=newtotal*(1 + increase);
output;
year=3;
Newtotal=newtotal*(1 + increase);
output;
Run;

62
Writing to multiple data sets
 Output statement is used to write observations to desired data sets.
 Example –
data army navy airforce;
drop type;
set prog2.mlitary;
if type eq ‘Army’ then
output army;
else if type eq ‘Navy’ then
output navy;
else if type eq ‘Air force’ then
output airforce;
run;

63
 First Obs and Obs statements can be used to control the number of
observations to be read by a dataset.
 OBS statement – Set prog2.military(obs = 25); this statement selects first
25 observations from the input dataset into the output data set.
 First Obs statement – Set prog2.military (firstobs=11 obs=25); this
statement starts reading observations into military data set starting
form 11th observation of the input data set till 25th observation.

64
Writing to an external file
 Data can be written to an external file using either ODS method or FILE statement.
 ODS method –
ods csvall file=‘raw – data – file’;
proc print data=prog2.maysale noobs;
format listdate
selldate date9.;
run;
ods csvall close;
 File statement –
data _null_;
set prog2.maysales;
file ‘raw – data – file’;
put description
listdate ; date9.;
run;

65
 _N_ and ISLAST automatic variables -
data _null_;
set prog2.maysales;
file ‘raw – data – file’;
if _N_=1 then
put ‘Description’ ‘ListDate’;
put description
listdate ; date9.;
if ISLAST = 1 then
put ‘End of data’;
run;
 Specifying delimiter – DLM= option is used to specify the delimiter in the file.
Example – file ‘raw – data – file’ DLM=‘,’;

66
Summarizing data
 Creating an accumulating variable – We can use RETAIN statement to create a
variable having a running sum of another numeric variable.
 Retain statement –
1. Retains the value of the value of the variable in the PDV across iterations of the
data step.
2. Initializes retain variable to missing if no default value is specified.
 Example –
data mnthtot;
set prog2.daysales;
retain mth2dte 0;
mth2dte=mth2dte+saleamt;
run;
 The above code will create a new variable mth2dte having a running sum of saleamt,
but if there is any missing value in saleamt then all sebsequent values of mth2dte will
be missing for that we use sum statement. Sum is a replacement to retain statement.

67
Accumulating totals for a group of data

 For accumulating corresponding to a particular variable, data need to


be sorted by that variable first and then we can use as by variable and
if statement in the following manner.
 Example -
data work.divsal(keep= jcode divsal);
set work.salary;
by jcode;
if first.jcode then divsal=0;
divsal + sal;
run;

68
Reading delimited raw data file
 Common delimiters used are blanks, commas and tab characters. Default
delimiter is space.
 For specifying the format in which SAS should read the data value. We can
specify the informat name.
 To specify an informat, use colon between name of the informat variable name.
Colon signals SAS to read from delimiter to delimiter.
 Length of the variable can also be specified in advance using length statement.
Using length, we can avoid colon.
 Example –
data airplanes;
length ID $5;
infile ‘raw data file’;
input ID $
Inservice : date9.
passcap cargocap;
run;

69
Delimiters and missing data
 DLM= option is used to specify the delimiter in the following manner
infile ‘raw data file’ dlm=‘:’;
 If you specify series of delimiters in DLM option then it considers any or all
of the characters as delimiter e.g. – DLM=‘:!’;
 If there is missing data in the record then SAS automatically appends the
next data to the previous data line. To avoid this MISSOVER option is used.
infile ‘raw data file’ dlm=‘:’ missover;
 If the length of any data value is less then the specified data length then
missover statement will take it as missing value, so to avoid this we use
TRUNCOVER option.
infile ‘raw data file’ dlm=‘:’ missover truncover;
 Two consecutive delimiters are treated as one, so to specify a missing value
there should be a placeholder, which can be ‘.’ for numeric filed and blank
for character field.

70
 If placeholder is not present then we can use the DSD option.
 Features of DSD option –
1. Sets the default delimiter to comma.
2. Treats consecutive delimiters as missing values.
3. enables SAS to read values with embedded delimiters if the
value is surrounded by double quotes.
 Example – infile ‘Raw data file’ dsd;

71
Controlling when a record loads

 SAS loads a new record into data set when it encounters input
statement.
 We can also use forward slash which moves the pointer to next line.
input Lname $20. Fname $10. /
City $10. State $20.;
This code will read Lname and Fname from first line and then move to
next line and start reading city and state.
 #n moves the pointer to desired line.
input #1 Lname $20. Fname $10.
#2 City $10. State $20.;
This will read Lname and Fname form first line and City and State from
second line. This cycle will carry on for 3 rd and 4th record and so on till it
reaches the end.

72
 If statement can also be used to control loading of observations based on
the value of any field.
Example –
input salesid 5. Location $3.;
if Location=‘USA’ then
input Saledate : mmddyy10.
Amount;
if Location=‘EUR’ then
input Saledate : date9.
Amount: comma8.;
 Above code will load salesid and location first and then depending on the
value of location read it will load the value of saledate and amount.
 For values not satisfying any criteria saledate and amount will be blank.

73
 To avoid this scenario, we can use trailing character ‘@’
 Trailing option holds the raw data record in the in the input buffer
until –
1. Executes an input with no trailing @ or
2. Reaches the end of data file step.
Input var1 var2 var3….@;
 Reading multiple observations in one record – Multiple observations
can be read into one record if we use double trailing ‘@@’.
Input var1 var2 var3…..@@:

74
Data Transformation

 SAS provides a variable list, which can be used to refer to set of


variables together.
Numbered range list X1 – Xn Specifies all variables from x1 to xn
inclusive. It can begin with any number
and end with any number as long as rules
for user supplied variables are not
violated
Name range lists X--a Specifies all variables from x to a
X –numeric-a Specifies all numeric variables from x to a
Specifies all character variables from x to
X-character-a a
Name prefix lists Sum(of REV:) Calculates the sum of all the variables
that begin with REV
Special SAS names _All_ All variables defined in a data step
_Numeric_ All numeric variables in a data step
_Character_ All character variables in a data step

75
SAS Functions

 Substr function – Used to extract a part of string.


General form – Newvar = Substr(string, start,<length>);
Here string can be a string or a variable name, start is the start position
and length is the number of characters to be extracted, if length is not
written then all characters till end are extracted.
 Right/Left function – Used for right justification or left justification
General form - Newvar=Right(argument)
Here the argument will be right justified and the trailing blanks will be
moved to start. Vice versa fro LEFT function.
 Scan function – SCAN function returns the nth word of a string.
General form – Newvar= SCAN(string , n , <delimiter>);
Delimiter here can be omitted, in that case it takes blank as delimiter.

76
 Concatenation operator - This operator is used to concatenate two or more
strings. To concatenate, we can use either (!!) or (||).
General Form – Newvar = String1 !! String2;
 Trim function – This function removes trailing blanks form the string
General form – Newvar = TRIM(argument);
If the argument is blank then it returns a blank. Trim function does not trim
leading blanks, for that we can use a combination of left and trim.
Example – Fullname = trim(left(Firstname)) !! ‘ ‘ !! Lastname;
 CATX function – This function concatenates character strings, removes
leading and trailing and inserts separators.
General Form – CATX(separator, string 1,……,string n);
Similar to this CAT concatenates without removing blanks, CATS
concatenates and removes leading and trailing blanks and CATT
concatenates and removes trailing blanks only.

77
 Find function – This function searches for a specific substring within a string
and returns its location if found and returns 0 if not found.
General Form – Position = FIND(target,value,<modifiers>,<start>);
- Modifier can be I or T. I indicates that search is case insensitive, by default
its case sensitive. T indicates that search ignores trailing blanks.
- Start identifies the start position of search, a positive value signifies forward
search and a negative value signifies backward search.
 Index function works same as find function except it doe not have modifier
and start argument.
 UPCASE function – This converts all the letters and arguments to upper case
and has no effect on digits and special characters.
General Form – NewVal = UPCASE(argument);
 LOWCASE function converts the text to lowercase.
 PROPCASE function converts the text to proper sentence form.

78
 TRANWRD function – This function translates a particular set of
character in a string with other set of characters.
General Form – Desert = Tranwrd(Desert , ’Pumpkin’ , ’Apple’);
This replaces Pumpkin with apple in desert. If the length of replacing
string is greater than replaced string then it causes truncation of string
if length is not specified.
 SUBSTR left side – If substr function is used of the left side of the
assignment statement then it replaces that substring in the text with the
substring on right.
General Form – SUBSTR(string , start , <length>)=value;

79
Manipulating numeric values

 Round function - This function returns a rounded off value to the


nearest unit.
General Form – NewVar = ROUND(arguments,<round off unit>);
Round off unit is numeric and positive. It indicates how many places
need to rounded off.
 CEIL function – This function returns the smallest integer greater than
or equal to the argument.
 Floor function – This function returns the greatest integer less than or
equal to the argument.
 INT function – This function returns the integer part of the argument.
 MEAN function – This returns the mean of all the arguments.
 MIN function – This returns the minimum no missing value.
 MAX function - This returns the maximum value.

80
Manipulating Date values
 Creating SAS date value – MDY function returns SAS date from date, month and
year given separately.
General Form - Newdate=MDY(month,date,year);
 TODAY() – This function returns the system date.
 Extracting information – We can extract day , month or year from SAS date using
DAY(SAS date ), MONTH(SAS date) or YEAR(SAS date) respectively. Similarly we
can use QTR and WEEKDAY.
 Calculating Interval of Years– YRDIF function calculates year difference between
two SAS dates.
General Form – Diff= YRDIF(sdate , edate , basis)
Basis can take following values –
1. ‘ACT/ACT’ – This calculates the actual difference in fraction.
2. ’30/360’ – Specifies 30 day month and 360 days year.
3. ‘ACT/360’ – Takes actual number of days and divides it by 360.
4. ‘ACT/365’ – Takes actual number of days and divides it by 365.

81
Converting variable type

 INPUT statement is used to convert character value to numeric value.


General Form – Numvar=INPUT(source,informat)
In above data conversion, the assigned variable cannot be same as
converted variable, assigned and converted variable name cannot be
the original name and rename of same variable.
 PUT statement is used to convert numeric value to character value.
General Form – Charvar=PUT(Source,format);
Same rules as above apply to PUT function also. Format can be any
valid character format.

82
Automatic conversions

 Automatic conversion from character to numeric is done in following


cases –
1. Assignment to a numeric variable.
2. An arithmetic operation.
3. Logical comparison with a numeric value.
4. A function that takes a numeric argument.
5. It produces a numeric missing value if it does not confirm to standard
numeric convention.
 Automatic numeric to character conversion is done in following manner

1. Assignment to a character variable.
2. A concatenation operation.
3. A function that accepts character arguments.

83
Do loop Processing

 Do loop is used to eliminate the redundant data and perform repetitive


work.
General Form – DO index-variable = start TO stop <BY increment>;
End;
Example- Data invest;
do year = 2001 to 2003;
Capital + 5000;
Capital + (Capital * .075);
end;
run;
The above code will write the final value of Capital into the data set.
If we write output; before the end of do loop then it will write all the
intermediate values of Capital in the data set.

84
 Do While loop – This is used for conditional iteration of a set of statements.
General form – DO WHILE(expression);
END;
Statement is executed first, if true then only loop is executed.
 Do Until loop - This is used for conditional iteration of a set of statements.
General form – DO UNTIL(expression);
END;
Statement is executed first, if not true then also once loop is executed.
 Combining Do WHILE and DO UNTIL with DO – This method is used to
avoid infinite loop.
DO index variable = start TO stop <BY variable>;
WHILE | UNTIL (expression);
END;

85
Nested Do loops

 Rules for nesting Do loops are –


1. Use different iteration variable for all the Do loops.
2. Make sure that every DO has a corresponding END.
Example – Data invest;
Do Year = 1 to 5;
Capital + 5000;
Do Quarter = 1 to 4;
Capital + (Capital * (.075/4));
End;
Output;
End;

86
SAS arrays
 Creating variables with arrays –
 Example -
Data percent (drop = qtr);
Set donate;
Total = sum(of qtr1 – qtr 4);
array contrib(4) qtr1 – qtr4;
array percent(4);
do qtr=1 to 4;
percent(qtr)=contrib(qtr)/total;
end;
run;
In the above code, contrib takes the value of qtr1 to qtr4 and percent is an empty array. We
can also format the array variable while declaration.
Example - var ID Percetn1 – Percent4;
Format percent1 – percent4 percent6.;
Percentw.d fromat multiplies value by 100 and adds a % sign at the end

87
Assigning initial values
 Example –
data compare(drop = qtr goal1 – goal4);
set donate;
array contrib(4) qtr1 – qtr4;
array diff(4);
array goal(4) goal1 – goal4 (10,15,5,10);
do qtr=1 to 4;
diff(qtr) = contrib(qtr) – goal(qtr);
end;
run;
The above code takes the value of existing variable qtr1 –qtr4 into contrib, assigns
values to new array goal with variable names goal1 to goal4 and calculates value for
diff array. Initial values are retained until new values are assigned and in case of less
values then array length, rest of the variables are set as having missing value.

88
Temporary arrays

 Temporary can be created if we an array for calculation purpose, e.g. –


in the previous example, array goal is an intermediate array and it is
not required in the output data set.
 For that we can use _TEMPORARY_ instead of variable name
Example – array Goal _temporary_ (10,15,5,10);

89
Rotating SAS data set
Input Data Set
ID QTR1 QTR2 QTR3 QTR4
E00224 12 33 22
E00367 35 48 40 30

Output Data Set


ID QTR Amount
E00224 1 12
E00224 2 33
E00224 3 22
E00224 4
E00367 1 35
E00367 2 48
E00367 3 40
E00367 4 30

90
SAS Program for rotation

 Data rotate(drop = Qtr1 – Qtr4);


Set donate;
array Contrib(4) Qtr1 – Qtr4;
do Qtr=1 to 4;
Amount = Contrib(qtr);
Output;
end;
run;
For every observation read from rotate data set in above code, there will
be values coming into contrib from Qtr1 – Qtr4. Now inside the loop
these values inside contrib will be assigned to amount one by one in
every iteration and every time these values will be written into the
output data set along with vale of Qtr variable.

91
Conditional match merging of SAS data sets

 If we have two data sets transact having account number information


for the week, having account number, transaction type and amount as
fields and a branches data set having account number and branch
location for that account.
 Our objective is to create three datasets.
 Newtrans having weeks transactions with fields account number
transaction type, amount and branch.
 Noactiv showing accounts with no transaction this week with fields
account number and branch
 Noacct showing accounts with non matching account number, with
fields account number, transaction type and amount.

92
Solution

Data Newtrans
Noactiv(drop = trans amt)
Noact(drop = branch);
Merge transact(IN = Intrans)
Branches(IN = InBanks);
By actnum;
If Intrans and Inbanks
Then output Newtrans;
Else if Inbanks and not InTrans
then output Noactiv;
Else If Intrans and not Inbanks
then output Noacct;
Run;

93
Writing SQL queries in SAS data set

 We can use SQL queries in SAS by enclosing them in PROC SQL; and
QUIT;
 While joining two data sets using an SQL query the data sets need not
be sorted contrary to MERGE command in SAS where the input data
sets need to be sorted by the BY variable.
 Example –
Proc SQL;
Select T.Actnum, T.Trans, T.Amt, B.Branch
from Transact T , Branches B
where T.Actnum = B.Actnum;
Quit;
 No RUN command is required for an SQL query.

94
SAS Macros
Macros construct input for the SAS compiler.
Functions of the SAS macro processor:
• pass symbolic values between SAS statements and steps
• establish default symbolic values
• conditionally execute SAS steps
• invoke very long, complex code in a quick, short way.

95
Advantages of SAS macros -
• substitute text in statements like TITLEs
• communicate across SAS steps
• establish default values
• conditionally execute SAS steps
• hide complex code that can be invoked easily.

96
Components of SAS macros
Macro variables:
• used to store and manipulate character strings
• follow SAS naming rules
• are NOT the same as DATA step variables
• are stored in memory in a macro symbol table.
Macro statements:
• begin with a % and a macro keyword and end with semicolon (;)
• assign values, substitute values, and change macro variables
• can branch or generate SAS statements conditionally.

97
Automatic macro variables

Some of the automatic macro variables are –


 SYSDATE – Current date in date7. format.
 SYSDAY – Current day of week.
 SYSDSN/SYSLAST – Last dataset built.
These are the most commonly used macro variables.
Example –
footnote "this report was run on &SYSDAY, &SYSDATE";
The above code resolves to –
footnote "this report was run on Friday, 25jul08";

98
Displaying macro variables
 %PUT is used to display macro variables on the log.
Example –
 %PUT **** SYSDAY = &SYSDAY;
 %PUT **** SYSTIME = &SYSTIME;
 %PUT **** SYSDATE = &SYSDATE;
The above code prints –
**** SYSDAY = Friday
**** SYSTIME = 13:42
**** SYSDATE = 25JUL08
Example of proc print using macro variable –
proc contents data=&SYSLAST;
title "contents of &SYSLAST";
run;

99
User defined macro variables
 Macro variables can be defined by using %LET statement.
 General form - %LET var_name = value;
 This variable can be used anywhere using a ‘&’ sign.
Example –
%LET NAME=PAYROLL;
PROC PRINT DATA=&NAME;
TITLE "PRINT OF DATASET &NAME";
RUN;
The above code will substitute NAME with PAYROLL in the proc print
procedure and prints the data set.
 % STR allows values with semicolon (;) .
Example - %LET CHART=%STR(PROC CHART;VBAR EMP;RUN;);
&CHART;

100
Defining and Using Macros

 %MACRO and %MEND can be used to define macros.


 %Macro name can be used to use or call macros.
 Example –
%MACRO CHART;
PROC CHART DATA=&NAME;
VBAR EMP;
RUN;
%MEND;
%CHART;
 %CHART will invoke the macro and run the code inside the definition
of the macro.

101
Parameterized Macro

 Example –
%MACRO CHART(NAME,BARVAR);
PROC CHART DATA=&NAME;
VBAR &BARVAR;
RUN;
%MEND;
%CHART(PAYROLL,EMP);
 The above macro resolves to –
PROC CHART DATA=PAYROLL;
VBAR EMP;
RUN;

102
Conditional Macro
 %IF and %DO can be used inside macro to execute a set of steps conditionally.
 Example –
%MACRO PTCHT(PRTCH,NAME,BARVAR);
%IF &PRTCH=YES %THEN
%DO;
PROC PRINT DATA=&NAME;
TITLE "PRINT OF DATASET &NAME";
RUN;
END;
PROC CHART DATA=&NAME;
VBAR &BARVAR;
RUN;
%MEND;
%PTCHT(YES,PAYROLL,EMP)

103
Transferring values between SAS steps

 SYMGET and SYMPUT can be used to transfer values between data steps
or proc steps.
 Example –
%MACRO OBSCOUNT(NAME);
DATA _NULL_;
SET &NAME NOBS=OBSOUT;
CALL SYMPUT('MOBSOUT',OBSOUT);
STOP;
RUN;
PROC PRINT DATA=&NAME;
TITLE "DATASET &NAME CONTAINS &MOBSOUT OBSERVATIONS";
RUN;
%MEND;
%OBSCOUNT(PAYROLL);

104
Efficiency Techniques

••Selecting observations – Comparison between In, or and where operator


Selecting observations – Comparison between In, or and where operator
while
whileselecting.
selecting.
••Reducing observation length – Comparison between SCAN and SUBSTR
Reducing observation length – Comparison between SCAN and SUBSTR
function
functionininterms
termsofofdisk
diskspace
spaceusage.
usage.
••Indexing – Usage of index in a where statement as compared to if statement.
Indexing – Usage of index in a where statement as compared to if statement.
••Compressing – Making a data set form another sorted data set in different
Compressing – Making a data set form another sorted data set in different
cases
casesofofwhether
whetherinput
inputisiscompressed
compressedororthe
theoutput.
output.
••Sub setting external files – Usage of if statement at different stages while sub
Sub setting external files – Usage of if statement at different stages while sub
setting
settingananexternal
externalfile.
file.
••Concatenating data sets – Comparison between simple concatenations,
Concatenating data sets – Comparison between simple concatenations,
append,
append,insert
insertinto
intoininSQL
SQLand
andunion
unionfunctions.
functions.
••Interleaving data sets - Using sort function separately, by function and order
Interleaving data sets - Using sort function separately, by function and order
by
byininunion.
union.

105 105
Selecting Observations

When we want to test for different values of a variable using the IF statement, we can
When we want to test for different values of a variable using the IF statement, we can
choose between the IN operator or the OR operator. The examples below show that the
choose between the IN operator or the OR operator. The examples below show that the
IN operator requires more CPU time. The difference becomes even more important when
IN operator requires more CPU time. The difference becomes even more important when
testing huge set of records.
testing huge set of records.

PROGRAM 1-A
PROGRAM
DATA 1-A
PRODUCTSALES; PROGRAM 1-B
DATA PRODUCTSALES; PROGRAM
DATA 1-B
PRODUCTSALES;
SET DATA1.SALES; DATA PRODUCTSALES;
SET DATA1.SALES;
WHERE PRODUCT_ID IN ('111', '142', '152', SET DATA1.SALES;
WHERE
'165', '166');PRODUCT_ID IN ('111', '142', '152', IFSET DATA1.SALES;
PRODUCT_ID = '111' OR
'165', '166'); IF PRODUCT_ID
PRODUCT_ID = '142'= '111'
OR OR
Run; PRODUCT_ID = '142' OR
Run; PRODUCT_ID = '152' OR
PRODUCT_ID= ='165'
PRODUCT_ID '152'OR
OR
PRODUCT_ID = '166'; OR
PRODUCT_ID = '165'
PRODUCT_ID = '166';
RUN;
RUN;

106 106
PROGRAM 1-C
PROGRAM
DATA 1-C
PRODUCTSALES;
DATA
SET PRODUCTSALES;
DATA1.SALES;
SET DATA1.SALES; PROGRAM 1-D
WHERE PRODUCT_ID IN ('111', '142', '152', PROGRAM 1-D
WHERE DATA PRODUCTSALES;
'165', '166', PRODUCT_ID
'411', IN ('111', '142', '152',
DATA PRODUCTSALES;
'165', '166', '411', SET DATA1.SALES;
'412', '417', '421',
'412','519',
'417','525',
'421', IFSET DATA1.SALES;
PRODUCT_ID = '111' OR
'423', IF PRODUCT_ID = '111'
'423','733',
'519','736');
'525', PRODUCT_ID = '142' OR OR
'526', PRODUCT_ID = '142' OR
'526', '733', '736'); PRODUCT_ID = '152' OR
RUN; PRODUCT_ID= ='165'
'152'OR
OR
RUN; PRODUCT_ID
PRODUCT_ID = '165'
PRODUCT_ID = '166' OR OR
PRODUCT_ID= ='411'
PRODUCT_ID '166'OR
OR
PRODUCT_ID= ='412'
PRODUCT_ID '411'OR
OR
PRODUCT_ID = '412'
PRODUCT_ID = '417' OR OR
PRODUCT_ID= ='421'
PRODUCT_ID '417'OR
OR
PRODUCT_ID = '421'
PRODUCT_ID = '423' OR OR
PRODUCT_ID= ='519'
PRODUCT_ID '423'OR
OR
PRODUCT_ID= ='525'
PRODUCT_ID '519'OR
OR
PRODUCT_ID = '525'
PRODUCT_ID = '526' OR OR
PRODUCT_ID= ='733'
PRODUCT_ID '526'OR
OR
PRODUCT_ID = '736'; OR
PRODUCT_ID = '733'
PRODUCT_ID = '736';
RUN;
RUN;

107
Comparison on the basis of time
Comparison on the basis of time
Program number Method used and size of data CPU time elapsed

1-A 5 records – IN operator 1.94 sec

1-B 5 values – OR operator 0.80 sec

1-C 15 records – IN operator 3.92 sec

1-D 15 records – OR operator 0.90 sec

108
PROGRAM 2-A PROGRAM 2-B
PROGRAM 2-B
DATA CLIENT; DATA CLIENT;
DATA CLIENT;
SET DATA1.CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
WHERE LAST_NAME = ‘VAN BRUSSELS’;
WHERE LAST_NAME = ‘VAN BRUSSELS’;
IF LAST_NAME = ‘VAN BRUSSELS’; RUN;
RUN;
RUN;

Sub setting data in a DATA step is possible through the IF statement or the WHERE
Sub setting data in a DATA step is possible through the IF statement or the WHERE
statement. Usually the WHERE statement is more efficient than the IF statement,
statement. Usually the WHERE statement is more efficient than the IF statement,
because the IF statement is executed on the data, being in the Program Data Vector,
because the IF statement is executed on the data, being in the Program Data Vector,
whereas the WHERE statement is executed before bringing the data in the Program
whereas the WHERE statement is executed before bringing the data in the Program
Data Vector. The following examples show this behavior.
Data Vector. The following examples show this behavior.

109
PROGRAM 2-C PROGRAM 2-D
PROGRAM 2-C PROGRAM 2-D
DATA CLIENT; DATA CLIENT;
DATA CLIENT; DATA CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
IF SUBSTR (LAST_NAME, 1, 3) = 'VAN'; WHERE SUBSTR (LAST_NAME, 1, 3) = 'VAN';
IF SUBSTR (LAST_NAME, 1, 3) = 'VAN'; WHERE SUBSTR (LAST_NAME, 1, 3) = 'VAN';
RUN; RUN;
RUN; RUN;

PROGRAM 2-E
PROGRAM 2-E
DATA CLIENT;
DATA CLIENT;
SET DATA1.CLIENT;
SET DATA1.CLIENT;
WHERE LAST_NAME LIKE 'VAN%';
WHERE LAST_NAME LIKE 'VAN%';
RUN;
RUN;

Although there is an exception in where statement too. The above examples show that
Although there is an exception in where statement too. The above examples show that
using the SUBSTR function in a WHERE statement increases the CPU time incredibly
using the SUBSTR function in a WHERE statement increases the CPU time incredibly
compared to the corresponding IF statement. When using a typical WHERE operand
compared to the corresponding IF statement. When using a typical WHERE operand
(LIKE), the same subset is created, but CPU time decreases and gives a better
(LIKE), the same subset is created, but CPU time decreases and gives a better
performance again compared to the sub setting IF statement.
performance again compared to the sub setting IF statement.

110
Comparison on the basis of time

Program number Method used CPU time elapsed (seconds)

2-A IF 0.90

2-B Where 0.07

2-C IF – SUBSTR 0.11

2-D Where – SUBSTR 0.22

2-E Where – LIKE 0.09

111
Reducing Observation Length

Several data manipulation functions have ‘space leaks’: If LENGTH statement is not
Several data manipulation functions have ‘space leaks’: If LENGTH statement is not
specified to identify the resulting variable, a lot of disk space might be wasted. Two
specified to identify the resulting variable, a lot of disk space might be wasted. Two
examples illustrate this behavior. Within the first example the variable INITIALS
examples illustrate this behavior. Within the first example the variable INITIALS
contains the output of the SUBSTR function, but the length of this variable equals the
contains the output of the SUBSTR function, but the length of this variable equals the
sum of the contributing variables. As a result, every observation in the output table
sum of the contributing variables. As a result, every observation in the output table
contains (length of first name + length of last name - 2) redundant blanks. Let us
contains (length of first name + length of last name - 2) redundant blanks. Let us
assume that the length of first name and last name is 20 each in that case every initials
assume that the length of first name and last name is 20 each in that case every initials
will have 38 redundant blanks.
will have 38 redundant blanks.

PROGRAM 1-A PROGRAM 1-B


PROGRAM 1-A PROGRAM 1-B
DATA CLIENT; DATA CLIENT;
DATA CLIENT; DATA CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
INITIALS = SUBSTR (FIRST_NAME, 1, 1) !! LENGTH INITIALS $ 2;
INITIALS = SUBSTR (FIRST_NAME, 1, 1) !! LENGTH INITIALS $ 2;
SUBSTR (LAST_NAME, 1, 1); INITIALS = SUBSTR (FIRST_NAME, 1, 1) !!
SUBSTR (LAST_NAME, 1, 1); INITIALS = SUBSTR (FIRST_NAME, 1, 1) !!
RUN; SUBSTR (LAST_NAME, 1, 1);
RUN; SUBSTR (LAST_NAME, 1, 1);
RUN;
RUN;

112
Some functions – like the SCAN function – create a result with a default length of 200, being the
Some functions – like the SCAN function – create a result with a default length of 200, being the
maximum length of a character variable. Following is an example of space wastage in that case.
maximum length of a character variable. Following is an example of space wastage in that case.

PROGRAM 1-C
PROGRAM 1-C PROGRAM 1-D
DATA CLIENT; PROGRAM 1-D
DATA CLIENT; DATA CLIENT;
SET DATA1.CLIENT; DATA CLIENT;
SET DATA1.CLIENT; SET DATA1.CLIENT;
COUNTRY = SCAN (CLIENT_ID, 1, '-'); SET DATA1.CLIENT;
COUNTRY = SCAN (CLIENT_ID, 1, '-'); LENGTH COUNTRY CITY $ 2
CITY = SCAN (CLIENT_ID, 2, '-'); LENGTH COUNTRY CITY $ 2
CITY = SCAN (CLIENT_ID, 2, '-'); NUMBER $ 8;
NUMBER = SCAN (CLIENT_ID, 3, '-'); NUMBER $ 8;
NUMBER = SCAN (CLIENT_ID, 3, '-'); COUNTRY = SCAN (CLIENT_ID, 1, '-');
RUN; COUNTRY = SCAN (CLIENT_ID, 1, '-');
RUN; CITY = SCAN (CLIENT_ID, 2, '-');
CITY = SCAN (CLIENT_ID, 2, '-');
NUMBER = SCAN (CLIENT_ID, 3, '-');
NUMBER = SCAN (CLIENT_ID, 3, '-');
RUN;
RUN;

113
Comparison on the basis of size
Comparison on the basis of size

Program number Method used Length of variables in different


cases
1-A SUBSTR 20 + 20

1-B SUBSTR – Length 2

1-C SCAN 3 x 200 = 600

1-D SCAN – Length 2 + 2 + 8 = 12

114
Indexing
Indexing
Although an index is considered for use in a WHERE statement and not in a sub setting IF
Although an index is considered for use in a WHERE statement and not in a sub setting IF
statement, we still find several programs using an IF statement to subset a table with an
statement, we still find several programs using an IF statement to subset a table with an
index. The gain in CPU time becomes more important if the subset returned by the index is
index. The gain in CPU time becomes more important if the subset returned by the index is
smaller. In the following examples, a simple index exists on the variables SHOP_ID and
smaller. In the following examples, a simple index exists on the variables SHOP_ID and
CUSTOMER_ID. The variable SHOP_ID has only 7 distinct values, whereas the variable
CUSTOMER_ID. The variable SHOP_ID has only 7 distinct values, whereas the variable
CUSTOMER_ID contains approximately 80.000 different values. Accessing the data
CUSTOMER_ID contains approximately 80.000 different values. Accessing the data
through the index on SHOP_ID returns +/- 15% of the data, resulting in only a small
through the index on SHOP_ID returns +/- 15% of the data, resulting in only a small
difference between the WHERE statement (using the index) and the IF statement
difference between the WHERE statement (using the index) and the IF statement
(performing a sequential search).
(performing a sequential search).

PROGRAM 1-A PROGRAM 1-B


PROGRAM 1-A PROGRAM 1-B
DATA SALES_B_B; DATA SALES_B_B;
DATA SALES_B_B; DATA SALES_B_B;
SET DATA1.SALES_INDEXED; SET DATA1.SALES_INDEXED;
SET DATA1.SALES_INDEXED; SET DATA1.SALES_INDEXED;
IF SHOP_ID = 'B-B'; WHERE SHOP_ID = 'B-B';
IF SHOP_ID = 'B-B'; WHERE SHOP_ID = 'B-B';
RUN; RUN;
RUN; RUN;

115
Accessing the data through the index on CUSTOMER_ID returns less than 0.01% of the
Accessing the data through the index on CUSTOMER_ID returns less than 0.01% of the
data and is extremely fast compared to the sub setting IF statement.
data and is extremely fast compared to the sub setting IF statement.

PROGRAM 2-A
PROGRAM 2-A
DATA SALES_12345;
DATA SALES_12345;
SET DATA1.SALES_INDEXED;
SET DATA1.SALES_INDEXED;
IF CUSTOMER_ID = ‘12345';
IF CUSTOMER_ID = ‘12345';
RUN;
RUN;

PROGRAM 2-B
PROGRAM 2-B
DATA SALES_12345;
DATA SALES_12345;
SET DATA1.SALES_INDEXED;
SET DATA1.SALES_INDEXED;
WHERE CUSTOMER_ID = ‘12345';
WHERE CUSTOMER_ID = ‘12345';
RUN;
RUN;

116
Comparison on the basis on time
Comparison on the basis on time

Program Number Description CPU Time(seconds)

1-A 7 shops – If 1.31

1-B 7 shops – Where 1.02

2-A 100.00 Clients – If 0.76

2-B 100.00 clients – Where 0.01

117
Compressing
Compressing

Compression can be useful if disk space is a problem. Compression must be


Compression can be useful if disk space is a problem. Compression must be
added in a sensible way: Both compressing the data and decompressing the data
added in a sensible way: Both compressing the data and decompressing the data
requires CPU time. COMPRESS = YES option in the global OPTIONS statement
requires CPU time. COMPRESS = YES option in the global OPTIONS statement
should not be specified. The following examples illustrate the CPU cost of
should not be specified. The following examples illustrate the CPU cost of
compression: an input SAS data set is sorted into an output SAS data set.
compression: an input SAS data set is sorted into an output SAS data set.

PROGRAM 1-A PROGRAM 1-C


PROGRAM 1-A PROGRAM 1-C
PROC SORT DATA = DATA1.CLIENT PROC SORT DATA = DATA1.CLIENT_COMPRESSED
PROC SORT DATA = DATA1.CLIENT PROC SORT DATA = DATA1.CLIENT_COMPRESSED
OUT = CLIENT; OUT = CLIENT;
OUT = CLIENT; OUT = CLIENT;
BY HOME_CITY; BY HOME_CITY;
BY HOME_CITY; BY HOME_CITY;
RUN; RUN;
RUN; RUN;

PROGRAM 1-B PROGRAM 1-D


PROGRAM 1-B PROGRAM 1-D
PROC SORT DATA = DATA1.CLIENT PROC SORT DATA = DATA1.CLIENT_COMPRESSED
PROC SORT DATA = DATA1.CLIENT PROC SORT DATA = DATA1.CLIENT_COMPRESSED
OUT = CLIENT_COMPRRESSED OUT = CLIENT_COMPRESSED
OUT = CLIENT_COMPRRESSED OUT = CLIENT_COMPRESSED
(COMPRESS = YES); (COMPRESS = YES);
(COMPRESS = YES); (COMPRESS = YES);
BY HOME_CITY; BY HOME_CITY;
BY HOME_CITY; BY HOME_CITY;
RUN; RUN;
RUN; RUN;

118
Comparison on the basis of time
Comparison on the basis of time

Program Description CPU Time (seconds)

1-A Input not compressed 0.51


Output not compressed

1-B Input not compressed 0.78


Output compressed

1-C Input compressed 0.48


Output not compressed

1-D Input compressed 0.80


Output compressed

119
Sub
Subsetting
settingexternal
externalfiles
files

The INPUT statement, structuring the input buffer’s content into variables in the
The INPUT statement, structuring the input buffer’s content into variables in the
Program Data Vector will consume quite some CPU time. If you only need to process
Program Data Vector will consume quite some CPU time. If you only need to process
a subset of the external file, only examine part of the input buffer, and if this part
a subset of the external file, only examine part of the input buffer, and if this part
meets your sub setting condition, examine the rest of the input buffer. The trailing @
meets your sub setting condition, examine the rest of the input buffer. The trailing @
in the INPUT statement allows holding contents the input buffer.
in the INPUT statement allows holding contents the input buffer.

PROGRAM 1-A
PROGRAM 1-A
DATA CLIENT;
DATACLIENT;
INFILE CLIENT;
INFILE CLIENT; $ 1 - 14
INPUT CLIENT_ID
INPUT CLIENT_ID
LAST_NAME $ 16 - 35$ 1 - 14
FIRST_NAME $ $3716- -5635
LAST_NAME
FIRST_NAME
HOME_CITY $ 37
$ 58 - 77- 56
HOME_COUNTRY $ 77
HOME_CITY $ 58 - 79 - 93
…;HOME_COUNTRY $ 79 - 93
…;
RUN;
RUN;CLIENT_LONDON;
DATA
DATA
SET CLIENT_LONDON;
CLIENT;
SET CLIENT;
IF HOME_CITY = 'LONDON';
IF HOME_CITY = 'LONDON';
RUN;
RUN;

120
PROGRAM 1-B PROGRAM 1-C
PROGRAM 1-B PROGRAM 1-C
DATA CLIENT_LONDON; DATA CLIENT_LONDON;
DATA CLIENT_LONDON; DATA CLIENT_LONDON;
INFILE CLIENT; INFILE CLIENT;
INFILE CLIENT; INFILE CLIENT;
INPUT CLIENT_ID $ 1 - 14 INPUT HOME_CITY $ 58 - 77 @;
INPUT CLIENT_ID $ 1 - 14 INPUT HOME_CITY $ 58 - 77 @;
LAST_NAME $ 16 - 35 IF HOME_CITY = 'LONDON';
LAST_NAME $ 16 - 35 IF HOME_CITY = 'LONDON';
FIRST_NAME $ 37 - 56 INPUT CLIENT_ID $ 1 - 14
FIRST_NAME $ 37 - 56 INPUT CLIENT_ID $ 1 - 14
HOME_CITY $ 58 - 77 LAST_NAME $ 16 - 35
HOME_CITY $ 58 - 77 LAST_NAME $ 16 - 35
HOME_COUNTRY $ 79 - 93 FIRST_NAME $ 37 - 56
HOME_COUNTRY $ 79 - 93 FIRST_NAME $ 37 - 56
…; HOME_COUNTRY $ 79 - 93
…; HOME_COUNTRY $ 79 - 93
IF HOME_CITY = 'LONDON'; …;
IF HOME_CITY = 'LONDON'; …;
RUN; RUN;
RUN; RUN;

121
Comparison on the basis on time
Comparison on the basis on time

Program number Description CPU Time(minutes)

1-A DATA (Input) – DATA (If) 4:22.80

1-B DATA (Input – If) 2:25.98

1-C DATA (Input – If – Input) 0:15.91

122
EFFICIENTLY COMBINING DATA - CONCATENATING SAS DATA SETS
EFFICIENTLY COMBINING DATA - CONCATENATING SAS DATA SETS

Many users are familiar with the APPEND procedure for adding a new table immediately to a
Many users are familiar with the APPEND procedure for adding a new table immediately to a
master table, without reading / writing the master table. Still, they rarely code the APPEND
master table, without reading / writing the master table. Still, they rarely code the APPEND
procedure, because they are used to typing the DATA step, which is coded very fast. In the next
procedure, because they are used to typing the DATA step, which is coded very fast. In the next
example the traditional DATA step concatenation capabilities are compared with using the
example the traditional DATA step concatenation capabilities are compared with using the
OUTER UNION CORR operator in the SQL procedure. The result can also be created using the
OUTER UNION CORR operator in the SQL procedure. The result can also be created using the
SQL INSERT statement to add all observations of the second table to the end of the master table.
SQL INSERT statement to add all observations of the second table to the end of the master table.

PROGRAM 1-A
PROGRAM 1-A
DATA SALES; PROGRAM 1-D
DATA SALES; PROGRAM 1-D
SET SALES DATA1.SALES2003; PROC SQL;
SET SALES DATA1.SALES2003; PROC SQL;
RUN; CREATE TABLE SALES AS
RUN; CREATE TABLE SALES AS
SELECT *
SELECT *
FROM SALES
PROGRAM 1-B FROM SALES
PROGRAM 1-B OUTER UNION CORR
PROC APPEND BASE = SALES OUTER UNION CORR
PROC APPEND BASE = SALES SELECT *
DATA = DATA1.SALES2003; SELECT *
DATA = DATA1.SALES2003; FROM DATA1.SALES2003;
RUN; FROM DATA1.SALES2003;
RUN; QUIT;
QUIT;
PROGRAM 1-C
PROGRAM 1-C
PROC SQL;
PROC SQL;
INSERT INTO SALES
INSERT INTO SALES
SELECT * FROM DATA1.SALES2003;
SELECT * FROM DATA1.SALES2003;
QUIT;
QUIT;

123
Comparison on the basis of time
Comparison on the basis of time

Program Number Description CPU Time(seconds)


1-A DATA (Set) 1.65

1-B Append 0.11

1-C SQL (Insert into) 0.59

1-D SQL (Outer union core) 3.98

124
Interleaving
Interleavingdataset
dataset

You can concatenate two sorted input SAS data sets into a sorted result in several ways.
You can concatenate two sorted input SAS data sets into a sorted result in several ways.
The following example compares the traditional DATA step followed by a SORT
The following example compares the traditional DATA step followed by a SORT
procedure with a BY statement immediately specified in the DATA step and with the
procedure with a BY statement immediately specified in the DATA step and with the
OUTER UNION CORR operator with an ORDER BY clause in the SQL procedure. As
OUTER UNION CORR operator with an ORDER BY clause in the SQL procedure. As
expected the SQL procedure requires more CPU time than the DATA step.
expected the SQL procedure requires more CPU time than the DATA step.

PROGRAM 1-A
PROGRAM 1-A PROGRAM 1-C
DATA SALES; PROGRAM 1-C
DATA SALES; PROC SQL;
SET DATA1.SALES_B DATA1.SALES_NL; PROC SQL;
SET DATA1.SALES_B DATA1.SALES_NL; CREATE TABLE SALES AS
RUN; CREATE TABLE SALES AS
RUN; SELECT *
PROC SORT DATA = SALES; SELECT *
PROC SORT DATA = SALES; FROM DATA1.SALES_B
BY SALES_DATE; FROM DATA1.SALES_B
BY SALES_DATE; OUTER UNION CORR
RUN; OUTER UNION CORR
RUN; SELECT *
SELECT *
FROM DATA1.SALES_NL
FROM DATA1.SALES_NL
ORDER BY SALES_DATE;
ORDER BY SALES_DATE;
PROGRAM 1-B QUIT;
PROGRAM 1-B QUIT;
DATA SALES;
DATA SALES;
SET DATA1.SALES_B DATA1.SALES_NL;
SET DATA1.SALES_B DATA1.SALES_NL;
BY SALES_DATE;
BY SALES_DATE;
RUN;
RUN;

125
Comparison on the basis on time
Comparison on the basis on time

Program Number Description CPU Time(seconds)


1-A DATA (Set) - Sort 6.15

1-B DATA (Set – By) 2.10

1-C SQL (Outer Union Corr – Order By) 11.32

126

Das könnte Ihnen auch gefallen