You are on page 1of 8

NESUG 16

Posters

PS019

CSV: A MACRO WHICH WRITES SAS PROGRAMS TO READ CSV FILES

Victor Kamensky, Albert Einstein College of Medicine


ABSTRACT
Macro CSV converts any CSV file into a SAS data set. A SAS program to do this conversion is the output. The
user can later modify the program. Only BASE SAS is needed.
WHY WAS CSV MACRO WRITTEN?
If EXCEL is converted into SAS by using PROC IMPORT (or EXPORT/IMPORT facility), part of the data could
be lost. For example, if a column contains mainly numbers (the first lines in the column contain numbers only, and
some later lines contain nonnumeric characters) the default use of PROC IMPORT will convert this column into a
numeric variable, and all character values will be lost. Or, vice versa, if first lines in the column are blank and/or
contain characters, the column will be converted into a character variable and some numbers can be lost.
Sometimes PROC IMPORT just refuses to convert EXCEL or runs for a very long time. Or sometimes a part of a
column was converted into missing values without any obvious reason. If SAS ACCESS TO PC FILE FORMATS
is not available, EXCEL can be converted into SAS by DDE. In this case user should supply variable names. The
usual solution is to output a CSV (comma-delimetered) file from EXCEL and read the CSV file into SAS. Many other
programs besides EXCEL also can output data as a CSV file.
CSV file can be imported into SAS by using PROC IMPORT and EXPORT/IMPORT facility. But it can be
done only if SAS ACCESS TO PC FILE FORMATS is available. There are some macros to convert CSV into SAS
using BASE SAS only. For example macro READDSD (by Ian Whitlock) was posted on SAS-L). But usually it is
convenient to have a program to read the CSV into SAS and modify the program later if necessary.
Macro CSV writes such programs and converts CSV into SAS (by running the program). If the user wants to have
program to read a CSV file into SAS and SAS ACCESS TO PC FILE FORMATS is available, it is faster to use macro
CSV than to run PROC IMPORT, take its log and edit the log to get a program.
TYPES OF EXCEL (CSV) COLUMNS
EXCEL is the main source of CSV files. That's why term "EXCEL COLUMNS" is used in this pape
r to describe
variables in CSV files. Macro CSV considers the following types of EXCEL columns:
1

NUMERIC

All cells contain only numbers (or blanks). If all cells are blank, the column is NUMERIC.

2
3

DATE
MIXED
NUMERIC
CHARACTER

Dates formatted as MMDDYY SAS formats.


At least one cell contains numbers only.

Each nonblank cell contains at least one nonnumeric character.

NESUG 16

Posters

WHAT DOES CSV MACRO DO?


Macro CSV writes a SAS program to convert a CSV file into SAS. User can later modify the program. By running
the program macro CSV converts the CSV file into SAS. The program can be ignored if the user wants only to convert
CSV into SAS. SAS data sets have the following features:

EXCEL
COLUMN
NUMERIC

SAS VARIABLE

DATE

MIXED
NUMERIC

CHARACTER

Default

NUMERIC
NUMERIC formatted
as MMDDYY
CHARACTER
CHARACTER

Numeric variable is also created. This numeric variable contains


numbers for all the cells that contain numbers only and missing
values for all other cells.
Length of each character variable is defined as maximum length
of cells in the EXCEL column.

By default columns with all blank cells are dropped from the SAS data set. The order of variables in the SAS data
set is the same as order of EXCEL columns. Numeric variable created (by default) from MIXED NUMERIC EXCEL
column will be placed after the corresponding character variable.
MACRO PARAMETERS
Macro CSV has 2 positional parameters and 9 keyword parameters. Positional parameters are required. Keyword
parameters are optional.
Name

Description

Default value

CSV
DATASET

CSV file name. The full path should be specified if the CSV file
is not in the user's current directory (i.e. directory who
se name
can be seen on the bottom of the SAS session).
SAS data set name. Can be permanent or temporary.

PROGNAME

SAS program name. Created in user's current directory.

LSIZE
MAXLEN
CONVNUM

Maximum length of a record in the CSV file.


Maximum length of any cell.
MIXED NUMERIC EXCEL columns can be converted into
numeric variables if CONVNUM is equal to Y
MIXED NUMERIC EXCEL column is converted into a numeric
variable if percent of numeric cells in this column is more than
or equal to PERC_NUM.

POSITIONAL PARAMETERS

PERC_NUM

DROPMISS
LENDATE
LENADD
_NAMES_

KEYWORD PARAMETERS

Blank columns are dropped from the SAS data set if


DROPMISS is equal to Y.
EXCEL columns are checked to be DATE formatted as
MMDDYY&LENDATE.
Constant added to maximum existing length of a CHARACTER
EXCEL column to define SAS character variable length.
Name of SAS data set to keep variable names.

_a, i.e. default program name is


_a.SAS.
32767 (maximum possible)
200
Y
0, i.e. all MIXED NUMERIC
EXCEL columns are converted into
numeric variables (if
CONVNUM=Y), even if only one
numeric cell exists.
Y
10, i.e. dates are supposed to be
formatted as MMDDYY10.
0, i.e. default length of a character
variable is equal to the maximum cell
length in the EXCEL column.
_names_

NESUG 16

Posters

HOW MACRO CSV WORKS


Letters A-N can be seen on the table below as well as on the macro CSV in the source code section. These letters
mark logical steps of the macro.
A.
B.

C.
D.
E.

F.

G-M.

N.

The first line of the &CSV file is read into one character variable LINE in data set &_NAMES_. LINE is
checked for the presence of double quotation marks ("). Commas (,) inside double quotation marks are changed
into underscores (_). Data set &_NAMES_ has 1 observation.
The variable LINE is divided into NAMEs separated by commas (,). The following is done with each
NAME. NAME is truncated to 32 characters. All characters except letters and digits are substituted into
underscores (_). If NAME starts with a digit or underscore (_), letter F is added (or substituted) as a first
character. Blank names are substituted into F#, where # means variable number (VARNUM). The number of
observations in data set &_NAMES_ is now equal to the number of EXCEL columns (&NUMCOL).
The purpose of this macro loop is to get rid of double NAMEs. In case of double names the variable number is
added at the end of the name.
The CSV file is read to the &DATASET. But it is not the final &DATASET. Each variable in the
&DATASET is now character, length =&MAXLEN (default 200). Variable names are V1- V&NUMCOL.
&DATASET is transposed (into itself, by PROC TRANSPOSE). Transposed &DATASET is merged with
&_NAMES_. First version of this macro used CALL EXECUTE to find frequencies of all variables (instead of
PROC TRANSPOSE), but it takes a long time even for a small CSV file. Another version attempted to
substitute PROC TRANSPOSE for a data step, but PROC TRANSPOSE turned out to be the fastest method.
Column types are defined for all columns of the EXCEL spreadsheet. The columns now correspond to variables
COL1-COL&NUMROW in &_NAMES_ dataset.
COL is NUMERIC (type=1) if all cells contain only numbers or blanks.
COL is DATE (date=1) if all cells are dates formatted as MMDDYY&LENDATE.
COL is CHARACTER if it is not NUMERIC or DATE.
CHARACTER COL is MIXED NUMERIC if at least one cell contains numbers only (n_number > 0).
PERC_NUM is the percent of numeric values in a MIXED NUMERIC column.
If a MIXED NUMERIC column is converted into a numeric variable, the name of the character variable will be
an underscore plus the value of NAME ('_'||trim(NAME), value of NAME is the numeric variable name). If a
MIXED NUMERIC column is not converted into a numeric variable, the name of the character variable will be
the value of NAME.
A program to read the &CSV file is written. Its name is &PROGNAME..sas.
K. Conversion of MIXED NUMERIC columns into numeric variables is written. Executed only if
&CONVNUM = Y.
L. Data steps to drop blank columns from the &DATASET are written. Executed only if
&DROPMISS=Y.
M. RETAIN statement to keep the variables in &DATASET in the same order as the columns in the CSV
file (EXCEL spreadsheet) is written.
A numeric variable converted from a MIXED NUMERIC column will be placed after the corresponding
character variable.
&CSV is converted to &DATASET by running &PROGNAME.. sas.

NESUG 16

Posters

HOW TO CALL MACRO CSV


Example I. Default values are used for all positional parameters.

%csv(MYCSV,MYSASDS);
1
2
3
4
5
6
7
8

CSV file MYCSV.CSV should by default be in the user's current directory (directory whose name can
be seen on the bottom of the SAS session).
MYSASDS data set is created.
Program _a.SAS is written into the user'
s current directory.
Length of records in the CSV file should not be greater than 32767. The length of any cell cannot be
more than 200.
For columns with mixed numeric and character data 2 variables are created - one character variable,
which keeps all data, and one numeric variable, which keeps only numeric cells.
Missing columns are dropped from the final data set.
Dates formatted as MMDDYY10 (if any) are converted into numeric variables formatted as
mmddyy10.
Length of each character variable is equal to maximum length of its corresponding column.
Example II. Positional parameter PROGNAME is added.

%csv(MYCSV,MYSASDS,PROGNAME=myprog);
The same as example I except
3
Program myprog.SAS is written into the user's current idrectory..
Example III. Positional parameter LENDATE is added.
%csv(MYCSV,MYSASDS,PROGNAME=myprog,LENDATE=8);
The same as example II except
7

Dates formatted as MMDDYY8 (if any) are converted into numeric


variables formatted as MMDDYY8.

NESUG 16

Posters

SOURCE CODE
%macro csv(CSV,DATASET,PROGNAME=_a,LSIZE=32767,MAXLEN=200,CONVNUM=Y,
PERC_NUM=0,DROPMISS =Y,LENDATE=10,LENADD=0,_NAMES_=_names_);
options mprint validvarname=v7;
%*A;data &_NAMES_;length line $&LSIZE. ;
infile "&CSV..csv" ls=&LSIZE lrecl=&LSIZE missover length=l;
input line $varying&LSIZE.. l;
if _n_=1;
if index(line,'"') > 0 then
do;
_in=0; DROP _in _ii;
do _ii=1 to length(trim(line));
if _in=1 and substr(line,_ii,1)=',' then substr(line,_ii,1)='_';
if
substr(line,_ii,1)='"' then _in=1-_in;
end;
end;
run;
%*B;data &_NAMES_(keep=name varnum);length name $ 32;
set &_NAMES_;
varnum=0; ind=0;
do until(index(line,',')=0);
varnum=varnum+1;
line=substr(line,ind+1);
name=line;
ind=index(line,',');
if ind > 1 then name=scan(line,1,',');
if ind = 1 then name='F'||compress(put(varnum,best.));
name=left(trim(name));
do _ii=1 to length(trim(name));
if index ('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_',
upcase(substr(name,_ii,1)) ) =0 then
substr(name,_ii,1)='_';
end;
if index ('0123456789',
upcase(substr(name,1,1)) ) > 0 then
name='F'||name;
name=upcase(substr(name,1,1))||substr(name,2);
if substr(name,1,1)='_' then substr(name,1,1)='F';
output;
end;
call symput('numcol',compress(put(varnum,best.))); run;
%*C;%let dif_names=0;
%do %until (&dif_names gt 0);
proc sort data=&_NAMES_; by name varnum; run;
data &_NAMES_; set &_NAMES_ nobs=nobs end=last;by name;
retain n_dif; drop n_dif;
if first.name then n_dif+1;
if not first.name then
do;
length _vn $ 10; drop _vn ln;
_vn =compress(put(varnum,best.));
ln =32-length(compress(_vn));
name=compress(substr(name,1,ln)||compress(_vn));
end;
if last and nobs=n_dif then
call symput('dif_names','1'); run;
%end;

NESUG 16

Posters

proc sort data=&_NAMES_; by varnum; run;


%*D;data &DATASET;
infile "&CSV..csv" ls=&LSIZE lrecl=&LSIZE
missover dlm=',' dsd firstobs=2;
length v1-v&numcol $&MAXLEN.; input v1-v&numcol ;
run;
data _null_; call symput('numrow',compress(put(numrow,best.)));
stop;set &DATASET nobs=numrow; run;
%*E;proc transpose data=&DATASET out=&DATASET name=name;
var _all_; run;
data &_NAMES_; merge &_NAMES_ &DATASET(drop=name);run;
%*F;data &_NAMES_; length name $ 32; set &_NAMES_;
array col col1-col&numrow; drop col1-col&numrow;
type=1; date=1;
do over col;
length=max(length,length(trim(col)));
if compress(col)=:'-' and compress(col)^='-' then col=substr(left(col),2);
if compress(col,'0123456789.')^=''
then type=2;
if compress(col,'0123456789.') =''
and
compress(col,'. ')^='' then
n_number=sum(n_number,1);
if compress(col)^='' and date=1 then
do; date=.;
if compress(col,'0123456789/')='' and
length(trim(col)) in (&LENDATE,%eval(&LENDATE-1),%eval(&LENDATE-2))
then
do;
if input(compress(col),mmddyy&LENDATE..) > . then date=1;
end;
end;
if compress(col)=' '
then n_blank=sum(n_blank,1);
if compress(col)='.'
then n_dots =sum(n_dots,1);
end;
nobs=&numrow;
if date=1 then type=1;
if type =2 and n_blank =nobs then do; dropflag=1; type=1;end;
if type =1 and (n_blank > . or n_dots > .) then
if sum (n_blank,n_dots)=nobs then
do; dropflag=1;date=.;end; drop n_blank n_dots;
if type=2 and n_number > 0 then
do;
perc_num=(n_number/nobs)*100;
if perc_num ge &PERC_NUM and "&CONVNUM"="Y" then
name_='_'||trim(name);
end;
run;
%*G;data _null_;
file "&PROGNAME..sas";
put "title '" "&PROGNAME..sas" "';";
put "data &DATASET;";
put @ 5 "infile '" "&CSV..csv" "' missover dlm=',' dsd ls=&LSIZE "@ ;
put " lrecl=&LSIZE firstobs=2;";
run;

NESUG 16

Posters

%*H;data _null_;
set &_NAMES_(where=(type=2)) end=last;
file "&PROGNAME..sas" mod;
length=length+&LENADD;
if _n_=1 then put ' LENGTH ';
if name_ ='' then put @ 5 name ' $ ' length;
else
put @ 8 name_ ' $ ' length;
if last then put ';';
run;
%*I;data _null_;
set &_NAMES_(where=(date=1)) end=last;
file "&PROGNAME..sas" mod;
put @ 5 ' informat ' name " mmddyy&LENDATE..;" @;
put
'
format ' name " mmddyy&LENDATE..;" ;
run;
%*J;data _null_;
set &_NAMES_ end=last;
file "&PROGNAME..sas" mod;
if _n_=1 then put ' INPUT ';
if name_ ='' then put @ 5 name ;
else
put @ 8 name_ ;
if last then put ';run;';
run;
%*K;%if &CONVNUM eq Y %then
%do;
data _null_;
set &_NAMES_(where=(name_^='')) end=last;
file "&PROGNAME..sas" mod;
if _n_=1 then
do;
put "data &DATASET; set &DATASET;";
put @ 3 " ARRAY _char_ ";
end;
put @ 5 name_ ;
if last then put @8 ';';
run;
data _null_;
set &_NAMES_(where=(name_^='')) end=last;
file "&PROGNAME..sas" mod;
if _n_=1 then
do;
put @ 3 " ARRAY _num_ ";
end;
put @ 5 name ;
if last then
do;
put @8 ';';
put @ 5 ' do over _char_; ';
put @ 9 "if _char_ ^='' and compress(_char_,'0123456789.')='' ";
put @ 10 "
then _num_=input(_char_,best.); ";
put @ 5 ' end; ';
put 'run;';
end;
run;
%end;

NESUG 16

Posters

%*L;%if &DROPMISS
eq Y %then
%do;
data _null_; set &_NAMES_(where=(dropflag=1))
file "&PROGNAME..sas" mod;
if _n_=1 then
do;
put "data &DATASET; set &DATASET;";
put @ 5 "DROP ";
end;
put @ 7 name_ name ;
if last then put ';run;';
run;

end=last;

data &_NAMES_; set &_NAMES_ ; if dropflag ne 1;run;


%end;
%*M;data _null_; set &_NAMES_
file "&PROGNAME..sas" mod;
if _n_=1 then
do;
put "data &DATASET;" ;
put @ 5 "RETAIN ";
end;
put @ 7 name_ name ;
if last then
do;
put @8 ';';
put @3 "set &DATASET;";
put 'run;';
end;
run;
%*N;options mprint source2;
%include &PROGNAME;
%mend csv;

end=last;

SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
Other brand and product names are trademarks of their respective companies.