Sie sind auf Seite 1von 12

Get it together: Combining data with SAS MERGE, UPDATE, and SET

Mel Widawski, MHW Consulting, Culver City, CA


ABSTRACT
Combining data sets is easy with Merge, Update or using the SET statements. You will learn when to do each and how they are different. Simple examples illustrate the results of combining data sets. Pitfalls will also be discussed.

There are a number of ways to combine data in SAS and this paper will cover a few of the most common methods. The first thing you must determine is how you want the data to be combined. Are you adding cases, bringing in additional variables, correcting data, changing specific values, or performing table look-ups. This paper will be organized by function. This paper deals with combining data in the DATA step using SET, MERGE, and UPDATE statements. In addition there will be a discussion of the uses of the BY statement where appropriate. SET is used primarily for adding cases, but it also can be used to propagate variables across an entire file. MERGE will combine two or more files that have the same cases and different variables. It can also be used for updating values when you wish to force a change regardless of the new value. UPDATE performs much the same function as merge with two exceptions: 1) Only two data sets can be combined, and 2) if a value is missing in the update data set the value is not changed in the target data set. In addition a BY statement is used in combination with SET to interleave lines of data, and with MERGE and UPDATE to assure the appropriate lines are combined. The paper will also discuss labeling the source of the case through the use of the IN= data set option, and performing table look-ups with the KEY= option on SET statement. This will be the most complex example in this presentation. In order to be able to use the KEY= option the table look-up data set must have indexes stored with the data set. There are some useful techniques available in SQL for joining data sets, but they will not be covered in this paper.

INTRODUCTION

COMBINING CASES
As mentioned previously the SET statement is used for combining cases. Cases or lines of data can be added to the end of another file, or they can be interleaved and ordered through the use of a variable that determines the order.
ADDING CASES TO THE END OF A FILE

This is one of the simplest tasks to perform, and uses statements you will be familiar with if you ever read a SAS data set in a data step. The only difference is that more than one data set is named on the SET statement. Assume that you have the following data set: DATA TEST01; INPUT ID v2 v3 v4; CARDS; 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 5 1 1 1 1 6 1 1 1 1 7 1 1 1 1 8 1 1 1 1 RUN; The values in the body have been set to one so that the additional cases will be easily distinguished from the original. The data are included in a data step which will read the data into a SAS data set. The data set you want to add .to this is presented in the following data step: DATA TEST02a; INPUT ID v2 v3 v4; CARDS; 2 2 2 2 2 5 5 5 5 5 8 8 8 8 8 RUN;

Naming both data sets on the same SET statement in the following data step combines the files one after another: DATA test11; SET test01 test02a ; RUN; The resulting file, TEST11, produced with a PROC PRINT statement follows: SET: Adding Lines Of Data Obs ID v2 v3 v4 1 1 1 1 1 2 2 1 1 1 3 3 1 1 1 4 4 1 1 1 5 5 1 1 1 6 6 1 1 1 7 7 1 1 1 8 8 1 1 1 9 2 2 2 2 10 5 5 5 5 11 8 8 8 8 The the additional lines of data are highlighted. . While it is possible to combine two data sets where only some of the variables are in both, it usually works best if all of the variables are the same in both files. Large areas of missing values may be created if only some of the variables in the files being combined are the same.
INTERLEAVINGG CASES IN TWO FILES

Many times it would be convenient to have the cases in the new data set ordered by some variable (usually ID). This can be accomplished by using PROC SORT after combining the data sets. If the data sets are already ordered by a variable, then it is easy to maintain that order in the resulting data set. This can be accomplished by including a by statement after the SET statement. In the following example the same data sets used previously will be combined. Adding a BY statement to the program we used above will maintain the order by ID: DATA TEST12; SET test01 test02a ; BY id; RUN; The resulting file, TEST12, produced with a PROC PRINT statement follows: Obs ID v2 v3 v4 1 1 1 1 1 2 2 1 1 1 3 2 2 2 2 4 3 1 1 1 5 4 1 1 1 6 5 1 1 1 7 5 5 5 5 8 6 1 1 1 9 7 1 1 1 10 8 1 1 1 11 8 8 8 8 The additional lines of data are bolded, and appear interspersed according to ID. Remember that in order for this to work the files must have be ordered by the variable of interest.
INDICATING THE SOURCE FILE IN THE INTERLEAVED FILES

It would be helpful to be able to determine when looking at a combined file later, which of the original files produced the line of data. Adding the data set option, IN= to the SET statement in the program we used above will provide information on the source of the data: DATA TEST13; SET test01 (IN=in1) test02a (IN=in2a); BY id; in01 =in1; in02a =in2a; RUN; Notice that in addition to creating IN1 and IN2A using the IN= data step option, we also set two addition variables IN01 and IN02A equal to them in an assignment statement. The variables created by using IN= are temporary variables only available during the data step. In order to have the information saved in the file for future use, new permanent variables must be created.

The resulting file, TEST13, produced with a PROC PRINT statement follows: Obs ID v2 v3 v4 in01 in02a 1 1 1 1 1 1 0 2 2 1 1 1 1 0 3 2 2 2 2 0 1 4 3 1 1 1 1 0 5 4 1 1 1 1 0 6 5 1 1 1 1 0 7 5 5 5 5 0 1 8 6 1 1 1 1 0 9 7 1 1 1 1 0 10 8 1 1 1 1 0 11 8 8 8 8 0 1 The additional lines of data are bolded, and appear interspersed according to ID. The new variables indicating the source of the line is also in bold type. Since no line of data, in this example, can receive input from more than one file a single variable would be sufficient to indicate source. This may be accomplished with the following modification to the data step: DATA TEST13a; SET test01 (IN=in1) test02a; BY id; length source $8; IF in1=1 THEN source=test01; ELSE source=test02a; RUN; The length statement is necessary because the first IF statement creates a variable with less characters than the subsequent if statement. If it is not included the final a in test02a would be missing resulting in test02. File 13a is the result of running this program: Obs ID v2 v3 v4 source 1 1 1 1 1 test01 2 2 1 1 1 test01 3 2 2 2 2 test02a 4 3 1 1 1 test01 5 4 1 1 1 test01 6 5 1 1 1 test01 7 5 5 5 5 test02a 8 6 1 1 1 test01 9 7 1 1 1 test01 10 8 1 1 1 test01 11 8 8 8 8 test02a
WHEN THE VARIABLES DO NOT MATCH

The presence of different variables in each file yields strange results. Large areas of missing values can result. The following program creates a data set with the new variables v5, v6 and v7: DATA TEST02; INPUT ID v5 v6 v7; cards; 2 2 2 2 2 5 5 5 5 5 8 8 8 8 8 ;; RUN; The program that combines the data sets is almost Identical to the previous program: DATA TEST14; Set test01 (in=in1) test02 (in=in2) ; by id; in01 =in1; in02 =in2; RUN; The resulting file will contain seven variables instead of four, and the values of the set from one file will be missing on lines of data that come from the other file. In this example none of the variables match.

File 14 is the result of running this program: Obs ID v2 v3 v4 v5 v6 v7 in01 in02 1 1 1 1 1 . . . 1 0 2 2 1 1 1 . . . 1 0 3 2 . . . 2 2 2 0 1 4 3 1 1 1 . . . 1 0 5 4 1 1 1 . . . 1 0 6 5 1 1 1 . . . 1 0 7 5 . . . 5 5 5 0 1 8 6 1 1 1 . . . 1 0 9 7 1 1 1 . . . 1 0 10 8 1 1 1 . . . 1 0 11 8 . . . 8 8 8 0 1 This is usually not the desired result when the data sets you wish to combine have the same ids and different variables. Usually in this case you want to add the variables to the existing cases. This is discussed in the section on adding variables that follows.

MERGING FILES AND ADDING VARIABLES


Combining files by merging additional variables from different sources is one of the most common tasks. For example merging demographic data with questionnaire or laboratory data. In order to do this there must be a variable to match the observations in each file. The files should be in ordered by this variable. A one to one match is most common. If there is no information to match with an observation in one file then missing values are generated for the variables from that file. It is also possible to add information from one file to each case with the same key variable value in another file. This is called a table look-up. Strange results are generated if the files both have multiple records with the same values on the key variables.
ONE TO ONE MERGING

A one to one merge implies that there is at most one case in each file with a given value on the variable used for matching. The following program performs this operation on the files presented above: DATA TEST03; MERGE test01 test02; BY id; RUN; This produces the following file: Obs ID v2 v3 v4 v5 v6 v7 1 1 1 1 1 . . . 2 2 1 1 1 2 2 2 3 3 1 1 1 . . . 4 4 1 1 1 . . . 5 5 1 1 1 5 5 5 6 6 1 1 1 . . . 7 7 1 1 1 . . . 8 8 1 1 1 8 8 8 Notice that there was information to be added only to cases with the IDs 2, 5, and 8. Missing values were generated for cases not in the TEST02 data set.
SIMPLE TABLE LOOKUPS

Table look-ups involve a single file that may contain information on a group that needs to be included with the information for each person in the group. Oor information for a person that needs to be spread to each time period record. Assume the following data set with a group identifier GP: DATA test01B; INPUT ID GP v3 v4; cards; 2 1 2 3 4 8 3 9 9 9 3 1 3 4 5 5 2 5 6 7 6 3 7 8 9 1 1 1 2 3 4 2 4 5 6 7 3 8 9 9 ;;

RUN; PROC SORT DATA=test01B OUT=test01BS; BY gp id; RUN; Notice that the data set is not in order by group so it is necessary to sort it with PROC SORT. The following data set contains information on each group that needs to be spread to each appropriate record above: DATA test02i; INPUT GP ID v6 v7; GP2=GP; cards; 1 2 2 2 5 5 3 8 8 ;; RUN; The merge statement can be used to accomplish this task just as it did for the one to one matching problem: DATA TEST17; MERGE test01BS test02i; BY gp; RUN; The data set produced by this process follows: Obs ID GP v3 v4 v6 v7 GP2 1 1 1 1 2 2 2 1 2 2 1 2 3 2 2 1 3 3 1 3 4 2 2 1 4 4 2 4 5 5 5 2 5 5 2 5 6 5 5 2 6 6 3 7 8 8 8 3 7 7 3 8 9 8 8 3 8 8 3 9 9 8 8 3 Notice that information from the second data set is spread to each case with the same group designation in the first data set. .This is also known as a one to many match.
TABLE LOOK-UP A VARIATION

Sometimes it is useful to calculate summary statistics for groups and propagate them to each record for that group. The look-up data set can be created in PROC MEANS and then spread to the members of the group from the original data set. The following PROC MEANS calculates means for each group, GP: PROC MEANS DATA=TEST01bs; BY GP; VAR V3 V4; OUTPUT OUT=TEST01bsm MEAN( V3 V4)= M3 M4; RUN; The OUTPUT statement creates the new SAS data set, test01bsm, containing the calculated means for each group. The BY statement specifies that each groups means should be calculated separately. The following data set was produced by the PROC above: Obs GP _TYPE_ _FREQ_ M3 M4 1 1 0 3 2.0 3.00000 2 2 0 2 4.5 5.50000 3 3 0 3 8.0 8.66667 Two automatic variables were produced by PROC MEANS, they are _TYPE_ and _FREQ_. The first is irrelevant

for this example, and _FREQ_ contains the number of cases in the group.
The table look-up is accomplished as in the previous example. The merge statement can be used to accomplish this task just as it did for the one to one matching problem: DATA TEST17b; MERGE test01BS test01BSm; BY gp; DROP _type_; RENAME _freq_=N; RUN; The DROP statement eliminates _TYPE_ from the new data set, and the RENAME statement changes the

name of _FREQ_ to N. 5

The data set produced by this process follows: Obs ID GP v3 v4 N M3 M4 1 2 1 2 3 3 2.0 3.00000 2 3 1 3 4 3 2.0 3.00000 3 1 1 1 2 3 2.0 3.00000 4 5 2 5 6 2 4.5 5.50000 5 4 2 4 5 2 4.5 5.50000 6 8 3 9 9 3 8.0 8.66667 7 6 3 7 8 3 8.0 8.66667 8 7 3 8 9 3 8.0 8.66667 Notice that means from the second data set is spread to each case with the same group designation in the first data set.
MULTIPLE RECORDS WITH THE SAME GROUP IN BOTH FILES

As mentioned earlier, merging data sets with more than one record with the same value of the matching variable in both files yields strange results. The group data set below contains multiple records for groups: DATA test02k; INPUT GP v6 v7; GP2=GP; cards; 1 2 2 1 3 3 2 5 5 3 8 8 3 7 7 ;; RUN; The following program merges the two data sets: DATA TEST19; MERGE test01bs test02k; BY gp; RUN; The data set produced by this merge follows: Obs ID GP v3 v4 v6 v7 GP2 1 1 1 1 2 2 2 1 2 2 1 2 3 3 3 1 3 3 1 3 4 3 3 1 4 4 2 4 5 5 5 2 5 5 2 5 6 5 5 2 6 6 3 7 8 8 8 3 7 7 3 8 9 7 7 3 8 8 3 9 9 7 7 3 Notice that the first group 1 record in the group data set (test02k) is attached to the first group 1 record in the main data set (test01bs). The second group record is propagated to the remaining two observations in the main data set. The same result occurs for the 3rd group. This would generally not be the desired result.

CHANGING THE VALUES OF EXISTING VARIABLES


Rather than changing the data directly in a data set, entering the new data in separate data sets and then using the new data set to update the data provides a record of the changes. It also enables reverting to the original values.
FORCING CHANGES WITH MERGE

When the transaction (second) data set contains the same variables as the main data set then and MERGE is used to combine the data sets, then the value of any matching variable in the transaction data set replaces the value of the matching variable in the main file for the matching line of data. The main data set for this example is test01presented below again to aid in observing the modifications: DATA TEST01; INPUT ID v2 v3 v4; CARDS; 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 5 1 1 1 1

6 1 1 1 1 7 1 1 1 1 8 1 1 1 1 RUN; The following program creates the transaction data set: DATA TEST02b; INPUT ID v2 v3 v4; cards; 2 2 2 . 5 . 5 5 8 8 . 8 9 9 9 9 ;; RUN; The missing values in this transaction data set will be propagated to the new main data set. Using merge forces the change regardless of whether there is a valid or a missing value. The following program performs this operation on the files presented above: DATA TEST03; MERGE test01 test02; BY id; RUN; This produces the following file: Obs ID v2 v3 v4 1 1 1 1 1 2 2 2 2 . 3 3 1 1 1 4 4 1 1 1 5 5 . 5 5 6 6 1 1 1 7 7 1 1 1 8 8 8 . 8 Notice that the the value was changed even if the value in the transaction data set was missing.
CONDITIONAL CHANGES WITH UPDATE

When the transaction (second) data set contains the same variables as the main data set then and UPDATE is used to combine the data sets, then the valid values of any matching variable in the transaction data set replaces the value of the matching variable in the main file for the matching line of data. Missing values in the transaction data set do not replace values in the main data set. The following program performs this operation on the files presented above: DATA TEST03; UPDATE test01 test02; BY id; RUN; This produces the following file: Obs ID v2 v3 v4 1 1 1 1 1 2 2 2 2 1 3 3 1 1 1 4 4 1 1 1 5 5 1 5 5 6 6 1 1 1 7 7 1 1 1 8 8 8 1 8 Notice that the the value was changed only if the value in the transaction data set was not missing. This allows using a transaction data set that contains all variables even if some of the values do not need to be changed for some of the lines of data. The two primary differences between UPDATE and MERGE are that UPDATE never replaces a value in the main data set with a missing value from the transaction data set, and only two data sets may be used on the UPDATE statement.
MULTIPLE TRANSACTION DATA SETS

When only a few changes are to be made at a time for only a small number of cases and variables then it may be more convenient to use multiple transaction data sets, one for each variable that needs updating. This example still uses test01 as the primary data set.

Assume the following transaction data sets for each variable containing only values to be updated: DATA TEST02e; INPUT ID v3 ; cards; 2 2 8 8 ;; RUN; DATA TEST02f; INPUT ID v4; cards; 5 5 8 8 ;; RUN; Each data set contains only values that need to be changed, but more transaction files are needed. The merge statement can be used to accomplish this task but more than two data sets are specified: DATA TEST09; Merge test01 test02e test02f test02g; by id; RUN; The data set produced by this process follows: Obs ID v2 v3 v4 1 1 1 1 1 2 2 1 2 1 3 3 1 1 1 4 4 1 1 1 5 5 1 1 5 6 6 1 1 1 7 7 1 1 1 8 8 1 8 8 All values for variables in the transaction data sets replace the value for that variable and case in the primary data set, but if there is no value listed in the transaction data set for that case and variable then no change is made.
UPDATE WITH ADDITIONAL VARIABLES AND CASES IN THE TRANSACTION FILE

Variables and case can be added using the transaction file, but if the variable or case has no entry in that file missing values will be created. The following data set contains an additional variable not in the original data set (v5), and an additional observation (9): DATA TEST02bx; INPUT ID v2 v3 v4 v5; cards; 2 2 2 . 2 5 . 5 5 5 8 8 . 8 8 9 9 9 9 9 ;; RUN; The update statement can be used to accomplish this task: DATA TEST06b; update test01 test02bx; BY id; RUN;

The data set produced by this process follows: Obs ID v2 v3 v4 v5 1 1 1 1 1 . 2 2 2 2 1 2 3 3 1 1 1 . 4 4 1 1 1 . 5 5 1 5 5 5 6 6 1 1 1 . 7 7 1 1 1 . 8 8 8 1 8 8 9 9 9 9 9 9 Non-missing values in the transaction data set replace the values appropriate values in the primary data set. New variables (v5) in the transaction data set are added to the new data set, but missing values for this variable are created for those cases not in the transaction data set. Similarly new cases (9) may be added to the file from the transaction data set, but any variable not in the transaction data set is set to missing for the new case.

PROPAGATING VALUES OF NEW VARIABLES TO EVERY CASE


Some times it is necessary to propagate the values of a group of new variables to every line of data. This can be accomplished by including an assignment statement for each of these variables, but if the number of variables is large this method can be cumbersome. If the values of these new variables are already in a file It would be advantageous to use this file. One way to accomplish this task is as a table look-up, but modifications must be made to the files. First create a new variable in the primary file with a value of one for all of the cases. Next create that same variable with a value of one in the single line look-up file. Then merge the two files by that new constant variable. There is another way to do this without creating the new constant variable. This technique uses multiple SET statements, and makes use of the fact that the values of the variables brought in by one SET statement are retained until the next case is read from that file.
PROPAGATING VALUES WITH SET

If you have a multi-line file and a file of new variables that needs to be spread to each record in the original file, values can be propagated by bringing in each record of the primary file with a SET statement, and bringing in the single line file only once with a separate SET statement. First look at test01 in the data step below: DATA TEST01; INPUT ID v2 v3 v4; cards; 1 1 1 1 2 1 1 1 3 1 1 1 4 1 1 1 5 1 1 1 6 1 1 1 7 1 1 1 8 1 1 1 ;; RUN; This is the primary data set for this example.The following data set contains the new variables which need to be spread across the records of the primary data set: DATA TEST02h; INPUT v5 v6 v7; cards; 5 6 7 ;; RUN; The IF statement in the program below insures that the single line data set is read only once. It makes use of the automatic temporary variable _N_ which is created to number the observations in the data set. It will equal one only for the first line of data in test01. When using separate SET statements remember that the values of all of the variables named on that SET statement are retained until reset by a subsequent invocation of that SET statement. The following program demonstrates this concept: DATA TEST15; Set TEST01 ; IF _N_=1 THEN SET TEST02h; RUN;

The data set produced by this process follows: Obs ID v2 v3 v4 v5 v6 v7 1 1 1 1 1 5 6 7 2 2 1 1 1 5 6 7 3 3 1 1 1 5 6 7 4 4 1 1 1 5 6 7 5 5 1 1 1 5 6 7 6 6 1 1 1 5 6 7 7 7 1 1 1 5 6 7 8 8 1 1 1 5 6 7 Notice the values of (5 6 7) of the variables v5, v6, and v7 in the test02h data set are propagated to all of the cases in the test01 data set. This feature of the SET statement allows for great control over combining files.

COMPLEX TABLE LOOK-UPS WITH SET


There are a variety of uses for complex table look-ups that cannot be easily done with simple merges. The feature discussed above of multiple SET statements for propagating values when combined with the KEY= feature of the SET statement is provides a very powerful tool. The following scoring task can be accomplished through brute force with cumbersome programs, but this method is rather simple and has the advantage of being easily expanded.
TEST SCORING TABLE LOOK-UP EXAMPLE

Assume that a test is being administered where each item is scored according to a scoring key that is provided. The scoring key contains the question number, the code that matches the response, a score assigned for that code on that question. The following data step reads in the test responses and sets up the main data set: DATA main (index=(id )); /**** create indexes ***/ INPUT id Q1 $ Q2 $ Q3 $ ; n2=_n_; CARDS; 5 a b e 6 c d b 8 c b a 9 d a b 10 a a e ;; Notice the INDEX= data set option. This option can be used in lieu of sorting the file to create a sort order. The advantage over sort is that multiple sort orders can be imposed on the same data set at the same time, without constant resorting. The following data set contains the scoring key: DATA scoring (index=(qc=(Q code))); /**** create level3 and keys ***/ INPUT Q code $ score; n2=_n_; CARDS; 1 a 1 1 b 2 1 c 3 1 d 4 2 a 4 2 b 3 2 c 2 2 d 1 3 a 1 3 b 2 3 c 3 3 d 4 ;; proc print; run; Notice the complex index (qc) is created for this file. The index created will allow access to the file by combined variables Q and code. The appropriate record will be brought in during a data step according to the values of those two variables when the set is keyed on qc.

10

In the following program the scoring key is consulted for each item for each individual: DATA scored ; SET main ; /**** select cases from master ***/ ARRAY qarray (q) q1-q3; ARRAY scarray (q) sc1-sc3; DO OVER qarray; code=qarray; SET scoring key=qc / unique; /**** select level3 cases w/key***/ IF _error_=1 THEN DO; /**** error = score not found ***/ scarray=.; PUT 'Score Not found ' id= Q= code= ; _error_=0; END; ELSE DO; scarray=score; END; END; OUTPUT; DROP score q code; RUN; Notice that the questions and the scoring variables are arrayed with the variable q referring to the respective question and score. When looping through the array the first question Q1 is referenced when Q=1. The DO loop progresses through the questions in the scale for each case. Setting the variable code equal to the value of the current array element sets up the other variable needed for accessing the score. A SET statement for the scoring data set is invoked with the KEY= operand and the UNIQUE option. The appropriate record from the scoring data set is brought in based on question number and response and the new variable score holds the scoring for that response. Placing the value in the scoring array (SCARRAY) keeps it available for writing the file when the score of the next question is input. The UNIQUE option directs SAS to start at the beginning of the file when searching for a case in the look-up file that matches the KEY. The data set produced by this process follows: Obs id Q1 sc1 Q2 sc2 Q3 sc3 1 5 a 1 b 3 e . 2 6 c 3 d 1 b 2 3 8 c 3 b 3 a 1 4 9 d 4 a 4 b 2 5 10 a 1 a 4 e . In printing this data set the response variables and scoring variables were alternated for ease of reading. This method of scoring allows for scores that are not simple reversals. If the instrument is increased by adding more questions, once the scoring key is expanded only the arrays need to be increased by adding the new questions and scoring variables for the program to continue working. It is possible to use a macro variable for providing the ending question and scoring variable number for easy expansion of this program to handle larger questionnaires Additionally, this technique can be used when there are various records to be combined each keyed on their own variable. For example, a clinical data base that has patient information, visit information based on patient ID and visit number, surgery records, laboratory records, each keyed on their own variables.

CONCLUSION
In conclusion, there are a variety of ways of combining data in SAS. Usually one of a small number of statements in the data step suffice. These statements are SET, MERGE, and UPDATE. SET is usually used for adding observations to a file, but can be used for propagating values and complex table lookups. If more than one SET statement is used then the values of variables unique to that SET statement are retained as the data set loops through the other file. Using the KEY=operand on the SET statement looks up records in the data set based on the value of the key variable or variables. This requires that the data set be indexed. MERGE is used to combine files with different variables, for simple table look-ups and for updating values when the new value is to replace the old even if the new value is missing. More that two data sets may be merged on a single statement. UPDATE is used for updating values when only with new valid values. Only two data sets may be used on an update statement. To order the case for combining use PROC SORT or create indexes with the INDEX= data set option on the DATA statement. Beware of trying to match files where each file contains multiple case with the same value of the by variable.

11

Recapping what you have learned; the following table lists the functions, sub-tasks, statements needed, and variations. FUNCTION adding cases SUB-TASK to the end of the file Interleaving lines of data PRIMARY STATEMENTS DATA, SET DATA, SET DATA, MERGE SECONDARY STATEMENTS REQUIRED CONDITIONS The same variables in each file. Ditto, Plus Sorted on the same variable. Sorted on the same ID variable. Other variables have different names in the two files Sorted on the same ID variable. Other variables have different names in the two files Sorted on the same ID variable. Other variables have different names in the two files Sorted on the same ID variable. Other variables have different names in the two files, can contain additional lines of data and variables One line of data in the file read by the conditional set. One line of data in the file read by the conditional set. One file has unique values on the BY variable Indexing the look-up file. The look-up file must have unique values of the key variables

BY BY

Combining files with the same cases and additional variables

Making data corrections

Forcing corrections

DATA, MERGE

BY

Selective corrections

DATA, UPDATE

BY

Selective corrections and additions

DATA, UPDATE

BY

Propagating values

From one file to all of the cases of the other file Adding computed stats to each case A simple one to many merge Complex

DATA, SET, SET

IF _N_=1

DATA, SET, SET, PROC MEAN DATA, MERGE

IF _N_=1

Table Look-ups

BY

DATA, SET, SET

DO, ARRAY KEY= option

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at: Mel Widawski MHW Consulting 5281 Dobson Way Culver City CA 90230 Work Phone: (310) 397-4446 Email: mel@ucla.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

12

Das könnte Ihnen auch gefallen