Anda di halaman 1dari 56

Week 9 SET, MERGE and Multiple Operations

Unit 4
SAS for Data Management

Week 9: Introduction to SAS – SET, MERGE, and Multiple Operations

Welcome.

SAS is an excellent software program for manipulating data! In this regard, it is a data

manager’s dream. This reading is a detailed introduction to a variety of SAS techniques for data

management, ranging from the simple definition of missing values to the alteration of the data

structure itself (e.g. creating multiple records from one or creating one record from many)

Goals of Week 9: Introduction to SAS – SET, MERGE, and Multiple Operations

1. To understand the rationale for and be competent in the creation of subsets (defined by variable

selection or by subject selection or both) of SAS data sets;

2. To understand the distinction between concatenating versus merging data sets and to be

competent in these techniques;

3. To appreciate, especially, the use of the MERGE statement in the manipulation of relational

databases;

4. To be competent in the use of conditional expressions such as IF, ELSEIF;

5. To appreciate the efficiency of ARRAYS and be competent in their use; and

6. To appreciate the necessity (at times) of planning and outlining in advance the writing of SAS

code.

week 09 9.1
Week 9 SET, MERGE and Multiple Operations

Week 9 Outline – Introduction to SAS: SET, MERGE and Multiple Operations


Section Topic Page

1. Using the SET Statement ………………..…………….……. …………….. 4

a. How to Save Only a Subset of Variables for Current Use (Keep, Drop) 5
b. How to Save Only a Subset of Subjects for Current Use (IF, DELETE)…. 7
c. How to Create New Variables – part 1 …………………………………. 8
d. How to Concatenate Datasets …………………………………… 10
e. How to Use the IN instruction for identification and selection … 11
f. How to Use BY, .FIRST, and .LAST for identification and selection ….. 13
g. How to Use the OUTPUT instruction to create several data sets at once .. 19
h. How to Create MULTIPLE records from one record …………………….. 20
i. How to Use RETAIN to create ONE record from multiple records ………. 23

2. Using the MERGE Statement ………………………………………..……………. 25

a. How to combine data sets without duplication ……………………………. 25

3. How to Create New Variables – part 2 ………………………………………….... 34

a. Addition, Subtraction, Multiplication, Division, Exponentiation ………….. 34


b. How to Code Conditional Expressions (IF, THEN, ELSE IF) ..…………... 37
c. How to Code Multiple Instructions using DO and END …………………… 41
d. How to Code Repetition of Instructions using an ARRAY ……………….. 43

4. Illustration …………………………………………………………………….. …. 47

week 09 9.2
Week 9 SET, MERGE and Multiple Operations

Now you have a SAS data set. It is in SAS format. A variety of techniques are available for the

manipulation of SAS data sets.

Some manipulations of SAS data sets that you might want.

a. Create new variables

b. Make subsets of the original data

c. Restructure the data

d. Combine several data sets

There are three statements for managing SAS data sets within a DATA step: (1) SET, (2) MERGE,

and (3) UPDATE.

Statements for Managing SAS Data files in a DATA STEP

SET file1 file2; Concatenates a list of files into a single file

MERGE file1 file2; Combines, record by record, multiple files into a


single file

UPDATE master1 newfile1; Replaces variable values from the NEWFILE1


file, and saves them in the MASTER1 file.

In this reading, the SET and MERGE statements are introduced in detail. Not presented is the

UPDATE statement.

week 09 9.3
Week 9 SET, MERGE and Multiple Operations

1. Using the SET Statement

The SET statement can be used in a SAS DATA step to accomplish a variety of tasks, including (but

not limited to):

1. Create a copy of a SAS data set (for example, a permanent copy)

2. Create a subset of observations or records in a data set

3. Add new variables to a data set

4. Concatenate several data sets

5. Create a subset of variables

6. Identify data sources

7. Rearrange the structure of a data set

8. Combinations of the above.

To illustrate the syntax for a SAS SET statement and some options for application, a series of

examples follow.

TIP on the naming of a new data set:

If the same name is used for the new data set as the old data set, the new data set will

overwrite the old version. The old data set is then gone!

week 09 9.4
Week 9 SET, MERGE and Multiple Operations

1a. How to Save Only a Subset of Variables for Current Use (KEEP and DROP)

Rationale

• It is very often the case that specific analyses involve only a subset of the study variables. Less

storage and greater efficiency of the SAS program will be achieved if only the necessary

subset of study variables are used.

Example

• This example creates a working copy from a permanent data set, saving only a subset of the

variables for current use.

• If you want to keep only a subset of the variables, you must use either the KEEP or DROP

option.

• Otherwise, the default is in place and all of the variables The default is that

• Choose KEEP or DROP – whichever list is shorter

• OR, choose KEEP or DROP – whichever is the listing that you want to be explicitly coded.

*** Example of KEEP dataset option ***;


*** to keep a subset of variables ***;
LIBNAME SDAT 'C:\TEMP';
DATA EXER(KEEP=SID AGE SEX HT WT);
SET SDAT.EXER1;
RUN;

week 09 9.5
Week 9 SET, MERGE and Multiple Operations

KEEP (or DROP) can be used in 2 different places in a data step; the syntax differs slightly.

1. As part of the data statement.

The option appears in parentheses ( ) after the name of the new data set, and KEEP= is

followed by a list of variables to be kept in the new dataset.

2. As a separate statement.

• When used as a separate statement in a data step, the word KEEP (or DROP) is followed by a

list of variables to be kept in (or dropped from) the new data set.

• This statement can appear anywhere after the SET statement as long as it appears before the

RUN; statement that ends the step.

• There is no equal sign when used as a separate statement.

• TIP: Be careful of the placement of a KEEP statement. It must appear after any new variables

you create in the DATA step. If you use a RENAME statement to rename a variable, use the

OLD name on a KEEP statement, but the NEW name on a KEEP= dataset option.

*** Example of KEEP statement ***;


*** to keep a subset of variables ***;
LIBNAME SDAT 'C:\TEMP';
DATA EXER;
SET SDAT.EXER1;
KEEP SID AGE SEX HT WT;
RUN;

TIP: In general, in SAS: KEEP and DROP are terms used to describe actions on variables

(columns), while RETAIN and DELETE are actions on observations (rows).

week 09 9.6
Week 9 SET, MERGE and Multiple Operations

1b. How to Save Only a Subset of Subjects for Current Use (IF, DELETE)

Rationale

• The rationale is similar to the one given above. Specifically, it is very often the case that specific

analyses involve only a subset of the study subjects. Less storage and greater efficiency of

the SAS program will be achieved if only the necessary subset subjects are used.

Example

This example creates a subset of the data using an IF statement. The new data set contains data on

females only; they are selected for inclusion for the reason of having the value ‘F’ for the variable

SEX.

Note: To refer to values of character variables the values must be enclosed in single quotes.

*** Create a data set called F_EXER. Include only females (SEX=’F’) ***;
LIBNAME SDAT 'C:\TEMP';
DATA F_EXER;
SET SDAT.EXER1;
IF SEX='F'; /* Include in new data set only IF SEX=’F’ */
RUN;

week 09 9.7
Week 9 SET, MERGE and Multiple Operations

This could also be done using the DELETE statement

IF SEX='M' THEN DELETE;

CAUTION:

• This instruction tells SAS to delete all observations with the value of ‘M’ for SEX.

• Therefore, observations of SEX with the value of missing are still retained.

• Thus, IF SEX=’F’ is a better choice.

1c. How to Create New Variables – Part 1

Rationale

• Self explanatory, yes?

Tip: A suggested tip for data set archival and SAS programming is (1) retaining as a

permanent (“raw”) SAS data set, the data set that contains only the source variables and (2)

writing a stand alone program that has as its only tasks labeling, formatting, and the creation

of new variables. I find that this system makes it easier for me to retrace, debug, modify , and

amend my work - cb.

Tip: Be careful about overwriting. While overwriting can save workspace in SAS since fewer data

sets will be kept during a work session – BE CAREFUL about overwriting. In most instances where

you create new variables or modify old ones, you may want to have the old data set around and you

may want to store a separate data set containing the new version of the data, to help keep track of

the sequence of changes you have made.

week 09 9.8
Week 9 SET, MERGE and Multiple Operations

Example

This example adds a new variable, height in cm, to the data set.

*** Notice that a separate new variable is created and the old still
exists ***;
LIBNAME SDAT 'C:\TEMP';
DATA SDAT.EXER1;
SET SDAT.EXER1;
* CREATE HEIGHT IN CM from HTIN in inches *;
HTCM = HTIN * 2.54;
RUN;

• If you did not want to keep the variable HTIN in the new version of the data set, then the data

statement could read:

DATA SDAT.EXER1(DROP=HTIN);

• Actually, any variables to be dropped could be named in the DROP option, as described in an

earlier section.

• Further details of computing with SAS variables are covered in section 3 (“How to Create New

Variables, Part 2” – page 34 ).

week 09 9.9
Week 9 SET, MERGE and Multiple Operations

3d How to Concatenate Datasets

• To concatenate data sets, list the names of the data sets to be concatenated, one after another

on the SET statement.

• There is no restriction on the number of data sets you can name. For practical purposes, don’t

let it get too large – it can be confusing.

• If the variables are not in the same order in the 2 data sets, the order will be determined by the

first-named data set, with any new variables in subsequent data sets listed after all those from

the first data set.

Rationale

• It is sometimes of interest to pool “like” data sets into a single large data set.

• An example is pooling site specific data sets from a multi-center clinical trial into a single

analytic data set.

Example

• This example concatenates two working data sets into a single permanent data set.

• When concatenating files, the number of observations in the new dataset is the sum of the

number in the original datasets.

• All of the variables in the two data sets will be contained in the new one (unless a KEEP

or DROP option or statement is used).

• If the two original data sets do not contain the same variables, then observations from the first

data set will have missing values for the variables contributed by the second, and vice versa.

week 09 9.10
Week 9 SET, MERGE and Multiple Operations

• The default missing values of a period ‘.’ for numeric data, and a blank for character data will be

used when missing values are created during data processing.

*** concatenating 2 datasets ***;


LIBNAME SDAT 'C:\TEMP';
DATA SDAT.EXER1_2;
SET EXER1 EXER2;
RUN;

• If the data set EXER1 contained the variables SID, AGE, HT and WT, while the data set EXER2

contained SID, HT, and WT then all records from EXER2 in the new data set EXER1_2 would

be assigned a missing value (.) for AGE.

1e How to use the IN instruction for Identification and Selection

Rationale

When concatenating several data sets into one, you may want to keep track of an observation’s

origins. For example, in pooling the site specific data from a multi-center clinical trial, the investigator

might well want to retain information on study site.

week 09 9.11
Week 9 SET, MERGE and Multiple Operations

Guidelines for Using the IN Instruction.

• SAS “IN” variables are variables that SAS creates automatically within a data step and are

saved ONLY for the duration of the data step.

• The IN variables are indicator variables, taking the value of 1 if the data comes from the

specified data set, and 0 otherwise.

• “IN” variables can be used to identify data sources. For example when concatenating data from

two hospitals, a new variable HOSPID can be created to identify the source hospital, by use of

the IN variable, as in the next example.

• How to use IN variables: For each input data set named on a SET statement, an IN variable

can be named. The IN variable for a source data set is named by the phrase

(IN=invariablename) after the data set name on the SET statement.

Example

In this example two data sets, one from each of two hospitals, are pooled. The IN statement keeps

track of the source hospital data set for each observation. This, in turn, is used to create a HOSPID

variable that retains permanently this source hospital identification.

*** using IN variable to identify ***;


*** data source when concatenating files ***;
LIBNAME SDAT 'C:\TEMP';
DATA SDAT.HOSP1_2;
SET HOSP1(IN=INH1)
HOSP2(IN=INH2);
IF INH1=1 THEN HOSPID=1;
ELSE IF INH2=1 THEN HOSPID=2;
RUN;

week 09 9.12
Week 9 SET, MERGE and Multiple Operations

• Data is read from the input data sets (HOSP1 and HOSP2) in the order in which they are listed.

As an observation is read from HOSP1, the variable INH1 is assigned the value 1, and the

variable INH2 is assigned the value 0. These variables are then used to define the values of

the new variable, HOSPID.

• The IN variables are not saved with the data set, and can only be used for the duration of

the data step.

• NOTE on the meaning of indicator (“logical”) variables: The phrase IF INH2=1 is

equivalent to saying IF INH2. The abbreviated form is permitted because indicator variables

defined as 0/1 are logical variables. A logical variable is one that has value 1 when the

accompanying condition is true and has the value 0 otherwise.

1f How to Use BY, .FIRST, and .LAST for Identification and Selection

• Often, a data set will contain multiple records for each subject. In this situation, it may be of

interest to select one or more particular records for a given individual (eg. The baseline record

from among a longitudinal series of records for that individual).

• Using the BY, .FIRST and .LAST instructions, it is possible to select a subset of observations

that satisfy a particular condition.

• Consider a study where everyone should have 2 records – a record from a baseline visit and a

follow-up visit . We might want to find which subjects are missing one of the 2 visits. Such

subjects would contribute only one record in the data set. When the data have been sorted

by a variable, in this case the subject ID, then a SET statement followed by a BY statement

can be used. The BY statement names the variable used to group observations – in this

case, grouping by subject.

week 09 9.13
Week 9 SET, MERGE and Multiple Operations

Guidelines for Using SET, BY, and .FIRST and .LAST

• When SET and BY statements are used together, SAS creates for you (automatically) two

temporary indicator variables: FIRST.variablename and LAST.variablename, where

variablename is the name of the variable named in the BY statement.

• For each set of multiple records for a given individual, the variable FIRST.variablename has

the value=1 for the first occurrence of each value of variablename. It has the value 0 for all

other records for that individual.

• Similarly, for reach set of multiple records for a given individual, the variable

LAST.variablename has the value=1 for the last occurrence of each value of variablename. It

has the value 0 for all other records for that individual.

• Thus, FIRST.variablename and LAST.variablename are simultaneously equal to 1 ONLY if

there is only ONE record for that subject. That is, the observation is both the first and last

occurrence of that value.

Example - Consider again the study where everyone should have 2 records – a record from a

baseline visit and a follow-up visit . As part of our data quality assessment, we want to find which

subjects contribute only one record instead of the intended two records. In this example, the data

have been sorted by the subject identification variable (SID). Once sorted, a SET statement followed

by a BY statement is used. The BY statement names the variable used to form the groups of

observations. In this example, we want to group the records for each individual subject (SID).

week 09 9.14
Week 9 SET, MERGE and Multiple Operations

** example using FIRST. and LAST. Variables **;


** step 1: create data set with some repeat subjects (SID) **;
DATA TEMP1;
INPUT SID VISIT SCORE;
CARDS;
01 1 87 /* SID=1 contributes ONE record */
02 1 77 /* note that SID=2 contributes TWO records */
03 1 62 /* ditto SID=3 */
02 2 54
03 2 77
04 2 21 /* SID=4 contributes ONE record */
;
RUN;

** Sort by subject ID variable **;


PROC SORT DATA=TEMP1;
BY SID;
RUN;
** Create subset with only 1 visit **;
DATA ONEVISIT;
SET TEMP1;
BY SID;
IF FIRST.SID=1 AND LAST.SID=1 then output;/* select only those with one visit */
RUN;

PROC PRINT DATA=ONEVISIT;


ID SID;
TITLE1 ‘SUBJECTS WITH ONLY ONE VISIT’;
RUN;
** create subset with 2 visits **;
DATA TWOVISIT;
SET TEMP1;
BY SID;
IF FIRST.SID=0 or LAST.SID=0; * selects with two visits;
RUN;
PROC PRINT DATA=TWOVISIT;
ID SID;
TITLE1 ‘BOTH VISITS FOR SUBJECTS WITH 2 VISITS’;
RUN;

week 09 9.15
Week 9 SET, MERGE and Multiple Operations

Note:

IF FIRST.SID=1 AND LAST.SID=1 then output;

Is the SAME as the instruction

IF FIRST.SID AND LAST.SID;

• Which would you rather have? Readability or brevity? It depends on how comfortable you are!

The resulting output would be:


SUBJECTS WITH ONLY ONE VISIT

SID VISIT SCORE


1 1 87
4 2 21

BOTH VISITS FOR SUBJECTS WITH 2 VISITS

SID VISIT SCORE


2 1 77
2 2 54
3 1 62
3 2 77

week 09 9.16
Week 9 SET, MERGE and Multiple Operations

Let’s analyze how this works. Once the data were in sorted order by SID they would look like this:

01 1 87
02 1 77
02 2 54
03 1 62
03 2 77
04 2 21

• Once a SET and BY statement instructions are given to SAS, SAS then creates values for the

FIRST. and LAST. Variables.

• The values of these variables are given below. FIRST.SID has the value 1 at the first

occurrence of each value of SID, and 0 otherwise. LAST.SID has the value 1 at the last

occurrence of each value of SID, and 0 otherwise.

• When both FIRST.SID and LAST.SID are 1 this is the only occurrence of that value of SID. If

either FIRST.SID or LAST.SID is 0, there must be more than one occurrence of that value of

SID in the data set.

SID VISIT SCORE FIRST.SID LAST.SID


01 1 87 1 1

02 1 77 1 0
02 2 54 0 1

03 1 62 1 0
03 2 77 0 1

04 2 21 1 1

week 09 9.17
Week 9 SET, MERGE and Multiple Operations

TIP – How to Check for Duplicate Records

First. and Last. variables are useful to check for duplicate records that may have occurred

inadvertently in data entry. The following lines could be used to look for duplicates:

** check for duplicate records for each patient identified by PATID **;
PROC SORT DATA=D1;
BY PATID;
RUN;
DATA DUPS;
SET D1;
BY PATID;
IF FIRST.PATID=0 or LAST.PATID=0; *selects obs with repeat of PATID;
RUN;

PROC PRINT DATA=DUPS;


TITLE1 ‘Duplicate PATient ID numbers;
RUN;

note the use of the “OR” in the selection statement.

week 09 9.18
Week 9 SET, MERGE and Multiple Operations

1g How to Use the OUTPUT Statement to Create Several Data Sets at Once

Idea

• Provide the names of the multiple data sets in a single data step.

• Then, use one or more OUTPUT statements where each is an indication of a condition which

must be satisfied in order for the current record to be output or “written” to the data set

indicated.

Example

A data set contains 3 records for each subject such that each record corresponds to a different day.

Wanted instead are 3 separate data sets, where each data set corresponds to a given day.

*** Example creating multiple data sets ***;


*** in one data step ***;
LIBNAME SDAT 'C:\TEMP';
DATA EXER1 EXER2 EXER3; * name 3 datasets in 1 step;
SET SDAT.EXER_ALL;
IF DAY=1 THEN OUTPUT EXER1; * EXER1 will contain day 1 records only;
ELSE IF DAY=2 THEN OUTPUT EXER2; * EXER2 will contain day 2 records only;
ELSE IF DAY=3 THEN OUTPUT EXER3; * EXER3 will contain day 3 records only;
RUN;

week 09 9.19
Week 9 SET, MERGE and Multiple Operations

1h How to Create MULTIPLE records from one record.

Rationale

• This manipulation is of interest when repeated measurements are collected in one record and it

is of interest to do use repeated measurements analysis of variance software (SAS is one of

them) that requires that the repeated measures be separate records.

Example

• In this example, each subject has measurements (SCORE1 - SCORE3) from three occasions

and they are all listed in the same record.

• To make separate observations, 3 for each subject, the following statements could be used:

*** Example creating multiple observations from a single observation ***;


*** Original data has repeat values for each subject ***;
DATA TEST;
INPUT SID $ SCORE1 SCORE2 SCORE3;
CARDS;
A 99 98 97
B 88 87 85
;
RUN;
PROC PRINT DATA=TEST;
TITLE ‘LIST OF SCORES BY SUBJECT: ORIGINAL DATA’;
RUN;

*** define 2 new vars: ***;


*** score = score regardless of timing ***;
*** status defines timing of score ***;
DATA MULT(KEEP=SID SCORE STATUS);
SET TEST;
SCORE=SCORE1; STATUS=1; OUTPUT; *set score=1st score-write obs to file;
SCORE=SCORE2; STATUS=2; OUTPUT;

week 09 9.20
Week 9 SET, MERGE and Multiple Operations

SCORE=SCORE3; STATUS=3; OUTPUT; * repeat for each score ;


RUN;
PROC PRINT DATA=MULT;
TITLE ‘LIST OF SCORES BY STATUS: NEW VERSION’;
RUN;

week 09 9.21
Week 9 SET, MERGE and Multiple Operations

The resulting output would be:

LIST OF SCORES BY SUBJECT: ORIGINAL DATA

OBS SID SCORE1 SCORE2 SCORE3


1 A 99 98 97
2 B 88 87 85

LIST OF SCORES BY STATUS: NEW VERSION

OBS SID SCORE STATUS


1 A 99 1
2 A 98 2
3 A 97 3
4 B 88 1
5 B 87 2
6 B 85 3

Note on the Use of the OUTPUT statement –

• The OUTPUT statement causes a new observation to be written at the point that it appears,

Thus, in this example, the OUTPUT statement appears three times and in such a manner as to

accomplish the creation of three observations for each one observation that is read in.

• If the OUTPUT statement had NOT been used, one observation would have been written to

the new data set for each one observation read in from the source data set.

week 09 9.22
Week 9 SET, MERGE and Multiple Operations

1i How to Use RETAIN to Create ONE Observation from Multiple Records

Rationale -

• Sometimes, it is the reverse that is desired. You start with one data set with several records

per subject, and you wish to have one record per subject.

• An example is when you wish to produce tables of descriptive statistics for the repeated

measures.

Example –

Beginning with the data set MULT that contains multiple records for each subject, you wish to

produce the data set TEST that contains just one record for each subject. This can be accomplished

by the following:

** use SORT to get all data from each subject together **;
PROC SORT DATA=MULT;
BY SID;
RUN;

** set BY SID, so LAST.SID indicator can be used;


DATA TEST2(KEEP=SID SCORE1 SCORE2 SCORE3);
SET MULT;
BY SID;

** Retain 3 new score variables to hold values **;


RETAIN SCORE1 SCORE2 SCORE3 .;

** Assign 1st score when status is 1, etc. **;


IF STATUS=1 THEN SCORE1=SCORE;
ELSE IF STATUS=2 THEN SCORE2=SCORE;
ELSE IF STATUS=3 THEN SCORE3=SCORE;

week 09 9.23
Week 9 SET, MERGE and Multiple Operations

** Now write data to output data set after last obs **;
** for the subject is read **;
IF LAST.SID THEN OUTPUT;
RUN;

PROC PRINT DATA=TEST2;


TITLE ‘TEST2: ONE OBS PER SUBJECT, RECREATED’;
RUN;

The Idea of the RETAIN Statement

• The RETAIN statement is a placeholder; it is used to retain or hold the value of the variable(s)

from the previous record.

• Since three records from the input data set must be read to create one observation in the output

data, a RETAIN statement is required.

• Without the RETAIN statement the value of a variable is automatically reset to missing before a

new observation is read. (Try this example without the RETAIN statement to see the

difference.)

• The BY statement must be used so that the LAST. variable is available.

• An OUTPUT instruction is given ONLY AFTER all the observations from a subject have been

read.

week 09 9.24
Week 9 SET, MERGE and Multiple Operations

2 Using the MERGE Statement

2a How to Combine Data Sets Without Duplication

Rationale

• We have just seen how to use the SET statement to concatenate files. It was mentioned that in

concatenating files, subjects that appear in multiple files will contribute multiple records. Thus,

concatenation can be viewed as just “piling together” the separate sets of data. We may wish

to be more selective than just “piling together”.

• For example, in the use of relational databases, it is often of interest to join together data from

two (or more) files where the two (or more) are linked by one or more common variables.

• Consider a relational database that includes one data file containing baseline information, and

another data file with follow-up information on the same subjects. Alternatively, one file may

contain data from a questionnaire, and another file clinical data on the same patients.

• Thus, it is often of interest to merge files to produce a single output file that contains a single

observation or record is created for each subject.

• If a SET statement were used to concatenate the files, subjects that were included in each file

would appear twice – have 2 records – in the combined file, one contributed from each file. To

avoid this duplication, files are combined using a MERGE statement.

week 09 9.25
Week 9 SET, MERGE and Multiple Operations

Guidelines for the Use of the MERGE Statement

• To combine the two (or more) files we first identify a common (“linking”) variable in the files.

• The data must then be sorted (PROC SORT) within each file according to the common

(“linking”) variable.

• The data sets are then MERGED, where the linking variable is specified with a BY statement.

In the example that follows, the common variable is the study ID, SID.

• Inclusion of the BY statement naming the linking variable is crucial. Without this, the first

record from the first data set is matched with the first record from the second data set, and so

on, without regard to the subject. Warning - You will not see an error message if you

forget a BY statement – in fact the program will run just fine. Thus, if you find mismatches in

your data – check that the proper SORT and BY statements were used.

week 09 9.26
Week 9 SET, MERGE and Multiple Operations

Example -
*************************************;
*** Example using MERGE statement ***;
*************************************;
LIBNAME SDAT 'C:\temp';
* sort (by linking variable SID) and print data from baseline visit *;
PROC SORT DATA=SDAT.EXER1;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXER1;
TITLE1 ‘BASELINE DATA’;
RUN;
* sort (by linking variable SID) and print data from follow-up visit;
PROC SORT DATA=SDAT.EXER2;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXER2;
TITLE1 ‘FOLLOW-UP DATA’;
RUN;
** merge files matching on SID **;
DATA SDAT.EXALL1;
MERGE SDAT.EXER1
SDAT.EXER2;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXALL1;
TITLE1 ‘MERGED DATA’;
RUN;

The resulting files are listed below:

BASELINE DATA
OBS SID VDATE1 DOB
1 101 06/12/90 01/27/84
2 102 06/12/90 10/11/87
3 103 06/13/90 05/09/87

week 09 9.27
Week 9 SET, MERGE and Multiple Operations

FOLLOW-UP DATA
OBS SID VDATE2 DOB
1 101 08/15/90 01/27/84
2 102 08/15/90 11/10/87
3 104 08/17/90 04/22/85

MERGED DATA
OBS SID VDATE1 DOB VDATE2
1 101 06/12/90 01/27/84 08/15/90
2 102 06/12/90 11/10/87 08/15/90
3 103 06/13/90 05/09/87 .
4 104 . 04/22/85 08/17/90

• Notice that the SID represented in the baseline data are 101, 102 and 103. Whereas, the SID

represented in the follow-up data are 101, 102 and 103. The created (merged) data set has all

four SID represented: 101, 102, 103 and 104. However, there are some missing values.

• The combined file contains one record for each of the records from the individual files, where

records appearing in both files have been merged, and are recorded as a single record with a

complete set of variables from both of the original files. When a record does not have a match

in the other file, the values for variables from the other file are set to missing values.

• Yuck – DOB (date of birth) for SID=103 has different values in the baseline and follow-up data

sets.

• IMPORTANT NOTE - When merging files that contain the same variables –with the same

variable names – the value from the LAST-NAMED data set will be the value retained in

the merged data set. Thus, be careful of the order in which data sets are named on the

MERGE statement. In the above example, birthdate (DOB) is recorded at both visits with the

week 09 9.28
Week 9 SET, MERGE and Multiple Operations

same variable name, and the value from the follow-up data file is retained (see SID 102, where

the DOB values don’t match).

• TIP - When merging 2 files with the same variable names, careful planning is required to

retain important information in the merged data set.

a. If you have values for the same variable, for example weight measured at two

occasions, make sure the data does not have the same variable name in both data sets

if you wish to retain both values in the new data set.

b. If the variables have the same name, such as WT, and you wish to retain both weight

variables in the merged data (for example to compute weight loss or gain), you must

rename (at least) one of them in a DATA step prior to the merge. This can be done with

a RENAME statement: RENAME oldname = newname; for example:

RENAME WT = WT1;

The advantage of using rename, rather than simply creating a new variable and dropping

the old one (WT1 = WT; DROP WT; ), is in retaining all the variable attributes (such

as length, formatting, labeling) that were previously set, when rename is used.

As with the SET statement, several variables can be automatically created with the MERGE

statement.

"IN" variables can be defined that indicate whether the current record was available in a particular

data set. FIRST. and LAST. variables can also be created that indicate the first occurrence of a BY

variable value. These variables can be used to specially tailor the resulting data set. The IN

variables and FIRST. or LAST. variables are available during a DATA step, but are not added to

the output data set. These variables can be added to the output data set by only defining new

week 09 9.29
Week 9 SET, MERGE and Multiple Operations

variables using them in the data step. The values of the automatic variables are all indicator

variables. They have a value of "1" when the condition is true, and "0" otherwise. Note that FIRST.

and LAST. variables can be created for each variable named in the BY statement that is used in the

DATA step.

Example – Using the MERGE statement and IN Variables

• In the study of warm/cold cardiopulmonary bypass and neurologic function, baseline test data of

various sources was entered in several files for all individuals tested at baseline.

• However several subjects were later found to be ineligible for the study, due to complications

during surgery, or mistakes made in reviewing patient history.

• Rather than try and delete each such case individually from each of the study files, it is simpler

to keep a master "inclusion" file. Only those subjects whose names and ID numbers are kept

in the latest version of the inclusion file, i.e., those deemed eligible, are used in data analyses.

*** program to merge inclusion file ***;


*** with psych data file, keeping ***;
*** only subjects in inclusion file ***;
LIBNAME SDAT 'C:\temp';

* sort data from include file- This file contains the deemed eligible;
PROC SORT DATA=SDAT.INCLUDE1;
BY PATID;
RUN;

* sort data from psych testing;


PROC SORT DATA=SDAT.PSYCH1;
BY PATID;
RUN;

week 09 9.30
Week 9 SET, MERGE and Multiple Operations

** merge by PATID, and keep only those patients deemed eligible**;


** these are the subjects listed in INCLUDE file **;
DATA PSYCH1;
MERGE SDAT.INCLUDE1(IN=INCL1)
SDAT.PSYCH1;
BY PATID;
IF INCL1 = 1;
RUN;

week 09 9.31
Week 9 SET, MERGE and Multiple Operations

Example - Sometimes, combinations of subsetting, concatenating and merging files are used to

create analysis data sets.

• In the cardiopulmonary bypass study, in order to summarize the pre-surgical test results, a

subset of variables (totals, rather than subscores) was selected from each data file.

• In addition, a subset of records, the baseline or pre-surgery records were selected.

• These were then merged with the "inclusion" file to keep only data for eligible patients.

• Summary baseline reports were based on final files created in this manner, for neurologic,

neuropsych, and delirium testing.

LIBNAME SDAT 'C:\temp';


************************************************************************;
** SELECT TOTALS (Nscore,Mscore), ID, AND DATE VARIABLES **;
** KEEP PRE-TEST DATA ONLY (PSTATUS=1) **;
************************************************************************;
DATA N1(KEEP=PATID NSCORE MSCORE NDATE);
SET SDAT.NEURO1;
IF PSTATUS=1; * keep pre-test data only;
RUN;

PROC SORT DATA=SDAT.INCLUDE1; * must sort data from include file;


BY PATID;
RUN;
PROC SORT DATA=N1; * must sort data from neuro testing;
BY PATID;
RUN;
**************************************************;
** SAVE NEURO TOTALS FOR ELIGIBLE PATIENTS **;
**************************************************;
DATA SDAT.NBASE1(LABEL='BASELINE NEURO TOTALS');
MERGE SDAT.INCLUDE1(IN=INCL1)
N1;
BY PATID;
IF INCL1;
RUN;

week 09 9.32
Week 9 SET, MERGE and Multiple Operations

(further steps to produce summary statistics)

week 09 9.33
Week 9 SET, MERGE and Multiple Operations

3 How to Create New Variables – Part 2

Very often your data as brought into SAS from data entry is “raw”. It is not in the final form necessary

for summarization and analysis (“analytic”). Typically, you will need to make subsets of the data and

merge files, as described above. You will also want to compute or create new variables. New

variables can be created in any data step, after the INPUT or SET or MERGE statement. However,

these instructions MUST be placed before the RUN statement (or before the CARDS statement for

instream data).

3a Addition, Subtraction, Multiplication, Division, Exponentiation

New numeric variables can be created by

• adding (+)

• subtracting (-)

• multiplying (*)

• dividing (/)

• exponentiating (**)

existing variables by constants or by other variables. New variables can also be created using IF -

THEN conditional statements, and by combinations of these.

week 09 9.34
Week 9 SET, MERGE and Multiple Operations

• When defining a new variable the name of the new variable is always given on the left

side of the equal sign, and the expression or value on the right side of the equal sign.

• The statement is evaluated for each observation, so that a new variable is created with a value

for each record, according to the given expression.

• Complex combinations of operations can be used following standard conventions for order of

operations. Use parentheses – ( ) – to be explicit about order of operations. Make sure that

every open parenthesis – ( – is paired with a close parenthesis – ). Some examples follow.

Examples -

HTCM = HT * 2.54; * create ht in cm from ht in inches;

WTKG = WT / 2.2; * create wt in kg from wt in pounds;

STD = VAR ** .5; * compute std dev as sq root of variance;

MSCORE=(SCORE1 + SCORE2 + SCORE3)/3; * compute average of 3 scores;

AGE = (VDATE1 - DOB)/365.25 ; * compute age in years at visit;

INTCPT = 1; * define a constant for all observations;

• In the above examples, if any of the original variables (on the right side of the equation) has a

missing value, then the value of the new variable will be missing also, except for the last

example, where the value of INTCPT is 1 for all observations.

week 09 9.35
Week 9 SET, MERGE and Multiple Operations

NOTE - Other Variable Creation Techniques

Computation of new variables for each observation can also be done using SAS functions. The

functions handle missing data in a different way than standard computation, giving more options to

the user. Functions will be discussed in a later section.

week 09 9.36
Week 9 SET, MERGE and Multiple Operations

3b How to Code Conditional Expressions (IF, THEN, ELSE IF)

Rationale –

• Suppose you want to perform an operation or group of operations for some observations but

not for all. IF - THEN statements are used in this situation.

• A few examples have already been given in the sections on SET and MERGE statements.

Examples of valid conditional statements follow.

• Note once again: to refer to specific values of character variables, the values must be

enclosed in single quotes.

IF SEX = 'F'; * retains cases with value F for SEX;

IF SEX = 'M' THEN DELETE; * deletes cases with value M for SEX;

IF AGE < 0 THEN DELETE; * deletes cases with missing or invalid AGE;

IF SEX = 'F' AND AGE > 0; * retains cases with valid AGE and SEX value F;

IF SEX = 'F' OR SEX = 'M'; * retains cases with valid SEX ;

New variables can also be defined with conditional statements.

For example to create a variable AGEGR to represent age groups, the following set of statements

can be used:

IF 0< AGE<20 THEN AGEGR=1;


ELSE IF 20<=AGE<40 THEN AGEGR=2;
ELSE IF 40<=AGE<60 THEN AGEGR=3;
ELSE IF 60<=AGE THEN AGEGR=4;

week 09 9.37
Week 9 SET, MERGE and Multiple Operations

Use of the ELSE IF, while not required, means that the data will be evaluated more efficiently, and

can save processing time in large data sets. Once a condition has been met for an observation, the

subsequent ELSE IF statements are not evaluated for that observation. For example, for an 18 year

old, the first statement is evaluated, the value ‘1’ assigned for AGEGR, and the subsequent

statements bypassed since the condition was met. For a 45 year old the first 3 statements are

evaluated. If you know that most of your sample falls into one group, putting that conditional

statement first will be the most efficient way to create a new variable. Be careful using ELSE IF

when your condition is based upon more than one variable – it can be tricky.

New character variables can also be defined using conditional statements.

Tip – Be careful: he length of a character variable is determined by the first value defined in

the IF-THEN sequence. If longer values are subsequently given, they will be truncated to the length

of the first value named. In the next example, LIGHT, MODERATE, and HEAVY smokers are defined

from the number of cigarettes smoked per day (CIGSDAY).

** creating ordinal groups for smoking **;


IF 0<CIGSDAY<=10 THEN HSMOKE='LIGHT';
ELSE IF 10<CIGSDAY<=24 THEN HSMOKE='MODERATE';
ELSE IF 24<CIGSDAY THEN HSMOKE='HEAVY';

In this example, the value MODERATE would be truncated to 5 characters (MODER) to match the

length of LIGHT. To avoid this, the first named value should be padded with blanks, i.e.,

HSMOKE='LIGHT ';

week 09 9.38
Week 9 SET, MERGE and Multiple Operations

How to Ensure that Created Character Variables Have the Desired Length

Use a length statement preceding variable definition to set the length: The length statement:

LENGTH HSMOKE $8.;

before the variable HSMOKE is defined would read all values without truncation.

A feature of SAS to be aware of is that, in the use of character values is that many SAS procedures

will automatically re-order the data alphabetically when printing summary tables. This reordering is

often not appropriate for ordinal variables, such as HSMOKE, which would appear in the order

HEAVY, LIGHT, MODERATE in frequency tables and graphic displays. To avoid this, HSMOKE can

be defined as a numeric variable with codes 1,2, and 3, and then formats assigned to the codes.

** create grouped, ordinal variable for cigs/day **;


DATA SMOKE2(KEEP=CASEID HSMOKE);
SET SMOKE1;
IF 0<CIGSDAY<=10 THEN HSMOKE=1;
ELSE IF 10<CIGSDAY<=24 THEN HSMOKE=2;
ELSE IF 24<CIGSDAY THEN HSMOKE=3;
RUN;
** create formats **;
PROC FORMAT;
VALUE SMKFMT 1='LIGHT'
2='MODERATE'
3='HEAVY';
RUN;
** get frequencies of hsmoke **;
PROC FREQ DATA=SMOKE2;
FORMAT HSMOKE SMKFMT.;
TABLES HSMOKE;
TITLE1 'FREQUENCY OF LIGHT, MODERATE, AND HEAVY SMOKING';
RUN;

week 09 9.39
Week 9 SET, MERGE and Multiple Operations

In this program, the frequency table of HSMOKE would be ordered by its numeric value, but the
words LIGHT, MODERATE, and HEAVY would appear in the table.

week 09 9.40
Week 9 SET, MERGE and Multiple Operations

3c How to Code Multiple Instructions Using DO and END statements

Rationale –

• It might be that, when a condition is satisfied, you want more than one operation to occur.

• In this case an IF-THEN-DO set of statements is used.

Example (This also illustrates the creation of design variables) –

We want to use the heavy/moderate/light smoking status at baseline as a predictor of QUIT status at

a 6-month follow-up survey. However, it is not appropriate to use the variable HSMOKE in an

analysis as defined above, since the 1-2-3 spacing implies an equal distance between light-moderate

smoking, and moderate-heavy smoking, which is unlikely to be reasonable. Instead, two indicator

variables are needed, which can be defined from the following table:

HSMOKE MOD HVY


1 0 0
2 1 0
3 0 1

Light smokers have value 0 for both new variables MOD and HVY, MOD is an indicator of moderate

smoking, and HVY is an indicator of heavy smoking. The following statements would create these

variables, which could then be used in PROC LOGISTIC, a logistic regression procedure, or other

regression procedure.

week 09 9.41
Week 9 SET, MERGE and Multiple Operations

** example using IF-THEN-DO to create design variables (two indicators) *;


** for each value of HSMOKE **;
DATA SMOKE3;
SET SMOKE2;
IF HSMOKE=1 THEN DO; * When HSMOKE=1, MOD=0 HVY=0 ;
MOD=0; HVY=0;
END;
ELSE IF HSMOKE=2 THEN DO;
MOD=1; HVY=0; * When HSMOKE=2, MOD=1 HVY=0 ;
END;
ELSE IF HSMOKE=3 THEN DO;
MOD=0; HVY=1; * When HSMOKE=3, MOD=0 HVY=1 ;
END;
RUN;

• A condition is defined beginning with IF …, followed by THEN DO; A series of statements

taking action, such as defining new variables, follow the DO;.

• An END; statement is required to end the set of operations to be done when the condition is

met.

• Warning: it is easy to lose track of DO and END statements. An ‘END;’ is required for every

‘DO;’

There are other ways to create these variables, too.

week 09 9.42
Week 9 SET, MERGE and Multiple Operations

3d How to Code Repetition of Instructions Using an Array

Rationale

• You want to perform the same operation (or same set of multiple operations) on each of many

variables. Why write this instruction over and over again for each variable? To avoid having to

repeat the same instructions for each variable, an ARRAY can be exploited.

Definition Array

• An array (or ordered listing of variables) is defined on an ARRAY statement by giving the array a

name, followed by the number of elements (variables) in curly brackets { }, followed by the list

of variables. The operation(s) to be carried out on the array elements are then defined in a DO

loop.

Example – Using an Array Statement to Define Missing Value Codes.

Suppose an input data set contains five test scores for each subject, with valid scores of 0 to 10,

while 99 represents a missing value. To convert these 99’s to SAS missing values for all test scores

at once, use the following:

** example using array statement to **;


** change missing code to SAS missing **;
** values for 5 variables at once **;
DATA SCORE2(DROP=I);
SET SCORE1;
** define array named TESTS with 5 elements (variables) *;
ARRAY TESTS{5} TEST1 TEST2 TEST3 TEST4 TEST5;

** use DO loop to change missing code 99 to . **;


DO I = 1 TO 5;
IF TESTS{I}=99 THEN TESTS{I}=.;
END;
RUN;

week 09 9.43
Week 9 SET, MERGE and Multiple Operations

• The above program defines an ARRAY called TESTS with the five test scores as elements.

• The DO statement says to do the subsequent operation on the Ith element, as I goes from 1

to 5.

• When I=1, the statement acts as IF TEST1=99 THEN TEST1=.;

because TEST1 is the first-named variable on the array statement.

• When I=3 the statement acts as IF TEST3=99 THEN TEST3=.;

• Note that “I” will be included as a variable in the data set unless you specifically DROP it.

week 09 9.44
Week 9 SET, MERGE and Multiple Operations

SAS permits you to use a shorthand listing for the variable name elements in the ARRAY statement.

In the example below, TEST1-TEST5 is the SAS shorthand notation for TEST1 TEST2 TEST3

TEST4 TEST5. In general, when variable names have the same prefix (in this example the prefix is

TEST) with sequential numbering, it is possible to name only the first and last variable, with a single

hyphen (no spaces) separating the names on any SAS statement, as in the example below.

*** program to recode missing, refused codes to SAS missing values***;


*** and compute an average of non-missing scores ***;
DATA SCORE2(DROP=I);
SET SCORE1;
ARRAY TESTS{5} TEST1-TEST5; ** define array TESTS with 5 elements;

DO I = 1 TO 5; ** start DO loop **;


IF TESTS{I}=99 THEN TESTS{I}=.; ** recode missing ;
ELSE IF TESTS{I}=88 THEN TESTS{I}=.R; ** and recode refusals;
END;

MTEST = MEAN(OF TEST1-TEST5); ** compute mean non-missing test scores;


RUN;

New variables can be defined by naming them in an ARRAY statement and then values may be

assigned in DO loop processing:

Consider a study of infant weight gain, where weights have been recorded, in ounces, at one-month

intervals from birth to 1 year. There are 13 measurements for each infant, WT0 being the birth

weight, WT1 the weight at 1 month, and so on to WT12 at 12 months. These weights are in the

dataset OLDWT.sas7bdat. However we are interested in doing the analysis on weight in grams.

Birth weight in grams can be computed as:

WTGM0 = WT0 * 28.4;

week 09 9.45
Week 9 SET, MERGE and Multiple Operations

To convert each weight individually would require 13 separate statements like the one above. Or this

can be accomplished with the following, using ARRAY statements and DO loops.

** example defining new variables in array **;


LIBNAME SDAT ‘C:\TEMP’;
DATA SDAT.NEWWT(DROP=I); *drop I, variable for DO loop;
SET SDAT.OLDWT;
ARRAY OZ{13} WT0-WT12; *array weights in ounces;
ARRAY GM{13} WTGM0-WTGM12 .; *array weights in grams-no values yet;
DO I = 1 TO 13;
GM{I} = OZ{I} * 28.4; *compute wt in gm from wt in oz;
END;
RUN;

• The first array statement names an array called OZ with thirteen variables. These are found in

the old data set.

• The second array statement names the 13 new variables to be computed: WTGM0, WTGM1,

... WTGM12.

• The ‘.’ at the end of the array statement sets the initial value to missing (.) for all 13 new

variables.

• The statement GM{I}=OZ{I}*28.4; says to compute the ith variable in the array GM as

28.4 times the ith variable in the array OZ.

• The dimensions (number of elements) of the arrays must match, and the variables in

the two arrays must be named in the same order. If the orders don’t match, you may be

computing (incorrectly) WTGM1 = WT2 * 28.4;

week 09 9.46
Week 9 SET, MERGE and Multiple Operations

4. Illustration

Example of steps in developing a more complicated program, which makes use of array statements

for repetitive processing of variables. This example also makes use of RETAIN statements and

FIRST. and LAST. variables.

Consider again the warm versus cold cardiopulmonary bypass study. In the pilot study, patients are

given a battery of psychological tests pre-operatively, post-operatively, and at a follow-up visit. At

each test period the same 16 tests designed to evaluate different aspects of cognitive processing

were given. The series of tests were time-consuming and tiring. Many patients refused to participate

at the follow-up or post-operative testing periods, or refused to complete the full battery of tests at one

or more of the testing periods. This meant that there were large amounts of missing data. It is

therefore particularly important that missing values are appropriately handled.

For ease of data entry, data were input with one record for each test period, so that there are up to

three records per patient. The variable PSTATUS identifies the patient status as 1 (pre), 2 (post) or 3

(follow-up). The structure of the data set is given below. The variable PATID is used as the patient

ID, and there are 16 variables to identify the test scores.

week 09 9.47
Week 9 SET, MERGE and Multiple Operations

PSYCH TESTING DATA FROM WARM/COLD CPB STUDY


DATA WITH MULTIPLE OBSERVATIONS PER SUBJECT

----Variables Ordered by Position----

# Variable Type Len Pos Format Label


1 PSTATUS Num 8 4 1. PATIENT STATUS
2 PATID Num 8 12 6. PATIENT ID NUMBER
3 INF Num 8 20 2. INFORMATION-SCALED SCORE
4 PIC Num 8 28 2. PICTURES-SCALED SCORE
5 VOC Num 8 36 2. VOCABULARY-SCALED SCORE
6 BLK Num 8 44 2. BLOCK DESIGN-SCALED SCORE
7 OBJ Num 8 52 2. OBJECT ASSEMBLY-SCALED SCORE
8 SIM Num 8 60 2. SIMILARITIES-SCALED SCORE
9 IOQ1 Num 8 68 2. INFORM. & ORIENT. QUES.
10 LOGMEM1 Num 8 76 2. LOGICAL MEMORY 1
11 LOGMEM2 Num 8 84 2. LOGICAL MEMORY 2
12 VIS1 Num 8 92 2. VISUOSPATIAL 1
13 VIS2 Num 8 100 2. VISUOSPATIAL 2
14 STRPW Num 8 108 2. STROOP-WORDS
15 STRPC Num 8 116 2. STROOP-COLORS
16 STRPCW Num 8 124 2. STROOP-WORDS/COLORS
17 SDMT Num 8 132 2. SDMT
18 FAS Num 8 140 2. FAS

A listing of a couple of the psych testing variables for the first eight patients is given below.

MULTIPLE RECORDS PER SUBJECT

OBS PATID PSTATUS INF FAS

1 28 1 7 .
2 28 2 7 .
3 65 1 10 39
4 65 2 . .
5 65 3 12 48
6 74 1 8 32
7 74 2 8 33
8 74 3 9 .

week 09 9.48
Week 9 SET, MERGE and Multiple Operations

9 144 1 6 28
10 144 2 6 .
11 192 1 6 6
12 192 2 7 .
13 192 3 . .
14 196 1 9 .
15 196 2 8 17
16 210 1 8 .
17 245 1 11 .
18 245 2 13 52
19 245 3 12 55

Investigators were interested in which, if any of the cognitive processes show significant decreases

post-operatively compared to pre-operatively, and whether or not there is full recovery or

improvement by the follow-up period. To do this, difference scores, post-op minus pre-op, and follow-

up minus pre-op for each of the 16 scores are needed.

We want a final data set that will have one observation for each subject, with the pre, post, and

follow-up scores, along with the difference scores. The steps required in the program are:

Create one observation per subject with pre-op, post-op, follow-up and difference scores

1. Sort the data so that the observations for each subject are grouped (sort by PATID)

2. Read the data so that the start and end for each subject can be identified (use BY PATID; so
FIRST. and LAST. can be used)

3. Make sure that values read for pre, post and follow-up will be kept on single new observation
for each subject (use RETAIN statement)

4. Identify start of a new subject (use FIRST.PATID), and reset the retained values to missing
so that value from previous subject won't be retained on missing record

5. Assign values to pre, post and follow-up scores, depending upon patient status (IF
PSTATUS=...)

6. After all data read for subject, compute difference scores, and output an observation (use
LAST.PATID)

week 09 9.49
Week 9 SET, MERGE and Multiple Operations

First a program is developed that will do this for one variable, in this case INF, the information score.

We will examine in some detail the processing that goes on in this case. Then the program that will

process all 16 variables at once will be given.

*************************************************************;
** Program to create one observation per subject **;
** with pre, post, follow-up scores, pre-post, pre-foll **;
** difference scores for the variable INF **;
*************************************************************;
** sort input data, so BY can be used **;
PROC SORT DATA=PSYCH3;
BY PATID;
RUN;

** create new dataset, keep new variables and subject ID **;


DATA PSYCH4(KEEP=PATID INF1 INF2 INF3 INFPP INFFP);
SET PSYCH3;
BY PATID; * for identification of start and end of;
* subject with first. and last. ;

** identify values that will be retained **;


** as next record for subject is read **;
RETAIN INF1 INF2 INF3 .; * 1=pre, 2=post, 3=foll scores;

** identify start of new subject and set scores to missing **;


** before reading values for new subject **;
IF FIRST.PATID THEN DO;
INF1=.; INF2=.; INF3=.;
END;
** assign values to new vars based upon the patient status **;
IF PSTATUS=1 THEN INF1=INF; * find pre-op score;
ELSE IF PSTATUS=2 THEN INF2=INF; * find post-op score;
ELSE IF PSTATUS=3 THEN INF3=INF; * find follow-up score;

** identify end of subject's records, **

week 09 9.50
Week 9 SET, MERGE and Multiple Operations

** compute difference scores and output a record **;


IF LAST.PATID THEN DO;
INFPP = INF2 - INF1; * post - pre;
INFFP = INF3 - INF1; * foll - pre;
OUTPUT; * write obs to new data set;
END;
RUN;

Now, to understand the processing, let’s examine in some detail what happens when this program is

used. Suppose we begin with the data set below, with 2 subjects, x and y. x has 3 records, and y

has only 2. The values of the FIRST. and LAST. variables, available as the data is read in with the

BY PATID statement are also shown.

PATID PSTATUS INF FIRST.PATID LAST.PATID


x 1 6 1 0
x 2 5 0 0
x 3 8 0 1
y 1 7 1 0
y 2 4 0 1

Values of variables in the new data set are initially all missing, until an observation is read:

PATID INF1 INF2 INF3 INFPP INFFP


. . . . .

As the first observation is read, missing values are replaced. In this case PSTATUS is 1 so the value

of INF1 is replaced:

PATID INF1 INF2 INF3 INFPP INFFP


x 6 . . . .

week 09 9.51
Week 9 SET, MERGE and Multiple Operations

The value of INF1 is retained as the next observation is read from the input data set, and this time

INF2 is replaced:

PATID INF1 INF2 INF3 INFPP INFFP


x 6 5 . . .

INF2 and INF1 are retained as the next observation is read in, and INF3 is replaced:

PATID INF1 INF2 INF3 INFPP INFFP


x 6 5 8 . .

We have now reached an observation where LAST.PATID = 1, identifying the last observation for a

subject, so the next few statements are executed. The differences are computed, and the

observation is written to the new dataset:

PATID INF1 INF2 INF3 INFPP INFFP


x 6 5 8 -1 2

As the next observation is read from the input data set, the values of INF1, INF2, and INF3 are reset

to missing, since the FIRST.PATID=1 identifies the start of a new subject. We don't need to reset the

other variables – they are automatically reset to missing before a new observation is read from the

input data, since they were not named in the RETAIN statement:

PATID INF1 INF2 INF3 INFPP INFFP


x 6 5 8 -1 2
. . . . . .

will become:

week 09 9.52
Week 9 SET, MERGE and Multiple Operations

PATID INF1 INF2 INF3 INFPP INFFP


x 6 5 8 -1 2
y 7 4 . -3 .

If we did not reset the values to missing at the start of the second subject, the value of INF3 would be

retained at 8 from the first subject, since there is no follow-up visit for the second subject to replace

this value.

This program worked well for processing a single variable, but there are actually 16 cognitive test

variables that we want to process. For each of the original 16 scores, there will be 5 variables on the

output data set, or 80 new variables. The following program follows the same logic as the program

above, but makes use of arrays to process all 16 variables at one time.

* example program for all 16 variables, using arrays **;


LIBNAME C 'C:\temp';

** sort by subject id **;


PROC SORT DATA=C.PSYCH3;
BY PATID;
RUN;

** create dataset with one obs per subject **;


DATA C.PSYCH4(LABEL='PSYCH DIFFERENCE SCORES'
DROP=I PSTATUS INF--FAS);
SET C.PSYCH3;
BY PATID; ** allows use of LAST.PATID **;
LENGTH DEFAULT=3; ** set length for new vars **;

** name arrays, creating separate **;


** pre, post, and follow-up arrays **;
** and differences, pp=post-pre, fp=foll-pre **;

week 09 9.53
Week 9 SET, MERGE and Multiple Operations

ARRAY INIT{16} INF--FAS;


ARRAY PRE{16} INF1 PIC1 VOC1 BLK1 OBJ1 SIM1 IOQ11 LM1_1 LM2_1
VIS1_1 VIS2_1 STRPC1 STRPW1 STRPCW1 SDMT1 FAS1;
ARRAY POST{16} INF2 PIC2 VOC2 BLK2 OBJ2 SIM2 IOQ2 LM1_2 LM2_2
VIS1_2 VIS2_2 STRPC2 STRPW2 STRPCW2 SDMT2 FAS2;
ARRAY FOLL{16} INF3 PIC3 VOC3 BLK3 OBJ3 SIM3 IOQ3 LM1_3 LM2_3
VIS1_3 VIS2_3 STRPC3 STRPW3 STRPCW3 SDMT3 FAS3;
ARRAY PP{16} INFPP PICPP VOCPP BLKPP OBJPP SIMPP IOQPP LM1PP
LM2PP VIS1PP VIS2PP STRPCPP STRPWPP STRPCWPP
SDMTPP FASPP;
ARRAY FP{16} INFFP PICFP VOCFP BLKFP OBJFP SIMFP IOQFP LM1FP
LM2FP VIS1FP VIS2FP STRPCFP STRPWFP STRPCWFP
SDMTFP FASFP;

** use RETAIN statement so value of pre, post will be kept **;


RETAIN INF1--FAS3 .;

** reset pre,post,foll values to missing before each subject **;


IF FIRST.PATID THEN DO I=1 TO 16;
PRE{I}=.;
POST{I}=.;
FOLL{I}=.;
END;
** assign values to new variables **;
IF PSTATUS=1 THEN DO I=1 TO 16;
PRE{I} = INIT{I};
END;
ELSE IF PSTATUS=2 THEN DO I=1 TO 16;
POST{I} = INIT{I};
END;
ELSE IF PSTATUS=3 THEN DO I=1 TO 16;
FOLL{I} = INIT{I};
END;

** when all data for a subject has been read **;


** compute difference scores, and then **;
** output an observation for each subject **;

week 09 9.54
Week 9 SET, MERGE and Multiple Operations

IF LAST.PATID THEN DO;


DO I=1 TO 16;
PP{I} = POST{I} - PRE{I};
FP{I} = FOLL{I} - PRE{I};
END;
OUTPUT;
END;
RUN;
PROC PRINT DATA=C.PSYCH4;
VAR PATID INF1 INF2 INF3 INFPP INFFP FAS1 FAS2 FAS3 FASPP
FASFP;
TITLE2 'DATA WITH ONE OBSERVATION PER SUBJECT';
RUN;

Creating the following output, with scores shown for the same subjects, first couple of variables:

DATA WITH ONE OBSERVATION PER SUBJECT

OBS PATID INF1 INF2 INF3 INFPP INFFP FAS1 FAS2 FAS3 FASPP FASFP

1 28 7 7 . 0 . . . . . .
2 65 10 . 12 . 2 39 . 48 . 9
3 74 8 8 9 0 1 32 33 . 1 .
4 144 6 6 . 0 . 28 . . . .
5 192 6 7 . 1 . 6 . . . .
6 196 9 8 . -1 . . 17 . . .
7 210 8 . . . . . . . . .
8 245 11 13 12 2 1 . 52 55 . .

week 09 9.55
Week 9 SET, MERGE and Multiple Operations

• In this program, the initial 16 score variables are listed in an array called INIT.

• The new variables to be created are named in five arrays, for PRE operative, POST operative,

and FOLLow-up scores, and PP (post - pre) and FP (follow-up - pre) differences.

• The logic of the program is the same as the earlier one, but at each step a DO loop is used to

process all 16 scores, rather than just one.

• Note also in the RETAIN statement a shorthand listing of the 16 psych test variables was used:
RETAIN INF1--FAS3 .;

A list of variables can be given by naming the first variable (here INF1) followed by 2 hypens (--), no

spaces, followed by the last variable. The variables must be in order by position in the input data

set. This is where the POSITION option of PROC CONTENTS comes in handy. All of the variables

on the list must be of the same type, either numeric or character.

week 09 9.56