Unit 4
SAS for Data Management
Welcome.
This reading is a wonderful reservoir of tools, such as: how to document data entry errors in the SAS log
using the PUT statement, how to select a simple random sample, and how to clean character data.
2. to be aware of the multiple uses of the PUT statement (such as report writing, forms creation, etc)
4. to understand how SAS stores dates and how to do calculations involving dates;
7. to appreciate that SAS offers multiple ways of working with missing information;
8. to know how to select a simple random sample without replacement from a sampling frame; and
week 10 10.1
Week 10 More on Manipulating Data
week 10 10.2
Week 10 More on Manipulating Data
There is a variety of ways in SAS to code instructions that will result in the writing or outputting of
information
• A DATA statement together with a SET statement outputs data to a SAS data set. (See Unit 4
• A PROC PRINT writes out information in a SAS data set to the results window. The results window
• The PUT statement is another SAS instruction for writing or outputting of information.
Rationale
• It may be of interest to output data to an ASCII (text) file for some other use
General Guidelines
• Use a PUT statement to write information to the location indicated in a preceding FILE statement.
Note – A FILE statement does not require a companion libname statement. This is because a FILE
• If no FILE statement precedes, the PUT statement will write the information to the SAS log.
• Thus, the PUT statement with no preceding FILE statement is the way to write information to the
SAS log.
• Tip - Be aware that there are many options for use with the PUT statement. See the SAS Language
week 10 10.3
Week 10 More on Manipulating Data
Example
The PUT statement was introduced in Unit 4, Week 2 as a method for writing/exporting SAS data to an
Libname old “z:\river valley\sasdata”; /* libname tells SAS the location of the source SAS data */
Data _null_; /* _null_ is a placeholder SAS data set that will not be saved */
set old.screen02; /* SAS is told to read source data from screen02.sas7bdat */
file “z:\river valley\txtdata\screen02.dta”; /* Specifies the name and destination of the ascii data set */
put medrec cd4; /* Specifies the variables that are to be written */
run;
In this example the two variables (MEDREC and CD4) would be listed in the file separated by a space.
Additional Tips
• The PUT statement can be used to write information to the log, to the output window or to files and,
as a choice of SAS instruction for purposes of writing, the PUT statement has certain advantages.
• Information “written” with a PUT statement can include variable values with or without variable
• By default, a PUT statement writes lines to the SAS LOG when no file specification is given.
week 10 10.4
Week 10 More on Manipulating Data
Rationale
• In some instances (especially when the data set size is reasonable), it may be good documentation
to have recorded in the SAS log the data that was actually read in.
Guidelines
• To write the data to the log as it is read in, use the statement: PUT _INFILE_; after the INPUT
statement.
Example
This example also illustrates use of the SAS instruction _INFILE_ . It refers specifically to data that is read
week 10 10.5
Week 10 More on Manipulating Data
1 DATA BOOKS;
2 INFILE 'C:\SPH\HW6#3.DAT';
3 INPUT NAME $ DAYS SUBJECT $15.;
4 PUT _INFILE_;
5 RUN;
BARBARA 2 ENGLISH
CAROL 2 SCIENCE
CAROL 3 ENGLISH
CAROL 4 MATH
DONALD 1 ART
JAMES 4 MATH
JOYCE 5 HOME ECONOMICS
JOYCE 6 SCIENCE
MARY 2 MECHANICS
PHILIP 1 HOME ECONOMICS
PHILIP 3 MATH
WILLIAM 2 SCIENCE
NOTE: The infile 'C:\TEMP\HW4P3.DAT' is file C:\TEMP\HW4P3.DAT.
NOTE: 12 records were read from the infile C:\TEMP\HW4P3.DAT.
The minimum record length was 15.
The maximum record length was 26.
NOTE: The data set WORK.BOOKS has 12 observations and 3 variables.
NOTE: The DATA statement used 6.00 seconds.
Rationale
• You may wish to identify subjects who have missing or invalid data for a particular variable, or those
• One way to get such a list is to make a subset of the data for observations with the characteristic of
interest, and then to list the data using PROC PRINT. This can require many different data steps
• Alternatively, the list of problem observations can be written to the output file or to the log using PUT
week 10 10.6
Week 10 More on Manipulating Data
Example
• In the first, the variables to be written to the log are named with an equal sign ( PUT VARNAME= ).
When this form is used the ‘variable name = value’ is printed to the log (VARNAME=value).
• In the 2nd example, a mixture of text and variable values are used to write information to the log: the
PUT statement says to (1) write the value of the variable NAME , (2) write the text enclosed in
quotes, (3) the value of the variable WEIGHT, and (4) the text enclosed in quotes.
Following is the SAS log that was produced. Recall that the PUT statement writes to the LOG by default,
unless preceded by a FILE statement. The lines in the SAS log that were produced by the PUT statements
1 DATA CLASS;
2 INFILE 'C:\SPH\691F\PUT2.DAT';
3 INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;
4
5 * list students with invalid weight *;
6 * and changes weight to missing *;
7 IF WEIGHT > 200 THEN DO;
8 PUT NAME= WEIGHT= ;
9 PUT NAME 'has invalid weight of ' WEIGHT 'lbs.';
10 WEIGHT=.;
11 END;
12 RUN;
week 10 10.7
Week 10 More on Manipulating Data
NAME=JOHN WEIGHT=995
JOHN has invalid weight of 995 lbs.
NOTE: The infile 'C:\TEMP\PUT2.DAT' is file C:\TEMP\PUT2.DAT.
NOTE: 19 records were read from the infile C:\TEMP\PUT2.DAT.
The minimum record length was 21.
The maximum record length was 23.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
week 10 10.8
Week 10 More on Manipulating Data
Rationale
• It may be that you want to write a list of data to the SAS output file so that it accompanies
• Such SAS output files might be descriptions, analysis results, or the reporting of feedback
information.
Guidelines
• All the options available for reading data with INPUT statements can be used for writing data with
PUT statements.
Example
ALICE F 13
BARBARA F 13
CAROL F 14
week 10 10.9
Week 10 More on Manipulating Data
JANE F 12
JANET F 15
JOYCE F 11
JUDY F 14
LOUISE F 12
MARY F 15
week 10 10.10
Week 10 More on Manipulating Data
e. How to Use the PUT Instruction to Assign Sequential Identification Values Plus Some Bells
and Whistles
Rationale
• A PUT statement can be used to create a header page for surveys or other data collection forms,
Example
• In the example that follows a header page is created for a chart review study that requires reviewing
• The form is used to record the names and medical record numbers of the mother and infant, and a
place is given for recording the date of chart review, reviewer’s name, as well as date of data entry,
• A DO loop is used to create the sequential ID numbers -- in this example one hundred forms are
• For each form a new page is started, using the PUT _PAGE_ command. Then instructions for the
• The @ symbol is used to define the starting column, just as in an INPUT statement.
week 10 10.11
Week 10 More on Manipulating Data
• Text to be printed, including lines to be drawn must be enclosed in quotes. Use double
• See the SAS Language Guide or SAS HELP, section on PUT statements for more details
on line and pointer controls.
Note that there is one very long PUT statement that ends with a semi-colon. This PUT statement is listed
**********************************************;
** example of creating sequential ID number **;
** and cover page for data collection form **;
**********************************************;
OPTIONS nodate nonumber nocenter ls=78 ps=60;
week 10 10.12
Week 10 More on Manipulating Data
_______________________________________
| |
| BAYSTATE MEDICAL CENTER |
| FETAL MACROSOMIA STUDY |
| |
|_______________________________________|
STUDY ID = 101
week 10 10.13
Week 10 More on Manipulating Data
Much more complicated and detailed forms can be generated with PUT statements. This can be
particularly useful in follow-up surveys, where a form for each subject in the baseline survey can be pre-
printed with name, address phone number, ID numbers, etc. Any useful information from the baseline
survey can be incorporated and printed using a PUT statement. To do this, the DATA _NULL_; statement
would be followed by a SET statement, naming the source data file containing the baseline data, before the
week 10 10.14
Week 10 More on Manipulating Data
SAS stores dates internally using numbers that are related to the date January 1, 1960
• In SAS, a date is recorded as the number of days from January 1, 1960, with Jan 1, 1960 having a
value of 0.
• Dates after 01/01/1960 have positive values, and dates before this date have negative values.
• If you fail to assign a format to a date using a format statement, the number printed will be the
number of days since January 1, 1960 – not a particularly useful piece of data to look at – so
• A record length of 5 is sufficient for all recent dates (dates after 778 BC) represented as month, day,
year since there are at most 5 significant digits. This record length also is sufficient for negative
SAS dates.
• We read dates recorded as month, day, and year. SAS doesn’t know how to do this.
• Therefore, special DATE informats exist to translate this specification of the date to the SAS numeric
representation. A
• n informat works in a manner similar to a format – it provides instruction for reading values, rather
• Similarly, since the SAS representation of a date as the number of days from 01/01/1960 is not
readily recognized as a month, day and year, formats exit to convert SAS numeric dates to month,
week 10 10.15
Week 10 More on Manipulating Data
Example
• Data in the following example includes information on birth date, date of surgery and date of post-
operative testing.
• The dates in this file are in MMDDYY format, in this case with a slash (/) to separate month, day, and
year.
• The subject birth dates use 10 columns (MMDDYY10. format), with year taking 4 columns
• The other dates use 8 columns with year taking 2 columns, or MMDDYY8. format.
• Note the INFORMAT statement used before the INPUT statement to define the date formats.
• Alternatively, without an informat statement the format for reading the dates can be indicated directly
• In the program that follows, copies of each of the dates are made under new names so that several of
• Note that date formats are always available for use in SAS, without running PROC FORMAT.
• However, formats must be assigned to date variables in a FORMAT statement in a data or proc step,
• A format should always be assigned to DATE variables in SAS for printing, since looking at
week 10 10.16
Week 10 More on Manipulating Data
week 10 10.17
Week 10 More on Manipulating Data
101 12/03/1917 03DEC17 Jul 17, 1991 Wednesday, Jul 17, 1991 73
102 03/18/1910 18MAR10 Jul 19, 1991 Friday, Jul 19, 1991 81
103 09/21/1930 21SEP30 Jul 22, 1991 Monday, Jul 22, 1991 60
104 08/27/1924 27AUG24 Jul 26, 1991 Friday, Jul 26, 1991 66
105 10/11/1938 11OCT38 Jul 29, 1991 Monday, Jul 29, 1991 52
106 06/14/1919 14JUN19 Jul 30, 1991 Tuesday, Jul 30, 1991 72
SAS’ system for internal storage of dates makes straightforward computations involving dates.
• The advantage of storing dates as a number of days from some reference point lies in the ease of
• In the above example, the number of days between surgery and post-operative testing can be
week 10 10.18
Week 10 More on Manipulating Data
• In the same manner, AGE (in years) at surgery can be computed as the number of days between
AGE = (SURDATE-DOB)/365.25;
• In the example in the program above the INT (integer) function was used to round down to the integer
Occasionally, you may need to be mindful of the Y2K problem in working with dates
TIP - It’s a good idea to use 4 digits for year whenever possible when reading and writing dates to avoid
confusion. In the above example, the birth dates were listed with 4 digit years for this reason. When
working with birthdates of subjects in the early 1900’s, dates will be read incorrectly as early 2000’s, if only
the final 2 digits of the year are given. These can be corrected using SAS Functions (see following, 3.
TIME and DATETIME formats and informats are available for use in SAS. These can be used when times,
and time differences are of interest. An example is given in the section on date and time functions, which
follows.
• Constant values for dates, times, and datetimes can also be referred to using the DATE. , TIME. or
DATETIME. formats.
week 10 10.19
Week 10 More on Manipulating Data
• For example given information on birth date (DOB), to compute a subject's age in years at a specific
date, for example Jan 1, 1992, you could use the following statement:
• The date constant must be in DATE9. format (days in 2 digits, month in 3 letter abbreviation,
and years in 4 digits) enclosed in single quotes, and followed by the letter d (d or D works),
no spaces.
Similar to dates, TIME and DATETIME constants can be defined. See the section on date, time and
datetime constants in the chapter on SAS expressions in the SAS Language Guide for more details.
The statement:
TDATE = TODAY();
creates the variable TDATE with the current date on the computer as the value. The word today is followed
by empty parentheses.
week 10 10.20
Week 10 More on Manipulating Data
The DATE( ) function gives the same result as the TODAY( ) function.
TDATE = DATE();
week 10 10.21
Week 10 More on Manipulating Data
Many special functions for computing and manipulating numeric and character data are available for use in
Mathematical EXP (power e), LOG (natural log), LOG10 (log base 10)
Date and time DATE (today's date), DATEJUL (Julian to SAS), MDY,
MONTH, DAY, YEAR, others
week 10 10.22
Week 10 More on Manipulating Data
• Functions can provide a simple way of computing complex formulae, such as a variance or probability
function.
• TIP: Many of the functions handle missing values differently than when computing with
expressions. It is important to pay attention to this detail when choosing to compute using SAS
Suppose you want to compute the mean of 3 scores for a subject (MEAN function) There’s more than one
approach.
• This can be done by defining an expression to sum the three scores and divide by 3.
• Alternatively, a mean can be computed by naming the three variables in the MEAN function.
• TIP - The result is not the same because missing values are handled differently in the two
approaches. Using the first method, the mean score has a missing value if any one of the three
scores is missing. Using the MEAN function, the mean score has a value if at least one of the three
week 10 10.23
Week 10 More on Manipulating Data
. . 3
. . .
;
RUN;
PROC PRINT DATA=EX1;
VAR SCORE1-SCORE3 M1 M2;
TITLE1 'MEAN SCORES COMPUTED TWO WAYS';
RUN;
• To use the mean function, the word MEAN is followed by a variable list enclosed in parentheses.
• If the variables names all have the same prefix with sequential numbering, an alternative hyphenated
M2 = MEAN(SCORE1,SCORE2,SCORE3);
or
M2 = MEAN(OF SCORE1-SCORE3);
Results of the above program follow. Note that the expression only computes a mean when all values are
non-missing, while the MEAN function computes a mean if at least one value is present.
week 10 10.24
Week 10 More on Manipulating Data
Tip – Have a look in the The SAS Language Guide on functions. You may find some kind of
Example - Computing a mean score for subjects with at least 11 of 14 subscores (the N function, MEAN
function)
• As part of a study on smoking cessation, subjects were given a 14-question form used to rate a
• For each question a score of 1-7 was possible. High scores were expected to correlate well with
efficacy in quitting.
• Investigators were interested in using the mean of the 14 scores as a predictor of quit status at the
follow-up periods.
• However many subjects had skipped questions on the form so that their data were incomplete.
• The study investigators decided that if at least 11 of the 14 scores were available, the mean would be
• The following program computes a mean for subjects with at least 11 scores, creating a missing value
for the others. It makes use of the N function to find the number of non-missing values, and the
MEAN function.
week 10 10.25
Week 10 More on Manipulating Data
DATA EFF2;
SET EFF1;
* find # of non-missing scores using N function *;
NSCORE = N(OF SCORE1-SCORE14);
In this example two variables were created: NSCORE the number of non-missing scores, and MSCORE,
the mean of the non-missing scores, when 11 or more scores were present.
week 10 10.26
Week 10 More on Manipulating Data
• When computing age at some event from the date of birth and the event date, it is common to want
• Rounding off, or formatting the age to 0 decimal places would result in rounding up for fractional parts
over .5, e.g., 8.67 rounds up to 9 years rather than 8 years, which is not how we usually think of
age.
• In this case the integer function INT, can be used, to select the integer part of the computed age:
• Date and Time functions can be used to create DATE, TIME and DATETIME variables from other
values, or to abstract information from DATE and TIME values, such as the day, month or year from
a date value.
• For example in the Wesson Women's Time Study, dates and times were entered as separate values
• The following program gives an example of how these can be converted into DATETIME values, so
that the time spent in a unit can be computed as the difference between two DATETIME values.
• The example makes use of the MDY function to create a date value from the month, day and year of
the study, 1991, and then uses the DHMS function to create a DATETIME value from the Date,
week 10 10.27
Week 10 More on Manipulating Data
• This example computes the time in hours spent in WETU, the Women's Evaluation and Treatment
Unit.
• Note that when a time difference is computed from a stored DATETIME value, the result is in
units of seconds.
* arrival datetime;
W_ARR_DT=DHMS(WARRDATE,W_ARRHRS,W_ARRMIN,0);
* transfer datetime;
W_TRANDT=DHMS(WTRADATE,W_TRAHRS,W_TRAMIN,0);
week 10 10.28
Week 10 More on Manipulating Data
• Note that the time spent in the Women’s Evaluation and Treatment Unit, WETUTIME, is in units of
• For example 1.25 means one-and-one-quarter hours, or 1 hour and 15 minutes, and 0.50 means a
half-hour, or 30 minutes.
• SAS TIME formats are meant for use with clock time representations, not durations. If you want
these durations printed in hours and minutes, you would need to compute the minutes from the
fractional hours, and print these as two variables. The following programming statements could be
used:
week 10 10.29
Week 10 More on Manipulating Data
4 How to Select a Simple Random Sample without Replacement: Using Random Number
Generating Functions
Rationale –
You may have used SAS to create a sampling frame (list of all members of population) and now wish to
select a simple random sample from the list. Random number generators enable the sample to be
selected. An illustration of a method for selecting a simple random sample without replacement from a
population of 10 subjects (with one record per subject) follows. Suppose the subjects are listed in a file as
shown below.
ID Age Sex
1 23 M
2 29 M
3 34 F
4 19 F
5 29 F
6 15 F
7 32 M
8 32 F
9 31 M
10 28 M
Using the Uniform ( ) Function to Select a Simple Random Sample Without Replacement
Assume we wish to select a simple random sample without replacement of 4 subjects from this list. We
can use the Uniform random number generator, which generates a random digit between 0 and 1, to do
week 10 10.30
Week 10 More on Manipulating Data
• Data are read in on one line using the trailing @@ to hold the line.
• The RETAIN statement initializes the total size of the population (n) and the size of the sample (k)
that is to be selected.
• These variables must be named in a RETAIN statement to keep track of the number of observations
left in the population, along with the number of observations remaining to be selected into the
sample.
• As the sample is selected, the value from the previous observation is held unless a change is
specified.
week 10 10.31
Week 10 More on Manipulating Data
• The variable SAMPLE1 is an indicator variable with value of "0" for observations not in the sample,
and value "1" for observations selected to be in the sample. When selecting a simple random
observation is selected.
• This proportion is given by PROB, the remaining number of observations to be selected divided by
• To determine whether the particular observation in the data step is to be selected in the sample, a
• If the random number is less than PROB, then the observation is selected, and the values of "k" and
"n" are both reduced by one. The indicator for the sample, SAMPLE1, is also set to one. If RN is
bigger than PROB, only the population size "n" is reduced by one and the sampling continues. A
1 1 23 M 0 0.75321 0.40000 9 4
2 2 29 M 1 0.34198 0.44444 8 3
3 3 34 F 0 0.50989 0.37500 7 3
4 4 19 F 1 0.41783 0.42857 6 2
5 5 29 F 0 0.36324 0.33333 5 2
6 6 15 F 0 0.51605 0.40000 4 2
7 7 32 M 1 0.02004 0.50000 3 1
8 8 32 F 1 0.32841 0.33333 2 0
9 9 31 M 0 0.62433 0.00000 1 0
10 10 28 M 0 0.59228 0.00000 0 0
week 10 10.32
Week 10 More on Manipulating Data
• The uniform random number generator is a SAS function that has "0" as an argument in the previous
program.
• The "0" is a special code that indicates that the uniform random number sequence should start based
• Re-running the program will produce a different set of Uniform random numbers.
• Although independent samples should be based on a different set of random numbers, it may be
desirable to be able to duplicate the sample selection. This can be done by specifying a 5-7
• This is illustrated this below, with the selection of two samples, where a common starting point – seed
week 10 10.33
Week 10 More on Manipulating Data
k3=k3-1;
end;
n3=n3-1;
run;
week 10 10.34
Week 10 More on Manipulating Data
Notice that the sample random numbers were generated (RN2 = RN3) when the same ‘seed’ or starting
value for the random number generator is used, resulting in selection of the same sample.
week 10 10.35
Week 10 More on Manipulating Data
• There is an array of SAS functions designed for working with character data.
• Tip - Character data can be tricky to work with, since capitalization, spelling, spacing and punctuation
• Many of the character functions can be used to modify values, selectively read parts of character
values, or search values for certain character strings. A few helpful examples follow.
• At the time of data entry, use of capitalization may have been inconsistent. As a result, some
observations that should be matching will not be read as having the same value.
• For example, if data for sex has inconsistently been entered as F, f, M, or m, then a frequency table
SEX2 = UPCASE(SEX);
would convert all values to uppercase, so only 2 distinct values would appear in the new variable
• However, when names, addresses or other character data with more than just a few possible values
week 10 10.36
Week 10 More on Manipulating Data
Rationale –
• When spacing or punctuation has been inconsistently used, values will not match.
• For example, when names have been entered in one place as ‘Last, First’ and in another as ‘Last
NAME2 = COMPRESS(NAME,’,’);
Guidelines
• To use the Compress function, the form COMPRESS(arg1,arg2); is used, where arg1 is the variable to
• If you are listing specific characters, then these must be named with single quotes (or double quotes
• Tip - Do not use delimiters (e.g., spaces or commas) between characters to remove, or the space or
nd
.,’); will remove spaces, periods and commas. If no 2 argument (arg2) is specified, by default
week 10 10.37
Week 10 More on Manipulating Data
Rationale -
• Inconsistent data entry, or data derived from different sources can leave you with values that match,
• The LEFT and RIGHT functions simply move values to be left or right aligned. LEFT removes blanks
from the start of a value, and RIGHT moves blanks from the end to the start. For example, if values
for sex have been entered as ′Female′ or ′Male ′ or ′ Male′, the last 2 values will be read as
different values, since the spacing or alignment differs. Using the statement:
SEX2 = LEFT(SEX);
Rationale -
• This is a great function! When information is encoded within a variable, for example, with an
• In the Anaconda Soil Ingestion Study, informative ID numbers for samples (SID), of the form ‘AnnSdd’
were used, where the first character (A) indicated the particular study, nn the child ID number, S the
• Sample analysis results were output in electronic form from the laboratory equipment used for
analysis. It was necessary to extract information from the sample ID number to determine which
child, and which sample type. The following statement will do this:
week 10 10.38
Week 10 More on Manipulating Data
CHILDID = SUBSTR(SID,1,3);
STYPE = SUBSTR(SID,4,1);
• The CHILDID is read from the sample ID (SID) starting in position 1, reading for 3 characters.
• The sample type (STYPE) is read from SID, starting in position 4, reading 1 character. For example,
Guidelines
To use SUBSTR(arg1,arg2,arg3),
Rationale -
Verify can be used to check that all values in your data fit a limited set of acceptable values.
For example, suppose you have data, stored as character codes (CODE), which should take on only the
values 0, 1, 2, 3 or 9, the following statements can be used to check for invalid data:
CHK = ‘01239’;
VCODE = VERIFY(CODE,CHK);
IF VCODE NE 0 THEN PUT ‘INVALID CODE: ‘ CODE= SID=;
week 10 10.39
Week 10 More on Manipulating Data
• The value of VCODE will be 0 if the value of CODE contains only 0,1,2,3, or 9, but will return the first
• The IF-THEN statement will write a message to the log with the invalid code and subject ID, when
week 10 10.40