LEARN SAS Within 7 Weeks: Part4 (More On Manipulating Data)

Week 10 More on Manipulating Data
Unit 4
SAS for Data Management
Week 10: More on Manipulating Data
Welcome.
This reading is a wonderful reservoir of tools, such as: how to document data entry errors in the SAS log
using the PUT statement, how to select a simple random sample, and how to clean character data.
Numerous examples illustrate the specification of the necessary code.
Goals of Week 10: More on Manipulating Data
1. to be competent in using SAS to verify data entry in an easy to read manner;
2. to be aware of the multiple uses of the PUT statement (such as report writing, forms creation, etc)
3. to know how to create sequential study identification numbers;
4. to understand how SAS stores dates and how to do calculations involving dates;
5. to be competent in the informatting and formatting of SAS dates and times;
6. to be aware of the variety of SAS functions that are available;
7. to appreciate that SAS offers multiple ways of working with missing information;
8. to know how to select a simple random sample without replacement from a sampling frame; and
9. to be competent in manipulating character data.
week 10 10.1
Week 10 Outline – More on Manipulating Data

Section Topic Page
1. Using the PUT Statement ………………..…………….……. …………….. 3
a. How to Write to a File ………………………………………………………… 3

b. How to Write Input Data to the Log .………………………………………... 5
c. Data Quality Assessment – A Good Way to Write Out a Subset of Data . 6
d. How to Output Data Using Column Format………………………………… 9
e. How to Use PUT Instruction to Assign Sequential Identification Values
Plus Some Bells and Whistles …………………………………………. … 11
2. Dates and Time in SAS ……….……………………………………..……………. 15
a. How to Use Informats/Formats to Read/Write Dates ………………………. 15

b. How to Compute Intervals Between Dates ……………………………… 18
c. How to Work with a Date as a Constant ………………………………… 19
d. The TODAY( ) and DATE( ) Functions …………………………………. 20
3. Introduction to SAS Functions ………….……………………………………….... 22
a. Overview of SAS Functions ……………………………………. ………….. 22

b. Using SAS to Manipulate Missing Values ……………………..…………... 23
c. How to Compute Age in Completed Years (INT) ………..………………… 27
d. How to Extract Date and Time Information (MDY,DHMS)… ..…………….. 27
4. How to Select a Simple Random Sample Without Replacement ..………….. 30
5. Introduction to SAS Character Functions ……………………………..……………. 36
a. How to Convert to Upper Case (UPCASE) …..……………………………. 36

b. How to Remove Spaces and Special Characters (COMPRESS) …..…. 37
c. How to Align Character Data (LEFT, RIGHT) …………………………… 38
d. How to Extract a String of Characters (SUBSTR)………………………… 38
e. How to Validate a Character Value (VERIFY) ……………………………. 39
week 10 10.2
1 Using the PUT Statement
There is a variety of ways in SAS to code instructions that will result in the writing or outputting of
information
• A DATA statement together with a SET statement outputs data to a SAS data set. (See Unit 4
Week 2; this is the Week 8 reading)
• A PROC PRINT writes out information in a SAS data set to the results window. The results window
can then be saved as a “.lst” file.
• The PUT statement is another SAS instruction for writing or outputting of information.
a. How to Write to a File
Rationale
• It may be of interest to output data to an ASCII (text) file for some other use
General Guidelines
• Use a PUT statement to write information to the location indicated in a preceding FILE statement.
Note – A FILE statement does not require a companion libname statement. This is because a FILE
statement indicates a full path address already.
• If no FILE statement precedes, the PUT statement will write the information to the SAS log.
• Thus, the PUT statement with no preceding FILE statement is the way to write information to the
SAS log.
• Tip - Be aware that there are many options for use with the PUT statement. See the SAS Language
Guide or refer to SAS help.
week 10 10.3
Example
The PUT statement was introduced in Unit 4, Week 2 as a method for writing/exporting SAS data to an
ASCII (text) file using a DATA step:
Libname old “z:\river valley\sasdata”; /* libname tells SAS the location of the source SAS data */
Data _null_; /* _null_ is a placeholder SAS data set that will not be saved */
set old.screen02; /* SAS is told to read source data from screen02.sas7bdat */
file “z:\river valley\txtdata\screen02.dta”; /* Specifies the name and destination of the ascii data set */
put medrec cd4; /* Specifies the variables that are to be written */
run;
In this example the two variables (MEDREC and CD4) would be listed in the file separated by a space.
Additional Tips
• The PUT statement can be used to write information to the log, to the output window or to files and,
as a choice of SAS instruction for purposes of writing, the PUT statement has certain advantages.
• Information “written” with a PUT statement can include variable values with or without variable
names, as well as text that you type in.
• By default, a PUT statement writes lines to the SAS LOG when no file specification is given.
week 10 10.4
b How to Write Input Data to the Log
Rationale
• In some instances (especially when the data set size is reasonable), it may be good documentation
to have recorded in the SAS log the data that was actually read in.
• This may aid in correcting data problems later.
Guidelines
• To write the data to the log as it is read in, use the statement: PUT _INFILE_; after the INPUT
statement.
Example
** example writing input data to the LOG **;

DATA BOOKS;
INFILE 'C:\TEMP\HW4P3.DAT';
INPUT NAME $ DAYS SUBJECT $15.;
PUT _INFILE_;
RUN;
This example also illustrates use of the SAS instruction _INFILE_ . It refers specifically to data that is read
in during the data step.
week 10 10.5
These SAS instructions produced the following listing in the log.
1 DATA BOOKS;
2 INFILE 'C:\SPH\HW6#3.DAT';
3 INPUT NAME $ DAYS SUBJECT $15.;
4 PUT _INFILE_;
5 RUN;
BARBARA 2 ENGLISH
CAROL 2 SCIENCE
CAROL 3 ENGLISH
CAROL 4 MATH
DONALD 1 ART
JAMES 4 MATH
JOYCE 5 HOME ECONOMICS
JOYCE 6 SCIENCE
MARY 2 MECHANICS
PHILIP 1 HOME ECONOMICS
PHILIP 3 MATH
WILLIAM 2 SCIENCE
NOTE: The infile 'C:\TEMP\HW4P3.DAT' is file C:\TEMP\HW4P3.DAT.
NOTE: 12 records were read from the infile C:\TEMP\HW4P3.DAT.
The minimum record length was 15.
The maximum record length was 26.
NOTE: The data set WORK.BOOKS has 12 observations and 3 variables.
NOTE: The DATA statement used 6.00 seconds.
c. Data Quality Assessment – A Good Way to Write Out a Subset of Observations
Rationale
• You may wish to identify subjects who have missing or invalid data for a particular variable, or those
subjects with a particular combination of variable values.
• One way to get such a list is to make a subset of the data for observations with the characteristic of
interest, and then to list the data using PROC PRINT. This can require many different data steps
plus a procedure step.
• Alternatively, the list of problem observations can be written to the output file or to the log using PUT
statements in the course of a DATA step, as in the example below.
week 10 10.6
Example
• In this example, two uses of the PUT statement are illustrated.
• In the first, the variables to be written to the log are named with an equal sign ( PUT VARNAME= ).
When this form is used the ‘variable name = value’ is printed to the log (VARNAME=value).
• In the 2nd example, a mixture of text and variable values are used to write information to the log: the
PUT statement says to (1) write the value of the variable NAME , (2) write the text enclosed in
quotes, (3) the value of the variable WEIGHT, and (4) the text enclosed in quotes.
** use of PUT statement to write to the log **;

DATA CLASS;
INFILE 'C:\TEMP\PUT2.DAT';
INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;
* list students with invalid weight *;

* and change weight to missing *;
IF WEIGHT > 200 THEN DO;
PUT NAME= WEIGHT= ;
PUT NAME 'has invalid weight of ' WEIGHT 'lbs.';
WEIGHT=.;
END;
RUN;
Following is the SAS log that was produced. Recall that the PUT statement writes to the LOG by default,
unless preceded by a FILE statement. The lines in the SAS log that were produced by the PUT statements
are highlighted here.
1 DATA CLASS;
2 INFILE 'C:\SPH\691F\PUT2.DAT';
3 INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;
4
5 * list students with invalid weight *;
6 * and changes weight to missing *;
7 IF WEIGHT > 200 THEN DO;
8 PUT NAME= WEIGHT= ;
9 PUT NAME 'has invalid weight of ' WEIGHT 'lbs.';
10 WEIGHT=.;
11 END;
12 RUN;
week 10 10.7
NAME=JOHN WEIGHT=995
JOHN has invalid weight of 995 lbs.
NOTE: The infile 'C:\TEMP\PUT2.DAT' is file C:\TEMP\PUT2.DAT.
NOTE: 19 records were read from the infile C:\TEMP\PUT2.DAT.
The minimum record length was 21.
The maximum record length was 23.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
week 10 10.8
d How to Output Data Using Column Format
Rationale
• It may be that you want to write a list of data to the SAS output file so that it accompanies
descriptions or analysis results.
• Such SAS output files might be descriptions, analysis results, or the reporting of feedback
information.
Guidelines
• The PUT statement is analogous to an INPUT statement in this case.
• All the options available for reading data with INPUT statements can be used for writing data with
PUT statements.
Example
** writing data to output window **;

DATA CLASS;
INFILE 'C:\TEMP\PUT2.DAT';
INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;
* route result of PUT statement to output window *;

FILE PRINT;
* list name, age, sex for females only *;

IF SEX='F' THEN
PUT NAME 1-8 SEX 10 AGE 12-13;
TITLE1 ‘LISTING OF GIRLS IN CLASS:’;
RUN;
This would appear in the output window:
LISTING OF GIRLS IN CLASS:
ALICE F 13
BARBARA F 13
CAROL F 14
week 10 10.9
JANE F 12
JANET F 15
JOYCE F 11
JUDY F 14
LOUISE F 12
MARY F 15
week 10 10.10
e. How to Use the PUT Instruction to Assign Sequential Identification Values Plus Some Bells
and Whistles
Rationale
• A PUT statement can be used to create a header page for surveys or other data collection forms,
each pre-printed with a sequential ID number.
• Tip - Such pre-numbering of forms can prevent assignment of duplicate ID numbers.
Example
• In the example that follows a header page is created for a chart review study that requires reviewing
mother’s and infant’s charts.
• A sequential ID is printed on each form.
• The form is used to record the names and medical record numbers of the mother and infant, and a
place is given for recording the date of chart review, reviewer’s name, as well as date of data entry,
and enterer’s name.
This example also illustrates some bells and whistles:
• A DATA _NULL_ statement is used since no data set is to be created
• A DO loop is used to create the sequential ID numbers -- in this example one hundred forms are
created, numbered from 101 to 200.
• For each form a new page is started, using the PUT _PAGE_ command. Then instructions for the
form specifics are given.
• The @ symbol is used to define the starting column, just as in an INPUT statement.
week 10 10.11
• The # symbol is used to define a specific line number on the page.
• Text to be printed, including lines to be drawn must be enclosed in quotes. Use double
quotes if the text contains a single quote.
• The forward slash (/) indicates move to the next line.
• See the SAS Language Guide or SAS HELP, section on PUT statements for more details
on line and pointer controls.
Note that there is one very long PUT statement that ends with a semi-colon. This PUT statement is listed
within a DO loop that creates the sequential ID numbers.
**********************************************;
** example of creating sequential ID number **;
** and cover page for data collection form **;
**********************************************;
OPTIONS nodate nonumber nocenter ls=78 ps=60;
DATA _NULL_; * start data step-no file created;

FILE 'C:\temp\cover1.lst'; * names file to store results;
DO CASEID = 101 TO 200; * creates forms 101 to 200;

PUT _PAGE_; * starts new page for each new id;
PUT / / /
@10 ' _______________________________________ ' /
@10 '| |' /
@10 '| BAYSTATE MEDICAL CENTER |' /
@10 '| FETAL MACROSOMIA STUDY |' /
@10 '| |' /
@10 '|_______________________________________|' /
/
@10 'STUDY ID = ' @22 CASEID /
/
@10 "MOTHER'S NAME: __________________________________ " /
@26 '(last, first)' /
/
@10 "MOHTER'S MEDICAL RECORD NUMBER: __ __ __ __ __ __ "
/ /
@10 'INFANT NAME: ___________________________________ ' /
@26 '(last, first)' /
/
week 10 10.12
@10 'INFANT MEDICAL RECORD NUMBER: __ __ __ __ __ __ ' /

#45 @10 "MOTHER'S CHART REVIEWED BY: ____________________ on ___ / ___ / ___" /
/
@10 'INFANT CHART REVIEWED BY: ____________________ on ___ / ___ / ___' /
/
@10 'FORMS CODED BY: _________________________________ on ___ / ___ / ___' /
/
@10 'DATA ENTERED BY: ________________________________ on ___ / ___ / ___'
; *end of put statement;
END; *end of do loop;
RUN;
Following is the first page written by the above program:
_______________________________________
| |
| BAYSTATE MEDICAL CENTER |
| FETAL MACROSOMIA STUDY |
| |
|_______________________________________|
STUDY ID = 101
MOTHER'S NAME: __________________________________

(last, first)
MOHTER'S MEDICAL RECORD NUMBER: __ __ __ __ __ __
INFANT NAME: ___________________________________

(last, first)
INFANT MEDICAL RECORD NUMBER: __ __ __ __ __ __
week 10 10.13
MOTHER'S CHART REVIEWED BY: _________________ on ___ / ___ / ___
INFANT CHART REVIEWED BY: __________________ on ___ / ___ / ___
FORMS CODED BY: _____________________________ on ___ / ___ / ___
DATA ENTERED BY: ____________________________ on ___ / ___ / ___
Much more complicated and detailed forms can be generated with PUT statements. This can be
particularly useful in follow-up surveys, where a form for each subject in the baseline survey can be pre-
printed with name, address phone number, ID numbers, etc. Any useful information from the baseline
survey can be incorporated and printed using a PUT statement. To do this, the DATA _NULL_; statement
would be followed by a SET statement, naming the source data file containing the baseline data, before the
DO loop and PUT statement.
week 10 10.14
2 Dates and Times in SAS
SAS stores dates internally using numbers that are related to the date January 1, 1960
• In SAS, a date is recorded as the number of days from January 1, 1960, with Jan 1, 1960 having a
value of 0.
• Dates after 01/01/1960 have positive values, and dates before this date have negative values.
• If you fail to assign a format to a date using a format statement, the number printed will be the
number of days since January 1, 1960 – not a particularly useful piece of data to look at – so
assign formats to dates!.
• A record length of 5 is sufficient for all recent dates (dates after 778 BC) represented as month, day,
year since there are at most 5 significant digits. This record length also is sufficient for negative
SAS dates.
a. How to Use INFORMATS/FORMATS to Read/Write Dates
• We read dates recorded as month, day, and year. SAS doesn’t know how to do this.
• Therefore, special DATE informats exist to translate this specification of the date to the SAS numeric
representation. A
• n informat works in a manner similar to a format – it provides instruction for reading values, rather
than writing values.
• Similarly, since the SAS representation of a date as the number of days from 01/01/1960 is not
readily recognized as a month, day and year, formats exit to convert SAS numeric dates to month,
day, year format for printing.
week 10 10.15
Example
• Data in the following example includes information on birth date, date of surgery and date of post-
operative testing.
• The dates in this file are in MMDDYY format, in this case with a slash (/) to separate month, day, and
year.
• The subject birth dates use 10 columns (MMDDYY10. format), with year taking 4 columns
• The other dates use 8 columns with year taking 2 columns, or MMDDYY8. format.
• Note the INFORMAT statement used before the INPUT statement to define the date formats.
• Alternatively, without an informat statement the format for reading the dates can be indicated directly
on the input statement, as in
INPUT CASEID @5 DOB MMDDYY10. @16 SURDATE MMDDYY8.

@25 TDATE MMDDYY8.;
• In the program that follows, copies of each of the dates are made under new names so that several of
the different formats for printing dates can be illustrated.
• Note that date formats are always available for use in SAS, without running PROC FORMAT.
• However, formats must be assigned to date variables in a FORMAT statement in a data or proc step,
like any other formats.
• A format should always be assigned to DATE variables in SAS for printing, since looking at
the number of days since 01JAN60 is not usually very informative.
week 10 10.16
OPTIONS PAGESIZE=60 LINESIZE=128 nodate nonumber nocenter;

***************************************************;
** read in data using MMDDYY. format for dates **;
***************************************************;
DATA DATES;
INFILE 'C:\temp\DATE1.DAT';
* how to read dates *;
INFORMAT DOB MMDDYY10.
SURDATE TDATE MMDDYY8.;
INPUT CASEID DOB SURDATE TDATE;
** make copies of dates for demonstrating use of formats**;

DOB1=DOB;
SURDATE1=SURDATE;
TDATE1=TDATE;
** compute days from surgery to testing, **
** and patient age at surgery **;
DAYS = TDATE - SURDATE;
AGE = INT((SURDATE - DOB) / 365.25); /* the INT function rounds to integer */
** assign formats to dates **;

** no format for TDATE1 **;
FORMAT DOB MMDDYY10.
DOB1 DATE7.
SURDATE WORDDATE12.
SURDATE1 WEEKDATE23.
TDATE WORDDATX12.;
RUN;
* print results to illustrate formats *;

PROC PRINT DATA=DATES;
ID CASEID;
VAR DOB DOB1 SURDATE SURDATE1 AGE;
TITLE1 'LIST 1: EXAMPLE OF DATE FORMATS';
RUN;
PROC PRINT DATA=DATES;
ID CASEID;
VAR SURDATE TDATE TDATE1 DAYS;
TITLE1 'LIST 2: EXAMPLE OF DATE FORMATS';
TITLE2 'UNFORMATTED TDATE1 LISTED AS DAYS FROM JAN 1, 1960';
RUN;
week 10 10.17
The output follows.
LIST 1: EXAMPLE OF DATE FORMATS
CASEID DOB DOB1 SURDATE SURDATE1 AGE
101 12/03/1917 03DEC17 Jul 17, 1991 Wednesday, Jul 17, 1991 73
102 03/18/1910 18MAR10 Jul 19, 1991 Friday, Jul 19, 1991 81
103 09/21/1930 21SEP30 Jul 22, 1991 Monday, Jul 22, 1991 60
104 08/27/1924 27AUG24 Jul 26, 1991 Friday, Jul 26, 1991 66
105 10/11/1938 11OCT38 Jul 29, 1991 Monday, Jul 29, 1991 52
106 06/14/1919 14JUN19 Jul 30, 1991 Tuesday, Jul 30, 1991 72
LIST 2: EXAMPLE OF DATE FORMATS

UNFORMATTED TDATE1 LISTED AS DAYS FROM JAN 1, 1960
CASEID SURDATE TDATE TDATE1 DAYS
101 Jul 17, 1991 4 Aug 1991 11538 18

102 Jul 19, 1991 5 Aug 1991 11539 17
103 Jul 22, 1991 5 Aug 1991 11539 14
104 Jul 26, 1991 13 Aug 1991 11547 18
105 Jul 29, 1991 15 Aug 1991 11549 17
106 Jul 30, 1991 15 Aug 1991 11549 16
b. How to Compute Intervals between Dates
SAS’ system for internal storage of dates makes straightforward computations involving dates.
• The advantage of storing dates as a number of days from some reference point lies in the ease of
computation of differences, or durations.
• In the above example, the number of days between surgery and post-operative testing can be
computed as a simple difference:
DAYS = TDATE - SURDATE;
• The result of a difference between two dates is always in units of days.
• If we wanted the difference in weeks this could be computed as:
week 10 10.18
WEEKS = (TDATE - SURDATE) / 7;
• In the same manner, AGE (in years) at surgery can be computed as the number of days between
surgery date and birth date, divided by 365.25:
AGE = (SURDATE-DOB)/365.25;
• In the example in the program above the INT (integer) function was used to round down to the integer
part – this is discussed in the section on SAS functions below.
Occasionally, you may need to be mindful of the Y2K problem in working with dates
TIP - It’s a good idea to use 4 digits for year whenever possible when reading and writing dates to avoid
confusion. In the above example, the birth dates were listed with 4 digit years for this reason. When
working with birthdates of subjects in the early 1900’s, dates will be read incorrectly as early 2000’s, if only
the final 2 digits of the year are given. These can be corrected using SAS Functions (see following, 3.
Introduction to SAS Functions).
TIME and DATETIME formats and informats are available for use in SAS. These can be used when times,
and time differences are of interest. An example is given in the section on date and time functions, which
follows.
c How to Work with a Date as a Constant
• Constant values for dates, times, and datetimes can also be referred to using the DATE. , TIME. or
DATETIME. formats.
week 10 10.19
• For example given information on birth date (DOB), to compute a subject's age in years at a specific
date, for example Jan 1, 1992, you could use the following statement:
AGE1992 = ('01JAN1992'D - DOB)/365.25;
• The date constant must be in DATE9. format (days in 2 digits, month in 3 letter abbreviation,
and years in 4 digits) enclosed in single quotes, and followed by the letter d (d or D works),
no spaces.
Similar to dates, TIME and DATETIME constants can be defined. See the section on date, time and
datetime constants in the chapter on SAS expressions in the SAS Language Guide for more details.
d The TODAY and DATE Functions
The TODAY( ) Function
The statement:
TDATE = TODAY();
creates the variable TDATE with the current date on the computer as the value. The word today is followed
by empty parentheses.
Example – Computing a subject’s current age:
CURR_AGE = (TODAY() - DOB)/365.25;
week 10 10.20
The DATE( ) Function
The DATE( ) function gives the same result as the TODAY( ) function.
TDATE = DATE();
week 10 10.21
3. Introduction to SAS Functions
Many special functions for computing and manipulating numeric and character data are available for use in
the SAS system.
a. Overview of SAS Functions
The functions are divided into 13 categories:
Category Functions in SAS
Arithmetic ABSolute value, MIN, MAX, MODulus, SIGN, and others
Truncation CEIL (smallest integer), FLOOR (largest integer), ROUND,

INTeger and others
Mathematical EXP (power e), LOG (natural log), LOG10 (log base 10)
Trigonometric ARCOS, ARSIN, ATAN, COS, SIN, SINH, TAN, TANH
Probability POISSON, PROBBETA, PROBBNML, PROBCHI, PROBF,

PROBT, PROBNORM, and others
Quantiles CINV (chi-square), FINV, GAMINV, PROBIT (inverse of

normal), TINV
Sample Statistics N (# non-miss), MEAN, CV, KURTOSIS, NMISS, RANGE,

STD, VAR, STDERR, SUM and others
Random Number NORMAL, RANBINomial, RANEXP, RANNOR, RANPOI,

UNIFORM, RANUNI and others
Financial COMPOUND, MORT, SAVING, and others
Character BYTE, LEFT align, LENGTH, SCAN, TRIM, UPCASE, and

others
Date and time DATE (today's date), DATEJUL (Julian to SAS), MDY,
MONTH, DAY, YEAR, others
State and Zip code ZIPSTATE, FIPNAME and others
Special functions DIFn (difference in lags), INPUT, LAGn, PUT, SOUND,

SYMGET
For details, see SAS Language Guide.
week 10 10.22
• Functions can provide a simple way of computing complex formulae, such as a variance or probability
function.
• TIP: Many of the functions handle missing values differently than when computing with
expressions. It is important to pay attention to this detail when choosing to compute using SAS
functions. Examples of use of functions in a few key categories follow.
b. Using SAS to Manipulate Missing Values
Suppose you want to compute the mean of 3 scores for a subject (MEAN function) There’s more than one
approach.
• This can be done by defining an expression to sum the three scores and divide by 3.
• Alternatively, a mean can be computed by naming the three variables in the MEAN function.
• TIP - The result is not the same because missing values are handled differently in the two
approaches. Using the first method, the mean score has a missing value if any one of the three
scores is missing. Using the MEAN function, the mean score has a value if at least one of the three
scores has a non-missing value, as illustrated below:
** EXAMPLE TO SHOW HOW MEAN FUNCTION **;

** HANDLES MISSING VALUES **;
DATA EX1;
INPUT SCORE1 SCORE2 SCORE3;
M1 = (SCORE1 + SCORE2 + SCORE3)/3; * mean of all 3 scores;

M2 = MEAN(SCORE1,SCORE2,SCORE3); * mean of non-missing ;
CARDS;
1 2 3
. 2 3
1 . 3
week 10 10.23
. . 3
. . .
;
RUN;
PROC PRINT DATA=EX1;
VAR SCORE1-SCORE3 M1 M2;
TITLE1 'MEAN SCORES COMPUTED TWO WAYS';
RUN;
How to Use the MEAN Function.
• To use the mean function, the word MEAN is followed by a variable list enclosed in parentheses.
• The variable names must be separated by commas, no spaces.
• If the variables names all have the same prefix with sequential numbering, an alternative hyphenated
list form is available:
M2 = MEAN(SCORE1,SCORE2,SCORE3);
or
M2 = MEAN(OF SCORE1-SCORE3);
Results of the above program follow. Note that the expression only computes a mean when all values are
non-missing, while the MEAN function computes a mean if at least one value is present.
week 10 10.24
MEAN SCORES COMPUTED TWO WAYS
OBS SCORE1 SCORE2 SCORE3 M1 M2

1 1 2 3 2 2.0
2 . 2 3 . 2.5
3 1 . 3 . 2.0
4 . . 3 . 3.0
5 . . . . .
Tip – Have a look in the The SAS Language Guide on functions. You may find some kind of
computational shortcut by using functions.
Example - Computing a mean score for subjects with at least 11 of 14 subscores (the N function, MEAN
function)
• As part of a study on smoking cessation, subjects were given a 14-question form used to rate a
patient on self-efficacy in quitting.
• For each question a score of 1-7 was possible. High scores were expected to correlate well with
efficacy in quitting.
• Investigators were interested in using the mean of the 14 scores as a predictor of quit status at the
follow-up periods.
• However many subjects had skipped questions on the form so that their data were incomplete.
• The study investigators decided that if at least 11 of the 14 scores were available, the mean would be
a valid measure of efficacy.
• The following program computes a mean for subjects with at least 11 scores, creating a missing value
for the others. It makes use of the N function to find the number of non-missing values, and the
MEAN function.
week 10 10.25
DATA EFF2;
SET EFF1;
* find # of non-missing scores using N function *;
NSCORE = N(OF SCORE1-SCORE14);
* compute mean of available scores when have 11+ *;

* using MEAN function *;
IF NSCORE>=11 THEN MSCORE=MEAN(OF SCORE1-SCORE14);
RUN;
In this example two variables were created: NSCORE the number of non-missing scores, and MSCORE,
the mean of the non-missing scores, when 11 or more scores were present.
week 10 10.26
c How to Compute Age in Completed Years( the INT Function)
• Age is typically reported in completed years.
• When computing age at some event from the date of birth and the event date, it is common to want
whole numbers only for age in years.
• Rounding off, or formatting the age to 0 decimal places would result in rounding up for fractional parts
over .5, e.g., 8.67 rounds up to 9 years rather than 8 years, which is not how we usually think of
age.
• In this case the integer function INT, can be used, to select the integer part of the computed age:
AGE = INT( (SURGDATE - DOB)/365.25 );
d. How to Extract Date and Time Information
• Date and Time functions can be used to create DATE, TIME and DATETIME variables from other
values, or to abstract information from DATE and TIME values, such as the day, month or year from
a date value.
• For example in the Wesson Women's Time Study, dates and times were entered as separate values
for day, month, hour and minutes.
• The following program gives an example of how these can be converted into DATETIME values, so
that the time spent in a unit can be computed as the difference between two DATETIME values.
• The example makes use of the MDY function to create a date value from the month, day and year of
the study, 1991, and then uses the DHMS function to create a DATETIME value from the Date,
Hours, Minutes, and (zero) Seconds.
week 10 10.27
• This example computes the time in hours spent in WETU, the Women's Evaluation and Treatment
Unit.
• Note that when a time difference is computed from a stored DATETIME value, the result is in
units of seconds.
OPTIONS PAGESIZE=60 LINESIZE=78;

LIBNAME D 'D:\WWTIME';
*************************************;
** COMPUTE TIME SPENT IN WETU **;
** AS DIFFERENCE BETWEEN TRANSFER **;
** AND ARRIVAL TIME **;
** USING WWTIME DATA **;
*************************************;
DATA WETU(KEEP=WETUTIME W_TRANDT W_ARR_DT);
SET D.WWTIME1(OBS=10); * reads only first 10 obs;
* compute date from month,day,year *;

WARRDATE=MDY(W_ARRMON,W_ARRDAY,1991); * arrival date;
WTRADATE=MDY(W_TRAMON,W_TRADAY,1991); * transfer date;
* compute datetime from date, hours, mins, and secs=0 *;
* arrival datetime;
W_ARR_DT=DHMS(WARRDATE,W_ARRHRS,W_ARRMIN,0);
* transfer datetime;
W_TRANDT=DHMS(WTRADATE,W_TRAHRS,W_TRAMIN,0);
* compute time spent in WETU as difference adjusted to hrs *;

WETUTIME = (W_TRANDT - W_ARR_DT) / 3600;
RUN;
** print data assigning datetime formats to transfer **;

** and arrival times, and allowing 6 columns, 2 dec **;
** for printing the time difference in hours **;
PROC PRINT DATA=WETU;
FORMAT W_TRANDT W_ARR_DT DATETIME13.
WETUTIME 6.2;
TITLE1 'DURATION, IN HOURS, SPENT IN WETU';
RUN;
week 10 10.28
The resulting output follows.
DURATION, IN HOURS, SPENT IN WETU
OBS W_ARR_DT W_TRANDT WETUTIME
1 09FEB91:06:35 09FEB91:07:28 0.88

2 07FEB91:11:25 07FEB91:12:45 1.33
3 07FEB91:09:20 07FEB91:10:45 1.42
4 07FEB91:05:30 07FEB91:06:35 1.08
5 05FEB91:04:30 05FEB91:05:50 1.33
6 10FEB91:02:45 10FEB91:04:00 1.25
7 06FEB91:18:20 06FEB91:19:35 1.25
8 10FEB91:08:50 10FEB91:09:20 0.50
9 10FEB91:01:45 10FEB91:03:30 1.75
10 08FEB91:08:58 08FEB91:10:53 1.92
• Note that the time spent in the Women’s Evaluation and Treatment Unit, WETUTIME, is in units of
hours, with decimal fractions of hours.
• For example 1.25 means one-and-one-quarter hours, or 1 hour and 15 minutes, and 0.50 means a
half-hour, or 30 minutes.
• SAS TIME formats are meant for use with clock time representations, not durations. If you want
these durations printed in hours and minutes, you would need to compute the minutes from the
fractional hours, and print these as two variables. The following programming statements could be
used:
* complete hours as integer part;

WETUHRS = INT(WETUTIME);
* fractional hours times 60 min/hr ;

WETUMIN = (WETUTIME - WETUHRS) * 60;
week 10 10.29
4 How to Select a Simple Random Sample without Replacement: Using Random Number
Generating Functions
Rationale –
You may have used SAS to create a sampling frame (list of all members of population) and now wish to
select a simple random sample from the list. Random number generators enable the sample to be
selected. An illustration of a method for selecting a simple random sample without replacement from a
population of 10 subjects (with one record per subject) follows. Suppose the subjects are listed in a file as
shown below.
ID Age Sex
1 23 M
2 29 M
3 34 F
4 19 F
5 29 F
6 15 F
7 32 M
8 32 F
9 31 M
10 28 M
Using the Uniform ( ) Function to Select a Simple Random Sample Without Replacement
Assume we wish to select a simple random sample without replacement of 4 subjects from this list. We
can use the Uniform random number generator, which generates a random digit between 0 and 1, to do
the selection with the following program.
week 10 10.30
*** selecting a simple random sample ***;

*** without replacement ***;
libname old 'c:\temp\';
data frame1;
input sid age sex $1. @@;
cards;
1 23 M 2 29 M 3 34 F 4 19 F 5 29 F 6 15 F
7 32 M 8 32 F 9 31 M 10 28 M
;
run;
data frame2a;
set frame1;
retain n 10
k 4; *set starting pop=n and sample size=k ;
sample1=0; * set indicator of in sample to 0;
prob=k/n; * set probability of selection;
rn=uniform(0); * select random number in (0,1);
if rn < prob then do; * if rn in (0,rn) ;
sample1=1; * set indicator to 1 ;
k=k-1; * decrease number to select ;
end;
n=n-1; * decrease remaining pop size ;
run;
proc print data=frame2a;

var sid age sex sample1 rn prob n k;
title1 'Listing after Sample Selection in SAS data set FRAME2a';
run;
• Data are read in on one line using the trailing @@ to hold the line.
• The sample is selected in the data set FRAME2A.
• The RETAIN statement initializes the total size of the population (n) and the size of the sample (k)
that is to be selected.
• These variables must be named in a RETAIN statement to keep track of the number of observations
left in the population, along with the number of observations remaining to be selected into the
sample.
• As the sample is selected, the value from the previous observation is held unless a change is
specified.
week 10 10.31
• The variable SAMPLE1 is an indicator variable with value of "0" for observations not in the sample,
and value "1" for observations selected to be in the sample. When selecting a simple random
sample without replacement, the proportion of observations to be selected changes as each
observation is selected.
• This proportion is given by PROB, the remaining number of observations to be selected divided by
the remaining number of observations in the population.
• To determine whether the particular observation in the data step is to be selected in the sample, a
Uniform random number (RN) is selected and compared with PROB.
• If the random number is less than PROB, then the observation is selected, and the values of "k" and
"n" are both reduced by one. The indicator for the sample, SAMPLE1, is also set to one. If RN is
bigger than PROB, only the population size "n" is reduced by one and the sampling continues. A
listing of the resulting data follows:
Listing after Sample Selection in SAS data set FRAME2a
OBS SID AGE SEX SAMPLE1 RN PROB N K
1 1 23 M 0 0.75321 0.40000 9 4
2 2 29 M 1 0.34198 0.44444 8 3
3 3 34 F 0 0.50989 0.37500 7 3
4 4 19 F 1 0.41783 0.42857 6 2
5 5 29 F 0 0.36324 0.33333 5 2
6 6 15 F 0 0.51605 0.40000 4 2
7 7 32 M 1 0.02004 0.50000 3 1
8 8 32 F 1 0.32841 0.33333 2 0
9 9 31 M 0 0.62433 0.00000 1 0
10 10 28 M 0 0.59228 0.00000 0 0
week 10 10.32
How the Uniform Random Number Generator Uniform(0) Works
• The uniform random number generator is a SAS function that has "0" as an argument in the previous
program.
• The "0" is a special code that indicates that the uniform random number sequence should start based
on the date and time that the program is run.
• Re-running the program will produce a different set of Uniform random numbers.
• Although independent samples should be based on a different set of random numbers, it may be
desirable to be able to duplicate the sample selection. This can be done by specifying a 5-7
digit odd integer as the argument in the UNIFORM function.
• This is illustrated this below, with the selection of two samples, where a common starting point – seed
– for the random number sequence is specified.
*** select 2 samples using same SEED ***;

data frame2b;
retain n2 10 k2 4 seed 1823925;
set frame1;
sample2=0;
prob2=k2/n2;
rn2=uniform(seed);
if rn2 < prob2 then do;
sample2=1;
k2=k2-1;
end;
n2=n2-1;
run;
*** repeat with same SEED ***;

data frame3;
retain n3 10 k3 4 seed 1823925;
set frame1;
sample3=0;
prob3=k3/n3;
rn3=uniform(seed);
if rn3 < prob3 then do;
sample3=1;
week 10 10.33
k3=k3-1;
end;
n3=n3-1;
run;
** merge 2 samples by SID and print results **;

data all;
merge frame2 frame3;
by sid;
run;
proc print data=all;
id sid;
var sid age sex sample2 rn2 prob2 n2 k2
sample3 rn3 prob3 n3 k3;
title1 'Listing after Sample Selection FRAME2-3 SEED=1823925';
run;
week 10 10.34
Listing after Sample Selection FRAME2-3 SEED=1823925
ID AGE SEX SAMPLE2 RN2 PROB2 N2 K2 SAMPLE3 RN3 PROB3 N3 K3

1 23 M 0 0.85516 0.40000 9 4 0 0.85516 0.40000 9 4
2 29 M 1 0.06069 0.44444 8 3 1 0.06069 0.44444 8 3
3 34 F 0 0.54748 0.37500 7 3 0 0.54748 0.37500 7 3
4 19 F 0 0.67427 0.42857 6 3 0 0.67427 0.42857 6 3
5 29 F 0 0.98183 0.50000 5 3 0 0.98183 0.50000 5 3
6 15 F 1 0.19015 0.60000 4 2 1 0.19015 0.60000 4 2
7 32 M 1 0.28839 0.50000 3 1 1 0.28839 0.50000 3 1
8 32 F 0 0.82554 0.33333 2 1 0 0.82554 0.33333 2 1
9 31 M 0 0.56925 0.50000 1 1 0 0.56925 0.50000 1 1
10 28 M 1 0.39763 1.00000 0 0 1 0.39763 1.00000 0 0
Notice that the sample random numbers were generated (RN2 = RN3) when the same ‘seed’ or starting
value for the random number generator is used, resulting in selection of the same sample.
week 10 10.35
5. Introduction to SAS Character Functions
• There is an array of SAS functions designed for working with character data.
• Tip - Character data can be tricky to work with, since capitalization, spelling, spacing and punctuation
differences will all cause values to be read differently.
• Many of the character functions can be used to modify values, selectively read parts of character
values, or search values for certain character strings. A few helpful examples follow.
a How to Convert to Upper Case (UPCASE)
Rationale and Example –
• At the time of data entry, use of capitalization may have been inconsistent. As a result, some
observations that should be matching will not be read as having the same value.
• For example, if data for sex has inconsistently been entered as F, f, M, or m, then a frequency table
would indicate 4 different values for sex. Use of the statement:
SEX2 = UPCASE(SEX);
would convert all values to uppercase, so only 2 distinct values would appear in the new variable
SEX2, F and M. This could also be accomplished with the statements:
IF SEX =’f’ THEN SEX =’F’;

IF SEX =’m’ THEN SEX =’M’;
• However, when names, addresses or other character data with more than just a few possible values
are to be converted, the function would obviously be the preferred method.
week 10 10.36
b How to Remove Spaces and Special Characters (COMPRESS)
Rationale –
• When spacing or punctuation has been inconsistently used, values will not match.
• The Compress function allows you to remove spaces or specified characters.
• For example, when names have been entered in one place as ‘Last, First’ and in another as ‘Last
First’ without a comma, we might wish to remove the commas.
NAME2 = COMPRESS(NAME,’,’);
would create a new variable NAME2 with all commas removed.
Guidelines
• To use the Compress function, the form COMPRESS(arg1,arg2); is used, where arg1 is the variable to
be compressed, and arg2 specifies the characters to be removed.
• If you are listing specific characters, then these must be named with single quotes (or double quotes
if you want to remove a single quote character).
• Tip - Do not use delimiters (e.g., spaces or commas) between characters to remove, or the space or
comma will be considered as characters to remove. For example, NAME2 = COMPRESS(NAME,’
nd
.,’); will remove spaces, periods and commas. If no 2 argument (arg2) is specified, by default
all spaces will be removed from the original value.
week 10 10.37
c How to Align Character Data (LEFT and RIGHT)
Rationale -
• Inconsistent data entry, or data derived from different sources can leave you with values that match,
except for inconsistent spacing.
• The LEFT and RIGHT functions simply move values to be left or right aligned. LEFT removes blanks
from the start of a value, and RIGHT moves blanks from the end to the start. For example, if values
for sex have been entered as ′Female′ or ′Male ′ or ′ Male′, the last 2 values will be read as
different values, since the spacing or alignment differs. Using the statement:
SEX2 = LEFT(SEX);
would left align all values, or change ′ Male′ to ′Male ′.
d How to Extract a String of Characters (SUBSTR)
Rationale -
• This is a great function! When information is encoded within a variable, for example, with an
informative ID variable, reading a part of the value becomes of interest.
• The substring function (SUBSTR) can be used for this purpose.
• In the Anaconda Soil Ingestion Study, informative ID numbers for samples (SID), of the form ‘AnnSdd’
were used, where the first character (A) indicated the particular study, nn the child ID number, S the
sample type (F=fecal, N=food), and dd the study day, 01-07.
• Sample analysis results were output in electronic form from the laboratory equipment used for
analysis. It was necessary to extract information from the sample ID number to determine which
child, and which sample type. The following statement will do this:
week 10 10.38
CHILDID = SUBSTR(SID,1,3);
STYPE = SUBSTR(SID,4,1);
• The CHILDID is read from the sample ID (SID) starting in position 1, reading for 3 characters.
• The sample type (STYPE) is read from SID, starting in position 4, reading 1 character. For example,
if SID=’A64N02’ then CHILDID=’A64’ and STYPE=’N’.
Guidelines
To use SUBSTR(arg1,arg2,arg3),
• arg1 names the variable to be read,
• arg2 names the starting position, and
• arg3 names the number of characters to read.
e How to Validate a Character Value (VERIFY)
Rationale -
Verify can be used to check that all values in your data fit a limited set of acceptable values.
How to Use the VERIFY Function (Tip - It is NOT a logical operator)
For example, suppose you have data, stored as character codes (CODE), which should take on only the
values 0, 1, 2, 3 or 9, the following statements can be used to check for invalid data:
CHK = ‘01239’;
VCODE = VERIFY(CODE,CHK);
IF VCODE NE 0 THEN PUT ‘INVALID CODE: ‘ CODE= SID=;
week 10 10.39
• The value of VCODE will be 0 if the value of CODE contains only 0,1,2,3, or 9, but will return the first
column number where any other value is found.
• The IF-THEN statement will write a message to the log with the invalid code and subject ID, when
VCODE is not equal to 0.
week 10 10.40

LEARN SAS Within 7 Weeks: Part4 (More On Manipulating Data)

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

LEARN SAS Within 7 Weeks: Part4 (More On Manipulating Data)

Diunggah oleh

Hak Cipta:

Format Tersedia

Week 10 More on Manipulating Data

Week 10: More on Manipulating Data

Numerous examples illustrate the specification of the necessary code.

Goals of Week 10: More on Manipulating Data

1. to be competent in using SAS to verify data entry in an easy to read manner;

3. to know how to create sequential study identification numbers;

5. to be competent in the informatting and formatting of SAS dates and times;

6. to be aware of the variety of SAS functions that are available;

9. to be competent in manipulating character data.

Week 10 Outline – More on Manipulating Data

1. Using the PUT Statement ………………..…………….……. …………….. 3

a. How to Write to a File ………………………………………………………… 3

2. Dates and Time in SAS ……….……………………………………..……………. 15

a. How to Use Informats/Formats to Read/Write Dates ………………………. 15

3. Introduction to SAS Functions ………….……………………………………….... 22

a. Overview of SAS Functions ……………………………………. ………….. 22

4. How to Select a Simple Random Sample Without Replacement ..………….. 30

5. Introduction to SAS Character Functions ……………………………..……………. 36

a. How to Convert to Upper Case (UPCASE) …..……………………………. 36

1 Using the PUT Statement

Week 2; this is the Week 8 reading)

can then be saved as a “.lst” file.

a. How to Write to a File

statement indicates a full path address already.

Guide or refer to SAS help.

ASCII (text) file using a DATA step:

names, as well as text that you type in.

b How to Write Input Data to the Log

• This may aid in correcting data problems later.

** example writing input data to the LOG **;

in during the data step.

These SAS instructions produced the following listing in the log.

c. Data Quality Assessment – A Good Way to Write Out a Subset of Observations

subjects with a particular combination of variable values.

plus a procedure step.

statements in the course of a DATA step, as in the example below.

• In this example, two uses of the PUT statement are illustrated.

** use of PUT statement to write to the log **;

* list students with invalid weight *;

are highlighted here.

d How to Output Data Using Column Format

descriptions or analysis results.

• The PUT statement is analogous to an INPUT statement in this case.

** writing data to output window **;

* route result of PUT statement to output window *;

* list name, age, sex for females only *;

This would appear in the output window:

LISTING OF GIRLS IN CLASS:

each pre-printed with a sequential ID number.

• Tip - Such pre-numbering of forms can prevent assignment of duplicate ID numbers.

mother’s and infant’s charts.

• A sequential ID is printed on each form.

and enterer’s name.

This example also illustrates some bells and whistles:

• A DATA _NULL_ statement is used since no data set is to be created

created, numbered from 101 to 200.

form specifics are given.

• The # symbol is used to define a specific line number on the page.

quotes if the text contains a single quote.

• The forward slash (/) indicates move to the next line.

within a DO loop that creates the sequential ID numbers.

DATA _NULL_; * start data step-no file created;

DO CASEID = 101 TO 200; * creates forms 101 to 200;

example writing input data to the LOG ;

use of PUT statement to write to the log ;

writing data to output window ;

@10 'INFANT MEDICAL RECORD NUMBER: ' /

MOHTER'S MEDICAL RECORD NUMBER:

INFANT MEDICAL RECORD NUMBER:

MOTHER'S CHART REVIEWED BY: _______________ on _ / _ / _

INFANT CHART REVIEWED BY: ________________ on _ / _ / _

FORMS CODED BY: ___________________________ on _ / _ / _

DATA ENTERED BY: __________________________ on _ / _ / _

make copies of dates for demonstrating use of formats;

assign formats to dates ;

EXAMPLE TO SHOW HOW MEAN FUNCTION ;