Anda di halaman 1dari 40

Week 10 More on Manipulating Data

Unit 4
SAS for Data Management

Week 10: More on Manipulating Data

Welcome.

This reading is a wonderful reservoir of tools, such as: how to document data entry errors in the SAS log

using the PUT statement, how to select a simple random sample, and how to clean character data.

Numerous examples illustrate the specification of the necessary code.

Goals of Week 10: More on Manipulating Data

1. to be competent in using SAS to verify data entry in an easy to read manner;

2. to be aware of the multiple uses of the PUT statement (such as report writing, forms creation, etc)

3. to know how to create sequential study identification numbers;

4. to understand how SAS stores dates and how to do calculations involving dates;

5. to be competent in the informatting and formatting of SAS dates and times;

6. to be aware of the variety of SAS functions that are available;

7. to appreciate that SAS offers multiple ways of working with missing information;

8. to know how to select a simple random sample without replacement from a sampling frame; and

9. to be competent in manipulating character data.

week 10 10.1
Week 10 More on Manipulating Data

Week 10 Outline – More on Manipulating Data


Section Topic Page

1. Using the PUT Statement ………………..…………….……. …………….. 3

a. How to Write to a File ………………………………………………………… 3


b. How to Write Input Data to the Log .………………………………………... 5
c. Data Quality Assessment – A Good Way to Write Out a Subset of Data . 6
d. How to Output Data Using Column Format………………………………… 9
e. How to Use PUT Instruction to Assign Sequential Identification Values
Plus Some Bells and Whistles …………………………………………. … 11

2. Dates and Time in SAS ……….……………………………………..……………. 15

a. How to Use Informats/Formats to Read/Write Dates ………………………. 15


b. How to Compute Intervals Between Dates ……………………………… 18
c. How to Work with a Date as a Constant ………………………………… 19
d. The TODAY( ) and DATE( ) Functions …………………………………. 20

3. Introduction to SAS Functions ………….……………………………………….... 22

a. Overview of SAS Functions ……………………………………. ………….. 22


b. Using SAS to Manipulate Missing Values ……………………..…………... 23
c. How to Compute Age in Completed Years (INT) ………..………………… 27
d. How to Extract Date and Time Information (MDY,DHMS)… ..…………….. 27

4. How to Select a Simple Random Sample Without Replacement ..………….. 30

5. Introduction to SAS Character Functions ……………………………..……………. 36

a. How to Convert to Upper Case (UPCASE) …..……………………………. 36


b. How to Remove Spaces and Special Characters (COMPRESS) …..…. 37
c. How to Align Character Data (LEFT, RIGHT) …………………………… 38
d. How to Extract a String of Characters (SUBSTR)………………………… 38
e. How to Validate a Character Value (VERIFY) ……………………………. 39

week 10 10.2
Week 10 More on Manipulating Data

1 Using the PUT Statement

There is a variety of ways in SAS to code instructions that will result in the writing or outputting of

information

• A DATA statement together with a SET statement outputs data to a SAS data set. (See Unit 4

Week 2; this is the Week 8 reading)

• A PROC PRINT writes out information in a SAS data set to the results window. The results window

can then be saved as a “.lst” file.

• The PUT statement is another SAS instruction for writing or outputting of information.

a. How to Write to a File

Rationale

• It may be of interest to output data to an ASCII (text) file for some other use

General Guidelines

• Use a PUT statement to write information to the location indicated in a preceding FILE statement.

Note – A FILE statement does not require a companion libname statement. This is because a FILE

statement indicates a full path address already.

• If no FILE statement precedes, the PUT statement will write the information to the SAS log.

• Thus, the PUT statement with no preceding FILE statement is the way to write information to the

SAS log.

• Tip - Be aware that there are many options for use with the PUT statement. See the SAS Language

Guide or refer to SAS help.

week 10 10.3
Week 10 More on Manipulating Data

Example

The PUT statement was introduced in Unit 4, Week 2 as a method for writing/exporting SAS data to an

ASCII (text) file using a DATA step:

Libname old “z:\river valley\sasdata”; /* libname tells SAS the location of the source SAS data */
Data _null_; /* _null_ is a placeholder SAS data set that will not be saved */
set old.screen02; /* SAS is told to read source data from screen02.sas7bdat */
file “z:\river valley\txtdata\screen02.dta”; /* Specifies the name and destination of the ascii data set */
put medrec cd4; /* Specifies the variables that are to be written */
run;

In this example the two variables (MEDREC and CD4) would be listed in the file separated by a space.

Additional Tips

• The PUT statement can be used to write information to the log, to the output window or to files and,

as a choice of SAS instruction for purposes of writing, the PUT statement has certain advantages.

• Information “written” with a PUT statement can include variable values with or without variable

names, as well as text that you type in.

• By default, a PUT statement writes lines to the SAS LOG when no file specification is given.

week 10 10.4
Week 10 More on Manipulating Data

b How to Write Input Data to the Log

Rationale

• In some instances (especially when the data set size is reasonable), it may be good documentation

to have recorded in the SAS log the data that was actually read in.

• This may aid in correcting data problems later.

Guidelines

• To write the data to the log as it is read in, use the statement: PUT _INFILE_; after the INPUT

statement.

Example

** example writing input data to the LOG **;


DATA BOOKS;
INFILE 'C:\TEMP\HW4P3.DAT';
INPUT NAME $ DAYS SUBJECT $15.;
PUT _INFILE_;
RUN;

This example also illustrates use of the SAS instruction _INFILE_ . It refers specifically to data that is read

in during the data step.

week 10 10.5
Week 10 More on Manipulating Data

These SAS instructions produced the following listing in the log.

1 DATA BOOKS;
2 INFILE 'C:\SPH\HW6#3.DAT';
3 INPUT NAME $ DAYS SUBJECT $15.;
4 PUT _INFILE_;
5 RUN;
BARBARA 2 ENGLISH
CAROL 2 SCIENCE
CAROL 3 ENGLISH
CAROL 4 MATH
DONALD 1 ART
JAMES 4 MATH
JOYCE 5 HOME ECONOMICS
JOYCE 6 SCIENCE
MARY 2 MECHANICS
PHILIP 1 HOME ECONOMICS
PHILIP 3 MATH
WILLIAM 2 SCIENCE
NOTE: The infile 'C:\TEMP\HW4P3.DAT' is file C:\TEMP\HW4P3.DAT.
NOTE: 12 records were read from the infile C:\TEMP\HW4P3.DAT.
The minimum record length was 15.
The maximum record length was 26.
NOTE: The data set WORK.BOOKS has 12 observations and 3 variables.
NOTE: The DATA statement used 6.00 seconds.

c. Data Quality Assessment – A Good Way to Write Out a Subset of Observations

Rationale

• You may wish to identify subjects who have missing or invalid data for a particular variable, or those

subjects with a particular combination of variable values.

• One way to get such a list is to make a subset of the data for observations with the characteristic of

interest, and then to list the data using PROC PRINT. This can require many different data steps

plus a procedure step.

• Alternatively, the list of problem observations can be written to the output file or to the log using PUT

statements in the course of a DATA step, as in the example below.

week 10 10.6
Week 10 More on Manipulating Data

Example

• In this example, two uses of the PUT statement are illustrated.

• In the first, the variables to be written to the log are named with an equal sign ( PUT VARNAME= ).

When this form is used the ‘variable name = value’ is printed to the log (VARNAME=value).

• In the 2nd example, a mixture of text and variable values are used to write information to the log: the

PUT statement says to (1) write the value of the variable NAME , (2) write the text enclosed in

quotes, (3) the value of the variable WEIGHT, and (4) the text enclosed in quotes.

** use of PUT statement to write to the log **;


DATA CLASS;
INFILE 'C:\TEMP\PUT2.DAT';
INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;

* list students with invalid weight *;


* and change weight to missing *;
IF WEIGHT > 200 THEN DO;
PUT NAME= WEIGHT= ;
PUT NAME 'has invalid weight of ' WEIGHT 'lbs.';
WEIGHT=.;
END;
RUN;

Following is the SAS log that was produced. Recall that the PUT statement writes to the LOG by default,

unless preceded by a FILE statement. The lines in the SAS log that were produced by the PUT statements

are highlighted here.

1 DATA CLASS;
2 INFILE 'C:\SPH\691F\PUT2.DAT';
3 INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;
4
5 * list students with invalid weight *;
6 * and changes weight to missing *;
7 IF WEIGHT > 200 THEN DO;
8 PUT NAME= WEIGHT= ;
9 PUT NAME 'has invalid weight of ' WEIGHT 'lbs.';
10 WEIGHT=.;
11 END;
12 RUN;

week 10 10.7
Week 10 More on Manipulating Data

NAME=JOHN WEIGHT=995
JOHN has invalid weight of 995 lbs.
NOTE: The infile 'C:\TEMP\PUT2.DAT' is file C:\TEMP\PUT2.DAT.
NOTE: 19 records were read from the infile C:\TEMP\PUT2.DAT.
The minimum record length was 21.
The maximum record length was 23.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.

week 10 10.8
Week 10 More on Manipulating Data

d How to Output Data Using Column Format

Rationale

• It may be that you want to write a list of data to the SAS output file so that it accompanies

descriptions or analysis results.

• Such SAS output files might be descriptions, analysis results, or the reporting of feedback

information.

Guidelines

• The PUT statement is analogous to an INPUT statement in this case.

• All the options available for reading data with INPUT statements can be used for writing data with

PUT statements.

Example

** writing data to output window **;


DATA CLASS;
INFILE 'C:\TEMP\PUT2.DAT';
INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;

* route result of PUT statement to output window *;


FILE PRINT;

* list name, age, sex for females only *;


IF SEX='F' THEN
PUT NAME 1-8 SEX 10 AGE 12-13;
TITLE1 ‘LISTING OF GIRLS IN CLASS:’;
RUN;

This would appear in the output window:

LISTING OF GIRLS IN CLASS:

ALICE F 13
BARBARA F 13
CAROL F 14

week 10 10.9
Week 10 More on Manipulating Data

JANE F 12
JANET F 15
JOYCE F 11
JUDY F 14
LOUISE F 12
MARY F 15

week 10 10.10
Week 10 More on Manipulating Data

e. How to Use the PUT Instruction to Assign Sequential Identification Values Plus Some Bells

and Whistles

Rationale

• A PUT statement can be used to create a header page for surveys or other data collection forms,

each pre-printed with a sequential ID number.

• Tip - Such pre-numbering of forms can prevent assignment of duplicate ID numbers.

Example

• In the example that follows a header page is created for a chart review study that requires reviewing

mother’s and infant’s charts.

• A sequential ID is printed on each form.

• The form is used to record the names and medical record numbers of the mother and infant, and a

place is given for recording the date of chart review, reviewer’s name, as well as date of data entry,

and enterer’s name.

This example also illustrates some bells and whistles:

• A DATA _NULL_ statement is used since no data set is to be created

• A DO loop is used to create the sequential ID numbers -- in this example one hundred forms are

created, numbered from 101 to 200.

• For each form a new page is started, using the PUT _PAGE_ command. Then instructions for the

form specifics are given.

• The @ symbol is used to define the starting column, just as in an INPUT statement.

week 10 10.11
Week 10 More on Manipulating Data

• The # symbol is used to define a specific line number on the page.

• Text to be printed, including lines to be drawn must be enclosed in quotes. Use double

quotes if the text contains a single quote.

• The forward slash (/) indicates move to the next line.

• See the SAS Language Guide or SAS HELP, section on PUT statements for more details
on line and pointer controls.

Note that there is one very long PUT statement that ends with a semi-colon. This PUT statement is listed

within a DO loop that creates the sequential ID numbers.

**********************************************;
** example of creating sequential ID number **;
** and cover page for data collection form **;
**********************************************;
OPTIONS nodate nonumber nocenter ls=78 ps=60;

DATA _NULL_; * start data step-no file created;


FILE 'C:\temp\cover1.lst'; * names file to store results;

DO CASEID = 101 TO 200; * creates forms 101 to 200;


PUT _PAGE_; * starts new page for each new id;
PUT / / /
@10 ' _______________________________________ ' /
@10 '| |' /
@10 '| BAYSTATE MEDICAL CENTER |' /
@10 '| FETAL MACROSOMIA STUDY |' /
@10 '| |' /
@10 '|_______________________________________|' /
/
@10 'STUDY ID = ' @22 CASEID /
/
@10 "MOTHER'S NAME: __________________________________ " /
@26 '(last, first)' /
/
@10 "MOHTER'S MEDICAL RECORD NUMBER: __ __ __ __ __ __ "
/ /
@10 'INFANT NAME: ___________________________________ ' /
@26 '(last, first)' /
/

week 10 10.12
Week 10 More on Manipulating Data

@10 'INFANT MEDICAL RECORD NUMBER: __ __ __ __ __ __ ' /


#45 @10 "MOTHER'S CHART REVIEWED BY: ____________________ on ___ / ___ / ___" /
/
@10 'INFANT CHART REVIEWED BY: ____________________ on ___ / ___ / ___' /
/
@10 'FORMS CODED BY: _________________________________ on ___ / ___ / ___' /
/
@10 'DATA ENTERED BY: ________________________________ on ___ / ___ / ___'
; *end of put statement;
END; *end of do loop;
RUN;

Following is the first page written by the above program:

_______________________________________
| |
| BAYSTATE MEDICAL CENTER |
| FETAL MACROSOMIA STUDY |
| |
|_______________________________________|

STUDY ID = 101

MOTHER'S NAME: __________________________________


(last, first)

MOHTER'S MEDICAL RECORD NUMBER: __ __ __ __ __ __

INFANT NAME: ___________________________________


(last, first)

INFANT MEDICAL RECORD NUMBER: __ __ __ __ __ __

week 10 10.13
Week 10 More on Manipulating Data

MOTHER'S CHART REVIEWED BY: _________________ on ___ / ___ / ___

INFANT CHART REVIEWED BY: __________________ on ___ / ___ / ___

FORMS CODED BY: _____________________________ on ___ / ___ / ___

DATA ENTERED BY: ____________________________ on ___ / ___ / ___

Much more complicated and detailed forms can be generated with PUT statements. This can be

particularly useful in follow-up surveys, where a form for each subject in the baseline survey can be pre-

printed with name, address phone number, ID numbers, etc. Any useful information from the baseline

survey can be incorporated and printed using a PUT statement. To do this, the DATA _NULL_; statement

would be followed by a SET statement, naming the source data file containing the baseline data, before the

DO loop and PUT statement.

week 10 10.14
Week 10 More on Manipulating Data

2 Dates and Times in SAS

SAS stores dates internally using numbers that are related to the date January 1, 1960

• In SAS, a date is recorded as the number of days from January 1, 1960, with Jan 1, 1960 having a

value of 0.

• Dates after 01/01/1960 have positive values, and dates before this date have negative values.

• If you fail to assign a format to a date using a format statement, the number printed will be the

number of days since January 1, 1960 – not a particularly useful piece of data to look at – so

assign formats to dates!.

• A record length of 5 is sufficient for all recent dates (dates after 778 BC) represented as month, day,

year since there are at most 5 significant digits. This record length also is sufficient for negative

SAS dates.

a. How to Use INFORMATS/FORMATS to Read/Write Dates

• We read dates recorded as month, day, and year. SAS doesn’t know how to do this.

• Therefore, special DATE informats exist to translate this specification of the date to the SAS numeric

representation. A

• n informat works in a manner similar to a format – it provides instruction for reading values, rather

than writing values.

• Similarly, since the SAS representation of a date as the number of days from 01/01/1960 is not

readily recognized as a month, day and year, formats exit to convert SAS numeric dates to month,

day, year format for printing.

week 10 10.15
Week 10 More on Manipulating Data

Example

• Data in the following example includes information on birth date, date of surgery and date of post-

operative testing.

• The dates in this file are in MMDDYY format, in this case with a slash (/) to separate month, day, and

year.

• The subject birth dates use 10 columns (MMDDYY10. format), with year taking 4 columns

• The other dates use 8 columns with year taking 2 columns, or MMDDYY8. format.

• Note the INFORMAT statement used before the INPUT statement to define the date formats.

• Alternatively, without an informat statement the format for reading the dates can be indicated directly

on the input statement, as in

INPUT CASEID @5 DOB MMDDYY10. @16 SURDATE MMDDYY8.


@25 TDATE MMDDYY8.;

• In the program that follows, copies of each of the dates are made under new names so that several of

the different formats for printing dates can be illustrated.

• Note that date formats are always available for use in SAS, without running PROC FORMAT.

• However, formats must be assigned to date variables in a FORMAT statement in a data or proc step,

like any other formats.

• A format should always be assigned to DATE variables in SAS for printing, since looking at

the number of days since 01JAN60 is not usually very informative.

week 10 10.16
Week 10 More on Manipulating Data

OPTIONS PAGESIZE=60 LINESIZE=128 nodate nonumber nocenter;


***************************************************;
** read in data using MMDDYY. format for dates **;
***************************************************;
DATA DATES;
INFILE 'C:\temp\DATE1.DAT';
* how to read dates *;
INFORMAT DOB MMDDYY10.
SURDATE TDATE MMDDYY8.;
INPUT CASEID DOB SURDATE TDATE;

** make copies of dates for demonstrating use of formats**;


DOB1=DOB;
SURDATE1=SURDATE;
TDATE1=TDATE;
** compute days from surgery to testing, **
** and patient age at surgery **;
DAYS = TDATE - SURDATE;
AGE = INT((SURDATE - DOB) / 365.25); /* the INT function rounds to integer */

** assign formats to dates **;


** no format for TDATE1 **;
FORMAT DOB MMDDYY10.
DOB1 DATE7.
SURDATE WORDDATE12.
SURDATE1 WEEKDATE23.
TDATE WORDDATX12.;
RUN;

* print results to illustrate formats *;


PROC PRINT DATA=DATES;
ID CASEID;
VAR DOB DOB1 SURDATE SURDATE1 AGE;
TITLE1 'LIST 1: EXAMPLE OF DATE FORMATS';
RUN;
PROC PRINT DATA=DATES;
ID CASEID;
VAR SURDATE TDATE TDATE1 DAYS;
TITLE1 'LIST 2: EXAMPLE OF DATE FORMATS';
TITLE2 'UNFORMATTED TDATE1 LISTED AS DAYS FROM JAN 1, 1960';
RUN;

week 10 10.17
Week 10 More on Manipulating Data

The output follows.

LIST 1: EXAMPLE OF DATE FORMATS

CASEID DOB DOB1 SURDATE SURDATE1 AGE

101 12/03/1917 03DEC17 Jul 17, 1991 Wednesday, Jul 17, 1991 73
102 03/18/1910 18MAR10 Jul 19, 1991 Friday, Jul 19, 1991 81
103 09/21/1930 21SEP30 Jul 22, 1991 Monday, Jul 22, 1991 60
104 08/27/1924 27AUG24 Jul 26, 1991 Friday, Jul 26, 1991 66
105 10/11/1938 11OCT38 Jul 29, 1991 Monday, Jul 29, 1991 52
106 06/14/1919 14JUN19 Jul 30, 1991 Tuesday, Jul 30, 1991 72

LIST 2: EXAMPLE OF DATE FORMATS


UNFORMATTED TDATE1 LISTED AS DAYS FROM JAN 1, 1960

CASEID SURDATE TDATE TDATE1 DAYS

101 Jul 17, 1991 4 Aug 1991 11538 18


102 Jul 19, 1991 5 Aug 1991 11539 17
103 Jul 22, 1991 5 Aug 1991 11539 14
104 Jul 26, 1991 13 Aug 1991 11547 18
105 Jul 29, 1991 15 Aug 1991 11549 17
106 Jul 30, 1991 15 Aug 1991 11549 16

b. How to Compute Intervals between Dates

SAS’ system for internal storage of dates makes straightforward computations involving dates.

• The advantage of storing dates as a number of days from some reference point lies in the ease of

computation of differences, or durations.

• In the above example, the number of days between surgery and post-operative testing can be

computed as a simple difference:

DAYS = TDATE - SURDATE;

• The result of a difference between two dates is always in units of days.

• If we wanted the difference in weeks this could be computed as:

week 10 10.18
Week 10 More on Manipulating Data

WEEKS = (TDATE - SURDATE) / 7;

• In the same manner, AGE (in years) at surgery can be computed as the number of days between

surgery date and birth date, divided by 365.25:

AGE = (SURDATE-DOB)/365.25;

• In the example in the program above the INT (integer) function was used to round down to the integer

part – this is discussed in the section on SAS functions below.

Occasionally, you may need to be mindful of the Y2K problem in working with dates

TIP - It’s a good idea to use 4 digits for year whenever possible when reading and writing dates to avoid

confusion. In the above example, the birth dates were listed with 4 digit years for this reason. When

working with birthdates of subjects in the early 1900’s, dates will be read incorrectly as early 2000’s, if only

the final 2 digits of the year are given. These can be corrected using SAS Functions (see following, 3.

Introduction to SAS Functions).

TIME and DATETIME formats and informats are available for use in SAS. These can be used when times,

and time differences are of interest. An example is given in the section on date and time functions, which

follows.

c How to Work with a Date as a Constant

• Constant values for dates, times, and datetimes can also be referred to using the DATE. , TIME. or

DATETIME. formats.

week 10 10.19
Week 10 More on Manipulating Data

• For example given information on birth date (DOB), to compute a subject's age in years at a specific

date, for example Jan 1, 1992, you could use the following statement:

AGE1992 = ('01JAN1992'D - DOB)/365.25;

• The date constant must be in DATE9. format (days in 2 digits, month in 3 letter abbreviation,

and years in 4 digits) enclosed in single quotes, and followed by the letter d (d or D works),

no spaces.

Similar to dates, TIME and DATETIME constants can be defined. See the section on date, time and

datetime constants in the chapter on SAS expressions in the SAS Language Guide for more details.

d The TODAY and DATE Functions

The TODAY( ) Function

The statement:

TDATE = TODAY();

creates the variable TDATE with the current date on the computer as the value. The word today is followed

by empty parentheses.

Example – Computing a subject’s current age:

CURR_AGE = (TODAY() - DOB)/365.25;

week 10 10.20
Week 10 More on Manipulating Data

The DATE( ) Function

The DATE( ) function gives the same result as the TODAY( ) function.

TDATE = DATE();

week 10 10.21
Week 10 More on Manipulating Data

3. Introduction to SAS Functions

Many special functions for computing and manipulating numeric and character data are available for use in

the SAS system.

a. Overview of SAS Functions

The functions are divided into 13 categories:

Category Functions in SAS

Arithmetic ABSolute value, MIN, MAX, MODulus, SIGN, and others

Truncation CEIL (smallest integer), FLOOR (largest integer), ROUND,


INTeger and others

Mathematical EXP (power e), LOG (natural log), LOG10 (log base 10)

Trigonometric ARCOS, ARSIN, ATAN, COS, SIN, SINH, TAN, TANH

Probability POISSON, PROBBETA, PROBBNML, PROBCHI, PROBF,


PROBT, PROBNORM, and others

Quantiles CINV (chi-square), FINV, GAMINV, PROBIT (inverse of


normal), TINV

Sample Statistics N (# non-miss), MEAN, CV, KURTOSIS, NMISS, RANGE,


STD, VAR, STDERR, SUM and others

Random Number NORMAL, RANBINomial, RANEXP, RANNOR, RANPOI,


UNIFORM, RANUNI and others

Financial COMPOUND, MORT, SAVING, and others

Character BYTE, LEFT align, LENGTH, SCAN, TRIM, UPCASE, and


others

Date and time DATE (today's date), DATEJUL (Julian to SAS), MDY,
MONTH, DAY, YEAR, others

State and Zip code ZIPSTATE, FIPNAME and others

Special functions DIFn (difference in lags), INPUT, LAGn, PUT, SOUND,


SYMGET
For details, see SAS Language Guide.

week 10 10.22
Week 10 More on Manipulating Data

• Functions can provide a simple way of computing complex formulae, such as a variance or probability

function.

• TIP: Many of the functions handle missing values differently than when computing with

expressions. It is important to pay attention to this detail when choosing to compute using SAS

functions. Examples of use of functions in a few key categories follow.

b. Using SAS to Manipulate Missing Values

Suppose you want to compute the mean of 3 scores for a subject (MEAN function) There’s more than one

approach.

• This can be done by defining an expression to sum the three scores and divide by 3.

• Alternatively, a mean can be computed by naming the three variables in the MEAN function.

• TIP - The result is not the same because missing values are handled differently in the two

approaches. Using the first method, the mean score has a missing value if any one of the three

scores is missing. Using the MEAN function, the mean score has a value if at least one of the three

scores has a non-missing value, as illustrated below:

** EXAMPLE TO SHOW HOW MEAN FUNCTION **;


** HANDLES MISSING VALUES **;
DATA EX1;
INPUT SCORE1 SCORE2 SCORE3;

M1 = (SCORE1 + SCORE2 + SCORE3)/3; * mean of all 3 scores;


M2 = MEAN(SCORE1,SCORE2,SCORE3); * mean of non-missing ;
CARDS;
1 2 3
. 2 3
1 . 3

week 10 10.23
Week 10 More on Manipulating Data

. . 3
. . .
;
RUN;
PROC PRINT DATA=EX1;
VAR SCORE1-SCORE3 M1 M2;
TITLE1 'MEAN SCORES COMPUTED TWO WAYS';
RUN;

How to Use the MEAN Function.

• To use the mean function, the word MEAN is followed by a variable list enclosed in parentheses.

• The variable names must be separated by commas, no spaces.

• If the variables names all have the same prefix with sequential numbering, an alternative hyphenated

list form is available:

M2 = MEAN(SCORE1,SCORE2,SCORE3);
or
M2 = MEAN(OF SCORE1-SCORE3);

Results of the above program follow. Note that the expression only computes a mean when all values are

non-missing, while the MEAN function computes a mean if at least one value is present.

week 10 10.24
Week 10 More on Manipulating Data

MEAN SCORES COMPUTED TWO WAYS

OBS SCORE1 SCORE2 SCORE3 M1 M2


1 1 2 3 2 2.0
2 . 2 3 . 2.5
3 1 . 3 . 2.0
4 . . 3 . 3.0
5 . . . . .

Tip – Have a look in the The SAS Language Guide on functions. You may find some kind of

computational shortcut by using functions.

Example - Computing a mean score for subjects with at least 11 of 14 subscores (the N function, MEAN

function)

• As part of a study on smoking cessation, subjects were given a 14-question form used to rate a

patient on self-efficacy in quitting.

• For each question a score of 1-7 was possible. High scores were expected to correlate well with

efficacy in quitting.

• Investigators were interested in using the mean of the 14 scores as a predictor of quit status at the

follow-up periods.

• However many subjects had skipped questions on the form so that their data were incomplete.

• The study investigators decided that if at least 11 of the 14 scores were available, the mean would be

a valid measure of efficacy.

• The following program computes a mean for subjects with at least 11 scores, creating a missing value

for the others. It makes use of the N function to find the number of non-missing values, and the

MEAN function.

week 10 10.25
Week 10 More on Manipulating Data

DATA EFF2;
SET EFF1;
* find # of non-missing scores using N function *;
NSCORE = N(OF SCORE1-SCORE14);

* compute mean of available scores when have 11+ *;


* using MEAN function *;
IF NSCORE>=11 THEN MSCORE=MEAN(OF SCORE1-SCORE14);
RUN;

In this example two variables were created: NSCORE the number of non-missing scores, and MSCORE,

the mean of the non-missing scores, when 11 or more scores were present.

week 10 10.26
Week 10 More on Manipulating Data

c How to Compute Age in Completed Years( the INT Function)

• Age is typically reported in completed years.

• When computing age at some event from the date of birth and the event date, it is common to want

whole numbers only for age in years.

• Rounding off, or formatting the age to 0 decimal places would result in rounding up for fractional parts

over .5, e.g., 8.67 rounds up to 9 years rather than 8 years, which is not how we usually think of

age.

• In this case the integer function INT, can be used, to select the integer part of the computed age:

AGE = INT( (SURGDATE - DOB)/365.25 );

d. How to Extract Date and Time Information

• Date and Time functions can be used to create DATE, TIME and DATETIME variables from other

values, or to abstract information from DATE and TIME values, such as the day, month or year from

a date value.

• For example in the Wesson Women's Time Study, dates and times were entered as separate values

for day, month, hour and minutes.

• The following program gives an example of how these can be converted into DATETIME values, so

that the time spent in a unit can be computed as the difference between two DATETIME values.

• The example makes use of the MDY function to create a date value from the month, day and year of

the study, 1991, and then uses the DHMS function to create a DATETIME value from the Date,

Hours, Minutes, and (zero) Seconds.

week 10 10.27
Week 10 More on Manipulating Data

• This example computes the time in hours spent in WETU, the Women's Evaluation and Treatment

Unit.

• Note that when a time difference is computed from a stored DATETIME value, the result is in

units of seconds.

OPTIONS PAGESIZE=60 LINESIZE=78;


LIBNAME D 'D:\WWTIME';
*************************************;
** COMPUTE TIME SPENT IN WETU **;
** AS DIFFERENCE BETWEEN TRANSFER **;
** AND ARRIVAL TIME **;
** USING WWTIME DATA **;
*************************************;
DATA WETU(KEEP=WETUTIME W_TRANDT W_ARR_DT);
SET D.WWTIME1(OBS=10); * reads only first 10 obs;

* compute date from month,day,year *;


WARRDATE=MDY(W_ARRMON,W_ARRDAY,1991); * arrival date;
WTRADATE=MDY(W_TRAMON,W_TRADAY,1991); * transfer date;

* compute datetime from date, hours, mins, and secs=0 *;

* arrival datetime;
W_ARR_DT=DHMS(WARRDATE,W_ARRHRS,W_ARRMIN,0);

* transfer datetime;
W_TRANDT=DHMS(WTRADATE,W_TRAHRS,W_TRAMIN,0);

* compute time spent in WETU as difference adjusted to hrs *;


WETUTIME = (W_TRANDT - W_ARR_DT) / 3600;
RUN;

** print data assigning datetime formats to transfer **;


** and arrival times, and allowing 6 columns, 2 dec **;
** for printing the time difference in hours **;
PROC PRINT DATA=WETU;
FORMAT W_TRANDT W_ARR_DT DATETIME13.
WETUTIME 6.2;
TITLE1 'DURATION, IN HOURS, SPENT IN WETU';
RUN;

week 10 10.28
Week 10 More on Manipulating Data

The resulting output follows.

DURATION, IN HOURS, SPENT IN WETU

OBS W_ARR_DT W_TRANDT WETUTIME

1 09FEB91:06:35 09FEB91:07:28 0.88


2 07FEB91:11:25 07FEB91:12:45 1.33
3 07FEB91:09:20 07FEB91:10:45 1.42
4 07FEB91:05:30 07FEB91:06:35 1.08
5 05FEB91:04:30 05FEB91:05:50 1.33
6 10FEB91:02:45 10FEB91:04:00 1.25
7 06FEB91:18:20 06FEB91:19:35 1.25
8 10FEB91:08:50 10FEB91:09:20 0.50
9 10FEB91:01:45 10FEB91:03:30 1.75
10 08FEB91:08:58 08FEB91:10:53 1.92

• Note that the time spent in the Women’s Evaluation and Treatment Unit, WETUTIME, is in units of

hours, with decimal fractions of hours.

• For example 1.25 means one-and-one-quarter hours, or 1 hour and 15 minutes, and 0.50 means a

half-hour, or 30 minutes.

• SAS TIME formats are meant for use with clock time representations, not durations. If you want

these durations printed in hours and minutes, you would need to compute the minutes from the

fractional hours, and print these as two variables. The following programming statements could be

used:

* complete hours as integer part;


WETUHRS = INT(WETUTIME);

* fractional hours times 60 min/hr ;


WETUMIN = (WETUTIME - WETUHRS) * 60;

week 10 10.29
Week 10 More on Manipulating Data

4 How to Select a Simple Random Sample without Replacement: Using Random Number
Generating Functions

Rationale –

You may have used SAS to create a sampling frame (list of all members of population) and now wish to

select a simple random sample from the list. Random number generators enable the sample to be

selected. An illustration of a method for selecting a simple random sample without replacement from a

population of 10 subjects (with one record per subject) follows. Suppose the subjects are listed in a file as

shown below.

ID Age Sex
1 23 M
2 29 M
3 34 F
4 19 F
5 29 F
6 15 F
7 32 M
8 32 F
9 31 M
10 28 M

Using the Uniform ( ) Function to Select a Simple Random Sample Without Replacement

Assume we wish to select a simple random sample without replacement of 4 subjects from this list. We

can use the Uniform random number generator, which generates a random digit between 0 and 1, to do

the selection with the following program.

week 10 10.30
Week 10 More on Manipulating Data

*** selecting a simple random sample ***;


*** without replacement ***;
libname old 'c:\temp\';
data frame1;
input sid age sex $1. @@;
cards;
1 23 M 2 29 M 3 34 F 4 19 F 5 29 F 6 15 F
7 32 M 8 32 F 9 31 M 10 28 M
;
run;
data frame2a;
set frame1;
retain n 10
k 4; *set starting pop=n and sample size=k ;
sample1=0; * set indicator of in sample to 0;
prob=k/n; * set probability of selection;
rn=uniform(0); * select random number in (0,1);
if rn < prob then do; * if rn in (0,rn) ;
sample1=1; * set indicator to 1 ;
k=k-1; * decrease number to select ;
end;
n=n-1; * decrease remaining pop size ;
run;

proc print data=frame2a;


var sid age sex sample1 rn prob n k;
title1 'Listing after Sample Selection in SAS data set FRAME2a';
run;

• Data are read in on one line using the trailing @@ to hold the line.

• The sample is selected in the data set FRAME2A.

• The RETAIN statement initializes the total size of the population (n) and the size of the sample (k)

that is to be selected.

• These variables must be named in a RETAIN statement to keep track of the number of observations

left in the population, along with the number of observations remaining to be selected into the

sample.

• As the sample is selected, the value from the previous observation is held unless a change is

specified.

week 10 10.31
Week 10 More on Manipulating Data

• The variable SAMPLE1 is an indicator variable with value of "0" for observations not in the sample,

and value "1" for observations selected to be in the sample. When selecting a simple random

sample without replacement, the proportion of observations to be selected changes as each

observation is selected.

• This proportion is given by PROB, the remaining number of observations to be selected divided by

the remaining number of observations in the population.

• To determine whether the particular observation in the data step is to be selected in the sample, a

Uniform random number (RN) is selected and compared with PROB.

• If the random number is less than PROB, then the observation is selected, and the values of "k" and

"n" are both reduced by one. The indicator for the sample, SAMPLE1, is also set to one. If RN is

bigger than PROB, only the population size "n" is reduced by one and the sampling continues. A

listing of the resulting data follows:

Listing after Sample Selection in SAS data set FRAME2a

OBS SID AGE SEX SAMPLE1 RN PROB N K

1 1 23 M 0 0.75321 0.40000 9 4
2 2 29 M 1 0.34198 0.44444 8 3
3 3 34 F 0 0.50989 0.37500 7 3
4 4 19 F 1 0.41783 0.42857 6 2
5 5 29 F 0 0.36324 0.33333 5 2
6 6 15 F 0 0.51605 0.40000 4 2
7 7 32 M 1 0.02004 0.50000 3 1
8 8 32 F 1 0.32841 0.33333 2 0
9 9 31 M 0 0.62433 0.00000 1 0
10 10 28 M 0 0.59228 0.00000 0 0

week 10 10.32
Week 10 More on Manipulating Data

How the Uniform Random Number Generator Uniform(0) Works

• The uniform random number generator is a SAS function that has "0" as an argument in the previous

program.

• The "0" is a special code that indicates that the uniform random number sequence should start based

on the date and time that the program is run.

• Re-running the program will produce a different set of Uniform random numbers.

• Although independent samples should be based on a different set of random numbers, it may be

desirable to be able to duplicate the sample selection. This can be done by specifying a 5-7

digit odd integer as the argument in the UNIFORM function.

• This is illustrated this below, with the selection of two samples, where a common starting point – seed

– for the random number sequence is specified.

*** select 2 samples using same SEED ***;


data frame2b;
retain n2 10 k2 4 seed 1823925;
set frame1;
sample2=0;
prob2=k2/n2;
rn2=uniform(seed);
if rn2 < prob2 then do;
sample2=1;
k2=k2-1;
end;
n2=n2-1;
run;

*** repeat with same SEED ***;


data frame3;
retain n3 10 k3 4 seed 1823925;
set frame1;
sample3=0;
prob3=k3/n3;
rn3=uniform(seed);
if rn3 < prob3 then do;
sample3=1;

week 10 10.33
Week 10 More on Manipulating Data

k3=k3-1;
end;
n3=n3-1;
run;

** merge 2 samples by SID and print results **;


data all;
merge frame2 frame3;
by sid;
run;
proc print data=all;
id sid;
var sid age sex sample2 rn2 prob2 n2 k2
sample3 rn3 prob3 n3 k3;
title1 'Listing after Sample Selection FRAME2-3 SEED=1823925';
run;

week 10 10.34
Week 10 More on Manipulating Data

Listing after Sample Selection FRAME2-3 SEED=1823925

ID AGE SEX SAMPLE2 RN2 PROB2 N2 K2 SAMPLE3 RN3 PROB3 N3 K3


1 23 M 0 0.85516 0.40000 9 4 0 0.85516 0.40000 9 4
2 29 M 1 0.06069 0.44444 8 3 1 0.06069 0.44444 8 3
3 34 F 0 0.54748 0.37500 7 3 0 0.54748 0.37500 7 3
4 19 F 0 0.67427 0.42857 6 3 0 0.67427 0.42857 6 3
5 29 F 0 0.98183 0.50000 5 3 0 0.98183 0.50000 5 3
6 15 F 1 0.19015 0.60000 4 2 1 0.19015 0.60000 4 2
7 32 M 1 0.28839 0.50000 3 1 1 0.28839 0.50000 3 1
8 32 F 0 0.82554 0.33333 2 1 0 0.82554 0.33333 2 1
9 31 M 0 0.56925 0.50000 1 1 0 0.56925 0.50000 1 1
10 28 M 1 0.39763 1.00000 0 0 1 0.39763 1.00000 0 0

Notice that the sample random numbers were generated (RN2 = RN3) when the same ‘seed’ or starting

value for the random number generator is used, resulting in selection of the same sample.

week 10 10.35
Week 10 More on Manipulating Data

5. Introduction to SAS Character Functions

• There is an array of SAS functions designed for working with character data.

• Tip - Character data can be tricky to work with, since capitalization, spelling, spacing and punctuation

differences will all cause values to be read differently.

• Many of the character functions can be used to modify values, selectively read parts of character

values, or search values for certain character strings. A few helpful examples follow.

a How to Convert to Upper Case (UPCASE)

Rationale and Example –

• At the time of data entry, use of capitalization may have been inconsistent. As a result, some

observations that should be matching will not be read as having the same value.

• For example, if data for sex has inconsistently been entered as F, f, M, or m, then a frequency table

would indicate 4 different values for sex. Use of the statement:

SEX2 = UPCASE(SEX);

would convert all values to uppercase, so only 2 distinct values would appear in the new variable

SEX2, F and M. This could also be accomplished with the statements:

IF SEX =’f’ THEN SEX =’F’;


IF SEX =’m’ THEN SEX =’M’;

• However, when names, addresses or other character data with more than just a few possible values

are to be converted, the function would obviously be the preferred method.

week 10 10.36
Week 10 More on Manipulating Data

b How to Remove Spaces and Special Characters (COMPRESS)

Rationale –

• When spacing or punctuation has been inconsistently used, values will not match.

• The Compress function allows you to remove spaces or specified characters.

• For example, when names have been entered in one place as ‘Last, First’ and in another as ‘Last

First’ without a comma, we might wish to remove the commas.

NAME2 = COMPRESS(NAME,’,’);

would create a new variable NAME2 with all commas removed.

Guidelines

• To use the Compress function, the form COMPRESS(arg1,arg2); is used, where arg1 is the variable to

be compressed, and arg2 specifies the characters to be removed.

• If you are listing specific characters, then these must be named with single quotes (or double quotes

if you want to remove a single quote character).

• Tip - Do not use delimiters (e.g., spaces or commas) between characters to remove, or the space or

comma will be considered as characters to remove. For example, NAME2 = COMPRESS(NAME,’

nd
.,’); will remove spaces, periods and commas. If no 2 argument (arg2) is specified, by default

all spaces will be removed from the original value.

week 10 10.37
Week 10 More on Manipulating Data

c How to Align Character Data (LEFT and RIGHT)

Rationale -

• Inconsistent data entry, or data derived from different sources can leave you with values that match,

except for inconsistent spacing.

• The LEFT and RIGHT functions simply move values to be left or right aligned. LEFT removes blanks

from the start of a value, and RIGHT moves blanks from the end to the start. For example, if values

for sex have been entered as ′Female′ or ′Male ′ or ′ Male′, the last 2 values will be read as

different values, since the spacing or alignment differs. Using the statement:

SEX2 = LEFT(SEX);

would left align all values, or change ′ Male′ to ′Male ′.

d How to Extract a String of Characters (SUBSTR)

Rationale -

• This is a great function! When information is encoded within a variable, for example, with an

informative ID variable, reading a part of the value becomes of interest.

• The substring function (SUBSTR) can be used for this purpose.

• In the Anaconda Soil Ingestion Study, informative ID numbers for samples (SID), of the form ‘AnnSdd’

were used, where the first character (A) indicated the particular study, nn the child ID number, S the

sample type (F=fecal, N=food), and dd the study day, 01-07.

• Sample analysis results were output in electronic form from the laboratory equipment used for

analysis. It was necessary to extract information from the sample ID number to determine which

child, and which sample type. The following statement will do this:

week 10 10.38
Week 10 More on Manipulating Data

CHILDID = SUBSTR(SID,1,3);
STYPE = SUBSTR(SID,4,1);

• The CHILDID is read from the sample ID (SID) starting in position 1, reading for 3 characters.

• The sample type (STYPE) is read from SID, starting in position 4, reading 1 character. For example,

if SID=’A64N02’ then CHILDID=’A64’ and STYPE=’N’.

Guidelines

To use SUBSTR(arg1,arg2,arg3),

• arg1 names the variable to be read,

• arg2 names the starting position, and

• arg3 names the number of characters to read.

e How to Validate a Character Value (VERIFY)

Rationale -

Verify can be used to check that all values in your data fit a limited set of acceptable values.

How to Use the VERIFY Function (Tip - It is NOT a logical operator)

For example, suppose you have data, stored as character codes (CODE), which should take on only the

values 0, 1, 2, 3 or 9, the following statements can be used to check for invalid data:

CHK = ‘01239’;
VCODE = VERIFY(CODE,CHK);
IF VCODE NE 0 THEN PUT ‘INVALID CODE: ‘ CODE= SID=;

week 10 10.39
Week 10 More on Manipulating Data

• The value of VCODE will be 0 if the value of CODE contains only 0,1,2,3, or 9, but will return the first

column number where any other value is found.

• The IF-THEN statement will write a message to the log with the invalid code and subject ID, when

VCODE is not equal to 0.

week 10 10.40

Anda mungkin juga menyukai