Anda di halaman 1dari 53

Introduction to SAS Essentials

Mastering SAS for Data Analytics


Alan Elliott and Wayne Woodward

1
Intro to SAS
Chapter 2
Class 1, Session 2
Alan Elliott

2
LEARNING OBJECTIVES

To enter data using freeform list input


To enter data using the compact method
To enter data using column input
To enter data using formatted input
To enter data using the INFILE technique
To enter multiple-line data

3
4
2.1 Using SAS Data Steps

The DATA step is used to define a SAS data set, and


to manipulate it and prepare it for analysis.
In the SAS language, the DATA statement signals
the creation of a new data set. For example, a DATA
statement in SAS code may look like this:

DATA MYDATA;

5
DATA Statement

Signals the beginning of the DATA step.


Assigns a name (of your choice) to the data set
created in the DATA Step. The general form of the
DATA statement is:

DATA datasetname;

6
Temporary Data Set Names

The SAS datasetname in the DATA statement can


have several forms:
The SAS datasetname can be a single name (used for
temporary data sets, kept only during current SAS
session). For example:

DATA EXAMPLE;
DATA OCT2007;

7
Permanent Data Set Names
Or, the SAS datasetname can be a two-part name.
The two-part name tells SAS that this permanent
data set will be stored on disk beyond the current
SAS session in a SAS library indicated by the
prefix name. For example:

DATA RESEARCH.JAN2016;
DATA CLASS.EX1;
DATA JONES.EXP1;

8
Windows Data Set Names
A SAS data set name can also refer directly to the
Windows name of a file on your hard disk. For
example:

DATA C:\SASDATA\SOMEDATA;
DATA C:\MYFILES\DECEMBER\MEASLES;

(note these refer to SAS data set files with .sas7bdat


extensions thus SOMEDATA.SAS7BDAT and
MEASLES.SAS7BDAT)

9
Tasks done within the DATA Step
DATA datasetname;
<code that defines the variables in the data set>;
<code to enter data>;
<code to create new variables>;
<code to assign missing values>;
<code to output data>;
<code to assign labels to variables>;
<and other data tasks>;

10
2.2 Understanding SAS data set
structure

ID SBP DBP GENDER AGE WT

obs 1 001 120 80 M 15 115

obs 2 002 130 70 F 25 180

... . . . . . .

obs 100 100 125 80 F 20 110

Columns are data variables (each named) and rows are subjects,
observations, or records.
11
Columns and Rows

Each column represents a variable and is


designated with a variable name (ID, SBP, etc.)
Every SAS variable (column) must have a name,
and the names must follow certain naming rules.
Each row, marked here as obs1, obs2, etc.,
indicate observations or records. An observation
consists of data observed from one subject or entity.

12
2.3 Rules for SAS variable names
SAS Variable names
must be 1-32 characters long but must not include any
blanks.
must start with the letters A through Z or the _
(underscore). A name cannot include a blank.
may include numbers (but not as first character in name).
may include upper and lower case characters (variable
names are case insensitive).
should be descriptive of the variable (optional but
recommended).

13
Correct & Incorrect variable names
Correct SAS variable names are
GENDER AGE_IN_1999
AGEin1999 _OUTCOME_HEIGHT_IN_CM
WT_IN_LBS

Incorrect SAS variable names are (WHY?)


AGE IN 2000 2000MeaslesCount
S S Number Question 5
WEIGHT IN KG AGE-In-2000

14
2.4 Understanding SAS Variable
Types
Numeric Variables (Default): A numeric variable is
used to designate values that could be used in
arithmetic calculations or are grouping codes for
categorical variables.
For example, the variables SBP (systolic blood pressure),
AGE, and WEIGHT are numeric. However, an ID number,
phone number, or Social Security number should not be
designated as a numeric variable. For one thing, you
typically would not want to use them in a calculation.
Moreover, for ID numbers, if ID = 00012 were stored as a
number, it would lose the zeros and become 12.

15
Character (Text, String) Variables
Character (Text, String) Variables: Character variables
are used for values that are not used in arithmetic
calculations.
For example, a variable that uses M and F as codes for gender
would be a character variable. For character variables, case
matters, because to the computer a lowercase f is a different
character from an uppercase F. It is important to note that a
character variable may contain numerical digits.
As mentioned previously, a Social Security number (e.g., 450-
67-7823) or an ID number (e.g., 143212) should be designated
as a character variable because their values should never be
used in mathematical calculations.
When designating a character variable in SAS, you must
indicate to SAS that it is of character type. This is shown in
upcoming examples.
16
Date Variables

Date Variables: A date value may be entered into


SAS using a variety of formats, such as 10/15/09,
01/05/2010, JAN052010, and so on. As you will see
in upcoming examples, dates are handled in SAS
using format specifications that tell SAS how to read
or display the date values. For more information
about dates in SAS, see Appendix B.
Technically, dates are integers. Well learn more
about dates, and how to manipulate them, later.

17
2.5 Methods of reading data into SAS

Reading data using freeform list input


Reading data using the compact method
Reading data using column input
Reading data using formatted input.

18
Freeform Data Entry
Open the file DFREEFORM.SAS The INPUT statement defines
the variables (some character,
designated by $ after name.)
DATA MYDATA;
INPUT ID $ SBP DBP GENDER $ AGE WT;
DATALINES; DATALINE indicates that data are listed next.
001 120 80 M 15 115
002 130 70 F 25 180
The data must match the INPUT
003 140 100 M 89 170 statement the same number of
004 120 80 F 30 150 values per line, separated with
005 125 80 F 20 110; blanks.
; DATA ends with a semicolon.
PROC PRINT;
RUN;

19
Data set created from code

Obs ID SBP DBP GENDER AGE WT

1 001 120 80 M 15 115

2 002 130 70 F 25 180

3 003 140 100 M 89 170

4 004 120 80 F 20 150

5 100 125 80 F 20 110

20
Advantages of freeform list input
Easy, very little to specify.
No rigid column positions which makes data entry
easy.
If you have a data set where the data are separated
by blanks, this is the quickest way to get your data
into SAS.

21
Restrictions for freeform list input
Every variable on each data line must be in the order
specified by the INPUT statement.
Fields must be separated by at least one blank.
Blank spaces representing missing variables are not
allowed. Having a blank space in the data causes values
to be out of sync. If there are missing values in the data,
a dot (.) should be placed in the position of that variable
in the data line. For example, a data line with AGE
missing might read:
4 120 80 F . 150
No embedded blanks are allowed within the data value
for a character field, like MR ED.
A character field has a default maximum length of 8
characters in freeform input.

22
The @@ symbol in the
INPUT statement tells
Compact Data Format SAS to allow multiple
rows of data on each
line. You must be
DATA WEIGHT; careful that the data
matches the input
INPUT TREATMENT LOSS @@; definition.

DATALINES;
1 1.0 1 3.0 1 -1.0 1 1.5 1 0.5 1 3.5
2 4.5 2 6.0 2 3.5 2 7.5 2 7.0 2 6.0 2 5.5
3 1.5 3 -2.5 3 -0.5 3 1.0 3 .5
;
PROC PRINT;
RUN;

23
Hands-On Example p 27

1. Open the program file DFREEFORM.SAS. (The


code was shown above.) Run the program.
2. Observe that the output listing (shown here)
illustrates the same information as in Table 2.4.
Etc

24
Column Input
(This is DCOLUMN.SAS)
DATA MYDATA;
INPUT ID $ 1-3 SBP 4-6 DBP 7-9
GENDER $ 10 AGE 11-12 WT 13-15;
DATALINES;
001120 80M15115 Note how data are in specific columns.
002130 70F25180 In the INPUT statement, the columns
are specified by the ranges following a
003140100M89170 variable name.
004120 80F30150
005125 80F20110
;
RUN; INPUT variable startcol-endcol ...;

25
Advantages of column input
Data fields can be defined and read in any order in the INPUT
statement and unneeded columns of data can be skipped.
Blanks are not needed to separate fields.
Character values can range from 1 to 200 characters. For
example:

INPUT DIAGNOSE $ 1-200;

For character data values, embedded blanks are no problem,


e.g., John Smith
Input only the variables you need -- skip the rest. This is handy
when your data set (perhaps downloaded from a large
database) contains variables youre not interested in using.
Only read the variables you need.

26
Rules and restrictions for column
Input
Data values must be in fixed column positions.
Blank fields are read as missing.
Character fields are read right justified in the field.
Column input has more specifications than list input.
You must specify the column ranges for each
variable.

27
How SAS interprets column data
INPUT GENDER $ 1-3;

1 2 3 4 5 6 7 1 2 3 4
M
M ---> All read as M
M

28
Numbers in Columns
INPUT X 1-6;

1 2 3 4 5 6 7 READ AS
2 3 0 230
2 3 . 0 23.0
2 . 3 E 1 2.3E1 or 23
2 3 23
- 2 3 -23

29
Hands-On Exercise p 31
(DCOLUMN.SAS)
DATA MYDATA;
INPUT ID $ 1-3 SBP 4-6 DBP 7-9 GENDER $ 10
AGE 11-12 WT 13-15;
DATALINES;
001120 80M15115
002130 70F25180
003140100M89170
004120 80F30150
005125 80F20110
;
RUN;
PROC PRINT DATA=MYDATA;
RUN;

30
Reading Data Using Formatted Input

DATA MYDATA;
INPUT @col variable1 format. @col
variable2 format. ...;
Note the difference in the INPUT
Example: statement.

DATA MYDATA
INPUT @1 SBP 3. @4 DBP 3. @7 GENDER $1.
@8 WT 3. @12 OWE COMMA9.;

31
Three Components In Formatted
Input

@1 SBP 3.

The @ is the The informat defines


starting column what kind of data to
The variable
pointer so @1 input. In this case 3.
name.
means start in indicates an integer
column 1. with at most 3 digits.

32
Input Formats Table 2.6, p 33

Informat Meaning

5. Five columns of data as numeric data.

$5. Character variable with width 5, removing leading blanks.

$CHAR5. Character variable with width 5, preserving leading blanks.

COMMA7. Seven columns of numeric data and strips out any commas or dollar
signs (i.e., $40,000 is read as 40000).

COMMA10.2 Reads 10 columns of numeric data with 2 decimal places (strips


commas and dollar signs.) $19,020.22 is read as 19020.22.

MMDDYY8. Date as 01/12/16. (Watch out for Y2K issue.)

MMDDYY10. Date as 04/07/2016

DATE7. Date as 20JUL16

DATE9. Date as 12JAN2016. (No Y2K issue.)


33
Informats & Formats
In SAS
INFORMATS are used to read data into a SAS data set
FORMATS are used to specify how to output (write) data
values
Most Format specifications (such as MMDDYY10.) can
be used as EITHER an informat or a format.

34
More about formats
Formats must end with a dot (.) or a dot followed by
a number
5. : A number up to five digits, no decimals, so could
take on values from -9999 to 99999.
5.2 : A number up to five digits, and up to 2 decimals
so could take on values from -9.99 to 99.99
$5. : A character value of up to five digits, such as
ABCDE, abcde, 12345 or (*&6%

35
Advantages & restrictions for using
formatted input

Advantages and restrictions are similar to those for


column input.
The primary difference is the ability to read in data
using INFORMAT specifications.
Is particularly handy for reading dates and dollar
values.
Restrictions are similar to those for column input.

36
Hands-On Exercise, p 36
(DINFORMAT.SAS)
DATA MYDATA;
INPUT @1 SBP 3. @4 DBP 3. @7
GENDER $1. @8 WT 3. @12 OWE
COMMA9.;
DATALINES;
120 80M115 $5,431.00
130 70F180 $12,122
140100M170 7550
120 80F150 4,523.2
125 80F110 $1000.99
;
PROC PRINT DATA=MYDATA;
RUN;

37
Some Common Output Formats

Compare these to the INPUT formats listed earlier.

38
Using the SAS INFORMAT Statement
There is a SAS statement named INFORMAT that you
could use in the freeform data entry case. For example,
(see p 37)
In this case, the INFORMAT statement can
specify that a freeform text value is longer than
the default eight characters.
DATA PEOPLE;
INFORMAT LASTNAME FIRSTNAME $12. AGE 3. SCORE 4.2;
INPUT LASTNAME FIRSTNAME AGE SCORE;
DATALINES;
Lincoln George 35 3.45
Ryan Lacy 33 5.5
;
PROC PRINT DATA=PEOPLE;
RUN;

39
Reading External Data Using INFILE
Suppose you have text data in a file in this format

101 A 12 22.3 25.3 28.2 30.6 5 0


102 A 11 22.8 27.5 33.3 35.8 5 0
104 B 12 22.8 30.0 32.8 31.0 4 0
110 A 12 18.5 26.0 29.0 27.9 5 1

How can you read this data into SAS?

40
INFILE Statement
INSTEAD of the DATALINES statement (followed by
data), you use the INFILE Statement:
INFILE replaces DATALINES, and defines
where the data are located on disk.
DATA MYDATA;
INFILE 'C:\SASDATA\EXAMPLE.DAT';
INPUT ID $ 1-3 GP $ 5 AGE 6-9
TIME1 10-14 TIME2 15-19 TIME3 20-24;
RUN;
NOTE: There is no DATALINES
PROC MEANS;
statement. This is where the DATA
RUN; step ends and the PROC step
begins.

41
DATALINES vs INFILE
When data are in the program code: Use
DATALINES statement
When data are read from external source: Use
INFILE statement.

ALSO NOTE: Do not confuse INFILE with the INPUT


statement.

42
Hands-On Example p 38 (DINFILE1.SAS)
DATA MYDATA;
INFILE
'C:\SASDATA\EXAMPLE.TXT'
;
INPUT ID $ 1-3 GP $ 5
AGE 6-9
TIME1 10-14 TIME2 RESULTS
15-19 TIME3 20-24;
PROC MEANS DATA=MYDATA;
RUN;

43
2.6 Going Deeper More Techniques
Reading Multiple Records per Observation
If your data for each record extends multiple lines:
You can use more than one INPUT statement
INPUT ID $ SEX $ AGE WT;
INPUT SBP DBP BLDCHL;
INPUT OBS1 OBS2 OBS3;
Or you can read the data using the / (advance)
indicator:
INPUT ID $ SEX $ AGE WT/ SBP DBP BLDCHL/ OBS1
OBS2 OBS3;

44
Hands-On Example p 40.
(DMULTILINE.SAS)
DATA MYDATA;
INPUT ID $ SEX $ AGE WT/ SBP DBP BLDCHL/
OBS1 OBS2 OBS3;
DATALINES;
10011 M 15 115 RESULTS
120 80 254
15 65 102
10012 F 25 180
130 70 240
34 120 132
10013 M 89 170
140 100 279
19 89 111
;
PROC PRINT DATA=MYDATA;
RUN;

45
Input Pointer Controls

46
Using Advanced INFILE Options
DLM: This option allows you to define a delimiter to
be something other than a blank. For example, if
data are separated by commas, include the option
DLM = , in the INFILE statement.
FIRSTOBS= n :Tells SAS on what line you want it
to start reading your raw data file. This is handy if
your data file contains one or more header lines or if
you want to skip the first portion of the data lines.
OBS= n :Indicates which line in your raw data file
should be treated as the last record to be read by
SAS.
See Table 2.13 p 41 for more options

47
Hands-On Example, p 42
(DINFILE2.SAS)
DATA MYDATA;
INFILE 'C:\SASDATA\EXAMPLE.CSV' DLM=', '
FIRSTOBS=2 OBS=26;
INPUT GROUP $ AGE TIME1 TIME2 TIME3
Time4 SOCIO;
PROC MEANS;
RUN;

48
Hands On Exercise DINFILE3.SAS p
43
1. Open the program file DINFILE3.SAS.
Note strange
DATA PLACES; delimiter
INFILE DATALINES DLMSTR='!!';
INPUT CITY $ STATE $ ZIP;
DATALINES;
DALLAS!!TEXAS!!75208
LIHUE!!HI!!96766
MALIBU!!CA!!90265
;
PROC PRINT;

Etc

49
Input a data set where there are
blanks
This is a ruler.

50
Do Hands On Example p 44
(DINFILE4.SAS)
Note TRUNCOVER
DATA TEST;
INFILE "C:\SASDATA\DINFILEDAT.TXT" TRUNCOVER;
INPUT LAST $1-21 FIRST $ 22-30
ID $ 31-36 ROLE $ 37-44;
RUN;
PROC PRINT DATA=TEST;RUN;

51
Summary
One of the most powerful features of SAS, and a
reason it is used in many research and corporate
environments, is that it is very adaptable for reading
in data from a number of sources. This chapter
showed only the tip of the iceberg. More information
about getting data into SAS is provided in the
following chapter. For more advanced techniques,
see the SAS documentation.

52
These slides are based on the book:

Introduction to SAS Essentials


Mastering SAS for Data Analytics, 2nd
Edition

By Alan C, Elliott and Wayne A.


Woodward

Paperback: 512 pages


Publisher: Wiley; 2 edition (August 3,
2015)
Language: English
ISBN-10: 111904216X
ISBN-13: 978-1119042167
These slides are provided for you to use to teach SAS using this book. Feel
free to modify them for your own needs. Please send comments about errors
in the slides (or suggestions for improvements) to acelliott@smu.edu.
Thanks.
53