Anda di halaman 1dari 55

Document K6.

Introduction to SAS

Document K6.1
*
Introduction to SAS
1 Introduction
1 Documentation
1 The SAS DATA step
2 Creating and naming a dataset - the DATA statement
2 Preparing data for SAS
3 Entering data to SAS and naming variables - the INPUT statement
5 Reading data from a file - the INFILE statement
5 Including data with SAS commands - the CARDS statement
5 Reading from a SAS dataset - the SET statement
6 Calculations - the assignment statement
7 Conditional evaluation and selection - the IF, THEN and ELSE statements
8 Giving labels to variables - the LABEL statement
8 Displaying data - the PUT statement
9 Controlling output format - the FORMAT statement
10 Executing instructions - the RUN statement
10 Recoding data
10 Example DATA steps
11 The SAS PROC step
11 Analysing data in subgroups - the BY statement
11 Controlling output format - the FORMAT statement
11 Printing titles on output from procedures - the TITLE statement
12 Printing footnotes - the FOOTNOTE statement
12 Descriptive statistics
12 Frequency tables and cross-tabulations - the FREQ procedure
14 Descriptive statistics - the MEANS procedure
15 Statistics about a single variable - the UNIVARIATE procedure
17 Correlations - the CORR procedure
19 Complex tables - the TABULATE procedure
21 Student’s T-Test
23 Drawing simple diagrams with SAS
24 Bar charts - the CHART procedure
26 Scattergrams - the PLOT procedure
28 High-quality graphics
28 Bar charts - the GCHART procedure
31 Scattergrams - the GPLOT procedure
33 Maps - the GMAP procedure
34 Regression - the REG procedure
35 The MODEL statement
36 The OUTPUT statement
36 Example
39 Analysis of variance - the ANOVA procedure
39 The CLASS statement
39 The MODEL statement
40 The MEANS statement
40 Example
44 General linear modelling - the GLM procedure
44 The CLASS statement
45 The MODEL statement
45 The ID statement
45 The MEANS statement
45 The RANDOM statement
45 The OUTPUT statement
46 Example
46 Miscellaneous useful procedures
46 Sorting a dataset - the SORT procedure
47 Defining your own formats - the FORMAT procedure
49 Printing the values of variables - the PRINT procedure
52 References and further information

* We are grateful to the University of Liverpool Computer Laboratory


for permitting us to use their original user guide as a basis for this
one.
Introduction
SAS (Statistical Analysis System usually pronounced as a single syllable
‘sass’) provides facilities for the manipulation and analysis of both
numeric and character data using a variety of techniques. The statistical
facilities range from production of frequency tables and bar charts to
multivariate regression and analysis of variance. The facilities for data
input and manipulation give control over how data is to be read, and
provide for the sorting and merging of datasets and calculations on the
data. This variety of facilities makes SAS suitable for a wide range of
applications including survey analysis, data processing, and analysis of
designed experiments.
A series of SAS statements is often divided into DATA steps and PROC
steps. The division is between those statements which input data,
manipulate it, select subgroups, create new variables etc (the DATA
step), and those which analyse the data and produce output (the PROC
step). The section ‘The SAS DATA step’ describes some of the
functions possible in the DATA step, the section ‘The SAS PROC step’
describes some of the statements which can occur in any PROC
statement, while the sections ‘Descriptive Statistics’ to the end describe
some specific PROC statements.
SAS is available locally on CMS and CSA. A PC version may be
purchased from the Computing Service.

Documentation
SAS is described in several manuals, amounting to many hundreds of
pages. For this document it has been necessary to omit much of the
material and select only a small portion of the facilities available. Even
when a facility is described here it will not be described in full and many
options may be ignored. This means that you should check with the full
manuals if there is something you want to do that is not described in the
following sections. Details of manuals may be found on page 52. On
CMS, type HELP SAS for details of the CMS version.

The SAS DATA step


A SAS DATA step is a set of statements which set up the data, for
example read the data, manipulate it, select subgroups or create new
variables. This section describes some of the functions possible in the
DATA step.
A DATA step starts with a DATA statement and ends with a RUN
statement (or with a further DATA or a PROC statement).

K6.1 (10.90) page 1


Comments can appear anywhere in a SAS statement provided they
appear within the delimiters /* (to start a comment) and */ (to end one).
For example:
DATA STOCK; INPUT CODE QUANTITY
PRICE /*in dollars*/ STOREBIN;

Creating and The DATA statement has the general form:


naming a dataset DATA dataset options ;
- the DATA Note that, as with all SAS statements, the DATA statement ends with a
statement semicolon. The ‘dataset’ part names the dataset being created by this
DATA step, either by the input of data or by processing an existing
dataset. ‘dataset’ has two parts; a library name (which may be omitted)
and the dataset name. These names follow the same rules as other names
used in SAS, which are:
¤ The name must not be more than eight characters long.
¤ It may contain only the letters A to Z, the digits 0 to 9 and the
underline character.
¤ It must not start with a digit.
Examples of valid names are NEW_ONE, OLDDATA2 and _EX_A1.
If a library name is specified it is separated from the dataset name by a
full stop. If no library name is given,it is named WORK, which is a
temporary library that is deleted when you leave SAS. Examples of
valid DATA statements are:
DATA KEEP.GNP82;
DATA PRICES;
DATA E_4_DATA;
If you miss out ‘dataset’ altogether, SAS gives the dataset the name
WORK.DATAn, where n is 1, 2, 3, . . . depending on how many such
datasets you have created so far.
If you specify a library name you can save the library and use it in a later
SAS session rather than having to set it up again. Note however that
only the library name SASUSER is set up for you by SAS. If you wish
to use any other library name then you must define it both to the system
you are using and to SAS (using a LIBNAME statement). This is
described in more detail in Reference A or B.

Preparing data for SAS can read either fixed format or free format data (or even data which
SAS mixes fixed and free format in some cases). Fixed format requires that
values be in fixed positions in each line of the data, while free format
requires only that values be separated by a blank. Most people find it
easier to prepare their data in free format. SAS can read text data (for
example names) as well as numbers, and it has special provision for
times and dates. Free format data must follow certain rules:
¤ Each value must be separated from the next by at least one space or
by the end of a line.
¤ Decimal points must be included where they are required.

K6.1 (10.90) page 2


¤ Any missing data must be represented by a full stop (.) and not by
a blank (full stop is the standard symbol for a missing value in
SAS).
¤ If you are using text data then no item of text may be longer than
eight characters and the data must be enclosed in quotes.
If your data satisfies all these conditions, you can use the free format
method of reading data. If not, you must use fixed format, or possibly a
mixture of free and fixed format, as described below.

Entering data to A SAS dataset is made up of a number of cases or observations, each of


SAS and naming which contains a value (either measurements or calculations) for each
variables - the variable in the dataset.
INPUT statement If you are inputting new data, rather than processing an existing dataset,
use the INPUT statement to specify:
¤ the names of the variables
¤ the order of variables in each complete set of variables (or case or
observation)
¤ whether variables are numeric or contain characters
¤ how the variables are laid out on the input line
The names of the variables follow the same rules as those for dataset
names given on page 2, for example HEIGHT, TOT_WAGE, AVE3. A
variable is assumed to be numeric unless it is followed by a dollar sign
($) in the INPUT list. The dollar sign is not part of its name and is not
typed when using the variable in other SAS statements.
The order in which the variables appear in the INPUT statement is their
order in the dataset (although not necessarily their order in the input data,
as some forms of fixed format input can read the data in a different order
to the one in which it was typed).
If you have a list of variable names ending in numeric suffixes, for
example VAR1, VAR2, VAR3 etc, SAS allows an abbreviated form for
their specification:
INPUT CASENUM VAR1-VAR12 DETAILS;
This form may also be used to specify such a list in procedures. For
example:
PROC MEANS; VAR AGE1-AGE7; RUN;
For more details of PROC MEANS see page 14.
Where variable names do not follow such a pattern you can still use an
abbreviated form when specifying them in a procedure, but it is a little
different. If the INPUT statement for the dataset was:
INPUT CASE SEX AGE TESTA TESTB INCOME MARSTAT
REGION;
you could type:
PROC MEANS; VAR AGE--INCOME; RUN;
to obtain means on the variables AGE, TESTA, TESTB and INCOME.
The double hyphen indicates a list of variable names.

K6.1 (10.90) page 3


Free format - list input To input data which satisfies the rules of free format input given on
page 2 you need only list the variables, for example:
INPUT CASE_NO SEX $ AGE HEIGHT WEIGHT;
SEX has been declared as a character variable so that M and F can be
used to denote male and female, rather than using numeric codes. As
with all SAS statements, the INPUT statement ends with a semicolon.

Fixed format - column There are two ways of specifying fixed format input. The easier is
input column input. The columns containing the value for the variable are
specified after the variable name, for example:
INPUT REF 1-6 NAME $ 7-26 AGE 27-28 SEX 59
HEIGHT 64-66 2;
This specifies the name and type of the variables and how they are laid
out:
REF numeric, columns 1 to 6
NAME character, columns 7 to 26
AGE numeric, columns 27 to 28
SEX numeric, column 59
HEIGHT numeric, columns 64 to 66, the last two columns (65 and
66) are assumed to be after a decimal point (which need
not be typed)
If you always type decimal points where they are needed then you have
no need to use the type of column format given for HEIGHT. Instead
you could specify simply HEIGHT 64-66 and type in all decimal points.
Anything typed in columns other than those mentioned in the INPUT
statement is ignored. If your data does not fit on one line you must
indicate when the second or third line begins. For example, if the data
has the name on one line and the address on the next:
INPUT NAME $ 1-20 #2 ADDRESS $ 1-60;
The #2 means that subsequent column numbers refer to line 2.

Fixed format - Formatted input is the second way to input fixed format data. It is the
formatted input only way of inputting certain unusual types of data and for declaring that
data represents a time or date so that it can be printed appropriately later.
The data layout used above would be specified as follows using SAS
informats (an informat describes how a value is to read; a format
describes how it is to be printed):
INPUT REF @1 6. NAME $ 20. AGE 2. SEX @59 1.
HEIGHT @64 3.2;
¤ ‘@1’ specifies that REF starts in Column 1
¤ ‘6.’ means that it is 6 digits long with no figures after the decimal
point
¤ since NAME follows immediately after REF there is no need to
specify the column at which it starts (column 7 is assumed)
¤ ‘20.’ indicates that it is 20 characters long
¤ AGE is two digits long with no decimal point

K6.1 (10.90) page 4


¤ SEX does not immediately follow AGE, and so ‘@59’ is needed to
show where SEX starts, and ‘1.’ specifies that it takes up one
column
¤ HEIGHT takes up three columns, the last two of which are after the
decimal point, so its informat is ‘3.2’ rather than ‘3.’, but ‘3.’ would
be quite adequate if you had typed the decimal points in the data.
If there is more than one line of data for each case use the ‘#n’ notation,
for example:
INPUT NAME $ @1 20. #2 ADDRESS $ @1 60.;

Mixing free and fixed SAS allows the different styles of describing how the data is arranged to
format be specified on the same INPUT statement. For example, suppose you
had data which consisted of a name of up to 20 letters and seven
measurements, and you wished to type the measurements in free format
but could not use free format for the name because it had more than 8
characters. If you put the name first you could describe this data quite
simply:
INPUT NAME $ 1-20 M1 M2 M3 M4 M5 M6 M7;
The name starts in column 1 and the measurements can be typed in free
format after column 20. If any name is 20 characters long, a space
between the end of the name and the first measurement is advisable.

Multiple cases on a If the list of variables ends with @@ SAS does not expect each case to
line start on a new line. For example:
DATA LINES;
INPUT A B@@;
CARDS;
1 2 1 4 1 5 6 4 2 3 6 7 10 8 3 6
RUN;
The data for eight cases is typed on one line. For an explanation of
CARDS see page 5 and for RUN see page 10.

Reading data If data is to be read from a file you must include an INFILE statement to
from a file - the specify the name of the file. There are different ways of specifying the
INFILE statement file according to which system you are using. See Reference A or B for
further details.

Including data If the data is to be included with the SAS statements the CARDS
with SAS statement appears just before the data. It has the format:
commands - the CARDS;
CARDS statement
If CARDS is being used then only the data itself and a RUN statement
(see page 10) should follow it. Any transformations of the data must
appear before the CARDS statement.

Reading from a If the data is to read from an existing dataset the SET statement specifies
SAS dataset - the the dataset to be used. A DATA step should normally contain either a
SET statement SET statement or an INPUT statement. The format of the SET statement
is:
SET dataset;

K6.1 (10.90) page 5


For example:
SET GHS_DATA;
SET SASUSER.ANSWERS;

Calculations - the Assignment statements have no keyword to identify them. They are used
assignment to perform calculations on the data and have the general form:
statement variable = expression;
The expression contains numbers, variable names and arithmetic
operators. The multiplication operator (*) must always appear between
two quantities which are to be multiplied together.
The symbols used for arithmetic operators are:
** exponentiation, for example (1 + R)**T
* multiplication, for example TAXABLE * RATE
/ division, for example C367 / C45
+ addition, for example A1 + B9
- subtraction, for example GROSS - TAX
Brackets may be used to make your meaning clear. Examples of
assignment statements are:
AVEINC = INCOME / NUMFAM
B1 = (A7 + C9 - EXPENSES)*0.54
GRPPAY = PAY / 1000
SAS has rules for the order in which it evaluates expressions by giving
priorities (or precedence) to each operator, as follows:
1 Bracketed expressions
2 Exponentiation
3 Multiplication and Division
4 Addition and Subtraction
If there are operators of equal precedence SAS works from left to right.
This means that an expression like:
A+B*C
is evaluated by SAS as A + (B * C). If you wish to add A and B before
multiplying by C then you must use brackets:
(A + B) * C
If you are in doubt about how SAS will evaluate a complex expression
then either insert brackets or split it into simpler expressions and use
several assignment statements to build up the full expression.
SAS expressions can also include SAS functions. These provide many
facilities including square roots of numbers, logarithms, sines and
cosines, probabilities, etc. A list of the more commonly used functions is
given below.

K6.1 (10.90) page 6


ABS absolute value (that is, the value ignoring sign)
MAX maximum of a list of values
MIN minimum of a list of values
SQRT square root
INT gives the integer part of the value, that is it discards the
decimal part
ROUND rounds off to the nearest whole number
NORMAL gives a random number from a normal distribution with
mean 0 and standard deviation 1
UNIFORM gives a random number from a uniform distribution in the
range 0 to 1.
There are also functions for manipulating dates and times and for
character variables.
Functions are used by giving the value on which they are to operate in
brackets following the name, for example:
BIG = MAX(A,B,C);
S = SQRT(S2/N);
RX = ROUND(X) + C;
Z = M + S*NORMAL(0);

Conditional The IF statement is used when an action is to be carried out on only


evaluation and some of the cases being processed. For example, you may wish to take
selection - the IF, special action if data is missing, or do calculations differently for people
THEN and ELSE in work and those unemployed, or you may wish to exclude certain
statements groups. The general form of the IF statement is:

IF condition THEN statement1; ELSE statement2;


‘statement1’ is acted upon if ‘condition’ is true, otherwise ‘statement2’ is
acted upon. The ‘condition’ is usually of the form:
expression comparator expression
where the expression is as described on page 6, and ‘comparator’ is one
of the following:
EQ or = equals
NE or ^= not equal to
GT or > greater than
NG or ^> not greater than
LT or < less than
NL or ^< not less than
GE or >= greater than or equal to
LE or <= less than or equal to
Note that NL and GE are equivalent and so are NG and LE.
These simple conditions can be linked with the words AND and OR.
NOT may be used to change the meaning of a condition.

K6.1 (10.90) page 7


When using a complex condition you should use brackets to make your
meaning clear. Examples of conditions are:
AGE LE 15 AND WORKSTAT EQ 1
NOT (SEX = 1 OR NISTAMP < 2)
MARSTAT EQ 1 OR MARSTAT >= 3
Examples of IF statements are:
IF A > B THEN X=A;
IF MONTH >= 3 AND MONTH LT 6 THEN
SEASON = ’SPRING’;
IF EXPENSES GT (EARNINGS - TAX) THEN DEBT = 1;
IF A > B THEN X=A;
ELSE X=B;
The above examples have shown only assignment statements following
THEN and ELSE, but other statements can also be used, for example
DELETE (which omits the case from the dataset) or PUT (which can
print values; see page 8). For example, if you already had a dataset and
wished to set up another one which only contained men over retirement
age, then you might have
IF SEX NE ’M’ OR AGE LT 65 THEN DELETE;
or
IF NOT (SEX EQ M AND AGE GE 65) THEN DELETE;
A special value used to indicate missing values is the full stop (which
can also be used in your data for the same purpose). Suppose your data
had been prepared with ‘9’ indicating a missing value for the variable
MARSTAT (which is marital status) you could replace this with a
missing value symbol by:
IF MARSTAT = 9 THEN MARSTAT = . ;
To eliminate cases with important data missing:
IF INCOME = . AND EXPENSES = . THEN DELETE;

Giving labels to The LABEL statement allows you to define labels for variables, which
variables - the will be used by various procedures to document the output. A label may
LABEL statement be up to 40 characters long, for example:

LABEL INCOME=’ANNUAL INCOME INCLUDING


STATE BENEFITS’;

Displaying data - The PUT statement enables you to print out values as the DATA step is
the PUT being processed. It has equivalent styles to the three forms of the INPUT
statement statement (list, columns, and formatted). Only simple ways of using
PUT are described here.

PUT - list style This can be simply a list of the variable names whose values you wish to
be printed, for example:
PUT NAME AGE MARSTAT;
in which case the values are printed with one space between each value,
and each case starts on a new line.

K6.1 (10.90) page 8


If you follow the name of the variable with an equals sign, the value is
labelled with the name of the variable, for example:
PUT REF= AGE=;
The output from this statement is:
REF=103 AGE=56
If a value is missing a full stop is printed to represent it.
You may also print text with PUT. For example, to check the validity of
the data and print an error message if a mistake is found:
IF AGE LT 16 AND WORK EQ 1 THEN
PUT ’UNDER AGE ’ NAME AGE= WORK=;

PUT - column style To lay out the data in regular columns use column style. For example:
PUT CASE_NO 1-8 HEIGHT 11-14 2 WEIGHT 16-19 1;
prints the case number in columns 1 to 8, the height in columns 11 to 14
with a decimal point in column 12, and the weight in columns 16 to 19
with a decimal point in column 18.
You can print text by specifying how many blanks are to be left between
the last field printed and the text. For example:
IF NRUNS GE 100 THEN
PUT BATSMAN 1-20 +2 ’SCORED A CENTURY’;
prints the name in the first 20 columns, skip 2 columns and then prints
the text.

PUT - formatted style In this style the name of each variable is followed by the name of the
format in which it is to be printed. This format may be a standard SAS
format or one which you have defined (see the description of PROC
FORMAT in the section on page 47). For example, using the standard
currency formats, the statement:
PUT DEBTS DOLLAR7.2 ASSETS DOLLAR9.2;
produces:
$130.00 $245.45

PUT - mixed style In the same way that styles can be mixed with INPUT you can mix styles
in PUT statements. For example:
PUT COMPANY 5-30 +1 ’HAS LOW ASSETS ’ FUNDS
DOLLAR8.2;

Controlling The FORMAT statement associates a format with a variable for printing.
output format - The association lasts until the session ends, not just for the DATA step.
the FORMAT If you use your own formats (declared using PROC FORMAT, see the
statement section on page 47) they must have been declared before they are used.
Examples of FORMAT statements are:
FORMAT HEIGHT 4.2;
FORMAT WEEK_PAY DOLLAR6.2;
FORMAT FILMYEAR ROMAN12.;

K6.1 (10.90) page 9


The last example will print FILMYEAR in Roman numerals allowing 12
spaces for the value.

Executing The RUN statement is used to end both DATA steps and PROC steps,
instructions - the and shows that the statements in the step are complete and should be
RUN statement executed. The format of the statement is simply:

RUN;
It is not essential to end a DATA step with RUN because the step is
executed when SAS meets a DATA or PROC statement, but it is certainly
tidier to use RUN especially when typing commands at the terminal.

Recoding data Sometimes you may wish to group data or do some relabelling of values.
This can be done by a series of IF statements, but can also be done quite
conveniently with a format and the PUT function. See section page 48
for details.

Example DATA
steps

Using data within SAS DATA ONE;


statements and list INPUT REF_NUM SEX $ AGE HEIGHT WEIGHT;
input LABEL HEIGHT=’HEIGHT IN METRES’;
LABEL WEIGHT=’WEIGHT IN KILOGRAMS’;
CARDS;
101 M 31 1.88 82
102 F 26 1.6 60
103 M 24 1.9 75.5
...
150 M 38 1.87 76
RUN;

Using data within SAS DATA SURVEY;


statements and INPUT CASENO 1-6 SEX $ 7 AGE 8-10 MARSTAT $
column input 11 INCOME 12-18 2;
CARDS;
000001M026S0741200
000002F056S2568000
...
100247M092M0403909
RUN;
PAYPERYR = INCOME/AGE;PUT CASENO PAYPERYR
INCOME AGE;
LABEL MARSTAT=’MARITAL STATUS’;
RUN;
For an example of reading data from a file, see Reference A or B.

K6.1 (10.90) page 10


The SAS PROC step
The PROC step starts with a PROC statement and ends with RUN (or by
meeting a DATA or PROC statement). There are many varieties of
PROC statement, each one providing a different SAS facility. The
following sections describe specific PROC statements, but some of the
statements which can occur in any PROC step are described in this
section.

Analysing data in A procedure can produce analyses for subgroups rather than for the
subgroups - the whole data if a BY statement is included in the PROC step and the data is
BY statement sorted on the variable or list of variables specified (for details of how to
sort data see the description of SORT on page 46). For example, to
produce separate mean values of income for men and women use the
procedure MEANS, and include the statement:
BY SEX;
within the PROC step. To produce tables for men and women in
different age groups, use:
BY SEX AGE_GRP;

Controlling The FORMAT statement gives a format for printing to variables used in
output format - the PROC step. It has the same layout as the FORMAT statement used in
the FORMAT a DATA step, but while the DATA step associates a format with a
statement variable for the whole SAS session, its use in a PROC step associates the
format with the variable only for the duration of that step. An example
is:
FORMAT INCOME DOLLAR7.2;

Printing titles on The TITLE statement prints a title on the output from a procedure. It can
output from appear anywhere but is most useful in a PROC step. The title can be
procedures - the several lines long. The first line can be numbered 1 or can be blank as
TITLE statement you wish, but any following lines must be numbered. For example, a one
line title could be either:
TITLE ’ANALYSIS OF ANTIGEN LEVELS’;
or:
TITLE1 ’ANALYSIS OF ANTIGEN LEVELS’;
If there are several lines in the title, the second and subsequent lines must
be numbered. For example:
TITLE ’ATTITUDES TO OUT-PATIENT CARE’;
TITLE3 ’DELAY IN RECEIVING APPOINTMENTS’;
This would give a title of three lines (the second line, TITLE2, is
assumed to be blank). You can redefine TITLE3 later without changing
TITLE1; a new TITLE statement suppresses only that numbered line and
any lines with higher numbers.
If you are using a graphics device for output then there are many extra
options for this statement.

K6.1 (10.90) page 11


Printing footnotes The FOOTNOTE statement prints notes at the foot of the output page. It
- the FOOTNOTE can appear anywhere but is most useful in a PROC step. Like a title, a
statement footnote can be several lines long. The first line can be numbered 1 or
can be blank as you wish, but any subsequent lines must be numbered.
For example, a one-line footnote could be either:
FOOTNOTE ’1985 figures, Pounds Sterling’;
or:
FOOTNOTE1 ’1985 figures, Pounds Sterling’;
If there are several lines in the footnote the second and subsequent lines
must be numbered. For example:
FOOTNOTE ’Data obtained from official sources’;
FOOTNOTE3 ’Estimated 12% under-reporting’;
This would give a footnote of three lines (the second line, FOOTNOTE2,
is assumed to be blank). You can redefine FOOTNOTE3 later without
changing FOOTNOTE1; a new FOOTNOTE statement suppresses only
that numbered line and any lines with higher numbers.

Descriptive statistics
This section describes some of the procedures in SAS for descriptive and
other statistics. These are FREQ, MEANS, UNIVARIATE, CORR and
TABULATE.

Frequency tables The FREQ procedure produces tables; one-way, two-way, three-way, etc.
and cross- One-way tables are normally called frequency tables, while two-way or
tabulations - the more are often called cross-tabulations. The PROC step for FREQ starts
FREQ procedure with the statement:

PROC FREQ options;


The options may be omitted entirely:
PROC FREQ;
The option DATA= specifies which dataset is to be used (if omitted the
most recently created dataset is used). For example:
PROC FREQ DATA=LIB83.GNPFIGS;

The TABLES statement The TABLES statement specifies which variables are to be analysed and
the sort of tables to be produced. It has the form:
TABLES tablerequests / options ;
If no options are specified the / is not required. For one-way tables, give
the name of the variable or variables required. For multi-way tables list
the variables required separated by asterisks, for example:
AGE_GRP*SEX
or
INC_GRP*MARSTAT*CITY

K6.1 (10.90) page 12


The TABLES statement has shorthand forms for specifying tables. For
example:
HEIGHT -- EXAMS
specifies all the variables from HEIGHT to EXAMS (inclusive) in the
dataset.
QN32*(QN01 QN02 QN03)
specifies the three tables QN32*QN01, QN32*QN02 and QN32*QN03.
QUALS*(LABVOTE -- SDPVOTE)
combines the two shorthand methods.
If no options are given, the content of a table is the frequency, the
percentage of the total number in the table, the percentage of the number
in the row, and the percentage of the number in the column (the last two
are only printed for cross-tabulations). The content can be changed by
specifying options in the TABLES statement. Some useful options are
NOPERCENT, which suppresses printing of overall percentages;
NOROW, which suppresses row percentages, and NOCOL, which
suppresses column percentages.

Examples ¤ PROC FREQ;


TABLES MARSTAT NUM_KIDS;
TABLES HOUSING*REGION / NOROW NOCOL;
TABLES INJURIES*(SHIFT MONTH);
RUN;
¤ PROC FREQ;
TABLES AREA HOUSING VAR01 - VAR10;
RUN;
¤ PROC FREQ DATA=RENTED;
TABLES AREA*HOUSING /NOPERCENT;
TABLES REPAIRS*TENURE;
RUN;
¤ To obtain frequencies for the number of children (NOOFCH) and a
cross-tabulation of SEX and marital status (MARSTAT), type:
PROC FREQ; TABLES NOOFCH SEX*MARSTAT; RUN;
This could produce the following output:

K6.1 (10.90) page 13


Cumulative
Cumulative
NOOFCH Frequency Percent Frequency
Percent
------------------------------------------------
------
. 1 . .
.
0 9 37.5 9
37.5
1 2 8.3 11
45.8
2 6 25.0 17
70.8
3 5 20.8 22
91.7
4 1 4.2 23
95.8
7 1 4.2 24
100.0
TABLE OF SEX BY MARSTAT
SEX MARSTAT
Frequency|
Percent |
Row Pct |
Col Pct | 1| 2| 3| 4|
Total
---------+--------+--------+--------+--------+
1 | 3 | 7 | 0 | 2 |
12
| 12.50 | 29.17 | 0.00 | 8.33 |
50.00
| 25.00 | 58.33 | 0.00 | 16.67 |
| 42.86 | 63.64 | 0.00 | 50.00 |
---------+--------+--------+--------+--------+
2 | 4 | 4 | 2 | 2 |
12
| 16.67 | 16.67 | 8.33 | 8.33 |
50.00
| 33.33 | 33.33 | 16.67 | 16.67 |
| 57.14 | 36.36 | 100.00 | 50.00 |
---------+--------+--------+--------+--------+
Total 7 11 2 4
24
29.17 45.83 8.33 16.67
100.00
Frequency Missing = 1

Descriptive The procedure MEANS prints the mean, standard deviation, and
statistics - the maximum and minimum values of a variable. The format of the MEANS
MEANS statement is:
procedure PROC MEANS options;
The option which is most likely to be required is DATA= to specify the
dataset to be used, for example:
PROC MEANS DATA=OLD_DATA;
As is usual, the most recently created dataset is used if no DATA=
option is specified.
The option MAXDEC= may also be useful as it specifies how many
decimal places (0 to 8) are to be printed in the results. For example:

K6.1 (10.90) page 14


PROC MEANS MAXDEC=4;
PROC MEANS DATA=INT_DATA MAXDEC=0;
You can also specify the statistics to be produced by MEANS.

The VAR statement The VAR statement specifies the variables which are to be analysed, for
example:
VAR AGE INCOME HEIGHT WEIGHT;
If no VAR statement is used all the numeric variables are processed.

Examples ¤ PROC MEANS;


VAR HEIGHT WEIGHT;
RUN;
¤ PROC MEANS DATA=BPAIN MAXDEC=4
VAR C7HGHT ILIAC CHEST;
RUN;
¤ To obtain the mean values for AGE and INCOME, type:
PROC MEANS;
VAR AGE INCOME;
RUN;
This could produce the following output:

N Obs Variable N Minimum Maximum


Mean Std Dev
------------------------------------------------
------------------
20 AGE 17 18.00 39.00
29.00 6.67
INCOME 19 3900.00 9800.00
6820.50 1773.43
------------------------------------------------
------------------

There are three missing values for AGE (the complete dataset has
20 cases) and one for INCOME.

Statistics about a The UNIVARIATE procedure can provide very detailed statistics on a
single variable - variable as well as plots to illustrate the distribution of values. The
the UNIVARIATE statistics produced include the mean, sum, standard deviation, variance,
procedure maximum, minimum, median, mode, quartiles, percentiles, and the five
highest and lowest values.
The PROC UNIVARIATE statement has the form:
PROC UNIVARIATE options;

K6.1 (10.90) page 15


Useful options are:
FREQ produces a frequency table giving the frequency,
percentage and cumulative percentage for each value.
NORMAL tells SAS to test if the distribution of the variable is close
to a Normal (Gaussian) distribution. It is sometimes
important to know whether a distribution is very different
from Normal, as several statistical techniques give
misleading results on such variables.
PLOT gives information on whether the variable is normally
distributed, by drawing a Normal probability plot and a
bar chart.
DATA= specifies which dataset is to be analysed if you do not
wish to use the most recently created one.
Example statements are:
PROC UNIVARIATE;
PROC UNIVARIATE DATA=ORIGDATA PLOT FREQ;
PROC UNIVARIATE FREQ NORMAL;

The VAR statement The VAR statement specifies which variables are to be analysed. For
example:
VAR INCOME;
VAR MALE_POP FEML_POP OAP_POP RATEABLE AREA;
The variables must be numeric. If the VAR statement is omitted all
numeric variables in the dataset are analysed.

Examples ¤ PROC UNIVARIATE;


VAR INCOME;
BY SEX;
RUN;
¤ PROC UNIVARIATE DATA=UK_82 PLOT;
TITLE ANALYSIS OF MONTHLY FIGURES;
VAR IMPORTS EXPORTS EMIGRATE;
RUN;
¤ You can use UNIVARIATE to test whether the distribution of AGE
is normal.
PROC UNIVARIATE NORMAL;
VAR AGE;
RUN;
produces the output:

K6.1 (10.90) page 16


UNIVARIATE
Variable=AGE
Moments
N 20 Sum Wgts 20
Mean 29 Sum 580
Std Dev 6.672804 Variance 44.52632
Skewness -.029524 Kurtosis -1.07339
USS 17666 CSS 846
CV 23.00967 Std Mean 1.492084
T:Mean=0 19.4359 Prob>|T| 0.0001
Sgn Rank 105 Prob>|S| 0.0001
Num ^= 0 20
W:Normal .9518337 Prob<W 0.407
UNIVARIATE
Variable=AGE
Quantiles(Def=5)
100% Max 39 99% 39
75% Q3 35 95% 39
50% Med 29 90% 38.5
25% Q1 24 10% 19.5
0% Min 18 5% 18.5
1% 18
Range 21
Q3-Q1 11
Mode 25
Extremes
Lowest Obs Highest Obs
18 (8) 36 (1)
19 (9) 36 (3)
20 (10) 38 (14)
22 (11) 39 (19)
23 (12) 39 (20)

The distribution is not significantly different from Normal. The


mean is significantly different to zero.

Correlations - the The CORR procedure calculates the correlation between variables. It
CORR procedure uses the product-moment (Pearson) definition of correlation, which is not
appropriate for some types of variable, or Spearman’s and Kendall’s
definitions which are more suitable for rankings and positions. Basic
statistics like the mean are also printed for the variables used. The
PROC CORR statement has the form:
PROC CORR options;
If no options are specified the most recently created dataset is used and
Pearson correlations are calculated. To change the dataset used specify
the DATA= option. To request a different correlation coefficient use the
SPEARMAN or KENDALL option. These can be used in combination
with each other and with PEARSON. Examples of PROC CORR
statements are:
PROC CORR;
PROC CORR DATA=FRENCH KENDALL SPEARMAN;
PROC CORR PEARSON SPEARMAN KENDALL;

K6.1 (10.90) page 17


The VAR statement There are different ways of specifying the correlations you wish to
calculate. If you use a VAR statement and not a WITH statement (see
below) coefficients are printed for all possible pairs of variables in the
list. For example:
VAR A B C;
gives the correlations between A and B, B and C and A and C.
Omitting the VAR statement is equivalent to including one with all the
numeric variables in the set specified.

The WITH statement If the WITH statement is used it modifies the way in which the VAR
statement is obeyed. The variables in the VAR statement are treated as
one list and those in the WITH statement as another, and coefficients are
calculated for all pairs, taking one from each list. For example:
VAR AGE HEIGHT WEIGHT; WITH LIFT1 LIFT2 LIFT3;
produces the correlation of AGE with LIFT1, LIFT2 and LIFT3; the
correlation of HEIGHT with LIFT1, LIFT2 and LIFT3; and the
correlation of WEIGHT with LIFT1, LIFT2 and LIFT3.

Examples ¤ PROC CORR;


VAR ENGLISH MATHS PHYSICS;
RUN;
¤ PROC CORR DATA=OPINION KENDALL SPEARMAN;
VAR SOCGROUP;
WITH THEATRE CINEMA FOOTBALL CONCERTS;
RUN;
¤ A correlation of AGE and INCOME can be obtained by typing:
PROC CORR;
VAR AGE INCOME;
RUN;
which would produce the following result:

K6.1 (10.90) page 18


VARIABLE N MEAN STD DEV SUM
MINIMUM MAXIMUM
AGE 22 41.0909 18.0262 904.00
19.0000 80.0000
INCOME 24 5799.9583 2547.5247 139199.00
1750.0000 9754.0000
PEARSON CORRELATION COEFFICIENTS
/ PROB > |R| UNDER H0:RHO=0 / NUMBER OF
OBSERVATIONS
AGE
INCOME
AGE 1.00000 -
0.33609
0.0000
0.1363
22
21
INCOME -0.33609
1.00000
0.1363
0.0000
21
24

Complex tables - The TABULATE procedure produces tables and gives far more control
the TABULATE over their layout than the FREQ procedure (see page 12). The entries in
procedure the tables can be means, standard deviations etc, rather than just counts.
The options to the PROC TABULATE statement include the usual
DATA=. Another important option is FORMAT, which defines how
values are to be printed in the tables. For example, the statement:
PROC TABULATE FORMAT=6.3;
allows two spaces before the decimal point, one for the decimal point
and three after it (making six in all). If no format is specified, it is
assumed to be 12.2, that is twelve spaces for the values with nine places
before the decimal point and two after it.

The CLASS statement The CLASS statement specifies the variables which will be used to define
the rows and columns of tables. For example:
CLASS SEX AGEGRP REGION;

The VAR statement The VAR statement specifies the variables which will be used to form the
entries in the cells of the tables. For example:
VAR AGE INCOME;

K6.1 (10.90) page 19


The TABLE statement The TABLE statement can be extremely complex, and only some of the
possible specifications are described here. Any variable appearing in a
TABLE statement must have appeared in a preceding CLASS or VAR
statement.
The simplest sort of table is like those produced by PROC FREQ. For
example:
TABLE SEX, REGION;
produces a two-way table showing the frequency of each combination of
SEX and REGION. Note the use of a comma rather than an asterisk.
TABLE SEX RACE, REGION;
produces a frequency table of SEX by REGION with a table of RACE by
REGION joined to the bottom.
TABLE REGION, SEX RACE;
produces tables of SEX and RACE side by side. By using the
FORMAT= option to reduce the width of the columns you can put
several cross-tabulations side by side.
To produce marginal totals use the keyword ALL. For example:
TABLE(REGION ALL), SEX RACE;
gives totals by adding each region together.
TABLE(REGION ALL), (SEX ALL RACE ALL);
gives totals for everything.
To produce percentages rather than the original counts use:
TABLE REGION, (SEX*PCTN RACE*PCTN);
A comma starts a new level of the table; an asterisk starts a nesting. The
statement:
TABLE REGION, SEX*RACE;
gives a table with each row representing a region. Each row contains a
count of the people of each sex, split into racial groups:

S1 S2
RG1 R1 R2 R3 ... R1 R2 R3 ...
RG2 R1 R2 R3 ... R1 R2 R3 ...
...

As well as arranging tables into a concise form, TABULATE can display


statistics. For example:
TABLE (AGE*MEAN INCOME*MEAN), REGION;
shows the means for AGE and INCOME for each region. Other statistics
which may be requested include:
STD for standard deviation
MIN for minimum
MAX for maximum
SUM for total
PCTSUM for the percentage of the sum of values
PCTN for percentages, as shown above

K6.1 (10.90) page 20


A table request like:
TABLE (INCOME*MAX), AGEGRP, SEX;
includes the highest income for all combinations of AGEGRP and SEX
in the output.

Examples ¤ PROC TABULATE;


CLASS REGION SEX MARSTAT;
TABLE (SEX ALL MARSTAT ALL),REGION;
RUN;
¤ PROC TABULATE FORMAT=6.2;
CLASS REGION SEX MARSTAT;
TABLE (SEX*PCTN MARSTAT*PCTN),REGION;
RUN;
¤ PROC TABULATE;
CLASS REGION;
VAR AGE,INCOME;
TABLE (AGE*MEAN INCOME*MEAN), REGION;
RUN;
¤ PROC TABULATE;
CLASS SEX MARSTAT;
VAR AGE;
TABLE (AGE*MEAN), MARSTAT, SEX;
RUN;
This last example produces the following output:

MEAN OF AGE
+-----------------+-------------------------+
| | SEX |
| |-------------------------|
| | 1 | 2 |
|-----------------+------------+------------|
|MARSTAT | | |
|-----------------| | |
|1 | 24.33| 26.33|
|-----------------+------------+------------|
|2 | 37.17| 41.00|
|-----------------+------------+------------|
|3 | .| 42.00|
|-----------------+------------+------------|
|4 | 59.50| 75.00|
+-------------------------------------------+

Student’s T-Test The TTEST procedure tests whether two groups have the same mean
value for a particular variable. The t-test was devised by an author who
wrote under the pseudonym Student. Note that the other use for
Student’s t-test, the comparison of the means of two variables (known as
a paired t-test), must be done in a different way (see page 23 for details).
The PROC TEST statement has the form:
PROC TTEST options;
As usual the option DATA= specifies a dataset other than the one most
recently created.

K6.1 (10.90) page 21


The CLASS statement The CLASS statement specifies the variable identifying the groups to be
compared. Since the procedure can only deal with two groups, the
variable must have only two values. You must specify a CLASS
statement.

The VAR statement The VAR statement specifies the variables on which the test is to be
carried out. If you specify more than one variable a t-test is performed
on each. If this statement is omitted a t-test is performed on all the
numeric variables in the dataset except the one specified in the CLASS
statement.

Example Suppose you are comparing the crop obtained from tomato plants, some
of which have been treated with a fertiliser. The yield of tomatoes is in a
variable CROP; the variable FERTIL contains 1 if no fertiliser was used
and 2 if it was. To perform a t-test on the two groups:
DATA TOMS;
INPUT CROP FERTIL;
CARDS;
12.3 1
11.6 1
...
...
14.5 2
RUN;
PROC TTEST;
CLASS FERTIL;
VAR CROP;
RUN;
SAS gives the mean and other information for each group as well as the t
value, the degrees of freedom, the significance assuming unequal
variances, and the significance assuming equal variances. In each case
the test is a two-sided one. Following the table in which these values
appear is the result of an F test on the equality of the variances. The
output from the above statements is:

TTEST PROCEDURE
VARIABLE: CROP
FERTIL N MEAN STD DEV STD ERROR
MINIMUM MAXIMUM
1 6 12.35000000 0.63482281 0.25916533
10.11 14.20
2 5 14.36000000 0.53665631 0.24000000
11.22 15.31
VARIANCES T DF PROB>|T|
UNEQUAL -5.6905 9.0 0.0003
EQUAL -5.5957 9.0 0.0003
FOR H0: VARIANCES ARE EQUAL, F’= 1.40 DF=(5,4)
PROB > F’= 0.7669

K6.1 (10.90) page 22


Five plants were treated with fertiliser out of the eleven used. The means
are significantly different whether equal or unequal variances are
assumed. The F test shows that the variances are not significantly
different.

Paired T-Test The TTEST procedure cannot test for two variables having the same
mean (a paired t-test). However, this test can be done using the MEANS
procedure, which can test if the mean of a variable is zero. One variable
is subtracted from the other and the result tested to see if it is zero. If it
is the two variables do not have significantly different means. The
following statements illustrate the procedure:
DATA TS;
INPUT TEST1 TEST2;
DIFF=TEST2 - TEST1;
CARDS;
34 45
36 44
...
...
57 62
RUN;
PROC MEANS MEAN T PRT;
VAR DIFF;
RUN;
The options MEAN, T and PRT print the mean, the t-test value for the
test of the mean being zero, and the corresponding probability. The
output is:

Analysis Variable : DIFF


N Obs MEAN T
PROB>|T|
------------------------------------------------
-----
20 6.85714286 8.45
0.0001
------------------------------------------------
-----

This shows that the variables have means which are significantly
different.

Drawing simple diagrams with SAS


SAS can draw pictures on the screen which can then be printed on a
printer using the CHART and PLOT procedures. The procedures
GCHART, GPLOT, GMAP, etc produce higher-quality pictures but
require special facilities in order to produce a copy on paper. They are
described on page 28.

K6.1 (10.90) page 23


Bar charts - the The CHART procedure draws vertical or horizontal bar charts
CHART (histograms), pie diagrams, block charts and star charts. They all give a
procedure visual appreciation of your data which may help you understand it better.
Only the method of producing histograms and pie charts is described
here. The PROC CHART statement has the form:
PROC CHART options;
The options include DATA= to specify the dataset to be used if you do
not wish to use the most recently created dataset. Example statements
are:
PROC CHART;
PROC CHART DATA=OLDSTATS;
Any number of charts may be requested within the same CHART
procedure (see the examples on page 25).

The VBAR statement To produce a vertical bar chart (a histogram with the bars drawn
vertically) specify the variables to be used with a VBAR statement,
which has the form:
VBAR variablelist / options;
If no options are specified the / is omitted. The options available include
DISCRETE, MIDPOINTS, SUMVAR, TYPE, MISSING and
NOZEROS.
DISCRETE draws a bar for each value of the variable. If
you do not use this option the range of values is
divided into groups by automatic choice of
midpoints or by your own choice (see
MIDPOINTS below) and a bar is drawn for each
sub-range.
MIDPOINTS=values specifies the points at which the distribution is to
be split. For example MIDPOINTS=10 50 100
produces three bars; one for those below 30, one
for those below 75 and one for those over 75
(these being the boundaries produced by these
midpoints). If the MIDPOINTS option is not
specified, SAS splits the range of values into a
number of intervals.
SUMVAR=variable means that the bars represent the sum of the
variable ‘variable’ for cases with that value of
the VBAR variable. For example, VBAR CITY/
SUMVAR=INCOME; gives a chart with bars
representing the total income for each city in the
data. If TYPE=MEAN is also specified the mean
value of ‘variable’ is used instead of the sum.
TYPE=type specifies what the bars are to represent and has
several different choices. The one of most
interest is MEAN when it is used with
SUMVAR.
NOZEROS omits entries for empty categories, avoiding
gaps in the chart.
MISSING treats missing values as a valid category and
draws a bar for them.

K6.1 (10.90) page 24


Examples of VBAR statements are:
VBAR MARSTAT/MISSING NOZEROS;
VBAR INCOME/MIDPOINTS=2500 5000 7500 10000 15000 20000;
VBAR GNP EXPORTS IMPORTS;

The HBAR statement The HBAR statement is just like the VBAR statement except that it
produces histograms with the bars horizontal rather than vertical. The
options DISCRETE, MISSING, MIDPOINTS, SUMVAR, TYPE and
NOZEROS all apply as described above. Examples are:
HBAR HEIGHT WEIGHT;
HBAR SEX/SUMVAR=ACCIDENT;

The PIE statement The PIE statement draws a pie chart which illustrates the relative
frequency of values by presenting them as slices of a cake or pie. The
statement format is:
PIE variablelist / options;
If no options are requested the / is omitted. The options DISCRETE,
MIDPOINTS, SUMVAR, TYPE and MISSING all apply (as described
on page 24) but NOZEROS does not. An example is:
PIE DEPT AREA REGION;
If neither DISCRETE nor MIDPOINTS are specified the pie has three
slices.

Examples ¤ PROC CHART;


VBAR NOOFSONS / MIDPOINTS= 1 2 3 4;
PIE MARSTAT / MISSING DISCRETE;
RUN;
¤ PROC CHART DATA=SALES;
HBAR MONTH/SUMVAR=VALUE DISCRETE;
HBAR AGENCY REGION;
RUN;
¤ The following statements produce a horizontal barchart for SEX and
a vertical bar chart for MARSTAT and AGE:
PROC CHART;
HBAR SEX/DISCRETE;
VBAR MARSTAT/DISCRETE;
RUN;
The output is:

K6.1 (10.90) page 25


FREQUENCY BAR CHART
SEX
FREQ CUM. PERCENT CUM.
FREQ PERCENT
|
1 |****************************************
10 10 50.00 50.00
|
2 |****************************************
10 20 50.00 100.00
----------+---------+---------+---------+
2.5 5 7.5 10
FREQUENCY

FREQUENCY BAR CHART


FREQUENCY
10 + *****
| *****
8 + *****
| *****
6 + *****
| ***** *****
4 + ***** *****
| ***** *****
*****
2 + ***** *****
*****
| ***** *****
*****
+---------------------------------
---
1 2
3
MARSTAT

Scattergrams - the The PLOT procedure produces a plot of one variable against another.
PLOT procedure Such diagrams are known as scatterplots, scattergrams or scatter
diagrams, as they show the scatter of the cases in the sample. The PROC
PLOT statement has the form:
PROC PLOT options;
The most important option is DATA= which specifies a dataset other
than the one most recently created.

The PLOT statement The PLOT statement specifies which variables are to be plotted against
each other. Its format is:
PLOT plotrequests / options;
If no options are specified the / is omitted.
A plot request can have several parts. The simplest form is ‘var*var’, for
example AGE*EXAM meaning a plot with AGE on the vertical axis and
EXAM on the horizontal axis.

K6.1 (10.90) page 26


A point is marked by a letter, which shows how many cases lie on that
point (to within the accuracy of the plot and given the size of the
character indicating the point). A indicates one case, B means two cases,
up to Z which indicates 26 or more cases at that point.
To specify a symbol to mark the points instead of letters, use
var*var=’symbol’. For example:
Y*X=’.’
causes each point to be marked by a full stop regardless of how many
cases are represented there.
You can produce a sort of three-dimensional plot by marking each point
with the values of another variable, by specifying var*var=var. For
example:
CONTENT*INCOME=MARSTAT
prints the value of MARSTAT (which may be numeric or character) at
each point on the plot of CONTENT by INCOME. If more than one
case is mapped to the same point the value of the first case is used. Note
that only the first character is used from the value of the variable, and so
if the values of MARSTAT were SINGLE, MARRIED and
SEPARATED, you could not distinguish between SINGLE and
SEPARATED.
Several plots can be specified in the same PLOT statement, for example:
PLOT AGE*INCOME HEIGHT*WEIGHT=SEX;
A useful option is OVERLAY, which causes several plots to be
produced on the same axes so that they can be compared, for example:
PLOT POP71*CITY=’7’ POP81*CITY=’8’ / OVERLAY;

Examples ¤ PROC PLOT;


PLOT MATHS*ENGLISH;
RUN;
¤ PROC PLOT DATA=SUNSPOTS;
PLOT NUM*MONTH=’*’;
PLOT NUM*MONTH=’+’ UHFPROBS*MONTH=’-’
/OVERLAY;
RUN;
¤ PROC PLOT;
PLOT INCOME*AGE=SEX;
RUN;
gives the following plot of AGE and INCOME using a symbol for
SEX.

K6.1 (10.90) page 27


Plot of INCOME*AGE Symbol is value of SEX
NOTE: 1 obs hidden
INCOME |
10000 +
1
| 2
1 1 2
| 1
2
| 2 1 1
| 1 2
1 2
5000 + 1 2 1
| 2 2
|
|
|
0 +--+--+--+--+--+--+--+--+--+--+--+--+--+-
-+--+--+--+--+--+--+--+--+
18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39

AGE

The message ‘1 obs hidden’ means that two points coincided


(to within the accuracy of the graph). Normally this would be
indicated by using a different symbol, but here the symbol is the
value of SEX.

High-quality graphics
As well as the simple graphics produced by CHART and PLOT, SAS can
draw high-quality graphics with smooth lines, colours and shading.
These facilities are documented in the ‘SAS/GRAPH Guide’. See
Reference A or B for producing graphics on output devices. This section
describes some of the procedures.

Bar charts - the The GCHART procedure draws the same sort of pictures as CHART but
GCHART on graphics devices. Many more options are available because of the
procedure extra facilities on a graphics device, and so there are more parts to the
PROC step. The PROC GCHART statement has the form:
PROC GCHART DATA=dataset GOUT=dataset;
where the most recently created dataset will be used if DATA= is
omitted. The GOUT= option is also optional, and is used to save the
graphical output as a dataset which can be redrawn later by the
GREPLAY procedure.

K6.1 (10.90) page 28


The VBAR statement The VBAR statement specifies the variables for which a vertical bar chart
is to be drawn. The general form is:
VBAR variablelist / options;
If no options are requested the / is omitted. The options include
DISCRETE, MIDPOINTS, SUMVAR, TYPE, MISSING, NOZEROS,
CAXIS and CTEXT.
DISCRETE draws a bar for each value of the variable. If
you do not use this option the range of values is
divided into groups by automatic choice of
midpoints or by your own choice (see
MIDPOINTS below) and a bar is drawn for each
sub-range.
MIDPOINTS=values specifies the points at which the distribution is to
be split. For example MIDPOINTS=10 50 100
produces three bars; one for those below 30, one
for those below 75, and one for those over 75
(these being the boundaries produced by these
midpoints). If the MIDPOINTS option is not
specified, SAS splits the range of values into a
number of intervals.
SUBGROUP=svar divides each bar into sections to show the
distribution of ‘svar’ within each category of the
variable named on the VBAR statement.
GROUP=gvar draws several bars for each value of the variable
mentioned in the VBAR statement, - one for each
value of ‘gvar’.
SUMVAR=variable means that the bars represent the sum of the
variable ‘variable’ for cases with that value of
the VBAR variable. For example, VBAR CITY/
SUMVAR=INCOME; gives a chart with bars
representing the total income for each city in the
data. If TYPE=MEAN is also specified the mean
value of ‘variable’ is used instead of the sum.
TYPE=type specifies what the bars are to represent and has
several different choices. The one of most
interest is MEAN when it is used with
SUMVAR.
NOZEROS omits entries for empty categories, avoiding
gaps in the chart.
MISSING treats missing values as a valid category and
draws a bar for them.
CAXIS=colour draws the axis in the specified colour.
CTEXT=colour draws the text on the chart in the specified
colour.
Examples of VBAR statements are:
VBAR MARSTAT/MISSING NOZEROS SUBGROUP=SEX;
VBAR INCOME/MIDPOINTS=2500 5000 7500 10000 15000
GROUP=AGE;

K6.1 (10.90) page 29


The HBAR statement The HBAR statement is just like the VBAR statement except that it
produces histograms with the bars horizontal rather than vertical. The
options DISCRETE, MISSING, MIDPOINTS, SUMVAR, TYPE,
GROUP, SUBGROUP, NOZEROS, CTEXT and CAXIS all apply as
described above. Examples are:
HBAR HEIGHT WEIGHT GROUP=MARSTAT;
HBAR SEX/SUMVAR=ACCIDENT SUBGROUP=AGE;

The PATTERN With HBAR and VBAR you can specify how the bars are to be coloured
statement or shaded using PATTERN statements. If SUBGROUP= is specified in a
HBAR or VBAR statement you may wish to specify the patterns of
shading to be used to distinguish the subgroups. Normally each bar is
shaded by cross-hatching, and if it is necessary to distinguish groups
each is coloured using the colours in order.
Suppose you have two groups and want the first to be a solid blue bar
and the second to be black (the background colour) specify:
PATTERN1 V=S C=BLUE;
PATTERN2 V=E;
where V=S means use solid colour, V=E means empty, and C= specifies
the colour to be used. Other types of pattern which involve hatching of
various kinds are described and illustrated in the appropriate
SAS/GRAPH Guide.

The PIE statement The PIE statement draws a pie chart which illustrates the relative
frequency of values by presenting them as slices of a cake or pie. The
statement format is:
PIE variablelist / options;
If no options are requested the / is omitted. The options DISCRETE,
MIDPOINTS, SUMVAR, TYPE, MISSING and CTEXT all apply (as
described on page 30) but SUBGROUP, GROUP, NOZEROS and
CAXIS do not. If you do not specify DISCRETE or MIDPOINTS then
SAS uses its own method to divide the data.
PIE DEPT AREA REGION;
The FILL option is used only with the PIE statement. It may be set to X
or SOLID:
FILL=SOLID means the sectors of the pie-chart are to be filled
in with solid colour. SAS calculates how many
colours are needed and takes them in order from
the colours available on the graphics device. If
only one colour is available a uniformly shaded
disk is drawn.
FILL=X means the sectors are to be filled in by
cross-hatching. If colours are available the
slices will be coloured.

Examples ¤ PROC GCHART;


VBAR NOOFSONS / MIDPOINTS= 1 2 3 4;
PIE MARSTAT / MISSING DISCRETE FILL=X;
RUN;

K6.1 (10.90) page 30


¤ PROC GCHART DATA=SALES;
HBAR MONTH/SUMVAR=VALUE DISCRETE;
HBAR AGENCY REGION;
RUN;

Scattergrams - the The GPLOT procedure plots scatter diagrams with a choice of patterns,
GPLOT filling, and plot symbols, and the option of fitted regression lines of
procedure various types. Only some of the facilities are described here. The
general form of the GPLOT statement is:
PROC GPLOT DATA=dataset GOUT=dataset UNIFORM;
If DATA= is omitted the most recently created dataset is used. If
GOUT= is specified then the picture is saved and may be redrawn by
GREPLAY. The UNIFORM option may be useful if you are using the
BY statement to plot pictures for several subgroups, because it forces the
use of the same scale for all the plots, so that comparisons may be made.

The PLOT statement The PLOT statement specifies which variables are to be plotted against
each other. Its format is:
PLOT plotrequests / options;
If no options are specified the / is omitted.
A plot request can have several parts. The simplest form is just var*var,
for example AGE*EXAM, meaning a plot with AGE on the vertical axis
and EXAM on the horizontal axis.
A point is marked by a plus sign. If several cases are plotted at the same
point (not necessarily identical but the same to within the accuracy of the
plot) it appears that cases are missing because there are fewer plus signs
on the plot than there are cases.
A SYMBOL statement (see below) can be used to specify another symbol
to mark the points instead of a plus sign.
You can produce a sort of three-dimensional plot by marking each point
with the values of another variable by specifying var*var=var. For
example,:
CONTENT*INCOME=MARSTAT
prints the value of MARSTAT (which may be numeric or character) at
each point on the plot of CONTENT by INCOME. If more than one
case is mapped to the same point the value of the first case is used. Note
that only the first character is used from the value of the variable, and so
if the values of MARSTAT were SINGLE, MARRIED and
SEPARATED, you could not distinguish between SINGLE and
SEPARATED.
Several plots can be specified in the same PLOT statement, for example:
PLOT AGE*INCOME HEIGHT*WEIGHT=SEX;
A useful option is OVERLAY, which causes several plots to be
produced on the same axes so that they can be compared, for example:
PLOT POP71*CITY=’7’ POP81*CITY=’8’ / OVERLAY;
You may also specify CAXIS=colour to specify the colour of the axis,
and CTEXT=colour to specify the colour of the text.

K6.1 (10.90) page 31


In order to overlay plots where the vertical scales are very different you
can use PLOT and PLOT2. The horizontal scales must be the same but
the right-hand vertical scale is for the variable specified for PLOT2. For
example:
PLOT HEIGHT*AGE;
PLOT2 WEIGHT*AGE;

The SYMBOL The SYMBOL statement defines the symbols to be used in the plot and
statement specifies whether any regression fitting is to be carried out. A different
SYMBOL statement can be included for each plot in the GPLOT
procedure. The SYMBOL statement has several optional parts, as
described below. To specify the colour of the symbol use C=colour, for
example C=BLUE.
To specify the symbol use V=symbol, for example V=1. The letters A to
W and the digits 0 to 9 may be used as symbols. There are also special
symbols which are represented by such characters as * or <. A table
showing how they appear is given in the SAS/GRAPH Guide. If no
symbols are to be drawn for the points use V=NONE.
To specify the sort of line to be used use the ‘L=number’ option. If L=1
a solid line is used. There are also 31 types of dotted or dashed lines,
which are shown in the SAS/GRAPH Guide.
The interpolation facilities to draw lines connecting the points include
the following:
I=JOIN connects the points by straight lines.
I=SPLINE uses a cubic spline method to fit a smooth line to the
points.
I=SMxx is used when the data is widely spread, so that a normal
cubic spline would look very jagged. It fits a smooth
curve through the points but the points may not all appear
on the line (as is the case with a normal spline curve). The
value ‘xx’ determines how closely the curve is to be fitted
to the points. A value of 1 makes it follow the points quite
closely, while a value of 99 produces a smooth curve
which may miss many of the points.
I=Rxxxxxxx is used when a regression line is to be fitted to the data.
The characters which follow the R are:
L linear regression is to be used
Q quadratic regression is to be used
C cubic regression is to be used
0 the regression line is forced through the origin.
If you want a constant term in the regression
omit this term. If it appears it is the second
character.

K6.1 (10.90) page 32


CLMnn draws lines representing the confidence
limits on the regression for the mean predicted
values where the confidence limits (nn) may be
at 90%, 95% or 99%, for example CLM95.
These characters follow the type of regression
and the optional constant term. The style of line
used for the confidence lines is determined by
adding 1 to the line style used for the main
plot. For example if the line is drawn with line
style 2 (small dashes) the confidence lines are
drawn with style 3 (medium dashes).
CLInn draws lines representing the confidence
limits on the regression for the individual values,
where the confidence limits (nn) may be at 90%,
95% or 99%, for example CLI90. Other details
are as for CLMnn above.
Examples of complete specifications for interpolation are
I=RQ, I=RL0, I=RLCLI90 and I=RC0CLM95.

Examples ¤ PROC GPLOT;


PLOT TESTA*TESTB;
RUN;
¤ PROC GPLOT DATA=NWEST;
PLOT POLLUT1*TOWN INCID*TOWN/OVERLAY;
SYMBOL1 V=NONE L=2 C=RED I=SPLINE;
SYMBOL2 V=NONE L=2 C=BLUE I=SPLINE;
RUN;
¤ PROC GPLOT;
PLOT DOSE*DAYS;
SYMBOL V=* L=1 I=RLCLM95;
RUN;

Maps - the The GMAP procedure produces maps illustrating the values of variables
GMAP procedure for the areas on the map. If the map dataset already exists you need to
know how the areas are identified. SAS provide maps of the United
States and Canada, which are described in the SAS/GRAPH Guide.
However, these are not installed automatically so you must check to see
if they are installed on the machine you are using. Maps of the counties
of the United Kingdom and Ireland are also available. If you do not
already have a map in a suitable form then you may need help in
converting your map into a dataset. Contact the Computing Service
Advisory Service for assistance.
The PROC GMAP statement specifies the map dataset as well as the
response dataset containing the data to be shown on the map, for
example:
PROC GMAP MAP=SASUSER.MERSEY DATA=POPLAN;
If no DATA= option appears the most recently created dataset is used.
The ALL option specifies that all areas in the map are to be drawn even
if there is no value for that area. Normally only areas for which data
exists are drawn, and the map is scaled to fill the space available.

K6.1 (10.90) page 33


Four types of map can be drawn; choropleth, surface, block and prism.
Examples of each are given in the SAS/GRAPH Guide. Only choropleth
maps are described below.

The CHORO The CHORO statement specifies that a choropleth map is to be drawn,
statement and gives the name of the variable to be used. A choropleth map is one
where the areas of the map are shaded or coloured to indicate the value
of the variable for each area. The form of the CHORO statement is:
CHORO variablelist;
A choropleth map is drawn for each variable specified. Options
available with the statement include DISCRETE, LEVELS and
MIDPOINTS.
DISCRETE means the data is a set of discrete values rather
than a continuous variable. Each value is
represented separately, unless you have a very
large number of values (or have also specified
LEVELS or MIDPOINTS).
LEVELS=n means SAS is to divide the data into n+1 groups
of the same size, and shade the map accordingly.
MIDPOINTS=list means the data is to be divided at the values
specified. You do not have to list every value,
for example:
MIDPOINTS = 10 TO 100 BY 10

The ID statement The ID statement specifies the variable which ties together the areas and
the values of the response variable. The variable must have the same
name in the response dataset as in the map dataset. The form is:
ID variable;
For example:
ID CNUM;

Regression - the REG procedure


Given a situation in which one or more variables seem to control the
behaviour of another (for example, blood pressure given weight, age and
activity), it is possible to build an equation which expresses the relation
numerically. This relation is only approximate in any real situation, but
you can measure how closely it fits the data and decide whether or not it
is useful for prediction.
The variable whose behaviour you are trying to explain is called the
dependent variable, and those variables being used for the explanation
(or ‘model’) are called the independent variables. In mathematical
terms there is a dependent variable Y which you are trying to predict
using values of independent variables X1, X2, . . . Xn, using the
equation:
Y = B0 + B1*X1 + B2*X2 + ... + Bn*Xn + eps

K6.1 (10.90) page 34


where ‘eps’ is the error involved in using such a simple model. The
procedure calculates the values of B0, B1, B2 . . . Bn, so that ‘eps’ is as
small as possible over the known values of Y, X1, X2 . . . Xn.
The values B0, B1 etc are called the parameters of the regression
equation. The term B0 is known as the constant term or the intercept
and is the value of Y when X1, X2 . . . Xn are all zero.
The values of Y which were recorded when the survey was done or the
experiment was performed are known as the observed values. The
values you would obtain by putting the values of X1, X2 ... Xn into the
regression equation are called the predicted values. The difference
between the predicted value and the observed value is called the residual
value. The square of the correlation of the observed values and the
predicted values is called the coefficient of determination or just the r-
square value. It can be regarded as the fraction of the variability of Y
explained by the equation.
The PROC REG statement has the form:
PROC REG options;
The DATA= option specifies that a dataset other than the one most
recently created is to be used. The SIMPLE option gives simple
descriptive statistics on each of the variables used in the procedure.
Example of PROC REG statements are:
PROC REG;
PROC REG DATA=SHELLS;
PROC REG SIMPLE;
PROC REG DATA=SASUSER.SAVED SIMPLE;

The MODEL The variables to be used are specified with the MODEL statement. For
statement example, to express the cost of producing a motor car (variable
CARCOST), given the hourly wage rate of workers on the production
line (HOURLY), the cost of steel (STEEL) and the price of electricity
(ELECTRIC):
MODEL CARCOST = HOURLY STEEL ELECTRIC;
CARCOST is the dependent variable and HOURLY, STEEL and
ELECTRIC are the independent variables.
The MODEL statement has several options. For example:
MODEL CARCOST = HOURLY STEEL ELECTRIC /
NOINT;
forces the equation to have no constant term, that is the intercept is set to
zero and CARCOST is zero when all the other variables are zero, which
is not entirely realistic for the model as there are other costs involved in
building a car. However if all the sources of expense were included in
the model you would expect the cost of the car to be zero when all the
factors contributing to the cost were zero.
To check how well the solution fits, you can print the values of the
dependent variable along with the value the regression equation predicts,
by specifying the P option, for example:
MODEL CARCOST = HOURLY STEEL ELECTRIC / P;

K6.1 (10.90) page 35


The R option prints extra information indicating whether the predicted
values are significantly different from the observed values. This can be
useful for spotting unusual cases in the data or for showing a pattern in
the residuals indicating that the model has systematic errors, for example
that a linear model is not appropriate and one with squares or cubes of
values should be used instead.

The OUTPUT To analyse the predicted or residual values you must write them to a SAS
statement dataset. Having done that you can use any of the facilities of SAS,
especially the graphical ones, to examine them. The OUTPUT statement
allows you to write these and other values to a dataset.
The OUTPUT statement must specify the name of a dataset. This can be
a permanent dataset (for example SASUSER.PREDVALS) or a
temporary one (for example PREDVALS). The information to be
written follows the dataset name. The keywords PREDICTED and
RESIDUAL (which can be abbreviated to P and R) specify that these
values are to be written and gives them names. For example:
PROC REG;
MODEL Y = X Z/NOINT;
OUTPUT OUT=SAVED P=PY R=RY;
The output dataset contains all the variables from the input dataset
(whether or not they were used used to calculate the regression equation)
as well as the ones specified by P or R. If the regression has multiple
dependent variables you must specify predicted and residual variable
names for each dependent variable.

Example The following statements look at the relation of age to income, by saving
the predicted values and plotting them to compare with the observed
value:
PROC REG;
MODEL INCOME=AGE;
OUTPUT OUT=SAVE P=PINCOME;
RUN;
PROC PLOT;
PLOT(INCOME PINCOME)*AGE/OVERLAY;
RUN;
The output is:

K6.1 (10.90) page 36


DEP VARIABLE: INCOME
ANALYSIS OF VARIANCE
SUM OF MEAN
SOURCE DF SQUARES SQUARE
F VALUE PROB>F
MODEL 1 15069036.69 15069036.69
2.419 0.1363
ERROR 19 118335771 6228198.45
C TOTAL 20 133404807
ROOT MSE 2495.636 R-SQUARE
0.1130
DEP MEAN 6034.476 ADJ R-SQ
0.0663
C.V. 41.3563
PARAMETER ESTIMATES
PARAMETER STANDARD T
FOR H0:
VARIABLE DF ESTIMATE ERROR
PARAMETER=0 PROB > |T|
INTERCEP 1 7958.69044 1351.63099
5.888 0.0001
AGE 1 -47.42781593 30.49099533
-1.555 0.1363

The probability that the coefficient for AGE is zero shows that the
variable is not a very good predictor of INCOME. The plot also shows
that the fit is a very poor one:

K6.1 (10.90) page 37


PLOT OF INCOME*AGE SYMBOL
USED IS O
PLOT OF PINCOME*AGE SYMBOL
USED IS P
|
10000 +
| O
| O
|
|
9000 +
| O
| O
|
| O
8000 + O
| O
|
O
| O
| O
P 7000 + P PP
R | O
E | P P
D | P PP
I | P PP
C 6000 + P
T | O
E | O PP
D |
PP
| O
V 5000 + O
A |
L |
P
U |
E |
PP
4000 +
|
| O
|
|
3000 + O
|
| O
O
|
|
2000 +
O
|
O
|
|
|
1000 +
----+---+---+---+---+---+---+---+---+-
--+---+---+---+---+---+-
19 23 27 31 35 39 43 47 51
55 59 63 67 71 75
AGE
NOTE: 7 OBS HAD MISSING VALUES 3 OBS
HIDDEN

K6.1 (10.90) page 38


Analysis of variance - the ANOVA
procedure
The ANOVA procedure is restricted to analysing balanced designs, that is
those experiments where there are the same number of replicate
observations for each combination of factors. If your data does not
satisfy this condition see page 44 for details of a general linear model.
ANOVA can deal with one or many response variables and so can do
multivariate analysis of variance.
As usual you may specify the dataset to be used with the DATA= option,
for example:
PROC ANOVA DATA=GRASSES;

The CLASS The factors in the design are declared using the CLASS statement, for
statement example:
CLASS STRAIN HERBCIDE;
You must have a CLASS statement and it must precede the MODEL
statement described below.

The MODEL The MODEL statement specifies the dependent variable (sometimes
statement called the response variable) and how it is thought to be related to the
independent variables (the factors). You can specify several dependent
variables, in which case SAS treats them together in a multivariate
analysis. The specification of the model is more complex than with the
REG procedure as you can include interaction effects between variables
as well as the variables themselves.
Suppose you have a dependent variable Y with factors A, B and C. To
fit only the factors with no interactions type:
MODEL Y = A B C;
To allow an interaction term between B and C, use:
MODEL Y = A B C B*C;
To include all possible interactions, type:
MODEL Y = A B C B*C A*B A*C A*B*C;
Since this is such a common model, SAS allows you to write this in the
shorter form:
MODEL Y = A|B|C;
You can specify a mixture of these, for example
MODEL Y = A B|C|D;
where only the main effect of A is used but the full interactions of B, C
and D are required.
If a factor B is nested within another factor A, type:
MODEL Y = A B(A);

K6.1 (10.90) page 39


This occurs when not all values of B are observed for each value of A,
and so you do not have a ‘crossed’ model. For example, if you were
comparing teaching methods in different schools then the teachers would
only teach in one school, and so any teacher effect would be nested
within the school effect.

The MEANS Having established that not all groups have the same mean, you might
statement like to know which groups are different from other groups. This can be
done using the MEANS statement. Suppose the MODEL is:
MODEL CROP = VARIETY FIELD VARIETY*FIELD;
To look at the effect of VARIETY in more detail type:
MEANS VARIETY;
The mean and standard deviation for each value of VARIETY is shown.
You can also specify various tests to investigate whether these means are
significantly different. These include the Scheffe test, Duncan’s test,
Tukey’s test, and Least Significant Difference (LSD). To request a
Scheffe multiple comparison test on VARIETY, type:
MEANS VARIETY/SCHEFFE;
This will then show in detail how the group means differ.

Example Suppose a biochemist is interested in the effect of a new herbicide on the


mortality of plants. Fifty plants were placed in each of twelve pots
containing nutrient solution. After ten days growth, three of the pots
were sprayed with herbicide and three were left as controls. After a
further ten days three more pots were sprayed and the remaining three
designated as controls. Thus two factors were considered - herbicide
treatment and age of plant - each treatment combination being replicated
three times.
The analysis produces an analysis of variance table assessing the
significance of the herbicide spraying, the age of plants and their
interaction. The experimenter was also interested in calculating least
significant differences for comparison of main effect means. The
following statements enter the data and carry out the analysis:

K6.1 (10.90) page 40


DATA HERB;
/* DATA INPUT SPECIFYING EACH FACTOR LEVEL
EXPLICITLY*/
INPUT AGE HERBICID SURVIVOR ;
CARDS;
1 1 20
1 1 18
1 1 23
1 2 11
1 2 12
1 2 15
2 1 40
2 1 43
2 1 39
2 2 35
2 2 37
2 2 32
PROC ANOVA;
CLASS AGE HERBICID;
MODEL SURVIVOR = AGE HERBICID AGE*HERBICID;
MEANS AGE HERBICID AGE*HERBICID / LSD;
RUN;
This produces the following output:

K6.1 (10.90) page 41


ANALYSIS OF VARIANCE
PROCEDURE
CLASS LEVEL
INFORMATION
CLASS LEVELS
VALUES
AGE 2 1
2
HERBICID 2 1
2
NUMBER OF OBSERVATIONS IN DATA
SET = 12
ANALYSIS OF VARIANCE
PROCEDURE
DEPENDENT VARIABLE: SURVIVOR
SOURCE DF SUM OF SQUARES MEAN
SQUARE F VALUE
MODEL 3 1486.25000000
495.41666667 92.89
ERROR 8 42.66666667
5.33333333 PR > F
CORRECTED TOTAL 11 1528.91666667
0.0001
R-SQUARE C.V. ROOT MSE
SURVIVOR MEAN
0.972094 8.5270 2.30940108
27.08333333
SOURCE DF ANOVA SS F
VALUE PR > F
AGE 1 1344.08333333
252.02 0.0001
HERBICID 1 140.08333333
26.27 0.0009
AGE*HERBICID 1 2.08333333
0.39 0.5494

K6.1 (10.90) page 42


ANALYSIS OF VARIANCE
PROCEDURE
T TESTS (LSD) FOR VARIABLE: SURVIVOR
NOTE: THIS TEST CONTROLS THE TYPE I
COMPARISONWISE ERROR RATE,
NOT THE EXPERIMENTWISE ERROR RATE
ALPHA=0.05 DF=8 MSE=5.33333
CRITICAL VALUE OF T=2.30600
LEAST SIGNIFICANT DIFFERENCE=3.0747
MEANS WITH THE SAME LETTER ARE NOT SIGNIFICANTLY
DIFFERENT.
T GROUPING MEAN
N AGE
A 37.667
6 2
B 16.500
6 1
ANALYSIS OF VARIANCE
PROCEDURE
T TESTS (LSD) FOR VARIABLE: SURVIVOR
NOTE: THIS TEST CONTROLS THE TYPE I
COMPARISONWISE ERROR RATE,
NOT THE EXPERIMENTWISE ERROR RATE
ALPHA=0.05 DF=8 MSE=5.33333
CRITICAL VALUE OF T=2.30600
LEAST SIGNIFICANT DIFFERENCE=3.0747
MEANS WITH THE SAME LETTER ARE NOT SIGNIFICANTLY
DIFFERENT.
T GROUPING MEAN N
HERBICID
A 30.500 6
1
B 23.667 6
2
ANALYSIS OF VARIANCE PROCEDURE
MEANS
AGE HERBICID N
SURVIVOR
1 1 3
20.3333333
1 2 3
12.6666667
2 1 3
40.6666667
2 2 3
34.6666667

K6.1 (10.90) page 43


An alternative way of inputting the data using DO loops is shown below.
It saves specifying factor levels individually. The @@ symbol is used to
specify that several cases occur on one line. Note that each case requires
an explicit OUTPUT statement.
DATA HERB;
/* DATA INPUT SPECIFYING EACH FACTORS AND
LEVELS THROUGH LOOPS; */
DO AGE = 1 TO 2;
DO HERBICID = 1 TO 2;
DO REPLICAT = 1 TO 3;
INPUT SURVIVOR @@;
OUTPUT;
END;
END;
END;
CARDS;
20 18 23 11 12 15
40 43 39 35 37 32
RUN;
PROC ANOVA;
CLASS AGE HERBICID;
MODEL SURVIVOR = AGE HERBICID AGE*HERBICID ;
MEANS AGE HERBICID AGE*HERBICID / LSD;
RUN;

General linear modelling - the GLM


procedure
The GLM procedure allows analysis of the general linear model. This
views analysis of variance, regression and several other techniques as
transformations of the simple model described above for regression.
This enables it to carry out many sophisticated analyses which are not
available elsewhere in SAS. Although the specification looks very much
like the REG or ANOVA procedures, the output is quite different. A very
important aspect of GLM is that it can perform an analysis of variance on
unbalanced designs. For balanced designs ANOVA is to be preferred.
As usual you may specify the dataset to be used with the DATA= option,
for example:
PROC GLM DATA=SASUSER.GRASSES;

The CLASS If variables to be used in the model are to be regarded as factors, that is
statement having a small number of defined categories, they should be specified in
a CLASS statement. If they do not appear in a CLASS statement SAS
assumes that a regression type model is appropriate, which is not true for
factors. If the CLASS statement is used then it must precede the MODEL
statement.

K6.1 (10.90) page 44


If none of the variables in the model appear in a CLASS statement
regression is being used. If all the variables in the model appear in a
CLASS statement analysis of variance is being used. If some of the
variables in the model appear in a CLASS statement analysis of
covariance is being used.

The MODEL The MODEL statement for GLM is essentially the same as for ANOVA in
statement the way that it describes the model. However the options which can be
specified are not identical. As with REG the NOINT option specifies
that the model is to have no constant term. Similarly the P option asks
for information to be produced on the predicted and residual values. For
example:
MODEL X3 = X1 X2 / NOINT P;
There is no R option. You can specify several dependent variables to
carry out a multivariate analysis.

The ID statement If the P option is specified in the MODEL statement you may wish to
label the cases. The ID statement specifies the variable name to be used
to label the observations. For example:
ID NAME;

The MEANS As with ANOVA you can request multiple comparisons on the group
statement means, for example:
MEANS MAKER/DUNCAN;
This shows in detail how the group means differ.

The RANDOM The RANDOM statement specifies that a factor in the model is to be
statement considered as a random effects factor rather than a fixed effects factor.
For example, the number of days on which it rained during a trial period
may be a very important factor, but it must be regarded as a random
effects factor rather than a fixed effects one. If used, the RANDOM
statement must appear after the MODEL statement, for example:
CLASS A B;
MODEL Y = A B;
RANDOM B;

The OUTPUT To analyse the predicted or residual values you must write them to a SAS
statement dataset. Having done that you can use any of the facilities of SAS,
especially the graphical ones, to examine them. The OUTPUT statement
allows you to write these values.
The OUTPUT statement must specify the name of a dataset. This can be
a permanent dataset (for example SASUSER.PREDVALS) or a
temporary one. The information to be written follows the dataset name.
The keywords PREDICTED and RESIDUAL (which can be abbreviated
to P and R) specify that these values are to be output and give them
names. For example:
PROC GLM;
MODEL Y = X Z/NOINT;
OUTPUT OUT=SAVED P=PY R=RY;

K6.1 (10.90) page 45


The output dataset contains all the variables from the input dataset
(whether or not they were used in the model) as well as the names
specified by P or R. If the analysis has multiple dependent variables you
must specify predicted and residual variable names for each dependent
variable.

Example DATA MILEAGE;


INPUT MPH MPG @@:
CARDS;
20 15.4 30 20.2 40 25.7 ... 60 24.8
RUN;
PROC GLM;
MODEL MPG=MPH /P CLM;
OUTPUT OUT=PP P=MPGPRED R=RESID;
PROC PLOT DATA=PP;
PLOT MPG*MPH=’A’ MPGPRED*MPG=’P’/OVERLAY;

Miscellaneous useful procedures


This section describes some further procedures which you are likely to
need but which do not form a logical group. These are the procedures
for sorting data, for declaring formats, and for listing the contents of a
dataset.

Sorting a dataset - The SORT procedure is used to sort a dataset on one or more variables.
the SORT It is necessary to have the data sorted if the BY statement is to be used on
procedure the dataset (see page 11). The general form of the SORT statement is:

PROC SORT options;


where the options include:
DATA=dataset specifies the dataset to be sorted. If it is omitted
the most recently created dataset will be used.
OUT=dataset specifies the output dataset. If it is omitted the
input dataset is overwritten by the sorted
version.
NODUPLICATES specifies that the data is checked after it has
been sorted and any exact duplicates are
dropped from the output data set. This checking
uses all the variables in the data, not just the
ones used for the sorting.
EQUALS specifies that the original order of cases is
preserved if they have identical sort key values.
If EQUALS is not specified then the order may
be changed.

K6.1 (10.90) page 46


The BY statement The BY statement specifies the variable or variables to be used as a key
or keys to order the data. For example:
BY AGEGRP;
specifies that the data is to be sorted according to the values of the
variable AGEGRP. Unless otherwise specified, the values are arranged
with the lowest values first, that is in ascending order. To have data
sorted in descending order, that is with the high values first, insert
DESCENDING before the variable name, for example:
BY DESCENDING INCOME;
If more than one variable is specified the first one mentioned is the most
important, the next the second most important etc. For example:
BY AGEGRP DESCENDING GRADE REGION;
sorts the data so that the first case has the lowest value of REGION
within the highest value of GRADE within the lowest value of
AGEGRP. The last case has the highest value of REGION within the
lowest value of GRADE within the highest value of AGEGRP.

Examples ¤ PROC SORT;


BY SEX;
RUN;
¤ PROC SORT DATA=NATION OUT=REGION;
BY REGION CITY;
RUN;

Defining your The FORMAT statement described on pages 9 and 11 associates a format
own formats - the with a variable. This may be a standard SAS format or one constructed
FORMAT by the user. These constructed formats are defined by the FORMAT
procedure procedure. You must use FORMAT to give labels to individual values of
a variable, using the VALUE.
When a format is specified in the FORMAT statement it always includes
a full stop. This is how formats are recognised. Because of it, format
names end with a full stop when they appear in a FORMAT statement but
they do not do so in a PROC FORMAT statement.

The VALUE statement To label individual values of a variable or ranges of values use the
VALUE statement. The format is given a name and then the values and
corresponding labels are declared. For example:
VALUE YESNO 1=’YES’
2=’NO’
3=’MISSING’;
Ranges of values are specified using a hyphen, for example:
VALUE NATIONS 1, 3-16, 28=’WESTERN BLOC’
2, 17-21=’EASTERN BLOC’
22-27, 29=’NON-ALIGNED’;

K6.1 (10.90) page 47


The keyword OTHER may be used to catch any values not explicitly
mentioned, for example:
VALUE AGEFMT 18-30=’YOUNG’
31-55=’MIDDLE’
56-80=’OLD’
OTHER=’MISSING’;
The keywords LOW and HIGH may be used to specify the ends of
ranges, that is the lowest and highest values.

Using PUT with The PUT function recodes a variable in accordance with a format. For
formats example:
PROC FORMAT;
FORMAT AGEGRP 0-18=1
19-30=2
31-50=3
51-65=4
65-100=5;
DATA SURV;
INPUT AGE SEX INCOME;
AGEGRP=PUT(AGE,AGEGRP.);
CARDS;
...
...
RUN;
The variable AGEGRP is set to 1 whenever AGE is 0 to 18, 2 whenever
AGE is 19 to 30, etc. AGEGRP is a character variable even though all
its values are digits. Note that the PUT function is quite different to the
PUT Statement (see page 8).

Example To produce a frequency table for income with the data grouped into a
small number of categories, a suitable format is declared with VALUE
and then the FORMAT statement is used within the FREQ procedure to
assign the format to the variable:
PROC FORMAT;
VALUE INCFMT LOW-1499=’<1,500’
1500-2499=’<2,500’
2500-5999=’<6,000’
6000-9999=’<10,000’
10000-HIGH=’10,000+’;
PROC FREQ;
TABLES INCOME;
FORMAT INCOME INCFMT.;
RUN;
This produces a frequency table with only five categories, which are
labelled as described. Note the full stop after the format name in the
FORMAT statement.

K6.1 (10.90) page 48


Printing the The PRINT procedure prints the contents of variables, and produces
values of simple reports by use of the BY, PAGEBY and SUM statements. The
variables - the PRINT statement has the form:
PRINT procedure PROC PRINT options;
The options include:
DATA=dataset specifies the dataset to be used. If it is missing
the most recently created dataset is used.
N outputs the number of cases at the end of the
data.
UNIFORM specifies that the same layout is to be used for
each page. If it is not included, SAS outputs as
many variables as possible on a page, which
may result in different numbers of variables on
different pages.
DOUBLE specifies double spacing in the output.
LABEL requests that the labels for the variables (see
page 8) be used to head the columns of output
rather than the names of the variables.

The VAR statement The VAR statement specifies the variable whose values are to be printed,
for example:
VAR AGE GRADE;
If no VAR statement is used all the variables are printed.

The ID statement Normally the number of the observation or case in the dataset is used to
identify it in the output. The ID statement means that the values of the
specified variable are to be used instead. For example:
ID NAME;

The BY and PAGEBY The BY statement specifies that the procedure is to operate on the
statements subgroups defined by the values of a variable. The PAGEBY statement
may be used with the BY statement to start a new page when a BY
variable changes. For example:
BY A B C;
PAGEBY A;
causes a new page to be started when the value of A changes.
BY A B C;
PAGEBY B;
causes a new page to be started whenever A or B change, because
PAGEBY triggers a new page for the variable specified and for any
earlier variables in the BY statement list.

The SUM statement If a variable appears in a SUM statement the total of the values is also
produced. If a BY statement has also been used totals are printed for the
subgroups (provided there is more than one case in the subgroup).

K6.1 (10.90) page 49


Examples ¤ PROC PRINT;
RUN;
¤ PROC PRINT DOUBLE DATA=SICKNESS;
VAR GRADE AGE DAYSSICK;
ID NAME;
BY DEPT;
PAGEBY DEPT;
SUM DAYSSICK;
RUN;
¤ The following statements list the data by sex with income summed
over men and women separately:
PROC SORT;
BY SEX;
RUN;
PROC PRINT;
VAR AGE MARSTAT NOOFCH INCOME;
BY SEX;
SUM INCOME;
ID CASENO;
RUN;
The output is as follows:

K6.1 (10.90) page 50


---------------------------- SEX=1 -------------
----------
CASENO AGE MARSTAT NOOFCH
INCOME
101 35 2 4
9754
103 53 . 0
7560
104 39 4 2
8500
106 38 2 3
8210
107 49 2 7
9607
108 27 1 0
8895
110 21 1 0
2954
114 25 1 0
5650
115 80 4 .
1750
116 43 2 3
7950
118 . 2 3
6105
121 31 2 2
7350
123 27 2 3
5850
-------
-----
SEX
90135
---------------------------- SEX=2 -------------
----------
CASENO AGE MARSTAT NOOFCH
INCOME
102 23 2 2
6850
105 32 3 1
7768
109 . 2 2
2500
111 79 4 2
1950
112 52 2 1
5285
113 19 1 0
2680
117 48 2 2
4938
119 52 3 3
.
120 . 1 0
3870
122 22 1 0
3538
124 38 1 0
7035
125 71 4 0
2650
-------
------
SEX
49064
======

K6.1 (10.90) page 51


139199

References and further information


A. Using SAS on CMS
Document K6.2.CMS
B. Using SAS on PCs
Document K6.2.PC
The main manuals for use with the IBM CMS mainframe system are:
C. SAS Language and Procedures: Usage
D. SAS Language: Reference
E. SAS Procedures Guide
F. SAS/STAT User’s Guide: Volumes 1 and 2
G. SAS/GRAPH Software: Volumes 1 and 2
H. SAS/ETS Users Guide (Econometrics and Time Series Analysis)
Other manuals relevant to the mainframe version are:
I. SAS CMS Companion
J. SAS Introductory Guide
K. SAS Applications Guide
L. SAS Guide to Macro Processing
All the above manuals are available for reference in the Computing
Service Advisory Office. Any problems encountered in using SAS
which can not be solved by consulting the relevant manual should be
referred to Advisory (ext 4530).
The main manuals for use with the PC version of SAS are:
M. SAS Introductory Guide for Personal Computers
N. SAS Procedures Guide for Personal Computers
O. SAS/STAT Guide for Personal Computers
P. SAS/GRAPH Guide for Personal Computers
Q. SAS Language Guide for Personal Computers

K6.1 (10.90) page 52

Anda mungkin juga menyukai