7 14
Introduction
SAS programs consist of SAS statements. SAS statements can be broadly categorized into three groups: 1) System statements group 2) Data steps group 3) Procedure steps group (Chapter 2) A SAS statement has two important characteristics: It usually begins with a SAS keyword. It ends with a semicolon (;).
Statements (begins with keyword) System statements group OPTIONS state ment
Sample Program Code options nodate; data eg1; infile 'D\abc.txt'; input name$ id age; run; proc print data=eg1; run;
DATA statement INFILE statement Input statement RUN statement PROC PRINT statement RUN statement
e s ; A c m e 4 3 A j a x 3 4 A t l a s . ; r u n ;
option2 <=v2>
Default: NUMBER
YEARCUTOFF = yyyy
Default: 1920
Warning: The settings of the OPTIONS statement remain in effect until you modify them, or until you end your SAS session. Question: What does the following OPTIONS statement do?
options linesize=110 nodate;
a. suppresses the date and limits the horizontal page size for text output b. suppresses the date and limits the vertical page size for text output c. suppresses the date and limits the vertical page size for the log d. suppresses the date and limits the horizontal page size for the log Answer: b
Raw data exist in many form, such as Text file (*.txt) and Excel file (*.xls)
A typical raw data set
The title row may /may not be part of a raw data file Each field is either separated by at least one space (blank delimiter) or in fixed columns
Note: SAS can read files with other delimiters such as commas or tabs.
One observation per row In SAS data set, missing value is represented by (default) a period '.' for numeric variable blank string " " for character variable By default, value of character variables align on the left and value of numeric variables align on the right
Question: What type of variable is the variable AcctNum in the data set below?
a. b. c. d.
numeric character can be either character or numeric can't tell from the data shown
Correct answer: b It must be a character variable, because the values contain letters and underscores, which are not valid characters for numeric values. Besides, it aligns on the left.
Question: What type of variable is the variable Wear in the data set below?
. . .
.
numeric character can be either character or numeric can't tell from the data shown
Correct answer: a It must be a numeric variable, because the missing value is indicated by a period rather than by a blank. Besides, it aligns on the right
Example
data ex1; (1) input Brand $ Wear; (2) datalines; Acme 43 Ajax 34 (3) Atlas . ; run;
Description (1) DATA statement initiate the data input process and defines the name assigned to the created SAS data set. DATA statement must be the first statement of a DATA step.
Rules for SAS data set names A SAS name can contain from 1-32 characters ii. iii. The first character must be a letter or an underscore (_) iv. Subsequent characters must be letters, numbers, or underscores v. Blanks cannot appear in SAS names
Question: Which of the following variable names is valid? a. 4BirthDate b. $Cost c. _Items_ d. Tax-Rate Correct answer: c Variable names follow the same rules as SAS data set names. They can be 1 to 32 characters long, must begin with a letter (AZ, either uppercase or lowercase) or an underscore, and can continue with any combination of numbers, letters, or underscores. (2) INPUT statement defines the list of variables contained in the data set. The $ after Brand indicates that it is a character variable, whereas Wear is numeric variable. Note: Rules for SAS variables names same as the rules for SAS data set names (3) Each line following the DATALINES statement are considered a data record until a line contains only a semicolon(;) is reached. Missing value must be represented by a period (.)
Add labels to SAS variables To improve a readability, a label can be attached to a variable when the SAS data set is being created using LABEL statement LABEL var1 = 'Label1' var2 = 'Label2' ... ; Label can be any thing up to 256 characters long May put as many variable names and labels into one LABEL statement Example:
The maximum length of var1 is 8, so it only shows abcdefghi (first 8 characters) only.
Solution: LENGTH Statement LENGTH varname1 $ n1 varname2 $ n2 varnameX $ nX ; You should put Length statement before the Input statement nX sets the length of the variable to X in SAS data set (default length is 8) SAS still reads the raw data record from delimiter to delimiter regardless of the length by default
There would be an error message if you put Length statement after the Input statement
DATA data_set_name; INFILE 'source_file_name' ; INPUT varname1 <$> bc<-ec> varname2 <$> bc<-ec> . . . ; RUN;
List input method Each value is separated by a delimiter (one blank space) Specify each variable in INPUT statement in the order that they appear in the records of raw data Missing value must be represented by a placeholder such as a period by default Character values cannot contain embedded blanks if blank space is used as delimiter Default maximum length of character variables is 8 Raw data is not required to be within fixed columns Data must be in standard character or numeric format by default
Column input method No delimiter is required Can read variables in any order
A placeholder is not required to indicate a missing value Allows embedded blanks for character values (since no delimiter is required) Longer than 8 character values is allowed Raw data must be contained within fixed columns Data must be in standard character or numeric format by default
Note: Both input methods have the same restriction: Data must be in standard format by default
Example :
data school; infile datalines delimiter=','; length district $ 12; input district teachers_no students_no; datalines; North Point,,,, 1000 , 30000 Central, 520 , 16000 Wan Chai, , 2500 ; run;
use comma ( , ) as a delimiter the maximum length of district has to set to 12 datalines must be included when enter Raw Data
directly (Datalines statement) Consecutive delimiters is treated as a single delimiter and character value may contain embedded space Blank field between two delimiters is read as missing value
Output:
2. DSD Use comma as a delimiter ( You can change the delimiter by using delimiter= Option)
Treats consecutive non-blank delimiters as a missing value Removes quotation marks from character value Reads character value that contains a delimiter within a quoted string Default maximum length of character variables is 8 Example :
Removes quotation marks from character value Treats consecutive non-blank delimiters as a missing value Reads character value that contains a delimiter within a quoted string data school; infile 'D:\district.txt' dsd delimiter='!'; length district $ 12; input district teachers_no students_no; run; Use ! as a delimiter
Output:
3. MISSOVER Adding the MISSOVER option if there is any missing data at the end of datalines It tells SAS that if it runs out of data, dont go to the next data line to continue reading. Example :
data case4a; input var1 var2 var3; datalines; 1 2 4 5 7 1 8 9 ; run; data case4b; infile datalines missover; input var1 var2 var3; datalines; 1 2 4 5 7 1 8 9 ; run;
Other INFILE statement options LRECL = logical-record-length SAS assumes external files have a record length of 256 or less. If your data lines are long, and it looks like SAS is not reading all your data, then use the LRECL= option in the INFILE statement to specify a record length at least as long as the longest record in your data file.
INFILE 'C:\MyRawData\President.txt' LRECL=2000;
FIRSTOBS = n It tells SAS at what line to begin reading data. OBS = n To specify a number (n) that SAS uses to stop reading raw data records after it in the raw data file Example :
DATA icecream; INFILE 'D:\Sales.txt' FIRSTOBS = 3 OBS=5; INPUT Flavor $ 1-9 Location BoxesSold; RUN;
Note: If : (colon) is missing, SAS reads exactly 9 columns of characters for var3 even a delimiter is encountered
Example:
data case6; infile datalines delimiter=','; input age profit: dollar12. ; datalines; 21, $6 750.55 19,$1 0000 00 22, ($3 000) ; run;
Informat DOLLARw.
Use SAS date informats to read date values: Informat Date format in raw data DATEw. 17Jan03, 17/Jan/03, 17-Jan-03, 17 Jan 03 MMDDYYw. 011703, 01/17/03, 01-17-03, 01 17 03 DDMMYYw. 170103, 17/01/03, 17-01-03, 17 01 03, YYMMDDw. 030117, 03/01/17, 03-01-17, 03 01 17 MONYYw. Jan03, Jan/03, Jan-03, Jan 03 YYMMw. 0301 Example:
data date; infile datalines delimiter=','; input date1: date9. date2 : ddmmyy8. date3 : mmddyy10.; datalines; 31Dec59, 01011960, 01172003 31Dec1959, 01-01-60, 01-17-03 31DEC59, 01/01/60, 01/17/03 31DEC1959, 01 01 60, 01 17 03 ; run;
Width 7 <= w <= 32 6 <= w <= 32 6 <= w <= 32 6 <= w <= 32 6 <= w <= 32 4 <= w <= 6
How does SAS convert calendar dates to SAS date values? SAS date values is the number of days between 1 Jan 1960 and the specified date Dates before 1 Jan 1960 are negative values, dates after are positive values
1-Jan-1959 1-Jan-1960 1-Jan-1961 1-Jan-1962
How does SAS know which century a two-digit years belong to? If you-365 use two-digit year values in your data lines or external files, 0 366 731 you should consider the
1-Jan-1959
1-Jan-1960
1-Jan-1961
1-Jan-1962
-365
366
731
YEARCUTOFF= option. This option specifies which 100-year span is used to interpret two-digit year values. The default value of YEARCUTOFF= is 1920 Two-digit years 20-99 are assumed to be 1920-1999 Two-digit years 00-19 are assumed to be 2000-2019 Date Expression Interpreted As 12/07/41 12/07/1941 18Dec15 18Dec2015 04/15/30 04/15/1930 15Apr95 15Apr1995 To change the cutoff year value: OPTION YEARCUTOFF = cutoffyear ; For example, if you specify YEARCUTOFF=1950, then the 100-year span will be from 1950 to 2049. options yearcutoff=1950; Using YEARCUTOFF=1950, dates are interpreted as shown below:
Question: SAS date values are the number of days since which date? a) January 1, 1901 b) January 1, 1950 c) January 1, 1960 d) January 1, 2001 Correct answer: a Questio n: In order for the date values 05May1955 and 04Mar2046 to be read correctly, what value must the YEARCUTOFF= option have? a. a value between 1947 and 1954, inclusive b. 1955 or higher c. 1946 or higher d. any value
Correct answer: d As long as you specify an informat (e.g date7.) with the correct field width for reading the entire date value, the YEARCUTOFF= option doesn't affect date values that have four-digit years. Questio n: Which time span is used to interpret two-digit year values if the YEARCUTOFF= option is set to 1950? a. 1950-2049 b. 1950-2050 c. 1949-2050 d. 1950-2000
Correct answer: a The YEARCUTOFF= option specifies which 100-year span is used to interpret two-digit year values. The default value of YEARCUTOFF= is 1920. However, you can override the default and change the value of YEARCUTOFF= to the first year of another 100-year span. If you specify YEARCUTOFF=1950, then
Creating a SAS data library -- using LIBNAME statement LIBNAME libref 'SAS-data-library'; where libref is 1 to 8 characters long, begins with a letter or underscore, and contains only letters, numbers, or underscores. SAS-data-library is the name of a SAS data library in which SAS data files are stored
The LIBNAME statement below assigns the libref Mysaslib to the SAS data library D:\.
libname Mysaslib 'd:\';
Creating permanent SAS Data Set Suppose a SAS library Mysaslib has been created A permanent SAS data set is created by the DATA statement with a two-level name (two names separated by a period) Syntax DATA libref.data-set-name ;
A SAS library can be deleted In Explorer, select the library icon, right click, select Delete It only removes the connection between the physical storage location and SAS. All SAS data sets remain in the storage folder
If a SAS data set is deleted from the a SAS library, it is deleted permanently from the storage folder A SAS data set can be copied/moved from a SAS library to another library Questio n. Which one of the following statements is false? a. LIBNAME statements can be stored with a SAS program to reference the SAS library automatically when you submit the program. b. When you delete a libref, SAS no longer has access to the files in the library. However, the contents of the library still exist on your operating system. c. Librefs can last from one SAS session to another. d. You can access files that were created with other vendors' software by submitting a LIBNAME statement.
Correct answer: c The LIBNAME statement is global, which means that librefs remain in effect until you modify them, cancel them, or end your SAS session. Therefore, the LIBNAME statement assigns the libref for the current SAS session only. You must assign a libref before accessing SAS files that are stored in a permanent SAS data library.
Basic Report
proc print data=Q1; run;
Example : The following output only shows the 3rd 8th observations and prints the total number of observations in the dataset Q1.
proc print data=Q1 (firstobs=3 obs=8) n; run;
Example : The following output specifies Observation Number as a header for the Obs column.
proc print data=Q1 obs= 'Observation Number'; run;
Note: If you include both NOOBS and OBS = 'column header' in your statement, SAS will suppresses the Obs column. (OBS = 'column header' does not take effect)
proc print data=Q1 run; obs= 'Observation Number' noobs ;
Selected Variables
VAR statement You can choose the observations and variables that appear in your report.
proc print data=Q1; var gender cost; run;
Selected Observations
WHERE statement To print observations that meet certain conditions Definition Equal to Not equal to Greater than Less than Greater than or equal to Less than or equal to Equal to one of a list Specified substring Operator EQ NE GT LT GE LE IN CONTAINS AND OR NOT
Column Totals
You can produce column totals for numeric variables within your report. Example :
proc print data=Q1; var gender cost; sum cost;
run;
Specifying TITLE
TITLE statement: To make your report more meaningful, you can specify up to 10 titles by using TITLE statements in your output. TITLEn 'title' ; n is a number from 1 to 10 that specifies the line number of the title Skipping some values of n indicates those lines are blank Titles are centered by default SAS uses the same title for all subsequent outputs until you cancel it or define a new title To cancel a title, specify a blank TITLEn statement, e.g. TITLE1; Example :
proc print data=Q1; var gender cost; sum cost; title 'ABC Company'; title3 'Transaction records'; run; proc print data=Q2; run;
Note1: title is the same as title1 Note2: Skipping title2 indicates the second lines is blank Note3: SAS uses the same title for printing Dataset Q2 Note4: If you do not want to same titles appear in the second output, you can specify a blank TITLE statement proc print data=Q2; title; run;
procedure Example
proc print data=Q1 label; var gender cost; sum cost; title 'ABC Company'; title3 'Transaction records'; label cost='Transaction cost'; run;
proc print data=Mylib.year_sales label noobs; var units amountsold; where salesrep= 'Garcia' and quarter='1'; sum amountsold; label unit ='Units sold' amountsold='Amount sold'; format units comma7. amountsold dollar12.2; title1 'Sales in first quarter by Garcia'; run;
Example
To display Values as
06/05/03 1,234 5,678.90
$1,234.00 $5,678.90
Example
proc sort data=ex1 out=sort_ex1; by month year; run; proc print data=sort_ex1 n; var year telephone; by month; run;
Questio n.
a. The PROC PRINT step runs successfully, printing observations in their sorted order. b. The PROC SORT step permanently sorts the input data set. c. The PROC SORT step generates errors and stops processing, but the PROC PRINT step runs successfully, printing observations in their original (unsorted) order.
d. The PROC SORT step runs successfully, but the PROC PRINT step generates errors and stops processing. Correct answer: c The BY statement is required in PROC SORT. Without it, the PROC SORT step fails. However, the PROC PRINT step prints the original data set as requested Questio n. What does PROC PRINT display by default? a. PROC PRINT does not create a default report; you must specify the rows and columns to be displayed. b. PROC PRINT displays all observations and variables in the data set. If you want an additional column for observation numbers, you can request it. c. PROC PRINT displays columns in the following order: a column for observation numbers, all character variables, and all numeric variables. d. PROC PRINT displays all observations and variables in the data set, a column for observation numbers on the far left, and variables in the order in which they occur in the data set.
Correct answer: d
2.
ORDER = : specifies the order for listing the variable values FREQ: orders values by descending frequency count
FORMATTED: orders values by ascending their formatted values INTERNAL(default): orders values by ascending their unformatted values DATA: orders values according to their order in the data set
data eg1; input var1 $; datalines; d 1 f 2 e 3 a 4 b 5 b c6 e ; proc freq data=eg1 order=data; run;
TABLE variable-list < / options> ; variable-list specifies variables included in the report For one-way tables, specify the variable name For more than one variable, separate the variable names by space For two-way table, separate the paired variables by * For more than a pair of variables, separate each pair by space
One-way tables
proc freq data=crew; table jobcode location; run;
Two-way table
proc freq data=crew; table location*jobcode; run;
In this example, a two-way table is produced. In this example, two one-way tables are produced. PROC FREQ produces one-way tables with cells that contain frequency percent cumulative frequency cumulative percent PROC FREQ produces two-way tables with cells that contain cell frequency cell percent of total frequency cell percent of row frequency cell percent of column frequency
Commonly used options in one-way tables: NOCUM suppresses display of cumulative frequencies and percentages NOPERCENT suppresses display of percentages OUTCUM includes the cumulative frequency and cumulative percentage in the output data set
Commonly used options in two-way tables: LIST prints crosstabulations in list format rather than grid CROSSLIST prints cross-tabulations in crosslist format NOCUM - suppresses display of cumulative frequencies and cumulative percentages in list format NOCOL suppresses display of column percentage for each cell NOROW suppresses display of row percentage
for each cell NOFREQ suppresses display of the frequency count for each cell OUTPCT - includes the percentage of column frequency, row frequency, and two-way table frequency in the output data set Commonly used options in one-way tables: Example
proc freq data=Mylib.car ; table size /nopercent; run;
Example
proc freq data=Mylib.car ; table size /nocum; run;
Example
proc freq data=Mylib.car ; table size/ noprint out=Q13 outcum; run;
Example
proc freq data=crew; table location*jobcode / crosslist; run;
Example
proc freq data=crew; table location*jobcode / out=result2 output; run;
BY statement
To obtain separate analyses on observations in groups defined by the BY variables If the data set is not sorted in ascending order, sort the data using the SORT procedure with a similar BY statement Example
proc sort data=crew; by location; proc freq data=crew; table jobcode / missing; by location; run;
<VAR variable-list ;> <BY by-variables ;> <CLASS class-variables ;> <OUTPUT OUT = sas-data-set <output-statistic = output-label> ;> <WHERE where-expression ;> <TITLE 'title-text' ;> <LABEL variable-name1 = 'label-string1 ' ... ;> RUN;
VAR statement
VAR variable-list ; Reports on every numeric variable in sas-data-set if VAR statement is not included For more than one variable, separate the variable names by space Default reported statistics are N, MEAN, STD, MIN, MAX
Note: The above report shows all numeric variables (only mileage and reliability are numeric variables) proc means data=Mylib.Car; var mileage; run;
Requested statistics
Other statistics include: RANGE, MEDIAN, SUM, NMISS, SKEWNESS, VAR, Q1, Q3, P1, P5, P10, P90, P95, P99, etc. If you add any statistics in requested-statistics, PROC MEANS no longer produce the default statistics. They must be requested.
proc means data=Mylib.Car n mean; var MILEAGE; run;
Group processing
Group Processing Using the CLASS Statement Group Processing Using the BY Statement
PROC MEANS DATA = sas_data_set <requestedstatistics> <options>; <VAR variable-list ;> CLASS class-variables ; RUN;
PROC MEANS DATA = sas_data_set <requested-statistics>; <VAR variable-list ;> BY by-variables ; RUN;
CLASS Statement Options: ORDER = - specify the order for listing the values of CLASS variable MISSING - treat missing values as a valid value of CLASS variable Note: You do not need to use the PROC SORT when using the CLASS Statement.
Group Processing Using the CLASS Statement
Note: If the data set is not sorted in ascending order, sort the data using the PROC SORT with a similar BY statement
Group Processing Using the BY Statement
Example
proc means data=Mylib.Car mean median order=freq; var mileage; class size; run;
Example
proc sort data=Mylib.Car out=sort_Car; by size; run; proc means data=sort_Car mean median; var mileage; by size; run;
Example
Note: Values of _TYPE_ indicates which combinations of Class variables are used to compute the statistics
Questio n.
The default statistics produced by the MEANS procedure are n-count, mean, minimum, maximum, and a. median. b. range. c. standard deviation. d. standard error of the mean. Correct answer: c
Questio n.
Which statement will limit a PROC MEANS analysis to the variables Boarded, Transfer, and Deplane? a. by boarded transfer deplane; b. class boarded transfer deplane; c. output boarded transfer deplane; d. var boarded transfer deplane; Correct answer: d To specify the variables that PROC MEANS analyzes, add a VAR statement and list the variable names.
Questio n.
Which of the following statements is true regarding BY-group processing? a. BY variables must be either indexed or sorted. b. Summary statistics are computed for BY variables. c. BY-group processing is preferred when you are categorizing data that contains few variables. d. BY-group processing overwrites your data set with the newly grouped observations. Correct answer: a Unlike CLASS processing, BY-group processing requires that your data already be indexed or sorted in the order of the BY variables. You might need to run the SORT procedure before using PROC MEANS with a BY group.
Once defined, custom format is used like SAS system format PROC FORMAT <LIBRARY = libref>; VALUE <$>format-name1 range1a = 'label1a' range2a = 'label2a' ; VALUE <$>format-name2 range2b = 'label1b' range2b = 'label2b' ; ... RUN ; Temporary custom format (default) A custom format is stored in a format catalog under WORK library (so the format is temporarily stored). You only need to submit the PROC FORMAT procedure once during one session, but you need to re-run the procedure again when you re-open the SAS software (session). Permanent custom format Option LIBRARY = librref specifies the name for a permanent SAS data library in which the format catalog will be stored Need to tell SAS where to find the defined format before using it but do not need to re-run the procedure format-name names the format that you are creating Must begin with a $ sign if the format applies to character values Cannot be longer than eight characters Cannot be the name of an existing SAS format Cannot end with a number Does not end in a period range specifies one or more values to be grouped Values in different ranges should not overlap label is a text string enclosed in quotation marks ( ) Note: In a single PROC FORMAT procedure, you can use several VALUE statements to define several formats
10 through the highest non-missing value ( ) First character of data value matches any letters from a through g, case sensitive First character of data value matches a or d, case sensitive Any first character of non-missing value through g, case sensitive G through any first character of non-missing value, case sensitive Any value not specified elsewhere
proc print data=eg; format age agegroup. sex gender. colour $col. income dollar8.; run;
proc freq data=eg; table age/ missing; format age agegroup.; run;
Example - Creating a SAS data set using custom format Without using format
data eg; input age sex income colour$; datalines; 19 1 14000 Y 45 1 65000 G 72 2 35000 B . 1 44000 Y 58 2 83000 W ; run;
Note: The user defined format must be created before the DATA step using the format
income dollar8.;
Questio n.
If you don't specify the LIBRARY= option, your formats are stored in Work.Formats, and they exist a. only for the current procedure. b. only for the current DATA step. c. only for the current SAS session. d. permanently. Correct answer: c If you do not specify the LIBRARY= option, formats are stored in a default format catalog named Work.Formats. As the libref Work implies, any format that is stored in Work.Formats is a temporary format that exists only for the current SAS session.
Questio n.
Which of the following statements will store your formats in a permanent catalog? a. libname library 'c:\sas\formats\lib';proc format library=library ...; b. libname library 'c:\sas\formats\lib';format library =library ...; c. library='c:\sas\formats\lib';proc format library ...; d. library='c:\sas\formats\lib';proc library ...; Correct answer: a To store formats in a permanent catalog, you first write a LIBNAME statement to associate the libref with the SAS data library in which the catalog will be stored. Then add the LIBRARY=
option to the PROC FORMAT statement, specifying the name of the catalog. Questio n. When creating a format with the VALUE statement, the new format's name cannot end with a number cannot end with a period cannot be the name of a SAS format, and a. b. c. d. cannot be the name of a data set variable. must be at least two characters long. must be at least eight characters long. must begin with a dollar sign ($) if used with a character variable.
Correct answer: d The name of a format that is created with a VALUE statement must begin with a dollar sign ($) if it applies to a character variable. Questio n. Which of the following FORMAT procedures is written correctly? a. proc format library=library value colorfmt; 1='Red' 2='Green' 3='Blue' run; b. proc format library=library; value colorfmt 1='Red' 2='Green' 3='Blue'; run; c. proc format library=library; value colorfmt; 1='Red' 2='Green' 3='Blue' run; d. proc format library=library; value colorfmt 1='Red'; 2='Green'; 3='Blue'; run; Correct answer: b A semicolon is needed after the PROC FORMAT statement. The VALUE statement begins with the keyword VALUE and ends with a semicolon after all the labels have been defined.
Questio n.
Which of these is false? Ranges in the VALUE statement can specify a. a single value, such as 24 or 'S'. b. a range of numeric values, such as 01500. c. a range of character values, such as 'A''M'. d. a list of numeric and character values separated by commas, such as 90,'B',180,'D',270. Correct answer: d You can list values separated by commas, but the list must contain either all numeric values or all character values. Data set variables are either numeric or character.
Questio n.
How many characters can be used in a label? a. 40 b. 96 c. 200 d. 256 Correct answer: d When specifying a label, enclose it in quotation marks and limit the label to 256 characters
Questio n.
Which keyword can be used to label missing values as well as any values that are not specified in a range? a. LOW b. MISS c. MISSING d. OTHER Correct answer: d MISS and MISSING are invalid keywords, and LOW does not include missing values. The
keyword OTHER can be used in the VALUE statement to label missing values as well as any values that are not specifically included in a range. Questio n. You can place the FORMAT statement in either a DATA step or a PROC step. What happens when you place the FORMAT statement in a DATA step? a. You temporarily associate the formats with variables. b. You permanently associate the formats with variables. c. You replace the original data with the format labels. d. You make the formats available to other data sets. Correct answer: b By placing the FORMAT statement in a DATA step, you permanently associate the defined formats with variables. Questio n. The format JOBFMT was created in a FORMAT procedure. Which FORMAT statement will apply it to the variable JobTitle in the program output? 1. format jobtitle jobfmt; 2. format jobtitle jobfmt.; 3. format jobtitle=jobfmt; 4. format jobtitle='jobfmt'; Correct answer: b To associate a user-defined format with a variable, place a period at the end of the format name when it is used in the FORMAT statement.
Y N R o e s c o r d 1 2 3
During the compilation phase, each statement is scanned for syntax errors. Most syntax errors prevent further processing of the DATA step. When the compilation phase is complete, the descriptor portion of the new data set is created. If the DATA step compiles successfully, then the execution phase begins. During the execution phase, the DATA step reads and processes the input data. The DATA step executes once for each record in the input file, unless otherwise directed.
PDV _N_
_ERROR_ signals the occurrence of an error that is caused by the data during execution. The default value is 0, which means there is no error. _ERROR_ = 1, when one or more errors occur.
_ERROR_
Question Suppose you run a program that causes three DATA step errors. What is the value of the automatic variable _ERROR_ when the observation that contains the third error is processed? a. 0 b. 1 c. 2 d. 3 Correct answer: b
3. Syntax Checking
During the compilation phase, SAS also scans each statement in the DATA step, looking for syntax errors. Syntax errors include missing or misspelled keywords invalid variable names missing or invalid punctuation invalid options.
_ERROR_
ID
Income
Expense
NetProfit
the name of the data set the number of observations and variables the names and attributes of the variables.
At this point
The example data set contains the four variables that are defined in the input data set and in the assignment statement. _N_ and _ERROR_ are not written to the data set. There are no observations because the DATA step has not yet executed.
Execution Phase
After the DATA step is compiled, it is ready for execution. During the execution phase, the data portion of the data set is created. The data portion contains the data values. Example :
data profit; input ID $ Income Expense; NetProfit=Income-Expense; datalines; PDV 001 1000 2000 _N_ 002 300 150 003 888 777 ; run;
_ERROR_
ID
Income
Expense
NetProfit
1. Set variables in the PDV to missing and Update _N_ & _Error_ in PDV
At the beginning of the execution phase, the value of _N_ is 1. Because there are no data errors, the value of _ERROR_ is 0. The remaining variables are initialized to missing. Missing numeric values are represented by periods, and missing character values are represented by blanks.
PDV _N_ 1
_ERROR_ 0
ID
Income
Expense
NetProfit
2. Put a new record to input buffer and read data value to the PDV
Input Buffer
At the end of the DATA step, several actions occur. First, the values in the PDV are written to the output data set as the first observation.
SAS Data Set profit
Next, the value of _N_ is set to 2 and control returns to the top of the DATA step. Finally, the variable values in the program data vector are re-set to missing. Notice that the automatic variable _ERROR_ retains its value.
PDV _N_ 2 _ERROR_ 0 ID Income Expense NetProfit
_ERROR_
ID
Income
Expense
NetProfit
002
300
150
150
6. End-of-File Marker
The execution phase continues until the end-of-file marker is reached in the raw data file. When there are no more records in the raw data file to be read, the data portion of the new data set is complete.
Final SAS Data Set profit
Questio Which of the following is not created during the compilation phase? n. a. the data set descriptor b. the first observation c. the program data vector d. the _N_ and _ERROR_ automatic variables Correct answer: b Observations are not written until the execution phase. Questio During the compilation phase, SAS scans each statement in the DATA step, looking for syntax n. errors. Which of the following is not considered a syntax error? a. incorrect values and formats b. invalid options or variable names c. missing or invalid punctuation d. missing or misspelled keywords Correct answer: a Questio Unless otherwise directed, the DATA step executes n. a. once for each compilation phase. b. once for each DATA step statement. c. once for each record in the input file. d. once for each variable in the input file. Correct answer: c Questio At the beginning of the execution phase, the value of _N_ is 1, the value of _ERROR_ is 0, and n. the values of the remaining variables are set to a. 0 b. 1 c. undefined d. missing Correct answer: d
Questio Suppose you run a program that causes three DATA step errors. What is the value of the n. automatic variable _ERROR_ when the observation that contains the third error is processed?
a. b. c. d.
0 1 2 3
Correct answer: b The default value of _ERROR_ is 0, which means there is no error. When an error occurs, whether it is one error or multiple errors, the value is set to 1. Questio Which of the following actions occurs at the end of the DATA step? n. a. The automatic variables _N_ and _ERROR_ are incremented by one. b. The DATA step stops execution. c. The descriptor portion of the data set is written. d. The values of variables created in programming statements are re-set to missing in the program data vector. Correct answer: d
Questio n
What usually happens when a syntax error is detected? SAS continues processing the step. a. b. SAS continues to process the step, and the SAS log displays messages about the error. c. SAS stops processing the step in which the error occurred, and the SAS log displays messages about the error. d. SAS stops processing the step in which the error occurred, and the Output window displays messages about the error.
Correct answer: c Syntax errors generally cause SAS to stop processing the step in which the error occurred. When a program that contains an error is submitted, messages regarding the problem also appear in the SAS log. When a syntax error is detected, the SAS log displays the word ERROR, identifies the possible location of the error, and gives an explanation of the error. A syntax error occurs when a. some data values are not appropriate for the SAS statements that are specified in a program. b. the form of the elements in a SAS statement is correct, but the elements are not valid for that usage. c. program statements do not conform to the rules of the SAS language. d. none of the above. Correct answer: c Questio How can you tell whether you have specified an invalid option in a SAS program? n a. A log message indicates an error in a statement that seems to be valid. b. A log message indicates that an option is not valid or not recognized. c. The message "PROC running" or "DATA step running" appears at the top of the active window. d. You can't tell until you view the output from the program. Correct answer: b When you submit a SAS statement that contains an invalid option, a log message notifies you that the option is not valid or not recognized. You should recall the program, remove or replace the invalid option, check your statement syntax as needed, and resubmit the corrected program. Questio n
Example:
should not be used with the @ pointer control (discuss in later section), with column input, nor with the MISSOVER option Example:
Data Q6; input X Y @@; datalines; 1 2 3 4 5 6 11 12 13 14 21 22 23 24 ; run;
7 8 25 26 27 28
Method 2: / Line-pointer control Forces a new record into the input buffer and start reading from the beginning of that record Works for equal number of records in each observation INPUT varname / varname / varname ; Example
data Case8b; infile 'd:\temp\list7.txt'; input id 1-4 name $ 6-16 / gender $1 / weight_before 1-4 weight_after 6-9; run;
Method 3: #n line-pointer control Puts multiple records to the input buffer and assigns the records to PDV in any specified order INPUT #n1 varname #n2 varname #n3 varname ; nX represents the record number in the input buffer Example
data Case8c; infile 'd:\temp\list7.txt'; input #2 gender $1 #1 id 1-4 name $ 6-16 #3 weight_before 1-4 weight_after 6-9; run;
addition and subtraction Can use parentheses to override the order Example: var1 = 10 * 4 + 3 ** 2 var1 = 49 var1 =10 * (4 + 3) ** 2 var1 = 490
Example
data case1; infile datalines delimiter=','; input name $ tomato cucumber peas grapes; zone=14; type='Home'; cucumber=cucumber*10; total= tomato + cucumber + peas + grapes; tomato_percent = tomato / total*100; datalines; David,10,2,40,0 Mary,15,5,10,1000 Francis,50,10,15,50 Tom,20,0, . ,20 ; run;
Note: SAS executes each statement once during each round of iteration of DATA step If a variable has already been assigned a value in PDV, SAS replaces the original value with the new one The variable PEAS had a missing value for the last observation. Variables calculated from Peas were also set to missing Note: The sequence of assignment statements and INPUT statement affect the assigned values
total= tomato + cucumber + peas + grapes; cucumber=cucumber*10;
Comparison Operators
Operator = or eq ^= or ne > or gt < or lt >= or ge Comparison Operation equal to not equal to greater than less than greater than or equal to
<= or le in
Example
if test<85 and time<=20 then Status='RETEST'; if region in ('NE','NW','SW') then Rate=fee-25; if target>300 or sales<50000 then Bonus=salary*.05;
Logical Operators
Operat or & | ^ or ~ Logical Operation and or not
Use the AND operator to execute the THEN statement if both expressions that are linked by AND are true.
Example if status='OK' and type=3 then Count+1; if (age^=agecheck or time^=3) & error=1 then Test=1;
Use the OR operator to execute the THEN statement if either expression that is linked by OR is true.
Example if status='S' or cond='E' then Control='Stop';
Use the NOT operator with other operators to reverse the logic of a comparison.
Example if not(loghours<7500) then Schedule='Quarterly'; if region not in ('NE','SE') then Bonus=200;
Character values must be specified in the same case in which they appear in the data set and must be enclosed in quotation marks.
Example if status='OK' and type=3 then Count+1; if status='S' or cond='E' then Control='Stop'; if not(loghours<7500) then Schedule='Quarterly'; if region not in ('NE','SE') then Bonus=200;
Logical comparisons that are enclosed in parentheses are evaluated as true or false before they are compared to other expressions. In the example below, the OR comparison in parentheses is evaluated before the first expression and the AND operator are evaluated.
if var1<11 then var2='Small'; datalines; 5 15 25 ; run; data case2; input var1 @@; length var2 $8.; it sets the length of var2 is 8 if var1>20 then var2='Big'; if 11<=var1<=20 then var2='Medium'; if var1<11 then var2='Small'; datalines; 5 15 25 ; run;
Missing value of a numeric variable is smaller than any specified value Example:
data case3; input age @@; if age <=18 then agroup='A'; if 18<age<30 then agroup ='B'; if 31<= age then agroup='C'; datalines; 14 . 25 19 ; run;
IF-THEN blocks To execute more than one action when the condition is true IF condition THEN DO; statements; statements; END; Example:
data case4; input course $ @@; if course='MS3215' then do; lecturer = 'AB Chan'; class_size=45; end; if course='MS3216' then do; lecturer = 'CD Ma'; class_size=30; end; datalines; MS3215 MS3216 MS3217 ; run;
IF-THEN-ELSE statements / IF-THEN-ELSE blocks To put a number of related IF-THEN statements / IF-THEN blocks together IF-THEN-ELSE statement: IF-THEN-ELSE block: IF condition THEN statement; ELSE IF condition THEN statement; IF condition THEN DO; statements;
Example (IF-THEN-ELSE):
if var1=. then var2='Unknown'; else if var1 <11 then var2='Small'; else if 11<= var1<=20 then var2 ='Medium'; else if 45 >=var1 >20 then var2='Big'; else var2='Very Big';
data case4; input course $ @@; if course='MS3215' then do; lecturer = 'AB Chan'; class_size=45; end; else if course='MS3216' then do; lecturer = 'CD Ma'; class_size=30; end; else do; lecturer = 'Other'; class_size=.; end; datalines; MS3215 MS3216 MS3217 ; run;
n must be surrounded by either ( ), { }, or [ ] $ is needed if the variables are character type and the variables have not been defined before the ARRAY statement arrayname[n] in an assignment statement refers to the nth elements of the array as defined in the array statement, n = 1, 2, . . .. In the example below, newarray[1] is var1, newarray[2] is var2 and newarray[3] is var3 Example: Use of array
data case5; array newarray[3] var1 var2 var3; newarray[1]=1; var1 newarray[3]=3; var3 run; newarray[2]=2; var2
Questio n.
Which statement is false regarding an ARRAY statement? a. It is an executable statement. b. It can be used to create variables. c. It must contain either all numeric or all character elements. d. It must be used to define an array before the array name can be referenced. Correct answer: a
Questio n.
What belongs within the braces of this ARRAY statement? array contrib{?} qtr1-qtr4; a. quarter b. quarter* c. 1-4 d. 4 Correct answer: d
Questio n.
For the program below, select an iterative DO statement to process all elements in the contrib array. data work.contrib; array contrib{4} qtr1-qtr4; ... contrib{i}=contrib{i}*1.25; end; run; a. b. c. d. do i=4; do i=1 to 4; do until i=4; do while i le 4;
Correct answer: b Questio n. What is the value of the index variable that references Jul in the statements below? array quarter{4} Jan Apr Jul Oct; do i=1 to 4; yeargoal=quarter{i}*1.2; end; a. 1 b. 2 c. 3
d. 4 Correct answer: c DO loop - To process an array of variables iteratively Syntax DO index_variable = k TO m < BY increment_amount >; SAS statements END; index_variable is a variable that changes value at each iteration of the loop Starts iteration with value k (m often equals to 1) increment_amount is a numeric variable or constant that controls how the value of index_variable changes Default value is 1 At END, index_variable changes by the amount of increment_amount Iteration continues until the value of index_variable > m Example:
data case5a; array newarray[3] var1 var2 var3; do i=1 to 3; newarray[i]=i; end; run;
Variable i can be dropped from the data set by including a DROP statement in the DATA step
data case5a; array newarray[3] var1 var2 var3; do i=1 to 3; newarray[i]=i; end; drop i; run;
Abbreviated list of variable names To replace regular list of variable names Numbered range lists Variables which start with the same characters and end with consecutive numbers The numbers can start and end anywhere as long as the number sequence between is complete Example:
Regular Abbreviated list Example : variable list INPUT var7 var8 var9; INPUT var6 to - var9; Array ALLvar6 has 20 numeric elements. Write Do statements refer to the following elements: ARRAY narray(4) var6 var7 var8 var9; ARRAY narray(4) var6 - var9; a. All elements PROC PRINT DATA = data1; PROC PRINT DATA = data1; b. Even-numbered elements VAR var6 var7 var8 var9; VAR var6 - var9; c. Every third element, beginning with 1 (i.e. 1, 4, 7, )
(a)
c)
DO k = 1 to 20 BY 3; ALL[k]=k; END;
(b)
Example:
data case6; infile datalines delimiter=','; input stud_id $ quiz1-quiz5; array quiz[5] quiz1-quiz5; quiz_sum=0; do i=1 to 5; quiz_sum=quiz_sum + quiz[i]; end; quiz_mean=quiz_sum/5; keep stud_id quiz_sum; or datalines; S1,45,33,60,75,80 S2,67,58,75,69,55 ; run; drop quiz1-quiz5 quiz_sum i;
Selecting observations By default, SAS put an observation to the SAS data set at the end of each DATA step iteration. Use OUTPUT statement in an IF-THEN statement makes SAS outputs an observation based on a condition
Example:
data case7; infile datalines delimiter=','; input age sex $ @@; if sex='f' then output; datalines; 25, m, 18, f, 19, m, 20, m, 21, f ; run;
Note: If the value of an assignment statement wants to be kept in the SAS data set, it must be placed before the OUTPUT statement Example:
data case7; infile datalines delimiter=','; input age sex $ @@; if sex='f' then do; newvar=1; output; end; datalines; 25, m, 18, f, 19, m, 20, m, 21, f ; run;
Example:
data case7; infile datalines delimiter=','; input age sex $ @@; if sex='f' then do; output; newvar=1; end; datalines; 25, m, 18, f, 19, m, 20, m, 21, f ; run;
Writing observations to multiple data sets To write observations to a selected SAS data set, specify the SAS data set name in the OUTPUT statement The SAS data set name appears in the OUTPUT statement must be already appeared in the DATA statement Example:
Data Q45am Q45pm; input group :$10. class :$10. enclosure $ fedtime $; if fedtime='am' then output Q45am; else if fedtime='pm' then output Q45pm; else if fedtime='both' then output Q45am Q45pm; datalines; bears Mammalia E2 both elephants Mammalia W3 am flamingos Aves W1 pm frogs Amphibia S2 pm kangaroos Mammalia N4 am lions Mammaliz W6 pm snakes Retilia S1 pm tigers Mammaliz W2 both zebras Mammaliz W2 am ;
run;
Retaining the Values of Variables RETAIN statement - Stops resetting some variables to missing in the PDV RETAIN variable1 <init_value1> variable2 <init_value2> ; A RETAIN statement can specify both numeric and character variables <init_valueN> Optional to specify starting value of each variable Example: Calculate the running total
data case8; input month $ sales @@; acc_sales= acc_sales + sales; retain acc_sales 0; starting value of acc_sales = 0 datalines; Jan 3500 Feb 2888 Mar 887 Apr 698 May 6789 Jun 906 ; run;
Write a SAS DATA step to create a data set which contains the name and sales date in every observation.
data Q50; infile 'F:\SAS\sas\ex3\Ex2_Data1.txt'; input name $1-15 @16 salesdate date11. salesamount 31-35; if name^=' ' then do; oldname=name; oldsalesdate=salesdate; end; else if name=' ' then do; name=oldname; salesdate=oldsalesdate; end; retain oldname oldsalesdate; drop oldname oldsalesdate; format salesdate date9.; run;
Effect of missing value on running totals Missing values will be generated from operations performed on missing values Example:
data case8; input month $ sales @@; acc_sales= acc_sales + sales; retain acc_sales 0; starting value of acc_sales = 0 datalines; Jan 3500 Feb 2888 Mar 887 Apr . May 6789 Jun 906 ; The Sales of April is missing run;
Solution:
data case8; input month $ sales @@; if sales ^=. then acc_sales= acc_sales+sales; retain acc_sales 0; starting value of acc_sales = 0 datalines; Jan 3500 Feb 2888 Mar 887 Apr . May 6789 Jun 906 ; run;
Sum statement Retains values from the previous iteration of the DATA step in order to cumulatively add the value of a variable across observations variable + expression; variable specifies the name of the accumulator variable which must be numeric. variable is automatically set to 0 before the first observation is read. variable 's value is retained from one DATA step execution to the next. expression contains the value to be added to the variable. expression can be a variable or a constant The Sum statement adds the result of the expression that is on the right side of the plus sign (+) to the numeric variable that is on the left side of the plus sign. At the beginning of the DATA step, the value of the numeric variable is not set to missing as it usually is when reading raw data. Instead, the variable retains the new value in the program data vector for use in processing the next observation. Note: The Sum statement is one of the few SAS statements that doesn't begin with a keyword. Note: If the expression produces a missing value, the Sum statement treats it like a zero. (By contrast, in an assignment statement, a missing value is assigned if the expression produces a missing value.) Example:
data case8; input month $ sales @@; acc_sales + sales; datalines; Jan 3500 Feb 2888 Mar 887 Apr . May 6789 Jun 906 ; run;
Error !
@ (single trailing @) line-hold specifier Holds the record in the input buffer until the last statement of the DATA step is executed, or encountered another INPUT statement without a line-hold specifiers Note: The term trailing indicates that the @ must be the last item that is specified in the INPUT statement. E.g. input salesid $ location $ @ ;
if location='EUR' then input saledate : date9. amount; datalines; 101, USA, 1-20-2008,3445 433,EUR,30Mar2008,432.3 102,USA,4-12-2008,5320 444,EUR,26Apr2008,3433.3 ; run;
Each record in temp.txt consists of a group's ID and followed by three experimental results How to pair each group's ID with one result to a single observation so that three observations can be derived from each record?
data temp; infile 'D:\temp.txt'; input id $ @; input result @; output; input result @; output; input result @; output; run;
Alternative:
data temp; infile 'D:\temp.txt'; input id $ @; do i=1 to 3; input result @; output; end; drop i; run;
Example: In EXE2_DATA5.txt, the first field is the ID of the student and the second field number of examination scores for that record. Create a SAS data set which contains 2 variables only, namely the student ID and examination score. The number of observations in the SAS data set equals to the number of examination scores for every student.
data Q64; infile 'F:\SAS\sas\ex3\Exe2_Data5.txt' missover; input id $ no score @; do while (score ^=''); output; input score @; end; drop no; run;
Alternative:
data Q64; infile 'F:\SAS\sas\ex3\Exe2_Data5.txt'; input id $ no @; do i= 1 to no; input score@; output; end; keep id score; run;
Raw Data File - LIST2_3.TXT Employee,Adams,Cheung Dependent,Machael,C,15 Dependent,Machael,C,13 Employee,Thomas,Leung Dependent,Susan,S,26 Employee,Lewis,Chan Dependent,Richard,C,8 Employee,Dansky,Wong Employee,Nicholls,Tsang Dependent,Robert,C,12 Employee,Mary,Fong Dependent,John,S,40
header record detail record detail record header record detail record header record detail record header record header record detail record header record detail record
You can build a SAS data set from a hierarchical file by creating one observation per detail record and storing each header record as part of the observation.
SAS data set one observation per detail record
You can also build a SAS data set from a hierarchical file by creating one observation per header record and combining the information from detail records into summary variables.
SAS data set one observation per header record
In this section, you learn how to read from a hierarchical file and create a SAS data set that contains either one observation for each detail record or one observation for each header record.
Step 2. Conditionally Executing SAS Statements You can use the value of type to identify each record. If type is Employee, execute an INPUT statement to read the values for first name (empfname) and last name (emplname). However, if type is Dependent, then execute an INPUT statement to read the values for first name (depfname), relation, and age. You can tell SAS to perform a given task based on a specific condition by using an IF-THEN statement.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then input empfname: $15. emplname : $15.; else if type='Dependent' then do; input depfname : $15. relation $ age; end;
Step 3. Reading a Detail Record Now think about what needs to happen when a detail record is read. Remember, you want to write an observation to the data set only when the value of type is Dependent. You can use an OUTPUT statement in an IF-THEN statement makes SAS outputs an observation only when the condition is true (i.e. type is Dependent).
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then input empfname: $15. emplname : $15.; else if type='Dependent' then do; input depfname : $15. relation $ age; output; end; retain empfname emplname;
Step 4. Dropping Variables and Final SAS Data Set Because type is useful only for identifying a record's type, drop the variable from the data set.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then input empfname: $15. emplname : $15.; else if type='Dependent' then do; input depfname : $15. relation $ age; output; end; retain empfname emplname; drop type; run;
Step 2. DO Group Actions for Header Records To execute multiple SAS statements based on the value of a variable, you can use a simple DO group with an IF-THEN statement. When the condition type='Employee' is true, you need to execute several statements.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then do;
First, you need to determine whether this is the first header record in the external file. You do not want the first header record to be written as an observation until the related detail records have been read and summarized. _N_ is an automatic variable whose value is the number of times the DATA step has begun to execute. The expression _n_^= 1 defines a condition where the DATA step has executed more than once. Use this expression in conjunction with the previous IF-THEN statement to check for these two conditions: When the conditions type='Employee' and _n_^= 1 are true, an OUTPUT statement is executed. Thus, each header record except for the first one causes an observation to be written to the data set.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;
insurance_cost=0;
end;
An INPUT statement reads the values of empfname and emplnames. An assignment statement creates the summary variable insurance_cost and sets its value to 0.
Step 3. Reading Detail Records When the value of type is not Employee, you need to define an alternative action. You can do this by adding an ELSE statement to the IF-THEN statement. If its value is 'Dependent' then continue to read for values of the first name, relation, and age. You want to count each person who is represented by a detail record and store the accumulated value in the summary variable insurance_cost. You have already initialized the value of insurance_cost to 0 each time a header record is read. Now, as each detail record is read, you can increment the value of insurance_cost by using a Sum statement. If relation = 'S' accumulate the cost of insurance by 100. If relation = 'C' accumulate the cost of insurance by 60.
data case12; infile 'd:\LIST2_3.TXT' delimiter=','; input type : $9. @; if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;
insurance_cost=0; end; else if type='Dependent' then do; input depfname : $15. relation $ age; if relation='S' then insurance_cost+100; if relation='C' then insurance_cost+60; end; retain empfname emplname;
Step 4. Determining the End of the External File and Final SAS Data Set
Your program writes an observation to the data set only when another header record is read and the DATA step has executed more than once. But after the last detail record is read, there are no more header records to cause the last observation to be written to the data set. You need to determine when the last record in the file is read so that you can then execute another explicit OUTPUT statement. You can determine when the current record is the last record in an external file by specifying the END= option in the INFILE statement. INFILE 'file-name' END = variable_name ; variable_name is any valid SAS variable name that is not included in the INPUT statement or other assignment statements in the same DATA step equals 1 if it is the last record in the raw data file; 0 otherwise Remains 0 until SAS processes the last data record Appears in PDV but not exported to the SAS data set
data case12; infile 'd:\LIST2_3.TXT' delimiter=',' end=eofile; input type : $9. @; if type='Employee' then do; if _n_^=1 then output; input empfname: $15. emplname : $15.;
insurance_cost=0; end; else if type='Dependent' then do; input depfname : $15. relation $ age; if relation='S' then insurance_cost+100; if relation='C' then insurance_cost+60; end; if eofile=1 then output; retain empfname emplname; keep empfname emplname; run; SAS data set one observation per header record
SAS Functions
A SAS function performs a computation on one or more variables over the same observation and returns a
value. SAS functions include mathematical functions, statistical functions, date functions, character functions, and others SAS function syntax Function_name(<argument1> <, , argumentn>) Function_name(OF abbreviated_variable_list) Function_name(OF array_name[*]) Function_name must be joined by a pair of parentheses If used in an assignment statement, the function must be placed on the right hand side The parentheses may contain one argument, more than one argument, or no argument (i.e. empty parentheses) The argument can be a variable name, a constant, another SAS function, or valid SAS expression Multiple arguments are separated by a comma Mathematical functions Function name Description ABS (argument) Returns a nonnegative number that is equal in magnitude to that of the argument. EXP(argument) Returns the value of the exponential function LOG(argument) Returns the natural (base e) logarithm LOG10(argument) Returns the logarithm to the base 10 SQRT(argument) Returns the square root of a value Example:
data test; input quantity @@; abs_quantity=abs(quantity); log_quantity=log(abs_quantity); sqrt_quantity=sqrt(abs_quantity); datalines; 1244 -1898 34232 10 242 ; run;
Description Returns the integer portion of the argument Returns the nearest integer to the argument Rounds the first argument to a value that is very close to a multiple of the second argument
Example:
data test; x1=int(10.499); x2=int(10.599); x3=round(10.49); x4=round(10.5); x5=round(10.51); x6=round(10.449,0.01); x7=round(10.501,0.01); x8=round(10.504,0.05);
x9=round(13,2); run;
Statistical functions Function name sum(argument, argument,...) mean(argument, argument,...) min(argument, argument,...) max(argument, argument,...) median(argument, argument,...) var(argument, argument,...) std(argument, argument,...) N(argument, argument,...) NMISS(argument, argument,...)
Description sum of values average of nonmissing values minimum value maximum value Median value variance of the values standard deviation of the values the number of nonmissing values the number of missing values
Example: The following figure displays the first few records of a raw data set containing the student quiz scores. The first line is not part of the data set. If a student took all five quizzes, the lowest of the five quiz scores is dropped. Write a program that will compute the average quiz score based on this decision. If a student took fewer than five quizzes, compute the average of the non-missing quizzes.
data Q77; input ID $ Q1-Q5; nmiss=nmiss(of Q1-Q5); if nmiss=0 then average=(sum(of Q1-Q5)-min(of Q1-Q5))/4; /*if n=5 then average=(sum(of Q1-Q5)-min(of Q1-Q5))/4;*/ else average=mean(of Q1-Q5); drop nmiss; datalines; 1 85 76 79 80 85 2 . 56 65 72 81 3 44 49 . . 54 ; run;
Character functions Function name CAT(string-1 <, ... string-n>) CATS(string-1 <, ... string-n>)
Description Concatenates character strings without removing leading or trailing blanks Concatenates character strings and removes leading and trailing blanks
Concatenates character strings and removes trailing blanks Concatenates character strings, removes leading and trailing blanks, and inserts separators Removes multiple blanks into a single blank from a character string
Example: Create a SAS data set that joins the two fields into a single variable for the full name in the form of firstname lastname such that there is only one blank space between the first name and the last name.
data Q78; infile datalines delimiter=','; input first $ last $; name=catx(' ', first, last); datalines; Mary,Leung John,Wong Jonathan,Ng ; run;
Character functions(continued) Function name LEFT(argument) LENGTH(string) LENGTHC(string) LENGTHN(string) LOWCASE(argument) RIGHT(argument) TRIM(argument) TRIMN(argument) UPCASE(argument) Description Left aligns a SAS character expression Returns the length of a non-blank character string, excluding trailing blanks, and returns 1 for a blank character string Returns the length of a character string, including trailing blanks Returns the length of a non-blank character string, excluding trailing blanks, and returns 0 for a blank character string Converts all letters in an argument to lowercase Right aligns a character expression Removes trailing blanks from character expressions and returns one blank if the expression is missing Removes trailing blanks from character expressions and returns a null string (zero blanks) if the expression is missing Converts all letters in an argument to uppercase
Description Searches for a specific substring of characters within a character string that you specify string specifies a character constant, variable, or expression that will be searched for substrings. Tip: Enclose a literal string of characters in quotation marks.
substring is a character constant, variable, or expression that specifies the substring of characters to search for in string. Tip: Enclose a literal string of characters in quotation marks. Function name Description SUBSTR(string, position<,length>) Extracts a substring from an argument string specifies any SAS character expression. position specifies a numeric expression that is the beginning character position. length specifies a numeric expression that is the length of the substring to extract. Tip: If you omit length, SAS extracts the remainder of the expression. Example:
data test; infile datalines delimiter=','; input name :$20. sex $; new_name = compbl(name); blank_pos=find(new_name,' '); name_len=length(new_name); last_name=substr(new_name,blank_pos); first_name=substr(name,1,name_len - length(last_name)); sex=upcase(sex); datalines; Mary Chan, f Tom Ng, M David Wong,m Betty Chung,F ; run;
Character functions(continued) Function name Description SCAN(string ,n<, delimiter(s)>) Selects a given word from a character expression n specifies a numeric expression that produces the number of the word in the character string you want SCAN to select. delimiter specifies a character expression that produces characters that you want SCAN to use as a word separator in the character string. Note: If you omit delimiter, SAS uses the following characters by default: blank . < ( + & ! $ * ); ^/,%| Tip: If you represent delimiter, enclose delimiter in quotation marks. Example:
data test; input name $ 20.; surname=scan(name,1,' '); givenname1=scan(name,2,' '); givenname2=scan(name,3,' '); givenname=catx(' ',givenname1,givenname2); datalines;
Example:
data Q79; infile datalines delimiter=' ' dsd; input age 1-2 @4 name:$50.; surname=scan(name,1,','); firstname=scan(name,2,','); drop name; datalines; 18 "HO, Chun Kit" 17 "LO, Yu Yin" 20 "SUM, On Man" ; run;
Character functions (continued) Function name Description COMPRESS(<source><, chars><, modifiers>) Removes specific characters from a character string source specifies a source string that contains characters to remove. chars specifies a character string that initializes a list of characters. By default, the characters in this list are removed from the source. If you specify the K modifier in the third argument, then only the characters in this list are kept in the result. Tip: You can add more characters to this list by using other modifiers in the third argument. Tip: Enclose a literal string of characters in quotation marks. modifiers specifies a character string in which each character modifies the action of the COMPRESS function. Blanks are ignored. These are the characters that can be used as modifiers: a or A - adds letters of the Latin alphabet (A - Z, a - z) to the list of characters. d or D - adds numerals to the list of characters. i or I - ignores the case of the characters to be kept or removed. k or K - keeps the characters in the list instead of removing them. p or P - adds punctuation marks to the list of characters. Example:
data test; input productcode :$ 10.; product=compress(productcode, ,'ka'); code=compress(productcode, ,'a'); datalines; Aa235 BXT3218 6798ZYV 316X
; run;
YRDIF(sdate,edate,Actual)
DATDIF(sdate,edate,Actual)
Description Extracts the day value from a SAS date value. Extracts the month value from a SAS date value. Returns the current date as a SAS date value, empty argument This function requires no arguments, but they must still be followed by parentheses. Returns the week number value Returns the day of the week from a SAS date value, where 1=Sunday, 2=Monday,, 7=Saturday Extracts the year value from a SAS date value. Returns a SAS date value from numeric expression of month, day, and year values month can be a variable that represents the month, or a number from 1-12 day can be a variable that represents the day, or a number from 1-31 year can be a variable that represents the year, or a number that has 2 or 4 digits. Returns the difference in years between two dates Actual uses the actual number of days between dates in calculating the number of years. Returns the actual number of days between two dates
Example:
data test; input id birthday birthmonth birthyear; birthdate=mdy(birthmonth,birthday,birthyear); birthweek=week(birthdate); birthweekday=weekday(birthdate); cutoffdate='1jan2004'd; day_diff=datdif(cutoffdate,birthdate,'actual'); year_diff=yrdif(cutoffdate,birthdate,'actual'); format birthdate cutoffdate; datalines; 1 31 12 2005 2 1 1 2006 3 28 2 2006 4 31 3 2006 ; run;
INTCK('interval',from,to) Returns the number of time intervals that occur in a given time span where l l l 'interval' specifies a character constant or variable. The value must be one of the following: DAY, WEEKDAY, WEEK, MONTH, HOUR, QTR, YEAR from specifies a SAS date value that identifies the beginning of the time span. to specifies a SAS date value that identifies the end of the time span
The INTCK function counts intervals from fixed interval beginnings, not in multiples of an interval unit from the from value. Partial intervals are not counted. For example, WEEK intervals are counted by Sundays rather than seven-day multiples from the from argument. MONTH intervals are counted by day 1 of each month, and YEAR intervals are counted from 01JAN, not in 365-day multiples. SAS Statement Weeks = intck ('week','31 dec 2000'd,'01jan2001'd); Months = intck ('month','31 dec 2000'd,'01jan2001'd); Years = intck ('year','31 dec 2000'd,'01jan2001'd); Value 0 1 1
Because December 31, 2000, is a Sunday, no WEEK interval is crossed between that day and January 1, 2001. However, both MONTH and YEAR intervals are crossed. Date functions (continued) Function name Description INTNX('interval',startIncrements a date value by a given interval or from,increment<,'alignment'>) intervals, and returns a date value where 'interval' specifies a character constant or variable. The value must be one of the following: DAY, WEEKDAY, WEEK, MONTH, HOUR, QTR, YEAR start-from specifies a starting SAS date value increment specifies a negative or positive integer that represents time intervals toward the past or future 'alignment' (optional) forces the alignment of the returned date to the beginning, middle, or end of the interval. For example, the following statement creates the variable TargetYear and assigns it a SAS date value of 13515, which corresponds to January 1, 1997. TargetYear=intnx('year','05feb94'd,3); The purpose of the optional alignment argument: it lets you specify whether the date value should be at the beginning, middle, or end of the interval. When specifying date alignment in the INTNX function, use the following arguments or their corresponding aliases: BEGINNING B MIDDLE M END E SAMEDAY S The best way to understand the alignment argument is to see its effect on identical statements. The following table shows the results of three INTNX statements that differ only in the value of alignment.
SAS Statement Date Value
MonthX=intnx('month','01jan95'd,5,'b'); MonthX=intnx('month','01jan95'd,5,'m');
MonthX=intnx('month','01jan95'd,5,'e');
These statements count five months from January, but the returned value depends on whether alignment specifies the beginning, middle, or end day of the resulting month. If alignment is not specified, the beginning day is returned by default.
source indicates the character variable, constant, or expression to be converted to a numeric value a numeric informat must also be specified, as in this example: input(payrate,2.) Description Explicit Numeric-to-Character Conversion
source indicates the numeric variable, constant, or expression to be converted to a character value a format matching the data type of the source must also be specified, as in this example: put(site,2.)
Question A typical value for the character variable Target is 123,456. Which statement correctly converts the values of Target to numeric values when creating the variable TargetNo? a. TargetNo=input(target,comma6.); b. TargetNo=input(target,comma7.); c. TargetNo=put(target,comma6.); d. TargetNo=put(target,comma7.); Correct answer: b
SET statement DATA data_set_name <data_set_options>; <Other DATA step statements> SET sas_data_set <data_set_options> <options>; <Other DATA step statements> RUN; data_set_name is the name of the SAS data set to be created sas_data_set is the name of the SAS data set to be read Any DATA step statements can be placed before/after the SET statement
How does it work? 1. Compilation phase No input buffer is created, tracking pointer points to the first observation of the SAS data set to be read PDV is created as usual, all variables contained in the SAS data set to be read will be included by default 2. Execution phase As the SET statement is executed, the values from the pointed observation is copied to the PDV At the end of each round of DATA step execution, the values in the PDV are written to the new data set At the beginning of each iteration, the values of variables which were read from the SAS data set with the SET statement, or those were created by a SUM statement are retained in PDV, all other variable values are set to missing Example: Suppose a SAS data set Scores exists in the Mylib library
SET statement - Dropping unwanted variables Suppose the variables score2 and score3 of SCORES are not wanted anymore DROP data set option : These variables (score2 and score3) are not kept in the PDV and cannot be used in the DATA step Example:
data case1; set Mylib.scores (drop=score2 score3); run;
DROP statement These variables are kept in the PDV but not output to the new data set, they can still be used in the DATA step Example:
data case2; set Mylib.scores; drop score2 score3; run;
SET statement - Keeping selected variables only Suppose only the variables StudentID and score3 are wanted KEEP data set option Only these variables are kept in PDV and output to new data set Example:
data case3; set Mylib.scores (keep=StudentID score3); run;
KEEP statement All variables are kept in the PDV but only these variables are output to the new data set Example:
data case3; set Mylib.scores; keep StudentID score3; run;
SET statement - Rename variables Suppose variable StudentID would be renamed to SID and variable score3 would be renamed to quiz3. Example:
data case4; set Mylib.scores (rename=(StudentID=SID score3=quiz3)); run;
Note: It only affects the PDV and the new data set.
Example: Suppose a SAS data set TEMP1 contains 500 observations, write SAS data step to create a SAS data set for each of the followings: a. The new data set contains only the first 100 observations of TEMP1. b. The new data set contains only the last 100 observations of TEMP1. c. The new data set contains the 101th 300th observations of TEMP1.
data Qa; set Temp1 (obs=100); run; data Qb; set Temp1 (firstobs=401); run; data Q12c; set Temp1 (firstobs=101 obs=300); run;
Example:
data Q16a Q16b; set Mylib.Fltattnd; IF JOBCODE='FLTAT1' THEN output Q16a; IF JOBCODE='FLTAT2' THEN output Q16b; run;
SET statement - Selecting an observation directly (direct access) Use POINT option in the SET statement point_variable = obs_number ; SET data_set_name POINT = point_variable ; point_variable specifies a temporary numeric variable point_variable appears in PDV but not final data set obs_number contains the observation number of the observation to be read, it must appear assigned to point_variable before the execution of the SET statement Example:
data case5; obsnum=102; set Mylib.booksales (keep =ID gender firstpurch) point=obsnum; output; stop; run;
The POINT= option reads only the specified observations, SAS cannot read an endof-file indicator, hence cause an infinite loop Must use a STOP statement to cause SAS to stop processing the current DATA step immediately DATA step writes observations to output at the end of the DATA step, but STOP statement stops processing before the end of the DATA step, hence no output of observations Use an OUTPUT statement before the STOP statement to override the automatic output
SET statement - Selecting every kth observation Example: Write a SAS DATA step to select first 1000-observation subset from the data set SALE2000 by reading every tenth observation from observation number 10.
data case6; do obs=10 to 10000 by 10; set Mylib.sale2000 point=obs; output; end; stop; run;
Write a SAS DATA step to select every tenth observation of the observations in SALE2000. Suppose you do not know total number of observations in SALE2000.SAS7BDAT. You can use NOBS = option creates a temporary variable that contains the total number of observations in the input data files. Note that NOBS = variable in executable statements that appear before the SET statement
data case6q; do obs=0 to ttlobs by 10; set Mylib.sale2000 point=obs nobs=ttlobs; output; end; stop; run;
SET statement - Creating a random sample with replacement With replacement: Observations can be selected more than once The major steps: First generate a random number, say k Read the kth observation directly Repeat the above two steps until the require numbers of observations are selected
Generate a random number Function RANUNI(seed) returns a value between 0 and 1 seed must be an integer seed = 0 uses the system clock time, resulting in different output each time To get an integer between 1 and M, use function CEIL( ) as follows: CEIL(RANUNI(seed) * M) CEIL( ) function returns the smallest integer that is greater than or equal to the argument Example:
data case7; samplesize=100; do i=1 to samplesize; sample_point=ceil(ranuni(0)*ttlobs); set Mylib.booksales (keep =ID gender firstpurch) point=sample_point nobs=ttlobs;
BY-group processing - To group observations for processing DATA data_set_name ; SET sas_data_set <(data_set_options)> <options>; BY variable1 <variable2 >; The data set in the SET statement must be sorted by the values of the BY variables Two temporary variables for each BY variable are created First.variable1: equals 1 for the first observation in a BY group; 0 otherwise Last.variable1: equals 1 for the last observation in a BY group; 0 otherwise Example: Suppose you want to compute the total amount of money spent (M) on books by each MCODE level in BOOKSALES.SAS7BDAT
proc sort data=Mylib.booksales out=sort_booksales; by mcode; run; data case8; set sort_booksales (keep= mcode m); by mcode; if first.mcode=1 then total_spent=0; total_spent+m; if last.mcode=1 then output; drop m; run;
Example: Suppose you want to compute the total amount of money spent (M) on books by each gender in each MCODE level
proc sort data=Mylib.booksales out=sort_booksales; by mcode gender; run;
BY-primary-variable data case8; set sort_booksales; by mcode gender; if first.gender=1 then total_spent=0; total_spent+m; if last.gender=1 then output; keep mcode gender total_spent; run;
BY-secondary-variable
DATA data_set_name ; SET sas_data_set1 <(data_set_options)> sas_data_set2 <(data_set_options)> <options> ; <Other DATA step statements> RUN; Can read any number of SAS data sets in one SET statement Common variables must have the same data type attribute The new data set contains all of the variables and observations from all of the data sets listed in the SET statement
How does it work? Similar to reading single SAS data set Observations from the first data set that is listed in the SET statement are read first Then the observations from the second data set that is listed, and so on Example:
data Jan; input name $ 1-20 sales; datalines; Daivd Wong 4500 Francis Leung 6000 Joe Chan 3000 ; run; data case9; set Jan Feb; run; data Feb; input name $ 1-20 sales; datalines; Joe Chan 5000 Daivd Wong 6000 John Tai 4500 ; run;
Missing values will be generated if stacking data sets with different variable names Example:
data Jan; input name $ 1-20 sales1; datalines; Daivd Wong 4500 Francis Leung 6000 Joe Chan 3000 ; run; data Feb; input name $ 1-20 sales2; datalines; Joe Chan 5000 Daivd Wong 6000 John Tai 4500 ; run; data case10; set Jan Feb; run;
Use IN= option to determine which data set contributed to the current observation SET sas_data_set (IN = in_variable) ; in_variable is a temporary numeric variable that equals 1 when the data set contributed to the current observation, 0 otherwise
data Jan; input name $ 1-20 sales; datalines; Daivd Wong 4500 Francis Leung 6000 Joe Chan 3000 ; run; data Feb; input name $ 1-20 sales; datalines; Joe Chan 5000 Daivd Wong 6000 John Tai 4500 ; run;
data case11; set Jan (in=file1) Feb (in=file2); if file1=1 then month='Jan'; if file2=1 then month='Feb'; run;
DATA data_set_name ; MERGE sas_data_set1 <(data_set_options)> sas_data_set2 <(data_set_options)> <options>; BY variable1 <variable2 >; <Other DATA step statements> RUN; The data sets in the MERGE statement must be sorted by the values of the BY variables Available options are identical to that of SET statement If variables that have the same name appear in more than one data set, the value of the variable is the value in the last data set that contains it How does it work? Compilation Phase - To prepare to merge data sets, SAS
1. reads the descriptor portions of the data sets that are listed in the MERGE statement 2. reads the rest of the DATA step program creates the program data vector (PDV) for the merged data set 3. 4. assigns a tracking pointer to each data set that is listed in the MERGE statement. Execution phase As the MERGE statement executes, compare the pointed observation of each listed data set to see whether the BY values match If yes, the observations are written to the PDV in the order in what the data sets appear in the MERGE statement If no, SAS determines which of the values comes first and writes the observation that contains this value to the PDV At the end of each iteration, writes observation to the data set and Variables created by the Data step are set to missing in PDV If neither data set contains any more observations in the BY group, variables come from the listed data sets are set to missing in the PDV. Otherwise, their values are retained in PDV One-to-one with equal list matching Example: Suppose marks of MS1111 and MS1112 for each student for stored in SAS data set MS1111 and MS1112 respectively. To calculate the average mark for each student, the two data sets must be merged
data combinea; merge ms1111 ms1112; by id; run;
One-to-one with unequal list matching Some students took MS1111 but not MS1112, or vice versa
data combinec; merge ms1111 ms1112; by id; run;
Use IN= option to select observations that appear in both data sets
data combined; merge ms1111(in=ms1) ms1112(in=ms2); by id; if ms1=1 and ms2=2 then do; average_mark=(mark1+mark2)/2; output; end; run;
One-to-many / Many-to-one matching The order of the data sets in the MERGE statement does not matter to SAS A One-to-many merge is the same as a many-to-one-merge, although the order of the variables in the new data set are not the same Example: Suppose CUSTOMERID contains profile of customers and SALES contains products purchased by each customer
A One-to-many merge is the same as a many-to-one-merge, although the order of the variables in the new data set are not the same
data sale_profile; merge sales customerid; by id; run;
Use IN= option to identify the non-matches Example: Suppose SALESA contains list of products purchased by some customers. You want to identify the group of customers who did not purchase any item at all
data sale_profile; merge customerid (in=file1) salesa(in=file2); by id; if file1=1 and file2=0 then output; keep id gender age; run;