Anda di halaman 1dari 6

Advanced Tutorials

Paper 9-26

MISSOVER, TRUNCOVER, and PAD, OH MY!!


or Making Sense of the INFILE and INPUT Statements.
Randall Cates, MPH, Technical Training Specialist

ABSTRACT
The SAS® System has many powerful FLOWOVER The default. Causes the
tools to store, analyze and present INPUT statement to jump
data. However, first programmers to the next record if it
need to get the data into SAS doesn’t find values for
datasets. This presentation will all variables.
delve into the intricacies of reading MISSOVER Sets all empty vars to
data from sequential (text) files missing when reading a
using the DATA step and INFILE and short line. However, it
INPUT statements. Discussion will can also skip values.
focus on the different options STOPOVER Stops the DATA step when
available when reading different it reads a short line.
types of text files. For example, TRUNCOVER Forces the INPUT
when should you use the MISSOVER statement to stop
option and when is the TRUNCOVER reading when it gets to
option more appropriate. This paper the end of a short line.
assumes the audience has basic This option will not
knowledge of reading text files using skip information.
the DATA step (Base SAS®) and is SCANOVER Causes the INPUT
appropriate for users on any statement to search the
Operating System, although some data lines for a
options may be restricted. character string
specified in the INPUT.
INTRODUCTION PAD Pads short lines with
Reading and understanding the SAS blanks to the length of
documentation can sometimes be a the LRECL= option.
challenge. This is evident in the Note: SCANOVER and STOPOVER
INFILE statement. There are no less will not be discussed.
than 34 different options available
for this particular statement. This The following text file was created
can get very sticky when the data with MS-Notepad on Windows-NT then
file you need to read differs from read into a SAS dataset using INFILE
the safe, easy columnar data files. and INPUT statements. Each line
So how can we make sense of the should contain 4 data points; Last
plethora of options? This paper will and First names, Employee ID and Job
attempt to clarify some of the title. The grayed-out area denotes
confusion. Three situations are actual line lengths. (Note: Most Word
explored. First Variable-Length processors on Windows and UNIX create
records; both shorter values, and variable-length lines, whereas
missing data points. Next, reading Mainframe computers files with lines
in multiple files at once. Finally, of uniform length, filled in by
obtaining data from both remote OS's blanks.)
and Web sites using the FILENAME
statement. LANGKAMM SARAH E0045 Mechanic
TORRES JAN E0029 Pilot
SO LITTLE TIME, SO MANY OPTIONS SMITH MICHAEL E0065
When the data lines aren't complete, LEISTNER COLIN E0116 Mechanic
what option will read the data TOMAS HARALD
correctly and completely? INFILE has WADE KIRSTEN E0126 Pilot
a number of options available: WAUGH TIM E0204 Pilot
Advanced Tutorials

List Input;
Then two sets of code were submitted
Obs Lastn Firstn Empid Jobcode
using different options on the INFILE
statement. First the lines were read 1 LANGKAMM SARAH E0045 Mechanic
in with Column Input;
2 TORRES JAN E0029 Pilot
DATA test;
3 SMITH MICHAEL E0065 LEISTNER
INFILE "d:\infile\emplist.dat"
<OPTIONS>; 4 TOMAS HARALD WADE KIRSTEN
INPUT lastn $1-21 Firstn $ 22-31
5 WAUGH TIM E0204 Pilot
Empid $32-36 Jobcode $37-45;
RUN;
In this example the Pilot values are
placed in the appropriate places, but
Then List Input was used;
the INPUT statement still loops to
the next line when unable to fill all
DATA test; variables.
INFILE "d:\infile\emplist2.dat";
INPUT lastn $ Firstn $ MISSOVER:
Empid $ Jobcode $ ; When the MISSOVER option is used on
RUN; the INFILE statement, the INPUT
statement does not jump to the next
FLOWOVER: line when reading a short line.
The FLOWOVER option is the default Instead, MISSOVER sets variables to
option on INFILE. Here, when the missing.
INPUT statement reaches the end of
non-blank characters without having Column input;
filled all variables, a new line is Obs Lastn Firstn Empid Jobcode
read into the Input Buffer and INPUT
attempts to fill the rest of the 1 LANGKAMM SARAH E0045 Mechanic
variables starting from column one.
2 TORRES JAN E0029
The next time an INPUT statement is
executed, a new line is brought into 3 SMITH MICHAEL E0065
the Input Buffer. The results
(printed with PROC PRINT) are below. 4 LEISTNER COLIN E0116 Mechanic

5 TOMAS HARALD
Column Input;
Obs Lastn Firstn Empid Jobcode 6 WADE KIRSTEN E0126

1 LANGKAMM SARAH E0045 Mechanic 7 WAUGH TIM E0204

2 TORRES JAN E0029 SMITH


All lines are read in as separate
3 LEISTNER COLIN E0116 Mechanic records. Notice however, that the
PILOT Jobcodes are still missing.
4 TOMAS HARALD WADE WAUGH When MISSOVER encounters the End-Of-
Line mark, and has not read all
In the second line, since the value required columns for a particular
PILOT did not extend to the required variable, then that variable is set
number of columns for Jobcode(37-45), to missing. This is better, but
the INPUT statement jumped to the still not perfect.
next line to complete Jobcode.
Similarly, for the fifth line read
in, the INPUT statement first jumped
to the sixth line to read Empid, then
to the seventh line to read Jobcode.
Advanced Tutorials

List Input; List Input;


Obs Lastn Firstn Empid Jobcode Obs Lastn Firstn Empid Jobcode
1 LANGKAMM SARAH E0045 Mechanic 1 LANGKAMM SARAH E0045 Mechanic

2 TORRES JAN E0029 Pilot 2 TORRES JAN E0029 Pilot

3 SMITH MICHAEL E0065 3 SMITH MICHAEL E0065

4 LEISTNER COLIN E0116 Mechanic 4 LEISTNER COLIN E0116 Mechanic

5 TOMAS HARALD 5 TOMAS HARALD

6 WADE KIRSTEN E0126 Pilot 6 WADE KIRSTEN E0126 Pilot

7 WAUGH TIM E0204 Pilot 7 WAUGH TIM E0204 Pilot

Since List Input doesn't specify Since List Input reads from delimiter
explicit columns, these data lines to delimiter, TRUNCOVER can still
can be correctly read using the work.
MISSOVER option.
PAD:
TRUNCOVER: The PAD option does not replace the
The TRUNCOVER option acts similarly FLOWOVER option. Instead, the PAD
to MISSOVER, and in addition, will option adds blanks to short lines out
take partial values to fill the first to the logical record length(LRECL).
unfilled variable. In this case, PAD takes the LRECL
from the file information, but you
Column Input; can specify LRECL= in the INFILE
statement.
Obs Lastn Firstn Empid Jobcode
1 LANGKAMM SARAH E0045 Mechanic Column Input;

2 TORRES JAN E0029 Pilot Obs Lastn Firstn Empid Jobcode

3 SMITH MICHAEL E0065 1 LANGKAMM SARAH E0045 Mechanic

4 LEISTNER COLIN E0116 Mechanic 2 TORRES JAN E0029 Pilot

5 TOMAS HARALD 3 SMITH MICHAEL E0065

6 WADE KIRSTEN E0126 Pilot 4 LEISTNER COLIN E0116 Mechanic

7 WAUGH TIM E0204 Pilot 5 TOMAS HARALD

6 WADE KIRSTEN E0126 Pilot


Here TRUNCOVER successfully reads the
short lines, apportioning out the 7 WAUGH TIM E0204 Pilot
values to the correct places. When
the INPUT statement reached a When reading in data with Column
foreshortened line, the TRUNCOVER Input, SAS reads "just the columns,
option takes what's left (e.g. Pilot) Ma'am". Since the PAD option adds
and assigns it to the appropriate blanks, SAS can read the appropriate
value. Other variables are set to columns without hitting the End-of-
missing where necessary. File mark. So the data is read in
correctly.
Advanced Tutorials

List Input; statement, and add whichever options


are appropriate. For example;
Obs Lastn Firstn Empid Jobcode
1 LANGKAMM SARAH E0045 Mechanic DATA test;
INFILE datalines TRUNCOVER;
2 TORRES JAN E0029 Pilot
INPUT lastn $1-20 firstn $21-30
3 SMITH MICHAEL E0065 LEISTNER empid $31-35 jobcode $37-44;
DATALINES;
4 TOMAS HARALD WADE KIRSTEN
"add a number of data lines here sans
5 WAUGH TIM E0204 Pilot semicolons"
RUN;
List Input reads data from delimiter
to delimiter. The default delimiter ALL THE FILES, PLEASE
character is a blank. Multiple Another situation that might come up
delimiters are treated as one. So is where the raw data exists in
with the PAD option in effect, and numerous multiple files. Here the
FLOWOVER still in effect, the INPUT INFILE statement has a couple of
statement must look to the next line options that can help. The FILEVAR=
to fill the remaining variables. option allows us to specify a
variable, to be filled during DATA
SYNOPSIS: step execution, that will contain the
Reading files with variable line name of the raw data file. The END=
lengths can be frustrating, option allows us to set a variable
especially when one doesn't fully that registers, for each raw line
understand how each option does, and read in, "is this the last line of
doesn't, work. The default option of the file?". We can use these in a
FLOWOVER expects to fill all number of useful ways to input data
variables, and uses multiple lines if from multiple files to a SAS Dataset.
necessary. Here are a few possibilities.

MISSOVER was originally created to be First, just list the files in a series of DATALINES
used in conjunction with PAD and in the DATA step.
works effectively and well in most
situations. However, this can be a DATA one;
CPU intensive process when reading an
LENGTH fil2read $ 40;
extremely large file.
INPUT fil2read $;
STOPOVER is a good tool for checking INFILE dummy FILEVAR=fil2read
code and raw data when dealing with END=done;
large, potentially messy files, since DO WHILE (NOT done);
it forces the DATA step to stop the INPUT lastn $ firstn $
first time it finds a short line. hiredate : mmddyy8.
salary;
TRUNCOVER was developed later than OUTPUT;
the MISSOVER and PAD options, and
END;
deals admirably with not only short
lines but with short values. DATALINES;
TRUNCOVER is more also efficient D:\Infile\emplist.dat
since it doesn't require the extra D:\Infile\emplist1.dat
"padding". D:\Infile\emplist2.dat
D:\Infile\emplist3.dat
One more point about variable-length D:\Infile\emplist4.dat
files. It is possible to copy in a RUN;
subset of any raw data file into the
DATA step and run these options on
the subset. Use an INFILE DATALINES;
Advanced Tutorials

The first INPUT statement reads each FILENAME indata PIPE


line and saves the information to a "dir D:\Infile\*.dat /b";
temporary variable. Add an INFILE DATA test;
statement with FILEVAR= set to the LENGTH fil2read $40;
variable just read in. Then set up
INFILE indata MISSOVER;
an INPUT/OUTPUT loop to read each
INPUT fil2read $;
file.
fil2read="d:\infile\"||fil2read;
Be careful to set up the DO loop so INFILE dummy FILEVAR=fil2read
that the DATA step never gets to the END=done;
End-Of-File marker on any file. DO WHILE(NOT done);
Using the END= option on the second INPUT lastn $ firstn $
INFILE statement sets up a temporary hiredate : mmddyy8.
variable (done) which will register 0 salary;
(not the last record) or 1 (the last
OUTPUT;
record) for each raw data line read
END;
in from each file. This is necessary
since, if SAS reads in any End-of- RUN;
File marker, the DATA step closes.
By testing for DONE at the top of the The information returned from the
loop (DO WHILE), and exiting the DO FILENAME statement is a list of all
loop after the last line of every files in D:\Infile with a .DAT type.
file, we ensure that we never hit the One can specify all files, or (as
end-of-file for all files read in. above) specific files. The DATA step
This remains true even for empty can use this information with one
files. INFILE statement and then use the
information to read the files by
A SAS Dataset can be used to store applying it to a FILEVAR= option on a
the names of the files and would be second INFILE statement.
called using a SET statement.
One limitation is that the Windows
DATA one; command (DIR) returns only the names
set two; of the files without the pathnames.
INFILE dummy FILEVAR=fil2read
So the fil2read variable needs to be
augmented with the pathname in an
END=done;
assignment statement.
DO WHILE (NOT done);
INPUT lastn $ firstn $ fil2read="d:\infile\"||fil2read;
hiredate : mmddyy8.
salary; In UNIX, a similar FILENAME statement
OUTPUT; would read:
END;
RUN; FILENAME indata PIPE
"ls -l /Infile/*.dat /b";
Finally, it’s possible to read in
filenames dynamically, using a The UNIX ls command returns a fully
FILENAME with the Pipe option. This qualified path and filename.
is useful when all of the files are
in the same directory. With the PIPE THE FILES ARE WHERE??
keyword, the FILENAME statement can This last topic is a little off
take an operating system command in subject; i.e. you can use the
quotes, and accept the result as FILENAME and FTP to access and read
valid input. Unfortunately, this is files on another operating system.
not available on Mainframe operating The FILENAME also has a URL access
systems. method to read a file at a Web site.
Once a data source has been defined
Advanced Tutorials

by the FILENAME statement, a DATA


step is able to access, open and read filename foo url
the data using usual INFILE/INPUT 'http://www.sas.com/service/techsup/intr
statements. o.html';
data _null_;
To access remote files using the
infile foo length=len;
FILENAME FTP Access method, there are
a number of options to tell SAS how input record $varying200. len;
to get to the data. Fortunately, if put record $varying200. len;
one is at all familiar with FTP the if _n_=15 the stop;
options are relatively run;
run
straightforward.
CONCLUSION
This example prompts the user for a This paper has described some options
password, connects to a UNIX server, of the FILENAME statement for
moves to a particular directory different situations. There are many
(/Infile/Mydata) reads a file named different types of data files, and
emplist.dat in the directory, and SAS can read in most, if not all.
dumps each record into one variable SAS can read data files of variable-
in the output dataset test. lengths, delimited files, files with
missing data, multiple files per DATA
filename unix ftp 'emplist.dat' step, files on other operating
cd='Infile/Mydata' systems, even HTML Web pages. With a
user='racate' broader knowledge of the SAS's data
host='test.unix.sas.com' reading capabilities, programmers can
prompt; accept data from multiple sources
data test; with confidence.
length name $ 300;
REFERENCES
infile unix truncover;
SAS Institute Inc., SAS® Language
input name $;
Reference, Version 8, Cary, NC: SAS
run;
run Institute Inc., 1999. 1256 pp.
SAS Institute Inc., SAS® Companion
Other options are; for the Microsoft Windows
DEBUG Writes information to the Environment, Version 8, Cary, NC: SAS
SAS log about the FTP Institute Inc., 1999. 555 pp.
process. SAS Institute Inc., Technical Support
LRECL= Logical record length of Notes, TS-581, Using FILEVAR= to read
remote file. multiple external files in a DATA
PASS= Password to use on remote Step,
server. http://ftp.sas.com/techsup/download/t
RECFM= Record format. "F", "S", echnote/ts581.html, Cary, NC: SAS
"V" Institute Inc., 2000. 5 pp.

Accessing Web pages is similar to the CONTACT INFORMATION


above code. Define a connection to a Randall Cates, Technical Training
Web page/site using FILENAME with the Specialist II
URL option, defining an http web site SAS Institute Inc.
as the pathname, with other options St. Louis Regional Office
as necessary, then use DATA step MCI Building, Suite 550
coding to read the file. 100 South Fourth St.
St. Louis, MO 63102
This example accesses a web page on (314)421-6364 ext. 8506
the SAS Institute Inc.'s web site, Randall.Cates@sas.com
reads the first 15 lines of html
code, and writes them to the log.