Anda di halaman 1dari 44

REAP

THE RESTRICTION ENZYME ANALYSIS PACKAGE

VERSION 4.0
DOUG MCELROY
PAUL MORAN
*ELDREDGE BERMINGHAM
IRV KORNFIELD

DEPARTMENT OF ZOOLOGY
MIGRATORY FISH RESEARCH INSTITUTE
AND
CENTER FOR MARINE STUDIES
UNIVERSITY OF MAINE
ORONO, MAINE 04469

*SMITHSONIAN TROPICAL RESEARCH INSTITUTE


BALBOA, REPUBLIC OF PANAMA
ACKNOWLEDGEMENTS

REAP was developed under the auspices of the Department of Zoology, Migratory Fish Research
Institute and the Center for Marine Studies, University of Maine (DM, PM, IK) and the Smithsonian
Tropical Research Institute, Balboa, Republic of Panama (EB). Development of this package was
partially supported by grants from the National Science Foundation (BSR 84-16131) and NOAA Sea
Grant to IK.

We have benefitted greatly from the input of several people. Masatoshi Nei provided us with theoretical
and mathematical insight on multiple occasions and sent us a copy of Maxlike (written by F. Tajima and
L. Jin), which served as our basis for comparison when writing DA. R. Martin Ball supplied us with his
program Fragset, giving us a model for the flow a distance estimation program. Joyce Miller discussed
with us alternative methods of computing nucleotide divergence and provided us with her RestSite
software to further evaluate DA. Robin Waples and Peter Grewe have independently developed a
program similar to Generate. We appreciate the contributions of these individuals, but any errors which
may exist in REAP are, of course, of our own derivation.

Doug McElroy
22 May 1991
TABLE OF CONTENTS

PART I. A REAP OVERVIEW 1

General Information 2
Contents of the Disk 3
A Note to Users 3
Installing REAP 4
Using REAP 5
Running in the Integrated Environment 6
Running Programs Independently 6
Batch Mode Processing 7
Comment Lines (Extremely Important !) 8
Input File Fomat 9
A Word about NEXUS 10
Getting Help 11

PART II. PROGRAM DESCRIPTIONS 12

Generate 13
Reduce 18
Group 20
D 21
DSE 25
DSize 26
DA 27
Monte 31
K 35

REFERENCES 39
PART I. A REAP OVERVIEW

1
GENERAL INFORMATION

The Restriction Enzyme Analysis Package (REAP) is designed to facilitate manipulation and phylogenetic
analysis of restriction fragment or restriction site data. REAP allows the user to (1) rapidly and reliably
create discrete character data sets for phenetic or cladistic analysis, (2) optimize those or any other
similar data sets by removing unnecessary characters or OTUs, (3) estimate evolutionary distance among
OTUs from fragment, site or sequence data, (4) estimate nucleotide divergence within and among
populations and (5) evaluate the level of heterogeneity in population frequency distributions via Monte
Carlo simulation. Input files suitable without modification for PAUP, PHYLIP and NTSYS can be
produced from a single pair of input files. Most programs can handle an unlimited number of OTUs and
up to 30,000 characters per OTU (see below for 2 exceptions).

Each of the REAP programs can run independently, as part of a batch process, or as a module in the
integrated environment. For programs running independently or as part of a batch, command line input
can be used to supply all necessary parameters.

Output files from REAP are suitable without modification as input for PAUP Version 2.4, PHYLIP
Version 3.1, and NTSYS Versions 1.40 and 1.60 (as well as any other versions compatible with these). In
addition, several REAP programs accept PAUP and PHYLIP files as input.

REAP was written in Turbo Pascal 5.0 and Turbo C 1.0 for IBM compatibles and should run on any
DOS-based machine. A numeric coprocessor is not required and will be emulated if not present. DOS
Version 3.0 or greater allows the most flexible use of the integrated environment.
A Macintosh version of REAP is in the final stages of development and, along with future IBM versions,
will support the NEXUS format for data input (see below).

2
CONTENTS OF THE DISK

The REAP disk contains all executable programs, example data files for each program, and example batch
files for batch mode processing. In addition, a sample batch file is provided which can be used to load the
REAP integrated environment. The file names and brief descriptions are as follows:

(0) Reap.exe Integrated environment under which other programs may be run
(1) Generate.exe Creates a D, PAUP or PHYLIP binary input file from a composite haplotype file
and its corresponding enzyme profile listing.
(2) Reduce.exe Removes extraneous characters from a binary file suitable for D, PAUP or
PHYLIP.
(3) Group.exe Collapses duplicate OTUs in a composite haplotype file or a binary file suitable for
D, PAUP or PHYLIP
(4) D.exe Estimates d values from fragment or site data
(5) Dse.exe Estimates d values and their standard errors from fragment or site data;
provides average number of sites and bases characterized by each enzyme size
class
(6) Dsize.exe Estimates d values for a single enzyme class from fragment or site data
(7) Da.exe Estimates haplotype and nucleotide diversity within populations; estimates
nucleotide diversity between all pairs of populations
(8) Monte.exe Conducts a Monte Carlo simulation test for heterogeneity among population
frequency distributions
(9) K.exe Estimates Kimura distances (d values) from sequence data

(a) Haplotyp Sample composite haplotype file


(b) Profiles Sample enzyme profile listing
(c) Dadat Sample population frequency/d value input file
(d) Kdat Sample sequence data input file
(e) Montedat Sample population frequency distribution matrix

(i) Reap.bat Sample batch file to load the integrated environment


(ii) Create.bat Sample batch file for creating and optimizing binary data sets
(iii) Analyze.bat Sample batch file for creating, optimizing and analyzing binary data sets

A NOTE TO USERS
REAP is distributed at no charge. All we ask is that you acknowledge REAP in any publications which
result from its use. Feel free to pass along this package to others; however, please ask the recipients to
notify us that they have received a copy. We will add them to our list of users. In this way, all users will
have access to updates and bug-fixes (which, hopefully, will not be necessary). We would like to avoid
having n different copies of the program floating around about which we are unaware, in the off chance
that a potentially significant bug is detected. The package has been thoroughly tested and debugged, and
we think it highly improbable that any computational errors remain, but saying there are absolutely no
bugs left (particularly for cosmetic-type problems) is like accepting a null hypothesis. Humor us by
registering any copies you pass on.

3
INSTALLING REAP

The version of DOS running on your machine will to some extent determine the way you install and use
REAP. DOS Version 3.0 or greater is infinitely better than earlier releases for working in the REAP
integrated environment, as it allows you to keep your data separate from the REAP programs (.EXE
files) on your hard disk. Under this scenario, the REAP programs reside in a subdirectory accessible as
part of the path, and data are kept in a separate (working) directory. The REAP programs are thus
effectively invisible once installed, and are accessed from within your data subdirectory. This should
prevent your applications subdirectories from getting cluttered with data files and minimize the chances of
deleting something important by mistake. Note that all of the REAP programs (including REAP.EXE)
must be in the same subdirectory.

You must somehow make your REAP subdirectory accessible from anywhere on your disk by including it
in the search path. The default search path is defined in your AUTOEXEC.BAT, and is in effect until it is
replaced by another PATH statement. We prefer not to load down the PATH statement in
AUTOEXEC.BAT with numerous subdirectories (such as REAP); rather, we recommend writing batch
files for each application, temporarily adding the necessary subdirectories to the path while a given
application is running. Such batch files are stored in a subdirectory which is part of the
AUTOEXEC.BAT PATH (i.e. C:\BIN). A sample batch file (REAP.BAT) for running the integrated
environment is included with the package. It is written on the assumption that you will store the REAP
programs in a directory called C:\REAP; you should change the first SET PATH (NOT SET TEMPATH)
statement accordingly if you prefer another name.

Without DOS 3.0 or greater, all programs have to be in the active (working) directory. This means that
either (1) programs and data must reside in the same directory or (2) you will have to redirect input and
output files by typing their entire path. Neither of these options is terribly elegant, and could be tiresome
if your machine has a complex directory structure.

Once you have decided on a configuration for installation, simply copy all .EXE and .BAT files (except
REAP.BAT) from the REAP disk into the REAP subdirectory you have created. Example data files
should be copied into your data subdirectory. Batch files you create later can be stored in either the
REAP or the data subdirectory. REAP.BAT (if you choose to use it) should be copied to a subdirectory
in the path as set in AUTOEXEC.BAT. No other installation is necessary.

For users without a hard disk, no installation (other than making backups of the REAP disk) is required.

4
USING REAP

To use REAP to its fullest extent, you will need to construct (1) an OTU by composite haplotype file, (2)
a corresponding enzyme profile listing, and (3) a population by haplotype frequency distribution matrix
(with frequencies coded as absolute numbers observed). Refer to the documentation for GENERATE,
DA and MONTE for the particular structure of these files. From these 3 input files you should be able to
execute each of the REAP programs (except K, which requires an OTU by sequence matrix. Consider
the following diagram when developing your REAP strategy.

RUNNING IN THE INTEGRATED ENVIRONMENT

5
The REAP integrated environment is a shell under which all of the REAP modules (and in fact any other
executable file such as PAUP, PHYLIP or NTSYS) can be run. With your data subdirectory as the active
(working) directory, to enter the REAP environment (assuming the REAP subdirectory is part of the
path) simply type

REAP

This will open the environment module and allow you to select the modules (programs) you wish to run.
Modules are selected by number and you are returned to the environment once their execution is
completed. Since you are presumably operating from within the data subdirectory, all data files are visible
to the programs (that is, you need not supply their entire path).

Once you are familiar with the modules and their identification numbers in the REAP environment, you
may wish to enter the environment and immediately load a particular module. To facilitate this process,
REAP can be given a module number on the command line, as in

REAP 1

This will load the environment and immediately call module #1 (GENERATE). Once GENERATE
executes, you are returned to the environment.

RUNNING PROGRAMS INDEPENDENTLY

You need not load the integrated environment to use the REAP programs. Each is a stand-alone
application which behaves exactly as it does under the environment. The advantages of running programs
independently are two: (1) Command line input can be used to supply all input/output files and options
(you don't have to answer the prompts on the screen), and (2) Programs can be included as part of a
batch process (such as ANALYZE.BAT), which executes multiple programs without further prompting.
The integrated environment is hopefully not too tedious, but some purists may prefer the more intimate
connection with their machine provided by running programs independently.

To run a program independently, first assure yourself that the individual programs (not just REAP.BAT)
are in the path. A 'Bad command or file name' error when you try to invoke one of the programs is
indicative of a path problem. Then type the name of the program you wish to run, and any command line
parameters you wish to supply (such as input and output files). For example,

GENERATE haplotyp profiles data 2

loads GENERATE, using the files 'haplotyp' and 'profiles' as the composite haplotype and enzyme input
files, respectively. The output will be stored as 'data', and it will be in PAUP format. Note that an
equivalent statement would be
GENERATE haplotyp profiles data paup

Supplying parameters on the command line is optional; those not supplied will be requested from the

6
terminal just as they are when running in the environment. The following invocation is thus perfectly
valid:

GENERATE

Command line input is a sequential, not a random, process ! You cannot specify parameters at will, but
rather must supply them in the order in which they would be requested from the terminal. For the
example above, the first parameter supplied (haplotyp) is taken by the program as the composite
haplotype file, and so on. If you mix up the order or provide, for example, an output file name without
also giving both input file names, the program will crash.

BATCH MODE PROCESSING

Since REAP programs can be run independent of the integrated environment, they can also be called from
within a batch file. Batch mode processing can be done either from within the environment (Option '10')
or directly from the command line (again assuming everything that needs to be is in the path). Batch
processing allows you to run multiple REAP programs without staying at the terminal to invoke each
program separately and supply parameters. Two example batch files (ANALYZE.BAT and
CREATE.BAT) are supplied with the package. These use the example data files to produce and analyze
binary matrices. All parameters are supplied within the batch file just as they would be if one were to
invoke the programs on the command line. Once started, each of the REAP programs is executed in
succession, using output from the previous program as input for the next. Some limitations on the nature
of command line input exist - You should refer to the discussion above on independent execution of
programs for specifics.

The advantage of batch processing is that it frees you from having to continually provide terminal input.
However, it does require you to alter the batch files in order to change the names of input and output
files.

User created batch files should reside in the working subdirectory, or you will have to supply their entire
path (even when running Batch Mode from within the environment). In addition, within the batch file
itself, the entire path of any .EXE programs called must be supplied. The batch files provided
(ANALYZE.BAT and CREATE.BAT) assume that REAP programs are located in the C:\REAP
subdirectory. Of course, both problems of path above are moot if the REAP subdirectory is in the path.

7
COMMENT LINES (EXTREMELY IMPORTANT !)

Comment lines are accommodated by each of the programs in the REAP package. In fact, REDUCE and
GROUP require comments in the input file, for reasons discussed below. Comment lines occur in the first
n lines of a data file, and begin with a comment line marker in the first column. The number of lines is
unlimited, but each line can be no more than 255 characters in length. There are 4 types of comment line
markers, specific to particular file formats (PAUP, PHYLIP, D and HAPLOTYPE); these indicate to the
programs the format of the data files, and so it is essential that the particular comment line marker used in
a data file is consistent with the overall format of that file. That is, you can't put a PAUP comment
delimiter above an NTSYS-formatted data matrix.

REDUCE and GROUP require comments in order to properly process an input file. These programs,
rather than asking you for the file format of the input file, detect the file type and alter their execution
accordingly. An output file in the same file format is produced. As you will see below, input files without
comments are assumed to be PHYLIP files, which can be a problem if that assumption was not intended
by you.

As stated, there are 4 types of comment line markers, which occur in the first column of a comment line
and are specific to the format of a data file. The types and their associations are as follows:

(1) Open curly bracket ({) - HAPLOTYPE format. This symbol identifies a file as a composite
haplotype file suitable as input for GENERATE and GROUP. This symbol is only recognized by
these programs, as only they can operate on haplotype file.

(2) Double quote (") - D format. This symbol identifies a file as destined for phenetic analysis. Any
file which is to be used in NTSYS must have this format. Likewise, D, DSE, DSIZE, DA and
MONTE recognize only this type of comment. GENERATE, REDUCE and GROUP recognize this
symbol (in addition to others), as they are used to produce valid files for phenetic analysis.

(3) Exclamation (!) - PAUP format. This symbol identifies a file as a PAUP-suitable file. This symbol
is not compatible with any program requiring D format (" and ! are thus mutually exclusive). Only
GENERATE, REDUCE and GROUP recognize and use this symbol.

(4) Star (*) - PAUP format. This symbol is an alternative to the exclamation. While it makes no
difference in REAP, PAUP itself does somewhat different things with these two types of
comments. Again, GENERATE, REDUCE and GROUP only recognize this format.

Note that we have defined no comment symbol for PHYLIP files. PHYLIP does not allow comments,
and so there can be no symbol for them. But, for REDUCE and GROUP, which require comments in
order to determine what type of input file is being operated upon, a lack of any comments is taken to
indicate a PHYLIP file. This will obviously crash the program if you simply choose not to include
comments in a D file, for example, as the program will be trying to read something that isn't there. The
moral: USE COMMENTS AT ALL TIMES !! It will save you heartache and frustration.

This is not as bad as it seems. Comments are always a good thing, and since they are passed through all
of the procedures you carry out (except in the case of PHYLIP files), they can be of real value in

8
documenting the analytical path taken. In addition, each REAP program adds at least one line of
comments to the output file which results; as such, if you GENERATE your initial matrices using REAP,
there will be comments where comments should be for all of the subsequent programs. The only
limitation in this scheme comes if you choose to GROUP your haplotype file before you GENERATE the
binary matrix. Here, the composite haplotype file must have comments beginning with the curly bracket
({).

GENERATE requires two input files, each of which can have comments. Because a file does not really
have a 'type' (or format) until it comes out of GENERATE, all types of comment line markers are
acceptable. In addition, they need not be the same for both the haplotype and the enzyme file, nor do
they need to be the same as the type of file you wish to produce as output. For example, the haplotype
file could have comments delimited by the curly bracket (so that it could be GROUPed), and the enzyme
file could have double quote comments; if you choose PAUP format as output, these comments will be
converted to PAUP-type by GENERATE. If PHYLIP is the output format, the comments will not be
passed on.

To recap, the following is a list of the programs in the REAP package and the comment line symbols they
recognize. Any others (indicated by spaces) are not valid for a given program.

(1) GENERATE { " ! * (4) D " (7) DA "


(2) REDUCE "!* (5) DSE " (8) MONTE "
(3) GROUP {"!* (6) DSIZE " (9) K "

INPUT FILE FORMAT

Each of the REAP programs has its own particular requirements for input files. These are described
explicitly in the formal documentation of the programs. However, we have tried to adhere to a very
generalized and consistent input file format which has several characteristics common to all of the
programs. The largest variation comes with programs such as GENERATE, REDUCE and GROUP
which deal with multiple types of file formats. For these, requirements of D, PAUP, and PHYLIP must be
followed; still, the requirements of REAP in general (as described below) underlie this level of structure.
Basically, there are three essential characteristics an input file must have:

(1) The number of OTUs, etc. must occur in the first line below the comments
(2) The # symbol (unless the file is already in PAUP or PHYLIP format) must occur in the first column
of the line just above the data matrix.
(3) OTU names must fill the first 8 spaces of each data stream (10 spaces for PHYLIP files, exactly
8 for MONTE, K and composite haplotype files for GENERATE).

Follow these restrictions to the letter. It should be relatively easy to create a valid input file for any of the
programs. Remember that, if you use REAP exclusively, much of the data set construction is obviated, as
REAP produces output files suitable as input for the next level of analysis. DA, MONTE, and K are
exceptions, as these programs require somewhat unique input files.

Of course, errors in data set construction are inevitable. Probably the most common mistake is to leave

9
out the '#' symbol above the matrix, or to place a character where an integer is expected. Run-time errors
such as these will cause the program to crash and will provide you with an error code number. Error
codes are reported in the following format:

Run-Time error nnn at xxxx:yyyy

where nnn is the number of interest. Knowing the error code makes correcting your mistakes relatively
simple; however, if you are in the integrated environment, this message will quickly disappear from sight
as the main menu reappears. As such, we recommend that, if you encounter an run-time error for which a
explanation does not immediately come to mind, you run the offending program independent of the
environment - this will prevent the environment from writing over the error message, allowing you to see
it. Below are the Turbo Pascal Run-Time error codes which you are likely to encounter and their
corresponding descriptions.

Error Description Interpretation

002 File not found Self-explanatory.


003 Path not found An invalid or non-existent subdirectory was specified.
004 Too many open files There is no FILES=XX statement in CONFIG.SYS. At
least 20 files is a safe number.
005 File access denied A read-only file was specified for output, or the disk is
full.
106 Invalid numeric format A character appears in input file when a numeric value
is expected.

These errors will cause the program to crash. However, it's possible to have mistakes in your input file
which will cause the program to either hang (sit there and do nothing) or output incorrect results; refer to
the formal program descriptions for insight into these types of problems.

A WORD ABOUT NEXUS

Recently David Swofford, Wayne Maddison and David Maddison have put forth the NEXUS format as a
potential standard input format for phylogenetic software packages; in fact, both Swofford's version of
PAUP for the Macintosh and the Maddisons' MacCLADE package fully support the NEXUS concept,
and thus allow complete portability of data files between programs. Clearly, the need for standardized
input styles is acute (and which, after all, was the driving force behind the development of REAP), and
NEXUS represents a very good solution. Although its usage is currently restricted to Macintosh
applications, it seems likely that the new IBM version of PAUP will support the NEXUS standard as well
(at the very least, PAUP should be less rigid in its formatting requirements). We thus seriously considered
incorporating NEXUS as an option for data file construction in REAP. We decided, however, that it
would be best to wait for the new version of PAUP to arrive before embarking on this task. Future
versions of REAP (both IBM and Macintosh) will certainly support NEXUS; until then, the PHYLIP
option produces a data file which can easily be modified to meet NEXUS guidelines.

GETTING HELP

10
If you have questions about REAP or require some assistance in running the software, feel free to contact
us at the address below. Again, it is a good idea to register your copy with us, as then you will be sure to
receive any updates, corrections, etc. We would also appreciate any suggestions for improvement of the
package.

Doug McElroy
Department of Biology
Western Kentucky University
Bowling Green, Kentucky 42101
(502) 745-5996
FAX (502) 745-6856
EMail Doug.McElroy@wku.edu

11
PART II. PROGRAM DESCRIPTIONS

12
GENERATE
GENERATE produces a binary character state matrix from (1) a file of OTUs and their composite
restriction phenotypes from fragment or site data and (2) a corresponding file of the binary representation
of restriction phenotypes by enzyme. There are no limits on the number of OTUs; up to 100 enzymes,
each with 100 characters, can be processed; however, program limitations inherent to D, PAUP and
PHYLIP, for which GENERATE produces suitable input files, place effective limits on the size of a data
matrix. In the output file, alphanumeric restriction phenotype designators from the composite haplotype
file are replaced with the appropriate binary code for each enzyme, thus generating a rectangular
character state matrix. GENERATE can be used to create valid D, PAUP or PHYLIP input files (which
can then be optimized via REDUCE and GROUP), which are suitable without modification for use with
those programs.

Two input files are required. The composite haplotype file consists of a rectangular data matrix, with
alphanumeric characters corresponding to alternative restriction phenotypes across OTUs for a series of
restriction enzymes; the corresponding enzyme profile input file is a tabular matrix (by enzyme) of the
binary representations of those restriction phenotypes specified in the haplotype file. The enzyme file may
contain additional phenotypes not represented in the haplotype file (see below).

In addition to the haplotype matrix and its corresponding enzyme profile listing, several other parameters
are required in the input files, including :
(1) The number of OTUs
(2) The number of characters (both alphanumeric and binary)
(3) The type of data (Fragment or Site)
(4) The r value (4, 4.6, 5, 5.3 or 6) of each restriction enzyme employed
(5) A symbol # marking the beginning of the matrix for each input file
This is merely a list of the necessary information. Its appropriate formatting is discussed below.
Comment lines are optional; these are discussed below.

Four variables are either read from the terminal or supplied on the command line. For terminal input, the
program first prompts the user for the name of the composite haplotype input file. This name can occupy
up to 50 characters (including colons and periods). As a result, data can be supplied either from disk or
from any subdirectory. The program then asks for the name of the enzyme profile input file. Again, there
is a 50 character limit. The next prompt requests the name to be given the output file (50 characters
maximum). Finally, the user is asked to select a format for the output file. The options (1) D, (2) PAUP
or (3) PHYLIP may be specified by either number or name. The syntax for command line input is

GENERATE <haplotypefile> <enzymefile> <format>

Command line input is sequential; a given parameter may only be specified if all preceding parameters are
also provided. Terminal input will be requested to supply the remaining parameters.
Output consists of a rectangular binary character state matrix suitable without modification as input for
D, PAUP or PHYLIP.

13
FORMAT OF GENERATE INPUT FILES

COMPOSITE HAPLOTYPE FILE

LINE 1. a. The number of OTUs.


b. The number of enzymes employed (columns of the haplotype matrix).
These are read as integers, and must be separated by at least one space.
c. The type of data (Specify either 'F' or 'S').
This must be separated by spaces from the number of enzymes.

LINE 2. IN THE FIRST COLUMN ! : # to indicate that the matrix begins on the next line.

FINALLY The data matrix. Data for each OTU should begin with a label. This label must
occupy 8 spaces, although blanks can be used. No composite haplotype data should
occur before COLUMN 9, as the program indiscriminately reads the first 8 characters as
the OTU label. Similarly, labels longer than 8 characters may cause an error. Composite
haplotype data are a series of alphanumeric characters (CASE SENSITIVE), which need not
be separated by spaces. Blanks or carriage returns will be ignored, although other
internal characters will cause an error, as they will be read as haplotype designators.

ENZYME PROFILE FILE

LINE 1. IN THE FIRST COLUMN ! : # to indicate that the enzyme data begins on the next
line.

NEXT FOR EACH ENZYME EMPLOYED. The enzyme data.

LINE A.i. The enzyme label.


This should occupy the first 8 characters of this line.
ii. The number of different haplotypes (n below) generated by that enzyme.
iii. The number of binary characters (m below) generated by that enzyme.
iv. The r value of that enzyme (4, 4.6, 5, 5.3 or 6).
These three values are read as numbers, and must be separated by spaces.

NEXT n LINES. The haplotype designators followed by their corresponding binary


representations. Designators can be any single alphanumeric character;
designators are case-sensitive (that is, 'a' is not the same haplotype as 'A').
Binary data are a string of characters ('0' or '1') size m in length representing the
presence and absence of particular characters in each haplotype. This string may
include internal blanks (' '), and structure of these n lines is free format.

14
EXAMPLE INPUT FILES

35S #
# EcoRI 236
COWS AAAAA A 110
PIGS AABAA B 100
HORSES BAABA C 011
AvaII 1 2 4.6
A 11
AluI 264
A 111101
B 011100
AvaI 2 3 5.3
A 111
B 011
HinfI 165
A 111111

The example file on the left represents the composite haplotype file. The first line of this file specifies the
number of OTUs and the number of characters. The final entry on this line identifies the type of data
contained in the enzyme file (right-hand example) as site data.

The symbol # occurs in the first column just above the data matrix, and must be in the first column. The
program uses this symbol as a positional reference during processing.

The final N lines of the file represent the haplotype matrix. The first eight columns comprise the OTU
label; anything under eight characters must be filled in by spaces; anything over eight characters (except
spaces) will cause an error. Following the label, the composite haplotypes (across all enzymes) of each
OTU are coded. These are read in as characters, and so need not be separated by spaces. Any
alphanumeric character is valid; upper and lower case characters are taken as different. Spaces can be
included at any point for readability; these will be ignored.

The example file on the right represents the enzyme profile file. The symbol # occurs in the line just
above the data array and must be in the first column. The program uses this symbol as a positional
reference during processing.

Following the # symbol, data for each enzyme are arranged sequentially. THE RELATIONSHIP
BETWEEN HAPLOTYPE DESIGNATORS IN THE COMPOSITE HAPLOTYPE FILE AND THE
ENZYMES THEY WERE GENERATED BY IS POSITIONAL; that is, the first column of the
haplotype file corresponds to the first enzyme described in the enzyme file, and so on.

For each enzyme, the first line specifies the name of the enzyme, the number of haplotypes distinguished,
the number of characters generated, and the r value of the enzyme. The first eight columns of this line
comprise the enzyme LABEL; anything under eight characters must be filled in by spaces, and anything
over eight characters will cause an error. The next three values must be separated by spaces.
The next N lines of the enzyme file identify the particular alphanumeric haplotypes resulting from that

15
enzyme and their binary representations. Haplotype designators are a single character. These lines are
free format. Note that haplotypes not present in the composite haplotype file (for example the 'C'
haplotype of EcoRI) can occur in the enzyme file; in this way, the enzyme file can be quite generalized, as
uninformative or extraneous characters in the output file resulting from subsampling of the enzyme file
(e.g., character 3 of EcoRI) can be removed by REDUCE.

The resulting output file from these two input files (choosing the 'D' option for file formatting) is identical
to the example input file provided in the documentation for D. Valid PAUP and PHYLIP input files, with
program-specific format characteristics can also be generated from these input files.

USE OF COMMENT LINES (OPTIONAL)

In order to facilitate record-keeping, comment lines are accommodated by GENERATE. These lines
occur at the beginning of either or both of the input files and can be prepared in any format compatible
with the REAP package (see ahead for types and their usage); the comment line types need not be the
same for both input files. However, comment lines can be no longer than 255 characters. The number of
lines is not constrained. Because of the input format, these lines will be passed through to the output
(Except to PHYLIP-compatible files, which do not allow comments) and through any subsequent
procedures. The test data files presented above, if supplied with comment lines, would start:

{Biochemical systematics of "Restriction site patterns


{selected farm animals "for barnyard survey
{Data from Orwell and MacDonald #
{EcoRI, AvaII, AluI, AvaI, HinfI
35S
#

TROUBLESHOOTING

The most critical aspect of input file preparation is to ensure that the number of OTUs and the number of
enzymes specified in the first line of the composite haplotype file (assuming no comment lines) is correct
with respect to the data matrix. A similar consistency is critical in the enzyme file, where the number of
haplotypes and number of binary characters identified for each enzyme must match the list of haplotypes
and their representations which follow. Finally, the number of enzymes defined in the haplotype file must
match the number of enzymes in the enzyme file. If the size of any of the data matrices is underestimated
by the lines which specify their size, the program will either (1) crash at some point or (2) output an
incorrect matrix because of a phase-shift error. In contrast, if the size of matrices is overestimated, the
program will get stuck in the READ procedure and spin its wheels indefinitely. The errors here can be
varied, but should be easy to correct.

Be sure the line specifying the number of OTUs, number of characters and type of data is in the first line
below the comments of the composite haplotype file.

Be sure the symbol # marking the beginning of the data matrix is in the first column of the line just above

16
the data matrix.

Be sure that any haplotype identified for an OTU is actually defined in the enzyme profile listing for that
enzyme. If the program fails to find a match for an alphanumeric designator given in the haplotype file, it
leaves that portion of the output matrix blank and issues a warning to the terminal. This error is most
likely to occur if one forgets that designators are case-sensitive, or if one makes a habit of deleting
haplotypes from the enzyme file. This latter situation should be completely unneccessary, as REDUCE
allows you to optimize your binary matrix prior to phylogenetic analysis.

SUMMARY OF LIMITATIONS

1. The OTU label should occupy exactly 8 spaces


2. The enzyme LABEL should occupy 8 spaces
3. Haplotype designators must be single alphanumeric characters and are case-sensitive
4. The number of OTUs and characters must occur in the first line below the comments
5. The symbol # must occur in the first column of the line just above the data matrix
6. Entries (enzymes described) in the haplotype and enzyme files must be positionally
related
7. All haplotypes in the composite haplotype file should be defined in the enzyme file

17
REDUCE

REDUCE removes uninformative characters from a binary character state matrix. The number of OTUs
is unlimited; a maximum of 30000 characters is allowed. The program is designed to operate on files
produced by GENERATE, but will work on any binary character data set suitable for D, PAUP or
PHYLIP. As such, requirements and limitations inherent to those programs are followed. All characters
with character state '0' across OTUs are eliminated automatically; removal of monomorphic (state '1' for
all OTUs) and/or autapomorphic characters is optional. Autapomorphic character removal assumes no
polarity of character states, and will eliminate characters for which the frequency of the '0' or '1' state is
(1/No. OTUs). The program detects the format of the input file (D, PAUP or PHYLIP) by the type of
comment line delimiter used (see ahead for types and their usage). Input files without comment lines are
taken to be of PHYLIP format. The same format is retained in the output file. The ability to readily
exclude uninformative characters has 2 advantages: (1) the user does not have to tailor files of restriction
enzyme profiles used as input for GENERATE in order to eliminate OTUs during analyses; (2) matrices
can be made minimally redundant for cladistic analysis, if one is not concerned with the accumulation of
autapomorphic state changes.

Four variables are either read from the terminal or supplied on the command line. For terminal input, the
program first prompts the user for the name of the input file. This name can occupy up to 50 characters
(including colons and periods). As a result, data can be supplied either from disk or from any
subdirectory. The program then asks for the name to be given the output file. Again there is a 50
character limit. The next two variables indicate whether (1) monomorphic and/or (2) autapomorphic
characters are to be eliminated. Removal is indicated by 'Y' or 'y'; 'N' or 'n' will cause the character type
to be retained. The syntax for command line input is

REDUCE <infilename> <outfilename> <monomorphic> <autapomorphic>

Command line input is sequential; a given parameter may only be specified if all preceding parameters are
also provided. Terminal input will be requested to supply the remainder.

TROUBLESHOOTING

The most critical aspect of input file preparation is to ensure that the structure of the file is consistent
with its format. Establishing this internal consistency has two components. First, the user must be
confident that the input file is suitable without modification for either D, PAUP or PHYLIP. Second, the
user must begin comment lines with symbols appropriate to the file format. The autodetection of input
file format requires the user to supply comments (except in the case of PHYLIP files, which do not allow
them). Files lacking comments are assumed to be of PHYLIP-type, which will cause an error if the user
is simply cavalier with regards to record-keeping. GENERATE will add its own comment lines, so input
files produced in this manner should be trouble-free.

SUMMARY OF LIMITATIONS

18
1. Program requirements and limitations follow those of D, PAUP and PHYLIP

PRACTICAL CONSIDERATIONS

One should consider the interaction of REDUCE and GROUP (see below) before commencing analysis
of data, as the sequence in which these programs are executed can affect the size of the final character
state matrix (although not its minimum information content). As an example, consider a matrix of six
OTUs versus the same matrix with duplicate OTUs collapsed into a single OTU via GROUP.

OTU1 111100001111 GROUP1 111100001111


OTU2 111100001111 GROUP2 001111111111
OTU3 001111111111 OTU5 000000001011
OTU4 001111111111 OTU6 000011111011
OTU5 000000001011
OTU6 000011111011

If one were to REDUCE these matrices, eliminating autapomorphic characters, one would be left with 2
fewer characters in the output from the GROUPed input file. The reason for this is that the first two
characters in the matrix, while synapomorphic and uniting OTU1 and OTU2 in the left-hand matrix,
collapse to autapomorphic characters in the right-hand matrix, when duplicate OTUs have been
eliminated. This type of result is of no real significance, as synapomorphic characters are only lost from
grouped files when the OTUs are identical; however, if one wishes to maximally optimize the size of the
data matrix and the efficiency of processing, one should GROUP matrices prior to running REDUCE.

A final point. The user should consider the value of eliminating characters and or grouping OTUs in data
sets which are to be analyzed using a repetitive sampling routine such as the bootstrap. We offer no
advice on this matter, only bring it to the attention of users.

19
GROUP

GROUP identifies those OTUs in a binary character state matrix (or its composite haplotype precursor)
having identical composite restriction phenotypes and collapses them into a single OTU. The number of
OTUs is unlimited; a maximum of 30000 characters is allowed. The program is designed to operate on
either composite haplotype files used as input for GENERATE or on GENERATE-produced binary files,
but will work on any binary character data file suitable for D, PAUP or PHYLIP. As such, requirements
and limitations inherent to those programs are followed. Any number of different groups of OTUs can be
created, and members of each group (OTU labels) are retained in comment lines (except for PHYLIP
data files, which do not allow comments). The program detects the format of the input file (D, PAUP
PHYLIP or HAPLOTYPE) by the type of comment line delimiter used (see ahead for types and their
usage). Input files without comment lines are taken to be of PHYLIP format. The same format is
retained in the output file. The ability to readily collapse equivalent OTUs simplifies the data matrix and
should increase processing speed during clustering, as nodes of zero branch length are eliminated.

Two variables are either read from the terminal or supplied on the command line. For terminal input, the
program first prompts the user for the name of the input file. This name can occupy up to 50 characters
(including colons and periods). As a result, data can be supplied either from disk or from any
subdirectory. The program then asks for the name to be given the output file. Again there is a 50
character limit. The syntax for command line input is

GROUP <infilename> <outfilename>

The user may supply (1) no filenames, (2) infilename only, or (3) both infilename and outfilename when
invoking GROUP. Terminal input will be requested to supply the remainder.

TROUBLESHOOTING

The most critical aspect of input file preparation is to ensure that the structure of the file is consistent
with its format. Establishing this internal consistency has two components. First, the user must be
confident that the input file is suitable without modification for either D, PAUP, PHYLIP or GENERATE
[as a composite haplotype file]). Second, the user must begin comment lines with symbols appropriate to
the file format. The autodetection of input file format requires the user to supply comments (except in
the case of PHYLIP files, which do not allow them). Files lacking comments are assumed to be of
PHYLIP-type, which will cause an error if the user is simply cavalier with regards to record-keeping.
GENERATE will add its own comment lines, so input files produced in this manner should be
trouble-free.

SUMMARY OF LIMITATIONS

1. Program requirements and limitations follow those of D, PAUP, PHYLIP and


GENERATE

20
D

D computes a nucleotide substitution matrix (d values) from restriction fragment or restriction site data.
The number of OTUs is unlimited; a maximum of 30000 characters is allowed. The presence of a
character (fragment or site) is specified by character state '1' in the input matrix. The proportion of
shared characters (state '1') for each class of restriction enzyme is determined for a given pair of taxa.
From this, dr is computed separately for each of 5 possible classes of restriction enzyme (r value = 4,
14/3, 5, 16/3 or 6 [Nei, 1987]); an overall weighted estimate of evolutionary divergence is then
generated. For fragment data, dr is estimated according to Nei and Li (1979; Nei, 1987, eq. 5.55). In the
case of site data, dr is generated as per Nei and Tajima (1981; Nei and Miller, 1990, eq. 4), which is
suitable for d < 0.25 and agrees well with results obtained via maximum likelihood estimation. Weighting
follows Nei and Tajima (1983) and is based on the proportion of fragments or sites generated by each
class of enzyme. Results are presented as a symmetric dissimilarity matrix; the output file is suitable
without modification as an input file for clustering via NTSYS.

Input largely consists of a rectangular data matrix, with character state '1' denoting the presence of a
particular fragment or site and '0' the absence of a fragment or site. All other alphanumeric characters
will be ignored by the READ statement (but see below for potential problems).

In addition to the matrix, several other parameters are required in the input file, including :
(1) The number of OTUs
(2) The number of characters
(3) The type of data (Fragment or Site)
(4) A string of reals (4, 4.6, 5, 5.3 or 6) which identify each character as
resulting from either a 4, 14/3, 5, 16/3 or 6 base restriction enzyme
(5) A symbol (#) marking the beginning of the binary matrix proper
This is merely a list of the necessary information. Its appropriate formatting is discussed below.
Comment lines are optional; these are discussed below.

Two variables are either read from the terminal or supplied on the command line. For terminal input, the
program first prompts the user for the name of the input file. This name can occupy up to 50 characters
(including colons and periods). As a result, data can be supplied either from disk or from any
subdirectory. The program then asks for the name to be given the output file. Again, there is a 50
character limit. The syntax for command line input is

D <infilename> <outfilename>

The user may supply (1) no filenames, (2) infilename only, or (3) both infilename and outfilename when
invoking D. Terminal input will be requested to supply the remainder. The user cannot specify only
outfilename on the command line.

Output consists of a symmetric dissimilarity matrix formatted to accommodate NTSYS. LABELS are
assigned by NTSYS to the ROWS of the matrix. This can of course be changed by editing the file prior
to running NTSYS.

21
FORMAT OF D INPUT FILES

LINE 1. a. The number of OTUs.


b. The number of characters (Maximum of 30000).
These are read as integers, and must be separated by at least one space.
c. The type of data (Specify either 'F' or 'S').
This must be separated by spaces from the number of characters.

LINE 2. A series of reals (4, 4.6, 5, 5.3 or 6) identifying the r value of the enzyme which
generated each fragment or site. These must be separated by spaces.

LINE 3. IN THE FIRST COLUMN ! : # to indicate that the matrix begins on the next line.

FINALLY The data matrix. Data for each OTU should begin with a label. This label must
occupy 8 spaces, although blanks can be used. No character state data should occur
before COLUMN 9, as the program indiscriminately reads the first 8 characters as the
OTU label. Character state data are a series of characters (0 or 1), which need not be
separated by spaces. Similarly, carriage returns or other characters will be ignored.

EXAMPLE INPUT FILE

3 20 S
6 6 6 4.6 4.6 4 4 4 4 4 4 5.3 5.3 5.3 5 5 5 5 5 5
#
COWS 110 11 111101 111 111111
PIGS 110 11 011100 111 111111
HORSES 100 11 111101 011 111111

The first line specifies the number of OTUs and the number of characters. The final entry on this line
identifies the matrix as representing site data.

The second line of the program categorizes each site as to the r value of enzyme which generated it. In
D, all enzymes are designated as either 4, 4.6, 5, 5.3 or 6 base cutters. All other entries will be ignored.
While this facilitates the inclusion of spaces or other notations to make the file more readable, it requires
a standardized coding of r values.

The symbol # occurs in the line just above the data matrix, and must be in the first column. The program
uses this symbol as a positional reference during processing.

The final N lines of the file constitute the rectangular data matrix. The first eight columns comprise the
OTU label; anything over eight characters will be truncated and anything under must be filled in by
spaces. Following the label, the character states (0 or 1) for each site are coded. Again, these are read in
as CHARACTERS, so they need not be separated by spaces. Any other characters may be included
internally for readability; these will be ignored.

22
USE OF COMMENT LINES (OPTIONAL)

In order to facilitate record-keeping, comment lines are accommodated by D. These lines occur at the
beginning of the input file and should be prepared in the format of NTSYS; that is, comment lines must
begin with a double quotation mark (") in the first column and can be no longer than 255 characters. The
number of lines is not constrained. Because of the input format, these comment lines will be passed to
the output file and similarly through any NTSYS procedures. The test data set presented above, if
supplied with comment lines, would start :

"Biochemical systematics of selected farm animals


"Data from Orwell and MacDonald
3 20 S
6 6 6 4.6 4.6 4 4 4 4 4 4 5.3 5.3 5.3 5 5 5 5 5 5
#

TROUBLESHOOTING

The most critical aspect of input file preparation is to ensure that the number of OTUs and particularly
the number of characters specified in LINE 1 (assuming no comment lines) are correct with respect to the
data matrix (including LINE 2). If LINE 1 underestimates the total of either parameter, the extra entries
in the data matrix will not be read, although the program will proceed without acknowledgement of a
problem. This obviously invites erroneous results. In contrast, if LINE 1 overestimates the size of the
data matrix, the program will get stuck in the READ procedure and spin its wheels indefinitely.
Remember that the program screens the matrix for the characters it is expecting and ignores all others; as
such, EOLN and EOF are skipped over. If either value in LINE 1 is too large, the program will get out
of phase and eventually attempt to read past EOF. THIS CAN ALSO OCCUR IF A CHARACTER IS
ENTERED INCORRECTLY IN THE DATA MATRIX !! For example, if 'O' is entered instead of '0',
that character will be skipped and the count-controlled loop will become out of phase.

Be sure the line specifying the number of OTUs, characters and the data type is in the first line below the
comments.

Be sure the symbol # marking the beginning of the data matrix is in the first column of the line just above
the data matrix.

The program indicates that processing is occurring by typing 'CALCULATING...' to the terminal. In
addition, it provides updates on its progress by typing '..' each time it completes a set of pairwise
comparisons between an OTU and those below it in the data matrix. These updates should occur every
one to two minutes (maximum). If the program stalls for an extended period, you can be fairly sure that
the data matrix contains an error and the program is attempting to read past EOF.
SUMMARY OF LIMITATIONS

1. The number of characters per OTU cannot exceed 30000


2. Characters must be classified as either 4, 4.6, 5, 5.3 or 6 base
3. The OTU label must occupy 8 spaces
4. Characters states must be coded as either 0 or 1

23
5. The number of OTUs , etc. must occur in the first line below the comments
6. The symbol # must occur in the first column of the line just above the data matrix

PRACTICAL CONSIDERATIONS

While the program appears fairly robust, we have encountered one situation which can significantly affect
the results generated. If a PAIR of OTUs do not share ANY fragments or sites (character state 1) for a
given enzyme size class (e.g., if PIGS and COWS share no 5 base sites), the resulting estimate of
evolutionary distance between those two taxa may be artificially inflated. Such a situation may be
difficult to recognize, as the errors are generally not astronomical; in fact, one might see pairwise
distances among some members of the ingroup which are on the same order as an ingroup-outgroup
comparison. This type of situation usually results in a small number of pairwise comparisons having
distances larger than expected, such that select OTUs would cluster outside the group to which they are
thought to be closely related. One should consider this possibility before accepting clusters which are
counter-intuitive. The program provides a warning statement upon encountering such a situation;
however, the affected pairs of OTUs are not identified.

To understand the reason for this discrepancy, consider 2 OTUs (such as PIGS and COWS) which,
although they share some 4 and 6 base sites, share no 5 base sites. dr values are calculated separately for
each size class of enzyme; an overall d value is generated through weighted averaging of these
component distance estimates. The distance between the 2 OTUs based on 5 base enzymes is effectively
1; because evolutionary distance over time is an asymptotic function (Brown et al., 1979), a value of 0.35
is assigned to the 5 base estimate. Still, this value is probably much larger than the true distance between
most users' equivalent of COWS and PIGS, and would contribute disproportionately to the weighted
average.

This problem is eliminated if one ensures that all pairs of taxa share at least one site (or fragment) of each
size class. This, of course, may not always be the case. In such instances, the problem might be best
minimized by including a single character (for each affected size class), whose presence (character state
'1') is shared by all taxa. The error introduced by including such false characters is most likely far smaller
than the error resulting from a lack of shared characters.

24
DSE

DSE is a modification of D designed to generate a more descriptive output file for fragment or site data,
containing estimates of evolutionary distance (d values) as well as their associated standard errors. In
addition, the average number of characters generated and bases surveyed by each class of restriction
enzyme are reported. The program follows the requirements and procedures of D. Results are presented
as two symmetric matrices. Estimates of distance are presented below the major diagonal, and
corresponding standard errors above the major diagonal. The output file cannot be used as input for
NTSYS.

Terminal and/or command line input follows D. No filenames, the infilename only, or infilename and
outfilename may be supplied on the command line; terminal input will be requested to supply the
remainder. Outfilename alone cannot be specified on the command line.

d values for site data are computed according to Nei and Tajima (1981; Nei and Miller, 1990, eq. 4); for
fragment data, d is estimated using the methods of Nei and Li (1979; Nei, 1987, eq. 5.55). Individual d r
values are weighted as per Nei and Tajima (1983). Standard errors for site data are computed according
to Nei and Tajima (1983; Nei, 1987, eqs. 5.41, 5.44 and 5.51); as for d, weighting is based on the
proportion of fragments generated by each enzyme size class. Estimation of standard errors from
fragment data is difficult to achieve analytically, as the distribution of d is highly skewed (Nei, pers.
comm.), although reliable estimates can be achieved numerically via the jacknife (Nei and Miller, 1990).
Nei and Li (1979) do not provide a solution for the determination of standard errors around d; as such,
the equation provided by Upholt (1977, eq. 6b) is used. This quantity is weighted equivalently to d (Nei
and Li, 1979; Nei, 1987, eq. 5.55).

Standard errors are associated with each pairwise comparison, and CANNOT be used directly to generate
confidence intervals for intercluster branch points in all cases ! For a discussion of generating standard
errors on intercluster branch points see Nei et al. (1985).

Refer to the section on PRACTICAL CONSIDERATIONS in the D documentation for a discussion of


potential problems in data analysis and interpretation.

SUMMARY OF LIMITATIONS

1. The number of characters per OTU cannot exceed 30000


2. Characters must be classified as either 4, 4.6, 5, 5.3 or 6 base
3. The OTU label must occupy 8 spaces
4. Character states must be coded as either 0 or 1
5. The number of OTUs, etc. must occur on the first line below the comments
6. The symbol # must occur in the first column of the line just above the data matrix
7. Output is not suitable for NTSYS (To generate acceptable output, use PD)
8. This program is designed to provide standard errors ONLY for estimates of
evolutionary distance between two OTUs

25
DSIZE

DSIZE is a modification of D which reports estimates of evolutionary distance (d values) based on a


specified class of restriction enzyme. We have noted that estimates of d from different enzyme classes
may differ significantly, so this program is provided as a means of examining the data more closely. Input
is a standard D input file; however, rather than generating a weighted average d based on all characters,
the program simply computes dr from that subset of characters resulting from r base enzymes (where r is
supplied by the user). The program follows the requirements and procedures of D. Results are presented
as a symmetric dissimilarity matrix, with the user-specified r value given in the comment lines. The
output is suitable without modification as an input file for clustering via NTSYS.

Terminal and/or command line input follows D. No filenames, the infilename only, or infilename and
outfilename may be supplied on the command line; terminal input will be requested to supply the
remainder. In addition, the program asks for the enzyme size class (r value) to consider when processing.
If the enzyme size class is supplied on the command line, it must occupy 3 characters and be one of the
following: 4.0, 4.6, 5.0, 5.3 or 6.0. Interactive terminal input of the enzyme size class is not quite so
strict, in that 4 = 4.0, etc. Command line input is sequential; that is, the enzyme size class may only be
specified if both infilename and outfilename precede it; similarly, outfilename alone cannot be provided on
the command line. Syntax for command line input is thus

DSIZE <infilename> <outfilename> <enzyme class>

d values for site data are computed according to Nei and Tajima (1981; Nei and Miller, 1990, eq. 3). For
fragment data, d is estimated using the methods of Nei and Li (1979; Nei, 1987, eq. 5.55).

Refer to the section on PRACTICAL CONSIDERATIONS in the D documentation for a discussion of


potential problems in data analysis and interpretation.

SUMMARY OF LIMITATIONS

1. The number of characters per OTU cannot exceed 30000


2. Characters must be classified as either 4, 4.6, 5, 5.3 or 6 base
3. The OTU label must occupy 8 spaces
4. Character states must be coded as either 0 or 1
5. The number of OTUs, etc. must occur in the first line below the comments
6. The symbol # must occur in the first column of the line just above the data matrix
7. The enzyme size class, when supplied on the command line, must occupy 3 characters
and be one of the following: 4.0, 4.6, 5.0, 5.3 or 6.0

26
DA

DA estimates haplotype and nucleotide diversity within populations, and computes the nucleotide
divergence between all pairs of populations. The number of populations is limited to 50; a maximum of
100 haplotypes per population is allowed. Haplotype frequency distributions for each population and the
associated d values among haplotypes are used to estimate haplotype and nucleotide diversity within
populations. For nucleotide divergence, total nucleotide diversity between two populations is estimated,
and the component of this diversity not explained by within-population polymorphism is extracted.
Haplotype diversity is estimated according to Nei (1987; eqs. 8.4, 8.5 and 8.12 for non-selfing and selfing
[i.e. mtDNA] populations). Nucleotide diversity and nucleotide divergence are estimated according to
Nei and Tajima (1981; Nei, 1987, eqs. 10.19, 10.7, 10.20, 10.21). Output consists of two files: (1) a
descriptive file which presents vectors of haplotype and nucleotide diversity and their standard error for
all populations, and a matrix of nucleotide diversity and divergence between populations; and (2) A
symmetric matrix of nucleotide divergence among all pairs of populations. This latter output file is
suitable without modification as input for NTSYS.

Input consists of (1) a row by column matrix representing frequency distributions for a number of
populations and (2) a symmetric dissimilarity matrix of d values for all pairs of haplotypes. Rows of the
frequency distribution matrix correspond to populations, and columns identify the particular haplotypes
characterized. Empty cells should be filled with zeros (0). In the d value matrix, values should occur
below the major diagonal, with values along the diagonal itself presented as '0.0'. D and DSIZE produce
suitable d value matrices. Only the matrix should be retained from such a file. This can be appended to
the frequency distribution matrix.

Aside from the matrices, several other parameters are required in the input file, including :
(1) The number of populations
(2) The number of haplotypes
(3) A symbol (#) marking the beginning of the matrices proper
This is merely a list of the necessary information. Its appropriate formatting is discussed below.
Comment lines are optional; these are discussed below.

Three variables are either read from the terminal or supplied on the command line. For terminal input,
the program first prompts the user for the name of the input file. This name can occupy up to 50
characters (including colons and periods). As a result, files can be supplied either from disk or from any
subdirectory. The program then requests the name for the haplotype/nucleotide diversity output file.
Again, there is a 50 character limit. Finally, the program requests the name for the nucleotide divergence
output file (<= 50 characters). Syntax for command line input is

DA <infilename> <diversityfilename> <divergencefilename>

Command line input is sequential; a given parameter may only be specified if all preceding parameters are
also provided. Terminal input will be requested to supply the remainder.

Output consists of a descriptive file of haplotype/nucleotide diversity and a file of nucleotide divergence
suitable without modification as input for NTSYS.

27
FORMAT OF DA INPUT FILES

LINE 1 a. The number of populations (Maximum of 50).


b. The number of classes (Maximum of 100).

LINE 2. IN THE FIRST COLUMN ! : # to indicate that the matrix begins on the next line.

NEXT The frequency distribution matrix. Data for each population should begin with a label.
This label must occupy 8 spaces, although blanks can be used. No
frequency data should occur before COLUMN 9, as the program
indiscriminately reads the first 8 characters as the population label.
Frequency data are a series of integers representing absolute
occurrence which must be separated by one or more spaces.

FINALLY The d value matrix. This matrix should have no OTU labels, and should have values
below and along the major diagonal. Values should be carried to
as many significant digits as possible, to minimize the
compounded introduction of rounding error.

EXAMPLE INPUT FILE

35
#
NEW YORK 25 5 5 1 1
CHICAGO 25 5 5 0 1
L.A. 4 0 1 1 10
0.00000000000
0.01100000000 0.00000000000
0.00900000000 0.00800000000 0.00000000000
0.00700000000 0.01200000000 0.01100000000 0.00000000000
0.00300000000 0.00900000000 0.00600000000 0.00500000000 0.00000000000

The first line specifies the dimensions of the data matrix. As stated, the matrix cannot contain more than
50 populations, and there can be no more than 100 classes designated.

The next n lines of the program represent the row by column matrix. The first eight columns comprise
the population label; anything over eight characters will cause an error and anything under must be filled
in by spaces. Following the label, the frequencies of occurrence of each class of individual are coded.
Again, these must be INTEGERS.

Finally, the d value matrix is provided. The relationships among haplotypes in the frequency distribution
matrix and in the d value matrix is positional; that is, the first two haplotypes in the frequency matrix
represent OTUs 1 and 2, respectively, in the d value matrix. Accordingly, the estimated evolutionary
distance between these two haplotypes is presented in cell (2,1) of the d value matrix.

28
USE OF COMMENT LINES (OPTIONAL)

In order to facilitate record-keeping, comment lines are accommodated by DA. These lines occur at the
beginning of the input file and should be prepared as follows : comment lines must begin with a double
quotation mark (") in the first column and can be no longer than 255 characters. The number of lines is
not constrained. No comment lines should occur between the frequency distribution matrix and the d
value matrix. Comment lines will be passed to the output file. The test data set presented above, if
supplied with comment lines, would start :

"Geographic heterogeneity in human mtDNA haplotype frequencies


"Initial survey of major cities
35
#

TROUBLESHOOTING

The most critical aspect of input file preparation is to ensure that the number of populations and
particularly the number of haplotypes specified in LINE 1 (assuming no comment lines) are correct with
respect to both data matrices. If LINE 1 underestimates the total of either parameter, the extra entries in
the data matrix will not be read. The program will proceed to completion; however, the problem will be
apparent, as the original matrix written to the output file will be incorrect. In contrast, if LINE 1
overestimates the size of the data matrix, the program will crash during the READ procedure, either
because it is attempting to read past EOF or because it encounters a CHAR variable when an INTEGER
is expected.

Be sure that the line specifying the number of populations and haplotypes occurs in the first line below
the comments.

Be sure the symbol # marking the beginning of the data matrix is in the first column of the line just above
the frequency distribution matrix.

Be sure that the d value matrix contains values along the major diagonal.

Once the data is read in, the program indicates that processing is occurring by typing 'CALCULATING...'
to the terminal. This should happen within one minute. In addition, it provides updates on its progress
by typing '..' each time it completes a set of pairwise comparisons between a population and those below
it in the frequency distribution matrix. These cycles may take some time; however, if the program reaches
the CALCULATING... stage, it will at some point run to completion (although note that some errors in
the data matrix are undetectable [see above]). If this does not occur, you can be fairly sure that the data
matrix contains an error and the program is attempting to read past EOF.

SUMMARY OF LIMITATIONS

1. The number of populations is limited to 50


2. The number of classes cannot exceed 100

29
3. The population label must occupy exactly 8 spaces
4. The number of populations, etc. must occur in the first line below the comments
5. The symbol # must occur in the first column of the line just above the data matrix

30
MONTE

MONTE analyzes the extent of geographic heterogeneity in population frequency distributions through a
Monte Carlo simulation as described by Roff and Bentzen (1989). The extent of heterogeneity (assessed
through chi-square analysis) in the original data matrix is compared to that estimated from repeated
randomizations of the original matrix. By repeatedly randomizing the data matrix, one can determine a
mean X2 value based on chance alone; the probability of encountering an X2 value as large as that
calculated for the original matrix can then be determined. This procedure is designed to minimize the
effect of large numbers of empty cells on the validity of the chi-square procedure. For a complete
theoretical treatment, see Roff and Bentzen (1989). In this program, up to 50 populations and 200
classes of individuals may be specified; the number of individuals in the entire matrix is limited to 5000.
As many as 10000 randomization procedures may be carried out.

Input is primarily a row by column matrix representing frequency distributions for a number of
populations. Rows of the matrix correspond to populations, and columns identify the particular classes
characterized. Empty cells should be filled with zeros (0).

Aside from the matrix, several other parameters are required in the input file, including :
(1) The number of populations
(2) The number of classes
(3) The number of randomizations to be carried out
(4) A symbol (#) marking the beginning of the matrix proper
This is merely a list of the necessary information. Its appropriate formatting is discussed below.
Comment lines are optional; these are discussed below.

Two variables are either read from the terminal or supplied on the command line. For terminal input, the
program first prompts the user for the name of the input file. This name can occupy up to 50 characters
(including colons and periods). As a result, files can be supplied either from disk or from any
subdirectory. The program then requests the name to be given the output file. Again, there is a 50
character limit. The syntax for command line input is

MONTE <infilename> <outfilename>

The user may supply (1) no filenames (for terminal input), (2) infilename only, or (3) both infilename and
outfilename when invoking MONTE. The user cannot specify only outfilename.

Output is largely self-explanatory. The original matrix is echoed, followed by the X2 value calculated
from that matrix. The results of the simulations are then presented in two ways. First, the likelihood of
generating an X2 value which exceeds that calculated from the original matrix by chance alone is reported
as PROBABILITY + SE (see also Practical Considerations below). [NOTE : This is a stochastic variable,
dependent upon the particular conditions of each randomization. Obviously, the larger the number of
randomizations, the closer this value will approximate the true probability.] The second aspect of output
is the distribution of X2 values calculated from the randomizations. Average, minimum and maximum X2
values are provided, as is a cumulative frequency distribution (partitioned into categories representing
1/10 of the range) of the X2 values.

31
FORMAT OF MONTE INPUT FILES

LINE 1 a. The number of populations (Maximum of 50).


b. The number of classes (Maximum of 200).
c. The number of randomizations to be carried out (Maximum of 10000).

LINE 2. IN THE FIRST COLUMN ! : # to indicate that the matrix begins on the next line.

NEXT The data matrix. Data for each population should begin with a label. This label must
occupy 8 spaces, although blanks can be used. No frequency data should occur
before COLUMN 9, as the program indiscriminately reads the first 8 characters
as the population label. Frequency data are a series of integers which must be
separated by one or more spaces.

EXAMPLE INPUT FILE

3 5 1000
#
NEW YORK 25 5 5 1 1
CHICAGO 25 5 5 0 1
L.A. 4 0 1 1 10

The first line specifies the dimensions of the data matrix, as well as the number of randomization
procedures to be carried out. As stated, the matrix cannot contain more than 50 populations, and there
can be no more than 200 classes designated. Although the number of individuals in the sample is not
specified explicitly, the sum of the marginal totals cannot exceed 5000. The final entry on this line, the
number of randomizations, is limited to 10000.

The final N lines of the program represent the row by column matrix. The first eight columns comprise
the population label; anything over eight characters will cause an error and anything under must be filled
in by spaces. Following the label, the frequencies of occurrence of each class of individual are coded.
Again, these must be INTEGERS.

32
USE OF COMMENT LINES (OPTIONAL)

In order to facilitate record-keeping, comment lines are accommodated by MONTE. These lines occur at
the beginning of the input file and should be prepared as follows : comment lines must begin with a
double quotation mark (") in the first column and can be no longer than 255 characters. The number of
lines is not constrained. These lines will be passed to the output file. The test data set presented above, if
supplied with comment lines, would start :

"Geographic heterogeneity in human mtDNA haplotype frequencies


"Initial survey of major cities
3 5 1000
#

TROUBLESHOOTING

The most critical aspect of input file preparation is to ensure that the number of populations and
particularly the number of classes specified in LINE 1 (assuming no comment lines) are correct with
respect to the data matrix. If LINE 1 underestimates the total of either parameter, the extra entries in the
data matrix will not be read. The program will proceed to completion; however, the problem will be
apparent, as the original matrix written to the output file will be incorrect. In contrast, if LINE 1
overestimates the size of the data matrix, the program will crash during the READ procedure, either
because it is attempting to read past EOF or because it encounters a CHAR variable when an INTEGER
is expected.

Be sure that the line specifying the number of populations, classes and replicates occurs in the first line
below the comments.

Be sure the symbol # marking the beginning of the data matrix is in the first column of the line just above
the data matrix.

Once the data is read in, the program indicates that processing is occurring by typing 'CALCULATING...'
to the terminal. This should happen within one minute. In addition, it provides updates on its progress
by typing '..' after every fifty randomizations. These cycles may take some time; however, if the program
reaches the CALCULATING... stage, it will at some point run to completion (although note that some
errors in the data matrix are undetectable [see above]). If this does not occur, you can be fairly sure that
the data matrix contains an error and the program is attempting to read past EOF.

33
SUMMARY OF LIMITATIONS

1. The number of populations is limited to 50


2. The number of classes cannot exceed 200
3. There can be no more than 5000 individuals in the total matrix
4. The number of randomizations cannot exceed 10000
5. The population label must occupy exactly 8 spaces
6. The number of populations, etc. must occur in the first line below the comments
7. The symbol # must occur in the first column of the line just above the data matrix

PRACTICAL CONSIDERATIONS

It has come to our attention that the algorithm of Roff and Bentzen (1989) is the least conservative
method of handling ties (X2 values from randomizations equal to that derived from the original data
matrix); here, ties are not counted, and the probability derived from simulations is thus the likelihood of
EXCEEDING the original X2 value by chance. In most cases this will more more than adequate.
However, for very small data sets, where the number of permutations of the data matrix may be limited,
the potential for ties can become significant. As an example, consider the following data set, where every
individual bears a unique haplotype:

2 10 1000
#
A 1110000010
B 0001110000
C 0000001101

In this case, all randomizations will produce the same result - one individual will be present in each
column, and row totals will be held constant. Thus, all X2 values will be the same as the original. Note
that the particular row in which each individual (= haplotype) occurs does not influence the resulting X 2;
unique haplotypes in general contribute nothing to the test for heterogeneity, because the shape of their
distribution across populations cannot change. Given these data, we would determine based on the
original algorithm of Roff and Bentzen (1989) that NONE of the randomizations produce an X2 value
that exceeds the original (probability = 0.000), suggesting the existence of heterogeneity among
populations. However, it is equally true that none of the randomizations result in X2 values LESS than
the original - all produce exactly the same value. Thus, if ties are included, the probability changes to
1.000 (no heterogeneity). This is clearly a more conservative (and appropriate) way of interpreting the
data as, in this extreme situation, the power of the test is zero. An analogous situation may also arise,
however, for nominally more complex data sets, where a relatively few common haplotypes are
represented by a small number of individuals.
To alleviate this problem, MONTE now reports both the probability of exceeding the original X2
value as well as the number of ties encountered during the simulation and the probability which results
from the inclusion of ties (that is, the probability that a given X2 value EQUALS OR EXCEEDS the
original by chance alone). Users are encouraged to consider the impact of ties before drawing
conclusions from their data. A note discussing this problem more fully is in preparation (Bentzen et al., in
prep.).

34
K

K computes a nucleotide substitution matrix (d values) from DNA sequence data using the two-parameter
model of Kimura (1980). The number of OTUs is unlimited; a maximum of 30000 characters (bases) is
allowed. Character states are specified by either upper or lower case 'A', 'C', 'T', 'G' or 'U'. Missing bases
are allowed, and are represented as character state '.'. The probability of type I substitutions (transitions)
and type II substitutions (transversions) per site is determined for a given pair of taxa. A character is
excluded from pairwise consideration if one or both of its character states is '.'. From the estimates of
transition and transversion rates, K is estimated according to the formulation of Kimura (1980; eq. 10;
Nei, 1987, eq. 5.5), which assumes that the rate of transitions is different from the rate of transversions.
This value represents an estimate of the expected number of nucleotide substitutions per site (that is, d)
between a pair of OTUs. In practice, the true pattern of nucleotide substitution may be more complicated
than can be accounted for by a two-parameter model (Nei, 1987); as such, this estimate may only be
applicable over a relatively short evolutionary time (Aquadro et al., 1984). Results are presented as a
symmetric dissimilarity matrix; the output file is suitable without modification as an input file for
clustering via NTSYS.

Input largely consists of a rectangular data matrix, with the upper or lower case 'A', 'C', 'T', 'G' and 'U'
denoting the character state at a particular site, and '.' denoting an unknown character state for a
particular site. All other alphanumeric characters will be ignored by the READ statement (but see below
for potential problems).

In addition to the matrix, several other parameters are required in the input file, including:
(1) The number of OTUs
(2) The number of characters
(3) A symbol (#) marking the beginning of the matrix proper
This is merely a list of the necessary information. Its appropriate formatting is discussed below.
Comment lines are optional; these are discussed below.

Two variables are either read from the terminal or supplied on the command line. For terminal input, the
program first prompts the user for the name of the input file. This name can occupy up to 50 characters
(including colons and periods). As a result, data can be supplied either from disk or from any
subdirectory. The program then asks for the name to be given the output file. Again, there is a 50
character limit. The syntax for command line input is

K <infilename> <outfilename>

The user may supply (1) no filenames, (2) infilename only, or (3) both infilename and outfilename when
invoking K. Terminal input will be requested to supply the remainder. The user cannot specify only
outfilename on the command line.

Output consists of a symmetric dissimilarity matrix formatted to accommodate NTSYS. LABELS are
assigned by NTSYS to the ROWS. This can of course be changed by editing the file prior to running
NTSYS.

35
FORMAT OF K INPUT FILES

LINE 1.a. The number of OTUs.


b. The number of characters (Maximum of 30000).
These are read as integers, and must be separated by spaces.

LINE 2.IN THE FIRST COLUMN ! : # to indicate that the matrix begins on the next line.

FINALLY The data matrix. Data for each OTU should begin with a label. This label must
occupy 8 spaces, although blanks can be used. No character state data should occur
before COLUMN 9, as the program indiscriminately reads the first 8 characters as the
OTU label. Similarly, labels longer than 8 characters may cause an error. Character
states are a series of characters which need not be separated by spaces. Carriage returns
or other non-valid characters will be ignored.

EXAMPLE INPUT FILE

3 25
#
COWS ACTGACTTCCGGactgacct..aAT
PIGS ACTTGCTTCCGGaccgacctaa.TT
HORSES ACTGAGTTCCGGatcgacctaaaTC

The first line specifies the number of OTUs and the number of characters.

The symbol # occurs in the line just above the data matrix, and must be in the first column. The program
uses this symbol as a positional reference during processing.

The final N lines of the program represent the rectangular data matrix. The first eight columns comprise
the OTU label; anything under eight characters must be filled in by spaces and anything over will cause an
error. Following the label, the character states (ACTGU or .) for each site are coded. These need not be
separated by spaces. Any other characters (including spaces) may be included internally for readability;
these will be ignored.

36
USE OF COMMENT LINES (OPTIONAL)

In order to facilitate record-keeping, comment lines are accommodated by K. These lines occur at the
beginning of the input file and should be prepared in the format of NTSYS; that is, comment lines must
begin with a double quotation mark (") in the first column and can be no longer than 255 characters. The
number of lines is not constrained. Because of the input format, these comment lines will be passed to
the output file and similarly through any NTSYS procedures. The test data set presented above, if
supplied with comment lines, would start :

"Biochemical systematics of selected farm animals


"Data from Orwell and MacDonald
3 25
#

TROUBLESHOOTING

The most critical aspect of input file preparation is to ensure that the number of OTUs and particularly
the number of characters specified in LINE 1 (assuming no comment lines) are correct with respect to the
data matrix. If LINE 1 underestimates the total of either parameter, the extra entries in the data matrix
will not be read, although the program will proceed without acknowledgement of a problem. This
obviously invites erroneous results. In contrast, if LINE 1 overestimates the size of the data matrix, the
program will get stuck in the READ procedure and spin its wheels indefinitely. Remember that the
program screens the matrix for the characters it is expecting and ignores all others; as such, EOLN and
EOF are skipped over. If either value in LINE 1 is too large, the program will get out of phase and
eventually attempt to read past EOF. THIS CAN ALSO OCCUR IF A CHARACTER IS ENTERED
INCORRECTLY IN THE DATA MATRIX!!

Be sure the line specifying number of OTUs and number of characters is in the first line below the
comments.

Be sure the symbol # marking the beginning of the data matrix is in the first column of the line just above
the matrix.

Do not extend OTU labels beyond the first eight characters of the data lines, as those extra characters
may be read as sequence data.

The program indicates that processing is occurring by typing 'CALCULATING...' to the terminal. In
addition, it provides updates on its progress by typing '..' each time it completes a set of pairwise
comparisons between an OTU and those below it in the data matrix. These updates should occur every
one to two minutes (maximum). If the program stalls for an extended period, you can be fairly sure that
the data matrix contains an error and the program is attempting to read past EOF.

37
SUMMARY OF LIMITATIONS

1. The number of characters per OTU cannot exceed 30000


2. The OTU label must occupy exactly eight spaces
3. Characters states must be coded as either upper or lower case 'ACTGU' or '.'
4. The number of OTUs and characters must occur in the first line below the comments
5. The symbol # must occur in the first column of the line just above the data matrix

PRACTICAL CONSIDERATIONS

While the program appears fairly robust, we have encountered one situation which can significantly affect
the results generated. If the proportion of transitions or transversions is quite high (approaching 0.5), the
estimation of K will be an undefined quantity, and will thus cause an error. It seems extremely unlikely
that one would encounter this problem in practice (particularly if one accepts the caveats [Nei, 1987;
Aquadro et al., 1984] concerning the applicability of the K estimator for more divergent taxa), but it is
relatively simple to generate from a small test data set. If the estimated K for a given pair of OTUs is
undefined, the program reports its value as '-----'; this clearly will not slide through a clustering algorithm.
Again, this is probably only a theoretical anomaly, but users are advised to consider this possibility before
invoking other explanations for why their test data sets fail to run.

38
REFERENCES

Aquadro, C.F., N. Kaplan and K.J. Risko. 1984. An analysis of the dynamics of mammalian
mitochondrial DNA sequence evolution. Mol. Biol. Evol. 1: 423-434.

Brown, W.M., M. George, Jr. and A.C. Wilson. 1979. Rapid evolution of animal mitochondrial DNA.
Proc. Nat. Acad. Sci. 76: 1967-1971.

Felsenstein, J. 1988. PHYLIP: Phylogeny inference package. Version 3.1. University of Washington,
Seattle.

Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through
comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120.

Lynch, M. and T.J. Crease. 1990. The analysis of population survey data of DNA sequence variation.
Mol. Biol. Evol. 7: 377-394.

Maddison, W. and D. Maddison. 1987. MacClade. Version 2.1. Harvard University, Cambridge.

Nei, M. 1987. Molecular Evolutionary Genetics. Columbia University Press. New York.

Nei, M. and L. Jin. 1989. Variances of the average numbers of nucleotide substitutions within and
between populations. Mol. Biol. Evol. 6: 290-300.

Nei, M. and W.-H. Li. 1979. Mathematical model for studying genetic variation in terms of restriction
endonucleases. Proc. Nat. Acad. Sci. 76: 5269-5273.

Nei, M. and Miller, J.C. 1990. A simple method for estimating average number of nucleotide
substitutions within and between populations from restriction data. Genetics 125: 873-879.

Nei, M., J.C. Stephens and N. Saitou. 1985. Methods for computing the standard errors of branching
points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol.
Evol. 2: 66-85.

Nei, M. and F. Tajima. 1981. DNA polymorphism detectable by restriction endonucleases. Genetics 97:
145-163.

Nei, M. and F. Tajima. 1983. Maximum likelihood estimation of the number of nucleotide substitutions
from restriction sites data. Genetics 105: 207-217.

Rohlf, F.J. 1988. NTSYS-PC: Numerical taxonomy and multivariate analysis system. Version 1.40.
Exeter Publishing, Setauket.

Rohlf, F.J. 1990. NTSYS-PC: Numerical taxonomy and multivariate analysis system. Version 1.60.
Exeter Publishing, Setauket.

39
Roff, D.A. and P. Bentzen. 1989. The statistical analysis of mitochondrial DNA polymorphisms: X2 and
the problem of small samples. Mol. Biol. Evol. 6: 539-545.

Swofford, D.L. 1989. PAUP: Phylogenetic analysis using parsimony. Version 3.0. Illinois Natural
History Survey, Champaign.

Swofford, D.L. 1985. PAUP: Phylogenetic analysis using parsimony. Version 2.4. Illinois Natural
History Survey, Champaign.

Upholt, W.B. 1977. Estimation of DNA sequence divergence from comparison of restriction
endonuclease digests. Nuc. Acids Res. 4: 1257-1265.

PART II. PROGRAM DESCRIPTIONS

40

Anda mungkin juga menyukai