ROBERT SEDGEWlCK
Program ~n Computer Science and Dwlsmn of Applled Mathematics
Brown Unwersity, Prowdence, Rhode Island 02912
This paper surveys the numerous methods t h a t have been proposed for p e r m u t a t m n
e n u m e r a t i o n by computer. The various algorithms which have been developed over the
years are described in detail, and zmplemented in a modern ALc,oL-hke language. All of
the algorithms are derived from one rumple control structure.
The problems involved with implementing the best of the algorithms on real com-
puters are treated m detail. Assembly-language programs are derived and analyzed
fully.
The paper is intended not only as a survey of p e r m u t a t i o n generation methods, but
also as a tutomal on how to compare a n u m b e r of different algorithms for the same task
Key Words and Phrases: permutations, combmatomal algorithms, code optimlzatmn,
analysis of algorithms, lexicographlc ordering, random permutatmns, recursion, cyclic
rotatzon.
CR Categories: 3.15, 4.6, 5 25, 5.30.
INTRODUCTION tation generation was being used to solve
problems. Most of the problems that he
Over thirty algorithms have been pub- described are now handled with more so-
lished during the past twenty years for phisticated techniques, but the paper stim-
generating by computer all N! permuta- ulated interest in permutation generation
tions of N elements. This problem is a by computer p e r se. T h e problem is simply
nontrivial example of the use of computers stated, but not easily solved, and is often
in combinatorial mathematics, and it is used as an example in programming and
interesting to study because a number of correctness. (See, for example, [6]).
different approaches can be compared. The study of the various methods that
Surveys of the field have been published have been proposed for permutation gener-
previously in 1960 by D. H. Lehmer [26] ation is still very instructive today because
and in 1970-71 by R. J. Ord-Smith [29, 30]. together they illustrate nicely the rela-
A new look at the problem is appropriate tionship between counting, recursion, and
at this time because several new algo- iteration. These are fundamental concepts
rithms have been proposed in the inter- in computer science, and it is useful to
vening years. have a rather simple example which illus-
Permutation generation has a long and trates so well the relationships between
distinguished history. It was actually one them. We shall see that algorithms which
of the first nontrivial nonnumeric prob- seem to differ markedly have essentially
lems to be attacked by computer. In 1956, the same structure when expressed in a
C. Tompkins wrote a paper [44] describing modern language and subjected to simple
a number of practical areas 'where permu- program transformations. Many readers
* Thin work was supported by the N a t m n a l Science may find it surprising to discover that
F o u n d a t m n G r a n t No. MCS75-23738 ~'top-down" (recursive) and '~bettom-up"
Copyright 1977, Associahon for Computing Machinery, Inc. General permismon to repubhsh, b u t not for
profit, all or part of this m a t e r m l is granted provided t h a t ACM's copymght notice is given and t h a t reference
is made to the publication, to its date of issue, and to the fact t h a t r e p n n t m g privileges were granted by
permission of the Association for Computing Machinery
~
ments, shown in Diagram 4, is more com-
A BC ~ ABC ~A BC plicated. (The empty boxes denote the net-
BAC ACB CBA work of Diagram 3 for four elements). To
CAB CAB BCA get the desired decreasing sequence in
CBA BAC BAC P[5], we must exchange it successively
BCA BCA CAB
ACB CBA ACB with P[3], P[1], P[3], P[1] in-between gen-
erating all permutations of P [ 1 ] , . . . ,P[4].
FIGURE 1. Legal p e r m u t a t i o n networks for t h r e e In general, we can generate all permu-
elements.
tations of N elements with the following
recursive procedure:
quences t h a t are generated explicitly writ-
ten out on the right. A lg or i t h m 1.
It is easy to see that for larger N there procedure permutations (N);
will be large numbers of legal networks. begin c: = 1;
The methods t h a t we shall now examine loop:
will show how to systematically construct if N > 2 t h e n permutatmns(N-1)
endif;
networks for arbitrary N. Of course, we
while c<N:
are most interested in networks with a P[B [N,c]]:=:P[N];
sufficiently simple structure t h a t their ex- c:=c+l
change sequences can be conveniently im- repeat
plemented on a computer. end;
This program uses the looping control con-
Recursive Methods struct loop while repeat which is
described by D. E. K n u t h [23]. Statements
We begin by studying a class of permuta-
between loop and repeat are iterated:
tion generation methods t h a t are very sim-
when the while condition fails, the loop is
ple when expressed as recursive programs.
exited. If the while were placed immedi-
To generate all permutations of
ately following the loop, then the state-
PIll, ,PIN], we repeat N times the step:
ment would be like a normal ALGOLwhile.
"first generate all permutations of
In fact, Algorithm 1 might be imple-
P[1],- ,P[N-1], then exchange P[N] with
mented with a simpler construct like for
one of the elements P[1],...,P[N-1]". As
this is repeated, a new value is put into
P[N] each time. The various methods dif-
fer in their approaches to f'filing P[N] with ~ ~-c--cJ ~E~EJ ~^ ^'
c~
the N original elements.
The first and seventh networks in Fig. 1 E E~D D~C C B ^
c:=1 until N do . ' - were it not for the this method in which the indices can be
need to test the control counter c within easily computed and it is not necessary to
the loop. The array B[N,c] is an index precompute the index table.
table which tells where the desired value The fLrst of these methods was one of the
of P[N] is after P[1],. ,P[N-1] have been earliest permutation generation algo-
run through all permutations for the cth rithms to be published, by M. B. Wells in
time. 1960 [47]. As modified by J. Boothroyd in
We still need to specify how to compute 1965 [1, 2], Wells' algorithm amounts to
B[N,c]. For each value of N we could spec- using
ify any one of ( N - l ) ! sequences in which
/~-c i f N is even and c > 2
to fill P[N], so there are a total of
(N-1)!(N-2)!(N-3)!. 3!2!1! different ta-
t~N,c]
- 1 otherwise,
bles B[N,c] which will cause Algorithm 1
to properly generate allN! permutations of or, in Algorithm 1, replacing
P[1], .. ,P[N]. P [ B [ N , c ]]:=:P[N] by
One possibility is to precompute BIN,c] if (N even) and (c>2)
by hand (since we know t h a t N is small), then P[N]:=:P[N-c]
continuing as in the example above. If we else P[N]:=:P[N-1] endif
adopt the rule that P[N] should be filled It is rather remarkable that such a simple
with elements in decreasing order of their method should work properly. Wells gives
original index, then the network in Dia- a complete formal proof in his paper, but
gram 4 tells us that B[5,c] should be 1,3,1,3 m a n y readers may be content to check the
for c = 1,2,3,4. F o r N = 6 we proceed in the method for all practical values of N by
same way: if we start w i t h A B C D E F, constructing the networks as shown in the
then the l~LrstN = 5 subnetwork leaves the example above. The complete networks for
elements in the order C D E B A F, so N = 2,3,4 are shown in Fig. 2.
that B[6,1] must be 3 to get the E into P[6], In a short paper t h a t has gone virtually
leaving C D F B A E. The second N = 5 unnoticed, B.R. Heap [16] pointed out sev-
subnetwork then leaves F B A D C E, so eral of the ideas above and described a
that B[6,2] must be 4 to get the D into P[6], method even simpler t h a n Wells'. (It is not
etc. Table 2 is the full table for N <- 12 clear whether Heap was influenced by
generated this way; we could generate per- Wells or Boothroyd, since he gives no ref-
mutations with Algorithm 1 by storing erences.) Heap's method is to use
these N ( N - 1) indices.
There is no reason to insist that P[N] B(N,c)=( I f i N is odd
should be filled with elements in decreas- l f Y is even,
ing order. We could proceed as above to
build a table which fills P[N] in any order or, in Algorithm 1, to replace
we choose. One reason for doing so would P[B[N,c]]:=:P[N] by
be to try to avoid having to store the table: i f N odd then P[N]:=:P[1] else P[N]:=:P[c]endif
there are at least two known versions of Heap gave no formal proof that his method
TABLE 2. I_~EX TA~LEB[N, c] FOaAJP,_,oRrrHM1 works, but a proof similar to Wells' will
show that the method works for all N.
2 1 (The reader may find it instructive to ver-
3 11
ify that the method works for practical
4 123
values of N (as Heap did) by proceeding as
5 3 1 3 1
6 3 4 3 23 we did when constructing the index table
N 7 5 3 1 5 3 1 above.) Figure 3 shows that the networks
8 5 2 7 2 1 23 for N = 2,3,4 are the same as for Algo-
9 7 1 5 5 3 371 rithm 1 with the precomputed index table,
1 0 7 8 1 6 5 4923 but t h a t the network for N = 5, shown in
1 1 9 7 5 3 1 9753 1 Diagram 5, differs. (The empty boxes de-
12963109 4 3 8 9 23 note the network for N = 4 from Fig. 3.)
i.!-Tii !
ABCD
BAC D
BCAD
E E' E ETA
BCDA
DL~GRAM 6
C BDA
The method is based on the natural idea N=2 C BA D
that for every permutation of N - 1 ele- CABD
H
A B
ments we can generate N permutations of ACBD
B A
N elements by inserting the new element A C DB
into all possible positions. For example, for CADB
five elements, the first four exchange mod-
CDA B
ules in the permutation network are as
shown in Diagram 6. The next exchange is CDBA
N=
P[1]:=:P[2], which produces a new permu- DC BA
A BC
tation of the elements originally in P[2], DCAB
P[3], P[4], P[5] (and which are now in P[1], BA C
DAC B
P[2], P[3], P[4]). Following this exchange, BCA
ADCB
we bring A back in the other direction, as C BA ADBC
illustrated in Diagram 7. Now we ex- CAB D/kBC
change P[3]:=:P[4] to produce the next ACB DBAC
permutation of the last four elements, and
DBCA
continue in this manner until all 4! permu-
tations of the elements originally in P[2], BDCA
P[3], P[4], P[5] have been generated. The BDAC
network makes five new permutations of BA DC
the five elements for each of these (by ABDC
putting the element originally in P[1] in
all possible positions), so that it generates FIGURE 4. J o h n s o n - T r o t t e r a l g o r i t h m for N = 2,
a total of 5! permutations. 3, 4.
Generalizing the description in the last
paragraph, we can inductively build the Continuing the example above, we get the
network for N elements by taking the net- full network for N = 5 shown in Figure 5.
work for N - 1 elements and inserting By connecting the boxes in this network,
chains of N - 1 exchange modules (to sweep we get the network for N = 4.
the first element back and forth) in each To develop a program to exchange ac-
space between exchange modules. The cording to these networks, we could work
main complication is that the subnetwork down from a recursive formulation as in
for N - 1 elements has to shift back and the preceding section, but instead we shall
forth between the first N - 1 lines and the take a bottom-up approach. To begin,
last N - 1 lines in between sweeps. Figure imagine that each exchange module is la-
4 shows the networks for N = 2,3,4. The belled with the number of the network in
modules in boxes identify the subnetwork: which it first appears. Thus, for N = 2 the
if, in the network for N, we connect the module would be numbered 2; for N = 3
output lines of one box to the input lines of the five modules would be labelled 3 3 2 3 3 ;
the next, we get the network for N - 1 . for N = 4 the 23 modules are numbered
44434443444244434443444;
A.
B I
for N ~ 5 we insert 5 5 5 5 between the
C-
D
: :2:T:T: : numbers above, etc. To write a program to
generate this sequence, we keep a set of
E incrementing counters c[i], 2 < i <- N ,
D I A a ~ 7. which are all initially 1 and which satisfy
rithms 2 and 3. The similarities become digit which is not 9 and change all the
striking when we consider an alternate nines to its right to zeros. If the digits are
implementation of the Johnson-Trotter stored in reverse order in the array
method: c[N],c[N - 1], . . . ,c[2],c[1] (according to
the way in which we customarily write
A l g o r i t h m 3a (Alternate Johnson-Trotter) numbers) we get the program
z"=N;
loop: c[t]:=l; d/l]:=true; while l > l : ~.=~ - 1 repeat; t:=N, loop c[~]:=O while t > l l : = ~ - I repeat;
process, loop:
loop: i f c[~]<9 t h e n c[d:=c[t]+ l; z =1
i f c[t] < N + l - I else c[z]:=O; ~ = z + l
then if d / d then k.=c[~]+x endif;
else k : = N + l - t - c [ t ] + x endif; while ~<-N repeat;
P[k]: = :P [k + 1];
process; From this program, we see that our per-
c[t]:=c[~]+l; ~:=1; x:=0; mutation generation algorithms are con-
else if not d/t] then x : = x + l endif; trolled by this simple counting process, but
c[z]:=l; ~:=z+l; d[t]:= not d/d; in a mixed-radix number system. Where
endif; in ordinary counting the digits satisfy 0 -<
while ~-<N repeat; c[i] <- 9, in Algorithm 2 they satisfy 1 -<
This program is the result of two simple c[i] -< i and in Algorithm 3a they satisfy 1
transformations on Algorithm 3. First, <- c[i] <- N - i + l . Figure 6 shows the val-
change i to N + I - ~ everywhere and rede- ues of c [ 1 ] , . . . ,c[N] when process is en-
fme the c and d arrays so that c [ N + l -~], countered in Algorithms 2 and 3a for N =
d [ N +1 - i ] in Algorithm 3 are the same as 2,3,4.
c[i], d[i] in Algorithm 3a. (Thus a refer- Virtually all of the permutation genera-
ence to c[i] in Algorithm 3 becomes tion algorithms that have been proposed
c [ N + l - i ] when i is changed to N + I - i , are based on such "factorial counting"
which becomes c[i] in Algorithm 3a.) Sec- schemes. Although they appear in the lit-
ond, rearrange the control structure erature in a variety of disguises, they all
around a single outer loop. The condition have the same control structure as the
c[i] < N + l - i in Algorithm 3a is equiva- elementary counting program above. We
lent to the condition c[i] < i in Algorithm have called methods like Algorithm 2 re-
3, and beth programs perform the ex- cursive because they generate all se-
change and process the permutation in quences of c [ 1 ] , . . . , c [ i - 1 ] in-between in-
this case. When the counter is exhausted crements of c[i] for all i; we shall call
(c[i] = N + I - ~ in Algorithm 3a; c[i] = i in methods like Algorithm 3 iterative because
Algorithm 3), both programs fLx the offset, they iterate c[i] through all its values in-
reset the counter, switch the direction, and between increments of c[i + 1], .,c[N].
move up a level.
If we ignore statements involving P, k Loopless Algorithms
and d, we fmd that this version of the
Johnson-Trotter algorithm is identical to An idea that has attracted a good deal of
Heap's method, except that Algorithm 3a attention recently is that the Johnson-
compares c[i] w i t h N + I - i and Algorithm Trotter algorithm might be improved by
2 compares it with i. (Notice that Algo- removing the inner loop from Algorithm 3.
rithm 2 still works properly ff in beth its This idea was introduced by G. Ehrlich
occurrences 2 is replaced by I .) [10, 11], and the implementation was re-
To appreciate this similarity more fully, fined by N. Dershowitz [5]. The method is
let us consider the problem of writing a also described in some detail by S. Even
program to generate all N-digit decimal [12].
numbers: to "count" from 0 to 9 9 . . . 9 = Ehrlich's original implementation was
10N-1. The algorithm that we learn in complex, but it is based on a few standard
grade school is to increment the right-most programming techniques. The inner loop
1121 1112
1211 1113
1221 1114
1311 1121
1321 1122
N-2 2111 N=2 1123
11 ii
2121 1124
21 12
2211 1131
2221 1132
2311 1133
N=3 2321 N=3 1134
111 iii
3111 1211
121 112
3121 1212
211 113
3211 1213
221 121
3221 1214
311 122
3311 1221
321 123
3321 1222
4111 1223
4121 1224
42 ll 1231
4221 1232
4811 1233
4321 1234
(a) (b)
Fmu~ 6. Factorial counting: c[N],..., c[1]. (a) Using Mgorithm 2 ( r e c ~ l v e ) . (b) Using M g o -
rithm 3a (iterative).
%=4
This method does not require that the ele-
ments being permuted be distinct, but it is
A BC D
slightly less efficient t h a n Algorithm 4 be-
BA C D
cause more counting has to be done.
BC A D
Ives' algorithm is more efficient t h a n
BC DA the Johnson-Trotter method (compare Al-
.....-i
A C D B gorithm 4a w i t h Algorithm 3a) since it
C A D B does not have to m a i n t a i n the array d or
C DA B offset x. The alternate implementation
%=2
C D B A bears a remarkable resemblance to Heap's
A D B C
method (Algorithm 2). Both of these algo-
B A
DA BC
rithms do little more than factorial count-
(B A) l
ing. We shall compare them in Section 3.
D B A C
D BC A
2. OTHER TYPES OF ALGORITHMS
(A B C D)
, A BC
B AC
BC
A C B
A
A C B D
C A B D
C BAD
C B DA
In this section we consider a variety of
algorithms which are not based on simple
exchanges between elements. These algo-
rithms generally take longer to produce all
permutations t h a n the best of the methods
CAB A BDC
already described, but they ave worthy of
BA D C
C BA study for several reasons. For example, in
B D A C some situations it m a y not be necessary to
B D C A generate all permutations, but only some
A DC B "random" ones. Other algorithms may be
DA C B of practical interest because they are based
DC A B on elementary operations which could be
D C BA as efficient as exchanges on some com-
(A C B D)
puters. Also, we consider algorithms that
generate the permutations in a particular
(A B C D)
order which is of interest. All of the algo-
FmURE 7 Ires' a l g o r i t h m for N = 2,3,4 rithms can be cast in terms of the basic
r ii i[ ii 1,~1 ii ][ 1[ Ira 11 I[ II b I[ II II L,I II II II. Id. ]1 II 1I I~
il [ JlI I ll, pllifli~ltI J ~{iili]~ I l E :E~l[ I IIl } JlI ! 1~,l
l'if [I
II)1JJI I IJI l I~l~
qll'lg
II [l ,~ ,[,
~1u
,I1[[ll, l[ll~[II~!
t~
II 11 II IP I[ II il ~1~ II II li IP II II II IP [I li ~ll Ig I I
H
A B (B C D A )
B A
constructed. Remarkably, we can take Al-
C D B A
gorithm 5 and switch to the counter sys-
(A B) DC BA
tem upon which the iterative algorithms
(C D B A ) in Section 2 were based:
DB C A
t:=N; loop: c[t]:=l w h i l e t > l . t : = t - 1 repeat;
B D CA
process;
N 3 (D B C A ) loop:
(B C O A ) rotate(N +1 - t);
A BC
--4 ifc/d<N+l-t then c[t]:=c[t]+l; t : = l ;
C D A B
BA C
DC h B
process;
(A B C ) else c[t]:=l; t : = t + l
(C D A B )
BCA endif;
DA C B while z-<N repeat,
C BA
--4 A D C B
(B C A)
(DA CB)
Although fewer redundant permutations
CAB are generated, longer rotations are in-
....,
AC D B
ACB volved, and this method is less efficient
CA DB
(C A B ) than Algorithm 5. However, this method
(A B C) (A C D B) does lend itself to a significant simplifica-
(C D A B) tion, similar to the one Ives used. The
DA BC condition c[i] = N + I - t merely indicates
A D BC that the elements in P[1],P[2],...,
(D A B C ) P [ N + I - i ] have undergone a full rota-
A B DC
t i o n - t h a t is, P[N+ l - t ] is back in its orig-
inal position in the array. This means that
B A D C
if we initially set Q[i] = P[i] for 1 -< i <- N,
(A B D C )
then c [ i ] = N + l - i is equivalent to
B D A C P [ N + I - i ] = Q [ N + I - i ] . But we have now
D B A C removed the only test on c [ i ] , and now it is
(B D A C ) not necessary to maintain the counter ar-
(D A B C ) ray at all! Making this simplification, and
(A B C D) changing t to N + I - i , we have an algo-
rithm proposed by G. Langdon in 1967 [25],
FIGURE 9. Tompkins-Pa,ge algorithm for N = 2, shown in Fig. 10.
3,4.
ent less difficulty than usual in the case of name, or an indexed m e m o r y reference.
permutation generation. First, since all of For example, A D D 1,1 means "increment
the algorithms have the same control Register I by r'; A D D l,J means "add the
structure, comparisons between many of contents of Register J to Register r'; and
them are immediate, and we need only A D D I,C(J) means "add to Register I the
examine a few in detail. Second, the anal- contents of the m e m o r y location whose ad-
ysis involved in determining the total run- dress is found by adding the contents of
ning time of the algorithms on real com- Register J to C". In addition, we shall use
puters (by counting the total number of control transfer instructions of the form
times each instruction is executed) is not OPCODE LABEL
difficult, because of the simple counting
algorithms upon which the programs are namely JMP (unconditional transfer); JL,
based. JLE, JE, JGE, JG (conditional transfer ac-
If we imagine that we have an impor- cording as whether the firstoperand in the
tant application where all N! permuta- last C M P instruction was <, -<, =, ->, >
tions must be generated as fast as possible, than the second); and C A L L (subroutine
it is easy to see that the programs must be call). Other conditional jump instructions
carefully implemented. For example, if we are of the form
are generating, say, every permutation of OPCODE REGISTER,LABEL
12 elements, then every extraneous in-
struction in the inner loop of the program namely JN, JZ, JP (transfer if the specified
will make it run at least 8 minutes longer register is negative, zero, positive). Most
on most computers (see Table 1). machines have capabilities similar to
Evidently, from the discussion in Sec- these, and readers should have no diffi-
tion 1, Heap's method (Algorithm 2) is the culty translating the programs given here
fastest of the recursive exchange algo- to particular assembly languages.
rithms examined, and Ives' method (Algo- M u c h of our effort will be directed to-
rithm 4) is the fastest of the iterative ex- wards what is commonly called code opti-
change algorithms. All of the algorithms mization: developing assembly language
in Section 2 are clearly slower than these implementations which are as efficient as
two, except possibly for Langdon's method possible. This is, of course, a misnomer:
(Algorithm 6) which m a y be competitive while we can usually improve programs,
on machines offering a fast rotation capa- we can rarely "optimize" them. A disad-
bility. In order to draw conclusions com- vantage of optimization is that it tends to
paring these three algorithms, we shall greatly complicate a program. Although
consider in detail how they can be imple- significant savings m a y be involved, it is
mented in assembly language on real com- dangerous to apply optimization tech-
puters, and we shall analyze exactly how niques at too early a stage in the develop-
long they can be expected to run. ment of a program. In particular, we shall
As we have done with the high-level not consider optimizing until we have a
language, we shall use a mythical assem- good assembly language implementation
bly language from which programs on real which we have fully analyzed, so that we
computers can be easily implemented. can tell where the improvements will do
(Readers unfamiliar with assembly lan- the most good. Knuth [24]presents a fuller
guage should consult [21].) We shall use discussion of these issues.
load (LD), stere (ST), add (ADD), subtract M a n y misleading conclusions have been
(SUB), and compare (CMP) instructions drawn and reported in the literature based
which have the general form on empirical performance statistics com-
paring particular implementations of par-
LABEL OPCODE REGISTER, OPERAND ticular algorithms. Empirical testing can
(optional) be valuable in some situations, but, as we
The first operand will always be a sym- have seen, the structures of permutation
bolic register name, and the second oper- generation algorithms are so similar that
and may be a value, a symbolic register the empirical tests which have been per-
LD Z,1
LD I,N I =N,
INIT ST Z,C(I) loop c[1] =1,
CMP 1,2
JLE CALL while/>2
SUB 1,1 I =1-1,
JMP INIT repeat,
CALL CALL PROCESS process,
LOOP LD J,C(I) loop
CMP J,I
JE ELSE If c[/]
THEN LD T,I then
AND T,1
JZ T,EVEN if / odd
LD K,1 then k =1
JMP EXCH else k =c[i]
EVEN LD K,J endlf,
EXCH LD T,P(I)
LD T1 ,P(K)
ST T1 ,P(I)
ST T,P(K) P[I] = P[k],
ADD J,1
ST J,C(I) ch] =cH+l,
LD 1,2 1=2,
CALL PROCESS process,
JMP WHILE
ELSE ST Z,C(I) else c[11"= I,
ADD 1,1 I . = i + I endlf,
WHILE CMP I,N whlle I-<N
JLE LOOP repeat,
This can be more simply expressed in tion loop in Algorithm 2.) In general, for
terms of the base of the natural loga- any n > 1, we can replace i:=2 by ~:=n+l;
rithms, e, which has the series expansion process all permutations of P[1], ., P[n].
~k~o 1/k!: it is easily verified that This idea was first applied to the permuta-
BN- [N!(e-2)] tion enumeration problem by Boothroyd
[2]. For small n, we can quite compactly
That is, BN is the integer part of the real write in-line code to generate all permuta-
numberN!(e-2) (OrBN = Nl(e-2) + e with tions of P[1],. ., P[n]. For example, tak-
0 <- E < 1). The recurrences for A N c a n be ing n = 3 we may simply replace
solved in a similar manner to yield the CALL LD 1,2
result LD X,1
CALL PROCESS
AN = N! 2 ~~-,
N (-1)~
k! - [N!/e]. in Program 2 by the code in Program 3,
1 1 0 0
2 2 0 1 40 56+
3 6 2 4 154 159+
4 24 8 17 638 637+
5 120 44 86 3194 3186
6 720 264 517 19130 19116
7 5040 1854 3620 133836 133812
8 40320 14832 28961 1070550 1070496
9 362880 133496 260650 9634750 9634454
10 3628800 1334960 2606501 96347210 96344640
11 39916800 14684570 28671512 1059818936 1059791040
12 479001600 176214840 344058145 12717826742 12717492480
P R O G R A M 4 IMPROVED IMPLEMENTATION OF IVES' (As before, we shall write down only the
METHOD new code, but make reference to the entire
LD I,N 1 optimized program as "Program 5".) In
INIT ST I,C(I) N-1 this program, Pointer J is kept negative so
LD V,P(I) N-1 that we can test it against zero, which can
ST V,Q(I) N-1 be done efficiently on many computers.
CMP 1,1 N-1 Alternatively, we could sweep in the other
JLE CALL N-1 direction, and have J range from N - 1 to 0.
SUB 1,1 N-1 Neither of these tricks may be necessary
JMP INIT N-1 on computers with advanced loop control
THEN LD T,P(J) N I-CN- 1 instructions.
LD T1 ,P+I(J) NI-CN-1 To find the total running time of Pro-
ST T1 ,P(J) NI-CN-1 gram 5, it turns out that we need only
ST T,P+I(J) NI-C~-I replace N! by (N-2)! everywhere in the
ADD J,1 NI-C~,-1 frequencies in Program 4, and then add
ST J,C(I) Nt-C~-I the frequencies of the new instructions.
CALL LD 1,1 N I-C~ The result is
LD H,N Nw-C~
CALL PROCESS NI 9N! + 2 ( N - l ) ! + 1 8 ( N - 2 ) ! + O ( ( N - 4 ) ! ) ,
LOOP LD J,C(I) Nm+DN-1
not quite as fast as the "optimized" version
CMP J,H NS+DN-1
of Heap's algorithm (Program 3). For a
JL THEN NI+DN-1 fixed value of N, we could improve the
ELSE LD T,P(I) CN+D,~ program further by completely unrolling
LD T1 ,P(H) CN+D~ the inner loop of Program 5. The second
ST T1 ,P(I) C,v+D~ through eighth instructions of Program 5
ST I,C(I) C~+D~
could be replaced by
CMP T,Q(H) C~+DN
JNE CALL CN+DN LD T,P+I
ADD 1,1 D~ ST T,P
SUB H,1 DN ST V,P+I
CMP I,H DN CALL PROCESS
JL LOOP D~ LD T,P+2
ST T,P+I
tion than the improved implementation of ST V,P+2
Heap's method, mainly because it does less CALL PROCESS
counter manipulation. Other iterative LD T,P+3
methods, like the Johnson-Trotter algo- ST T,P+2
rithm (or the version of Ives' method, Al- ST V,P+3
gorithm 4a, which does not require the CALL PROCESS
elements to be distinct), are only slightly
faster than Heap's method. (This could be done, for example, by a
However, the iterative methods cannot macro generator). This reduces the total
be optimized quite as completely as we running time to
were able to improve Heap's method. In
7N! + ( N - l ) ! + 1 8 ( N - 2 ) ! + O ( ( N - 4 ) ! )
Algorithm 4 and Program 4, the most fre-
quent operation is P[c[N]]:=:P[c[N]+I]; which is not as fast as the comparable
c[N]:=c[N]+l; all but 1IN of the ex- highly optimized version of Heap's method
changes are of this type. Therefore, we (with n = 4).
should program this operation separately. It is interesting to note that the optimi-
(This idea was used by Ehrlich [10, 11].) zation technique which is appropriate for
Program 4 can be improved by inserting the recursive programs (handling small
the code given in Program 5 directly after cases separately) is much more effective
CALL PROCESS than the optimization technique which is
It is interesting to study Langdon's cyclic It is faster than Program 2 for N < 8 and
method (Algorithm 6) in more detail, be- faster than Program 4 for N < 4, but it is
cause it can be implemented with only a much slower for larger N.
few instructions on m a n y computers. In By almost any measure, Program 6 is
addition, it can be made to run very fast on the simplest of the programs and algo-
computers with hardware rotation capa- rithms that we have seen so far. Further-
bilities. more, on most computer systems it will
To implement Algorithm 6, we shall use run faster than any of the algorithms im-
a new instruction plemented in a high-level language. The
MOVE TO, FROM(I) algorithm fueled a controversy of sorts (see
which, if Register I contains the number i, other references in [25]) when it was first
moves ~ words starting at Location FROM introduced, based on just this issue.
to Location TO. That is, the above instruc- Furthermore, if hardware rotation is
tion is equivalent to available, Program 6 may be the method of
choice. Since (N-1)/N of the rotations are
LD J,0 of length N, the program may be optimized
LOOP T,FROM(J) in the manner of Program 5 around a four-
T,TO(J) instruction inner loop (call, rotate, com-
ADD J,1 pare, conditional jump). On some ma-
CMPJ,I
JL LOOP P R O G R A M 6 IMPLEMENTATION OF LANGDON'S
METHOD
We shall assume that memory references
THEN LD I,N-1 NI
are overlapped, so that the instruction
takes 2i time units. Many computers have CALL PROCESS NI
"block transfer" instructions similar to LOOP LD T,P+I NV+E~
this, although the details of implementa- MOVE P,P+I(I) F~
tion vary widely. ST T,P+I(I) NV+E~
For simplicity, let us further suppose CMP T,I NI+E N
t h a t P I l l , . - .,P[N] are initially the inte- JNE THEN NI+EN
gers 0,1,. , N - l , so t h a t we don't have to SUB 1,1 E~
bother with the Q array of Algorithm 6. JNZ LOOP E~