10 1 1 50

Challenges of Computing the Fast Fourier Transform
J. R. Johnson and R. W. Johnson June 18, 1997
1 Introduction
The Fast Fourier Transform, the FFT, discovered by Cooley and Tukey in 1964 is one of most important algorithms in scienti c computation (Actually Cooley and Tukey rediscovered the FFT. See 18] and 24] for the interesting history of the FFT, which can be traced back to Gauss). In fact, Mickey Edwards of Cray Research, Inc. said in 1990, that in Cray's installed base of about 200 machines, at $25 million a shot, that 40% of all CPU cycles were spent computing the FFT. To illustrate how the FFT is used, we discuss brie y a very typical application of the FFT. In crystallographic structure analysis, the FFT is used to derive the structure of a crystal from its x-ray di raction pattern. The electron density function (r) in a crystal determines its di raction pattern and conversely. The function (r) is a triply periodic function of the position vector r, and consequently, can be expanded in a Fourier series. The coe cients of the Fourier series are the structure factors and their magnitudes are determined from the x-ray di raction pattern. Since the structure factors are in general complex, the determination of the structure of a crystal is equivalent to nding the phases of the structure factors: the phase problem of x-ray crystallography. Hauptman 23], discusses the phase problem and shows that, due to the atomicity of crystal structures and the redundancy of observed magnitudes, the problem is, in principle, solvable. In practice, this means that for small structures the phase problem is directly solvable. However, for larger structures indirect methods are used. Indirect methods involve computing the Fourier transform and its inverse many times to calculate the structure factors from their magnitudes. The computational cost of these calculations is enormous. It is in the folklore of scienti c computation that the successful e ort of Rossmann at Purdue in the mid-80's to determine the structure of one of the cold viruses use 1400 hours of CYBER 205 computer time at $2000 an hour. Most of this computer time was spent in computing the FFT or its inverse. From this example, it can be seen that even very modest (5 or 10 per cent) improvements in the performance of the FFT on a given computer are very signi cant. This observation suggests that it is important to provide optimized implementations of the FFT that are tuned to the speci c computers that they are 1
being run on. Writing machine speci c code goes against the desire to provide implementations of algorithms that are portable across many platforms. The problem of simultaneously achieving high performance and portability is made much more di cult with the advent of vector and parallel computers. There have been many approaches to providing high-performance portable implementations of scienti c codes. The two basic approaches are: (1) Implement algorithms using a set of standard primitives, for which e cient vendor supplied implementations exist for many di erent computers. (2) Implement algorithms in a high level language and rely on optimizing compilers to provide e ciency. The second approach can be divided into two sub categories depending on the amount of architectural information in the high-level language that is used. Some languages, such as High Performance FORTRAN (HPF), may provide explicit vector and parallel instructions so that programs may explicitly deal with performance issues that arise on parallel and vector computers. Of course, portability will depend on the universality of the computational model that is available in the language. Alternatively, parallelizing or vectorizing compilers can be used to transform sequential programs (or programs without a speci c computational model) into a parallel or vector program suited to a particular parallel or vector computer. We do not believe that either of these general approaches provide the performance, portability, and exibility that is needed for important algorithms such as the FFT. However, the FFT has a su ciently rich structure that an alternative approach to providing e ciency and portability is possible. It is possible to provide a concise mathematical description of the FFT that is general enough to describe \all" of the variants of the FFT. We argue that this mathematical description should be used for implementing FFT algorithms. In this approach a formula becomes a program and mathematics can be used to optimize and transform programs. Furthermore, using various interpretations of the mathematics, architectural meaning can be associated with formula. In addition, mathematical properties of the Finite Fourier transform can be used to automatically generate \all" FFT algorithms for a transform of a speci c size. Thus optimization of an FFT can be reduced to a search process. Various formulas are generated, translated to programs and timed, and the formula with the best performance is chosen. Using this framework, our approach to an optimized portable implementation of the FFT is to provide a special-purpose language for describing FFT algorithms along with a special-purpose compiler and a set of tools for generating and optimizing and measuring the e ciency of FFT programs. A speci c transform is given to the system and the system automatically searches for a good implementation. In order to obtain high-performance it may be necessary to provide a machine speci c back end for the compiler; however, due to the limited scope of the language, this should be far easier than developing a general-purpose compiler. In the remainder of this paper we outline additional applications of the FFT, discuss some issues involved in high-performance implementations of the FFT, and elaborate on our methodology for implementing FFTs. The paper 2
is organized so that technical details are separate from the general discussion. Section 7, which contains a detailed discussion of our approach to implementing FFTs, can be skipped by the reader who is not interetested in the FFT.
2 Why the FFT?

It is important and fast. It is rst and foremost a way of computing the nite Fourier transform quickly and e ciently. Thus, wherever the nite Fourier transform occurs naturally the FFT will be found. In fact, it is no overstatement to say that the discovery of the FFT created the eld of digital signal processing. However, even where the use of the nite Fourier is somewhat arti cial, it generally pays to cast the problem in these terms because of the availability of the FFT. This might be said of Fast Poisson Solvers in partial di erential equations. In this section we outline some of application areas where the FFT is used. The list of applications we provide are meant to illustrate the types of applications that use the FFT. The list is not meant to be exhaustive nor is it supposed to explain how the FFT arises or how it is used. The interested reader should consult the following books to learn more about the application of the FFT 12, 13, 14, 43]. In a recent conversation, Charles Stirman, manager of Wavelet programs at Lockheed Martin, developed the following extensive list of applications of the FFT in areas of interest to the DoD. This list was developed with the help of Jim Hughen and Dick McCoy at Lockheed Martin. Steve Orzag, from Princeton University also mentioned that the FFT is used in Computational uid dynamics and microlithography, two areas also of interest to the DoD. 1. RADAR Processing (a) Range Compression in High Range Resolution Pro le (b) Doppler processing for F14, F15, F18 radars and missiles (AMRAAM and Phoenix) that track target velocity (c) Moving Target Indication (MTI) Radars that look for the presence of moving targets and need to tell if the target is approaching or receding (d) Frequency domain implementation of cross correlation as in pulse compression (e) Clutter cancellation 2. Digital Pulse Compression Uses two transforms. One to convert the signals to frequency then multiply by the pulse compression lter coe cients the convert back to time. (Using 3
2.1 Uses
3.
4. 5. 6.
7. 8.
9. 10. 11. 12.
the identity that multiplication in the frequency domain is equivalent to convolution in the time domain.) Synthetic Aperture Radar (SAR) (a) Range compression (b) Azimuth compression (c) Fast convolution in stripmap SAR Phased Array Antennas (a) Butler Feed for Phased Arrays uses Fourier Transform Digital Filtering Spectrum Analysis (a) sensors as accelerometers for vibration analysis (b) acoustics as in underwater acoustics (c) passive sonar uses short time Fourier analysis (d) active sonar is analogous to radar (e) Laser Radars (LADARS) use FFT for vibration analysis for identi cation Optics Short Time Fourier Analysis Short time Fourier analysis is a sequence of relatively short FFTs, usually over a long sequence of data. Short time Fourier analysis is used in Jim Hughen's SAR Autofocus algorithm. Speech Analysis Crystallography Computational Fluid Dynamics Microlithography
log
If the nite Fourier transform of a sequence of N points is computed in the straight forward way order N 2 operations are required. The FFT reduces the number of operations to order N logN. This implies a reduction in computing time of order N= log N, which for N = 210 is two orders of magnitude. The actual time reduction depends on the constants hidden in the \order" statement. Di erent programs implementing the FFT can easily have run times di ering by an order of magnitude. Depending on how well the FFT is implemented the improvement over straight forward methods may be more or less pronounced. 4
2.2 What does Fast (N
N ) Mean?
3 Optimization
These introductory considerations suggest that optimization of an FFT running on a particular architecture is an important practical problem. In the Purdue example, a small improvement in running time, say 10%, would save $280,000. Historically, the problem of matching an FFT implementation to the underlying hardware reached something of a crisis with the introduction of the Cray machines and the Cyber 205 in the late 70's and early 80's. These machines were vector machines and required that code be \vectorized" to achieve high performance. Penalties for poorly vectorized code could certainly be an order of magnitude. This problem lead to a urry of activity in re-examining the FFT 1, 2, 8, 9, 10, 15, 27, 38, 39, 40, 41].
4 Portability
The search for optimal implementations of the FFT lead to the development of many variants of the original implementation. These variants were di cult to program, so that it became very desirable to take, say a good vector version, and port it to another vector machine rather then implement it from scratch. However, this leads immediately to an engineering dilemma. Good portability means code that take advantage only of common properties of the computing platform, but good optimization takes advantage of the special features of the underlying hardware to achieve good performance. In the folk-lore at Cray, is the story that it took six man-years to port the FFT routines from the X-MP to the Y-MP in 1985, even though the two machines, as their names suggest, are \nearly" identical!! These are very similar machines, vector with a few tightly coupled processors. The problem of porting the FFT to the Cray-T3D/E, a massively parallel machine, with many loosely coupled processors is to this day essentially unsolved!!!! That is Portability requires the lowest common denominator in computing features and Optimization requires a good t to the special case. There have been many di erent proposals for solving this dilemma.
5 Methodology
In this section we outline a methodology for producing optimized portable libraries that has grown out of our experience with automatic methods of implementing high performance FFT algorithms. The underlying idea is to use the mathematical formulation of the algorithm and mathematical tools to aid in optimizing the implementation of the algorithm in addition to computer science techniques. Our approach is based on three fundamental principles: (1) algorithms should be described in a formal language which can be mathematically manipulated and which is appropriate for the speci c problem domain, (2) since one algorithm may not be optimal for all problem sizes and machine 5
architectures it is important to provide a toolkit that allows the user to adapt the algorithm to a particular problem and computer, (3) the language used to describe problem-speci c algorithms should be su ciently powerful so that algorithms are adaptable and many di erent variations of an algorithm can be described and generated. The general approach is to create a formal language for describing algorithms for a restricted class of problems. A special-purpose compiler needs to be constructed for the language. This compiler should be easily adaptable so that it can produce di erent code for di erent computer architectures. In addition to the compiler an algorithm generator and maniplulator should be provided, so that di erent algorithms can be generated and tried for di erent problems and problem sizes. Finally, tools should be provided to assist the algorithm developer in the selection of an approprate algorithm. These tools could provide expert assistance based on mathematical theorems and previous experience or may simply help with the execution and analysis of runtime experiments to measure performance. Our basic optimization strategy can then be described in the following steps: 1. Use mathematics to systematical generate di erent, but mathematically equivalent, expressions in the language describing the same algorithm; 2. Use the compiler to translate each expression into a computer program; 3. Run the program on the target machine and measure the desired properties; 4. Repeat step 1,2,3 until a su ciently \good". This basic implementationstrategy sets up the problem of optimized portable application libraries as a search problem. As such it has a brute force solution. And although this maybe impractical, it sets the stage for further analysis which may lead to a practical automatic method of optimizing portable code on a speci c machine. The formulation in Step 1, can be dealt with mathematically, with all the advantages that this usually implies. In this step we still have the mathematical knowledge of what the algorthm does to help us optimize it. We can use mathematics to reduce the number of expressions that need to be translated and run. In fact, if we can formulate architectural properties of the target computer hardware mathematically then we can eliminate certain expressions on these general grounds. In fact, given some reasonable cost function, we have a optimization problem in the domain of formal expressions, which is far simpler than that of running programs. The cost function introduced here, can be continuously improved using feed-back from the measurements taken at the computer run. In Step 2, we can use all the optimization techniques that have been developed by computer science for code generation. In Step 3, we need to use techniques of performance analysis su cient to not only modify our cost function, but also to suggest methods of improving the code or the candidate expression. As an aside, this latter possibility might be exploited in recon guring the hardware. 6
This methodology enables us to distribute the optimized portable application library as a tool-kit which separates portablity from optimization. In fact, the algorithm is simply a mathematical expression and the tool-kit is the semiautomated optimization procedure. Thus, in the tool-kit we need the expression generator, a special- purpose compiler, a performance analyzer, and a way of using the performance analysis to modify a general cost function. We note that this process need not be totally automatic to be practical. We believe that despite the high level description we are advocating it is easier to produce optimal code for a particular problem domain rather than for a general-purpose computer language. Furthermore, problem speci c knowledge can be represented in a special purpose language and this knowledge can be used to help guide program optimization.
6 Comparison with other approaches

In this section we survey some alternative approaches to designing and implementing portable, high performance algorithms. We also explain how our methodology di ers from these approaches and discuss the potential bene ts of our approach. This section is meant to be illustrative and is not exhaustive. Portability can be achieved by implementing algorithms in a high level language such as FORTRAN, C, or C++. A collection of algorithms can be distributed as a collection of subroutines implemented in a high level language. The e ciency of such programs depends on the compiler that translates them into object code for di erent machines. However, even with optimizing compilers, this approach may not produce code that is e cient enough for the demands of scienti c computations. Hand coded programs for important algorithms such as the FFT can outperform compiler produced code by several orders of magnitude. In numerical linear algebra an alternative approach has been used to obtain high performance libraries. The key idea is to implement library routines using a small set of primitive operations. Implementations of these primitive operations, optimized to a particular computer, can then be supplied by the computer vendor. The library LAPACK 3] was written this way using primitives from the BLAS (Basic Linear Algebra Subroutines) 21]. Additional di culties are involved when the algorithms are to be used on parallel computers. The di culty stems from the many di erent models of computation and the lack of parallel languages that e ectively support the di erent models. One approach to providing portable libraries for distributed memory parallel computers extends the idea of the BLAS. The library ScaLAPACK 11] was built using the primitives PBLAS 17], which combines ideas from the BLAS with a collection of routines used for communication called BLACS (Basic Linear Algebra Communication Subprograms). Subprograms in BLACS provides an abstraction appropriate for communications arising in linear algebra. The BLACS can be implemented on top of other communication libraries such as MPI 36] (Message Passing Interface). E cient implementations of MPI can be provided by di erent vendors of parallel computers. 7
An alternative approach, that does not rely on vendor supplied primitives uses vectorizing and/or parallelizing compilers. In this approach the compiler automatically transforms sequential code written in a high level language into machine speci c code that inserts parallelism and vector instructions (some times program transformations are performed on high level code prior to translation to machine code { this preprocessing allows the translation process to more e ectively vectorize and parallelize the resulting code) 46]. This approach as been pretty e ective for vectorizing certain types of scienti c codes; however, it has not proven to be very e ective for distributed memory computers. In order to remove some of the di culties of automatic parallelization, computer languages with explicit constructs for parallel operations and communication have been developed. An extension to FORTRAN 90 called High Performance FORTRAN (HPF) 29] includes many constructs useful for describing data parallel algorithms. There are several di culties with the approaches discussed: (1) Performance is heavily dependent on the vendor supplying e cient implementations of the necessary primitives or optimizing compilers. (2) Algorithm development is carried out in general purpose languages and not in a language particularly suited to the problem. (3) For many problems there are many di erent algorithms and many choices for the implementation of a particular algorithm. In these situations the user would like guidance in the selection of an algorithm and would like the ability to create hybrid approaches which use di erent approaches in di erent situations. While very good implementations of the BLAS and MPI have been provided by various vendors, the development time and cost can be very high. Furthermore, by organizing programs around library calls various design decisions have to be made a priori and hence there is little exibility to change primitives or the design choices. Similarly there are many good optimizing compilers; however, the construction of such compilers is costly and hence compilers for certain architectures (especially special purpose hardware) may be inadequate. Several systems have been developed to allow the user to develop algorithms in a language more suitable to the problem domain. For example, MATLAB 31], uses a higher level notation, as compared to languages such as FORTRAN, for describing matrix operations. Computer Algebra systems such as Maple 16], Mathematica 47], and AXIOM 26] provide mathematical notation and algorithms useful for designing many scienti c problem domains. However, these systems are designed more for exploration rather than implementing high performance programs. This is largely due to the interpreted nature of the languages in these systems. Note, however, that with additional tools, MATLAB code can be translated to C or C++. Furthermore, systems such as Maple and Mathematica provide many tools for generating numeric code. An approach to providing high performance libraries in a language that more closely resembles the underlying mathematics uses operator overloading and class libraries provided by object oriented languages such as C++. The POOMA 44] project is an example where programs for solving PDEs on parallel computers are provided using operator overloading and classes to support array 8
operations. Several expert systems have been developed to assist the user in the selection of an appropriate algorithm for a particular problem or a particular computer platform. For example, the IRENA 20] system provides a set of rules to help the user choose an appropriate library routine from the NAG 33] library to solve a particular problem, and John Strassner from the Advanced Product Lab at Hughs aircraft company have developed an expert system to help in the choice of an FFT algorithm for a particular computer architecture 37]. Our approach is borrows some ideas from these approaches, but is fundamentally di erent. We propose to implement algorithms in a special purpose language suited to their description. The language uses the same mathematics commonly used to derive the algorithms and hence can be used to verify that the programs are correct. A special purpose compiler is provided along with a set of of tools to assist in algorithm implementation and optimization. Since the language is limited we do not need to produce a general purpose optimizing compiler. Furthermore, since the language is based on well understood mathematics, our compiler has much more information to use in program optimization and transformation than general purpose compilers. This allows automatic optimization well beyond the capabilities of general purpose optimizing compilers. On the other hand the high level nature of the language allows us to change the semantics in the sense that we are not wed to a particular computation model. For example, by changing the translation rules, we can generate programs for di erent models of parallel computation (e.g. shared memory, vector, or distributed memory) or for that matter we can directly generate hardware descriptions that can be used for designing special purpose ASICs or describing algorithms for recon gurable hardware. Finally, our system will come with tools based on a feedback loop that will allow the system to automaticallyoptimize for a given platform or environment. The feedback information can help guide the choice of rewrite rules used to generate alternative implementations. Our use of rewrite rules is very di erent from other attempts at automating algorithm selection in that they are based on an underlying mathematical description of the algorithms.
7 Methodology Applied to the FFT

This section presents details showing how the methodology outlined in Section 5 can be applied to the FFT. This section also illustrates how mathematics plays a fundamental role in the implementation of FFT algorithms. Section 7.1 de nes the Fourier transform and presents a mathematical description of the FFT. Section 7.2 describes a language, called TPL, for describing FFT algorithms. Expressions in TPL are mathematical formulas encompassing the mathematical description of the FFT. Section 7.3 shows how to translate the mathematical formulas in TPL into programs suited to di erent models of computation. Section 7.4 shows how mathematical theorems, interpreted as rewrite rules, can be used to perform program transformations on TPL pro9
grams. These transformations can be used to generate variants of the FFT and are used as part of the optimization process. In Section 7.5 optimization of FFT algorithms is formally stated as a search process, and Section 7.6 uses the theory of nite Abelian groups to help classify the di erent FFT algorithms that are available in TPL.
7.1 Mathematical description of the FFT

The n-point Fourier Transform is the linear computation y(l) =
n?1 X k=0 lk !n x(k)
0 l<n
where !n = e2 i=n. Computation of the Fourier transform can be represented by lk the matrix-vector product y = Fnx, where Fn is the n n matrix !n ]0 l;k<n. Computing the n-point Fourier transform as a matrix-vector product requires O(n2 ) operations. In 1964, Cooley and Tukey 19] presented a divide and conquer algorithm for computing y = Fnx. Their algorithm is based on the following theorem. Theorem Let n = rs, 0 k1; l2 < r, 0 k2; l1 < s, then y(l1 + l2 s) =
X
k1
!k1 l2 s
! k 1 l1
X
k2
x(k1 + k2r)!k2 l1 r
!!
Repeated application of this theorem, when n = 2t, leads to an O(n logn) algorithm, called the Fast Fourier Transform (FFT), for computing y = Fnx. Many formulations of the FFT have appeared since the publication of Cooley and Tukey's paper. For our purposes, the most attractive formulation of the FFT uses a mathematical construction called the tensor product. Let A be an m m matrix and B a n n matrix then the tensor product of A and B, A B, is the mn mn matrix de ned by the block matrix product A B = aij B]1 i;j m 3 2a B a1;m B 1;1 .. 7 : ... = 6 ... 4 . 5 am;1 B : : : am;m B For example, if b b A = a11 a12 and B = b11 b12 a21 a22 21 22 then 2a b a b a b a b 3 12 11 12 12 6 a11b11 a11b12 a12b21 a12b22 7 : 11 21 11 22 A B = 6 a21b11 a21b12 a22b11 a22b12 7 4 5 a21b21 a21b22 a22b21 a22b22 10
Theorem
Using the tensor product we can reformulate the Cooley-Tukey theorem. Frs = (Fr Is )Tsrs (Ir Fs )Lrs ; r
where Id is the d d identity matrix, Tsrs is a diagonal matrix and Lrs is a r special permutation called a stride permutation.
For example, F4 = (F2 I2 )T24 (I2 F2)L4 2
21 0 1 6 0 = 6 0 1 ?1 41 0
0 1
0 1 0 0 ?1
32 1 0 0 0 32 1 1 0 7 6 0 1 0 0 7 6 1 ?1 0 76 0 0 1 0 76 0 0 1 54 54
0 0 0 i 0
0 0 1 0 1 ?1
32 1 0 0 0 3 76 0 0 1 0 7 76 0 1 0 0 7 5 54
0 0 0 1
The importance of this formulation is that the Cooley-Tukey theorem can be interpreted as a rewrite rule. It says that the Fourier transform matrix can be replaced by the product of four matrices. The FFT can be derived by inductively applying this rewrite rule. For example, the 8-point FFT is obtained by applying Cooley-Tukey to F8 with r = 2 and s = 4 and then applying Cooley-Tukey to F4 with r = 2 and s = 2. F8 = (F2 I4 )T48(I2 F4)L8 2 = (F2 I4 )T48(I2 ((F2 I2)T24 (I2 F2)L4 ))L8 : 2 2 In the previous section, the FFT was derived using a tensor product formulation of the Cooley-Tukey theorem. A consequence of this description of the FFT, is that the FFT can be compactly represented as a mathematical formula involving constructs such as the tensor product and parameterized symbols such as Ir , Tsrs , Lrs and Fr representing special matrices (identity, twiddle factor, stride r permutation, and Fourier transform matrices). While the resulting formula can be interpreted as a matrix factorization of the Fourier matrix, it can also be interpreted as a program to compute the Fourier transform. In previous work 6] DARPA Review Auslander, Johnson, and Johnson have designed a language, called TPL (Tensor Product Language) whose programs are mathematical formulas of this type. A compiler has been written which translates formulas to FORTRAN programs. The bene t of having a special-purpose language for FFT algorithms is that there are many variants of the FFT each of which can be represented as a mathematical formula. With TPL, many di erent variants can be implemented and the computing times of the resulting programs can be compared. The following list represents several variants of the FFT, from the literature 42], as 11
7.2 A Language for Describing FFT Algorithms
mathematical formulas using the tensor product. Each formula is given for a transform on 8 points and corresponds to a matrix factorization of the Fourier matrix F8 .
Apply Cooley-Tukey Inductively
Bit Reversal Recursive FFT
F8 = (F2 I4 )T48 (I2 F4 )L8 2 R8 = (I2 L4 )L8 2 2
Iterative FFT (CT)
F8 = (F2 I4 )T48 (I2 ((F2 I2 )T24 (I2 F2)L4 ))L8 2 2
Vector FFT (Stockham)
F8 = (F2 I4)T48 (I2 F2 I2 )(I2 T24)(I4 F2 )R8
Vector FFT (Korn-Lambiotte) Parallel FFT (Pease)
F8 = (F2 I4)T48 L8 (F2 I4 )(T24 I2 )(L4 I2 )(F2 I4 ) 2 2 F8 = (F2 I4 )T48 L8 (F2 I4 )(T24 I2 )L8 (F2 I4 )L8 R8 2 2 2
F8 = L8 (I4 F2)L8 T48 L8 L8(I4 F2 )L8 (T24 I2 )L8 L8 (I4 F2)R8 2 4 2 2 4 2 2
The TPL language uses a pre x notation similar to lisp to represent formulas. Built in to the language are operators corresponding algebraic operations such as the tensor product. For example, the expressions \(compose A B)" and \(tensor A B)" correspond the matrix product AB and the tensor product A B respectively. The language also includes special symbols such as (f n) and (i m) to indicate the Fourier transform matrix Fn and the identity matrix Im . The following TPL expression corresponds to the formula for the recursive FFT on 8 points
(compose (compose (compose (tensor (f 2) (i 4)) (t 8 4)) (tensor (i 2) (compose (compose (compose (tensor (f 2) (i 2)) (t 4 2)) (tensor (i 2) (f 2))) (l 4 2)))) (l 8 2)),
12
and the formula for the iterative FFT on 8 points

(compose (tensor (f 2) (i 4)) (compose (t 8 4) (compose (tensor (i 2) (tensor (f 2) (i 2))) (compose (tensor (i 2) (t 4 2)) (compose (tensor (i 4) (f 2)) (compose (tensor (i 2) (l 4 2)) (l 8 2))))))).
The Cooley-Tukey theorem shows how to compute the Fourier transform using smaller Fourier transforms, diagonal matrices, and permutations. This theorem can be used to automatically construct FFT programs. If programs are given for the smaller Fourier transforms it is possible to construct a program to compute the larger Fourier transform. The ideas behind this construction serve as the basis for the code generator in the TPL compiler. The formulas in TPL are built up from primitive symbols, such as F2, In , mn Tn , and Lmn , using algebraic operations such as matrix composition and the m tensor product. Suppose that we are given programs to compute the linear transformations y = Ax and y = Bx, then the following program computes y = ABx. y t y (AB)x Bx At
7.3 Programming and Architectural Interpretation of the Tensor Product
Given programs to compute y = Ax and y = Bx it is also possible to construct a program to compute y = (A B)x. To see how this is done we rst give a programming interpretation of two special cases of the tensor product; namely, Im B and A In . Let B be an n n matrix and let x be a vector of size mn. If xn denotes the i-th segment of x of size n containing the elements in n (xin ; xin+1; : : :; xin+n?1) and similarly yin denotes the i-segment of y, then the computation y = (Im B)x is given by y (Im B)x
f or i = 0; : : :; m ? 1
n yin
Bxn in
13
Even though this computation has been written as a sequential loop, the comn putations yin = Bxn are independent and can be performed in parallel. For in this reason (I B) is called the parallel form of the tensor product. The computation y = (A In )x can be viewed as a vector operation. 1 0 a I a0;m?1 In 0;0 n C .. A In = B A @ ... . am?1;0In am?1;m?1 In and 1 0 a0;0xn + + a0;m?1xn 0 (m?1)n C; ... (A In )x = B A @ am?1;0xn + + am?1;m?1xnm?1)n 0 ( n is a scalar-vector product and + is a vector sum. A program to where ai;j xin compute y = (A In )x can be obtained from a program to compute y = Ax be replacing scalar operations by the appropriate vector operations. Because of this interpretation (I B) is called the vector form of the tensor product. A program to compute y = (A B)x can be obtained from programs to compute y = Ax and y = Bx using the constructions for the two special cases (the vector and parallel forms) using the property of tensor products that states (A B) = (A In )(Im B). The constructions just presented are the crucial ingredients in the code generation phase of a compiler for TPL. The TPL compiler rst parses a formula and then builds a program corresponding to the formula by replacing primitive symbols with primitive code sequences and combining code sequences using operations corresponding to the algebraic operations occurring in the formula. For example, if F2 , F4, In , Lrs , and Tsrs are considered primitive symbols, the parse s tree associated with the formula (F2 I4 )T48(I2 F4)L8 is given in Figure 1. 2 Code corresponding to the formula can be obtained by using the tensor product to produce code for F2 I4 and I2 F4 and then composing the resulting codes with T48 and L8 . 2 It is essential to note that di erent formulas, for Fn, while mathematically equivalent, lead to di erent programs whose performance, even on workstations and personal computers, can be dramatically di erent 28]. This is true even when the number of arithmetic operations is the same { the important distinctions come from the di erences in data ow 25]. The di erence in performance due to data ow distinctions is even more on parallel and vector machines 28]. For example, the vector form of F8 in the previous section has three factors of the form F2 I4 , which correspond to vector operations on vectors of size 4. In contrast the Cooley-Tukey FFT has factors of the form F2 I4 , I2 F2 I2 , and I4 F2. The rst factor corresponds to a vector operation on vectors of size 4, whereas the second corresponds to a vector operation on vectors of size 2, and the last facter is purely sequential. The varying vector lengths in the Cooley-Tukey FFT led to many of the initial di culties in obtaining a good vectorized version of the FFT. The paper 27] presents many more details on the programming and 14
F2
? @ ? @@ ? L * ? @@ ? @ ? * ? @ ? @ I F T ? @
8 4 2 4
8 2
I4
Figure 1: Parse Tree for (F2 I4 )T48 (I2 F4)L8 2 hardware interpretations of tensor products. In addition to discussing vector and parallel interpretations, there is a discussion on the data ow corresponding to the di erent permutations that arise in FFT factorizations and the amount of memory locality for obtained by di erent FFT factorizations. The use of mathematical formulas to represent algorithms provides, in addition to compactness, two important bene ts over conventional descriptions of algorithms: (1) mathematical properties of the constructs and symbols involved can be used to modify a given formula to obtain a new formula and (2) mathematical theorems can be used to generate automatically (as in the derivation of the FFT presented earlier) formulas for a given computation. The rst point allows program optimization and transformation to be accomplished using mathematics. The second point allows us to formally de ne classes of algorithms for a given computation, and more importantly to generate \all" algorithms for a given computation. In this section we illustrate the notion of program or algorithm transformation using mathematical properties of the Fourier transform matrix and the tensor product. The properties are given as rewrite rules which can be applied to a given program (formula) by replacing occurrences of symbols on the left hand side of rules by the corresponding symbols on the right hand side. The following list contains some of the important rewrite rules that can be used with FFT algorithms.
7.4 Rewrite Rules and Algorithm Transformations
Cooley-Tukey
Frs ?! (Fr Is )Tsrs (Ir Fs)Lrs r 15
Linear and Tensor Algebra
t In (Im In ) t Fn rs )t (Ts (Lrs )t s LrstLrst s t (AB)t (A B)t (AB) (CD) (Am Bn )
?! ?! ?! ?! ?! ?! ?! ?! ?! ?!
In Im n Fn Tsrs Lrs r Lrst st B t At At B t (A C)(B D) Lmn (Bn Am )Lmn m n
Starting with F8 = F2 4 ?! (F2 I4 )T48(I2 F4 )L8 2 we can obtain a variant of the Cooley-Tukey factorization, called decimation in frequency, by applying the rewrite rule corresponding to the fact that the t Fourier matrix is symmetric (i.e. Fn = Fn ). F8t ?!
?! L8 (I2 F4)T48 (F2 I4 ) 4
?(F I )T (I F )L
2 4 8 4 2 4
8 2
As mentioned previously, the 8-point FFT can be derived from a sequence of applications of the Cooley-Tukey rewrite rule. F8 ?! (F2 I4 )T28 (I2 F4 )L8 2 ?! (F2 I4 )T28 (I2 ((F2 I2 )T24(I2 F2)L4 )L8 2 2 This form of the FFT corresponds to a recursive program since the CooleyTukey theorem is applied recursively, in this case to the two computations of F4 in (I2 F4). In 19] an iterative form of the FFT was presented. We can derive this variant of the FFT using properties of the tensor product. F8 ?! (F2 I4 )T48 (I2 ((F2 I2 )T24 (I2 F2 )L4 )L8 2 2 ?! (F2 I4 )T48 (I2 F2 I2 )(I2 T24 )(I4 F2)(I2 L4 )L8 2 2 Similar derivations are possible for all of the variants of the FFT listed in subsection 7.2.
7.5 Searching through Algorithm Space

16
By viewing FFT algorithms as formulas, which can be derived by a sequence of applications of rewrite rules, we can de ne the set of \all" FFT algorithms for the
computation of the n-point Fourier transform as the set of formulas generated by the symbol Fn using the allowable rewrite rules. The resulting set of formulas is called algorithm space. Since all of the formulas in algorithm space can be enumerated, we can systematically generate all FFT algorithms. Each formula can be translated into a program which can be timed on a given platform. Thus by an exhaustive search we can nd the optimal implementation (with respect to a xed translation scheme). Additional FFT algorithms may result if additional rewrite rules are incorporated. Furthermore, di erent performance may be obtained if the formula translation scheme were to change. Nonetheless, this point of view, formally de nes the notion of \optimal" FFT algorithm and implementation. Furthermore, algorithmic and coding optimizations have been clearly separated. The two tasks are to (1) nd e cient translation schemes for the symbols and constructs that arise in FFT algorithms and (2) nd the best formula with respect to a given translations scheme and computational model. Optimizations in (2) require domain speci c knowledge (given as mathematical rewrite rules) which are beyond the capabilities of an optimizing compiler for a traditional programming language. A potential di culty with this approach is that there may be too many formulas to translate and time. For example, if only the Cooley-Tukey rewrite rule is used, there are approximately 4k?1=p (k ? 1)3=2 formulas for F2k . If transposition is also allowed so that we can apply decimation in frequency variant of Cooley-Tukey, 2k variants are possible for each of the formulas derived by Cooley-Tukey alone. As we allow additional rules and modi cations we will add and multiply terms that are polynomials in 2k and k! to the total number of formulas. While such a bound will remain polynomial in the size of the Fourier transform we wish to compute, there will be an enormous number of formulas to consider if an exhaustive search is used to nd an optimal FFT. By associating a cost, other than empirical computing time, with each formula it is possible to re ne the search using pruning techniques. To e ectively prune the search into a more manageable task we must devise appropriate cost functions and search strategies. All of the TPL examples, seen so far, are for computing one-dimensional FFTs. However, TPL, can also be used for multidimensional FFTs. Let X(r; s), 0 r < m and 0 s < n, be a function of two variables. The two-dimensional Fourier transform, F(m;n) is de ned by Y (u; v) = where !m = e2 i=m and !n = e2 Y (u; v) =
m?1 n?1 XX r=0 s=0 i=n . Written as a nested s=0 vs ur X(r; s)!n !m ur vs X(r; s)!m !n ;
7.6 Multidimensional FFTs and Abelian Groups
m?1 n?1 X X r=0
sum
17
gives the standard row-column algorithm for computing two-dimensional Fourier transforms. In the row-column algorithm, the two-dimensional FT is computed by applying a one-dimensional FFT to the rows of input and then applying a one-dimensional FFT to the columns. As a matrix equation the row-column algorithm is Y = Fm (XFn ): If the matrix X is stored by rows, then elements in a row are stored consecutively, whereas there is a stride of length n between elements in a column of X. For this reason, it is sometimes bene cial to perform an explicit transpose before applying Fm to the columns of (XFn ). In order to return to the row-major representation of matrices, another transpose is required after applying Fm . If the rows of X and Y are placed one after the next in vectors x and y of size mn, then y = (Fm Fn )x; and the multidimensional FT is just the tensor product of one-dimensional FTs. This generalizes to higher dimensions, where the t-dimensional FT on a paralF nt . lelpiped of size n1 nt becomes the t-fold tensor product Fn1 In the tensor product setting, the row-column algorithm corresponds to the matrix factorization (Fm Fn) = (Fm In)(Im Fn ); and the row-column algorithm with an explicit transpose becomes (Fm Fn ) = Lmn (In Fm )Lmn (Im Fn); m n where the stride permutations Lmn and Lmn correspond to matrix transposition. m n On some computer architectures where transposition can be performed quickly, the row-column algorithmis an e cient algorithm for computing two-dimensional FTs (optimized one-dimensional FFTs can be plugged in to obtain e cient twodimensional FFTs). However, on some computers transposition can be very expensive (e.g. on a distributed memory computer transposition requires an all to all permutation, which is a very costly communication pattern) and the row-column algorithm is not desirable. It is possible to improve the row-column algorithm by using faster transposition algorithms; however, the fast transposition algorithms can be incorporated into the two-dimensional FFT. Such an algorithm has been called the vector-radix algorithm 22, 35]; however, we prefer to call it a block-recursive FFT. The block recursive FFT can be derived using properties of the tensor product and hence can be described using TPL. It is obtained by taking the tensor product of two one-dimensional FFT algorithms. Let r = r1 r2 and s = s1 s2 and write r1 Fr = (Fr1 Ir2 )Trr21 r2 (Ir1 Fr2 )Lr1 r2 ; and Fs = (Fs1 Is2 )Tss21 s2 (Is1 Fs2 )Ls1 s2 : s1 18
Then the two-dimensional transform Fr Fs is equal to r1 (Fr1 Ir2 Fs1 Is2 )Trr21 r2 Tss21s2 (Ir1 Fr2 Is1 Fs2 )Lr1 r2 Ls1 s2 : s1 Rewriting this equation we see that the block-recursive algorithm computes the two-dimensional FT using smaller two-dimensional FTs, where B is a permutation called a block permutation. Since B ?1 Trr21 r2 Tss21 s2 B is a diagonal matrix, this factorization has the same form as the Cooley-Tukey theorem except for the initial permutation, and the block-recursive algorithm is a two-dimensional analog of the Cooley-Tukey theorem. In fact, there are many generalizations of the Cooley-Tukey theorem that apply to multi-dimensional FFTs, all of which can be stated in a single theorem about the Fourier transform of nite Abelian groups. B(Fr1 Fs1 Ir2 s2 )B ?1 Trr21 r2 Tss21 s2 B(Ir1 r2 Fr2 Fs2 )B ?1 Lr1 r2 Ls1 s2 ; r1 s1
Theorem Let A be a nite Abelian group and let F(A) be the Fourier transform of A. If B
A and C = A=B then
F(A) = Q(F(B) IjC j )T(IjBj F(C))P;

where Q and P are permutations and T is a diagonal matrix.
The permutations P and Q and the diagonal matrix T are determined the subgroup B and the quotient group C along with the choice of coset representatives used to represent A=B. For an explicit statement of this theorem along with a proof see 5]. The fundamental theorem of nite Abelian groups states that any nite Abelian group is the direct product of cyclic groups. In other words every Abelian group is isomorphic to the group Z=n1 Z Z=ntZ for some n1; : : :; nt, where the product n1 nt is equal to the order of the Abelian group. The Fourier transform of Z=n1 Z Z=nt Z can be shown to be the multidimensional FT on n1 nt points (see 5]). This connection provides the relationship between the abstract Cooley-Tukey theorem and multidimensional FFTs. The Abelian group approach to the FFT is important for two reasons. First a single theorem can be used to encompass many di erent algorithms. We can view the generalized Cooley-Tukey theorem as a complicated rewrite rule that can be added to an extended version of TPL that incorporates statements about Abelian groups. Note that it can already be applied to the current version of TPL provided the theorem is stated in terms of multidimensional FTs and the necessary permutation and twiddle matrices are added to TPL. The second reason is that it can be used to mathematically classify multidimensional FFT algorithms. If we start with an Abelian group, A, and chose a subgroup, B, and a set of coset representatives for the quotient group C = A=B, we obtain a factorization of the Fourier transform F(A). The same process can 19
be used to factor F(B) and F(C). Continuing in this way we obtain an algorithm to compute F(A). Associated with this algorithm is a binary tree of subgroups and quotient groups (called a subgroup tree). Alternatively, given a subgroup tree for A we can derive an algorithm to compute F(A). To see how we can use subgroup trees to classify FFT algorithms, we begin by de ning the set of FFT algorithms for a multidimensional FT as the closure of the allowed rewrite rules (including the Cooley-Tukey theorem) starting with the symbol for the given multidimensional FT. While such a de nition is algorithmically appealing, since we can use the rewrite rules to systematically generate \all" algorithms, it is mathematically unappealing since it is di cult to say what \all" the algorithms are. However, each formula each formula in the set of algorithms corresponds to a subgroup tree. We believe that the subgroup tree can be recovered from the twiddle factors arising in the formula. Thus the classi cation problem is now reduced to determining all of the formulas corresponding to the possible subgroup trees, which can be classi ed. Furthermore, the formulas that are equivalent to a given subgroup tree only di er in their data ow as determined by the permutations in the formula. Therefore, the classi cation can be reduced to determining the types of permutations that can arise in the allowed factorizations of F(A). Finally we remark that if we ignore the values occurring in the twiddle factors, it is impossible to determine the group structure. The only information that is retained is the order of the subgroups. This observation led us to the notion of a dimensionless FFT 5]. A dimensionless FFT is an algorithm that can be used to compute any FT on a xed number of points, independent of dimension. For example, a dimensionless algorithm for 16 points can be used to compute a 16-point one-dimensional FFT, a (4 4)-point two-dimensional FFT, and a (2 2 4)-point three-dimensional FFT. The only part of the algorithm that changes when the dimension is changed are the values of the twiddle factors and an initial permutation or relabeling of the input data. In 5] several dimensionless FFT algorithms are derived. It is an open problem as to whether all algorithms generated by the rewrite rules discussed here are dimensionless. If the answer is yes, then, atleast for this class of algorithms, dimension is irrelevant and we can search for an optimal FFT within the space of one-dimensional algorithms.
8 Future Directions
In this paper we have presented a methodology for implementinghigh-performance FFTs. This methodology relies heavily on the ability to describe the FFT and its many variants as mathematical expressions that can be manipulated and translated into programs. In fact, the mathematical expressions form the basis of a special purpose language, called TPL, we have developed for implementing FFT algorithms. Given a particular scheme for translating the expressions in TPL into programs, program optimization can be stated as a search process, where mathematics can be used to help guide the search. 20
While this approach has been worked out for the FFT, it is not clear how applicable it may be to other algorithm and application domains. For the approach to work it is essential that algorithms in the problem domain can be described mathematically and that the mathematics has programming and architectural interpretations that can aid in the implementation and optimization of algorithms. In the near future we will look into extending TPL to include other FFT algorithms such as the Rader 34] Winograd 45] algorithms. We will also look into extending TPL to deal with other transforms, such as the, DCT, DST, the Zak transform 7], and non-Abelian FFTs 30]. Preliminary discussions with R. Coifman from Yale University indicate that many of the techniques discussed in this paper will be applicable to wavelets and the wavelet transform. Preliminary work by Jose Moura from CMU 4, 32] suggests that the approach discussed in this paper might be applicable to many algorithms in signal processing and the solution of certain PDEs.
References
1] R. C. Agarwal and J. W. Cooley. Fourier transform and convolution subroutines for the IBM 3090 vector facility. IBM J. Res. Develop., 30:145{162, 1986. 2] R. C. Agarwal and J. W. Cooley. Vectorized mixed radix discrete Fourier transform algorithms. In Proc. IEEE 75, pages 1283{1292, 1987. 3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LAPACK Users' Guide. SIAM, second edition, 1995. 4] A. Asif and J. M. F. Moura. Data assimilationin large time varying multidimensional elds. Technical report, Carnegie Mellon University, Pittsburgh, PA, April 1996. Submitted for publication. 30 pages. 5] L. Auslander, J. R. Johnson, and R. W. Johnson. Multidimensional CooleyTukey algorithms revisited. Adv. in Appl. Math., 19(9):297{301, April 1995. 6] L. Auslander, J. R. Johnson, and R. W. Johnson. Automatic implementation of FFT algorithms. Technical report, Department of Mathematics and Computer Science, Drexel University, Philadelphia, PA, June 1996. 7] L. Auslander and R. Tolimieri. Radar ambiguity function and group theory. SIAM J. Math. Anal., 16(3), 1985. 8] D. H. Bailey. A high-performance fast Fourier transform algorithm for the Cray-2. J. Supercomputing, 1:43{60, 1987. 9] D. H. Bailey. A high-performance FFT algorithm for vector supercomputers. International J. Supercomputer Applications, 2:82{87, 1988. 21
10] D. H. Bailey. Ffts in external of hierarchical memory. J. Supercomputing, 4:23{35, 1990. 11] L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley. ScaLAPACK Users' Guide. SIAM, 1997. 12] R. N. Bracewell. The Fourier Transform and Its Applications. McGrawHill, New York, 1978. 13] William L. Briggs and Van Emden Henson. The DFT: An Owner's Manual for the Discrete Fourier Transform. Society for Industrial and Applied Mathematics, Philadelphia, 1995. 14] E. O. Brigham. The Fast Fourier Transform and Applications. PrenticeHall, Englewood Cli s, NJ, 1988. 15] C. A. Carlson. Using local memory to boost the performance of FFT algorithms on the Cray-2 supercomputer. J. Supercomputing, 4:345{356, 1991. 16] B. W. Char, K. O. Geddes, G. H. Gonnet, B. L. Leong, M. B. Monagan, and S. M. Watt. Maple V Library Reference Manual. Springer-Verlag, New York, 1991. 17] J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Whaley. A proposal for a set of parallel basic linear algebra subprograms. Technical Report Tech. Report CS-95-292, University of Tennessee, 1995. LAPACK Working Note 100. 18] J. W. Cooley. The re-discovery of the fast Fourier transform algorithm. Mikrochimica Acta, III:33{45, 1987. 19] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex Fourier series. Math. Comp., 19(90):297{301, April 1965. 20] M. C. Dewar. IRENA and integrated symbolic and numerical computation environment. In ISSAC '89, Proceedings of the International Symposium on Symbolic and Algebraic Computation, pages 163{170. ACM-Press, 1989. 21] J. Dongarray, J. Du Cros, I. Du , and S. Hammarling. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):1{17, 1990. 22] D. B. Harris, J. H. McClelland, D. S. K. Chan, and H. Schuessler. Vector radix fast Fourier transform. In Proc. IEEE Int. Conf. on Acoust. Speech Signal Process, pages 548{551, 1977.
22
23] H. Hauptman. The phase problem of x-ray crystallography. In F. A. Grunbaum, J. W. Helton, and P. Khargonekar, editors, Signal Processing Part II: Control Theory and Applications, volume 23 of The IMA Volumes in Mathematics and Its Applications, pages 257{273. Springer-Verlag, New York, 1990. 24] Michael T. Heideman, Don H. Johnson, and C. Sidney Burrus. Gauss and the history of the fast Fourier transform. Arch. for History of Exact Sci., 34(3):265{277, 1985. 25] C.-H. Huang, J. R. Johnson, and R. W. Johnson. A report on the performance of an implementation of strassen's algorithm. Applied Mathematics Letters, 4(1):99{102, 1991. 26] R. D. Jenks and R. S. Sutor. Axiom, The Scienti c Computation System. Springer-Verlag, 1992. 27] J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. Circuits, Systems, and Signal Processing, 9(4):449{500, 1990. 28] R. W. Johnson, C.-H. Huang, and J. R. Johnson. Multilinear algebra and parallel programming. The Journal of Supercomputing, 5:189{217, 1991. 29] Charles H. Koelbel, David B. Loveman, Robert S. Schreiber, Guy L. Steele Jr., and Mary E. Zosel. High Performance Fortran Handbook. MIT Press, 1995. 30] D. Maslen and D. Rockmore. Generalized FFTs | a survey of some recent results. In L. Finkelstein and W. Kantor, editors, DIMACS Ser. Discrete Math. Theoret. Comput. Sci., Groups and Computation, II, pages 183 { 237. Amer. Math. Soc., Providence, RI, 1997. To Appear. 31] MathWorks, Inc. MATLAB Reference Guide. 32] Jose M. F. Moura and M. Bruno. DCT/DST and Gauss Markov elds: Conditions for equivalence. Technical report, Carnegie Mellon University, Pittsburgh, PA, September 1996. Submitted for publication, 30 pages. 33] NAG LTD. The NAG Fortran Library Manual, 1987. 34] C. Rader. Discrete Fourier transforms when the number of sample points is prime. In Proc. IEEE 5., pages 1107{1108, 1968. 35] G. E. Rivard. Direct fast Fourier transform of bivariate functions. IEEE Trans. Acoust. Speech Signal Process., ASSP-25:250{252, 1977. 36] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI The Complete Reference. MIT Press, 1995. 23
37] John Strassner. An expert system for matching FFT algorithms to computer architectures. DARPA ACMP presentation, 1989. 38] P. N. Swarztrauber. Vectorizing the FFTs. In G. Rodrigue, editor, Parallel Computations, pages 490{501. Academic Press, New York, 1982. 39] P. N. Swarztrauber. FFT algorithms for vector computers. Parallel Comput., 1:45{63, 1984. 40] C. Temperton. Fast Fourier transforms on the Cyber 205. In J. Kowalik, editor, High Speed Computations, pages 490{501. Springer-Verlag, Berlin, 1984. 41] C. Temperton. Implementation of a prime factor FFT algorithm on the Cray-1. Parallel Comput., 6:99{108, 1988. 42] C. Van Loan. Computational Frameworks for the Fast Fourier Transform, volume 10 of Frontiers in Applied Mathematics. Society for Industrial and Applied Mathematics, Philadelphia, 1992. 43] James S. Walker. Fast Fourier Transforms. CRC Press, New York, second edition, 1996. 44] T. Williams, J. Reynders, and W. Humphrey. POOMA User Guide. Advanced Computing Laboratory at Los Alamos National Laboratory, 1997. URL: http://www.acl.lanl.gov/pooma/doc/userguide. 45] S. Winograd. On computing the discrete fast Fourier transform. Math. Comp., 32:175{199, 1978. 46] M. Wolfe. High Performance Compilers for Parallel Computing. AddisonWesley Publishing Co., 1995. 47] Stephen Wolfram. The Mathematica Book. Cambridge University Press, third edition, 1996.
24

10 1 1 50

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

10 1 1 50

Diunggah oleh

Hak Cipta:

Format Tersedia

Challenges of Computing the Fast Fourier Transform

J. R. Johnson and R. W. Johnson June 18, 1997

2 Why the FFT?

9. 10. 11. 12.

2.2 What does Fast (N

6 Comparison with other approaches

7 Methodology Applied to the FFT

7.1 Mathematical description of the FFT

For example, F4 = (F2 I2 )T24 (I2 F2)L4 2

7.2 A Language for Describing FFT Algorithms

Apply Cooley-Tukey Inductively

Bit Reversal Recursive FFT

F8 = (F2 I4 )T48 (I2 F4 )L8 2 R8 = (I2 L4 )L8 2 2

Iterative FFT (CT)

F8 = (F2 I4 )T48 (I2 ((F2 I2 )T24 (I2 F2)L4 ))L8 2 2

Vector FFT (Stockham)

F8 = (F2 I4)T48 (I2 F2 I2 )(I2 T24)(I4 F2 )R8

Vector FFT (Korn-Lambiotte) Parallel FFT (Pease)

F8 = L8 (I4 F2)L8 T48 L8 L8(I4 F2 )L8 (T24 I2 )L8 L8 (I4 F2)R8 2 4 2 2 4 2 2

and the formula for the iterative FFT on 8 points

7.3 Programming and Architectural Interpretation of the Tensor Product

7.4 Rewrite Rules and Algorithm Transformations

Frs ?! (Fr Is )Tsrs (Ir Fs)Lrs r 15

Linear and Tensor Algebra

t In (Im In ) t Fn rs )t (Ts (Lrs )t s LrstLrst s t (AB)t (A B)t (AB) (CD) (Am Bn )

In Im n Fn Tsrs Lrs r Lrst st B t At At B t (A C)(B D) Lmn (Bn Am )Lmn m n

?! L8 (I2 F4)T48 (F2 I4 ) 4

7.5 Searching through Algorithm Space

7.6 Multidimensional FFTs and Abelian Groups

m?1 n?1 X X r=0

A and C = A=B then

F(A) = Q(F(B) IjC j )T(IjBj F(C))P;

Anda mungkin juga menyukai