# Chapter 1. The Radix-2 Frameworks

1.1 Matrix Notation and Algorithms
1.2 The FFT Idea
1.4 Weight and Butterfly Computations
1.5 Bit Reversal and Transposition
1.6 The Cooley--Tukey Framework
1.7 The Stockham Autosort Frameworks
1.8 The Pease Framework
1.9 Decimation in Frequency and Inverse FFTs

A fast Fourier Transform (FFT) is a quick method for forming the matrix-vector product Fnx, where Fn is the discrete Fourier transform (DFT) matrix. Our examination of this area begins in the simplest setting: the case when n = 2t. This permits the orderly repetition of the central divide-and-conquer process that underlies all FFT work. Our approach is based upon the factorization of Fn into the product of t = log2 n sparse matrix factors. Different factorizations correspond to different FFT frameworks. Within each framework different implementations are possible.

To navigate this hierarchy of ideas, we rely heavily upon block matrix notation, which we detail in Section 1.1. This "language" revolves around the Kronecker product and is used in Section 1.2 to establish the "radix-2 splitting," a factorization that indicates how a 2m-point DFT can be rapidly determined from a pair of m-point DFTs. The repeated application of this process leads to our first complete FFT procedure, the famous Cooley--Tukey algorithm in Section 1.3.

Before fully detailing the Cooley--Tukey process, we devote Section 1.4 and Section 1.5 to a number of important calculations that surface in FFT work. These include the butterfly computation, the generation of weights, and various data transpositions. Using these developments, we proceed to detail the in-place Cooley--Tukey framework in (Section 1.6). In-place FFT procedures overwrite the input vector with its DFT without making use of an additional vector workspace. However, certain data permutations are involved that are sometimes awkward to compute. These permutations can be avoided at the expense of a vector workspace. The resulting in-order FFTs are discussed in Section 1.7, where a pair of autosort frameworks are presented. A fourth framework due to Pease is covered in Section 1.8.

Associated with each of these four methods is a different factorization of the DFT matrix. As we show in Section 1.9, four additional frameworks are obtained by transposing these four factorizations. A fast procedure for the inverse DFT can be derived by conjugating any of the eight "forward" DFT factorizations.

## Chapter 2. General Radix Frameworks

2.2 Index Reversal and Transposition

The frameworks of Chapter 1 revolve around the efficient synthesis of an even-order DFT from two half-length DFTs. However, if n is not a power of two, then the radix-2 divide-and-conquer process breaks down. The mission of this section is to show that it is possible to formulate FFT frameworks for general n, as long as n is highly composite. In some important cases, the resulting algorithms involve fewer flops and better memory traffic patterns than their radix-2 counterparts.

The first three sections cover the underlying mathematics and generalize the factorization notions developed in Chapter 1. As a case study, we detail the central calculations associated with radix-4 and radix-8 frameworks in Section 2.4. We conclude in Section 2.5 with a discussion of the split-radix algorithm, an interesting reduced-arithmetic rearrangement of the radix-2 Cooley--Tukey algorithm.

## Chapter 3. High-Performance Frameworks

3.1 The Multiple DFT Problem
3.2 Matrix Transposition
3.3 The Large Single-Vector FFT Problem
3.4 The Multidimensional FFT Problem
3.5 Distributed-Memory FFTs
3.6 Shared-Memory FFTs

Suppose X is n1-by-n2. The computations X = Fn1 X and X = XFn2 are referred to as multiple DFT problems. Applying an FFT algorithm to the columns or rows of a matrix is an important problem in many applications. However, there is more to the problem than just the repeated application of a single-vector technique, as we show in Section 3.1.

Single-vector DFT problems can be converted into a pair of multiple DFT problems with a scaling in between by exploiting the radix-p splitting. This amounts to a "blocking'' of the single-vector FFT problem and is very effective in many high-performance environments. The transpositions that are required by this framework are detailed in Section 3.2, and this is followed by a discussion of large single-vector FFTs in Section 3.3.

The multidimensional DFT problem is discussed in Section 3.4. Successful approaches in this area involve interesting combinations of matrix transposition and multiple DFT computation.

We conclude with two sections on parallel FFTs. Our aim is to highlight how one reasons about parallel computation in both distributed-memory and shared-memory environments.

## Chapter 4. Selected Topics

4.1 Prime Factor Frameworks
4.2 Convolution
4.3 FFTs of Real Data
4.4 Fast Trigonometric Transforms
4.5 Fast Poisson Solvers

Surprisingly, we have not yet exhausted the set of all possible factorizations of the DFT matrix Fn. In fact, a case can be made that we are but halfway along in our survey, for until now we have neglected the prime factor FFTs. This framework is based upon a number-theoretic splitting of the DFT matrix and a large fraction of the FFT literature is devoted to its implementation and various properties.

Elsewhere in this final chapter we examine two leading application areas that rely heavily upon FFTs: convolution and the numerical solution of the Poisson equation. Methods for one- and two-dimensional convolution are given in Section 4.2. The duality between fast convolution and FFTs is established and some new FFT frameworks emerge. Fast methods for the product of a Toeplitz matrix and a vector are also discussed.

A real-data FFT framework is derived in Section 4.3 and comparisons are made with the rival Hartley transform. By exploiting structure, it is possible to halve the amount of required arithmetic. Fast algorithms for various sine and cosine transforms are given in Section 4.4 and then used in Section 4.5 to solve the Poisson equation problem. The trigonometric transforms that we develop have many applications and are a fitting tribute to the FFT and its wide applicability.