diff options
Diffstat (limited to 'gsl-1.9/doc/fftalgorithms.tex')
-rw-r--r-- | gsl-1.9/doc/fftalgorithms.tex | 3296 |
1 files changed, 3296 insertions, 0 deletions
diff --git a/gsl-1.9/doc/fftalgorithms.tex b/gsl-1.9/doc/fftalgorithms.tex new file mode 100644 index 0000000..4f4a725 --- /dev/null +++ b/gsl-1.9/doc/fftalgorithms.tex @@ -0,0 +1,3296 @@ +\documentclass[fleqn,12pt]{article} +% +\setlength{\oddsidemargin}{-0.25in} +\setlength{\textwidth}{7.0in} +\setlength{\topmargin}{-0.25in} +\setlength{\textheight}{9.5in} +% +\usepackage{algorithmic} +\newenvironment{algorithm}{\begin{quote} %\vspace{1em} +\begin{algorithmic}\samepage}{\end{algorithmic} %\vspace{1em} +\end{quote}} +\newcommand{\Real}{\mathop{\mathrm{Re}}} +\newcommand{\Imag}{\mathop{\mathrm{Im}}} + +\begin{document} +\title{FFT Algorithms} +\author{Brian Gough, {\tt bjg@network-theory.co.uk}} +\date{May 1997} +\maketitle + +\section{Introduction} + +Fast Fourier Transforms (FFTs) are efficient algorithms for +calculating the discrete fourier transform (DFT), +% +\begin{eqnarray} +h_a &=& \mathrm{DFT}(g_b) \\ + &=& \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) \qquad 0 \leq a \leq N-1\\ + &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N= \exp(-2\pi i/N) +\end{eqnarray} +% +The DFT usually arises as an approximation to the continuous fourier +transform when functions are sampled at discrete intervals in space or +time. The naive evaluation of the discrete fourier transform is a +matrix-vector multiplication ${\mathbf W}\vec{g}$, and would take +$O(N^2)$ operations for $N$ data-points. The general principle of the +Fast Fourier Transform algorithms is to use a divide-and-conquer +strategy to factorize the matrix $W$ into smaller sub-matrices, +typically reducing the operation count to $O(N \sum f_i)$ if $N$ can +be factorized into smaller integers, $N=f_1 f_2 \dots f_n$. + +This chapter explains the algorithms used in the GSL FFT routines +and provides some information on how to extend them. To learn more about +the FFT you should read the review article {\em Fast Fourier +Transforms: A Tutorial Review and A State of the Art} by Duhamel and +Vetterli~\cite{duhamel90}. There are several introductory books on the +FFT with example programs, such as {\em The Fast Fourier Transform} by +Brigham~\cite{brigham74} and {\em DFT/FFT and Convolution Algorithms} +by Burrus and Parks~\cite{burrus84}. In 1979 the IEEE published a +compendium of carefully-reviewed Fortran FFT programs in {\em Programs +for Digital Signal Processing}~\cite{committee79} which is a useful +reference for implementations of many different FFT algorithms. If you +are interested in using DSPs then the {\em Handbook of Real-Time Fast +Fourier Transforms}~\cite{smith95} provides detailed information on +the algorithms and hardware needed to design, build and test DSP +applications. Many FFT algorithms rely on results from number theory. +These results are covered in the books {\em Fast transforms: +algorithms, analyses, applications}, by Elliott and +Rao~\cite{elliott82}, {\em Fast Algorithms for Digital Signal +Processing} by Blahut~\cite{blahut} and {\em Number Theory in Digital +Signal Processing} by McClellan and Rader~\cite{mcclellan79}. There is +also an annotated bibliography of papers on the FFT and related topics +by Burrus~\cite{burrus-note}. + +\section{Families of FFT algorithms} +% +There are two main families of FFT algorithms: the Cooley-Tukey +algorithm and the Prime Factor algorithm. These differ in the way they +map the full FFT into smaller sub-transforms. Of the Cooley-Tukey +algorithms there are two types of routine in common use: mixed-radix +(general-$N$) algorithms and radix-2 (power of 2) algorithms. Each +type of algorithm can be further classified by additional characteristics, +such as whether it operates in-place or uses additional scratch space, +whether its output is in a sorted or scrambled order, and whether it +uses decimation-in-time or -frequency iterations. + +Mixed-radix algorithms work by factorizing the data vector into +shorter lengths. These can then be transformed by small-$N$ FFTs. +Typical programs include FFTs for small prime factors, such as 2, 3, +5, \dots which are highly optimized. The small-$N$ FFT modules act as +building blocks and can be multiplied together to make longer +transforms. By combining a reasonable set of modules it is possible to +compute FFTs of many different lengths. If the small-$N$ modules are +supplemented by an $O(N^2)$ general-$N$ module then an FFT of any +length can be computed, in principle. Of course, any lengths which +contain large prime factors would perform only as $O(N^2)$. + +Radix-2 algorithms, or ``power of two'' algorithms, are simplified +versions of the mixed-radix algorithm. They are restricted to lengths +which are a power of two. The basic radix-2 FFT module only involves +addition and subtraction, so the algorithms are very simple. Radix-2 +algorithms have been the subject of much research into optimizing the +FFT. Many of the most efficient radix-2 routines are based on the +``split-radix'' algorithm. This is actually a hybrid which combines +the best parts of both radix-2 and radix-4 (``power of 4'') +algorithms~\cite{sorenson86,duhamel86}. + +The prime factor algorithm (PFA) is an alternative form of general-$N$ +algorithm based on a different way of recombining small-$N$ FFT +modules~\cite{temperton83pfa,temperton85}. It has a very simple indexing +scheme which makes it attractive. However it only works in the case +where all factors are mutually prime. This requirement makes it more +suitable as a specialized algorithm for given lengths. + + +\begin{subsection}{FFTs of prime lengths} +% + Large prime lengths cannot be handled efficiently by any of these +algorithms. However it may still possible to compute a DFT, by using +results from number theory. Rader showed that it is possible to +convert a length-$p$ FFT (where $p$ is prime) into a convolution of +length-($p-1$). There is a simple identity between the convolution of +length $N$ and the FFT of the same length, so if $p-1$ is easily +factorizable this allows the convolution to be computed efficiently +via the FFT. The idea is presented in the original paper by +Rader~\cite{raderprimes} (also reprinted in~\cite{mcclellan79}), but +for more details see the theoretical books mentioned earlier. +\end{subsection} + +%\subsection{In-place algorithms vs algorithms using scratch space} +%% +%For simple algorithms, such as the Radix-2 algorithms, the FFT can be +%performed in-place, without additional memory. For this to be possible +%the index calculations must be simple enough that the calculation of +%the result to be stored in index $k$ can be carried out at the same +%time as the data in $k$ is used, so that no temporary storage is +%needed. + +%The mixed-radix algorithm is too complicated for this. It requires an +%area of scratch space sufficient to hold intermediate copies of the +%input data. + +%\subsection{Self-sorting algorithms vs scrambling algorithms} +%% +%A self-sorting algorithm At each iteration of the FFT internal permutations are included +%which leave the final iteration in its natural order. + + +%The mixed-radix algorithm can be made self-sorting. The additional +%scratch space helps here, although it is in fact possible to design +%self-sorting algorithms which do not require scratch +%space. + + +%The in-place radix-2 algorithm is not self-sorting. The data is +%permuted, and transformed into bit-reversed order. Note that +%bit-reversal only refers to the order of the array, not the values for +%each index which are of course not changed. More precisely, the data +%stored in location $[b_n \dots b_2 b_1 b_0]$ is moved to location +%$[b_0 b_1 b_2 \dots b_n]$, where $b_0 \dots b_n$ are the raw bits of a +%given index. Placing the data in bit reversed order is easily done, +%using right shifts to extract the least significant bits of the index +%and left-shifts to feed them into a register for the bit-reversed +%location. Simple radix-2 FFT usually recompute the bit reverse +%ordering in the naive way for every call. For repeated calls it might +%be worthwhile to precompute the permutation and cache it. There are +%also more sophisticated algorithms which reduce the number of +%operations needed to perform bit-reversal~\cite{bit-reversal}. + + +%\begin{subsection}{Decimation-in-time (DIT) vs Decimation-in-Frequency (DIF)} + +%\end{subsection} + + +\subsection{Optimization} +% +There is no such thing as the single fastest FFT {\em algorithm}. FFT +algorithms involve a mixture of floating point calculations, integer +arithmetic and memory access. Each of these operations will have +different relative speeds on different platforms. The performance of +an algorithm is a function of the hardware it is implemented on. The +goal of optimization is thus to choose the algorithm best suited to +the characteristics of a given hardware platform. + +For example, the Winograd Fourier Transform (WFTA) is an algorithm +which is designed to reduce the number of floating point +multiplications in the FFT. However, it does this at the expense of +using many more additions and data transfers than other algorithms. +As a consequence the WFTA might be a good candidate algorithm for +machines where data transfers occupy a negligible time relative to +floating point arithmetic. However on most modern machines, where the +speed of data transfers is comparable to or slower than floating point +operations, it would be outperformed by an algorithm which used a +better mix of operations (i.e. more floating point operations but +fewer data transfers). + +For a study of this sort of effect in detail, comparing the different +algorithms on different platforms consult the paper {\it Effects of +Architecture Implementation on DFT Algorithm Performance} by Mehalic, +Rustan and Route~\cite{mehalic85}. The paper was written in the early +1980's and has data for super- and mini-computers which you are +unlikely to see today, except in a museum. However, the methodology is +still valid and it would be interesting to see similar results for +present day computers. + + +\section{FFT Concepts} +% +Factorization is the key principle of the mixed-radix FFT divide-and-conquer +strategy. If $N$ can be factorized into a product of $n_f$ integers, +% +\begin{equation} +N = f_1 f_2 ... f_{n_f} , +\end{equation} +% +then the FFT itself can be divided into smaller FFTs for each factor. +More precisely, an FFT of length $N$ can be broken up into, +% +\begin{quote} +\begin{tabular}{l} +$(N/f_1)$ FFTs of length $f_1$, \\ +$(N/f_2)$ FFTs of length $f_2$, \\ +\dots \\ +$(N/f_{n_f})$ FFTs of length $f_{n_f}$. +\end{tabular} +\end{quote} +% +The total number of operations for these sub-operations will be $O( +N(f_1 + f_2 + ... + f_{n_f}))$. When the factors of $N$ are all small +integers this will be substantially less than $O(N^2)$. For example, +when $N$ is a power of 2 an FFT of length $N=2^m$ can be reduced to $m +N/2$ FFTs of length 2, or $O(N\log_2 N)$ operations. Here is a +demonstration which shows this: + +We start with the full DFT, +% +\begin{eqnarray} +h_a &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N=\exp(-2\pi i/N) +\end{eqnarray} +% +and split the sum into even and odd terms, +% +\begin{eqnarray} +\phantom{h_a} +% &=& \left(\sum_{b=0,2,4,\dots} + \sum_{b=1,3,5,\dots}\right) g_b W_N^{ab}\\ + &=& \sum_{b=0}^{N/2-1} g_{2b} W_N^{a(2b)} + + \sum_{b=0}^{N/2-1} g_{2b+1} W_N^{a(2b+1)}. +\end{eqnarray} +% +This converts the original DFT of length $N$ into two DFTs of length +$N/2$, +% +\begin{equation} +h_a = \sum_{b=0}^{N/2-1} g_{2b} W_{(N/2)}^{ab} + + W_N^a \sum_{b=0}^{N/2-1} g_{2b+1} W_{(N/2)}^{ab} +\end{equation} +% +The first term is a DFT of the even elements of $g$. The second term +is a DFT of the odd elements of $g$, premultiplied by an exponential +factor $W_N^k$ (known as a {\em twiddle factor}). +% +\begin{equation} +\mathrm{DFT}(h) = \mathrm{DFT}(g_{even}) + W_N^k \mathrm{DFT}(g_{odd}) +\end{equation} +% +By splitting the DFT into its even and odd parts we have reduced the +operation count from $N^2$ (for a DFT of length $N$) to $2 (N/2)^2$ +(for two DFTs of length $N/2$). The cost of the splitting is that we +need an additional $O(N)$ operations to multiply by the twiddle factor +$W_N^k$ and recombine the two sums. + +We can repeat the splitting procedure recursively $\log_2 N$ times +until the full DFT is reduced to DFTs of single terms. The DFT of a +single value is just the identity operation, which costs nothing. +However since $O(N)$ operations were needed at each stage to recombine +the even and odd parts the total number of operations to obtain the +full DFT is $O(N \log_2 N)$. If we had used a length which was a +product of factors $f_1$, $f_2$, $\dots$ we could have split the sum +in a similar way. First we would split terms corresponding to the +factor $f_1$, instead of the even and odd terms corresponding to a +factor of two. Then we would repeat this procedure for the subsequent +factors. This would lead to a final operation count of $O(N \sum +f_i)$. + +This procedure gives some motivation for why the number of operations +in a DFT can in principle be reduced from $O(N^2)$ to $O(N \sum f_i)$. +It does not give a good explanation of how to implement the algorithm +in practice which is what we shall do in the next section. + +\section{Radix-2 Algorithms} +% +For radix-2 FFTs it is natural to write array indices in binary form +because the length of the data is a power of two. This is nicely +explained in the article {\em The FFT: Fourier Transforming One Bit at +a Time} by P.B. Visscher~\cite{visscher96}. A binary representation +for indices is the key to deriving the simplest efficient radix-2 +algorithms. + +We can write an index $b$ ($0 \leq b < 2^{n-1}$) in binary +representation like this, +% +\begin{equation} +b = [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + \dots + 2 b_1 + b_0 . +\end{equation} +% +Each of the $b_0, b_1, \dots, b_{n-1}$ are the bits (either 0 or 1) of +$b$. + +Using this notation the original definition of the DFT +can be rewritten as a sum over the bits of $b$, +% +\begin{equation} +h(a) = \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) +\end{equation} +% +to give an equivalent summation like this, +% +\begin{equation} +h([a_{n-1} \dots a_1 a_0]) = +\sum_{b_0=0}^{1} +\sum_{b_1=0}^{1} +\dots +\sum_{b_{n-1}=0}^{1} + g([b_{n-1} \dots b_1 b_0]) W_N^{ab} +\end{equation} +% +where the bits of $a$ are $a=[a_{n-1} \dots a_1 a_0]$. + +To reduce the number of operations in the sum we will use the +periodicity of the exponential term, +% +\begin{equation} +W_N^{x+N} = W_N^{x}. +\end{equation} +% +Most of the products $ab$ in $W_N^{ab}$ are greater than $N$. By +making use of this periodicity they can all be collapsed down into the +range $0\dots N-1$. This allows us to reduce the number of operations +by combining common terms, modulo $N$. Using this idea we can derive +decimation-in-time or decimation-in-frequency algorithms, depending on +how we break the DFT summation down into common terms. We'll first +consider the decimation-in-time algorithm. + +\subsection{Radix-2 Decimation-in-Time (DIT)} +% +To derive the the decimation-in-time algorithm we start by separating +out the most significant bit of the index $b$, +% +\begin{equation} +[b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + [b_{n-2} \dots b_1 b_0] +\end{equation} +% +Now we can evaluate the innermost sum of the DFT without any +dependence on the remaining bits of $b$ in the exponential, +% +\begin{eqnarray} +h([a_{n-1} \dots a_1 a_0]) &=& +\sum_{b_0=0}^{1} +\sum_{b_1=0}^{1} +\dots +\sum_{b_{n-1}=0}^{1} + g(b) +W_N^{a(2^{n-1}b_{n-1}+[b_{n-2} \dots b_1 b_0])} \\ + &=& +\sum_{b_0=0}^{1} +\sum_{b_1=0}^{1} +\dots +\sum_{b_{n-2}=0}^{1} +W_N^{a[b_{n-2} \dots b_1 b_0]} +\sum_{b_{n-1}=0}^{1} + g(b) +W_N^{a(2^{n-1}b_{n-1})} +\end{eqnarray} +% +Looking at the term $W_N^{a(2^{n-1}b_{n-1})}$ we see that we can also +remove most of the dependence on $a$ as well, by using the periodicity of the +exponential, +% +\begin{eqnarray} +W_N^{a(2^{n-1}b_{n-1})} &=& +\exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-1} b_{n-1}/ 2^n )\\ +&=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] b_{n-1}/ 2 )\\ +&=& \exp(-2\pi i ( 2^{n-2}a_{n-1} + \dots + a_1 + (a_0/2)) b_{n-1} )\\ +&=& \exp(-2\pi i a_0 b_{n-1}/2 ) \\ +&=& W_2^{a_0 b_{n-1}} +\end{eqnarray} +% +Thus the innermost exponential term simplifies so that it only +involves the highest order bit of $b$ and the lowest order bit of $a$, +and the sum can be reduced to, +% +\begin{equation} +h([a_{n-1} \dots a_1 a_0]) += +\sum_{b_0=0}^{1} +\sum_{b_1=0}^{1} +\dots +\sum_{b_{n-2}=0}^{1} +W_N^{a[b_{n-2} \dots b_1 b_0]} +\sum_{b_{n-1}=0}^{1} + g(b) +W_2^{a_0 b_{n-1}}. +\end{equation} +% +We can repeat this this procedure for the next most significant bit of +$b$, $b_{n-2}$, using a similar identity, +% +\begin{eqnarray} +W_N^{a(2^{n-2}b_{n-2})} +&=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-2} b_{n-2}/ 2^n )\\ +&=& W_4^{ [a_1 a_0] b_{n-2}}. +\end{eqnarray} +% +to give a formula with even less dependence on the bits of $a$, +% +\begin{equation} +h([a_{n-1} \dots a_1 a_0]) += +\sum_{b_0=0}^{1} +\sum_{b_1=0}^{1} +\dots +\sum_{b_{n-3}=0}^{1} +W_N^{a[b_{n-3} \dots b_1 b_0]} +\sum_{b_{n-2}=0}^{1} +W_4^{[a_1 a_0] b_{n-2}} +\sum_{b_{n-1}=0}^{1} + g(b) +W_2^{a_0 b_{n-1}}. +\end{equation} +% +If we repeat the process for all the remaining bits we obtain a +simplified DFT formula which is the basis of the radix-2 +decimation-in-time algorithm, +% +\begin{eqnarray} +h([a_{n-1} \dots a_1 a_0]) &=& +\sum_{b_0=0}^{1} +W_{N}^{[a_{n-1} \dots a_1 a_0]b_0} +%\sum_{b_1=0}^{1} +%W_{N/2}^{[a_{n-1} \dots a_1 a_0]b_1} +\dots +\sum_{b_{n-2}=0}^{1} +W_4^{ [a_1 a_0] b_{n-2}} +\sum_{b_{n-1}=0}^{1} +W_2^{a_0 b_{n-1}} +g(b) +\end{eqnarray} +% +To convert the formula to an algorithm we expand out the sum +recursively, evaluating each of the intermediate summations, which we +denote by $g_1$, $g_2$, \dots, $g_n$, +% +\begin{eqnarray} +g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0) +&=& +\sum_{b_{n-1}=0}^{1} +W_2^{a_0 b_{n-1}} +g([b_{n-1} b_{n-2} b_{n-3} \dots b_1 b_0]) \\ +g_2(a_0, {}_{\phantom{-2}} a_{1}, b_{n-3}, \dots, b_1, b_0) +&=& +\sum_{b_{n-2}=0}^{1} +W_4^{[a_1 a_0] b_{n-2}} +g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0) \\ +g_3(a_0, {}_{\phantom{-2}} a_{1}, {}_{\phantom{-3}} a_{2}, \dots, b_1, b_0) +&=& +\sum_{b_{n-3}=0}^{1} +W_8^{[a_2 a_1 a_0] b_{n-2}} +g_2(a_0, a_1, b_{n-3}, \dots, b_1, b_0) \\ +\dots &=& \dots \\ +g_n(a_0, a_{1}, a_{2}, \dots, a_{n-2}, a_{n-1}) +&=& +\sum_{b_{0}=0}^{1} +W_N^{[a_{n-1} \dots a_1 a_0]b_0} +g_{n-1}(a_0, a_1, a_2, \dots, a_{n-2}, b_0) +\end{eqnarray} +% +After the final sum, we can obtain the transform $h$ from $g_n$, +% +\begin{equation} +h([a_{n-1} \dots a_1 a_0]) += +g_n(a_0, a_1, \dots, a_{n-1}) +\end{equation} +% +Note that we left the storage arrangements of the intermediate sums +unspecified by using the bits as function arguments and not as an +index. The storage of intermediate sums is different for the +decimation-in-time and decimation-in-frequency algorithms. + +Before deciding on the best storage scheme we'll show that the results +of each stage, $g_1$, $g_2$, \dots, can be carried out {\em +in-place}. For example, in the case of $g_1$, the inputs are, +% +\begin{equation} +g([\underline{b_{n-1}} b_{n-2} b_{n-3} \dots b_1 b_0]) +\end{equation} +% +for $b_{n-1}=(0,1)$, and the corresponding outputs are, +% +\begin{equation} +g_1(\underline{a_0},b_{n-2}, b_{n-3}, \dots, b_1, b_0) +\end{equation} +% +for $a_0=(0,1)$. It's clear that if we hold $b_{n-2}, b_{n-3}, \dots, +b_1, b_0$ fixed and compute the sum over $b_{n-1}$ in memory for both +values of $a_0 = 0,1$ then we can store the result for $a_0=0$ in the +location which originally had $b_0=0$ and the result for $a_0=1$ in +the location which originally had $b_0=1$. The two inputs and two +outputs are known as {\em dual node pairs}. At each stage of the +calculation the sums for each dual node pair are independent of the +others. It is this property which allows an in-place calculation. + +So for an in-place pass our storage has to be arranged so that the two +outputs $g_1(a_0,\dots)$ overwrite the two input terms +$g([b_{n-1},\dots])$. Note that the order of $a$ is reversed from the +natural order of $b$. i.e. the least significant bit of $a$ +replaces the most significant bit of $b$. This is inconvenient +because $a$ occurs in its natural order in all the exponentials, +$W^{ab}$. We could keep track of both $a$ and its bit-reverse, +$a^{\mathit bit-reversed}$ at all times but there is a neat trick +which avoids this: if we bit-reverse the order of the input data $g$ +before we start the calculation we can also bit-reverse the order of +$a$ when storing intermediate results. Since the storage involving $a$ +was originally in bit-reversed order the switch in the input $g$ now +allows us to use normal ordered storage for $a$, the same ordering +that occurs in the exponential factors. + +This is complicated to explain, so here is an example of the 4 passes +needed for an $N=16$ decimation-in-time FFT, with the initial data +stored in bit-reversed order, +% +\begin{eqnarray} +g_1([b_0 b_1 b_2 a_0]) +&=& +\sum_{b_3=0}^{1} W_2^{a_0 b_3} g([b_0 b_1 b_2 b_3]) +\\ +g_2([b_0 b_1 a_1 a_0]) +&=& +\sum_{b_2=0}^{1} W_4^{[a_1 a_0] b_2} g_1([b_0 b_1 b_2 a_0]) +\\ +g_3([b_0 a_2 a_1 a_0]) +&=& +\sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0]) +\\ +h(a) = g_4([a_3 a_2 a_1 a_0]) +&=& +\sum_{b_0=0}^{1} W_{16}^{[a_3 a_2 a_1 a_0] b_0} g_3([b_0 a_2 a_1 a_0]) +\end{eqnarray} +% +We compensate for the bit reversal of the input data by accessing $g$ +with the bit-reversed form of $b$ in the first stage. This ensures +that we are still carrying out the same calculation, using the same +data, and not accessing different values. Only single bits of $b$ ever +occur in the exponential so we never need the bit-reversed form of +$b$. + +Let's examine the third pass in detail, +% +\begin{equation} +g_3([b_0 a_2 a_1 a_0]) += +\sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0]) +\end{equation} +% +First note that only one bit, $b_1$, varies in each summation. The +other bits of $b$ ($b_0$) and of $a$ ($a_1 a_0$) are essentially +``spectators'' -- we must loop over all combinations of these bits and +carry out the same basic calculation for each, remembering to update +the exponentials involving $W_8$ appropriately. If we are storing the +results in-place (with $g_3$ overwriting $g_2$ we will need to compute +the sums involving $b_1=0,1$ and $a_2=0,1$ simultaneously. +% +\begin{equation} +\left( +\begin{array}{c} +g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\ +g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}} +\end{array} +\right) += +\left( +\begin{array}{c} +g_2([b_0 0 a_1 a_0]) + W_8^{[0 a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\ +g_2([b_0 0 a_1 a_0]) + W_8^{[1 a_1 a_0]} g_2([b_2 1 a_1 a_0]) +\end{array} +\right) +\end{equation} +% +We can write this in a more symmetric form by simplifying the exponential, +% +\begin{equation} +W_8^{[a_2 a_1 a_0]} += W_8^{4 a_2 + [a_1 a_0]} += (-1)^{a_2} W_8^{[a_1 a_0]} +\end{equation} +% +\begin{equation} +\left( +\begin{array}{c} +g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\ +g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}} +\end{array} +\right) += +\left( +\begin{array}{c} +g_2([b_0 0 a_1 a_0]) + W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\ +g_2([b_0 0 a_1 a_0]) - W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0]) +\end{array} +\right) +\end{equation} +% +The exponentials $W_8^{[a_1 a_0]}$ are referred to as {\em twiddle +factors}. The form of this calculation, a symmetrical sum and +difference involving a twiddle factor is called {\em a butterfly}. +It is often shown diagrammatically, and in the case $b_0=a_0=a_1=0$ +would be drawn like this, +% +\begin{center} +\setlength{\unitlength}{1cm} +\begin{picture}(9,4) +% +%\put(0,0){\line(1,0){8}} +%\put(0,0){\line(0,1){4}} +%\put(8,4){\line(0,-1){4}} +%\put(8,4){\line(-1,0){8}} +% +\put(0,1){$g_2(4)$} \put(4.5,1){$g_3(4)=g_2(0) - W^a_8 g_2(4)$} +\put(0,3){$g_2(0)$} \put(4.5,3){$g_3(0)=g_2(0) + W^a_8 g_2(4)$} +\put(1,1){\vector(1,0){0.5}} +\put(1.5,1){\line(1,0){0.5}} +\put(1.5,0.5){$W^a_8$} +\put(1,3){\vector(1,0){0.5}}\put(1.5,3){\line(1,0){0.5}} +\put(2,1){\circle*{0.1}} +\put(2,3){\circle*{0.1}} +\put(2,1){\vector(1,1){2}} +\put(2,1){\vector(1,0){1}} +\put(3,1){\line(1,0){1}} +\put(3,0.5){$-1$} +\put(2,3){\vector(1,-1){2}} +\put(2,3){\vector(1,0){1}} +\put(3,3){\line(1,0){1}} +\put(4,1){\circle*{0.1}} +\put(4,3){\circle*{0.1}} +\end{picture} +\end{center} +% +The inputs are shown on the left and the outputs on the right. The +outputs are computed by multiplying the incoming lines by their +accompanying factors (shown next to the lines) and summing the results +at each node. + +In general, denoting the bit for dual-node pairs by $\Delta$ and the +remaining bits of $a$ and $b$ by ${\hat a}$ and ${\hat b}$, the +butterfly is, +% +\begin{equation} +\left( +\begin{array}{c} +g({\hat b} + {\hat a}) \\ +g({\hat b} + \Delta + {\hat a}) \\ +\end{array} +\right) +\leftarrow +\left( +\begin{array}{c} +g({\hat b} + {\hat a}) + W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})\\ +g({\hat b} + {\hat a}) - W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a}) +\end{array} +\right) +\end{equation} +% +where ${\hat a}$ runs from $0 \dots \Delta-1$ and ${\hat b}$ runs +through $0 \times 2\Delta$, $1\times 2\Delta$, $\dots$, $(N/\Delta - +1)2\Delta$. The value of $\Delta$ is 1 on the first pass, 2 on the +second pass and $2^{n-1}$ on the $n$-th pass. Each pass requires +$N/2$ in-place computations, each involving two input locations and +two output locations. + +In the example above $\Delta = [100] = 4$, ${\hat a} = [a_1 a_0]$ and +${\hat b} = [b_0 0 0 0]$. + +This leads to the canonical radix-2 decimation-in-time FFT algorithm +for $2^n$ data points stored in the array $g(0) \dots g(2^n-1)$. +% +\begin{algorithm} +\STATE bit-reverse ordering of $g$ +\STATE {$\Delta \Leftarrow 1$} +\FOR {$\mbox{pass} = 1 \dots n$} + \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$} + \FOR {$(a = 0 ; a < \Delta ; a{++})$} + \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$} + \STATE{$t_0 \Leftarrow g(b+a) + W^a g(b+\Delta+a)$} + \STATE{$t_1 \Leftarrow g(b+a) - W^a g(b+\Delta+a)$} + \STATE{$g(b+a) \Leftarrow t_0$} + \STATE{$g(b+\Delta+a) \Leftarrow t_1$} + \ENDFOR + \ENDFOR + \STATE{$\Delta \Leftarrow 2\Delta$} +\ENDFOR +\end{algorithm} +% +%This algorithm appears in Numerical Recipes as the routine {\tt +%four1}, where the implementation is attributed to N.M. Brenner. +% +\subsection{Details of the Implementation} +It is straightforward to implement a simple radix-2 decimation-in-time +routine from the algorithm above. Some optimizations can be made by +pulling the special case of $a=0$ out of the loop over $a$, to avoid +unnecessary multiplications when $W^a=1$, +% +\begin{algorithm} + \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$} + \STATE{$t_0 \Leftarrow g(b) + g(b+\Delta)$} + \STATE{$t_1 \Leftarrow g(b) - g(b+\Delta)$} + \STATE{$g(b) \Leftarrow t_0$} + \STATE{$g(b+\Delta) \Leftarrow t_1$} + \ENDFOR +\end{algorithm} +% +There are several algorithms for doing fast bit-reversal. We use the +Gold-Rader algorithm, which is simple and does not require any working +space, +% +\begin{algorithm} +\FOR{$i = 0 \dots n - 2$} + \STATE {$ k = n / 2 $} + \IF {$i < j$} + \STATE {swap $g(i)$ and $g(j)$} + \ENDIF + + \WHILE {$k \leq j$} + \STATE{$j \Leftarrow j - k$} + \STATE{$k \Leftarrow k / 2$} + \ENDWHILE + + \STATE{$j \Leftarrow j + k$} +\ENDFOR +\end{algorithm} +% +The Gold-Rader algorithm is typically twice as fast as a naive +bit-reversal algorithm (where the bit reversal is carried out by +left-shifts and right-shifts on the index). The library also has a +routine for the Rodriguez bit reversal algorithm, which also does not +require any working space~\cite{rodriguez89}. There are faster bit +reversal algorithms, but they all use additional scratch +space~\cite{rosel89}. + +Within the loop for $a$ we can compute $W^a$ using a trigonometric +recursion relation, +% +\begin{eqnarray} +W^{a+1} &=& W W^a \\ + &=& (\cos(2\pi/2\Delta) + i \sin(2\pi/2\Delta)) W^a +\end{eqnarray} +% +This requires only $2 \log_2 N$ trigonometric calls, to compute the +initial values of $\exp(2\pi i /2\Delta)$ for each pass. + +\subsection{Radix-2 Decimation-in-Frequency (DIF)} +% +To derive the decimation-in-frequency algorithm we start by separating +out the lowest order bit of the index $a$. Here is an example for the +decimation-in-frequency $N=16$ DFT. +% +\begin{eqnarray} +W_{16}^{[a_3 a_2 a_1 a_0][b_3 b_2 b_1 b_0]} +&=& +W_{16}^{[a_3 a_2 a_1 a_0][b_2 b_1 b_0]} W_{16}^{[a_3 a_2 a_1 a_0] [b_3 +0 0 0]} \\ +&=& +W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} W_2^{a_0 +b_3} \\ +&=& +W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} (-1)^{a_0 b_3} +\end{eqnarray} +% +By repeating the same type of the expansion on the term, +% +\begin{equation} +W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} +\end{equation} +% +we can reduce the transform to an alternative simple form, +% +\begin{equation} +h(a) = +\sum_{b_0=0}^1 (-1)^{a_3 b_0} W_4^{a_2 b_0} +\sum_{b_1=0}^1 (-1)^{a_2 b_1} W_8^{a_1 [b_1 b_0]} +\sum_{b_2=0}^1 (-1)^{a_1 b_2} W_{16}^{a_0 [b_2 b_1 b_0]} +\sum_{b_3=0}^1 (-1)^{a_0 b_3} g(b) +\end{equation} +% +To implement this we can again write the sum recursively. In this case +we do not have the problem of the order of $a$ being bit reversed -- +the calculation can be done in-place using the natural ordering of +$a$ and $b$, +% +\begin{eqnarray} +g_1([a_0 b_2 b_1 b_0]) +&=& +W_{16}^{a_0 [b_2 b_1 b_0]} +\sum_{b_3=0}^1 (-1)^{a_0 b_3} g([b_3 b_2 b_1 b_0]) \\ +g_2([a_0 a_1 b_1 b_0]) +&=& +W_{8}^{a_1 [b_1 b_0]} +\sum_{b_2=0}^1 (-1)^{a_1 b_2} g_1([a_0 b_2 b_1 b_0]) \\ +g_3([a_0 a_1 a_2 b_0]) +&=& +W_{4}^{a_2 b_0} +\sum_{b_1=0}^1 (-1)^{a_2 b_1} g_2([a_0 a_1 b_1 b_0]) \\ +h(a) += +g_4([a_0 a_1 a_2 a_3]) +&=& +\sum_{b_0=0}^1 (-1)^{a_3 b_0} g_3([a_0 a_1 a_2 b_0]) +\end{eqnarray} +% +The final pass leaves the data for $h(a)$ in bit-reversed order, but +this is easily fixed by a final bit-reversal of the ordering. + +The basic in-place calculation or butterfly for each pass is slightly +different from the decimation-in-time version, +% +\begin{equation} +\left( +\begin{array}{c} +g({\hat a} + {\hat b}) \\ +g({\hat a} + \Delta + {\hat b}) \\ +\end{array} +\right) +\leftarrow +\left( +\begin{array}{c} +g({\hat a} + {\hat b}) + g({\hat a} + \Delta + {\hat b})\\ +W_{\Delta}^{\hat b} +\left( g({\hat a} + {\hat b}) - g({\hat a} + \Delta + {\hat b}) \right) +\end{array} +\right) +\end{equation} +% +In each pass ${\hat b}$ runs from $0 \dots \Delta-1$ and ${\hat +a}$ runs from $0, 2\Delta, \dots, (N/\Delta -1) \Delta$. On the first +pass we start with $\Delta=16$, and on subsequent passes $\Delta$ takes +the values $8, 4, \dots, 1$. + +This leads to the canonical radix-2 decimation-in-frequency FFT +algorithm for $2^n$ data points stored in the array $g(0) \dots +g(2^n-1)$. +% +\begin{algorithm} +\STATE {$\Delta \Leftarrow 2^{n-1}$} +\FOR {$\mbox{pass} = 1 \dots n$} + \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$} + \FOR {$(b = 0 ; b < \Delta ; b++)$} + \FOR{$(a = 0 ; a < N ; a += 2*\Delta)$} + \STATE{$t_0 \Leftarrow g(b+a) + g(a+\Delta+b)$} + \STATE{$t_1 \Leftarrow W^b \left( g(a+b) - g(a+\Delta+b) \right)$} + \STATE{$g(a+b) \Leftarrow t_0$} + \STATE{$g(a+\Delta+b) \Leftarrow t_1$} + \ENDFOR + \ENDFOR + \STATE{$\Delta \Leftarrow \Delta/2$} +\ENDFOR +\STATE bit-reverse ordering of $g$ +\end{algorithm} +% + +\section{Self-Sorting Mixed-Radix Complex FFTs} +% +This section is based on the review article {\em Self-sorting +Mixed-Radix Fast Fourier Transforms} by Clive +Temperton~\cite{temperton83}. You should consult his article for full +details of all the possible algorithms (there are many +variations). Here I have annotated the derivation of the simplest +mixed-radix decimation-in-frequency algorithm. + +For general-$N$ FFT algorithms the simple binary-notation of radix-2 +algorithms is no longer useful. The mixed-radix FFT has to be built +up using products of matrices acting on a data vector. The aim is to +take the full DFT matrix $W_N$ and factor it into a set of small, +sparse matrices corresponding to each factor of $N$. + + +We'll denote the components of matrices using either subscripts or +function notation, +% +\begin{equation} +M_{ij} = M(i,j) +\end{equation} +% +with (C-like) indices running from 0 to $N-1$. Matrix products will be +denoted using square brackets, +% +\begin{equation} +[AB]_{ij} = \sum_{k} A_{ik} B_{kj} +\end{equation} +% +% +Three special matrices will be needed in the mixed-radix factorization +of the DFT: the identity matrix, $I$, a permutation matrix, $P$ and a +matrix of twiddle factors, $D$, as well as the normal DFT matrices +$W_n$. + +We write the identity matrix of order $r$ as $I_r(n,m)$, +% +\begin{equation} +I_r(n,m) = \delta_{nm} +\end{equation} +% +for $0 \leq n,m \leq r-1$. + +We also need to define a permutation matrix $P^a_b$ that performs +digit reversal of the ordering of a vector. If the index of a vector +$j= 0\dots N-1$ is factorized into $j = la +m$, with $0 \leq l \leq +b-1$ and $0 \leq m \leq a-1$ then the operation of the matrix $P$ will +exchange positions $la+m$ and $mb+l$ in the vector (this sort of +digit-reversal is the generalization of bit-reversal to a number +system with exponents $a$ and $b$). + +In mathematical terms $P$ is a square matrix of size $ab \times ab$ +with the property, +% +\begin{eqnarray} +P^a_b(j,k) &=& 1 ~\mbox{if}~ j=ra+s ~\mbox{and}~ k=sb+r \\ + &=& 0 ~\mbox{otherwise} +\end{eqnarray} +% + +Finally the FFT algorithm needs a matrix of twiddle factors, $D^a_b$, +for the trigonometric sums. $D^a_b$ is a diagonal square matrix of +size $ab \times ab$ with the definition, +% +\begin{eqnarray} +D^a_b(j,k) &=& \omega^{sr}_{ab} ~\mbox{if}~ j=k=sb+r \\ + &=& 0 ~\mbox{otherwise} +\end{eqnarray} +% +where $\omega_{ab} = e^{-2\pi i/ab}$. + + +\subsection{The Kronecker Product} +The Kronecker matrix product plays an important role in all the +algorithms for combining operations on different subspaces. The +Kronecker product $A \otimes B$ of two square matrices $A$ and $B$, of +sizes $a \times a$ and $b \times b$ respectively, is a square matrix +of size $a b \times a b$, defined as, +% +\begin{equation} +[A \otimes B] (tb+u, rb+s) = A(t,r) B(u,s) +\end{equation} +% +where $0 \leq u,s < b$ and $0 \leq t,r < a$. Let's examine a specific +example. If we take a $2 \times 2$ matrix and a $3 +\times 3$ matrix, +% +\begin{equation} +\begin{array}{ll} +A = +\left( +\begin{array}{cc} +a_{11} & a_{12} \\ +a_{21} & a_{22} +\end{array} +\right) +& +B = +\left( +\begin{array}{ccc} +b_{11} & b_{12} & b_{13} \\ +b_{21} & b_{22} & b_{23} \\ +b_{31} & b_{32} & b_{33} +\end{array} +\right) +\end{array} +\end{equation} +% +then the Kronecker product $A \otimes B$ is, +% +\begin{eqnarray} +A \otimes B &= & +\left( +\begin{array}{cc} +a_{11} B & a_{12} B \\ +a_{12} B & a_{22} B +\end{array} +\right) \\ + &=& +\left( +\begin{array}{cccccc} +a_{11} b_{11} & a_{11} b_{12} & a_{11} b_{13} & + a_{12} b_{11} & a_{12} b_{12} & a_{12} b_{13} \\ +a_{11} b_{21} & a_{11} b_{22} & a_{11} b_{23} & + a_{12} b_{21} & a_{12} b_{22} & a_{12} b_{23} \\ +a_{11} b_{31} & a_{11} b_{32} & a_{11} b_{33} & + a_{12} b_{31} & a_{12} b_{32} & a_{12} b_{33} \\ +a_{21} b_{11} & a_{21} b_{12} & a_{21} b_{13} & + a_{22} b_{11} & a_{22} b_{12} & a_{22} b_{13} \\ +a_{21} b_{21} & a_{21} b_{22} & a_{21} b_{23} & + a_{22} b_{21} & a_{22} b_{22} & a_{22} b_{23} \\ +a_{21} b_{31} & a_{21} b_{32} & a_{21} b_{33} & + a_{22} b_{31} & a_{22} b_{32} & a_{22} b_{33} +\end{array} +\right) +\end{eqnarray} +% +When the Kronecker product $A \otimes B$ acts on a vector of length +$ab$, each matrix operates on a different subspace of the vector. +Writing the index $i$ as $i=t b + u$, with $0\leq u \leq b-1$ +and $0\leq t\leq a$, we can see this explicitly by looking at components, +% +\begin{eqnarray} +[(A \otimes B) v]_{(tb+u)} +& = & \sum_{t'=0}^{a-1} \sum_{u'=0}^{b-1} + [A \otimes B]_{(tb+u,t'b+u')} v_{t'b+u'} \\ +& = & \sum_{t'u'} A_{tt'} B_{uu'} v_{t'b+u'} +\end{eqnarray} +% +The matrix $B$ operates on the ``index'' $u'$, for all values of $t'$, and +the matrix $A$ operates on the ``index'' $t'$, for all values of $u'$. +% +The most important property needed for deriving the FFT factorization +is that the matrix product of two Kronecker products is the Kronecker +product of the two matrix products, +% +\begin{equation} +(A \otimes B)(C \otimes D) = (AC \otimes BD) +\end{equation} +% +This follows straightforwardly from the original definition of the +Kronecker product. + +\subsection{Two factor case, $N=ab$} +% +First consider the simplest possibility, where the data length $N$ can +be divided into two factors, $N=ab$. The aim is to reduce the DFT +matrix $W_N$ into simpler matrices corresponding to each factor. To +make the derivation easier we will start from the known factorization +and verify it (the initial factorization can be guessed by +generalizing from simple cases). Here is the factorization we are +going to prove, +% +\begin{equation} +W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b). +\end{equation} +% +We can check it by expanding the product into components, +% +\begin{eqnarray} +\lefteqn{[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)} \\ +& = & +\sum_{u=0}^{b-1} \sum_{t=0}^{a-1} +[(W_b \otimes I_a)](la+m,ua+t) [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s) +\end{eqnarray} +% +where we have split the indices to match the Kronecker product $0 \leq +m, r \leq a$, $0 \leq l, s \leq b$. The first term in the sum can +easily be reduced to its component form, +% +\begin{eqnarray} +[(W_b \otimes I_a)](la+m,ua+t) +&=& W_b(l,u) I_a(m,t) \\ +&=& \omega_b^{lu} \delta_{mt} +\end{eqnarray} +% +The second term is more complicated. We can expand the Kronecker +product like this, +\begin{eqnarray} +(W_a \otimes I_b)(tb+u,rb+s) +&=& W_a(t,r) I_a(u,s) \\ +&=& \omega_a^{tr} \delta_{us} +\end{eqnarray} +% +and use this term to build up the product, $P^a_b D^a_b (W_a \otimes +I_b)$. We first multiply by $D^a_b$, +% +\begin{equation} +[D^a_b (W_a \otimes I_b)](tb+u,rb+s) += +\omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su} +\end{equation} +% +and then apply the permutation matrix, $P^a_b$, which digit-reverses +the ordering of the first index, to obtain, +% +\begin{equation} +[P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s) += +\omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su} +\end{equation} +% +Combining the two terms in the matrix product we can obtain the full +expansion in terms of the exponential $\omega$, +% +\begin{eqnarray} +[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s) +&=& +\sum_{u=0}^{b-1} \sum_{t=0}^{a-1} +\omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su} +\end{eqnarray} +% +If we evaluate this sum explicitly we can make the connection between +the product involving $W_a$ and $W_b$ (above) and the expansion of the +full DFT matrix $W_{ab}$, +% +\begin{eqnarray} +\sum_{u=0}^{b-1} \sum_{t=0}^{a-1} +\omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su} +&=& \omega^{ls}_b \omega^{ms}_{ab} \omega^{mr}_a \\ +&=& \omega^{als + ms +bmr}_{ab} \\ +&=& \omega^{als + ms +bmr}_{ab} \omega^{lrab}_{ab} \quad\mbox{using~} \omega^{ab}_{ab} =1\\ +&=& \omega^{(la+m)(rb+s)}_{ab} \\ +&=& W_{ab}(la+m,rb+s) +\end{eqnarray} +% +The final line shows that matrix product given above is identical to +the full two-factor DFT matrix, $W_{ab}$. +% +Thus the full DFT matrix $W_{ab}$ for two factors $a$, $b$ can be +broken down into a product of sub-transforms, $W_a$ and $W_b$, plus +permutations, $P$, and twiddle factors, $D$, according to the formula, +% +\begin{equation} +W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b). +\end{equation} +% +This relation is the foundation of the general-$N$ mixed-radix FFT algorithm. + +\subsection{Three factor case, $N=abc$} +% +The result for the two-factor expansion can easily be generalized to +three factors. We first consider $abc$ as being a product of two +factors $a$ and $(bc)$, and then further expand the product $(bc)$ into +$b$ and $c$. The first step of the expansion looks like this, +% +\begin{eqnarray} +W_{abc} &=& W_{a(bc)}\\ +&=& (W_{bc} \otimes I_a) P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) . +\end{eqnarray} +% +And after using the two-factor result to expand out $W_{bc}$ we obtain +the factorization of $W_{abc}$, +% +\begin{eqnarray} +W_{abc} &=& (((W_c \otimes I_b) P^b_c D^b_c (W_b \otimes I_c)) \otimes I_a ) +P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\ +&=& (W_c \otimes I_{ab}) (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) P^a_{bc} D^a_{bc} (W_a \otimes I_c) +\end{eqnarray} +% +We can write this factorization in a product form, with one term for +each factor, +% +\begin{equation} +W_{abc} = T_3 T_2 T_1 +\end{equation} +% +where we read off $T_1$, $T_2$ and $T_3$, +% +\begin{eqnarray} +T_1 &=& P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\ +T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\ +T_3 &=& (W_c \otimes I_{ab} ) +\end{eqnarray} +% + + +\subsection{General case, $N=f_1 f_2 \dots f_{n_f}$} +% +If we continue the procedure that we have used for two- and +three-factors then a general pattern begins to emerge in the +factorization of $W_{f_1 f_2 \dots f_{n_f}}$. To see the beginning of +the pattern we can rewrite the three factor case as, +% +\begin{eqnarray} +T_1 &=& (P^a_{bc} D^a_{bc} \otimes I_1) (W_a \otimes I_{bc}) \\ +T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\ +T_3 &=& (P^c_1 D^c_1 \otimes I_{ab}) (W_c \otimes I_{ab} ) +\end{eqnarray} +% +using the special cases $P^c_1 = D^c_1 = I_c$. +% +In general, we can write the factorization of $W_N$ for $N= f_1 f_2 +\dots f_{n_f}$ as, +% +\begin{equation} +W_N = T_{n_f} \dots T_2 T_1 +\end{equation} +% +where the matrix factors $T_i$ are, +% +\begin{equation} +T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ( W_{f_i} +\otimes I_{m_i}) +\end{equation} +% +We have defined the following three additional variables $p$, $q$ and +$m$ to denote different partial products of the factors, +% +\begin{eqnarray} +p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1) \\ +q_i &=& N / p_i \\ +m_i &=& N / f_i +\end{eqnarray} +% +Note that the FFT modules $W$ are applied before the permutations $P$, +which makes this a decimation-in-frequency algorithm. + +\subsection{Implementation} +% +Now to the implementation of the algorithm. We start with a vector of +data, $z$, as input and want to apply the transform, +% +\begin{eqnarray} +x &=& W_N z \\ + &=& T_{n_f} \dots T_2 T_1 z +\end{eqnarray} +% +where $T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ( +W_{f_i} \otimes I_{m_i})$. + +The outer structure of the implementation will be a loop over the +$n_f$ factors, applying each matrix $T_i$ to the vector in turn to +build up the complete transform. +% +\begin{algorithm} +\FOR{$(i = 1 \dots n_f)$} + \STATE{$v \Leftarrow T_i v $} +\ENDFOR +\end{algorithm} +% +The order of the factors is not important. Now we examine the iteration +$v \Leftarrow T_i v$, which we'll write as, +% +\begin{equation} +v' = +(P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ~ +( W_{f_i} \otimes I_{m_i}) v +\end{equation} +% +There are two Kronecker product matrices in this iteration. The +rightmost matrix, which is the first to be applied, is a DFT of length +$f_i$ applied to $N/f_i$ subsets of the data. We'll call this $t$, +since it will be a temporary array, +% +\begin{equation} +t = (W_{f_i} \otimes I_{m_i}) v +\end{equation} +% +The second matrix applies a permutation and the exponential +twiddle-factors. We'll call this $v'$, since it is the result of the +full iteration on $v$, +% +\begin{equation} +v' = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) t +\end{equation} + +The effect of the matrix $(W_{f_i} \otimes I_{m_i})$ is best seen by +an example. Suppose the factor is $f_i = 3$, and the length of the FFT +is $N=6$, then the relevant Kronecker product is, +% +\begin{equation} +t = (W_3 \otimes I_2) v +\end{equation} +% +which expands out to, +% +\begin{equation} +\left( +\begin{array}{c} +t_0 \\ +t_1 \\ +t_2 \\ +t_3 \\ +t_4 \\ +t_5 +\end{array} +\right) += +\left( +\begin{array}{cccccc} +W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) & 0 \\ +0 & W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) \\ +W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) & 0 \\ +0 & W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) \\ +W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3) & 0 \\ +0 & W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3) +\end{array} +\right) +\left( +\begin{array}{c} +v_0 \\ +v_1 \\ +v_2 \\ +v_3 \\ +v_4 \\ +v_5 +\end{array} +\right) +\end{equation} +% +We can rearrange the components in a computationally convenient form, +\begin{equation} +\left( +\begin{array}{c} +t_0 \\ +t_2 \\ +t_4 \\ +t_1 \\ +t_3 \\ +t_5 +\end{array} +\right) += +\left( +\begin{array}{cccccc} +W_3(1,1) & W_3(1,2) & W_3(1,3) & 0 & 0 & 0 \\ +W_3(2,1) & W_3(2,2) & W_3(2,3) & 0 & 0 & 0 \\ +W_3(3,1) & W_3(3,2) & W_3(3,3) & 0 & 0 & 0 \\ +0 & 0 & 0 & W_3(1,1) & W_3(1,2) & W_3(1,3) \\ +0 & 0 & 0 & W_3(2,1) & W_3(2,2) & W_3(2,3) \\ +0 & 0 & 0 & W_3(3,1) & W_3(3,2) & W_3(3,3) +\end{array} +\right) +\left( +\begin{array}{c} +v_0 \\ +v_2 \\ +v_4 \\ +v_1 \\ +v_3 \\ +v_5 +\end{array} +\right) +\end{equation} +% +which clearly shows that we just need to apply the $3\times 3$ DFT +matrix $W_3$ twice, once to the sub-vector of elements $(v_0, v_2, v_4)$, +and independently to the remaining sub-vector $(v_1, v_3, v_5)$. + +In the general case, if we index $t$ as $t_k = t(\lambda,\mu) = +t_{\lambda m + \mu}$ then $\lambda = 0 \dots f-1$ is an index within +each transform of length $f$ and $\mu = 0 \dots m-1$ labels the +independent subsets of data. We can see this by showing the +calculation with all indices present, +% +\begin{equation} +t = (W_f \otimes I_m) z +\end{equation} +% +becomes, +% +\begin{eqnarray} +t_{\lambda m + \mu} &=& \sum_{\lambda'=0}^{f-1} \sum_{\mu'=0}^{m-1} + (W_f \otimes I_m)_{(\lambda m + \mu)(\lambda' m + \mu')} + z_{\lambda'm + \mu} \\ +&=& \sum_{\lambda'\mu'} (W_f)_{\lambda\lambda'} \delta_{\mu\mu'} + z_{\lambda'm+\mu'} \\ +&=& \sum_{\lambda'} (W_f)_{\lambda\lambda'} z_{\lambda'm+\mu} +\end{eqnarray} +% +The DFTs on the index $\lambda$ will be computed using +special optimized modules for each $f$. + +To calculate the next stage, +% +\begin{equation} +v'=(P^f_q D^f_q \otimes I_{p_{i-1}}) t +\end{equation} +% +we note that the Kronecker product has the property of performing +$p_{i-1}$ independent multiplications of $PD$ on $q_{i-1}$ different +subsets of $t$. The index $\mu$ of $t(\lambda,\mu)$ which runs from 0 +to $m$ will include $q_i$ copies of each $PD$ operation because +$m=p_{i-1}q$. i.e. we can split the index $\mu$ further into $\mu = a +p_{i-1} + b$, where $a = 0 \dots q-1$ and $b=0 \dots p_{i-1}$, +% +\begin{eqnarray} +\lambda m + \mu &=& \lambda m + a p_{i-1} + b \\ + &=& (\lambda q + a) p_{i-1} + b. +\end{eqnarray} +% +Now we can expand the second stage, +% +\begin{eqnarray} +v'&=& (P^f_q D^f_q \otimes I_{p_{i-1}}) t \\ +v'_{\lambda m + \mu} &=& \sum_{\lambda' \mu'} + (P^f_q D^f_q \otimes I_{p_{i-1}})_{(\lambda m + \mu)(\lambda' m + \mu')} + t_{\lambda' m + \mu'} \\ +v'_{(\lambda q + a) p_{i-1} + b} &=& \sum_{\lambda' a' b'} +( +P^f_q D^f_q \otimes I_{p_{i-1}} +)_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')} +t_{(\lambda' q + a')p_{i-1} +b'} +\end{eqnarray} +% +The first step in removing redundant indices is to take advantage of +the identity matrix $I$ and separate the subspaces of the Kronecker +product, +% +\begin{equation} +( +P^f_q D^f_q \otimes I_{p_{i-1}} +)_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')} += +(P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')} +\delta_{bb'} +\end{equation} +% +This eliminates one sum, leaving us with, +% +\begin{equation} +v'_{(\lambda q + a) p_{i-1} + b} += +\sum_{\lambda' a' } +(P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')} t_{(\lambda'q+a')p_{i-1} + b} +\end{equation} +% +We can insert the definition of $D^f_q$ to give, +% +\begin{equation} +\phantom{v'_{(\lambda q + a) p_{i-1} + b}} += \sum_{\lambda'a'} (P^f_q)_{(\lambda q + a)(\lambda'q + a')} +\omega^{\lambda'a'}_{q_{i-1}} t_{(\lambda'q+a')p_{i-1}+b} +\end{equation} +% +Using the definition of $P^f_q$, which exchanges an index of $\lambda +q + a$ with $a f + \lambda$, we get a final result with no matrix +multiplication, +% +\begin{equation} +v'_{(a f + \lambda) p_{i-1} + b} += \omega^{\lambda a}_{q_{i-1}} t_{(\lambda q + a)p_{i-1} + b} +\end{equation} +% +All we have to do is premultiply each element of the temporary vector +$t$ by an exponential twiddle factor and store the result in another +index location, according to the digit reversal permutation of $P$. + +Here is the algorithm to implement the mixed-radix FFT, +% +\begin{algorithm} +\FOR{$i = 1 \dots n_f$} +\FOR{$a = 0 \dots q-1$} +\FOR{$b = 0 \dots p_{i-1} - 1$} +\FOR{$\lambda = 0 \dots f-1$} +\STATE{$t_\lambda \Leftarrow + \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') v_{b+\lambda'm+ap_{i-1}}$} + \COMMENT{DFT matrix-multiply module} +\ENDFOR +\FOR{$\lambda = 0 \dots f-1$} +\STATE{$v'_{(af+\lambda)p_{i-1}+b} + \Leftarrow \omega^{\lambda a}_{q_{i-1}} t_\lambda$} +\ENDFOR +\ENDFOR +\ENDFOR +\STATE{$v \Leftarrow v'$} +\ENDFOR +\end{algorithm} +% +\subsection{Details of the implementation} +% +First the function {\tt gsl\_fft\_complex\_wavetable\_alloc} allocates +$n$ elements of scratch space (to hold the vector $v'$ for each +iteration) and $n$ elements for a trigonometric lookup table of +twiddle factors. + +Then the length $n$ must be factorized. There is a general +factorization function {\tt gsl\_fft\_factorize} which takes a list of +preferred factors. It first factors out the preferred factors and then +removes general remaining prime factors. + +The algorithm used to generate the trigonometric lookup table is +% +\begin{algorithm} +\FOR {$a = 1 \dots n_f$} +\FOR {$b = 1 \dots f_i - 1$} +\FOR {$c = 1 \dots q_i$} +\STATE $\mbox{trig[k++]} = \exp(- 2\pi i b c p_{a-1}/N)$ +\ENDFOR +\ENDFOR +\ENDFOR +\end{algorithm} +% +Note that $\sum_{1}^{n_f} \sum_{0}^{f_i-1} \sum_{1}^{q_i} = +\sum_{1}^{n_f} (f_i-1)q_i = n - 1$ so $n$ elements are always +sufficient to store the lookup table. This is chosen because we need +to compute $\omega_{q_i-1}^{\lambda a} t_\lambda$ in +the FFT. In terms of the lookup table we can write this as, +% +\begin{eqnarray} +\omega_{q_{i-1}}^{\lambda a} t_\lambda +&=& \exp(-2\pi i \lambda a/q_{i-1}) t_\lambda \\ +&=& \exp(-2\pi i \lambda a p_{i-1}/N) t_\lambda \\ +&=& \left\{ + \begin{array}{ll} + t_\lambda & a=0 \\ + \mbox{trig}[\mbox{twiddle[i]}+\lambda q+(a-1)] t_\lambda & a\not=0 +\end{array} +\right. +\end{eqnarray} +% +\begin{sloppypar} +\noindent +The array {\tt twiddle[i]} maintains a set of pointers into {\tt trig} +for the starting points for the outer loop. The core of the +implementation is {\tt gsl\_fft\_complex}. This function loops over +the chosen factors of $N$, computing the iteration $v'=T_i v$ for each +pass. When the DFT for a factor is implemented the iteration is +handed-off to a dedicated small-$N$ module, such as {\tt +gsl\_fft\_complex\_pass\_3} or {\tt +gsl\_fft\_complex\_pass\_5}. Unimplemented factors are handled +by the general-$N$ routine {\tt gsl\_fft\_complex\_pass\_n}. The +structure of one of the small-$N$ modules is a simple transcription of +the basic algorithm given above. Here is an example for {\tt +gsl\_fft\_complex\_pass\_3}. For a pass with a factor of 3 we have to +calculate the following expression, +\end{sloppypar}% +\begin{equation} +v'_{(a f + \lambda) p_{i-1} + b} += +\sum_{\lambda' = 0,1,2} +\omega^{\lambda a}_{q_{i-1}} W^{\lambda \lambda'}_{3} +v_{b + \lambda' m + a p_{i-1}} +\end{equation} +% +for $b = 0 \dots p_{i-1} - 1$, $a = 0 \dots q_{i} - 1$ and $\lambda = +0, 1, 2$. This is implemented as, +% +\begin{algorithm} +\FOR {$a = 0 \dots q-1$} +\FOR {$b = 0 \dots p_{i-1} - 1$} +\STATE {$ + \left( + \begin{array}{c} + t_0 \\ t_1 \\ t_2 + \end{array} + \right) + = + \left( + \begin{array}{ccc} + W^{0}_3 & W^{0}_3 & W^{0}_3 \\ + W^{0}_3 & W^{1}_3 & W^{2}_3 \\ + W^{0}_3 & W^{2}_3 & W^{4}_3 + \end{array} + \right) + \left( + \begin{array}{l} + v_{b + a p_{i-1}} \\ + v_{b + a p_{i-1} + m} \\ + v_{b + a p_{i-1} +2m} + \end{array} + \right)$} + \STATE {$ v'_{a p_{i} + b} = t_0$} + \STATE {$ v'_{a p_{i} + b + p_{i-1}} = \omega^{a}_{q_{i-1}} t_1$} + \STATE {$ v'_{a p_{i} + b + 2 p_{i-1}} = \omega^{2a}_{q_{i-1}} t_2$} +\ENDFOR +\ENDFOR +\end{algorithm} +% +In the code we use the variables {\tt from0}, {\tt from1}, {\tt from2} +to index the input locations, +% +\begin{eqnarray} +\mbox{\tt from0} &=& b + a p_{i-1} \\ +\mbox{\tt from1} &=& b + a p_{i-1} + m \\ +\mbox{\tt from2} &=& b + a p_{i-1} + 2m +\end{eqnarray} +% +and the variables {\tt to0}, {\tt to1}, {\tt to2} to index the output +locations in the scratch vector $v'$, +% +\begin{eqnarray} +\mbox{\tt to0} &=& b + a p_{i} \\ +\mbox{\tt to1} &=& b + a p_{i} + p_{i-1} \\ +\mbox{\tt to2} &=& b + a p_{i} + 2 p_{i-1} +\end{eqnarray} +% +The DFT matrix multiplication is computed using the optimized +sub-transform modules given in the next section. The twiddle factors +$\omega^a_{q_{i-1}}$ are taken out of the {\tt trig} array. + +To compute the inverse transform we go back to the definition of the +fourier transform and note that the inverse matrix is just the complex +conjugate of the forward matrix (with a factor of $1/N$), +% +\begin{equation} +W^{-1}_N = W^*_N / N +\end{equation} +% +Therefore we can easily compute the inverse transform by conjugating +all the complex elements of the DFT matrices and twiddle factors that +we apply. (An alternative strategy is to conjugate the input data, +take a forward transform, and then conjugate the output data). + +\section{Fast Sub-transform Modules} +% +To implement the mixed-radix FFT we still need to compute the +small-$N$ DFTs for each factor. Fortunately many highly-optimized +small-$N$ modules are available, following the work of Winograd who +showed how to derive efficient small-$N$ sub-transforms by number +theoretic techniques. + +The algorithms in this section all compute, +% +\begin{equation} +x_a = \sum_{b=0}^{N-1} W_N^{ab} z_b +\end{equation} +% +The sub-transforms given here are the ones recommended by Temperton +and differ slightly from the canonical Winograd modules. According to +Temperton~\cite{temperton83} they are slightly more robust against +rounding errors and trade off some additions for multiplications. +% +For the $N=2$ DFT, +% +\begin{equation} +\begin{array}{ll} +x_0 = z_0 + z_1, & +x_1 = z_0 - z_1. +\end{array} +\end{equation} +% +For the $N=3$ DFT, +% +\begin{equation} +\begin{array}{lll} +t_1 = z_1 + z_2, & +t_2 = z_0 - t_1/2, & +t_3 = \sin(\pi/3) (z_1 - z_2), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{lll} +x_0 = z_0 + t_1, & +x_1 = t_2 + i t_3, & +x_2 = t_2 - i t_3. +\end{array} +\end{equation} +% +The $N=4$ transform involves only additions and subtractions, +% +\begin{equation} +\begin{array}{llll} +t_1 = z_0 + z_2, & +t_2 = z_1 + z_3, & +t_3 = z_0 - z_2, & +t_4 = z_1 - z_3, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{llll} +x_0 = t_1 + t_2, & +x_1 = t_3 + i t_4, & +x_2 = t_1 - t_2, & +x_3 = t_3 - i t_4. +\end{array} +\end{equation} +% +For the $N=5$ DFT, +% +\begin{equation} +\begin{array}{llll} +t_1 = z_1 + z_4, & +t_2 = z_2 + z_3, & +t_3 = z_1 - z_4, & +t_4 = z_2 - z_3, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{llll} +t_5 = t_1 + t_2, & +t_6 = (\sqrt{5}/4) (t_1 - t_2), & +t_7 = z_0 - t_5/4, \\ +\end{array} +\end{equation} +\begin{equation} +\begin{array}{llll} +t_8 = t_7 + t_6, & +t_9 = t_7 - t_6, \\ +\end{array} +\end{equation} +\begin{equation} +\begin{array}{llll} +t_{10} = \sin(2\pi/5) t_3 + \sin(2\pi/10) t_4, & +t_{11} = \sin(2\pi/10) t_3 - \sin(2\pi/5) t_4, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{llll} +x_0 = z_0 + t_5, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{llll} +x_1 = t_8 + i t_{10}, & +x_2 = t_9 + i t_{11}, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{llll} +x_3 = t_9 - i t_{11}, & +x_4 = t_8 - i t_{10}. +\end{array} +\end{equation} +% +The DFT matrix for $N=6$ can be written as a combination of $N=3$ and +$N=2$ transforms with index permutations, +% +\begin{equation} +\left( +\begin{array}{c} +x_0 \\ +x_4 \\ +x_2 \\ +\hline x_3 \\ +x_1 \\ +x_5 +\end{array} +\right) += +\left( +\begin{array}{ccc|ccc} + & & & & & \\ + &W_3& & &W_3& \\ + & & & & & \\ +\hline & & & & & \\ + &W_3& & &-W_3& \\ + & & & & & +\end{array} +\right) +\left( +\begin{array}{c} +z_0 \\ +z_2 \\ +z_4 \\ +\hline z_3 \\ +z_5 \\ +z_1 +\end{array} +\right) +\end{equation} +% +This simplification is an example of the Prime Factor Algorithm, which +can be used because the factors 2 and 3 are mutually prime. For more +details consult one of the books on number theory for +FFTs~\cite{elliott82,blahut}. We can take advantage of the simple +indexing scheme of the PFA to write the $N=6$ DFT as, +% +\begin{equation} +\begin{array}{lll} +t_1 = z_2 + z_4, & +t_2 = z_0 - t_1/2, & +t_3 = \sin(\pi/3) (z_2 - z_4), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{lll} +t_4 = z_5 + z_1, & +t_5 = z_3 - t_4/2, & +t_6 = \sin(\pi/3) (z_5 - z_1), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{lll} +t_7 = z_0 + t_1, & +t_8 = t_2 + i t_3, & +t_9 = t_2 - i t_3, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{lll} +t_{10} = z_3 + t_4, & +t_{11} = t_5 + i t_6, & +t_{12} = t_5 - i t_6, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{lll} +x_0 = t_7 + t_{10}, & +x_4 = t_8 + t_{11}, & +x_2 = t_9 + t_{12}, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{lll} +x_3 = t_7 - t_{10}, & +x_1 = t_8 - t_{11}, & +x_5 = t_9 - t_{12}. +\end{array} +\end{equation} + +For any remaining general factors we use Singleton's efficient method +for computing a DFT~\cite{singleton}. Although it is an $O(N^2)$ +algorithm it does reduce the number of multiplications by a factor of +4 compared with a naive evaluation of the DFT. If we look at the +general stucture of a DFT matrix, shown schematically below, +% +\begin{equation} +\left( +\begin{array}{c} +h_0 \\ +h_1 \\ +h_2 \\ +\vdots \\ +h_{N-2} \\ +h_{N-1} +\end{array} +\right) += +\left( +\begin{array}{cccccc} +1 & 1 & 1 & \cdots & 1 & 1 \\ +1 & W & W & \cdots & W & W \\ +1 & W & W & \cdots & W & W \\ +\vdots & \vdots & \vdots & \cdots & \vdots & \vdots \\ +1 & W & W & \cdots & W & W \\ +1 & W & W & \cdots & W & W +\end{array} +\right) +\left( +\begin{array}{c} +g_0 \\ +g_1 \\ +g_2 \\ +\vdots \\ +g_{N-2} \\ +g_{N-1} +\end{array} +\right) +\end{equation} +% +we see that the outer elements of the DFT matrix are all unity. We can +remove these trivial multiplications but we will still be left with an +$(N-1) \times (N-1)$ sub-matrix of complex entries, which would appear +to require $(N-1)^2$ complex multiplications. Singleton's method, +uses symmetries of the DFT matrix to convert the complex +multiplications to an equivalent number of real multiplications. We +start with the definition of the DFT in component form, +% +\begin{equation} +a_k + i b_k = \sum_{j=0} (x_j+iy_j)(\cos(2\pi jk/f) - i\sin(2\pi jk/f)) +\end{equation} +% +The zeroth component can be computed using only additions, +% +\begin{equation} +a_0 + i b_0 = \sum_{j=0}^{(f-1)} x_j + i y_j +\end{equation} +% +We can rewrite the remaining components as, +% +\begin{eqnarray} +a_k + i b_k & = x_0 + i y_0 & + + \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f) + + (y_j - y_{f-j}) \sin(2\pi jk/f) \\ +& & + i\sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f) + - (x_j - x_{f-j}) \sin(2\pi jk/f) +\end{eqnarray} +% +by using the following trigonometric identities, +% +\begin{eqnarray} + \cos(2\pi(f-j)k/f) &=& \phantom{-}\cos(2\pi jk/f) \\ + \sin(2\pi(f-j)k/f) &=& -\sin(2\pi jk/f) +\end{eqnarray} +% +These remaining components can all be computed using four partial +sums, +% +\begin{eqnarray} +a_k + i b_k & = & (a^+_k - a^-_k) + i (b^+_k + b^-_k) \\ +a_{f-k} + i b_{f-k} & = & (a^+_k + a^-_k) + i (b^+_k - b^-_k) +\end{eqnarray} +% +for $k = 1, 2, \dots, (f-1)/2$, where, +% +\begin{eqnarray} +a^+_k &=& x_0 + \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f) \\ +a^-_k &=& \phantom{x_0} - \sum_{j=1}^{(f-1)/2} (y_j - y_{f-j}) \sin(2\pi jk/f) \\ +b^+_k &=& y_0 + \sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f) \\ +b^-_k &=& \phantom{y_0} - \sum_{j=1}^{(f-1)/2} (x_j - x_{f-j}) \sin(2\pi jk/f) +\end{eqnarray} +% +Note that the higher components $k'=f-k$ can be obtained directly +without further computation once $a^+$, $a^-$, $b^+$ and $b^-$ are +known. There are $4 \times (f-1)/2$ different sums, each involving +$(f-1)/2$ real multiplications, giving a total of $(f-1)^2$ real +multiplications instead of $(f-1)^2$ complex multiplications. + +To implement Singleton's method we make use of the input and output +vectors $v$ and $v'$ as scratch space, copying data back and forth +between them to obtain the final result. First we use $v'$ to store +the terms of the symmetrized and anti-symmetrized vectors of the form +$x_j + x_{f-j}$ and $x_j - y_{f-j}$. Then we multiply these by the +appropriate trigonometric factors to compute the partial sums $a^+$, +$a^-$, $b^+$ and $b^-$, storing the results $a_k + i b_k$ and $a_{f-k} ++ i b_{f-k}$ back in $v$. Finally we multiply the DFT output by any +necessary twiddle factors and place the results in $v'$. + +\section{FFTs for real data} +% +This section is based on the articles {\em Fast Mixed-Radix Real + Fourier Transforms} by Clive Temperton~\cite{temperton83real} and +{\em Real-Valued Fast Fourier Transform Algorithms} by Sorensen, +Jones, Heideman and Burrus~\cite{burrus87real}. The DFT of a real +sequence has a special symmetry, called a {\em conjugate-complex} or +{\em half-complex} symmetry, +% +\begin{equation} +h(a) = h(N-a)^* +\end{equation} +% +The element $h(0)$ is real, and when $N$ is even $h(N/2)$ is also +real. It is straightforward to prove the symmetry, +% +\begin{eqnarray} +h(a) &=& \sum W^{ab}_N g(b) \\ +h(N-a)^* &=& \sum W^{-(N-a)b}_N g(b)^* \\ + &=& \sum W^{-Nb}_N W^{ab}_N g(b) \qquad{(W^N_N=1)} \\ + &=& \sum W^{ab}_N g(b) +\end{eqnarray} +% +Real-valued data is very common in practice (perhaps more common that +complex data) so it is worth having efficient FFT routines for real +data. In principle an FFT for real data should need half the +operations of an FFT on the equivalent complex data (where the +imaginary parts are set to zero). There are two different strategies +for computing FFTs of real-valued data: + +One strategy is to ``pack'' the real data (of length $N$) into a +complex array (of length $N/2$) by index transformations. A complex +FFT routine can then be used to compute the transform of that array. +By further index transformations the result can actually by +``unpacked'' to the FFT of the original real data. It is also possible +to do two real FFTs simultaneously by packing one in the real part and +the other in the imaginary part of the complex array. These +techniques have some disadvantages. The packing and unpacking +procedures always add $O(N)$ operations, and packing a real array of +length $N$ into a complex array of length $N/2$ is only possible if +$N$ is even. In addition, if two unconnected datasets with very +different magnitudes are packed together in the same FFT there could +be ``cross-talk'' between them due to a loss of precision. + +A more straightforward strategy is to start with an FFT algorithm, +such as the complex mixed-radix algorithm, and prune out all the +operations involving the zero imaginary parts of the initial data. The +FFT is linear so the imaginary part of the data can be decoupled from +the real part. This procedure leads to a dedicated FFT for real-valued +data which works for any length and does not perform any unnecessary +operations. It also allows us to derive a corresponding inverse FFT +routine which transforms a half-complex sequence back into real data. + +\subsection{Radix-2 FFTs for real data} +% +Before embarking on the full mixed-radix real FFT we'll start with the +radix-2 case. It contains all the essential features of the +general-$N$ algorithm. To make it easier to see the analogy between +the two we will use the mixed-radix notation to describe the +factors. The factors are all 2, +% +\begin{equation} +f_1 = 2, f_2 = 2, \dots, f_{n_f} = 2 +\end{equation} +% +and the products $p_i$ are powers of 2, +% +\begin{eqnarray} +p_0 & = & 1 \\ +p_1 & = & f_1 = 2 \\ +p_2 & = & f_1 f_2 = 4 \\ +\dots &=& \dots \\ +p_i & = & f_1 f_2 \dots f_i = 2^i +\end{eqnarray} +% +Using this notation we can rewrite the radix-2 decimation-in-time +algorithm as, +% +\begin{algorithm} +\STATE bit-reverse ordering of $g$ +\FOR {$i = 1 \dots n$} + \FOR {$a = 0 \dots p_{i-1} - 1$} + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$ + \left( + \begin{array}{c} + g(b p_i + a) \\ + g(b p_i + p_{i-1} + a) + \end{array} + \right) + = + \left( + \begin{array}{c} + g(b p_i + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\ + g(b p_i + a) - W^a_{p_i} g(b p_i + p_{i-1} + a) + \end{array} + \right) $} + \ENDFOR + \ENDFOR +\ENDFOR +\end{algorithm} +% +where we have used $p_i = 2 \Delta$, and factored $2 \Delta$ out of +the original definition of $b$ ($b \to b p_i$). + +If we go back to the original recurrence relations we can see how to +write the intermediate results in a way which make the +real/half-complex symmetries explicit at each step. The first pass is +just a set of FFTs of length-2 on real values, +% +\begin{equation} +g_1([b_0 b_1 b_2 a_0]) = \sum_{b_3} W^{a_0 b_3}_2 g([b_0 b_1 b_2 b_3]) +\end{equation} +% +Using the symmetry $FFT(x)_k = FFT(x)^*_{N-k}$ we have the reality +condition, +% +\begin{eqnarray} +g_1([b_0 b_1 b_2 0]) &=& \mbox{real} \\ +g_1([b_0 b_1 b_2 1]) &=& \mbox{real'} +\end{eqnarray} +% +In the next pass we have a set of length-4 FFTs on the original data, +% +\begin{eqnarray} +g_2([b_0 b_1 b_1 a_0]) +&=& +\sum_{b_2}\sum_{b_3} +W^{[a_1 a_0]b_2}_4 W^{a_0 b_3}_2 +g([b_0 b_1 b_2 b_3]) \\ +&=& +\sum_{b_2}\sum_{b_3} +W^{[a_1 a_0][b_3 b_2]}_4 +g([b_0 b_1 b_2 b_3]) +\end{eqnarray} +% +This time symmetry gives us the following conditions on the +transformed data, +% +\begin{eqnarray} +g_2([b_0 b_1 0 0]) &=& \mbox{real} \\ +g_2([b_0 b_1 0 1]) &=& x + i y \\ +g_2([b_0 b_1 1 0]) &=& \mbox{real'} \\ +g_2([b_0 b_1 1 1]) &=& x - i y +\end{eqnarray} +% +We can see a pattern emerging here: the $i$-th pass computes a set of +independent length-$2^i$ FFTs on the original real data, +% +\begin{eqnarray} +g_i ( b p_i + a ) = \sum_{a' = 0}^{p_i-1} W_{p_i}^{aa'} g(b p_i + a') +\quad +\mbox{for $b = 0 \dots q_i - 1$} +\end{eqnarray} +% +As a consequence the we can apply the symmetry for an FFT of real data +to all the intermediate results -- not just the final result. +In general after the $i$-th pass we will +have the symmetry, +% +\begin{eqnarray} +g_i(b p_i) &=& \mbox{real} \\ +g_i(b p_i + a) &=& g_i(b p_i + p_i - a)^* \qquad a = 1 \dots p_{i}/2 - 1 \\ +g_i(b p_i + p_{i}/2) &=& \mbox{real'} +\end{eqnarray} +% +In the next section we'll show that this is a general property of +decimation-in-time algorithms. The same is not true for the +decimation-in-frequency algorithm, which does not have any simple +symmetries in the intermediate results. + +Since we can obtain the values of $g_i(b p_i + a)$ for $a > p_i/2$ +from the values for $a < p_i/2$ we can cut our computation and +storage in half compared with the full-complex case. +% +We can easily rewrite the algorithm to show how the computation can be +halved, simply by limiting all terms to involve only values for $a +\leq p_{i}/2$. Whenever we encounter a term $g_i(b p_i + a)$ with $a > +p_{i}/2$ we rewrite it in terms of its complex symmetry partner, +$g_i(b p_i + a')^*$, where $a' = p_i - a$. The butterfly computes two +values for each value of $a$, $b p_i + a$ and $b p_i + p_{i-1} - a$, +so we actually only need to compute from $a = 0$ to $p_{i-1}/2$. This +gives the following algorithm, +% +\begin{algorithm} +\FOR {$a = 0 \dots p_{i-1}/2$} + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$ + \left( + \begin{array}{c} + g(b p_i + a) \\ + g(b p_i + p_{i-1} - a)^* + \end{array} + \right) + = + \left( + \begin{array}{c} + g(b p_{i} + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\ + g(b p_{i} + a) - W^a_{p_i} g(b p_i + p_{i-1} + a) + \end{array} + \right) $} + \ENDFOR + \ENDFOR +\end{algorithm} +% +Although we have halved the number of operations we also need a +storage arrangement which will halve the memory requirement. The +algorithm above is still formulated in terms of a complex array $g$, +but the input to our routine will naturally be an array of $N$ real +values which we want to use in-place. + +Therefore we need a storage scheme which lays out the real and +imaginary parts within the real array, in a natural way so that there +is no need for complicated index calculations. In the radix-2 +algorithm we do not have any additional scratch space. The storage +scheme has to be designed to accommodate the in-place calculation +taking account of dual node pairs. + +Here is a scheme which takes these restrictions into account: On the +$i$-th pass we store the real part of $g(b p_i + a)$ in location $b +p_i + a$. We store the imaginary part in location $b p_i + p_i - +a$. This is the redundant location which corresponds to the conjugate +term $g(b p_i + a)^* = g(b p_i + p_i -a)$, so it is not needed. When +the results are purely real (as in the case $a = 0$ and $a = p_i/2$ we +store only the real part and drop the zero imaginary part). + +This storage scheme has to work in-place, because the radix-2 routines +should not use any scratch space. We will now check the in-place +property for each butterfly. A crucial point is that the scheme is +pass-dependent. Namely, when we are computing the result for pass $i$ +we are reading the results of pass $i-1$, and we must access them +using the scheme from the previous pass, i.e. we have to remember that +the results from the previous pass were stored using $b p_{i-1} + a$, +not $b p_i + a$, and the symmetry for these results will be $g_{i-1}(b +p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1} - a)^*$. To take this into +account we'll write the right hand side of the iteration, $g_{i-1}$, +in terms of $p_{i-1}$. For example, instead of $b p_i$, which occurs +naturally in $g_i(b p_i + a)$ we will use $2 b p_{i-1}$, since $p_i = +2 p_{i-1}$. + +Let's start with the butterfly for $a = 0$, +% +\begin{equation} +\left( +\begin{array}{c} +g(b p_i) \\ +g(b p_i + p_{i-1})^* +\end{array} +\right) += +\left( +\begin{array}{c} +g(2 b p_{i-1}) + g((2 b + 1) p_{i-1}) \\ +g(2 b p_{i-1}) - g((2 b + 1) p_{i-1}) +\end{array} +\right) +\end{equation} +% +By the symmetry $g_{i-1}(b p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1} +- a)^*$ all the inputs are purely real. The input $g(2 b p_{i-1})$ is +read from location $2 b p_{i-1}$ and $g((2 b + 1) p_{i-1})$ is read +from the location $(2 b + 1) p_{i-1}$. Here is the full breakdown, +% +\begin{center} +\renewcommand{\arraystretch}{1.5} +\begin{tabular}{|l|lll|} +\hline Term & & Location & \\ +\hline +$g(2 b p_{i-1})$ + & real part & $2 b p_{i-1} $ &$= b p_i$ \\ + & imag part & --- & \\ +$g((2 b+1) p_{i-1})$ + & real part & $(2 b + 1) p_{i-1} $&$= b p_i + p_{i-1} $ \\ + & imag part & --- & \\ +\hline +$g(b p_{i})$ + & real part & $b p_i$ &\\ + & imag part & --- & \\ +$g(b p_{i} + p_{i-1})$ + & real part & $b p_i + p_{i-1}$& \\ + & imag part & --- & \\ +\hline +\end{tabular} +\end{center} +% +The conjugation of the output term $g(b p_i + p_{i-1})^*$ is +irrelevant here since the results are purely real. The real results +are stored in locations $b p_i$ and $b p_i + p_{i-1}$, which +overwrites the inputs in-place. + +The general butterfly for $a = 1 \dots p_{i-1}/2 - 1$ is, +% +\begin{equation} +\left( +\begin{array}{c} +g(b p_i + a) \\ +g(b p_i + p_{i-1} - a)^* +\end{array} +\right) += +\left( +\begin{array}{c} +g(2 b p_{i-1} + a) + W^a_{p_i} g((2 b + 1) p_{i-1} + a) \\ +g(2 b p_{i-1} + a) - W^a_{p_i} g((2 b + 1) p_{i-1} + a) +\end{array} +\right) +\end{equation} +% +All the terms are complex. To store a conjugated term like $g(b' p_i + +a')^*$ where $a > p_i/2$ we take the real part and store it in +location $b' p_i + a'$ and then take imaginary part, negate it, and +store the result in location $b' p_i + p_i - a'$. + +Here is the full breakdown of the inputs and outputs from the +butterfly, +% +\begin{center} +\renewcommand{\arraystretch}{1.5} +\begin{tabular}{|l|lll|} +\hline Term & & Location & \\ +\hline +$g(2 b p_{i-1} + a)$ + & real part & $2 b p_{i-1} + a $ &$= b p_i + a$ \\ + & imag part & $2 b p_{i-1} + p_{i-1} - a$ &$= b p_i + p_{i-1} - a$ \\ +$g((2 b+1) p_{i-1} + a)$ + & real part & $(2 b+1) p_{i-1} + a $&$= b p_i + p_{i-1} + a $ \\ + & imag part & $(2 b+1) p_{i-1} + p_{i-1} - a $&$= b p_i + p_i - a$\\ +\hline +$g(b p_{i} + a)$ + & real part & $b p_i + a$ &\\ + & imag part & $b p_i + p_i - a$& \\ +$g(b p_{i} + p_{i-1} - a)$ + & real part & $b p_i + p_{i-1} - a$& \\ + & imag part & $b p_i + p_{i-1} + a$& \\ +\hline +\end{tabular} +\end{center} +% +By comparing the input locations and output locations we can see +that the calculation is done in place. + +The final butterfly for $a = p_{i-1}/2$ is, +% +\begin{equation} +\left( +\begin{array}{c} +g(b p_i + p_{i-1}/2) \\ +g(b p_i + p_{i-1} - p_{i-1}/2)^* +\end{array} +\right) += +\left( +\begin{array}{c} +g(2 b p_{i-1} + p_{i-1}/2) - i g((2 b + 1) p_{i-1} + p_{i-1}/2) \\ +g(2 b p_{i-1} + p_{i-1}/2) + i g((2 b + 1) p_{i-1} + p_{i-1}/2) +\end{array} +\right) +\end{equation} +% +where we have substituted for the twiddle factor, $W^a_{p_i} = -i$, +% +\begin{eqnarray} +W^{p_{i-1}/2}_{p_i} &=& \exp(-2\pi i p_{i-1}/2 p_i) \\ + &=& \exp(-2\pi i /4) \\ + &=& -i +\end{eqnarray} +% +For this butterfly the second line is just the conjugate of the first, +because $p_{i-1} - p_{i-1}/2 = p_{i-1}/2$. Therefore we only need to +consider the first line. The breakdown of the inputs and outputs is, +% +\begin{center} +\renewcommand{\arraystretch}{1.5} +\begin{tabular}{|l|lll|} +\hline Term & & Location & \\ +\hline +$g(2 b p_{i-1} + p_{i-1}/2)$ + & real part & $2 b p_{i-1} + p_{i-1}/2 $ &$= b p_i + p_{i-1}/2$ \\ + & imag part & --- & \\ +$g((2 b + 1) p_{i-1} + p_{i-1}/2)$ + & real part & $(2 b + 1) p_{i-1} + p_{i-1}/2 $&$= b p_i + p_{i} - p_{i-1}/2 $ \\ + & imag part & --- & \\ +\hline +$g(b p_{i} + p_{i-1}/2)$ + & real part & $b p_i + p_{i-1}/2$ &\\ + & imag part & $b p_i + p_i - p_{i-1}/2$& \\ +\hline +\end{tabular} +\end{center} +% +By comparing the locations of the inputs and outputs with the +operations in the butterfly we find that this computation is very +simple: the effect of the butterfly is to negate the location $b p_i + +p_i - p_{i-1}/2$ and leave other locations unchanged. This is clearly +an in-place operation. + +Here is the radix-2 algorithm for real data, in full, with the cases +of $a=0$, $a=1\dots p_{i-1}/2 - 1$ and $a = p_{i-1}/2$ in separate +blocks, +% +\begin{algorithm} +\STATE bit-reverse ordering of $g$ +\FOR {$i = 1 \dots n$} + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$\left( + \begin{array}{c} + g(b p_i) \\ + g(b p_i + p_{i-1}) + \end{array} + \right) + \Leftarrow + \left( + \begin{array}{c} + g(b p_{i}) + g(b p_{i} + p_{i-1}) \\ + g(b p_{i}) - g(b p_{i} + p_{i-1}) + \end{array} + \right)$} + \ENDFOR + + \FOR {$a = 1 \dots p_{i-1}/2 - 1$} + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$(\Real z_0, \Imag z_0) \Leftarrow + (g(b p_i + a), g(b p_i + p_{i-1} - a))$} + \STATE{$(\Real z_1, \Imag z_1) \Leftarrow + (g(b p_i + p_{i-1} + a), g(b p_i + p_{i} - a))$} + \STATE{$t_0 \Leftarrow z_0 + W^a_{p_i} z_1$} + \STATE{$t_1 \Leftarrow z_0 - W^a_{p_i} z_1$} + \STATE{$(g(b p_{i} + a),g(b p_{i} + p_i - a) \Leftarrow + (\Real t_0, \Imag t_0)$} + \STATE{$(g(b p_{i} + p_{i-1} - a), g(b p_{i} + p_{i-1} + a)) + \Leftarrow + (\Real t_1, -\Imag t_1)$} + \ENDFOR + \ENDFOR + + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$g(b p_{i} - p_{i-1}/2) \Leftarrow -g(b p_{i} - p_{i-1}/2)$} + \ENDFOR + +\ENDFOR +\end{algorithm} +% +We split the loop over $a$ into three parts, $a=0$, $a=1\dots +p_{i-1}/2-1$ and $a = p_{i-1}/2$ for efficiency. When $a=0$ we have +$W^a_{p_i}=1$ so we can eliminate a complex multiplication within the +loop over $b$. When $a=p_{i-1}/2$ we have $W^a_{p_i} = -i$ which does +not require a full complex multiplication either. + + +\subsubsection{Calculating the Inverse} +% +The inverse FFT of complex data was easy to calculate, simply by +taking the complex conjugate of the DFT matrix. The input data and +output data were both complex and did not have any special +symmetry. For real data the inverse FFT is more complicated because +the half-complex symmetry of the transformed data is +different from the purely real input data. + +We can compute an inverse by stepping backwards through the forward +transform. To simplify the inversion it's convenient to write the +forward algorithm with the butterfly in matrix form, +% +\begin{algorithm} +\FOR {$i = 1 \dots n$} + \FOR {$a = 0 \dots p_{i-1}/2$} + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$ + \left( + \begin{array}{c} + g(b p_i + a) \\ + g(b p_i + p_{i-1} + a) + \end{array} + \right) + = + \left( + \begin{array}{cc} + 1 & W^a_{p_{i}} \\ + 1 & -W^a_{p_{i}} + \end{array} + \right) + \left( + \begin{array}{c} + g(2 b p_{i-1} + a) \\ + g((2 b + 1) p_{i-1} + a) + \end{array} + \right) $} + \ENDFOR + \ENDFOR +\ENDFOR +\end{algorithm} +% +To invert the algorithm we run the iterations backwards and invert the +matrix multiplication in the innermost loop, +% +\begin{algorithm} +\FOR {$i = n \dots 1$} + \FOR {$a = 0 \dots p_{i-1}/2$} + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$ + \left( + \begin{array}{c} + g(2 b p_{i-1} + a) \\ + g((2 b + 1) p_{i-1} + a) + \end{array} + \right) + = + \left( + \begin{array}{cc} + 1 & W^a_{p_{i}} \\ + 1 & -W^a_{p_{i}} + \end{array} + \right)^{-1} + \left( + \begin{array}{c} + g(b p_i + a) \\ + g(b p_i + p_{i-1} + a) + \end{array} + \right) $} + \ENDFOR + \ENDFOR +\ENDFOR +\end{algorithm} +% +There is no need to reverse the loops over $a$ and $b$ because the +result is independent of their order. The inverse of the matrix that +appears is, +% +\begin{equation} +\left( +\begin{array}{cc} +1 & W^a_{p_{i}} \\ +1 & -W^a_{p_{i}} +\end{array} +\right)^{-1} += +{1 \over 2} +\left( +\begin{array}{cc} +1 & 1 \\ +W^{-a}_{p_{i}} & -W^{-a}_{p_{i}} +\end{array} +\right) +\end{equation} +% +To save divisions we remove the factor of $1/2$ inside the loop. This +computes the unnormalized inverse, and the normalized inverse can be +retrieved by dividing the final result by $N = 2^n$. + +Here is the radix-2 half-complex to real inverse FFT algorithm, taking +into account the radix-2 storage scheme, +% +\begin{algorithm} +\FOR {$i = n \dots 1$} + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$\left( + \begin{array}{c} + g(b p_i) \\ + g(b p_i + p_{i-1}) + \end{array} + \right) + \Leftarrow + \left( + \begin{array}{c} + g(b p_{i}) + g(b p_{i} + p_{i-1}) \\ + g(b p_{i}) - g(b p_{i} + p_{i-1}) + \end{array} + \right)$} + \ENDFOR + + \FOR {$a = 1 \dots p_{i-1}/2 - 1$} + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$(\Real z_0, \Imag z_0) + \Leftarrow + (g(b p_i + a), g(b p_i + p_{i} - a))$} + \STATE{$(\Real z_1, \Imag z_1) + \Leftarrow + (g(b p_i + p_{i-1} - a), -g(b p_i + p_{i-1} + a))$} + \STATE{$t_0 \Leftarrow z_0 + z_1$} + \STATE{$t_1 \Leftarrow z_0 - z_1$} + \STATE{$(g(b p_{i} + a), g(b p_{i} + p_{i-1} - a)) + \Leftarrow + (\Real t_0, \Imag t_0) $} + \STATE{$(g(b p_{i} + p_{i-1} + a),g(b p_{i} + p_{i} - a)) + \Leftarrow + (\Real(W^a_{p_i}t_1), \Imag(W^a_{p_i}t_1))$} + \ENDFOR + \ENDFOR + + \FOR{$b = 0 \dots q_i - 1$} + \STATE{$g(b p_{i} + p_{i-1}/2) \Leftarrow 2 g(b p_{i} + p_{i-1}/2)$} + \STATE{$g(b p_{i} + p_{i-1} + p_{i-1}/2) \Leftarrow -2 g(b p_{i} + p_{i-1} + p_{i-1}/2)$} + \ENDFOR + +\ENDFOR +\STATE bit-reverse ordering of $g$ +\end{algorithm} + + + +\subsection{Mixed-Radix FFTs for real data} +% +As discussed earlier the radix-2 decimation-in-time algorithm had the +special property that its intermediate passes are interleaved fourier +transforms of the original data, and this generalizes to the +mixed-radix algorithm. The complex mixed-radix algorithm that we +derived earlier was a decimation-in-frequency algorithm, but we can +obtain a decimation-in-time version by taking the transpose of the +decimation-in-frequency DFT matrix like this, +% +\begin{eqnarray} +W_N &=& W_N^T \\ +&=& (T_{n_f} \dots T_2 T_1)^T \\ +&=& T_1^T T_2^T \dots T_{n_f}^T +\end{eqnarray} +% +with, +% +\begin{eqnarray} +T_i^T &=& \left( (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) + (W_{f_i} \otimes I_{m_i}) \right)^T \\ + &=& (W_{f_i} \otimes I_{m_i}) + ( D^{f_i}_{q_i} (P^{f_i}_{q_i})^T \otimes I_{p_{i-1}}). +\end{eqnarray} +% +We have used the fact that $W$, $D$ and $I$ are symmetric and that the +permutation matrix $P$ obeys, +% +\begin{equation} +(P^a_b)^T = P^b_a. +\end{equation} +% +From the definitions of $D$ and $P$ we can derive the following identity, +% +\begin{equation} +D^a_b P^b_a = P^b_a D^b_a. +\end{equation} +% +This allows us to put the transpose into a simple form, +% +\begin{equation} +T_i^T = (W_{f_i} \otimes I_{m_i}) + (P^{q_i}_{f_i} D^{q_i}_{f_i} \otimes I_{p_{i-1}}). +\end{equation} +% +The transposed matrix, $T^T$ applies the digit-reversal $P$ before the +DFT $W$, giving the required decimation-in-time algorithm. The +transpose reverses the order of the factors --- $T_{n_f}$ is applied +first and $T_1$ is applied last. For convenience we can reverse the +order of the factors, $f_1 \leftrightarrow f_{n_f}$, $f_2 +\leftrightarrow f_{n_f-1}$, \dots and make the corresponding +substitution $p_{i-1} \leftrightarrow q_i$. These substitutions give +us a decimation-in-time algorithm with the same ordering as the +decimation-in-frequency algorithm, +% +\begin{equation} +W_N = T_{n_f} \dots T_2 T_1 +\end{equation} +% +\begin{equation} +T_i = (W_{f_i} \otimes I_{m_i}) + (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) +\end{equation} +% +where $p_i$, $q_i$ and $m_i$ now have the same meanings as before, +namely, +% +\begin{eqnarray} +p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1) \\ +q_i &=& N / p_i \\ +m_i &=& N / f_i +\end{eqnarray} +% +Now we would like to prove that the iteration for computing $x = W z = +T_{n_f} \dots T_2 T_1 z$ has the special property interleaving +property. If we write the result of each intermediate pass as +$v^{(i)}$, +% +\begin{eqnarray} +v^{(0)} &=& z \\ +v^{(1)} &=& T_1 v^{(0)} \\ +v^{(2)} &=& T_2 v^{(1)} \\ +\dots &=& \dots \\ +v^{(i)} &=& T_i v^{(i-1)} +\end{eqnarray} +% +then we will show that the intermediate results $v^{(i)}$ on any pass +can be written as, +% +\begin{equation} +v^{(i)} = (W_{p_i} \otimes I_{q_i}) z +\end{equation} +% +Each intermediate stage will be a set of $q_i$ interleaved fourier +transforms, each of length $p_i$. We can prove this result by +induction. First we assume that the result is true for $v^{(i-1)}$, +% +\begin{equation} +v^{(i-1)} = (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \qquad \mbox{(assumption)} +\end{equation} +% +And then we examine the next iteration using this assumption, +% +\begin{eqnarray} +v^{(i)} &=& T_i v^{(i-1)} \\ + &=& T_i (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \\ + &=& (W_{f_i} \otimes I_{m_i}) + (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) + (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \label{dit-induction} +\end{eqnarray} +% +Using the relation $m_i = p_{i-1} q_i$, we can write $I_{m_i}$ as +$I_{p_{i-1} q_i}$ and $I_{q_{i-1}}$ as $I_{f_i q_i}$. By combining these +with the basic matrix identity, +% +\begin{equation} +I_{ab} = I_a \otimes I_b +\end{equation} +% +we can rewrite $v^{(i)}$ in the following form, +% +\begin{equation} +v^{(i)} = (((W_{f_i} \otimes I_{p_{i-1}}) + (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i}) + (W_{p_{i-1}} \otimes I_{f_i})) \otimes I_{q_i}) z . +\end{equation} +% +The first part of this matrix product is the two-factor expansion of +$W_{ab}$, for $a = p_{i-1}$ and $b = f_i$, +% +\begin{equation} +W_{p_{i-1} f_i} = ((W_{f_i} \otimes I_{p_{i-1}}) + (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i}) + (W_{p_{i-1}} \otimes I_{f_i})). +\end{equation} +% +If we substitute this result, remembering that $p_i = p_{i-1} f_i$, we +obtain, +% +\begin{equation} +v^{(i)} = (W_{p_i} \otimes I_{q_i}) z +\end{equation} +% +which is the desired result. The case $i=1$ can be verified +explicitly, and induction then shows that the result is true for all +$i$. As discussed for the radix-2 algorithm this result is important +because if the initial data $z$ is real then each intermediate pass is +a set of interleaved fourier transforms of $z$, having half-complex +symmetries (appropriately applied in the subspaces of the Kronecker +product). Consequently only $N$ real numbers are needed to store the +intermediate and final results. + +\subsection{Implementation} +% +The implementation of the mixed-radix real FFT algorithm follows the +same principles as the full complex transform. Some of the steps are +applied in the opposite order because we are dealing with a decimation +in time algorithm instead of a decimation in frequency algorithm, but +the basic outer structure of the algorithm is the same. We want to +apply the factorized version of the DFT matrix $W_N$ to the input +vector $z$, +% +\begin{eqnarray} +x &=& W_N z \\ + &=& T_{n_f} \dots T_2 T_1 z +\end{eqnarray} +% +We loop over the $n_f$ factors, applying each matrix $T_i$ to the +vector in turn to build up the complete transform, +% +\begin{algorithm} +\FOR{$(i = 1 \dots n_f)$} + \STATE{$v \Leftarrow T_i v $} +\ENDFOR +\end{algorithm} +% +In this case the definition of $T_i$ is different because we have +taken the transpose, +% +\begin{equation} +T_i = + (W_{f_i} \otimes I_{m_i}) + (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}). +\end{equation} +% +We'll define a temporary vector $t$ to denote the results of applying the +rightmost matrix, +% +\begin{equation} +t = (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) v +\end{equation} +% +If we expand this out into individual components, as before, we find a +similar simplification, +% +\begin{eqnarray} +t_{aq+b} +&=& +\sum_{a'b'} +(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})_{(aq+b)(a'q+b')} +v_{a'q+b'} \\ +&=& +\sum_{a'} +(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{a a'} +v_{a'q+b} +\end{eqnarray} +% +We have factorized the indices into the form $aq+b$, with $0 \leq a < +p_{i}$ and $0 \leq b < q$. Just as in the decimation in frequency +algorithm we can split the index $a$ to remove the matrix +multiplication completely. We have to write $a$ as $\mu f + \lambda$, +where $0 \leq \mu < p_{i-1}$ and $0 \leq \lambda < f$, +% +\begin{eqnarray} +t_{(\mu f + \lambda)q+b} +&=& +\sum_{\mu'\lambda'} +(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')} +v_{(\mu' f + \lambda')q+b} +\\ +&=& +\sum_{\mu'\lambda'} +(P^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')} +\omega^{\mu'\lambda'}_{p_{i}} +v_{(\mu' f + \lambda')q+b} +\end{eqnarray} +% +The matrix $P^{p_{i-1}}_{f_i}$ exchanges an index of $(\mu f + +\lambda) q + b$ with $(\lambda p_{i-1} + \mu) q + b$, giving a final +result of, +% +\begin{eqnarray} +t_{(\lambda p_{i-1} + \mu) q + b} += +w^{\mu\lambda}_{p_i} v_{(\mu f + \lambda)q +b} +\end{eqnarray} +% +To calculate the next stage, +% +\begin{equation} +v' = (W_{f_i} \otimes I_{m_i}) t, +\end{equation} +% +we temporarily rearrange the index of $t$ to separate the $m_{i}$ +independent DFTs in the Kronecker product, +% +\begin{equation} +v'_{(\lambda p_{i-1} + \mu) q + b} += +\sum_{\lambda' \mu' b'} +(W_{f_i} \otimes I_{m_i})_{ +((\lambda p_{i-1} + \mu) q + b) +((\lambda' p_{i-1} + \mu') q + b')} +t_{(\lambda' p_{i-1} + \mu') q + b'} +\end{equation} +% +If we use the identity $m = p_{i-1} q$ to rewrite the index of $t$ +like this, +% +\begin{equation} +t_{(\lambda p_{i-1} + \mu) q + b} = t_{\lambda m + (\mu q + b)} +\end{equation} +% +we can split the Kronecker product, +% +\begin{eqnarray} +v'_{(\lambda p_{i-1} + \mu) q + b} +&=& +\sum_{\lambda' \mu' b'} +(W_{f_i} \otimes I_{m_i})_{ +((\lambda p_{i-1} + \mu) q + b) +((\lambda' p_{i-1} + \mu') q + b')} +t_{(\lambda' p_{i-1} + \mu') q + b'}\\ +&=& +\sum_{\lambda'} +(W_{f_i})_{\lambda \lambda'} +t_{\lambda' m_i + (\mu q + b)} +\end{eqnarray} +% +If we switch back to the original form of the index in the last line we obtain, +% +\begin{eqnarray} +\phantom{v'_{(\lambda p_{i-1} + \mu) q + b}} +&=& +\sum_{\lambda'} +(W_{f_i})_{\lambda \lambda'} +t_{(\lambda p_{i-1} + \mu) q + b} +\end{eqnarray} +% +which allows us to substitute our previous result for $t$, +% +\begin{equation} +v'_{(\lambda p_{i-1} + \mu) q + b} += +\sum_{\lambda'} +(W_{f_i})_{\lambda \lambda'} +w^{\mu\lambda'}_{p_i} v_{(\mu f + \lambda')q + b} +\end{equation} +% +This gives us the basic decimation-in-time mixed-radix algorithm for +complex data which we will be able to specialize to real data, +% +\begin{algorithm} +\FOR{$i = 1 \dots n_f$} +\FOR{$\mu = 0 \dots p_{i-1} - 1$} +\FOR{$b = 0 \dots q-1$} +\FOR{$\lambda = 0 \dots f-1$} +\STATE{$t_\lambda \Leftarrow + \omega^{\mu\lambda'}_{p_{i}} v_{(\mu f + \lambda')q + b}$} +\ENDFOR +\FOR{$\lambda = 0 \dots f-1$} +\STATE{$v'_{(\lambda p_{i-1} + \mu)q + b} = + \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') t_{\lambda'}$} + \COMMENT{DFT matrix-multiply module} +\ENDFOR +\ENDFOR +\ENDFOR +\STATE{$v \Leftarrow v'$} +\ENDFOR +\end{algorithm} +% +We are now at the point where we can convert an algorithm formulated +in terms of complex variables to one in terms of real variables by +choosing a suitable storage scheme. We will adopt the FFTPACK storage +convention. FFTPACK uses a scheme where individual FFTs are +contiguous, not interleaved, and real-imaginary pairs are stored in +neighboring locations. This has better locality than was possible for +the radix-2 case. + +The interleaving of the intermediate FFTs results from the Kronecker +product, $W_p \otimes I_q$. The FFTs can be made contiguous if we +reorder the Kronecker product on the intermediate passes, +% +\begin{equation} +W_p \otimes I_q \Rightarrow I_q \otimes W_p +\end{equation} +% +This can be implemented by a simple change in indexing. On pass-$i$ +we store element $v_{a q_i + b}$ in location $v_{b p_i+a}$. We +compensate for this change by making the same transposition when +reading the data. Note that this only affects the indices of the +intermediate passes. On the zeroth iteration the transposition has no +effect because $p_0 = 1$. Similarly there is no effect on the last +iteration, which has $q_{n_f} = 1$. This is how the algorithm above +looks after this index transformation, +% +\begin{algorithm} +\FOR{$i = 1 \dots n_f$} +\FOR{$\mu = 0 \dots p_{i-1} - 1$} +\FOR{$b = 0 \dots q-1$} +\FOR{$\lambda = 0 \dots f-1$} +\STATE{$t_\lambda \Leftarrow + \omega^{\mu\lambda'}_{p_{i}} v_{(\lambda'q + b)p_{i-1} + \mu}$} +\ENDFOR +\FOR{$\lambda = 0 \dots f-1$} +\STATE{$v'_{b p + (\lambda p_{i-1} + \mu)} = + \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') t_{\lambda'}$} + \COMMENT{DFT matrix-multiply module} +\ENDFOR +\ENDFOR +\ENDFOR +\STATE{$v \Leftarrow v'$} +\ENDFOR +\end{algorithm} +% +We transpose the input terms by writing the index in the form $a +q_{i-1} + b$, to take account of the pass-dependence of the scheme, +% +\begin{equation} +v_{(\mu f + \lambda')q + b} = v_{\mu q_{i-1} + \lambda'q + b} +\end{equation} +% +We used the identity $q_{i-1} = f q$ to split the index. Note that in +this form $\lambda'q + b$ runs from 0 to $q_{i-1} - 1$ as $b$ runs +from 0 to $q-1$ and $\lambda'$ runs from 0 to $f-1$. The transposition +for the input terms then gives, +% +\begin{equation} +v_{\mu q_{i-1} + \lambda'q + b} \Rightarrow v_{(\lambda'q + b) p_{i-1} + \mu} +\end{equation} +% +In the FFTPACK scheme the intermediate output data have the same +half-complex symmetry as the radix-2 example, namely, +% +\begin{equation} +v^{(i)}_{b p + a} = v^{(i)*}_{b p + (p - a)} +\end{equation} +% +and on the input data from the previous pass have the symmetry, +% +\begin{equation} +v^{(i-1)}_{(\lambda' q + b) p_{i-1} + \mu} = v^{(i-1)*}_{(\lambda' q + +b) p_{i-1} + (p_{i-1} - \mu)} +\end{equation} +% +Using these symmetries we can halve the storage and computation +requirements for each pass. Compared with the radix-2 algorithm we +have more freedom because the computation does not have to be done in +place. The storage scheme adopted by FFTPACK places elements +sequentially with real and imaginary parts in neighboring +locations. Imaginary parts which are known to be zero are not +stored. Here are the full details of the scheme, +% +\begin{center} +\renewcommand{\arraystretch}{1.5} +\begin{tabular}{|l|lll|} +\hline Term & & Location & \\ +\hline +$g(b p_i)$ + & real part & $b p_{i} $ & \\ + & imag part & --- & \\ +\hline +$g(b p_i + a)$ + & real part & $b p_{i} + 2a - 1 $& for $a = 1 \dots p_i/2 - 1$ \\ + & imag part & $b p_{i} + 2a$ & \\ +\hline +$g(b p_{i} + p_{i}/2)$ + & real part & $b p_i + p_{i} - 1$ & if $p_i$ is even\\ + & imag part & --- & \\ +\hline +\end{tabular} +\end{center} +% +The real element for $a=0$ is stored in location $bp$. The real parts +for $a = 1 \dots p/2 - 1$ are stored in locations $bp + 2a -1$ and the +imaginary parts are stored in locations $b p + 2 a$. When $p$ is even +the term for $a = p/2$ is purely real and we store it in location $bp ++ p - 1$. The zero imaginary part is not stored. + +When we compute the basic iteration, +% +\begin{equation} +v^{(i)}_{b p + (\lambda p_{i-1} + \mu)} = \sum_{\lambda'} +W^{\lambda \lambda'}_f +\omega^{\mu\lambda'}_{p_i} v^{(i-1)}_{(\lambda' q + b)p_{i-1} + \mu} +\end{equation} +% +we eliminate the redundant conjugate terms with $a > p_{i}/2$ as we +did in the radix-2 algorithm. Whenever we need to store a term with $a +> p_{i}/2$ we consider instead the corresponding conjugate term with +$a' = p - a$. Similarly when reading data we replace terms with $\mu > +p_{i-1}/2$ with the corresponding conjugate term for $\mu' = p_{i-1} - +\mu$. + +Since the input data on each stage has half-complex symmetry we only +need to consider the range $\mu=0 \dots p_{i-1}/2$. We can consider +the best ways to implement the basic iteration for each pass, $\mu = 0 +\dots p_{i-1}/2$. + +On the first pass where $\mu=0$ we will be accessing elements which +are the zeroth components of the independent transforms $W_{p_{i-1}} +\otimes I_{q_{i-1}}$, and are purely real. +% +We can code the pass with $\mu=0$ as a special case for real input +data, and conjugate-complex output. When $\mu=0$ the twiddle factors +$\omega^{\mu\lambda'}_{p_i}$ are all unity, giving a further saving. +We can obtain small-$N$ real-data DFT modules by removing the +redundant operations from the complex modules. +% +For example the $N=3$ module was, +% +\begin{equation} +\begin{array}{lll} +t_1 = z_1 + z_2, & +t_2 = z_0 - t_1/2, & +t_3 = \sin(\pi/3) (z_1 - z_2), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{lll} +x_0 = z_0 + t_1, & +x_1 = t_2 + i t_3, & +x_2 = t_2 - i t_3. +\end{array} +\end{equation} +% +In the complex case all the operations were complex, for complex input +data $z_0$, $z_1$, $z_2$. In the real case $z_0$, $z_1$ and $z_2$ are +all real. Consequently $t_1$, $t_2$ and $t_3$ are also real, and the +symmetry $x_1 = t_1 + i t_2 = x^*_2$ means that we do not have to +compute $x_2$ once we have computed $x_1$. + +For subsequent passes $\mu = 1 \dots p_{i-1}/2 - 1$ the input data is +complex and we have to compute full complex DFTs using the same +modules as in the complex case. Note that the inputs are all of the +form $v_{(\lambda q + b) p_{i-1} + \mu}$ with $\mu < p_{i-1}/2$ so we +never need to use the symmetry to access the conjugate elements with +$\mu > p_{i-1}/2$. + +If $p_{i-1}$ is even then we reach the halfway point $\mu=p_{i-1}/2$, +which is another special case. The input data in this case is purely +real because $\mu = p_{i-1} - \mu$ for $\mu = p_{i-1}/2$. We can code +this as a special case, using real inputs and real-data DFT modules as +we did for $\mu=0$. However, for $\mu = p_{i-1}/2$ the twiddle factors +are not unity, +% +\begin{eqnarray} +\omega^{\mu\lambda'}_{p_i} &=& \omega^{(p_{i-1}/2)\lambda'}_{p_i} \\ +&=& \exp(-i\pi\lambda'/f_i) +\end{eqnarray} +% +These twiddle factors introduce an additional phase which modifies the +symmetry of the outputs. Instead of the conjugate-complex symmetry +which applied for $\mu=0$ there is a shifted conjugate-complex +symmetry, +% +\begin{equation} +t_\lambda = t^*_{f-(\lambda+1)} +\end{equation} +% +This is easily proved, +% +\begin{eqnarray} +t_\lambda +&=& +\sum e^{-2\pi i \lambda\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\ +t_{f - (\lambda + 1)} +&=& +\sum e^{-2\pi i (f-\lambda-1)\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\ +&=& +\sum e^{2\pi i \lambda\lambda'/f_i} e^{i\pi \lambda'/f_i} r_{\lambda'} \\ +&=& t^*_\lambda +\end{eqnarray} +% +The symmetry of the output means that we only need to compute half of +the output terms, the remaining terms being conjugates or zero +imaginary parts. For example, when $f=4$ the outputs are $(x_0 + i +y_0, x_1 + i y_1, x_1 - i y_1, x_0 - i y_0)$. For $f=5$ the outputs +are $(x_0 + i y_0, x_1 + i y_1, x_2, x_1 - i y_1, x_0 - i y_0)$. By +combining the twiddle factors and DFT matrix we can make a combined +module which applies both at the same time. By starting from the +complex DFT modules and bringing in twiddle factors we can derive +optimized modules. Here are the modules given by Temperton for $z = W +\Omega x$ where $x$ is real and $z$ has the shifted conjugate-complex +symmetry. The matrix of twiddle factors, $\Omega$, is given by, +% +\begin{equation} +\Omega = \mathrm{diag}(1, e^{-i\pi/f}, e^{-2\pi i/f}, \dots, e^{-i\pi(f-1)/f}) +\end{equation} +% +We write $z$ in terms of two real vectors $z = a + i b$. +% +For $N=2$, +% +\begin{equation} +\begin{array}{ll} +a_0 = x_0, & +b_0 = - x_1. +\end{array} +\end{equation} +% +For $N=3$, +% +\begin{equation} +\begin{array}{l} +t_1 = x_1 - x_2, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_0 = x_0 + t_1/2, & b_0 = x_0 - t_1, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{l} +a_1 = - \sin(\pi/3) (x_1 + x_2) +\end{array} +\end{equation} +% +For $N=4$, +% +\begin{equation} +\begin{array}{ll} +t_1 = (x_1 - x_3)/\sqrt{2}, & t_2 = (x_1 + x_3)/\sqrt{2}, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_0 = x_0 + t_1, & b_0 = -x_2 - t_2, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_1 = x_0 - t_1, & b_1 = x_2 - t_2. +\end{array} +\end{equation} +% +For $N=5$, +% +\begin{equation} +\begin{array}{ll} +t_1 = x_1 - x_4, & t_2 = x_1 + x_4, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +t_3 = x_2 - x_3, & t_4 = x_2 + x_3, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +t_5 = t_1 - t_3, & t_6 = x_0 + t_5 / 4, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +t_7 = (\sqrt{5}/4)(t_1 + t_3) & +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_0 = t_6 + t_7, & b_0 = -\sin(2\pi/10) t_2 - \sin(2\pi/5) t_4, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_1 = t_6 - t_7, & b_1 = -\sin(2\pi/5) t_2 + \sin(2\pi/10) t_4, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_2 = x_0 - t_5 & +\end{array} +\end{equation} +% +For $N=6$, +% +\begin{equation} +\begin{array}{ll} +t_1 = \sin(\pi/3)(x_5 - x_1), & t_2 = \sin(\pi/3) (x_2 + x_4), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +t_3 = x_2 - x_4, & t_4 = x_1 + x_5, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +t_5 = x_0 + t_3 / 2, & t_6 = -x_3 - t_4 / 2, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_0 = t_5 - t_1, & b_0 = t_6 - t_2, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_1 = x_0 - t_3, & b_1 = x_3 - t_4, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +a_2 = t_5 + t_1, & b_2 = t_6 + t_2 +\end{array} +\end{equation} + +\section{Computing the mixed-radix inverse for real data} +% +To compute the inverse of the mixed-radix FFT on real data we step +through the algorithm in reverse and invert each operation. + +This gives the following algorithm using FFTPACK indexing, +% +\begin{algorithm} +\FOR{$i = n_f \dots 1$} +\FOR{$\mu = 0 \dots p_{i-1} - 1$} +\FOR{$b = 0 \dots q-1$} +\FOR{$\lambda = 0 \dots f-1$} +\STATE{$t_{\lambda'} = + \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') + v_{b p + (\lambda p_{i-1} + \mu)}$} + \COMMENT{DFT matrix-multiply module} +\ENDFOR +\FOR{$\lambda = 0 \dots f-1$} +\STATE{$v'_{(\lambda'q + b)p_{i-1} + \mu} \Leftarrow + \omega^{-\mu\lambda'}_{p_{i}} t_\lambda$} +\ENDFOR + +\ENDFOR +\ENDFOR +\STATE{$v \Leftarrow v'$} +\ENDFOR +\end{algorithm} +% +When $\mu=0$ we are applying an inverse DFT to half-complex data, +giving a real result. The twiddle factors are all unity. We can code +this as a special case, just as we did for the forward routine. We +start with complex module and eliminate the redundant terms. In this +case it is the final result which has the zero imaginary part, and we +eliminate redundant terms by using the half-complex symmetry of the +input data. + +When $\mu=1 \dots p_{i-1}/2 - 1$ we have full complex transforms on +complex data, so no simplification is possible. + +When $\mu = p_{i-1}/2$ (which occurs only when $p_{i-1}$ is even) we +have a combination of twiddle factors and DFTs on data with the +shifted half-complex symmetry which give a real result. We implement +this as a special module, essentially by inverting the system of +equations given for the forward case. We use the modules given by +Temperton, appropriately modified for our version of the algorithm. He +uses a slightly different convention which differs by factors of two +for some terms (consult his paper for details~\cite{temperton83real}). + +For $N=2$, +% +\begin{equation} +\begin{array}{ll} +x_0 = 2 a_0, & x_1 = - 2 b_0 . +\end{array} +\end{equation} +% +For $N=3$, +% +\begin{equation} +\begin{array}{ll} +t_1 = a_0 - a_1, & t_2 = \sqrt{3} b_1, \\ +\end{array} +\end{equation} +\begin{equation} +\begin{array}{lll} +x_0 = 2 a_0 + a_1, & x_1 = t_1 - t_2, & x_2 = - t_1 - t_2 +\end{array} +\end{equation} +% +For $N=4$, +\begin{equation} +\begin{array}{ll} +t_1 = \sqrt{2} (b_0 + b_1), & t_2 = \sqrt{2} (a_0 - a_1), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +x_0 = 2(a_0 + a_1), & x_1 = t_2 - t_1 , +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +x_2 = 2(b_1 - b_0), & x_3 = -(t_2 + t_1). +\end{array} +\end{equation} +% +For $N=5$, +% +\begin{equation} +\begin{array}{ll} +t_1 = 2 (a_0 + a_1), & t_2 = t_1 / 4 - a_2, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +t_3 = (\sqrt{5}/2) (a_0 - a_1), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{l} +t_4 = 2(\sin(2\pi/10) b_0 + \sin(2\pi/5) b_1), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{l} +t_5 = 2(\sin(2\pi/10) b_0 - \sin(2\pi/5) b_1), +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +t_6 = t_3 + t_2, & t_7 = t_3 - t_2, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +x_0 = t_1 + a_2, & x_1 = t_6 - t_4 , +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +x_2 = t_7 - t_5, & x_3 = - t_7 - t_5, +\end{array} +\end{equation} +\begin{equation} +\begin{array}{ll} +x_4 = -t_6 - t_4. +\end{array} +\end{equation} + +\section{Conclusions} +% +We have described the basic algorithms for one-dimensional radix-2 and +mixed-radix FFTs. It would be nice to have a pedagogical explanation +of the split-radix FFT algorithm, which is faster than the simple +radix-2 algorithm we used. We could also have a whole chapter on +multidimensional FFTs. +% +%\section{Multidimensional FFTs} +%\section{Testing FFTs, Numerical Analysis} + +%\nocite{*} +\bibliographystyle{unsrt} +\bibliography{fftalgorithms} + +\end{document} + + + + |