summaryrefslogtreecommitdiff
path: root/gsl-1.9/doc/fftalgorithms.tex
diff options
context:
space:
mode:
Diffstat (limited to 'gsl-1.9/doc/fftalgorithms.tex')
-rw-r--r--gsl-1.9/doc/fftalgorithms.tex3296
1 files changed, 3296 insertions, 0 deletions
diff --git a/gsl-1.9/doc/fftalgorithms.tex b/gsl-1.9/doc/fftalgorithms.tex
new file mode 100644
index 0000000..4f4a725
--- /dev/null
+++ b/gsl-1.9/doc/fftalgorithms.tex
@@ -0,0 +1,3296 @@
+\documentclass[fleqn,12pt]{article}
+%
+\setlength{\oddsidemargin}{-0.25in}
+\setlength{\textwidth}{7.0in}
+\setlength{\topmargin}{-0.25in}
+\setlength{\textheight}{9.5in}
+%
+\usepackage{algorithmic}
+\newenvironment{algorithm}{\begin{quote} %\vspace{1em}
+\begin{algorithmic}\samepage}{\end{algorithmic} %\vspace{1em}
+\end{quote}}
+\newcommand{\Real}{\mathop{\mathrm{Re}}}
+\newcommand{\Imag}{\mathop{\mathrm{Im}}}
+
+\begin{document}
+\title{FFT Algorithms}
+\author{Brian Gough, {\tt bjg@network-theory.co.uk}}
+\date{May 1997}
+\maketitle
+
+\section{Introduction}
+
+Fast Fourier Transforms (FFTs) are efficient algorithms for
+calculating the discrete fourier transform (DFT),
+%
+\begin{eqnarray}
+h_a &=& \mathrm{DFT}(g_b) \\
+ &=& \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) \qquad 0 \leq a \leq N-1\\
+ &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N= \exp(-2\pi i/N)
+\end{eqnarray}
+%
+The DFT usually arises as an approximation to the continuous fourier
+transform when functions are sampled at discrete intervals in space or
+time. The naive evaluation of the discrete fourier transform is a
+matrix-vector multiplication ${\mathbf W}\vec{g}$, and would take
+$O(N^2)$ operations for $N$ data-points. The general principle of the
+Fast Fourier Transform algorithms is to use a divide-and-conquer
+strategy to factorize the matrix $W$ into smaller sub-matrices,
+typically reducing the operation count to $O(N \sum f_i)$ if $N$ can
+be factorized into smaller integers, $N=f_1 f_2 \dots f_n$.
+
+This chapter explains the algorithms used in the GSL FFT routines
+and provides some information on how to extend them. To learn more about
+the FFT you should read the review article {\em Fast Fourier
+Transforms: A Tutorial Review and A State of the Art} by Duhamel and
+Vetterli~\cite{duhamel90}. There are several introductory books on the
+FFT with example programs, such as {\em The Fast Fourier Transform} by
+Brigham~\cite{brigham74} and {\em DFT/FFT and Convolution Algorithms}
+by Burrus and Parks~\cite{burrus84}. In 1979 the IEEE published a
+compendium of carefully-reviewed Fortran FFT programs in {\em Programs
+for Digital Signal Processing}~\cite{committee79} which is a useful
+reference for implementations of many different FFT algorithms. If you
+are interested in using DSPs then the {\em Handbook of Real-Time Fast
+Fourier Transforms}~\cite{smith95} provides detailed information on
+the algorithms and hardware needed to design, build and test DSP
+applications. Many FFT algorithms rely on results from number theory.
+These results are covered in the books {\em Fast transforms:
+algorithms, analyses, applications}, by Elliott and
+Rao~\cite{elliott82}, {\em Fast Algorithms for Digital Signal
+Processing} by Blahut~\cite{blahut} and {\em Number Theory in Digital
+Signal Processing} by McClellan and Rader~\cite{mcclellan79}. There is
+also an annotated bibliography of papers on the FFT and related topics
+by Burrus~\cite{burrus-note}.
+
+\section{Families of FFT algorithms}
+%
+There are two main families of FFT algorithms: the Cooley-Tukey
+algorithm and the Prime Factor algorithm. These differ in the way they
+map the full FFT into smaller sub-transforms. Of the Cooley-Tukey
+algorithms there are two types of routine in common use: mixed-radix
+(general-$N$) algorithms and radix-2 (power of 2) algorithms. Each
+type of algorithm can be further classified by additional characteristics,
+such as whether it operates in-place or uses additional scratch space,
+whether its output is in a sorted or scrambled order, and whether it
+uses decimation-in-time or -frequency iterations.
+
+Mixed-radix algorithms work by factorizing the data vector into
+shorter lengths. These can then be transformed by small-$N$ FFTs.
+Typical programs include FFTs for small prime factors, such as 2, 3,
+5, \dots which are highly optimized. The small-$N$ FFT modules act as
+building blocks and can be multiplied together to make longer
+transforms. By combining a reasonable set of modules it is possible to
+compute FFTs of many different lengths. If the small-$N$ modules are
+supplemented by an $O(N^2)$ general-$N$ module then an FFT of any
+length can be computed, in principle. Of course, any lengths which
+contain large prime factors would perform only as $O(N^2)$.
+
+Radix-2 algorithms, or ``power of two'' algorithms, are simplified
+versions of the mixed-radix algorithm. They are restricted to lengths
+which are a power of two. The basic radix-2 FFT module only involves
+addition and subtraction, so the algorithms are very simple. Radix-2
+algorithms have been the subject of much research into optimizing the
+FFT. Many of the most efficient radix-2 routines are based on the
+``split-radix'' algorithm. This is actually a hybrid which combines
+the best parts of both radix-2 and radix-4 (``power of 4'')
+algorithms~\cite{sorenson86,duhamel86}.
+
+The prime factor algorithm (PFA) is an alternative form of general-$N$
+algorithm based on a different way of recombining small-$N$ FFT
+modules~\cite{temperton83pfa,temperton85}. It has a very simple indexing
+scheme which makes it attractive. However it only works in the case
+where all factors are mutually prime. This requirement makes it more
+suitable as a specialized algorithm for given lengths.
+
+
+\begin{subsection}{FFTs of prime lengths}
+%
+ Large prime lengths cannot be handled efficiently by any of these
+algorithms. However it may still possible to compute a DFT, by using
+results from number theory. Rader showed that it is possible to
+convert a length-$p$ FFT (where $p$ is prime) into a convolution of
+length-($p-1$). There is a simple identity between the convolution of
+length $N$ and the FFT of the same length, so if $p-1$ is easily
+factorizable this allows the convolution to be computed efficiently
+via the FFT. The idea is presented in the original paper by
+Rader~\cite{raderprimes} (also reprinted in~\cite{mcclellan79}), but
+for more details see the theoretical books mentioned earlier.
+\end{subsection}
+
+%\subsection{In-place algorithms vs algorithms using scratch space}
+%%
+%For simple algorithms, such as the Radix-2 algorithms, the FFT can be
+%performed in-place, without additional memory. For this to be possible
+%the index calculations must be simple enough that the calculation of
+%the result to be stored in index $k$ can be carried out at the same
+%time as the data in $k$ is used, so that no temporary storage is
+%needed.
+
+%The mixed-radix algorithm is too complicated for this. It requires an
+%area of scratch space sufficient to hold intermediate copies of the
+%input data.
+
+%\subsection{Self-sorting algorithms vs scrambling algorithms}
+%%
+%A self-sorting algorithm At each iteration of the FFT internal permutations are included
+%which leave the final iteration in its natural order.
+
+
+%The mixed-radix algorithm can be made self-sorting. The additional
+%scratch space helps here, although it is in fact possible to design
+%self-sorting algorithms which do not require scratch
+%space.
+
+
+%The in-place radix-2 algorithm is not self-sorting. The data is
+%permuted, and transformed into bit-reversed order. Note that
+%bit-reversal only refers to the order of the array, not the values for
+%each index which are of course not changed. More precisely, the data
+%stored in location $[b_n \dots b_2 b_1 b_0]$ is moved to location
+%$[b_0 b_1 b_2 \dots b_n]$, where $b_0 \dots b_n$ are the raw bits of a
+%given index. Placing the data in bit reversed order is easily done,
+%using right shifts to extract the least significant bits of the index
+%and left-shifts to feed them into a register for the bit-reversed
+%location. Simple radix-2 FFT usually recompute the bit reverse
+%ordering in the naive way for every call. For repeated calls it might
+%be worthwhile to precompute the permutation and cache it. There are
+%also more sophisticated algorithms which reduce the number of
+%operations needed to perform bit-reversal~\cite{bit-reversal}.
+
+
+%\begin{subsection}{Decimation-in-time (DIT) vs Decimation-in-Frequency (DIF)}
+
+%\end{subsection}
+
+
+\subsection{Optimization}
+%
+There is no such thing as the single fastest FFT {\em algorithm}. FFT
+algorithms involve a mixture of floating point calculations, integer
+arithmetic and memory access. Each of these operations will have
+different relative speeds on different platforms. The performance of
+an algorithm is a function of the hardware it is implemented on. The
+goal of optimization is thus to choose the algorithm best suited to
+the characteristics of a given hardware platform.
+
+For example, the Winograd Fourier Transform (WFTA) is an algorithm
+which is designed to reduce the number of floating point
+multiplications in the FFT. However, it does this at the expense of
+using many more additions and data transfers than other algorithms.
+As a consequence the WFTA might be a good candidate algorithm for
+machines where data transfers occupy a negligible time relative to
+floating point arithmetic. However on most modern machines, where the
+speed of data transfers is comparable to or slower than floating point
+operations, it would be outperformed by an algorithm which used a
+better mix of operations (i.e. more floating point operations but
+fewer data transfers).
+
+For a study of this sort of effect in detail, comparing the different
+algorithms on different platforms consult the paper {\it Effects of
+Architecture Implementation on DFT Algorithm Performance} by Mehalic,
+Rustan and Route~\cite{mehalic85}. The paper was written in the early
+1980's and has data for super- and mini-computers which you are
+unlikely to see today, except in a museum. However, the methodology is
+still valid and it would be interesting to see similar results for
+present day computers.
+
+
+\section{FFT Concepts}
+%
+Factorization is the key principle of the mixed-radix FFT divide-and-conquer
+strategy. If $N$ can be factorized into a product of $n_f$ integers,
+%
+\begin{equation}
+N = f_1 f_2 ... f_{n_f} ,
+\end{equation}
+%
+then the FFT itself can be divided into smaller FFTs for each factor.
+More precisely, an FFT of length $N$ can be broken up into,
+%
+\begin{quote}
+\begin{tabular}{l}
+$(N/f_1)$ FFTs of length $f_1$, \\
+$(N/f_2)$ FFTs of length $f_2$, \\
+\dots \\
+$(N/f_{n_f})$ FFTs of length $f_{n_f}$.
+\end{tabular}
+\end{quote}
+%
+The total number of operations for these sub-operations will be $O(
+N(f_1 + f_2 + ... + f_{n_f}))$. When the factors of $N$ are all small
+integers this will be substantially less than $O(N^2)$. For example,
+when $N$ is a power of 2 an FFT of length $N=2^m$ can be reduced to $m
+N/2$ FFTs of length 2, or $O(N\log_2 N)$ operations. Here is a
+demonstration which shows this:
+
+We start with the full DFT,
+%
+\begin{eqnarray}
+h_a &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N=\exp(-2\pi i/N)
+\end{eqnarray}
+%
+and split the sum into even and odd terms,
+%
+\begin{eqnarray}
+\phantom{h_a}
+% &=& \left(\sum_{b=0,2,4,\dots} + \sum_{b=1,3,5,\dots}\right) g_b W_N^{ab}\\
+ &=& \sum_{b=0}^{N/2-1} g_{2b} W_N^{a(2b)} +
+ \sum_{b=0}^{N/2-1} g_{2b+1} W_N^{a(2b+1)}.
+\end{eqnarray}
+%
+This converts the original DFT of length $N$ into two DFTs of length
+$N/2$,
+%
+\begin{equation}
+h_a = \sum_{b=0}^{N/2-1} g_{2b} W_{(N/2)}^{ab} +
+ W_N^a \sum_{b=0}^{N/2-1} g_{2b+1} W_{(N/2)}^{ab}
+\end{equation}
+%
+The first term is a DFT of the even elements of $g$. The second term
+is a DFT of the odd elements of $g$, premultiplied by an exponential
+factor $W_N^k$ (known as a {\em twiddle factor}).
+%
+\begin{equation}
+\mathrm{DFT}(h) = \mathrm{DFT}(g_{even}) + W_N^k \mathrm{DFT}(g_{odd})
+\end{equation}
+%
+By splitting the DFT into its even and odd parts we have reduced the
+operation count from $N^2$ (for a DFT of length $N$) to $2 (N/2)^2$
+(for two DFTs of length $N/2$). The cost of the splitting is that we
+need an additional $O(N)$ operations to multiply by the twiddle factor
+$W_N^k$ and recombine the two sums.
+
+We can repeat the splitting procedure recursively $\log_2 N$ times
+until the full DFT is reduced to DFTs of single terms. The DFT of a
+single value is just the identity operation, which costs nothing.
+However since $O(N)$ operations were needed at each stage to recombine
+the even and odd parts the total number of operations to obtain the
+full DFT is $O(N \log_2 N)$. If we had used a length which was a
+product of factors $f_1$, $f_2$, $\dots$ we could have split the sum
+in a similar way. First we would split terms corresponding to the
+factor $f_1$, instead of the even and odd terms corresponding to a
+factor of two. Then we would repeat this procedure for the subsequent
+factors. This would lead to a final operation count of $O(N \sum
+f_i)$.
+
+This procedure gives some motivation for why the number of operations
+in a DFT can in principle be reduced from $O(N^2)$ to $O(N \sum f_i)$.
+It does not give a good explanation of how to implement the algorithm
+in practice which is what we shall do in the next section.
+
+\section{Radix-2 Algorithms}
+%
+For radix-2 FFTs it is natural to write array indices in binary form
+because the length of the data is a power of two. This is nicely
+explained in the article {\em The FFT: Fourier Transforming One Bit at
+a Time} by P.B. Visscher~\cite{visscher96}. A binary representation
+for indices is the key to deriving the simplest efficient radix-2
+algorithms.
+
+We can write an index $b$ ($0 \leq b < 2^{n-1}$) in binary
+representation like this,
+%
+\begin{equation}
+b = [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + \dots + 2 b_1 + b_0 .
+\end{equation}
+%
+Each of the $b_0, b_1, \dots, b_{n-1}$ are the bits (either 0 or 1) of
+$b$.
+
+Using this notation the original definition of the DFT
+can be rewritten as a sum over the bits of $b$,
+%
+\begin{equation}
+h(a) = \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N)
+\end{equation}
+%
+to give an equivalent summation like this,
+%
+\begin{equation}
+h([a_{n-1} \dots a_1 a_0]) =
+\sum_{b_0=0}^{1}
+\sum_{b_1=0}^{1}
+\dots
+\sum_{b_{n-1}=0}^{1}
+ g([b_{n-1} \dots b_1 b_0]) W_N^{ab}
+\end{equation}
+%
+where the bits of $a$ are $a=[a_{n-1} \dots a_1 a_0]$.
+
+To reduce the number of operations in the sum we will use the
+periodicity of the exponential term,
+%
+\begin{equation}
+W_N^{x+N} = W_N^{x}.
+\end{equation}
+%
+Most of the products $ab$ in $W_N^{ab}$ are greater than $N$. By
+making use of this periodicity they can all be collapsed down into the
+range $0\dots N-1$. This allows us to reduce the number of operations
+by combining common terms, modulo $N$. Using this idea we can derive
+decimation-in-time or decimation-in-frequency algorithms, depending on
+how we break the DFT summation down into common terms. We'll first
+consider the decimation-in-time algorithm.
+
+\subsection{Radix-2 Decimation-in-Time (DIT)}
+%
+To derive the the decimation-in-time algorithm we start by separating
+out the most significant bit of the index $b$,
+%
+\begin{equation}
+[b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + [b_{n-2} \dots b_1 b_0]
+\end{equation}
+%
+Now we can evaluate the innermost sum of the DFT without any
+dependence on the remaining bits of $b$ in the exponential,
+%
+\begin{eqnarray}
+h([a_{n-1} \dots a_1 a_0]) &=&
+\sum_{b_0=0}^{1}
+\sum_{b_1=0}^{1}
+\dots
+\sum_{b_{n-1}=0}^{1}
+ g(b)
+W_N^{a(2^{n-1}b_{n-1}+[b_{n-2} \dots b_1 b_0])} \\
+ &=&
+\sum_{b_0=0}^{1}
+\sum_{b_1=0}^{1}
+\dots
+\sum_{b_{n-2}=0}^{1}
+W_N^{a[b_{n-2} \dots b_1 b_0]}
+\sum_{b_{n-1}=0}^{1}
+ g(b)
+W_N^{a(2^{n-1}b_{n-1})}
+\end{eqnarray}
+%
+Looking at the term $W_N^{a(2^{n-1}b_{n-1})}$ we see that we can also
+remove most of the dependence on $a$ as well, by using the periodicity of the
+exponential,
+%
+\begin{eqnarray}
+W_N^{a(2^{n-1}b_{n-1})} &=&
+\exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-1} b_{n-1}/ 2^n )\\
+&=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] b_{n-1}/ 2 )\\
+&=& \exp(-2\pi i ( 2^{n-2}a_{n-1} + \dots + a_1 + (a_0/2)) b_{n-1} )\\
+&=& \exp(-2\pi i a_0 b_{n-1}/2 ) \\
+&=& W_2^{a_0 b_{n-1}}
+\end{eqnarray}
+%
+Thus the innermost exponential term simplifies so that it only
+involves the highest order bit of $b$ and the lowest order bit of $a$,
+and the sum can be reduced to,
+%
+\begin{equation}
+h([a_{n-1} \dots a_1 a_0])
+=
+\sum_{b_0=0}^{1}
+\sum_{b_1=0}^{1}
+\dots
+\sum_{b_{n-2}=0}^{1}
+W_N^{a[b_{n-2} \dots b_1 b_0]}
+\sum_{b_{n-1}=0}^{1}
+ g(b)
+W_2^{a_0 b_{n-1}}.
+\end{equation}
+%
+We can repeat this this procedure for the next most significant bit of
+$b$, $b_{n-2}$, using a similar identity,
+%
+\begin{eqnarray}
+W_N^{a(2^{n-2}b_{n-2})}
+&=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-2} b_{n-2}/ 2^n )\\
+&=& W_4^{ [a_1 a_0] b_{n-2}}.
+\end{eqnarray}
+%
+to give a formula with even less dependence on the bits of $a$,
+%
+\begin{equation}
+h([a_{n-1} \dots a_1 a_0])
+=
+\sum_{b_0=0}^{1}
+\sum_{b_1=0}^{1}
+\dots
+\sum_{b_{n-3}=0}^{1}
+W_N^{a[b_{n-3} \dots b_1 b_0]}
+\sum_{b_{n-2}=0}^{1}
+W_4^{[a_1 a_0] b_{n-2}}
+\sum_{b_{n-1}=0}^{1}
+ g(b)
+W_2^{a_0 b_{n-1}}.
+\end{equation}
+%
+If we repeat the process for all the remaining bits we obtain a
+simplified DFT formula which is the basis of the radix-2
+decimation-in-time algorithm,
+%
+\begin{eqnarray}
+h([a_{n-1} \dots a_1 a_0]) &=&
+\sum_{b_0=0}^{1}
+W_{N}^{[a_{n-1} \dots a_1 a_0]b_0}
+%\sum_{b_1=0}^{1}
+%W_{N/2}^{[a_{n-1} \dots a_1 a_0]b_1}
+\dots
+\sum_{b_{n-2}=0}^{1}
+W_4^{ [a_1 a_0] b_{n-2}}
+\sum_{b_{n-1}=0}^{1}
+W_2^{a_0 b_{n-1}}
+g(b)
+\end{eqnarray}
+%
+To convert the formula to an algorithm we expand out the sum
+recursively, evaluating each of the intermediate summations, which we
+denote by $g_1$, $g_2$, \dots, $g_n$,
+%
+\begin{eqnarray}
+g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0)
+&=&
+\sum_{b_{n-1}=0}^{1}
+W_2^{a_0 b_{n-1}}
+g([b_{n-1} b_{n-2} b_{n-3} \dots b_1 b_0]) \\
+g_2(a_0, {}_{\phantom{-2}} a_{1}, b_{n-3}, \dots, b_1, b_0)
+&=&
+\sum_{b_{n-2}=0}^{1}
+W_4^{[a_1 a_0] b_{n-2}}
+g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0) \\
+g_3(a_0, {}_{\phantom{-2}} a_{1}, {}_{\phantom{-3}} a_{2}, \dots, b_1, b_0)
+&=&
+\sum_{b_{n-3}=0}^{1}
+W_8^{[a_2 a_1 a_0] b_{n-2}}
+g_2(a_0, a_1, b_{n-3}, \dots, b_1, b_0) \\
+\dots &=& \dots \\
+g_n(a_0, a_{1}, a_{2}, \dots, a_{n-2}, a_{n-1})
+&=&
+\sum_{b_{0}=0}^{1}
+W_N^{[a_{n-1} \dots a_1 a_0]b_0}
+g_{n-1}(a_0, a_1, a_2, \dots, a_{n-2}, b_0)
+\end{eqnarray}
+%
+After the final sum, we can obtain the transform $h$ from $g_n$,
+%
+\begin{equation}
+h([a_{n-1} \dots a_1 a_0])
+=
+g_n(a_0, a_1, \dots, a_{n-1})
+\end{equation}
+%
+Note that we left the storage arrangements of the intermediate sums
+unspecified by using the bits as function arguments and not as an
+index. The storage of intermediate sums is different for the
+decimation-in-time and decimation-in-frequency algorithms.
+
+Before deciding on the best storage scheme we'll show that the results
+of each stage, $g_1$, $g_2$, \dots, can be carried out {\em
+in-place}. For example, in the case of $g_1$, the inputs are,
+%
+\begin{equation}
+g([\underline{b_{n-1}} b_{n-2} b_{n-3} \dots b_1 b_0])
+\end{equation}
+%
+for $b_{n-1}=(0,1)$, and the corresponding outputs are,
+%
+\begin{equation}
+g_1(\underline{a_0},b_{n-2}, b_{n-3}, \dots, b_1, b_0)
+\end{equation}
+%
+for $a_0=(0,1)$. It's clear that if we hold $b_{n-2}, b_{n-3}, \dots,
+b_1, b_0$ fixed and compute the sum over $b_{n-1}$ in memory for both
+values of $a_0 = 0,1$ then we can store the result for $a_0=0$ in the
+location which originally had $b_0=0$ and the result for $a_0=1$ in
+the location which originally had $b_0=1$. The two inputs and two
+outputs are known as {\em dual node pairs}. At each stage of the
+calculation the sums for each dual node pair are independent of the
+others. It is this property which allows an in-place calculation.
+
+So for an in-place pass our storage has to be arranged so that the two
+outputs $g_1(a_0,\dots)$ overwrite the two input terms
+$g([b_{n-1},\dots])$. Note that the order of $a$ is reversed from the
+natural order of $b$. i.e. the least significant bit of $a$
+replaces the most significant bit of $b$. This is inconvenient
+because $a$ occurs in its natural order in all the exponentials,
+$W^{ab}$. We could keep track of both $a$ and its bit-reverse,
+$a^{\mathit bit-reversed}$ at all times but there is a neat trick
+which avoids this: if we bit-reverse the order of the input data $g$
+before we start the calculation we can also bit-reverse the order of
+$a$ when storing intermediate results. Since the storage involving $a$
+was originally in bit-reversed order the switch in the input $g$ now
+allows us to use normal ordered storage for $a$, the same ordering
+that occurs in the exponential factors.
+
+This is complicated to explain, so here is an example of the 4 passes
+needed for an $N=16$ decimation-in-time FFT, with the initial data
+stored in bit-reversed order,
+%
+\begin{eqnarray}
+g_1([b_0 b_1 b_2 a_0])
+&=&
+\sum_{b_3=0}^{1} W_2^{a_0 b_3} g([b_0 b_1 b_2 b_3])
+\\
+g_2([b_0 b_1 a_1 a_0])
+&=&
+\sum_{b_2=0}^{1} W_4^{[a_1 a_0] b_2} g_1([b_0 b_1 b_2 a_0])
+\\
+g_3([b_0 a_2 a_1 a_0])
+&=&
+\sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0])
+\\
+h(a) = g_4([a_3 a_2 a_1 a_0])
+&=&
+\sum_{b_0=0}^{1} W_{16}^{[a_3 a_2 a_1 a_0] b_0} g_3([b_0 a_2 a_1 a_0])
+\end{eqnarray}
+%
+We compensate for the bit reversal of the input data by accessing $g$
+with the bit-reversed form of $b$ in the first stage. This ensures
+that we are still carrying out the same calculation, using the same
+data, and not accessing different values. Only single bits of $b$ ever
+occur in the exponential so we never need the bit-reversed form of
+$b$.
+
+Let's examine the third pass in detail,
+%
+\begin{equation}
+g_3([b_0 a_2 a_1 a_0])
+=
+\sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0])
+\end{equation}
+%
+First note that only one bit, $b_1$, varies in each summation. The
+other bits of $b$ ($b_0$) and of $a$ ($a_1 a_0$) are essentially
+``spectators'' -- we must loop over all combinations of these bits and
+carry out the same basic calculation for each, remembering to update
+the exponentials involving $W_8$ appropriately. If we are storing the
+results in-place (with $g_3$ overwriting $g_2$ we will need to compute
+the sums involving $b_1=0,1$ and $a_2=0,1$ simultaneously.
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\
+g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}}
+\end{array}
+\right)
+=
+\left(
+\begin{array}{c}
+g_2([b_0 0 a_1 a_0]) + W_8^{[0 a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\
+g_2([b_0 0 a_1 a_0]) + W_8^{[1 a_1 a_0]} g_2([b_2 1 a_1 a_0])
+\end{array}
+\right)
+\end{equation}
+%
+We can write this in a more symmetric form by simplifying the exponential,
+%
+\begin{equation}
+W_8^{[a_2 a_1 a_0]}
+= W_8^{4 a_2 + [a_1 a_0]}
+= (-1)^{a_2} W_8^{[a_1 a_0]}
+\end{equation}
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\
+g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}}
+\end{array}
+\right)
+=
+\left(
+\begin{array}{c}
+g_2([b_0 0 a_1 a_0]) + W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\
+g_2([b_0 0 a_1 a_0]) - W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0])
+\end{array}
+\right)
+\end{equation}
+%
+The exponentials $W_8^{[a_1 a_0]}$ are referred to as {\em twiddle
+factors}. The form of this calculation, a symmetrical sum and
+difference involving a twiddle factor is called {\em a butterfly}.
+It is often shown diagrammatically, and in the case $b_0=a_0=a_1=0$
+would be drawn like this,
+%
+\begin{center}
+\setlength{\unitlength}{1cm}
+\begin{picture}(9,4)
+%
+%\put(0,0){\line(1,0){8}}
+%\put(0,0){\line(0,1){4}}
+%\put(8,4){\line(0,-1){4}}
+%\put(8,4){\line(-1,0){8}}
+%
+\put(0,1){$g_2(4)$} \put(4.5,1){$g_3(4)=g_2(0) - W^a_8 g_2(4)$}
+\put(0,3){$g_2(0)$} \put(4.5,3){$g_3(0)=g_2(0) + W^a_8 g_2(4)$}
+\put(1,1){\vector(1,0){0.5}}
+\put(1.5,1){\line(1,0){0.5}}
+\put(1.5,0.5){$W^a_8$}
+\put(1,3){\vector(1,0){0.5}}\put(1.5,3){\line(1,0){0.5}}
+\put(2,1){\circle*{0.1}}
+\put(2,3){\circle*{0.1}}
+\put(2,1){\vector(1,1){2}}
+\put(2,1){\vector(1,0){1}}
+\put(3,1){\line(1,0){1}}
+\put(3,0.5){$-1$}
+\put(2,3){\vector(1,-1){2}}
+\put(2,3){\vector(1,0){1}}
+\put(3,3){\line(1,0){1}}
+\put(4,1){\circle*{0.1}}
+\put(4,3){\circle*{0.1}}
+\end{picture}
+\end{center}
+%
+The inputs are shown on the left and the outputs on the right. The
+outputs are computed by multiplying the incoming lines by their
+accompanying factors (shown next to the lines) and summing the results
+at each node.
+
+In general, denoting the bit for dual-node pairs by $\Delta$ and the
+remaining bits of $a$ and $b$ by ${\hat a}$ and ${\hat b}$, the
+butterfly is,
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+g({\hat b} + {\hat a}) \\
+g({\hat b} + \Delta + {\hat a}) \\
+\end{array}
+\right)
+\leftarrow
+\left(
+\begin{array}{c}
+g({\hat b} + {\hat a}) + W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})\\
+g({\hat b} + {\hat a}) - W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})
+\end{array}
+\right)
+\end{equation}
+%
+where ${\hat a}$ runs from $0 \dots \Delta-1$ and ${\hat b}$ runs
+through $0 \times 2\Delta$, $1\times 2\Delta$, $\dots$, $(N/\Delta -
+1)2\Delta$. The value of $\Delta$ is 1 on the first pass, 2 on the
+second pass and $2^{n-1}$ on the $n$-th pass. Each pass requires
+$N/2$ in-place computations, each involving two input locations and
+two output locations.
+
+In the example above $\Delta = [100] = 4$, ${\hat a} = [a_1 a_0]$ and
+${\hat b} = [b_0 0 0 0]$.
+
+This leads to the canonical radix-2 decimation-in-time FFT algorithm
+for $2^n$ data points stored in the array $g(0) \dots g(2^n-1)$.
+%
+\begin{algorithm}
+\STATE bit-reverse ordering of $g$
+\STATE {$\Delta \Leftarrow 1$}
+\FOR {$\mbox{pass} = 1 \dots n$}
+ \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$}
+ \FOR {$(a = 0 ; a < \Delta ; a{++})$}
+ \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$}
+ \STATE{$t_0 \Leftarrow g(b+a) + W^a g(b+\Delta+a)$}
+ \STATE{$t_1 \Leftarrow g(b+a) - W^a g(b+\Delta+a)$}
+ \STATE{$g(b+a) \Leftarrow t_0$}
+ \STATE{$g(b+\Delta+a) \Leftarrow t_1$}
+ \ENDFOR
+ \ENDFOR
+ \STATE{$\Delta \Leftarrow 2\Delta$}
+\ENDFOR
+\end{algorithm}
+%
+%This algorithm appears in Numerical Recipes as the routine {\tt
+%four1}, where the implementation is attributed to N.M. Brenner.
+%
+\subsection{Details of the Implementation}
+It is straightforward to implement a simple radix-2 decimation-in-time
+routine from the algorithm above. Some optimizations can be made by
+pulling the special case of $a=0$ out of the loop over $a$, to avoid
+unnecessary multiplications when $W^a=1$,
+%
+\begin{algorithm}
+ \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$}
+ \STATE{$t_0 \Leftarrow g(b) + g(b+\Delta)$}
+ \STATE{$t_1 \Leftarrow g(b) - g(b+\Delta)$}
+ \STATE{$g(b) \Leftarrow t_0$}
+ \STATE{$g(b+\Delta) \Leftarrow t_1$}
+ \ENDFOR
+\end{algorithm}
+%
+There are several algorithms for doing fast bit-reversal. We use the
+Gold-Rader algorithm, which is simple and does not require any working
+space,
+%
+\begin{algorithm}
+\FOR{$i = 0 \dots n - 2$}
+ \STATE {$ k = n / 2 $}
+ \IF {$i < j$}
+ \STATE {swap $g(i)$ and $g(j)$}
+ \ENDIF
+
+ \WHILE {$k \leq j$}
+ \STATE{$j \Leftarrow j - k$}
+ \STATE{$k \Leftarrow k / 2$}
+ \ENDWHILE
+
+ \STATE{$j \Leftarrow j + k$}
+\ENDFOR
+\end{algorithm}
+%
+The Gold-Rader algorithm is typically twice as fast as a naive
+bit-reversal algorithm (where the bit reversal is carried out by
+left-shifts and right-shifts on the index). The library also has a
+routine for the Rodriguez bit reversal algorithm, which also does not
+require any working space~\cite{rodriguez89}. There are faster bit
+reversal algorithms, but they all use additional scratch
+space~\cite{rosel89}.
+
+Within the loop for $a$ we can compute $W^a$ using a trigonometric
+recursion relation,
+%
+\begin{eqnarray}
+W^{a+1} &=& W W^a \\
+ &=& (\cos(2\pi/2\Delta) + i \sin(2\pi/2\Delta)) W^a
+\end{eqnarray}
+%
+This requires only $2 \log_2 N$ trigonometric calls, to compute the
+initial values of $\exp(2\pi i /2\Delta)$ for each pass.
+
+\subsection{Radix-2 Decimation-in-Frequency (DIF)}
+%
+To derive the decimation-in-frequency algorithm we start by separating
+out the lowest order bit of the index $a$. Here is an example for the
+decimation-in-frequency $N=16$ DFT.
+%
+\begin{eqnarray}
+W_{16}^{[a_3 a_2 a_1 a_0][b_3 b_2 b_1 b_0]}
+&=&
+W_{16}^{[a_3 a_2 a_1 a_0][b_2 b_1 b_0]} W_{16}^{[a_3 a_2 a_1 a_0] [b_3
+0 0 0]} \\
+&=&
+W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} W_2^{a_0
+b_3} \\
+&=&
+W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} (-1)^{a_0 b_3}
+\end{eqnarray}
+%
+By repeating the same type of the expansion on the term,
+%
+\begin{equation}
+W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]}
+\end{equation}
+%
+we can reduce the transform to an alternative simple form,
+%
+\begin{equation}
+h(a) =
+\sum_{b_0=0}^1 (-1)^{a_3 b_0} W_4^{a_2 b_0}
+\sum_{b_1=0}^1 (-1)^{a_2 b_1} W_8^{a_1 [b_1 b_0]}
+\sum_{b_2=0}^1 (-1)^{a_1 b_2} W_{16}^{a_0 [b_2 b_1 b_0]}
+\sum_{b_3=0}^1 (-1)^{a_0 b_3} g(b)
+\end{equation}
+%
+To implement this we can again write the sum recursively. In this case
+we do not have the problem of the order of $a$ being bit reversed --
+the calculation can be done in-place using the natural ordering of
+$a$ and $b$,
+%
+\begin{eqnarray}
+g_1([a_0 b_2 b_1 b_0])
+&=&
+W_{16}^{a_0 [b_2 b_1 b_0]}
+\sum_{b_3=0}^1 (-1)^{a_0 b_3} g([b_3 b_2 b_1 b_0]) \\
+g_2([a_0 a_1 b_1 b_0])
+&=&
+W_{8}^{a_1 [b_1 b_0]}
+\sum_{b_2=0}^1 (-1)^{a_1 b_2} g_1([a_0 b_2 b_1 b_0]) \\
+g_3([a_0 a_1 a_2 b_0])
+&=&
+W_{4}^{a_2 b_0}
+\sum_{b_1=0}^1 (-1)^{a_2 b_1} g_2([a_0 a_1 b_1 b_0]) \\
+h(a)
+=
+g_4([a_0 a_1 a_2 a_3])
+&=&
+\sum_{b_0=0}^1 (-1)^{a_3 b_0} g_3([a_0 a_1 a_2 b_0])
+\end{eqnarray}
+%
+The final pass leaves the data for $h(a)$ in bit-reversed order, but
+this is easily fixed by a final bit-reversal of the ordering.
+
+The basic in-place calculation or butterfly for each pass is slightly
+different from the decimation-in-time version,
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+g({\hat a} + {\hat b}) \\
+g({\hat a} + \Delta + {\hat b}) \\
+\end{array}
+\right)
+\leftarrow
+\left(
+\begin{array}{c}
+g({\hat a} + {\hat b}) + g({\hat a} + \Delta + {\hat b})\\
+W_{\Delta}^{\hat b}
+\left( g({\hat a} + {\hat b}) - g({\hat a} + \Delta + {\hat b}) \right)
+\end{array}
+\right)
+\end{equation}
+%
+In each pass ${\hat b}$ runs from $0 \dots \Delta-1$ and ${\hat
+a}$ runs from $0, 2\Delta, \dots, (N/\Delta -1) \Delta$. On the first
+pass we start with $\Delta=16$, and on subsequent passes $\Delta$ takes
+the values $8, 4, \dots, 1$.
+
+This leads to the canonical radix-2 decimation-in-frequency FFT
+algorithm for $2^n$ data points stored in the array $g(0) \dots
+g(2^n-1)$.
+%
+\begin{algorithm}
+\STATE {$\Delta \Leftarrow 2^{n-1}$}
+\FOR {$\mbox{pass} = 1 \dots n$}
+ \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$}
+ \FOR {$(b = 0 ; b < \Delta ; b++)$}
+ \FOR{$(a = 0 ; a < N ; a += 2*\Delta)$}
+ \STATE{$t_0 \Leftarrow g(b+a) + g(a+\Delta+b)$}
+ \STATE{$t_1 \Leftarrow W^b \left( g(a+b) - g(a+\Delta+b) \right)$}
+ \STATE{$g(a+b) \Leftarrow t_0$}
+ \STATE{$g(a+\Delta+b) \Leftarrow t_1$}
+ \ENDFOR
+ \ENDFOR
+ \STATE{$\Delta \Leftarrow \Delta/2$}
+\ENDFOR
+\STATE bit-reverse ordering of $g$
+\end{algorithm}
+%
+
+\section{Self-Sorting Mixed-Radix Complex FFTs}
+%
+This section is based on the review article {\em Self-sorting
+Mixed-Radix Fast Fourier Transforms} by Clive
+Temperton~\cite{temperton83}. You should consult his article for full
+details of all the possible algorithms (there are many
+variations). Here I have annotated the derivation of the simplest
+mixed-radix decimation-in-frequency algorithm.
+
+For general-$N$ FFT algorithms the simple binary-notation of radix-2
+algorithms is no longer useful. The mixed-radix FFT has to be built
+up using products of matrices acting on a data vector. The aim is to
+take the full DFT matrix $W_N$ and factor it into a set of small,
+sparse matrices corresponding to each factor of $N$.
+
+
+We'll denote the components of matrices using either subscripts or
+function notation,
+%
+\begin{equation}
+M_{ij} = M(i,j)
+\end{equation}
+%
+with (C-like) indices running from 0 to $N-1$. Matrix products will be
+denoted using square brackets,
+%
+\begin{equation}
+[AB]_{ij} = \sum_{k} A_{ik} B_{kj}
+\end{equation}
+%
+%
+Three special matrices will be needed in the mixed-radix factorization
+of the DFT: the identity matrix, $I$, a permutation matrix, $P$ and a
+matrix of twiddle factors, $D$, as well as the normal DFT matrices
+$W_n$.
+
+We write the identity matrix of order $r$ as $I_r(n,m)$,
+%
+\begin{equation}
+I_r(n,m) = \delta_{nm}
+\end{equation}
+%
+for $0 \leq n,m \leq r-1$.
+
+We also need to define a permutation matrix $P^a_b$ that performs
+digit reversal of the ordering of a vector. If the index of a vector
+$j= 0\dots N-1$ is factorized into $j = la +m$, with $0 \leq l \leq
+b-1$ and $0 \leq m \leq a-1$ then the operation of the matrix $P$ will
+exchange positions $la+m$ and $mb+l$ in the vector (this sort of
+digit-reversal is the generalization of bit-reversal to a number
+system with exponents $a$ and $b$).
+
+In mathematical terms $P$ is a square matrix of size $ab \times ab$
+with the property,
+%
+\begin{eqnarray}
+P^a_b(j,k) &=& 1 ~\mbox{if}~ j=ra+s ~\mbox{and}~ k=sb+r \\
+ &=& 0 ~\mbox{otherwise}
+\end{eqnarray}
+%
+
+Finally the FFT algorithm needs a matrix of twiddle factors, $D^a_b$,
+for the trigonometric sums. $D^a_b$ is a diagonal square matrix of
+size $ab \times ab$ with the definition,
+%
+\begin{eqnarray}
+D^a_b(j,k) &=& \omega^{sr}_{ab} ~\mbox{if}~ j=k=sb+r \\
+ &=& 0 ~\mbox{otherwise}
+\end{eqnarray}
+%
+where $\omega_{ab} = e^{-2\pi i/ab}$.
+
+
+\subsection{The Kronecker Product}
+The Kronecker matrix product plays an important role in all the
+algorithms for combining operations on different subspaces. The
+Kronecker product $A \otimes B$ of two square matrices $A$ and $B$, of
+sizes $a \times a$ and $b \times b$ respectively, is a square matrix
+of size $a b \times a b$, defined as,
+%
+\begin{equation}
+[A \otimes B] (tb+u, rb+s) = A(t,r) B(u,s)
+\end{equation}
+%
+where $0 \leq u,s < b$ and $0 \leq t,r < a$. Let's examine a specific
+example. If we take a $2 \times 2$ matrix and a $3
+\times 3$ matrix,
+%
+\begin{equation}
+\begin{array}{ll}
+A =
+\left(
+\begin{array}{cc}
+a_{11} & a_{12} \\
+a_{21} & a_{22}
+\end{array}
+\right)
+&
+B =
+\left(
+\begin{array}{ccc}
+b_{11} & b_{12} & b_{13} \\
+b_{21} & b_{22} & b_{23} \\
+b_{31} & b_{32} & b_{33}
+\end{array}
+\right)
+\end{array}
+\end{equation}
+%
+then the Kronecker product $A \otimes B$ is,
+%
+\begin{eqnarray}
+A \otimes B &= &
+\left(
+\begin{array}{cc}
+a_{11} B & a_{12} B \\
+a_{12} B & a_{22} B
+\end{array}
+\right) \\
+ &=&
+\left(
+\begin{array}{cccccc}
+a_{11} b_{11} & a_{11} b_{12} & a_{11} b_{13} &
+ a_{12} b_{11} & a_{12} b_{12} & a_{12} b_{13} \\
+a_{11} b_{21} & a_{11} b_{22} & a_{11} b_{23} &
+ a_{12} b_{21} & a_{12} b_{22} & a_{12} b_{23} \\
+a_{11} b_{31} & a_{11} b_{32} & a_{11} b_{33} &
+ a_{12} b_{31} & a_{12} b_{32} & a_{12} b_{33} \\
+a_{21} b_{11} & a_{21} b_{12} & a_{21} b_{13} &
+ a_{22} b_{11} & a_{22} b_{12} & a_{22} b_{13} \\
+a_{21} b_{21} & a_{21} b_{22} & a_{21} b_{23} &
+ a_{22} b_{21} & a_{22} b_{22} & a_{22} b_{23} \\
+a_{21} b_{31} & a_{21} b_{32} & a_{21} b_{33} &
+ a_{22} b_{31} & a_{22} b_{32} & a_{22} b_{33}
+\end{array}
+\right)
+\end{eqnarray}
+%
+When the Kronecker product $A \otimes B$ acts on a vector of length
+$ab$, each matrix operates on a different subspace of the vector.
+Writing the index $i$ as $i=t b + u$, with $0\leq u \leq b-1$
+and $0\leq t\leq a$, we can see this explicitly by looking at components,
+%
+\begin{eqnarray}
+[(A \otimes B) v]_{(tb+u)}
+& = & \sum_{t'=0}^{a-1} \sum_{u'=0}^{b-1}
+ [A \otimes B]_{(tb+u,t'b+u')} v_{t'b+u'} \\
+& = & \sum_{t'u'} A_{tt'} B_{uu'} v_{t'b+u'}
+\end{eqnarray}
+%
+The matrix $B$ operates on the ``index'' $u'$, for all values of $t'$, and
+the matrix $A$ operates on the ``index'' $t'$, for all values of $u'$.
+%
+The most important property needed for deriving the FFT factorization
+is that the matrix product of two Kronecker products is the Kronecker
+product of the two matrix products,
+%
+\begin{equation}
+(A \otimes B)(C \otimes D) = (AC \otimes BD)
+\end{equation}
+%
+This follows straightforwardly from the original definition of the
+Kronecker product.
+
+\subsection{Two factor case, $N=ab$}
+%
+First consider the simplest possibility, where the data length $N$ can
+be divided into two factors, $N=ab$. The aim is to reduce the DFT
+matrix $W_N$ into simpler matrices corresponding to each factor. To
+make the derivation easier we will start from the known factorization
+and verify it (the initial factorization can be guessed by
+generalizing from simple cases). Here is the factorization we are
+going to prove,
+%
+\begin{equation}
+W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b).
+\end{equation}
+%
+We can check it by expanding the product into components,
+%
+\begin{eqnarray}
+\lefteqn{[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)} \\
+& = &
+\sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
+[(W_b \otimes I_a)](la+m,ua+t) [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s)
+\end{eqnarray}
+%
+where we have split the indices to match the Kronecker product $0 \leq
+m, r \leq a$, $0 \leq l, s \leq b$. The first term in the sum can
+easily be reduced to its component form,
+%
+\begin{eqnarray}
+[(W_b \otimes I_a)](la+m,ua+t)
+&=& W_b(l,u) I_a(m,t) \\
+&=& \omega_b^{lu} \delta_{mt}
+\end{eqnarray}
+%
+The second term is more complicated. We can expand the Kronecker
+product like this,
+\begin{eqnarray}
+(W_a \otimes I_b)(tb+u,rb+s)
+&=& W_a(t,r) I_a(u,s) \\
+&=& \omega_a^{tr} \delta_{us}
+\end{eqnarray}
+%
+and use this term to build up the product, $P^a_b D^a_b (W_a \otimes
+I_b)$. We first multiply by $D^a_b$,
+%
+\begin{equation}
+[D^a_b (W_a \otimes I_b)](tb+u,rb+s)
+=
+\omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
+\end{equation}
+%
+and then apply the permutation matrix, $P^a_b$, which digit-reverses
+the ordering of the first index, to obtain,
+%
+\begin{equation}
+[P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s)
+=
+\omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
+\end{equation}
+%
+Combining the two terms in the matrix product we can obtain the full
+expansion in terms of the exponential $\omega$,
+%
+\begin{eqnarray}
+[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)
+&=&
+\sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
+\omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
+\end{eqnarray}
+%
+If we evaluate this sum explicitly we can make the connection between
+the product involving $W_a$ and $W_b$ (above) and the expansion of the
+full DFT matrix $W_{ab}$,
+%
+\begin{eqnarray}
+\sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
+\omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
+&=& \omega^{ls}_b \omega^{ms}_{ab} \omega^{mr}_a \\
+&=& \omega^{als + ms +bmr}_{ab} \\
+&=& \omega^{als + ms +bmr}_{ab} \omega^{lrab}_{ab} \quad\mbox{using~} \omega^{ab}_{ab} =1\\
+&=& \omega^{(la+m)(rb+s)}_{ab} \\
+&=& W_{ab}(la+m,rb+s)
+\end{eqnarray}
+%
+The final line shows that matrix product given above is identical to
+the full two-factor DFT matrix, $W_{ab}$.
+%
+Thus the full DFT matrix $W_{ab}$ for two factors $a$, $b$ can be
+broken down into a product of sub-transforms, $W_a$ and $W_b$, plus
+permutations, $P$, and twiddle factors, $D$, according to the formula,
+%
+\begin{equation}
+W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b).
+\end{equation}
+%
+This relation is the foundation of the general-$N$ mixed-radix FFT algorithm.
+
+\subsection{Three factor case, $N=abc$}
+%
+The result for the two-factor expansion can easily be generalized to
+three factors. We first consider $abc$ as being a product of two
+factors $a$ and $(bc)$, and then further expand the product $(bc)$ into
+$b$ and $c$. The first step of the expansion looks like this,
+%
+\begin{eqnarray}
+W_{abc} &=& W_{a(bc)}\\
+&=& (W_{bc} \otimes I_a) P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) .
+\end{eqnarray}
+%
+And after using the two-factor result to expand out $W_{bc}$ we obtain
+the factorization of $W_{abc}$,
+%
+\begin{eqnarray}
+W_{abc} &=& (((W_c \otimes I_b) P^b_c D^b_c (W_b \otimes I_c)) \otimes I_a )
+P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\
+&=& (W_c \otimes I_{ab}) (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) P^a_{bc} D^a_{bc} (W_a \otimes I_c)
+\end{eqnarray}
+%
+We can write this factorization in a product form, with one term for
+each factor,
+%
+\begin{equation}
+W_{abc} = T_3 T_2 T_1
+\end{equation}
+%
+where we read off $T_1$, $T_2$ and $T_3$,
+%
+\begin{eqnarray}
+T_1 &=& P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\
+T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\
+T_3 &=& (W_c \otimes I_{ab} )
+\end{eqnarray}
+%
+
+
+\subsection{General case, $N=f_1 f_2 \dots f_{n_f}$}
+%
+If we continue the procedure that we have used for two- and
+three-factors then a general pattern begins to emerge in the
+factorization of $W_{f_1 f_2 \dots f_{n_f}}$. To see the beginning of
+the pattern we can rewrite the three factor case as,
+%
+\begin{eqnarray}
+T_1 &=& (P^a_{bc} D^a_{bc} \otimes I_1) (W_a \otimes I_{bc}) \\
+T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\
+T_3 &=& (P^c_1 D^c_1 \otimes I_{ab}) (W_c \otimes I_{ab} )
+\end{eqnarray}
+%
+using the special cases $P^c_1 = D^c_1 = I_c$.
+%
+In general, we can write the factorization of $W_N$ for $N= f_1 f_2
+\dots f_{n_f}$ as,
+%
+\begin{equation}
+W_N = T_{n_f} \dots T_2 T_1
+\end{equation}
+%
+where the matrix factors $T_i$ are,
+%
+\begin{equation}
+T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ( W_{f_i}
+\otimes I_{m_i})
+\end{equation}
+%
+We have defined the following three additional variables $p$, $q$ and
+$m$ to denote different partial products of the factors,
+%
+\begin{eqnarray}
+p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1) \\
+q_i &=& N / p_i \\
+m_i &=& N / f_i
+\end{eqnarray}
+%
+Note that the FFT modules $W$ are applied before the permutations $P$,
+which makes this a decimation-in-frequency algorithm.
+
+\subsection{Implementation}
+%
+Now to the implementation of the algorithm. We start with a vector of
+data, $z$, as input and want to apply the transform,
+%
+\begin{eqnarray}
+x &=& W_N z \\
+ &=& T_{n_f} \dots T_2 T_1 z
+\end{eqnarray}
+%
+where $T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) (
+W_{f_i} \otimes I_{m_i})$.
+
+The outer structure of the implementation will be a loop over the
+$n_f$ factors, applying each matrix $T_i$ to the vector in turn to
+build up the complete transform.
+%
+\begin{algorithm}
+\FOR{$(i = 1 \dots n_f)$}
+ \STATE{$v \Leftarrow T_i v $}
+\ENDFOR
+\end{algorithm}
+%
+The order of the factors is not important. Now we examine the iteration
+$v \Leftarrow T_i v$, which we'll write as,
+%
+\begin{equation}
+v' =
+(P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ~
+( W_{f_i} \otimes I_{m_i}) v
+\end{equation}
+%
+There are two Kronecker product matrices in this iteration. The
+rightmost matrix, which is the first to be applied, is a DFT of length
+$f_i$ applied to $N/f_i$ subsets of the data. We'll call this $t$,
+since it will be a temporary array,
+%
+\begin{equation}
+t = (W_{f_i} \otimes I_{m_i}) v
+\end{equation}
+%
+The second matrix applies a permutation and the exponential
+twiddle-factors. We'll call this $v'$, since it is the result of the
+full iteration on $v$,
+%
+\begin{equation}
+v' = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) t
+\end{equation}
+
+The effect of the matrix $(W_{f_i} \otimes I_{m_i})$ is best seen by
+an example. Suppose the factor is $f_i = 3$, and the length of the FFT
+is $N=6$, then the relevant Kronecker product is,
+%
+\begin{equation}
+t = (W_3 \otimes I_2) v
+\end{equation}
+%
+which expands out to,
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+t_0 \\
+t_1 \\
+t_2 \\
+t_3 \\
+t_4 \\
+t_5
+\end{array}
+\right)
+=
+\left(
+\begin{array}{cccccc}
+W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) & 0 \\
+0 & W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) \\
+W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) & 0 \\
+0 & W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) \\
+W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3) & 0 \\
+0 & W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3)
+\end{array}
+\right)
+\left(
+\begin{array}{c}
+v_0 \\
+v_1 \\
+v_2 \\
+v_3 \\
+v_4 \\
+v_5
+\end{array}
+\right)
+\end{equation}
+%
+We can rearrange the components in a computationally convenient form,
+\begin{equation}
+\left(
+\begin{array}{c}
+t_0 \\
+t_2 \\
+t_4 \\
+t_1 \\
+t_3 \\
+t_5
+\end{array}
+\right)
+=
+\left(
+\begin{array}{cccccc}
+W_3(1,1) & W_3(1,2) & W_3(1,3) & 0 & 0 & 0 \\
+W_3(2,1) & W_3(2,2) & W_3(2,3) & 0 & 0 & 0 \\
+W_3(3,1) & W_3(3,2) & W_3(3,3) & 0 & 0 & 0 \\
+0 & 0 & 0 & W_3(1,1) & W_3(1,2) & W_3(1,3) \\
+0 & 0 & 0 & W_3(2,1) & W_3(2,2) & W_3(2,3) \\
+0 & 0 & 0 & W_3(3,1) & W_3(3,2) & W_3(3,3)
+\end{array}
+\right)
+\left(
+\begin{array}{c}
+v_0 \\
+v_2 \\
+v_4 \\
+v_1 \\
+v_3 \\
+v_5
+\end{array}
+\right)
+\end{equation}
+%
+which clearly shows that we just need to apply the $3\times 3$ DFT
+matrix $W_3$ twice, once to the sub-vector of elements $(v_0, v_2, v_4)$,
+and independently to the remaining sub-vector $(v_1, v_3, v_5)$.
+
+In the general case, if we index $t$ as $t_k = t(\lambda,\mu) =
+t_{\lambda m + \mu}$ then $\lambda = 0 \dots f-1$ is an index within
+each transform of length $f$ and $\mu = 0 \dots m-1$ labels the
+independent subsets of data. We can see this by showing the
+calculation with all indices present,
+%
+\begin{equation}
+t = (W_f \otimes I_m) z
+\end{equation}
+%
+becomes,
+%
+\begin{eqnarray}
+t_{\lambda m + \mu} &=& \sum_{\lambda'=0}^{f-1} \sum_{\mu'=0}^{m-1}
+ (W_f \otimes I_m)_{(\lambda m + \mu)(\lambda' m + \mu')}
+ z_{\lambda'm + \mu} \\
+&=& \sum_{\lambda'\mu'} (W_f)_{\lambda\lambda'} \delta_{\mu\mu'}
+ z_{\lambda'm+\mu'} \\
+&=& \sum_{\lambda'} (W_f)_{\lambda\lambda'} z_{\lambda'm+\mu}
+\end{eqnarray}
+%
+The DFTs on the index $\lambda$ will be computed using
+special optimized modules for each $f$.
+
+To calculate the next stage,
+%
+\begin{equation}
+v'=(P^f_q D^f_q \otimes I_{p_{i-1}}) t
+\end{equation}
+%
+we note that the Kronecker product has the property of performing
+$p_{i-1}$ independent multiplications of $PD$ on $q_{i-1}$ different
+subsets of $t$. The index $\mu$ of $t(\lambda,\mu)$ which runs from 0
+to $m$ will include $q_i$ copies of each $PD$ operation because
+$m=p_{i-1}q$. i.e. we can split the index $\mu$ further into $\mu = a
+p_{i-1} + b$, where $a = 0 \dots q-1$ and $b=0 \dots p_{i-1}$,
+%
+\begin{eqnarray}
+\lambda m + \mu &=& \lambda m + a p_{i-1} + b \\
+ &=& (\lambda q + a) p_{i-1} + b.
+\end{eqnarray}
+%
+Now we can expand the second stage,
+%
+\begin{eqnarray}
+v'&=& (P^f_q D^f_q \otimes I_{p_{i-1}}) t \\
+v'_{\lambda m + \mu} &=& \sum_{\lambda' \mu'}
+ (P^f_q D^f_q \otimes I_{p_{i-1}})_{(\lambda m + \mu)(\lambda' m + \mu')}
+ t_{\lambda' m + \mu'} \\
+v'_{(\lambda q + a) p_{i-1} + b} &=& \sum_{\lambda' a' b'}
+(
+P^f_q D^f_q \otimes I_{p_{i-1}}
+)_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')}
+t_{(\lambda' q + a')p_{i-1} +b'}
+\end{eqnarray}
+%
+The first step in removing redundant indices is to take advantage of
+the identity matrix $I$ and separate the subspaces of the Kronecker
+product,
+%
+\begin{equation}
+(
+P^f_q D^f_q \otimes I_{p_{i-1}}
+)_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')}
+=
+(P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')}
+\delta_{bb'}
+\end{equation}
+%
+This eliminates one sum, leaving us with,
+%
+\begin{equation}
+v'_{(\lambda q + a) p_{i-1} + b}
+=
+\sum_{\lambda' a' }
+(P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')} t_{(\lambda'q+a')p_{i-1} + b}
+\end{equation}
+%
+We can insert the definition of $D^f_q$ to give,
+%
+\begin{equation}
+\phantom{v'_{(\lambda q + a) p_{i-1} + b}}
+= \sum_{\lambda'a'} (P^f_q)_{(\lambda q + a)(\lambda'q + a')}
+\omega^{\lambda'a'}_{q_{i-1}} t_{(\lambda'q+a')p_{i-1}+b}
+\end{equation}
+%
+Using the definition of $P^f_q$, which exchanges an index of $\lambda
+q + a$ with $a f + \lambda$, we get a final result with no matrix
+multiplication,
+%
+\begin{equation}
+v'_{(a f + \lambda) p_{i-1} + b}
+= \omega^{\lambda a}_{q_{i-1}} t_{(\lambda q + a)p_{i-1} + b}
+\end{equation}
+%
+All we have to do is premultiply each element of the temporary vector
+$t$ by an exponential twiddle factor and store the result in another
+index location, according to the digit reversal permutation of $P$.
+
+Here is the algorithm to implement the mixed-radix FFT,
+%
+\begin{algorithm}
+\FOR{$i = 1 \dots n_f$}
+\FOR{$a = 0 \dots q-1$}
+\FOR{$b = 0 \dots p_{i-1} - 1$}
+\FOR{$\lambda = 0 \dots f-1$}
+\STATE{$t_\lambda \Leftarrow
+ \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') v_{b+\lambda'm+ap_{i-1}}$}
+ \COMMENT{DFT matrix-multiply module}
+\ENDFOR
+\FOR{$\lambda = 0 \dots f-1$}
+\STATE{$v'_{(af+\lambda)p_{i-1}+b}
+ \Leftarrow \omega^{\lambda a}_{q_{i-1}} t_\lambda$}
+\ENDFOR
+\ENDFOR
+\ENDFOR
+\STATE{$v \Leftarrow v'$}
+\ENDFOR
+\end{algorithm}
+%
+\subsection{Details of the implementation}
+%
+First the function {\tt gsl\_fft\_complex\_wavetable\_alloc} allocates
+$n$ elements of scratch space (to hold the vector $v'$ for each
+iteration) and $n$ elements for a trigonometric lookup table of
+twiddle factors.
+
+Then the length $n$ must be factorized. There is a general
+factorization function {\tt gsl\_fft\_factorize} which takes a list of
+preferred factors. It first factors out the preferred factors and then
+removes general remaining prime factors.
+
+The algorithm used to generate the trigonometric lookup table is
+%
+\begin{algorithm}
+\FOR {$a = 1 \dots n_f$}
+\FOR {$b = 1 \dots f_i - 1$}
+\FOR {$c = 1 \dots q_i$}
+\STATE $\mbox{trig[k++]} = \exp(- 2\pi i b c p_{a-1}/N)$
+\ENDFOR
+\ENDFOR
+\ENDFOR
+\end{algorithm}
+%
+Note that $\sum_{1}^{n_f} \sum_{0}^{f_i-1} \sum_{1}^{q_i} =
+\sum_{1}^{n_f} (f_i-1)q_i = n - 1$ so $n$ elements are always
+sufficient to store the lookup table. This is chosen because we need
+to compute $\omega_{q_i-1}^{\lambda a} t_\lambda$ in
+the FFT. In terms of the lookup table we can write this as,
+%
+\begin{eqnarray}
+\omega_{q_{i-1}}^{\lambda a} t_\lambda
+&=& \exp(-2\pi i \lambda a/q_{i-1}) t_\lambda \\
+&=& \exp(-2\pi i \lambda a p_{i-1}/N) t_\lambda \\
+&=& \left\{
+ \begin{array}{ll}
+ t_\lambda & a=0 \\
+ \mbox{trig}[\mbox{twiddle[i]}+\lambda q+(a-1)] t_\lambda & a\not=0
+\end{array}
+\right.
+\end{eqnarray}
+%
+\begin{sloppypar}
+\noindent
+The array {\tt twiddle[i]} maintains a set of pointers into {\tt trig}
+for the starting points for the outer loop. The core of the
+implementation is {\tt gsl\_fft\_complex}. This function loops over
+the chosen factors of $N$, computing the iteration $v'=T_i v$ for each
+pass. When the DFT for a factor is implemented the iteration is
+handed-off to a dedicated small-$N$ module, such as {\tt
+gsl\_fft\_complex\_pass\_3} or {\tt
+gsl\_fft\_complex\_pass\_5}. Unimplemented factors are handled
+by the general-$N$ routine {\tt gsl\_fft\_complex\_pass\_n}. The
+structure of one of the small-$N$ modules is a simple transcription of
+the basic algorithm given above. Here is an example for {\tt
+gsl\_fft\_complex\_pass\_3}. For a pass with a factor of 3 we have to
+calculate the following expression,
+\end{sloppypar}%
+\begin{equation}
+v'_{(a f + \lambda) p_{i-1} + b}
+=
+\sum_{\lambda' = 0,1,2}
+\omega^{\lambda a}_{q_{i-1}} W^{\lambda \lambda'}_{3}
+v_{b + \lambda' m + a p_{i-1}}
+\end{equation}
+%
+for $b = 0 \dots p_{i-1} - 1$, $a = 0 \dots q_{i} - 1$ and $\lambda =
+0, 1, 2$. This is implemented as,
+%
+\begin{algorithm}
+\FOR {$a = 0 \dots q-1$}
+\FOR {$b = 0 \dots p_{i-1} - 1$}
+\STATE {$
+ \left(
+ \begin{array}{c}
+ t_0 \\ t_1 \\ t_2
+ \end{array}
+ \right)
+ =
+ \left(
+ \begin{array}{ccc}
+ W^{0}_3 & W^{0}_3 & W^{0}_3 \\
+ W^{0}_3 & W^{1}_3 & W^{2}_3 \\
+ W^{0}_3 & W^{2}_3 & W^{4}_3
+ \end{array}
+ \right)
+ \left(
+ \begin{array}{l}
+ v_{b + a p_{i-1}} \\
+ v_{b + a p_{i-1} + m} \\
+ v_{b + a p_{i-1} +2m}
+ \end{array}
+ \right)$}
+ \STATE {$ v'_{a p_{i} + b} = t_0$}
+ \STATE {$ v'_{a p_{i} + b + p_{i-1}} = \omega^{a}_{q_{i-1}} t_1$}
+ \STATE {$ v'_{a p_{i} + b + 2 p_{i-1}} = \omega^{2a}_{q_{i-1}} t_2$}
+\ENDFOR
+\ENDFOR
+\end{algorithm}
+%
+In the code we use the variables {\tt from0}, {\tt from1}, {\tt from2}
+to index the input locations,
+%
+\begin{eqnarray}
+\mbox{\tt from0} &=& b + a p_{i-1} \\
+\mbox{\tt from1} &=& b + a p_{i-1} + m \\
+\mbox{\tt from2} &=& b + a p_{i-1} + 2m
+\end{eqnarray}
+%
+and the variables {\tt to0}, {\tt to1}, {\tt to2} to index the output
+locations in the scratch vector $v'$,
+%
+\begin{eqnarray}
+\mbox{\tt to0} &=& b + a p_{i} \\
+\mbox{\tt to1} &=& b + a p_{i} + p_{i-1} \\
+\mbox{\tt to2} &=& b + a p_{i} + 2 p_{i-1}
+\end{eqnarray}
+%
+The DFT matrix multiplication is computed using the optimized
+sub-transform modules given in the next section. The twiddle factors
+$\omega^a_{q_{i-1}}$ are taken out of the {\tt trig} array.
+
+To compute the inverse transform we go back to the definition of the
+fourier transform and note that the inverse matrix is just the complex
+conjugate of the forward matrix (with a factor of $1/N$),
+%
+\begin{equation}
+W^{-1}_N = W^*_N / N
+\end{equation}
+%
+Therefore we can easily compute the inverse transform by conjugating
+all the complex elements of the DFT matrices and twiddle factors that
+we apply. (An alternative strategy is to conjugate the input data,
+take a forward transform, and then conjugate the output data).
+
+\section{Fast Sub-transform Modules}
+%
+To implement the mixed-radix FFT we still need to compute the
+small-$N$ DFTs for each factor. Fortunately many highly-optimized
+small-$N$ modules are available, following the work of Winograd who
+showed how to derive efficient small-$N$ sub-transforms by number
+theoretic techniques.
+
+The algorithms in this section all compute,
+%
+\begin{equation}
+x_a = \sum_{b=0}^{N-1} W_N^{ab} z_b
+\end{equation}
+%
+The sub-transforms given here are the ones recommended by Temperton
+and differ slightly from the canonical Winograd modules. According to
+Temperton~\cite{temperton83} they are slightly more robust against
+rounding errors and trade off some additions for multiplications.
+%
+For the $N=2$ DFT,
+%
+\begin{equation}
+\begin{array}{ll}
+x_0 = z_0 + z_1, &
+x_1 = z_0 - z_1.
+\end{array}
+\end{equation}
+%
+For the $N=3$ DFT,
+%
+\begin{equation}
+\begin{array}{lll}
+t_1 = z_1 + z_2, &
+t_2 = z_0 - t_1/2, &
+t_3 = \sin(\pi/3) (z_1 - z_2),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{lll}
+x_0 = z_0 + t_1, &
+x_1 = t_2 + i t_3, &
+x_2 = t_2 - i t_3.
+\end{array}
+\end{equation}
+%
+The $N=4$ transform involves only additions and subtractions,
+%
+\begin{equation}
+\begin{array}{llll}
+t_1 = z_0 + z_2, &
+t_2 = z_1 + z_3, &
+t_3 = z_0 - z_2, &
+t_4 = z_1 - z_3,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{llll}
+x_0 = t_1 + t_2, &
+x_1 = t_3 + i t_4, &
+x_2 = t_1 - t_2, &
+x_3 = t_3 - i t_4.
+\end{array}
+\end{equation}
+%
+For the $N=5$ DFT,
+%
+\begin{equation}
+\begin{array}{llll}
+t_1 = z_1 + z_4, &
+t_2 = z_2 + z_3, &
+t_3 = z_1 - z_4, &
+t_4 = z_2 - z_3,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{llll}
+t_5 = t_1 + t_2, &
+t_6 = (\sqrt{5}/4) (t_1 - t_2), &
+t_7 = z_0 - t_5/4, \\
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{llll}
+t_8 = t_7 + t_6, &
+t_9 = t_7 - t_6, \\
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{llll}
+t_{10} = \sin(2\pi/5) t_3 + \sin(2\pi/10) t_4, &
+t_{11} = \sin(2\pi/10) t_3 - \sin(2\pi/5) t_4,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{llll}
+x_0 = z_0 + t_5,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{llll}
+x_1 = t_8 + i t_{10}, &
+x_2 = t_9 + i t_{11},
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{llll}
+x_3 = t_9 - i t_{11}, &
+x_4 = t_8 - i t_{10}.
+\end{array}
+\end{equation}
+%
+The DFT matrix for $N=6$ can be written as a combination of $N=3$ and
+$N=2$ transforms with index permutations,
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+x_0 \\
+x_4 \\
+x_2 \\
+\hline x_3 \\
+x_1 \\
+x_5
+\end{array}
+\right)
+=
+\left(
+\begin{array}{ccc|ccc}
+ & & & & & \\
+ &W_3& & &W_3& \\
+ & & & & & \\
+\hline & & & & & \\
+ &W_3& & &-W_3& \\
+ & & & & &
+\end{array}
+\right)
+\left(
+\begin{array}{c}
+z_0 \\
+z_2 \\
+z_4 \\
+\hline z_3 \\
+z_5 \\
+z_1
+\end{array}
+\right)
+\end{equation}
+%
+This simplification is an example of the Prime Factor Algorithm, which
+can be used because the factors 2 and 3 are mutually prime. For more
+details consult one of the books on number theory for
+FFTs~\cite{elliott82,blahut}. We can take advantage of the simple
+indexing scheme of the PFA to write the $N=6$ DFT as,
+%
+\begin{equation}
+\begin{array}{lll}
+t_1 = z_2 + z_4, &
+t_2 = z_0 - t_1/2, &
+t_3 = \sin(\pi/3) (z_2 - z_4),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{lll}
+t_4 = z_5 + z_1, &
+t_5 = z_3 - t_4/2, &
+t_6 = \sin(\pi/3) (z_5 - z_1),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{lll}
+t_7 = z_0 + t_1, &
+t_8 = t_2 + i t_3, &
+t_9 = t_2 - i t_3,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{lll}
+t_{10} = z_3 + t_4, &
+t_{11} = t_5 + i t_6, &
+t_{12} = t_5 - i t_6,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{lll}
+x_0 = t_7 + t_{10}, &
+x_4 = t_8 + t_{11}, &
+x_2 = t_9 + t_{12},
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{lll}
+x_3 = t_7 - t_{10}, &
+x_1 = t_8 - t_{11}, &
+x_5 = t_9 - t_{12}.
+\end{array}
+\end{equation}
+
+For any remaining general factors we use Singleton's efficient method
+for computing a DFT~\cite{singleton}. Although it is an $O(N^2)$
+algorithm it does reduce the number of multiplications by a factor of
+4 compared with a naive evaluation of the DFT. If we look at the
+general stucture of a DFT matrix, shown schematically below,
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+h_0 \\
+h_1 \\
+h_2 \\
+\vdots \\
+h_{N-2} \\
+h_{N-1}
+\end{array}
+\right)
+=
+\left(
+\begin{array}{cccccc}
+1 & 1 & 1 & \cdots & 1 & 1 \\
+1 & W & W & \cdots & W & W \\
+1 & W & W & \cdots & W & W \\
+\vdots & \vdots & \vdots & \cdots & \vdots & \vdots \\
+1 & W & W & \cdots & W & W \\
+1 & W & W & \cdots & W & W
+\end{array}
+\right)
+\left(
+\begin{array}{c}
+g_0 \\
+g_1 \\
+g_2 \\
+\vdots \\
+g_{N-2} \\
+g_{N-1}
+\end{array}
+\right)
+\end{equation}
+%
+we see that the outer elements of the DFT matrix are all unity. We can
+remove these trivial multiplications but we will still be left with an
+$(N-1) \times (N-1)$ sub-matrix of complex entries, which would appear
+to require $(N-1)^2$ complex multiplications. Singleton's method,
+uses symmetries of the DFT matrix to convert the complex
+multiplications to an equivalent number of real multiplications. We
+start with the definition of the DFT in component form,
+%
+\begin{equation}
+a_k + i b_k = \sum_{j=0} (x_j+iy_j)(\cos(2\pi jk/f) - i\sin(2\pi jk/f))
+\end{equation}
+%
+The zeroth component can be computed using only additions,
+%
+\begin{equation}
+a_0 + i b_0 = \sum_{j=0}^{(f-1)} x_j + i y_j
+\end{equation}
+%
+We can rewrite the remaining components as,
+%
+\begin{eqnarray}
+a_k + i b_k & = x_0 + i y_0 & +
+ \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f)
+ + (y_j - y_{f-j}) \sin(2\pi jk/f) \\
+& & + i\sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f)
+ - (x_j - x_{f-j}) \sin(2\pi jk/f)
+\end{eqnarray}
+%
+by using the following trigonometric identities,
+%
+\begin{eqnarray}
+ \cos(2\pi(f-j)k/f) &=& \phantom{-}\cos(2\pi jk/f) \\
+ \sin(2\pi(f-j)k/f) &=& -\sin(2\pi jk/f)
+\end{eqnarray}
+%
+These remaining components can all be computed using four partial
+sums,
+%
+\begin{eqnarray}
+a_k + i b_k & = & (a^+_k - a^-_k) + i (b^+_k + b^-_k) \\
+a_{f-k} + i b_{f-k} & = & (a^+_k + a^-_k) + i (b^+_k - b^-_k)
+\end{eqnarray}
+%
+for $k = 1, 2, \dots, (f-1)/2$, where,
+%
+\begin{eqnarray}
+a^+_k &=& x_0 + \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f) \\
+a^-_k &=& \phantom{x_0} - \sum_{j=1}^{(f-1)/2} (y_j - y_{f-j}) \sin(2\pi jk/f) \\
+b^+_k &=& y_0 + \sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f) \\
+b^-_k &=& \phantom{y_0} - \sum_{j=1}^{(f-1)/2} (x_j - x_{f-j}) \sin(2\pi jk/f)
+\end{eqnarray}
+%
+Note that the higher components $k'=f-k$ can be obtained directly
+without further computation once $a^+$, $a^-$, $b^+$ and $b^-$ are
+known. There are $4 \times (f-1)/2$ different sums, each involving
+$(f-1)/2$ real multiplications, giving a total of $(f-1)^2$ real
+multiplications instead of $(f-1)^2$ complex multiplications.
+
+To implement Singleton's method we make use of the input and output
+vectors $v$ and $v'$ as scratch space, copying data back and forth
+between them to obtain the final result. First we use $v'$ to store
+the terms of the symmetrized and anti-symmetrized vectors of the form
+$x_j + x_{f-j}$ and $x_j - y_{f-j}$. Then we multiply these by the
+appropriate trigonometric factors to compute the partial sums $a^+$,
+$a^-$, $b^+$ and $b^-$, storing the results $a_k + i b_k$ and $a_{f-k}
++ i b_{f-k}$ back in $v$. Finally we multiply the DFT output by any
+necessary twiddle factors and place the results in $v'$.
+
+\section{FFTs for real data}
+%
+This section is based on the articles {\em Fast Mixed-Radix Real
+ Fourier Transforms} by Clive Temperton~\cite{temperton83real} and
+{\em Real-Valued Fast Fourier Transform Algorithms} by Sorensen,
+Jones, Heideman and Burrus~\cite{burrus87real}. The DFT of a real
+sequence has a special symmetry, called a {\em conjugate-complex} or
+{\em half-complex} symmetry,
+%
+\begin{equation}
+h(a) = h(N-a)^*
+\end{equation}
+%
+The element $h(0)$ is real, and when $N$ is even $h(N/2)$ is also
+real. It is straightforward to prove the symmetry,
+%
+\begin{eqnarray}
+h(a) &=& \sum W^{ab}_N g(b) \\
+h(N-a)^* &=& \sum W^{-(N-a)b}_N g(b)^* \\
+ &=& \sum W^{-Nb}_N W^{ab}_N g(b) \qquad{(W^N_N=1)} \\
+ &=& \sum W^{ab}_N g(b)
+\end{eqnarray}
+%
+Real-valued data is very common in practice (perhaps more common that
+complex data) so it is worth having efficient FFT routines for real
+data. In principle an FFT for real data should need half the
+operations of an FFT on the equivalent complex data (where the
+imaginary parts are set to zero). There are two different strategies
+for computing FFTs of real-valued data:
+
+One strategy is to ``pack'' the real data (of length $N$) into a
+complex array (of length $N/2$) by index transformations. A complex
+FFT routine can then be used to compute the transform of that array.
+By further index transformations the result can actually by
+``unpacked'' to the FFT of the original real data. It is also possible
+to do two real FFTs simultaneously by packing one in the real part and
+the other in the imaginary part of the complex array. These
+techniques have some disadvantages. The packing and unpacking
+procedures always add $O(N)$ operations, and packing a real array of
+length $N$ into a complex array of length $N/2$ is only possible if
+$N$ is even. In addition, if two unconnected datasets with very
+different magnitudes are packed together in the same FFT there could
+be ``cross-talk'' between them due to a loss of precision.
+
+A more straightforward strategy is to start with an FFT algorithm,
+such as the complex mixed-radix algorithm, and prune out all the
+operations involving the zero imaginary parts of the initial data. The
+FFT is linear so the imaginary part of the data can be decoupled from
+the real part. This procedure leads to a dedicated FFT for real-valued
+data which works for any length and does not perform any unnecessary
+operations. It also allows us to derive a corresponding inverse FFT
+routine which transforms a half-complex sequence back into real data.
+
+\subsection{Radix-2 FFTs for real data}
+%
+Before embarking on the full mixed-radix real FFT we'll start with the
+radix-2 case. It contains all the essential features of the
+general-$N$ algorithm. To make it easier to see the analogy between
+the two we will use the mixed-radix notation to describe the
+factors. The factors are all 2,
+%
+\begin{equation}
+f_1 = 2, f_2 = 2, \dots, f_{n_f} = 2
+\end{equation}
+%
+and the products $p_i$ are powers of 2,
+%
+\begin{eqnarray}
+p_0 & = & 1 \\
+p_1 & = & f_1 = 2 \\
+p_2 & = & f_1 f_2 = 4 \\
+\dots &=& \dots \\
+p_i & = & f_1 f_2 \dots f_i = 2^i
+\end{eqnarray}
+%
+Using this notation we can rewrite the radix-2 decimation-in-time
+algorithm as,
+%
+\begin{algorithm}
+\STATE bit-reverse ordering of $g$
+\FOR {$i = 1 \dots n$}
+ \FOR {$a = 0 \dots p_{i-1} - 1$}
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$
+ \left(
+ \begin{array}{c}
+ g(b p_i + a) \\
+ g(b p_i + p_{i-1} + a)
+ \end{array}
+ \right)
+ =
+ \left(
+ \begin{array}{c}
+ g(b p_i + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\
+ g(b p_i + a) - W^a_{p_i} g(b p_i + p_{i-1} + a)
+ \end{array}
+ \right) $}
+ \ENDFOR
+ \ENDFOR
+\ENDFOR
+\end{algorithm}
+%
+where we have used $p_i = 2 \Delta$, and factored $2 \Delta$ out of
+the original definition of $b$ ($b \to b p_i$).
+
+If we go back to the original recurrence relations we can see how to
+write the intermediate results in a way which make the
+real/half-complex symmetries explicit at each step. The first pass is
+just a set of FFTs of length-2 on real values,
+%
+\begin{equation}
+g_1([b_0 b_1 b_2 a_0]) = \sum_{b_3} W^{a_0 b_3}_2 g([b_0 b_1 b_2 b_3])
+\end{equation}
+%
+Using the symmetry $FFT(x)_k = FFT(x)^*_{N-k}$ we have the reality
+condition,
+%
+\begin{eqnarray}
+g_1([b_0 b_1 b_2 0]) &=& \mbox{real} \\
+g_1([b_0 b_1 b_2 1]) &=& \mbox{real'}
+\end{eqnarray}
+%
+In the next pass we have a set of length-4 FFTs on the original data,
+%
+\begin{eqnarray}
+g_2([b_0 b_1 b_1 a_0])
+&=&
+\sum_{b_2}\sum_{b_3}
+W^{[a_1 a_0]b_2}_4 W^{a_0 b_3}_2
+g([b_0 b_1 b_2 b_3]) \\
+&=&
+\sum_{b_2}\sum_{b_3}
+W^{[a_1 a_0][b_3 b_2]}_4
+g([b_0 b_1 b_2 b_3])
+\end{eqnarray}
+%
+This time symmetry gives us the following conditions on the
+transformed data,
+%
+\begin{eqnarray}
+g_2([b_0 b_1 0 0]) &=& \mbox{real} \\
+g_2([b_0 b_1 0 1]) &=& x + i y \\
+g_2([b_0 b_1 1 0]) &=& \mbox{real'} \\
+g_2([b_0 b_1 1 1]) &=& x - i y
+\end{eqnarray}
+%
+We can see a pattern emerging here: the $i$-th pass computes a set of
+independent length-$2^i$ FFTs on the original real data,
+%
+\begin{eqnarray}
+g_i ( b p_i + a ) = \sum_{a' = 0}^{p_i-1} W_{p_i}^{aa'} g(b p_i + a')
+\quad
+\mbox{for $b = 0 \dots q_i - 1$}
+\end{eqnarray}
+%
+As a consequence the we can apply the symmetry for an FFT of real data
+to all the intermediate results -- not just the final result.
+In general after the $i$-th pass we will
+have the symmetry,
+%
+\begin{eqnarray}
+g_i(b p_i) &=& \mbox{real} \\
+g_i(b p_i + a) &=& g_i(b p_i + p_i - a)^* \qquad a = 1 \dots p_{i}/2 - 1 \\
+g_i(b p_i + p_{i}/2) &=& \mbox{real'}
+\end{eqnarray}
+%
+In the next section we'll show that this is a general property of
+decimation-in-time algorithms. The same is not true for the
+decimation-in-frequency algorithm, which does not have any simple
+symmetries in the intermediate results.
+
+Since we can obtain the values of $g_i(b p_i + a)$ for $a > p_i/2$
+from the values for $a < p_i/2$ we can cut our computation and
+storage in half compared with the full-complex case.
+%
+We can easily rewrite the algorithm to show how the computation can be
+halved, simply by limiting all terms to involve only values for $a
+\leq p_{i}/2$. Whenever we encounter a term $g_i(b p_i + a)$ with $a >
+p_{i}/2$ we rewrite it in terms of its complex symmetry partner,
+$g_i(b p_i + a')^*$, where $a' = p_i - a$. The butterfly computes two
+values for each value of $a$, $b p_i + a$ and $b p_i + p_{i-1} - a$,
+so we actually only need to compute from $a = 0$ to $p_{i-1}/2$. This
+gives the following algorithm,
+%
+\begin{algorithm}
+\FOR {$a = 0 \dots p_{i-1}/2$}
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$
+ \left(
+ \begin{array}{c}
+ g(b p_i + a) \\
+ g(b p_i + p_{i-1} - a)^*
+ \end{array}
+ \right)
+ =
+ \left(
+ \begin{array}{c}
+ g(b p_{i} + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\
+ g(b p_{i} + a) - W^a_{p_i} g(b p_i + p_{i-1} + a)
+ \end{array}
+ \right) $}
+ \ENDFOR
+ \ENDFOR
+\end{algorithm}
+%
+Although we have halved the number of operations we also need a
+storage arrangement which will halve the memory requirement. The
+algorithm above is still formulated in terms of a complex array $g$,
+but the input to our routine will naturally be an array of $N$ real
+values which we want to use in-place.
+
+Therefore we need a storage scheme which lays out the real and
+imaginary parts within the real array, in a natural way so that there
+is no need for complicated index calculations. In the radix-2
+algorithm we do not have any additional scratch space. The storage
+scheme has to be designed to accommodate the in-place calculation
+taking account of dual node pairs.
+
+Here is a scheme which takes these restrictions into account: On the
+$i$-th pass we store the real part of $g(b p_i + a)$ in location $b
+p_i + a$. We store the imaginary part in location $b p_i + p_i -
+a$. This is the redundant location which corresponds to the conjugate
+term $g(b p_i + a)^* = g(b p_i + p_i -a)$, so it is not needed. When
+the results are purely real (as in the case $a = 0$ and $a = p_i/2$ we
+store only the real part and drop the zero imaginary part).
+
+This storage scheme has to work in-place, because the radix-2 routines
+should not use any scratch space. We will now check the in-place
+property for each butterfly. A crucial point is that the scheme is
+pass-dependent. Namely, when we are computing the result for pass $i$
+we are reading the results of pass $i-1$, and we must access them
+using the scheme from the previous pass, i.e. we have to remember that
+the results from the previous pass were stored using $b p_{i-1} + a$,
+not $b p_i + a$, and the symmetry for these results will be $g_{i-1}(b
+p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1} - a)^*$. To take this into
+account we'll write the right hand side of the iteration, $g_{i-1}$,
+in terms of $p_{i-1}$. For example, instead of $b p_i$, which occurs
+naturally in $g_i(b p_i + a)$ we will use $2 b p_{i-1}$, since $p_i =
+2 p_{i-1}$.
+
+Let's start with the butterfly for $a = 0$,
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+g(b p_i) \\
+g(b p_i + p_{i-1})^*
+\end{array}
+\right)
+=
+\left(
+\begin{array}{c}
+g(2 b p_{i-1}) + g((2 b + 1) p_{i-1}) \\
+g(2 b p_{i-1}) - g((2 b + 1) p_{i-1})
+\end{array}
+\right)
+\end{equation}
+%
+By the symmetry $g_{i-1}(b p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1}
+- a)^*$ all the inputs are purely real. The input $g(2 b p_{i-1})$ is
+read from location $2 b p_{i-1}$ and $g((2 b + 1) p_{i-1})$ is read
+from the location $(2 b + 1) p_{i-1}$. Here is the full breakdown,
+%
+\begin{center}
+\renewcommand{\arraystretch}{1.5}
+\begin{tabular}{|l|lll|}
+\hline Term & & Location & \\
+\hline
+$g(2 b p_{i-1})$
+ & real part & $2 b p_{i-1} $ &$= b p_i$ \\
+ & imag part & --- & \\
+$g((2 b+1) p_{i-1})$
+ & real part & $(2 b + 1) p_{i-1} $&$= b p_i + p_{i-1} $ \\
+ & imag part & --- & \\
+\hline
+$g(b p_{i})$
+ & real part & $b p_i$ &\\
+ & imag part & --- & \\
+$g(b p_{i} + p_{i-1})$
+ & real part & $b p_i + p_{i-1}$& \\
+ & imag part & --- & \\
+\hline
+\end{tabular}
+\end{center}
+%
+The conjugation of the output term $g(b p_i + p_{i-1})^*$ is
+irrelevant here since the results are purely real. The real results
+are stored in locations $b p_i$ and $b p_i + p_{i-1}$, which
+overwrites the inputs in-place.
+
+The general butterfly for $a = 1 \dots p_{i-1}/2 - 1$ is,
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+g(b p_i + a) \\
+g(b p_i + p_{i-1} - a)^*
+\end{array}
+\right)
+=
+\left(
+\begin{array}{c}
+g(2 b p_{i-1} + a) + W^a_{p_i} g((2 b + 1) p_{i-1} + a) \\
+g(2 b p_{i-1} + a) - W^a_{p_i} g((2 b + 1) p_{i-1} + a)
+\end{array}
+\right)
+\end{equation}
+%
+All the terms are complex. To store a conjugated term like $g(b' p_i +
+a')^*$ where $a > p_i/2$ we take the real part and store it in
+location $b' p_i + a'$ and then take imaginary part, negate it, and
+store the result in location $b' p_i + p_i - a'$.
+
+Here is the full breakdown of the inputs and outputs from the
+butterfly,
+%
+\begin{center}
+\renewcommand{\arraystretch}{1.5}
+\begin{tabular}{|l|lll|}
+\hline Term & & Location & \\
+\hline
+$g(2 b p_{i-1} + a)$
+ & real part & $2 b p_{i-1} + a $ &$= b p_i + a$ \\
+ & imag part & $2 b p_{i-1} + p_{i-1} - a$ &$= b p_i + p_{i-1} - a$ \\
+$g((2 b+1) p_{i-1} + a)$
+ & real part & $(2 b+1) p_{i-1} + a $&$= b p_i + p_{i-1} + a $ \\
+ & imag part & $(2 b+1) p_{i-1} + p_{i-1} - a $&$= b p_i + p_i - a$\\
+\hline
+$g(b p_{i} + a)$
+ & real part & $b p_i + a$ &\\
+ & imag part & $b p_i + p_i - a$& \\
+$g(b p_{i} + p_{i-1} - a)$
+ & real part & $b p_i + p_{i-1} - a$& \\
+ & imag part & $b p_i + p_{i-1} + a$& \\
+\hline
+\end{tabular}
+\end{center}
+%
+By comparing the input locations and output locations we can see
+that the calculation is done in place.
+
+The final butterfly for $a = p_{i-1}/2$ is,
+%
+\begin{equation}
+\left(
+\begin{array}{c}
+g(b p_i + p_{i-1}/2) \\
+g(b p_i + p_{i-1} - p_{i-1}/2)^*
+\end{array}
+\right)
+=
+\left(
+\begin{array}{c}
+g(2 b p_{i-1} + p_{i-1}/2) - i g((2 b + 1) p_{i-1} + p_{i-1}/2) \\
+g(2 b p_{i-1} + p_{i-1}/2) + i g((2 b + 1) p_{i-1} + p_{i-1}/2)
+\end{array}
+\right)
+\end{equation}
+%
+where we have substituted for the twiddle factor, $W^a_{p_i} = -i$,
+%
+\begin{eqnarray}
+W^{p_{i-1}/2}_{p_i} &=& \exp(-2\pi i p_{i-1}/2 p_i) \\
+ &=& \exp(-2\pi i /4) \\
+ &=& -i
+\end{eqnarray}
+%
+For this butterfly the second line is just the conjugate of the first,
+because $p_{i-1} - p_{i-1}/2 = p_{i-1}/2$. Therefore we only need to
+consider the first line. The breakdown of the inputs and outputs is,
+%
+\begin{center}
+\renewcommand{\arraystretch}{1.5}
+\begin{tabular}{|l|lll|}
+\hline Term & & Location & \\
+\hline
+$g(2 b p_{i-1} + p_{i-1}/2)$
+ & real part & $2 b p_{i-1} + p_{i-1}/2 $ &$= b p_i + p_{i-1}/2$ \\
+ & imag part & --- & \\
+$g((2 b + 1) p_{i-1} + p_{i-1}/2)$
+ & real part & $(2 b + 1) p_{i-1} + p_{i-1}/2 $&$= b p_i + p_{i} - p_{i-1}/2 $ \\
+ & imag part & --- & \\
+\hline
+$g(b p_{i} + p_{i-1}/2)$
+ & real part & $b p_i + p_{i-1}/2$ &\\
+ & imag part & $b p_i + p_i - p_{i-1}/2$& \\
+\hline
+\end{tabular}
+\end{center}
+%
+By comparing the locations of the inputs and outputs with the
+operations in the butterfly we find that this computation is very
+simple: the effect of the butterfly is to negate the location $b p_i +
+p_i - p_{i-1}/2$ and leave other locations unchanged. This is clearly
+an in-place operation.
+
+Here is the radix-2 algorithm for real data, in full, with the cases
+of $a=0$, $a=1\dots p_{i-1}/2 - 1$ and $a = p_{i-1}/2$ in separate
+blocks,
+%
+\begin{algorithm}
+\STATE bit-reverse ordering of $g$
+\FOR {$i = 1 \dots n$}
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$\left(
+ \begin{array}{c}
+ g(b p_i) \\
+ g(b p_i + p_{i-1})
+ \end{array}
+ \right)
+ \Leftarrow
+ \left(
+ \begin{array}{c}
+ g(b p_{i}) + g(b p_{i} + p_{i-1}) \\
+ g(b p_{i}) - g(b p_{i} + p_{i-1})
+ \end{array}
+ \right)$}
+ \ENDFOR
+
+ \FOR {$a = 1 \dots p_{i-1}/2 - 1$}
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$(\Real z_0, \Imag z_0) \Leftarrow
+ (g(b p_i + a), g(b p_i + p_{i-1} - a))$}
+ \STATE{$(\Real z_1, \Imag z_1) \Leftarrow
+ (g(b p_i + p_{i-1} + a), g(b p_i + p_{i} - a))$}
+ \STATE{$t_0 \Leftarrow z_0 + W^a_{p_i} z_1$}
+ \STATE{$t_1 \Leftarrow z_0 - W^a_{p_i} z_1$}
+ \STATE{$(g(b p_{i} + a),g(b p_{i} + p_i - a) \Leftarrow
+ (\Real t_0, \Imag t_0)$}
+ \STATE{$(g(b p_{i} + p_{i-1} - a), g(b p_{i} + p_{i-1} + a))
+ \Leftarrow
+ (\Real t_1, -\Imag t_1)$}
+ \ENDFOR
+ \ENDFOR
+
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$g(b p_{i} - p_{i-1}/2) \Leftarrow -g(b p_{i} - p_{i-1}/2)$}
+ \ENDFOR
+
+\ENDFOR
+\end{algorithm}
+%
+We split the loop over $a$ into three parts, $a=0$, $a=1\dots
+p_{i-1}/2-1$ and $a = p_{i-1}/2$ for efficiency. When $a=0$ we have
+$W^a_{p_i}=1$ so we can eliminate a complex multiplication within the
+loop over $b$. When $a=p_{i-1}/2$ we have $W^a_{p_i} = -i$ which does
+not require a full complex multiplication either.
+
+
+\subsubsection{Calculating the Inverse}
+%
+The inverse FFT of complex data was easy to calculate, simply by
+taking the complex conjugate of the DFT matrix. The input data and
+output data were both complex and did not have any special
+symmetry. For real data the inverse FFT is more complicated because
+the half-complex symmetry of the transformed data is
+different from the purely real input data.
+
+We can compute an inverse by stepping backwards through the forward
+transform. To simplify the inversion it's convenient to write the
+forward algorithm with the butterfly in matrix form,
+%
+\begin{algorithm}
+\FOR {$i = 1 \dots n$}
+ \FOR {$a = 0 \dots p_{i-1}/2$}
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$
+ \left(
+ \begin{array}{c}
+ g(b p_i + a) \\
+ g(b p_i + p_{i-1} + a)
+ \end{array}
+ \right)
+ =
+ \left(
+ \begin{array}{cc}
+ 1 & W^a_{p_{i}} \\
+ 1 & -W^a_{p_{i}}
+ \end{array}
+ \right)
+ \left(
+ \begin{array}{c}
+ g(2 b p_{i-1} + a) \\
+ g((2 b + 1) p_{i-1} + a)
+ \end{array}
+ \right) $}
+ \ENDFOR
+ \ENDFOR
+\ENDFOR
+\end{algorithm}
+%
+To invert the algorithm we run the iterations backwards and invert the
+matrix multiplication in the innermost loop,
+%
+\begin{algorithm}
+\FOR {$i = n \dots 1$}
+ \FOR {$a = 0 \dots p_{i-1}/2$}
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$
+ \left(
+ \begin{array}{c}
+ g(2 b p_{i-1} + a) \\
+ g((2 b + 1) p_{i-1} + a)
+ \end{array}
+ \right)
+ =
+ \left(
+ \begin{array}{cc}
+ 1 & W^a_{p_{i}} \\
+ 1 & -W^a_{p_{i}}
+ \end{array}
+ \right)^{-1}
+ \left(
+ \begin{array}{c}
+ g(b p_i + a) \\
+ g(b p_i + p_{i-1} + a)
+ \end{array}
+ \right) $}
+ \ENDFOR
+ \ENDFOR
+\ENDFOR
+\end{algorithm}
+%
+There is no need to reverse the loops over $a$ and $b$ because the
+result is independent of their order. The inverse of the matrix that
+appears is,
+%
+\begin{equation}
+\left(
+\begin{array}{cc}
+1 & W^a_{p_{i}} \\
+1 & -W^a_{p_{i}}
+\end{array}
+\right)^{-1}
+=
+{1 \over 2}
+\left(
+\begin{array}{cc}
+1 & 1 \\
+W^{-a}_{p_{i}} & -W^{-a}_{p_{i}}
+\end{array}
+\right)
+\end{equation}
+%
+To save divisions we remove the factor of $1/2$ inside the loop. This
+computes the unnormalized inverse, and the normalized inverse can be
+retrieved by dividing the final result by $N = 2^n$.
+
+Here is the radix-2 half-complex to real inverse FFT algorithm, taking
+into account the radix-2 storage scheme,
+%
+\begin{algorithm}
+\FOR {$i = n \dots 1$}
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$\left(
+ \begin{array}{c}
+ g(b p_i) \\
+ g(b p_i + p_{i-1})
+ \end{array}
+ \right)
+ \Leftarrow
+ \left(
+ \begin{array}{c}
+ g(b p_{i}) + g(b p_{i} + p_{i-1}) \\
+ g(b p_{i}) - g(b p_{i} + p_{i-1})
+ \end{array}
+ \right)$}
+ \ENDFOR
+
+ \FOR {$a = 1 \dots p_{i-1}/2 - 1$}
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$(\Real z_0, \Imag z_0)
+ \Leftarrow
+ (g(b p_i + a), g(b p_i + p_{i} - a))$}
+ \STATE{$(\Real z_1, \Imag z_1)
+ \Leftarrow
+ (g(b p_i + p_{i-1} - a), -g(b p_i + p_{i-1} + a))$}
+ \STATE{$t_0 \Leftarrow z_0 + z_1$}
+ \STATE{$t_1 \Leftarrow z_0 - z_1$}
+ \STATE{$(g(b p_{i} + a), g(b p_{i} + p_{i-1} - a))
+ \Leftarrow
+ (\Real t_0, \Imag t_0) $}
+ \STATE{$(g(b p_{i} + p_{i-1} + a),g(b p_{i} + p_{i} - a))
+ \Leftarrow
+ (\Real(W^a_{p_i}t_1), \Imag(W^a_{p_i}t_1))$}
+ \ENDFOR
+ \ENDFOR
+
+ \FOR{$b = 0 \dots q_i - 1$}
+ \STATE{$g(b p_{i} + p_{i-1}/2) \Leftarrow 2 g(b p_{i} + p_{i-1}/2)$}
+ \STATE{$g(b p_{i} + p_{i-1} + p_{i-1}/2) \Leftarrow -2 g(b p_{i} + p_{i-1} + p_{i-1}/2)$}
+ \ENDFOR
+
+\ENDFOR
+\STATE bit-reverse ordering of $g$
+\end{algorithm}
+
+
+
+\subsection{Mixed-Radix FFTs for real data}
+%
+As discussed earlier the radix-2 decimation-in-time algorithm had the
+special property that its intermediate passes are interleaved fourier
+transforms of the original data, and this generalizes to the
+mixed-radix algorithm. The complex mixed-radix algorithm that we
+derived earlier was a decimation-in-frequency algorithm, but we can
+obtain a decimation-in-time version by taking the transpose of the
+decimation-in-frequency DFT matrix like this,
+%
+\begin{eqnarray}
+W_N &=& W_N^T \\
+&=& (T_{n_f} \dots T_2 T_1)^T \\
+&=& T_1^T T_2^T \dots T_{n_f}^T
+\end{eqnarray}
+%
+with,
+%
+\begin{eqnarray}
+T_i^T &=& \left( (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}})
+ (W_{f_i} \otimes I_{m_i}) \right)^T \\
+ &=& (W_{f_i} \otimes I_{m_i})
+ ( D^{f_i}_{q_i} (P^{f_i}_{q_i})^T \otimes I_{p_{i-1}}).
+\end{eqnarray}
+%
+We have used the fact that $W$, $D$ and $I$ are symmetric and that the
+permutation matrix $P$ obeys,
+%
+\begin{equation}
+(P^a_b)^T = P^b_a.
+\end{equation}
+%
+From the definitions of $D$ and $P$ we can derive the following identity,
+%
+\begin{equation}
+D^a_b P^b_a = P^b_a D^b_a.
+\end{equation}
+%
+This allows us to put the transpose into a simple form,
+%
+\begin{equation}
+T_i^T = (W_{f_i} \otimes I_{m_i})
+ (P^{q_i}_{f_i} D^{q_i}_{f_i} \otimes I_{p_{i-1}}).
+\end{equation}
+%
+The transposed matrix, $T^T$ applies the digit-reversal $P$ before the
+DFT $W$, giving the required decimation-in-time algorithm. The
+transpose reverses the order of the factors --- $T_{n_f}$ is applied
+first and $T_1$ is applied last. For convenience we can reverse the
+order of the factors, $f_1 \leftrightarrow f_{n_f}$, $f_2
+\leftrightarrow f_{n_f-1}$, \dots and make the corresponding
+substitution $p_{i-1} \leftrightarrow q_i$. These substitutions give
+us a decimation-in-time algorithm with the same ordering as the
+decimation-in-frequency algorithm,
+%
+\begin{equation}
+W_N = T_{n_f} \dots T_2 T_1
+\end{equation}
+%
+\begin{equation}
+T_i = (W_{f_i} \otimes I_{m_i})
+ (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})
+\end{equation}
+%
+where $p_i$, $q_i$ and $m_i$ now have the same meanings as before,
+namely,
+%
+\begin{eqnarray}
+p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1) \\
+q_i &=& N / p_i \\
+m_i &=& N / f_i
+\end{eqnarray}
+%
+Now we would like to prove that the iteration for computing $x = W z =
+T_{n_f} \dots T_2 T_1 z$ has the special property interleaving
+property. If we write the result of each intermediate pass as
+$v^{(i)}$,
+%
+\begin{eqnarray}
+v^{(0)} &=& z \\
+v^{(1)} &=& T_1 v^{(0)} \\
+v^{(2)} &=& T_2 v^{(1)} \\
+\dots &=& \dots \\
+v^{(i)} &=& T_i v^{(i-1)}
+\end{eqnarray}
+%
+then we will show that the intermediate results $v^{(i)}$ on any pass
+can be written as,
+%
+\begin{equation}
+v^{(i)} = (W_{p_i} \otimes I_{q_i}) z
+\end{equation}
+%
+Each intermediate stage will be a set of $q_i$ interleaved fourier
+transforms, each of length $p_i$. We can prove this result by
+induction. First we assume that the result is true for $v^{(i-1)}$,
+%
+\begin{equation}
+v^{(i-1)} = (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \qquad \mbox{(assumption)}
+\end{equation}
+%
+And then we examine the next iteration using this assumption,
+%
+\begin{eqnarray}
+v^{(i)} &=& T_i v^{(i-1)} \\
+ &=& T_i (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \\
+ &=& (W_{f_i} \otimes I_{m_i})
+ (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})
+ (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \label{dit-induction}
+\end{eqnarray}
+%
+Using the relation $m_i = p_{i-1} q_i$, we can write $I_{m_i}$ as
+$I_{p_{i-1} q_i}$ and $I_{q_{i-1}}$ as $I_{f_i q_i}$. By combining these
+with the basic matrix identity,
+%
+\begin{equation}
+I_{ab} = I_a \otimes I_b
+\end{equation}
+%
+we can rewrite $v^{(i)}$ in the following form,
+%
+\begin{equation}
+v^{(i)} = (((W_{f_i} \otimes I_{p_{i-1}})
+ (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})
+ (W_{p_{i-1}} \otimes I_{f_i})) \otimes I_{q_i}) z .
+\end{equation}
+%
+The first part of this matrix product is the two-factor expansion of
+$W_{ab}$, for $a = p_{i-1}$ and $b = f_i$,
+%
+\begin{equation}
+W_{p_{i-1} f_i} = ((W_{f_i} \otimes I_{p_{i-1}})
+ (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})
+ (W_{p_{i-1}} \otimes I_{f_i})).
+\end{equation}
+%
+If we substitute this result, remembering that $p_i = p_{i-1} f_i$, we
+obtain,
+%
+\begin{equation}
+v^{(i)} = (W_{p_i} \otimes I_{q_i}) z
+\end{equation}
+%
+which is the desired result. The case $i=1$ can be verified
+explicitly, and induction then shows that the result is true for all
+$i$. As discussed for the radix-2 algorithm this result is important
+because if the initial data $z$ is real then each intermediate pass is
+a set of interleaved fourier transforms of $z$, having half-complex
+symmetries (appropriately applied in the subspaces of the Kronecker
+product). Consequently only $N$ real numbers are needed to store the
+intermediate and final results.
+
+\subsection{Implementation}
+%
+The implementation of the mixed-radix real FFT algorithm follows the
+same principles as the full complex transform. Some of the steps are
+applied in the opposite order because we are dealing with a decimation
+in time algorithm instead of a decimation in frequency algorithm, but
+the basic outer structure of the algorithm is the same. We want to
+apply the factorized version of the DFT matrix $W_N$ to the input
+vector $z$,
+%
+\begin{eqnarray}
+x &=& W_N z \\
+ &=& T_{n_f} \dots T_2 T_1 z
+\end{eqnarray}
+%
+We loop over the $n_f$ factors, applying each matrix $T_i$ to the
+vector in turn to build up the complete transform,
+%
+\begin{algorithm}
+\FOR{$(i = 1 \dots n_f)$}
+ \STATE{$v \Leftarrow T_i v $}
+\ENDFOR
+\end{algorithm}
+%
+In this case the definition of $T_i$ is different because we have
+taken the transpose,
+%
+\begin{equation}
+T_i =
+ (W_{f_i} \otimes I_{m_i})
+ (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}).
+\end{equation}
+%
+We'll define a temporary vector $t$ to denote the results of applying the
+rightmost matrix,
+%
+\begin{equation}
+t = (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) v
+\end{equation}
+%
+If we expand this out into individual components, as before, we find a
+similar simplification,
+%
+\begin{eqnarray}
+t_{aq+b}
+&=&
+\sum_{a'b'}
+(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})_{(aq+b)(a'q+b')}
+v_{a'q+b'} \\
+&=&
+\sum_{a'}
+(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{a a'}
+v_{a'q+b}
+\end{eqnarray}
+%
+We have factorized the indices into the form $aq+b$, with $0 \leq a <
+p_{i}$ and $0 \leq b < q$. Just as in the decimation in frequency
+algorithm we can split the index $a$ to remove the matrix
+multiplication completely. We have to write $a$ as $\mu f + \lambda$,
+where $0 \leq \mu < p_{i-1}$ and $0 \leq \lambda < f$,
+%
+\begin{eqnarray}
+t_{(\mu f + \lambda)q+b}
+&=&
+\sum_{\mu'\lambda'}
+(P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')}
+v_{(\mu' f + \lambda')q+b}
+\\
+&=&
+\sum_{\mu'\lambda'}
+(P^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')}
+\omega^{\mu'\lambda'}_{p_{i}}
+v_{(\mu' f + \lambda')q+b}
+\end{eqnarray}
+%
+The matrix $P^{p_{i-1}}_{f_i}$ exchanges an index of $(\mu f +
+\lambda) q + b$ with $(\lambda p_{i-1} + \mu) q + b$, giving a final
+result of,
+%
+\begin{eqnarray}
+t_{(\lambda p_{i-1} + \mu) q + b}
+=
+w^{\mu\lambda}_{p_i} v_{(\mu f + \lambda)q +b}
+\end{eqnarray}
+%
+To calculate the next stage,
+%
+\begin{equation}
+v' = (W_{f_i} \otimes I_{m_i}) t,
+\end{equation}
+%
+we temporarily rearrange the index of $t$ to separate the $m_{i}$
+independent DFTs in the Kronecker product,
+%
+\begin{equation}
+v'_{(\lambda p_{i-1} + \mu) q + b}
+=
+\sum_{\lambda' \mu' b'}
+(W_{f_i} \otimes I_{m_i})_{
+((\lambda p_{i-1} + \mu) q + b)
+((\lambda' p_{i-1} + \mu') q + b')}
+t_{(\lambda' p_{i-1} + \mu') q + b'}
+\end{equation}
+%
+If we use the identity $m = p_{i-1} q$ to rewrite the index of $t$
+like this,
+%
+\begin{equation}
+t_{(\lambda p_{i-1} + \mu) q + b} = t_{\lambda m + (\mu q + b)}
+\end{equation}
+%
+we can split the Kronecker product,
+%
+\begin{eqnarray}
+v'_{(\lambda p_{i-1} + \mu) q + b}
+&=&
+\sum_{\lambda' \mu' b'}
+(W_{f_i} \otimes I_{m_i})_{
+((\lambda p_{i-1} + \mu) q + b)
+((\lambda' p_{i-1} + \mu') q + b')}
+t_{(\lambda' p_{i-1} + \mu') q + b'}\\
+&=&
+\sum_{\lambda'}
+(W_{f_i})_{\lambda \lambda'}
+t_{\lambda' m_i + (\mu q + b)}
+\end{eqnarray}
+%
+If we switch back to the original form of the index in the last line we obtain,
+%
+\begin{eqnarray}
+\phantom{v'_{(\lambda p_{i-1} + \mu) q + b}}
+&=&
+\sum_{\lambda'}
+(W_{f_i})_{\lambda \lambda'}
+t_{(\lambda p_{i-1} + \mu) q + b}
+\end{eqnarray}
+%
+which allows us to substitute our previous result for $t$,
+%
+\begin{equation}
+v'_{(\lambda p_{i-1} + \mu) q + b}
+=
+\sum_{\lambda'}
+(W_{f_i})_{\lambda \lambda'}
+w^{\mu\lambda'}_{p_i} v_{(\mu f + \lambda')q + b}
+\end{equation}
+%
+This gives us the basic decimation-in-time mixed-radix algorithm for
+complex data which we will be able to specialize to real data,
+%
+\begin{algorithm}
+\FOR{$i = 1 \dots n_f$}
+\FOR{$\mu = 0 \dots p_{i-1} - 1$}
+\FOR{$b = 0 \dots q-1$}
+\FOR{$\lambda = 0 \dots f-1$}
+\STATE{$t_\lambda \Leftarrow
+ \omega^{\mu\lambda'}_{p_{i}} v_{(\mu f + \lambda')q + b}$}
+\ENDFOR
+\FOR{$\lambda = 0 \dots f-1$}
+\STATE{$v'_{(\lambda p_{i-1} + \mu)q + b} =
+ \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') t_{\lambda'}$}
+ \COMMENT{DFT matrix-multiply module}
+\ENDFOR
+\ENDFOR
+\ENDFOR
+\STATE{$v \Leftarrow v'$}
+\ENDFOR
+\end{algorithm}
+%
+We are now at the point where we can convert an algorithm formulated
+in terms of complex variables to one in terms of real variables by
+choosing a suitable storage scheme. We will adopt the FFTPACK storage
+convention. FFTPACK uses a scheme where individual FFTs are
+contiguous, not interleaved, and real-imaginary pairs are stored in
+neighboring locations. This has better locality than was possible for
+the radix-2 case.
+
+The interleaving of the intermediate FFTs results from the Kronecker
+product, $W_p \otimes I_q$. The FFTs can be made contiguous if we
+reorder the Kronecker product on the intermediate passes,
+%
+\begin{equation}
+W_p \otimes I_q \Rightarrow I_q \otimes W_p
+\end{equation}
+%
+This can be implemented by a simple change in indexing. On pass-$i$
+we store element $v_{a q_i + b}$ in location $v_{b p_i+a}$. We
+compensate for this change by making the same transposition when
+reading the data. Note that this only affects the indices of the
+intermediate passes. On the zeroth iteration the transposition has no
+effect because $p_0 = 1$. Similarly there is no effect on the last
+iteration, which has $q_{n_f} = 1$. This is how the algorithm above
+looks after this index transformation,
+%
+\begin{algorithm}
+\FOR{$i = 1 \dots n_f$}
+\FOR{$\mu = 0 \dots p_{i-1} - 1$}
+\FOR{$b = 0 \dots q-1$}
+\FOR{$\lambda = 0 \dots f-1$}
+\STATE{$t_\lambda \Leftarrow
+ \omega^{\mu\lambda'}_{p_{i}} v_{(\lambda'q + b)p_{i-1} + \mu}$}
+\ENDFOR
+\FOR{$\lambda = 0 \dots f-1$}
+\STATE{$v'_{b p + (\lambda p_{i-1} + \mu)} =
+ \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') t_{\lambda'}$}
+ \COMMENT{DFT matrix-multiply module}
+\ENDFOR
+\ENDFOR
+\ENDFOR
+\STATE{$v \Leftarrow v'$}
+\ENDFOR
+\end{algorithm}
+%
+We transpose the input terms by writing the index in the form $a
+q_{i-1} + b$, to take account of the pass-dependence of the scheme,
+%
+\begin{equation}
+v_{(\mu f + \lambda')q + b} = v_{\mu q_{i-1} + \lambda'q + b}
+\end{equation}
+%
+We used the identity $q_{i-1} = f q$ to split the index. Note that in
+this form $\lambda'q + b$ runs from 0 to $q_{i-1} - 1$ as $b$ runs
+from 0 to $q-1$ and $\lambda'$ runs from 0 to $f-1$. The transposition
+for the input terms then gives,
+%
+\begin{equation}
+v_{\mu q_{i-1} + \lambda'q + b} \Rightarrow v_{(\lambda'q + b) p_{i-1} + \mu}
+\end{equation}
+%
+In the FFTPACK scheme the intermediate output data have the same
+half-complex symmetry as the radix-2 example, namely,
+%
+\begin{equation}
+v^{(i)}_{b p + a} = v^{(i)*}_{b p + (p - a)}
+\end{equation}
+%
+and on the input data from the previous pass have the symmetry,
+%
+\begin{equation}
+v^{(i-1)}_{(\lambda' q + b) p_{i-1} + \mu} = v^{(i-1)*}_{(\lambda' q +
+b) p_{i-1} + (p_{i-1} - \mu)}
+\end{equation}
+%
+Using these symmetries we can halve the storage and computation
+requirements for each pass. Compared with the radix-2 algorithm we
+have more freedom because the computation does not have to be done in
+place. The storage scheme adopted by FFTPACK places elements
+sequentially with real and imaginary parts in neighboring
+locations. Imaginary parts which are known to be zero are not
+stored. Here are the full details of the scheme,
+%
+\begin{center}
+\renewcommand{\arraystretch}{1.5}
+\begin{tabular}{|l|lll|}
+\hline Term & & Location & \\
+\hline
+$g(b p_i)$
+ & real part & $b p_{i} $ & \\
+ & imag part & --- & \\
+\hline
+$g(b p_i + a)$
+ & real part & $b p_{i} + 2a - 1 $& for $a = 1 \dots p_i/2 - 1$ \\
+ & imag part & $b p_{i} + 2a$ & \\
+\hline
+$g(b p_{i} + p_{i}/2)$
+ & real part & $b p_i + p_{i} - 1$ & if $p_i$ is even\\
+ & imag part & --- & \\
+\hline
+\end{tabular}
+\end{center}
+%
+The real element for $a=0$ is stored in location $bp$. The real parts
+for $a = 1 \dots p/2 - 1$ are stored in locations $bp + 2a -1$ and the
+imaginary parts are stored in locations $b p + 2 a$. When $p$ is even
+the term for $a = p/2$ is purely real and we store it in location $bp
++ p - 1$. The zero imaginary part is not stored.
+
+When we compute the basic iteration,
+%
+\begin{equation}
+v^{(i)}_{b p + (\lambda p_{i-1} + \mu)} = \sum_{\lambda'}
+W^{\lambda \lambda'}_f
+\omega^{\mu\lambda'}_{p_i} v^{(i-1)}_{(\lambda' q + b)p_{i-1} + \mu}
+\end{equation}
+%
+we eliminate the redundant conjugate terms with $a > p_{i}/2$ as we
+did in the radix-2 algorithm. Whenever we need to store a term with $a
+> p_{i}/2$ we consider instead the corresponding conjugate term with
+$a' = p - a$. Similarly when reading data we replace terms with $\mu >
+p_{i-1}/2$ with the corresponding conjugate term for $\mu' = p_{i-1} -
+\mu$.
+
+Since the input data on each stage has half-complex symmetry we only
+need to consider the range $\mu=0 \dots p_{i-1}/2$. We can consider
+the best ways to implement the basic iteration for each pass, $\mu = 0
+\dots p_{i-1}/2$.
+
+On the first pass where $\mu=0$ we will be accessing elements which
+are the zeroth components of the independent transforms $W_{p_{i-1}}
+\otimes I_{q_{i-1}}$, and are purely real.
+%
+We can code the pass with $\mu=0$ as a special case for real input
+data, and conjugate-complex output. When $\mu=0$ the twiddle factors
+$\omega^{\mu\lambda'}_{p_i}$ are all unity, giving a further saving.
+We can obtain small-$N$ real-data DFT modules by removing the
+redundant operations from the complex modules.
+%
+For example the $N=3$ module was,
+%
+\begin{equation}
+\begin{array}{lll}
+t_1 = z_1 + z_2, &
+t_2 = z_0 - t_1/2, &
+t_3 = \sin(\pi/3) (z_1 - z_2),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{lll}
+x_0 = z_0 + t_1, &
+x_1 = t_2 + i t_3, &
+x_2 = t_2 - i t_3.
+\end{array}
+\end{equation}
+%
+In the complex case all the operations were complex, for complex input
+data $z_0$, $z_1$, $z_2$. In the real case $z_0$, $z_1$ and $z_2$ are
+all real. Consequently $t_1$, $t_2$ and $t_3$ are also real, and the
+symmetry $x_1 = t_1 + i t_2 = x^*_2$ means that we do not have to
+compute $x_2$ once we have computed $x_1$.
+
+For subsequent passes $\mu = 1 \dots p_{i-1}/2 - 1$ the input data is
+complex and we have to compute full complex DFTs using the same
+modules as in the complex case. Note that the inputs are all of the
+form $v_{(\lambda q + b) p_{i-1} + \mu}$ with $\mu < p_{i-1}/2$ so we
+never need to use the symmetry to access the conjugate elements with
+$\mu > p_{i-1}/2$.
+
+If $p_{i-1}$ is even then we reach the halfway point $\mu=p_{i-1}/2$,
+which is another special case. The input data in this case is purely
+real because $\mu = p_{i-1} - \mu$ for $\mu = p_{i-1}/2$. We can code
+this as a special case, using real inputs and real-data DFT modules as
+we did for $\mu=0$. However, for $\mu = p_{i-1}/2$ the twiddle factors
+are not unity,
+%
+\begin{eqnarray}
+\omega^{\mu\lambda'}_{p_i} &=& \omega^{(p_{i-1}/2)\lambda'}_{p_i} \\
+&=& \exp(-i\pi\lambda'/f_i)
+\end{eqnarray}
+%
+These twiddle factors introduce an additional phase which modifies the
+symmetry of the outputs. Instead of the conjugate-complex symmetry
+which applied for $\mu=0$ there is a shifted conjugate-complex
+symmetry,
+%
+\begin{equation}
+t_\lambda = t^*_{f-(\lambda+1)}
+\end{equation}
+%
+This is easily proved,
+%
+\begin{eqnarray}
+t_\lambda
+&=&
+\sum e^{-2\pi i \lambda\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\
+t_{f - (\lambda + 1)}
+&=&
+\sum e^{-2\pi i (f-\lambda-1)\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\
+&=&
+\sum e^{2\pi i \lambda\lambda'/f_i} e^{i\pi \lambda'/f_i} r_{\lambda'} \\
+&=& t^*_\lambda
+\end{eqnarray}
+%
+The symmetry of the output means that we only need to compute half of
+the output terms, the remaining terms being conjugates or zero
+imaginary parts. For example, when $f=4$ the outputs are $(x_0 + i
+y_0, x_1 + i y_1, x_1 - i y_1, x_0 - i y_0)$. For $f=5$ the outputs
+are $(x_0 + i y_0, x_1 + i y_1, x_2, x_1 - i y_1, x_0 - i y_0)$. By
+combining the twiddle factors and DFT matrix we can make a combined
+module which applies both at the same time. By starting from the
+complex DFT modules and bringing in twiddle factors we can derive
+optimized modules. Here are the modules given by Temperton for $z = W
+\Omega x$ where $x$ is real and $z$ has the shifted conjugate-complex
+symmetry. The matrix of twiddle factors, $\Omega$, is given by,
+%
+\begin{equation}
+\Omega = \mathrm{diag}(1, e^{-i\pi/f}, e^{-2\pi i/f}, \dots, e^{-i\pi(f-1)/f})
+\end{equation}
+%
+We write $z$ in terms of two real vectors $z = a + i b$.
+%
+For $N=2$,
+%
+\begin{equation}
+\begin{array}{ll}
+a_0 = x_0, &
+b_0 = - x_1.
+\end{array}
+\end{equation}
+%
+For $N=3$,
+%
+\begin{equation}
+\begin{array}{l}
+t_1 = x_1 - x_2,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_0 = x_0 + t_1/2, & b_0 = x_0 - t_1,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{l}
+a_1 = - \sin(\pi/3) (x_1 + x_2)
+\end{array}
+\end{equation}
+%
+For $N=4$,
+%
+\begin{equation}
+\begin{array}{ll}
+t_1 = (x_1 - x_3)/\sqrt{2}, & t_2 = (x_1 + x_3)/\sqrt{2},
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_0 = x_0 + t_1, & b_0 = -x_2 - t_2,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_1 = x_0 - t_1, & b_1 = x_2 - t_2.
+\end{array}
+\end{equation}
+%
+For $N=5$,
+%
+\begin{equation}
+\begin{array}{ll}
+t_1 = x_1 - x_4, & t_2 = x_1 + x_4,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+t_3 = x_2 - x_3, & t_4 = x_2 + x_3,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+t_5 = t_1 - t_3, & t_6 = x_0 + t_5 / 4,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+t_7 = (\sqrt{5}/4)(t_1 + t_3) &
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_0 = t_6 + t_7, & b_0 = -\sin(2\pi/10) t_2 - \sin(2\pi/5) t_4,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_1 = t_6 - t_7, & b_1 = -\sin(2\pi/5) t_2 + \sin(2\pi/10) t_4,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_2 = x_0 - t_5 &
+\end{array}
+\end{equation}
+%
+For $N=6$,
+%
+\begin{equation}
+\begin{array}{ll}
+t_1 = \sin(\pi/3)(x_5 - x_1), & t_2 = \sin(\pi/3) (x_2 + x_4),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+t_3 = x_2 - x_4, & t_4 = x_1 + x_5,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+t_5 = x_0 + t_3 / 2, & t_6 = -x_3 - t_4 / 2,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_0 = t_5 - t_1, & b_0 = t_6 - t_2,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_1 = x_0 - t_3, & b_1 = x_3 - t_4,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+a_2 = t_5 + t_1, & b_2 = t_6 + t_2
+\end{array}
+\end{equation}
+
+\section{Computing the mixed-radix inverse for real data}
+%
+To compute the inverse of the mixed-radix FFT on real data we step
+through the algorithm in reverse and invert each operation.
+
+This gives the following algorithm using FFTPACK indexing,
+%
+\begin{algorithm}
+\FOR{$i = n_f \dots 1$}
+\FOR{$\mu = 0 \dots p_{i-1} - 1$}
+\FOR{$b = 0 \dots q-1$}
+\FOR{$\lambda = 0 \dots f-1$}
+\STATE{$t_{\lambda'} =
+ \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda')
+ v_{b p + (\lambda p_{i-1} + \mu)}$}
+ \COMMENT{DFT matrix-multiply module}
+\ENDFOR
+\FOR{$\lambda = 0 \dots f-1$}
+\STATE{$v'_{(\lambda'q + b)p_{i-1} + \mu} \Leftarrow
+ \omega^{-\mu\lambda'}_{p_{i}} t_\lambda$}
+\ENDFOR
+
+\ENDFOR
+\ENDFOR
+\STATE{$v \Leftarrow v'$}
+\ENDFOR
+\end{algorithm}
+%
+When $\mu=0$ we are applying an inverse DFT to half-complex data,
+giving a real result. The twiddle factors are all unity. We can code
+this as a special case, just as we did for the forward routine. We
+start with complex module and eliminate the redundant terms. In this
+case it is the final result which has the zero imaginary part, and we
+eliminate redundant terms by using the half-complex symmetry of the
+input data.
+
+When $\mu=1 \dots p_{i-1}/2 - 1$ we have full complex transforms on
+complex data, so no simplification is possible.
+
+When $\mu = p_{i-1}/2$ (which occurs only when $p_{i-1}$ is even) we
+have a combination of twiddle factors and DFTs on data with the
+shifted half-complex symmetry which give a real result. We implement
+this as a special module, essentially by inverting the system of
+equations given for the forward case. We use the modules given by
+Temperton, appropriately modified for our version of the algorithm. He
+uses a slightly different convention which differs by factors of two
+for some terms (consult his paper for details~\cite{temperton83real}).
+
+For $N=2$,
+%
+\begin{equation}
+\begin{array}{ll}
+x_0 = 2 a_0, & x_1 = - 2 b_0 .
+\end{array}
+\end{equation}
+%
+For $N=3$,
+%
+\begin{equation}
+\begin{array}{ll}
+t_1 = a_0 - a_1, & t_2 = \sqrt{3} b_1, \\
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{lll}
+x_0 = 2 a_0 + a_1, & x_1 = t_1 - t_2, & x_2 = - t_1 - t_2
+\end{array}
+\end{equation}
+%
+For $N=4$,
+\begin{equation}
+\begin{array}{ll}
+t_1 = \sqrt{2} (b_0 + b_1), & t_2 = \sqrt{2} (a_0 - a_1),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+x_0 = 2(a_0 + a_1), & x_1 = t_2 - t_1 ,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+x_2 = 2(b_1 - b_0), & x_3 = -(t_2 + t_1).
+\end{array}
+\end{equation}
+%
+For $N=5$,
+%
+\begin{equation}
+\begin{array}{ll}
+t_1 = 2 (a_0 + a_1), & t_2 = t_1 / 4 - a_2,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+t_3 = (\sqrt{5}/2) (a_0 - a_1),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{l}
+t_4 = 2(\sin(2\pi/10) b_0 + \sin(2\pi/5) b_1),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{l}
+t_5 = 2(\sin(2\pi/10) b_0 - \sin(2\pi/5) b_1),
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+t_6 = t_3 + t_2, & t_7 = t_3 - t_2,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+x_0 = t_1 + a_2, & x_1 = t_6 - t_4 ,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+x_2 = t_7 - t_5, & x_3 = - t_7 - t_5,
+\end{array}
+\end{equation}
+\begin{equation}
+\begin{array}{ll}
+x_4 = -t_6 - t_4.
+\end{array}
+\end{equation}
+
+\section{Conclusions}
+%
+We have described the basic algorithms for one-dimensional radix-2 and
+mixed-radix FFTs. It would be nice to have a pedagogical explanation
+of the split-radix FFT algorithm, which is faster than the simple
+radix-2 algorithm we used. We could also have a whole chapter on
+multidimensional FFTs.
+%
+%\section{Multidimensional FFTs}
+%\section{Testing FFTs, Numerical Analysis}
+
+%\nocite{*}
+\bibliographystyle{unsrt}
+\bibliography{fftalgorithms}
+
+\end{document}
+
+
+
+