Bottom     Previous     Contents

# Practical Considerations.

## Inverse Transforms.

As presented, both the DIF and DIT algorithms evaluate (an incorrectly scaled) DFT. They are 'forward' transforms. Of course the same basic algorithms may be used to evaluate the Inverse DFT (IDFT). There are three commonly used methods:

1 - Conjugate the twiddle factors:

The only significant difference between DFT and IDFT if the sign of the imaginary part of the coefficients used in the defining summations. (-j for DFT, +j for IDFT). This change of sign is simply reflected in the twiddle factors. So if your FFT routine uses table based twiddle factors, you can turn it into an inverse FFT just by using a conjugated twiddle factor table. (If your FFT routine has been optimised with 'hard wired' trivial butterflies things won't be quite so simple.)

2 - Conjugate the input and output:

Both sides of the IDFT definition can be conjugated:

The above expression is simply the DFT of F(k)* , so to perform an IFFT conjugate the input data, perform an FFT, then conjugate the result.

3 - Use DFT symmetry:

Suppose we define:

From periodicity, we have:

The DFT of f' is:

Substitute m=N-n:

Of course, the order in which the terms are summed makes no difference, and periodicity tells us that the term for m=N is exactly the same as the m=0 term would be (if it appeared above), so the expression may be re arranged:

The above expression is just the IDFT of f(n). So we can use an FFT routine to evaluate an Inverse FFT by re-arranging the the input samples. Note that this re-arrangement is not a simple mirror image refelection of the input data array. f[0] and f[N/2] stay where they are!

A similar method is to re-interpret (or re-arrange) the FFT output instead of the input:

Suppose we define:

Again, the above expression is simply the IDFT of f(n). The advantage of this is that (depending on what you're going to do with the output) you may not have to do an physical re-arranging of data in the output array.

Note:

Although we have loosely used the term 'inverse FFT', in reality it won't be an inverse FFT unless it includes the appropriate scaling factor somewhere. The FFT's as we've defined them, will have the property that:

## Scaling.

The DFT and IDFT (as defined at the start of this document) preserve the mean squared magnitudes:

To see this, we can expand the left side of the above equation (using different dummy summation indices):

We've seen the inner term before when proving that the IDFT really is an inverse DFT (see Annex B). The result of interest is:

So, in the double summation, the only non-zero terms are for m=n, and these have a constant multiplier of N. So this is the result we're looking for:

This result is called 'Parsevals Theorem'. Unfortunately, the 'raw' FFT algorithms presented so far do not have this property, because the scaling was omitted from the FFT definitions. If you're using floating point arithmetic this probably doesn't present any serious problems, you just need to be aware of it. However, if you're using fixed point arithmetic it is very important that the FFT is properly scaled to make full use of the restricted range of fixed point numbers.

So how should the FFT and IFFT be scaled? There is no easy answer to this question. One thing we can say for sure is that if we want the IFFT to be an inverse FFT then the product of the FFT and IFFT scaling factors should equal (1/N).

It's helpful to consider 3 different types of signal:

1. Signals with a White(ish) spectrum (broad band signals) whose energy uniformly distributed in the 'time domain'. Typically these will be random noise dominated signals, or coherent signals with greater than unity time-bandwidth product, such as the 'Chirps' often used in active Sonar or Radar. The DFT of such signals will contain energy spread roughly uniformly across the 'frequency domain' . So to maintain the same signal level within the fixed point range the FFT (and IFFT) should use the standard DFT scale factors.
2. Signals with a White(ish) spectrum (broad band signals) whose energy is localised in the time domain. These will be impulse like signals with close to unity time-bandwidth product. Again the energy of these signals will be spread roughly uniformly across the 'frequency domain', but at a significantly lower level (because of the reduced energy resulting from time domain localisation). In such cases the standard DFT scaling will result in numbers which are smaller than the ideal for the fixed point range available. It may be better to use no scaling for the FFT and (1/N) scaling for the IFFT.
3. Signals with a Narrow(ish) spectrum (narrow band signals) whose energy uniformly distributed in the 'time domain'. An example of such a signal is a pure tone, or the so called 'CW' pulses used Doppler Sonar and Radar. These will also have close to unity time-bandwidth product. The DFT will tend to concentrate the signals energy in very small number of frequency bins, so the standard DFT scaling may result in clipped results. It may be better to use (1/N) scaling for the FFT and no scaling for the IFFT.

Of course these examples are extremes. Real world applications probably need to deal with signals whose components share one or more of the above characteristics. It the responsibility of the system designer to undertake the necessary 'trace' calculations to ensure that signals will neither clip nor be swamped by quantisation noise.

In summary, we have identified 3 different scaling factors which may be used in FFT's and IFFT's:

The first of these just corresponds to a 'raw' FFT, so requires no implementation. For Radix 2 algorithms, the easiest (and the best) way to achieve the last is to simply shift the results of each pass right 1 place (divide by 2) after each pass. Most DSP's can be configured to do this automatically. A similar approach could be used for the second scaling factor (root N), but this would require a multiplication by root 2 after each pass. To save time, it is much more common to apply the 1/2 scaling (by shifting the results right) every other pass (first,third... or second,fourth...).

## Frequency Domain Rounding.

The computational efficiency of 'frequency domain' processing is very attractive for many applications (typically those that would otherwise require a very long 'time domain' convolution or correlation). However, there is an intrinsic hazard in doing this which is of particular significance if fixed point arithmetic is used. It is very important that neither the FFT process nor any frequency domain calculations introduce consistent biases in the frequency domain data. In particular, simply truncating results in finite precision calculations will introduce a consistent bias of -1/2 of a least significant bit (LSB) in the real and imaginary parts. This may seem insignificant until you consider what happens to this bias when an inverse FFT is performed. All the bias energy will be concentrated in 'time domain' sample f(0). For an unscaled IFFT, this will result in a spurious spike at n=0 of approximate magnitude -N/2 in the real and imaginary parts. In real systems this effect might manifest itself as periodic audible clicks, or visible (false) Sonar or Radar returns. To prevent this you need to ensure that some form of un-biased rounding is used, such as 'round to nearest' (most DSP's can be configured to do this automatically).

## DIF or DIT?

In terms of computational work load, there would appear to be very little to chose between the DIF and DIT algorithms. Both perform exactly the same number of butterflies. Each butterfly requires exactly one complex multiply and two complex adds. Both algorithms use exactly the same number if trivial (and 'semi-trivial') butterflies. So why chose one in preference to the other?

The most significant difference between simple DIF and DIT algorithms is that DIF starts with normal order input and generates bit reversed order output. In contrast, DIT starts with bit reversed order input and generates normal order output. So if both forward and inverse transforms are required and bit reversed addressing isn't available, then the choice would seem clear. Use DIF for the forward transform and DIT for the inverse transform. (For the inverse transform you will need to conjugate the twiddle factors.) Unfortunately, the issue isn't quite so simple. If your performing FFT's on pure real data it may be simpler to use a modified DIT for the forward transform and a modified DIF for the inverse transform (the section about real transforms explains why later).

It is worth noting that if your FFT (DIF or DIT) uses 'hard wired' -j trivial butterflies, then you will need seperate routines forward and inverse transforms in any case. You can easily conjugate a twiddle factor table, but the 'hard wired' -j butterflies can't use conjugated twiddle factors without re-writing the code.

If your processor supports bit reversed addressing (see below), then the physical ordering of the input or output data is not an important efficiency consideration. You should try coding both the DIF and DIT butterflies and see which is the most efficient. (In a DSP that supports single cycle adds, multiplies and multiply/accumulates, both should take about 8 clock cycles).

Just about all DSP's provide this in some form. Some documentation refers to the mode as 'reverse carry propagation' (for obvious reasons). All this really does is provide a mechanism to rearrange an array of samples into bit reversed order (as seen by the FFT routine) without a potentially time consuming physical re-arrangement of the array in memory.

By using bit reversed addressing, you can use the DIT algorithm on normal (physical) order input data. The consequence of doing this is that output data will be in bit reversed (physical) order.

Similarly, you could use bit reversed addressing to perform a DIF algorithm on bit reversed (physical) order input. In this case the output data will be in normal (physical) order.

## Cache Efficiency.

The presence of a cache can result in non-intutive effects regarding the performance of real code running on real processors. Most modern PC's and workstations use general purpose CPU's which have fast floating point arithmetic units and large caches. In contrast, the traditional methods of quantifying an FFT algorithms merit date from the early days of computing when a floating point multiply would have been an expensive operation, when compared to reading a word from memory. These days the converse is likely to be true. A floating point operation is likely to be cheap, when compared to reading a word from (external) memory.

This does not mean we have to abandon our favourite FFT algorithm, but we do need to reconsider how it's implemented. In particular, the efficiency of the traditional nested loop FFT is questionable. This is because each pass accesses and modifies the entire array being transformed. If this array won't fit inside the cache then the entire array will be cycled through the cache during each pass. In other words, the cache will 'thrash'. In contrast, the recursive (in place) DIF and DIT algorithms presented in this document tend to keep each sub-transform as a distinct computational 'lump' which is completed before performing the next sub-transform. Once the sub transforms have reached a sufficiently small size that the relevant portion of the array can fit in the cache, then a significantly enhanced performance should be observed. The recursive forms have the additional advantage of being easier to understand.

Of course, this all depends on a lot of factors such as cache size, line size and line replacement algorithm. So it's hard to make quantitative predictions. The best way to determine the relative performance merits of one algorithm over another on any particular machine is to try them both.

## Using a DSP's internal memory.

A similar problem to cache effeciency is presented by DSP's with a relatively small area of fast internal memory. Unlike general purpose CPU's, a DSP's internal memory serves as a software controllable cache. This has advantages for both performance and achieving deterministic behaviour, at the expense of increased code complexity. Given that DSP's tend to be 'good at loops' but 'bad at stacks', a nested loop FFT still looks attractive. However, it is still possible to design FFT routines which perform all sub-transforms below a given size entirely in internal memory.

Typically a DIF routine will perform the first few passes in external memory until the size of each block is sufficiently small to fit in internal memory. At this point, each block in turn will be transfered to internal memory, transformed and the result written back. (If the FFT is being used to perform a convolution or correlation, then the point by point multiplication of frequency domain data and a DIT IFFT can also be performed in internal memory.)

Similarly, a DIT routine will initially perform the first sub-transforms internally before completing the remaining passes in external memory.

In both cases, the transfer to and from internal memory can be combined with useful computation, rather than using an explicit block move.