Flops and the FFT

The Fast Fourier Transform is a wonderful family of algorithms that have made whole fields of study possible. They mostly operate by breaking a large problem into some number of equally-sized smaller problems, all the way down to problems of just a few points. As a result, the prime factorization of the number of points can make a large difference in the speed of the calculation. Traditionally, powers of two have been the fastest (or the only) sizes available, so the wisdom has been that if you have a data set whose length is not a power of two, you should pad it to the next larger power of two. It turns out that, at least in some cases, that is a mistake.

For context, I am still thinking about the problem of coherent harmonic summing in pulsar searches. As I worked out before, it's important to have an accurately-interpolated Fourier spectrum, and it turns out that the simplest way to do this is to simply pad the input array before taking its Fourier transform.

For a two-minute observation, we are taking about twelve thousand power measurements per second (yes, this is basically an audio data rate; but it only appears once we have combined all 4096 frequency channels), so the raw data is about 1.48 million data points (the exact length of our observations varies slightly). We need to pad this to eight times its length to get good enough sampling in the frequency domain, so that's about 11.8 million data points. The question is how much more to pad this in order to make the FFT fast.

I'm using FFTW for the Fourier transform, and it has efficient algorithms for powers 2a3b5c7d. So while I could pad my FFTs to the next power of two (about 16.8 million), I could also pad them to 12 million exactly, or to the smaller value of 11854080, which still factors nicely (28335173). I know that those powers of five and seven are not as efficient as those powers of two, but on the other hand, flops are cheap, while increasing the data size by almost a factor of two potentially means an awful lot more memory access (obviously these giant FFTs aren't going to fit in cache, though FFTW is smart enough to arrange the pieces for the best possible cache coherence). So I thought I'd do a comparison.

The results are:

  • 16.8 million points: 1.7 s
  • 12 million points: 0.8 s
  • 11854080 points: 1.0 s
That is, the power of two is the slowest — in fact it took almost twice as long as the nice round decimal number. Trimming the value further, to 11854080, slowed things down again, presumably because the reduction in memory size was minor, but the prime factorization was worse. The upshot of all this is that, at least for large FFTs, padding to the next power of two is not necessarily a good idea.

2 comments:

mike said...

Wow! impressive-sized FFTs. Do you know of a software solution for large fft's that's open source? (that can do 1 million point or greater)

Thanks,

Michael

Unknown said...

I just used FFTW (from C) for that. You don't need anything very special until your data size starts to exceed the RAM on your machine (so, gigapoint FFTs). At that point I'm not sure what there is in terms of out-of-core algorithms; you might even do okay with FFTW from one memory-mapped file to another.