For context, I am still thinking about the problem of coherent harmonic summing in pulsar searches. As I worked out before, it's important to have an accurately-interpolated Fourier spectrum, and it turns out that the simplest way to do this is to simply pad the input array before taking its Fourier transform.
For a two-minute observation, we are taking about twelve thousand power measurements per second (yes, this is basically an audio data rate; but it only appears once we have combined all 4096 frequency channels), so the raw data is about 1.48 million data points (the exact length of our observations varies slightly). We need to pad this to eight times its length to get good enough sampling in the frequency domain, so that's about 11.8 million data points. The question is how much more to pad this in order to make the FFT fast.
I'm using FFTW for the Fourier transform, and it has efficient algorithms for powers 2a3b5c7d. So while I could pad my FFTs to the next power of two (about 16.8 million), I could also pad them to 12 million exactly, or to the smaller value of 11854080, which still factors nicely (28335173). I know that those powers of five and seven are not as efficient as those powers of two, but on the other hand, flops are cheap, while increasing the data size by almost a factor of two potentially means an awful lot more memory access (obviously these giant FFTs aren't going to fit in cache, though FFTW is smart enough to arrange the pieces for the best possible cache coherence). So I thought I'd do a comparison.
The results are:
- 16.8 million points: 1.7 s
- 12 million points: 0.8 s
- 11854080 points: 1.0 s