DSPSR user documentation: dspsr

Optimizing FFT Performance

dspsr can utilize previously-computed tables of FFT performance measurements to select both the optimal transform length and the best FFT library to employ during phase-coherent dispersion removal and synthetic filterbank formation.

Producing FFT Benchmarks for Central Processing Units

First, ensure that hyperthreading is disabled on the system of interest. This feature tricks most linux kernels into thinking that the system has twice as many processing cores as are actually available.

Next, check that PSRCHIVE detected all of the available FFT libraries on your system by running

psrchive_info
It should say something like
FTransform::report 2 available FFT libraries: FFTW3 IPP
FTransform::default FFT library: FFTW3
If you don't see all of the expected FFT libraries, then you might have to ensure that LD_LIBRARY_PATH includes the path(s) to the expected FFT libraries before running the psrchive configure script (and recompiling psrchive).

Next, change to the directory in which PSRCHIVE was originally built and make the FFT benchmarks; e.g.

cd psrchive_build/
cd Util/fft
make bench
This step will take some time to run through a number of iterations that produce estimates of FFT execution times as a function of transform size and number of processing threads. These benchmarks can later be plotted as in Figure 5 of van Straten & Bailes (2010) arXiv:1008.3973.

Producing FFT Benchmarks for Graphics Processing Units

Graphics Processing Units achieve optimal performance when multiple FFTs are executed in parallel (batched). Therefore, benchmarks of the CUDA FFT library are performed over two dimensions:
  1. nchan, the number of frequency channels into which the spectrum is divided by the convolving filterbank; and
  2. nfft, the size of backward FFT performed in each resulting frequency channel.
Note that it is currently assumed that dspsr will peform phase-coherent dispersion removal using a convolving filterbank. If the data have already been divided into frequency channels (e.g. using a polyphase filter bank) then a convolving filterbank may not be necessary. The code required to handle input data that have already been divided into multiple frequency channels is currently (August 2015) being implemented and tested and a separate set of benchmarks will be developed to optimize performance in this case.

To measure the performance of the convolving filterbank as a function of nchan and nfft, change to the directory in which DSPSR was originally built, run the filterbank benchmark script and copy the resulting text file to the dspsr installation directory; e.g.

cd dspsr_build/
cd Benchmark
./filterbank_bench.csh
cp filterbank_bench.out dspsr_prefix/share/filterbank_bench_CUDA.dat 

Using the Benchmarks to optimize dspsr performance

To use the benchmarks created in either of the above steps, simply run dspsr with the -fft-bench command line option.