Thom Popovici, Tze-Meng Low and Franz Franchetti (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2018)
Large Bandwidth-Efficient FFTs on Multicore and Multi-Socket Systems
Preprint (543 KB)
Published paper (link to publisher)

Current microprocessor trends show a steady increase in the number of cores and/or threads present on the same CPU die. While this increase improves performance for computebound applications, the benefits for memory-bound applications are limited. The discrete Fourier transform (DFT) is an example of such a memory-bound application, where increasing the number of cores does not yield a corresponding increase in performance. In this paper, we present an alternate solution for using the increased number of cores/threads available on a typical multicore system. We propose to repurpose some of the cores/threads as soft Direct Memory Access (DMA) engines so that data is moved on and off chip while computation is performed. Overlapping memory accesses with computation permits us to preload and reshape data so that computation is more efficient. We show that despite using fewer cores/threads for computation, our approach improves performance relative to MKL and FFTW by 1.2x to 3x for large multi-dimensional DFTs of up to 2048^3 on one and two-socket Intel and AMD systems.

Fast Fourier Transform, Bandwidth, Multicore Systems