Srinivas Chellappa (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)
Computer Generation of Fourier Transform Libraries for Distributed Memory Architectures
Preprint (1.7 MB)
Bibtex

High-performance discrete Fourier transform (DFT) libraries are an important requirement for many computing platforms. Unfortunately, developing and optimizing these libraries for modern, complex platforms has become extraordinarily difficult. To make things worse, performance often does not port, thus requiring permanent re-optimizations. Overcoming this problem has been the goal of Spiral, a library generation system that can produce highly optimal DFT code from a high level specification of algorithms and target platforms.

However, current techniques in Spiral cannot support all target platforms. In particular, several emerging target platforms incorporate a distributed memory parallel processing paradigm, where the cost of accessing non-local memories is relatively high, and handling data movement is exposed to the programmers. Traditionally used only in supercomputing environments, this paradigm is now also finding its way in the form of multicore processors into desktop computing.

The goal of this work is the computer generation of high-performance DFT libraries for a wide range of distributed memory parallel processing systems, given only a high-level description of a DFT algorithm and some platform parameters. The key challenges include generating code for multiple target programming paradigms that delivers load balanced parallelization across multiple layers of the compute hierarchy, orchestrates explicit memory management, and overlaps computation with communication.

We attack this problem by first developing a formal framework to describe parallelization, streaming, and data exchange in a domain-specific declarative mathematical language. Based on this framework, we develop a rewriting system that structurally manipulates DFT algorithms to "match" them to a distributed memory target architecture and hence extracts maximum performance. We implement this approach as a part of Spiral together with a backend that translates the derived algorithms into actual libraries. Experimental results show that the performance of our generated code is comparable to hand-tuned code in all cases, and exceeds the performance of hand-tuned code in some cases.

Keywords:
Discrete/fast Fourier transform, Cell BE Processor, Fast algorithms, Multibuffering, Distributed memory, Parallel processing