Copyrights to these papers may be held by the publishers. The download files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Srinivas Chellappa (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)
Computer Generation of Fourier Transform Libraries for Distributed Memory Architectures
Preprint (1.7 MB)
Bibtex
High-performance discrete Fourier transform (DFT) libraries are an important requirement for many computing platforms. Unfortunately, developing and optimizing these libraries for modern, complex platforms has become extraordinarily difficult. To make things worse, performance often does not port, thus requiring permanent re-optimizations. Overcoming this problem has been the goal of Spiral, a library generation system that can produce highly optimal DFT code from a high level specification of algorithms and target platforms.
However, current techniques in Spiral cannot support all target platforms. In particular, several emerging target platforms incorporate a distributed memory parallel processing paradigm, where the cost of accessing non-local memories is relatively high, and handling data movement is exposed to the programmers. Traditionally used only in supercomputing environments, this paradigm is now also finding its way in the form of multicore processors into desktop computing.
The goal of this work is the computer generation of high-performance DFT libraries for a wide range of distributed memory parallel processing systems, given only a high-level description of a DFT algorithm and some platform parameters. The key challenges include generating code for multiple target programming paradigms that delivers load balanced parallelization across multiple layers of the compute hierarchy, orchestrates explicit memory management, and overlaps computation with communication.
We attack this problem by first developing a formal framework to describe parallelization, streaming, and data exchange in a domain-specific declarative mathematical language. Based on this framework, we develop a rewriting system that structurally manipulates DFT algorithms to "match" them to a distributed memory target architecture and hence extracts maximum performance. We implement this approach as a part of Spiral together with a backend that translates the derived algorithms into actual libraries. Experimental results show that the performance of our generated code is comparable to hand-tuned code in all cases, and exceeds the performance of hand-tuned code in some cases.