Copyrights to these papers may be held by the publishers. The download files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)
FFT Program Generation for the Cell BE
Preprint (55 KB)
The Cell BE is designed for accelerating heavy numerical computation. The Cell features an innovative design that includes multiple vectorized cores with explicitly managed per-core local memory and intercore communication. These features allow for a theoretical peak performance of 204.8 Gflop/s (single precision, using just the SPEs) without breaking the memory, power, or frequency walls. However, the very same features that allow for high theoretical performance make it difficult and time consuming to develop multithreaded, vectorized, high-performance numerical libraries for the Cell BE. Our approach to solving this problem is to use the Spiral framework to automatically generate and optimize signal processing transform libraries for the Cell. Spiral is a program generation system for linear transforms such as the discrete Fourier transform, discrete cosine transforms, filters, and others. For a user-selected transform, Spiral autonomously generates different algorithms, represented in a declarative form as mathematical formulas, and their implementations to find the best match for the target platform. Besides using search techniques, Spiral performs deterministic optimizations at the formula level, effectively restructuring the code in ways impractical at the code level. To extend the Spiral framework to support the Cell architecture, we first focus on automatically generating high-performance vectorized DFT kernels that run on a single SPE. The performance of our single-threaded code is comparable to hand tuned code, and achieves 16-20 Gflop/s on a single SPE for input vectors resident in the local stores. Next, we produce optimized multithreaded code that runs on multiple SPEs. To maximize the achieved interconnect bandwidth, the algorithms generated use large DMA packets for all-to-all data exchanges between the SPEs.Keywords: Multithreading, SPIRAL program generation system for transforms, Discrete/fast Fourier transform, Cell BE Processor