Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2009)
High Performance Linear Transform Program Generation for the Cell BE
Preprint (84 KB)
Bibtex

The Cell BE is among a new generation of multicore processors including the Intel Larrabee and the Tilera TILE64 that provide an impressive peak fixed or floating point performance for scientific, signal processing, visualization, and other engineering applications. As shown in Fig. 1, the Cell uses simple in-order cores designed specifically for numerical computing, and requires explicit memory management to achieve maximal performance, which make programming and optimizing a challenge. In this paper, we extend Spiral [7], a program generation system, to generate highly optimized linear transform programs for the Cell BE. In doing so, as presented in [2], we extend Spiral’s architectural paradigms to include support for distributed memory architectures like the Cell that allow hiding memory costs using multibuffering techniques. We focus on fixed-size code for the 1D complex discrete Fourier transform (DFT), but also generate code for variants including transforms that work on real input, 2D input, and for other transforms including the discrete Cosine and Sine transforms. We generate code for various usage scenarios, including latency optimized and throughput optimized code, and our system can handle various complex data formats and data distribution formats. The performance of Spiral generated code for the Cell is comparable to, and in many cases better than existing implementations, where available.

Keywords:
Discrete/fast Fourier transform, Cell BE Processor, Multibuffering, Distributed memory, Parallel processing