Daniel McFarlin, Franz Franchetti and Markus Püschel (submitted for publication)
Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets
Bibtex

The well-known shift to parallelism in CPUs is often associated with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date (SIMD) vector units. Intel's SSE and IBM’s Altivec both offer 4-way (single precision) floating point, but for the near future both 8-way and 16-way have been announced with Intel's AVX and Larrabee (LRB) instruction sets. Compilation and optimization for vector instructions faces the same fundamental problems as automatic parallelization and is not well understood. Unfortunately, the complexity of the intrinsics interface used to write vector code by hand increases considerably with the vector length. In this paper, we present an early look at the kind of optimizations needed to produce efficient AVX and LRB code in the context of the Spiral program generation system for transforms. In particular, we show how to automatically obtain basic shuffle building blocks and present a peephole optimizer that performs efficient strength reduction. Inserted into Spiral we show automatically generated fast Fourier transform code with a vectorization efficiency (scalar floating point operations/total vector instructions) of 6 - 7.75 for 8-way AVX and 10 - 15.75 for 16-way LRB.

Keywords:
SIMD vectorization, Discrete/fast Fourier transform, Search/Learning for optimization, Numerical kernels we consider, Fast algorithms