Jeremy Johnson and Michael Andrews (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh-Hadamard Transform
Comment: Minisymposium on HPC Software: Tools, Libraries and Frameworks
Preprint (1.1 MB)
Published paper (link to publisher)

Short vector instructions (SIMD) can significantly increase performance, yet are difficult to use effectively. Recently, several efforts (ATLAS, FFTW, SPIRAL) have successfully used automated performance tuning and search to find good SIMD implementations of high-performance kernels such as matrix multiplication, the FFT and related transforms. In this paper, we review techniques, similar to those used by SPIRAL, for vectorizing sparse matrix factorizations, and incorporate them into a package for computing the Walsh–Hadamard Transform (WHT). These techniques along with search are used to discover algorithms with close to theoretical speedup. Not all algorithms provide the same amount of speedup and it is not the case that the vectorized version of the best sequential algorithm leads to the best algorithm with SIMD instructions. The goal of this paper is to better understand the search space and which algorithms lead to the best speedup and overall performance. Several SIMD specific algorithm features are used to predict speedup and performance. The resulting performance model is highly correlated with runtime performance and measured speedup, and can be used to partition the search space and expedite the search for good vector performance.

SIMD vectorization, Walsh-Hadamard transform, Performance model