Jeremy Johnson and Michael Andrews (Proc. Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL), 2007 IEEE International Parallel and Distributed Processing Symposium , IEEE, pp. 450, 2007)
Performance Analysis of a Family of WHT Algorithms
Preprint (1.8 MB)
Published paper (link to publisher)
Bibtex

This paper explores the correlation of instruction counts and cache misses to runtime performance for a large family of divide and conquer algorithms to compute the Walsh–Hadamard transform (WHT). Previous work showed how to compute instruction counts and cache misses from a high–level description of the algorithm and proved theoretical results about their minimum, maximum, mean, and distribution. While the models themselves do not accurately predict performance, it is shown that they are statistically correlated to performance and thus can be used to prune the search space for fast implementations. When the size of the transform fits in cache the instruction count itself is used; however, when the transform no longer fits in cache, a linear combination of instruction counts and cache misses is used. Thus for small transforms it is safe to ignore algorithms which have a high instruction count and for large transforms it is safe to ignore algorithms with a high value in the combined instruction count/cache miss model. Since the models can be computed from a high–level description of the algorithms, they can be obtained without runtime measurement and the previous theoretical results on the models can be applied to limit empirical search.

Keywords:
Walsh-Hadamard transform, Performance model

More information:

Workshop web page