Copyrights to these papers may be held by the publishers. The download files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Benjamin Hess, Thomas Gross and Markus Püschel (Proc. International Conference on Generative Programming: Concepts & Experiences (GPCE), pp. 83-92, 2014)
Automatic Locality-Friendly Interface Extension of Numerical Functions
Preprint (509 KB)
Published paper (link to publisher)
Raising the level of abstraction is a key concern of software engineering, and libraries (either used directly or as a target of a program generation system) are a successful technique to raise programmer productivity and to improve software quality. Unfortunately successful libraries may contain functions that may not be general enough. E.g., many numeric performance libraries contain functions that work on one- or higher-dimensional arrays. A problem arises if a program wants to invoke such a function on a non-contiguous subarray (e.g., in C the column of a matrix or a subarray of an image). If the library developer did not foresee this scenario, the client program must include explicit copy steps before and after the library function call, incurring a possibly high performance penalty. A better solution would be an enhanced library function that allows for the desired access pattern. Exposing the access pattern allows the compiler to optimize for the intended usage scenario(s). As we do not want the library developer to generate all interesting versions manually, we present a tool that takes a library function written in C and generates such a customized function for typical accesses. We describe the approach, discuss limitations, and report on the performance. As example access patterns we consider those most common in numerical applications: permutations, striding and block striding, as well as scaling. We evaluate the tool on various library functions including filters, scans, reductions, sorting, FFTs and linear algebra operations. The automatically generated custom version is in most cases significantly faster than using individual steps, offering gains that are typically in the range of 20--80%.Keywords: SIMD vectorization, Locality optimization