Copyrights to these papers may be held by the publishers. The download files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Frédéric de Mesmay, Franz Franchetti, Yevgen Voronenko and Markus Püschel (Parallel Matrix Algorithms and Applications (PMAA), 2008, Presentation (Abstract reviewed))
Automatic Generation of Adaptive Libraries for Matrix-Multiplication
Bibtex
Developping high-performance matrix-multiplication libraries for current off-the-shelf microprocessors is a challenging task: it usually takes the combination of a peak-performance hand-scheduled vectorized kernel and multiple cache-aware multi-threaded blocking strategies to achieve maximum speed in all cases. Both these tasks depends on the underlying hardware and therefore need be rewritten or readapted for each and every platform. To alleviate this repetive effort, our work presents how to fully automate the development of such general matrix-multiplication libraries on various computing platforms. At the core of our code generator is a declarative domain-specific language that captures access patterns of blocked matrices. This language is manipulated using a symbolic rewritting system that is able to produce different alternatives. The rules in this language capture all the knowledge about the matrix-multiplication operation. Using this language, we generate all promising algorithms for computing small matrices products, produce the corresponding unrolled code and select the best kernels at compile time to include in our library. Then, the very same language is used to find and implement the set of mutually recursive functions that could be choosed from when deciding the blocking strategy. As this choice ultimately depends on the shape and size of the matrices and should adapt to the platform's caches, the final strategy is only decided at runtime, using dynamic programming. The final implementation of the library can be returned in C++ or Java, takes advantage of vector instructions, multi-processors and multi-cores where available, adapts each problem to the memory hierarchy, compares to vendor's performance and, amazingly, does not contain a single line of code written by a human.
Keywords: Beyond transforms, Linear algebra