Daniel McFarlin, Franz Franchetti and Markus Püschel (submitted for publication)
Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets
Bibtex

The well-known shift to parallelism in CPUs is often associated with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date (SIMD) vector units. Intel's SSE and IBM’s Altivec both offer 4-way (single precision) floating point, but for the near future both 8-way and 16-way have been announced with Intel's AVX and Larrabee (LRB) instruction sets. Compilation and optimization for vector instructions faces the same fundamental problems as automatic parallelization and is not well understood. Unfortunately, the complexity of the intrinsics interface used to write vector code by hand increases considerably with the vector length. In this paper, we present an early look at the kind of optimizations needed to produce efficient AVX and LRB code in the context of the Spiral program generation system for transforms. In particular, we show how to automatically obtain basic shuffle building blocks and present a peephole optimizer that performs efficient strength reduction. Inserted into Spiral we show automatically generated fast Fourier transform code with a vectorization efficiency (scalar floating point operations/total vector instructions) of 6 - 7.75 for 8-way AVX and 10 - 15.75 for 16-way LRB.

Frédéric de Mesmay, Srinivas Chellappa, Franz Franchetti and Markus Püschel (to appear in Proc. International Conference on High Performance Embedded Architectures & Compilers (HiPEAC), 2010)
Computer Generation of Efficient Software Viterbi Decoders
Bibtex

This paper presents a program generator for fast software Viterbi decoders for arbitrary convolutional codes. The input to the generator is a specification of the code, and single-instruction multiple-data (SIMD) vector length. The output is an optimized C implementation of the decoder that uses explicit Intel SSE vector instructions. Only the performance-critical forward pass is actually generated, and a generic infrastructure handles the traceback pass. At the heart of the generator is a small domain-specific language called VL to express the structure of the forward pass. Vectorization is done by rewriting VL expressions, which a compiler then translates into actual code in addition to performing further optimizations specific to the vector instruction set. Benchmarks show that the generated decoders match the performance of available expert hand-tuned implementations, while spanning the entire space of convolutional codes.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (submitted for publication)
Hardware Implementation of the Discrete Fourier Transform with Non-Power-of-Two Problem Size
Bibtex

Frédéric de Mesmay, Yevgen Voronenko and Markus Püschel (submitted for publication)
Offline Library Adaptation Using Automatically Generated Heuristics
Bibtex

Automatic tuning has emerged as a solution to provide high-performance libraries for fast changing, increasingly complex computer architectures. We distinguish offline adaptation (e.g., in ATLAS) that is performed during installation without the full problem description from online adaptation (e.g., in FFTW) that is performed at run-time. Offline adaptive libraries are simpler to use, but, unfortunately, writing the adaptation heuristics that power them is a daunting task. The overhead of online adaptive libraries, on the other hand, makes them unsuitable for a number of applications. In this paper, we propose to automatically generate heuristics in the form of decision trees using a statistical classifier, effectively converting an online adaptive library into an offline one. As testbed we use Spiral-generated adaptive transform libraries for current multicores with vector extensions. We show that replacing the online search with generated decision trees maintains a performance competitive with vendor libraries while allowing for a simpler interface and reduced computation overhead.

Yannis Benlachtar, Philip M. Watts, Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (to appear in Proc. European Conference on Optical Communication (ECOC), 2009)
21.4 GS/s Real-Time DSP-Based Optical OFDM Signal Generation and Transmission Over 1600 km of Uncompensated Fibre
Preprint (279 KB)
Bibtex

We report a real-time optical OFDM transmitter with the highest sampling rate to date. Generation and transmission of an 8.36 Gb/s digitally up-converted single sideband OFDM signal over 1600km of uncompensated fiber with a BER < 10-3 was achieved.

Yevgen Voronenko and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 57, No. 1, pp. 205-222, 2009)
Algebraic Signal Processing Theory: Cooley-Tukey Type Algorithms for Real DFTs
Preprint (324 KB)
Published paper (link to publisher)
Bibtex

In this paper we systematically derive a large class of fast general-radix algorithms for various types of real discrete Fourier transforms (real DFTs) including the discrete Hartley transform (DHT) based on the algebraic signal processing theory. This means that instead of manipulating the transform definition, we derive algorithms by manipulating the polynomial algebras underlying the transforms using one general method. The same method yields the well-known Cooley-Tukey fast Fourier transform (FFT) as well as general radix discrete cosine and sine transform algorithms. The algebraic approach makes the derivation concise, unifies and classifies many existing algorithms, yields new variants, enables structural optimization, and naturally produces a human-readable structural algorithm representation based on the Kronecker product formalism. We show, for the first time, that the general-radix Cooley-Tukey and the lesser known Bruun algorithms are instances of the same generic algorithm. Further, we show that this generic algorithm can be in stantiated for all four types of the real DFT and the DHT.

Peter A. Milder, James C. Hoe and Markus Püschel (Proc. Design, Automation and Test in Europe (DATE), pp. 1118-1123, 2009)
Automatic Generation of Streaming Datapaths for Arbitrary Fixed Permutations
Preprint (222 KB)
Bibtex

This paper presents a technique to perform arbitrary fixed permutations on streaming data. We describe a parameterized architecture that takes as input n data points streamed at a rate of w per cycle, performs a permutation over all n points, and outputs the result in the same streaming format. We describe the system and its requirements mathematically and use this mathematical description to show that the datapaths resulting from our technique can sustain a full throughput of w words per cycle without stalling. Additionally, we provide an algorithm to configure the datapath for a given permutation and streaming width. Using this technique, we have constructed a full synthesis system that takes as input a permutation and a streaming width and outputs a register-transfer level Verilog description of the datapath. We present an evaluation of our generated designs over varying problem sizes and streaming widths, synthesized for a Xilinx Virtex-5 FPGA.

Daniel McFarlin, Franz Franchetti and Markus Püschel (Proc. High Performance Embedded Computing (HPEC), 2009)
Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension
Preprint (205 KB)
Bibtex

Basilio B. Fraguela, Yevgen Voronenko and Markus Püschel (Proc. Parallel Architectures and Compilation Techniques (PACT), pp. 271-280, 2009)
Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling
Preprint (2.7 MB)
Bibtex

Analytical models have been used to estimate optimal values for parameters such as tile sizes in the context of loop nests. However, important algorithms such as fast Fourier transforms (FFTs) present a far more complex search space consisting of many thousands of different implementations with very different complex access patterns and nesting and code structures. As a results, some of the best available FFT implementations use heuristic search based on runtime measurements. In this paper we present the first analytical model that can successfully replace the measurement in this search on modern platforms. The model includes many details of the platform's memory system including the TLBs, and, for the first time, physically addressed caches and hardware prefetching. The effect, as we show, is a dramatically reduced search time to find the best FFT without significant loss in performance. Even though our model is adapted to the FFT in this paper, its underlying structure should be applicable for a much larger set of code structures and hence is a candidate for iterative compilation.

Frédéric de Mesmay, Arpad Rimmel, Yevgen Voronenko and Markus Püschel (Proc. International Conference on Machine Learning (ICML), pp. 729-736, 2009)
Bandit-Based Optimization on Graphs with Application to Library Performance Tuning
Preprint (309 KB)
Published paper (link to publisher)
Bibtex

The problem of choosing fast implementations for a class of recursive algorithms such as the fast Fourier transforms can be formulated as an optimization problem over the language generated by a suitably defined grammar. We propose a novel algorithm that solves this problem by reducing it to maximizing an objective function over the sinks of a directed acyclic graph. This algorithm valuates nodes using Monte-Carlo and grows a subgraph in the most promising directions by considering local maximum k-armed bandits. When used inside an adaptive linear transform library, it cuts down the search time by an order of magnitude to find a similar solution. Further, the search can be stopped at any time and still return a solution. In some cases, the performance of the implementations found is also increased by up to 10% which is of considerable practical importance since it consequently improves the performance of all applications using the performance library.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Conference on Supercomputing (ICS), pp. 26-35, 2009)
Computer Generation of Fast Fourier Transforms for the Cell Broadband Engine
Preprint (238 KB)
Published paper (link to publisher)
Bibtex

The Cell BE is a multicore processor with eight vector accelerators (called SPEs) that implement explicit cache management through direct memory access engines. While the Cell has an impressive floating point peak performance, programming and optimizing for it is difficult as it requires explicit memory management, multithreading, streaming, and vectorization. We address this problem for the discrete Fourier transform (DFT) by extending Spiral, a program generation system, to automatically generate highly optimized implementations for the Cell. The extensions include multi-SPE parallelization and explicit memory streaming, both performed at a high abstraction level using rewriting systems operating on Spiral's internal domain-specific language. Further, we support latency and throughput optimizations, single and double precision, and different data formats. The performance of Spiral's computer generated code is comparable with and sometimes better than existing DFT implementations, where available.

Yevgen Voronenko, Frédéric de Mesmay and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 102-113, 2009)
Computer Generation of General Size Linear Transform Libraries
Preprint (563 KB)
Published paper (link to publisher)
Bibtex

The development of high-performance libraries has become extraordinarily difficult due to multiple processor cores, vector instruction sets, and deep memory hierarchies. Often, the library has to be reimplemented and reoptimized, when a new platform is released. In this paper we show how to automatically generate general input-size libraries for the domain of linear transforms. The input to our generator is a formal specification of the transform and the recursive algorithms the library should use; the output is a library that supports general input size, is vectorized and multithreaded, provides an adaptation mechanism for the memory hierarchy, and has excellent performance, comparable to or better than the best human-written libraries. Further, we show that our library generator enables various customizations; one example is the generation of Java libraries.

Franz Franchetti, Markus Püschel, Yevgen Voronenko, Srinivas Chellappa and José M. F. Moura (IEEE Signal Processing Magazine, special issue on ``Signal Processing on Platforms with Multiple Cores'', Vol. 26, No. 6, pp. 90-102, 2009)
Discrete Fourier Transform on Multicore
Preprint (453 KB)
Published paper (link to publisher)
Bibtex

This paper gives an overview on the techniques needed to implement the discrete Fourier transform (DFT) efficiently on current multicore systems. The focus is on Intel compatible multicores but we also discuss the IBM Cell, and briefly, graphics processing units (GPUs). The performance optimization is broken down into three key challenges: parallelization, vectorization, and memory hierarchy optimization. In each case, we use the Kronecker product formalism to formally derive the necessary algorithmic transformations based on a few hardware parameters. Further code level optimizations are discussed. The rigorous nature of this framework enables the complete automation of the implementation task as shown by the program generator Spiral. Finally, we show and analyze DFT benchmarks of the fastest libraries available for the considered platforms.

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 549-552, 2009)
Generating High-Performance Pruned FFT Implementations
Published paper (link to publisher)
Bibtex

We derive a recursive general-radix pruned Cooley-Tukey fast Fourier transform (FFT) algorithm in Kronecker product notation. The algorithm is compatible with vectorization and parallelization required on state-of-the-art multicore CPUs. We include the pruned FFT algorithm into the program generation system Spiral, and automatically generate optimized implementations of the pruned FFT for the Intel Core2Duo multicore processor. Experimental results show that using the pruned FFT can indeed speed up the fastest available FFT implementations by up to 30% when the problem size and the pattern of unused inputs and outputs are known in advance.

Yannis Benlachtar, Philip M. Watts, Rachid Bouziane, Peter A. Milder, Deepak Rangaraj, Anthony Cartolano, Robert Koutsoyannis, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (to appear in Optics Express, Vol. 17, No. 20, pp. 17658-17668, 2009)
Generation of Optical OFDM Signals Using 21.4 GS/s Real Time Digital Signal Processing
Published paper (link to publisher)
Bibtex

We demonstrate a field programmable gate array (FPGA) based optical orthogonal frequency division multiplexing (OFDM) transmitter implementing real time digital signal processing at a sample rate of 21.4 GS/s. The QPSK-OFDM signal is generated using an 8 bit, 128 point inverse fast Fourier transform (IFFT) core, performing one transform per clock cycle at a clock speed of 167.2 MHz and can be deployed with either a direct-detection or a coherent receiver. The hardware design and the main digital signal processing functions are described, and we show that the main performance limitation is due to the low (4-bit) resolution of the digital-to-analog converter (DAC) and the 8-bit resolution of the IFFT core used. We analyze the back-to-back performance of the transmitter generating an 8.36 Gb/s optical single sideband (SSB) OFDM signal using digital up-conversion, suitable for direct-detection. Additionally, we use the device to transmit 8.36 Gb/s SSB OFDM signals over 200 km of uncompensated standard single mode fiber achieving an overall BER<10^−3.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. High Performance Embedded Computing (HPEC), 2009)
High Performance Linear Transform Program Generation for the Cell BE
Bibtex

Daniel McFarlin, Franz Franchetti, José M. F. Moura and Markus Püschel (Proc. SPIE Conference on Defense, Security, and Sensing, Proceedings of SPIE, Vol. 7337, pp. 733708, 2009)
High Performance Synthetic Aperture Radar Image Formation On Commodity Architectures
Published paper (link to publisher)
Bibtex

Synthetic Aperture Radar (SAR) image processing platforms are increasingly confronted with vast datasets and hard real-time deadlines. Platform developers are also finding themselves constrained by battlefield exigencies, time-to-market pressures and rapidly varying requirements. In response, developers are coupling high performance, general-purpose Commercial-Off-The-Shelf (COTS) architectures with software implementations of SAR algorithms. While this approach provides great flexibility, achieving the requisite performance on COTS architectures such as IBM's Cell BE and Intel's Core2 Quad is a potentially time-consuming and error-prone process. This is principally due to the highly parallel nature of modern COTS architectures. Developers must now grapple with architectures that exhibit parallelism at nearly every level of granularity; extracting high performance requires a complex interweaving of multiple forms of parallelism (e.g., SIMD vector extensions, multicore). In this paper, we first present an overview of SPIRAL, a program generator that reduces the burden of developing high performance code by automatically exploring many combinations of parallelism. We then extend SPIRAL's domain-specific language to expose high-level SAR constructs that a developer can combine into a parameterized scenario. We subsequently employ these extensions to generate code for the computationally dominant image formation component of Polar Format SAR processing. By fully leveraging automatic program generation, SPIRAL produces code for Intel's Core2 Quad that surpasses competing hand-tuned implementations on the Cell Blade, an architecture with twice as many cores and three times the memory bandwidth. We show an average performance of 34 Gigaflops/sec for 16-Megapixel and 100-Megapixel SAR images with runtimes of 0.6 and 4.45 seconds respectively.

Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel (Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp. 385-410, 2009)
Operator Language: A Program Generation Framework for Fast Kernels
Preprint (511 KB)
Bibtex

We present the Operator Language (OL), a framework to automatically generate fast numerical kernels. OL provides the structure to extend the program generation system Spiral beyond the transform domain. Using OL, we show how to automatically generate library functionality for the fast Fourier transform and multiple non-transform kernels, including matrix-matrix multiplication, synthetic aperture radar (SAR), circular convolution, sorting networks, and Viterbi decoding. The control flow of the kernels is data-independent, which allows us to cast their algorithms as operator expressions. Using rewriting systems, a structural architecture model and empirical search, we automatically generate very fast C implementations for state-of-the-art multicore CPUs that rival hand-tuned implementations.

Markus Püschel, Peter A. Milder and James C. Hoe (Journal of the ACM, Vol. 56, No. 2, pp. 10:1-10:34, 2009)
Permuting Streaming Data Using RAMs
Published paper (link to publisher)
Bibtex

This paper presents a method for constructing hardware structures to perform permutations on streaming data. The method applies to permutations that can be represented as linear mappings on the bit-level representation of the data locations, a subclass that includes many important permutations such as stride permutations (corner turn, perfect shuffle, etc.), the bit reversal, and the Hadamard reordering. This problem has only been considered in the literature for the special case of stride permutations. We propose a family of parameterized datapaths that consist of several independent banks of memory and two interconnection networks. These structures are built for a given streaming width, i.e., number of inputs and outputs per cycle. The structures are able to operate at full throughput for a given streaming width. We provide an algorithm that completely specifies the datapath and control logic given a permutation and a number of input/output ports. Further, we provide lower bounds on the achievable cost of the solution. For an important subclass of considered permutations, the resulting design is proven optimal in both the cost of control and the number of switching elements required by the interconnection networks. We apply our algorithm to derive complete solutions for several important permutations, including a detailed example that carefully illustrates each aspect of the design process. Lastly, we compare our permutation structures to those of [Jarvinen et al. 2004], which are specialized for stride permutations.

Frédéric de Mesmay, Franz Franchetti, Yevgen Voronenko and Markus Püschel (Parallel Matrix Algorithms and Applications (PMAA), 2008, Presentation (Abstract reviewed))
Automatic Generation of Adaptive Libraries for Matrix-Multiplication
Bibtex

Developping high-performance matrix-multiplication libraries for current off-the-shelf microprocessors is a challenging task: it usually takes the combination of a peak-performance hand-scheduled vectorized kernel and multiple cache-aware multi-threaded blocking strategies to achieve maximum speed in all cases. Both these tasks depends on the underlying hardware and therefore need be rewritten or readapted for each and every platform. To alleviate this repetive effort, our work presents how to fully automate the development of such general matrix-multiplication libraries on various computing platforms. At the core of our code generator is a declarative domain-specific language that captures access patterns of blocked matrices. This language is manipulated using a symbolic rewritting system that is able to produce different alternatives. The rules in this language capture all the knowledge about the matrix-multiplication operation. Using this language, we generate all promising algorithms for computing small matrices products, produce the corresponding unrolled code and select the best kernels at compile time to include in our library. Then, the very same language is used to find and implement the set of mutually recursive functions that could be choosed from when deciding the blocking strategy. As this choice ultimately depends on the shape and size of the matrices and should adapt to the platform's caches, the final strategy is only decided at runtime, using dynamic programming. The final implementation of the library can be returned in C++ or Java, takes advantage of vector instructions, multi-processors and multi-cores where available, adapts each problem to the memory hierarchy, compares to vendor's performance and, amazingly, does not contain a single line of code written by a human.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Supercomputing (SC), 2008, Poster (Abstract reviewed))
Automatic Linear Transform Program Generation for the Cell BE
Bibtex

The complexity of the Cell BE's architecture makes it difficult and time consuming to develop multithreaded, vectorized, high-performance numerical libraries for it. Our approach to solving this problem is to use Spiral, a program generation system, to automatically generate and optimize linear transform libraries for the Cell. We first extend the Spiral framework for the Cell architecture to automatically generate high-performance discrete Fourier, Cosine, and Sine transform kernels that run on a single SPE. The performance of our kernels is comparable to hand tuned code where available, and reaches 14--20 Gflop/s on a single SPE for input vectors resident in the local memories. Next, we produce optimized multithreaded code for the discrete Fourier transform that runs on multiple SPEs and obtains upto a 2x speed-up on 4 SPEs. Our poster will provide details of the formal methods used by our approach, and will present our latest performance results.

Franz Franchetti, Yevgen Voronenko, Peter A. Milder, Srinivas Chellappa, Marek Telgarsky, Hao Shen, Paolo D'Alberto, Frédéric de Mesmay, James C. Hoe, José M. F. Moura and Markus Püschel (Proc. NSF Next Generation Software Program Workshop (NSFNGS) colocated with IPDPS, 2008)
Domain-Specific Library Generation for Parallel Software and Hardware Platforms
Preprint (180 KB)
Published paper (link to publisher)
Bibtex

We overview a library generation framework called Spiral. For the domain of linear transforms, Spiral automatically generates implementations for parallel platforms including SIMD vector extensions, multicore processors, field-programmable gate arrays (FPGAs) and FPGA accelerated processors. The performance of the generated code is competitive with the best available hand-written libraries.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)
FFT Program Generation for the Cell BE
Preprint (55 KB)
Bibtex

The Cell BE is designed for accelerating heavy numerical computation. The Cell features an innovative design that includes multiple vectorized cores with explicitly managed per-core local memory and intercore communication. These features allow for a theoretical peak performance of 204.8 Gflop/s (single precision, using just the SPEs) without breaking the memory, power, or frequency walls. However, the very same features that allow for high theoretical performance make it difficult and time consuming to develop multithreaded, vectorized, high-performance numerical libraries for the Cell BE. Our approach to solving this problem is to use the Spiral framework to automatically generate and optimize signal processing transform libraries for the Cell. Spiral is a program generation system for linear transforms such as the discrete Fourier transform, discrete cosine transforms, filters, and others. For a user-selected transform, Spiral autonomously generates different algorithms, represented in a declarative form as mathematical formulas, and their implementations to find the best match for the target platform. Besides using search techniques, Spiral performs deterministic optimizations at the formula level, effectively restructuring the code in ways impractical at the code level. To extend the Spiral framework to support the Cell architecture, we first focus on automatically generating high-performance vectorized DFT kernels that run on a single SPE. The performance of our single-threaded code is comparable to hand tuned code, and achieves 16-20 Gflop/s on a single SPE for input vectors resident in the local stores. Next, we produce optimized multithreaded code that runs on multiple SPEs. To maximize the achieved interconnect bandwidth, the algorithms generated use large DMA packets for all-to-all data exchanges between the SPEs.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 385-390, 2008)
Formal Datapath Representation and Manipulation for Implementing DSP Transforms
Preprint (285 KB)
Published paper (link to publisher)
Bibtex

We present a domain-specific approach to representing datapaths for hardware implementations of linear signal transform algorithms. We extend the tensor structure for describing linear transform algorithms, adding the ability to explicitly characterize two important dimensions of datapath architecture. This representation allows both algorithm and datapath to be specified within a single formula, and gives the designer the ability to easily consider a wide space of possible datapaths at a high level of abstraction. We have constructed a formula manipulation system based upon this representation, and have written a compiler that can translate a formula into a hardware implementation. This enables an automatic “push button” compilation flow that produces a register transfer level hardware description from high-level datapath directives and an algorithm (written as a formula). In our experimental results, we demonstrate that this approach yields efficient designs over a large tradeoff space.

Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay and Markus Püschel (Proc. High Performance Embedded Computing (HPEC), 2008)
Generating High-Performance General Size Linear Transform Libraries Using Spiral
Preprint (191 KB)
Published paper (link to publisher)
Bibtex

Franz Franchetti and Markus Püschel (Proc. International Conference on Compiler Construction (CC), Lecture Notes in Computer Science, Springer, Vol. 4959, pp. 116-131, 2008)
Generating SIMD Vectorized Permutations
Preprint (244 KB)
Published paper (link to publisher)
Bibtex

This paper introduces a method to generate efficient vectorized implementations of small stride permutations using only vector load and vector shuffle instructions. These permutations are crucial for high-performance numerical kernels including the fast Fourier transform. Our generator takes as input only the specification of the target platform's SIMD vector ISA and the desired permutation. The basic idea underlying our generator is to model vector instructions as matrices and sequences of vector instructions as matrix formulas using the Kronecker product formalism. We design a rewriting system and a search mechanism that applies matrix identities to generate those matrix formulas that have vector structure and minimize a cost measure that we define. The formula is then translated into the actual vector program for the specified permutation. For three important classes of permutations, we show that our method yields a solution with the minimal number of vector shuffles. Inserting into a fast Fourier transform yields a significant speedup.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. Summer School on Generative and Transformational Techniques in Software Engineering (GTTSE), Lecture Notes in Computer Science, Springer, Vol. 5235, pp. 196-259, 2008)
How To Write Fast Numerical Code: A Small Introduction
Preprint (556 KB)
Published paper (link to publisher)
Bibtex

The complexity of modern computing platforms has made it increasingly difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by an order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer's memory hierarchy. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.

Yevgen Voronenko (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2008)
Library Generation for Linear Transforms
Preprint (2.2 MB)
Bibtex

The development of high-performance numeric libraries has become extraordinarily difficult due to multiple processor cores, vector instruction sets, and deep memory hierarchies. To make things worse, often each library has to be re-implemented and re-optimized, whenever a new platform is released. In this thesis we develop a library generator that completely automates the library development for one important numerical domain: linear transforms, which include the discrete Fourier transform, discrete cosine transforms, filters, and discrete wavelet transforms. The input to our generator is a specification of the transform and a set of recursive algorithms for the transform, represented in a high-level domain-specific language; the output is a C++ library that supports general input size, is vectorized and multithreaded, and provides an optional adaptation mechanism for the memory hierarchy. Further, as we show in extensive benchmarks, the runtime performance of our automatically generated libraries is comparable to and often even higher than the best existing human-written code, including the widely used library FFTW and the commercially developed and maintained Intel Integrated Performance Primitives (IPP) and AMD Performance Library (APL). Our generator automates all library development steps typically performed manually by programmers, such as analyzing the algorithm and finding the set of required recursive functions and base cases, the appropriate restructuring of the algorithm to parallelize for multiple threads and to vectorize for the available vector instruction set and vector length, and performing code level optimizations such as algebraic simplification and others. The key to achieving full automation as well as excellent performance is a proper set of abstraction layers in the form of domain-specific languages called SPL (Signal Processing Language), index-free \sspl, regular \sspl, and intermediate code representation, and the use of rewriting systems to perform all difficult optimizations at a suitable, high level of abstraction. In addition, we demonstrate that our automatic library generation framework enables various forms of customization that would be very costly to perform manually. As examples, we show generated trade-offs between code size and performance, generated Java libraries obtained by modifying the backend, and functional customization for important transform variants.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. Workshop on High-Level Synthesis colocated with DAC, 2008)
Linear Transforms: From Math to Efficient Hardware
Preprint (70 KB)
Bibtex

Franz Franchetti, Daniel McFarlin, Frédéric de Mesmay, Hao Shen, Tomasz Wiktor Włodarczyk, Srinivas Chellappa, Marek Telgarsky, Peter A. Milder, Yevgen Voronenko, Qian Yu, James C. Hoe, José M. F. Moura and Markus Püschel (Proc. High Performance Embedded Computing (HPEC), 2008)
Program Generation with Spiral: Beyond Transforms
Published paper (link to publisher)
Bibtex

Jeremy Johnson, Tim Chagnon, Petya Vachranukunkiet, Prawat Nagvajara and Chika Nwankpa (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)
Sparse LU Decomposition using FPGA
Preprint (539 KB)
Published paper (link to publisher)
Bibtex

This paper reports on an FPGA implementation of sparse LU decomposition. The resulting special purpose hardware is geared towards power system problems - load flow computation - which are typically solved iteratively using Newton Raphson. The key step in this process, which takes approximately 85% of the computation time, is the solution of sparse linear systems arising from the Jacobian matrices that occur in each iteration of Newton Raphson. Current state-of-the-art software packages, such as UMFPACK and SuperLU, running on general purpose processors perform suboptimally on these problems due to poor utilization of the floating point hardware (typically 1 to 4% efficiency). Our LU hardware, using a special purpose data path and cache, designed to keep the floating point hardware busy, achieves an efficiency of 60% and higher. This improved efficiency provides an order of magnitude speedup when compared to a software solution using UMFPACK running on general purpose processors.

Jeremy Johnson and Michael Andrews (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh-Hadamard Transform
Preprint (1.1 MB)
Published paper (link to publisher)
Bibtex

Short vector instructions (SIMD) can significantly increase performance, yet are difficult to use effectively. Recently, several efforts (ATLAS, FFTW, SPIRAL) have successfully used automated performance tuning and search to find good SIMD implementations of high-performance kernels such as matrix multiplication, the FFT and related transforms. In this paper, we review techniques, similar to those used by SPIRAL, for vectorizing sparse matrix factorizations, and incorporate them into a package for computing the Walsh–Hadamard Transform (WHT). These techniques along with search are used to discover algorithms with close to theoretical speedup. Not all algorithms provide the same amount of speedup and it is not the case that the vectorized version of the best sequential algorithm leads to the best algorithm with SIMD instructions. The goal of this paper is to better understand the search space and which algorithms lead to the best speedup and overall performance. Several SIMD specific algorithm features are used to predict speedup and performance. The resulting performance model is highly correlated with runtime performance and measured speedup, and can be used to partition the search space and expedite the search for good vector performance.

Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay and Markus Püschel (Proc. Algebraic Methodology and Software Technology (AMAST), 2008)
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
Preprint (187 KB)
Bibtex

We demonstrate Spiral, a domain-specific library generation system. Spiral generates high performance source code for linear transforms (such as the discrete Fourier transform and many others) directly from a problem specification. The key idea underlying Spiral is to perform automatic reasoning and optimizations at a high abstraction level using the mathematical, declarative domain-specific languages SPL and $\Sigma$-SPL and a rigorous rewriting framework. Optimization includes various forms of parallelization. Even though Spiral provides complete automation, its generated libraries run often faster than any existing hand-written code.

Sung-Chul Han (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2007)
A Flexible Decoder and Performance Evaluation of Array-Structured LDPC Codes
Preprint (1.4 MB)
Bibtex

The low density parity check (LDPC) codes designed by a pseudorandom construction, as proposed in Gallager’s original work, have been shown to perform very close to the Shannon limit (when constructed as very long codes); however, the lack of structure in such codes makes them unsuitable for practical applications due to high encoding complexity and costly decoder implementations. These difficulties have lead to numerous works on the structured LDPC codes, especially array-structured codes with quasi-cyclic property. Among the array-structured codes, those with an array of cyclic permutation matrices have been of particular interest due to the balanced edge partitioning inherent in the structure that simplifies the implementation of highly parallel decoders. While many construction methods have been proposed for this circulant permutation array (CPA) structure, the performance of the codes has been reported to a very limited extent. Especially, the effect on the performance by the explicit control of graph parameters has not been provided despite the fact their importance is emphasized in the construction process. In the decoder design for quasi-cyclic LDPC codes, the primary concern is to exploit the array structure for efficient implementation. Fast hardware-based decoders on a medium-capacity FPGA are often faster than the software implementation by at least one or two orders of magnitude, and thus important for both actual deployment in practical systems and evaluation of error performance. As a large number of high-throughput decoders in the literature are designed for specific array dimensions and the bus and memory connections are simplified using the array structure of the code, the degree of parallelism in the decoders is dependent on the code parameters, making it difficult to parameterize the hardware to use a desired amount of hardware resource. Furthermore, such architectures cannot support a large class of array-structured codes with very different array dimensions. In this thesis, we present a generalized hardware decoder that supports any kind of quasi-cyclic LDPC codes including CPA-structured codes. The decoder has been designed with a priority on high flexibility. In the synthesis step, the degree of parallelism can be chosen independently from the code parameters. Thus, for FPGA implementation, the decoder can be parameterized to fully utilize a given amount of hardware resource. Also, it supports run-time reconfiguration of code parameters, i.e., different codes can be supported by changing register contents without a new synthesis. In wireless applications, such flexibility makes it possible to choose a channel code based on the varying channel condition. When used for performance evaluation purposes for a large set of codes, it saves a considerable amount of time by eliminating the need for re-synthesis for each code. Using the FPGA implementation of the proposed decoder, we characterize the performance of array-structured LDPC codes, with a primary focus on pseudorandomly constructed CPA-structured codes. Based on the obtained simulation results, we show the effect of combinatorial parameters (girth, diameter and column weight) on the error performance. The pseudorandom construction is also compared with algebraic construction, and with the codes specified in the IEEE 802.16e standards.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (CSSI Technical Report #CSSI-07-01, Carnegie Mellon University, 2007)
Discrete Fourier Transform Compiler: From Mathematical Representation to Efficient Hardware
Preprint (259 KB)
Bibtex

A wide range of hardware implementations are possible for the discrete Fourier transform (DFT), offering different tradeoffs in throughput, latency and cost. The well-understood structure of DFT algorithms makes possible a fully automatic synthesis framework that can span the viable interesting design choices. In this paper, we present such a synthesis framework that starts from formal mathematical formulas of a general class of fast DFT algorithms and produces performance and cost efficient sequential hardware implementations, making design decisions and tradeoffs according to user specified high-level preferences. We present evaluations to demonstrate the variety of supported implementations and the cost/performance tradeoffs they allow.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. IEEE International High Level Design Validation and Test Workshop (HLDVT), 2007)
FFT Compiler: From Math to Efficient Hardware
Preprint (158 KB)
Bibtex

This paper presents a high-level compiler that generates hardware implementations of the discrete Fourier transform (DFT) from mathematical specifications. The matrix formula input language captures not only the DFT calculation but also the implementation options at the algorithmic and architectural levels. By selecting the appropriate formula, the resulting hardware implementations (described in a synthesizable Verilog description) can achieve a wide range of tradeoffs between implementation cost and performance. The compiler is also parameterized for a set of technology-specific optimizations, to allow it to target specific implementation platforms. This paper gives a brief overview of the system and presents synthesis results.

Paolo D'Alberto, Peter A. Milder, Aliaksei Sandryhaila, Franz Franchetti, James C. Hoe, José M. F. Moura, Markus Püschel and Jeremy Johnson (Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 173-184, 2007)
Generating FPGA Accelerated DFT Libraries
Preprint (347 KB)
Published paper (link to publisher)
Bibtex

We present a domain-specific approach to generate high performance hardware-software partitioned implementations of the discrete Fourier transform (DFT) in fixed point precision. The partitioning strategy is a heuristic based on the DFT’s divide-and-conquer algorithmic structure and fine tuned by the feedback-driven exploration of candidate designs. We have integrated this approach in the Spiral linear-transform code-generation framework to support push-button automatic implementation. We present evaluations of hardware–software DFT implementations running on the embedded PowerPC processor and the reconfigurable fabric of the Xilinx Virtex-II Pro FPGA. In our experiments, the 1D and 2D DFT’s FPGA accelerated libraries exhibit between 2 and 7.5 times higher performance (operations per second) and up to 2.5 times better energy efficiency (operations per Joule) than the software-only version.

Jeremy Johnson and Xu Xu (Proc. ACM International Symposium on Symbolic and Algebraic Computation (ISSAC), ACM, pp. 195-202, 2007)
Generating Symmetric DFTs and Equivariant FFT Algorithms
Published paper (link to publisher)
Bibtex

This paper presents a code generator which produces efficient implementations of multi-dimensional fast Fourier transform (FFT) algorithms which utilize symmetries in the input data to reduce memory usage and the number of arithmetic operations. The FFT algorithms are constructed using a group theoretic version of the divide and conquer step in the FFT that is compatible with the group of symmetries. The GAP compute algebra system is used to perform the necessary group computations and the generated algorithm is represented as a symbolic matrix factorization, which is translated into efficient code using the SPIRAL system. Performance data is given that shows that the resulting code is significantly faster than state-of-the-art FFT implementations that do not utilize the symmetries.

Yevgen Voronenko and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 55, No. 9, pp. 4458-4473, 2007)
Mechanical Derivation of Fused Multiply-Add Algorithms for Linear Transforms
Preprint (751 KB)
Published paper (link to publisher)
Bibtex

Several computer architectures offer fused multiply-add (FMA), also called multiply-and-accumulate (MAC) instructions, that are as fast as a single addition or multiplication. For the efficient implementation of linear transforms, such as the discrete Fourier transform or discrete cosine transforms, this poses a challenge to algorithm developers as standard transform algorithms have to be manipulated into FMA algorithms that make optimal use of FMA instructions. We present a general method to convert any transform algorithm into an FMA algorithm. The method works with both algorithms given as directed acyclic graphs (DAGs) and algorithms given as structured matrix factorizations. We prove bounds on the efficiency of the method. In particular, we show that it removes all single multiplications except at most as many as the transform has outputs. We implemented the DAG-based version of the method and show that we can generate many of the best-known hand-derived FMA algorithms from the literature as well as a few novel FMA algorithms.

Yevgen Voronenko and Markus Püschel (ACM Transactions on Algorithms, Vol. 3, No. 2, 2007)
Multiplierless Multiple Constant Multiplication
Preprint (507 KB)
Published paper (link to publisher)
Bibtex

A variable can be multiplied by a given set of fixed-point constants using a multiplier block that consists exclusively of additions, subtractions, and shifts. The generation of a multiplier block from the set of constants is known as the multiple constant multiplication (MCM) problem. Finding the optimal solution, i.e., the one with the fewest number of additions and subtractions is known to be NP-complete. We propose a new algorithm for the MCM problem, which produces solutions that require up to 20% less additions and subtractions than the best previously known algorithm. At the same time our algorithm, in contrast to the closest competing algorithm, is not limited by the constant bitwidths. We present our algorithm using a unifying formal framework for the best, graph-based, MCM algorithms and provide a detailed runtime analysis and experimental evaluation. We show that our algorithm can handle problem sizes as large as 100 32-bit constants in a time acceptable for most applications. The implementation of the new algorithm is available at www.spiral.net.

Jeremy Johnson and Michael Andrews (Proc. Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL), 2007 IEEE International Parallel and Distributed Processing Symposium , IEEE, pp. 450, 2007)
Performance Analysis of a Family of WHT Algorithms
Preprint (1.8 MB)
Published paper (link to publisher)
Bibtex

This paper explores the correlation of instruction counts and cache misses to runtime performance for a large family of divide and conquer algorithms to compute the Walsh–Hadamard transform (WHT). Previous work showed how to compute instruction counts and cache misses from a high–level description of the algorithm and proved theoretical results about their minimum, maximum, mean, and distribution. While the models themselves do not accurately predict performance, it is shown that they are statistically correlated to performance and thus can be used to prune the search space for fast implementations. When the size of the transform fits in cache the instruction count itself is used; however, when the transform no longer fits in cache, a linear combination of instruction counts and cache misses is used. Thus for small transforms it is safe to ignore algorithms which have a high instruction count and for large transforms it is safe to ignore algorithms with a high value in the combined instruction count/cache miss model. Since the models can be computed from a high–level description of the algorithms, they can be obtained without runtime measurement and the previous theoretical results on the models can be applied to limit empirical search.

Paolo D'Alberto, Markus Püschel and Franz Franchetti (Proc. International Conference on High Performance Embedded Architectures & Compilers (HiPEAC), 2007)
Performance/Energy Optimization of DSP Transforms on the XScale Processor
Preprint (646 KB)
Bibtex

The XScale processor family provides user-controllable independent configuration of CPU, bus, and memory frequencies. This feature introduces another handle for the code optimization with respect to energy consumption or runtime performance. We quantify the effect of frequency configurations on both performance and energy for three signal processing transforms: the discrete Fourier transform (DFT), finite impulse response (FIR) filters, and the Walsh-Hadamard Transform (WHT). To do this, we use SPIRAL, a program generation and optimization system for signal processing transforms. For a given transform to be implemented, SPIRAL searches over different algorithms to find the best match to the given platform with respect to the chosen performance metric (usually runtime). In this paper we use SPIRAL to generate implementations for different frequency configurations and optimize for runtime and physically measured energy consumption. In doing so we show that first, each transform achieves best performance/energy consumption for a different system configuration; second, the best code depends on the chosen configuration, problem size and algorithm; third, the fastest implementation is not always the most energy efficient; fourth, we introduce dynamic (i.e., during execution) reconfiguration in order to further improve performance/energy. Finally, we benchmark SPIRAL generated code against Intel’s vendor library routines. We show competitive results as well as 20% performance improvements or energy reduction for selected transforms and problem sizes.

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. II-17-II-20, 2007)
SIMD Vectorization of Non-Two-Power Sized FFTs
Preprint (197 KB)
Published paper (link to publisher)
Bibtex

SIMD (single instruction multiple data) vector instructions, such as Intel’s SSE family, are available on most architectures, but are difficult to exploit for speed-up. In many cases, such as the fast Fourier transform (FFT), signal processing algorithms have to undergo major transformations to map efficiently. Using the Kronecker product formalism, we rigorously derive a novel variant of the general-radix Cooley-Tukey FFT that is structured to map efficiently for any vector length nu and radix. Then, we include the new FFT into the program generator Spiral to generate actual C implementations. Benchmarks on Intel’s SSE show that the new algorithms perform better on practically all sizes than the best available libraries Intel’s MKL and FFTW.

Peter Tummeltshammer, James C. Hoe and Markus Püschel (IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 26, No. 9, pp. 1551-1563, 2007)
Time-Multiplexed Multiple Constant Multiplication
Preprint (349 KB)
Published paper (link to publisher)
Bibtex

This paper studies area-efficient arithmetic circuits to multiply a fixed-point input value selectively by one of several preset fixed-point constants. We present an algorithm that generates a class of solutions to this time-multiplexed multiple-constant multiplication problem by “fusing” single-constant multiplication circuits for the required constants. Our evaluation compares our solution against a baseline implementation style that employs a full multiplier and a lookup table for the constants. The evaluation shows that we gain a significant area advantage, at the price of increased latency, for problem sizes (in terms of the number of constants) up to a threshold dependent on the bit-widths of the input and the constants. Our evaluation further shows that our solution is better suited for standard-cell application-specific integrated circuits than prior works on reconfigurable multiplier blocks.

Pranab Shenoy (Master thesis, Computer Science, Drexel University, 2007)
Universal FFT Core Generator
Published paper (link to publisher)
Bibtex

This thesis presents a special purpose processor for multi-dimensional discrete Fourier transform (DFT) computation. The processor, called a Universal fast Fourier transform (FFT) processor, is capable of computing an arbitrary multi-dimensional DFT with a fixed number of input points. The processor can be configured to compute different dimensional DFTs simply by rearranging the input data and using a different set of constants, called generalized twiddle factors, throughout the computation. The basic computational process remains the same independent of dimension. The processor does not rely on one-dimensional FFT computation as does the standard approach using the row-column algorithm, but instead utilizes an algorithm called the Dimensionless FFT, whose computation data flow is the same as the FFT algorithm due to Pease. Since the computation performed by the Universal FFT processor is the same as the Pease FFT, an implementation can be obtained by modifying a one-dimensional FFT processor based on the Pease algorithm. The implementation presented in this thesis is obtained in this way from the core generator by Nordin, Milder, Hoe, and P¨uschel. The core generator from Nordin, Milder, Hoe, and P¨uschel generates a variety of cores, with various space and time tradeoffs for implementation on field programmable gate arrays (FPGA). The performance obtained is comparable to vendor supplied cores but offers much greater flexibility of design choices with different area and throughput tradeoffs. The resulting Universal FFT core generator can be used to instantiate multi-dimensional FPGA cores with approximately the same performance and area obtained in the one-dimensional case by Nordin, Milder, Hoe, and P¨uschel.

Yevgen Voronenko and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 3, 2006)
Algebraic Derivation of General Radix Cooley-Tukey Algorithms for the Real Discrete Fourier Transform
Preprint (84 KB)
Published paper (link to publisher)
Bibtex

We first show that the real version of the discrete Fourier transform (called RDFT) can be characterized in the framework of polynomial algebras just as the DFT and the discrete cosine and sine transforms. Then, we use this connection to algebraically derive a general radix Cooley-Tukey type algorithm for the RDFT. The algorithm has a similar structure as its complex counterpart, but there are also important differences, which are exhibited by our Kronecker product style presentation. In particular, the RDFT is decomposed into smaller RDFTs but also other auxiliary transforms, which we then decompose by their own Cooley-Tukey type algorithms to obtain a full recursive algorithm for the RDFT.

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 4395, pp. 363-377, 2006)
A Rewriting System for the Vectorization of Signal Transforms
Preprint (226 KB)
Published paper (link to publisher)
Bibtex

We present a rewriting system that automatically vectorizes signal transform algorithms at a high level of abstraction. The input to the system is a transform algorithm given as a formula in the well-known Kronecker product formalism. The output is a "vectorized" formula, which means it consists exclusively of constructs that can be directly mapped into short vector code. This approach obviates compiler vectorization, which is known to be limited in this domain. We included the formula vectorization into the Spiral program generator for signal transforms, which enables us to generate vectorized code and optimize through search over alternative algorithms. Benchmarks for the discrete Fourier transform (DFT) show that our generated floating-point code is competitive with and that our fixed-point code clearly outperforms the best available libraries.

Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel and Christoph W. Ueberhuber (Proc. International Symposium on Parallel and Distributed Processing and Application (ISPA), Lecture Notes In Computer Science, Springer, Vol. 4330, pp. 818-832, 2006)
Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers
Preprint (146 KB)
Bibtex

This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system SPIRAL. DFT algorithms are represented as mathematical formulas in SPIRAL’s internal language SPL. Using a tagging mechanism and formula rewriting, we extend SPIRAL to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms, SPIRAL’s search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by SPIRAL show performance gains of up to 30% due to rescaling. Further, our generated programs compare favorably with FFTW-MPI 2.1.5.

Paolo D'Alberto, Peter A. Milder, Franz Franchetti, James C. Hoe, Markus Püschel and José M. F. Moura (Proc. High Performance Embedded Computing (HPEC), 2006)
Discrete Fourier Transform Compiler for FPGA and CPU/FPGA Partitioned Implementations
Bibtex

Pawel Hitczenko, Jeremy Johnson and Hung-Jen Huang (Theoretical Computer Science, Vol. 352, pp. 8-30, 2006)
Distribution of a Class of Divide and Conquer Recurrences arising from the Computation of the Walsh-Hadamard Transform
Published paper (link to publisher)
Bibtex

Peter A. Milder, Mohammad Ahmad, James C. Hoe and Markus Püschel (Proc. FPGA, pp. 211-220, 2006)
Fast and Accurate Resource Estimation of Automatically Generated Custom DFT IP Cores
Preprint (145 KB)
Published paper (link to publisher)
Bibtex

This paper presents an equation-based resource utilization model for automatically generated discrete Fourier transform (DFT) soft core IPs. The parameterized DFT IP generator allows a user to make customized tradeoffs between cost and performance and between utilization of different resource classes. The equation-based resource model permits immediate and accurate estimation of resource requirements as the user considers the different generator options. Furthermore, the fast turnaround of the model allows it to be combined with a search algorithm such that the user could query automatically for an optimal design within the stated performance and resource constraints.Following a brief review of the DFT IP generator, this paper presents the development of the equation-based models for estimating slice and hard macro utilizations in the Xilinx Virtex-II Pro FPGA family. The evaluation section shows that an average error of 6.1% is achievable by a model of linear equations that can be evaluated in sub-microseconds. The paper further offers a demonstration of the automatic design exploration capability.

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. Supercomputing (SC), 2006)
FFT Program Generation for Shared Memory: SMP and Multicore
Preprint (161 KB)
Published paper (link to publisher)
Bibtex

The chip maker's response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speed-up already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms.

Franz Franchetti, Andreas Bonelli, Ekapol Chuangsuwanich, Yu-Chiang Lee, Juergen Lorenz, Thomas Peter, Hao Shen, Marek Telgarsky, Yevgen Voronenko, Markus Püschel, José M. F. Moura and Christoph W. Ueberhuber (Proc. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), 2006)
Parallelism in Spiral
Preprint (202 KB)
Bibtex

Spiral is a program generator for linear transforms such as the discrete Fourier transform. Spiral generates highly optimized code directly from a problem specification using a combination of techniques including optimization at a high level of abstraction using rewriting of mathematical expressions and heuristic search for platform adaptation. In this paper, we overview the generation of parallel programs using Spiral. This includes programs for vector architectures and programs for shared or distributed memory platforms.

Sung-Chul Han, Franz Franchetti and Markus Püschel (Proc. Parallel Architectures and Compilation Techniques (PACT), pp. 222-232, 2006)
Program Generation for the All-Pairs Shortest Path Problem
Preprint (258 KB)
Published paper (link to publisher)
Bibtex

A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and re-optimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet generator in FFTW. Each of these generators produces highly optimized source code directly from a problem specification. In this paper, we extend this list by a program generator for the well-known Floyd-Warshall (FW) algorithm that solves the all-pairs shortest path problem, which is important in a wide range of engineering applications.

Marek Telgarsky, James C. Hoe and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 3, 2006)
Spiral: Joint Runtime and Energy Optimization of Linear Transforms
Preprint (199 KB)
Published paper (link to publisher)
Bibtex

There is much interest into joint runtime and energy optimization of implementations of signal processing algorithms. Applications in domains such as embedded computing, sensor networks, and mobile communications often require processing of signals under simultaneous runtime, energy and/or power constraints. Hence, in addition to runtime, power and energy are first-order design considerations for both hardware and software developers in those domains. This paper studies the automatic generation of software implementations of digital signal processing (DSP) transforms that are optimized with respect to both runtime and energy. We explore the impact of algorithm selection (a software technique) and voltage-frequency scaling (a hardware technique) on the runtime and energy of computing fast linear transforms. We use SPIRAL, a code generation system, to enumerate automatically many alternative algorithms for the discrete Fourier transform. We measure the runtime and energy of these algorithms at different voltage-frequency settings of an Intel Pentium M microprocessor. We report experimental results supporting that algorithm selection and voltage-frequency scaling do achieve the following: (1) have large impact on the runtime and energy of computing the discrete Fourier transform on a microprocessor; and (2) enable the optimization of important joint runtime-energy objectives.

Roland Wunderlich, Markus Püschel and James C. Hoe (Proc. High Performance Embedded Computing (HPEC), 2005)
Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA
Preprint (43 KB)
Bibtex

The optimization of matrix-matrix multiplication (MMM) performance has been well studied on general-purpose desktop and server processors. Classic solutions exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. Typical digital signal processors (DSPs) do not have these features, and instead use in-order execution, configurable memory hierarchies, and pro-grammable I/O interfaces. We investigate the methods needed to achieve high per-formance MMM on the Texas Instruments C6713 floating-point DSP. This processor has two components that can be used to accelerate MMM: a software-managed memory hierarchy, and a direct memory access (DMA) engine that can perform block copies from main memory to into the memory hierarchy. Our MMM implementation overlaps computation with DMA block transfers. For matrices lar-ger than the data caches, we observed a 46% performance increase over a blocked MMM implementation, and a 190% increase over the Texas Instruments DSP library.

Grace Nordin, Peter A. Milder, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 471-474, 2005)
Automatic Generation of Customized Discrete Fourier Transform IPs
Preprint (291 KB)
Published paper (link to publisher)
Bibtex

This paper presents a parameterized soft core generator for the discrete Fourier transform (DFT). Reusable IPs of digital signal processing (DSP) kernels are important time-saving resources in DSP hardware development. Unfortunately, reusable IPs, however optimized, can introduce inefficiencies because they cannot fit the exact requirements of every application context. Given the well-understood and regular computation in DSP kernels, an automatic tool can generate high-quality ready-to-use IPs customized to user-specified cost/performance tradeoffs (beyond basic parameters such as input size and data format). The paper shows that the generated DFT cores can match closely the performance and cost of DFT cores from the Xilinx LogiCore library. Furthermore, the generator can yield DFT cores over a range of different performance/ cost tradeoff points that are not available from the library.

Mihai Furis, Pawel Hitczenko and Jeremy Johnson (Proc. International Conference on Analysis of Algorithms, Discrete Mathematics and Theoretical Computer Science, pp. 115-124, 2005)
Cache Miss Analysis of WHT Algorithms
Published paper (link to publisher)
Bibtex

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326 , 2005)
Formal Loop Merging for Signal Transforms
Preprint (304 KB)
Published paper (link to publisher)
Bibtex

A critical optimization in the domain of linear signal transforms, such as the discrete Fourier transform (DFT), is loop merging, which increases data locality and reuse and thus performance. In particular, this includes the conversion of shuffle operations into array reindexings. To date, loop merging is well understood only for the DFT, and only for Cooley-Tukey FFT based algorithms, which excludes DFT sizes divisible by large primes. In this paper, we present a formal loop merging framework for general signal transforms and its implementation within the SPIRAL code generator. The framework consists of Sigma-SPL, a mathematical language to express loops and index mappings; a rewriting system to merge loops in Sigma-SPL; and a compiler that translates Sigma-SPL into code. We apply the framework to DFT sizes that cannot be handled using only the Cooley-Tukey FFT and compare our method to FFTW 3.0.1 and the vendor library Intel MKL 7.2.1. Compared to FFTW our generated code is a factor of 2-4 faster under equal implementation conditions (same algorithms, same unrolling threshold). For some sizes we show a speed-up of a factor of 9 using Bluestein's algorithm. Further, we give a detailed comparison against the Intel vendor library MKL; our generated code is between 2 times faster and 4.5 times slower.

Xiaoming Li, María J. Garzarán and David Padua (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 99-110, 2005)
Optimizing Sorting with Genetic Algorithm
Preprint (268 KB)
Published paper (link to publisher)
Bibtex

The growing complexity of modern processors has made the generation of highly efficient code increasingly difficult. Manual code generation is very time consuming, but it is often the only choice since the code generated by today's compiler technology often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS, FFTW, and SPIRAL, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proven successful on scientific codes whose performance does not depend on the input data. In this paper we study machine learning techniques to extend empirical search to the generation of sorting routines, whose performance depends on the input characteristics and the architecture of the target machine. We build on a previous study that selects a "pure" sorting algorithm at the outset of the computation as a function of the standard deviation. The approach discussed in this paper uses genetic algorithms and a classifier system to build hierarchically-organized hybrid sorting algorithms capable of adapting to the input data. Our results show that such algorithms generated using the approach presented in this paper are quite effective at taking into account the complex interactions between architectural and input data characteristics and that the resulting code performs significantly better than conventional sorting implementations and the code generated by our earlier study. In particular, the routines generated using our approach perform better than all the commercial libraries that we tried including IBM ESSL, INTEL MKL and the C++ STL The best algorithm we have been able to generate is on the average 26% and 62% faster than the IBM ESSL in an IBM Power 3 and IBM Power 4, respectively.

Thammanit Pipatsrisawat, Aca Gacic, Franz Franchetti, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. 153-156, 2005)
Performance Analysis of the Filtered Backprojection Image Reconstruction Algorithms
Preprint (214 KB)
Published paper (link to publisher)
Bibtex

We investigate performance tradeoffs for a class of filtered backprojection (FBP) image reconstruction algorithms. The recently developed fast hierarchical backprojection asymptotically achieves the same O(N/sup 2/ log N) cost as Fourier-based methods while retaining many advantages of the FBP technique. In this paper, we provide a detailed cost and performance analysis of the algorithm on a general purpose platform. Based on carefully tuned implementations of both the direct and the hierarchical backprojection, we explore the tradeoffs between distortion and runtime by varying several algorithm and implementation choices. Experimental results show that, given the desired performance, the choice of algorithm parameters is not obvious and largely depends on the image properties and the underlying computer platform.

José M. F. Moura, Markus Püschel, David Padua and Jack Dongarra (Proceedings of the IEEE, special issue on ``Program Generation, Optimization, and Adaptation'', Vol. 93, No. 2, pp. 211-215, 2005)
Scanning the Issue: Special Issue on Program Generation, Optimization, and Platform Adaptation
Published paper (link to publisher)
Bibtex

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson and Nicholas Rizzolo (Proceedings of the IEEE, special issue on ``Program Generation, Optimization, and Adaptation'', Vol. 93, No. 2, pp. 232- 275, 2005)
SPIRAL: Code Generation for DSP Transforms
Preprint (1.2 MB)
Published paper (link to publisher)
Bibtex

Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL, which considers this problem for the performance-critical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high-performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, SPIRAL "intelligently" generates and explores algorithmic and implementation choices to find the best match to the computer's microarchitecture. The "intelligence" is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high-performance code for a broad set of DSP transforms, including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code.

Lawrence C. Chang, Inpyo Hong, Yevgen Voronenko and Markus Püschel (Proc. High Performance Embedded Computing (HPEC), 2004)
Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic
Bibtex

Xiaoming Li, María J. Garzarán and David Padua (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 111-124, 2004)
A Dynamically Tuned Sorting Library
Preprint (418 KB)
Published paper (link to publisher)
Bibtex

Empirical search is a strategy used during the installation oflibrary generators such as ATLAS, FFTW, and SPIRAL to identify the algorithm or the version of an algorithm that delivers thebest performance. In the past, empirical search has been appliedalmost exclusively to scientific problems. In this paper, we discuss the application of empirical search to sorting, which is oneof the best understood symbolic computing problems. When contrasted with the dense numerical computations of ATLAS, FFTW,and SPIRAL, sorting presents a new challenge, namely that the relative performance of the algorithms depend not only on the characteristics of the target machine and the size of the input data but also on the distribution of values in the input data set.Empirical search is applied in the study reported here as partof a sorting library generator. The resulting routines dynamicallyadapt to the characteristics of the input data by selecting the bestsorting algorithm from a small set of alternatives. To generate therun time selection mechanism our generator makes use of machinelearning to predict the best algorithm as a function of the characteristics of the input data set and the performance of the differentalgorithms on the target machine. This prediction is based on thedata obtained through empirical search at installation time.Our results show that our approach is quite effective. Whensorting data inputs of 12M keys with various standard deviations,our adaptive approach selected the best algorithm for all the inputdata sets and all platforms that we tried in our experiments. Thewrong decision could have introduced a performance degradationof up to 133%, with an average value of 44%.

Jeremy Johnson and Kang Chen (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 44-, 2004)
A Self-Adapting Distributed Memory Package for Fast Signal Transforms
Preprint (1.4 MB)
Published paper (link to publisher)
Bibtex

Aca Gacic, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-69-V-72, 2004)
Automatically Generated High-Performance Code for Discrete Wavelet Transforms
Preprint (103 KB)
Published paper (link to publisher)
Bibtex

A growing number of performance-critical DSP application use the discrete wavelet transform (DWT), thus prompting the need for highly efficient DWT software implementations. Unfortunately, the rapid evolution of computing platforms and compiler technology makes carefully hand-tuned code obsolete almost as fast as it is written. In this paper we describe our work on the automatic generation of DWT implementations that are tuned to a given platform. Our approach captures the various DWT algorithms in a concise mathematical framework that enables the integration of DWTs into the SPIRAL code generation system. Experiments show the quality of our automatically generated code and provide interesting insights; for example, the fastest code differs between platforms and is usually based on a non-obvious combination of DWT algorithms.

Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber and Peter Wurzinger (Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 3402, pp. 23-36, 2004)
Automatically Tuned FFTs for BlueGene/L’s Double FPU
Preprint (219 KB)
Bibtex

IBM's upcoming 360 Tflop/s supercomputer BlueGene/L featuring 65,536 processors is supposed to lead the Top 500 list when being installed in 2005. This paper presents one of the first numerical codes actually run on a small prototype of this machine. Formal vectorization techniques, the Vienna MAP vectorizer (both developed for generic short vector SIMD extensions), and the automatic performance tuning approach provided by Spiral are combined to generate automatically optimized FFT codes for the BlueGene/L machine targeting its two-way short vector SIMD ``double'' floating-point unit. The resulting FFT codes are 40% faster than the best scalar Spiral generated code and 5 times faster than the mixed-radix FFT implementation provided by the Gnu scientific library GSL.

Adam C. Zelinski, Markus Püschel, Smarahara Misra and James C. Hoe (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-221-V-224, 2004)
Automatic Cost Minimization for Multiplierless Implementations of Discrete Signal Transforms
Preprint (80 KB)
Published paper (link to publisher)
Bibtex

The computation of linear DSP transforms consists entirely of additions and multiplications by constants, which, in a hardware realization, can be implemented as a network of wired shifts and additions. Thus, a light weight fixed point implementation that approximates an exact transform can be built from only adders. This paper presents an automatic approach for minimizing the number of additions required for a given transform under the constraint of a particular quality measure. We present an evaluation of our approach. For example, one experiment shows that the IMDCT transform within an MP3 decoder can be reduced from 643 additions to 170 additions while maintaining Limited Accuracy as defined by the MP3 ISO standard.

Anthony F. Breitzman and Jeremy Johnson (Journal of High Performance Computing and Applications, special issue on ``Computer Algebra and Signal Processing'', Vol. 37, No. 2, pp. 157-186, 2004)
Automatic Derivation and Implementation of Fast Convolution Algorithms
Published paper (link to publisher)
Bibtex

Yevgen Voronenko and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-101-V-104, 2004)
Automatic Generation of Implementations for DSP Transforms on Fused Multiply-Add Architectures
Preprint (95 KB)
Published paper (link to publisher)
Bibtex

Many modern computer architectures feature fused multiply-add (FMA) instructions, which offer potentially faster performance for numerical applications. For DSP transforms, compilers can only generate FMA code to a very limited extent because optimal use of FMAs requires modifying the chosen algorithm. In this paper we present a framework for automatically generating FMA code for every linear DSP transform, which we implemented as an extension to the SPIRAL code generation system. We show that for many transforms and transform sizes, our generated FMA code matches the best-known hand-derived FMA algorithms in terms of arithmetic cost. Further, we present actual runtime results that show the speed-up obtained by using FMA instructions.

Aca Gacic (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2004)
Automatic Implementation and Platform Adaptation of Discrete Filtering and Wavelet Algorithms
Preprint (1.2 MB)
Bibtex

Moore's law, with the doubling of the transistor count every 18 months, poses serious challenges to high-performance numerical software designers: how to stay close to the maximum achievable performance on ever-changing and ever-faster hardware technologies? Up-to-date numerical libraries are usually maintained by large teams of expert programmers who hand-tune the code to a specific class of computer platforms, sacrificing portability for performance. Every new generation of processors reopens the cycle of implementing, tuning, and debugging.

The SPIRAL system addresses this problem by automatically generating and implementing algorithms for DSP numerical kernels and searching for the best solution on the platform of interest. Using search, SPIRAL adapts code to take optimal advantage of the available platform features, such as the architecture of the memory hierarchy and register banks. As a result, SPIRAL generates high-performance implementations for DSP transforms that are competitive with the best hand-coded numerical libraries provided by hardware vendors.

In this thesis, we focus on automatic implementation and platform adaptation of filtering and wavelet kernels, which are at the core of many performance-critical DSP applications. We formulate many well-known algorithms for FIR filters and discrete wavelet transforms (DWT) using a concise and flexible symbolic mathematical language and integrate it in the SPIRAL system. This enables automatic generation and search over the comprehensive space of competitive algorithms, often leading to complex solutions that are hardly ever considered by a human programmer.

Experimental results show that our automatically generated and tuned code for FIR filters and DWTs is competitive and sometimes even outperforms hand-coded numerical libraries provided by hardware vendors. This implies that the richness and the extent of the automatically generated search space can match human ingenuity in achieving high performance. Our system generates high-quality code for digital filtering and wavelet kernels across most current and compatible future computer platforms and frees software developers from tedious and time-consuming coding at the machine level.

Markus Püschel, Adam C. Zelinski and James C. Hoe (Proc. International Conference on Computer-Aided Design (ICCAD), pp. 175-182, 2004)
Custom-Optimized Multiplierless Implementations of DSP Algorithms
Preprint (137 KB)
Published paper (link to publisher)
Bibtex

Linear DSP kernels such as transforms and filters are comprised exclusively of additions and multiplications by constants. These multiplications may be realized as networks of additions and wired shifts in hardware. The cost of such a "multiplierless" implementation is determined by the number of additions, which in turn depends on the value and precision of these constants. For a given transform or filter, the set of constants and their required precision is affected by algorithmic and implementation choices and hence provides a degree of freedom for optimization. In this paper we present an automated method to generate, for a given linear transform, a minimum addition multiplierless implementation that satisfies a given quality constraint. The method combines automatic algorithm selection to improve numerical robustness and automatic search methods to minimize constant precisions in a chosen algorithm. We present experiments that show the trade-offs between cost and quality, including custom optimizations of the transforms used in JPEG image and MP3 audio decoders

Grace Nordin, James C. Hoe and Markus Püschel (Proc. High Performance Embedded Computing (HPEC), 2004)
Discrete Fourier Transform IP Generator
Bibtex

Nicholas Rizzolo and David Padua (Proc. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2004)
HiLO: High Level Optimization of FFTs
Preprint (181 KB)
Bibtex

As computing platforms become more and more complex, the task of optimizing performance critical codes becomes more challenging. Recently, more attention has been focused on automating this optimization process by making aggressive assumptions about the algorithm. Motivated by these trends, this paper presents HiLO, the high-level optimizer for FFT codes. HiLO blends traditional optimization techniques into an optimization strategy tailored to the needs of FFT codes and outputs C code ready to be further optimized by the native compiler. It has already been shown that such high-level transformations are important to coax the native compiler into doing its job well. HiLO provides a more appropriate platform for researching these phenomena, suggests an optimization strategy that improves on previous approaches, and shows that even software pipelining at the C level can improve the final binary's performance.

Peter Tummeltshammer, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 826-829, 2004)
Multiple Constant Multiplication By Time-Multiplexed Mapping of Addition Chains
Preprint (122 KB)
Published paper (link to publisher)
Bibtex

An important primitive in the hardware implementations of linear DSP transforms is a circuit that can multiply an input value by one of several different preset constants. We propose a novel implementation of this circuit based on combining the addition chains of the constituent constants. We present an algorithm to automatically generate such a circuit for a given set of constants. The quality of the resulting circuits is evaluated after synthesis for a commercial 0.18mum standard cell library. We compare the area and latency efficiency of this addition chain based approach against a straightforward approach based on a constant table and a full multiplier.

Markus Püschel, Bryan Singer, Jianxin Xiong, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso and Robert W. Johnson (Journal of High Performance Computing and Applications, special issue on ``Automatic Performance Tuning'', Vol. 18, No. 1, pp. 21-45, 2004)
SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
Published paper (link to publisher)
Bibtex

Xu Xu (Master thesis, Computer Science, Drexel University, 2003)
A Recursive Implementation of the Dimensionless FFT
Bibtex

Jeremy Johnson and Xu Xu (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003)
A Recursive Implementation of the Dimensionless FFT
Preprint (62 KB)
Published paper (link to publisher)
Bibtex

A divide and conquer algorithm is presented for computing arbitrary multi-dimensional discrete Fourier transforms. In contrast to standard approaches such as the row-column algorithm, this algorithm allows an arbitrary decomposition, based solely on the size of the transform independent of the dimension of the transform. Only minor modifications are required to compute transforms with different dimension. These modifications were incorporated into the FFTW package so that the algorithm for computing one-dimensional transforms can be used to compute arbitrary dimensional transforms. This reduced the runtime of many multi-dimensional transforms.

Anthony F. Breitzman (PhD. thesis, Computer Science, Drexel University, 2003)
Automatic Derivation and Implementation of Fast Convolution Algorithms
Published paper (link to publisher)
Bibtex

Mihai Furis (Master thesis, Computer Science, Drexel University, 2003)
Cache Miss Analysis of Walsh-Hadamard Transform Algorithms
Bibtex

Smarahara Misra, Adam C. Zelinski, James C. Hoe and Markus Püschel (Proc. High Performance Embedded Computing (HPEC), 2003)
Custom Reduction of Arithmetic in Linear DSP Transforms
Bibtex

Smarahara Misra (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2003)
Custom Reduction of Arithmetic in Multiplierless implementations of DSP Transforms
Bibtex

Aca Gacic, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 541-544, 2003)
Fast Automatic Implementations of FIR Filters
Preprint (82 KB)
Bibtex

SPIRAL is a generator for platform-adapted libraries of DSP transform algorithms. SPIRAL represents and automatically generates fast algorithms as mathematical formulas and translates them into programs. Adaptation is achieved by searching in the space of algorithmic and coding alternatives for the fastest implementation. In this paper we extend SPIRAL to generate platform-adapted implementations of FIR filters. First, we present various filter algorithms and introduce the mathematical constructs needed to include them into SPIRAL's architecture. Then we use SPIRAL to find fast filter implementations. The results show runtime improvements to a standard loop implementation of up to 70% using different blocking techniques. Further, we show that the usefulness of frequency-domain methods is not determined by the number of operations.

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 537-540, 2003)
Short Vector Code Generation and Adaptation for DSP Algorithms
Preprint (90 KB)
Published paper (link to publisher)
Bibtex

Most recent general purpose processors feature short vector SIMD instructions, like SSE on Pentium III/4. In this paper we automatically generate platform-adapted short vector code for DSP transform algorithms using SPIRAL. SPIRAL represents and generates fast algorithms as mathematical formulas, and translates them into code. Adaptation is achieved by searching in the space of algorithmic and coding alternatives for the fastest implementation on the given platform. We explain the mathematical foundation that relates formula constructs to vector code, and overview the vector code generator within SPIRAL. Experimental results show excellent speed-ups compared to ordinary C code for a variety of transforms and computing platforms. For the DFT on Pentium 4, our automatically generated code compares favorably with the hand-tuned Intel MKL vendor library.

Franz Franchetti and Markus Püschel (Proc. International Parallel and Distributed Processing Symposium (IPDPS), 2003)
Short Vector Code Generation for the Discrete Fourier Transform
Preprint (169 KB)
Bibtex

In this paper we use a mathematical approach to generate high performance short vector code for the discrete Fourier transform. We represent the well-known Cooley-Tukey FFT recursion in a mathematical notation and formally derive a ``short vector variant''. Using this recursion we generate for one DFT size a large number of different algorithms, represented as formulas, and translate them into short vector code. Finally, we present a vector code specific dynamic programming search that finds in the formula space a good match to the given architecture. We implemented this approach as part of the SPIRAL library generator. On Pentium III and 4, we automatically generate SSE and SSE2 vector code that compares favorably with the hand-tuned Intel vendor MKL.

Markus Püschel and José M. F. Moura (Proc. Workshop on Optimizations for DSP and Embedded Systems (ODES), 2003)
SPIRAL: An Overview
Bibtex

Fang Fang, Rob A. Rutenbar, Markus Püschel and Tsuhan Chen (Proc. Design Automation Conference (DAC), pp. 496-501, 2003)
Toward Efficient Static Analysis of Finite-Precision Effects in DSP Applications via Affine Arithmetic Modeling
Preprint (126 KB)
Published paper (link to publisher)
Bibtex

We introduce a static error analysis technique, based on smart interval methods from affine arithmetic, to help designers translate DSP codes from full-precision floating-point to smaller finite-precision formats. The technique gives results for numerical error estimation comparable to detailed simulation, but achieves speedups of three orders of magnitude by avoiding actual bit-level simulation. We show results for experiments mapping common DSP transform algorithms to implementations using small custom floating point formats.

Mike Balog (Master thesis, Electrical and Computer Engineering, Drexel University, 2002)
A Flexible Framework for Implementing FFT Processors
Bibtex

Kang Chen (Master thesis, Computer Science, Drexel University, 2002)
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms
Published paper (link to publisher)
Bibtex

This thesis presents a self-adapting parallel package for computing the Walsh-Hadamard transform (WHT), a prototypical fast signal transform, similar to the fast Fourier transform. Using a search over a space of mathematical formulas representing different algorithms to compute the WHT, the package finds the best parallel implementation on a given shared-memory multiprocessor. The search automatically fins the best combination of sequential and parallel code leading to effective granularity, load balance, and cache utilization. Experimental data are presented in the thesis showing the performance of the package on four different architectures. Results are also presented showing the optimizations required to obtain nearly linear speedup on these sample symmetric multiprocessors.

Kang Chen and Jeremy Johnson (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 58-63, 2002)
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms
Preprint (262 KB)
Published paper (link to publisher)
Bibtex

This paper presents a self-adapting parallel package for computing the Walsh-Hadamard transform (WHT), a prototypical fast signal transform, similar to the fast Fourier transform. Using a search over a space of mathematical formulas representing different algorithms to compute the WHT the package finds the best parallel implementation on a given shared-memory multiprocessor. The search automatically finds the best combination of sequential and parallel code leading to the optimal granularity, load balance, and cache utilization. Experimental results are presented showing the optimizations required to obtain nearly linear speedup on a sample symmetric multiprocessor.

Markus Püschel, Sebastian Egner and Thomas Beth (in New Reference Book on Computer Algebra, Eds. J. Grabmeier, E. Kaltofen, J. Grabmeier, E. Kaltofen, Springer 2002)
AREP
Published paper (link to publisher)
Bibtex

Franz Franchetti and Markus Püschel (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 20-26, 2002)
A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms
Preprint (142 KB)
Published paper (link to publisher)
Bibtex

Many applications require fast implementations of signal processing algorithms to analyze data in real time or to effectively process many large data sets. Fast implementations of a signal transform need to take advantage of structure in the transformation matrix to factor the transform into a product of structured matrices. These factorizations compute the transform with fewer operations than the naive implementation of matrix multiplication. Signal transforms can have a vast number of factorizations, with each factorization of a single transform represented by a unique but mathematically equivalent formula. Interestingly, when implemented in code, these formulas can have significantly different runtimes on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations differ significantly between processors. Therefore, determining which formula is the most efficient for a particular processor is of great interest.
This thesis contributes methods for automating the modeling and optimization of performance across a variety of signal processing algorithms. Modeling and understanding performance can greatly aid in intelligently pruning the huge search space when optimizing performance. Automation is vital considering the size of the search space, the variety of signal processing algorithms, and the constantly changing computer platform market.
To automate the optimization of signal transforms, we have developed and implemented a number of different search methods in the SPIRAL system. These search methods are capable of optimizing a variety of different signal transforms, including new user-specified transforms. We have developed a new search method for this domain, STEER, which uses an evolutionary stochastic algorithm to find fast implementations.
To enable computer modeling of signal processing performance, we have developed and analyzed a number of feature sets to describe formulas representing specific transforms. We have developed several different models of formula performance, including models that predict runtimes of formulas and models that predict the number of cache misses formulas incur.
Further, we have developed a method that uses these learned models to generate fast implementations. This method is able to construct fast formulas, allowing us to intelligently search through only the most promising formulas. While the learned model is trained on data from one transform size, our method is able to produce fast formulas across many transform sizes, including larger sizes, even though it has never timed a formula of those other sizes.

Bryan Singer and Manuela Veloso (IEEE Transactions on Signal Processing, Vol. 50, No. 8, pp. 2003-2014, 2002)
Automating the Modeling and Optimization of the Performance of Signal Transforms
Preprint (365 KB)
Published paper (link to publisher)
Bibtex

Fast implementations of discrete signal transforms, such as the discrete Fourier transform, the Walsh-Hadamard transform, and the discrete trigonometric transforms, can be viewed as factorizations of their corresponding transformation matrices. A given signal transform can have many different factorizations, with each factorization represented by a unique but mathematically equivalent formula. When implemented in code, these formulas can have significantly different running times on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations on various processors are often different. Given this complexity, a crucial problem is automating the modeling and optimization of the performance of signal transform implementations.
To enable computer modeling of signal processing performance, we have developed and analyzed more than 15 feature sets to describe formulas representing specific transforms. Using some of these features and a limited set of training data, we have successfully trained neural networks to learn to accurately predict performance of formulas, with error rates less than 5%. In the direction of optimization, we have developed a new stochastic evolutionary algorithm, STEER, for finding fast implementations of a variety of signal transforms. STEER is able to optimize completely new transforms specified by a user. We present results that show STEER can find discrete cosine transform formulas that are 10--20% faster than what a dynamic programming search finds.

Markus Püschel and José M. F. Moura (Proc. Digital Signal Processing Workshop, 2002)
Generation and Manipulation of DSP Transform Algorithms
Bibtex

Fang Fang, James C. Hoe, Markus Püschel and Smarahara Misra (Proc. High Performance Embedded Computing (HPEC), 2002)
Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform
Bibtex

Hardware designers are increasingly relying on pre-designed DSP (digital signal processing) cores from IP libraries to improve their productivity and reduce design time. Unfortunately, static DSP cores cannot accommodate application-specific trade-offs. To overcome this problem, we are proposing to automatically generate customized DSP cores that can be tailored for specific design requirements. This enables a designer, with no specific background in DSP transform mathematics, to quickly create and evaluate different design choices and to determine the one that is most suitable for the application. In this paper, we present a generator for the Walsh-Hadamard Transform (WHT). The generator accepts as input the WHT's sample size and two implementation parameters that control the degree of hardware reuse. The output is an RTL-level Verilog description of the desired implementation. We present experimental results that compare our different generated designs for the same transform with respect to resource requirements, computation latency, and throughput.

Bryan Singer and Manuela Veloso (Journal of Machine Learning Research, special issue on ``the Eighteenth International Conference on Machine Learning (ICML 2001)'', Vol. 3, pp. 887-919, 2002)
Learning to Construct Fast Signal Processing Implementations
Preprint (6.4 MB)
Published paper (link to publisher)
Bibtex

A single signal processing algorithm can be represented by many mathematically equivalent formulas. However, when these formulas are implemented in code and run on real machines, they have very different runtimes. Unfortunately, it is extremely difficult to model this broad performance range. Further, the space of formulas for real signal transforms is so large that it is impossible to search it exhaustively for fast implementations. We approach this search question as a control learning problem. We present a new method for learning to generate fast formulas, allowing us to intelligently search through only the most promising formulas. Our approach incorporates signal processing knowledge, hardware features, and formula performance data to learn to construct fast formulas. Our method learns from performance data for a few formulas of one size and then can construct formulas that will have the fastest runtimes possible across many sizes.

Hung-Jen Huang (Master thesis, Computer Science, Drexel University, 2002)
Performance Analysis of an Adaptive Algorithm for the Walsh-Hadamard Transform
Bibtex

Franz Franchetti, Markus Püschel, José M. F. Moura and Christoph W. Ueberhuber (Proc. High Performance Embedded Computing (HPEC), 2002)
Short Vector SIMD Code Generation for DSP Algorithms
Bibtex

Peter Becker (Master thesis, Electrical and Computer Engineering, Drexel University, 2001)
A High Speed VLSI Architecture for the Discrete Haar Wavelet Transform
Bibtex

An alternative to strictly time- or frequency-domain analysis, wavelets provide a multiresolution analysis of signals for input to processing such as edge detection, compression, and pattern recognition. High speed VLSI architectures are necessary for real-time processing of large data sets. Presented here is such an architecture for the Haar wavelet transform, the simplest and easiest wavelet to understand. From this base model, a generalization of the architecture to other wavelets is discussed, as well.

Sebastian Egner, Jeremy Johnson, David Padua, Jianxin Xiong and Markus Püschel (ACM SIGSAM Bulletin Communications in Computer Algebra, Vol. 35, No. 2, pp. 1-19, 2001)
Automatic Derivation and Implementation of Signal Processing Algorithms
Preprint (249 KB)
Published paper (link to publisher)
Bibtex

We present a computer algebra framework to automatically derive and implement algorithms for digital signal processing. The two main parts of the framework are AREP, a library for symbolic manipulation of group representations and structured matrices, and SPL, a compiler turning matrix expressions into efficient imperative-style numerical programs.

Sebastian Egner and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 49, No. 9, pp. 1992-2002, 2001)
Automatic Generation of Fast Discrete Signal Transforms
Preprint (217 KB)
Published paper (link to publisher)
Bibtex

This paper presents an algorithm that derives fast versions for a broad class of discrete signal transforms symbolically. The class includes but is not limited to the discrete Fourier and the discrete trigonometric transforms. This is achieved by finding fast sparse matrix factorizations for the matrix representations of these transforms. Unlike previous methods, the algorithm is entirely automatic and uses the defining matrix as its sole input. The sparse matrix factorization algorithm consists of two steps: first, the “symmetry” of the matrix is computed in the form of a pair of group representations; second, the representations are stepwise decomposed, giving rise to a sparse factorization of the original transform matrix. We have successfully demonstrated the method by computing automatically efficient transforms in several important cases: for the DFT, we obtain the Cooley-Tukey FFT; for a class of transforms including the DCT, type II, the number of arithmetic operations for our fast transforms is the same as for the best-known algorithms. Our approach provides new insights and interpretations for the structure of these signal transforms and the question of why fast algorithms exist. The sparse matrix factorization algorithm is implemented within the software package AREP

Jianxin Xiong (PhD. thesis, Computer Science, University of Illinois at Urbana-Champaign, 2001, Also as Technical Report UIUCDCS-R-2001-224, University of Illinois)
Automatic Optimization of DSP Algorithms
Bibtex

Since the advent of digital signal processing (DSP), there has been an enormous effort to obtain high-performance implementations of signal processing algorithms such as the fast Fourier transform (FFT). This effort has produced thousands of variants of fundamental algorithms and an equally large number of implementation techniques. Many of the implementations have been carefully hand-optimized. The cost of these hand optimizations and their long implementation times are a strong motivation to automate the implementation and optimization of signal processing algorithms.
In this dissertation, we present SPL, a domain specific language that represents fast signal processing algorithms as mathematical formulas, and the SPL compiler, which translates these formulas into efficient programs in high-level programming language such as FORTRAN and C. The SPL compiler was developed as part of SPIRAL, a system which utilizes formula transformations and intelligent search strategies to automatically create optimized digital signal processing libraries.
We also present Meta-SPL, a meta-language for extending the SPL language without modifying the SPL compiler. Meta-SPL provides the approach to declare new symbols that can be used as SPL keywords, as well as the mechanism to define the semantics of these new symbols.
After presenting the translation and optimization techniques utilized by the SPL compiler, empirical data is presented showing the efficiency of the resulting C/FORTRAN code. Timings are compared, using fast Fourier transform (FFT), Walsh-Hadamard transform (WHT), discrete cosine transform (DCT) and discrete sine transform (DST) as the benchmarks, to those obtained by highly optimized implementations including FFTW and the WHT package. The results of this comparison show that the SPL compiler produces code that is competitive and, in many cases, faster than the competitors.

Bryan Singer (PhD. thesis, Computer Science, Carnegie Mellon University, 2001)
Automating the Modeling and Optimization of the Performance of Signal Processing Algorithms
Preprint (60 KB)
Published paper (link to publisher)
Bibtex

Many applications require fast implementations of signal processing algorithms to analyze data in real time or to effectively process many large data sets. Fast implementations of a signal transform need to take advantage of structure in the transformation matrix to factor the transform into a product of structured matrices. These factorizations compute the transform with fewer operations than the naive implementation of matrix multiplication. Signal transforms can have a vast number of factorizations, with each factorization of a single transform represented by a unique but mathematically equivalent formula. Interestingly, when implemented in code, these formulas can have significantly different runtimes on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations differ significantly between processors. Therefore, determining which formula is the most efficient for a particular processor is of great interest.
This thesis contributes methods for automating the modeling and optimization of performance across a variety of signal processing algorithms. Modeling and understanding performance can greatly aid in intelligently pruning the huge search space when optimizing performance. Automation is vital considering the size of the search space, the variety of signal processing algorithms, and the constantly changing computer platform market.
To automate the optimization of signal transforms, we have developed and implemented a number of different search methods in the SPIRAL system. These search methods are capable of optimizing a variety of different signal transforms, including new user-specified transforms. We have developed a new search method for this domain, STEER, which uses an evolutionary stochastic algorithm to find fast implementations.
To enable computer modeling of signal processing performance, we have developed and analyzed a number of feature sets to describe formulas representing specific transforms. We have developed several different models of formula performance, including models that predict runtimes of formulas and models that predict the number of cache misses formulas incur.
Further, we have developed a method that uses these learned models to generate fast implementations. This method is able to construct fast formulas, allowing us to intelligently search through only the most promising formulas. While the learned model is trained on data from one transform size, our method is able to produce fast formulas across many transform sizes, including larger sizes, even though it has never timed a formula of those other sizes.

Neungsoo Park and Viktor K. Prasanna (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. II-1205-II-1208, 2001)
Cache Conscious Walsh-Hadamard Transform
Preprint (88 KB)
Published paper (link to publisher)
Bibtex

The Walsh-Hadamard Transform (WHT) is an important tool in signal processing because of its simplicity. One major problem with the existing packages is their poor scalability to larger problems. In computing large size WHT, non-unit stride accesses result in poor cache performance leading to severe degradation in performance. In this paper, we develop an efficient cache friendly technique that drastically enhances the performance of large size WHT. In our approach, data reorganization is performed between computation stages to reduce cache pollution. We develop an efficient search algorithm to determine the optimal factorization tree depending on the problem size and stride access in the decomposition. We have developed a package and demonstrated improved performance. For example, on Alpha 21264 and MIPS R10000, our technique results in upto 180% performance improvement over the state-of-the-art package. The proposed optimization is applicable to other signal transforms and is portable across various platforms.

Pinit Kumhom (PhD. thesis, Electrical and Computer Engineering, Drexel University, 2001, Also Tech. Report DU-MCS-01-01, Drexel University, 2001)
Design, Optimization, and Implementation of a Universal FFT Processor
Preprint (8.2 MB)
Bibtex

There exist Fast Fourier transform (FFT) algorithms, called dimensionless FFTs (L. Auslander, J. Johnson and R. Johnson, Dimensionless Fast Fourier Transform Method and Apparatus, Patent #US6003056, issued Dec. 14, 1999.), that work independent of dimension. These algorithms can be configured to compute different dimensional discrete Fourier transforms (DFTs) simply by relabeling the input data and by changing the values of the twiddle factors occurring in the butterfly operations. This observation allows the design of a universal FFT processor, which with minor reconfiguring, can compute one, two, and three dimensional DFTs. In this thesis a family of FFT processors, parameterized by the number of points, the dimension, the number of processors, and the internal dataflow is designed. Mathematical properties of the FFT are used systematically to simplify and optimize the processor design, and to explore different algorithms and design choices. Different dimensionless FFTs have different dataflows and consequently lead to different performance characteristics. A performance model is used to evaluate the different algorithmic choices and their resulting dataflow. Using the performance model, a search was conducted to find the optimal algorithm for the family of processors considered. The resulting algorithm and corresponding hardware design was implemented using FPGA.

Markus Püschel, Bryan Singer, Manuela Veloso and José M. F. Moura (Proc. International Conference on Computational Science (ICCS), Lecture Notes In Computer Science, Springer, Vol. 2073, pp. 97-106, 2001)
Fast Automatic Generation of DSP Algorithms
Published paper (link to publisher)
Bibtex

SPIRAL is a generator of optimized, platform-adapted libraries for digital signal processing algorithms. SPIRAL's strategy translates the implementation task into a search in an expanded space of alternatives. These result from the many degrees of freedom in the DSP algorithm itself and in the various coding choices. This paper describes the framework to represent and generate efficiently these alternatives: the formula generator module in SPIRAL. We also address the search module that works in tandem with the formula generator in a feedback loop to find optimal implementations. These modules are implemented using the computer algebra system GAP/AREP.

José M. F. Moura, Jeremy Johnson, Robert W. Johnson, David Padua, Viktor K. Prasanna, Markus Püschel, Bryan Singer, Manuela Veloso and Jianxin Xiong (Proc. High Performance Embedded Computing (HPEC), 2001)
Generating Platform-Adapted DSP Libraries using SPIRAL
Bibtex

Bryan Singer and Manuela Veloso (Proc. International Conference on Machine Learning (ICML), pp. 529-536, 2001)
Learning to Generate Fast Signal Processing Implementations
Preprint (450 KB)
Published paper (link to publisher)
Bibtex

A single signal processing algorithm can be represented by many mathematically equivalent formulas. However, when these formulas are implemented in code and run on real machines, they have very different running times. Unfortunately, it is extremely difficult to model this broad performance range. Further, the space of formulas for real signal transforms is so large that it is impossible to search it exhaustively for fast implementations. We approach this search question as a control learning problem. We present a new method for learning to generate fast formulas, allowing us to intelligently search through only the most promising formulas. Our approach incorporates signal processing knowledge, hardware features, and formula performance data to learn to construct fast formulas. Our method learns from performance data for a few formulas of one size and then can construct formulas that will have the fastest run times possible across many sizes.

Jianxin Xiong, Jeremy Johnson, Robert W. Johnson and David Padua (Proc. Programming Languages Design and Implementation (PLDI), pp. 298-308, 2001)
SPL: A Language and Compiler for DSP Algorithms
Preprint (123 KB)
Bibtex

In this paper, we discuss the design and implementation of the SPL compiler, a domain specific compiler that translates signal processing algorithms expressed as mathematical formulas in the SPL language into efficient programs in high-level programming language such as FORTRAN and C. The SPL compiler was developed as part of SPIRAL, a system which utilizes formula transformations and intelligent search strategies to automatically create optimized digital signal processing (DSP) libraries. After presenting the translation and optimization techniques utilized by the compiler, empirical data is presented showing the efficiency of the resulting C/FORTRAN code. Timings are compared, using the fast Fourier transform (FFT) as a benchmark, to those obtained by FFTW, one of the fastest FFT packages available. The results of this comparison show that the SPL compiler produces code that is competitive and, in many cases, faster than FFTW.

Bryan Singer and Manuela Veloso (Proc. Supercomputing (SC), pp. 22, 2001)
Stochastic Search for Signal Processing Algorithm Optimization
Preprint (167 KB)
Published paper (link to publisher)
Bibtex

Many difficult problems can be viewed as search problems. However, given a new task with an embedded search problem, it is challenging to state and find a truly effective search approach. In this paper, we address the complex task of signal processing optimization. We first introduce and discuss the complexities of this domain. In general, a single signal processing algorithm can be represented by a very large number of different but mathematically equivalent formulas. When these formulas are implemented in actual code, unfortunately their running times differ significantly. Signal processing algorithm optimization aims at finding the fastest formula. We present a new approach that successfully solves this problem, using an evolutionary stochastic search algorithm, STEER, to search through the very large space of formulas. We empirically compare STEER against other search methods, showing that it notably can find faster formulas while still only timing a very small portion of the search space.

Gavin Haentjens (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2000)
An Investigation of Recursive FFT Implementations
Preprint (762 KB)
Bibtex

The goals of the research discussed in this report are to determine the impact of different Cooley-Tukey decompositions on the performance of computer programs that compute the FFT, evaluate different methods for finding the most efficient decompositions, and determine the characteristics of the most efficient decompositions. Experiments are conducted using three different FFT programs and three dynamic programming (DP) methods for searching for efficient decompositions. The results show that even for FFTs of sizes under 211, the runtime for the average decomposition may be up to three times the runtime for the optimal decomposition. The results also show that a basic implementation of DP performs as well as an exhaustive search at finding fast decompositions for FFTs of sizes up to 210, and that the simple DP performs as well as two more sophisticated versions of DP for FFT sizes up to 220. Furthermore, the results show that for an out-of-place implementation of the FFT, right-expanded decompositions are optimal because they require memory storage only for the input and output data arrays, whereas other decompositions require additional temporary storage. Moreover, the results show that for an in-place implementation of the FFT, balanced decompositions are optimal if the algorithm is iterative, and right-expanded decompositions are optimal if the algorithm is recursive.

Pinit Kumhom, Jeremy Johnson and Prawat Nagvajara (Proc. IEEE ASIC/SOC Conference, IEEE, pp. 182-186, 2000)
Design, optimization, and implementation of a universal FFT processor
Published paper (link to publisher)
Bibtex

There exist Fast Fourier transform (FFT) algorithms, called dimensionless FFTs, that work independent of dimension. These algorithms can be configured to compute different dimensional DFTs simply by relabeling the input data and by changing the values of the twiddle factors occurring in the butterfly operations. This observation allows us to design an FFT processor, which with minor reconfiguring, can compute one, two, and three dimensional DFTs. In this paper we design a family of FFT processors, parameterized by the number of points, the dimension, the number of processors, and the internal dataflow, and show how to map different dimensionless FFTs onto this hardware design. Different dimensionless FFTs have different dataflows and consequently lead to different performance characteristics. Using a performance model we search for the optimal algorithm for the family of processors we considered. The resulting algorithm and corresponding hardware design was implemented using FPGA

Neungsoo Park, Viktor K. Prasanna, Kiran Bondalapati and Dongsoo Kang (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 693-701, 2000)
Dynamic Data Layouts for Cache-conscious Factorization of DFT
Preprint (245 KB)
Bibtex

Effective utilization of cache memories is a key factor in achieving high performance in computing the Discrete Fourier Transform (DFT). Most optimization techniques for computing the DFT rely on either modifying the computation and data access order or exploiting low level platform specific details, while keeping the data layout in memory static. In this paper we propose a high level optimization technique, dynamic data layout (DDL). In DDL, data reorganization is performed between computations to effectively utilize the cache. This cache-conscious factorization of the DFT including the data reorganization steps is automatically computed by using efficient techniques in our approach. An analytical model of the cache miss pattern is utilized to predict the performance and explore the search space of factorizations. Our technique results in up to a factor of 4 improvement over standard FFT implementations and up to 33% improvement over other optimization techniques such as copying on SUN UltraSPARC-II, DEC Alpha and Intel Pentium III dynamic data layout (DDL). In DDL, data reorganization is performed between computations to effectively utilize the cache. This cache-conscious factorization of the DFT including the data reorganization steps is automatically computed by using efficient techniques in our approach. An analytical model of the cache miss pattern is utilized to predict the performance and explore the search space of factorizations. Our technique results in up to a factor of 4 improvement over standard FFT implementations and up to 33% improvement over other optimization techniques such as copying on SUN UltraSPARC-II, DEC Alpha and Intel Pentium III.

Jeremy Johnson and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 6, pp. 3347-3350, 2000)
In Search of the Optimal Walsh-Hadamard Transform
Preprint (137 KB)
Published paper (link to publisher)
Bibtex

This paper describes an approach to implementing and optimizing fast signal transforms. Algorithms for computing signal transforms are expressed by symbolic expressions, which can be automatically generated and translated into programs. Optimizing an implementation involves searching for the fastest program obtained from one of the possible expressions. In this paper we apply this methodology to the implementation of the Walsh-Hadamard transform. An environment, accessible from MATLAB, is provided for generating and timing WHT algorithms. These tools are used to search for the fastest WHT algorithm. The fastest algorithm found is substantially faster than standard approaches to implementing the WHT. The work reported in this paper is part of the SPIRAL project, an ongoing project whose goal is to automate the implementation and optimization of signal processing algorithms.

Bryan Singer and Manuela Veloso (Proc. International Conference on Machine Learning (ICML), pp. 887-894, 2000)
Learning to Predict Performance from Formula Modeling and Training Data
Preprint (188 KB)
Published paper (link to publisher)
Bibtex

This paper reports on our work and results framing signal processing algorithm optimization as a machine learning task. A single signal processing algorithm can be represented by many different but mathematically equivalent formulas. When these formulas are implemented in actual code, they have very different running times. Signal processing optimization is concerned with finding a formula that implements the algorithm as efficiently as possible. Unfortunately, a correct mapping between a mathematical formula and its running time is unknown. However empirical performance data can be gathered for a variety of formulas. This data offers an interesting opportunity to learn to predict running time performance. In this paper we present two major results along this direction: (1) Different sets of features are identified for mathematical formulas that distinguish them into partitions with significantly different running times, and (2) A function approximator can learn to accurately predict the running time of a formula given a limited set of training data. Showing the impact of selecting different features to describe the input, this work contributes an extensive study on the role of learning for this novel task.

David Sepiashvili (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2000)
Performance Models and Search Methods for Optimal FFT Implementations
Preprint (589 KB)
Bibtex

The central result of this thesis is deriving systematic methodologies for finding the fast Fourier transform (FFT) optimal implementations. By employing a divide-conquer procedure (decomposition), we break down the initial transform into combinations of different smalller size sub-transforms that can graphically be presented as breakdown trees. We apply the rewrite rules to the Cooley-Tukey formula and generate a set of algorithms and alternative codes for the FFT computation. We obtain a set of all possible implementations by pairing all possible breakdown trees with all code alternatives. For the runtime evaluation of these implementations, we develop the analytical and experimental perfomance models. Based on these models, we derive the optimal search methods -- the dynamic programming and exhaustive search -- and use them to search over the set of implementation runtimes for finding the implementation with the minimal runtime. The test results demonstrate that good algorithms and codes, accurate performance evaluation models, and effective search methods, combined together within the system framework, provide the best FFT implementations.

José M. F. Moura, Jeremy Johnson, Robert W. Johnson, David Padua, Viktor K. Prasanna, Markus Püschel and Manuela Veloso (Proc. High Performance Embedded Computing (HPEC), 2000)
SPIRAL: Automatic Implementation of Signal Processing Algorithms
Bibtex

Publication interface designed and implemented by Patra Pantupat, Aliaksei Sandryhaila, and Markus Püschel
Electrical and Computer Engineering, Carnegie Mellon University, 2007