Copyrights to these papers may be held by the publishers. The download files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Matthias Bolten, Franz Franchetti, P. H. J. Kelly, Christian Lengauer and Marcus Mohr (Concurrency and Computation: Practice and Experience, 2017)

**Algebraic Description and Automatic Generation of Multigrid Methods in SPIRAL**

Bibtex

Tze-Meng Low and Franz Franchetti (Proc. IEEE International Symposium on High Assurance Systems Engineering (HASE), 2017)

**High Assurance Code Generation for Cyber-Physical Systems**

Bibtex

Franz Franchetti, Tze-Meng Low, Stefan Mitsch, Juan Pablo Mendoza, Liangyan Gui, Amarin Phaosawasdi, David Padua, Soummya Kar, José M. F. Moura, M. Franusich, Jeremy Johnson, Andre' Platzer and Manuela Veloso (IEEE Control Systems Magazine, 2017)

**High-Assurance SPIRAL: End-to-End Guarantees for Robot and Car Control**

Bibtex

H. V. Koops, Kashish Garg, Munsung (Bill) Kim, Jonathan Li, Anja Volk and Franz Franchetti (Technical report UU-CS-2017-001, Dept. of Information and Computing Sciences, Utrecht University, 2017)

**Prediction of Quadcopter State through Multi-Microphone Side-Channel Fusion**

Bibtex

F. Sadi, Lawrence Pileggi and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), pp. 1-1, 2016)

**3D DRAM Based Application Specific Hardware Accelerator for SpMV**

Bibtex

Daniele G. Spampinato and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 117-127, 2016)

**A basic linear algebra compiler for structured matrices**

Preprint (1 MB)

Published paper (link to publisher)

Bibtex

Many problems in science and engineering are in practice modelled and solved through matrix computations. Often, the matrices involved have structure such as symmetric or triangular, which reduces the operations count needed to perform a computation. For example, dense linear systems of equations are solved by first converting to triangular form and optimization problems may yield matrices with any kind of structure. The well-known BLAS (basic linear algebra subprograms) interface provides a small set of structured matrix computations, chosen to serve a certain set of higher level functions (LAPACK). However, if a user encounters a computation or structure that is not supported, she has to forego the benefits of the structure and choose a generic functionality. In this paper, we address this problem by providing a compiler that translate a given basic linear algebra computation on structured matrices into optimized C code, optionally vectorized with intrinsics. Our work combines prior work on the Spiral-like LGen compiler with techniques from polyhedral compilation to mathematically capture matrix structures. In the paper we consider upper/lower triangular and symmetric matrices but the approach is extensible to a much larger set including blocked structures. We run experiments on a modern Intel platform against the Intel MKL library and a baseline implementation showing competitive performance results for both BLAS and non-BLAS functionalities.

Richard Veras, Tze-Meng Low and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), pp. 1-7, 2016)

**A Scale-Free Structure for Power-Law Graphs**

Bibtex

Richard Veras, Thom Popovici, Tze-Meng Low and Franz Franchetti (Proc. Workshop on Programming Models for SIMD/Vector Programming (WPMVP), 2016)

**Compilers, Hands-Off My Hands-On Optimizations**

Bibtex

Marcela Zuluaga, Andreas Krause and Markus Püschel (Journal of Machine Learning Research, Vol. 17, No. 104, pp. 1-32, 2016)

**e-PAL: An Active Learning Approach to the Multi-Objective Optimization Problem**

Published paper (link to publisher)

Bibtex

In many fields one encounters the challenge of identifying out of a pool of possible designs those that simultaneously optimize multiple objectives. In many applications an exhaustive search for the Pareto-optimal set is infeasible. To address this challenge, we propose the ϵ-Pareto Active Learning (ϵ-PAL) algorithm which adaptively samples the design space to predict a set of Pareto-optimal solutions that cover the true Pareto front of the design space with some granularity regulated by a parameter ϵ. Key features of ϵ-PAL include (1) modeling the objectives as draws from a Gaussian process distribution to capture structure and accommodate noisy evaluation; (2) a method to carefully choose the next design to evaluate to maximize progress; and (3) the ability to control prediction accuracy and sampling cost. We provide theoretical bounds on ϵ-PAL's sampling cost required to achieve a desired accuracy. Further, we perform an experimental evaluation on three real-world data sets that demonstrate ϵ-PAL's effectiveness; in comparison to the state-of-the-art active learning algorithm PAL, ϵ-PAL reduces the amount of computations and the number of samples from the design space required to meet the user's desired level of accuracy. In addition, we show that ϵ-PAL improves significantly over a state-of-the-art multi- objective optimization method, saving in most cases 30% to 70% evaluations to achieve the same accuracy.

J. Kepner, P. Aaltonen, D. Bader, A. Buluc, Franz Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. D. Owens, C. Yang, M. Zalewski and T. Mattson (Proc. High Performance Extreme Computing (HPEC), 2016)

**Mathematical Foundations of the GraphBLAS**

Bibtex

Francois Serre and Markus Püschel (Proc. FPGA, pp. 215-223, 2016)

**Optimal Circuits for Streamed Linear Permutations using RAM**

Preprint (639 KB)

Published paper (link to publisher)

Bibtex

We propose a method to automatically derive hardware structures that perform a fixed linear permutation on streaming data. Linear permutations are permutations that map linearly the bit representation of the elements addresses. This set contains many of the most important permutations in media processing, communication, and other applications and includes perfect shuffles, stride permutations, and the bit reversal. Streaming means that the data to be permuted arrive as a sequence of chunks over several cycles. We solve this problem by mathematically decomposing a given permutation into a sequence of three permutations that are either temporal or spatial. The former are implemented as banks of RAM, the latter as switching networks. We prove optimality of our solution in terms of the number of switches in these networks.

Marcela Zuluaga, Peter A. Milder and Markus Püschel (ACM Transactions on Design Automation of Electronic Systems, Vol. 21, No. 4, pp. 55, 2016)

**Streaming Sorting Networks**

Preprint (2.5 MB)

Published paper (link to publisher)

Bibtex

Sorting is a fundamental problem in computer science and has been studied extensively. Thus, a large variety of sorting methods exists for both software and hardware implementations. For the latter, there is a tradeoff between the throughput achieved and the cost, i.e., the area or amount of logic and storage invested to sort n elements. Two popular solutions are bitonic sorting networks with O(n\log^2 n) logic and storage, which sort $n$ elements per cycle, and linear sorters with O(n) logic and storage, which sort $n$ elements per $n$ cycles. In this paper, we present new hardware structures that we call streaming sorting networks, which we derive through a mathematical formalism that we introduce. With the new networks we achieve novel and improved cost-performance tradeoffs. For example, assuming n is a two-power and w is any divisor of n, one class of these networks can sort in n/w cycles with O(w\log^2n) logic and O(n\log^2n) storage; the other class we present sorts in n\log^2n/w cycles with O(w) logic and O(n) storage. We carefully analyze the performance of these networks and their cost at three levels of abstraction: (1) asymptotically, (2) exactly in terms of the number of basic elements needed, and (3) in terms of the resources required by the actual circuit when mapped to a field-programmable gate array. We obtain the latter results through a domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog. With this generator we explore the entire design space, identify the Pareto-optimal solutions, and show superior cost-performance tradeoffs compared to prior work.

Jiyuan Zhang, Tze-Meng Low, Qi Guo and Franz Franchetti (Proc. Workshop on Near Data Processing (WONDP), 2015)

**A 3D-Stacked Memory Manycore Stencil Accelerator System**

Bibtex

Nikolaos Kyrtatas, Daniele G. Spampinato and Markus Püschel (Proc. Design, Automation and Test in Europe (DATE), pp. 1054-1059, 2015)

**A Basic Linear Algebra Compiler for Embedded Processors**

Preprint (772 KB)

Published paper (link to publisher)

Bibtex

Many applications in signal processing, control, and graphics on embedded devices require efficient linear algebra computations. On general-purpose computers, program generators have proven useful to produce such code, or important building blocks, automatically. An example is LGen, a compiler for basic linear algebra computations of fixed size. In this work, we extend LGen towards the embedded domain using as example targets Intel Atom, ARM Cortex-A8, ARM Cortex-A9, and Raspberry Pi (ARM1176). To efficiently support these processors we introduce support for the NEON vector ISA and a methodology for domain-specific load/store optimizations. Our experimental evaluation shows that the new version of LGen produces code that performs better than well-established, commercial and non-commercial libraries (Intel MKL and IPP), software generators (Eigen and ATLAS), and compilers (icc, gcc, and clang).

Qi Guo, Tsuhan Chen, Y. Chen and Franz Franchetti (IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015)

**Accelerating Architectural Simulation Via Statistical Techniques: A Survey**

Bibtex

H. V. Koops and Franz Franchetti (Proc. International Conference on Digital Signal Processing (DSP), 2015)

**An Ensemble Technique for Estimating Vehicle Speed and Geer Position from Acoustic Data**

Bibtex

H. E. Sumbul, K. Vaidyanathan, Qiuling Zhu, Franz Franchetti and Lawrence Pileggi (Proc. Design Automation Conference (DAC), 2015)

**A Synthesis Methodology for Application-Specific Logic-in-Memory Designs**

Bibtex

Berkin Akin, Franz Franchetti and James C. Hoe (Proc. International Symposium on Computer Architectur (ISCA), 2015)

**Data Reorganization in Memory Using 3D-stacked DRAM**

Bibtex

Qi Guo, Tze-Meng Low, N. Alachiotis, Berkin Akin, Lawrence Pileggi, James C. Hoe and Franz Franchetti (Proc. MICRO, 2015)

**Enabling Portable Energy Efficiency with Memory Accelerated Library**

Bibtex

M. M. Sabry Aly, M. Gao, G. Hills, C.-S. Lee, G. Pitner, M. M. Shulaker, T. F. Wu, M. Asheghi, J. Bokor, Franz Franchetti, K. E. Goodson, C. Kozyrakis, I. Markov, K. Olukoton, Lawrence Pileggi, E. Pop, J. Rabaey, C. Re, H.-S. Wong and S. Mitra (Computer, Vol. 48, No. 12, pp. 24-33, 2015)

**Energy-Efficient Abundant-Data Computing: The N3XT 1,000x**

Bibtex

Berkin Akin, Franz Franchetti and James C. Hoe (Journal of Signal Processing Systems, 2015)

**FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation**

Bibtex

Thom Popovici, F. Russell, K. Wilkinson, C-K. Skylaris, P. H. J. Kelly and Franz Franchetti (Proc. Workshop on Compilers for Parallel Computing (CPC), 2015)

**Generating Optimized Fourier Interpolation Routines for Density Functional Theory Using SPIRAL**

Bibtex

Thom Popovici, F. Russell, K. Wilkinson, C-K. Skylaris, P. H. J. Kelly and Franz Franchetti (Proc. International Parallel and Distributed Processing Symposium (IPDPS), 2015)

**Generating Optimized Fourier Interpolation Routines for Density Functional Theory Using SPIRAL**

Bibtex

Tze-Meng Low, Qi Guo and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2015)

**Optimizing Space Time Adaptive Processing Through Accelerating Memory-Bounded Operations**

Bibtex

T. Ozturk, C. Stein, R. Pokharel, Thom Popovici, R. Suter, Franz Franchetti and Anthony Rollett (submitted for publication)

**Spectral Full-Field Deformation Modeling of Polycrystalline Materials**

Bibtex

Berkin Akin, Franz Franchetti and James C. Hoe (IEEE Micro, Vol. 1, No. 1, pp. 99, 2015)

**To Reshape or Not To Reshape: HAMLeT Architecture for Parallel Data Reorganization in Memory**

Bibtex

Qi Guo, N. Alachiotis, Berkin Akin, F. Sadi, G. Xu, Tze-Meng Low, Lawrence Pileggi, James C. Hoe and Franz Franchetti (Proc. Workshop on Near Data Processing (WONDP), 2014)

**3D-Stacked Memory-Side Acceleration: Accelerator and System Design**

Bibtex

Daniele G. Spampinato and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 23-32, 2014)

**A Basic Linear Algebra Compiler**

Preprint (2.4 MB)

Published paper (link to publisher)

Bibtex

Many applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel's MKL are more geared towards large-scale problems (matrix sizes in the hundreds and larger) and towards specific interfaces (e.g., BLAS). In this paper we present LGen: a compiler for small-scale, basic linear algebra computations. The input to LGen is a fixed-size linear algebra expression; the output is a corresponding C function optionally including intrinsics to efficiently use SIMD vector extensions. LGen generates code using two levels of mathematical domain-specific languages (DSLs). The DSLs are used to perform tiling, loop fusion, and vectorization at a high level of abstraction, before the final code is generated. In addition, search is used to select among alternative generated implementations. We show benchmarks of code generated by LGen against Intel's MKL and IPP as well as against alternative generators, such as the C++ template-based Eigen and the BTO compiler. The achieved speed-up is typically about a factor of two to three.

Nikolaos Kyrtatas (Master thesis, Computer Science, ETH Zurich, Switzerland, 2014)

**A Basic Linear Algebra Compiler for Embedded Processors**

Preprint (3.4 MB)

Published paper (link to publisher)

Bibtex

In many fields, such as signal processing, control, and graphics, there is a significant demand for efficient Dense Linear Algebra (DLA) code for embedded devices. At the same time, code generators have proved to be useful for generating fast DLA code for general-purpose computers. An example is LGen, a research compiler designed after Spiral for basic linear algebra computations of fixed size. In this thesis, we extend LGen towards four processors that are widely used in the embedded ecosystem: Intel Atom, ARM Cortex-A8, ARM Cortex-A9, and ARM1176. For this purpose, we introduce into the LGen methodology a set of optimizations that take into account the specific limitations and capabilities of these processors, aiming at generating highly optimized code for them. An extensive set of experiments run on our target platforms shows that the new version of LGen produces code that performs better than well-established, commercial and non-commercial libraries (Intel MKL and Intel IPP), software generators (Eigen and ATLAS), and compilers (icc, gcc, and clang). The large number of experiments that we conducted on the four investigated processors were executed using Mediator, a web-based middleware that was developed as part of this thesis. Mediator offers a RESTful interface that makes possible the simultaneous execution of experiments on multiple SSH-accessible devices by multiple users. More specifically, it guarantees mutual exclusion of the experiments running on a specific core, while at the same time applies load balancing over the cores of a device. On top of that, it also provides an easy-to-use mechanism for retrieving performance metrics on a variety of architectures with minimal user involvement.

Alen Stojanov, Georg Ofenbeck, Tiark Rompf and Markus Püschel (Proc. ACM International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY), pp. 14, 2014)

**Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters**

Preprint (529 KB)

Published paper (link to publisher)

Bibtex

We present FGen, a program generator for high performance convolution operations (finite-impulse-response filters). The generator uses an internal mathematical DSL to enable structural optimization at a high level of abstraction. We use FGen as a testbed to demonstrate how to provide modular and extensible support for modern SIMD vector architectures in a DSL-based generator. Specifically, we show how to combine staging and generic programming with type classes to abstract over both the data type (real or complex) and the target architecture (e.g., SSE or AVX) when mapping DSL expressions to C code with explicit vector intrinsics. Benchmarks shows that the generated code is highly competitive with commercial libraries.

Tao Cui, R. Yang, Gabriela Hug and Franz Franchetti (Proc. IEEE Power and Energy Society General Meeting (PES-GM), 2014)

**Accelerated AC Contingency Calculation on Commodity Multi-core SIMD CPUs**

Bibtex

F. Sadi, Berkin Akin, Thom Popovici, James C. Hoe, Lawrence Pileggi and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2014)

**Algorithm/Hardware Co-optimized SAR Image Reconstruction with 3D-stacked Logic in Memory**

Bibtex

Georg Ofenbeck, Ruedi Steinmann, Victoria Caparrós Cabezas, Daniele G. Spampinato and Markus Püschel (Proc. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 76 - 85, 2014)

**Applying the Roofline Model**

Preprint (1.6 MB)

Published paper (link to publisher)

Bibtex

The recently introduced roofline model plots the performance of executed code against its operational intensity (operations count divided by memory traffic). It also includes two platform-specific performance ceilings: the processor's peak performance and a ceiling derived from the memory bandwidth, which is relevant for code with low operational intensity. The model thus makes more precise the notions of memory- and compute-bound and, despite its simplicity, can provide an insightful visualization of bottlenecks. As such it can be valuable to guide manual code optimization as well as in education. Unfortunately, to date the model has been used almost exclusively with back-of-the-envelope calculations and not with measured data. In this paper we show how to produce roofline plots with measured data on recent generations of Intel platforms. We show how to accurately measure the necessary quantities for a given program using performance counters, including threaded and vectorized code, and for warm and cold cache scenarios. We explain the measurement approach, its validation, and discuss limitations. Finally, we show, to this extent for the first time, a set of roofline plots with measured data for common numerical functions on a variety of platforms and discuss their possible uses.

B. Duff, J. Larkin, M. Franusich and Franz Franchetti (submitted for publication)

**Automatic Generation of 3-D FFTs**

Bibtex

Benjamin Hess, Thomas Gross and Markus Püschel (Proc. International Conference on Generative Programming: Concepts & Experiences (GPCE), pp. 83-92, 2014)

**Automatic Locality-Friendly Interface Extension of Numerical Functions**

Preprint (509 KB)

Published paper (link to publisher)

Bibtex

Raising the level of abstraction is a key concern of software engineering, and libraries (either used directly or as a target of a program generation system) are a successful technique to raise programmer productivity and to improve software quality. Unfortunately successful libraries may contain functions that may not be general enough. E.g., many numeric performance libraries contain functions that work on one- or higher-dimensional arrays. A problem arises if a program wants to invoke such a function on a non-contiguous subarray (e.g., in C the column of a matrix or a subarray of an image). If the library developer did not foresee this scenario, the client program must include explicit copy steps before and after the library function call, incurring a possibly high performance penalty. A better solution would be an enhanced library function that allows for the desired access pattern. Exposing the access pattern allows the compiler to optimize for the intended usage scenario(s). As we do not want the library developer to generate all interesting versions manually, we present a tool that takes a library function written in C and generates such a customized function for typical accesses. We describe the approach, discuss limitations, and report on the performance. As example access patterns we consider those most common in numerical applications: permutations, striding and block striding, as well as scaling. We evaluate the tool on various library functions including filters, scans, reductions, sorting, FFTs and linear algebra operations. The automatically generated custom version is in most cases significantly faster than using individual steps, offering gains that are typically in the range of 20--80%.

Vadim Zaliva and Franz Franchetti (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014)

**Barometric and GPS Altitude Sensor Fusion**

Bibtex

K. Vaidyanathan, R. Liu, H. E. Sumbul, Qiuling Zhu, Franz Franchetti and Lawrence Pileggi (Proc. IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), 2014)

**Efficient and Secure Intellectual Property (IP) Design for Split Fabrication**

Bibtex

Victoria Caparrós Cabezas and Markus Püschel (Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014)

**Extending the Roofline Model: Bottleneck Analysis with Microarchitectural Constraints**

Preprint (2.1 MB)

Published paper (link to publisher)

Bibtex

Software, even if carefully optimized, rarely reaches the peak performance of a processor. Understanding which hardware resource is the bottleneck is difficult but important as it can help with both further optimizing the code or deciding which hardware component to upgrade for higher performance. If the bottleneck is the memory bandwidth, the roofline model provides a simple but instructive analysis and visualization. In this paper, we take the roofline analysis further by including additional performance-relevant hardware features including latency, throughput, and capacity information for a multilevel cache hierarchy and out-of-order execution buffers. Two key ideas underlie our analysis. First, we estimate performance based on a scheduling of the computation DAG on a high-level model of the microarchitecture and extract data including utilization of resources and overlaps from a cycle-by-cycle analysis of the schedule. Second, we show how to use this data to create only one plot with multiple rooflines that visualize bottlenecks. We validate our model against performance data obtained from a real system, and then apply our bottleneck analysis to a number of prototypical floating point kernels to identify and interpret bottlenecks.

T. Ozturk, Thom Popovici, C. Stein, R. Pokharel, Franz Franchetti and Anthony Rollett (Proc. Materials Science & Technology, 2014)

**Fast Fourier Transform Based Mechanical Behavior Formulation: Optimized Implementation and Sensitivity Analysis of the Method**

Bibtex

Berkin Akin, Franz Franchetti and James C. Hoe (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014)

**FFTs with Near-Optimal Memory Access Through Block Data Layouts**

Bibtex

Berkin Akin, James C. Hoe and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2014)

**HAMLeT: Hardware Accelerated Memory Layout Transform within 3D-stacked DRAM**

Bibtex

Franz Franchetti, Aliaksei Sandryhaila and Jeremy Johnson (Proc. SPIE, Proceedings of SPIE 2014, 2014)

**High Assurance SPIRAL**

Published paper (link to publisher)

Bibtex

Jörn Schumacher and Markus Püschel (Proc. IEEE Workshop on Signal Processing Systems (SIPS), pp. 1-6, 2014)

**High-performance sparse fast Fourier transforms**

Preprint (1.2 MB)

Published paper (link to publisher)

Bibtex

The sparse fast Fourier transform (SFFT) is a recent novel algorithm to compute discrete Fourier transforms on signals with a sparse frequency domain with an improved asymptotic runtime. Reference implementations exist for different variants of the algorithm and were already shown to be faster than state-of-the-art FFT implementations in cases of sufficient sparsity. However, to date the SFFT has not been carefully optimized for modern processors. In this paper, we first analyze the performance of the existing SFFT implementations and discuss possible improvements. Then we present an optimized implementation. We achieve a speedup of 2-5 compared to the existing code and an efficiency that is competitive to high-performance FFT libraries.

Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky and Markus Püschel (Proc. International Workshop on Adaptive Self-tuning Computing Systems (ADAPT), 2014)

**Language Support for the Construction of High Performance Code Generators**

Bibtex

The development of highest performance code on modern processors is extremely difficult due to deep memory hierarchies, vector instructions, multiple cores, and inherent limitations of compilers. The problem is particularly noticeable for library functions of mathematical nature (e.g., BLAS, FFT, filters, Viterbi decoders) that are performance-critical in areas such as multimedia processing, computer vision, graphics, machine learning, or scientific computing. Experience shows that a straightforward implementation often underperforms by one or two orders of magnitude compared to highly tuned code. The latter is often highly specialized to a platform which makes porting very costly. One appealing solution to the problem of optimizing and porting libraries are program generators that automatically produce highest performance libraries for a given platform from a high level description. Building such a generator is difficult, which is the reason that only very few exist to date. The difficulty comes from both the problem of designing an extensible approach to perform all the optimizations the compiler is unable to do and the actual implementation of the generator. To tackle the latter problem we asked the question: Which programming languages features support the development of generators for performance libraries?

T. Ozturk, C. Stein, R. Pokharel, Thom Popovici, Franz Franchetti, R. Suter and Anthony Rollett (submitted for publication)

**Performance Evaluation, Algorithm Optimization and Sensitivity Analysis of the Spectral Full-Field Deformation Modeling of Polycrystalline Materials**

Bibtex

Berkin Akin, Franz Franchetti and James C. Hoe (Proc. IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 248-255, 2014)

**Understanding the Design Space of DRAM-optimized Hardware FFT Accelerators**

Bibtex

Qiuling Zhu, Berkin Akin, H. E. Sumbul, F. Sadi, James C. Hoe, Lawrence Pileggi and Franz Franchetti (Proc. IEEE International 3D Systems Integration Conference (3DIC), pp. 1-7, 2013)

**A 3D-Stacked Logic-in-Memory Accelerator for Application-Specific Data Intensive Computing**

Bibtex

Qiuling Zhu, H. E. Sumbul, F. Sadi, James C. Hoe, Lawrence Pileggi and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), pp. 1-6, 2013)

**Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware**

Bibtex

Marcela Zuluaga, Andreas Krause, Guillaume Sergent and Markus Püschel (Proc. International Conference on Machine Learning (ICML), pp. 462-470, 2013)

**Active Learning for Multi-Objective Optimization**

Preprint (4.7 MB)

Bibtex

In many fields one encounters the challenge of identifying, out of a pool of possible designs, those that simultaneously optimize multiple objectives. This means that usually there is not one optimal design but an entire set of Pareto-optimal ones with optimal trade-offs in the objectives. In many applications, evaluating one design is expensive; thus, an exhaustive search for the Pareto-optimal set is unfeasible. To address this challenge, we propose the Pareto Active Learning (PAL) algorithm which intelligently samples the design space to predict the Pareto-optimal set. Key features of PAL include (1) modeling the objectives as samples from a Gaussian process distribution to capture structure and accomodate noisy evaluation; (2) a method to carefully choose the next design to evaluate to maximize progress; and (3) the ability to control prediction accuracy and sampling cost. We provide theoretical bounds on PAL’s sampling cost required to achieve a desired accuracy. Further, we show an experimental evaluation on three real-world data sets. The results show PAL’s effectiveness; in particular it improves significantly over a state-of-the-art multi-objective optimization method, saving in many cases about 33% evaluations to achieve the same accuracy.

Q. Li, Tao Cui, Yang Weng, Rohit Negi, Franz Franchetti and Marija Ilic (IEEE Transactions on Smart Grid, Vol. 4, No. 1, pp. 446-456, 2013)

**An Information-Theoretic Approach to PMU Placement in Electric Power Systems**

Bibtex

Tao Cui and Franz Franchetti (Proc. IEEE Innovative Smart Grid Technology Conference, 2013)

**A Quasi-Monte Carlo Approach for Radial Distribution System Probabilistic Load Flow**

Bibtex

Qiuling Zhu, Lawrence Pileggi and Franz Franchetti (in FIP AICT 418, pp. 21-44 2013)

**A Smart Memory Accelerated Computed Tomography Parallel Backprojection**

Bibtex

Tao Cui and Franz Franchetti (IEEE PES General Meeting, 2013)

**A Software Performance Engineering Aproach to Fast Transmission Probabilistic Load Flow**

Bibtex

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam and P. Sadayappan (Proc. ACM International Conference on Supercomputing , pp. 13-24, 2013)

**A Stencil Compiler for Short-Vector SIMD Architectures**

Bibtex

Benjamin Hess (Master thesis, Computer Science, ETH Zurich, Switzerland, 2013)

**Automatic Refactoring: Locality Friendly Interface Enhancements for Numerical Functions**

Preprint (1.2 MB)

Bibtex

Recent improvements in processor architectures such as multiple cores and larger single-instruction multiple-data (SIMD) vector units increased the discrepancy between processing speed and memory bandwidth. Today, the memory bandwidth has become the biggest bottleneck in high performance domains. Numerical functions are often target to heavy optimizations to get as much performance as possible. These functions presume a specific data layout and have a fixed domain and range. Adjusting unsuitable data according to these requirements can take up a significant amount of the runtime due to heavy memory operations. By integrating layout, domain and range adjustments into numerical functions, memory bandwidth is saved as the adjustments happen inplace during execution of the function. Four transformations are provided by the developed tool to refactor the most common adjustments into numerical functions. Restrictions on the to transformed function ensure the correctness of the transformations. These transformations enable the function to directly work on previously incompatible data which makes the manual adjustments superfluous and saves memory bandwidth by not needing to adjust the data manually before calling the function. The runtime of the transformed function is highly dependent on the used function and transformation type, ranging from 50% slower to up to 5 times faster compared to applying the adjustments manually before or after the function. The developed tool shows that automatic refactoring with a subset of C is possible and allows a developer to enhance numerical functions with minimal additional effort providing more flexible and high-performance versions of the functions with only having to maintain the original function.

Jörn Schumacher (Master thesis, Computer Science, ETH Zurich, Switzerland, 2013)

**High Performance Sparse Fast Fourier Transform**

Preprint (702 KB)

Bibtex

The Sparse Fast Fourier Transform is a recent algorithm developed by Hassanieh et al. at MIT for Discrete Fourier Transforms on signals with a sparse frequency domain. A reference implementation of the algorithm exists and proves that the Sparse Fast Fourier Transform can be faster than modern FFT libraries. However, the reference implementation does not take advantage of modern hardware features like vector instruction sets or multithreading. In this Master Thesis the reference implementation’s performance will be analyzed and evaluated. Several optimizations are proposed and implemented in a high-performance Sparse Fast Fourier Transform library. The optimized code is evaluated for performance and compared to the reference implementation as well as the FFTW library. The main result is that, depending on the input parameters, the optimized Sparse Fast Fourier Transform library is two to five times faster than the reference implementation.

Marcela Zuluaga, Andreas Krause and Markus Püschel (Proc. Workshop on High-Level Synthesis for High Performance Computing (HLS4HPC), 2013)

**Multi-Objective Optimization for High-Level Synthesis**

Preprint (2.1 MB)

Bibtex

Tao Cui and Franz Franchetti (Proc. High Performance Computing, Networking and Analytics for the Power Grid (HiPCNA-PG), 2013)

**Power System Probabilistic and Security Analysis on Commodity High Performance Computing Systems**

Bibtex

Cory Thoma, Tao Cui and Franz Franchetti (Proc. IEEE Power and Energy Society General Meeting (PES-GM), 2013)

**Privacy Preserving Smart Meter System Based Retail Level Electricity Market**

Bibtex

Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky and Markus Püschel (Proc. International Conference on Generative Programming: Concepts & Experiences (GPCE), pp. 125-134, 2013)

**Spiral in Scala: Towards the Systematic Construction of Generators for Performance Libraries**

Preprint (886 KB)

Published paper (link to publisher)

Bibtex

Program generators for high performance libraries are an appealing solution to the recurring problem of porting and optimizing code with every new processor generation, but only few such generators exist to date. This is due to not only the difficulty of the design, but also of the actual implementation, which often results in an ad-hoc collection of standalone programs and scripts that are hard to extend, maintain, or reuse. In this paper we ask whether and which programming language concepts and features are needed to enable a more systematic construction of such generators. The systematic approach we advocate extrapolates from existing generators: a) describing the problem and algorithmic knowledge using one, or several, domain-specific languages (DSLs), b) expressing optimizations and choices as rewrite rules on DSL programs, c) designing data structures that can be configured to control the type of code that is generated and the data representation used, and d) using autotuning to select the best-performing alternative. As a case study, we implement a small, but representative subset of Spiral in Scala using the Lightweight Modular Staging (LMS) framework. The first main contribution of this paper is the realization of c) using type classes to abstract over staging decisions, i.e. which pieces of a computation are performed immediately and for which pieces code is generated. Specifically, we abstract over different complex data representations jointly with different code representations including generating loops versus unrolled code with scalar replacement---a crucial and usually tedious performance transformation. The second main contribution is to provide full support for a) and d) within the LMS framework: we extend LMS to support translation between different DSLs and autotuning through search.

H. E. Sumbul, A. Patterson, A. Tazzoli, G. Feeder, Franz Franchetti, G. Piazza and Lawrence Pileggi (Proc. Government Applications & Critical Technology Conference (GOMACTech), 2013)

**Trusted Split-Fabrication System-on-Chip Design Technology and Methodology**

Bibtex

Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet and P. Sadayappan (Proc. Programming Languages Design and Implementation (PLDI), Vol. 48, pp. 127-138, 2013)

**When Polyhedral Transformations Meet SIMD Code Generation**

Bibtex

Berkin Akin, Peter A. Milder, Franz Franchetti and James C. Hoe (ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2012)

**Algorithm and Architecture Optimization for Large Size Two Dimensional Discrete Fourier Transform**

Bibtex

Tao Cui and Franz Franchetti (Proc. IEEE Power and Energy Society General Meeting (PES-GM), pp. 1-6, 2012)

**A Multi-Core High Performance Computing Framework for Probabilistic Solutions of Distribution Systems**

Bibtex

C. Angelopoulos, Franz Franchetti and Markus Püschel (NVIDIA Research Summit at the GPU Technology Conference, 2012)

**Automatic Generation of FFT Libraries for GPUs**

Bibtex

Franz Franchetti, Yevgen Voronenko and G. Almasi (Proc. High Performance Computing for Computational Science (VECPAR), 2012)

**Automatic Generation of the HPC Challenges Global FFT Benchmark for BlueGene/P**

Bibtex

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (ACM Transactions on Design Automation of Electronic Systems, Vol. 17, No. 2, 2012)

**Computer Generation of Hardware for Linear Digital Signal Processing Transforms**

Preprint (2.2 MB)

Published paper (link to publisher)

Bibtex

Linear signal transforms such as the discrete Fourier transform (DFT) are very widely used in digital signal processing and other domains. Due to high performance or efficiency requirements, these transforms are often implemented in hardware. This implementation is challenging due to the large number of algorithmic options (e.g., fast Fourier transform algorithms or FFTs), the variety of ways that a fixed algorithm can be mapped to a sequential datapath, and the design of the components of this datapath. The best choices depend heavily on the resource budget and the performance goals of the target application. Thus, it is difficult for a designer to determine which set of options will best meet a given set of requirements.

In this article we introduce the Spiral hardware generation framework and system for linear transforms. The system takes a problem specification as input as well as directives that define characteristics of the desired datapath. Using a mathematical language to represent and explore transform algorithms and data- path characteristics, the system automatically generates an algorithm, maps it to a datapath, and outputs a synthesizable register transfer level Verilog description suitable for FPGA or ASIC implementation. The quality of the generated designs rivals the best available handwritten IP cores.

Marcela Zuluaga, Peter A. Milder and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 1245-1253, 2012)

**Computer Generation of Streaming Sorting Networks**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

Sorting networks offer great performance but become prohibitively expensive for large data sets. We present a domain-specific language and compiler to automatically generate hardware implementations of sorting networks with reduced area and optimized for latency or throughput. Our results show that the generator produces a wide range of Pareto-optimal solutions that both compete with and outperform prior sorting hardware.

Qiuling Zhu, Lawrence Pileggi and Franz Franchetti (Proc. IFIP/IEEE Internationa Conference on Very Large Scale Integration, pp. 111-116, 2012)

**Cost-Effective Smart Memory Implementation for Parallel Backprojection in Computed Tomography**

Bibtex

Qiuling Zhu, K. Vaidyanathan, O. Shacham, M. Horowitz, Lawrence Pileggi and Franz Franchetti (Proc. IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 125-132, 2012)

**Design Automation Framwork for Application-Specific Logic-in-Memory Blocks**

Bibtex

W. Yu, Franz Franchetti, James C. Hoe and Tsuhan Chen (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 296-307, 2012)

**Highly Efficient Performance Portable Tracking of Evolving Surfaces**

Bibtex

Robert Koutsoyannis, Peter A. Milder, Christian Berger, Madeleine Glick, James C. Hoe and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012)

**Improving Fixed-Point Accuracy of FFT Cores in O-OFDM Systems**

Preprint (1018 KB)

Bibtex

Optical OFDM communication systems operating at data rates in the 40Gb/s (and higher) range require high-throughput/highly parallel fast Fourier transform (FFT) implementations. These consume a significant amount of chip resources; we aim to reduce costs by improving the system’s accuracy per chip-area. For OFDM signals, we characterize the growth of data within the FFT and explore several cost-conscious methods for improving the fixed-point format. Using ASIC synthesis and hardware accurate simulations, we evaluate the corresponding system error and stability of these methods. We introduce Directive Scaling, which provides an average increase in overall accuracy without additional runtime-adaptive mechanisms. ASIC synthesis results show minimal overhead, and we explicitly evaluate and explain the inherent tradeoffs. When applied to an 8-bit IFFT design, our technique improves precision by approximately two bits with just a 4% area overhead, as opposed to the additional 32% area overhead required using standard methods.

Qiuling Zhu, Christian Berger, E. L. Turner, Lawrence Pileggi and Franz Franchetti (Journal of Signal Processing Systems, 2012)

**Local Interpolation-based Polar Format SAR: Algorithm, Hardware Implementation and Design Automation**

Bibtex

Berkin Akin, Peter A. Milder, Franz Franchetti and James C. Hoe (Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2012)

**Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes**

Preprint (695 KB)

Bibtex

Tao Cui and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2012)

**Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU**

Bibtex

Qiuling Zhu, Christian Berger, E. L. Turner, Lawrence Pileggi and Franz Franchetti (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1557-1560, 2012)

**Polar Format Synthetic Aperture Radar in Energy Efficient Application-Specific Logic-in-Memory**

Bibtex

Cory Thoma, Tao Cui and Franz Franchetti (Proc. North American Power Symposium (NAPS), pp. 1-6, 2012)

**Secure Multiparty Computation Based Privacy Preserving Smart Metering System**

Bibtex

Marcela Zuluaga, Andreas Krause, Peter A. Milder and Markus Püschel (Proc. Languages, Compilers, Tools and Theory for Embedded Systems (LCTES), pp. 119-128 , 2012)

**"Smart" Design Space Sampling to Predict Pareto-Optimal Solutions**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

Many high-level synthesis tools offer degrees of freedom in mapping high-level specifications to Register-Transfer Level descriptions. These choices do not affect the functional behavior but span a design space of different cost-performance tradeoffs. In this paper we present a novel machine learning-based approach that efficiently determines the Pareto-optimal designs while only sampling and synthesizing a fraction of the design space. The approach combines three key components: (1) A regression model based on Gaussian processes to predict area and throughput based on synthesis training data. (2) A “smart” sampling strategy, GP-PUCB, to iteratively refine the model by carefully selecting the next design to synthesize to maximize progress. (3) A stopping criterion based on assessing the accuracy of the model without access to complete synthesis data. We demonstrate the effectiveness of our approach using IP generators for discrete Fourier transforms and sorting networks. However, our algorithm is not specific to this application and can be applied to a wide range of Pareto front prediction problems.

Qiuling Zhu, Lawrence Pileggi and Franz Franchetti (Proc. SRC TECHCON, 2012)

**Smart Memory Synthesis for Energy-Efficient Computed Tomography Reconstruction**

Bibtex

Tao Cui and Franz Franchetti (Carnegie Mellon Conference on the Electricity Industry, 2011)

**A Monte Carlo Framework for Probabilistic Distribution Power Flow**

Bibtex

Tao Cui and Franz Franchetti (Proc. North American Power Symposium (NAPS), 2011)

**A Multi-core High Performance Computing Framework for Distribution Power Flow**

Bibtex

Qiuling Zhu, E. L. Turner, Christian Berger, Lawrence Pileggi and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2011)

**Application-Specific Logic-in-Memory for Polar Format Synthetic Aperture Radar**

Bibtex

Daniel McFarlin, Volodymyr Arbatov, Franz Franchetti and Markus Püschel (Proc. International Conference on Supercomputing (ICS), 2011)

**Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets**

Preprint (2.5 MB)

Published paper (link to publisher)

Bibtex

The well-known shift to parallelism in CPUs is often associated with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date (SIMD) vector units. Intel's SSE and IBM's VMX (compatible to AltiVec) both offer 4-way (single precision) floating point, but the recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way and 16-way, respectively. Compilation and optimization for vector extensions is hard, and often the achievable speed-up by using vectorizing compilers is small compared to hand-optimization using intrinsic function interfaces. Unfortunately, the complexity of these intrinsics interfaces increases considerably with the vector length, making hand-optimization a nightmare. In this paper, we present a peephole-based vectorization system that takes as input the vector instruction semantics and outputs a library of basic data reorganization blocks such as small transpositions and perfect shuffles that are needed in a variety of high performance computing applications. We evaluate the system by generating the blocks needed by the program generator Spiral for vectorized fast Fourier transforms (FFTs). With the generated FFTs we achieve a vectorization speed-up of 5.5-6.5 for 8-way AVX and 10-12.5 for 16-way LRB. For the latter instruction counts are used since no timing information is available. The combination of the proposed system and Spiral thus automates the production of high performance FFTs for current and future vector architectures.

Tao Cui and Franz Franchetti (Proc. International Workshop on Automatic Performance Tuning (iWAPT), 2011)

**Autotuning a Random Walk Boolean Satisfiability Solver**

Bibtex

Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam and P. Sadayappan (Proc. International Conference on Compiler Construction (CC), 2011)

**Data Layout Transformation for Stencil Computations on Short SIMD Architectures**

Bibtex

Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, Yannis Benlachtar, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (Optics Express, Vol. 19, No. 21, pp. 20857-20864, 2011)

**Design studies for ASIC implementations of 28 GS/s optical QPSK- and 16-QAM-OFDM transceivers**

Published paper (link to publisher)

Bibtex

We designed at the register-transfer-level digital signal processing (DSP) circuits for 21.8 Gb/s and 43.7 Gb/s QPSK- and 16- QAM-encoded optical orthogonal frequency division multiplexing (OFDM) transceivers, and carried out synthesis and simulations assessing performance, power consumption and chip area. The aim of the study is to determine the suitability of OFDM technology for low-cost optical interconnects. Power calculations based on synthesis for a 65nm standard- cell library showed that the DSP components of the transceiver (FFTs, equalisation, (de)mapping and clipping/scaling circuits) consume 18.2 mW/Gb/s and 12.8 mW/Gb/s in the case of QPSK and 16-QAM respectively.

Franz Franchetti and Markus Püschel (in Encyclopedia of Parallel Computing, Eds. David Padua, Springer 2011)

**Fast Fourier Transform**

Preprint (874 KB)

Published paper (link to publisher)

Bibtex

A fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (DFT) of an input vector. Efficient means that the FFT computes the DFT of an n-element vector in O(n log n) operations in contrast to the O(n^2) operations required for computing the DFT by definition. FFTs exist for any vector length $n$ and for real and higher-dimensional data. Parallel FFTs have been developed since the advent of parallel computing.

W. Yu, Franz Franchetti, James C. Hoe, José M. F. Moura and Tsuhan Chen (Proc. High Performance Extreme Computing (HPEC), 2011)

**Performance Portable Tracking of Evolving Surfaces**

Bibtex

Christian Berger, Volodymyr Arbatov, Yevgen Voronenko, Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1693-1696 , 2011)

**Real-Time Software Implementation of an IEEE 802.11a Baseband Receiver on Intel Multicore**

Preprint (148 KB)

Published paper (link to publisher)

Bibtex

We present a software-only implementation of an IEEE~802.11a (WiFi) receiver optimized for Intel multicore platforms. The receiver is about 50 times faster than a straightforward C implementation, i.e., an implementation that has the same functionality, but leaves optimization completely to the compiler. Our hand-optimized implementation achieves real-time for all data rates up to the maximum of 54 Mbit/s on a Core i7, clocked at 3.3 GHz, and for up to 12 Mbit/s on an Atom, clocked at 1.6 GHz, using two cores in both cases. To achieve this performance we use up to two threads, up to 16-way vectorization using Intel's SSE, and various other optimizations.

Markus Püschel, Franz Franchetti and Yevgen Voronenko (in Encyclopedia of Parallel Computing, Eds. David Padua, Springer 2011)

**Spiral**

Preprint (362 KB)

Published paper (link to publisher)

Bibtex

Spiral is a program generation system (software that generates other software) for linear transforms and an increasing list of other mathematical functions. The goal of Spiral is to automate the development and porting of performance libraries. Linear transforms include the discrete Fourier transform (DFT), discrete cosine transforms, convolution, and the discrete wavelet transform. The input to Spiral consists of a high-level mathematical algorithm specification and selected architectural and microarchitectural parameters. The output is performance-optimized code in a high-level language such as C, possibly augmented with vector intrinsics and threading instructions.

Peter A. Milder (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)

**A Mathematical Approach for Compiling and Optimizing Hardware Implementations of DSP Transforms**

Preprint (4.4 MB)

Bibtex

Linear signal transforms (such as the discrete Fourier transform) are frequently used in digital signal processing and related fields. Algorithms for computing linear transforms are well-understood and typically have a high degree of regularity and parallelism, making them well-suited for implementation as sequential datapaths on field-programmable gate arrays or application-specific integrated circuits. Nonetheless, transforms are difficult to implement due to the large number of algorithmic options available and ways that algorithms can be mapped to sequential datapaths. Further, the best choices depend heavily on the resource budget and performance goals of the target application. Thus, it is difficult for a designer to determine which set of options will best meet a given set of requirements. This thesis proposes the Spiral hardware generation framework, a hardware compilation and optimization tool based on a mathematical formula language. A formula written in this language specifies a particular transform algorithm executed using a particular sequential datapath. This language allows high-level representation of sequential hardware structures that reuse datapath elements multiple times in the computation of a transform. The language accomplishes this by providing a formal connection between structure in the algorithm and sequential reuse in the datapath. This proposed language drives a hardware compilation system that takes as input a problem specification with directives that define characteristics of the desired datapath; the system then automatically generates an algorithm, maps the algorithm to a datapath, and outputs synthesizable register transfer level Verilog. This thesis evaluates the generated designs when synthesized for field-programmable gate array or application-specific integrated circuit. Its evaluations consider designs across multiple transforms, datatypes, and design goals, and its results show that Spiral is able to automatically providea wide tradeoff between cost (e.g., area, power) and performance. This tradeoff space compares well with existing benchmarks, but allows the designer much more flexibility to find the design best suited to his or her needs.

Frédéric de Mesmay, Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), Lecture Notes in Computer Science, Springer, Vol. 5952, pp. 353-368, 2010)

**Computer Generation of Efficient Software Viterbi Decoders**

Preprint (575 KB)

Published paper (link to publisher)

Bibtex

This paper presents a program generator for fast software Viterbi decoders for arbitrary convolutional codes. The input to the generator is a specification of the code, and single-instruction multiple-data (SIMD) vector length. The output is an optimized C implementation of the decoder that uses explicit Intel SSE vector instructions. Only the performance-critical forward pass is actually generated, and a generic infrastructure handles the traceback pass. At the heart of the generator is a small domain-specific language called VL to express the structure of the forward pass. Vectorization is done by rewriting VL expressions, which a compiler then translates into actual code in addition to performing further optimizations specific to the vector instruction set. Benchmarks show that the generated decoders match the performance of available expert hand-tuned implementations, while spanning the entire space of convolutional codes.

Srinivas Chellappa (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)

**Computer Generation of Fourier Transform Libraries for Distributed Memory Architectures**

Preprint (1.7 MB)

Bibtex

High-performance discrete Fourier transform (DFT) libraries are an important requirement for many computing platforms. Unfortunately, developing and optimizing these libraries for modern, complex platforms has become extraordinarily difficult. To make things worse, performance often does not port, thus requiring permanent re-optimizations. Overcoming this problem has been the goal of Spiral, a library generation system that can produce highly optimal DFT code from a high level specification of algorithms and target platforms.

However, current techniques in Spiral cannot support all target platforms. In particular, several emerging target platforms incorporate a distributed memory parallel processing paradigm, where the cost of accessing non-local memories is relatively high, and handling data movement is exposed to the programmers. Traditionally used only in supercomputing environments, this paradigm is now also finding its way in the form of multicore processors into desktop computing.

The goal of this work is the computer generation of high-performance DFT libraries for a wide range of distributed memory parallel processing systems, given only a high-level description of a DFT algorithm and some platform parameters. The key challenges include generating code for multiple target programming paradigms that delivers load balanced parallelization across multiple layers of the compute hierarchy, orchestrates explicit memory management, and overlaps computation with communication.

We attack this problem by first developing a formal framework to describe parallelization, streaming, and data exchange in a domain-specific declarative mathematical language. Based on this framework, we develop a rewriting system that structurally manipulates DFT algorithms to "match" them to a distributed memory target architecture and hence extracts maximum performance. We implement this approach as a part of Spiral together with a backend that translates the derived algorithms into actual libraries. Experimental results show that the performance of our generated code is comparable to hand-tuned code in all cases, and exceeds the performance of hand-tuned code in some cases.

Yevgen Voronenko, Volodymyr Arbatov, Christian Berger, Ronghui Peng, Markus Püschel and Franz Franchetti (Proc. Software Defined Radio (SDR), 2010)

**Computer Generation of Platform-Adapted Physical Layer Software**

Preprint (310 KB)

Bibtex

In this paper, we describe a program generator for physical layer (PHY) baseband processing in a software-defined radio implementation. The input of the generator is a very high-level platform-independent description of the transmitter and receiver PHY functionality, represented in a domain-specific declarative language called Operator Language (OL). The output is performance-optimized and platform-tuned C code with single-instruction multiple-data (SIMD) vector intrinsics and threading directives. The generator performs these optimizations by restructuring the algorithms for the individual components at the OL level before mapping to code. This way known compiler limitations are overcome. We demonstrate the approach and the excellent performance of the generated code on on the IEEE 802.11a (WiFi) receiver and transmitter PHY for all transmission modes.

Douglas F. Jones (Master thesis, Computer Science, Drexel University, 2010)

**Data Pump Architecture Simulator and Performance Model**

Preprint (1.1 MB)

Bibtex

The Data Pump Architecture (DPA) is a novel non-von-Neumann computer architecture emphasizing efficient use of on-chip SRAM and off-chip DRAM bandwidth. The DPA is parameterized by local memory size, memory bandwidth, vector length, and number of compute processors, allowing the architecture configuration to be balanced for a given set of computations. The Data Pump Architecture Simulator and Performance Model is a functional simulator providing an implementation of the DPA as a software library. Simulation at this level provides the user with the ability to test algorithms for correctness as well as performance. A hardware designer may then use the Simulator as a tool to experimentally determine the effects of architecture parameters and software algorithms through performance data provided by the Simulator's model. The benefit of the approach described here compared to alternatives such as logic/gate level or RTL/HDL software emulation is primarily a reduction in the time needed for 1) a designer to modify and compile a HDL design and 2) to perform the simulation. The Simulator serves as a bridge between the application specific hardware and software design. The result is the ability to use the simulator as part of a framework for rapidly investigating the construction of an optimal system architecture for a given set of algorithms. Investigations with the Walsh-Hadamard transform (WHT), a prototypical digital signal processing transform, are shown. The SPIRAL (www.spiral.net) code generation system is used to explore the space of WHT algorithms while the Simulator is used to explore the performance and trade-offs of dierent DPA configurations.

Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, Yannis Benlachtar, Christian Berger, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (Proc. European Conference on Optical Communication (ECOC), pp. 1-3, 2010)

**Design Studies for an ASIC Implementation of an Optical OFDM Transceiver**

Preprint (254 KB)

Published paper (link to publisher)

Bibtex

We designed at the register-transfer-level the DSP circuits for a 21.8 Gb/s QPSK-OFDM transceiver, and carried out synthesis and simulations assessing performance, power consumption and chip area, to determine their suitability for low-cost optical interconnects.

C. Angelopoulos, Franz Franchetti and Markus Püschel (NVIDIA Research Summit at the GPU Technology Conference, 2010)

**DFT Transform on the Fermi (GTX480): Automatic Program Generation**

Bibtex

W. Yu, Franz Franchetti, James C. Hoe and Tsuhan Chen (Proc. IEEE International Conference on Image Processing (ICIP), 2010)

**Fast and Robust Active Contours for Image Segmentation**

Bibtex

W. Yu, Franz Franchetti, James C. Hoe, Y.-J. Chang and Tsuhan Chen (Proc. IEEE International Conference on Image Processing (ICIP), 2010)

**Fast Bilateral Filtering By Adapting Block Size**

Bibtex

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010)

**Hardware Implementation of the Discrete Fourier Transform with Non-Power-of-Two Problem Size**

Preprint (482 KB)

Bibtex

In this paper, we examine several algorithms suitable for the hardware implementation of the discrete Fourier transform (DFT) with non-power-of two problem size. We incorporate these algorithms into Spiral, a tool capable of automatically generating corresponding hardware implementations. We discuss how each algorithm can be used to generate different types of hardware structures, and we demonstrate that our tool is able to produce hardware implementations of non-power- of-two sized DFTs over a wide range of cost/performance tradeoff points.

W. Yu, Tsuhan Chen, Franz Franchetti and James C. Hoe (IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 20, No. 11, pp. 1509-1519, 2010)

**High Performance Stereo Vision Designed for Massively Data Parallel Platforms**

Bibtex

Frédéric de Mesmay, Yevgen Voronenko and Markus Püschel (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 1-10, 2010)

**Offline Library Adaptation Using Automatically Generated Heuristics**

Preprint (626 KB)

Published paper (link to publisher)

Bibtex

Automatic tuning has emerged as a solution to provide high-performance libraries for fast changing, increasingly complex computer architectures. We distinguish offline adaptation (e.g., in ATLAS) that is performed during installation without the full problem description from online adaptation (e.g., in FFTW) that is performed at run-time. Offline adaptive libraries are simpler to use, but, unfortunately, writing the adaptation heuristics that power them is a daunting task. The overhead of online adaptive libraries, on the other hand, makes them unsuitable for a number of applications. In this paper, we propose to automatically generate heuristics in the form of decision trees using a statistical classifier, effectively converting an online adaptive library into an offline one. As testbed we use Spiral-generated adaptive transform libraries for current multicores with vector extensions. We show that replacing the online search with generated decision trees maintains a performance competitive with vendor libraries while allowing for a simpler interface and reduced computation overhead.

Q. Li, Tao Cui, Rohit Negi, Franz Franchetti and Marija Ilic (, 2010)

**On-line Decentralized Charging of Plug-In Electric Vehicles in Power Systems**

Bibtex

Frédéric de Mesmay (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)

**On the Computer Generation of Adaptive Numerical Libraries**

Preprint (3.2 MB)

Bibtex

Very fast runtime is crucial in many applications in scientific computing, multimedia processing, communication, and control. Most of these applications spend the bulk of the computation in well known mathematical functions which are provided by highly optimized libraries. The development and maintenance of these libraries has become extraordinarily difficult. Optimal performance requires multiple-core balancing, careful use of vector instruction sets, and locality optimization. These optimizations require highly-skilled programmers and are often platform-specific, which means maintenance is a considerable effort given the short processor release cycles.

The Spiral system has successfully addressed these issues by automatically generating high performance libraries given only a high-level mathematical algorithm description in a language called SPL. Spiral produced high performance code using a number of techniques including SPL rewriting systems and a form iterative compilation. However, to date Spiral has been limited in two key aspects. First, Spiral could only generate libraries for the domain of linear transforms; second, all optimizations for a specific target platform are performed during the source code generation, that is, the produced libraries themselves had no dynamic platform-adaptation mechanism.

In this thesis we make progress on both fronts. We present a framework and its implementation for the computer generation of functionalities that are not transforms, specifically matrix multiplication and convolutional decoding. The framework builds on the operator language (OL) that we introduce and that extends SPL. Similar to prior work on transforms, we then develop OL rewriting system to explore algorithm choice, to vectorize and parallelize, and to derive the basic library structure called recursion step closure. The actual code is obtained through a backend that supports different target languages. The generated libraries exhibit a performance comparable to libraries that are hand-written for commodity workstations.

Further, we enable the generation of platform-adaptive libraries, through adaptation modules that can be inserted into our libraries, which are generated to support different ways to compute the same function. We distinguish between online adaptation and offline adaptation and provide mechanism for both. Online adaptation happens during the actual user function call when the input size is provided. Given this size, the library searches for the best computation strategy inside the library, which can then be used for subsequent computations of this size. We provide the dynamic programming strategy used in prior work and introduce a novel kind of Monte-Carlo search on graphs. Finally, we present a machine learning approach that performs offline (during installation time) adaptation with an online adaptive library. First a search is run to produce solutions for a set of sizes. Based on the result, a learning algorithm derives solutions for all sizes in the form of a set of decision trees that are then inserted into the library to render it deterministic. Experiments show the viability of both approaches.

Yannis Benlachtar, Rachid Bouziane, Robert I. Killey, Christian Berger, Peter A. Milder, Robert Koutsoyannis, James C. Hoe, Markus Püschel and Madeleine Glick (Proc. International Conference on Transparent Optical Networks (ICTON), pp. 1-4, 2010)

**Optical OFDM for the Data Center**

Preprint (242 KB)

Published paper (link to publisher)

Bibtex

We investigated the use of orthogonal frequency division multiplexing (OFDM) to increase the capacity of multimode fiber (MMF)-based optical interconnects for data center applications. This approach provides a solution to modulation bandwidth limitations of the lasers, and to the intermodal dispersion of the MMF which leads to frequency-dependent attenuation. Recent studies on adaptively modulated OFDM are reviewed, and new simulation results assessing the capacity of such links for lengths of up to 300 m are presented, assuming the use of 50/125 μm graded-index MMF at a wavelength of 850 nm. The use of coded OFDM as an approach to deal with intermodal dispersion is also discussed.

Yannis Benlachtar, Philip M. Watts, Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (IEEE Journal of Selected Topics in Quantum Electronics, Vol. 16, No. 5, pp. 1235-1244 , 2010)

**Real-Time Digital Signal Processing for the Generation of Optical Orthogonal Frequency Division Multiplexed Signals**

Preprint (836 KB)

Published paper (link to publisher)

Bibtex

We investigate the design of a field programmable gate array (FPGA) based optical orthogonal frequency division multiplexing (OFDM) transmitter implementing real time digital signal processing at 21.4 GS/s. The transmitter was utilized to generate 8.34 Gbit/s QPSK-OFDM signals for direct detection. We study the impact of the finite resolutions of the inverse fast Fourier transform (IFFT) cores and the digital-to-analog converters (DAC) on the system performance. Furthermore, we demonstrate a transmission experiment over 800 and 1600 km of uncompensated standard fiber with negligible OSNR penalties and BER < 10^-3.

Lingchuan Meng, Jeremy Johnson, Franz Franchetti, Yevgen Voronenko, Marc Moreno Maza and Yuzhen Xie (Proc. Parallel Symbolic Computation (PASCO), pp. 169-170, 2010)

**Spiral-Generated Modular FFT Algorithms**

Bibtex

Yannis Benlachtar, Philip M. Watts, Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (Proc. European Conference on Optical Communication (ECOC), pp. 1-2, 2009)

**21.4 GS/s Real-Time DSP-Based Optical OFDM Signal Generation and Transmission Over 1600 km of Uncompensated Fibre**

Preprint (279 KB)

Published paper (link to publisher)

Bibtex

We report a real-time optical OFDM transmitter with the highest sampling rate to date. Generation and transmission of an 8.36 Gb/s digitally up-converted single sideband OFDM signal over 1600km of uncompensated fiber with a BER < 10-3 was achieved.

Yevgen Voronenko and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 57, No. 1, pp. 205-222, 2009)

**Algebraic Signal Processing Theory: Cooley-Tukey Type Algorithms for Real DFTs**

Preprint (324 KB)

Published paper (link to publisher)

Bibtex

In this paper we systematically derive a large class of fast general-radix algorithms for various types of real discrete Fourier transforms (real DFTs) including the discrete Hartley transform (DHT) based on the algebraic signal processing theory. This means that instead of manipulating the transform definition, we derive algorithms by manipulating the polynomial algebras underlying the transforms using one general method. The same method yields the well-known Cooley-Tukey fast Fourier transform (FFT) as well as general radix discrete cosine and sine transform algorithms. The algebraic approach makes the derivation concise, unifies and classifies many existing algorithms, yields new variants, enables structural optimization, and naturally produces a human-readable structural algorithm representation based on the Kronecker product formalism. We show, for the first time, that the general-radix Cooley-Tukey and the lesser known Bruun algorithms are instances of the same generic algorithm. Further, we show that this generic algorithm can be in stantiated for all four types of the real DFT and the DHT.

Peter A. Milder, James C. Hoe and Markus Püschel (Proc. Design, Automation and Test in Europe (DATE), pp. 1118-1123, 2009)

**Automatic Generation of Streaming Datapaths for Arbitrary Fixed Permutations**

Preprint (222 KB)

Bibtex

This paper presents a technique to perform arbitrary fixed permutations on streaming data. We describe a parameterized architecture that takes as input n data points streamed at a rate of w per cycle, performs a permutation over all n points, and outputs the result in the same streaming format. We describe the system and its requirements mathematically and use this mathematical description to show that the datapaths resulting from our technique can sustain a full throughput of w words per cycle without stalling. Additionally, we provide an algorithm to configure the datapath for a given permutation and streaming width. Using this technique, we have constructed a full synthesis system that takes as input a permutation and a streaming width and outputs a register-transfer level Verilog description of the datapath. We present an evaluation of our generated designs over varying problem sizes and streaming widths, synthesized for a Xilinx Virtex-5 FPGA.

Daniel McFarlin, Franz Franchetti and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2009)

**Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension**

Preprint (205 KB)

Bibtex

Basilio B. Fraguela, Yevgen Voronenko and Markus Püschel (Proc. Parallel Architectures and Compilation Techniques (PACT), pp. 271-280, 2009)

**Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling**

Preprint (2.7 MB)

Bibtex

Analytical models have been used to estimate optimal values for parameters such as tile sizes in the context of loop nests. However, important algorithms such as fast Fourier transforms (FFTs) present a far more complex search space consisting of many thousands of different implementations with very different complex access patterns and nesting and code structures. As a results, some of the best available FFT implementations use heuristic search based on runtime measurements. In this paper we present the first analytical model that can successfully replace the measurement in this search on modern platforms. The model includes many details of the platform's memory system including the TLBs, and, for the first time, physically addressed caches and hardware prefetching. The effect, as we show, is a dramatically reduced search time to find the best FFT without significant loss in performance. Even though our model is adapted to the FFT in this paper, its underlying structure should be applicable for a much larger set of code structures and hence is a candidate for iterative compilation.

Frédéric de Mesmay, Arpad Rimmel, Yevgen Voronenko and Markus Püschel (Proc. International Conference on Machine Learning (ICML), pp. 729-736, 2009)

**Bandit-Based Optimization on Graphs with Application to Library Performance Tuning**

Preprint (309 KB)

Published paper (link to publisher)

Bibtex

The problem of choosing fast implementations for a class of recursive algorithms such as the fast Fourier transforms can be formulated as an optimization problem over the language generated by a suitably defined grammar. We propose a novel algorithm that solves this problem by reducing it to maximizing an objective function over the sinks of a directed acyclic graph. This algorithm valuates nodes using Monte-Carlo and grows a subgraph in the most promising directions by considering local maximum k-armed bandits. When used inside an adaptive linear transform library, it cuts down the search time by an order of magnitude to find a similar solution. Further, the search can be stopped at any time and still return a solution. In some cases, the performance of the implementations found is also increased by up to 10% which is of considerable practical importance since it consequently improves the performance of all applications using the performance library.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Conference on Supercomputing (ICS), pp. 26-35, 2009)

**Computer Generation of Fast Fourier Transforms for the Cell Broadband Engine**

Preprint (238 KB)

Published paper (link to publisher)

Bibtex

The Cell BE is a multicore processor with eight vector accelerators (called SPEs) that implement explicit cache management through direct memory access engines. While the Cell has an impressive floating point peak performance, programming and optimizing for it is difficult as it requires explicit memory management, multithreading, streaming, and vectorization. We address this problem for the discrete Fourier transform (DFT) by extending Spiral, a program generation system, to automatically generate highly optimized implementations for the Cell. The extensions include multi-SPE parallelization and explicit memory streaming, both performed at a high abstraction level using rewriting systems operating on Spiral's internal domain-specific language. Further, we support latency and throughput optimizations, single and double precision, and different data formats. The performance of Spiral's computer generated code is comparable with and sometimes better than existing DFT implementations, where available.

Yevgen Voronenko, Frédéric de Mesmay and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 102-113, 2009)

**Computer Generation of General Size Linear Transform Libraries**

Preprint (563 KB)

Published paper (link to publisher)

Bibtex

The development of high-performance libraries has become extraordinarily difficult due to multiple processor cores, vector instruction sets, and deep memory hierarchies. Often, the library has to be reimplemented and reoptimized, when a new platform is released. In this paper we show how to automatically generate general input-size libraries for the domain of linear transforms. The input to our generator is a formal specification of the transform and the recursive algorithms the library should use; the output is a library that supports general input size, is vectorized and multithreaded, provides an adaptation mechanism for the memory hierarchy, and has excellent performance, comparable to or better than the best human-written libraries. Further, we show that our library generator enables various customizations; one example is the generation of Java libraries.

Franz Franchetti, Markus Püschel, Yevgen Voronenko, Srinivas Chellappa and José M. F. Moura (IEEE Signal Processing Magazine, special issue on ``Signal Processing on Platforms with Multiple Cores'', Vol. 26, No. 6, pp. 90-102, 2009)

**Discrete Fourier Transform on Multicores: Algorithms and Automatic Implementation**

Preprint (453 KB)

Published paper (link to publisher)

Bibtex

This paper gives an overview on the techniques needed to implement the discrete Fourier transform (DFT) efficiently on current multicore systems. The focus is on Intel compatible multicores but we also discuss the IBM Cell, and briefly, graphics processing units (GPUs). The performance optimization is broken down into three key challenges: parallelization, vectorization, and memory hierarchy optimization. In each case, we use the Kronecker product formalism to formally derive the necessary algorithmic transformations based on a few hardware parameters. Further code level optimizations are discussed. The rigorous nature of this framework enables the complete automation of the implementation task as shown by the program generator Spiral. Finally, we show and analyze DFT benchmarks of the fastest libraries available for the considered platforms.

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 549-552, 2009)

**Generating High-Performance Pruned FFT Implementations**

Preprint (273 KB)

Published paper (link to publisher)

Bibtex

We derive a recursive general-radix pruned Cooley-Tukey fast Fourier transform (FFT) algorithm in Kronecker product notation. The algorithm is compatible with vectorization and parallelization required on state-of-the-art multicore CPUs. We include the pruned FFT algorithm into the program generation system Spiral, and automatically generate optimized implementations of the pruned FFT for the Intel Core2Duo multicore processor. Experimental results show that using the pruned FFT can indeed speed up the fastest available FFT implementations by up to 30% when the problem size and the pattern of unused inputs and outputs are known in advance.

Yannis Benlachtar, Philip M. Watts, Rachid Bouziane, Peter A. Milder, Deepak Rangaraj, Anthony Cartolano, Robert Koutsoyannis, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (Optics Express, Vol. 17, No. 20, pp. 17658-17668, 2009)

**Generation of Optical OFDM Signals Using 21.4 GS/s Real Time Digital Signal Processing**

Published paper (link to publisher)

Bibtex

We demonstrate a field programmable gate array (FPGA) based optical orthogonal frequency division multiplexing (OFDM) transmitter implementing real time digital signal processing at a sample rate of 21.4 GS/s. The QPSK-OFDM signal is generated using an 8 bit, 128 point inverse fast Fourier transform (IFFT) core, performing one transform per clock cycle at a clock speed of 167.2 MHz and can be deployed with either a direct-detection or a coherent receiver. The hardware design and the main digital signal processing functions are described, and we show that the main performance limitation is due to the low (4-bit) resolution of the digital-to-analog converter (DAC) and the 8-bit resolution of the IFFT core used. We analyze the back-to-back performance of the transmitter generating an 8.36 Gb/s optical single sideband (SSB) OFDM signal using digital up-conversion, suitable for direct-detection. Additionally, we use the device to transmit 8.36 Gb/s SSB OFDM signals over 200 km of uncompensated standard single mode fiber achieving an overall BER<10^−3.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2009)

**High Performance Linear Transform Program Generation for the Cell BE**

Bibtex

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2009)

**High Performance Linear Transform Program Generation for the Cell BE**

Bibtex

Daniel McFarlin, Franz Franchetti, José M. F. Moura and Markus Püschel (Proc. SPIE Conference on Defense, Security, and Sensing, Proceedings of SPIE, Vol. 7337, pp. 733708, 2009)

**High Performance Synthetic Aperture Radar Image Formation On Commodity Architectures**

Published paper (link to publisher)

Bibtex

Synthetic Aperture Radar (SAR) image processing platforms are increasingly confronted with vast datasets and hard real-time deadlines. Platform developers are also finding themselves constrained by battlefield exigencies, time-to-market pressures and rapidly varying requirements. In response, developers are coupling high performance, general-purpose Commercial-Off-The-Shelf (COTS) architectures with software implementations of SAR algorithms. While this approach provides great flexibility, achieving the requisite performance on COTS architectures such as IBM's Cell BE and Intel's Core2 Quad is a potentially time-consuming and error-prone process. This is principally due to the highly parallel nature of modern COTS architectures. Developers must now grapple with architectures that exhibit parallelism at nearly every level of granularity; extracting high performance requires a complex interweaving of multiple forms of parallelism (e.g., SIMD vector extensions, multicore). In this paper, we first present an overview of SPIRAL, a program generator that reduces the burden of developing high performance code by automatically exploring many combinations of parallelism. We then extend SPIRAL's domain-specific language to expose high-level SAR constructs that a developer can combine into a parameterized scenario. We subsequently employ these extensions to generate code for the computationally dominant image formation component of Polar Format SAR processing. By fully leveraging automatic program generation, SPIRAL produces code for Intel's Core2 Quad that surpasses competing hand-tuned implementations on the Cell Blade, an architecture with twice as many cores and three times the memory bandwidth. We show an average performance of 34 Gigaflops/sec for 16-Megapixel and 100-Megapixel SAR images with runtimes of 0.6 and 4.45 seconds respectively.

Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel (Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp. 385-410, 2009)

**Operator Language: A Program Generation Framework for Fast Kernels**

Preprint (511 KB)

Bibtex

We present the Operator Language (OL), a framework to automatically generate fast numerical kernels. OL provides the structure to extend the program generation system Spiral beyond the transform domain. Using OL, we show how to automatically generate library functionality for the fast Fourier transform and multiple non-transform kernels, including matrix-matrix multiplication, synthetic aperture radar (SAR), circular convolution, sorting networks, and Viterbi decoding. The control flow of the kernels is data-independent, which allows us to cast their algorithms as operator expressions. Using rewriting systems, a structural architecture model and empirical search, we automatically generate very fast C implementations for state-of-the-art multicore CPUs that rival hand-tuned implementations.

Markus Püschel, Peter A. Milder and James C. Hoe (Journal of the ACM, Vol. 56, No. 2, pp. 10:1-10:34, 2009)

**Permuting Streaming Data Using RAMs**

Published paper (link to publisher)

Bibtex

This paper presents a method for constructing hardware structures to perform permutations on streaming data. The method applies to permutations that can be represented as linear mappings on the bit-level representation of the data locations, a subclass that includes many important permutations such as stride permutations (corner turn, perfect shuffle, etc.), the bit reversal, and the Hadamard reordering. This problem has only been considered in the literature for the special case of stride permutations. We propose a family of parameterized datapaths that consist of several independent banks of memory and two interconnection networks. These structures are built for a given streaming width, i.e., number of inputs and outputs per cycle. The structures are able to operate at full throughput for a given streaming width. We provide an algorithm that completely specifies the datapath and control logic given a permutation and a number of input/output ports. Further, we provide lower bounds on the achievable cost of the solution. For an important subclass of considered permutations, the resulting design is proven optimal in both the cost of control and the number of switching elements required by the interconnection networks. We apply our algorithm to derive complete solutions for several important permutations, including a detailed example that carefully illustrates each aspect of the design process. Lastly, we compare our permutation structures to those of [Jarvinen et al. 2004], which are specialized for stride permutations.

Markus Püschel and José M. F. Moura (IEEE Transactions on Signal Processing, Vol. 56, No. 4, pp. 1502-1521, 2008)

**Algebraic Signal Processing Theory: Cooley-Tukey Type Algorithms for DCTs and DSTs**

Preprint (347 KB)

Published paper (link to publisher)

Bibtex

This paper presents a systematic methodology to derive and classify fast algorithms for linear transforms. The approach is based on the algebraic signal processing theory. This means that the algorithms are not derived by manipulating the entries of transform matrices, but by a stepwise decomposition of the associated signal models, or polynomial algebras. This decomposition is based on two generic methods or algebraic principles that generalize the well-known Cooley-Tukey FFT and make the algorithms' derivations concise and transparent. Application to the 16 discrete cosine and sine transforms yields a large class of fast general radix algorithms, many of which have not been found before.

Frédéric de Mesmay, Franz Franchetti, Yevgen Voronenko and Markus Püschel (Parallel Matrix Algorithms and Applications (PMAA), 2008, Presentation (Abstract reviewed))

**Automatic Generation of Adaptive Libraries for Matrix-Multiplication**

Bibtex

Developping high-performance matrix-multiplication libraries for current off-the-shelf microprocessors is a challenging task: it usually takes the combination of a peak-performance hand-scheduled vectorized kernel and multiple cache-aware multi-threaded blocking strategies to achieve maximum speed in all cases. Both these tasks depends on the underlying hardware and therefore need be rewritten or readapted for each and every platform. To alleviate this repetive effort, our work presents how to fully automate the development of such general matrix-multiplication libraries on various computing platforms. At the core of our code generator is a declarative domain-specific language that captures access patterns of blocked matrices. This language is manipulated using a symbolic rewritting system that is able to produce different alternatives. The rules in this language capture all the knowledge about the matrix-multiplication operation. Using this language, we generate all promising algorithms for computing small matrices products, produce the corresponding unrolled code and select the best kernels at compile time to include in our library. Then, the very same language is used to find and implement the set of mutually recursive functions that could be choosed from when deciding the blocking strategy. As this choice ultimately depends on the shape and size of the matrices and should adapt to the platform's caches, the final strategy is only decided at runtime, using dynamic programming. The final implementation of the library can be returned in C++ or Java, takes advantage of vector instructions, multi-processors and multi-cores where available, adapts each problem to the memory hierarchy, compares to vendor's performance and, amazingly, does not contain a single line of code written by a human.

Frédéric de Mesmay, Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. International Workshop on Parallel Matrix Algorithms and Applications (PMAA), 2008)

**Automatic Generation of Multithreaded Vectorized Adaptive Libraries for Matrix Multiplication**

Bibtex

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. Supercomputing (SC), 2008, Poster (Abstract reviewed))

**Automatic Linear Transform Program Generation for the Cell BE**

Bibtex

The complexity of the Cell BE's architecture makes it difficult and time consuming to develop multithreaded, vectorized, high-performance numerical libraries for it. Our approach to solving this problem is to use Spiral, a program generation system, to automatically generate and optimize linear transform libraries for the Cell. We first extend the Spiral framework for the Cell architecture to automatically generate high-performance discrete Fourier, Cosine, and Sine transform kernels that run on a single SPE. The performance of our kernels is comparable to hand tuned code where available, and reaches 14--20 Gflop/s on a single SPE for input vectors resident in the local memories. Next, we produce optimized multithreaded code for the discrete Fourier transform that runs on multiple SPEs and obtains upto a 2x speed-up on 4 SPEs. Our poster will provide details of the formal methods used by our approach, and will present our latest performance results.

Franz Franchetti, Yevgen Voronenko, Peter A. Milder, Srinivas Chellappa, Marek Telgarsky, Hao Shen, Paolo D'Alberto, Frédéric de Mesmay, James C. Hoe, José M. F. Moura and Markus Püschel (Proc. NSF Next Generation Software Program Workshop (NSFNGS) colocated with IPDPS, 2008)

**Domain-Specific Library Generation for Parallel Software and Hardware Platforms**

Preprint (180 KB)

Published paper (link to publisher)

Bibtex

We overview a library generation framework called Spiral. For the domain of linear transforms, Spiral automatically generates implementations for parallel platforms including SIMD vector extensions, multicore processors, field-programmable gate arrays (FPGAs) and FPGA accelerated processors. The performance of the generated code is competitive with the best available hand-written libraries.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)

**FFT Program Generation for the Cell BE**

Preprint (55 KB)

Bibtex

The Cell BE is designed for accelerating heavy numerical computation. The Cell features an innovative design that includes multiple vectorized cores with explicitly managed per-core local memory and intercore communication. These features allow for a theoretical peak performance of 204.8 Gflop/s (single precision, using just the SPEs) without breaking the memory, power, or frequency walls. However, the very same features that allow for high theoretical performance make it difficult and time consuming to develop multithreaded, vectorized, high-performance numerical libraries for the Cell BE. Our approach to solving this problem is to use the Spiral framework to automatically generate and optimize signal processing transform libraries for the Cell. Spiral is a program generation system for linear transforms such as the discrete Fourier transform, discrete cosine transforms, filters, and others. For a user-selected transform, Spiral autonomously generates different algorithms, represented in a declarative form as mathematical formulas, and their implementations to find the best match for the target platform. Besides using search techniques, Spiral performs deterministic optimizations at the formula level, effectively restructuring the code in ways impractical at the code level. To extend the Spiral framework to support the Cell architecture, we first focus on automatically generating high-performance vectorized DFT kernels that run on a single SPE. The performance of our single-threaded code is comparable to hand tuned code, and achieves 16-20 Gflop/s on a single SPE for input vectors resident in the local stores. Next, we produce optimized multithreaded code that runs on multiple SPEs. To maximize the achieved interconnect bandwidth, the algorithms generated use large DMA packets for all-to-all data exchanges between the SPEs.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 385-390, 2008)

**Formal Datapath Representation and Manipulation for Implementing DSP Transforms**

Preprint (285 KB)

Published paper (link to publisher)

Bibtex

We present a domain-specific approach to representing datapaths for hardware implementations of linear signal transform algorithms. We extend the tensor structure for describing linear transform algorithms, adding the ability to explicitly characterize two important dimensions of datapath architecture. This representation allows both algorithm and datapath to be specified within a single formula, and gives the designer the ability to easily consider a wide space of possible datapaths at a high level of abstraction. We have constructed a formula manipulation system based upon this representation, and have written a compiler that can translate a formula into a hardware implementation. This enables an automatic “push button” compilation ﬂow that produces a register transfer level hardware description from high-level datapath directives and an algorithm (written as a formula). In our experimental results, we demonstrate that this approach yields efficient designs over a large tradeoff space.

Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2008)

**Generating High-Performance General Size Linear Transform Libraries Using Spiral**

Preprint (191 KB)

Published paper (link to publisher)

Bibtex

Franz Franchetti and Markus Püschel (Proc. International Conference on Compiler Construction (CC), Lecture Notes in Computer Science, Springer, Vol. 4959, pp. 116-131, 2008)

**Generating SIMD Vectorized Permutations**

Preprint (244 KB)

Published paper (link to publisher)

Bibtex

This paper introduces a method to generate efficient vectorized implementations of small stride permutations using only vector load and vector shuffle instructions. These permutations are crucial for high-performance numerical kernels including the fast Fourier transform. Our generator takes as input only the specification of the target platform's SIMD vector ISA and the desired permutation. The basic idea underlying our generator is to model vector instructions as matrices and sequences of vector instructions as matrix formulas using the Kronecker product formalism. We design a rewriting system and a search mechanism that applies matrix identities to generate those matrix formulas that have vector structure and minimize a cost measure that we define. The formula is then translated into the actual vector program for the specified permutation. For three important classes of permutations, we show that our method yields a solution with the minimal number of vector shuffles. Inserting into a fast Fourier transform yields a significant speedup.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. Summer School on Generative and Transformational Techniques in Software Engineering (GTTSE), Lecture Notes in Computer Science, Springer, Vol. 5235, pp. 196-259, 2008)

**How To Write Fast Numerical Code: A Small Introduction**

Preprint (556 KB)

Published paper (link to publisher)

Bibtex

The complexity of modern computing platforms has made it increasingly difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by an order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer's memory hierarchy. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.

Yevgen Voronenko (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2008)

**Library Generation for Linear Transforms**

Preprint (2.2 MB)

Bibtex

The development of high-performance numeric libraries has become extraordinarily difficult due to multiple processor cores, vector instruction sets, and deep memory hierarchies. To make things worse, often each library has to be re-implemented and re-optimized, whenever a new platform is released. In this thesis we develop a library generator that completely automates the library development for one important numerical domain: linear transforms, which include the discrete Fourier transform, discrete cosine transforms, filters, and discrete wavelet transforms. The input to our generator is a specification of the transform and a set of recursive algorithms for the transform, represented in a high-level domain-specific language; the output is a C++ library that supports general input size, is vectorized and multithreaded, and provides an optional adaptation mechanism for the memory hierarchy. Further, as we show in extensive benchmarks, the runtime performance of our automatically generated libraries is comparable to and often even higher than the best existing human-written code, including the widely used library FFTW and the commercially developed and maintained Intel Integrated Performance Primitives (IPP) and AMD Performance Library (APL). Our generator automates all library development steps typically performed manually by programmers, such as analyzing the algorithm and finding the set of required recursive functions and base cases, the appropriate restructuring of the algorithm to parallelize for multiple threads and to vectorize for the available vector instruction set and vector length, and performing code level optimizations such as algebraic simplification and others. The key to achieving full automation as well as excellent performance is a proper set of abstraction layers in the form of domain-specific languages called SPL (Signal Processing Language), index-free \sspl, regular \sspl, and intermediate code representation, and the use of rewriting systems to perform all difficult optimizations at a suitable, high level of abstraction. In addition, we demonstrate that our automatic library generation framework enables various forms of customization that would be very costly to perform manually. As examples, we show generated trade-offs between code size and performance, generated Java libraries obtained by modifying the backend, and functional customization for important transform variants.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. Workshop on High-Level Synthesis colocated with DAC, 2008)

**Linear Transforms: From Math to Efficient Hardware**

Preprint (70 KB)

Bibtex

Franz Franchetti, Daniel McFarlin, Frédéric de Mesmay, Hao Shen, Tomasz Wiktor Włodarczyk, Srinivas Chellappa, Marek Telgarsky, Peter A. Milder, Yevgen Voronenko, Qian Yu, James C. Hoe, José M. F. Moura and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2008)

**Program Generation with Spiral: Beyond Transforms**

Published paper (link to publisher)

Bibtex

Jeremy Johnson, Tim Chagnon, Petya Vachranukunkiet, Prawat Nagvajara and Chika Nwankpa (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)

**Sparse LU Decomposition using FPGA**

Preprint (539 KB)

Published paper (link to publisher)

Bibtex

This paper reports on an FPGA implementation of sparse LU decomposition. The resulting special purpose hardware is geared towards power system problems - load flow computation - which are typically solved iteratively using Newton Raphson. The key step in this process, which takes approximately 85% of the computation time, is the solution of sparse linear systems arising from the Jacobian matrices that occur in each iteration of Newton Raphson. Current state-of-the-art software packages, such as UMFPACK and SuperLU, running on general purpose processors perform suboptimally on these problems due to poor utilization of the floating point hardware (typically 1 to 4% efficiency). Our LU hardware, using a special purpose data path and cache, designed to keep the floating point hardware busy, achieves an efficiency of 60% and higher. This improved efficiency provides an order of magnitude speedup when compared to a software solution using UMFPACK running on general purpose processors.

Jeremy Johnson and Michael Andrews (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)

**Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh-Hadamard Transform**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

Short vector instructions (SIMD) can significantly increase performance, yet are difficult to use effectively. Recently, several efforts (ATLAS, FFTW, SPIRAL) have successfully used automated performance tuning and search to find good SIMD implementations of high-performance kernels such as matrix multiplication, the FFT and related transforms. In this paper, we review techniques, similar to those used by SPIRAL, for vectorizing sparse matrix factorizations, and incorporate them into a package for computing the Walsh–Hadamard Transform (WHT). These techniques along with search are used to discover algorithms with close to theoretical speedup. Not all algorithms provide the same amount of speedup and it is not the case that the vectorized version of the best sequential algorithm leads to the best algorithm with SIMD instructions. The goal of this paper is to better understand the search space and which algorithms lead to the best speedup and overall performance. Several SIMD specific algorithm features are used to predict speedup and performance. The resulting performance model is highly correlated with runtime performance and measured speedup, and can be used to partition the search space and expedite the search for good vector performance.

Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay and Markus Püschel (Proc. Algebraic Methodology and Software Technology (AMAST), 2008)

**System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries**

Preprint (187 KB)

Bibtex

We demonstrate Spiral, a domain-specific library generation system. Spiral generates high performance source code for linear transforms (such as the discrete Fourier transform and many others) directly from a problem specification. The key idea underlying Spiral is to perform automatic reasoning and optimizations at a high abstraction level using the mathematical, declarative domain-specific languages SPL and $\Sigma$-SPL and a rigorous rewriting framework. Optimization includes various forms of parallelization. Even though Spiral provides complete automation, its generated libraries run often faster than any existing hand-written code.

Sung-Chul Han (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2007)

**A Flexible Decoder and Performance Evaluation of Array-Structured LDPC Codes**

Preprint (1.4 MB)

Bibtex

The low density parity check (LDPC) codes designed by a pseudorandom construction, as proposed in Gallager’s original work, have been shown to perform very close to the Shannon limit (when constructed as very long codes); however, the lack of structure in such codes makes them unsuitable for practical applications due to high encoding complexity and costly decoder implementations. These difficulties have lead to numerous works on the structured LDPC codes, especially array-structured codes with quasi-cyclic property. Among the array-structured codes, those with an array of cyclic permutation matrices have been of particular interest due to the balanced edge partitioning inherent in the structure that simplifies the implementation of highly parallel decoders. While many construction methods have been proposed for this circulant permutation array (CPA) structure, the performance of the codes has been reported to a very limited extent. Especially, the effect on the performance by the explicit control of graph parameters has not been provided despite the fact their importance is emphasized in the construction process. In the decoder design for quasi-cyclic LDPC codes, the primary concern is to exploit the array structure for efficient implementation. Fast hardware-based decoders on a medium-capacity FPGA are often faster than the software implementation by at least one or two orders of magnitude, and thus important for both actual deployment in practical systems and evaluation of error performance. As a large number of high-throughput decoders in the literature are designed for specific array dimensions and the bus and memory connections are simplified using the array structure of the code, the degree of parallelism in the decoders is dependent on the code parameters, making it difficult to parameterize the hardware to use a desired amount of hardware resource. Furthermore, such architectures cannot support a large class of array-structured codes with very different array dimensions. In this thesis, we present a generalized hardware decoder that supports any kind of quasi-cyclic LDPC codes including CPA-structured codes. The decoder has been designed with a priority on high flexibility. In the synthesis step, the degree of parallelism can be chosen independently from the code parameters. Thus, for FPGA implementation, the decoder can be parameterized to fully utilize a given amount of hardware resource. Also, it supports run-time reconfiguration of code parameters, i.e., different codes can be supported by changing register contents without a new synthesis. In wireless applications, such flexibility makes it possible to choose a channel code based on the varying channel condition. When used for performance evaluation purposes for a large set of codes, it saves a considerable amount of time by eliminating the need for re-synthesis for each code. Using the FPGA implementation of the proposed decoder, we characterize the performance of array-structured LDPC codes, with a primary focus on pseudorandomly constructed CPA-structured codes. Based on the obtained simulation results, we show the effect of combinatorial parameters (girth, diameter and column weight) on the error performance. The pseudorandom construction is also compared with algebraic construction, and with the codes specified in the IEEE 802.16e standards.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (CSSI Technical Report #CSSI-07-01, Carnegie Mellon University, 2007)

**Discrete Fourier Transform Compiler: From Mathematical Representation to Efficient Hardware**

Preprint (259 KB)

Bibtex

A wide range of hardware implementations are possible for the discrete Fourier transform (DFT), offering different tradeoffs in throughput, latency and cost. The well-understood structure of DFT algorithms makes possible a fully automatic synthesis framework that can span the viable interesting design choices. In this paper, we present such a synthesis framework that starts from formal mathematical formulas of a general class of fast DFT algorithms and produces performance and cost efficient sequential hardware implementations, making design decisions and tradeoffs according to user specified high-level preferences. We present evaluations to demonstrate the variety of supported implementations and the cost/performance tradeoffs they allow.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. International Symposium on Field-Programmable Gate Arrays (FPGA), 2007)

**Fast Fourier Transform on FPGA: Design Choices and Evaluation**

Bibtex

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. IEEE International High Level Design Validation and Test Workshop (HLDVT), 2007)

**FFT Compiler: From Math to Efficient Hardware**

Preprint (158 KB)

Bibtex

This paper presents a high-level compiler that generates hardware implementations of the discrete Fourier transform (DFT) from mathematical specifications. The matrix formula input language captures not only the DFT calculation but also the implementation options at the algorithmic and architectural levels. By selecting the appropriate formula, the resulting hardware implementations (described in a synthesizable Verilog description) can achieve a wide range of tradeoffs between implementation cost and performance. The compiler is also parameterized for a set of technology-specific optimizations, to allow it to target specific implementation platforms. This paper gives a brief overview of the system and presents synthesis results.

Paolo D'Alberto, Peter A. Milder, Aliaksei Sandryhaila, Franz Franchetti, James C. Hoe, José M. F. Moura, Markus Püschel and Jeremy Johnson (Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 173-184, 2007)

**Generating FPGA Accelerated DFT Libraries**

Preprint (347 KB)

Published paper (link to publisher)

Bibtex

We present a domain-specific approach to generate high performance hardware-software partitioned implementations of the discrete Fourier transform (DFT) in fixed point precision. The partitioning strategy is a heuristic based on the DFT’s divide-and-conquer algorithmic structure and fine tuned by the feedback-driven exploration of candidate designs. We have integrated this approach in the Spiral linear-transform code-generation framework to support push-button automatic implementation. We present evaluations of hardware–software DFT implementations running on the embedded PowerPC processor and the reconfigurable fabric of the Xilinx Virtex-II Pro FPGA. In our experiments, the 1D and 2D DFT’s FPGA accelerated libraries exhibit between 2 and 7.5 times higher performance (operations per second) and up to 2.5 times better energy efficiency (operations per Joule) than the software-only version.

Jeremy Johnson and Xu Xu (Proc. ACM International Symposium on Symbolic and Algebraic Computation (ISSAC), ACM, pp. 195-202, 2007)

**Generating Symmetric DFTs and Equivariant FFT Algorithms**

Published paper (link to publisher)

Bibtex

This paper presents a code generator which produces efficient implementations of multi-dimensional fast Fourier transform (FFT) algorithms which utilize symmetries in the input data to reduce memory usage and the number of arithmetic operations. The FFT algorithms are constructed using a group theoretic version of the divide and conquer step in the FFT that is compatible with the group of symmetries. The GAP compute algebra system is used to perform the necessary group computations and the generated algorithm is represented as a symbolic matrix factorization, which is translated into efficient code using the SPIRAL system. Performance data is given that shows that the resulting code is significantly faster than state-of-the-art FFT implementations that do not utilize the symmetries.

Yevgen Voronenko and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 55, No. 9, pp. 4458-4473, 2007)

**Mechanical Derivation of Fused Multiply-Add Algorithms for Linear Transforms**

Preprint (751 KB)

Published paper (link to publisher)

Bibtex

Several computer architectures offer fused multiply-add (FMA), also called multiply-and-accumulate (MAC) instructions, that are as fast as a single addition or multiplication. For the efficient implementation of linear transforms, such as the discrete Fourier transform or discrete cosine transforms, this poses a challenge to algorithm developers as standard transform algorithms have to be manipulated into FMA algorithms that make optimal use of FMA instructions. We present a general method to convert any transform algorithm into an FMA algorithm. The method works with both algorithms given as directed acyclic graphs (DAGs) and algorithms given as structured matrix factorizations. We prove bounds on the efficiency of the method. In particular, we show that it removes all single multiplications except at most as many as the transform has outputs. We implemented the DAG-based version of the method and show that we can generate many of the best-known hand-derived FMA algorithms from the literature as well as a few novel FMA algorithms.

Yevgen Voronenko and Markus Püschel (ACM Transactions on Algorithms, Vol. 3, No. 2, 2007)

**Multiplierless Multiple Constant Multiplication**

Preprint (507 KB)

Published paper (link to publisher)

Bibtex

A variable can be multiplied by a given set of fixed-point constants using a multiplier block that consists exclusively of additions, subtractions, and shifts. The generation of a multiplier block from the set of constants is known as the multiple constant multiplication (MCM) problem. Finding the optimal solution, i.e., the one with the fewest number of additions and subtractions is known to be NP-complete. We propose a new algorithm for the MCM problem, which produces solutions that require up to 20% less additions and subtractions than the best previously known algorithm. At the same time our algorithm, in contrast to the closest competing algorithm, is not limited by the constant bitwidths. We present our algorithm using a unifying formal framework for the best, graph-based, MCM algorithms and provide a detailed runtime analysis and experimental evaluation. We show that our algorithm can handle problem sizes as large as 100 32-bit constants in a time acceptable for most applications. The implementation of the new algorithm is available at www.spiral.net.

Jeremy Johnson and Michael Andrews (Proc. Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL), 2007 IEEE International Parallel and Distributed Processing Symposium , IEEE, pp. 450, 2007)

**Performance Analysis of a Family of WHT Algorithms**

Preprint (1.8 MB)

Published paper (link to publisher)

Bibtex

This paper explores the correlation of instruction counts and cache misses to runtime performance for a large family of divide and conquer algorithms to compute the Walsh–Hadamard transform (WHT). Previous work showed how to compute instruction counts and cache misses from a high–level description of the algorithm and proved theoretical results about their minimum, maximum, mean, and distribution. While the models themselves do not accurately predict performance, it is shown that they are statistically correlated to performance and thus can be used to prune the search space for fast implementations. When the size of the transform fits in cache the instruction count itself is used; however, when the transform no longer fits in cache, a linear combination of instruction counts and cache misses is used. Thus for small transforms it is safe to ignore algorithms which have a high instruction count and for large transforms it is safe to ignore algorithms with a high value in the combined instruction count/cache miss model. Since the models can be computed from a high–level description of the algorithms, they can be obtained without runtime measurement and the previous theoretical results on the models can be applied to limit empirical search.

Paolo D'Alberto, Markus Püschel and Franz Franchetti (Proc. International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), Lecture Notes in Computer Science, Springer, Vol. 4367, pp. 201-214, 2007)

**Performance/Energy Optimization of DSP Transforms on the XScale Processor**

Preprint (646 KB)

Published paper (link to publisher)

Bibtex

The XScale processor family provides user-controllable independent configuration of CPU, bus, and memory frequencies. This feature introduces another handle for the code optimization with respect to energy consumption or runtime performance. We quantify the effect of frequency configurations on both performance and energy for three signal processing transforms: the discrete Fourier transform (DFT), finite impulse response (FIR) filters, and the Walsh-Hadamard Transform (WHT). To do this, we use SPIRAL, a program generation and optimization system for signal processing transforms. For a given transform to be implemented, SPIRAL searches over different algorithms to find the best match to the given platform with respect to the chosen performance metric (usually runtime). In this paper we use SPIRAL to generate implementations for different frequency configurations and optimize for runtime and physically measured energy consumption. In doing so we show that first, each transform achieves best performance/energy consumption for a different system configuration; second, the best code depends on the chosen configuration, problem size and algorithm; third, the fastest implementation is not always the most energy efficient; fourth, we introduce dynamic (i.e., during execution) reconfiguration in order to further improve performance/energy. Finally, we benchmark SPIRAL generated code against Intel’s vendor library routines. We show competitive results as well as 20% performance improvements or energy reduction for selected transforms and problem sizes.

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. II-17-II-20, 2007)

**SIMD Vectorization of Non-Two-Power Sized FFTs**

Preprint (197 KB)

Published paper (link to publisher)

Bibtex

SIMD (single instruction multiple data) vector instructions, such as Intel’s SSE family, are available on most architectures, but are difficult to exploit for speed-up. In many cases, such as the fast Fourier transform (FFT), signal processing algorithms have to undergo major transformations to map efficiently. Using the Kronecker product formalism, we rigorously derive a novel variant of the general-radix Cooley-Tukey FFT that is structured to map efficiently for any vector length nu and radix. Then, we include the new FFT into the program generator Spiral to generate actual C implementations. Benchmarks on Intel’s SSE show that the new algorithms perform better on practically all sizes than the best available libraries Intel’s MKL and FFTW.

Peter Tummeltshammer, James C. Hoe and Markus Püschel (IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 26, No. 9, pp. 1551-1563, 2007)

**Time-Multiplexed Multiple Constant Multiplication**

Preprint (349 KB)

Published paper (link to publisher)

Bibtex

This paper studies area-efficient arithmetic circuits to multiply a fixed-point input value selectively by one of several preset fixed-point constants. We present an algorithm that generates a class of solutions to this time-multiplexed multiple-constant multiplication problem by “fusing” single-constant multiplication circuits for the required constants. Our evaluation compares our solution against a baseline implementation style that employs a full multiplier and a lookup table for the constants. The evaluation shows that we gain a significant area advantage, at the price of increased latency, for problem sizes (in terms of the number of constants) up to a threshold dependent on the bit-widths of the input and the constants. Our evaluation further shows that our solution is better suited for standard-cell application-specific integrated circuits than prior works on reconfigurable multiplier blocks.

Pranab Shenoy (Master thesis, Computer Science, Drexel University, 2007)

**Universal FFT Core Generator**

Published paper (link to publisher)

Bibtex

This thesis presents a special purpose processor for multi-dimensional discrete Fourier transform (DFT) computation. The processor, called a Universal fast Fourier transform (FFT) processor, is capable of computing an arbitrary multi-dimensional DFT with a fixed number of input points. The processor can be configured to compute different dimensional DFTs simply by rearranging the input data and using a different set of constants, called generalized twiddle factors, throughout the computation. The basic computational process remains the same independent of dimension. The processor does not rely on one-dimensional FFT computation as does the standard approach using the row-column algorithm, but instead utilizes an algorithm called the Dimensionless FFT, whose computation data flow is the same as the FFT algorithm due to Pease. Since the computation performed by the Universal FFT processor is the same as the Pease FFT, an implementation can be obtained by modifying a one-dimensional FFT processor based on the Pease algorithm. The implementation presented in this thesis is obtained in this way from the core generator by Nordin, Milder, Hoe, and P¨uschel. The core generator from Nordin, Milder, Hoe, and P¨uschel generates a variety of cores, with various space and time tradeoffs for implementation on field programmable gate arrays (FPGA). The performance obtained is comparable to vendor supplied cores but offers much greater flexibility of design choices with different area and throughput tradeoffs. The resulting Universal FFT core generator can be used to instantiate multi-dimensional FPGA cores with approximately the same performance and area obtained in the one-dimensional case by Nordin, Milder, Hoe, and P¨uschel.

Yevgen Voronenko and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 3, 2006)

**Algebraic Derivation of General Radix Cooley-Tukey Algorithms for the Real Discrete Fourier Transform**

Preprint (84 KB)

Published paper (link to publisher)

Bibtex

We first show that the real version of the discrete Fourier transform (called RDFT) can be characterized in the framework of polynomial algebras just as the DFT and the discrete cosine and sine transforms. Then, we use this connection to algebraically derive a general radix Cooley-Tukey type algorithm for the RDFT. The algorithm has a similar structure as its complex counterpart, but there are also important differences, which are exhibited by our Kronecker product style presentation. In particular, the RDFT is decomposed into smaller RDFTs but also other auxiliary transforms, which we then decompose by their own Cooley-Tukey type algorithms to obtain a full recursive algorithm for the RDFT.

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 4395, pp. 363-377, 2006)

**A Rewriting System for the Vectorization of Signal Transforms**

Preprint (226 KB)

Published paper (link to publisher)

Bibtex

We present a rewriting system that automatically vectorizes signal transform algorithms at a high level of abstraction. The input to the system is a transform algorithm given as a formula in the well-known Kronecker product formalism. The output is a "vectorized" formula, which means it consists exclusively of constructs that can be directly mapped into short vector code. This approach obviates compiler vectorization, which is known to be limited in this domain. We included the formula vectorization into the Spiral program generator for signal transforms, which enables us to generate vectorized code and optimize through search over alternative algorithms. Benchmarks for the discrete Fourier transform (DFT) show that our generated floating-point code is competitive with and that our fixed-point code clearly outperforms the best available libraries.

Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel and Christoph W. Ueberhuber (Proc. International Symposium on Parallel and Distributed Processing and Application (ISPA), Lecture Notes In Computer Science, Springer, Vol. 4330, pp. 818-832, 2006)

**Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers**

Preprint (146 KB)

Bibtex

This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system SPIRAL. DFT algorithms are represented as mathematical formulas in SPIRAL’s internal language SPL. Using a tagging mechanism and formula rewriting, we extend SPIRAL to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms, SPIRAL’s search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by SPIRAL show performance gains of up to 30% due to rescaling. Further, our generated programs compare favorably with FFTW-MPI 2.1.5.

Paolo D'Alberto, Peter A. Milder, Franz Franchetti, James C. Hoe, Markus Püschel and José M. F. Moura (Proc. High Performance Extreme Computing (HPEC), 2006)

**Discrete Fourier Transform Compiler for FPGA and CPU/FPGA Partitioned Implementations**

Bibtex

Pawel Hitczenko, Jeremy Johnson and Hung-Jen Huang (Theoretical Computer Science, Vol. 352, pp. 8-30, 2006)

**Distribution of a Class of Divide and Conquer Recurrences arising from the Computation of the Walsh-Hadamard Transform**

Published paper (link to publisher)

Bibtex

Peter A. Milder, Mohammad Ahmad, James C. Hoe and Markus Püschel (Proc. FPGA, pp. 211-220, 2006)

**Fast and Accurate Resource Estimation of Automatically Generated Custom DFT IP Cores**

Preprint (145 KB)

Published paper (link to publisher)

Bibtex

This paper presents an equation-based resource utilization model for automatically generated discrete Fourier transform (DFT) soft core IPs. The parameterized DFT IP generator allows a user to make customized tradeoffs between cost and performance and between utilization of different resource classes. The equation-based resource model permits immediate and accurate estimation of resource requirements as the user considers the different generator options. Furthermore, the fast turnaround of the model allows it to be combined with a search algorithm such that the user could query automatically for an optimal design within the stated performance and resource constraints.Following a brief review of the DFT IP generator, this paper presents the development of the equation-based models for estimating slice and hard macro utilizations in the Xilinx Virtex-II Pro FPGA family. The evaluation section shows that an average error of 6.1% is achievable by a model of linear equations that can be evaluated in sub-microseconds. The paper further offers a demonstration of the automatic design exploration capability.

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. Supercomputing (SC), 2006)

**FFT Program Generation for Shared Memory: SMP and Multicore**

Preprint (161 KB)

Published paper (link to publisher)

Bibtex

The chip maker's response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speed-up already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms.

F. Gygi, E. W. Draeger, M. Schulz, B. R. de Supinski, J. A. Gunnels, V. Austel, J. C. Sexton, Franz Franchetti, Stefan Kral, Christoph W. Ueberhuber and Juergen Lorenz (Proc. Supercomputing (SC), 2006)

**Large-Scale Electronic Structure Calculations of High-Z Metals on the BlueGene/L Platform**

Bibtex

Franz Franchetti, Andreas Bonelli, Ekapol Chuangsuwanich, Yu-Chiang Lee, Juergen Lorenz, Thomas Peter, Hao Shen, Marek Telgarsky, Yevgen Voronenko, Markus Püschel, José M. F. Moura and Christoph W. Ueberhuber (Proc. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), 2006)

**Parallelism in Spiral**

Preprint (202 KB)

Bibtex

Spiral is a program generator for linear transforms such as the discrete Fourier transform. Spiral generates highly optimized code directly from a problem specification using a combination of techniques including optimization at a high level of abstraction using rewriting of mathematical expressions and heuristic search for platform adaptation. In this paper, we overview the generation of parallel programs using Spiral. This includes programs for vector architectures and programs for shared or distributed memory platforms.

Sung-Chul Han, Franz Franchetti and Markus Püschel (Proc. Parallel Architectures and Compilation Techniques (PACT), pp. 222-232, 2006)

**Program Generation for the All-Pairs Shortest Path Problem**

Preprint (258 KB)

Published paper (link to publisher)

Bibtex

A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and re-optimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet generator in FFTW. Each of these generators produces highly optimized source code directly from a problem specification. In this paper, we extend this list by a program generator for the well-known Floyd-Warshall (FW) algorithm that solves the all-pairs shortest path problem, which is important in a wide range of engineering applications.

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. EDGE Workshop, pp. D49-D50, 2006)

**Spiral: Generating Signal Processing Kernels for New Commodity Architectures**

Bibtex

Marek Telgarsky, James C. Hoe and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 3, 2006)

**Spiral: Joint Runtime and Energy Optimization of Linear Transforms**

Preprint (199 KB)

Published paper (link to publisher)

Bibtex

There is much interest into joint runtime and energy optimization of implementations of signal processing algorithms. Applications in domains such as embedded computing, sensor networks, and mobile communications often require processing of signals under simultaneous runtime, energy and/or power constraints. Hence, in addition to runtime, power and energy are first-order design considerations for both hardware and software developers in those domains. This paper studies the automatic generation of software implementations of digital signal processing (DSP) transforms that are optimized with respect to both runtime and energy. We explore the impact of algorithm selection (a software technique) and voltage-frequency scaling (a hardware technique) on the runtime and energy of computing fast linear transforms. We use SPIRAL, a code generation system, to enumerate automatically many alternative algorithms for the discrete Fourier transform. We measure the runtime and energy of these algorithms at different voltage-frequency settings of an Intel Pentium M microprocessor. We report experimental results supporting that algorithm selection and voltage-frequency scaling do achieve the following: (1) have large impact on the runtime and energy of computing the discrete Fourier transform on a microprocessor; and (2) enable the optimization of important joint runtime-energy objectives.

Roland Wunderlich, Markus Püschel and James C. Hoe (Proc. High Performance Extreme Computing (HPEC), 2005)

**Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA**

Preprint (43 KB)

Bibtex

The optimization of matrix-matrix multiplication (MMM) performance has been well studied on general-purpose desktop and server processors. Classic solutions exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. Typical digital signal processors (DSPs) do not have these features, and instead use in-order execution, configurable memory hierarchies, and pro-grammable I/O interfaces. We investigate the methods needed to achieve high per-formance MMM on the Texas Instruments C6713 floating-point DSP. This processor has two components that can be used to accelerate MMM: a software-managed memory hierarchy, and a direct memory access (DMA) engine that can perform block copies from main memory to into the memory hierarchy. Our MMM implementation overlaps computation with DMA block transfers. For matrices lar-ger than the data caches, we observed a 46% performance increase over a blocked MMM implementation, and a 190% increase over the Texas Instruments DSP library.

Grace Nordin, Peter A. Milder, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 471-474, 2005)

**Automatic Generation of Customized Discrete Fourier Transform IPs**

Preprint (291 KB)

Published paper (link to publisher)

Bibtex

This paper presents a parameterized soft core generator for the discrete Fourier transform (DFT). Reusable IPs of digital signal processing (DSP) kernels are important time-saving resources in DSP hardware development. Unfortunately, reusable IPs, however optimized, can introduce inefficiencies because they cannot fit the exact requirements of every application context. Given the well-understood and regular computation in DSP kernels, an automatic tool can generate high-quality ready-to-use IPs customized to user-specified cost/performance tradeoffs (beyond basic parameters such as input size and data format). The paper shows that the generated DFT cores can match closely the performance and cost of DFT cores from the Xilinx LogiCore library. Furthermore, the generator can yield DFT cores over a range of different performance/ cost tradeoff points that are not available from the library.

Mihai Furis, Pawel Hitczenko and Jeremy Johnson (Proc. International Conference on Analysis of Algorithms, Discrete Mathematics and Theoretical Computer Science, pp. 115-124, 2005)

**Cache Miss Analysis of WHT Algorithms**

Published paper (link to publisher)

Bibtex

Franz Franchetti, Stefan Kral, Juergen Lorenz and Christoph W. Ueberhuber (Proceedings of the IEEE, special issue on ``Program Generation, Optimization, and Adaptation'', Vol. 93, No. 2, pp. 409-425, 2005)

**Efficient Utilization of SIMD Extensions**

Bibtex

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326 , 2005)

**Formal Loop Merging for Signal Transforms**

Preprint (304 KB)

Published paper (link to publisher)

Bibtex

A critical optimization in the domain of linear signal transforms, such as the discrete Fourier transform (DFT), is loop merging, which increases data locality and reuse and thus performance. In particular, this includes the conversion of shuffle operations into array reindexings. To date, loop merging is well understood only for the DFT, and only for Cooley-Tukey FFT based algorithms, which excludes DFT sizes divisible by large primes. In this paper, we present a formal loop merging framework for general signal transforms and its implementation within the SPIRAL code generator. The framework consists of Sigma-SPL, a mathematical language to express loops and index mappings; a rewriting system to merge loops in Sigma-SPL; and a compiler that translates Sigma-SPL into code. We apply the framework to DFT sizes that cannot be handled using only the Cooley-Tukey FFT and compare our method to FFTW 3.0.1 and the vendor library Intel MKL 7.2.1. Compared to FFTW our generated code is a factor of 2-4 faster under equal implementation conditions (same algorithms, same unrolling threshold). For some sizes we show a speed-up of a factor of 9 using Bluestein's algorithm. Further, we give a detailed comparison against the Intel vendor library MKL; our generated code is between 2 times faster and 4.5 times slower.

F. Gygi, E. W. Draeger, B. R. de Supinski, R. K. Yates, Franz Franchetti, Stefan Kral, Juergen Lorenz, Christoph W. Ueberhuber, J. A. Gunnels and J. C. Sexton (Proc. Supercomputing (SC), 2005)

**Large-Scale First-Principles Molecular Dynamics Simulations on the BlueGene/L Platform Using the Qbox Code**

Bibtex

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326, 2005)

**Loop Merging for Signal Transforms**

Bibtex

Xiaoming Li, María J. Garzarán and David Padua (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 99-110, 2005)

**Optimizing Sorting with Genetic Algorithm**

Preprint (268 KB)

Published paper (link to publisher)

Bibtex

The growing complexity of modern processors has made the generation of highly efficient code increasingly difficult. Manual code generation is very time consuming, but it is often the only choice since the code generated by today's compiler technology often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS, FFTW, and SPIRAL, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proven successful on scientific codes whose performance does not depend on the input data. In this paper we study machine learning techniques to extend empirical search to the generation of sorting routines, whose performance depends on the input characteristics and the architecture of the target machine. We build on a previous study that selects a "pure" sorting algorithm at the outset of the computation as a function of the standard deviation. The approach discussed in this paper uses genetic algorithms and a classifier system to build hierarchically-organized hybrid sorting algorithms capable of adapting to the input data. Our results show that such algorithms generated using the approach presented in this paper are quite effective at taking into account the complex interactions between architectural and input data characteristics and that the resulting code performs significantly better than conventional sorting implementations and the code generated by our earlier study. In particular, the routines generated using our approach perform better than all the commercial libraries that we tried including IBM ESSL, INTEL MKL and the C++ STL The best algorithm we have been able to generate is on the average 26% and 62% faster than the IBM ESSL in an IBM Power 3 and IBM Power 4, respectively.

Thammanit Pipatsrisawat, Aca Gacic, Franz Franchetti, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. 153-156, 2005)

**Performance Analysis of the Filtered Backprojection Image Reconstruction Algorithms**

Preprint (214 KB)

Published paper (link to publisher)

Bibtex

We investigate performance tradeoffs for a class of filtered backprojection (FBP) image reconstruction algorithms. The recently developed fast hierarchical backprojection asymptotically achieves the same O(N/sup 2/ log N) cost as Fourier-based methods while retaining many advantages of the FBP technique. In this paper, we provide a detailed cost and performance analysis of the algorithm on a general purpose platform. Based on carefully tuned implementations of both the direct and the hierarchical backprojection, we explore the tradeoffs between distortion and runtime by varying several algorithm and implementation choices. Experimental results show that, given the desired performance, the choice of algorithm parameters is not obvious and largely depends on the image properties and the underlying computer platform.

José M. F. Moura, Markus Püschel, David Padua and Jack Dongarra (Proceedings of the IEEE, special issue on ``Program Generation, Optimization, and Adaptation'', Vol. 93, No. 2, pp. 211-215, 2005)

**Scanning the Issue: Special Issue on Program Generation, Optimization, and Platform Adaptation**

Published paper (link to publisher)

Bibtex

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson and Nicholas Rizzolo (Proceedings of the IEEE, special issue on ``Program Generation, Optimization, and Adaptation'', Vol. 93, No. 2, pp. 232- 275, 2005)

**SPIRAL: Code Generation for DSP Transforms**

Preprint (1.2 MB)

Published paper (link to publisher)

Bibtex

Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL, which considers this problem for the performance-critical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high-performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, SPIRAL "intelligently" generates and explores algorithmic and implementation choices to find the best match to the computer's microarchitecture. The "intelligence" is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high-performance code for a broad set of DSP transforms, including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code.

Juergen Lorenz, Stefan Kral, Franz Franchetti and Christoph W. Ueberhuber (IBM Journal of Research and Development, Vol. 49, No. 2/3, pp. 437-446, 2005)

**Vectorization Techniques for the BlueGene/L Double FPU**

Bibtex

Lawrence C. Chang, Inpyo Hong, Yevgen Voronenko and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2004)

**Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic**

Bibtex

Xiaoming Li, María J. Garzarán and David Padua (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 111-124, 2004)

**A Dynamically Tuned Sorting Library**

Preprint (418 KB)

Published paper (link to publisher)

Bibtex

Empirical search is a strategy used during the installation oflibrary generators such as ATLAS, FFTW, and SPIRAL to identify the algorithm or the version of an algorithm that delivers thebest performance. In the past, empirical search has been appliedalmost exclusively to scientific problems. In this paper, we discuss the application of empirical search to sorting, which is oneof the best understood symbolic computing problems. When contrasted with the dense numerical computations of ATLAS, FFTW,and SPIRAL, sorting presents a new challenge, namely that the relative performance of the algorithms depend not only on the characteristics of the target machine and the size of the input data but also on the distribution of values in the input data set.Empirical search is applied in the study reported here as partof a sorting library generator. The resulting routines dynamicallyadapt to the characteristics of the input data by selecting the bestsorting algorithm from a small set of alternatives. To generate therun time selection mechanism our generator makes use of machinelearning to predict the best algorithm as a function of the characteristics of the input data set and the performance of the differentalgorithms on the target machine. This prediction is based on thedata obtained through empirical search at installation time.Our results show that our approach is quite effective. Whensorting data inputs of 12M keys with various standard deviations,our adaptive approach selected the best algorithm for all the inputdata sets and all platforms that we tried in our experiments. Thewrong decision could have introduced a performance degradationof up to 133%, with an average value of 44%.

Jeremy Johnson and Kang Chen (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 44-, 2004)

**A Self-Adapting Distributed Memory Package for Fast Signal Transforms**

Preprint (1.4 MB)

Published paper (link to publisher)

Bibtex

Aca Gacic, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-69-V-72, 2004)

**Automatically Generated High-Performance Code for Discrete Wavelet Transforms**

Preprint (103 KB)

Published paper (link to publisher)

Bibtex

A growing number of performance-critical DSP application use the discrete wavelet transform (DWT), thus prompting the need for highly efficient DWT software implementations. Unfortunately, the rapid evolution of computing platforms and compiler technology makes carefully hand-tuned code obsolete almost as fast as it is written. In this paper we describe our work on the automatic generation of DWT implementations that are tuned to a given platform. Our approach captures the various DWT algorithms in a concise mathematical framework that enables the integration of DWTs into the SPIRAL code generation system. Experiments show the quality of our automatically generated code and provide interesting insights; for example, the fastest code differs between platforms and is usually based on a non-obvious combination of DWT algorithms.

Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber and Peter Wurzinger (Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 3402, pp. 23-36, 2004)

**Automatically Tuned FFTs for BlueGene/L’s Double FPU**

Preprint (219 KB)

Bibtex

IBM's upcoming 360 Tflop/s supercomputer BlueGene/L featuring 65,536 processors is supposed to lead the Top 500 list when being installed in 2005. This paper presents one of the first numerical codes actually run on a small prototype of this machine. Formal vectorization techniques, the Vienna MAP vectorizer (both developed for generic short vector SIMD extensions), and the automatic performance tuning approach provided by Spiral are combined to generate automatically optimized FFT codes for the BlueGene/L machine targeting its two-way short vector SIMD ``double'' floating-point unit. The resulting FFT codes are 40% faster than the best scalar Spiral generated code and 5 times faster than the mixed-radix FFT implementation provided by the Gnu scientific library GSL.

Adam C. Zelinski, Markus Püschel, Smarahara Misra and James C. Hoe (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-221-V-224, 2004)

**Automatic Cost Minimization for Multiplierless Implementations of Discrete Signal Transforms**

Preprint (80 KB)

Published paper (link to publisher)

Bibtex

The computation of linear DSP transforms consists entirely of additions and multiplications by constants, which, in a hardware realization, can be implemented as a network of wired shifts and additions. Thus, a light weight fixed point implementation that approximates an exact transform can be built from only adders. This paper presents an automatic approach for minimizing the number of additions required for a given transform under the constraint of a particular quality measure. We present an evaluation of our approach. For example, one experiment shows that the IMDCT transform within an MP3 decoder can be reduced from 643 additions to 170 additions while maintaining Limited Accuracy as defined by the MP3 ISO standard.

Anthony F. Breitzman and Jeremy Johnson (Journal of High Performance Computing and Applications, special issue on ``Computer Algebra and Signal Processing'', Vol. 37, No. 2, pp. 157-186, 2004)

**Automatic Derivation and Implementation of Fast Convolution Algorithms**

Published paper (link to publisher)

Bibtex

Yevgen Voronenko and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-101-V-104, 2004)

**Automatic Generation of Implementations for DSP Transforms on Fused Multiply-Add Architectures**

Preprint (95 KB)

Published paper (link to publisher)

Bibtex

Many modern computer architectures feature fused multiply-add (FMA) instructions, which offer potentially faster performance for numerical applications. For DSP transforms, compilers can only generate FMA code to a very limited extent because optimal use of FMAs requires modifying the chosen algorithm. In this paper we present a framework for automatically generating FMA code for every linear DSP transform, which we implemented as an extension to the SPIRAL code generation system. We show that for many transforms and transform sizes, our generated FMA code matches the best-known hand-derived FMA algorithms in terms of arithmetic cost. Further, we present actual runtime results that show the speed-up obtained by using FMA instructions.

Aca Gacic (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2004)

**Automatic Implementation and Platform Adaptation of Discrete Filtering and Wavelet Algorithms**

Preprint (1.2 MB)

Bibtex

Moore's law, with the doubling of the transistor count every 18 months, poses serious challenges to high-performance numerical software designers: how to stay close to the maximum achievable performance on ever-changing and ever-faster hardware technologies? Up-to-date numerical libraries are usually maintained by large teams of expert programmers who hand-tune the code to a specific class of computer platforms, sacrificing portability for performance. Every new generation of processors reopens the cycle of implementing, tuning, and debugging.

The SPIRAL system addresses this problem by automatically generating and implementing algorithms for DSP numerical kernels and searching for the best solution on the platform of interest. Using search, SPIRAL adapts code to take optimal advantage of the available platform features, such as the architecture of the memory hierarchy and register banks. As a result, SPIRAL generates high-performance implementations for DSP transforms that are competitive with the best hand-coded numerical libraries provided by hardware vendors.

In this thesis, we focus on automatic implementation and platform adaptation of filtering and wavelet kernels, which are at the core of many performance-critical DSP applications. We formulate many well-known algorithms for FIR filters and discrete wavelet transforms (DWT) using a concise and flexible symbolic mathematical language and integrate it in the SPIRAL system. This enables automatic generation and search over the comprehensive space of competitive algorithms, often leading to complex solutions that are hardly ever considered by a human programmer.

Experimental results show that our automatically generated and tuned code for FIR filters and DWTs is competitive and sometimes even outperforms hand-coded numerical libraries provided by hardware vendors. This implies that the richness and the extent of the automatically generated search space can match human ingenuity in achieving high performance. Our system generates high-quality code for digital filtering and wavelet kernels across most current and compatible future computer platforms and frees software developers from tedious and time-consuming coding at the machine level.

Markus Püschel, Adam C. Zelinski and James C. Hoe (Proc. International Conference on Computer-Aided Design (ICCAD), pp. 175-182, 2004)

**Custom-Optimized Multiplierless Implementations of DSP Algorithms**

Preprint (137 KB)

Published paper (link to publisher)

Bibtex

Linear DSP kernels such as transforms and filters are comprised exclusively of additions and multiplications by constants. These multiplications may be realized as networks of additions and wired shifts in hardware. The cost of such a "multiplierless" implementation is determined by the number of additions, which in turn depends on the value and precision of these constants. For a given transform or filter, the set of constants and their required precision is affected by algorithmic and implementation choices and hence provides a degree of freedom for optimization. In this paper we present an automated method to generate, for a given linear transform, a minimum addition multiplierless implementation that satisfies a given quality constraint. The method combines automatic algorithm selection to improve numerical robustness and automatic search methods to minimize constant precisions in a chosen algorithm. We present experiments that show the trade-offs between cost and quality, including custom optimizations of the transforms used in JPEG image and MP3 audio decoders

Grace Nordin, James C. Hoe and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2004)

**Discrete Fourier Transform IP Generator**

Bibtex

Nicholas Rizzolo and David Padua (Proc. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2004)

**HiLO: High Level Optimization of FFTs**

Preprint (181 KB)

Bibtex

As computing platforms become more and more complex, the task of optimizing performance critical codes becomes more challenging. Recently, more attention has been focused on automating this optimization process by making aggressive assumptions about the algorithm. Motivated by these trends, this paper presents HiLO, the high-level optimizer for FFT codes. HiLO blends traditional optimization techniques into an optimization strategy tailored to the needs of FFT codes and outputs C code ready to be further optimized by the native compiler. It has already been shown that such high-level transformations are important to coax the native compiler into doing its job well. HiLO provides a more appropriate platform for researching these phenomena, suggests an optimization strategy that improves on previous approaches, and shows that even software pipelining at the C level can improve the final binary's performance.

Peter Tummeltshammer, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 826-829, 2004)

**Multiple Constant Multiplication By Time-Multiplexed Mapping of Addition Chains**

Preprint (122 KB)

Published paper (link to publisher)

Bibtex

An important primitive in the hardware implementations of linear DSP transforms is a circuit that can multiply an input value by one of several different preset constants. We propose a novel implementation of this circuit based on combining the *addition chains* of the constituent constants. We present an algorithm to automatically generate such a circuit for a given set of constants. The quality of the resulting circuits is evaluated after synthesis for a commercial 0.18mum standard cell library. We compare the area and latency efficiency of this addition chain based approach against a straightforward approach based on a constant table and a full multiplier.

Markus Püschel, Bryan Singer, Jianxin Xiong, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso and Robert W. Johnson (Journal of High Performance Computing and Applications, special issue on ``Automatic Performance Tuning'', Vol. 18, No. 1, pp. 21-45, 2004)

**SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms **

Published paper (link to publisher)

Bibtex

Franz Franchetti (Proc. IMACS Symposium on Mathematical Modelling (MATHMOD), Vol. 2, pp. 1539-1548, 2003)

**A Portable Short Vector Version of FFTW**

Bibtex

Xu Xu (Master thesis, Computer Science, Drexel University, 2003)

**A Recursive Implementation of the Dimensionless FFT**

Bibtex

Jeremy Johnson and Xu Xu (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003)

**A Recursive Implementation of the Dimensionless FFT **

Preprint (62 KB)

Published paper (link to publisher)

Bibtex

A divide and conquer algorithm is presented for computing arbitrary multi-dimensional discrete Fourier transforms. In contrast to standard approaches such as the row-column algorithm, this algorithm allows an arbitrary decomposition, based solely on the size of the transform independent of the dimension of the transform. Only minor modifications are required to compute transforms with different dimension. These modifications were incorporated into the FFTW package so that the algorithm for computing one-dimensional transforms can be used to compute arbitrary dimensional transforms. This reduced the runtime of many multi-dimensional transforms.

Anthony F. Breitzman (PhD. thesis, Computer Science, Drexel University, 2003)

**Automatic Derivation and Implementation of Fast Convolution Algorithms**

Published paper (link to publisher)

Bibtex

Mihai Furis (Master thesis, Computer Science, Drexel University, 2003)

**Cache Miss Analysis of Walsh-Hadamard Transform Algorithms**

Bibtex

Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 501-504, 2003)

**Cooley-Tukey FFT like Algorithms for the DCT**

Preprint (71 KB)

Published paper (link to publisher)

Bibtex

The Cooley-Tukey FFT algorithm decomposes a discrete Fourier transform (DFT) of size n = km into smaller DFT of size k and m. In this paper we present a theorem that decomposes a polynomial transform into smaller polynomial transforms, and show that the FFT is obtained as a special case. Then we use this theorem to derive a new class of recursive algorithms for the discrete cosine transforms (DCT) of type II and type III. In contrast to other approaches, we manipulate polynomial algebras instead of transform matrix entries, which makes the derivation transparent, concise, and gives insight into the algorithms' structure. The derived algorithms have a regular structure and, for 2-power size, minimal arithmetic cost (among known DCT algorithms).

Smarahara Misra, Adam C. Zelinski, James C. Hoe and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2003)

**Custom Reduction of Arithmetic in Linear DSP Transforms**

Bibtex

Smarahara Misra (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2003)

**Custom Reduction of Arithmetic in Multiplierless implementations of DSP Transforms**

Bibtex

Aca Gacic, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 541-544, 2003)

**Fast Automatic Implementations of FIR Filters**

Preprint (82 KB)

Bibtex

SPIRAL is a generator for platform-adapted libraries of DSP transform algorithms. SPIRAL represents and automatically generates fast algorithms as mathematical formulas and translates them into programs. Adaptation is achieved by searching in the space of algorithmic and coding alternatives for the fastest implementation. In this paper we extend SPIRAL to generate platform-adapted implementations of FIR filters. First, we present various filter algorithms and introduce the mathematical constructs needed to include them into SPIRAL's architecture. Then we use SPIRAL to find fast filter implementations. The results show runtime improvements to a standard loop implementation of up to 70% using different blocking techniques. Further, we show that the usefulness of frequency-domain methods is not determined by the number of operations.

T. Fahringer, Franz Franchetti, M. Geissler, G. Madsen, H. Moritsch and R. Prodan (Proc. Euromicro Conference on Parallel Distributed and Network-based Processing (Euro PDP), pp. 185-192, 2003)

**On Using ZENTURIO for Performance and Parameter Studies on Clusters and Grids**

Bibtex

Franz Franchetti (PhD. thesis, Vienna University of Technology, 2003)

**Performance Portable Short Vector Transforms**

Bibtex

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 537-540, 2003)

**Short Vector Code Generation and Adaptation for DSP Algorithms **

Preprint (90 KB)

Published paper (link to publisher)

Bibtex

Most recent general purpose processors feature short vector SIMD instructions, like SSE on Pentium III/4. In this paper we automatically generate platform-adapted short vector code for DSP transform algorithms using SPIRAL. SPIRAL represents and generates fast algorithms as mathematical formulas, and translates them into code. Adaptation is achieved by searching in the space of algorithmic and coding alternatives for the fastest implementation on the given platform. We explain the mathematical foundation that relates formula constructs to vector code, and overview the vector code generator within SPIRAL. Experimental results show excellent speed-ups compared to ordinary C code for a variety of transforms and computing platforms. For the DFT on Pentium 4, our automatically generated code compares favorably with the hand-tuned Intel MKL vendor library.

Franz Franchetti and Markus Püschel (Proc. International Parallel and Distributed Processing Symposium (IPDPS), 2003)

**Short Vector Code Generation for the Discrete Fourier Transform **

Preprint (169 KB)

Bibtex

In this paper we use a mathematical approach to generate high performance short vector code for the discrete Fourier transform. We represent the well-known Cooley-Tukey FFT recursion in a mathematical notation and formally derive a ``short vector variant''. Using this recursion we generate for one DFT size a large number of different algorithms, represented as formulas, and translate them into short vector code. Finally, we present a vector code specific dynamic programming search that finds in the formula space a good match to the given architecture. We implemented this approach as part of the SPIRAL library generator. On Pentium III and 4, we automatically generate SSE and SSE2 vector code that compares favorably with the hand-tuned Intel vendor MKL.

Stefan Kral, Franz Franchetti, Juergen Lorenz and Christoph W. Ueberhuber (Proc. Euro-Par Conference on Parallel and Distributed Computing, LNCS 2985, pp. 251-260, 2003)

**SIMD Vectorization of Straight Line Code**

Bibtex

Markus Püschel and José M. F. Moura (Proc. Workshop on Optimizations for DSP and Embedded Systems (ODES), 2003)

**SPIRAL: An Overview**

Bibtex

Markus Püschel and José M. F. Moura (SIAM Journal of Computing, Vol. 32, No. 5, pp. 1280-1316, 2003)

**The Algebraic Approach to the Discrete Cosine and Sine Transforms and their Fast Algorithms**

Preprint (327 KB)

Published paper (link to publisher)

Bibtex

It is known that the discrete Fourier transform (DFT) used in digital signal processing can be characterized in the framework of representation theory of algebras, namely as the decomposition matrix for the regular module C[Z_n] = C[x]/(x^n - 1). This characterization provides deep insight on the DFT and can be used to derive and understand the structure of its fast algorithms. In this paper we present an algebraic characterization of the important class of discrete cosine and sine transforms as decomposition matrices of certain regular modules associated to four series of Chebyshev polynomials. Then we derive most of their known algorithms by pure algebraic means. We identify the mathematical principle behind each algorithm and give insight into its structure. Our results show that the connection between algebra and digital signal processing is stronger than previously understood.

Franz Franchetti (Proc. International Workshop on Numerical and Symbolic Scientific Computing, 2003)

**Top Performance in Signal Processing**

Bibtex

Fang Fang, Rob A. Rutenbar, Markus Püschel and Tsuhan Chen (Proc. Design Automation Conference (DAC), pp. 496-501, 2003)

**Toward Efficient Static Analysis of Finite-Precision Effects in DSP Applications via Affine Arithmetic Modeling **

Preprint (126 KB)

Published paper (link to publisher)

Bibtex

We introduce a static error analysis technique, based on smart interval methods from affine arithmetic, to help designers translate DSP codes from full-precision floating-point to smaller finite-precision formats. The technique gives results for numerical error estimation comparable to detailed simulation, but achieves speedups of three orders of magnitude by avoiding actual bit-level simulation. We show results for experiments mapping common DSP transform algorithms to implementations using small custom floating point formats.

Mike Balog (Master thesis, Electrical and Computer Engineering, Drexel University, 2002)

**A Flexible Framework for Implementing FFT Processors**

Bibtex

Kang Chen (Master thesis, Computer Science, Drexel University, 2002)

**A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms**

Published paper (link to publisher)

Bibtex

This thesis presents a self-adapting parallel package for computing the Walsh-Hadamard transform (WHT), a prototypical fast signal transform, similar to the fast Fourier transform. Using a search over a space of mathematical formulas representing different algorithms to compute the WHT, the package finds the best parallel implementation on a given shared-memory multiprocessor. The search automatically fins the best combination of sequential and parallel code leading to effective granularity, load balance, and cache utilization. Experimental data are presented in the thesis showing the performance of the package on four different architectures. Results are also presented showing the optimizations required to obtain nearly linear speedup on these sample symmetric multiprocessors.

Kang Chen and Jeremy Johnson (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 58-63, 2002)

**A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms **

Preprint (262 KB)

Published paper (link to publisher)

Bibtex

This paper presents a self-adapting parallel package for computing the Walsh-Hadamard transform (WHT), a prototypical fast signal transform, similar to the fast Fourier transform. Using a search over a space of mathematical formulas representing different algorithms to compute the WHT the package finds the best parallel implementation on a given shared-memory multiprocessor. The search automatically finds the best combination of sequential and parallel code leading to the optimal granularity, load balance, and cache utilization. Experimental results are presented showing the optimizations required to obtain nearly linear speedup on a sample symmetric multiprocessor.

Markus Püschel, Sebastian Egner and Thomas Beth (in New Reference Book on Computer Algebra, Eds. J. Grabmeier, E. Kaltofen, J. Grabmeier, E. Kaltofen, Springer 2002)

**AREP**

Published paper (link to publisher)

Bibtex

Franz Franchetti and Markus Püschel (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 20-26, 2002)

**A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms **

Preprint (142 KB)

Published paper (link to publisher)

Bibtex

Many applications require fast implementations of signal processing algorithms to analyze data in real time or to effectively process many large data sets. Fast implementations of a signal transform need to take advantage of structure in the transformation matrix to factor the transform into a product of structured matrices. These factorizations compute the transform with fewer operations than the naive implementation of matrix multiplication. Signal transforms can have a vast number of factorizations, with each factorization of a single transform represented by a unique but mathematically equivalent formula. Interestingly, when implemented in code, these formulas can have significantly different runtimes on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations differ significantly between processors. Therefore, determining which formula is the most efficient for a particular processor is of great interest.

This thesis contributes methods for automating the modeling and optimization of performance across a variety of signal processing algorithms. Modeling and understanding performance can greatly aid in intelligently pruning the huge search space when optimizing performance. Automation is vital considering the size of the search space, the variety of signal processing algorithms, and the constantly changing computer platform market.

To automate the optimization of signal transforms, we have developed and implemented a number of different search methods in the SPIRAL system. These search methods are capable of optimizing a variety of different signal transforms, including new user-specified transforms. We have developed a new search method for this domain, STEER, which uses an evolutionary stochastic algorithm to find fast implementations.

To enable computer modeling of signal processing performance, we have developed and analyzed a number of feature sets to describe formulas representing specific transforms. We have developed several different models of formula performance, including models that predict runtimes of formulas and models that predict the number of cache misses formulas incur.

Further, we have developed a method that uses these learned models to generate fast implementations. This method is able to construct fast formulas, allowing us to intelligently search through only the most promising formulas. While the learned model is trained on data from one transform size, our method is able to produce fast formulas across many transform sizes, including larger sizes, even though it has never timed a formula of those other sizes.

Bryan Singer and Manuela Veloso (IEEE Transactions on Signal Processing, Vol. 50, No. 8, pp. 2003-2014, 2002)

**Automating the Modeling and Optimization of the Performance of Signal Transforms **

Preprint (365 KB)

Published paper (link to publisher)

Bibtex

Fast implementations of discrete signal transforms, such as the discrete Fourier transform, the Walsh-Hadamard transform, and the discrete trigonometric transforms, can be viewed as factorizations of their corresponding transformation matrices. A given signal transform can have many different factorizations, with each factorization represented by a unique but mathematically equivalent formula. When implemented in code, these formulas can have significantly different running times on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations on various processors are often different. Given this complexity, a crucial problem is automating the modeling and optimization of the performance of signal transform implementations.

To enable computer modeling of signal processing performance, we have developed and analyzed more than 15 feature sets to describe formulas representing specific transforms. Using some of these features and a limited set of training data, we have successfully trained neural networks to learn to accurately predict performance of formulas, with error rates less than 5%. In the direction of optimization, we have developed a new stochastic evolutionary algorithm, STEER, for finding fast implementations of a variety of signal transforms. STEER is able to optimize completely new transforms specified by a user. We present results that show STEER can find discrete cosine transform formulas that are 10--20% faster than what a dynamic programming search finds.

Franz Franchetti, F. Kaltenberger and Christoph W. Ueberhuber (Proc. APLIMAT Conference, pp. 333-339, 2002)

**FFT Kernels with FMA Utilization**

Bibtex

Markus Püschel and José M. F. Moura (Proc. Digital Signal Processing Workshop, 2002)

**Generation and Manipulation of DSP Transform Algorithms**

Bibtex

Fang Fang, James C. Hoe, Markus Püschel and Smarahara Misra (Proc. High Performance Extreme Computing (HPEC), 2002)

**Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform **

Bibtex

Hardware designers are increasingly relying on pre-designed DSP (digital signal processing) cores from IP libraries to improve their productivity and reduce design time. Unfortunately, static DSP cores cannot accommodate application-specific trade-offs. To overcome this problem, we are proposing to automatically generate customized DSP cores that can be tailored for specific design requirements. This enables a designer, with no specific background in DSP transform mathematics, to quickly create and evaluate different design choices and to determine the one that is most suitable for the application. In this paper, we present a generator for the Walsh-Hadamard Transform (WHT). The generator accepts as input the WHT's sample size and two implementation parameters that control the degree of hardware reuse. The output is an RTL-level Verilog description of the desired implementation. We present experimental results that compare our different generated designs for the same transform with respect to resource requirements, computation latency, and throughput.

Bryan Singer and Manuela Veloso (Journal of Machine Learning Research, special issue on ``the Eighteenth International Conference on Machine Learning (ICML 2001)'', Vol. 3, pp. 887-919, 2002)

**Learning to Construct Fast Signal Processing Implementations **

Preprint (6.4 MB)

Published paper (link to publisher)

Bibtex

A single signal processing algorithm can be represented by many mathematically equivalent formulas. However, when these formulas are implemented in code and run on real machines, they have very different runtimes. Unfortunately, it is extremely difficult to model this broad performance range. Further, the space of formulas for real signal transforms is so large that it is impossible to search it exhaustively for fast implementations. We approach this search question as a control learning problem. We present a new method for learning to *generate fast formulas*, allowing us to intelligently search through only the most promising formulas. Our approach incorporates signal processing knowledge, hardware features, and formula performance data to learn to construct fast formulas. Our method learns from performance data for a few formulas of one size and then can construct formulas that will have the fastest runtimes possible across many sizes.

Hung-Jen Huang (Master thesis, Computer Science, Drexel University, 2002)

**Performance Analysis of an Adaptive Algorithm for the Walsh-Hadamard Transform**

Bibtex

Franz Franchetti, Markus Püschel, José M. F. Moura and Christoph W. Ueberhuber (Proc. High Performance Extreme Computing (HPEC), 2002)

**Short Vector SIMD Code Generation for DSP Algorithms**

Bibtex

Peter Becker (Master thesis, Electrical and Computer Engineering, Drexel University, 2001)

**A High Speed VLSI Architecture for the Discrete Haar Wavelet Transform**

Bibtex

An alternative to strictly time- or frequency-domain analysis, wavelets provide a multiresolution analysis of signals for input to processing such as edge detection, compression, and pattern recognition. High speed VLSI architectures are necessary for real-time processing of large data sets. Presented here is such an architecture for the Haar wavelet transform, the simplest and easiest wavelet to understand. From this base model, a generalization of the architecture to other wavelets is discussed, as well.

Franz Franchetti, H. Karner, Stefan Kral and Christoph W. Ueberhuber (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 1109-1112, 2001)

**Architecture Independent Short Vector FFTs**

Bibtex

Sebastian Egner, Jeremy Johnson, David Padua, Jianxin Xiong and Markus Püschel (ACM SIGSAM Bulletin Communications in Computer Algebra, Vol. 35, No. 2, pp. 1-19, 2001)

**Automatic Derivation and Implementation of Signal Processing Algorithms **

Preprint (249 KB)

Published paper (link to publisher)

Bibtex

We present a computer algebra framework to automatically derive and implement algorithms for digital signal processing. The two main parts of the framework are AREP, a library for symbolic manipulation of group representations and structured matrices, and SPL, a compiler turning matrix expressions into efficient imperative-style numerical programs.

Sebastian Egner and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 49, No. 9, pp. 1992-2002, 2001)

**Automatic Generation of Fast Discrete Signal Transforms **

Preprint (217 KB)

Published paper (link to publisher)

Bibtex

This paper presents an algorithm that derives fast versions for a broad class of discrete signal transforms symbolically. The class includes but is not limited to the discrete Fourier and the discrete trigonometric transforms. This is achieved by finding fast sparse matrix factorizations for the matrix representations of these transforms. Unlike previous methods, the algorithm is entirely automatic and uses the defining matrix as its sole input. The sparse matrix factorization algorithm consists of two steps: first, the “symmetry” of the matrix is computed in the form of a pair of group representations; second, the representations are stepwise decomposed, giving rise to a sparse factorization of the original transform matrix. We have successfully demonstrated the method by computing automatically efficient transforms in several important cases: for the DFT, we obtain the Cooley-Tukey FFT; for a class of transforms including the DCT, type II, the number of arithmetic operations for our fast transforms is the same as for the best-known algorithms. Our approach provides new insights and interpretations for the structure of these signal transforms and the question of why fast algorithms exist. The sparse matrix factorization algorithm is implemented within the software package AREP

Jianxin Xiong (PhD. thesis, Computer Science, University of Illinois at Urbana-Champaign, 2001, Also as Technical Report UIUCDCS-R-2001-224, University of Illinois)

**Automatic Optimization of DSP Algorithms **

Bibtex

Since the advent of digital signal processing (DSP), there has been an enormous effort to obtain high-performance implementations of signal processing algorithms such as the fast Fourier transform (FFT). This effort has produced thousands of variants of fundamental algorithms and an equally large number of implementation techniques. Many of the implementations have been carefully hand-optimized. The cost of these hand optimizations and their long implementation times are a strong motivation to automate the implementation and optimization of signal processing algorithms.

In this dissertation, we present SPL, a domain specific language that represents fast signal processing algorithms as mathematical formulas, and the SPL compiler, which translates these formulas into efficient programs in high-level programming language such as FORTRAN and C. The SPL compiler was developed as part of SPIRAL, a system which utilizes formula transformations and intelligent search strategies to automatically create optimized digital signal processing libraries.

We also present Meta-SPL, a meta-language for extending the SPL language without modifying the SPL compiler. Meta-SPL provides the approach to declare new symbols that can be used as SPL keywords, as well as the mechanism to define the semantics of these new symbols.

After presenting the translation and optimization techniques utilized by the SPL compiler, empirical data is presented showing the efficiency of the resulting C/FORTRAN code. Timings are compared, using fast Fourier transform (FFT), Walsh-Hadamard transform (WHT), discrete cosine transform (DCT) and discrete sine transform (DST) as the benchmarks, to those obtained by highly optimized implementations including FFTW and the WHT package. The results of this comparison show that the SPL compiler produces code that is competitive and, in many cases, faster than the competitors.

Bryan Singer (PhD. thesis, Computer Science, Carnegie Mellon University, 2001)

**Automating the Modeling and Optimization of the Performance of Signal Processing Algorithms **

Preprint (60 KB)

Published paper (link to publisher)

Bibtex

Many applications require fast implementations of signal processing algorithms to analyze data in real time or to effectively process many large data sets. Fast implementations of a signal transform need to take advantage of structure in the transformation matrix to factor the transform into a product of structured matrices. These factorizations compute the transform with fewer operations than the naive implementation of matrix multiplication. Signal transforms can have a vast number of factorizations, with each factorization of a single transform represented by a unique but mathematically equivalent formula. Interestingly, when implemented in code, these formulas can have significantly different runtimes on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations differ significantly between processors. Therefore, determining which formula is the most efficient for a particular processor is of great interest.

This thesis contributes methods for automating the modeling and optimization of performance across a variety of signal processing algorithms. Modeling and understanding performance can greatly aid in intelligently pruning the huge search space when optimizing performance. Automation is vital considering the size of the search space, the variety of signal processing algorithms, and the constantly changing computer platform market.

To automate the optimization of signal transforms, we have developed and implemented a number of different search methods in the SPIRAL system. These search methods are capable of optimizing a variety of different signal transforms, including new user-specified transforms. We have developed a new search method for this domain, STEER, which uses an evolutionary stochastic algorithm to find fast implementations.

To enable computer modeling of signal processing performance, we have developed and analyzed a number of feature sets to describe formulas representing specific transforms. We have developed several different models of formula performance, including models that predict runtimes of formulas and models that predict the number of cache misses formulas incur.

Further, we have developed a method that uses these learned models to generate fast implementations. This method is able to construct fast formulas, allowing us to intelligently search through only the most promising formulas. While the learned model is trained on data from one transform size, our method is able to produce fast formulas across many transform sizes, including larger sizes, even though it has never timed a formula of those other sizes.

Neungsoo Park and Viktor K. Prasanna (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. II-1205-II-1208, 2001)

**Cache Conscious Walsh-Hadamard Transform **

Preprint (88 KB)

Published paper (link to publisher)

Bibtex

The Walsh-Hadamard Transform (WHT) is an important tool in signal processing because of its simplicity. One major problem with the existing packages is their poor scalability to larger problems. In computing large size WHT, non-unit stride accesses result in poor cache performance leading to severe degradation in performance. In this paper, we develop an efficient *cache friendly* technique that drastically enhances the performance of large size WHT. In our approach, data reorganization is performed between computation stages to reduce cache pollution. We develop an efficient search algorithm to determine the optimal factorization tree depending on the problem size and stride access in the decomposition. We have developed a package and demonstrated improved performance. For example, on Alpha 21264 and MIPS R10000, our technique results in upto 180% performance improvement over the state-of-the-art package. The proposed optimization is applicable to other signal transforms and is portable across various platforms.

Pinit Kumhom (PhD. thesis, Electrical and Computer Engineering, Drexel University, 2001, Also Tech. Report DU-MCS-01-01, Drexel University, 2001)

**Design, Optimization, and Implementation of a Universal FFT Processor **

Preprint (8.2 MB)

Bibtex

There exist Fast Fourier transform (FFT) algorithms, called dimensionless FFTs (L. Auslander, J. Johnson and R. Johnson, Dimensionless Fast Fourier Transform Method and Apparatus, Patent #US6003056, issued Dec. 14, 1999.), that work independent of dimension. These algorithms can be configured to compute different dimensional discrete Fourier transforms (DFTs) simply by relabeling the input data and by changing the values of the twiddle factors occurring in the butterfly operations. This observation allows the design of a universal FFT processor, which with minor reconfiguring, can compute one, two, and three dimensional DFTs. In this thesis a family of FFT processors, parameterized by the number of points, the dimension, the number of processors, and the internal dataflow is designed. Mathematical properties of the FFT are used systematically to simplify and optimize the processor design, and to explore different algorithms and design choices. Different dimensionless FFTs have different dataflows and consequently lead to different performance characteristics. A performance model is used to evaluate the different algorithmic choices and their resulting dataflow. Using the performance model, a search was conducted to find the optimal algorithm for the family of processors considered. The resulting algorithm and corresponding hardware design was implemented using FPGA.

Markus Püschel, Bryan Singer, Manuela Veloso and José M. F. Moura (Proc. International Conference on Computational Science (ICCS), Lecture Notes In Computer Science, Springer, Vol. 2073, pp. 97-106, 2001)

**Fast Automatic Generation of DSP Algorithms**

Preprint (144 KB)

Published paper (link to publisher)

Bibtex

SPIRAL is a generator of optimized, platform-adapted libraries for digital signal processing algorithms. SPIRAL's strategy translates the implementation task into a search in an expanded space of alternatives. These result from the many degrees of freedom in the DSP algorithm itself and in the various coding choices. This paper describes the framework to represent and generate efficiently these alternatives: the formula generator module in SPIRAL. We also address the search module that works in tandem with the formula generator in a feedback loop to find optimal implementations. These modules are implemented using the computer algebra system GAP/AREP.

José M. F. Moura, Jeremy Johnson, Robert W. Johnson, David Padua, Viktor K. Prasanna, Markus Püschel, Bryan Singer, Manuela Veloso and Jianxin Xiong (Proc. High Performance Extreme Computing (HPEC), 2001)

**Generating Platform-Adapted DSP Libraries using SPIRAL**

Bibtex

Bryan Singer and Manuela Veloso (Proc. International Conference on Machine Learning (ICML), pp. 529-536, 2001)

**Learning to Generate Fast Signal Processing Implementations **

Preprint (450 KB)

Published paper (link to publisher)

Bibtex

A single signal processing algorithm can be represented by many mathematically equivalent formulas. However, when these formulas are implemented in code and run on real machines, they have very different running times. Unfortunately, it is extremely difficult to model this broad performance range. Further, the space of formulas for real signal transforms is so large that it is impossible to search it exhaustively for fast implementations. We approach this search question as a control learning problem. We present a new method for learning to *generate fast formulas*, allowing us to intelligently search through only the most promising formulas. Our approach incorporates signal processing knowledge, hardware features, and formula performance data to learn to construct fast formulas. Our method learns from performance data for a few formulas of one size and then can construct formulas that will have the fastest run times possible across many sizes.

Jianxin Xiong, Jeremy Johnson, Robert W. Johnson and David Padua (Proc. Programming Languages Design and Implementation (PLDI), pp. 298-308, 2001)

**SPL: A Language and Compiler for DSP Algorithms **

Preprint (226 KB)

Published paper (link to publisher)

Bibtex

In this paper, we discuss the design and implementation of the SPL compiler, a domain specific compiler that translates signal processing algorithms expressed as mathematical formulas in the SPL language into efficient programs in high-level programming language such as FORTRAN and C. The SPL compiler was developed as part of SPIRAL, a system which utilizes formula transformations and intelligent search strategies to automatically create optimized digital signal processing (DSP) libraries. After presenting the translation and optimization techniques utilized by the compiler, empirical data is presented showing the efficiency of the resulting C/FORTRAN code. Timings are compared, using the fast Fourier transform (FFT) as a benchmark, to those obtained by FFTW, one of the fastest FFT packages available. The results of this comparison show that the SPL compiler produces code that is competitive and, in many cases, faster than FFTW.

Bryan Singer and Manuela Veloso (Proc. Supercomputing (SC), pp. 22, 2001)

**Stochastic Search for Signal Processing Algorithm Optimization **

Preprint (167 KB)

Published paper (link to publisher)

Bibtex

Many difficult problems can be viewed as search problems. However, given a new task with an embedded search problem, it is challenging to state and find a truly effective search approach. In this paper, we address the complex task of signal processing optimization. We first introduce and discuss the complexities of this domain. In general, a single signal processing algorithm can be represented by a very large number of different but mathematically equivalent formulas. When these formulas are implemented in actual code, unfortunately their running times differ significantly. Signal processing algorithm optimization aims at finding the fastest formula. We present a new approach that successfully solves this problem, using an evolutionary stochastic search algorithm, STEER, to search through the very large space of formulas. We empirically compare STEER against other search methods, showing that it notably can find faster formulas while still only timing a very small portion of the search space.

Gavin Haentjens (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2000)

**An Investigation of Recursive FFT Implementations **

Preprint (762 KB)

Bibtex

The goals of the research discussed in this report are to determine the impact of different Cooley-Tukey decompositions on the performance of computer programs that compute the FFT, evaluate different methods for finding the most efficient decompositions, and determine the characteristics of the most efficient decompositions. Experiments are conducted using three different FFT programs and three dynamic programming (DP) methods for searching for efficient decompositions. The results show that even for FFTs of sizes under 2^{11}, the runtime for the average decomposition may be up to three times the runtime for the optimal decomposition. The results also show that a basic implementation of DP performs as well as an exhaustive search at finding fast decompositions for FFTs of sizes up to 2^{10}, and that the simple DP performs as well as two more sophisticated versions of DP for FFT sizes up to 2^{20}. Furthermore, the results show that for an out-of-place implementation of the FFT, right-expanded decompositions are optimal because they require memory storage only for the input and output data arrays, whereas other decompositions require additional temporary storage. Moreover, the results show that for an in-place implementation of the FFT, balanced decompositions are optimal if the algorithm is iterative, and right-expanded decompositions are optimal if the algorithm is recursive.

Pinit Kumhom, Jeremy Johnson and Prawat Nagvajara (Proc. IEEE ASIC/SOC Conference, IEEE, pp. 182-186, 2000)

**Design, optimization, and implementation of a universal FFT processor**

Published paper (link to publisher)

Bibtex

There exist Fast Fourier transform (FFT) algorithms, called dimensionless FFTs, that work independent of dimension. These algorithms can be configured to compute different dimensional DFTs simply by relabeling the input data and by changing the values of the twiddle factors occurring in the butterfly operations. This observation allows us to design an FFT processor, which with minor reconfiguring, can compute one, two, and three dimensional DFTs. In this paper we design a family of FFT processors, parameterized by the number of points, the dimension, the number of processors, and the internal dataflow, and show how to map different dimensionless FFTs onto this hardware design. Different dimensionless FFTs have different dataflows and consequently lead to different performance characteristics. Using a performance model we search for the optimal algorithm for the family of processors we considered. The resulting algorithm and corresponding hardware design was implemented using FPGA

Neungsoo Park, Viktor K. Prasanna, Kiran Bondalapati and Dongsoo Kang (Proc. International Parallel and Distributed Processing Symposium (IPDPS), pp. 693-701, 2000)

**Dynamic Data Layouts for Cache-conscious Factorization of DFT **

Preprint (245 KB)

Bibtex

Effective utilization of cache memories is a key factor in achieving high performance in computing the Discrete Fourier Transform (DFT). Most optimization techniques for computing the DFT rely on either modifying the computation and data access order or exploiting low level platform specific details, while keeping the data layout in memory static. In this paper we propose a high level optimization technique, dynamic data layout (DDL). In DDL, data reorganization is performed between computations to effectively utilize the cache. This cache-conscious factorization of the DFT including the data reorganization steps is automatically computed by using efficient techniques in our approach. An analytical model of the cache miss pattern is utilized to predict the performance and explore the search space of factorizations. Our technique results in up to a factor of 4 improvement over standard FFT implementations and up to 33% improvement over other optimization techniques such as copying on SUN UltraSPARC-II, DEC Alpha and Intel Pentium III
dynamic data layout
(DDL). In DDL, data reorganization is performed
*
between
*
computations to effectively utilize the cache. This cache-conscious factorization of the DFT including the data reorganization steps is automatically computed by using efficient techniques in our approach. An analytical model of the cache miss pattern is utilized to predict the performance and explore the search space of factorizations. Our technique results in up to a factor of 4 improvement over standard FFT implementations and up to 33% improvement over other optimization techniques such as copying on SUN UltraSPARC-II, DEC Alpha and Intel Pentium III.

Jeremy Johnson and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 6, pp. 3347-3350, 2000)

**In Search of the Optimal Walsh-Hadamard Transform **

Preprint (104 KB)

Published paper (link to publisher)

Bibtex

This paper describes an approach to implementing and optimizing fast signal transforms. Algorithms for computing signal transforms are expressed by symbolic expressions, which can be automatically generated and translated into programs. Optimizing an implementation involves searching for the fastest program obtained from one of the possible expressions. In this paper we apply this methodology to the implementation of the Walsh-Hadamard transform. An environment, accessible from MATLAB, is provided for generating and timing WHT algorithms. These tools are used to search for the fastest WHT algorithm. The fastest algorithm found is substantially faster than standard approaches to implementing the WHT. The work reported in this paper is part of the SPIRAL project, an ongoing project whose goal is to automate the implementation and optimization of signal processing algorithms.

Bryan Singer and Manuela Veloso (Proc. International Conference on Machine Learning (ICML), pp. 887-894, 2000)

**Learning to Predict Performance from Formula Modeling and Training Data **

Preprint (188 KB)

Published paper (link to publisher)

Bibtex

This paper reports on our work and results framing signal processing algorithm optimization as a machine learning task. A single signal processing algorithm can be represented by many different but mathematically equivalent formulas. When these formulas are implemented in actual code, they have very different running times. Signal processing optimization is concerned with finding a formula that implements the algorithm as efficiently as possible. Unfortunately, a correct mapping between a mathematical formula and its running time is unknown. However empirical performance data can be gathered for a variety of formulas. This data offers an interesting opportunity *to learn* to predict running time performance. In this paper we present two major results along this direction: (1) Different sets of features are identified for mathematical formulas that distinguish them into partitions with significantly different running times, and (2) A function approximator can learn to accurately predict the running time of a formula given a limited set of training data. Showing the impact of selecting different features to describe the input, this work contributes an extensive study on the role of learning for this novel task.

David Sepiashvili (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2000)

**Performance Models and Search Methods for Optimal FFT Implementations **

Preprint (589 KB)

Bibtex

The central result of this thesis is deriving systematic methodologies for finding the fast Fourier transform (FFT) optimal implementations. By employing a divide-conquer procedure (decomposition), we break down the initial transform into combinations of different smalller size sub-transforms that can graphically be presented as breakdown trees. We apply the rewrite rules to the Cooley-Tukey formula and generate a set of algorithms and alternative codes for the FFT computation. We obtain a set of all possible implementations by pairing all possible breakdown trees with all code alternatives. For the runtime evaluation of these implementations, we develop the analytical and experimental perfomance models. Based on these models, we derive the optimal search methods -- the dynamic programming and exhaustive search -- and use them to search over the set of implementation runtimes for finding the implementation with the minimal runtime. The test results demonstrate that good algorithms and codes, accurate performance evaluation models, and effective search methods, combined together within the system framework, provide the best FFT implementations.

Franz Franchetti (Master thesis, Vienna University of Technology, 2000)

**Short Vector FFTs**

Bibtex

José M. F. Moura, Jeremy Johnson, Robert W. Johnson, David Padua, Viktor K. Prasanna, Markus Püschel and Manuela Veloso (Proc. High Performance Extreme Computing (HPEC), 2000)

**SPIRAL: Automatic Implementation of Signal Processing Algorithms**

Bibtex

Publication interface designed and implemented by Patra Pantupat, Aliaksei Sandryhaila,
and Markus Püschel

Electrical and Computer Engineering, Carnegie Mellon University, 2007

Electrical and Computer Engineering, Carnegie Mellon University, 2007