Copyrights to these papers may be held by the publishers. The download files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

L. Tang, S. Chen, K. Harisrikanth, G. Xu, Franz Franchetti and K. Mai (Proc. Hot Chips Symposium (HCS), 2024)

**A 1.19GHz 9.52Gsamples/sec Radix-8 FFT Hardware Accelerator in 28nm**

Preprint (2.1 MB)

Bibtex

S. Fu, N. Zhang and Franz Franchetti (Proc. Parallel Architectures and Compilation Techniques (PACT), 2024)

**Accelerating High-Precision Number Theoretic Transforms using Intel AVX-512**

Preprint (850 KB)

Bibtex

Fully Homomorphic Encryption (FHE) allows various platforms to manipulate encrypted data offering ideal privacy protection. However, implementing FHE requires significant computing overhead, which can make it impractical in several applications. Researchers have previously addressed the computational overhead by developing accelerators specifically designed for FHE [6–8] or on server-level GPUs [5, 9]. While proven effective, we seek to accelerate FHE on CPUs, broadening the range of applications dramatically. In this work, we choose to focus on Number Theoretic Transform (NTT), which is the computational bottleneck of multiple state-of-the- art FHE schemes that account for over 90% of FHE execution time in practice [3]. The NTT algorithm primarily depends on three integer modulo arithmetic operations; namely modular addition, modular subtraction, and modular multiplication. Specifically, we seek to optimize these operations on large multi-word integers of 128 bits and utilize Intel’s Advanced Vector Extensions 512 (AVX- 512) to parallelize our code.We then use these vectorized operations to accelerate any 𝑛-point NTT algorithm on a single CPU core.

E. Tang, James C. Hoe and Franz Franchetti (Proc. SRC TECHCON, 2024)

**Magic Memory: A Memory-Centric Declarative Programming Paradigm to Enable High Productivity on Heterogeneous Systems**

Preprint (558 KB)

Bibtex

Heterogeneous computing systems involving CPUs, GPUs, FPGAs and even ASICs have already been shown to have immense computational power for various applications. However, programming these systems efficiently and effectively remains a significant challenge. Traditional imperative programming approaches result in complex programs that require expert-level understanding in order to optimize or maintain. The declarative programming model, Magic Memory, focuses on the behavior of the program and thus hides the nature of the underlying hardware that executes the program. In this model, computation is expressed using mathematical functions between arrays. This function can then be treated as an invariant that is always held true. During steady state, computation is performed by observing changes to these magic memory regions and then only recomputing the values that are affected by the change. As a first demonstration of this work, we utilize a CPU-FPGA system to compute the PageRank of a graph and recompute the PageRank after adding or removing an edge from the graph.

T. Zhang, E. Tang, F. Siddique, K. Skadron and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2024)

**Towards an End-to-End Processing-in-DRAM Acceleration of Spectral Library Search**

Preprint (871 KB)

Bibtex

This work explores accelerating spectral library searches, a key mass spectrometry (MS) workload, using processing-in-memory (PIM) architectures through an end-toend, co-designed approach. We apply signal processing and approximate computing techniques for pre-filtering MS data and implement a sum of absolute differences (SAD) algorithm optimized for PIM to compare spectral similarity. Our methodology is evaluated using a DRAM-based PIM simulator and compared against traditional CPU implementations. While initial results with small datasets favor CPUs, our analysis indicates potential benefits for PIM with larger, more realistic proteomics datasets. This work represents an initial step towards investigating PIM acceleration for MS applications.

Y. Eum, N. Zhang, L. Tang and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2024)

**Towards a RISC-V Instruction Set Extension for Multi-word Arithmetic**

Preprint (427 KB)

Bibtex

Multi-word arithmetic is a method of doing calculations with data which are bigger than what the machine registers can hold. For example, we could do 128-bit arithmetic with a 64-bit machine by using this technique. One application of multi-word arithmetic is in cryptography, since working with large numbers is a critical security feature [1]. Currently, many CPUs and GPUs support multi-word arithmetic by providing a single carry flag. This is useful for implementing simple 128-bit operations such as addition and subtraction, but is limited with more complicated operations such as multiplication or modulo. Contributions. Our key contributions are: • An extension to the RISC-V ISA to accelerate multi-word arithmetic incorporating multiple carry bits to enable modulo and multiplication. • An implementation of the ISA using the Chipyard framework [2] and the RoCC interface, seeing up to 1.5× speedup in clock cycles.

Sanil Rao, A. Prakash and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2024)

**Towards LibraryX: A Framework for Cross-Library Call Optimization**

Preprint (144 KB)

Bibtex

In scientific computing, key algorithms are implemented using domain specific libraries. In many cases these algorithms cannot be expressed using just one domain library, instead having to use library calls from various domains. Cross library implementations are productive but suffer in performance because library developers cannot optimize library calls outside their domain. Additionally, automated solutions like compilers struggle to optimize these cases as they cannot cross the library call boundary. This leaves manually implementing optimized algorithms for idealized performance, hindering readability and productivity.We propose LibraryX as a framework to express key algorithms using domain libraries while automatically providing high-performance implementations done by hand. LibraryX uses a combination of library call semantic capture, abstraction lifting/code generation, and runtime compilation to provide optimized implementations without modifying the source application. We demonstrate LibraryX using the example of FFT convolution.

G. Xu, Franz Franchetti and James C. Hoe (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2023)

**An Approach to Generating Customized Load-Store Architectures**

Published paper (link to publisher)

Bibtex

Sanil Rao, M. A. H. Monil, H. Mankad, J. Vetter and Franz Franchetti (Proc. Supercomputing (SC), 2023)

**FFTX-IRIS: Towards Performance Portability and Heterogeneity for SPIRAL Generated Code**

Published paper (link to publisher)

Bibtex

FFTX-IRIS is a dynamic system to efficiently utilize novel heterogeneous platforms. This system links two next-generation frameworks, FFTX and IRIS, to navigate the complexity of different hardware architectures. FFTX provides a runtime code generation framework for high-performance Fast Fourier Transform kernels. IRIS runtime provides portability and multi-device heterogeneity, allowing computation on any available compute resource. Together, FFTX-IRIS enables code generation, seamless portability, and performance without user involvement. We show the design of the FFTX-IRIS system along with an evaluation of various small FFT benchmarks. We also demonstrate multi-device heterogeneity of FFTX-IRIS with a larger stencil application.

B. Van Straalen, P. Colella, Franz Franchetti and P. Broderick (to appear in Proc. Exascale Computing Project Annual Meeting (ECPAM), 2023)

**FFTX: Next-Generation Open-Source Software for Fast Fourier Transforms**

Bibtex

N. Zhang, A. Ebel, N. Neda, P. Brinich, B. Reynwar, A. G. Schmidt, M. Franusich, Jeremy Johnson, B. Reagen and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2023)

**Generating High-Performance Number Theoretic Transform Implementations for Vector Architectures**

Preprint (505 KB)

Bibtex

Fully homomorphic encryption (FHE) offers the ability to perform computations directly on encrypted data by encoding numerical vectors onto mathematical structures. However, the adoption of FHE is hindered by substantial overheads that make it impractical for many applications. Number theoretic transforms (NTTs) are a key optimization technique for FHE by accelerating vector convolutions. Towards practical usage of FHE, we propose to use SPIRAL, a code generator renowned for generating efficient linear transform implementations, to generate high-performance NTT on vector architectures. We identify suitable NTT algorithms and translate the dataflow graphs of those algorithms into SPIRAL’s internal mathematical representations. We then implement the entire workflow required for generating efficient vectorized NTT code. In this work, we target the Ring Processing Unit (RPU), a multi-tile long vector accelerator designed for FHE computations. On average, the SPIRAL-generated NTT kernel achieves a 1.7× speedup over naive implementations on RPU, showcasing the effectiveness of our approach towards maximizing performance for NTT computations on vector architectures.

N. Zhang and Franz Franchetti (Proc. International Symposium on Code Generation and Optimization (CGO), 2023)

**Generating Number Theoretic Transforms for Multi-Word Integer Data Types**

Preprint (1.9 MB)

Bibtex

D. Sun, N. Zhang and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2023)

**Optimization and Performance Analysis of Shor’s Algorithm in Qiskit**

Preprint (617 KB)

Bibtex

Shor’s algorithm is widely renowned in the field of quantum computing due to its potential to efficiently break RSA encryption in polynomial time. In this paper, we optimized an end-to-end library-based implementation of Shor’s algorithm using the IBM Qiskit quantum library and derived a speed-oflight (i.e., theoretical peak) performance model that calculates the minimum runtime required for executing Shor’s algorithm with input size N on a certain machine by counting the total operations as a function of different numbers of gates. We evaluated our model by running Shor’s algorithm on CPUs and GPUs, and simulated the factorization of a number up to 4,757. Comparing the speed-of-light runtime to our real-world measurement, we are able to quantify the margin for future quantum library improvements.

D. Soni, N. Neda, N. Zhang, B. Reynwar, H. Gamil, B. Heyman, M. N. T. Moopan, A. Al Badawi, Y. Polyakov, K. Canida, M. Pedram, M. Maniatakos, D. B. Cousins, Franz Franchetti, M. French, A. G. Schmidt and B. Reagen (Cryptology ePrint Archive, 2023)

**RPU: The Ring Processing Unit**

Published paper (link to publisher)

Bibtex

Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received limited use due to their extreme overheads of running on general-purpose machines. In this paper, we present a novel vector Instruction Set Architecture (ISA) and microarchitecture for accelerating the ring-based computations of RLWE. The ISA, named B512, is developed to meet the needs of ring processing workloads while balancing high-performance and general-purpose programming support. Having an ISA rather than fixed hardware facilitates continued software improvement post-fabrication and the ability to support the evolving workloads. We then propose the ring processing unit (RPU), a high-performance, modular implementation of B512. The RPU has native large word modular arithmetic support, capabilities for very wide parallel processing, and a large capacity high-bandwidth scratchpad to meet the needs of ring processing. We address the challenges of programming the RPU using a newly developed SPIRAL backend. A configurable simulator is built to characterize design tradeoffs and quantify performance. The best performing design was implemented in RTL and used to validate simulator performance. In addition to our characterization, we show that a RPU using 20.5mm2 of GF 12nm can provide a speedup of 1485x over a CPU running a 64k, 128-bit NTT, a core RLWE workload

D. B. Cousins, Y. Polyakov, A. Al Badawi, M. French, A. G. Schmidt, A. Jacob, B. Reynwar, K. Canida, A. Jaiswal, C. Mathew, H. Gamil, N. Neda, D. Soni, M. Maniatakos, B. Reagen, N. Zhang, Franz Franchetti, P. Brinich, Jeremy Johnson, P. Broderick, M. Franusich, B. Zhang, Z. Cheng and M. Pedram (arXiv, 2023)

**TREBUCHET: Fully Homomorphic Encryption Accelerator for Deep Computation**

Published paper (link to publisher)

Bibtex

Secure computation is of critical importance to not only the DoD, but across financial institutions, healthcare, and anywhere personally identifiable information (PII) is accessed. Traditional security techniques require data to be decrypted before performing any computation. When processed on untrusted systems the decrypted data is vulnerable to attacks to extract the sensitive information. To address these vulnerabilities Fully Homomorphic Encryption (FHE) keeps the data encrypted during computation and secures the results, even in these untrusted environments. However, FHE requires a significant amount of computation to perform equivalent unencrypted operations. To be useful, FHE must significantly close the computation gap (within 10x) to make encrypted processing practical. To accomplish this ambitious goal the TREBUCHET project is leading research and development in FHE processing hardware to accelerate deep computations on encrypted data, as part of the DARPA MTO Data Privacy for Virtual Environments (DPRIVE) program. We accelerate the major secure standardized FHE schemes (BGV, BFV, CKKS, FHEW, etc.) at >=128-bit security while integrating with the open-source PALISADE and OpenFHE libraries currently used in the DoD and in industry. We utilize a novel tile-based chip design with highly parallel ALUs optimized for vectorized 128b modulo arithmetic. The TREBUCHET coprocessor design provides a highly modular, flexible, and extensible FHE accelerator for easy reconfiguration, deployment, integration and application on other hardware form factors, such as System-on-Chip or alternate chip areas.

P. Brinich, N. Zhang, A. Ebel, Franz Franchetti and Jeremy Johnson (Proc. IEEE High Performance Extreme Computing (HPEC), 2023)

**Twiddle Factor Generation for a Vectorized Number Theoretic Transform**

Preprint (222 KB)

Bibtex

Joao Rivera, Franz Franchetti and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), 2022)

**A Compiler for Sound Floating-Point Computations using Affine Arithmetic**

Preprint (650 KB)

Bibtex

Floating-point arithmetic is extensively used in scientific and engineering applications to approximate real arithmetic. Unfortunately, floating-point arithmetic is not a sound implementation of real arithmetic, i.e., it may produce different results, does not provide error guarantees, and the errors can become arbitrarily large. In this paper, we introduce SafeGen, a source-to-source compiler that rewrites a given C program using floating-point arithmetic to an efficient C program performing the same computation soundly, i.e., it returns an error bound that is guaranteed to contain the correct result of the program if it had been executed in real arithmetic. Equivalently, it gives a precision certificate on the number of correct bits in the result. SafeGen uses affine arithmetic (AA) that keeps accuracy high compared to interval arithmetic by preserving linear correlations between variables. To mitigate its high cost, SafeGen combines a novel form of static analysis to identify these correlations with a flexible policy-based approach for their selection. SafeGen supports SIMD intrinsics in the input and can output SIMD-optimized code. Our results show that SafeGen-generated code is 30–70 times faster than manually rewritten code using AA libraries. Equivalently, SafeGen can offer many more bits of certified accuracy within a reduced time budget.

Anuva Kulkarni, Jelena Kovacevic and Franz Franchetti (Proc. International Conference on Parallel Processing (ICPP), 2022)

**A Framework for Low Communication Approaches for Large Scale 3D Convolution**

Preprint (1.5 MB)

Bibtex

Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.

L. Tang, S. Chen, K. Harisrikanth, G. Xu, K. Mai and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2022)

**A High Throughput Hardware Accelerator for FFTW Codelets: A First Look**

Preprint (9.7 MB)

Bibtex

The Fast Fourier Transform (FFT) is a critical computation for numerous applications in science and engineering. Its implementation has been widely studied and optimized on various computing platforms, with the FFTW library becoming the standard interface in HPC. In this work, we propose hardware acceleration of the FFTW library by putting a software codelet into hardware. The hardware is exposed to the user through an FFTW-compatible software library while actual computation takes place behind the scenes on a custom accelerator. To demonstrate a first look at this idea, we design a high throughput accelerator for FFTW twiddle codelets. The FFT hardware is automatically generated using SPIRAL and a test chip is fabricated in a TSMC 28nm process. We provide measured results of the test chip and discuss many opportunities for future work.

P. Oostema and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2022)

**Approximating Manifolds and Geodesics with Curved Surfaces**

Preprint (968 KB)

Bibtex

J. Nguyen, Sanil Rao, J. Moreira and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2022)

**Automatic Generation of Matrix-Vector Code Using SPIRAL for the Power10 ISA**

Preprint (254 KB)

Bibtex

Z. Gong, N. Zhu, M. Ngaw, Joao Rivera, L. Tang, E. Tang, H. Mankad and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2022)

**Interval Arithmetic-based FFT for Large Integer Multiplication**

Preprint (227 KB)

Bibtex

J. Nguyen, M. Cai, Z. Zuo, L. Tang, K. Mai and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2022)

**LIMA: Hardware for FFT based Large Integer Multiplication**

Preprint (1.9 MB)

Bibtex

E. Tang and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2022)

**Magic Memory: A Programming Model For Big Data Analytics**

Preprint (203 KB)

Bibtex

P. Oostema and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2022)

**On Manifold Graph Embedding**

Preprint (710 KB)

Bibtex

H. Mankad, Sanil Rao, P. Colella, B. Van Straalen and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2022)

**ProtoX: A First Look**

Preprint (199 KB)

Bibtex

N. Zhang, H. Gamil, P. Brinich, B. Reynwar, A. Al Badawi, N. Neda, D. Soni, K. Canida, Y. Polyakov, P. Broderick, M. Maniatakos, A. G. Schmidt, M. Franusich, Jeremy Johnson, B. Reagen, D. B. Cousins and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2022)

**Towards Full-Stack Acceleration for Fully Homomorphic Encryption**

Preprint (540 KB)

Bibtex

N. Kitai, Daisuke Takahashi, Franz Franchetti, T. Katagiri, S. Ohshima and T. Nagai (Proc. High Performance Computing Conference (HPCC), Japan, 2021)

**Adaptation of A64 Scalable Vector Extension for Spiral**

Preprint (560 KB)

Bibtex

In this paper, we propose an adaptation of the A64 Scalable Vector Extension for SPIRAL to generate discrete Fourier transform (DFT) implementations. The performance of our method is evaluated, using the Supercomputer “Flow” at Nagoya University. The A64 scalable vector extension applied DFT codes are up to 1.98-times faster than scalar DFT codes and up to 3.63-times higher in terms of the SIMD instruction rate.

N. Kitai, Daisuke Takahashi, Franz Franchetti, T. Katagiri, S. Ohshima and T. Nagai (Proc. International Workshop on Automatic Performance Tuning (iWAPT), 2021)

**An Auto-tuning with Adaptation of A64 Scalable Vector Extension for SPIRAL**

Preprint (1 MB)

Bibtex

Joao Rivera, Franz Franchetti and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), 2021)

**An Interval Compiler for Sound Floating Point Computations**

Preprint (646 KB)

Bibtex

Floating-point arithmetic is widely used by software developers but is unsound, i.e., there is no guarantee on the accuracy obtained, which can be imperative in safety-critical applications. We present IGen, a source-to-source compiler that translates a given C function using floating-point into an equivalent sound C function that uses interval arithmetic. IGen supports Intel SIMD intrinsic in the input function using a specially designed code generator and can produce SIMD-optimized output. To mitigate a possible loss of accuracy due to the increase of interval sizes, IGen can compile to double-double precision, again SIMD-optimized. Finally, IGen implements an accuracy optimization for the common reduction pattern. We benchmark our compiler on high-performance code in the domain of linear algebra and signal processing. The results show that the generated code delivers sound double precision results at high performance. In particular, we observe speed-ups of up to 9.8 when compared to commonly used interval libraries using double precision. When compiling to double-double, our compiler delivers intervals that keep error accumulation small enough to compute results with at most one bit of error in double precision, i.e., certified double precision results.

Mark Blanco, S. McMillan and Tze-Meng Low (Proc. High Performance Extreme Computing (HPEC), 2021)

**Delayed Asynchronous Iterative Graph Algorithms **

Preprint (472 KB)

Bibtex

Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values are asynchronously made available to other updates computed in the same iteration. In a multi-threaded shared memory system, the immediate propagation of updated values can cause memory contention that may offset the benefit of propagating updates sooner. In some cases, the benefit of a smaller number of iterations may be diminished by each iteration taking longer. Our key idea is to combine the low memory contention that synchronous approaches have with the faster information sharing of asynchronous approaches. Our hybrid approach buffers updates from threads locally before committing them to the global store to control how often threads may cause conflicts for others while still sharing data within one iteration and hence speeding convergence. On a 112-thread CPU system, our hybrid approach attains up to 4.5% - 19.4% speedup over an asynchronous approach for Pagerank and up to 1.9% - 17% speedup over asynchronous Bellman Ford SSSP. Further, our hybrid approach attains 2.56x better performance than the synchronous approach. Finally, we provide insights as to why delaying updates is not helpful on certain graphs where connectivity is clustered on the main diagonal of the adjacency matrix.

P. Oostema and Franz Franchetti (Proc. IEEE Workshop on Parallel / Distributed Combinatorics and Optimization (PDCO), 2021)

**Leveraging High Dimensional Spatial Graph Embedding as a Heuristic for Graph Algorithms**

Preprint (3.8 MB)

Bibtex

Yannick Zakowski, Calvin Beck, Irene Yoon, Ilia Zaichuk, Vadim Zaliva and Steve Zdancewic (Proc. International Conference on Functional Programming (ICFP), 2021)

**Modular, Compositional, and Executable Formal Semantics for LLVMIR**

Preprint (1.4 MB)

Bibtex

This paper presents a novel formal semantics, mechanized in Coq, for a large, sequential subset of the LLVM IR. In contrast to previous approaches, which use relationally-specified operational semantics, this new semantics is based on monadic interpretation of interaction trees, a structure that provides a more compositional approach to defining language semantics while retaining the ability to extract an executable interpreter. Our semantics handles many of the LLVM IR’s non-trivial language features and is constructed modularly in terms of event handlers, including those that deal with nondeterminism in the specification. We show how this semantics admits compositional reasoning principles derived from the interaction trees equational theory of weak bisimulation, which we extend here to better deal with nondeterminism, and we use them to prove that the extracted reference interpreter faithfully refines the semantic model. We validate the correctness of the semantics by evaluating it on unit tests and LLVM IR programs generated by HELIX.

Scott Mionis, Franz Franchetti and J. Larkin (Proc. High Performance Extreme Computing (HPEC), 2021)

**Optimized Quantum Circuit Generation with SPIRAL **

Preprint (400 KB)

Bibtex

Quantum computers have been at the bleeding edge of computing technology for nearly 40 years and research continues to progress due to their immense promise. However, despite hardware and algorithm breakthroughs, the software infrastructure that compiles programs for these devices requires further development and has traditionally been under-emphasized. In this work, we present a novel approach to compiling more efficient quantum programs. We capture the input algorithm as a high-level mathematical transform and generate a multitude of architecture-compliant programs directly from that specification; this is achieved by casting the problem as a sparse matrix factorization task and recursively searching over a host of applicable divide-and-conquer decomposition rules. This approach allows us to leverage high-level symmetries of the target transform to explore global rewrites and select the best factorization from the top-down, a task that is nearly impossible given only a program stream.We implement the proposed framework with SPIRAL [4], a code generation platform founded on the GAP [6] computer algebra system; we ultimately demonstrate that SPIRAL is a viable supplemental tool for future quantum frameworks.

Jiyuan Zhang (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2020)

**Accelerating the Motifs of Machine Learning Applications on Modern Processors**

Published paper (link to publisher)

Bibtex

Thom Popovici, Martin Schatz, Franz Franchetti and Tze-Meng Low (SIAM Journal on Scientific Computing (SISC), Software and High-Performance Computing, 2020)

**A Flexible Framework for Multi-Dimensional DFTs**

Preprint (1.5 MB)

Published paper (link to publisher)

Bibtex

Anuva Kulkarni (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2020)

**An Approach for Large-Scale Three-Dimensional FFT-Based Approximate Convolutions on GPUs**

Published paper (link to publisher)

Bibtex

Jiyuan Zhang, Yi Lu, Daniele G. Spampinato and Franz Franchetti (Proc. IEEE International Conference on Data Engineering (ICDE), 2020)

**FESIA: A Fast and Efficient Set Intersection Approach on Modern CPUs**

Published paper (link to publisher)

Bibtex

Set intersection is an important operation and widely used in both database and graph analytics applications. However, existing state-of-the-art set intersection methods only consider the size of input sets and fail to optimize for the case in which the intersection size is small. In real-world scenarios, the size of most intersections is usually orders of magnitude smaller than the size of the input sets, e.g., keyword search in databases and common neighbor search in graph analytics. In this paper, we present FESIA, a new set intersection approach on modern CPUs. The time complexity of our approach is O(n/√w+r), in which w is the SIMD width, and n and r are the size of input sets and intersection size, respectively. The key idea behind FESIA is that it first uses bitmaps to filter out unmatched elements from the input sets, and then selects suitable specialized kernels (i.e., small function blocks) at runtime to compute the final intersection on each pair of bitmap segments. In addition, all data structures in FESIA are designed to take advantage of SIMD instructions provided by vector ISAs with various SIMD widths, including SSE, AVX, and the latest AVX512. Our experiments on both real world and synthetic datasets show that our intersection method achieves more than an order of magnitude better performance than conventional scalar implementations, and up to 4x better performance than state-of-the-art SIMD implementations.

Jiyuan Zhang, Daniele G. Spampinato and Franz Franchetti (North East Database Day (NEDB), 2020, Poster)

**FESIA: A Fast and SIMD-Efficient Set Intersection Approach on Modern CPUs**

Published paper (link to publisher)

Bibtex

Franz Franchetti, Daniele G. Spampinato, Anuva Kulkarni, Tze-Meng Low, M. Franusich, Thom Popovici, A. Canning, P. McCorquodale, B. Van Straalen and P. Colella (Exascale Computing Project (ECP) Annual Meeting, 2020, Poster)

**FFT and Solver Libraries for Exascale: FFTX and SpectralPack**

Bibtex

Daisuke Takahashi and Franz Franchetti (Proc. International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia), pp. 114-122, 2020)

**FFTE on SVE: SPIRAL-Generated Kernels**

Published paper (link to publisher)

Bibtex

In this paper we propose an implementation of the fast Fourier transform (FFT) targeting the ARM Scalable Vector Extension (SVE). We performed automatic vectorization via a compiler and an explicit vectorization through code generation by SPIRAL for FFT kernels, and compared the performance. We show that the explicit vectorization of SPIRAL generated code improves performance significantly. Performance results of FFTs on RIKEN's Fugaku processor simulator are reported. With the ARM compiler SPIRAL-generated FFT kernels written in SVE intrinsic are up to 3.16 times faster than FFT kernels of FFTE written in Fortran and up to 5.62 times faster than SPIRAL-generated FFT kernels written in C.

Sanil Rao, A. Kutuluru, Paul Brouwer, S. McMillan and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2020)

**GBTLX: A First Look**

Preprint (308 KB)

Bibtex

We provide a first look at GBTLX, a code generator that translates graph processing programs written using the GraphBLAS Template Library (GBTL)into high-performance C programs that match the performance of hand-tuned implementations. GBTLX refactors code written using GBTL into problems that capture the signature of algorithms and solvers that capture the semantics (input/output behavior of algorithms. Users provide classes that implement these two aspects using standard GBTL functions and encapsulate the targeted algorithm. GBTLX then performs a sequence of inspection, code generation and high performance execution. First, the user code is traced while running with the original GBTL. Then, the trace is used to define the semantics and signature of the algorithm to be produced in code generation. The SPIRAL system is used to generate high performance C code that implements the user-specified algorithm, specializing the code for algorithm and hardware-dependent optimizations. Finally, the user-provided GBTL-based implementation is replaced by the SPIRAL generated C code. For triangle counting and k-truss enumeration the resulting executables provide performance equivalent to hand-tuned implementations, while the source code is maintainable as it only uses the C++ GBTL library.

Vadim Zaliva (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2020)

**HELIX: From Math to Verified Code**

Published paper (link to publisher)

Bibtex

Anuva Kulkarni, Jelena Kovacevic and Franz Franchetti (Proc. Platform for Advanced Scientific Computing (PASC), Article 13, pp. 1 - 10, 2020)

**Massive Scaling of MASSIF: Algorithm Development and Analysis for Simulation on GPUs**

Published paper (link to publisher)

Bibtex

Micromechanical Analysis of Stress-Strain Inhomogeneities with Fourier transforms (MASSIF) is a large-scale Fortran-based differential equation solver used to study local stresses and strains in materials. Due to its prohibitive memory requirements, it is extremely difficult to port the code to GPUs with small on-device memory. In this work, we present an algorithm design that uses domain decomposition with approximate convolution, which reduces memory footprint to make the MASSIF simulation feasible on distributed GPU systems. A first-order performance model of our method estimates that compression and multi-resolution sampling strategies can enable domain computation within GPU memory constraints for 3D grids larger than those simulated by the current state-of-the-art Fortran MPI implementation. The model analysis also provides an insight into design requirements for further scalability. Lastly, we discuss the extension of our method to irregular domain decomposition and challenges to be tackled in the future.

Scott Mionis, Franz Franchetti and J. Larkin (Proc. Supercomputing (SC), 2020)

**Quantum Circuit Optimization with SPIRAL: A First Look**

Published paper (link to publisher)

Bibtex

Compilation and optimization of quantum circuits is an integral part of the quantum computing toolchain. In many Noisy Intermediate-Scale Quantum (NISQ) devices, only loose connectivity between qubits is maintained, meaning a valid quantum circuit often requires swapping physical qubits in order to satisfy adjacency requirements. Optimizing circuits to minimize such swaps, as well as additional metrics like gate count and circuit depth, is imperative towards utilizing the quantum hardware of both today and the near future. In this work, we leverage SPIRAL, a code generation system for linear transforms built on GAP’s computer algebra system, and present an application of such a system towards optimizing quantum circuits. SPIRAL natively understands tensor products, complex matrices and symbolic matrices, and the proven decomposition and rewriting capabilities are uniquely predisposed to optimize quantum circuits. Specifically, by defining the optimization problem in terms of SPIRAL’s breakdown and rewriting system, we construct a search problem that can be solved with techniquies like dynamic programming. The optimal circuit can then translated to QASM code, where it is executable on a real quantum device. We demonstrate that the power of SPIRAL could provide a valuable tool for future software frameworks.

Mark Blanco, S. McMillan and Tze-Meng Low (Proc. High Performance Extreme Computing (HPEC), 2020)

**Towards an Objective Metric for the Performance of Exact Triangle Count**

Preprint (665 KB)

Bibtex

The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed. Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline.

Vadim Zaliva, Ilia Zaichuk and Franz Franchetti (Proc. Working Conference on Verified Software: Theories, Tools, and Experiments (VSTTE), 2020)

**Verified Translation Between Purely Functional and Imperative Domain Specific Languages in HELIX**

Published paper (link to publisher)

Bibtex

HELIX is a formally verified language and rewriting engine for generation of high-performance implementation for a variety of numerical algorithms. Based on the existing SPIRAL system, HELIX adds the rigor of formal verification of its correctness using the Coq proof assistant. It formally defines a series of domain-specific languages starting with HCOL, which represents a computation data flow. HELIX works by transforming the original program through a series of intermediate languages, culminating in LLVM IR. In this paper, we will focus on three intermediate languages and the formally verified translation between them. Translation between these three languages is non-trivial, because each subsequent language introduces lower-level abstractions, compared to the previous one. During these steps, we switch from pure-functional language using mixed embedding to a deep-embedded imperative one, also introducing a memory model, lexical scoping, monadic error handling, and transition from abstract algebraic datatype to floatingpoint-numbers. We will demonstrate the design of these languages, the automatic reification between them, and automated proofs of semantic preservation, in Coq.

Thom Popovici, Martin Schatz, Franz Franchetti and Tze-Meng Low (arXiv (Technical Report), 2019)

**A Flexible Framework for Parallel Multi-Dimensional DFTs**

Published paper (link to publisher)

Bibtex

Multi-dimensional discrete Fourier transforms (DFT) are typically decomposed into multiple 1D transforms. Hence, parallel implementations of any multi-dimensional DFT focus on parallelizing within or across the 1D DFT. Existing DFT packages exploit the inherent parallelism across the 1D DFTs and offer rigid frameworks, that cannot be extended to incorporate both forms of parallelism and various data layouts to enable some of the parallelism. However, in the era of exascale, where systems have thousand of nodes and intricate network topologies, flexibility and parallel efficiency are key aspects all multi-dimensional DFT frameworks need to have in order to map and scale the computation appropriately. In this work, we present a flexible framework, built on the Redistribution Operations and Tensor Expressions (ROTE) framework, that facilitates the development of a family of parallel multi-dimensional DFT algorithms by 1) unifying the two parallelization schemes within a single framework, 2) exploiting the two different parallelization schemes to different degrees and 3) using different data layouts to distribute the data across the compute nodes. We demonstrate the need of a versatile framework and thus a need for a family of parallel multi-dimensional DFT algorithms on the K-Computer, where we show almost linear strong scaling results for problem sizes of 1024^3 on 32k compute nodes.

Anuva Kulkarni, Jelena Kovacevic and Franz Franchetti (SIAM Conference on Computational Science and Engineering (CSE), 2019)

**Algorithm Design at Scale: Porting Parallel FFT-based Fortran Simulations to GPUs**

Published paper (link to publisher)

Bibtex

Anuva Kulkarni, Daniele G. Spampinato and Franz Franchetti (Supercomputing, 2019)

**Design and Specification of Large-scale Simulations for GPUs using FFTX**

Published paper (link to publisher)

Bibtex

Large-scale scientific simulations can be ported to heterogeneous environments with GPUs using domain decomposition. However, Fast Fourier Transform (FFT) based simulations require all-to-all communication and large memory, which is beyond the capacity of on-chip GPU memory. To overcome this, domain decomposition solutions are combined with adaptive sampling or pruning around the domain to reduce storage. Expression of such operations is a challenge in existing FFT libraries like FFTW, and thus it is difficult to get a high performance implementation of such methods. We demonstrate algorithm specification for one such simulation (Hooke's law) using FFTX, an emerging API with a SPIRAL-based code generation back-end, and suggest future extensions useful for GPU-based scientific computing.

F. Sadi, Joe Sweeney, Tze-Meng Low, James C. Hoe, Lawrence Pileggi and Franz Franchetti (Proc. MICRO, 2019)

**Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-way Merge Parallelization**

Published paper (link to publisher)

Bibtex

The importance of Sparse Matrix dense Vector multiplication (SpMV) operation in graph analytics and numerous scientific applications has led to development of custom accelerators that are intended to over-come the difficulties of sparse data operations on general purpose architectures. However, efficient SpMV operation on large problem (i.e. working set exceeds on-chip storage) is severely constrained due to strong dependence on limited amount of fast random access memory to scale. Additionally, unstructured matrix with high sparsity pose difficulties as most solutions rely on exploitation of data locality. This work presents an algorithm co-optimized scalable hardware architecture that can efficiently operate on very large (~billion nodes) and/or highly sparse (avg. degree <10) graphs with significantly less on-chip fast memory than existing solutions. A novel parallelization methodology for implementing large and high throughput multi-way merge network is the key enabler of this high performance SpMV accelerator. Additionally, a data compression scheme to reduce off-chip traffic and special computation for nodes with exceptionally large number of edges, commonly found in power-law graphs, are presented. This accelerator is demonstrated with 16-nm fabricated ASIC and Stratix® 10 FPGA platforms. Experimental results show more than an order of magnitude improvement over current custom hardware solutions and more than two orders of magnitude improvement over commercial off-the-shelf (COTS) architectures for both performance and energy efficiency.

Mark Blanco, Tze-Meng Low and Kyungjoo Kim (Proc. High Performance Extreme Computing (HPEC), 2019)

**Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU**

Preprint (735 KB)

Bibtex

In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.26- 1.48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation.

Franz Franchetti, Daniele G. Spampinato, Anuva Kulkarni, Tze-Meng Low, M. Franusich, Thom Popovici, A. Canning, P. McCorquodale, B. Van Straalen and P. Colella (Exascale Computing Project (ECP) Annual Meeting, 2019)

**FFT and Solvers for Exascale: FFTX and SpectralPACK**

Preprint (1.7 MB)

Published paper (link to publisher)

Bibtex

Anuva Kulkarni, Daniele G. Spampinato and Franz Franchetti (IEEE High Performance Extreme Computing Conference (HPEC), 2019)

**FFTX for Micromechanical Stress-Strain Analysis**

Published paper (link to publisher)

Bibtex

Porting scientific simulations to heterogeneous platforms requires complex algorithmic and optimization strategies to overcome memory and communication bottlenecks. Such operations are inexpressible using traditional libraries (e.g., FFTW for spectral methods) and difficult to optimize by hand for various hardware platforms. In this work, we use our GPU-adapted stress-strain analysis method to show how FFTX, a new API that extends FFTW, can be used to express our algorithm without worrying about code optimization, which is handled by a backend code generator.

Vadim Zaliva and Matthieu Sozeau (Proc. International Workshop on Coq for Programming Languages (CoqPL), 2019)

**Reification of Shallow-Embedded DSLs in Coq with Automated Verification**

Bibtex

Yoko Franchetti, Thomas Nolin and Franz Franchetti (SIAM Conference on Computational Science and Engineering (CSE), 2019)

**Towards Precision Medicine: Simulation Based Parameter Estimation for Drug Metabolism**

Published paper (link to publisher)

Bibtex

F. Sadi (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2018)

**Accelerating Sparse Matrix Kernels with Co-optimized Architecture**

Published paper (link to publisher)

Bibtex

Anuva Kulkarni, Franz Franchetti and Jelena Kovacevic (, 2018)

**Algorithm Design for Large Scale FFT-Based Simulations on CPU-GPU Platforms**

Preprint (1.2 MB)

Published paper (link to publisher)

Bibtex

Extreme memory requirements and high communication overhead prevent scaling of large scale iterative simulations involving parallel Fast Fourier Transforms (FFTs) to higher grid sizes, which is necessary for high resolution analysis. To overcome these limitations, we propose an algorithm to run stress-strain simulations on CPU-GPU platforms for larger problem sizes using irregular domain decomposition and local FFTs. Early results show that our method lowers iteration cost without adversely impacting accuracy of the result.

Anuva Kulkarni, Franz Franchetti and Jelena Kovacevic (Proc. High Performance Extreme Computing (HPEC), 2018)

**Algorithm Design for Large Scale Parallel FFT-Based Simulations on Heterogeneous Platforms**

Preprint (237 KB)

Published paper (link to publisher)

Bibtex

Large scale iterative simulations involving parallel Fast Fourier Transforms (FFTs) have extreme memory requirements and high communication overhead. This prevents scaling to higher grid sizes, which is necessary for high resolution analysis. In this work, we describe an algorithm to overcome these limitations and run stress-strain simulations for larger problem sizes using irregular domain decomposition and local FFTs. Early results show that our method lowers iteration cost without adversely impacting accuracy of the result.

Thom Popovici (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2018)

**An Approach to Specifying and Automatically Optimizing Fourier Transform Based Operations **

Published paper (link to publisher)

Bibtex

Vit Ruzicka and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2018)

**Fast and Accurate Object Detection in High Resolution 4K and 8K Video Using GPUs**

Preprint (1.7 MB)

Published paper (link to publisher)

Bibtex

Machine learning has celebrated a lot of achievements on computer vision tasks such as object detection, but the traditionally used models work with relatively low resolution images. The resolution of recording devices is gradually increasing and there is a rising need for new methods of processing high resolution data. We propose an attention pipeline method which uses two staged evaluation of each image or video frame under rough and refined resolution to limit the total number of necessary evaluations. For both stages, we make use of the fast object detection model YOLO v2. We have implemented our model in code, which distributes the work across GPUs. We maintain high accuracy while reaching the average performance of 3-6 fps on 4K video and 2 fps on 8K video.

Franz Franchetti, Daniele G. Spampinato, Anuva Kulkarni, Thom Popovici, Tze-Meng Low, M. Franusich, A. Canning, P. McCorquodale, B. Van Straalen and P. Colella (Proc. IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 2018)

**FFTX and SpectralPack: A First Look**

Preprint (283 KB)

Published paper (link to publisher)

Bibtex

We propose FFTX, a new framework for building high-performance FFT-based applications on exascale machines. Complex node architectures lead to multiple levels of parallelism and demand efficient ways of data communication. The current FFTW interface falls short in maximizing performance in such scenarios. FFTX is designed to enable application developers to leverage expert-level, automatic optimizations while navigating a familiar interface. FFTX is backwards compatible to FFTW and extends the FFTW interface into an embedded Domain Specific Language (DSL) expressed as a library interface. By means of a SPIRAL-based back end, this enables build-time source-to-source translation and advanced performance optimizations, such as cross-library calls optimizations, targeting of accelerators through offloading, and inlining of user-provided kernels. We demonstrate the use of FFTX with the prototypical example of 1D and 3D pruned convolutions and discuss future extensions.

Vadim Zaliva and Franz Franchetti (Proc. Workshop on Functional High Performance Computing (FHPC), 2018)

**HELIX: A Case Study of a Formal Verification of High Performance Program Generation**

Preprint (732 KB)

Published paper (link to publisher)

Bibtex

In this paper, we present HELIX, a formally verified operator language and rewriting engine for generation of high performance implementation for a variety of linear algebra algorithms. Based on the existing SPIRAL system, HELIX adds the rigor of formal verification of its correctness using Coq proof assistant. It formally defines two domain-specific languages: HCOL, which represents a computation data flow and Σ-HCOL, which extends HCOL with iterative computations. A framework for automatically proving semantic preservation of expression rewriting for both languages is presented. The structural properties of the dataflow graph which allow efficient compilation are formalized, and a monadic approach to tracking them and to reasoning about structural correctness of Σ-HCOL expressions is presented.

Jiyuan Zhang, Franz Franchetti and Tze-Meng Low (Proc. International Conference on Machine Learning (ICML), 2018)

**High Performance Zero-Memory Overhead Direct Convolutions**

Preprint (1.7 MB)

Published paper (link to publisher)

Bibtex

The computation of convolution layers in deep neural networks typically rely on high performance routines that trade space for time by using additional memory (either for packing purposes or required as part of the algorithm) to improve performance. The problems with such an approach are two-fold. First, these routines incur additional memory overhead which reduces the overall size of the network that can fit on embedded devices with limited memory capacity. Second, these high performance routines were not optimized for performing convolution, which means that the performance obtained is usually less than conventionally expected. In this paper, we demonstrate that direct convolution, when implemented correctly, eliminates all memory overhead, and yields performance that is between 10% to 400% times better than existing high performance implementations of convolution layers on conventional and embedded CPU architectures. We also show that a high performance direct convolution exhibits better scaling performance, i.e. suffers less performance drop, when increasing the number of threads.

Thom Popovici, Tze-Meng Low and Franz Franchetti (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2018)

**Large Bandwidth-Efficient FFTs on Multicore and Multi-Socket Systems**

Preprint (543 KB)

Published paper (link to publisher)

Bibtex

Current microprocessor trends show a steady increase in the number of cores and/or threads present on the same CPU die. While this increase improves performance for computebound applications, the benefits for memory-bound applications are limited. The discrete Fourier transform (DFT) is an example of such a memory-bound application, where increasing the number of cores does not yield a corresponding increase in performance. In this paper, we present an alternate solution for using the increased number of cores/threads available on a typical multicore system. We propose to repurpose some of the cores/threads as soft Direct Memory Access (DMA) engines so that data is moved on and off chip while computation is performed. Overlapping memory accesses with computation permits us to preload and reshape data so that computation is more efficient. We show that despite using fewer cores/threads for computation, our approach improves performance relative to MKL and FFTW by 1.2x to 3x for large multi-dimensional DFTs of up to 2048^3 on one and two-socket Intel and AMD systems.

Tze-Meng Low, Daniele G. Spampinato, A. Kutuluru, U. Sridhar, Thom Popovici, Franz Franchetti and S. McMillan (Proc. High Performance Extreme Computing (HPEC), 2018)

**Linear Algebraic Formulation of Edge-centric K-truss Algorithms with Adjacency Matrices**

Preprint (501 KB)

Published paper (link to publisher)

Bibtex

Edge-centric algorithms using the linear algebraic approach typically require the use of both the incidence and adjacency matrices. Due to the two different data formats, the information contained in the graph is replicated, thereby incurring both time and space for the replication. Using K-truss as an example, we demonstrate that an edge-centric K-truss algorithm, the Eager K-truss algorithm, can be identified from a linear algebraic formulation using only the adjacency matrix. In addition, we demonstrate that our implementation of the linear algebraic edge-centric K-truss algorithm out-performs a Galois’ K-truss implementation by an average (over 53 graphs) of more than 13 times, and up to 71 times.

F. Sadi, Joe Sweeney, S. McMillan, Tze-Meng Low, James C. Hoe, Lawrence Pileggi and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2018)

**PageRank Acceleration for Large Graphs with Scalable Hardware and Two-Step SpMV**

Preprint (1.7 MB)

Published paper (link to publisher)

Bibtex

PageRank is an important vertex ranking algorithm that suffers from poor performance and efficiency due to notorious memory access behavior. Furthermore, when graphs become bigger and sparser, PageRank applications are inhibited as most current solutions profoundly rely on large random access fast memory, which is not easily scalable. In this paper we present a 16nm ASIC based shared memory platform for PageRank implementation that fundamentally accelerates Sparse Matrix dense Vector multiplication (SpMV), the core kernel of PageRank. This accelerator is scalable, guarantees full DRAM streaming and reduces off-chip communication. More importantly, it is capable of handling very large graphs (2 billion vertices) despite using significantly less fast random access memory than current solutions. Experimental results show that our proposed accelerator is able to yield order of magnitude improvement in both energy efficiency and performance over state of the art shared memory commercial off-the-shelf (COTS) solutions.

Jiyuan Zhang, Daniele G. Spampinato, S. McMillan and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2018)

**Preliminary Exploration on Large-Scale Triangle Counting in Shared-Memory Multicore System**

Preprint (1 MB)

Published paper (link to publisher)

Bibtex

As triangle counting is becoming a widely-used building block for a large amount of graph analytics applications, there is a growing need to make it run fast and scalable on large parallel systems. In this work we conduct a preliminary exploration on the optimizations of triangle counting algorithms on shared-memory system with large dataset.

Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 327-339, 2018)

**Program Generation for Small-Scale Linear Algebra Applications**

Preprint (2.3 MB)

Published paper (link to publisher)

Bibtex

We present SLinGen, a program generation system for linear algebra. The input to SLinGen is an application expressed mathematically in a linear-algebra-inspired language (LA) that we define. LA provides basic scalar/vector/matrix additions/multiplications and higher level operations including linear systems solvers, Cholesky and LU factorizations. The output of SLingen is performance-optimized single-source C code, optionally vectorized with intrinsics. The target of SLinGen are small-scale computations on fixed-size operands, for which a straightforward implementation using optimized libraries (e.g., BLAS or LAPACK) is known to yield suboptimal performance (besides increasing code size and introducing dependencies), but which are crucial in control, signal processing, computer vision, and other domains. Internally, SLinGen uses synthesis and DSL-based techniques to optimize at a high level of abstraction. We benchmark our program generator on three prototypical applications: the Kalman filter, Gaussian process regression, and an L1-analysis convex solver, as well as basic routines including Cholesky factorization and solvers for the continuous-time Lyapunov and Sylvester equations. The results show significant speed-ups compared to straightforward C with Intel icc and clang with a polyhedral optimizer, as well as library-based and template-based implementations.

Franz Franchetti, Tze-Meng Low, Thom Popovici, Richard Veras, Daniele G. Spampinato, Jeremy Johnson, Markus Püschel, James C. Hoe and José M. F. Moura (Proceedings of the IEEE, special issue on ``From High Level Specification to High Performance Code'', Vol. 106, No. 11, 2018)

**SPIRAL: Extreme Performance Portability**

Preprint (6.6 MB)

Published paper (link to publisher)

Bibtex

In this paper, we address the question of how to automatically map computational kernels to highly efficient code for a wide range of computing platforms and establish the correctness of the synthesized code. More specifically, we focus on two fundamental problems that software developers are faced with: performance portability across the ever-changing landscape of parallel platforms and correctness guarantees for sophisticated floating-point code. The problem is approached as follows: We develop a formal framework to capture computational algorithms, computing platforms, and program transformations of interest, using a unifying mathematical formalism we call operator language (OL). Then we cast the problem of synthesizing highly optimized computational kernels for a given machine as a strongly constrained optimization problem that is solved by search and a multistage rewriting system. Since all rewrite steps are semantics preserving, our approach establishes equivalence between the kernel specification and the synthesized program. This approach is implemented in the SPIRAL system, and we demonstrate it with a selection of computational kernels from the signal and image processing domain, software-defined radio, and robotic vehicle control. Our target platforms range from mobile devices, desktops, and server multicore processors to large-scale high performance and supercomputing systems, and we demonstrate performance comparable to expertly hand-tuned code across kernels and platforms.

Matthias Bolten, Franz Franchetti, P. H. J. Kelly, Christian Lengauer and Marcus Mohr (Concurrency and Computation: Practice and Experience, 2017)

**Algebraic Description and Automatic Generation of Multigrid Methods in SPIRAL**

Preprint (345 KB)

Published paper (link to publisher)

Bibtex

SPIRAL is an autotuning, program generation and code synthesis system that offers a fully automatic generation of highly optimized target codes, customized for the specific execution platform at hand. Initially, SPIRAL was targeted at problem domains in digital signal processing, later also at basic linear algebra. We open SPIRAL up to a new, practically relevant and challenging domain: multigrid solvers. SPIRAL is driven by algebraic transformation rules. We specify a set of such rules for a simple multigrid solver with a Richardson smoother for a discretized square 2D Poisson equation with Dirichlet boundary conditions. We present the target code that SPIRAL generates in static single-assignment form and discuss its performance. While this example required no changes of or extensions to the SPIRAL system, more complex multigrid solvers may require small adaptations.

F. Sadi, Lawrence Pileggi and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), IEEE, pp. 1-7, 2017)

**Algorithm and Hardware Co-Optimized Solution for Large SpMV Problems**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel for many scientific and engineering applications. However, SpMV performance and efficiency are poor on commercial of-the-shelf (COTS) architectures, specially when the data size exceeds on-chip memory or last level cache (LLC). In this work we present an algorithm co-optimized hardware accelerator for large SpMV problems. We start with exploring the basic difference in data transfer characteristics for various SpMV algorithms. We propose an algorithm that requires the least amount of data transfer while ensuring main memory streaming for all accesses. However, the proposed algorithm requires an efficient multi-way merge, which is difficult to achieve with COTS architectures. Hence, we propose a hardware accelerator model that includes an Application Specific Integrated Circuit (ASIC)for the muti-way merge operation. The proposed accelerator incorporates state of the art 3D stacked High Bandwidth Memory (HBM) in order to demonstrate the proposed algorithm’s capability coupled with the latest technologies. Simulation results using standard benchmarks show improvements of over 100x against COTS architectures with commercial libraries for both energy efficiency and performance.

Daniele G. Spampinato (PhD. thesis, Computer Science, ETH Zurich, Switzerland, 2017)

**A Linear Algebra Compiler for Small Problem Sizes**

Published paper (link to publisher)

Bibtex

Many applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel MKL are more geared towards large-scale problems (matrix sizes in the hundreds and larger), specific interfaces (e.g., BLAS and LAPACK), and cannot specialize to problem instances. In other domains, program generators have proven effective to automatically produce specialized code for important building blocks. An example is the Spiral generator for linear transforms and its extension to support other functionalities, such as matrix-matrix multiplication and Viterbi decoding. Learning from both libraries and generators, in this dissertation we aim for the automatic generation of fast code for small scale, linear algebra computations. We introduce a program synthesis framework for small-scale, linear algebra computations of fixed size. These include computations on matrices, vectors, and scalars using basic operators as well as higher-level computations such as the Cholesky decomposition, solvers for the continuous-time Lyapunov and Sylvester equations, and the inversion of triangular matrices. The input to our framework is a linear algebra program composed of several fixed size basic and higher-level computations; the output is a corresponding C function optionally including intrinsics to efficiently use SIMD vector extensions. In our framework, support for for small-scale, basic linear algebra computations is provided by the LGen compiler. LGen generates code using two levels of mathematical domain-specific languages (DSLs). The DSLs are used to perform tiling, loop fusion, and vectorization at a high level of abstraction, before the final code is generated. In particular, LGen’s vectorization approach can easily extend to different vector ISAs. In addition, search is used to select among alternative generated implementations. LGen also supports computations whose matrices have structure, such as symmetric or triangular, that reduces the cost of the computation. For example, dense linear systems of equations are solved by first reducing to triangular form and problems in optimization may yield matrices with different kinds of structures. The BLAS interface provides a small set of structured matrix computations, chosen to serve a certain set of higher-level functions supported by LAPACK. However, if a given computation contains a structure that is not supported by standard interfaces then its computation using a more general version of it (e.g., multiplying two general matrices instead of two structured ones) would lose the benefits of the structure. We address this problem by combining the LGen compiler with techniques from polyhedral compilation to mathematiiii cally capture matrix structures. In this dissertation, we consider triangular and symmetric matrices and discuss the approach extensibility to a much larger set including blocked structures. Finally, support for higher-level linear algebra computations is included building on LGen and the FLAME-based algorithm synthesis tool Cl1ck. When a higher-level computation is encountered in an input linear algebra program, a set of basic computations is synthetized starting from one of its algorithmic definitions. If desired, the framework’s autotuner explores alternative algorithmic variants to select the one that yields the best performance. We evaluate the performance of the code generated with our framework for several linear algebra computations, including library compliant and noncompliant basic and higher-level computations and the computation of a Kalman filter. Experimental results show that our framework produces code that performs in many cases considerably better than well-established, commercial and non-commercial libraries, prior software generators, and compilers.

Richard Veras and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), IEEE, pp. 1-7, 2017)

**A Scale-free Structure for Real World Networks**

Preprint (911 KB)

Published paper (link to publisher)

Bibtex

The field of High Performance Computing (HPC) is defined by application in physics and engineering. These problems drove the development of libraries such as LAPACK, which cast their performance in terms of more specialized building block such as the BLAS. Now that we see a rise in simulation and computational analysis in fields such as biology and the social sciences, how do we leverage existing HPC approaches to these domains. The GraphBLAS project reconciles graph analytics with the machinery of linear algebra libraries. Like their Dense Linear Algebra (DLA) counterpart, the GraphBLAS expresses complex operations in terms of smaller primitives. This paper focuses on efficiently storing real world networks, such that for these graph primitives we can obtain the level of performance seen in DLA. We provide a hierarchical data structured called GERMV, which is an extension of our previous Recursive Matrix Vector (RMV). If the network in question exhibits a scale-free structure, namely hierarchical communities, then our data structure enables high performance. We demonstrate high performance for Sparse Matrix Vector (spMV) and PageRank on real world web graphs.

Tze-Meng Low, Varun Rao, Matthew Lee, Thom Popovici, Franz Franchetti and S. McMillan (Proc. High Performance Extreme Computing (HPEC), IEEE, pp. 1-6, 2017)

**First Look: Linear Algebra-based Triangle Counting Without Matrix Multiplication**

Preprint (731 KB)

Published paper (link to publisher)

Bibtex

Linear algebra-based approaches to exact triangle counting often require sparse matrix multiplication as a primitive operation. Non-linear algebra approaches to the same problem often assume that the adjacency matrix of the graph is not available. In this paper, we show that both approaches can be unified into a single approach that separates the data format from the algorithm design. By not casting the triangle counting algorithm into matrix multiplication, a different algorithm that counts each triangle exactly once can be identified. In addition, by choosing the appropriate sparse matrix format, we show that the same algorithm is equivalent to the compact-forward algorithm attained assuming that the adjacency matrix of the graph is not available. We show that our approach yields an initial implementation that is between 69 and more than 2000 times faster than the reference implementation. We also show that the initial implementation can be easily parallelized on shared memory systems.

Tze-Meng Low and Franz Franchetti (Proc. IEEE International Symposium on High Assurance Systems Engineering (HASE), 2017)

**High Assurance Code Generation for Cyber-Physical Systems**

Preprint (188 KB)

Published paper (link to publisher)

Bibtex

High Assurance SPIRAL (HA-SPIRAL) is a tool that synthesizes a faithful and high performance implementation from the mathematical specification of a given controller or monitor. At the heart of HA-SPIRAL is a mathematical identity rewrite engine based on a computer algebra system. The rewrite engine refines the mathematical expression provided by a control engineer, through mathematical identities, into an equivalent mathematical expression that can be implemented in code. In this paper, we discuss the use of HA-SPIRAL in generating provably correct and high-performance implementations for different controllers and monitors for autonomous land and air vehicles.

Franz Franchetti, Tze-Meng Low, Stefan Mitsch, Juan Pablo Mendoza, Liangyan Gui, Amarin Phaosawasdi, David Padua, Soummya Kar, José M. F. Moura, M. Franusich, Jeremy Johnson, Andre' Platzer and Manuela Veloso (IEEE Control Systems Magazine, 2017)

**High-Assurance SPIRAL: End-to-End Guarantees for Robot and Car Control**

Preprint (4.4 MB)

Published paper (link to publisher)

Bibtex

Cyber-physical systems (CPSs), ranging from critical infrastructures such as power plants, to modern (semi) autonomous vehicles, are systems that use software to control physical processes. CPSs are made up of many different computational components. Each component runs its own piece of software that implements its control algorithms, based on its model of the environment. Every component then interacts with other components through the signals and values it sends out. Collectively, these components, and the code they run, drive the complex behaviors modern society has come to expect and rely on. Due to these intricate interactions between components, managing the hundreds to millions of lines of software to ensure that the system, as a whole, performs as desired can often be unwieldy.

Thom Popovici, Franz Franchetti and Tze-Meng Low (Proc. High Performance Extreme Computing (HPEC), IEEE, pp. 1-7, 2017)

**Mixed Data Layout Kernels for Vectorized Complex Arithmetic**

Preprint (414 KB)

Published paper (link to publisher)

Bibtex

Implementing complex arithmetic routines with Single Instruction Multiple Data (SIMD) instructions requires the use of instructions that are usually not found in their real arithmetic counter-parts. These instructions, such as shuffles and addsub, are often bottlenecks for many complex arithmetic kernels as modern architectures usually can perform more real arithmetic operations than execute instructions for complex arithmetic. In this work, we focus on using a variety of data layouts (mixed format) for storing complex numbers at different stages of the computation so as to limit the use of these instructions. Using complex matrix multiplication and Fast Fourier Transforms (FFTs) as our examples, we demonstrate that performance improvements of up to 2x can be attained with mixed format within the computational routines. We also described how existing algorithms can be easily modified to implement the mixed format complex layout.

H. V. Koops, Kashish Garg, Munsung (Bill) Kim, Jonathan Li, Anja Volk and Franz Franchetti (Proc. Multisensor Fusion and Integration for Intelligent Systems (MFI), IEEE, pp. 15-21, 2017)

**Multirotor UAV State Prediction Through Multi-microphone Side-channel Fusion**

Preprint (1.5 MB)

Published paper (link to publisher)

Bibtex

Improving trust in the state of Cyber-Physical Systems becomes increasingly important as more Cyber-Physical Systems tasks become autonomous. Research into the sound of Cyber-Physical Systems has shown that audio side-channel information from a single microphone can be used to accurately model traditional primary state sensor measurements such as speed and gear position. Furthermore, data integration research has shown that information from multiple heterogeneous sources can be integrated to create improved and more reliable data. In this paper, we present a multi-microphone machine learning data fusion approach to accurately predict ascending/hovering/descending states of a multi-rotor UAV in ight. We show that data fusion of multiple audio classiers predicts these states with accuracies over 94%. Furthermore, we significantly improve the state predictions of single microphones, and outperform several other integration methods. These results add to a growing body of work showing that microphone sidechannel approaches can be used in Cyber-Physical Systems to accurately model and improve the assurance of primary sensors measurements.

G. Xu, Tze-Meng Low, James C. Hoe and Franz Franchetti (High Performance Extreme Computing Conference (HPEC), 2017)

**Optimizing FFT Resource Efficiency of FPGA using High-Level Synthesis**

Preprint (572 KB)

Bibtex

The practical use of High-level Synthesis (HLS) for FPGAs may compromise resource efﬁciency, which is a common design goal in FPGA programming. Speciﬁcally, when designing the iterative datapath for computing a FFT, combining naive HLS code with compiler optimizations can fail to saturate the BRAM ports and incur signiﬁcant penalty in clock frequency. To solve this problem, we suggest that the target datapath should be explicit in HLS code. Using this solution, our FFT cores reduce the FFT latency by N cycles for the radix-2 N-point FFT comparing with previous HLS-based work and the Xilinx LogiCORE FFT. Meanwhile, our designs achieve comparable peak frequency and consume similar resources to Xilinx.

H. V. Koops, Kashish Garg, Munsung (Bill) Kim, Jonathan Li, Anja Volk and Franz Franchetti (Technical report UU-CS-2017-001, Dept. of Information and Computing Sciences, Utrecht University, 2017)

**Prediction of Quadcopter State through Multi-Microphone Side-Channel Fusion**

Published paper (link to publisher)

Bibtex

Georg Ofenbeck, Tiark Rompf and Markus Püschel (Proc. International Conference on Generative Programming: Concepts & Experiences (GPCE), pp. 15-28, 2017)

**Staging for Generic Programming in Space and Time**

Preprint (1.3 MB)

Published paper (link to publisher)

Bibtex

Metaprogramming is among the most promising candidates to solve the abstraction vs performance trade-off that plagues software engineering through \emph{specialization}. Metaprogramming has been used to enable low-overhead generic programming for a long time, with C++ templates being one of the most prominent examples. But often a single, fixed pattern of specialization is not enough, and more flexibility is needed. Hence, this paper seeks to apply generic programming techniques to challenges in metaprogramming, in particular to abstract over the execution stage of individual program expressions. We thus extend the scope of generic programming into the dimension of time. The resulting notion of stage polymorphism enables novel abstractions in the design of program generators, which we develop and explore in this paper. We present one possible implementation, in Scala using the lightweight modular staging (LMS) framework, and apply it to two important case studies: convolution on images and the fast Fourier transform (FFT).

Richard Veras (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2017)

**The Automatic Generation of High-Performance Graph Analytic Code using Spiral**

Bibtex

F. Sadi, Lawrence Pileggi and Franz Franchetti (High Performance Extreme Computing Conference (HPEC), 2016)

**3D DRAM Based Application Specific Hardware Accelerator for SpMV**

Preprint (709 KB)

Published paper (link to publisher)

Bibtex

For numerous scientific applications Sparse Matrix-Vector multiplication (SpMV) is one of the most important kernels. Unfortunately, due to its very low ratio of computation to memory access SpMV is inherently a memory bound problem. On the other hand, the main memory bandwidth of commercial off-the-shelf (COTS) architectures is insufficient for available computation resources on these platforms, well known as the memory wall problem. As a result, COTS architectures are unsuitable for SpMV. Furthermore, SpMV requires random access into a memory space which is far too big for cache. Hence, it becomes difficult to utilize the main memory bandwidth which is already scarce. We propose an algorithm for large SpMV problems which is specially optimized to fully exploit the underlying micro-architecture and overall system capabilities. This algorithm is implemented in two steps. The key feature of the first step is that it converts all the main memory random access into streaming access. This reduces the overall data transfer volume significantly and ensures full utilization of the memory bandwidth. On top of that, we propose a meta-data compression technique, namely Variable Length Delta Index (VLDI), to decrease the data transfer volume even further. VLDI is particularly effective for sparse matrices where meta-data to payload ratio is high, e.g. sparse bit matrices.

Daniele G. Spampinato and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 117-127, 2016)

**A Basic Linear Algebra Compiler for Structured Matrices**

Preprint (1 MB)

Published paper (link to publisher)

Bibtex

Many problems in science and engineering are in practice modelled and solved through matrix computations. Often, the matrices involved have structure such as symmetric or triangular, which reduces the operations count needed to perform a computation. For example, dense linear systems of equations are solved by first converting to triangular form and optimization problems may yield matrices with any kind of structure. The well-known BLAS (basic linear algebra subprograms) interface provides a small set of structured matrix computations, chosen to serve a certain set of higher level functions (LAPACK). However, if a user encounters a computation or structure that is not supported, she has to forego the benefits of the structure and choose a generic functionality. In this paper, we address this problem by providing a compiler that translate a given basic linear algebra computation on structured matrices into optimized C code, optionally vectorized with intrinsics. Our work combines prior work on the Spiral-like LGen compiler with techniques from polyhedral compilation to mathematically capture matrix structures. In the paper we consider upper/lower triangular and symmetric matrices but the approach is extensible to a much larger set including blocked structures. We run experiments on a modern Intel platform against the Intel MKL library and a baseline implementation showing competitive performance results for both BLAS and non-BLAS functionalities.

Qi Guo, Tianshi Chen, Y. Chen and Franz Franchetti (IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 35, No. 3, pp. 433-446, 2016)

**Accelerating Architectural Simulation Via Statistical Techniques: A Survey**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

In computer architecture research and development, simulation is a powerful way of acquiring and predicting processor behaviors. While architectural simulation has been extensively utilized for computer performance evaluation, design space exploration, and computer architecture assessment, it still suffers from the high computational costs in practice. Specifically, the total simulation time is determined by the simulator’s raw speed and the total number of simulated instructions. The simulator’s speed can be improved by enhanced simulation infrastructures (e.g., simulators with high-level abstraction, parallel simulators, and hardware-assisted simulators). Orthogonal to these work, recent studies also managed to significantly reduce the total number of simulated instructions with a slight loss of accuracy. Interestingly, we observe that most of these work are built upon statistical techniques. This survey presents a comprehensive review to such studies and proposes a taxonomy based on the sources of reduction. In addition to identifying the similarities and differences of state-of-the-art approaches, we further discuss insights gained from these studies as well as implications for future research.

Richard Veras, Tze-Meng Low and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), pp. 1-7, 2016)

**A Scale-Free Structure for Power-Law Graphs**

Preprint (1.5 MB)

Published paper (link to publisher)

Bibtex

Many real-world graphs, such as those that arise from the web, biology and transportation, appear random and without a structure that can be exploited for performance on modern computer architectures. However, these graphs have a scale-free graph topology that can be leveraged for locality. Existing sparse data formats are not designed to take advantage of this structure. They focus primarily on reducing storage requirements and improving the cost of certain matrix operations for these large data sets. Therefore, we propose a data structure for storing real-world scale-free graphs in a sparse and hierarchical fashion. By maintaining the structure of the graph, we preserve locality in the graph and in the cache. For synthetic scale-free graph data we outperform the state of the art for graphs with up to 10^7 non-zero edges.

Joya Deri, Franz Franchetti and José M. F. Moura (Proc. IEEE International Conference on Big Data (Big Data), IEEE, pp. 2616-2625, 2016)

**Big Data Computation of Taxi Movement in New York City**

Preprint (2.7 MB)

Published paper (link to publisher)

Bibtex

We seek to extract and explore statistics that characterize New York City traffic flows based on 700 million taxi trips in the 2010-2013 New York City taxi data. This paper presents a two-part solution for intensive computation: space and time design considerations for estimating taxi trajectories with Dijkstra’s algorithm, and job parallelization and scheduling with HTCondor. Our contribution is to present a solution that reduces execution time from 3,000 days to less than a day with detailed analysis of the necessary design decisions.

Richard Veras, Thom Popovici, Tze-Meng Low and Franz Franchetti (Proc. Workshop on Programming Models for SIMD/Vector Programming (WPMVP), 2016)

**Compilers, Hands-Off My Hands-On Optimizations**

Preprint (332 KB)

Published paper (link to publisher)

Bibtex

Achieving high performance for compute bounded numerical kernels typically requires an expert to hand select an appropriate set of Single-instruction multiple-data (SIMD) instructions, then statically scheduling them in order to hide their latency while avoiding register spilling in the process. Unfortunately, this level of control over the code forces the expert to trade programming abstraction for performance which is why many performance critical kernels are written in assembly language. An alternative is to either resort to auto-vectorization (see Figure 1) or to use intrinsic functions, both features oered by compilers. However, in both scenarios the expert loses control over which instructions are selected, which optimizations are applied to the code and moreover how the instructions are scheduled for a target architecture. Ideally, the expert would need assembly-like control over their SIMD instructions beyond what intrinsics provide while maintaining a C-level abstraction for the non-performance critical parts. In this paper, we bridge the gap between performance and abstraction for SIMD instructions through the use of custom macro intrinsics that provide the programmer control over the instruction selection, and scheduling, while leveraging the compiler to manage the registers. This provides the best of both assembly and vector intrinsics programming so that a programmer can obtain high performance implementations within the C programming language.

Marcela Zuluaga, Andreas Krause and Markus Püschel (Journal of Machine Learning Research, Vol. 17, No. 104, pp. 1-32, 2016)

**e-PAL: An Active Learning Approach to the Multi-Objective Optimization Problem**

Published paper (link to publisher)

Bibtex

In many fields one encounters the challenge of identifying out of a pool of possible designs those that simultaneously optimize multiple objectives. In many applications an exhaustive search for the Pareto-optimal set is infeasible. To address this challenge, we propose the ϵ-Pareto Active Learning (ϵ-PAL) algorithm which adaptively samples the design space to predict a set of Pareto-optimal solutions that cover the true Pareto front of the design space with some granularity regulated by a parameter ϵ. Key features of ϵ-PAL include (1) modeling the objectives as draws from a Gaussian process distribution to capture structure and accommodate noisy evaluation; (2) a method to carefully choose the next design to evaluate to maximize progress; and (3) the ability to control prediction accuracy and sampling cost. We provide theoretical bounds on ϵ-PAL's sampling cost required to achieve a desired accuracy. Further, we perform an experimental evaluation on three real-world data sets that demonstrate ϵ-PAL's effectiveness; in comparison to the state-of-the-art active learning algorithm PAL, ϵ-PAL reduces the amount of computations and the number of samples from the design space required to meet the user's desired level of accuracy. In addition, we show that ϵ-PAL improves significantly over a state-of-the-art multi- objective optimization method, saving in most cases 30% to 70% evaluations to achieve the same accuracy.

Berkin Akin, Franz Franchetti and James C. Hoe (IEEE Micro, Vol. 36, No. 1, pp. 14-23, 2016)

**HAMLeT Architecture for Parallel Data Reorganization in Memory**

Published paper (link to publisher)

Bibtex

3D-stacked integration of DRAM and logic layers using through-silicon via (TSV) technology has given rise to a new interpretation of near-data processing (NDP) concepts that were proposed decades ago. However, processing capability within the stack is limited by stringent power and thermal constraints. Simple processing mechanisms with intensive memory accesses, such as data reorganization, are an effective means of exploiting 3D stacking-based NDP. Data reorganization handled completely in memory improves the host processor's memory access performance. However, in-memory data reorganization performed in parallel with host memory accesses raises issues, including interference, bandwidth allocation, and coherence. Previous work has mainly focused on performing data reorganization while blocking host accesses. This article details data reorganization performed in parallel with host memory accesses, providing mechanisms to address host/NDP interference, flexible bandwidth allocation, and in-memory coherence.

J. Kepner, P. Aaltonen, D. Bader, A. Buluc, Franz Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. D. Owens, C. Yang, M. Zalewski and T. Mattson (Proc. High Performance Extreme Computing (HPEC), 2016)

**Mathematical Foundations of the GraphBLAS**

Published paper (link to publisher)

Bibtex

The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix-based graph algorithms to the broadest possible audience. Mathematically, the GraphBLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of the GraphBLAS. Graphs represent connections between vertices with edges. Matrices can represent a wide range of graphs using adjacency matrices or incidence matrices. Adjacency matrices are often easier to analyze while incidence matrices are often better for representing data. Fortunately, the two are easily connected by matrix multiplication. A key feature of matrix mathematics is that a very small number of matrix operations can be used to manipulate a very wide range of graphs. This composability of a small number of operations is the foundation of the GraphBLAS. A standard such as the GraphBLAS can only be effective if it has low performance overhead. Performance measurements of prototype GraphBLAS implementations indicate that the overhead is low.

Francois Serre and Markus Püschel (Proc. FPGA, pp. 215-223, 2016)

**Optimal Circuits for Streamed Linear Permutations using RAM**

Preprint (639 KB)

Published paper (link to publisher)

Bibtex

We propose a method to automatically derive hardware structures that perform a fixed linear permutation on streaming data. Linear permutations are permutations that map linearly the bit representation of the elements addresses. This set contains many of the most important permutations in media processing, communication, and other applications and includes perfect shuffles, stride permutations, and the bit reversal. Streaming means that the data to be permuted arrive as a sequence of chunks over several cycles. We solve this problem by mathematically decomposing a given permutation into a sequence of three permutations that are either temporal or spatial. The former are implemented as banks of RAM, the latter as switching networks. We prove optimality of our solution in terms of the number of switches in these networks.

Marcela Zuluaga, Peter A. Milder and Markus Püschel (ACM Transactions on Design Automation of Electronic Systems, Vol. 21, No. 4, pp. 55, 2016)

**Streaming Sorting Networks**

Preprint (2.5 MB)

Published paper (link to publisher)

Bibtex

Sorting is a fundamental problem in computer science and has been studied extensively. Thus, a large variety of sorting methods exists for both software and hardware implementations. For the latter, there is a tradeoff between the throughput achieved and the cost, i.e., the area or amount of logic and storage invested to sort n elements. Two popular solutions are bitonic sorting networks with O(n\log^2 n) logic and storage, which sort $n$ elements per cycle, and linear sorters with O(n) logic and storage, which sort $n$ elements per $n$ cycles. In this paper, we present new hardware structures that we call streaming sorting networks, which we derive through a mathematical formalism that we introduce. With the new networks we achieve novel and improved cost-performance tradeoffs. For example, assuming n is a two-power and w is any divisor of n, one class of these networks can sort in n/w cycles with O(w\log^2n) logic and O(n\log^2n) storage; the other class we present sorts in n\log^2n/w cycles with O(w) logic and O(n) storage. We carefully analyze the performance of these networks and their cost at three levels of abstraction: (1) asymptotically, (2) exactly in terms of the number of basic elements needed, and (3) in terms of the resources required by the actual circuit when mapped to a field-programmable gate array. We obtain the latter results through a domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog. With this generator we explore the entire design space, identify the Pareto-optimal solutions, and show superior cost-performance tradeoffs compared to prior work.

Jiyuan Zhang, Tze-Meng Low, Qi Guo and Franz Franchetti (Proc. Workshop on Near Data Processing (WONDP), 2015)

**A 3D-Stacked Memory Manycore Stencil Accelerator System**

Preprint (706 KB)

Published paper (link to publisher)

Bibtex

Stencil operations are an important class of scientific computational kernels that are pervasive in scientific simulations as well as in image processing. A key characteristic of this class of computation is that they have a low operational intensity, i.e., the ratio of the number of memory accesses to the number of floating point operations it performs is high. As a result, the performance of stencil operations implemented on general purpose computing systems is bounded by the memory bandwidth. Technologies such as 3D stacked memory can provide substantially more bandwidth than conventional memory systems and can enhance the performance of memory intensive computations like stencil kernels. In this paper, we leverage this 3D stacked memory technology to design an accelerator for stencil computations. We show that for the best efficiency one needs to find the balance between computation and memory accesses to keep all components consistently busy. We achieve this by exploring how blocking and caching schemes to control the compute-to-memory ratio. Finally, we identify optimal design points that maximize performance.

Nikolaos Kyrtatas, Daniele G. Spampinato and Markus Püschel (Proc. Design, Automation and Test in Europe (DATE), pp. 1054-1059, 2015)

**A Basic Linear Algebra Compiler for Embedded Processors**

Preprint (772 KB)

Published paper (link to publisher)

Bibtex

Many applications in signal processing, control, and graphics on embedded devices require efficient linear algebra computations. On general-purpose computers, program generators have proven useful to produce such code, or important building blocks, automatically. An example is LGen, a compiler for basic linear algebra computations of fixed size. In this work, we extend LGen towards the embedded domain using as example targets Intel Atom, ARM Cortex-A8, ARM Cortex-A9, and Raspberry Pi (ARM1176). To efficiently support these processors we introduce support for the NEON vector ISA and a methodology for domain-specific load/store optimizations. Our experimental evaluation shows that the new version of LGen produces code that performs better than well-established, commercial and non-commercial libraries (Intel MKL and IPP), software generators (Eigen and ATLAS), and compilers (icc, gcc, and clang).

Berkin Akin (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2015)

**A Formal Approach to Memory Access Optimization: Data Layout, Reorganization, and Near-Data Processing**

Published paper (link to publisher)

Bibtex

H. V. Koops and Franz Franchetti (Proc. International Conference on Digital Signal Processing (DSP), 2015)

**An Ensemble Technique for Estimating Vehicle Speed and Gear Position from Acoustic Data**

Preprint (289 KB)

Published paper (link to publisher)

Bibtex

This paper presents a machine learning system that is capable of predicting the speed and gear position of a moving vehicle from the sound it makes. While audio classification is widely used in other research areas such as music information retrieval and bioacoustics, its application to vehicle sounds is rare. Therefore, we investigate predicting the state of a vehicle using audio features in a classification task. We improve the classification results using correlation matrices, calculated from signals correlating with the audio. In an experiment, the sound of a moving vehicle is classified into discretized speed intervals and gear positions. The experiment shows that the system is capable of predicting the vehicle speed and gear position with near-perfect accuracy over 99%. These results show that this system could be a valuable addition to vehicle anomaly detection and safety systems.

H. E. Sumbul, K. Vaidyanathan, Qiuling Zhu, Franz Franchetti and Lawrence Pileggi (Proc. Design Automation Conference (DAC), 2015)

**A Synthesis Methodology for Application-Specific Logic-in-Memory Designs**

Preprint (2.3 MB)

Published paper (link to publisher)

Bibtex

For deeply scaled digital integrated systems, the power required for transporting data between memory and logic can exceed the power needed for computation, thereby limiting the efficacy of synthesizing logic and compiling memory independently. Logic-in-Memory (LiM) architectures address this challenge by embedding logic within the memory block to perform basic operations on data locally for specific functions. While custom smart memories have been successfully constructed for various applications, a fully automated LiM synthesis flow enables architectural exploration that has heretofore not been possible. In this paper we present a tool and design methodology for LiM physical synthesis that performs co-design of algorithms and architectures to explore system level trade-offs. The resulting layouts and timing models can be incorporated within any physical synthesis tool. Silicon results shown in this paper demonstrate a 250x performance improvement and 310x energy savings for a data-intensive application example.

Berkin Akin, Franz Franchetti and James C. Hoe (Proc. International Symposium on Computer Architectur (ISCA), 2015)

**Data Reorganization in Memory Using 3D-stacked DRAM**

Preprint (2.4 MB)

Published paper (link to publisher)

Bibtex

In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations. We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3Dstacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.

Qi Guo, Tze-Meng Low, N. Alachiotis, Berkin Akin, Lawrence Pileggi, James C. Hoe and Franz Franchetti (Proc. MICRO, 2015)

**Enabling Portable Energy Efficiency with Memory Accelerated Library**

Preprint (637 KB)

Published paper (link to publisher)

Bibtex

Over the last decade, the looming power wall has spurred a flurry of interest in developing heterogeneous systems with hardware accelerators. The questions, then, are what and how accelerators should be designed, and what software support is required. Our accelerator design approach stems from the observation that many efficient and portable software implementations rely on high performance software libraries with well-established application programming interfaces (APIs). We propose the integration of hardware accelerators on 3D-stacked memory that explicitly targets the memory-bounded operations within high performance libraries. The fixed APIs with limited configurability simplify the design of the accelerators, while ensuring that the accelerators have wide applicability. With our software support that automatically converts library APIs to accelerator invocations, an additional advantage of our approach is that library-based legacy code automatically gains the benefit of memory-side accelerators without requiring a reimplementation. On average, the legacy code using our proposed MEmory Accelerated Library (MEALib) improves performance and energy efficiency for individual operations in Intel's Math Kernel Library (MKL) by 38x and 75x, respectively. For a real-world signal processing application that employs Intel MKL, MEALib attains more than 10x better energy efficiency.

M. M. Sabry Aly, M. Gao, G. Hills, C.-S. Lee, G. Pitner, M. M. Shulaker, T. F. Wu, M. Asheghi, J. Bokor, Franz Franchetti, K. E. Goodson, C. Kozyrakis, I. Markov, K. Olukoton, Lawrence Pileggi, E. Pop, J. Rabaey, C. Re, H.-S. Wong and S. Mitra (Computer, Vol. 48, No. 12, pp. 24-33, 2015)

**Energy-Efficient Abundant-Data Computing: The N3XT 1,000x**

Published paper (link to publisher)

Bibtex

Next-generation information technologies will process unprecedented amounts of loosely structured data that overwhelm existing computing systems. N3XT improves the energy efficiency of abundant-data applications 1,000-fold by using new logic and memory technologies, 3D integration with fine-grained connectivity, and new architectures for computation immersed in memory.

Berkin Akin, Franz Franchetti and James C. Hoe (Journal of Signal Processing Systems, 2015)

**FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation**

Published paper (link to publisher)

Bibtex

Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.

Thom Popovici, F. Russell, K. Wilkinson, C-K. Skylaris, P. H. J. Kelly and Franz Franchetti (Proc. Workshop on Compilers for Parallel Computing (CPC), 2015)

**Generating Optimized Fourier Interpolation Routines for Density Functional Theory Using SPIRAL**

Published paper (link to publisher)

Bibtex

Upsampling of a multi-dimensional data-set is an operation with wide application in image processing and quantum mechanical calculations using density functional theory. For small up sampling factors as seen in the quantum chemistry code ONETEP, a time-shift based implementation that shifts samples by a fraction of the original grid spacing to fill in the intermediate values using a frequency domain Fourier property can be a good choice. Readily available highly optimized multidimensional FFT implementations are leveraged at the expense of extra passes through the entire working set. In this paper we present an optimized variant of the time-shift based up sampling. Since ONETEP handles threading, we address the memory hierarchy and SIMD vectorization, and focus on problem dimensions relevant for ONETEP. We present a formalization of this operation within the SPIRAL framework and demonstrate auto-generated and auto-tuned interpolation libraries. We compare the performance of our generated code against the previous best implementations using highly optimized FFT libraries (FFTW and MKL). We demonstrate speed-ups in isolation averaging 3x and within ONETEP of up to 15%.

Thom Popovici, F. Russell, K. Wilkinson, C-K. Skylaris, P. H. J. Kelly and Franz Franchetti (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2015)

**Generating Optimized Fourier Interpolation Routines for Density Functional Theory Using SPIRAL**

Preprint (2.2 MB)

Published paper (link to publisher)

Bibtex

Upsampling of a multi-dimensional data-set is an operation with wide application in image processing and quantum mechanical calculations using density functional theory. For small upsampling factors as seen in the quantum chemistry code ONETEP, a time-shift based implementation that shifts samples by a fraction of the original grid spacing to fill in the intermediate values using a frequency domain Fourier property can be a good choice. Readily available highly optimized multidimensional FFT implementations are leveraged at the expense of extra passes through the entire working set. In this paper we present an optimized variant of the time-shift based upsampling. Since ONETEP handles threading, we address the memory hierarchy and SIMD vectorization, and focus on problem dimensions relevant for ONETEP. We present a formalization of this operation within the SPIRAL framework and demonstrate auto-generated and autotuned interpolation libraries. We compare the performance of our generated code against the previous best implementations using highly optimized FFT libraries (FFTW and MKL). We demonstrate speed-ups in isolation averaging 3x and within ONETEP of up to 15%.

Tze-Meng Low, Qi Guo and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2015)

**Optimizing Space Time Adaptive Processing Through Accelerating Memory-Bounded Operations**

Preprint (317 KB)

Published paper (link to publisher)

Bibtex

Space-Time Adaptive Processing (STAP) is a technique for processing signals from multiple antenna elements over multiple time periods for target detection. As STAP algorithms are typical run on airborne platforms, they need to be both high performance and energy-efficient. Due to the high rate of processing required, many existing algorithms focus on reducing the dimensionality of the data, or exploiting structure in the underlying mathematical formulation in order to reduce the total number of floating-point operations (FLOPs), and consequently, the time for computation. While such algorithms target the FLOPs-intensive operations within the STAP algorithm, a significant portion of the compute time for most STAP algorithms is actually spent in low-FLOPs, memory-bounded operations. In this paper, we address the computation of these memory-bounded operations within the STAP algorithm using a 3D stacked Logic-in-Memory system. The imminent arrival of 3D stacked memory makes available high memory bandwidth, which opens up a new and orthogonal dimension for optimizing STAP algorithms. We show that more than 11x improvement in time, and 77x improvement in energy efficiency can be expected when a 3D stack is used together with memory-side accelerators to target the memory-bounded operations within STAP.

T. Ozturk, C. Stein, R. Pokharel, Thom Popovici, R. Suter, Franz Franchetti and Anthony Rollett (submitted for publication)

**Spectral Full-Field Deformation Modeling of Polycrystalline Materials**

Bibtex

Qi Guo, N. Alachiotis, Berkin Akin, F. Sadi, G. Xu, Tze-Meng Low, Lawrence Pileggi, James C. Hoe and Franz Franchetti (Proc. Workshop on Near Data Processing (WONDP), 2014)

**3D-Stacked Memory-Side Acceleration: Accelerator and System Design**

Preprint (910 KB)

Published paper (link to publisher)

Bibtex

Specialized hardware acceleration is an effective technique to mitigate the dark silicon problems. A challenge in designing on-chip hardware accelerators for data-intensive applications is how to efficiently transfer data between the memory hierarchy and the accelerators. Although the Processingin-Memory (PIM) technique has the potential to reduce the overhead of data transfers, it is limited by the traditional process technology. Recent process technology advancements such as 3Ddie stacking enable efficient PIM architectures by integrating accelerators to the logic layer of 3D DRAM, thus leading to the concept of the 3D-stacked Memory-Side Accelerator (MSA). In this paper, we initially present the overall architecture of the 3D-stacked MSA, which relies on a configurable array of domain-specific accelerators. Thereafter, we describe a full-system prototype that is built upon a novel software stack and a hybrid evaluation methodology. Experimental results demonstrate that the 3D-stacked MSA achieves up to 179x and 96x better energy efficiency than the Intel Haswell processor for the FFT and matrix transposition algorithms, respectively.

Daniele G. Spampinato and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 23-32, 2014)

**A Basic Linear Algebra Compiler**

Preprint (2.4 MB)

Published paper (link to publisher)

Bibtex

Many applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel's MKL are more geared towards large-scale problems (matrix sizes in the hundreds and larger) and towards specific interfaces (e.g., BLAS). In this paper we present LGen: a compiler for small-scale, basic linear algebra computations. The input to LGen is a fixed-size linear algebra expression; the output is a corresponding C function optionally including intrinsics to efficiently use SIMD vector extensions. LGen generates code using two levels of mathematical domain-specific languages (DSLs). The DSLs are used to perform tiling, loop fusion, and vectorization at a high level of abstraction, before the final code is generated. In addition, search is used to select among alternative generated implementations. We show benchmarks of code generated by LGen against Intel's MKL and IPP as well as against alternative generators, such as the C++ template-based Eigen and the BTO compiler. The achieved speed-up is typically about a factor of two to three.

Nikolaos Kyrtatas (Master thesis, Computer Science, ETH Zurich, Switzerland, 2014)

**A Basic Linear Algebra Compiler for Embedded Processors**

Preprint (3.4 MB)

Published paper (link to publisher)

Bibtex

In many fields, such as signal processing, control, and graphics, there is a significant demand for efficient Dense Linear Algebra (DLA) code for embedded devices. At the same time, code generators have proved to be useful for generating fast DLA code for general-purpose computers. An example is LGen, a research compiler designed after Spiral for basic linear algebra computations of fixed size. In this thesis, we extend LGen towards four processors that are widely used in the embedded ecosystem: Intel Atom, ARM Cortex-A8, ARM Cortex-A9, and ARM1176. For this purpose, we introduce into the LGen methodology a set of optimizations that take into account the specific limitations and capabilities of these processors, aiming at generating highly optimized code for them. An extensive set of experiments run on our target platforms shows that the new version of LGen produces code that performs better than well-established, commercial and non-commercial libraries (Intel MKL and Intel IPP), software generators (Eigen and ATLAS), and compilers (icc, gcc, and clang). The large number of experiments that we conducted on the four investigated processors were executed using Mediator, a web-based middleware that was developed as part of this thesis. Mediator offers a RESTful interface that makes possible the simultaneous execution of experiments on multiple SSH-accessible devices by multiple users. More specifically, it guarantees mutual exclusion of the experiments running on a specific core, while at the same time applies load balancing over the cores of a device. On top of that, it also provides an easy-to-use mechanism for retrieving performance metrics on a variety of architectures with minimal user involvement.

Alen Stojanov, Georg Ofenbeck, Tiark Rompf and Markus Püschel (Proc. ACM International Workshop on Libraries, Languages and Compilers for Array Programming (ARRAY), pp. 14, 2014)

**Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters**

Preprint (529 KB)

Published paper (link to publisher)

Bibtex

We present FGen, a program generator for high performance convolution operations (finite-impulse-response filters). The generator uses an internal mathematical DSL to enable structural optimization at a high level of abstraction. We use FGen as a testbed to demonstrate how to provide modular and extensible support for modern SIMD vector architectures in a DSL-based generator. Specifically, we show how to combine staging and generic programming with type classes to abstract over both the data type (real or complex) and the target architecture (e.g., SSE or AVX) when mapping DSL expressions to C code with explicit vector intrinsics. Benchmarks shows that the generated code is highly competitive with commercial libraries.

Tao Cui, R. Yang, Gabriela Hug and Franz Franchetti (Proc. IEEE Power and Energy Society General Meeting (PES-GM), 2014)

**Accelerated AC Contingency Calculation on Commodity Multi-core SIMD CPUs**

Preprint (369 KB)

Published paper (link to publisher)

Bibtex

Multi-core CPUs with multiple levels of parallelism (i.e. data level, instruction level and task/core level) have become the mainstream CPUs for commodity computing systems. Based on the multi-core CPUs, in this paper we developed a high performance computing framework for AC contingency calculation (ACCC) to fully utilize the computing power of commodity systems for online and real time applications. Using Woodbury matrix identity based compensation method, we transform and pack multiple contingency cases of different outages into a fine grained vectorized data parallel programming model. We implement the data parallel programming model using SIMD instruction extension on x86 CPUs, therefore, fully taking advantages of the CPU core with SIMD floating point capability. We also implement a thread pool scheduler for ACCC on multi-core CPUs which automatically balances the computing loads across CPU cores to fully utilize the multi-core capability. We test the ACCC solver on the IEEE test systems and on the Polish 3000-bus system using a quad-core Intel Sandy Bridge CPU. The optimized ACCC solver achieves close to linear speedup (SIMD width multiply core numbers) comparing to scalar implementation and is able to solve a complete N-1 line outage AC contingency calculation of the Polish grid within one second on a commodity CPU. It enables the complete ACCC as a real-time application on commodity computing systems.

F. Sadi, Berkin Akin, Thom Popovici, James C. Hoe, Lawrence Pileggi and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2014)

**Algorithm/Hardware Co-optimized SAR Image Reconstruction with 3D-stacked Logic in Memory**

Preprint (2.9 MB)

Published paper (link to publisher)

Bibtex

Real-time system level implementations of complex Synthetic Aperture Radar (SAR) image reconstruction algorithms have always been challenging due to their data intensive characteristics. In this paper, we propose a basis vector transform based novel algorithm to alleviate the data intensity and a 3Dstacked logic in memory based hardware accelerator as the implementation platform. Experimental results indicate that this proposed algorithm/hardware co-optimized system can achieve an accuracy of 91 dB PSNR compared to a reference algorithm implemented in Matlab and energy efficiency of 72 GFLOPS/W for a 8k8k SAR image reconstruction.

Georg Ofenbeck, Ruedi Steinmann, Victoria Caparrós Cabezas, Daniele G. Spampinato and Markus Püschel (Proc. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 76 - 85, 2014)

**Applying the Roofline Model**

Preprint (1.6 MB)

Published paper (link to publisher)

Bibtex

The recently introduced roofline model plots the performance of executed code against its operational intensity (operations count divided by memory traffic). It also includes two platform-specific performance ceilings: the processor's peak performance and a ceiling derived from the memory bandwidth, which is relevant for code with low operational intensity. The model thus makes more precise the notions of memory- and compute-bound and, despite its simplicity, can provide an insightful visualization of bottlenecks. As such it can be valuable to guide manual code optimization as well as in education. Unfortunately, to date the model has been used almost exclusively with back-of-the-envelope calculations and not with measured data. In this paper we show how to produce roofline plots with measured data on recent generations of Intel platforms. We show how to accurately measure the necessary quantities for a given program using performance counters, including threaded and vectorized code, and for warm and cold cache scenarios. We explain the measurement approach, its validation, and discuss limitations. Finally, we show, to this extent for the first time, a set of roofline plots with measured data for common numerical functions on a variety of platforms and discuss their possible uses.

B. Duff, J. Larkin, M. Franusich and Franz Franchetti (submitted for publication)

**Automatic Generation of 3-D FFTs**

Bibtex

Benjamin Hess, Thomas Gross and Markus Püschel (Proc. International Conference on Generative Programming: Concepts & Experiences (GPCE), pp. 83-92, 2014)

**Automatic Locality-Friendly Interface Extension of Numerical Functions**

Preprint (509 KB)

Published paper (link to publisher)

Bibtex

Raising the level of abstraction is a key concern of software engineering, and libraries (either used directly or as a target of a program generation system) are a successful technique to raise programmer productivity and to improve software quality. Unfortunately successful libraries may contain functions that may not be general enough. E.g., many numeric performance libraries contain functions that work on one- or higher-dimensional arrays. A problem arises if a program wants to invoke such a function on a non-contiguous subarray (e.g., in C the column of a matrix or a subarray of an image). If the library developer did not foresee this scenario, the client program must include explicit copy steps before and after the library function call, incurring a possibly high performance penalty. A better solution would be an enhanced library function that allows for the desired access pattern. Exposing the access pattern allows the compiler to optimize for the intended usage scenario(s). As we do not want the library developer to generate all interesting versions manually, we present a tool that takes a library function written in C and generates such a customized function for typical accesses. We describe the approach, discuss limitations, and report on the performance. As example access patterns we consider those most common in numerical applications: permutations, striding and block striding, as well as scaling. We evaluate the tool on various library functions including filters, scans, reductions, sorting, FFTs and linear algebra operations. The automatically generated custom version is in most cases significantly faster than using individual steps, offering gains that are typically in the range of 20--80%.

Vadim Zaliva and Franz Franchetti (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014)

**Barometric and GPS Altitude Sensor Fusion**

Preprint (342 KB)

Published paper (link to publisher)

Bibtex

The altitude of a moving vehicle as reported by GPS suffers from intermittent errors caused by temporary obstruction of the satellites by buildings, mountains, etc. Additionally, it is affected by systematic errors caused by multipath effects, ionospheric and tropospheric effects, and other hardware design limitations and natural factors. Atmospheric pressure, measured by a portable barometric sensor, could also be used to determine altitude, is not susceptible to problems caused by obstruction of satellites, and can provide reliable measurements outdoors even in urban and mountainous regions. In this paper, we propose an algorithm which improves accuracy and provides tighter confidence bounds of altitude measurements from a mobile phone (or any device equipped with GPS and barometric sensors) by means of sensor fusion techniques without the need for calibration. Our experiments have shown that the proposed algorithm provides more accurate measurements with tighter confidence bounds compared to using either of the two sensors, barometric or GPS, alone.

K. Vaidyanathan, R. Liu, H. E. Sumbul, Qiuling Zhu, Franz Franchetti and Lawrence Pileggi (Proc. IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), 2014)

**Efficient and Secure Intellectual Property (IP) Design for Split Fabrication**

Published paper (link to publisher)

Bibtex

Split fabrication, the process of splitting an IC into an untrusted and trusted tier, facilitates access to the most advanced semiconductor manufacturing capabilities available in the world without requiring disclosure of design intent. While researchers have investigated the security of logic blocks in the context of split fabrication, the security of IP blocks, another key component of an SoC, has not been examined. Our security analysis of IP block designs, specifically embedded memory and analog components, shows that they are vulnerable to “recognition attacks” at the untrusted foundry due to the use of standardized floorplans and leaf cell layouts. We propose methodologies to design these blocks efficiently and securely, and demonstrate their effectiveness using 130nm split fabricated test chips.

Victoria Caparrós Cabezas and Markus Püschel (Proc. IEEE International Symposium on Workload Characterization (IISWC), pp. 222-231, 2014)

**Extending the Roofline Model: Bottleneck Analysis with Microarchitectural Constraints**

Preprint (2.1 MB)

Published paper (link to publisher)

Bibtex

Software, even if carefully optimized, rarely reaches the peak performance of a processor. Understanding which hardware resource is the bottleneck is difficult but important as it can help with both further optimizing the code or deciding which hardware component to upgrade for higher performance. If the bottleneck is the memory bandwidth, the roofline model provides a simple but instructive analysis and visualization. In this paper, we take the roofline analysis further by including additional performance-relevant hardware features including latency, throughput, and capacity information for a multilevel cache hierarchy and out-of-order execution buffers. Two key ideas underlie our analysis. First, we estimate performance based on a scheduling of the computation DAG on a high-level model of the microarchitecture and extract data including utilization of resources and overlaps from a cycle-by-cycle analysis of the schedule. Second, we show how to use this data to create only one plot with multiple rooflines that visualize bottlenecks. We validate our model against performance data obtained from a real system, and then apply our bottleneck analysis to a number of prototypical floating point kernels to identify and interpret bottlenecks.

T. Ozturk, Thom Popovici, C. Stein, R. Pokharel, Franz Franchetti and Anthony Rollett (Proc. Materials Science & Technology, 2014)

**Fast Fourier Transform Based Mechanical Behavior Formulation: Optimized Implementation and Sensitivity Analysis of the Method**

Bibtex

Berkin Akin, Franz Franchetti and James C. Hoe (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014)

**FFTs with Near-Optimal Memory Access Through Block Data Layouts**

Preprint (1.8 MB)

Published paper (link to publisher)

Bibtex

Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework. In our evaluations we demonstrate that Spiral generated hardware designs achieve close to theoretical peak performance of the targeted platform and offer significant speed-up (up to 6.5x) compared to naïve baseline algorithms.

Berkin Akin, James C. Hoe and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2014)

**HAMLeT: Hardware Accelerated Memory Layout Transform within 3D-stacked DRAM**

Preprint (1.8 MB)

Published paper (link to publisher)

Bibtex

Memory layout transformations via data reorganization are very common operations, which occur as a part of the computation or as a performance optimization in data-intensive applications. These operations require inefficient memory access patterns and roundtrip data movement through the memory hierarchy, failing to utilize the performance and energy-efficiency potentials of the memory subsystem. This paper proposes a highbandwidth and energy-efficient hardware accelerated memory layout transform (HAMLeT) system integrated within a 3Dstacked DRAM. HAMLeT uses a low-overhead hardware that exploits the existing infrastructure in the logic layer of 3Dstacked DRAMs, and does not require any changes to the DRAM layers, yet it can fully exploit the locality and parallelism within the stack by implementing efficient layout transform algorithms. We analyze matrix layout transform operations (such as matrix transpose, matrix blocking and 3D matrix rotation) and demonstrate that HAMLeT can achieve close to peak system utilization, offering up to an order of magnitude performance improvement compared to the CPU and GPU memory subsystems which does not employ HAMLeT.

Franz Franchetti, Aliaksei Sandryhaila and Jeremy Johnson (Proc. SPIE, Proceedings of SPIE 2014, 2014)

**High Assurance SPIRAL**

Preprint (275 KB)

Published paper (link to publisher)

Bibtex

In this paper we introduce High Assurance SPIRAL to solve the last mile problem for the synthesis of high assurance implementations of controllers for vehicular systems that are executed in today’s and future embedded and high performance embedded system processors. High Assurance SPIRAL is a scalable methodology to translate a high level specification of a high assurance controller into a highly resource-efficient, platform-adapted, verified control software implementation for a given platform in a language like C or C++. High Assurance SPIRAL proves that the implementation is equivalent to the specification written in the control engineer’s domain language. Our approach scales to problems involving floating-point calculations and provides highly optimized synthesized code. It is possible to estimate the available headroom to enable assurance/performance trade-offs under real-time constraints, and enables the synthesis of multiple implementation variants to make attacks harder. At the core of High Assurance SPIRAL is the Hybrid Control Operator Language (HCOL) that leverages advanced mathematical constructs expressing the controller specification to provide high quality translation capabilities. Combined with a verified/certified compiler, High Assurance SPIRAL provides a comprehensive complete solution to the efficient synthesis of verifiable high assurance controllers. We demonstrate High Assurance SPIRALs capability by co-synthesizing proofs and implementations for attack detection and sensor spoofing algorithms and deploy the code as ROS nodes on the Landshark unmanned ground vehicle and on a Synthetic Car in a real-time simulator.

Jörn Schumacher and Markus Püschel (Proc. IEEE Workshop on Signal Processing Systems (SIPS), pp. 1-6, 2014)

**High-performance sparse fast Fourier transforms**

Preprint (1.2 MB)

Published paper (link to publisher)

Bibtex

The sparse fast Fourier transform (SFFT) is a recent novel algorithm to compute discrete Fourier transforms on signals with a sparse frequency domain with an improved asymptotic runtime. Reference implementations exist for different variants of the algorithm and were already shown to be faster than state-of-the-art FFT implementations in cases of sufficient sparsity. However, to date the SFFT has not been carefully optimized for modern processors. In this paper, we first analyze the performance of the existing SFFT implementations and discuss possible improvements. Then we present an optimized implementation. We achieve a speedup of 2-5 compared to the existing code and an efficiency that is competitive to high-performance FFT libraries.

Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky and Markus Püschel (Proc. International Workshop on Adaptive Self-tuning Computing Systems (ADAPT), 2014)

**Language Support for the Construction of High Performance Code Generators**

Bibtex

The development of highest performance code on modern processors is extremely difficult due to deep memory hierarchies, vector instructions, multiple cores, and inherent limitations of compilers. The problem is particularly noticeable for library functions of mathematical nature (e.g., BLAS, FFT, filters, Viterbi decoders) that are performance-critical in areas such as multimedia processing, computer vision, graphics, machine learning, or scientific computing. Experience shows that a straightforward implementation often underperforms by one or two orders of magnitude compared to highly tuned code. The latter is often highly specialized to a platform which makes porting very costly. One appealing solution to the problem of optimizing and porting libraries are program generators that automatically produce highest performance libraries for a given platform from a high level description. Building such a generator is difficult, which is the reason that only very few exist to date. The difficulty comes from both the problem of designing an extensible approach to perform all the optimizations the compiler is unable to do and the actual implementation of the generator. To tackle the latter problem we asked the question: Which programming languages features support the development of generators for performance libraries?

T. Ozturk, C. Stein, R. Pokharel, Thom Popovici, Franz Franchetti, R. Suter and Anthony Rollett (submitted for publication)

**Performance Evaluation, Algorithm Optimization and Sensitivity Analysis of the Spectral Full-Field Deformation Modeling of Polycrystalline Materials**

Bibtex

Lingchuan Meng and Jeremy Johnson (ACM Communications in Computer Algebra, 2014)

**Towards parallel general-size library generation for polynomial multiplication**

Published paper (link to publisher)

Bibtex

Berkin Akin, Franz Franchetti and James C. Hoe (Proc. IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 248-255, 2014)

**Understanding the Design Space of DRAM-optimized Hardware FFT Accelerators**

Preprint (2.6 MB)

Published paper (link to publisher)

Bibtex

As technology scaling is reaching its limits, pointing to the well-known memory and power wall problems, achieving high-performance and energy-efficient systems is becoming a significant challenge. Especially for data-intensive computing, efficient utilization of the memory subsystem is the key to achieve high performance and energy efficiency.We address this challenge in DRAM-optimized hardware accelerators for 1D, 2D and 3D fast Fourier transforms (FFT) on large datasets. When the dataset has to be stored in external DRAM, the main challenge for FFT algorithm design lies in reshaping DRAM-unfriendly memory access patterns to eliminate excessive DRAM row buffer misses. More importantly, these algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. We use automatic design generation techniques to consider a family of DRAM-optimized FFT algorithms and their hardware implementation design space. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that generated pareto-optimal designs can yield up to 5.5x energy consumption and order of magnitude memory bandwidth utilization improvements in DRAM, which lead to overall system performance and power efficiency improvements of up to 6x and 6.5x respectively over conventional row-column FFT algorithms.

Qiuling Zhu, Berkin Akin, H. E. Sumbul, F. Sadi, James C. Hoe, Lawrence Pileggi and Franz Franchetti (Proc. IEEE International 3D Systems Integration Conference (3DIC), pp. 1-7, 2013)

**A 3D-Stacked Logic-in-Memory Accelerator for Application-Specific Data Intensive Computing**

Published paper (link to publisher)

Bibtex

This paper introduces a 3D-stacked logic-in-memory (LiM) system that integrates the 3D die-stacked DRAM architecture with the application-specific LiM IC to accelerate important data-intensive computing. The proposed system comprises a fine-grained rank-level 3D die-stacked DRAM device and extra LiM layers implementing logic-enhanced SRAM blocks that are dedicated to a particular application. Through silicon vias (TSVs) are used for vertical interconnections providing the required bandwidth to support the high performance LiM computing. We performed a comprehensive 3D DRAM design space exploration and exploit the efficient architectures to accelerate the computing that can balance the performance and power. Our experiments demonstrate orders of magnitude of performance and power efficiency improvements compared with the traditional multithreaded software implementation on modern CPU.

Qiuling Zhu, H. E. Sumbul, F. Sadi, James C. Hoe, Lawrence Pileggi and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), pp. 1-6, 2013)

**Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware**

Published paper (link to publisher)

Bibtex

This paper introduces a 3D-stacked logic-in-memory (LiM) system to accelerate the processing of sparse matrix data that is held in a 3D DRAM system. We build a customized content addressable memory (CAM) hardware structure to exploit the inherent sparse data patterns and model the LiM based hardware accelerator layers that are stacked in between DRAM dies for the efficient sparse matrix operations. Through silicon vias (TSVs) are used to provide the required high inter-layer bandwidth. Furthermore, we adapt the algorithm and data structure to fully leverage the underlying hardware capabilities, and develop the necessary design framework to facilitate the design space evaluation and LiM hardware synthesis. Our simulation demonstrates more than two orders of magnitude of performance and energy efficiency improvements compared with the traditional multithreaded software implementation on modern processors.

Marcela Zuluaga, Andreas Krause, Guillaume Sergent and Markus Püschel (Proc. International Conference on Machine Learning (ICML), pp. 462-470, 2013)

**Active Learning for Multi-Objective Optimization**

Preprint (4.7 MB)

Bibtex

In many fields one encounters the challenge of identifying, out of a pool of possible designs, those that simultaneously optimize multiple objectives. This means that usually there is not one optimal design but an entire set of Pareto-optimal ones with optimal trade-offs in the objectives. In many applications, evaluating one design is expensive; thus, an exhaustive search for the Pareto-optimal set is unfeasible. To address this challenge, we propose the Pareto Active Learning (PAL) algorithm which intelligently samples the design space to predict the Pareto-optimal set. Key features of PAL include (1) modeling the objectives as samples from a Gaussian process distribution to capture structure and accomodate noisy evaluation; (2) a method to carefully choose the next design to evaluate to maximize progress; and (3) the ability to control prediction accuracy and sampling cost. We provide theoretical bounds on PAL’s sampling cost required to achieve a desired accuracy. Further, we show an experimental evaluation on three real-world data sets. The results show PAL’s effectiveness; in particular it improves significantly over a state-of-the-art multi-objective optimization method, saving in many cases about 33% evaluations to achieve the same accuracy.

Q. Li, Tao Cui, Yang Weng, Rohit Negi, Franz Franchetti and Marija Ilic (IEEE Transactions on Smart Grid, Vol. 4, No. 1, pp. 446-456, 2013)

**An Information-Theoretic Approach to PMU Placement in Electric Power Systems**

Preprint (1.9 MB)

Published paper (link to publisher)

Bibtex

This paper presents an information-theoretic approach to address the phasor measurement unit (PMU) placement problem in electric power systems. Different from the conventional ‘topological observability’ based approaches, this paper advocates a much more refined, information-theoretic criterion, namely the mutual information (MI) between PMU measurements and power system states. The proposed MI criterion not only includes observability as a special case, but also rigorously models the uncertainty reduction on power system states from PMU measurements. Thus, it can generate highly informative PMU configurations. The MI criterion can also facilitate robust PMU placement by explicitly modeling probabilistic PMU outages. We propose a greedy PMU placement algorithm, and show that it achieves an approximation ratio of (1 - 1/e) for any PMU placement budget. We further show that the performance is the best that one can achieve, in the sense that it is NP-hard to achieve any approximation ratio beyond (1 - 1/e). Such performance guarantee makes the greedy algorithm very attractive in the practical scenario of multi-stage installations for utilities with limited budgets. Finally, simulation results demonstrate near-optimal performance of the proposed PMU placement algorithm.

Qiuling Zhu (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2013)

**Application Specific Logic in Memory**

Published paper (link to publisher)

Bibtex

Tao Cui and Franz Franchetti (Proc. IEEE Innovative Smart Grid Technology Conference, 2013)

**A Quasi-Monte Carlo Approach for Radial Distribution System Probabilistic Load Flow**

Preprint (305 KB)

Published paper (link to publisher)

Bibtex

Monte Carlo simulation (MCS) is a numerical method to solve the probabilistic load flow (PLF) problem. Comparing to analytical methods, MCS for PLF has advantages such as flexibility, general purpose, able to deal with large nonlinearity and large variances, and embarrassingly parallelizable. However, MCS also suffers from low convergence speed and high computational burden, especially for problems with multiple random variables. In this paper, we proposed a Quasi-Monte Carlo (QMC) based method to solve the PLF for radial distribution network. QMC uses samples from low-discrepancy sequence intended to cover the high dimension random sample space as uniformly as possible. The QMC method is particularly suitable for the high dimension problems with low effective dimensions, and has been successfully used to solve large scale problems in econometrics and statistical circuit design. In this paper, we showed that the PLF for radial distribution system has the similar properties and can be a good candidate for QMC method. The proposed method possesses the advantage of MCS method and significantly increases the convergence rate and overall speed. Numerical experiment results on IEEE test feeders have shown the effectiveness of the proposed method.

Qiuling Zhu, Lawrence Pileggi and Franz Franchetti (in From Algorithms to Circuits and System-on-Chip Design (VLSI-SoC 2012), Eds. A. Berg, A. Coskun, M. Guthaus, S. Katkoori, R. Reis, pp. 21-44, Springer, Berlin, Heidelberg 2013)

**A Smart Memory Accelerated Computed Tomography Parallel Backprojection**

Published paper (link to publisher)

Bibtex

Tao Cui and Franz Franchetti (IEEE PES General Meeting, 2013)

**A Software Performance Engineering Aproach to Fast Transmission Probabilistic Load Flow**

Preprint (2.3 MB)

Published paper (link to publisher)

Bibtex

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam and P. Sadayappan (Proc. ACM International Conference on Supercomputing , pp. 13-24, 2013)

**A Stencil Compiler for Short-Vector SIMD Architectures**

Preprint (415 KB)

Published paper (link to publisher)

Bibtex

Stencil computations are an integral component of applications in a number of scientific computing domains. Short-vector SIMD instruction sets are ubiquitous on modern processors and can be used to significantly increase the performance of stencil computations. Traditional approaches to optimizing stencils on these platforms have focused on either short-vector SIMD or data locality optimizations. In this paper, we propose a domain-specific language and compiler for stencil computations that allows specification of stencils in a concise manner and automates both locality and short-vector SIMD optimizations, along with effective utilization of multi-core parallelism. Loop transformations to enhance data locality and enable load-balanced parallelism are combined with a data layout transformation to effectively increase the performance of stencil computations. Performance increases are demonstrated for a number of stencils on several modern SIMD architectures.

Lingchuan Meng and Jeremy Johnson (Proc. International Workshop on Computer Algebra in Scientific Computing, Springer, pp. 243-256, 2013)

**Automatic Parallel Library Generation for General-Size Modular FFT Algorithms**

Published paper (link to publisher)

Bibtex

Benjamin Hess (Master thesis, Computer Science, ETH Zurich, Switzerland, 2013)

**Automatic Refactoring: Locality Friendly Interface Enhancements for Numerical Functions**

Preprint (1.2 MB)

Bibtex

Recent improvements in processor architectures such as multiple cores and larger single-instruction multiple-data (SIMD) vector units increased the discrepancy between processing speed and memory bandwidth. Today, the memory bandwidth has become the biggest bottleneck in high performance domains. Numerical functions are often target to heavy optimizations to get as much performance as possible. These functions presume a specific data layout and have a fixed domain and range. Adjusting unsuitable data according to these requirements can take up a significant amount of the runtime due to heavy memory operations. By integrating layout, domain and range adjustments into numerical functions, memory bandwidth is saved as the adjustments happen inplace during execution of the function. Four transformations are provided by the developed tool to refactor the most common adjustments into numerical functions. Restrictions on the to transformed function ensure the correctness of the transformations. These transformations enable the function to directly work on previously incompatible data which makes the manual adjustments superfluous and saves memory bandwidth by not needing to adjust the data manually before calling the function. The runtime of the transformed function is highly dependent on the used function and transformation type, ranging from 50% slower to up to 5 times faster compared to applying the adjustments manually before or after the function. The developed tool shows that automatic refactoring with a subset of C is possible and allows a developer to enhance numerical functions with minimal additional effort providing more flexible and high-performance versions of the functions with only having to maintain the original function.

Jörn Schumacher (Master thesis, Computer Science, ETH Zurich, Switzerland, 2013)

**High Performance Sparse Fast Fourier Transform**

Preprint (702 KB)

Bibtex

The Sparse Fast Fourier Transform is a recent algorithm developed by Hassanieh et al. at MIT for Discrete Fourier Transforms on signals with a sparse frequency domain. A reference implementation of the algorithm exists and proves that the Sparse Fast Fourier Transform can be faster than modern FFT libraries. However, the reference implementation does not take advantage of modern hardware features like vector instruction sets or multithreading. In this Master Thesis the reference implementation’s performance will be analyzed and evaluated. Several optimizations are proposed and implemented in a high-performance Sparse Fast Fourier Transform library. The optimized code is evaluated for performance and compared to the reference implementation as well as the FFTW library. The main result is that, depending on the input parameters, the optimized Sparse Fast Fourier Transform library is two to five times faster than the reference implementation.

Marcela Zuluaga, Andreas Krause and Markus Püschel (Proc. Workshop on High-Level Synthesis for High Performance Computing (HLS4HPC), 2013)

**Multi-Objective Optimization for High-Level Synthesis**

Preprint (2.1 MB)

Bibtex

Tao Cui and Franz Franchetti (Proc. High Performance Computing, Networking and Analytics for the Power Grid (HiPCNA-PG), 2013)

**Power System Probabilistic and Security Analysis on Commodity High Performance Computing Systems**

Preprint (1.2 MB)

Published paper (link to publisher)

Bibtex

Large scale integration of stochastic energy resources in power systems requires probabilistic analysis approaches for comprehensive system analysis. The large-varying grid condition on the aging and stressed power system infrastructures also requires merging of offline security analyses into online operation. Meanwhile in computing, the recent rapid hardware performance growth comes from the more and more complicated architecture. Fully utilizing the computation power for specific applications becomes very difficult. Given the challenges and opportunities in both the power system and computing fields, this paper presents the unique high performance commodity computing system solution to the following fundamental tools for power system probabilistic and security analysis: 1) a high performance Monte Carlo simulation (MCS) based distribution probabilistic load flow solver for real time distribution feeder probabilistic solution. 2) A high performance MCS based transmission probabilistic load flow solver for transmission grid analysis. 3) A SIMD accelerated AC contingency calculation solver based on Woodbury matrix identity on multi-core CPUs. By aggressive algorithm level and computer architecture level performance optimizations including optimized data structures, optimization for superscalar out-of-order execution, SIMDization, and multi-core scheduling, our software fully utilizes the modern commodity computing systems, makes the critical and computational intensive power system probabilistic and security analysis problems solvable in real time on commodity computing systems.

Tao Cui (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2013)

**Power System Probabilistic and Security Analysis Using Commodity High Performance Computing Systems**

Published paper (link to publisher)

Bibtex

Cory Thoma, Tao Cui and Franz Franchetti (Proc. IEEE Power and Energy Society General Meeting (PES-GM), 2013)

**Privacy Preserving Smart Meter System Based Retail Level Electricity Market**

Published paper (link to publisher)

Bibtex

Smart metering systems in distribution networks provide near real-time, two-way information exchange between end users and utilities, enabling many advanced smart grid technologies. However, the fine grained real-time data as well as the various market functionalities also pose great risks to customer privacy. In this work we propose a secure multi-party computation (SMC) based privacy preserving smart metering system. Using the proposed SMC protocol, a utility is able to perform advanced market based demand management algorithms without knowing the actual values of private end user consumption and configuration data. Using homomorphic encryption, billing is secure and verifiable. We implemented a demonstration system that includes a graphical user interface and simulates real-world network communication of the proposed SMC-enabled smart meters. The demonstration shows the feasibility of our proposed privacy preserving protocol for advanced smart grid technologies which includes load management and retail level electricity market support.

Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky and Markus Püschel (Proc. International Conference on Generative Programming: Concepts & Experiences (GPCE), pp. 125-134, 2013)

**Spiral in Scala: Towards the Systematic Construction of Generators for Performance Libraries**

Preprint (886 KB)

Published paper (link to publisher)

Bibtex

Program generators for high performance libraries are an appealing solution to the recurring problem of porting and optimizing code with every new processor generation, but only few such generators exist to date. This is due to not only the difficulty of the design, but also of the actual implementation, which often results in an ad-hoc collection of standalone programs and scripts that are hard to extend, maintain, or reuse. In this paper we ask whether and which programming language concepts and features are needed to enable a more systematic construction of such generators. The systematic approach we advocate extrapolates from existing generators: a) describing the problem and algorithmic knowledge using one, or several, domain-specific languages (DSLs), b) expressing optimizations and choices as rewrite rules on DSL programs, c) designing data structures that can be configured to control the type of code that is generated and the data representation used, and d) using autotuning to select the best-performing alternative. As a case study, we implement a small, but representative subset of Spiral in Scala using the Lightweight Modular Staging (LMS) framework. The first main contribution of this paper is the realization of c) using type classes to abstract over staging decisions, i.e. which pieces of a computation are performed immediately and for which pieces code is generated. Specifically, we abstract over different complex data representations jointly with different code representations including generating loops versus unrolled code with scalar replacement---a crucial and usually tedious performance transformation. The second main contribution is to provide full support for a) and d) within the LMS framework: we extend LMS to support translation between different DSLs and autotuning through search.

H. E. Sumbul, A. Patterson, A. Tazzoli, G. Feeder, Franz Franchetti, G. Piazza and Lawrence Pileggi (Proc. Government Applications & Critical Technology Conference (GOMACTech), 2013)

**Trusted Split-Fabrication System-on-Chip Design Technology and Methodology**

Published paper (link to publisher)

Bibtex

Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet and P. Sadayappan (Proc. Programming Languages Design and Implementation (PLDI), Vol. 48, pp. 127-138, 2013)

**When Polyhedral Transformations Meet SIMD Code Generation**

Preprint (192 KB)

Published paper (link to publisher)

Bibtex

Data locality and parallelism are critical optimization objectives for performance on modern multi-core machines. Both coarse-grain parallelism (e.g., multi-core) and fine-grain parallelism (e.g., vector SIMD) must be effectively exploited, but despite decades of progress at both ends, current compiler optimization schemes that attempt to address data locality and both kinds of parallelism often fail at one of the three objectives. We address this problem by proposing a 3-step framework, which aims for integrated data locality, multi-core parallelism and SIMD execution of programs. We define the concept of vectorizable codelets, with properties tailored to achieve effective SIMD code generation for the codelets. We leverage the power of a modern high-level transformation framework to restructure a program to expose good ISA-independent vectorizable codelets, exploiting multi-dimensional data reuse. Then, we generate ISA-specific customized code for the codelets, using a collection of lower-level SIMD-focused optimizations. We demonstrate our approach on a collection of numerical kernels that we automatically tile, parallelize and vectorize, exhibiting significant performance improvements over existing compilers.

Berkin Akin, Peter A. Milder, Franz Franchetti and James C. Hoe (ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), 2012)

**Algorithm and Architecture Optimization for Large Size Two Dimensional Discrete Fourier Transform**

Preprint (1.4 MB)

Published paper (link to publisher)

Bibtex

Tao Cui and Franz Franchetti (Proc. IEEE Power and Energy Society General Meeting (PES-GM), pp. 1-6, 2012)

**A Multi-Core High Performance Computing Framework for Probabilistic Solutions of Distribution Systems**

Preprint (1.9 MB)

Published paper (link to publisher)

Bibtex

Multi-core CPUs with multiple levels of parallelism and deep memory hierarchies have become the mainstream computing platform. In this paper we developed a generally applicable high performance computing framework for Monte Carlo simulation (MCS) type applications in distribution systems, taking advantage of performance-enhancing features of multi-core CPUs. The application in this paper is to solve the probabilistic load flow (PLF) in real time, in order to cope with the uncertainties caused by the integration of renewable energy resources. By applying various performance optimizations and multi-level parallelization, the optimized MCS solver is able to achieve more than 50% of a CPU’s theoretical peak performance and the performance is scalable with the hardware parallelism. We tested the MCS solver on the IEEE 37-bus test feeder using a new Intel Sandy Bridge multi-core CPU. The optimized MCS solver is able to solve millions of load flow cases within a second, enabling the real-time Monte Carlo solution of the PLF.

C. Angelopoulos, Franz Franchetti and Markus Püschel (NVIDIA Research Summit at the GPU Technology Conference, 2012)

**Automatic Generation of FFT Libraries for GPUs**

Preprint (1.7 MB)

Published paper (link to publisher)

Bibtex

Franz Franchetti, Yevgen Voronenko and G. Almasi (Proc. High Performance Computing for Computational Science (VECPAR), 2012)

**Automatic Generation of the HPC Challenges Global FFT Benchmark for BlueGene/P**

Preprint (151 KB)

Published paper (link to publisher)

Bibtex

We present the automatic synthesis of the HPC Challenge’s Global FFT, a large 1D FFT across a whole supercomputer system.We extend the Spiral system to synthesize specialized single-node FFT libraries that combine a data layout transformation with the actual on-node FFT computation to improve the network performance through enabling all-to-all collectives. We run our optimized Global FFT benchmark on up to 128k cores (32 racks) of ANL’s BlueGene/P “Intrepid” and achieved 6.4 Tflop/s, outperforming ANL’s 2008 HPC Challenge Class I Global FFT run (5 Tflop/s). Our code was part of IBM’s winning 2010 HPC Challenge Class II submission. Further, we show first single-thread results on BlueGene/Q.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (ACM Transactions on Design Automation of Electronic Systems, Vol. 17, No. 2, 2012)

**Computer Generation of Hardware for Linear Digital Signal Processing Transforms**

Preprint (2.2 MB)

Published paper (link to publisher)

Bibtex

Linear signal transforms such as the discrete Fourier transform (DFT) are very widely used in digital signal processing and other domains. Due to high performance or efficiency requirements, these transforms are often implemented in hardware. This implementation is challenging due to the large number of algorithmic options (e.g., fast Fourier transform algorithms or FFTs), the variety of ways that a fixed algorithm can be mapped to a sequential datapath, and the design of the components of this datapath. The best choices depend heavily on the resource budget and the performance goals of the target application. Thus, it is difficult for a designer to determine which set of options will best meet a given set of requirements.

In this article we introduce the Spiral hardware generation framework and system for linear transforms. The system takes a problem specification as input as well as directives that define characteristics of the desired datapath. Using a mathematical language to represent and explore transform algorithms and data- path characteristics, the system automatically generates an algorithm, maps it to a datapath, and outputs a synthesizable register transfer level Verilog description suitable for FPGA or ASIC implementation. The quality of the generated designs rivals the best available handwritten IP cores.

Marcela Zuluaga, Peter A. Milder and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 1245-1253, 2012)

**Computer Generation of Streaming Sorting Networks**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

Sorting networks offer great performance but become prohibitively expensive for large data sets. We present a domain-specific language and compiler to automatically generate hardware implementations of sorting networks with reduced area and optimized for latency or throughput. Our results show that the generator produces a wide range of Pareto-optimal solutions that both compete with and outperform prior sorting hardware.

Qiuling Zhu, Lawrence Pileggi and Franz Franchetti (Proc. IFIP/IEEE Internationa Conference on Very Large Scale Integration, pp. 111-116, 2012)

**Cost-Effective Smart Memory Implementation for Parallel Backprojection in Computed Tomography**

Preprint (1.7 MB)

Published paper (link to publisher)

Bibtex

As nanoscale lithography challenges mandate greater pattern regularity and commonality for logic and memory circuits, new opportunities are created to affordably synthesize more powerful smart memory blocks for specific applications. Leveraging the ability to embed logic inside the memory block boundary, we demonstrate the synthesis of smart memory architectures that exploits the inherent memory address patterns of the backprojection algorithm to enable efficient parallel image reconstruction at minimum hardware overhead. An end-toend design framework in sub-20nm CMOS technologies was constructed for the physical synthesis of smart memories and evaluation of the huge design space. Our experimental results show that customizing memory for the computerized tomography (CT) parallel backprojection can achieve more than 30% area and power savings while offering significant performance improvements with marginal sacrifice of image accuracy.

Qiuling Zhu, K. Vaidyanathan, O. Shacham, M. Horowitz, Lawrence Pileggi and Franz Franchetti (Proc. IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 125-132, 2012)

**Design Automation Framwork for Application-Specific Logic-in-Memory Blocks**

Preprint (789 KB)

Published paper (link to publisher)

Bibtex

This paper presents a design methodology for hardware synthesis of application-specific logic-in-memory (LiM) blocks. Logic-in-memory designs tightly integrate specialized computation logic with embedded memory, enabling more localized computation, thus save energy consumption. As a demonstration, we present an end-to-end design framework to automatically synthesize an interpolation based logic-in-memory block named interpolation memory, which combines a seed table with simple arithmetic logic to efficiently evaluate functions. In order to support multiple consecutive seed data access that is required in the interpolation operation, we synthesize the physical memory into the novel rectangular access smart memory blocks. We evaluated a large design space of interpolation memories in sub-20 nm commercial CMOS technology by using the proposed design framework. Furthermore, we implemented a logic-in-memory based computed tomography (CT) medical image reconstruction system and our experimental results show that the logic-in-memory computing method achieves orders of magnitude of energy saving compared with the traditional in-processor computing.

W. Yu, Franz Franchetti, James C. Hoe and Tsuhan Chen (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 296-307, 2012)

**Highly Efficient Performance Portable Tracking of Evolving Surfaces**

Preprint (488 KB)

Published paper (link to publisher)

Bibtex

In this paper we present a framework to obtain highly efficient implementations for the narrow band level set method on commercial off-the-shelf (COTS) multicore CPU systems with a cache-based memory hierarchy such as Intel Xeon and Atom processors. The narrow-band level set algorithm tracks wave-fronts in discretized volumes (for instance, explosion shock waves), and is computationally very demanding. At the core of our optimization framework is a novel projection-based approach to enhance data locality and enable reuse for sparse surfaces in dense discretized volumes. The method reduces stencil operations on sparse and changing sets of pixels belonging to an evolving surface into dense stencil operations on meta-pixels in a lower-dimensional projection of the pixel space. These meta-pixels are then amenable to standard techniques like time tiling. However, the complexity introduced by ever-changing meta-pixels requires us to revisit and adapt all other necessary optimizations. We apply adapted versions of SIMDization, multi-threading, DAG scheduling for basic tiles, and specialization through code generation to extract maximum performance. The system is implemented as highly parameterized code skeleton that is auto-tuned and uses program generation. We evaluated our framework on a dual-socket 2.8 GHz Xeon 5560 and a 1.6 GHz Atom N270. Our single-core performance reaches 26%–35% of the machine peak on the Xeon, and 12%– 20% on the Atom across a range of image sizes. We see up to 6.5x speedup on 8 cores of the dual-socket Xeon. For cache resident sizes our code outperforms the best available third party code (C pre-compiled into a DLL) by about 10x and for the largest out-of-cache sizes the speedup approaches around 200x. Experiments fully explain the high speedup numbers.

Robert Koutsoyannis, Peter A. Milder, Christian Berger, Madeleine Glick, James C. Hoe and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012)

**Improving Fixed-Point Accuracy of FFT Cores in O-OFDM Systems**

Preprint (1018 KB)

Bibtex

Optical OFDM communication systems operating at data rates in the 40Gb/s (and higher) range require high-throughput/highly parallel fast Fourier transform (FFT) implementations. These consume a significant amount of chip resources; we aim to reduce costs by improving the system’s accuracy per chip-area. For OFDM signals, we characterize the growth of data within the FFT and explore several cost-conscious methods for improving the fixed-point format. Using ASIC synthesis and hardware accurate simulations, we evaluate the corresponding system error and stability of these methods. We introduce Directive Scaling, which provides an average increase in overall accuracy without additional runtime-adaptive mechanisms. ASIC synthesis results show minimal overhead, and we explicitly evaluate and explain the inherent tradeoffs. When applied to an 8-bit IFFT design, our technique improves precision by approximately two bits with just a 4% area overhead, as opposed to the additional 32% area overhead required using standard methods.

Qiuling Zhu, Christian Berger, E. L. Turner, Lawrence Pileggi and Franz Franchetti (Journal of Signal Processing Systems, 2012)

**Local Interpolation-based Polar Format SAR: Algorithm, Hardware Implementation and Design Automation**

Published paper (link to publisher)

Bibtex

In this paper we present a local interpolation-based variant of the well-known polar format algorithm used for synthetic aperture radar (SAR) image formation. We develop the algorithm to match the capabilities of the application-specific logic-in-memory processing paradigm, which off-loads lightweight computation directly into the SRAM and DRAM. Our proposed algorithm performs filtering, an image perspective transformation, and a local 2D interpolation, and supports partial and low-resolution reconstruction. We implement our customized SAR grid interpolation logic-in-memory hardware in advanced 14 nm silicon technology. Our high-level design tools allow to instantiate various optimized design choices to fit image processing and hardware needs of application designers. Our simulation results show that the logic-in-memory approach has the potential to enable substantial improvements in energy efficiency without sacrificing image quality.

Berkin Akin, Peter A. Milder, Franz Franchetti and James C. Hoe (Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2012)

**Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes**

Preprint (645 KB)

Published paper (link to publisher)

Bibtex

Prevailing VLSI trends point to a growing gap between the scaling of on-chip processing throughput and off-chip memory bandwidth. An efficient use of memory bandwidth must become a first-class design consideration in order to fully utilize the processing capability of highly concurrent processing platforms like FPGAs. In this paper, we present key aspects of this challenge in developing FPGA-based implementations of two-dimensional fast Fourier transform (2D-FFT) where the large datasets must reside off-chip in DRAM. Our scalable implementations address the memory bandwidth bottleneck through both (1) algorithm design to enable efficient DRAM access patterns and (2) datapath design to extract the maximum compute throughput for a given level of memory bandwidth. We present results for double-precision 2D-FFT up to size 2,048-by-2,048. On an Altera DE4 platform our implementation of the 2,048-by-2,048 2D-FFT can achieve over 19.2 Gflop/s from the 12 GByte/s maximum DRAM bandwidth available. The results also show that our FPGA-based implementations of 2D-FFT are more efficient than 2D-FFT running on state-of-the- art CPUs and GPUs in terms of the bandwidth and power efficiency.

Tao Cui and Franz Franchetti (Proc. IEEE High Performance Extreme Computing (HPEC), 2012)

**Optimized Parallel Distribution Load Flow Solver on Commodity Multi-core CPU**

Published paper (link to publisher)

Bibtex

Solving a large number of load flow problems quickly is required for Monte Carlo analysis and various power system problems, including long term steady state simulation, system benchmarking, among others. Due to the computational burden, such applications are considered to be time-consuming, and infeasible for online or realtime application. In this work we developed a high performance framework for high throughput distribution load flow computation, taking advantage of performance-enhancing features of multi-core CPUs and various code optimization techniques. We optimized data structures to better fit the memory hierarchy. We use the SPIRAL code generator to exploit inherent patterns of the load flow model through code specialization. We use SIMD instructions and multithreading to parallelize our solver. Finally, we designed a Monte Carlo thread scheduling infrastructure to enable real time operation. The optimized solver is able to achieve more than 50% of peak performance on a Intel Core i7 CPU, which translates to solving millions of load flow problems within a second for IEEE 37 test feeder.

Qiuling Zhu, Christian Berger, E. L. Turner, Lawrence Pileggi and Franz Franchetti (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1557-1560, 2012)

**Polar Format Synthetic Aperture Radar in Energy Efficient Application-Specific Logic-in-Memory**

Preprint (489 KB)

Published paper (link to publisher)

Bibtex

In this paper we present a local interpolation-based variant of the well-known polar format algorithm used for synthetic aperture radar (SAR) image formation. We develop the algorithm to match the capabilities of the application-specific logic-in-memory processing paradigm, which off-loads lightweight computation directly into the SRAM and DRAM. Our proposed algorithm performs filtering, an image perspective transformation, and a local 2D interpolation and supports partial and low-resolution reconstruction. We implement our customized SAR grid interpolation logic-in-memory hardware in advanced 32nm silicon technology. Our high-level design tools allow to instantiate various optimized design choices to fit image processing and hardware needs of application designers. Our simulation results show that the logic-in-memory approach has the potential to enable substantial improvements in energy efficiency without sacrificing image quality.

Cory Thoma, Tao Cui and Franz Franchetti (Proc. North American Power Symposium (NAPS), pp. 1-6, 2012)

**Secure Multiparty Computation Based Privacy Preserving Smart Metering System**

Preprint (661 KB)

Published paper (link to publisher)

Bibtex

Smart metering systems provide high resolution, realtime end user power consumption data for utilities to better monitor and control the system, and for end users to better manage their energy usage and bills. However, the high resolution realtime power consumption data can also be used to extract end user activity details, which could pose a great threat to user privacy. In this work, we propose a secure multi-party computation (SMC) based privacy preserving protocol for smart meter based load management. Using SMC and a proper designed electricity plan, the utility is able to perform real time demand management with individual users, without knowing the actual value of each user’s consumption data. Using homomorphic encryption, the billing is secure and verifiable. We have further implemented a demonstration system which includes a graphical user interface and simulates network communication. The demonstration shows that the proposed privacy preserving protocol is feasible for implementation on commodity IT systems.

Marcela Zuluaga, Andreas Krause, Peter A. Milder and Markus Püschel (Proc. Languages, Compilers, Tools and Theory for Embedded Systems (LCTES), pp. 119-128 , 2012)

**"Smart" Design Space Sampling to Predict Pareto-Optimal Solutions**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

Many high-level synthesis tools offer degrees of freedom in mapping high-level specifications to Register-Transfer Level descriptions. These choices do not affect the functional behavior but span a design space of different cost-performance tradeoffs. In this paper we present a novel machine learning-based approach that efficiently determines the Pareto-optimal designs while only sampling and synthesizing a fraction of the design space. The approach combines three key components: (1) A regression model based on Gaussian processes to predict area and throughput based on synthesis training data. (2) A “smart” sampling strategy, GP-PUCB, to iteratively refine the model by carefully selecting the next design to synthesize to maximize progress. (3) A stopping criterion based on assessing the accuracy of the model without access to complete synthesis data. We demonstrate the effectiveness of our approach using IP generators for discrete Fourier transforms and sorting networks. However, our algorithm is not specific to this application and can be applied to a wide range of Pareto front prediction problems.

Qiuling Zhu, Lawrence Pileggi and Franz Franchetti (Proc. SRC TECHCON, 2012)

**Smart Memory Synthesis for Energy-Efficient Computed Tomography Reconstruction**

Preprint (3.4 MB)

Published paper (link to publisher)

Bibtex

As nanoscale lithography challenges mandate greater pattern regularity and commonality for logic and memory circuits, new opportunities are created to affordably synthesize more powerful smart memory blocks for specific applications. Leveraging the ability to embed logic inside the memory block boundary, we demonstrate the synthesis of smart memory architectures that exploits the inherent memory address patterns of the backprojection algorithm to enable efficient image reconstruction at minimum hardware overhead. An end-to-end design framework in sub-20nm CMOS technologies was constructed for the physical synthesis of smart memories and exploration of the huge design space. Our experimental results show that customizing memory for the computerized tomography parallel backprojection can achieve more than 30% area and power savings with marginal sacrifice of image accuracy.

Tao Cui and Franz Franchetti (Carnegie Mellon Conference on the Electricity Industry, 2011)

**A Monte Carlo Framework for Probabilistic Distribution Power Flow**

Bibtex

Tao Cui and Franz Franchetti (Proc. North American Power Symposium (NAPS), 2011)

**A Multi-core High Performance Computing Framework for Distribution Power Flow**

Preprint (1.9 MB)

Published paper (link to publisher)

Bibtex

Multi-core CPUs with multiple levels of parallelism and deep memory hierarchies have become the mainstream computing platform. In this paper we developed a generally applicable high performance computing framework for Monte Carlo simulation (MCS) type applications in distribution systems, taking advantage of performance-enhancing features of multi-core CPUs. The application in this paper is to solve the probabilistic load flow (PLF) in real time, in order to cope with the uncertainties caused by the integration of renewable energy resources. By applying various performance optimizations and multi-level parallelization, the optimized MCS solver is able to achieve more than 50% of a CPU’s theoretical peak performance and the performance is scalable with the hardware parallelism. We tested the MCS solver on the IEEE 37-bus test feeder using a new Intel Sandy Bridge multi-core CPU. The optimized MCS solver is able to solve millions of load flow cases within a second, enabling the real-time Monte Carlo solution of the PLF.

Qiuling Zhu, E. L. Turner, Christian Berger, Lawrence Pileggi and Franz Franchetti (Proc. High Performance Extreme Computing (HPEC), 2011)

**Application-Specific Logic-in-Memory for Polar Format Synthetic Aperture Radar**

Preprint (977 KB)

Published paper (link to publisher)

Bibtex

In the conventional von Neumann model, where computing systems are physically and logically split between memory and CPUs, the “memory wall” and “power wall” are well known bottlenecks that have severely limited the energy efficiency of many applications. Running today’s memory intensive applications, modern computers spend most of their time and energy moving data rather than on computation.

Daniel McFarlin, Volodymyr Arbatov, Franz Franchetti and Markus Püschel (Proc. International Conference on Supercomputing (ICS), 2011)

**Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets**

Preprint (2.5 MB)

Published paper (link to publisher)

Bibtex

The well-known shift to parallelism in CPUs is often associated with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date (SIMD) vector units. Intel's SSE and IBM's VMX (compatible to AltiVec) both offer 4-way (single precision) floating point, but the recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way and 16-way, respectively. Compilation and optimization for vector extensions is hard, and often the achievable speed-up by using vectorizing compilers is small compared to hand-optimization using intrinsic function interfaces. Unfortunately, the complexity of these intrinsics interfaces increases considerably with the vector length, making hand-optimization a nightmare. In this paper, we present a peephole-based vectorization system that takes as input the vector instruction semantics and outputs a library of basic data reorganization blocks such as small transpositions and perfect shuffles that are needed in a variety of high performance computing applications. We evaluate the system by generating the blocks needed by the program generator Spiral for vectorized fast Fourier transforms (FFTs). With the generated FFTs we achieve a vectorization speed-up of 5.5-6.5 for 8-way AVX and 10-12.5 for 16-way LRB. For the latter instruction counts are used since no timing information is available. The combination of the proposed system and Spiral thus automates the production of high performance FFTs for current and future vector architectures.

Tao Cui and Franz Franchetti (Proc. International Workshop on Automatic Performance Tuning (iWAPT), 2011)

**Autotuning a Random Walk Boolean Satisfiability Solver**

Preprint (792 KB)

Published paper (link to publisher)

Bibtex

In this paper we present a performance optimization case study for a kernel with dynamic data structures, few instructions on boolean variables per node, and data-dependent control flow. This kernel serves as model for wide class of important algorithms operating on dynamic data structures. Our example is a simple random walk search algorithm to solve boolean satisfiability (SAT) and our goal is to gain better understanding on how such a kernel must be optimized, not to build a better SAT solver. We start from a very compact, reasonable C++ implementation using the STL vector class, and arrive at a fully optimized, parallelized, vectorized C implementation in which we use algorithm-level optimization, data structure and layout optimization, and program generation and autotuning. We evaluate our example on a 2.66 GHz quad-core Core 2 Extreme in 32-bit mode and a 3 GHz dualcore Xeon 5160 in 64-bit mode. On the Core2 Extreme quadcore processor for L1 resident synthetic benchmarks, the final version achieves up to 47% of machine peak (240 useful logical operations per cycle) and the realistic benchmarks 60–230 operations/cycle. The baseline C++ STL implementation requires between 7 and 50 cycles per useful logical operation for the synthetic benchmark; this translates into about 1,700–12,000x performance gain.

Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J. Ramanujam and P. Sadayappan (Proc. International Conference on Compiler Construction (CC), 2011)

**Data Layout Transformation for Stencil Computations on Short SIMD Architectures**

Preprint (430 KB)

Published paper (link to publisher)

Bibtex

Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.

Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, Yannis Benlachtar, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (Optics Express, Vol. 19, No. 21, pp. 20857-20864, 2011)

**Design studies for ASIC implementations of 28 GS/s optical QPSK- and 16-QAM-OFDM transceivers**

Published paper (link to publisher)

Bibtex

We designed at the register-transfer-level digital signal processing (DSP) circuits for 21.8 Gb/s and 43.7 Gb/s QPSK- and 16- QAM-encoded optical orthogonal frequency division multiplexing (OFDM) transceivers, and carried out synthesis and simulations assessing performance, power consumption and chip area. The aim of the study is to determine the suitability of OFDM technology for low-cost optical interconnects. Power calculations based on synthesis for a 65nm standard- cell library showed that the DSP components of the transceiver (FFTs, equalisation, (de)mapping and clipping/scaling circuits) consume 18.2 mW/Gb/s and 12.8 mW/Gb/s in the case of QPSK and 16-QAM respectively.

Franz Franchetti and Markus Püschel (in Encyclopedia of Parallel Computing, Eds. David Padua, Springer 2011)

**Fast Fourier Transform**

Preprint (874 KB)

Published paper (link to publisher)

Bibtex

A fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (DFT) of an input vector. Efficient means that the FFT computes the DFT of an n-element vector in O(n log n) operations in contrast to the O(n^2) operations required for computing the DFT by definition. FFTs exist for any vector length $n$ and for real and higher-dimensional data. Parallel FFTs have been developed since the advent of parallel computing.

W. Yu, Franz Franchetti, James C. Hoe, José M. F. Moura and Tsuhan Chen (Proc. High Performance Extreme Computing (HPEC), 2011)

**Performance Portable Tracking of Evolving Surfaces**

Preprint (525 KB)

Bibtex

Tracking the continuous evolution of surfaces such as a shock wavefront has a wide range of real world applications. The level set algorithm is a widely used tool for tracking evolving surfaces[4]. It embeds the surface into a higher dimensional function defined on a structural grid discretized volume, and performs numerical computation on the fixed Cartesian grid. The narrow band level set[4] is a variation of the level set method that can significantly reduce the computational cost without noticeable change in quality, by constraining computation to a neighborhood region around the interface which is called narrow band. The narrow band level set algorithm performs a computation similar to iteratively solving partial differential equations with stencils, but on a irregular shaped and dynamically evolving narrow band. Usually, the interface motion is tracked for a large number of iterations, thus having the potential for a high data reuse rate. However, extracting data reuse on the highly irregular computational pattern is difficult. The major contribution of the paper is that we develop a framework to generate highly efficient code for this algorithm on mainstream CPUs with multicore, deep memory hierarchies and SIMD instructions. Related work. This work is closely related to prior works on optimization of stencils [3, 2] and sparse linear algebra[5, 6]. Prior work on stencils has been primarily focusing on the dense structural grid. Sparse matrix solvers are usually memory-bound because of the low data reuse rate. We combined concepts from both areas, and develop a novel framework for the narrow band level set algorithm.

Christian Berger, Volodymyr Arbatov, Yevgen Voronenko, Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1693-1696 , 2011)

**Real-Time Software Implementation of an IEEE 802.11a Baseband Receiver on Intel Multicore**

Preprint (148 KB)

Published paper (link to publisher)

Bibtex

We present a software-only implementation of an IEEE~802.11a (WiFi) receiver optimized for Intel multicore platforms. The receiver is about 50 times faster than a straightforward C implementation, i.e., an implementation that has the same functionality, but leaves optimization completely to the compiler. Our hand-optimized implementation achieves real-time for all data rates up to the maximum of 54 Mbit/s on a Core i7, clocked at 3.3 GHz, and for up to 12 Mbit/s on an Atom, clocked at 1.6 GHz, using two cores in both cases. To achieve this performance we use up to two threads, up to 16-way vectorization using Intel's SSE, and various other optimizations.

Markus Püschel, Franz Franchetti and Yevgen Voronenko (in Encyclopedia of Parallel Computing, Eds. David Padua, pp. 1920-1933, Springer 2011)

**Spiral**

Preprint (362 KB)

Published paper (link to publisher)

Bibtex

Spiral is a program generation system (software that generates other software) for linear transforms and an increasing list of other mathematical functions. The goal of Spiral is to automate the development and porting of performance libraries. Linear transforms include the discrete Fourier transform (DFT), discrete cosine transforms, convolution, and the discrete wavelet transform. The input to Spiral consists of a high-level mathematical algorithm specification and selected architectural and microarchitectural parameters. The output is performance-optimized code in a high-level language such as C, possibly augmented with vector intrinsics and threading instructions.

Peter A. Milder (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)

**A Mathematical Approach for Compiling and Optimizing Hardware Implementations of DSP Transforms**

Preprint (4.4 MB)

Bibtex

Linear signal transforms (such as the discrete Fourier transform) are frequently used in digital signal processing and related fields. Algorithms for computing linear transforms are well-understood and typically have a high degree of regularity and parallelism, making them well-suited for implementation as sequential datapaths on field-programmable gate arrays or application-specific integrated circuits. Nonetheless, transforms are difficult to implement due to the large number of algorithmic options available and ways that algorithms can be mapped to sequential datapaths. Further, the best choices depend heavily on the resource budget and performance goals of the target application. Thus, it is difficult for a designer to determine which set of options will best meet a given set of requirements. This thesis proposes the Spiral hardware generation framework, a hardware compilation and optimization tool based on a mathematical formula language. A formula written in this language specifies a particular transform algorithm executed using a particular sequential datapath. This language allows high-level representation of sequential hardware structures that reuse datapath elements multiple times in the computation of a transform. The language accomplishes this by providing a formal connection between structure in the algorithm and sequential reuse in the datapath. This proposed language drives a hardware compilation system that takes as input a problem specification with directives that define characteristics of the desired datapath; the system then automatically generates an algorithm, maps the algorithm to a datapath, and outputs synthesizable register transfer level Verilog. This thesis evaluates the generated designs when synthesized for field-programmable gate array or application-specific integrated circuit. Its evaluations consider designs across multiple transforms, datatypes, and design goals, and its results show that Spiral is able to automatically providea wide tradeoff between cost (e.g., area, power) and performance. This tradeoff space compares well with existing benchmarks, but allows the designer much more flexibility to find the design best suited to his or her needs.

Frédéric de Mesmay, Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), Lecture Notes in Computer Science, Springer, Vol. 5952, pp. 353-368, 2010)

**Computer Generation of Efficient Software Viterbi Decoders**

Preprint (575 KB)

Published paper (link to publisher)

Bibtex

This paper presents a program generator for fast software Viterbi decoders for arbitrary convolutional codes. The input to the generator is a specification of the code, and single-instruction multiple-data (SIMD) vector length. The output is an optimized C implementation of the decoder that uses explicit Intel SSE vector instructions. Only the performance-critical forward pass is actually generated, and a generic infrastructure handles the traceback pass. At the heart of the generator is a small domain-specific language called VL to express the structure of the forward pass. Vectorization is done by rewriting VL expressions, which a compiler then translates into actual code in addition to performing further optimizations specific to the vector instruction set. Benchmarks show that the generated decoders match the performance of available expert hand-tuned implementations, while spanning the entire space of convolutional codes.

Srinivas Chellappa (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)

**Computer Generation of Fourier Transform Libraries for Distributed Memory Architectures**

Preprint (1.7 MB)

Bibtex

High-performance discrete Fourier transform (DFT) libraries are an important requirement for many computing platforms. Unfortunately, developing and optimizing these libraries for modern, complex platforms has become extraordinarily difficult. To make things worse, performance often does not port, thus requiring permanent re-optimizations. Overcoming this problem has been the goal of Spiral, a library generation system that can produce highly optimal DFT code from a high level specification of algorithms and target platforms.

However, current techniques in Spiral cannot support all target platforms. In particular, several emerging target platforms incorporate a distributed memory parallel processing paradigm, where the cost of accessing non-local memories is relatively high, and handling data movement is exposed to the programmers. Traditionally used only in supercomputing environments, this paradigm is now also finding its way in the form of multicore processors into desktop computing.

The goal of this work is the computer generation of high-performance DFT libraries for a wide range of distributed memory parallel processing systems, given only a high-level description of a DFT algorithm and some platform parameters. The key challenges include generating code for multiple target programming paradigms that delivers load balanced parallelization across multiple layers of the compute hierarchy, orchestrates explicit memory management, and overlaps computation with communication.

We attack this problem by first developing a formal framework to describe parallelization, streaming, and data exchange in a domain-specific declarative mathematical language. Based on this framework, we develop a rewriting system that structurally manipulates DFT algorithms to "match" them to a distributed memory target architecture and hence extracts maximum performance. We implement this approach as a part of Spiral together with a backend that translates the derived algorithms into actual libraries. Experimental results show that the performance of our generated code is comparable to hand-tuned code in all cases, and exceeds the performance of hand-tuned code in some cases.

Yevgen Voronenko, Volodymyr Arbatov, Christian Berger, Ronghui Peng, Markus Püschel and Franz Franchetti (Proc. Software Defined Radio (SDR), 2010)

**Computer Generation of Platform-Adapted Physical Layer Software**

Preprint (310 KB)

Bibtex

In this paper, we describe a program generator for physical layer (PHY) baseband processing in a software-defined radio implementation. The input of the generator is a very high-level platform-independent description of the transmitter and receiver PHY functionality, represented in a domain-specific declarative language called Operator Language (OL). The output is performance-optimized and platform-tuned C code with single-instruction multiple-data (SIMD) vector intrinsics and threading directives. The generator performs these optimizations by restructuring the algorithms for the individual components at the OL level before mapping to code. This way known compiler limitations are overcome. We demonstrate the approach and the excellent performance of the generated code on on the IEEE 802.11a (WiFi) receiver and transmitter PHY for all transmission modes.

Douglas F. Jones (Master thesis, Computer Science, Drexel University, 2010)

**Data Pump Architecture Simulator and Performance Model**

Preprint (1.1 MB)

Bibtex

The Data Pump Architecture (DPA) is a novel non-von-Neumann computer architecture emphasizing efficient use of on-chip SRAM and off-chip DRAM bandwidth. The DPA is parameterized by local memory size, memory bandwidth, vector length, and number of compute processors, allowing the architecture configuration to be balanced for a given set of computations. The Data Pump Architecture Simulator and Performance Model is a functional simulator providing an implementation of the DPA as a software library. Simulation at this level provides the user with the ability to test algorithms for correctness as well as performance. A hardware designer may then use the Simulator as a tool to experimentally determine the effects of architecture parameters and software algorithms through performance data provided by the Simulator's model. The benefit of the approach described here compared to alternatives such as logic/gate level or RTL/HDL software emulation is primarily a reduction in the time needed for 1) a designer to modify and compile a HDL design and 2) to perform the simulation. The Simulator serves as a bridge between the application specific hardware and software design. The result is the ability to use the simulator as part of a framework for rapidly investigating the construction of an optimal system architecture for a given set of algorithms. Investigations with the Walsh-Hadamard transform (WHT), a prototypical digital signal processing transform, are shown. The SPIRAL (www.spiral.net) code generation system is used to explore the space of WHT algorithms while the Simulator is used to explore the performance and trade-offs of dierent DPA configurations.

Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, Yannis Benlachtar, Christian Berger, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (Proc. European Conference on Optical Communication (ECOC), pp. 1-3, 2010)

**Design Studies for an ASIC Implementation of an Optical OFDM Transceiver**

Preprint (254 KB)

Published paper (link to publisher)

Bibtex

We designed at the register-transfer-level the DSP circuits for a 21.8 Gb/s QPSK-OFDM transceiver, and carried out synthesis and simulations assessing performance, power consumption and chip area, to determine their suitability for low-cost optical interconnects.

C. Angelopoulos, Franz Franchetti and Markus Püschel (NVIDIA Research Summit at the GPU Technology Conference, 2010)

**DFT Transform on the Fermi (GTX480): Automatic Program Generation**

Bibtex

W. Yu, Franz Franchetti, James C. Hoe and Tsuhan Chen (Proc. IEEE International Conference on Image Processing (ICIP), 2010)

**Fast and Robust Active Contours for Image Segmentation**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

Active models are widely used in applications like image segmentation and tracking. Region-based active models are known for robustness to weak edges and high computational complexity. We found previous region-based models can easily get stuck in local minimums if initialization is far from the true object boundary. This is caused by an inherent ambiguity in evolution direction of the level set function when minimizing the energy. To solve this problem, we propose an intensity re-weighting (IR) model to bias the evolution process in certain direction. IR model can effectively avoid local minimums and enable much faster convergence of the evolution process. The proposed method is applied to both real and synthetic images with promising results.

W. Yu, Franz Franchetti, James C. Hoe, Y.-J. Chang and Tsuhan Chen (Proc. IEEE International Conference on Image Processing (ICIP), 2010)

**Fast Bilateral Filtering By Adapting Block Size**

Preprint (422 KB)

Published paper (link to publisher)

Bibtex

Direct implementations of bilateral filtering show O(r2) computational complexity per pixel, where r is the filter window radius. Several lower complexity methods have been developed. State-of-the-art low complexity algorithm is an O(1) bilateral filtering, in which computational cost per pixel is nearly constant for large image size. Although the overall computational complexity does not go up with the window radius, it is linearly proportional to the number of quantization levels of bilateral filtering computed per pixel in the algorithm. In this paper, we show that overall runtime depends on two factors, computing time per pixel per level and average number of levels per pixel. We explain a fundamental trade-off between these two factors, which can be controlled by adjusting block size. We establish a model to estimate run time and search for the optimal block size. Using this model, we demonstrate an average speedup of 1.2–26.0x over the pervious method for typical bilateral filtering parameters.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010)

**Hardware Implementation of the Discrete Fourier Transform with Non-Power-of-Two Problem Size**

Preprint (482 KB)

Bibtex

In this paper, we examine several algorithms suitable for the hardware implementation of the discrete Fourier transform (DFT) with non-power-of two problem size. We incorporate these algorithms into Spiral, a tool capable of automatically generating corresponding hardware implementations. We discuss how each algorithm can be used to generate different types of hardware structures, and we demonstrate that our tool is able to produce hardware implementations of non-power- of-two sized DFTs over a wide range of cost/performance tradeoff points.

W. Yu, Tsuhan Chen, Franz Franchetti and James C. Hoe (IEEE Transactions on Circuits and Systems for Video Technology (T-CSVT), Vol. 20, No. 11, pp. 1509-1519, 2010)

**High Performance Stereo Vision Designed for Massively Data Parallel Platforms**

Preprint (870 KB)

Published paper (link to publisher)

Bibtex

Real-time stereo vision is attractive in many applications like robot navigation and 3D scene reconstruction. Data parallel platforms, e.g. GPU, is often used for real-time stereo, because most stereo algorithms involve a large portion of data parallel computations. In this paper, we propose a stereo system on GPU which pushes the Pareto-efficiency frontline in the accuracy and speed trade-off space. Our system is based on hardware-aware algorithm design approach. The system consists of new algorithms and code optimization techniques. We emphasize on keeping the highly data parallel structure in the algorithm design process such that the algorithms can be effectively mapped to massively data parallel platforms. We propose two stereo algorithms: namely, exponential step size adaptive weight (ESAW), and exponential step size message propagation (ESMP). ESAW reduces computational complexity without sacrificing disparity accuracy. ESMP is an extension of ESAW, which incorporates the smoothness term to better model non-frontal planes. ESMP offers additional choice in the accuracy and speed trade-off space. We adopt code optimization methodologies from the performance tuning community, and apply them to this specific application. Such approach gives higher performance than optimizing the code in an ‘ad hoc’ manner, and helps understanding the code efficiency. Experiment results demonstrate a speed-up factor of 2.7 to 8.5 over state-ofthe- art stereo systems at comparable disparity accuracy.

Frédéric de Mesmay, Yevgen Voronenko and Markus Püschel (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1-10, 2010)

**Offline Library Adaptation Using Automatically Generated Heuristics**

Preprint (626 KB)

Published paper (link to publisher)

Bibtex

Automatic tuning has emerged as a solution to provide high-performance libraries for fast changing, increasingly complex computer architectures. We distinguish offline adaptation (e.g., in ATLAS) that is performed during installation without the full problem description from online adaptation (e.g., in FFTW) that is performed at run-time. Offline adaptive libraries are simpler to use, but, unfortunately, writing the adaptation heuristics that power them is a daunting task. The overhead of online adaptive libraries, on the other hand, makes them unsuitable for a number of applications. In this paper, we propose to automatically generate heuristics in the form of decision trees using a statistical classifier, effectively converting an online adaptive library into an offline one. As testbed we use Spiral-generated adaptive transform libraries for current multicores with vector extensions. We show that replacing the online search with generated decision trees maintains a performance competitive with vendor libraries while allowing for a simpler interface and reduced computation overhead.

Q. Li, Tao Cui, Rohit Negi, Franz Franchetti and Marija Ilic (, 2010)

**On-line Decentralized Charging of Plug-In Electric Vehicles in Power Systems**

Published paper (link to publisher)

Bibtex

The concept of plug-in electric vehicles (PEV) are gaining increasing popularity in recent years, due to the growing societal awareness of reducing greenhouse gas (GHG) emissions, and gaining independence on foreign oil or petroleum. Large-scale deployment of PEVs currently faces many challenges. One particular concern is that the PEV charging can potentially cause significant impacts on the existing power distribution system, due to the increase in peak load. As such, this work tries to mitigate the impacts of PEV charging by proposing a decentralized smart PEV charging algorithm to minimize the distribution system load variance, so that a `flat' total load profile can be obtained. The charging algorithm is myopic, in that it controls the PEV charging processes in each time slot based entirely on the current power system states, without knowledge about future system dynamics. We provide theoretical guarantees on the asymptotic optimality of the proposed charging algorithm. Thus, compared to other forecast based smart charging approaches in the literature, the charging algorithm not only achieves optimality asymptotically in an on-line, and decentralized manner, but also is robust against various uncertainties in the power system, such as random PEV driving patterns and distributed generation (DG) with highly intermittent renewable energy sources.

Frédéric de Mesmay (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)

**On the Computer Generation of Adaptive Numerical Libraries**

Preprint (3.2 MB)

Bibtex

Very fast runtime is crucial in many applications in scientific computing, multimedia processing, communication, and control. Most of these applications spend the bulk of the computation in well known mathematical functions which are provided by highly optimized libraries. The development and maintenance of these libraries has become extraordinarily difficult. Optimal performance requires multiple-core balancing, careful use of vector instruction sets, and locality optimization. These optimizations require highly-skilled programmers and are often platform-specific, which means maintenance is a considerable effort given the short processor release cycles.

The Spiral system has successfully addressed these issues by automatically generating high performance libraries given only a high-level mathematical algorithm description in a language called SPL. Spiral produced high performance code using a number of techniques including SPL rewriting systems and a form iterative compilation. However, to date Spiral has been limited in two key aspects. First, Spiral could only generate libraries for the domain of linear transforms; second, all optimizations for a specific target platform are performed during the source code generation, that is, the produced libraries themselves had no dynamic platform-adaptation mechanism.

In this thesis we make progress on both fronts. We present a framework and its implementation for the computer generation of functionalities that are not transforms, specifically matrix multiplication and convolutional decoding. The framework builds on the operator language (OL) that we introduce and that extends SPL. Similar to prior work on transforms, we then develop OL rewriting system to explore algorithm choice, to vectorize and parallelize, and to derive the basic library structure called recursion step closure. The actual code is obtained through a backend that supports different target languages. The generated libraries exhibit a performance comparable to libraries that are hand-written for commodity workstations.

Further, we enable the generation of platform-adaptive libraries, through adaptation modules that can be inserted into our libraries, which are generated to support different ways to compute the same function. We distinguish between online adaptation and offline adaptation and provide mechanism for both. Online adaptation happens during the actual user function call when the input size is provided. Given this size, the library searches for the best computation strategy inside the library, which can then be used for subsequent computations of this size. We provide the dynamic programming strategy used in prior work and introduce a novel kind of Monte-Carlo search on graphs. Finally, we present a machine learning approach that performs offline (during installation time) adaptation with an online adaptive library. First a search is run to produce solutions for a set of sizes. Based on the result, a learning algorithm derives solutions for all sizes in the form of a set of decision trees that are then inserted into the library to render it deterministic. Experiments show the viability of both approaches.

Yannis Benlachtar, Rachid Bouziane, Robert I. Killey, Christian Berger, Peter A. Milder, Robert Koutsoyannis, James C. Hoe, Markus Püschel and Madeleine Glick (Proc. International Conference on Transparent Optical Networks (ICTON), pp. 1-4, 2010)

**Optical OFDM for the Data Center**

Preprint (242 KB)

Published paper (link to publisher)

Bibtex

We investigated the use of orthogonal frequency division multiplexing (OFDM) to increase the capacity of multimode fiber (MMF)-based optical interconnects for data center applications. This approach provides a solution to modulation bandwidth limitations of the lasers, and to the intermodal dispersion of the MMF which leads to frequency-dependent attenuation. Recent studies on adaptively modulated OFDM are reviewed, and new simulation results assessing the capacity of such links for lengths of up to 300 m are presented, assuming the use of 50/125 μm graded-index MMF at a wavelength of 850 nm. The use of coded OFDM as an approach to deal with intermodal dispersion is also discussed.

W. Yu (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2010)

**Performance Portable Tracking of Evolving Surfaces**

Published paper (link to publisher)

Bibtex

Yannis Benlachtar, Philip M. Watts, Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (IEEE Journal of Selected Topics in Quantum Electronics, Vol. 16, No. 5, pp. 1235-1244 , 2010)

**Real-Time Digital Signal Processing for the Generation of Optical Orthogonal Frequency Division Multiplexed Signals**

Preprint (836 KB)

Published paper (link to publisher)

Bibtex

We investigate the design of a field programmable gate array (FPGA) based optical orthogonal frequency division multiplexing (OFDM) transmitter implementing real time digital signal processing at 21.4 GS/s. The transmitter was utilized to generate 8.34 Gbit/s QPSK-OFDM signals for direct detection. We study the impact of the finite resolutions of the inverse fast Fourier transform (IFFT) cores and the digital-to-analog converters (DAC) on the system performance. Furthermore, we demonstrate a transmission experiment over 800 and 1600 km of uncompensated standard fiber with negligible OSNR penalties and BER < 10^-3.

Lingchuan Meng, Jeremy Johnson, Franz Franchetti, Yevgen Voronenko, Marc Moreno Maza and Yuzhen Xie (Proc. Parallel Symbolic Computation (PASCO), pp. 169-170, 2010)

**Spiral-Generated Modular FFT Algorithms**

Preprint (366 KB)

Published paper (link to publisher)

Bibtex

This paper presents an extension of the Spiral system to automatically generate and optimize FFT algorithms for the discrete Fourier transform over finite fields. The generated code is intended to support modular algorithms for multivariate polynomial computations in the modpn library used by Maple. The resulting code provides an order of magnitude speedup over the original implementations in the modpn library, and the Spiral system provides the ability to automatically tune the FFT code to different computing platforms.

Yannis Benlachtar, Philip M. Watts, Rachid Bouziane, Peter A. Milder, Robert Koutsoyannis, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (Proc. European Conference on Optical Communication (ECOC), pp. 1-2, 2009)

**21.4 GS/s Real-Time DSP-Based Optical OFDM Signal Generation and Transmission Over 1600 km of Uncompensated Fibre**

Preprint (279 KB)

Published paper (link to publisher)

Bibtex

We report a real-time optical OFDM transmitter with the highest sampling rate to date. Generation and transmission of an 8.36 Gb/s digitally up-converted single sideband OFDM signal over 1600km of uncompensated fiber with a BER < 10-3 was achieved.

Yevgen Voronenko and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 57, No. 1, pp. 205-222, 2009)

**Algebraic Signal Processing Theory: Cooley-Tukey Type Algorithms for Real DFTs**

Preprint (324 KB)

Published paper (link to publisher)

Bibtex

In this paper we systematically derive a large class of fast general-radix algorithms for various types of real discrete Fourier transforms (real DFTs) including the discrete Hartley transform (DHT) based on the algebraic signal processing theory. This means that instead of manipulating the transform definition, we derive algorithms by manipulating the polynomial algebras underlying the transforms using one general method. The same method yields the well-known Cooley-Tukey fast Fourier transform (FFT) as well as general radix discrete cosine and sine transform algorithms. The algebraic approach makes the derivation concise, unifies and classifies many existing algorithms, yields new variants, enables structural optimization, and naturally produces a human-readable structural algorithm representation based on the Kronecker product formalism. We show, for the first time, that the general-radix Cooley-Tukey and the lesser known Bruun algorithms are instances of the same generic algorithm. Further, we show that this generic algorithm can be in stantiated for all four types of the real DFT and the DHT.

Peter A. Milder, James C. Hoe and Markus Püschel (Proc. Design, Automation and Test in Europe (DATE), pp. 1118-1123, 2009)

**Automatic Generation of Streaming Datapaths for Arbitrary Fixed Permutations**

Preprint (222 KB)

Bibtex

This paper presents a technique to perform arbitrary fixed permutations on streaming data. We describe a parameterized architecture that takes as input n data points streamed at a rate of w per cycle, performs a permutation over all n points, and outputs the result in the same streaming format. We describe the system and its requirements mathematically and use this mathematical description to show that the datapaths resulting from our technique can sustain a full throughput of w words per cycle without stalling. Additionally, we provide an algorithm to configure the datapath for a given permutation and streaming width. Using this technique, we have constructed a full synthesis system that takes as input a permutation and a streaming width and outputs a register-transfer level Verilog description of the datapath. We present an evaluation of our generated designs over varying problem sizes and streaming widths, synthesized for a Xilinx Virtex-5 FPGA.

Daniel McFarlin, Franz Franchetti and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2009)

**Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension**

Preprint (205 KB)

Bibtex

Basilio B. Fraguela, Yevgen Voronenko and Markus Püschel (Proc. Parallel Architectures and Compilation Techniques (PACT), pp. 271-280, 2009)

**Automatic Tuning of Discrete Fourier Transforms Driven by Analytical Modeling**

Preprint (2.7 MB)

Bibtex

Analytical models have been used to estimate optimal values for parameters such as tile sizes in the context of loop nests. However, important algorithms such as fast Fourier transforms (FFTs) present a far more complex search space consisting of many thousands of different implementations with very different complex access patterns and nesting and code structures. As a results, some of the best available FFT implementations use heuristic search based on runtime measurements. In this paper we present the first analytical model that can successfully replace the measurement in this search on modern platforms. The model includes many details of the platform's memory system including the TLBs, and, for the first time, physically addressed caches and hardware prefetching. The effect, as we show, is a dramatically reduced search time to find the best FFT without significant loss in performance. Even though our model is adapted to the FFT in this paper, its underlying structure should be applicable for a much larger set of code structures and hence is a candidate for iterative compilation.

Frédéric de Mesmay, Arpad Rimmel, Yevgen Voronenko and Markus Püschel (Proc. International Conference on Machine Learning (ICML), pp. 729-736, 2009)

**Bandit-Based Optimization on Graphs with Application to Library Performance Tuning**

Preprint (309 KB)

Published paper (link to publisher)

Bibtex

The problem of choosing fast implementations for a class of recursive algorithms such as the fast Fourier transforms can be formulated as an optimization problem over the language generated by a suitably defined grammar. We propose a novel algorithm that solves this problem by reducing it to maximizing an objective function over the sinks of a directed acyclic graph. This algorithm valuates nodes using Monte-Carlo and grows a subgraph in the most promising directions by considering local maximum k-armed bandits. When used inside an adaptive linear transform library, it cuts down the search time by an order of magnitude to find a similar solution. Further, the search can be stopped at any time and still return a solution. In some cases, the performance of the implementations found is also increased by up to 10% which is of considerable practical importance since it consequently improves the performance of all applications using the performance library.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Conference on Supercomputing (ICS), pp. 26-35, 2009)

**Computer Generation of Fast Fourier Transforms for the Cell Broadband Engine**

Preprint (238 KB)

Published paper (link to publisher)

Bibtex

The Cell BE is a multicore processor with eight vector accelerators (called SPEs) that implement explicit cache management through direct memory access engines. While the Cell has an impressive floating point peak performance, programming and optimizing for it is difficult as it requires explicit memory management, multithreading, streaming, and vectorization. We address this problem for the discrete Fourier transform (DFT) by extending Spiral, a program generation system, to automatically generate highly optimized implementations for the Cell. The extensions include multi-SPE parallelization and explicit memory streaming, both performed at a high abstraction level using rewriting systems operating on Spiral's internal domain-specific language. Further, we support latency and throughput optimizations, single and double precision, and different data formats. The performance of Spiral's computer generated code is comparable with and sometimes better than existing DFT implementations, where available.

Yevgen Voronenko, Frédéric de Mesmay and Markus Püschel (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 102-113, 2009)

**Computer Generation of General Size Linear Transform Libraries**

Preprint (563 KB)

Published paper (link to publisher)

Bibtex

The development of high-performance libraries has become extraordinarily difficult due to multiple processor cores, vector instruction sets, and deep memory hierarchies. Often, the library has to be reimplemented and reoptimized, when a new platform is released. In this paper we show how to automatically generate general input-size libraries for the domain of linear transforms. The input to our generator is a formal specification of the transform and the recursive algorithms the library should use; the output is a library that supports general input size, is vectorized and multithreaded, provides an adaptation mechanism for the memory hierarchy, and has excellent performance, comparable to or better than the best human-written libraries. Further, we show that our library generator enables various customizations; one example is the generation of Java libraries.

Franz Franchetti, Markus Püschel, Yevgen Voronenko, Srinivas Chellappa and José M. F. Moura (IEEE Signal Processing Magazine, special issue on ``Signal Processing on Platforms with Multiple Cores'', Vol. 26, No. 6, pp. 90-102, 2009)

**Discrete Fourier Transform on Multicores: Algorithms and Automatic Implementation**

Preprint (453 KB)

Published paper (link to publisher)

Bibtex

This paper gives an overview on the techniques needed to implement the discrete Fourier transform (DFT) efficiently on current multicore systems. The focus is on Intel compatible multicores but we also discuss the IBM Cell, and briefly, graphics processing units (GPUs). The performance optimization is broken down into three key challenges: parallelization, vectorization, and memory hierarchy optimization. In each case, we use the Kronecker product formalism to formally derive the necessary algorithmic transformations based on a few hardware parameters. Further code level optimizations are discussed. The rigorous nature of this framework enables the complete automation of the implementation task as shown by the program generator Spiral. Finally, we show and analyze DFT benchmarks of the fastest libraries available for the considered platforms.

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 549-552, 2009)

**Generating High-Performance Pruned FFT Implementations**

Preprint (273 KB)

Published paper (link to publisher)

Bibtex

We derive a recursive general-radix pruned Cooley-Tukey fast Fourier transform (FFT) algorithm in Kronecker product notation. The algorithm is compatible with vectorization and parallelization required on state-of-the-art multicore CPUs. We include the pruned FFT algorithm into the program generation system Spiral, and automatically generate optimized implementations of the pruned FFT for the Intel Core2Duo multicore processor. Experimental results show that using the pruned FFT can indeed speed up the fastest available FFT implementations by up to 30% when the problem size and the pattern of unused inputs and outputs are known in advance.

Yannis Benlachtar, Philip M. Watts, Rachid Bouziane, Peter A. Milder, Deepak Rangaraj, Anthony Cartolano, Robert Koutsoyannis, James C. Hoe, Markus Püschel, Madeleine Glick and Robert I. Killey (Optics Express, Vol. 17, No. 20, pp. 17658-17668, 2009)

**Generation of Optical OFDM Signals Using 21.4 GS/s Real Time Digital Signal Processing**

Published paper (link to publisher)

Bibtex

We demonstrate a field programmable gate array (FPGA) based optical orthogonal frequency division multiplexing (OFDM) transmitter implementing real time digital signal processing at a sample rate of 21.4 GS/s. The QPSK-OFDM signal is generated using an 8 bit, 128 point inverse fast Fourier transform (IFFT) core, performing one transform per clock cycle at a clock speed of 167.2 MHz and can be deployed with either a direct-detection or a coherent receiver. The hardware design and the main digital signal processing functions are described, and we show that the main performance limitation is due to the low (4-bit) resolution of the digital-to-analog converter (DAC) and the 8-bit resolution of the IFFT core used. We analyze the back-to-back performance of the transmitter generating an 8.36 Gb/s optical single sideband (SSB) OFDM signal using digital up-conversion, suitable for direct-detection. Additionally, we use the device to transmit 8.36 Gb/s SSB OFDM signals over 200 km of uncompensated standard single mode fiber achieving an overall BER<10^−3.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2009)

**High Performance Linear Transform Program Generation for the Cell BE**

Preprint (84 KB)

Bibtex

The Cell BE is among a new generation of multicore processors including the Intel Larrabee and the Tilera TILE64 that provide an impressive peak ﬁxed or ﬂoating point performance for scientiﬁc, signal processing, visualization, and other engineering applications. As shown in Fig. 1, the Cell uses simple in-order cores designed speciﬁcally for numerical computing, and requires explicit memory management to achieve maximal performance, which make programming and optimizing a challenge. In this paper, we extend Spiral [7], a program generation system, to generate highly optimized linear transform programs for the Cell BE. In doing so, as presented in [2], we extend Spiral’s architectural paradigms to include support for distributed memory architectures like the Cell that allow hiding memory costs using multibuffering techniques. We focus on ﬁxed-size code for the 1D complex discrete Fourier transform (DFT), but also generate code for variants including transforms that work on real input, 2D input, and for other transforms including the discrete Cosine and Sine transforms. We generate code for various usage scenarios, including latency optimized and throughput optimized code, and our system can handle various complex data formats and data distribution formats. The performance of Spiral generated code for the Cell is comparable to, and in many cases better than existing implementations, where available.

Daniel McFarlin, Franz Franchetti, José M. F. Moura and Markus Püschel (Proc. SPIE Conference on Defense, Security, and Sensing, Proceedings of SPIE, Vol. 7337, pp. 733708, 2009)

**High Performance Synthetic Aperture Radar Image Formation On Commodity Architectures**

Published paper (link to publisher)

Bibtex

Synthetic Aperture Radar (SAR) image processing platforms are increasingly confronted with vast datasets and hard real-time deadlines. Platform developers are also finding themselves constrained by battlefield exigencies, time-to-market pressures and rapidly varying requirements. In response, developers are coupling high performance, general-purpose Commercial-Off-The-Shelf (COTS) architectures with software implementations of SAR algorithms. While this approach provides great flexibility, achieving the requisite performance on COTS architectures such as IBM's Cell BE and Intel's Core2 Quad is a potentially time-consuming and error-prone process. This is principally due to the highly parallel nature of modern COTS architectures. Developers must now grapple with architectures that exhibit parallelism at nearly every level of granularity; extracting high performance requires a complex interweaving of multiple forms of parallelism (e.g., SIMD vector extensions, multicore). In this paper, we first present an overview of SPIRAL, a program generator that reduces the burden of developing high performance code by automatically exploring many combinations of parallelism. We then extend SPIRAL's domain-specific language to expose high-level SAR constructs that a developer can combine into a parameterized scenario. We subsequently employ these extensions to generate code for the computationally dominant image formation component of Polar Format SAR processing. By fully leveraging automatic program generation, SPIRAL produces code for Intel's Core2 Quad that surpasses competing hand-tuned implementations on the Cell Blade, an architecture with twice as many cores and three times the memory bandwidth. We show an average performance of 34 Gigaflops/sec for 16-Megapixel and 100-Megapixel SAR images with runtimes of 0.6 and 4.45 seconds respectively.

Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel (Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp. 385-410, 2009)

**Operator Language: A Program Generation Framework for Fast Kernels**

Preprint (511 KB)

Bibtex

We present the Operator Language (OL), a framework to automatically generate fast numerical kernels. OL provides the structure to extend the program generation system Spiral beyond the transform domain. Using OL, we show how to automatically generate library functionality for the fast Fourier transform and multiple non-transform kernels, including matrix-matrix multiplication, synthetic aperture radar (SAR), circular convolution, sorting networks, and Viterbi decoding. The control flow of the kernels is data-independent, which allows us to cast their algorithms as operator expressions. Using rewriting systems, a structural architecture model and empirical search, we automatically generate very fast C implementations for state-of-the-art multicore CPUs that rival hand-tuned implementations.

Markus Püschel, Peter A. Milder and James C. Hoe (Journal of the ACM, Vol. 56, No. 2, pp. 10:1-10:34, 2009)

**Permuting Streaming Data Using RAMs**

Preprint (511 KB)

Published paper (link to publisher)

Bibtex

This paper presents a method for constructing hardware structures to perform permutations on streaming data. The method applies to permutations that can be represented as linear mappings on the bit-level representation of the data locations, a subclass that includes many important permutations such as stride permutations (corner turn, perfect shuffle, etc.), the bit reversal, and the Hadamard reordering. This problem has only been considered in the literature for the special case of stride permutations. We propose a family of parameterized datapaths that consist of several independent banks of memory and two interconnection networks. These structures are built for a given streaming width, i.e., number of inputs and outputs per cycle. The structures are able to operate at full throughput for a given streaming width. We provide an algorithm that completely specifies the datapath and control logic given a permutation and a number of input/output ports. Further, we provide lower bounds on the achievable cost of the solution. For an important subclass of considered permutations, the resulting design is proven optimal in both the cost of control and the number of switching elements required by the interconnection networks. We apply our algorithm to derive complete solutions for several important permutations, including a detailed example that carefully illustrates each aspect of the design process. Lastly, we compare our permutation structures to those of [Jarvinen et al. 2004], which are specialized for stride permutations.

Markus Püschel and José M. F. Moura (IEEE Transactions on Signal Processing, Vol. 56, No. 4, pp. 1502-1521, 2008)

**Algebraic Signal Processing Theory: Cooley-Tukey Type Algorithms for DCTs and DSTs**

Preprint (347 KB)

Published paper (link to publisher)

Bibtex

This paper presents a systematic methodology to derive and classify fast algorithms for linear transforms. The approach is based on the algebraic signal processing theory. This means that the algorithms are not derived by manipulating the entries of transform matrices, but by a stepwise decomposition of the associated signal models, or polynomial algebras. This decomposition is based on two generic methods or algebraic principles that generalize the well-known Cooley-Tukey FFT and make the algorithms' derivations concise and transparent. Application to the 16 discrete cosine and sine transforms yields a large class of fast general radix algorithms, many of which have not been found before.

Frédéric de Mesmay, Franz Franchetti, Yevgen Voronenko and Markus Püschel (Parallel Matrix Algorithms and Applications (PMAA), 2008, Presentation (Abstract reviewed))

**Automatic Generation of Adaptive Libraries for Matrix-Multiplication**

Bibtex

Developping high-performance matrix-multiplication libraries for current off-the-shelf microprocessors is a challenging task: it usually takes the combination of a peak-performance hand-scheduled vectorized kernel and multiple cache-aware multi-threaded blocking strategies to achieve maximum speed in all cases. Both these tasks depends on the underlying hardware and therefore need be rewritten or readapted for each and every platform. To alleviate this repetive effort, our work presents how to fully automate the development of such general matrix-multiplication libraries on various computing platforms. At the core of our code generator is a declarative domain-specific language that captures access patterns of blocked matrices. This language is manipulated using a symbolic rewritting system that is able to produce different alternatives. The rules in this language capture all the knowledge about the matrix-multiplication operation. Using this language, we generate all promising algorithms for computing small matrices products, produce the corresponding unrolled code and select the best kernels at compile time to include in our library. Then, the very same language is used to find and implement the set of mutually recursive functions that could be choosed from when deciding the blocking strategy. As this choice ultimately depends on the shape and size of the matrices and should adapt to the platform's caches, the final strategy is only decided at runtime, using dynamic programming. The final implementation of the library can be returned in C++ or Java, takes advantage of vector instructions, multi-processors and multi-cores where available, adapts each problem to the memory hierarchy, compares to vendor's performance and, amazingly, does not contain a single line of code written by a human.

Frédéric de Mesmay, Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. International Workshop on Parallel Matrix Algorithms and Applications (PMAA), 2008)

**Automatic Generation of Multithreaded Vectorized Adaptive Libraries for Matrix Multiplication**

Bibtex

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. Supercomputing (SC), 2008, Poster (Abstract reviewed))

**Automatic Linear Transform Program Generation for the Cell BE**

Bibtex

The complexity of the Cell BE's architecture makes it difficult and time consuming to develop multithreaded, vectorized, high-performance numerical libraries for it. Our approach to solving this problem is to use Spiral, a program generation system, to automatically generate and optimize linear transform libraries for the Cell. We first extend the Spiral framework for the Cell architecture to automatically generate high-performance discrete Fourier, Cosine, and Sine transform kernels that run on a single SPE. The performance of our kernels is comparable to hand tuned code where available, and reaches 14--20 Gflop/s on a single SPE for input vectors resident in the local memories. Next, we produce optimized multithreaded code for the discrete Fourier transform that runs on multiple SPEs and obtains upto a 2x speed-up on 4 SPEs. Our poster will provide details of the formal methods used by our approach, and will present our latest performance results.

Franz Franchetti, Yevgen Voronenko, Peter A. Milder, Srinivas Chellappa, Marek Telgarsky, Hao Shen, Paolo D'Alberto, Frédéric de Mesmay, James C. Hoe, José M. F. Moura and Markus Püschel (Proc. NSF Next Generation Software Program Workshop (NSFNGS) colocated with IPDPS, 2008)

**Domain-Specific Library Generation for Parallel Software and Hardware Platforms**

Preprint (180 KB)

Published paper (link to publisher)

Bibtex

We overview a library generation framework called Spiral. For the domain of linear transforms, Spiral automatically generates implementations for parallel platforms including SIMD vector extensions, multicore processors, field-programmable gate arrays (FPGAs) and FPGA accelerated processors. The performance of the generated code is competitive with the best available hand-written libraries.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)

**FFT Program Generation for the Cell BE**

Preprint (55 KB)

Bibtex

The Cell BE is designed for accelerating heavy numerical computation. The Cell features an innovative design that includes multiple vectorized cores with explicitly managed per-core local memory and intercore communication. These features allow for a theoretical peak performance of 204.8 Gflop/s (single precision, using just the SPEs) without breaking the memory, power, or frequency walls. However, the very same features that allow for high theoretical performance make it difficult and time consuming to develop multithreaded, vectorized, high-performance numerical libraries for the Cell BE. Our approach to solving this problem is to use the Spiral framework to automatically generate and optimize signal processing transform libraries for the Cell. Spiral is a program generation system for linear transforms such as the discrete Fourier transform, discrete cosine transforms, filters, and others. For a user-selected transform, Spiral autonomously generates different algorithms, represented in a declarative form as mathematical formulas, and their implementations to find the best match for the target platform. Besides using search techniques, Spiral performs deterministic optimizations at the formula level, effectively restructuring the code in ways impractical at the code level. To extend the Spiral framework to support the Cell architecture, we first focus on automatically generating high-performance vectorized DFT kernels that run on a single SPE. The performance of our single-threaded code is comparable to hand tuned code, and achieves 16-20 Gflop/s on a single SPE for input vectors resident in the local stores. Next, we produce optimized multithreaded code that runs on multiple SPEs. To maximize the achieved interconnect bandwidth, the algorithms generated use large DMA packets for all-to-all data exchanges between the SPEs.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 385-390, 2008)

**Formal Datapath Representation and Manipulation for Implementing DSP Transforms**

Preprint (285 KB)

Published paper (link to publisher)

Bibtex

We present a domain-specific approach to representing datapaths for hardware implementations of linear signal transform algorithms. We extend the tensor structure for describing linear transform algorithms, adding the ability to explicitly characterize two important dimensions of datapath architecture. This representation allows both algorithm and datapath to be specified within a single formula, and gives the designer the ability to easily consider a wide space of possible datapaths at a high level of abstraction. We have constructed a formula manipulation system based upon this representation, and have written a compiler that can translate a formula into a hardware implementation. This enables an automatic “push button” compilation ﬂow that produces a register transfer level hardware description from high-level datapath directives and an algorithm (written as a formula). In our experimental results, we demonstrate that this approach yields efficient designs over a large tradeoff space.

Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2008)

**Generating High-Performance General Size Linear Transform Libraries Using Spiral**

Preprint (191 KB)

Published paper (link to publisher)

Bibtex

Franz Franchetti and Markus Püschel (Proc. International Conference on Compiler Construction (CC), Lecture Notes in Computer Science, Springer, Vol. 4959, pp. 116-131, 2008)

**Generating SIMD Vectorized Permutations**

Preprint (244 KB)

Published paper (link to publisher)

Bibtex

This paper introduces a method to generate efficient vectorized implementations of small stride permutations using only vector load and vector shuffle instructions. These permutations are crucial for high-performance numerical kernels including the fast Fourier transform. Our generator takes as input only the specification of the target platform's SIMD vector ISA and the desired permutation. The basic idea underlying our generator is to model vector instructions as matrices and sequences of vector instructions as matrix formulas using the Kronecker product formalism. We design a rewriting system and a search mechanism that applies matrix identities to generate those matrix formulas that have vector structure and minimize a cost measure that we define. The formula is then translated into the actual vector program for the specified permutation. For three important classes of permutations, we show that our method yields a solution with the minimal number of vector shuffles. Inserting into a fast Fourier transform yields a significant speedup.

Srinivas Chellappa, Franz Franchetti and Markus Püschel (Proc. Summer School on Generative and Transformational Techniques in Software Engineering (GTTSE), Lecture Notes in Computer Science, Springer, Vol. 5235, pp. 196-259, 2008)

**How To Write Fast Numerical Code: A Small Introduction**

Preprint (556 KB)

Published paper (link to publisher)

Bibtex

The complexity of modern computing platforms has made it increasingly difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by an order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer's memory hierarchy. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.

Yevgen Voronenko (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2008)

**Library Generation for Linear Transforms**

Preprint (2.2 MB)

Bibtex

The development of high-performance numeric libraries has become extraordinarily difficult due to multiple processor cores, vector instruction sets, and deep memory hierarchies. To make things worse, often each library has to be re-implemented and re-optimized, whenever a new platform is released. In this thesis we develop a library generator that completely automates the library development for one important numerical domain: linear transforms, which include the discrete Fourier transform, discrete cosine transforms, filters, and discrete wavelet transforms. The input to our generator is a specification of the transform and a set of recursive algorithms for the transform, represented in a high-level domain-specific language; the output is a C++ library that supports general input size, is vectorized and multithreaded, and provides an optional adaptation mechanism for the memory hierarchy. Further, as we show in extensive benchmarks, the runtime performance of our automatically generated libraries is comparable to and often even higher than the best existing human-written code, including the widely used library FFTW and the commercially developed and maintained Intel Integrated Performance Primitives (IPP) and AMD Performance Library (APL). Our generator automates all library development steps typically performed manually by programmers, such as analyzing the algorithm and finding the set of required recursive functions and base cases, the appropriate restructuring of the algorithm to parallelize for multiple threads and to vectorize for the available vector instruction set and vector length, and performing code level optimizations such as algebraic simplification and others. The key to achieving full automation as well as excellent performance is a proper set of abstraction layers in the form of domain-specific languages called SPL (Signal Processing Language), index-free \sspl, regular \sspl, and intermediate code representation, and the use of rewriting systems to perform all difficult optimizations at a suitable, high level of abstraction. In addition, we demonstrate that our automatic library generation framework enables various forms of customization that would be very costly to perform manually. As examples, we show generated trade-offs between code size and performance, generated Java libraries obtained by modifying the backend, and functional customization for important transform variants.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. Workshop on High-Level Synthesis colocated with DAC, 2008)

**Linear Transforms: From Math to Efficient Hardware**

Preprint (70 KB)

Bibtex

Franz Franchetti, Daniel McFarlin, Frédéric de Mesmay, Hao Shen, Tomasz Wiktor Włodarczyk, Srinivas Chellappa, Marek Telgarsky, Peter A. Milder, Yevgen Voronenko, Qian Yu, James C. Hoe, José M. F. Moura and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2008)

**Program Generation with Spiral: Beyond Transforms**

Published paper (link to publisher)

Bibtex

Jeremy Johnson, Tim Chagnon, Petya Vachranukunkiet, Prawat Nagvajara and Chika Nwankpa (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)

**Sparse LU Decomposition using FPGA**

Preprint (539 KB)

Published paper (link to publisher)

Bibtex

This paper reports on an FPGA implementation of sparse LU decomposition. The resulting special purpose hardware is geared towards power system problems - load flow computation - which are typically solved iteratively using Newton Raphson. The key step in this process, which takes approximately 85% of the computation time, is the solution of sparse linear systems arising from the Jacobian matrices that occur in each iteration of Newton Raphson. Current state-of-the-art software packages, such as UMFPACK and SuperLU, running on general purpose processors perform suboptimally on these problems due to poor utilization of the floating point hardware (typically 1 to 4% efficiency). Our LU hardware, using a special purpose data path and cache, designed to keep the floating point hardware busy, achieves an efficiency of 60% and higher. This improved efficiency provides an order of magnitude speedup when compared to a software solution using UMFPACK running on general purpose processors.

Jeremy Johnson and Michael Andrews (Proc. International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), 2008)

**Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh-Hadamard Transform**

Preprint (1.1 MB)

Published paper (link to publisher)

Bibtex

Short vector instructions (SIMD) can significantly increase performance, yet are difficult to use effectively. Recently, several efforts (ATLAS, FFTW, SPIRAL) have successfully used automated performance tuning and search to find good SIMD implementations of high-performance kernels such as matrix multiplication, the FFT and related transforms. In this paper, we review techniques, similar to those used by SPIRAL, for vectorizing sparse matrix factorizations, and incorporate them into a package for computing the Walsh–Hadamard Transform (WHT). These techniques along with search are used to discover algorithms with close to theoretical speedup. Not all algorithms provide the same amount of speedup and it is not the case that the vectorized version of the best sequential algorithm leads to the best algorithm with SIMD instructions. The goal of this paper is to better understand the search space and which algorithms lead to the best speedup and overall performance. Several SIMD specific algorithm features are used to predict speedup and performance. The resulting performance model is highly correlated with runtime performance and measured speedup, and can be used to partition the search space and expedite the search for good vector performance.

Yevgen Voronenko, Franz Franchetti, Frédéric de Mesmay and Markus Püschel (Proc. Algebraic Methodology and Software Technology (AMAST), 2008)

**System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries**

Preprint (187 KB)

Bibtex

We demonstrate Spiral, a domain-specific library generation system. Spiral generates high performance source code for linear transforms (such as the discrete Fourier transform and many others) directly from a problem specification. The key idea underlying Spiral is to perform automatic reasoning and optimizations at a high abstraction level using the mathematical, declarative domain-specific languages SPL and $\Sigma$-SPL and a rigorous rewriting framework. Optimization includes various forms of parallelization. Even though Spiral provides complete automation, its generated libraries run often faster than any existing hand-written code.

Sung-Chul Han (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2007)

**A Flexible Decoder and Performance Evaluation of Array-Structured LDPC Codes**

Preprint (1.4 MB)

Bibtex

The low density parity check (LDPC) codes designed by a pseudorandom construction, as proposed in Gallager’s original work, have been shown to perform very close to the Shannon limit (when constructed as very long codes); however, the lack of structure in such codes makes them unsuitable for practical applications due to high encoding complexity and costly decoder implementations. These difficulties have lead to numerous works on the structured LDPC codes, especially array-structured codes with quasi-cyclic property. Among the array-structured codes, those with an array of cyclic permutation matrices have been of particular interest due to the balanced edge partitioning inherent in the structure that simplifies the implementation of highly parallel decoders. While many construction methods have been proposed for this circulant permutation array (CPA) structure, the performance of the codes has been reported to a very limited extent. Especially, the effect on the performance by the explicit control of graph parameters has not been provided despite the fact their importance is emphasized in the construction process. In the decoder design for quasi-cyclic LDPC codes, the primary concern is to exploit the array structure for efficient implementation. Fast hardware-based decoders on a medium-capacity FPGA are often faster than the software implementation by at least one or two orders of magnitude, and thus important for both actual deployment in practical systems and evaluation of error performance. As a large number of high-throughput decoders in the literature are designed for specific array dimensions and the bus and memory connections are simplified using the array structure of the code, the degree of parallelism in the decoders is dependent on the code parameters, making it difficult to parameterize the hardware to use a desired amount of hardware resource. Furthermore, such architectures cannot support a large class of array-structured codes with very different array dimensions. In this thesis, we present a generalized hardware decoder that supports any kind of quasi-cyclic LDPC codes including CPA-structured codes. The decoder has been designed with a priority on high flexibility. In the synthesis step, the degree of parallelism can be chosen independently from the code parameters. Thus, for FPGA implementation, the decoder can be parameterized to fully utilize a given amount of hardware resource. Also, it supports run-time reconfiguration of code parameters, i.e., different codes can be supported by changing register contents without a new synthesis. In wireless applications, such flexibility makes it possible to choose a channel code based on the varying channel condition. When used for performance evaluation purposes for a large set of codes, it saves a considerable amount of time by eliminating the need for re-synthesis for each code. Using the FPGA implementation of the proposed decoder, we characterize the performance of array-structured LDPC codes, with a primary focus on pseudorandomly constructed CPA-structured codes. Based on the obtained simulation results, we show the effect of combinatorial parameters (girth, diameter and column weight) on the error performance. The pseudorandom construction is also compared with algebraic construction, and with the codes specified in the IEEE 802.16e standards.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (CSSI Technical Report #CSSI-07-01, Carnegie Mellon University, 2007)

**Discrete Fourier Transform Compiler: From Mathematical Representation to Efficient Hardware**

Preprint (259 KB)

Bibtex

A wide range of hardware implementations are possible for the discrete Fourier transform (DFT), offering different tradeoffs in throughput, latency and cost. The well-understood structure of DFT algorithms makes possible a fully automatic synthesis framework that can span the viable interesting design choices. In this paper, we present such a synthesis framework that starts from formal mathematical formulas of a general class of fast DFT algorithms and produces performance and cost efficient sequential hardware implementations, making design decisions and tradeoffs according to user specified high-level preferences. We present evaluations to demonstrate the variety of supported implementations and the cost/performance tradeoffs they allow.

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. International Symposium on Field-Programmable Gate Arrays (FPGA), 2007)

**Fast Fourier Transform on FPGA: Design Choices and Evaluation**

Bibtex

Peter A. Milder, Franz Franchetti, James C. Hoe and Markus Püschel (Proc. IEEE International High Level Design Validation and Test Workshop (HLDVT), 2007)

**FFT Compiler: From Math to Efficient Hardware**

Preprint (158 KB)

Bibtex

This paper presents a high-level compiler that generates hardware implementations of the discrete Fourier transform (DFT) from mathematical specifications. The matrix formula input language captures not only the DFT calculation but also the implementation options at the algorithmic and architectural levels. By selecting the appropriate formula, the resulting hardware implementations (described in a synthesizable Verilog description) can achieve a wide range of tradeoffs between implementation cost and performance. The compiler is also parameterized for a set of technology-specific optimizations, to allow it to target specific implementation platforms. This paper gives a brief overview of the system and presents synthesis results.

Paolo D'Alberto, Peter A. Milder, Aliaksei Sandryhaila, Franz Franchetti, James C. Hoe, José M. F. Moura, Markus Püschel and Jeremy Johnson (Proc. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 173-184, 2007)

**Generating FPGA Accelerated DFT Libraries**

Preprint (347 KB)

Published paper (link to publisher)

Bibtex

We present a domain-specific approach to generate high performance hardware-software partitioned implementations of the discrete Fourier transform (DFT) in fixed point precision. The partitioning strategy is a heuristic based on the DFT’s divide-and-conquer algorithmic structure and fine tuned by the feedback-driven exploration of candidate designs. We have integrated this approach in the Spiral linear-transform code-generation framework to support push-button automatic implementation. We present evaluations of hardware–software DFT implementations running on the embedded PowerPC processor and the reconfigurable fabric of the Xilinx Virtex-II Pro FPGA. In our experiments, the 1D and 2D DFT’s FPGA accelerated libraries exhibit between 2 and 7.5 times higher performance (operations per second) and up to 2.5 times better energy efficiency (operations per Joule) than the software-only version.

Jeremy Johnson and Xu Xu (Proc. ACM International Symposium on Symbolic and Algebraic Computation (ISSAC), ACM, pp. 195-202, 2007)

**Generating Symmetric DFTs and Equivariant FFT Algorithms**

Published paper (link to publisher)

Bibtex

This paper presents a code generator which produces efficient implementations of multi-dimensional fast Fourier transform (FFT) algorithms which utilize symmetries in the input data to reduce memory usage and the number of arithmetic operations. The FFT algorithms are constructed using a group theoretic version of the divide and conquer step in the FFT that is compatible with the group of symmetries. The GAP compute algebra system is used to perform the necessary group computations and the generated algorithm is represented as a symbolic matrix factorization, which is translated into efficient code using the SPIRAL system. Performance data is given that shows that the resulting code is significantly faster than state-of-the-art FFT implementations that do not utilize the symmetries.

Yevgen Voronenko and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 55, No. 9, pp. 4458-4473, 2007)

**Mechanical Derivation of Fused Multiply-Add Algorithms for Linear Transforms**

Preprint (751 KB)

Published paper (link to publisher)

Bibtex

Several computer architectures offer fused multiply-add (FMA), also called multiply-and-accumulate (MAC) instructions, that are as fast as a single addition or multiplication. For the efficient implementation of linear transforms, such as the discrete Fourier transform or discrete cosine transforms, this poses a challenge to algorithm developers as standard transform algorithms have to be manipulated into FMA algorithms that make optimal use of FMA instructions. We present a general method to convert any transform algorithm into an FMA algorithm. The method works with both algorithms given as directed acyclic graphs (DAGs) and algorithms given as structured matrix factorizations. We prove bounds on the efficiency of the method. In particular, we show that it removes all single multiplications except at most as many as the transform has outputs. We implemented the DAG-based version of the method and show that we can generate many of the best-known hand-derived FMA algorithms from the literature as well as a few novel FMA algorithms.

Yevgen Voronenko and Markus Püschel (ACM Transactions on Algorithms, Vol. 3, No. 2, 2007)

**Multiplierless Multiple Constant Multiplication**

Preprint (507 KB)

Published paper (link to publisher)

Bibtex

A variable can be multiplied by a given set of fixed-point constants using a multiplier block that consists exclusively of additions, subtractions, and shifts. The generation of a multiplier block from the set of constants is known as the multiple constant multiplication (MCM) problem. Finding the optimal solution, i.e., the one with the fewest number of additions and subtractions is known to be NP-complete. We propose a new algorithm for the MCM problem, which produces solutions that require up to 20% less additions and subtractions than the best previously known algorithm. At the same time our algorithm, in contrast to the closest competing algorithm, is not limited by the constant bitwidths. We present our algorithm using a unifying formal framework for the best, graph-based, MCM algorithms and provide a detailed runtime analysis and experimental evaluation. We show that our algorithm can handle problem sizes as large as 100 32-bit constants in a time acceptable for most applications. The implementation of the new algorithm is available at www.spiral.net.

Jeremy Johnson and Michael Andrews (Proc. Workshop on Performance Optimization for High-Level Languages and Libraries (POHLL), 2007 IEEE International Parallel and Distributed Processing Symposium , IEEE, pp. 450, 2007)

**Performance Analysis of a Family of WHT Algorithms**

Preprint (1.8 MB)

Published paper (link to publisher)

Bibtex

This paper explores the correlation of instruction counts and cache misses to runtime performance for a large family of divide and conquer algorithms to compute the Walsh–Hadamard transform (WHT). Previous work showed how to compute instruction counts and cache misses from a high–level description of the algorithm and proved theoretical results about their minimum, maximum, mean, and distribution. While the models themselves do not accurately predict performance, it is shown that they are statistically correlated to performance and thus can be used to prune the search space for fast implementations. When the size of the transform fits in cache the instruction count itself is used; however, when the transform no longer fits in cache, a linear combination of instruction counts and cache misses is used. Thus for small transforms it is safe to ignore algorithms which have a high instruction count and for large transforms it is safe to ignore algorithms with a high value in the combined instruction count/cache miss model. Since the models can be computed from a high–level description of the algorithms, they can be obtained without runtime measurement and the previous theoretical results on the models can be applied to limit empirical search.

Paolo D'Alberto, Markus Püschel and Franz Franchetti (Proc. International Conference on High Performance Embedded Architectures and Compilers (HiPEAC), Lecture Notes in Computer Science, Springer, Vol. 4367, pp. 201-214, 2007)

**Performance/Energy Optimization of DSP Transforms on the XScale Processor**

Preprint (646 KB)

Published paper (link to publisher)

Bibtex

The XScale processor family provides user-controllable independent configuration of CPU, bus, and memory frequencies. This feature introduces another handle for the code optimization with respect to energy consumption or runtime performance. We quantify the effect of frequency configurations on both performance and energy for three signal processing transforms: the discrete Fourier transform (DFT), finite impulse response (FIR) filters, and the Walsh-Hadamard Transform (WHT). To do this, we use SPIRAL, a program generation and optimization system for signal processing transforms. For a given transform to be implemented, SPIRAL searches over different algorithms to find the best match to the given platform with respect to the chosen performance metric (usually runtime). In this paper we use SPIRAL to generate implementations for different frequency configurations and optimize for runtime and physically measured energy consumption. In doing so we show that first, each transform achieves best performance/energy consumption for a different system configuration; second, the best code depends on the chosen configuration, problem size and algorithm; third, the fastest implementation is not always the most energy efficient; fourth, we introduce dynamic (i.e., during execution) reconfiguration in order to further improve performance/energy. Finally, we benchmark SPIRAL generated code against Intel’s vendor library routines. We show competitive results as well as 20% performance improvements or energy reduction for selected transforms and problem sizes.

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. II-17-II-20, 2007)

**SIMD Vectorization of Non-Two-Power Sized FFTs**

Preprint (197 KB)

Published paper (link to publisher)

Bibtex

SIMD (single instruction multiple data) vector instructions, such as Intel’s SSE family, are available on most architectures, but are difficult to exploit for speed-up. In many cases, such as the fast Fourier transform (FFT), signal processing algorithms have to undergo major transformations to map efficiently. Using the Kronecker product formalism, we rigorously derive a novel variant of the general-radix Cooley-Tukey FFT that is structured to map efficiently for any vector length nu and radix. Then, we include the new FFT into the program generator Spiral to generate actual C implementations. Benchmarks on Intel’s SSE show that the new algorithms perform better on practically all sizes than the best available libraries Intel’s MKL and FFTW.

Peter Tummeltshammer, James C. Hoe and Markus Püschel (IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 26, No. 9, pp. 1551-1563, 2007)

**Time-Multiplexed Multiple Constant Multiplication**

Preprint (349 KB)

Published paper (link to publisher)

Bibtex

This paper studies area-efficient arithmetic circuits to multiply a fixed-point input value selectively by one of several preset fixed-point constants. We present an algorithm that generates a class of solutions to this time-multiplexed multiple-constant multiplication problem by “fusing” single-constant multiplication circuits for the required constants. Our evaluation compares our solution against a baseline implementation style that employs a full multiplier and a lookup table for the constants. The evaluation shows that we gain a significant area advantage, at the price of increased latency, for problem sizes (in terms of the number of constants) up to a threshold dependent on the bit-widths of the input and the constants. Our evaluation further shows that our solution is better suited for standard-cell application-specific integrated circuits than prior works on reconfigurable multiplier blocks.

Pranab Shenoy (Master thesis, Computer Science, Drexel University, 2007)

**Universal FFT Core Generator**

Published paper (link to publisher)

Bibtex

This thesis presents a special purpose processor for multi-dimensional discrete Fourier transform (DFT) computation. The processor, called a Universal fast Fourier transform (FFT) processor, is capable of computing an arbitrary multi-dimensional DFT with a fixed number of input points. The processor can be configured to compute different dimensional DFTs simply by rearranging the input data and using a different set of constants, called generalized twiddle factors, throughout the computation. The basic computational process remains the same independent of dimension. The processor does not rely on one-dimensional FFT computation as does the standard approach using the row-column algorithm, but instead utilizes an algorithm called the Dimensionless FFT, whose computation data flow is the same as the FFT algorithm due to Pease. Since the computation performed by the Universal FFT processor is the same as the Pease FFT, an implementation can be obtained by modifying a one-dimensional FFT processor based on the Pease algorithm. The implementation presented in this thesis is obtained in this way from the core generator by Nordin, Milder, Hoe, and P¨uschel. The core generator from Nordin, Milder, Hoe, and P¨uschel generates a variety of cores, with various space and time tradeoffs for implementation on field programmable gate arrays (FPGA). The performance obtained is comparable to vendor supplied cores but offers much greater flexibility of design choices with different area and throughput tradeoffs. The resulting Universal FFT core generator can be used to instantiate multi-dimensional FPGA cores with approximately the same performance and area obtained in the one-dimensional case by Nordin, Milder, Hoe, and P¨uschel.

Yevgen Voronenko and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 3, 2006)

**Algebraic Derivation of General Radix Cooley-Tukey Algorithms for the Real Discrete Fourier Transform**

Preprint (84 KB)

Published paper (link to publisher)

Bibtex

We first show that the real version of the discrete Fourier transform (called RDFT) can be characterized in the framework of polynomial algebras just as the DFT and the discrete cosine and sine transforms. Then, we use this connection to algebraically derive a general radix Cooley-Tukey type algorithm for the RDFT. The algorithm has a similar structure as its complex counterpart, but there are also important differences, which are exhibited by our Kronecker product style presentation. In particular, the RDFT is decomposed into smaller RDFTs but also other auxiliary transforms, which we then decompose by their own Cooley-Tukey type algorithms to obtain a full recursive algorithm for the RDFT.

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 4395, pp. 363-377, 2006)

**A Rewriting System for the Vectorization of Signal Transforms**

Preprint (226 KB)

Published paper (link to publisher)

Bibtex

We present a rewriting system that automatically vectorizes signal transform algorithms at a high level of abstraction. The input to the system is a transform algorithm given as a formula in the well-known Kronecker product formalism. The output is a "vectorized" formula, which means it consists exclusively of constructs that can be directly mapped into short vector code. This approach obviates compiler vectorization, which is known to be limited in this domain. We included the formula vectorization into the Spiral program generator for signal transforms, which enables us to generate vectorized code and optimize through search over alternative algorithms. Benchmarks for the discrete Fourier transform (DFT) show that our generated floating-point code is competitive with and that our fixed-point code clearly outperforms the best available libraries.

Andreas Bonelli, Franz Franchetti, Juergen Lorenz, Markus Püschel and Christoph W. Ueberhuber (Proc. International Symposium on Parallel and Distributed Processing and Application (ISPA), Lecture Notes In Computer Science, Springer, Vol. 4330, pp. 818-832, 2006)

**Automatic Performance Optimization of the Discrete Fourier Transform on Distributed Memory Computers**

Preprint (146 KB)

Bibtex

This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system SPIRAL. DFT algorithms are represented as mathematical formulas in SPIRAL’s internal language SPL. Using a tagging mechanism and formula rewriting, we extend SPIRAL to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms, SPIRAL’s search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by SPIRAL show performance gains of up to 30% due to rescaling. Further, our generated programs compare favorably with FFTW-MPI 2.1.5.

Paolo D'Alberto, Peter A. Milder, Franz Franchetti, James C. Hoe, Markus Püschel and José M. F. Moura (Proc. High Performance Extreme Computing (HPEC), 2006)

**Discrete Fourier Transform Compiler for FPGA and CPU/FPGA Partitioned Implementations**

Bibtex

Pawel Hitczenko, Jeremy Johnson and Hung-Jen Huang (Theoretical Computer Science, Vol. 352, pp. 8-30, 2006)

**Distribution of a Class of Divide and Conquer Recurrences arising from the Computation of the Walsh-Hadamard Transform**

Published paper (link to publisher)

Bibtex

Peter A. Milder, Mohammad Ahmad, James C. Hoe and Markus Püschel (Proc. FPGA, pp. 211-220, 2006)

**Fast and Accurate Resource Estimation of Automatically Generated Custom DFT IP Cores**

Preprint (145 KB)

Published paper (link to publisher)

Bibtex

This paper presents an equation-based resource utilization model for automatically generated discrete Fourier transform (DFT) soft core IPs. The parameterized DFT IP generator allows a user to make customized tradeoffs between cost and performance and between utilization of different resource classes. The equation-based resource model permits immediate and accurate estimation of resource requirements as the user considers the different generator options. Furthermore, the fast turnaround of the model allows it to be combined with a search algorithm such that the user could query automatically for an optimal design within the stated performance and resource constraints.Following a brief review of the DFT IP generator, this paper presents the development of the equation-based models for estimating slice and hard macro utilizations in the Xilinx Virtex-II Pro FPGA family. The evaluation section shows that an average error of 6.1% is achievable by a model of linear equations that can be evaluated in sub-microseconds. The paper further offers a demonstration of the automatic design exploration capability.

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. Supercomputing (SC), 2006)

**FFT Program Generation for Shared Memory: SMP and Multicore**

Preprint (161 KB)

Published paper (link to publisher)

Bibtex

The chip maker's response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speed-up already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms.

F. Gygi, E. W. Draeger, M. Schulz, B. R. de Supinski, J. A. Gunnels, V. Austel, J. C. Sexton, Franz Franchetti, Stefan Kral, Christoph W. Ueberhuber and Juergen Lorenz (Proc. Supercomputing (SC), 2006)

**Large-Scale Electronic Structure Calculations of High-Z Metals on the BlueGene/L Platform**

Bibtex

Franz Franchetti, Andreas Bonelli, Ekapol Chuangsuwanich, Yu-Chiang Lee, Juergen Lorenz, Thomas Peter, Hao Shen, Marek Telgarsky, Yevgen Voronenko, Markus Püschel, José M. F. Moura and Christoph W. Ueberhuber (Proc. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), 2006)

**Parallelism in Spiral**

Preprint (202 KB)

Bibtex

Spiral is a program generator for linear transforms such as the discrete Fourier transform. Spiral generates highly optimized code directly from a problem specification using a combination of techniques including optimization at a high level of abstraction using rewriting of mathematical expressions and heuristic search for platform adaptation. In this paper, we overview the generation of parallel programs using Spiral. This includes programs for vector architectures and programs for shared or distributed memory platforms.

Sung-Chul Han, Franz Franchetti and Markus Püschel (Proc. Parallel Architectures and Compilation Techniques (PACT), pp. 222-232, 2006)

**Program Generation for the All-Pairs Shortest Path Problem**

Preprint (258 KB)

Published paper (link to publisher)

Bibtex

A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and re-optimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet generator in FFTW. Each of these generators produces highly optimized source code directly from a problem specification. In this paper, we extend this list by a program generator for the well-known Floyd-Warshall (FW) algorithm that solves the all-pairs shortest path problem, which is important in a wide range of engineering applications.

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. EDGE Workshop, pp. D49-D50, 2006)

**Spiral: Generating Signal Processing Kernels for New Commodity Architectures**

Bibtex

Marek Telgarsky, James C. Hoe and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 3, 2006)

**Spiral: Joint Runtime and Energy Optimization of Linear Transforms**

Preprint (199 KB)

Published paper (link to publisher)

Bibtex

There is much interest into joint runtime and energy optimization of implementations of signal processing algorithms. Applications in domains such as embedded computing, sensor networks, and mobile communications often require processing of signals under simultaneous runtime, energy and/or power constraints. Hence, in addition to runtime, power and energy are first-order design considerations for both hardware and software developers in those domains. This paper studies the automatic generation of software implementations of digital signal processing (DSP) transforms that are optimized with respect to both runtime and energy. We explore the impact of algorithm selection (a software technique) and voltage-frequency scaling (a hardware technique) on the runtime and energy of computing fast linear transforms. We use SPIRAL, a code generation system, to enumerate automatically many alternative algorithms for the discrete Fourier transform. We measure the runtime and energy of these algorithms at different voltage-frequency settings of an Intel Pentium M microprocessor. We report experimental results supporting that algorithm selection and voltage-frequency scaling do achieve the following: (1) have large impact on the runtime and energy of computing the discrete Fourier transform on a microprocessor; and (2) enable the optimization of important joint runtime-energy objectives.

Roland Wunderlich, Markus Püschel and James C. Hoe (Proc. High Performance Extreme Computing (HPEC), 2005)

**Accelerating Blocked Matrix-Matrix Multiplication using a Software-Managed Memory Hierarchy with DMA**

Preprint (43 KB)

Bibtex

The optimization of matrix-matrix multiplication (MMM) performance has been well studied on general-purpose desktop and server processors. Classic solutions exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. Typical digital signal processors (DSPs) do not have these features, and instead use in-order execution, configurable memory hierarchies, and pro-grammable I/O interfaces. We investigate the methods needed to achieve high per-formance MMM on the Texas Instruments C6713 floating-point DSP. This processor has two components that can be used to accelerate MMM: a software-managed memory hierarchy, and a direct memory access (DMA) engine that can perform block copies from main memory to into the memory hierarchy. Our MMM implementation overlaps computation with DMA block transfers. For matrices lar-ger than the data caches, we observed a 46% performance increase over a blocked MMM implementation, and a 190% increase over the Texas Instruments DSP library.

Grace Nordin, Peter A. Milder, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 471-474, 2005)

**Automatic Generation of Customized Discrete Fourier Transform IPs**

Preprint (291 KB)

Published paper (link to publisher)

Bibtex

This paper presents a parameterized soft core generator for the discrete Fourier transform (DFT). Reusable IPs of digital signal processing (DSP) kernels are important time-saving resources in DSP hardware development. Unfortunately, reusable IPs, however optimized, can introduce inefficiencies because they cannot fit the exact requirements of every application context. Given the well-understood and regular computation in DSP kernels, an automatic tool can generate high-quality ready-to-use IPs customized to user-specified cost/performance tradeoffs (beyond basic parameters such as input size and data format). The paper shows that the generated DFT cores can match closely the performance and cost of DFT cores from the Xilinx LogiCore library. Furthermore, the generator can yield DFT cores over a range of different performance/ cost tradeoff points that are not available from the library.

Mihai Furis, Pawel Hitczenko and Jeremy Johnson (Proc. International Conference on Analysis of Algorithms, Discrete Mathematics and Theoretical Computer Science, pp. 115-124, 2005)

**Cache Miss Analysis of WHT Algorithms**

Published paper (link to publisher)

Bibtex

Franz Franchetti, Stefan Kral, Juergen Lorenz and Christoph W. Ueberhuber (Proceedings of the IEEE, special issue on ``Program Generation, Optimization, and Adaptation'', Vol. 93, No. 2, pp. 409-425, 2005)

**Efficient Utilization of SIMD Extensions**

Bibtex

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326 , 2005)

**Formal Loop Merging for Signal Transforms**

Preprint (304 KB)

Published paper (link to publisher)

Bibtex

A critical optimization in the domain of linear signal transforms, such as the discrete Fourier transform (DFT), is loop merging, which increases data locality and reuse and thus performance. In particular, this includes the conversion of shuffle operations into array reindexings. To date, loop merging is well understood only for the DFT, and only for Cooley-Tukey FFT based algorithms, which excludes DFT sizes divisible by large primes. In this paper, we present a formal loop merging framework for general signal transforms and its implementation within the SPIRAL code generator. The framework consists of Sigma-SPL, a mathematical language to express loops and index mappings; a rewriting system to merge loops in Sigma-SPL; and a compiler that translates Sigma-SPL into code. We apply the framework to DFT sizes that cannot be handled using only the Cooley-Tukey FFT and compare our method to FFTW 3.0.1 and the vendor library Intel MKL 7.2.1. Compared to FFTW our generated code is a factor of 2-4 faster under equal implementation conditions (same algorithms, same unrolling threshold). For some sizes we show a speed-up of a factor of 9 using Bluestein's algorithm. Further, we give a detailed comparison against the Intel vendor library MKL; our generated code is between 2 times faster and 4.5 times slower.

F. Gygi, E. W. Draeger, B. R. de Supinski, R. K. Yates, Franz Franchetti, Stefan Kral, Juergen Lorenz, Christoph W. Ueberhuber, J. A. Gunnels and J. C. Sexton (Proc. Supercomputing (SC), 2005)

**Large-Scale First-Principles Molecular Dynamics Simulations on the BlueGene/L Platform Using the Qbox Code**

Bibtex

Franz Franchetti, Yevgen Voronenko and Markus Püschel (Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326, 2005)

**Loop Merging for Signal Transforms**

Bibtex

Xiaoming Li, María J. Garzarán and David Padua (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 99-110, 2005)

**Optimizing Sorting with Genetic Algorithm**

Preprint (268 KB)

Published paper (link to publisher)

Bibtex

The growing complexity of modern processors has made the generation of highly efficient code increasingly difficult. Manual code generation is very time consuming, but it is often the only choice since the code generated by today's compiler technology often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS, FFTW, and SPIRAL, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proven successful on scientific codes whose performance does not depend on the input data. In this paper we study machine learning techniques to extend empirical search to the generation of sorting routines, whose performance depends on the input characteristics and the architecture of the target machine. We build on a previous study that selects a "pure" sorting algorithm at the outset of the computation as a function of the standard deviation. The approach discussed in this paper uses genetic algorithms and a classifier system to build hierarchically-organized hybrid sorting algorithms capable of adapting to the input data. Our results show that such algorithms generated using the approach presented in this paper are quite effective at taking into account the complex interactions between architectural and input data characteristics and that the resulting code performs significantly better than conventional sorting implementations and the code generated by our earlier study. In particular, the routines generated using our approach perform better than all the commercial libraries that we tried including IBM ESSL, INTEL MKL and the C++ STL The best algorithm we have been able to generate is on the average 26% and 62% faster than the IBM ESSL in an IBM Power 3 and IBM Power 4, respectively.

Thammanit Pipatsrisawat, Aca Gacic, Franz Franchetti, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. 153-156, 2005)

**Performance Analysis of the Filtered Backprojection Image Reconstruction Algorithms**

Preprint (214 KB)

Published paper (link to publisher)

Bibtex

We investigate performance tradeoffs for a class of filtered backprojection (FBP) image reconstruction algorithms. The recently developed fast hierarchical backprojection asymptotically achieves the same O(N/sup 2/ log N) cost as Fourier-based methods while retaining many advantages of the FBP technique. In this paper, we provide a detailed cost and performance analysis of the algorithm on a general purpose platform. Based on carefully tuned implementations of both the direct and the hierarchical backprojection, we explore the tradeoffs between distortion and runtime by varying several algorithm and implementation choices. Experimental results show that, given the desired performance, the choice of algorithm parameters is not obvious and largely depends on the image properties and the underlying computer platform.

José M. F. Moura, Markus Püschel, David Padua and Jack Dongarra (Proceedings of the IEEE, special issue on ``Program Generation, Optimization, and Adaptation'', Vol. 93, No. 2, pp. 211-215, 2005)

**Scanning the Issue: Special Issue on Program Generation, Optimization, and Platform Adaptation**

Published paper (link to publisher)

Bibtex

Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson and Nicholas Rizzolo (Proceedings of the IEEE, special issue on ``Program Generation, Optimization, and Adaptation'', Vol. 93, No. 2, pp. 232- 275, 2005)

**SPIRAL: Code Generation for DSP Transforms**

Preprint (1.2 MB)

Published paper (link to publisher)

Bibtex

Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL, which considers this problem for the performance-critical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high-performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, SPIRAL "intelligently" generates and explores algorithmic and implementation choices to find the best match to the computer's microarchitecture. The "intelligence" is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high-performance code for a broad set of DSP transforms, including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code.

Juergen Lorenz, Stefan Kral, Franz Franchetti and Christoph W. Ueberhuber (IBM Journal of Research and Development, Vol. 49, No. 2/3, pp. 437-446, 2005)

**Vectorization Techniques for the BlueGene/L Double FPU**

Bibtex

Lawrence C. Chang, Inpyo Hong, Yevgen Voronenko and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2004)

**Adaptive Mapping of Linear DSP Algorithms to Fixed-Point Arithmetic**

Bibtex

Xiaoming Li, María J. Garzarán and David Padua (Proc. International Symposium on Code Generation and Optimization (CGO), pp. 111-124, 2004)

**A Dynamically Tuned Sorting Library**

Preprint (418 KB)

Published paper (link to publisher)

Bibtex

Empirical search is a strategy used during the installation oflibrary generators such as ATLAS, FFTW, and SPIRAL to identify the algorithm or the version of an algorithm that delivers thebest performance. In the past, empirical search has been appliedalmost exclusively to scientific problems. In this paper, we discuss the application of empirical search to sorting, which is oneof the best understood symbolic computing problems. When contrasted with the dense numerical computations of ATLAS, FFTW,and SPIRAL, sorting presents a new challenge, namely that the relative performance of the algorithms depend not only on the characteristics of the target machine and the size of the input data but also on the distribution of values in the input data set.Empirical search is applied in the study reported here as partof a sorting library generator. The resulting routines dynamicallyadapt to the characteristics of the input data by selecting the bestsorting algorithm from a small set of alternatives. To generate therun time selection mechanism our generator makes use of machinelearning to predict the best algorithm as a function of the characteristics of the input data set and the performance of the differentalgorithms on the target machine. This prediction is based on thedata obtained through empirical search at installation time.Our results show that our approach is quite effective. Whensorting data inputs of 12M keys with various standard deviations,our adaptive approach selected the best algorithm for all the inputdata sets and all platforms that we tried in our experiments. Thewrong decision could have introduced a performance degradationof up to 133%, with an average value of 44%.

Jeremy Johnson and Kang Chen (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 44-, 2004)

**A Self-Adapting Distributed Memory Package for Fast Signal Transforms**

Preprint (1.4 MB)

Published paper (link to publisher)

Bibtex

Aca Gacic, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-69-V-72, 2004)

**Automatically Generated High-Performance Code for Discrete Wavelet Transforms**

Preprint (103 KB)

Published paper (link to publisher)

Bibtex

A growing number of performance-critical DSP application use the discrete wavelet transform (DWT), thus prompting the need for highly efficient DWT software implementations. Unfortunately, the rapid evolution of computing platforms and compiler technology makes carefully hand-tuned code obsolete almost as fast as it is written. In this paper we describe our work on the automatic generation of DWT implementations that are tuned to a given platform. Our approach captures the various DWT algorithms in a concise mathematical framework that enables the integration of DWTs into the SPIRAL code generation system. Experiments show the quality of our automatically generated code and provide interesting insights; for example, the fastest code differs between platforms and is usually based on a non-obvious combination of DWT algorithms.

Franz Franchetti, Stefan Kral, Juergen Lorenz, Markus Püschel, Christoph W. Ueberhuber and Peter Wurzinger (Proc. High Performance Computing for Computational Science (VECPAR), Lecture Notes in Computer Science, Springer, Vol. 3402, pp. 23-36, 2004)

**Automatically Tuned FFTs for BlueGene/L’s Double FPU**

Preprint (219 KB)

Bibtex

IBM's upcoming 360 Tflop/s supercomputer BlueGene/L featuring 65,536 processors is supposed to lead the Top 500 list when being installed in 2005. This paper presents one of the first numerical codes actually run on a small prototype of this machine. Formal vectorization techniques, the Vienna MAP vectorizer (both developed for generic short vector SIMD extensions), and the automatic performance tuning approach provided by Spiral are combined to generate automatically optimized FFT codes for the BlueGene/L machine targeting its two-way short vector SIMD ``double'' floating-point unit. The resulting FFT codes are 40% faster than the best scalar Spiral generated code and 5 times faster than the mixed-radix FFT implementation provided by the Gnu scientific library GSL.

Adam C. Zelinski, Markus Püschel, Smarahara Misra and James C. Hoe (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-221-V-224, 2004)

**Automatic Cost Minimization for Multiplierless Implementations of Discrete Signal Transforms**

Preprint (80 KB)

Published paper (link to publisher)

Bibtex

The computation of linear DSP transforms consists entirely of additions and multiplications by constants, which, in a hardware realization, can be implemented as a network of wired shifts and additions. Thus, a light weight fixed point implementation that approximates an exact transform can be built from only adders. This paper presents an automatic approach for minimizing the number of additions required for a given transform under the constraint of a particular quality measure. We present an evaluation of our approach. For example, one experiment shows that the IMDCT transform within an MP3 decoder can be reduced from 643 additions to 170 additions while maintaining Limited Accuracy as defined by the MP3 ISO standard.

Anthony F. Breitzman and Jeremy Johnson (Journal of High Performance Computing and Applications, special issue on ``Computer Algebra and Signal Processing'', Vol. 37, No. 2, pp. 157-186, 2004)

**Automatic Derivation and Implementation of Fast Convolution Algorithms**

Published paper (link to publisher)

Bibtex

Yevgen Voronenko and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, pp. V-101-V-104, 2004)

**Automatic Generation of Implementations for DSP Transforms on Fused Multiply-Add Architectures**

Preprint (95 KB)

Published paper (link to publisher)

Bibtex

Many modern computer architectures feature fused multiply-add (FMA) instructions, which offer potentially faster performance for numerical applications. For DSP transforms, compilers can only generate FMA code to a very limited extent because optimal use of FMAs requires modifying the chosen algorithm. In this paper we present a framework for automatically generating FMA code for every linear DSP transform, which we implemented as an extension to the SPIRAL code generation system. We show that for many transforms and transform sizes, our generated FMA code matches the best-known hand-derived FMA algorithms in terms of arithmetic cost. Further, we present actual runtime results that show the speed-up obtained by using FMA instructions.

Aca Gacic (PhD. thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2004)

**Automatic Implementation and Platform Adaptation of Discrete Filtering and Wavelet Algorithms**

Preprint (1.2 MB)

Bibtex

Moore's law, with the doubling of the transistor count every 18 months, poses serious challenges to high-performance numerical software designers: how to stay close to the maximum achievable performance on ever-changing and ever-faster hardware technologies? Up-to-date numerical libraries are usually maintained by large teams of expert programmers who hand-tune the code to a specific class of computer platforms, sacrificing portability for performance. Every new generation of processors reopens the cycle of implementing, tuning, and debugging.

The SPIRAL system addresses this problem by automatically generating and implementing algorithms for DSP numerical kernels and searching for the best solution on the platform of interest. Using search, SPIRAL adapts code to take optimal advantage of the available platform features, such as the architecture of the memory hierarchy and register banks. As a result, SPIRAL generates high-performance implementations for DSP transforms that are competitive with the best hand-coded numerical libraries provided by hardware vendors.

In this thesis, we focus on automatic implementation and platform adaptation of filtering and wavelet kernels, which are at the core of many performance-critical DSP applications. We formulate many well-known algorithms for FIR filters and discrete wavelet transforms (DWT) using a concise and flexible symbolic mathematical language and integrate it in the SPIRAL system. This enables automatic generation and search over the comprehensive space of competitive algorithms, often leading to complex solutions that are hardly ever considered by a human programmer.

Experimental results show that our automatically generated and tuned code for FIR filters and DWTs is competitive and sometimes even outperforms hand-coded numerical libraries provided by hardware vendors. This implies that the richness and the extent of the automatically generated search space can match human ingenuity in achieving high performance. Our system generates high-quality code for digital filtering and wavelet kernels across most current and compatible future computer platforms and frees software developers from tedious and time-consuming coding at the machine level.

Markus Püschel, Adam C. Zelinski and James C. Hoe (Proc. International Conference on Computer-Aided Design (ICCAD), pp. 175-182, 2004)

**Custom-Optimized Multiplierless Implementations of DSP Algorithms**

Preprint (137 KB)

Published paper (link to publisher)

Bibtex

Linear DSP kernels such as transforms and filters are comprised exclusively of additions and multiplications by constants. These multiplications may be realized as networks of additions and wired shifts in hardware. The cost of such a "multiplierless" implementation is determined by the number of additions, which in turn depends on the value and precision of these constants. For a given transform or filter, the set of constants and their required precision is affected by algorithmic and implementation choices and hence provides a degree of freedom for optimization. In this paper we present an automated method to generate, for a given linear transform, a minimum addition multiplierless implementation that satisfies a given quality constraint. The method combines automatic algorithm selection to improve numerical robustness and automatic search methods to minimize constant precisions in a chosen algorithm. We present experiments that show the trade-offs between cost and quality, including custom optimizations of the transforms used in JPEG image and MP3 audio decoders

Grace Nordin, James C. Hoe and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2004)

**Discrete Fourier Transform IP Generator**

Bibtex

Nicholas Rizzolo and David Padua (Proc. International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2004)

**HiLO: High Level Optimization of FFTs**

Preprint (181 KB)

Bibtex

As computing platforms become more and more complex, the task of optimizing performance critical codes becomes more challenging. Recently, more attention has been focused on automating this optimization process by making aggressive assumptions about the algorithm. Motivated by these trends, this paper presents HiLO, the high-level optimizer for FFT codes. HiLO blends traditional optimization techniques into an optimization strategy tailored to the needs of FFT codes and outputs C code ready to be further optimized by the native compiler. It has already been shown that such high-level transformations are important to coax the native compiler into doing its job well. HiLO provides a more appropriate platform for researching these phenomena, suggests an optimization strategy that improves on previous approaches, and shows that even software pipelining at the C level can improve the final binary's performance.

Peter Tummeltshammer, James C. Hoe and Markus Püschel (Proc. Design Automation Conference (DAC), pp. 826-829, 2004)

**Multiple Constant Multiplication By Time-Multiplexed Mapping of Addition Chains**

Preprint (122 KB)

Published paper (link to publisher)

Bibtex

An important primitive in the hardware implementations of linear DSP transforms is a circuit that can multiply an input value by one of several different preset constants. We propose a novel implementation of this circuit based on combining the *addition chains* of the constituent constants. We present an algorithm to automatically generate such a circuit for a given set of constants. The quality of the resulting circuits is evaluated after synthesis for a commercial 0.18mum standard cell library. We compare the area and latency efficiency of this addition chain based approach against a straightforward approach based on a constant table and a full multiplier.

Markus Püschel, Bryan Singer, Jianxin Xiong, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso and Robert W. Johnson (Journal of High Performance Computing and Applications, special issue on ``Automatic Performance Tuning'', Vol. 18, No. 1, pp. 21-45, 2004)

**SPIRAL: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms **

Published paper (link to publisher)

Bibtex

Franz Franchetti (Proc. IMACS Symposium on Mathematical Modelling (MATHMOD), Vol. 2, pp. 1539-1548, 2003)

**A Portable Short Vector Version of FFTW**

Bibtex

Xu Xu (Master thesis, Computer Science, Drexel University, 2003)

**A Recursive Implementation of the Dimensionless FFT**

Bibtex

Jeremy Johnson and Xu Xu (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003)

**A Recursive Implementation of the Dimensionless FFT **

Preprint (62 KB)

Published paper (link to publisher)

Bibtex

A divide and conquer algorithm is presented for computing arbitrary multi-dimensional discrete Fourier transforms. In contrast to standard approaches such as the row-column algorithm, this algorithm allows an arbitrary decomposition, based solely on the size of the transform independent of the dimension of the transform. Only minor modifications are required to compute transforms with different dimension. These modifications were incorporated into the FFTW package so that the algorithm for computing one-dimensional transforms can be used to compute arbitrary dimensional transforms. This reduced the runtime of many multi-dimensional transforms.

Anthony F. Breitzman (PhD. thesis, Computer Science, Drexel University, 2003)

**Automatic Derivation and Implementation of Fast Convolution Algorithms**

Published paper (link to publisher)

Bibtex

Mihai Furis (Master thesis, Computer Science, Drexel University, 2003)

**Cache Miss Analysis of Walsh-Hadamard Transform Algorithms**

Bibtex

Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 501-504, 2003)

**Cooley-Tukey FFT like Algorithms for the DCT**

Preprint (71 KB)

Published paper (link to publisher)

Bibtex

The Cooley-Tukey FFT algorithm decomposes a discrete Fourier transform (DFT) of size n = km into smaller DFT of size k and m. In this paper we present a theorem that decomposes a polynomial transform into smaller polynomial transforms, and show that the FFT is obtained as a special case. Then we use this theorem to derive a new class of recursive algorithms for the discrete cosine transforms (DCT) of type II and type III. In contrast to other approaches, we manipulate polynomial algebras instead of transform matrix entries, which makes the derivation transparent, concise, and gives insight into the algorithms' structure. The derived algorithms have a regular structure and, for 2-power size, minimal arithmetic cost (among known DCT algorithms).

Smarahara Misra, Adam C. Zelinski, James C. Hoe and Markus Püschel (Proc. High Performance Extreme Computing (HPEC), 2003)

**Custom Reduction of Arithmetic in Linear DSP Transforms**

Bibtex

Smarahara Misra (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2003)

**Custom Reduction of Arithmetic in Multiplierless implementations of DSP Transforms**

Bibtex

Aca Gacic, Markus Püschel and José M. F. Moura (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 541-544, 2003)

**Fast Automatic Implementations of FIR Filters**

Preprint (82 KB)

Bibtex

SPIRAL is a generator for platform-adapted libraries of DSP transform algorithms. SPIRAL represents and automatically generates fast algorithms as mathematical formulas and translates them into programs. Adaptation is achieved by searching in the space of algorithmic and coding alternatives for the fastest implementation. In this paper we extend SPIRAL to generate platform-adapted implementations of FIR filters. First, we present various filter algorithms and introduce the mathematical constructs needed to include them into SPIRAL's architecture. Then we use SPIRAL to find fast filter implementations. The results show runtime improvements to a standard loop implementation of up to 70% using different blocking techniques. Further, we show that the usefulness of frequency-domain methods is not determined by the number of operations.

T. Fahringer, Franz Franchetti, M. Geissler, G. Madsen, H. Moritsch and R. Prodan (Proc. Euromicro Conference on Parallel Distributed and Network-based Processing (Euro PDP), pp. 185-192, 2003)

**On Using ZENTURIO for Performance and Parameter Studies on Clusters and Grids**

Bibtex

Franz Franchetti (PhD. thesis, Vienna University of Technology, 2003)

**Performance Portable Short Vector Transforms**

Bibtex

Franz Franchetti and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 537-540, 2003)

**Short Vector Code Generation and Adaptation for DSP Algorithms **

Preprint (90 KB)

Published paper (link to publisher)

Bibtex

Most recent general purpose processors feature short vector SIMD instructions, like SSE on Pentium III/4. In this paper we automatically generate platform-adapted short vector code for DSP transform algorithms using SPIRAL. SPIRAL represents and generates fast algorithms as mathematical formulas, and translates them into code. Adaptation is achieved by searching in the space of algorithmic and coding alternatives for the fastest implementation on the given platform. We explain the mathematical foundation that relates formula constructs to vector code, and overview the vector code generator within SPIRAL. Experimental results show excellent speed-ups compared to ordinary C code for a variety of transforms and computing platforms. For the DFT on Pentium 4, our automatically generated code compares favorably with the hand-tuned Intel MKL vendor library.

Franz Franchetti and Markus Püschel (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2003)

**Short Vector Code Generation for the Discrete Fourier Transform **

Preprint (169 KB)

Bibtex

In this paper we use a mathematical approach to generate high performance short vector code for the discrete Fourier transform. We represent the well-known Cooley-Tukey FFT recursion in a mathematical notation and formally derive a ``short vector variant''. Using this recursion we generate for one DFT size a large number of different algorithms, represented as formulas, and translate them into short vector code. Finally, we present a vector code specific dynamic programming search that finds in the formula space a good match to the given architecture. We implemented this approach as part of the SPIRAL library generator. On Pentium III and 4, we automatically generate SSE and SSE2 vector code that compares favorably with the hand-tuned Intel vendor MKL.

Stefan Kral, Franz Franchetti, Juergen Lorenz and Christoph W. Ueberhuber (Proc. Euro-Par Conference on Parallel and Distributed Computing, LNCS 2985, pp. 251-260, 2003)

**SIMD Vectorization of Straight Line Code**

Bibtex

Markus Püschel and José M. F. Moura (Proc. Workshop on Optimizations for DSP and Embedded Systems (ODES), 2003)

**SPIRAL: An Overview**

Bibtex

Markus Püschel and José M. F. Moura (SIAM Journal of Computing, Vol. 32, No. 5, pp. 1280-1316, 2003)

**The Algebraic Approach to the Discrete Cosine and Sine Transforms and their Fast Algorithms**

Preprint (327 KB)

Published paper (link to publisher)

Bibtex

It is known that the discrete Fourier transform (DFT) used in digital signal processing can be characterized in the framework of representation theory of algebras, namely as the decomposition matrix for the regular module C[Z_n] = C[x]/(x^n - 1). This characterization provides deep insight on the DFT and can be used to derive and understand the structure of its fast algorithms. In this paper we present an algebraic characterization of the important class of discrete cosine and sine transforms as decomposition matrices of certain regular modules associated to four series of Chebyshev polynomials. Then we derive most of their known algorithms by pure algebraic means. We identify the mathematical principle behind each algorithm and give insight into its structure. Our results show that the connection between algebra and digital signal processing is stronger than previously understood.

Franz Franchetti (Proc. International Workshop on Numerical and Symbolic Scientific Computing, 2003)

**Top Performance in Signal Processing**

Bibtex

Fang Fang, Rob A. Rutenbar, Markus Püschel and Tsuhan Chen (Proc. Design Automation Conference (DAC), pp. 496-501, 2003)

**Toward Efficient Static Analysis of Finite-Precision Effects in DSP Applications via Affine Arithmetic Modeling **

Preprint (126 KB)

Published paper (link to publisher)

Bibtex

We introduce a static error analysis technique, based on smart interval methods from affine arithmetic, to help designers translate DSP codes from full-precision floating-point to smaller finite-precision formats. The technique gives results for numerical error estimation comparable to detailed simulation, but achieves speedups of three orders of magnitude by avoiding actual bit-level simulation. We show results for experiments mapping common DSP transform algorithms to implementations using small custom floating point formats.

Mike Balog (Master thesis, Electrical and Computer Engineering, Drexel University, 2002)

**A Flexible Framework for Implementing FFT Processors**

Bibtex

Kang Chen (Master thesis, Computer Science, Drexel University, 2002)

**A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms**

Published paper (link to publisher)

Bibtex

This thesis presents a self-adapting parallel package for computing the Walsh-Hadamard transform (WHT), a prototypical fast signal transform, similar to the fast Fourier transform. Using a search over a space of mathematical formulas representing different algorithms to compute the WHT, the package finds the best parallel implementation on a given shared-memory multiprocessor. The search automatically fins the best combination of sequential and parallel code leading to effective granularity, load balance, and cache utilization. Experimental data are presented in the thesis showing the performance of the package on four different architectures. Results are also presented showing the optimizations required to obtain nearly linear speedup on these sample symmetric multiprocessors.

Kang Chen and Jeremy Johnson (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 58-63, 2002)

**A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms **

Preprint (262 KB)

Published paper (link to publisher)

Bibtex

This paper presents a self-adapting parallel package for computing the Walsh-Hadamard transform (WHT), a prototypical fast signal transform, similar to the fast Fourier transform. Using a search over a space of mathematical formulas representing different algorithms to compute the WHT the package finds the best parallel implementation on a given shared-memory multiprocessor. The search automatically finds the best combination of sequential and parallel code leading to the optimal granularity, load balance, and cache utilization. Experimental results are presented showing the optimizations required to obtain nearly linear speedup on a sample symmetric multiprocessor.

Markus Püschel, Sebastian Egner and Thomas Beth (in New Reference Book on Computer Algebra, Eds. J. Grabmeier, E. Kaltofen, J. Grabmeier, E. Kaltofen, Springer 2002)

**AREP**

Published paper (link to publisher)

Bibtex

Franz Franchetti and Markus Püschel (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 20-26, 2002)

**A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms **

Preprint (142 KB)

Published paper (link to publisher)

Bibtex

Many applications require fast implementations of signal processing algorithms to analyze data in real time or to effectively process many large data sets. Fast implementations of a signal transform need to take advantage of structure in the transformation matrix to factor the transform into a product of structured matrices. These factorizations compute the transform with fewer operations than the naive implementation of matrix multiplication. Signal transforms can have a vast number of factorizations, with each factorization of a single transform represented by a unique but mathematically equivalent formula. Interestingly, when implemented in code, these formulas can have significantly different runtimes on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations differ significantly between processors. Therefore, determining which formula is the most efficient for a particular processor is of great interest.

This thesis contributes methods for automating the modeling and optimization of performance across a variety of signal processing algorithms. Modeling and understanding performance can greatly aid in intelligently pruning the huge search space when optimizing performance. Automation is vital considering the size of the search space, the variety of signal processing algorithms, and the constantly changing computer platform market.

To automate the optimization of signal transforms, we have developed and implemented a number of different search methods in the SPIRAL system. These search methods are capable of optimizing a variety of different signal transforms, including new user-specified transforms. We have developed a new search method for this domain, STEER, which uses an evolutionary stochastic algorithm to find fast implementations.

To enable computer modeling of signal processing performance, we have developed and analyzed a number of feature sets to describe formulas representing specific transforms. We have developed several different models of formula performance, including models that predict runtimes of formulas and models that predict the number of cache misses formulas incur.

Further, we have developed a method that uses these learned models to generate fast implementations. This method is able to construct fast formulas, allowing us to intelligently search through only the most promising formulas. While the learned model is trained on data from one transform size, our method is able to produce fast formulas across many transform sizes, including larger sizes, even though it has never timed a formula of those other sizes.

Bryan Singer and Manuela Veloso (IEEE Transactions on Signal Processing, Vol. 50, No. 8, pp. 2003-2014, 2002)

**Automating the Modeling and Optimization of the Performance of Signal Transforms **

Preprint (365 KB)

Published paper (link to publisher)

Bibtex

Fast implementations of discrete signal transforms, such as the discrete Fourier transform, the Walsh-Hadamard transform, and the discrete trigonometric transforms, can be viewed as factorizations of their corresponding transformation matrices. A given signal transform can have many different factorizations, with each factorization represented by a unique but mathematically equivalent formula. When implemented in code, these formulas can have significantly different running times on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations on various processors are often different. Given this complexity, a crucial problem is automating the modeling and optimization of the performance of signal transform implementations.

To enable computer modeling of signal processing performance, we have developed and analyzed more than 15 feature sets to describe formulas representing specific transforms. Using some of these features and a limited set of training data, we have successfully trained neural networks to learn to accurately predict performance of formulas, with error rates less than 5%. In the direction of optimization, we have developed a new stochastic evolutionary algorithm, STEER, for finding fast implementations of a variety of signal transforms. STEER is able to optimize completely new transforms specified by a user. We present results that show STEER can find discrete cosine transform formulas that are 10--20% faster than what a dynamic programming search finds.

Franz Franchetti, F. Kaltenberger and Christoph W. Ueberhuber (Proc. APLIMAT Conference, pp. 333-339, 2002)

**FFT Kernels with FMA Utilization**

Bibtex

Markus Püschel and José M. F. Moura (Proc. Digital Signal Processing Workshop, 2002)

**Generation and Manipulation of DSP Transform Algorithms**

Bibtex

Fang Fang, James C. Hoe, Markus Püschel and Smarahara Misra (Proc. High Performance Extreme Computing (HPEC), 2002)

**Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform **

Bibtex

Hardware designers are increasingly relying on pre-designed DSP (digital signal processing) cores from IP libraries to improve their productivity and reduce design time. Unfortunately, static DSP cores cannot accommodate application-specific trade-offs. To overcome this problem, we are proposing to automatically generate customized DSP cores that can be tailored for specific design requirements. This enables a designer, with no specific background in DSP transform mathematics, to quickly create and evaluate different design choices and to determine the one that is most suitable for the application. In this paper, we present a generator for the Walsh-Hadamard Transform (WHT). The generator accepts as input the WHT's sample size and two implementation parameters that control the degree of hardware reuse. The output is an RTL-level Verilog description of the desired implementation. We present experimental results that compare our different generated designs for the same transform with respect to resource requirements, computation latency, and throughput.

Bryan Singer and Manuela Veloso (Journal of Machine Learning Research, special issue on ``the Eighteenth International Conference on Machine Learning (ICML 2001)'', Vol. 3, pp. 887-919, 2002)

**Learning to Construct Fast Signal Processing Implementations **

Preprint (6.4 MB)

Published paper (link to publisher)

Bibtex

A single signal processing algorithm can be represented by many mathematically equivalent formulas. However, when these formulas are implemented in code and run on real machines, they have very different runtimes. Unfortunately, it is extremely difficult to model this broad performance range. Further, the space of formulas for real signal transforms is so large that it is impossible to search it exhaustively for fast implementations. We approach this search question as a control learning problem. We present a new method for learning to *generate fast formulas*, allowing us to intelligently search through only the most promising formulas. Our approach incorporates signal processing knowledge, hardware features, and formula performance data to learn to construct fast formulas. Our method learns from performance data for a few formulas of one size and then can construct formulas that will have the fastest runtimes possible across many sizes.

Hung-Jen Huang (Master thesis, Computer Science, Drexel University, 2002)

**Performance Analysis of an Adaptive Algorithm for the Walsh-Hadamard Transform**

Bibtex

Franz Franchetti, Markus Püschel, José M. F. Moura and Christoph W. Ueberhuber (Proc. High Performance Extreme Computing (HPEC), 2002)

**Short Vector SIMD Code Generation for DSP Algorithms**

Bibtex

Peter Becker (Master thesis, Electrical and Computer Engineering, Drexel University, 2001)

**A High Speed VLSI Architecture for the Discrete Haar Wavelet Transform**

Bibtex

An alternative to strictly time- or frequency-domain analysis, wavelets provide a multiresolution analysis of signals for input to processing such as edge detection, compression, and pattern recognition. High speed VLSI architectures are necessary for real-time processing of large data sets. Presented here is such an architecture for the Haar wavelet transform, the simplest and easiest wavelet to understand. From this base model, a generalization of the architecture to other wavelets is discussed, as well.

Franz Franchetti, H. Karner, Stefan Kral and Christoph W. Ueberhuber (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 1109-1112, 2001)

**Architecture Independent Short Vector FFTs**

Bibtex

Sebastian Egner, Jeremy Johnson, David Padua, Jianxin Xiong and Markus Püschel (ACM SIGSAM Bulletin Communications in Computer Algebra, Vol. 35, No. 2, pp. 1-19, 2001)

**Automatic Derivation and Implementation of Signal Processing Algorithms **

Preprint (249 KB)

Published paper (link to publisher)

Bibtex

We present a computer algebra framework to automatically derive and implement algorithms for digital signal processing. The two main parts of the framework are AREP, a library for symbolic manipulation of group representations and structured matrices, and SPL, a compiler turning matrix expressions into efficient imperative-style numerical programs.

Sebastian Egner and Markus Püschel (IEEE Transactions on Signal Processing, Vol. 49, No. 9, pp. 1992-2002, 2001)

**Automatic Generation of Fast Discrete Signal Transforms **

Preprint (217 KB)

Published paper (link to publisher)

Bibtex

This paper presents an algorithm that derives fast versions for a broad class of discrete signal transforms symbolically. The class includes but is not limited to the discrete Fourier and the discrete trigonometric transforms. This is achieved by finding fast sparse matrix factorizations for the matrix representations of these transforms. Unlike previous methods, the algorithm is entirely automatic and uses the defining matrix as its sole input. The sparse matrix factorization algorithm consists of two steps: first, the “symmetry” of the matrix is computed in the form of a pair of group representations; second, the representations are stepwise decomposed, giving rise to a sparse factorization of the original transform matrix. We have successfully demonstrated the method by computing automatically efficient transforms in several important cases: for the DFT, we obtain the Cooley-Tukey FFT; for a class of transforms including the DCT, type II, the number of arithmetic operations for our fast transforms is the same as for the best-known algorithms. Our approach provides new insights and interpretations for the structure of these signal transforms and the question of why fast algorithms exist. The sparse matrix factorization algorithm is implemented within the software package AREP

Jianxin Xiong (PhD. thesis, Computer Science, University of Illinois at Urbana-Champaign, 2001, Also as Technical Report UIUCDCS-R-2001-224, University of Illinois)

**Automatic Optimization of DSP Algorithms **

Bibtex

Since the advent of digital signal processing (DSP), there has been an enormous effort to obtain high-performance implementations of signal processing algorithms such as the fast Fourier transform (FFT). This effort has produced thousands of variants of fundamental algorithms and an equally large number of implementation techniques. Many of the implementations have been carefully hand-optimized. The cost of these hand optimizations and their long implementation times are a strong motivation to automate the implementation and optimization of signal processing algorithms.

In this dissertation, we present SPL, a domain specific language that represents fast signal processing algorithms as mathematical formulas, and the SPL compiler, which translates these formulas into efficient programs in high-level programming language such as FORTRAN and C. The SPL compiler was developed as part of SPIRAL, a system which utilizes formula transformations and intelligent search strategies to automatically create optimized digital signal processing libraries.

We also present Meta-SPL, a meta-language for extending the SPL language without modifying the SPL compiler. Meta-SPL provides the approach to declare new symbols that can be used as SPL keywords, as well as the mechanism to define the semantics of these new symbols.

After presenting the translation and optimization techniques utilized by the SPL compiler, empirical data is presented showing the efficiency of the resulting C/FORTRAN code. Timings are compared, using fast Fourier transform (FFT), Walsh-Hadamard transform (WHT), discrete cosine transform (DCT) and discrete sine transform (DST) as the benchmarks, to those obtained by highly optimized implementations including FFTW and the WHT package. The results of this comparison show that the SPL compiler produces code that is competitive and, in many cases, faster than the competitors.

Bryan Singer (PhD. thesis, Computer Science, Carnegie Mellon University, 2001)

**Automating the Modeling and Optimization of the Performance of Signal Processing Algorithms **

Preprint (60 KB)

Published paper (link to publisher)

Bibtex

Many applications require fast implementations of signal processing algorithms to analyze data in real time or to effectively process many large data sets. Fast implementations of a signal transform need to take advantage of structure in the transformation matrix to factor the transform into a product of structured matrices. These factorizations compute the transform with fewer operations than the naive implementation of matrix multiplication. Signal transforms can have a vast number of factorizations, with each factorization of a single transform represented by a unique but mathematically equivalent formula. Interestingly, when implemented in code, these formulas can have significantly different runtimes on the same processor, sometimes differing by an order of magnitude. Further, the optimal implementations differ significantly between processors. Therefore, determining which formula is the most efficient for a particular processor is of great interest.

This thesis contributes methods for automating the modeling and optimization of performance across a variety of signal processing algorithms. Modeling and understanding performance can greatly aid in intelligently pruning the huge search space when optimizing performance. Automation is vital considering the size of the search space, the variety of signal processing algorithms, and the constantly changing computer platform market.

To automate the optimization of signal transforms, we have developed and implemented a number of different search methods in the SPIRAL system. These search methods are capable of optimizing a variety of different signal transforms, including new user-specified transforms. We have developed a new search method for this domain, STEER, which uses an evolutionary stochastic algorithm to find fast implementations.

To enable computer modeling of signal processing performance, we have developed and analyzed a number of feature sets to describe formulas representing specific transforms. We have developed several different models of formula performance, including models that predict runtimes of formulas and models that predict the number of cache misses formulas incur.

Further, we have developed a method that uses these learned models to generate fast implementations. This method is able to construct fast formulas, allowing us to intelligently search through only the most promising formulas. While the learned model is trained on data from one transform size, our method is able to produce fast formulas across many transform sizes, including larger sizes, even though it has never timed a formula of those other sizes.

Neungsoo Park and Viktor K. Prasanna (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. II-1205-II-1208, 2001)

**Cache Conscious Walsh-Hadamard Transform **

Preprint (88 KB)

Published paper (link to publisher)

Bibtex

The Walsh-Hadamard Transform (WHT) is an important tool in signal processing because of its simplicity. One major problem with the existing packages is their poor scalability to larger problems. In computing large size WHT, non-unit stride accesses result in poor cache performance leading to severe degradation in performance. In this paper, we develop an efficient *cache friendly* technique that drastically enhances the performance of large size WHT. In our approach, data reorganization is performed between computation stages to reduce cache pollution. We develop an efficient search algorithm to determine the optimal factorization tree depending on the problem size and stride access in the decomposition. We have developed a package and demonstrated improved performance. For example, on Alpha 21264 and MIPS R10000, our technique results in upto 180% performance improvement over the state-of-the-art package. The proposed optimization is applicable to other signal transforms and is portable across various platforms.

Pinit Kumhom (PhD. thesis, Electrical and Computer Engineering, Drexel University, 2001, Also Tech. Report DU-MCS-01-01, Drexel University, 2001)

**Design, Optimization, and Implementation of a Universal FFT Processor **

Preprint (8.2 MB)

Bibtex

There exist Fast Fourier transform (FFT) algorithms, called dimensionless FFTs (L. Auslander, J. Johnson and R. Johnson, Dimensionless Fast Fourier Transform Method and Apparatus, Patent #US6003056, issued Dec. 14, 1999.), that work independent of dimension. These algorithms can be configured to compute different dimensional discrete Fourier transforms (DFTs) simply by relabeling the input data and by changing the values of the twiddle factors occurring in the butterfly operations. This observation allows the design of a universal FFT processor, which with minor reconfiguring, can compute one, two, and three dimensional DFTs. In this thesis a family of FFT processors, parameterized by the number of points, the dimension, the number of processors, and the internal dataflow is designed. Mathematical properties of the FFT are used systematically to simplify and optimize the processor design, and to explore different algorithms and design choices. Different dimensionless FFTs have different dataflows and consequently lead to different performance characteristics. A performance model is used to evaluate the different algorithmic choices and their resulting dataflow. Using the performance model, a search was conducted to find the optimal algorithm for the family of processors considered. The resulting algorithm and corresponding hardware design was implemented using FPGA.

Markus Püschel, Bryan Singer, Manuela Veloso and José M. F. Moura (Proc. International Conference on Computational Science (ICCS), Lecture Notes In Computer Science, Springer, Vol. 2073, pp. 97-106, 2001)

**Fast Automatic Generation of DSP Algorithms**

Preprint (144 KB)

Published paper (link to publisher)

Bibtex

SPIRAL is a generator of optimized, platform-adapted libraries for digital signal processing algorithms. SPIRAL's strategy translates the implementation task into a search in an expanded space of alternatives. These result from the many degrees of freedom in the DSP algorithm itself and in the various coding choices. This paper describes the framework to represent and generate efficiently these alternatives: the formula generator module in SPIRAL. We also address the search module that works in tandem with the formula generator in a feedback loop to find optimal implementations. These modules are implemented using the computer algebra system GAP/AREP.

José M. F. Moura, Jeremy Johnson, Robert W. Johnson, David Padua, Viktor K. Prasanna, Markus Püschel, Bryan Singer, Manuela Veloso and Jianxin Xiong (Proc. High Performance Extreme Computing (HPEC), 2001)

**Generating Platform-Adapted DSP Libraries using SPIRAL**

Bibtex

Bryan Singer and Manuela Veloso (Proc. International Conference on Machine Learning (ICML), pp. 529-536, 2001)

**Learning to Generate Fast Signal Processing Implementations **

Preprint (450 KB)

Published paper (link to publisher)

Bibtex

A single signal processing algorithm can be represented by many mathematically equivalent formulas. However, when these formulas are implemented in code and run on real machines, they have very different running times. Unfortunately, it is extremely difficult to model this broad performance range. Further, the space of formulas for real signal transforms is so large that it is impossible to search it exhaustively for fast implementations. We approach this search question as a control learning problem. We present a new method for learning to *generate fast formulas*, allowing us to intelligently search through only the most promising formulas. Our approach incorporates signal processing knowledge, hardware features, and formula performance data to learn to construct fast formulas. Our method learns from performance data for a few formulas of one size and then can construct formulas that will have the fastest run times possible across many sizes.

Jianxin Xiong, Jeremy Johnson, Robert W. Johnson and David Padua (Proc. Programming Languages Design and Implementation (PLDI), pp. 298-308, 2001)

**SPL: A Language and Compiler for DSP Algorithms **

Preprint (226 KB)

Published paper (link to publisher)

Bibtex

In this paper, we discuss the design and implementation of the SPL compiler, a domain specific compiler that translates signal processing algorithms expressed as mathematical formulas in the SPL language into efficient programs in high-level programming language such as FORTRAN and C. The SPL compiler was developed as part of SPIRAL, a system which utilizes formula transformations and intelligent search strategies to automatically create optimized digital signal processing (DSP) libraries. After presenting the translation and optimization techniques utilized by the compiler, empirical data is presented showing the efficiency of the resulting C/FORTRAN code. Timings are compared, using the fast Fourier transform (FFT) as a benchmark, to those obtained by FFTW, one of the fastest FFT packages available. The results of this comparison show that the SPL compiler produces code that is competitive and, in many cases, faster than FFTW.

Bryan Singer and Manuela Veloso (Proc. Supercomputing (SC), pp. 22, 2001)

**Stochastic Search for Signal Processing Algorithm Optimization **

Preprint (167 KB)

Published paper (link to publisher)

Bibtex

Many difficult problems can be viewed as search problems. However, given a new task with an embedded search problem, it is challenging to state and find a truly effective search approach. In this paper, we address the complex task of signal processing optimization. We first introduce and discuss the complexities of this domain. In general, a single signal processing algorithm can be represented by a very large number of different but mathematically equivalent formulas. When these formulas are implemented in actual code, unfortunately their running times differ significantly. Signal processing algorithm optimization aims at finding the fastest formula. We present a new approach that successfully solves this problem, using an evolutionary stochastic search algorithm, STEER, to search through the very large space of formulas. We empirically compare STEER against other search methods, showing that it notably can find faster formulas while still only timing a very small portion of the search space.

Gavin Haentjens (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2000)

**An Investigation of Recursive FFT Implementations **

Preprint (762 KB)

Bibtex

The goals of the research discussed in this report are to determine the impact of different Cooley-Tukey decompositions on the performance of computer programs that compute the FFT, evaluate different methods for finding the most efficient decompositions, and determine the characteristics of the most efficient decompositions. Experiments are conducted using three different FFT programs and three dynamic programming (DP) methods for searching for efficient decompositions. The results show that even for FFTs of sizes under 2^{11}, the runtime for the average decomposition may be up to three times the runtime for the optimal decomposition. The results also show that a basic implementation of DP performs as well as an exhaustive search at finding fast decompositions for FFTs of sizes up to 2^{10}, and that the simple DP performs as well as two more sophisticated versions of DP for FFT sizes up to 2^{20}. Furthermore, the results show that for an out-of-place implementation of the FFT, right-expanded decompositions are optimal because they require memory storage only for the input and output data arrays, whereas other decompositions require additional temporary storage. Moreover, the results show that for an in-place implementation of the FFT, balanced decompositions are optimal if the algorithm is iterative, and right-expanded decompositions are optimal if the algorithm is recursive.

Pinit Kumhom, Jeremy Johnson and Prawat Nagvajara (Proc. IEEE ASIC/SOC Conference, IEEE, pp. 182-186, 2000)

**Design, optimization, and implementation of a universal FFT processor**

Published paper (link to publisher)

Bibtex

There exist Fast Fourier transform (FFT) algorithms, called dimensionless FFTs, that work independent of dimension. These algorithms can be configured to compute different dimensional DFTs simply by relabeling the input data and by changing the values of the twiddle factors occurring in the butterfly operations. This observation allows us to design an FFT processor, which with minor reconfiguring, can compute one, two, and three dimensional DFTs. In this paper we design a family of FFT processors, parameterized by the number of points, the dimension, the number of processors, and the internal dataflow, and show how to map different dimensionless FFTs onto this hardware design. Different dimensionless FFTs have different dataflows and consequently lead to different performance characteristics. Using a performance model we search for the optimal algorithm for the family of processors we considered. The resulting algorithm and corresponding hardware design was implemented using FPGA

Neungsoo Park, Viktor K. Prasanna, Kiran Bondalapati and Dongsoo Kang (Proc. IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 693-701, 2000)

**Dynamic Data Layouts for Cache-conscious Factorization of DFT **

Preprint (245 KB)

Bibtex

Effective utilization of cache memories is a key factor in achieving high performance in computing the Discrete Fourier Transform (DFT). Most optimization techniques for computing the DFT rely on either modifying the computation and data access order or exploiting low level platform specific details, while keeping the data layout in memory static. In this paper we propose a high level optimization technique, dynamic data layout (DDL). In DDL, data reorganization is performed between computations to effectively utilize the cache. This cache-conscious factorization of the DFT including the data reorganization steps is automatically computed by using efficient techniques in our approach. An analytical model of the cache miss pattern is utilized to predict the performance and explore the search space of factorizations. Our technique results in up to a factor of 4 improvement over standard FFT implementations and up to 33% improvement over other optimization techniques such as copying on SUN UltraSPARC-II, DEC Alpha and Intel Pentium III
dynamic data layout
(DDL). In DDL, data reorganization is performed
*
between
*
computations to effectively utilize the cache. This cache-conscious factorization of the DFT including the data reorganization steps is automatically computed by using efficient techniques in our approach. An analytical model of the cache miss pattern is utilized to predict the performance and explore the search space of factorizations. Our technique results in up to a factor of 4 improvement over standard FFT implementations and up to 33% improvement over other optimization techniques such as copying on SUN UltraSPARC-II, DEC Alpha and Intel Pentium III.

Jeremy Johnson and Markus Püschel (Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 6, pp. 3347-3350, 2000)

**In Search of the Optimal Walsh-Hadamard Transform **

Preprint (104 KB)

Published paper (link to publisher)

Bibtex

This paper describes an approach to implementing and optimizing fast signal transforms. Algorithms for computing signal transforms are expressed by symbolic expressions, which can be automatically generated and translated into programs. Optimizing an implementation involves searching for the fastest program obtained from one of the possible expressions. In this paper we apply this methodology to the implementation of the Walsh-Hadamard transform. An environment, accessible from MATLAB, is provided for generating and timing WHT algorithms. These tools are used to search for the fastest WHT algorithm. The fastest algorithm found is substantially faster than standard approaches to implementing the WHT. The work reported in this paper is part of the SPIRAL project, an ongoing project whose goal is to automate the implementation and optimization of signal processing algorithms.

Bryan Singer and Manuela Veloso (Proc. International Conference on Machine Learning (ICML), pp. 887-894, 2000)

**Learning to Predict Performance from Formula Modeling and Training Data **

Preprint (188 KB)

Published paper (link to publisher)

Bibtex

This paper reports on our work and results framing signal processing algorithm optimization as a machine learning task. A single signal processing algorithm can be represented by many different but mathematically equivalent formulas. When these formulas are implemented in actual code, they have very different running times. Signal processing optimization is concerned with finding a formula that implements the algorithm as efficiently as possible. Unfortunately, a correct mapping between a mathematical formula and its running time is unknown. However empirical performance data can be gathered for a variety of formulas. This data offers an interesting opportunity *to learn* to predict running time performance. In this paper we present two major results along this direction: (1) Different sets of features are identified for mathematical formulas that distinguish them into partitions with significantly different running times, and (2) A function approximator can learn to accurately predict the running time of a formula given a limited set of training data. Showing the impact of selecting different features to describe the input, this work contributes an extensive study on the role of learning for this novel task.

David Sepiashvili (Master thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2000)

**Performance Models and Search Methods for Optimal FFT Implementations **

Preprint (589 KB)

Bibtex

The central result of this thesis is deriving systematic methodologies for finding the fast Fourier transform (FFT) optimal implementations. By employing a divide-conquer procedure (decomposition), we break down the initial transform into combinations of different smalller size sub-transforms that can graphically be presented as breakdown trees. We apply the rewrite rules to the Cooley-Tukey formula and generate a set of algorithms and alternative codes for the FFT computation. We obtain a set of all possible implementations by pairing all possible breakdown trees with all code alternatives. For the runtime evaluation of these implementations, we develop the analytical and experimental perfomance models. Based on these models, we derive the optimal search methods -- the dynamic programming and exhaustive search -- and use them to search over the set of implementation runtimes for finding the implementation with the minimal runtime. The test results demonstrate that good algorithms and codes, accurate performance evaluation models, and effective search methods, combined together within the system framework, provide the best FFT implementations.

Franz Franchetti (Master thesis, Vienna University of Technology, 2000)

**Short Vector FFTs**

Bibtex

José M. F. Moura, Jeremy Johnson, Robert W. Johnson, David Padua, Viktor K. Prasanna, Markus Püschel and Manuela Veloso (Proc. High Performance Extreme Computing (HPEC), 2000)

**SPIRAL: Automatic Implementation of Signal Processing Algorithms**

Bibtex

Publication interface designed and implemented by Patra Pantupat, Aliaksei Sandryhaila,
and Markus Püschel

Electrical and Computer Engineering, Carnegie Mellon University, 2007

Electrical and Computer Engineering, Carnegie Mellon University, 2007