Jiyuan Zhang, Tze-Meng Low, Qi Guo and Franz Franchetti (Proc. Workshop on Near Data Processing (WONDP), 2015)
A 3D-Stacked Memory Manycore Stencil Accelerator System
Comment: in conjuction with MICRO
Preprint (706 KB)
Published paper (link to publisher)

Stencil operations are an important class of scientific computational kernels that are pervasive in scientific simulations as well as in image processing. A key characteristic of this class of computation is that they have a low operational intensity, i.e., the ratio of the number of memory accesses to the number of floating point operations it performs is high. As a result, the performance of stencil operations implemented on general purpose computing systems is bounded by the memory bandwidth. Technologies such as 3D stacked memory can provide substantially more bandwidth than conventional memory systems and can enhance the performance of memory intensive computations like stencil kernels. In this paper, we leverage this 3D stacked memory technology to design an accelerator for stencil computations. We show that for the best efficiency one needs to find the balance between computation and memory accesses to keep all components consistently busy. We achieve this by exploring how blocking and caching schemes to control the compute-to-memory ratio. Finally, we identify optimal design points that maximize performance.

Memory, Acceleration, 3D-stacked