Copyrights to these papers may be held by the publishers. The download files are preprints. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
3D-stacked integration of DRAM and logic layers using through-silicon via (TSV) technology has given rise to a new interpretation of near-data processing (NDP) concepts that were proposed decades ago. However, processing capability within the stack is limited by stringent power and thermal constraints. Simple processing mechanisms with intensive memory accesses, such as data reorganization, are an effective means of exploiting 3D stacking-based NDP. Data reorganization handled completely in memory improves the host processor's memory access performance. However, in-memory data reorganization performed in parallel with host memory accesses raises issues, including interference, bandwidth allocation, and coherence. Previous work has mainly focused on performing data reorganization while blocking host accesses. This article details data reorganization performed in parallel with host memory accesses, providing mechanisms to address host/NDP interference, flexible bandwidth allocation, and in-memory coherence.Keywords: Parallel processing, Memory, Architecture