Turbocharge ANNS on Real Processing-in-Memory by Enabling Fine-Grained Per-PIM-Core Scheduling
per-PU scheduling; persistent PIM kernel; per-PU dispatching with selective replication
3
4
4
2025
HPCA
UC Davis
NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing
message-driven processors capable of executing algorithms; a direct-mapped cache with a write-back policy; support both asynchronous and bulk synchronous parallel execution models
Challenge: Host pages need to enable interleaving to improve concurrent throughput, while PIM pages need to disable it to maintain better locality, creating a conflict.
Year
Venue
Authors
Title
Tags
P
E
N
2023
DAC
Georgia Tech
vPIM: Efficient Virtual Address Translation for Scalable Processing-in-Memory Architectures
network-contention-aware hashing to minimize cross-stack page table walks; pre-translation using repurposed PIM cores to move page table walks off the critical path
4
4
3
2024
ISCA
SJTU
UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space
Uniform shared CPU-PIM memory; dual-track memory management; zero-copy data re-layout
3
3
4
Challenge: Host pages need to enable interleaving to improve concurrent throughput, while PIM pages need to disable it to maintain better locality, creating a conflict.
Year
Venue
Authors
Title
Tags
P
E
N
2023
DAC
Georgia Tech
vPIM: Efficient Virtual Address Translation for Scalable Processing-in-Memory Architectures
network-contention-aware hashing to minimize cross-stack page table walks; pre-translation using repurposed PIM cores to move page table walks off the critical path
4
4
3
2024
ISCA
SJTU
UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space
Uniform shared CPU-PIM memory; dual-track memory management; zero-copy data re-layout
Challenge: Existing compilers are not optimized for locality-aware PIM architectures and require specialized programming models to fully utilize PIM capabilities.
Year
Venue
Authors
Title
Tags
P
E
N
2015
ISCA
Seoul National
PIM-Enabled Instructions: A Low-Overhead; Locality-Aware Processing-in-Memory Architecture
PIM-Enabled Instructions for ISA extension; PIM directory for atomicity and coherence; single-cache-block restriction
3
4
4
2020
ISCA
UCSB
iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture
Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather
In-DRAM fine-grained scatter-gather via data bus offsets; fine-grained cache architecture using fg-tags; Standard DDR command interpretation for FIM control; Combined graph tiling with fine-grained memory access
3
3
4
2025
arXiv
ETHZ
PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System
PIMDAL library for DB operators; quicksort/mergesort/hashing on UPMEM PIM; scatter/gather/async transfers for PIM communication
4
4
2
2024
arXiv
Seoul National
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices
Virtual hypercube PIM model; PE-assisted data reordering; in-register and cross-domain data modulation
3
4
3
2025
ISCA
KAIST
PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing
high-speed hardware link bridges between DIMMs; direct intra-group P2P communication & broadcast; hybrid routing mechanism for inter-group communication
2025
HPCA
SJTU
AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing
Application-Transparent Near-Memory Processing Architecture with Memory Channel Network
integrates a processor on a buffered DIMM; application-transparent near-memory processing; leverages memory channels for high-bandwidth/low-latency inter-processor communication
ComPASS: A Compatible PIM Protocol Architecture and Scheduling Solution for Processor-PIM Collaboration
PIM-ACT new memory command for multi-bank PIM operations; PIM request generator to offload host processor; static and adaptive throughput balancers for PIM and non-PIM request scheduling
Challenge: The original UMPEM API library is not well-suited for all workloads especially for those with cross-bank communication.
Year
Venue
Authors
Title
Tags
P
E
N
2023
arXiv
ETHZ
A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems
Alignment-in-Memory framework; hybrid WRAM-MRAM sketch data management for PIM
2
3
4
2025
arXiv
ETHZ
PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System
PIMDAL library on UPMEM PIM system for data analytics; scatter/gather-aware transfers for inter-PIM communication; Apache Arrow for host memory management
Challenge: No direct physical connectivity between the banks in the DIMM-based NDP architecture. Limited number of DDR channels causing poor scalability.
Solution: Introduce CXL-based interconnects to enable direct communication between memory banks; Use CXL memory pools and CXL switches to enable scalable NDP architecture.
Year
Venue
Authors
Title
Tags
P
E
N
2022
MICRO
UCSB
BEACON: Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support
scalable hardware accelerator inside CXL switch or bank; lossless memory expansion for CXL memory pools
Challenge: There is no direct physical interconnection paths in DIMM-based, bank-level uniform NDP like UPMEM.
Solution: Put the logical, computational layer at the bottom of the die, and stack DRAM layers on top of it. Use TSVs to build thousands of physical paths between the logical and the DRAM layers.
Solution: Replace GPU's traditional DRAM-only HBM dies with PIM-enabled HBM dies to achieve higher memory bandwidth.
Year
Venue
Authors
Title
Tags
P
E
N
2021
ISCA
Samsung
Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology Industrial Product
drop-in replacement for standard HBM2; bank-level parallelism using standard DRAM commands; address aligned mode to tolerate host-side command reordering
3
5
3
2022
Hot Chips
Samsung
Aquabolt-XL HBM2-PIM, LPDDR5-PIM With In-Memory Processing, and AXDIMM With Acceleration Buffer
HBM2-PIM with bank-level SIMD programmable computing units; Acceleration DIMM with acceleration buffers for rank-level parallelism
Challenge: Different PIM architectures have different characteristics and performance trade-offs; communicating between different PIM architectures is challenging.
Year
Venue
Authors
Title
Tags
P
E
N
2025
arXiv
NUS
LEAP: LLM Inference on Scalable PIM-NoC Architecture with Balanced Dataflow and Fine-Grained Parallelism
data dynamicity-aware task assignment to PIM or NoC; fine-grained model partitioning and heuristically optimized spatial mapping strategy
3
4
3
2025
arXiv
THU
CompAir: Synergizing Complementary PIMs and In-Transit NoC Computation for Efficient LLM Acceleration
heterogeneous DRAM-PIM and SRAM-PIM architecture with hybrid bonding; in-transit NoC computation with Curry ALU; hierarchical ISA for hybrid PIM systems
NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning
estimate the circuit-level performance of neuro-inspired architectures; estimates the area, latency, dynamic energy, and leakage power; Support both SRAM and eNVM; tested on 2-layer MLP NN, MNIST
2019
IEDM
Georgia Tech
DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies
a python wrapper to interface NeuroSim; for inference only
2020
TCAD
ZJU
Eva-CiM: A System-Level Performance and Energy Evaluation Framework for Computing-in-Memory Architectures
models for capturing memory access and dependency-aware ISA traces; models for quantifying interactions between the host CPU and the CiM module
2022
ICCAD
Purdue
Design Space and Memory Technology Co-Exploration for In-Memory Computing Based Machine Learning Accelerators
simulation framework to evaluate the systemlevel performance of IMC architecture; area-aware weight mapping strategy
4
3
2
2024
ISPASS
MIT
CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool
flexible specification to describe CiM systems; accurate model/fast statistical model of data-value-dependent component energy
2025
ASPDAC
HKUST
MICSim: A Modular Simulator for Mixed-signal Compute-in-Memory based AI Accelerator
modulared Neurosim; data statistic-based average-mode instead of trace-based mode
Solution: Rather than placing logic units into DRAM; modify the physical structure of DRAM/eDRAM to enable in-memory computing.
Year
Venue
Authors
Title
Tags
P
E
N
2021
ICCD
ASU
CIDAN: Computing in DRAM with Artificial Neurons
Threshold Logic Processing Element (TLPE) for in-memory computation; Four-bank activation window; Configurable threshold functions; Energy-efficient bitwise operations; Integration with DRAM architecture
2022
HPCA
UCSD
TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer
token-based dataflow for general Transformer-based models; ring-based data broadcast in modified HBM
4
2
4
2024
A-SSCC
UNIST
A 273.48 TOPS/W and 1.58 Mb/mm2 Analog-Digital Hybrid CIM Processor with Transpose Ternary-eDRAM Bitcell
analog DRAM CIM for partial sum and digital adder
1
4
2
2025
arXiv
KAIST
RED: Energy Optimization Framework for eDRAM-based PIM with Reconfigurable Voltage Swing and Retention-aware Scheduling
RED framework for energy optimization; reconfigurable eDRAM design; retention-aware scheduling; trade-off analysis between RBL voltage swing, sense amplifier power, and retention time; refresh skipping and sense amplifier power gating
2025
arXiv
UTokyo
MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration
GeMV operations for end-to-end low-bit LLM inference using unmodified DRAM; processor-DRAM co-design; on-the-fly vector encoding; horizontal matrix layout
4
4
3
2025
arXiv
Purdue
HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference
Challenge: Memory wall causing high latency of data transfer between CPU and memory; DIMM-based NDP causing high energy consumption; area overhead and low performance efficiency.
Solution: Generally modify the physical structure of SRAM to enable in-memory computing; rather than placing logic units into SRAM.
CIMR-V: An End-to-End SRAM-based CIM Accelerator with RISC-V for AI Edge Device
incorporates CIM layer fusion, convolution/max pooling pipeline, and weight fusion; weight fusion: pipelining the CIM convolution and weight loading
2018
JSSC
MIT
CONV-SRAM: An Energy-Efficient SRAM With In-Memory Dot-Product Computation for Low-Power Convolutional Neural Networks
SRAM-embedded convolution (dot-product) computation architecture for BNN; support multi-bit input-output
2024
ESSCIRC
THU
A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface
SRAM-based CD-CiM architecture; charge-domain analog adder tree; ReLU-optimized ADC
4
4
4
2021
ISSCC
TSMC
An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications
programmable bit-widths for both input and weights; SRAM and CIM mode
2
5
1
2021
JSSC
KAIST
Z-PIM: A Sparsity-Aware Processing-in-Memory Architecture With Fully Variable Weight Bit-Precision for Energy-Efficient Deep Neural Networks
bit-serial operation to support variable weight bit-precision; data mapping and computation flow for sparsity handling
MemTorch: A Simulation Framework for Deep Memristive Cross-Bar Architectures
supports both GPUs and CPUs; integrates directly with PyTorch; simulate non-idealities of memristive devices within cross-bar, tested on VGG-16, CIFAR-10
2021
TCAD
Geogia Tech
DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training
non-ideal device properties of NVMS' effect for on-chip training
3
3
2
2025
DAC
BUAA
CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures
workflow for implementing and evaluating DNN workloads on digital CIM architectures; CIM-specific ISA design; compilation flow built on the MLIR infrastructure
Challenge: Transformer architecture is widely used in NLP and CV tasks. Existing SRAM CIM architectures are not suitable for transformer acceleration.
Year
Venue
Authors
Title
Tags
P
E
N
2025
DATE
PKU
Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs
architecture model and simulator for CIM-based TPUs; designed for LLM inference
4
2
4
2023
arXiv
Keio
An 818-TOPS/W CSNR-31dB SQNR-45dB 10-bit Capacitor-Reconfiguring Computing-in-Memory Macro with Software-Analog Co-Design for Transformers
Capacitor-Reconfiguring analog CIM architecture
1
4
3
2025
arXiv
Purdue
Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory
SRAM based softmax-friendly CIM architecture for transformer; finer-granularity pipelining strategy
4
3
2
2025
arXiv
PKU
Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs
Energy-efficient CIM core integration in TPUs (replace the original MXU); CIM-MXU with systolic data path; Array dimension scaling for CIM-MXU; Area-efficient CIM macro design; Mapping engine for generative model inference
2024
JSSC
THU
MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity
long reuse elimination scheduler (LRES) to dynamically reshape the attention matrix; runtime token pruner (RTP) to remove insignificant tokens; modal-adaptive CIM network (MACN) to dynamically divide CIM cores into Pipeline; effective-bits-balanced CIM (EBBCIM) macro architecture
Challenge: RRAM devices are non-volatile and have high density; suitable for CIM applications. However; RRAM devices have non-ideal effects that can cause significant performance degradation.
PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference
Programmable and general-purpose ReRAM based ML Accelerator; Supports an instruction set; Has potential for DNN training; Provides simulator that accepts model
2018
ICRC
Purdue & HP
Hardware-Software Co-Design for an Analog-Digital Accelerator for Machine Learning
compiler to translate model to ISA; ONNX interpreter to support models in common DL frame work; simulator to evaluate performance
2023
NANOARCH
HUST
Heterogeneous Instruction Set Architecture for RRAM-enabled In-memory Computing
General ISA for RRAM CiM & digital heterogeneous architecture; a tile-processing unit-array three-level architecture
2024
VLSI-SoC
RWTH Aachen University
Architecture-Compiler Co-design for ReRAM-Based Multi-core CIM Architectures
inference latency predictions and analysis of the crossbar utilization for CNN
2024
arXiv
CAS
A Fully Hardware Implemented Accelerator Design in ReRAM Analog Computing without ADCs
Based on Stochastic Binary Neural Networks; Winner-Take-All (WTA) strategy; Hardware implemented sigmoid and softmax
DRCTL: A Disorder-Resistant Computation Translation Layer Enhancing the Lifetime and Performance of Memristive CIM Architecture
address conversion method for dynamic scheduling; hierarchical wear-leveling (HWL) strategy for reliability improvement; data layout-aware selective remapping (LASR) to improve communication locality and reduce latency
2024
DATE
RWTH Aachen University
CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures
algorithm to decide which parts of NN are duplicated to reduce inference latency; cross layer scheduling on tiled CIM architectures
2024
TC
SJTU
ERA-BS: Boosting the Efficiency of ReRAM-Based PIM Accelerator With Fine-Grained Bit-Level Sparsity
bit-level sparsity in both weights and activations; bit-flip scheme; dynamic activation sparsity exploitation scheme
2023
TETCI
TU Delft
Accurate and Energy-Efficient Bit-Slicing for RRAM-Based Neural Networks
unbalanced bit-slicing scheme for higher accuracy; holistic solution using 2's compliment
2024
Science
USC
Programming memristor arrays with arbitrarily high precision for analog computing
represent high-precision numbers using multiple relatively low-precision analog devices;using RRAM CIM to solve PDEs
A Calibratable Model for Fast Energy Estimation of MVM Operations on RRAM Crossbars
system energy model for MVM on ReRAM crossbars; methodology to study the effect of the selection transistor and wire parasitics in 1T1R crossbar arrays
2024
arXiv
MIT
Modeling Analog-Digital-Converter Energy and Area for Compute-In-Memory Accelerator Design
architecture-level model that estimates ADC energy and area
ITT-RNA: Imperfection Tolerable Training for RRAM-Crossbar-Based Deep Neural-Network Accelerator
prevent the large-weight synapses from being mapped to the imperfect memristor cells; off-device training algorithm to alleviate the accumulation of errors across multiple layers; bit-wise mechanism to compensate the resistance variations
3
3
2
2023
arXiv
UND
U-SWIM: Universal Selective Write-Verify for Computing-in-Memory Neural Accelerators
only do write-verify for important weights; based on weight second derivatives as a guide
3
3
3
2023
Adv. Mater.
UMich
Bulk‐Switching Memristor‐Based Compute‐In‐Memory Module for Deep Neural Network Training
Bulk-ReRAM based digital-CIM hybrid architecture for training; CIM for forward, digital for backward
4
4
1
2024
APIN
SWU
Multi-optimization scheme for in-situ training of memristor neural network based on contrastive learning
optimizations to the deployment method, loss function and gradient calculation; compensation measures for non-ideal effects
2025
TNNLS
SNU
Efficient Hybrid Training Method for Neuromorphic Hardware Using Analog Nonvolatile Memory
Challenge: Compiler for RRAM CIM is not well studied. Existing compilers are either for specific architecture or not efficient.
Year
Venue
Authors
Title
Tags
P
E
N
2023
TACO
HUST
A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures
compilation tool to migrate legacy programs to CPU/CIM heterogeneous architectures; a model to quantify the performance gain
2023
DAC
CAS
PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators
compiler based on Crossbar/IMA/Tile/Chip hierarchy; low latency and high throughput mode; genetic algorithm to optimize weight replication and core mapping; scheduling algorithms for complex DNN
2024
ASPLOS
CAS
CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators
compilation stack for various CIM accelerators; multi-level DNN scheduling approach
Challenge: Convolutional layer is the most compute-intensive layer in CNNs. RRAM CIM architecture is quite suitable for convolutional layer operations but face challenges related to non-ideal effects and performance degradation.
fabrication of high-yield, high-performance and uniform memristor crossbar arrays; hybrid-training method; replication of multiple identical kernels for processing different inputs in parallel
2019
TED
PKU
Convolutional Neural Networks Based on RRAM Devices for Image Recognition and Online Learning Tasks
RRAM-based hardware implementation of CNN; expand kernel to the size of image
2025
TVLSI
NBU
A 578-TOPS/W RRAM-Based Binary Convolutional Neural Network Macro for Tiny AI Edge Devices
ReRAM XNOR cell; BCNN CIM macro with FPGA as the control core
Mapping of CNNs on multi-core RRAM-based CIM architectures
architecture optimized for communication; compiler algorithms for conv2D layer; cycle-accurate simulator
2023
TODAES
UCAS
Mathematical Framework for Optimizing Crossbar Allocation for ReRAM-based CNN Accelerators
formulate a crossbar allocation problem for ReRAM-based CNN accelerators; dynamic programming based solver; models the performance considering allocation problem
2025
IEEE Access
UTehran
SCiMA: A Systolic CiM-Based Accelerator With a New Weight Mapping for CNNs—A Virtual Framework Approach
Heterogeneous Integration of In-Memory Analog Computing Architectures with Tensor Processing Units
hybrid TPU-IMAC architecture; TPU for conv, CIM for fc
2025
ASPLOS
CAS
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
dynamic parallelism-aware task scheduling for llm decoding; online kernel characterization for heterogeneous architectures; hybrid PIM units for compute-bound and memory-bound kernels
A mixed-precision memristor and SRAM compute-in-memory AI processor
layer based INT-FP hybrid architure; kernel-based mix-CIM (SRAM/ReRAM/digital hybrid architecture)
5
5
2
2025
DAC
Chung-Ang Univ.
HH-PIM: Dynamic Optimization of Power and Performance with Heterogeneous-Hybrid PIM for Edge AI Devices
heterogeneous-hybrid PIM with HP/LP modules and MRAM/SRAM; dynamic data placement algorithm for energy optimization; dual PIM controller design
3
4
2
2025
arXiv
AaltoU
Acore-CIM: build accurate and reliable mixed-signal CIM cores with RISC-V controlled self-calibration
reliability-focused MAC cell; proof-of-concept SoC composed of a CIM core and a RISC-V control processor; automated Built-In Self-Calibration (BISC) routine
Challenge: Limited by the precision & area & power trade-off of the ADC; certain CIM devices like RRAM are not suitable for high-precision computation (e.g. FP32). Quantization is needed to reduce the precision of the data.
Partial-Sum Quantization for Near ADC-Less Compute-In-Memory Accelerators
ADC-Less and near ADC-Less CiM accelerators; CiM hardware aware DNN quantization methodology
2023
AICAS
TU Delft
Mapping-aware Biased Training for Accurate Memristor-based Neural Networks
favorability constraint analysis to find important weight values; mapping-aware biased training to restrict weight values to low variance RRAM states
3
4
2
2024
TCAD
BUAA
CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network Accelerators
bit-level sparsity induced activation quantization; quantizing partial sums to decrease required resolution of ADCs; arraywise quantization granularity
2024
TCAD
BUAA
CIM²PQ: An Arraywise and Hardware-Friendly Mixed Precision Quantization Method for Analog Computing-In-Memory
mixed precision quantization method based on evolutionary algorithm; arraywise quantization granularity; evaluation method to obtain the performance of strategy on the CIM
2024
ICCAD
TU Delft
Hardware-Aware Quantization for Accurate Memristor-Based Neural Networks
analysis of fixed-point quantization impact on conductance variation; weight quantization tuning technique; approach to reduce the residual error
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
integer-only inference arithmetic; quantizes both weights and activations as 8-bit integers, bias 32-bit; provides both quantized inference framework and training frame work
2023
ICCD
SJTU
PSQ: An Automatic Search Framework for Data-Free Quantization on PIM-based Architecture
post-training quantization framework without retraining; hardware-aware block reassembly
2025
arXiv
UHK
Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators
a quantization framework that considers CIM's mixed-signal constraints; closed-form layer-specific weight binarization method; differentiable function for uniform multi-bit quantization
Challenge: Speculative prefetch requests can cause undesirable effects on the system (e.g., increased memory bandwidth consumption, cache pollution, memory access interference).
Year
Venue
Authors
Title
Tags
P
E
N
2021
MICRO
ETHZ
Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning
formulating prefetching as a reinforcement learning problem; holistic learning from multiple program features and system feedback; customizable prefetching objective via configuration registers
3
3
2
2025
MICRO
NUDT
Elevating Temporal Prefetching Through Instruction Correlation
critical instruction detection based on miss contribution; coverage-based classification for metadata utility; adaptive metadata cache partitioning via controller