Challenge: Existing compilers are not optimized for locality-aware PIM architectures and require specialized programming models to fully utilize PIM capabilities.
Year
Venue
Authors
Title
Tags
P
E
N
2015
ISCA
Seoul National
PIM-Enabled Instructions: A Low-Overhead; Locality-Aware Processing-in-Memory Architecture
PIM-Enabled Instructions for ISA extension; PIM directory for atomicity and coherence; single-cache-block restriction
3
4
4
2020
ISCA
UCSB
iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture
Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather
In-DRAM fine-grained scatter-gather via data bus offsets; fine-grained cache architecture using fg-tags; Standard DDR command interpretation for FIM control; Combined graph tiling with fine-grained memory access
3
3
4
2025
arXiv
ETHZ
PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System
PIMDAL library for DB operators; quicksort/mergesort/hashing on UPMEM PIM; scatter/gather/async transfers for PIM communication
4
4
2
2024
arXiv
Seoul National
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices
Virtual hypercube PIM model; PE-assisted data reordering; in-register and cross-domain data modulation
3
4
3
2025
ISCA
KAIST
PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing
high-speed hardware link bridges between DIMMs; direct intra-group P2P communication & broadcast; hybrid routing mechanism for inter-group communication
2025
HPCA
SJTU
AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing
Application-Transparent Near-Memory Processing Architecture with Memory Channel Network
integrates a processor on a buffered DIMM; application-transparent near-memory processing; leverages memory channels for high-bandwidth/low-latency inter-processor communication
Challenge: No direct physical connectivity between the banks in the DIMM-based NDP architecture. Limited number of DDR channels causing poor scalability.
Solution: Introduce CXL-based interconnects to enable direct communication between memory banks; Use CXL memory pools and CXL switches to enable scalable NDP architecture.
Year
Venue
Authors
Title
Tags
P
E
N
2022
MICRO
UCSB
BEACON: Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support
scalable hardware accelerator inside CXL switch or bank; lossless memory expansion for CXL memory pools
NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning
estimate the circuit-level performance of neuro-inspired architectures; estimates the area, latency, dynamic energy, and leakage power; Support both SRAM and eNVM; tested on 2-layer MLP NN, MNIST
2019
IEDM
Georgia Tech
DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies
a python wrapper to interface NeuroSim; for inference only
2020
TCAD
ZJU
Eva-CiM: A System-Level Performance and Energy Evaluation Framework for Computing-in-Memory Architectures
models for capturing memory access and dependency-aware ISA traces; models for quantifying interactions between the host CPU and the CiM module
2024
ISPASS
MIT
CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool
flexible specification to describe CiM systems; accurate model/fast statistical model of data-value-dependent component energy
2025
ASPDAC
HKUST
MICSim: A Modular Simulator for Mixed-signal Compute-in-Memory based AI Accelerator
modulared Neurosim; data statistic-based average-mode instead of trace-based mode
Solution: Rather than placing logic units into DRAM; modify the physical structure of DRAM/eDRAM to enable in-memory computing.
Year
Venue
Authors
Title
Tags
P
E
N
2021
ICCD
ASU
CIDAN: Computing in DRAM with Artificial Neurons
Threshold Logic Processing Element (TLPE) for in-memory computation; Four-bank activation window; Configurable threshold functions; Energy-efficient bitwise operations; Integration with DRAM architecture
2022
HPCA
UCSD
TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer
token-based dataflow for general Transformer-based models; ring-based data broadcast in modified HBM
4
2
4
2024
A-SSCC
UNIST
A 273.48 TOPS/W and 1.58 Mb/mm2 Analog-Digital Hybrid CIM Processor with Transpose Ternary-eDRAM Bitcell
analog DRAM CIM for partial sum and digital adder
1
4
2
2025
arXiv
KAIST
RED: Energy Optimization Framework for eDRAM-based PIM with Reconfigurable Voltage Swing and Retention-aware Scheduling
RED framework for energy optimization; reconfigurable eDRAM design; retention-aware scheduling; trade-off analysis between RBL voltage swing, sense amplifier power, and retention time; refresh skipping and sense amplifier power gating
2025
arXiv
UTokyo
MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration
GeMV operations for end-to-end low-bit LLM inference using unmodified DRAM; processor-DRAM co-design; on-the-fly vector encoding; horizontal matrix layout
Challenge: Memory wall causing high latency of data transfer between CPU and memory; DIMM-based NDP causing high energy consumption; area overhead and low performance efficiency.
Solution: Generally modify the physical structure of SRAM to enable in-memory computing; rather than placing logic units into SRAM.
sparsity algorithm designed for SRAM CiM; quantization algorithm with BN fusion
2024
ESSCIRC
THU
A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface
SRAM-based CD-CiM architecture; charge-domain analog adder tree; ReLU-optimized ADC
4
4
4
2021
ISSCC
TSMC
An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications
programmable bit-widths for both input and weights; SRAM and CIM mode
MemTorch: A Simulation Framework for Deep Memristive Cross-Bar Architectures
supports both GPUs and CPUs; integrates directly with PyTorch; simulate non-idealities of memristive devices within cross-bar, tested on VGG-16, CIFAR-10
2021
TCAD
Geogia Tech
DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training
non-ideal device properties of NVMS' effect for on-chip training
Challenge: Transformer architecture is widely used in NLP and CV tasks. Existing SRAM CIM architectures are not suitable for transformer acceleration.
Year
Venue
Authors
Title
Tags
P
E
N
2025
DATE
PKU
Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs
architecture model and simulator for CIM-based TPUs; designed for LLM inference
4
2
4
2023
arXiv
Keio
An 818-TOPS/W CSNR-31dB SQNR-45dB 10-bit Capacitor-Reconfiguring Computing-in-Memory Macro with Software-Analog Co-Design for Transformers
Capacitor-Reconfiguring analog CIM architecture
1
4
3
2025
arXiv
Purdue
Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory
SRAM based softmax-friendly CIM architecture for transformer; finer-granularity pipelining strategy
4
3
2
2025
arXiv
PKU
Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs
Energy-efficient CIM core integration in TPUs (replace the original MXU); CIM-MXU with systolic data path; Array dimension scaling for CIM-MXU; Area-efficient CIM macro design; Mapping engine for generative model inference
2024
JSSC
THU
MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity
long reuse elimination scheduler (LRES) to dynamically reshape the attention matrix; runtime token pruner (RTP) to remove insignificant tokens; modal-adaptive CIM network (MACN) to dynamically divide CIM cores into Pipeline; effective-bits-balanced CIM (EBBCIM) macro architecture
Challenge: RRAM devices are non-volatile and have high density; suitable for CIM applications. However; RRAM devices have non-ideal effects that can cause significant performance degradation.
PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference
Programmable and general-purpose ReRAM based ML Accelerator; Supports an instruction set; Has potential for DNN training; Provides simulator that accepts model
2018
ICRC
Purdue & HP
Hardware-Software Co-Design for an Analog-Digital Accelerator for Machine Learning
compiler to translate model to ISA; ONNX interpreter to support models in common DL frame work; simulator to evaluate performance
2023
NANOARCH
HUST
Heterogeneous Instruction Set Architecture for RRAM-enabled In-memory Computing
General ISA for RRAM CiM & digital heterogeneous architecture; a tile-processing unit-array three-level architecture
2024
VLSI-SoC
RWTH Aachen University
Architecture-Compiler Co-design for ReRAM-Based Multi-core CIM Architectures
inference latency predictions and analysis of the crossbar utilization for CNN
2024
arXiv
CAS
A Fully Hardware Implemented Accelerator Design in ReRAM Analog Computing without ADCs
Based on Stochastic Binary Neural Networks; Winner-Take-All (WTA) strategy; Hardware implemented sigmoid and softmax
DRCTL: A Disorder-Resistant Computation Translation Layer Enhancing the Lifetime and Performance of Memristive CIM Architecture
address conversion method for dynamic scheduling; hierarchical wear-leveling (HWL) strategy for reliability improvement; data layout-aware selective remapping (LASR) to improve communication locality and reduce latency
2024
DATE
RWTH Aachen University
CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures
algorithm to decide which parts of NN are duplicated to reduce inference latency; cross layer scheduling on tiled CIM architectures
2024
TC
SJTU
ERA-BS: Boosting the Efficiency of ReRAM-Based PIM Accelerator With Fine-Grained Bit-Level Sparsity
bit-level sparsity in both weights and activations; bit-flip scheme; dynamic activation sparsity exploitation scheme
2023
TETCI
TU Delft
Accurate and Energy-Efficient Bit-Slicing for RRAM-Based Neural Networks
unbalanced bit-slicing scheme for higher accuracy; holistic solution using 2's compliment
2024
Science
USC
Programming memristor arrays with arbitrarily high precision for analog computing
represent high-precision numbers using multiple relatively low-precision analog devices;using RRAM CIM to solve PDEs
A Calibratable Model for Fast Energy Estimation of MVM Operations on RRAM Crossbars
system energy model for MVM on ReRAM crossbars; methodology to study the effect of the selection transistor and wire parasitics in 1T1R crossbar arrays
2024
arXiv
MIT
Modeling Analog-Digital-Converter Energy and Area for Compute-In-Memory Accelerator Design
architecture-level model that estimates ADC energy and area
Challenge: Compiler for RRAM CIM is not well studied. Existing compilers are either for specific architecture or not efficient.
Year
Venue
Authors
Title
Tags
P
E
N
2023
TACO
HUST
A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures
compilation tool to migrate legacy programs to CPU/CIM heterogeneous architectures; a model to quantify the performance gain
2023
DAC
CAS
PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators
compiler based on Crossbar/IMA/Tile/Chip hierarchy; low latency and high throughput mode; genetic algorithm to optimize weight replication and core mapping; scheduling algorithms for complex DNN
2024
ASPLOS
CAS
CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators
compilation stack for various CIM accelerators; multi-level DNN scheduling approach
Challenge: Convolutional layer is the most compute-intensive layer in CNNs. RRAM CIM architecture is quite suitable for convolutional layer operations but face challenges related to non-ideal effects and performance degradation.
fabrication of high-yield, high-performance and uniform memristor crossbar arrays; hybrid-training method; replication of multiple identical kernels for processing different inputs in parallel
2020
TCAS-I
Georgia Tech
Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-in-Memory Architectures
weight mapping to avoid multiple access to input; pipeline architecture for conv layer calculation
2019
TED
PKU
Convolutional Neural Networks Based on RRAM Devices for Image Recognition and Online Learning Tasks
RRAM-based hardware implementation of CNN; expand kernel to the size of image
2021
TCAD
SJTU
Efficient and Robust RRAM-Based Convolutional Weight Mapping With Shifted and Duplicated Kernel
Mapping of CNNs on multi-core RRAM-based CIM architectures
architecture optimized for communication; compiler algorithms for conv2D layer; cycle-accurate simulator
2023
TODAES
UCAS
Mathematical Framework for Optimizing Crossbar Allocation for ReRAM-based CNN Accelerators
formulate a crossbar allocation problem for ReRAM-based CNN accelerators; dynamic programming based solver; models the performance considering allocation problem
2025
TVLSI
NBU
A 578-TOPS/W RRAM-Based Binary Convolutional Neural Network Macro for Tiny AI Edge Devices
ReRAM XNOR cell; BCNN CIM macro with FPGA as the control core
Heterogeneous Integration of In-Memory Analog Computing Architectures with Tensor Processing Units
hybrid TPU-IMAC architecture; TPU for conv, CIM for fc
2025
ASPLOS
CAS
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
dynamic parallelism-aware task scheduling for llm decoding; online kernel characterization for heterogeneous architectures; hybrid PIM units for compute-bound and memory-bound kernels
Challenge: Limited by the precision & area & power trade-off of the ADC; certain CIM devices like RRAM are not suitable for high-precision computation (e.g. FP32). Quantization is needed to reduce the precision of the data.
Partial-Sum Quantization for Near ADC-Less Compute-In-Memory Accelerators
ADC-Less and near ADC-Less CiM accelerators; CiM hardware aware DNN quantization methodology
2024
TCAD
BUAA
CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network Accelerators
bit-level sparsity induced activation quantization; quantizing partial sums to decrease required resolution of ADCs; arraywise quantization granularity
2024
TCAD
BUAA
CIM²PQ: An Arraywise and Hardware-Friendly Mixed Precision Quantization Method for Analog Computing-In-Memory
mixed precision quantization method based on evolutionary algorithm; arraywise quantization granularity; evaluation method to obtain the performance of strategy on the CIM
2024
ICCAD
TU Delft
Hardware-Aware Quantization for Accurate Memristor-Based Neural Networks
analysis of fixed-point quantization impact on conductance variation; weight quantization tuning technique; approach to reduce the residual error
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
integer-only inference arithmetic; quantizes both weights and activations as 8-bit integers, bias 32-bit; provides both quantized inference framework and training frame work
2023
ICCD
SJTU
PSQ: An Automatic Search Framework for Data-Free Quantization on PIM-based Architecture
post-training quantization framework without retraining; hardware-aware block reassembly