Memory Architecture¶

NDP: SSDs¶

Solution: Intergrate the compute unit into the SSD controller to process the capacity-sensitive applications.

Year	Venue	Authors	Title	Tags	P	E	N
2024	HPCA	UCLA	BeaconGNN: Large-Scale GNN Acceleration with Out-of-Order Streaming In-Storage Computing	DirectGraph format for out-of-order sampling; die-level processing units; channel-level command router	4	2	3
2025	ISCA	ETHZ	REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing	In-Storage processing	2	4	3
2025	ISCA	UCSD	In-Storage Acceleration of Retrieval Augmented Generation as a Service	metamorphic in-storage accelerator; Metadata Navigation Unit for dynamic data access	4	3	2

NDP: DIMM¶

Challenge: Memory wall causing high latency of data transfer between CPU and memory.

Solution: Put the compute unit in the memory or near the memory to reduce the data transfer overhead.

General Application-Specific Optimization¶

Challenge: Existing NDP architecture are designed for general-purpose computing; not efficient for specific tasks like graph processing.

Year	Venue	Authors	Title	Tags	P	E	N
2021	HPCA	Seoul National	GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent	fixed-function PIM architecture for DNN gradient descent; non-invasive PIM operations using reserved DDR commands	3	3	2
2022	ISCA	Micron	To PIM or Not for Emerging General Purpose Processing in DDR Memory Systems	vector engine inside NDP bank; intelligent code offload decision	2	3	2
2024	ISCA	Samsung	pSyncPIM: Partially Synchronous Execution of Sparse Matrix Operations for All-Bank PIM Architectures	partially synchronous PIM control; predicated execution; sparse matrix distribution & compaction	3	3	3
2025	arxiv	ETHZ	MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem	PIM module inside the SSD controller; early signal quantization; read filtering	3	3	2
2025	ATC	RUC	Turbocharge ANNS on Real Processing-in-Memory by Enabling Fine-Grained Per-PIM-Core Scheduling	per-PU scheduling; persistent PIM kernel; per-PU dispatching with selective replication	3	4	4
2025	HPCA	UC Davis	NOVA: A Novel Vertex Management Architecture for Scalable Graph Processing	message-driven processors capable of executing algorithms; a direct-mapped cache with a write-back policy; support both asynchronous and bulk synchronous parallel execution models	3	3	3

LLM-Specific Optimization¶

Challenge: LLM inference is fundamentally bottlenecked by memory bandwidth; HBM is expensive and not scalable.

Year	Venue	Authors	Title	Tags	P	E	N
2024	npj Unconv. Comput.	UMich	PIM-GPT: a hybrid process in memory accelerator for autoregressive transformers	hybrid system to accelerate GPT inference; mapping scheme for data locality and workload distribution	3	2	2
2025	MICRO	KAIST	PIMBA: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving	Unified PIM acceleration for both transformer and post-transformer LLMs; access interleaving technique for shared State-update Processing Unit	4	2	3

Memory Allocation & Management¶

Challenge: Existing NDP architecture has numerous independent memory spaces; lacks unified management; and features inefficient memory allocation.

Year	Venue	Authors	Title	Tags	P	E	N
2024	ISCA	SJTU	UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space	Uniform shared CPU-PIM memory; dual-track memory management; zero-copy data re-layout	3	3	4
2024	ISCA	KAIST	PIM-malloc: A Fast and Scalable Dynamic Memory Allocator for Processing-In-Memory (PIM) Architectures	PIM-specific memory allocator; hierarchical memory allocation scheme; hardware metadata cache	4	2	3
2024	arXiv	ETHZ	PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures	aligned memory allocator for PUM; DRAM-aware memory allocation	2	3	2
2024	MICRO	KAIST	PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems	data copy engine for host-PIM transfers; PIM-aware memory scheduler for MLP maximization; memory remapping unit for dual address mapping	2	4	3

PIM Compiler & ISA Extension¶

Challenge: Existing compilers are not optimized for locality-aware PIM architectures and require specialized programming models to fully utilize PIM capabilities.

Year	Venue	Authors	Title	Tags	P	E	N
2015	ISCA	Seoul National	PIM-Enabled Instructions: A Low-Overhead; Locality-Aware Processing-in-Memory Architecture	PIM-Enabled Instructions for ISA extension; PIM directory for atomicity and coherence; single-cache-block restriction	3	4	4
2020	ISCA	UCSB	iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture	Single-Instruction-Multiple-Bank ISA; register allocation; instruction reordering	4	4	2
2025	ISCA	POSTECH	ATIM: Autotuning Tensor Programs for Processing-in-DRAM	autotuning framework for DRAM PIM; search-based optimizing tensor compiler; balanced evolutionary search algorithm	3	3	4

Evaluation & Simulators¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	HPCA	THU	UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures	unified NDP hardware abstraction; NDP compiler optimization; instruction-driven NDP simulator	3	5	2
2025	arXiv	ETHZ	EasyDRAM: An FPGA-based Infrastructure for Fast and Accurate End-to-End Evaluation of Emerging DRAM Techniques	FPGA-based DRAM evaluation framework; C++ high-level language for description; time scaling for accurate modeling	3	4	3

Intra-DIMM Communication¶

Challenge: High latency of intra-DIMM (cross-bank) communication via host CPU forwarding.

Year	Venue	Authors	Title	Tags	P	E	N
2024	ISCA	THU	NDPBridge: Enabling Cross-Bank Coordination in Near-DRAM-Bank Processing Architectures	gather & scatter messages via buffer chip; task-based message-passing model; hierarchical, data-transfer-aware load balancing
2025	HPCA	Samsung	Piccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather	In-DRAM fine-grained scatter-gather via data bus offsets; fine-grained cache architecture using fg-tags; Standard DDR command interpretation for FIM control; Combined graph tiling with fine-grained memory access	3	3	4
2025	arXiv	ETHZ	PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System	PIMDAL library for DB operators; quicksort/mergesort/hashing on UPMEM PIM; scatter/gather/async transfers for PIM communication	4	4	2
2024	arXiv	Seoul National	PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices	Virtual hypercube PIM model; PE-assisted data reordering; in-register and cross-domain data modulation	3	4	3
2025	ISCA	KAIST	PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM	domain-specific PIM interconnect; hierarchical network for PIM packaging; PIM-controlled deterministic scheduling	2	4	3

Inter-DIMM Communication¶

Challenge: High latency of inter-DIMM (cross-DIMM) communication via host CPU forwarding.

Year	Venue	Authors	Title	Tags	P	E	N
2017	MEMSYS	UCLA	AIM: Accelerating Computational Genomics through Scalable and Noninvasive Accelerator-Interposed Memory	placing FPGA chip between DIMM and the conventional memory network; multi-drop bus for inter-accelerator communication	1	2	2
2023	ASPLOS	THU	ABNDP: Co-optimizing Data Access and Load Balance in Near-Data Processing	Traveller Cache; hybrid task scheduling; hybrid scheduling leveraging distributed cache	4	3	4
2023	HPCA	PKU	DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing	high-speed hardware link bridges between DIMMs; direct intra-group P2P communication & broadcast; hybrid routing mechanism for inter-group communication
2025	HPCA	SJTU	AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing	Offload-Schedule-Return mechanism; switch-recovery scheduling; explicit/implicit synchronization	2	4	3
2018	MICRO	UIUC	Application-Transparent Near-Memory Processing Architecture with Memory Channel Network	integrates a processor on a buffered DIMM; application-transparent near-memory processing; leverages memory channels for high-bandwidth/low-latency inter-processor communication	3	4	4

Concurrent Host and PIM operations¶

Challenge: High latency of concurrent host CPU/GPU and PIM operations via host CPU forwarding.

Year	Venue	Authors	Title	Tags	P	E	N
2024	IEEE CA	KAIST	Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM	runtime data transposition causing high CPU overhead; PIM-integrated system memory mapping impact	2	2	2
2024	ASPLOS	KAIST	NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing	dual row buffer architecture; sub-batch interleaving; greedy min-load bin packing algorithm	3	4	3
2025	HPCA	ICT	Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM	activation sparsity-based hot(GPU)/cold(NDP) neuron partitioning; offline ILP + online predictor for neuron partition; window-based online remapping for GPU-NDP & NDP-NDP load balance	2	3	4
2025	ISCA	Univ. of Virginia	Membrane: Accelerating Database Analytics with Bank-Level DRAM-PIM Filtering	bank-level DRAM-PIM filtering; CPU-PIM cooperative query execution; denormalization for PIM-amenable filtering	3	3	2

Optimizations on UPMEM-PIM¶

Challenge: The original UMPEM API library is not well-suited for all workloads especially for those with cross-bank communication.

Year	Venue	Authors	Title	Tags	P	E	N
2023	arXiv	ETHZ	A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems	Alignment-in-Memory framework; hybrid WRAM-MRAM sketch data management for PIM	2	3	4
2025	arXiv	ETHZ	PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System	PIMDAL library on UPMEM PIM system for data analytics; scatter/gather-aware transfers for inter-PIM communication; Apache Arrow for host memory management	3	3	3

PIM: In-Cache-Computing¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Torino	ARCANE: Adaptive RISC-V Cache Architecture for Near-memory Extensions	ARCANE in-cache NMC coprocessor architecture; software-defined matrix ISA for NMC abstraction; cache-integrated control runtime for NMC management	3	4	4

PIM & NDP: Benchmarks¶

Challenge: Conventional parallel computing benchmarks are not suitable for PIM/NDP.

Benchmarks for Conventional Computing¶

Year	Venue	Authors	Title	Tags
2021	ATC	UBC	A Case Study of Processing-in-Memory in off-the-Shelf Systems	benchmark
2022	IEEE Access	ETH	Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System	benchmark suite PrIM
2024	CAL	KAIST	Analysis of Data Transfer Bottlenecks in Commercial PIM Systems: A Study With UPMEM-PIM	low MLP; manual data placement; unbalanced thread allocation and scheduling
2024	IEEE Access	Lisbon	NDPmulator: Enabling Full-System Simulation for Near-Data Accelerators From Caches to DRAM	simulator PiMulator based on Ramulator & gem5; full system support; multiple ISA support
2024	HPCA	KAIST	Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology	simulator uPIMulator

Benchmarks for Quantum Computing¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	ASPDAC	NUS	PIMutation: Exploring the Potential of PIM Architecture for Quantum Circuit Simulation	PIMutation framework for quantum circuit simulation; gate merging optimization; row swapping instead of matrix multiplication; vector partitioning for separable states; leveraging UPMEM PIM architecture

NDP: CXL¶

Challenge: No direct physical connectivity between the banks in the DIMM-based NDP architecture. Limited number of DDR channels causing poor scalability.

Solution: Introduce CXL-based interconnects to enable direct communication between memory banks; Use CXL memory pools and CXL switches to enable scalable NDP architecture.

Year	Venue	Authors	Title	Tags	P	E	N
2022	MICRO	UCSB	BEACON: Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support	scalable hardware accelerator inside CXL switch or bank; lossless memory expansion for CXL memory pools
2024	ICS	Samsung	CLAY: CXL-based Scalable NDP Architecture Accelerating Embedding Layers	direct interconnect between DRAM clusters; dedicated memory address mapping scheme; Multi-CLAY system support through customized CXL switch
2024	MICRO	SK Hyrix	Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders	CXL.mem protocol instead of CXL.io (DMA) for low-latency; lightweight threads to reduce address calculation overhead
2025	ISCA	Seoul National	COSMOS: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search	CXL core-based ANNS task offload; rank-level parallel distance computation; adjacency-aware data placement algorithm	2	2	2
2025	ASPLOS	UMich	PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference	hierarchical CXL PIM-PNM compute architecture; use die-shot to estimate area cost; multiple LLM parallelism policies	2	3	3

NDP: 3D-stacked DRAM¶

Challenge: No direct physical connectivity between the banks in the DIMM-based NDP architecture.

Solution: Use TSVs to provide TB/s level bandwidth in inter-bank communication & band-to-logic layer communication.

Year	Venue	Authors	Title	Tags
2013	PACT	KAIST	Memory-centric System Interconnect Design with Hybrid Memory Cubes	memory-centric network; distributor-based topology for reduced latency; non-minimal routing for higher throughput
2024	DAC	SNU	MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models	NDP for MoE; activation movement; GPU-MoNDE load-balancing scheme
2024	ASPLOS	PKU	SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration	algorithmic and architectural heterogeneity; PIM resource allocation; multi-model collaboration workflow

NDP: HBM¶

Year	Venue	Authors	Title	Tags	P	E	N
2021	ISCA	Samsung	Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology Industrial Product	drop-in replacement for standard HBM2; bank-level parallelism using standard DRAM commands; address aligned mode to tolerate host-side command reordering	3	5	3
2022	Hot Chips	Samsung	Aquabolt-XL HBM2-PIM, LPDDR5-PIM With In-Memory Processing, and AXDIMM With Acceleration Buffer	HBM2-PIM with bank-level SIMD programmable computing units; Acceleration DIMM with acceleration buffers for rank-level parallelism	2	5	3

Benchmark¶

Year	Venue	Authors	Title	Tags	P	E	N
2019	DAC	ETHZ	NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning	simulator Ramulator-PIM; tracefile from Ramulator & run on zsim
2021	CAL	UVA	MultiPIM: A Detailed and Configurable Multi-Stack Processing-In-Memory Simulator	simulator MultiPIM; multi-stack & virtual memory support; parallel offloading

NDP: Heterogeneous Architecture¶

Challenge: Different PIM architectures have different characteristics and performance trade-offs; communicating between different PIM architectures is challenging.

Year	Venue	Authors	Title	Tags	P	E	N
2025	ISCA	HUST	HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation	combine DIMM-PIM and HBM-PIM for acceleration; locality-aware retrieval and generation; fine-grained parallel pipelining	2	3	3

General CiM¶

Specific Application & Algorithm¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	ISVLSI	USC	Multi-Objective Neural Architecture Search for In-Memory Computing	neural architecture search methodology; integration of Hyperopt, PyTorch and MNSIM
2024	arXiv	Intel	CiMNet: Towards Joint Optimization for DNN Architecture and Configuration for Compute-In-Memory Hardware	framework that jointly searches for optimal sub-networks and hardware configurations for CiM architectures; multi-objective evolutionary search method	4	2	4
2025	AICAS	UVA	Optimizing and Exploring System Performance in Compact Processing-in-Memory-based Chips	Pipeline Method for Compact PIM Designs; Dynamic Duplication Method (DDM); Maximum NN Size Estimation & Deployment in Compact PIM Design

Modeling & Simulation¶

Year	Venue	Authors	Title	Tags	P	E	N
2018	TCAD	ASU	NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning	estimate the circuit-level performance of neuro-inspired architectures; estimates the area, latency, dynamic energy, and leakage power; Support both SRAM and eNVM; tested on 2-layer MLP NN, MNIST
2019	IEDM	Georgia Tech	DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies	a python wrapper to interface NeuroSim; for inference only
2020	TCAD	ZJU	Eva-CiM: A System-Level Performance and Energy Evaluation Framework for Computing-in-Memory Architectures	models for capturing memory access and dependency-aware ISA traces; models for quantifying interactions between the host CPU and the CiM module
2024	ISPASS	MIT	CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool	flexible specification to describe CiM systems; accurate model/fast statistical model of data-value-dependent component energy
2025	ASPDAC	HKUST	MICSim: A Modular Simulator for Mixed-signal Compute-in-Memory based AI Accelerator	modulared Neurosim; data statistic-based average-mode instead of trace-based mode	4	3	2

CIM: DRAM¶

Solution: Rather than placing logic units into DRAM; modify the physical structure of DRAM/eDRAM to enable in-memory computing.

Year	Venue	Authors	Title	Tags	P	E	N
2021	ICCD	ASU	CIDAN: Computing in DRAM with Artificial Neurons	Threshold Logic Processing Element (TLPE) for in-memory computation; Four-bank activation window; Configurable threshold functions; Energy-efficient bitwise operations; Integration with DRAM architecture
2022	HPCA	UCSD	TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer	token-based dataflow for general Transformer-based models; ring-based data broadcast in modified HBM	4	2	4
2024	A-SSCC	UNIST	A 273.48 TOPS/W and 1.58 Mb/mm2 Analog-Digital Hybrid CIM Processor with Transpose Ternary-eDRAM Bitcell	analog DRAM CIM for partial sum and digital adder	1	4	2
2025	arXiv	KAIST	RED: Energy Optimization Framework for eDRAM-based PIM with Reconfigurable Voltage Swing and Retention-aware Scheduling	RED framework for energy optimization; reconfigurable eDRAM design; retention-aware scheduling; trade-off analysis between RBL voltage swing, sense amplifier power, and retention time; refresh skipping and sense amplifier power gating
2025	arXiv	UTokyo	MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration	GeMV operations for end-to-end low-bit LLM inference using unmodified DRAM; processor-DRAM co-design; on-the-fly vector encoding; horizontal matrix layout	4	4	3

CIM: SRAM¶

Challenge: Memory wall causing high latency of data transfer between CPU and memory; DIMM-based NDP causing high energy consumption; area overhead and low performance efficiency.

Solution: Generally modify the physical structure of SRAM to enable in-memory computing; rather than placing logic units into SRAM.

SRAM CIM: General Architecture¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	TCASAI	Purdue	Algorithm Hardware Co-Design for ADC-Less Compute In-Memory Accelerator	reduce ADC overhead in analog CiM architectures; Quantization-Aware Training; Partial Sum Quantization; ADC-Less hybrid analog-digital CiM hardware architecture HCiM
2024	ISCAS	NYCU	CIMR-V: An End-to-End SRAM-based CIM Accelerator with RISC-V for AI Edge Device	incorporates CIM layer fusion, convolution/max pooling pipeline, and weight fusion; weight fusion: pipelining the CIM convolution and weight loading
2018	JSSC	MIT	CONV-SRAM: An Energy-Efficient SRAM With In-Memory Dot-Product Computation for Low-Power Convolutional Neural Networks	SRAM-embedded convolution (dot-product) computation architecture for BNN; support multi-bit input-output
2022	TCAD	NTHU	MARS: Multi-macro Architecture SRAM CIM-Based Accelerator with Co-designed Compressed Neural Networks	sparsity algorithm designed for SRAM CiM; quantization algorithm with BN fusion
2024	ESSCIRC	THU	A 65nm 8b-Activation 8b-Weight SRAM-Based Charge-Domain Computing-in-Memory Macro Using A Fully-Parallel Analog Adder Network and A Single-ADC Interface	SRAM-based CD-CiM architecture; charge-domain analog adder tree; ReLU-optimized ADC	4	4	4
2021	ISSCC	TSMC	An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications	programmable bit-widths for both input and weights; SRAM and CIM mode	2	5	1

SRAM CIM: Specific Use or Application¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	TCAS-I	UIC	MC-CIM: Compute-in-Memory With Monte-Carlo Dropouts for Bayesian Edge Intelligence	SRAM-based CIM macros to accelerate Monte-Carlo dropout; compute reuse between consecutive iterations
2024	DAC	GWU	Addition is Most You Need: Efficient Floating-Point SRAM Compute-in-Memory by Harnessing Mantissa Addition	decomposing FP mantissa multiplication into sub-ADD and sub-MUL; hybrid-domain SRAM CIM architecture	3	3	2

SRAM CIM: Hardware-Software Co-Design¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	TCAD	BUAA	DDC-PIM: Efficient Algorithm/Architecture Co-Design for Doubling Data Capacity of SRAM-Based Processing-in-Memory	doubling the equivalent data capacity of SRAM-based PIM; FCC algorithm to obtain bitwise complementary filters	4	4	2
2025	TCAD	BUAA	Efficient SRAM-PIM Co-design by Joint Exploration of Value-Level and Bit-Level Sparsity	hybrid-grained pruning algorithm; customized Dyadic Block PIM (DB-PIM) architecture	4	3	2

SRAM CIM: Simulator & Modeling¶

Year	Venue	Authors	Title	Tags	P	E	N
2020	ISCAS	JCU	MemTorch: A Simulation Framework for Deep Memristive Cross-Bar Architectures	supports both GPUs and CPUs; integrates directly with PyTorch; simulate non-idealities of memristive devices within cross-bar, tested on VGG-16, CIFAR-10
2021	TCAD	Geogia Tech	DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training	non-ideal device properties of NVMS' effect for on-chip training
2025	DAC	BUAA	CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures	workflow for implementing and evaluating DNN workloads on digital CIM architectures; CIM-specific ISA design; compilation flow built on the MLIR infrastructure	4	2	3

SRAM CIM: Transformer Accelerator¶

Challenge: Transformer architecture is widely used in NLP and CV tasks. Existing SRAM CIM architectures are not suitable for transformer acceleration.

Year	Venue	Authors	Title	Tags	P	E	N
2025	DATE	PKU	Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs	architecture model and simulator for CIM-based TPUs; designed for LLM inference	4	2	4
2023	arXiv	Keio	An 818-TOPS/W CSNR-31dB SQNR-45dB 10-bit Capacitor-Reconfiguring Computing-in-Memory Macro with Software-Analog Co-Design for Transformers	Capacitor-Reconfiguring analog CIM architecture	1	4	3
2025	arXiv	Purdue	Hardware-Software Co-Design for Accelerating Transformer Inference Leveraging Compute-in-Memory	SRAM based softmax-friendly CIM architecture for transformer; finer-granularity pipelining strategy	4	3	2
2025	arXiv	PKU	Leveraging Compute-in-Memory for Efficient Generative Model Inference in TPUs	Energy-efficient CIM core integration in TPUs (replace the original MXU); CIM-MXU with systolic data path; Array dimension scaling for CIM-MXU; Area-efficient CIM macro design; Mapping engine for generative model inference
2024	JSSC	THU	MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity	long reuse elimination scheduler (LRES) to dynamically reshape the attention matrix; runtime token pruner (RTP) to remove insignificant tokens; modal-adaptive CIM network (MACN) to dynamically divide CIM cores into Pipeline; effective-bits-balanced CIM (EBBCIM) macro architecture	5	4	3

CIM: RRAM¶

Challenge: RRAM devices are non-volatile and have high density; suitable for CIM applications. However; RRAM devices have non-ideal effects that can cause significant performance degradation.

RRAM CiM: Simulator¶

Year	Venue	Authors	Title	Tags
2018	TCAD	THU	MNSIM: Simulation Platform for Memristor-Based Neuromorphic Computing System	reference design for largescale neuromorphic accelerator and can also be customized; behavior-level computing accuracy model
2023	TCAD	THU	MNSIM 2.0: A Behavior-Level Modeling Tool for Processing-In-Memory Architectures	integrated PIM-oriented NN model training and quantization flow; unified PIM memory array model; support for mixed-precision NN operations
2024	DATE	UCAS	PIMSIM-NN: An ISA-based Simulation Framework for Processing-in-Memory Accelerators	event-driven simulation approach; can evaluate the optimizations of software and hardware independently

RRAM CiM: Architecture¶

Year	Venue	Authors	Title	Tags	P	E	N
2019	ASPLOS	Purdue & HP	PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference	Programmable and general-purpose ReRAM based ML Accelerator; Supports an instruction set; Has potential for DNN training; Provides simulator that accepts model
2018	ICRC	Purdue & HP	Hardware-Software Co-Design for an Analog-Digital Accelerator for Machine Learning	compiler to translate model to ISA; ONNX interpreter to support models in common DL frame work; simulator to evaluate performance
2023	NANOARCH	HUST	Heterogeneous Instruction Set Architecture for RRAM-enabled In-memory Computing	General ISA for RRAM CiM & digital heterogeneous architecture; a tile-processing unit-array three-level architecture
2024	VLSI-SoC	RWTH Aachen University	Architecture-Compiler Co-design for ReRAM-Based Multi-core CIM Architectures	inference latency predictions and analysis of the crossbar utilization for CNN
2024	arXiv	CAS	A Fully Hardware Implemented Accelerator Design in ReRAM Analog Computing without ADCs	Based on Stochastic Binary Neural Networks; Winner-Take-All (WTA) strategy; Hardware implemented sigmoid and softmax	4	3	4

RRAM CiM: Architecture optimization¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	MICRO	HUST	DRCTL: A Disorder-Resistant Computation Translation Layer Enhancing the Lifetime and Performance of Memristive CIM Architecture	address conversion method for dynamic scheduling; hierarchical wear-leveling (HWL) strategy for reliability improvement; data layout-aware selective remapping (LASR) to improve communication locality and reduce latency
2024	DATE	RWTH Aachen University	CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures	algorithm to decide which parts of NN are duplicated to reduce inference latency; cross layer scheduling on tiled CIM architectures
2024	TC	SJTU	ERA-BS: Boosting the Efficiency of ReRAM-Based PIM Accelerator With Fine-Grained Bit-Level Sparsity	bit-level sparsity in both weights and activations; bit-flip scheme; dynamic activation sparsity exploitation scheme
2023	TETCI	TU Delft	Accurate and Energy-Efficient Bit-Slicing for RRAM-Based Neural Networks	unbalanced bit-slicing scheme for higher accuracy; holistic solution using 2's compliment
2024	Science	USC	Programming memristor arrays with arbitrarily high precision for analog computing	represent high-precision numbers using multiple relatively low-precision analog devices;using RRAM CIM to solve PDEs	5	4	3

RRAM CiM: Modeling¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	AICAS	RWTH Aachen University	A Calibratable Model for Fast Energy Estimation of MVM Operations on RRAM Crossbars	system energy model for MVM on ReRAM crossbars; methodology to study the effect of the selection transistor and wire parasitics in 1T1R crossbar arrays
2024	arXiv	MIT	Modeling Analog-Digital-Converter Energy and Area for Compute-In-Memory Accelerator Design	architecture-level model that estimates ADC energy and area	4	3	3

RRAM CiM: Training optimization¶

Year	Venue	Authors	Title	Tags	P	E	N
2021	TCAD	SJTU	ITT-RNA: Imperfection Tolerable Training for RRAM-Crossbar-Based Deep Neural-Network Accelerator	prevent the large-weight synapses from being mapped to the imperfect memristor cells; off-device training algorithm to alleviate the accumulation of errors across multiple layers; bit-wise mechanism to compensate the resistance variations	3	3	2
2023	arXiv	UND	U-SWIM: Universal Selective Write-Verify for Computing-in-Memory Neural Accelerators	only do write-verify for important weights; based on weight second derivatives as a guide	3	3	3
2023	Adv. Mater.	UMich	Bulk‐Switching Memristor‐Based Compute‐In‐Memory Module for Deep Neural Network Training	Bulk-ReRAM based digital-CIM hybrid architecture for training; CIM for forward, digital for backward	4	4	1
2024	APIN	SWU	Multi-optimization scheme for in-situ training of memristor neural network based on contrastive learning	optimizations to the deployment method, loss function and gradient calculation; compensation measures for non-ideal effects
2025	TNNLS	SNU	Efficient Hybrid Training Method for Neuromorphic Hardware Using Analog Nonvolatile Memory	Hybrid offline-online training method

RRAM CiM: Compiler¶

Challenge: Compiler for RRAM CIM is not well studied. Existing compilers are either for specific architecture or not efficient.

Year	Venue	Authors	Title	Tags
2023	TACO	HUST	A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures	compilation tool to migrate legacy programs to CPU/CIM heterogeneous architectures; a model to quantify the performance gain
2023	DAC	CAS	PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators	compiler based on Crossbar/IMA/Tile/Chip hierarchy; low latency and high throughput mode; genetic algorithm to optimize weight replication and core mapping; scheduling algorithms for complex DNN
2024	ASPLOS	CAS	CIM-MLC: A Multi-level Compilation Stack for Computing-In-Memory Accelerators	compilation stack for various CIM accelerators; multi-level DNN scheduling approach

RRAM CiM: Float-Point processing¶

Challenge: Raw RRAM devices are not suitable for floating-point operations; while floating point data is common in DNNs (e.g. FP32).

Year	Venue	Authors	Title	Tags
2023	SC	UCLA	ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers	data format and accelerator architecture
2024	DATE	UESTC	AFPR-CIM: An Analog-Domain Floating-Point RRAM -based Compute- In- Memory Architecture with Dynamic Range Adaptive FP-ADC	all-analog domain CIM architecture for FP8 calculations; adaptive dynamic range FP-ADC & FP-DAC
2025	arXiv	GWU	A Hybrid-Domain Floating-Point Compute-in-Memory Architecture for Efficient Acceleration of High-Precision Deep Neural Networks	SRAM based hybrid-domain FP CIM architecture; detailed circuit schematics and physical layouts

RRAM CiM: Convolutional Layer¶

Challenge: Convolutional layer is the most compute-intensive layer in CNNs. RRAM CIM architecture is quite suitable for convolutional layer operations but face challenges related to non-ideal effects and performance degradation.

Year	Venue	Authors	Title	Tags	P	E	N
2020	Nature	THU	Fully hardware-implemented memristor convolutional neural network	fabrication of high-yield, high-performance and uniform memristor crossbar arrays; hybrid-training method; replication of multiple identical kernels for processing different inputs in parallel
2020	TCAS-I	Georgia Tech	Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-in-Memory Architectures	weight mapping to avoid multiple access to input; pipeline architecture for conv layer calculation
2019	TED	PKU	Convolutional Neural Networks Based on RRAM Devices for Image Recognition and Online Learning Tasks	RRAM-based hardware implementation of CNN; expand kernel to the size of image
2021	TCAD	SJTU	Efficient and Robust RRAM-Based Convolutional Weight Mapping With Shifted and Duplicated Kernel	shift and duplicate kernel (SDK) convolutional weight mapping architecture; parallel-window size allocation algorithm; kernel synchronization method
2023	VLSI-SoC	Aachen	Mapping of CNNs on multi-core RRAM-based CIM architectures	architecture optimized for communication; compiler algorithms for conv2D layer; cycle-accurate simulator
2023	TODAES	UCAS	Mathematical Framework for Optimizing Crossbar Allocation for ReRAM-based CNN Accelerators	formulate a crossbar allocation problem for ReRAM-based CNN accelerators; dynamic programming based solver; models the performance considering allocation problem
2025	TVLSI	NBU	A 578-TOPS/W RRAM-Based Binary Convolutional Neural Network Macro for Tiny AI Edge Devices	ReRAM XNOR cell; BCNN CIM macro with FPGA as the control core	4	4	3

RRAM CIM: Transformer Accelerator¶

Challenge: RRAM's cross-bar architecture is suitable for matrix operations.

Year	Venue	Authors	Title	Tags	P	E	N
2023	VLSI	Purdue	X-Former: In-Memory Acceleration of Transformers	in-memory accelerate attention layers; intralayer sequence blocking dataflow; provides a simulator
2024	TODAES	HUST	A Cascaded ReRAM-based Crossbar Architecture for Transformer Neural Network Acceleration	cascaded crossbar arrays that uses transimpedance amplifiers; data mapping scheme to store signed operands; ADC virtualization scheme
2023	VLSI	HUST	An RRAM-Based Computing-in-Memory Architecture and Its Application in Accelerating Transformer Inference	RRAM-based in-memory floating-point computation architecture (RIME); pipelined implementations of MatMul and softmax	3	3	4
2020	ICCAD	Duke	ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration	MatMul does matrix decomposition in scaled dot-product attention; in-memory logic techniques for softmax; sub-matrix pipeline	4	3	3
2022	TCAD	KAIST	A Framework for Accelerating Transformer-Based Language Model on ReRAM-Based Architecture	window self-attention and window-size search algorithm; ReRAM hardware design optimized for this algorithm	4	2	3
2020	ICCD	LSU	ATT: A Fault-Tolerant ReRAM Accelerator for Attention-based Neural Networks	ReRAM-based accelerator with pipeline for AttNNs; heuristic redundancy algorithm	3	2	2

RRAM CiM: Special Usage¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	GLSVLSI	Yale	Examining the Role and Limits of Batchnorm Optimization to Mitigate Diverse Hardware-noise in In-memory Computing	non-idealities; circuit-level parasitic resistances and device-level non-idealities; crossbar-aware fine-tuning of batchnorm parameters
2019	ASPDAC	POSTECH	In-memory batch-normalization for resistive memory based binary neural network hardware	in-memory batchnormalization schemes; integrate BN layers on crossbar
2024	TRETS	UFRGS	Reprogrammable Non-Linear Circuits Using ReRAM for NN Accelerators	perform typical non-linear operations using ReRAM	4	3	4
2019	Adv. Funct. Mater.	HUST	Functional Demonstration of a Memristive Arithmetic Logic Unit (MemALU) for In‐Memory Computing	non-volatile Boolean logic using RRAM crossbar;reconfigurable boolean logic gates	3	4	3

CIM: Hybrid Architecture¶

Solution: Use hybrid architecture (like SRAM + RRAM) to overcome the limitations of single device (e.g. RRAM's non-ideal effects).

Hybrid CIM: SRAM + General Logic¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	GLSVLSI	USC	Heterogeneous Integration of In-Memory Analog Computing Architectures with Tensor Processing Units	hybrid TPU-IMAC architecture; TPU for conv, CIM for fc
2025	ASPLOS	CAS	PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System	dynamic parallelism-aware task scheduling for llm decoding; online kernel characterization for heterogeneous architectures; hybrid PIM units for compute-bound and memory-bound kernels

Hybrid CIM: SRAM + RRAM¶

Year	Venue	Authors	Title	Tags
2024	Science	NTHU	Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing	Fusion of ReRAM and SRAM CiM; ReRAM SLC & MLC Hybrid; Current quantization; Weight shifting with compensation
2024	IPDPS	Georgia Tech	Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators	select and transfer imperfectionsensitive weights to digital accelerator; hybrid quantization(weights on analog part is more quantized)
2023	ICCAD	SJTU	TL-nvSRAM-CIM: Ultra-High-Density Three-Level ReRAM-Assisted Computing-in-nvSRAM with DC-Power Free Restore and Ternary MAC Operations	DCpower-free weight-restore from ReRAM; ternary SRAM-CIM mechanism with differential computing scheme

Hybrid CIM: Memristor/MRAM + SRAM¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	Nature	TSMC	A mixed-precision memristor and SRAM compute-in-memory AI processor	layer based INT-FP hybrid architure; kernel-based mix-CIM (SRAM/ReRAM/digital hybrid architecture)	5	5	2
2025	DAC	Chung-Ang Univ.	HH-PIM: Dynamic Optimization of Power and Performance with Heterogeneous-Hybrid PIM for Edge AI Devices	heterogeneous-hybrid PIM with HP/LP modules and MRAM/SRAM; dynamic data placement algorithm for energy optimization; dual PIM controller design	3	4	2

Hybrid CIM: Analog + Digital¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	arXiv	HP	RACE-IT: A Reconfigurable Analog CAM-Crossbar Engine for In-Memory Transformer Acceleration	Compute Analog Content Addressable Memory (Compute-ACAM) structure; accelerator based on crossbars and Compute-ACAMs; encoding-based optimization	3	3	4
2024	VLSI	FDU	HARDSEA: Hybrid Analog-ReRAM Clustering and Digital-SRAM In-Memory Computing Accelerator for Dynamic Sparse Self-Attention in Transformer	product-quantization-based sparse self-attention algorithm; ADC-free ReRAM-CIM macro; ReRAM-CIM for front-end attention sparsification, SRAM-CIM for back-end sparse attention	4	3	3
2024	ASP-DAC	Keio	OSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration	On-the-fly Saliency-Aware precision configuration scheme; Hybrid CIM Array for DCIM and ACIM using split-port SRAM
2025	arXiv	South Carolina	PIM-LLM: A High-Throughput Hybrid PIM Architecture for 1-bit LLMs	hybrid PIM-Digital architecture; analog PIM for low-precision MatMul; digital systolic array for high-precision matMul	4	3	1
2024	ESSERC	UCSD	An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing	analog CIM for low-score tokens, digital processor for high	3	4	2

CIM: Quantization¶

Challenge: Limited by the precision & area & power trade-off of the ADC; certain CIM devices like RRAM are not suitable for high-precision computation (e.g. FP32). Quantization is needed to reduce the precision of the data.

CIM Quantization: For Analog CIM¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	ISLPED	Purdue	Partial-Sum Quantization for Near ADC-Less Compute-In-Memory Accelerators	ADC-Less and near ADC-Less CiM accelerators; CiM hardware aware DNN quantization methodology
2023	AICAS	TU Delft	Mapping-aware Biased Training for Accurate Memristor-based Neural Networks	favorability constraint analysis to find important weight values; mapping-aware biased training to restrict weight values to low variance RRAM states	3	4	2
2024	TCAD	BUAA	CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory-Based Neural Network Accelerators	bit-level sparsity induced activation quantization; quantizing partial sums to decrease required resolution of ADCs; arraywise quantization granularity
2024	TCAD	BUAA	CIM²PQ: An Arraywise and Hardware-Friendly Mixed Precision Quantization Method for Analog Computing-In-Memory	mixed precision quantization method based on evolutionary algorithm; arraywise quantization granularity; evaluation method to obtain the performance of strategy on the CIM
2024	ICCAD	TU Delft	Hardware-Aware Quantization for Accurate Memristor-Based Neural Networks	analysis of fixed-point quantization impact on conductance variation; weight quantization tuning technique; approach to reduce the residual error	3	2	3

CIM Quantization: For all CIM¶

Year	Venue	Authors	Title	Tags	P	E	N
2018	CVPR	Google	Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference	integer-only inference arithmetic; quantizes both weights and activations as 8-bit integers, bias 32-bit; provides both quantized inference framework and training frame work
2023	ICCD	SJTU	PSQ: An Automatic Search Framework for Data-Free Quantization on PIM-based Architecture	post-training quantization framework without retraining; hardware-aware block reassembly

CIM: Digital CIM¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	ISCAS	CAS	StreamDCIM: A Tile-based Streaming Digital CIM Accelerator with Mixed-stationary Cross-forwarding Dataflow for Multimodal Transformer	tile-based reconfigurable CIM macro microarchitecture; mixed-stationary cross-forwarding dataflow; ping-pong-like finegrained compute-rewriting pipeline

NVM¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	ISCAS	UMCP	On-Chip Adaptation for Reducing Mismatch in Analog Non-Volatile Device Based Neural Networks	float-gate transistors based; hot-electron injection to address the issue of mismatch and variation
2023	DATE	UniBo	End-to-End DNN Inference on a Massively Parallel Analog In Memory Computing Architecture	many-core heterogeneous architecture; general-purpose system based on RISC-V cores and nvAIMC cores; based on Phase-Change Memory(PCM);