Parallel and Multi-Processor Architecture¶

Heterogeneous Architecture¶

Challenge: Classic Heterogeneous Architecture faces challenges in the data movement and memory access patterns; leading to performance bottlenecks.

Year	Venue	Authors	Title	Tags
2017	TACO	Intel	HAShCache: Heterogeneity-Aware Shared DRAMCache for Integrated Heterogeneous Systems	heterogeneity-aware DRAMCache scheduling PrIS; temporal bypass ByE; spatial occupancy control chaining
2018	ICS	NC State	ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems	latency sensitivity; bandwidth sensitivity; moving factor based data placement
2023	HPCA	THU	Baryon: Efficient Hybrid Memory Management with Compression and Sub-Blocking	stage area and selective commit for stable block; dual-format metadata scheme; cacheline-aligned compression and two-level replacements

Multiple Domain Specific Accelerator¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	HPCA	UCSD && Univ. of Kansas	Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators	Proposes Data Motion Acceleration (DMX);Data Restructuring Accelerator (DRX)	3	4	3
2025	ISCA	HyperAccel	Hybe: GPU-NPU Hybrid System for Efficient LLM Inference with Million-Token Context Window	GPU-NPU hybrid system for LLM; prefill-decode stage separation; fine-grained KV cache transmission; stage-wise pipelining	4	3	2

CPU-GPU System¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	arXiv	KTH	Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper	Grace Hopper system memory characterization; integrated CPU-GPU page table analysis; first-touch policy impact study; system page size impact study; access-counter page migration evaluation	2	4	3
2025	ATC	THU	HYPERECA: Distributed Heterogeneous In-Memory Embedding Database for Training Recommender Models	embedding database in host memory and GPU memory; 2-Fold Parallel strategy; contention-free ring schedule	2	3	3

GPU System¶

Year	Venue	Authors	Title	Tags	P	E	N
2022	Mlsys	MIT	TORCHSPARSE: EFFICIENT POINT CLOUD INFERENCE ENGINE	3D Sparse Convolution; optimize Gather-Matmul-Scatter dataflow; Adaptive Matmul Grouping; Quantized and Vectorized Memory access	4	3	4
2023	Mlsys	THU&&SJTU	EXPLOITING HARDWARE UTILIZATION AND ADAPTIVE DATAFLOW FOREFFICIENT SPARSE CONVOLUTION IN 3D POINT CLOUDS	3D Sparse Convolution; optimize Gather-Matmul-Scatter and fetch-on-demand dataflow; Dynamic dataflow changing; coded-CSR mapping; Parallel Processing of different workloads without padding; Pointer	4	3	3
2023	MICRO	MIT	TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs	3D Sparse Convolution; optimize Implicit Gather-Matmul-Scatter; Cuda Sparse Kernel; Sparse Autotuner by detailed workload	4	3	4

Disaggregated Memory¶

Challenge: CXL and NVM offer higher speed & bandwidth than storage devices with byte-level access. Memory disaggregation using DRAM (high-speed/BW + small capacity) and NVM (low-speed/BW + large capacity), faces latency, bandwidth, and consistency challenges.

Year	Venue	Authors	Title	Tags	P	E	N
2025	ATC	THU	DSA-2LM: A CPU-Free Tiered Memory Architecture with Intel DSA	CPU-free page migration in tiered memory via data streaming accelerator; adaptable migration algorithm for mixed 4KB/2MB pages; direct in-kernel DSA integration bypassing DMA	3	3	4
2025	ASPLOS	Purdue	EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation	Ethernet PHY network stack; PHY in-network scheduler; PHY intra-frame preemption	4	4	4
2025	HOTOS	MSR	Storage Class Memory is Dead, All Hail Managed-Retention Memory: Rethinking Memory for the AI Era	Managed-Retention Memory class; relaxed retention non-volatile memory; dynamically Configurable Memory	3	3	2

CXL-based Disaggregated Memory¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	MICRO	PKU	NeoMem: Hardware-Software Co-Design for CXL-Native Memory Tiering	device-side memory profiling unit; sketch-based hot page detector with error-bound estimation; dynamic hotness threshold adjustment based on statistics	2	2	3
2025	ASPLOS	Yale	PULSE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory	iterator-based programming model; disaggregated accelerator architecture; in-network routing for distributed traversal	3	4	3
2025	arXiv	Micron	Architectural and System Implications of CXL-enabled Tiered Memory	CXL parallelism bottleneck analysis; MIKU dynamic request control; ToR-based service time estimation	4	4	3
2025	arXiv	PKU	Enabling Efficient Transaction Processing on CXL-Based Memory Sharing	hybrid coherence primitive for transactional data; hardware-assisted loose coherence	3	2	2
2025	ASPLOS	PKU	CTXNL: A Software-Hardware Co-designed Solution for Efficient CXL-Based Transaction Processing	decouple coherence from memory access; software synchronization primitives at transaction commit	4	2	3

Survey¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	SJTU	Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters	Cross-layer classification of DM techniques; hardware-level categories; architectural-level classifications; system and runtime-level groupings; application-level optimizations such as general-purpose and domain-specific approaches

Chiplets¶

Challenge: Current chip designs are often monolithic and inflexible; leading to high costs and limited performance optimization opportunities.

Solution: Use chiplets to enable more flexible and cost-effective system designs by allowing the integration of specialized dies manufactured using optimal processes; leading to improved performance and yield.

Survey¶

Year	Venue	Authors	Title	Tags
2020	Electronics	NUDT	Chiplet Heterogeneous Integration Technology—Status and Challenges	heterogeneous integration technology; interconnect interfaces and protocols; packaging technology
2022	CCF THPC	ICT	Survey on chiplets: interface, interconnect and integration methodology	development history; interfaces and protocols; packaging technology; EDA tool; standardization of chiplet technology
2024	IEEE CASS	THU	Chiplet Heterogeneous Integration Technology—Status and Challenges	wafer-scale chip architecture; compiler tool chain; integration technology; wafer-scale system; fault tolerance

Multimodal AI chiplets¶

Year	Venue	Authors	Title	Tags	P	E	N
2024	MICRO	CA	SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators	Hierarchical Scheduling Framework;Time Windowing;Chiplet-level Scheduling	3	3	3

Cost Analysis¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	ASU	CATCH: a Cost Analysis Tool for Co-optimization of chiplet-based Heterogeneous systems	heterogeneous chiplet system modeling; DSE on chiplets size,IO,connection

3D IC¶

Solution: 3DIC technology enables higher integration density; shorter interconnects; and improved performance by stacking multiple active layers in a single device.

General 3D IC¶

Year	Venue	Authors	Title	Tags
2019	GLSVLSI	Boston Univeristy	An Overview of Thermal Challenges and Opportunities for Monolithic 3D ICs	TSV-based 3D integration; Mono3D integration with nanoscale monolithic inter-tier vias; influence of lateral heat flow and inter-connection
2019	ECTC	TSMC	System on Integrated Chips (SoIC) for 3D Heterogeneous Integration	system on integrated chips; SoIC package integration; reliability of SoIC bond,TSV and TDV
2020	DATE	Georgia Tech	Macro-3D: A Physical Design Methodology for Face-to-Face-Stacked Heterogeneous 3D ICs	face-to-face stack; separate 2D floorplans generation; memory-on-logic projection
2022	IEEE Micro	Cerebras	Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning	fine-grained dataflow scheduling; high-bandwidth, low-latency fabric design; weight streaming

Interconnection¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	HPCA	Fudan	EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer	Dual-layer interconnect architecture, Reinforcement learning routing, Switch-programmable interconnect	3	2	3

Design Space Exploration¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	SJTU	Cool-3D: An End-to-End Thermal-Aware Framework for Early-Phase Design Space Exploration of Microfluidic-Cooled 3DICs	end-to-end thermal-aware framework; microfluidic cooling integration; Pre-RTL design space exploration; floorplan designer; microfluidic cooling strategy generator

Benchmarks¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	NJU	Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation	open-source 3D-IC benchmark; modular 3D partitioning and placement; Open3D-DMP algorithm for cross-die co-placement; comprehensive PPA evaluation with thermal simulation

SpMM, SpGEMM, SDDMM hardware accelerator¶

Tiling hardware¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	ASPLOS	UC && UIUC && NVIDIA	Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (DRT)	Dynamic Reflexive Tiling (DRT) algorithm; dynamically adjust tile shapes at runtime based on sparsity of tensors; ssembling uniform micro tiles into non-uniform macro tiles	3	3	2
2023	MICRO	MIT && NVIDIA	Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity	overbooking tiling strategy;Swiftiles statistical sampling method;low-overhead hardware mechanism Tailors	3	3	3
2025	HPCA	THU	HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches	hybrid static-dynamic framework;selecting a near-optimal initial tiling scheme;dynamic fine-tuning of tile shapes;coordinates efficient management of both data and metadata in on-chip/off-chip buffers	3	3	3

Dataflow hardware¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	ASPLOS	THU && DAMO && Northwestern University	SPADA: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow	highly diverse sparsity patterns;Window-based Adaptive Dataflow;dynamically select the optimal window shape configuration based on the similarity of sparse patterns	3	2	3
2023	ASPLOS	Universidad de Murcia && Georgia Tech && NVIDIA	Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing	dynamically adaptable multi-dataflow SpMSpM accelerator;Merger-Reduction Network;configurable tree-based topology;a customized L1 memory hierarchy comprising a read-only FIFO;a low-power cache;and a PSRAM for partial sums	3	3	3
2025	MICRO	University of Maryland	Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication	FPGA;a machine learning-based framework for dynamic dataflow selection;intelligent hardware reconfiguration;decision tree;intelligent reconfiguration engine	3	2	4