Domain-Specific Accelerators¶

Graph Accelerators¶

Challenge: Massive memory requirement, Non-ordered memory access

Year	Venue	Authors	Title	Tags	P	E	N
2016	MICRO	Princeton	Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics	vertex-programming based pipeline; on-chip scratchpad optimization; source/destination-oriented parallel streams	4	4	4
2019	FPL	NUS	On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs	on-the-fly parallel data shuffling; OpenCL-based data dispatcher; runtime data dependency resolution; decoder-filter shuffling architecture	3	4	2
2020	MICRO	UCR	GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing	asynchronous event-driven model; in-place event coalescing; delta-based accumulative processing	3	3	4
2024	TRETS	HUST	ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA Chip	HBM-enhanced BFS accelerator; independent HBM Reader; hybrid-mode PE; multi-layer crossbar	3	4	3

Hypergraph Accelerators¶

Solution: Realize the shared parts in hyperedges

Year	Venue	Authors	Title	Tags	P	E	N
2022	MICRO	HUST	A Data-Centric Accelerator for High-Performance Hypergraph Processing	Data-Centric; Load-Trigger-Reduce (LTR); Adaptive Data Loading	4	4	3
2025	HPCA	HUST	MeHyper: Accelerating Hypergraph Neural Networks by Exploring Implicit Dataflows	Microedge; Microedge-Centric Dataflow; RePAG Execution Model	4	3	4

Dynamic Graph Accelerators¶

Challenge: Edge update, Graph store data structure design

Year	Venue	Authors	Title	Tags	P	E	N
2021	MICRO	UCR	JetStream: Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator	first streaming graph accelerator; asynchronous incremental algorithms; asynchronous edge deletion handling (VAP; DAP)	3	3	4
2022	ISCA	HUST	TDGraph: A Topology-Driven Accelerator for High-Performance Streaming Graph Processing	topology-driven incremental execution; streaming graph processing; regularized state propagation; vertex states coalescing	4	4	3
2023	MICRO	UCR	MEGA Evolving Graph Accelerator	first evolving graph accelerator; Batch-Oriented Execution (BOE); deletion-free based on CommonGraph; Batch Pipelining	3	3	4
2024	TRETS	UoV	Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs	a novel edge packing format (ACTPACK); hashed edge updates; low-overhead online partitioning	4	4	3

Graph Mining Accelerators¶

Challenge: Complex graph algorithms, Irregular access patterns

Year	Venue	Authors	Title	Tags	P	E	N
2021	ISCA	MIT	FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining	pattern-aware GPM accelerator; software/hardware co-design; pattern-specific execution plan; connectivity map (c-map)	4	4	3

DNN Accelerators¶

Layer Fusion Accelerators¶

Solution: Use layer fusion to combine multiple layers of a neural network into a single layer. This can help reduce the number of computations and memory accesses required during inference; leading to faster execution times and lower power consumption.

Year	Venue	Authors	Title	Tags
2016	MICRO	SBU	Fused-Layer CNN Accelerators	fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip
2025	TC	KU Leuven	Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators	fine-grain mapping paradigm; mapping of layer-fused DNNs on heterogeneous dataflow accelerator architectures; memory- and communication-aware latency analysis; constraint optimization
2024	SOCC	IIT Hyderabad	Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging	Width Shrinking: reduces the number of feature maps in CNN layers; Depth Shrinking: Merge of conv layer and fc layer
2024	ICSAI	MIT	LoopTree: Exploring the Fused-Layer Dataflow Accelerator Design Space	design space that supports set of tiling, recomputation, retention choices, and their combinations; model that validates design space

LLM Accelerators¶

Challenge: LLM accelerators face challenges in terms of memory bandwidth; power consumption; and the need for efficient data movement.

Year	Venue	Authors	Title	Tags	P	E	N
2024	DATE	NTU	ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers	highly efficient memory-centric dataflow; fused special function module for non-linear functions; A comprehensive DSE of ViTA Kernels and VMUs
2025	arXiv	SJTU	ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM	hybrid ROM-SRAM architecture for on-device LLM; B-ROM design for area-efficient ROM; fused cell integration of ROM and compute unit; QLoRA rank adaptation for task-specific tuning; on-chip storage optimization for quantized models
2025	ISCA	Duke	Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression	entropy-aware cache compression for LLMs; group-wise non-uniform quantization; shared k-means patterns; parallel Huffman hardware decoder	4	3	3

Quantized DNN Accelerators¶

Solution: Quantized DNN accelerators are designed to efficiently execute quantized neural networks, which use lower precision representations for weights and activations.

Year	Venue	Authors	Title	Tags	P	E	N
2018	ISCA	SNU	Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation	accelerator architecture for outlier-aware quantized models; outlier-aware low-precision computation; separate outlier MAC unit	4	3	2
2018	ISCA	Georgia Tech	Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network	accelerator for layer-aware quantized DNN; bit-flexible computation unit; block-structured instruction set architecture	4	3	3
2023	HPCA	KAIST	Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation	signed bit-slice representation;flexible zero skipping processing element	3	3	4
2024	DAC	ASU	Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference	composite data type Logarithmic Posits (LP); automated post training LP Quantization (LPQ) Framework based on genetic algorithms; mixed-precision LP Accelerator (LPA)	3	3	2
2025	HPCA	POSTECH	Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity	Asymmetrically-Quantized bit-Slice GEMM; Zero-Point Manipulation and Distribution-based Bit-Slicing to increase sparsity	3	3	4

Reconfigurable Accelerators¶

Solution: Reconfigurable accelerators not only break the trade-off of flexibility and performance, but also enable hardware to adapt to algorithm changes as quickly as software while maintaining high energy efficiency.

Year	Venue	Authors	Title	Tags	P	E	N
2018	ASPLOS	Georgia Tech	MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects	augmented reduction tree(ART) for link conflict; chubby distribution tree for bandiwdth optimization; ART based virtual neuron construction	4	3	2
2019	JETCAS	MIT	Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices	hierarchical mesh NoC for multiple transmission modes; sparse PE architecture	5	4	2
2020	HPCA	Georgia Tech & Intel	SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training	flexible dot product engine; forwarding adder network	4	3	2
2023	ASPLOS	UM & Georgia Tech	Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing	merger-reduction network for area efficiency; compression format conversion without hardware module; dedicated L1 memory architecture for different access pattern	4	3	2

Application-Specific Accelerators¶

Solution: Application-specific accelerators are designed to efficiently execute specific types of neural networks or machine learning workloads, providing optimized performance and energy efficiency for targeted applications.

Year	Venue	Authors	Title	Tags	P	E	N
2022	PLDI	University of Edinburgh	Bind the Gap: Compiling Real Software to Hardware FFT Accelerators	IO-based synthesis; adapter generation; multi-level mismatch resolution; neural code classification; generate-and-test verification	4	4	3

Benchmarks¶

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	Cambridge	Benchmarking Ultra-Low-Power µNPUs	Comparative µNPU Benchmarking (µNPU: microcontroller-scale Neural Processing Unit); open-source model compilation framework; µNPU memory I/O bottleneck identification	4	4	2

Dataflow Architecture¶

Solution: Dataflow architecture allows the execution of instructions based on the availability of data rather than a predetermined sequence; leading to more efficient use of resources and better performance in parallel processing and real-time systems.

Year	Venue	Authors	Title	Tags	P	E	N
2019	ASPLOS	THU	Tangram: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators	buffer sharing dataflow(BSD); alternate layer loop ordering (ALLO) dataflow; heuristics spatial layer mapping algorithm
2024	MICRO	CMU	The TYR Dataflow Architecture: Improving Locality by Taming Parallelism	local tag spaces technique; space tag managing instruction set; CT based concurrent-block communication
2024	MICRO	UCR	Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse	producer-consumer reuse; cross-iteration reuse; sub-tensor dependency; OEI dataflow; sparsepipe architecture
2025	arXiv	UCSB	FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training	contraction sequence search engine; tensor contraction unit; distribution/reduction network	3	4	3

Data Mapping¶

Solution: Assign data to specific locations in memory or storage to optimize performance; reduce latency; and improve resource utilization.

Survey¶

Year	Venue	Authors	Title	Tags	P	E	N
2013	DAC	NUS	Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends	dense/run-time mapping; centralized/distributred management; hybrid mapping

Heuristic Algorithm¶

Year	Venue	Authors	Title	Tags
2021	HPCA	Georgia Tech	MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores	sub-accelerator selection; fine-grained job prioritization; MANGA crossover genetic operators
2023	ISCA	THU	MapZero: Mapping for Coarse-grained Reconfigurable Architectures with Reinforcement Learning and Monte-Carlo Tree Search	GAT based DFG and CGRA embedding; routing penalty based reinforcement learning; Monte-Carlo tree search space exploration
2023	VLSI	IIT Kharagpur	Application Mapping Onto Manycore Processor Architectures Using Active Search Framework	RNN based active search framework; IP-Core Numbering Scheme; active search with/without pretraining

Optimization Modeling¶

Year	Venue	Authors	Title	Tags
2020	FPGA	ETH Zurich	Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis	computation and I/O decomposition model for matrix multiplication; 1D array collapse mapping method; internal double buffering
2021	HPCA	Georgia Tech	Heterogeneous Dataflow Accelerators for Multi-DNN Workloads	heterogeneous dataflow accelerators (HDAs) for DNN; dataflow flexibility; high utilization across the sub-accelerators
2023	MICRO	Alibaba; CUHK	ArchExplorer: Microarchitecture Exploration Via Bottleneck Analysis	dynamic event-dependence graph(EDG); induced DEG based critical path construction; bottleneck-removal-driven DSE
2023	ISCA	THU	Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators	inter-layer encoding method; temperal cut; spatial cut; RA tree analysis

Fault Tolerant Mapping¶

Year	Venue	Authors	Title	Tags	P	E	N
2017	SC	NIT	High-performance and energy-efficient fault-tolerance core mapping in NoC	weighted communication energy; placing unmapped vertices region; application core graph; spare core placement algorithm
2019	IVLSI	UESTC	Optimized mapping algorithm to extend lifetime of both NoC and cores in many-core system	lifetime budget metric; LBC-LBL mapping algorithm; electro-migration fault model

Reliability Management¶

Year	Venue	Authors	Title	Tags
2020	DATE	Turku	Thermal-Cycling-aware Dynamic Reliability Management in -Core System-on-Chip	Coffin-Mason equation based reliability model; reliability-aware mapping/scheduling; dynamic power management
2024	arXiv	WUSTL	A Two-Level Thermal Cycling-Aware Task Mapping Technique for Reliability Management in Manycore Systems	temperature based bin packing; task-to-bin assignment; thermal cycling-aware based task-to-core mapping
2024	arXiv	WUSTL	A Reinforcement Learning-Based Task Mapping Method to Improve the Reliability of Clustered Manycores	mean time to failure; density-based spatial clustering of applications with noise algorithm

Task Scheduling¶

Year	Venue	Authors	Title	Tags	P	E	N
2023	ICCAD	PKU	Memory-aware Scheduling for Complex Wired Networks with Iterative Graph Optimization	topology-aware pruning algorithm; integer linear programming scheduling method; sub-graph fusion algorithm ; memory-aware graph partitioning
2023	MICRO	Duke	Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI	graph alignment algorithm for dataflow graph and platform pe grap; producer-consumer pattern dataflow generation algorithm

Many-core Architecture¶

Challenge: Many-core architectures are designed to handle a large number of cores; but they face challenges in terms of power consumption; performance; and resource allocation.

Resource Management¶

Challenge: Cores share resources with each other, how to achieve high performance by coordinating access among cores to prevent conflicts and ensure data consistency is a problem.

Year	Venue	Authors	Title	Tags
2015	HPCA	Cornel	Increasing Multicore System Efficiency through Intelligent Bandwidth Shifting	online bandwidth shifting mechanism; prefetch usefulness (PU) level
2015	HPCA	IBM	XChange: A Market-based Approach to Scalable Dynamic Multi-resource Allocation in Multicore Architectures	CMP multiresource allocation mechanism XChange; market framework based modeling
2018	MICRO	SNU	RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors	graph-based multi-core performance model; distance-based memory system model; dynamic scheduling reconstruction method
2023	MICRO	Yonsei	McCore: A Holistic Management of High-Performance Heterogeneous Multicores	cluster partitioning via index hash function; partitions balancing method; hardware support for RL based scheduling

Hardware Design¶

Solution: Hardware implementation for many-core architecture to achieve massive parallelism.

Year	Venue	Authors	Title	Tags	P	E	N
2016	SCIS	THU&BNU&CAS	The Sunway TaihuLight supercomputer: system and applications	Sunway TaihuLight's composition; scientific computing applications on TaihuLight	3	4	2
2017	IPDPSW	SJTU&Tokyo Tech	Benchmarking SW26010 Many-core Processor	hand-coded assembly benchmark for SW26010; CPE pipeline&memory hierarchy&RLC mechanism benchmarking	3	4	2
2023	MICRO	THU	MAICC: A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference	slice improved and hardware-implemented reduction CIM; ISA extension for CIM; CNN layer segmentation and mapping algorithm

Application Optimization¶

Year	Venue	Authors	Title	Tags
2023	SC	NUDT	Optimizing Direct Convolutions on ARM Multi-Cores	direct convolution algorithm NDirect; loop ordering algorithm; micro convolution kernal for computing & packeting
2023	SC	NUDT	Optimizing MPI Collectives on Shared Memory Multi-Cores	intra-node reduction algorithm for redundant data movements; fine grained non-temporal store based adaptive collectives
2024	PPoPP	NUDT	Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores	task dependency tree(TDT); tree traversal based parallel algorithm for CPU/GPU

Architecture DSE¶

Challenge: It's crucial to find the optimal hardware configurations that meet performance; power; and area constraints for specific applications.

NOC DSE¶

Year	Venue	Authors	Title	Tags	P	E	N
2018	ICCAD	WSU	Hybrid On-Chip Communication Architectures for Heterogeneous Manycore Systems	many-to-few communication patterns; long range shortcut based wireless NoC ; 3D-TSV based heterogeneous NoC
2018	IEEE TC	WSU	On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems	wireless-enabled heterogeneous NoC; archived multi-objective simulated annealing for network connectivity

Mapping & Co-Exploration DSE¶

Challenge: Efficiently co-optimize DNN mapping and hardware architecture under complex constraints.

Year	Venue	Authors	Title	Tags	P	E	N
2020	ICCAD	UIUC	DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator	two-level (global and local) automatic DSE engine; dynamic design space exploration framework; high-dimensional design space support	4	4	4
2024	HPCA	THU	Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators	layer-centric encoding method; DP-based graph partition algorithm; SA based D2D link communication optimization
2024	ASPLOS	THU	Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization	consumption-centric flow based subgraph execution scheme; main/side region based memory management
2024	ASPDAC	CUHK	SoC-Tuner: An Importance-guided Exploration Framework for DNN-targeting SoC Design	intercluster distance algorithm; importance-based pruning and initialization	3	2	2
2024	Arxiv	Georgia Tech	PIPEORGAN: Efficient Inter-operation Pipelining with Flexible Spatial Org	spatial organization strategy pipeorgan for inter-operator pipelining; augmented mesh for pipelining(AMP) topology	4	2	2

Microarchitecture & Cross-Architecture DSE¶

Challenge: Efficiently explore and optimize design spaces across microarchitectures and heterogeneous hardware.

Year	Venue	Authors	Title	Tags	P	E	N
2025	arXiv	THU & Macau	MLDSE: Scaling Design Space Exploration Infrastructure for Multi-Level Hardware	IR and builder based hardware modeling; cross-architecture DSE; spatial-level DSE	3	3	2
2025	arXiv	PKU	DiffuSE: Cross-Layer Design Space Exploration of DNN Accelerator via Diffusion-Driven Optimization	diffusion-based design generation; conditional sampling	3	4	3

Data Access Accelerators¶

Challenge: Indirect and sparse memory access patterns; low memory bandwidth utilization; core structural limitations

Year	Venue	Authors	Title	Tags	P	E	N
2025	ISCA	UMich	DX100: Programmable Data Access Accelerator for Indirection	indirect memory access accelerator; bulk memory access reordering; DRAM row-buffer hit rate optimization; programmable data access ISA	4	3	2

GEMM Accelerators¶

Challenge: High computational intensity; memory bandwidth bottlenecks; energy efficiency

Year	Venue	Authors	Title	Tags	P	E	N
2020	HPCA	Georgia Tech	SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training	flexible dot product engine; forwarding adder network	4	2	3