Solution: Use layer fusion to combine multiple layers of a neural network into a single layer. This can help reduce the number of computations and memory accesses required during inference; leading to faster execution times and lower power consumption.
Year
Venue
Authors
Title
Tags
P
E
N
2016
MICRO
SBU
Fused-Layer CNN Accelerators
fuse the processing of multiple CNN layers by modifying the order in which the input data are brought on chip
2025
TC
KU Leuven
Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators
fine-grain mapping paradigm; mapping of layer-fused DNNs on heterogeneous dataflow accelerator architectures; memory- and communication-aware latency analysis; constraint optimization
2024
SOCC
IIT Hyderabad
Hardware-Aware Network Adaptation using Width and Depth Shrinking including Convolutional and Fully Connected Layer Merging
Width Shrinking: reduces the number of feature maps in CNN layers; Depth Shrinking: Merge of conv layer and fc layer
2024
ICSAI
MIT
LoopTree: Exploring the Fused-Layer Dataflow Accelerator Design Space
design space that supports set of tiling, recomputation, retention choices, and their combinations; model that validates design space
Challenge: LLM accelerators face challenges in terms of memory bandwidth; power consumption; and the need for efficient data movement.
Year
Venue
Authors
Title
Tags
P
E
N
2024
DATE
NTU
ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers
highly efficient memory-centric dataflow; fused special function module for non-linear functions; A comprehensive DSE of ViTA Kernels and VMUs
2025
arXiv
SJTU
ROMA: A Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM
hybrid ROM-SRAM architecture for on-device LLM; B-ROM design for area-efficient ROM; fused cell integration of ROM and compute unit; QLoRA rank adaptation for task-specific tuning; on-chip storage optimization for quantized models
2025
ISCA
Duke
Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression
Solution: Quantized DNN accelerators are designed to efficiently execute quantized neural networks, which use lower precision representations for weights and activations.
Year
Venue
Authors
Title
Tags
P
E
N
2018
ISCA
SNU
Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation
accelerator architecture for outlier-aware quantized models; outlier-aware low-precision computation; separate outlier MAC unit
4
3
2
2024
DAC
ASU
Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference
composite data type Logarithmic Posits (LP); automated post training LP Quantization (LPQ) Framework based on genetic algorithms; mixed-precision LP Accelerator (LPA)
3
3
2
2023
HPCA
UPC
Mix-GEMM: An efficient HW-SW Architecture for Mixed-Precision Quantized Deep Neural Networks Inference on Edge Devices
Complete mixed-precision flexibility; hardware accelerator & BLIS-based library with custom RISC-V ISA extensions
Solution: Bit-sliced DNN accelerators break down data into smaller bit-slices, allowing for more efficient processing and reduced memory and calculation resources requirements.
Year
Venue
Authors
Title
Tags
P
E
N
2018
ISCA
Georgia Tech
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network
accelerator for layer-aware quantized DNN; bit-flexible computation unit; block-structured instruction set architecture
4
3
3
2023
HPCA
KAIST
Sibia: Signed Bit-slice Architecture for Dense DNN Acceleration with Slice-level Sparsity Exploitation
signed bit-slice representation;flexible zero skipping processing element
3
3
4
2024
HPCA
KU Leuven
BitWave: Exploiting Column-Based Bit-Level Sparsity for Deep Learning Acceleration
Bit-column sparsity for both computation reduction and data compression; Single-shot Bit-Flip post-training
3
3
3
2025
HPCA
POSTECH
Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity
Asymmetrically-Quantized bit-Slice GEMM; Zero-Point Manipulation and Distribution-based Bit-Slicing to increase sparsity
3
3
4
2025
HPCA
Yonsei
Bit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation
both input AND weight sparsity at bit-slice level; 8-bit data processing with 4-bit multipliers; Scale regularization during training to enhance sparsity
Solution: Reconfigurable accelerators not only break the trade-off of flexibility and performance, but also enable hardware to adapt to algorithm changes as quickly as software while maintaining high energy efficiency.
Year
Venue
Authors
Title
Tags
P
E
N
2018
ASPLOS
Georgia Tech
MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects
augmented reduction tree(ART) for link conflict; chubby distribution tree for bandiwdth optimization; ART based virtual neuron construction
4
3
2
2019
JETCAS
MIT
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
hierarchical mesh NoC for multiple transmission modes; sparse PE architecture
5
4
2
2023
ASPLOS
UM & Georgia Tech
Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing
merger-reduction network for area efficiency; compression format conversion without hardware module; dedicated L1 memory architecture for different access pattern
4
3
2
2023
MICRO
MIT
HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity
Solution: Dataflow architecture allows the execution of instructions based on the availability of data rather than a predetermined sequence; leading to more efficient use of resources and better performance in parallel processing and real-time systems.
Year
Venue
Authors
Title
Tags
P
E
N
2019
ASPLOS
THU
Tangram: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators
Solution: Optimally map an application's data flow graph onto the hardware fabric by simultaneously solving the tightly-coupled problems of scheduling, placement, and routing under strict spatial and temporal resource constraints.
Year
Venue
Authors
Title
Tags
P
E
N
2017
ASAP
Torontom
CGRA-ME: A Unified Framework for CGRA Modelling and Exploration
Challenge: Many-core architectures are designed to handle a large number of cores; but they face challenges in terms of power consumption; performance; and resource allocation.
Challenge: Cores share resources with each other, how to achieve high performance by coordinating access among cores to prevent conflicts and ensure data consistency is a problem.
Year
Venue
Authors
Title
Tags
P
E
N
2015
HPCA
Cornel
Increasing Multicore System Efficiency through Intelligent Bandwidth Shifting
Challenge: The built-in crossbar of HBM FPGAs suffers from contention and low bandwidth during many-to-many unicast access, and standard HLS lacks support for efficient burst buffering.
Year
Venue
Authors
Title
Tags
P
E
N
2021
FPGA
UCLA
HBM Connect: High-Performance HLS Interconnect for FPGA HBM