TORCHSPARSE: EFFICIENT POINT CLOUD INFERENCE ENGINE
3D Sparse Convolution; optimize Gather-Matmul-Scatter dataflow; Adaptive Matmul Grouping; Quantized and Vectorized Memory access
4
3
4
2023
Mlsys
THU&&SJTU
EXPLOITING HARDWARE UTILIZATION AND ADAPTIVE DATAFLOW FOREFFICIENT SPARSE CONVOLUTION IN 3D POINT CLOUDS
3D Sparse Convolution; optimize Gather-Matmul-Scatter and fetch-on-demand dataflow; Dynamic dataflow changing; coded-CSR mapping; Parallel Processing of different workloads without padding; Pointer
4
3
3
2023
MICRO
MIT
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
3D Sparse Convolution; optimize Implicit Gather-Matmul-Scatter; Cuda Sparse Kernel; Sparse Autotuner by detailed workload
Challenge: CXL and NVM offer higher speed & bandwidth than storage devices with byte-level access. Memory disaggregation using DRAM (high-speed/BW + small capacity) and NVM (low-speed/BW + large capacity), faces latency, bandwidth, and consistency challenges.
Year
Venue
Authors
Title
Tags
P
E
N
2025
ATC
THU
DSA-2LM: A CPU-Free Tiered Memory Architecture with Intel DSA
CPU-free page migration in tiered memory via data streaming accelerator; adaptable migration algorithm for mixed 4KB/2MB pages; direct in-kernel DSA integration bypassing DMA
3
3
4
2025
ASPLOS
Purdue
EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation
NeoMem: Hardware-Software Co-Design for CXL-Native Memory Tiering
device-side memory profiling unit; sketch-based hot page detector with error-bound estimation; dynamic hotness threshold adjustment based on statistics
2
2
3
2025
ASPLOS
Yale
PULSE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory
Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters
Cross-layer classification of DM techniques; hardware-level categories; architectural-level classifications; system and runtime-level groupings; application-level optimizations such as general-purpose and domain-specific approaches
Challenge: Current chip designs are often monolithic and inflexible; leading to high costs and limited performance optimization opportunities.
Solution: Use chiplets to enable more flexible and cost-effective system designs by allowing the integration of specialized dies manufactured using optimal processes; leading to improved performance and yield.
Solution: 3DIC technology enables higher integration density; shorter interconnects; and improved performance by stacking multiple active layers in a single device.
Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation
open-source 3D-IC benchmark; modular 3D partitioning and placement; Open3D-DMP algorithm for cross-die co-placement; comprehensive PPA evaluation with thermal simulation
Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (DRT)
Dynamic Reflexive Tiling (DRT) algorithm; dynamically adjust tile shapes at runtime based on sparsity of tensors; ssembling uniform micro tiles into non-uniform macro tiles
3
3
2
2023
MICRO
MIT && NVIDIA
Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity
HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches
hybrid static-dynamic framework;selecting a near-optimal initial tiling scheme;dynamic fine-tuning of tile shapes;coordinates efficient management of both data and metadata in on-chip/off-chip buffers
SPADA: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow
highly diverse sparsity patterns;Window-based Adaptive Dataflow;dynamically select the optimal window shape configuration based on the similarity of sparse patterns
3
2
3
2023
ASPLOS
Universidad de Murcia && Georgia Tech && NVIDIA
Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing
dynamically adaptable multi-dataflow SpMSpM accelerator;Merger-Reduction Network;configurable tree-based topology;a customized L1 memory hierarchy comprising a read-only FIFO;a low-power cache;and a PSRAM for partial sums
3
3
3
2025
MICRO
University of Maryland
Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication