TORCHSPARSE: EFFICIENT POINT CLOUD INFERENCE ENGINE
3D Sparse Convolution; optimize Gather-Matmul-Scatter dataflow; Adaptive Matmul Grouping; Quantized and Vectorized Memory access
4
3
4
2023
Mlsys
THU&&SJTU
EXPLOITING HARDWARE UTILIZATION AND ADAPTIVE DATAFLOW FOREFFICIENT SPARSE CONVOLUTION IN 3D POINT CLOUDS
3D Sparse Convolution; optimize Gather-Matmul-Scatter and fetch-on-demand dataflow; Dynamic dataflow changing; coded-CSR mapping; Parallel Processing of different workloads without padding; Pointer
4
3
3
2023
MICRO
MIT
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
3D Sparse Convolution; optimize Implicit Gather-Matmul-Scatter; Cuda Sparse Kernel; Sparse Autotuner by detailed workload
Challenge: CXL and NVM offer higher speed & bandwidth than storage devices with byte-level access. Memory disaggregation using DRAM (high-speed/BW + small capacity) and NVM (low-speed/BW + large capacity), faces latency, bandwidth, and consistency challenges.
DSA-2LM: A CPU-Free Tiered Memory Architecture with Intel DSA
CPU-free page migration in tiered memory via data streaming accelerator; adaptable migration algorithm for mixed 4KB/2MB pages; direct in-kernel DSA integration bypassing DMA
3
3
4
2025
ASPLOS
Yale
PULSE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory
Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters
Cross-layer classification of DM techniques; hardware-level categories; architectural-level classifications; system and runtime-level groupings; application-level optimizations such as general-purpose and domain-specific approaches
Challenge: Current chip designs are often monolithic and inflexible; leading to high costs and limited performance optimization opportunities.
Solution: Use chiplets to enable more flexible and cost-effective system designs by allowing the integration of specialized dies manufactured using optimal processes; leading to improved performance and yield.
Solution: 3DIC technology enables higher integration density; shorter interconnects; and improved performance by stacking multiple active layers in a single device.
Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation
open-source 3D-IC benchmark; modular 3D partitioning and placement; Open3D-DMP algorithm for cross-die co-placement; comprehensive PPA evaluation with thermal simulation