High-Performance Computing¶
Parallel programming models¶
Challenge: the complexity of great number of cores and the heterogeneity of the hardware
Scientific computing applications and libraries¶
Solution: enable researchers to model physical phenomena, process large datasets, and accelerate discoveries across fields on HPC systems
Parallel algorithms¶
Solution: develop parallel algorithms for HPC systems, enabling scalable performance
Challenge: balancing computation and memory-access overhead
SpMM, SpGEMM, SDDMM algorithms¶
Tiling Algorithms¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2019 | PPoPP | The Ohio State University | Adaptive Sparse Tiling for Sparse Matrix Multiplication (ASpT) | Adaptive Sparse Tiling; SpMM && SDDMM; 2D tiling; row panels and classifies column segments as either "dense" or "sparse"; reordering to group dense columns contiguously | 3 | 3 | 2 |
| 2022 | PPoPP | China University of Petroleum-Beijing | TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs | divide sparse matrices into fixed-size sparse tiles; SpGEMM; determining the tile structure of the result matrix via symbolic SpGEMM; generating the nonzero structure and row pointers for each tile using bitmask operations and binary search; performing numeric computation with an adaptive sparse or dense accumulator based on tile density | 4 | 3 | 3 |
| 2024 | SC | Indiana University && UIUC | Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication | Tall-and-Skinny Matrix Multiplication; SpGEMM; distributed-memory algorithm; 1D row partitioning combined with virtual 2D tiling strategy; based on the sparsity pattern of each tile to adaptively choose between local or remote computation modes | 3 | 2 | 3 |
| 2024 | PPoPP | THU | A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs (Rode) | Row Decomposition; SpMM && SDDMM; decompose row into a regular part (containing a multiple of 32 nonzeros) and a residual part; block splitting technique to achieve load balancing; sub-block pipelining technique to overlap computation and memory access | 3 | 3 | 2 |
| 2025 | TACO | HUST | ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel | heterogeneous collaboration methods; SpGEMM && SpMM && SDDMM;lightweight analysis to extract matrix features;varying sparsity levels to either CPU or GPU using core affinity analysis;synchronous computation and transfer overlapping | 2 | 4 | 3 |
Rearrange Algorithms (Cluster Algorithms)¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2025 | SC | Georgia Institute of Technology && University of Delaware | Improving SpGEMM Performance Through Matrix-Reordering and Cluster-wise Computation | hierarchical clustering; SpGEMM; new format called CSR_Cluster; identifies similar rows via a single SpGEMM operation A×AT; process cluster collectively to improve data reuse of B | 3 | 3 | 2 |
Using shared memory¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2020 | PPoPP | Graz University of Technology | spECK: accelerating gpu sparse matrix-matrix multiplication through lightweight analysis | A lightweight and multi-level analysis framework that dynamically selects and tunes the best algorithm;Choose between hashing and dense accumulation;Direct referencing for each row based on real-time matrix characteristics | 3 | 3 | 2 |
| 2025 | SC | China University of Petroleum-Beijing | KAMI: Communication-Avoiding General Matrix Multiplication within a Single GPU | tensor cores as compute units;registers for local storage;shared memory for communication;1D & 2D & 3D partitioning strategies to optimize data locality and reduce communication overhead | 2 | 2 | 2 |
| 2025 | HPCA | Hunan University && Arizona State University | HSMU-SpGEMM: Achieving High Shared Memory Utilization for Parallel Sparse General Matrix-Matrix Multiplication on Modern GPUs | a binary search-based accumulator design; pre-generating a sorted column index array during the symbolic stage;incorporates tailored symbolic processing for matrices of different scales | 3 | 3 | 1 |
Using Tensor Core¶
| Year | Venue | Authors | Title | Tags | P | E | N |
|---|---|---|---|---|---|---|---|
| 2021 | SC | University of California | Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision | fine-grained sparsity;column-vector sparse encoding;Tensor-Core-based 1D Octet Tiling;reduced precision | 2 | 3 | 3 |
| 2023 | ATC | University of California | TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs | highly sparse and irregular graph operations;Sparse Graph Translation technique;collaborative execution strategy between CUDA cores and TCUs | 3 | 3 | 3 |
| 2023 | SC | Universidade da Coruña && ETH Zürich | VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores | vectorized grouping and two-level pruning;electing important columns through vector-wise pruning within blocks;applying N:M sparsity per row | 2 | 3 | 3 |
| 2024 | ASPLOS | HKUST | DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores | memory-efficient storage format called ME-TCF;two-level TCU-Cache-Aware reordering method;runtime kernel optimizations;simulation-based adaptive selector | 3 | 4 | 4 |
| 2025 | IPDPS | HUST | BRP-SpMM: Block-Row Partition Based Sparse Matrix Multiplication with Tensor and CUDA Cores | Block-Row Partition;“Fixed Non-zero” strategy;customized storage forma;two lightweight GPU kernels for the TC Block and for the Residual Row part | 3 | 3 | 3 |