Skip to content

High-Performance Computing

Parallel programming models

Challenge: the complexity of great number of cores and the heterogeneity of the hardware

Scientific computing applications and libraries

Solution: enable researchers to model physical phenomena, process large datasets, and accelerate discoveries across fields on HPC systems

Parallel algorithms

Solution: develop parallel algorithms for HPC systems, enabling scalable performance

Challenge: balancing computation and memory-access overhead

SpMM, SpGEMM, SDDMM algorithms

Tiling Algorithms

Year Venue Authors Title Tags P E N
2019 PPoPP The Ohio State University Adaptive Sparse Tiling for Sparse Matrix Multiplication (ASpT) Adaptive Sparse Tiling; SpMM && SDDMM; 2D tiling; row panels and classifies column segments as either "dense" or "sparse"; reordering to group dense columns contiguously 3 3 2
2022 PPoPP China University of Petroleum-Beijing TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs divide sparse matrices into fixed-size sparse tiles; SpGEMM; determining the tile structure of the result matrix via symbolic SpGEMM; generating the nonzero structure and row pointers for each tile using bitmask operations and binary search; performing numeric computation with an adaptive sparse or dense accumulator based on tile density 4 3 3
2024 SC Indiana University && UIUC Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication Tall-and-Skinny Matrix Multiplication; SpGEMM; distributed-memory algorithm; 1D row partitioning combined with virtual 2D tiling strategy; based on the sparsity pattern of each tile to adaptively choose between local or remote computation modes 3 2 3
2024 PPoPP THU A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs (Rode) Row Decomposition; SpMM && SDDMM; decompose row into a regular part (containing a multiple of 32 nonzeros) and a residual part; block splitting technique to achieve load balancing; sub-block pipelining technique to overlap computation and memory access 3 3 2
2025 TACO HUST ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel heterogeneous collaboration methods; SpGEMM && SpMM && SDDMM;lightweight analysis to extract matrix features;varying sparsity levels to either CPU or GPU using core affinity analysis;synchronous computation and transfer overlapping 2 4 3

Rearrange Algorithms (Cluster Algorithms)

Year Venue Authors Title Tags P E N
2025 SC Georgia Institute of Technology && University of Delaware Improving SpGEMM Performance Through Matrix-Reordering and Cluster-wise Computation hierarchical clustering; SpGEMM; new format called CSR_Cluster; identifies similar rows via a single SpGEMM operation A×AT; process cluster collectively to improve data reuse of B 3 3 2

Using shared memory

Year Venue Authors Title Tags P E N
2020 PPoPP Graz University of Technology spECK: accelerating gpu sparse matrix-matrix multiplication through lightweight analysis A lightweight and multi-level analysis framework that dynamically selects and tunes the best algorithm;Choose between hashing and dense accumulation;Direct referencing for each row based on real-time matrix characteristics 3 3 2
2025 SC China University of Petroleum-Beijing KAMI: Communication-Avoiding General Matrix Multiplication within a Single GPU tensor cores as compute units;registers for local storage;shared memory for communication;1D & 2D & 3D partitioning strategies to optimize data locality and reduce communication overhead 2 2 2
2025 HPCA Hunan University && Arizona State University HSMU-SpGEMM: Achieving High Shared Memory Utilization for Parallel Sparse General Matrix-Matrix Multiplication on Modern GPUs a binary search-based accumulator design; pre-generating a sorted column index array during the symbolic stage;incorporates tailored symbolic processing for matrices of different scales 3 3 1

Using Tensor Core

Year Venue Authors Title Tags P E N
2021 SC University of California Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision fine-grained sparsity;column-vector sparse encoding;Tensor-Core-based 1D Octet Tiling;reduced precision 2 3 3
2023 ATC University of California TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs highly sparse and irregular graph operations;Sparse Graph Translation technique;collaborative execution strategy between CUDA cores and TCUs 3 3 3
2023 SC Universidade da Coruña && ETH Zürich VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores vectorized grouping and two-level pruning;electing important columns through vector-wise pruning within blocks;applying N:M sparsity per row 2 3 3
2024 ASPLOS HKUST DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores memory-efficient storage format called ME-TCF;two-level TCU-Cache-Aware reordering method;runtime kernel optimizations;simulation-based adaptive selector 3 4 4
2025 IPDPS HUST BRP-SpMM: Block-Row Partition Based Sparse Matrix Multiplication with Tensor and CUDA Cores Block-Row Partition;“Fixed Non-zero” strategy;customized storage forma;two lightweight GPU kernels for the TC Block and for the Residual Row part 3 3 3