Accelerating GNNs on GPU Sparse Tensor Cores through N:M Sparsity-Oriented Graph Reordering

April 4, 2025

Conference Paper

Accelerating GNNs on GPU Sparse Tensor Cores through N:M Sparsity-Oriented Graph Reordering

Abstract

Recent advancements in GPU hardware support have introduced the capability to leverage N:M sparse patterns for substantial performance gains. Graphs in Graph Neural Networks (GNNs) are typically sparse, but the sparsity is often irregular, not conforming to such sparse patterns. In this paper, we propose a novel graph reordering algorithm, the first of its kind, to reshape irregular graph data into the N:M structured sparse pattern at the tile level, allowing linear-algebra-based graph operations in GNNs to benefit from the N:M sparse hardware. The optimization is lossless, maintaining the accuracy of GNN. It can remove 98-100\% violations of the N:M sparse patterns at the vector level, and increase the proportion of conforming graphs in SuiteSparse collection from 5-9\% to 88.7-93.5\%. On A100 GPUs, the optimization accelerates Sparse Matrix Matrix (SpMM) by up to 43X (2.3X -- 7.5X on average) and speeds up the key graph operations in GNNs on real graphs by as much as 8.6X (3.5X on average).

Published: April 4, 2025

Citation

Chen J., H. Sung, R. Zhang, A. Li, and X. Shen. 2025. Accelerating GNNs on GPU Sparse Tensor Cores through N:M Sparsity-Oriented Graph Reordering. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP 2025), March 1-5, 2025, Las Vegas, NV, 16 - 28. New York, New York:Association for Computing Machinery. PNNL-SA-193030. doi:10.1145/3710848.3710881

Research topics

High-Performance Computing

Graph and Data Analytics

Artificial Intelligence

PNNL

Accelerating GNNs on GPU Sparse Tensor Cores through N:M Sparsity-Oriented Graph Reordering

Abstract

Citation

Research topics

New Roadmap for Microelectronics: Charting the Semiconductor Industry's Path Over the Next 5, 10, and 20 Years

HPC Network Simulation Tuning via Automatic Extraction of Hardware Parameters

Improving I/O-aware Workflow Scheduling via Data Flow Characterization and trade-off Analysis