December 3, 2024
Conference Paper

Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience

Abstract

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). Thus, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on the characteristics of error resilience, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy by 75% if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines.

Published: December 3, 2024

Citation

Fang B., X. Li, H. Dam, C. Tan, S. Hari, T. Tsai, and I. Laguna, et al. 2024. Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience. In IEEE International Conference on Cluster Computing (CLUSTER 2024), September 24-27, 2024, Kobe, Japan, 166-178. Piscataway, New Jersey:IEEE. PNNL-SA-183954. doi:10.1109/CLUSTER59578.2024.00022

Research topics