Back to News
research

FlashFuser Leverages Inter-Core Connectivity to Advance Kernel Fusion, Delivering 58% Performance Gains

Quantum Zeitgeist
Loading...
4 min read
1 views
0 likes
FlashFuser Leverages Inter-Core Connectivity to Advance Kernel Fusion, Delivering 58% Performance Gains

Summarize this article with:

The increasing demand for computational power frequently exceeds the capacity of memory bandwidth, creating a bottleneck for many modern deep learning applications. Ziyu Huang, Yangjie Zhou, and Zihan Liu, alongside colleagues, address this challenge with FlashFuser, a novel compiler framework that expands the possibilities of kernel fusion on contemporary GPUs.

The team’s work represents a significant step forward by exploiting the inter-core connection, specifically Distributed Shared Memory, available in modern GPUs like the H100, a resource previously underutilised by software. FlashFuser achieves substantial performance gains by formalising complex data exchange patterns and optimising data movement across the distributed memory hierarchy, ultimately reducing memory access and delivering significant speedups compared to existing libraries and compilers, demonstrating a 1. 24x improvement in end-to-end performance.

Exploiting Distributed Memory for Kernel Fusion The increasing demands of deep learning computations are now limited more by memory bandwidth than by processing power. Kernel fusion, a technique for combining operations, offers a solution, but current systems are restricted by the capacity of local on-chip memory.

This research investigates how to overcome this limitation by utilizing Distributed Shared Memory (DSM), a feature of modern GPUs like the NVIDIA H100, which provides a larger, faster on-chip memory pool. By effectively using DSM, the team demonstrates significant reductions in memory access latency and improvements in performance, even for models with large intermediate representations. The researchers developed a new fusion strategy that dynamically allocates intermediate results across the DSM, effectively extending available memory beyond the limits of individual on-chip storage. This approach enables the fusion of larger kernels and reduces the need for costly data transfers between global memory and on-chip storage, resulting in substantial performance gains.

Tensor Program Fusion and Optimisation A significant body of research focuses on improving how deep learning models are compiled and optimized for execution on various hardware platforms. A dominant theme is tensor program transformation, with numerous studies dedicated to automatically reshaping programs for better performance. Key techniques include kernel fusion, tiling, loop transformation, and algebraic simplification to enhance parallelism and efficiency. Auto-tuning, often using Bayesian optimization, automatically searches for the best optimization strategies. Researchers also explore schedule exploration, graph optimization, and memory optimization to reduce memory footprint and improve access patterns. Dataflow analysis helps understand how data moves through computations, identifying opportunities for improvement. Some research targets specific model types, such as Transformers and Graph Neural Networks, while others explore advanced compilation techniques like virtual tensor management and monolithic optimization. Emerging trends include AI-driven compilation and composable infrastructure. FlashFuser Optimizes GPU Memory for Deep Learning The research team developed FlashFuser, a novel compiler framework that significantly improves performance in deep learning workloads by effectively utilizing Distributed Shared Memory (DSM) on modern GPUs, such as the NVIDIA H100. Recognizing that increasing computational power is outpacing improvements in memory bandwidth, the team addressed the growing bottleneck in memory access, particularly within demanding layers like Feed Forward Networks. FlashFuser overcomes limitations of existing techniques by extending kernel fusion to leverage the larger, faster on-chip memory pool provided by inter-core connections. The work introduces a powerful communication abstraction that formalizes complex data exchange patterns, enabling efficient data movement across multiple Streaming Multiprocessors. A dataflow analyzer generalizes loop scheduling, resource mapping, and tile selection to the distributed memory hierarchy, determining the optimal execution order and tile sizes by quantifying data movement across different memory levels. Through analytical cost modeling and DSM-aware pruning strategies, FlashFuser efficiently discovers the optimal execution plan, minimizing data transfer and maximizing performance. FlashFuser Achieves 3. 3x GPU Kernel Speedup This study introduces FlashFuser, a new compiler framework designed to overcome key limitations of existing GPU compilation systems by fully exploiting the inter-core connectivity of modern GPUs. FlashFuser enhances kernel fusion by extending it across distributed shared memory, enabling more efficient use of hardware resources beyond traditional on-chip memory constraints. The framework integrates a novel communication abstraction, a dataflow analyzer, and an efficient search engine to systematically generate highly optimized fused kernels. Evaluations on an NVIDIA H100 GPU demonstrate substantial performance improvements: fused kernels achieve up to 3.3× speedups over highly optimized libraries and 4.1× gains compared to state-of-the-art compilers. These benefits are driven by a 58% reduction in memory accesses, resulting in an overall 1.24× end-to-end speedup. While the results highlight FlashFuser’s effectiveness on cutting-edge hardware, the authors note that future work will focus on adapting the framework to other GPU architectures and expanding support for a broader range of deep learning models. 👉 More information🗞 FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection🧠 ArXiv: https://arxiv.org/abs/2512.12949 Tags:

Read Original

Source Information

Source: Quantum Zeitgeist