Chopper Characterizes LLaMA-3-8B Training, Revealing Multi-Level GPU Inefficiencies

Summarize this article with:
Understanding the efficiency of large language model training demands detailed insight into modern GPU systems, yet comprehensive characterisation of multi-GPU workloads has remained a significant challenge. Marco Kurzynski from University of Central Florida, Shaizeen Aga from Advanced Micro Devices, Inc., and Di Wu address this gap by introducing Chopper, a novel profiling and analysis framework. Chopper collects and aligns GPU data across multiple levels of detail, from individual kernels to entire training iterations, providing a holistic view of performance. Their analysis of Llama 3 8B training on AMD Instinct MI300X GPUs reveals previously overlooked bottlenecks, notably the substantial impact of frequency regulation on overall performance, exceeding the effects of other factors like communication overhead. This work delivers actionable insights for optimising training frameworks, improving power management, and informing the design of future GPU architectures, representing a substantial step towards more efficient large language model training. GPU Architectures and Large Language Models This document provides a comprehensive overview of hardware and software optimization for large-scale machine learning, particularly focusing on training and inference of large language models (LLMs) and other deep learning workloads. Key areas of focus include GPU architecture, software stacks, parallelism strategies, and optimization techniques. The document details various GPU architectures, including NVIDIA’s Hopper and AMD’s CDNA 2 and 3, examining their matrix cores, memory hierarchies, and interconnects. It emphasizes the ROCm software platform as AMD’s alternative to NVIDIA’s CUDA, highlighting libraries like rocBLAS and rocprofiler for performance analysis and PyTorch integration. Researchers explore different parallelism strategies, such as data, tensor, pipeline, and fully sharded data parallelism, and techniques for scaling training across multiple GPUs and nodes. A wide range of optimization techniques are discussed, including FP8 quantization, optimized attention mechanisms like FlashAttention and GQA, kernel batching, and CUDA graphs. The document also covers profiling and analysis tools for identifying bottlenecks and improving performance. While NVIDIA currently maintains a more mature ecosystem, AMD is actively building its own with improved software and developer support. Ultimately, the document concludes that competition in the AI hardware space is intensifying, software is crucial for success, optimization is key to achieving peak performance, and increasingly sophisticated parallelism strategies are necessary for future advancements. Multi-GPU Profiling of Llama 3 Training Scientists developed Chopper, a novel profiling and analysis framework to comprehensively characterize large language model (LLM) training on modern GPU systems. Recognizing limitations in prior work, which often focused on kernel-level performance or single-GPU microbenchmarks, the team engineered a system capable of collecting, aligning, and visualizing GPU kernel traces and hardware performance counters across multiple granularities. This multi-granularity approach enables detailed analysis of performance bottlenecks and behaviors previously overlooked in LLM training. The methodology involved a comprehensive assessment of GPU resources at multiple hardware levels, facilitating a holistic understanding of the entire system. Scientists investigated throughput through the lens of operation efficiency, operation overlap, power management decisions, and launch overhead effects. They quantified the gap between actual and theoretical operation duration, providing a breakdown of operation time to pinpoint inefficiencies. Through this detailed analysis, the team identified frequency overhead, resulting from dynamic voltage and frequency scaling (DVFS) effects, as the single largest contributor to performance gaps, exceeding the impact of MFMA utilization loss, communication/computation overlap, and kernel launch overheads. This work addresses a critical gap in understanding how GPUs behave under the complex demands of distributed LLM training, moving beyond kernel-level performance or single-GPU microbenchmarks. Chopper collects, aligns, and visualizes GPU kernel traces and hardware performance counters across multiple granularities, providing unprecedented insight into the training process. Experiments demonstrated that frequency overhead, stemming from dynamic voltage and frequency scaling (DVFS) effects, constitutes the single largest contributor to the gap between theoretical and observed performance, exceeding the impact of MFMA utilization loss, communication/computation overlap, and kernel launch overheads.
The team’s measurements show that Chopper provides full characterization coverage across the application stack, profiling at the GPU kernel level while enabling characterization at varying granularities. This multi-level approach examines GPU resources in detail, considering the node of multiple GPUs, the microarchitecture of individual GPUs, and the CPU, facilitating a comprehensive analysis of the full system. By examining both software and hardware aspects, Chopper delivers actionable insights for optimizing training frameworks, improving power-management strategies, and guiding future GPU architecture and system design. Chopper automates the collection and analysis of performance data across multiple granularities, from individual kernels to entire training phases, providing detailed insights into the complex interactions between hardware and software.
The team successfully applied Chopper to training the Llama 3 8B model using fully sharded data parallelism on an eight-GPU system, revealing previously unexplored bottlenecks and behaviors. The analysis demonstrates that frequency overhead, stemming from dynamic voltage and frequency scaling effects, represents the largest contributor to performance gaps, exceeding the impact of factors like memory bandwidth utilization and communication delays. Notably, the research highlights improvements in frequency management between different versions of the fully sharded data parallelism framework. These findings underscore the importance of considering frequency stability as a key component of profiling and optimization efforts, particularly when comparing training frameworks and tuning power management strategies. The authors acknowledge that the study focuses on a specific hardware and software configuration, limiting the generalizability of the findings to other systems. 👉 More information 🗞 Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency 🧠 ArXiv: https://arxiv.org/abs/2512.08242 Tags:
