Multi-gpu Framework Enables Encrypted Large-Model Inference, Addressing Terabyte-Scale Memory Challenges

Summarize this article with:
Protecting data privacy during artificial intelligence tasks remains a significant challenge, and researchers are now demonstrating a scalable solution for encrypted AI inference. Siddharth Jayashankar from Carnegie Mellon University, Joshua Kim from UT Austin, Michael B. Sullivan from NVIDIA, Wenting Zheng from Carnegie Mellon University, and Dimitrios Skarlatos from Carnegie Mellon University present a new framework, Cerium, that overcomes performance limitations previously hindering practical deployment of fully homomorphic encryption. Cerium leverages the accessibility of multi-GPU systems to achieve performance competitive with specialised FHE ASICs, even matching the speed of prior ASIC designs like CraterLake. This system introduces innovative compiler techniques and memory management strategies, enabling encrypted inference for large language models such as Llama3-8B, and represents a substantial step towards practical, privacy-preserving AI applications. Notably, Cerium achieves bootstrapping in under 10 milliseconds and completes encrypted inference for BERT-Base and Llama3-8B in 8 and 134 seconds respectively, marking the first time these feats have been accomplished on a GPU system. GPU Acceleration of Programmable Homomorphic Encryption This research introduces Cerberus, a new system designed to accelerate fully homomorphic encryption (FHE) using the power of GPUs. FHE allows computations to be performed directly on encrypted data, preserving privacy, but is computationally intensive, hindering its widespread adoption. Cerberus overcomes this limitation by leveraging GPU parallelism and introducing a novel programming model based on micro-operations (µops), small, independent units of FHE computation that can be executed in parallel. A dynamic scheduler efficiently manages these µops, maximizing GPU utilization, while a hybrid memory architecture balances performance and capacity. The results demonstrate substantial speedups compared to traditional CPU-based FHE implementations and other GPU-based approaches, with performance improvements reaching up to 100times faster than CPUs and 2. 5times faster than existing GPU systems. The µop-based programming model provides flexibility, supporting a wide range of FHE schemes and applications, and the system is designed to scale with increasingly complex computations. Researchers successfully evaluated Cerberus on real-world applications, including image classification and privacy-preserving machine learning. Automated GPU Acceleration of Homomorphic Encryption Cerium, a multi-GPU framework, significantly improves the performance of fully homomorphic encryption (FHE) inference, particularly when working with large models. The system automatically generates optimized GPU kernels, manages terabyte-scale memory footprints, and parallelizes computation across multiple GPUs. Researchers developed a domain-specific language, optimizing compiler, and runtime system to achieve this, enabling the creation of high-performance GPU kernels without manual intervention. The framework outperforms expert-written, hand-optimized GPU libraries by up to 2. 25times for smaller models and achieves performance competitive with state-of-the-art FHE ASICs, directly matching the performance of the CraterLake ASIC. This is the first GPU system to execute bootstrapping in under 10 milliseconds, achieving a time of 7. 5 milliseconds. The system automatically generates optimized GPU kernels, manages terabyte-scale memory footprints, and parallelizes computation across multiple GPUs. Researchers achieved a 2. 25times performance improvement over expert-written, hand-optimized GPU libraries when utilizing multi-GPU scaling for smaller models. Cerium’s performance now matches that of the state-of-the-art FHE ASIC, CraterLake, and exceeds the performance of ARK and Cinnamon.
The team broke the 10 millisecond barrier for bootstrapping, achieving a time of 7. 5 milliseconds using commercially available hardware. Furthermore, the framework accelerates BERT-Base inference by a factor of 9. 12 compared to previous GPU-based FHE systems. These results are achieved through a domain-specific language and optimizing compiler that eliminate the need for manual kernel creation. GPU Acceleration of Fully Homomorphic Encryption Cerium represents a significant advance in the field of fully homomorphic encryption (FHE), delivering a multi-GPU framework that substantially improves performance. Researchers have developed a system capable of automatically generating high-performance GPU kernels, managing extensive memory requirements, and distributing computation across multiple GPUs. This framework achieves performance comparable to state-of-the-art FHE ASICs, even matching the performance of the CraterLake ASIC, and demonstrates the first GPU-based bootstrapping in under 10 milliseconds. 4times the speed of current FHE ASICs. Cerium introduces new techniques for kernel generation, optimization, and scheduling, alongside methods that reduce memory demands by over 100times and enable the processing of terabyte-scale encrypted data. 👉 More information 🗞 A Scalable Multi-GPU Framework for Encrypted Large-Model Inference 🧠 ArXiv: https://arxiv.org/abs/2512.11269 Tags:
