Sparse Autoencoders Enhance Faithful Retrieval-Augmented Generation by Disentangling Internal Activations

Summarize this article with:
Retrieval-Augmented Generation (RAG) represents a significant step towards more reliable large language models, yet these systems still struggle with ‘hallucinations’, generating content that contradicts or goes beyond the provided source material. Guangzhi Xiong, Zhenghao He, and Bohan Liu, along with colleagues from the University of Virginia, address this critical challenge with a novel approach that moves beyond expensive external checks or extensive data labelling.
The team introduces RAGLens, a lightweight hallucination detector that dissects the internal workings of language models using sparse autoencoders to pinpoint the specific features responsible for unfaithful outputs. This method not only surpasses existing detection techniques in accuracy, but also offers valuable insights into why these errors occur, paving the way for more effective post-hoc correction and a deeper understanding of language model behaviour.
Sparse Autoencoders Detect LLM Hallucinations The study pioneers a novel approach to detecting inaccuracies in Retrieval-Augmented Generation (RAG) systems by employing sparse autoencoders (SAEs) to analyze the internal workings of large language models (LLMs). Researchers hypothesized that specific features within the LLM’s hidden states are uniquely activated when the model “hallucinates,” generating text that contradicts or extends beyond the provided source material. To test this, the team engineered a systematic pipeline that leverages SAEs to disentangle these internal activations, effectively isolating features linked to unfaithful generation. The core of the methodology involves training SAEs on LLM hidden states, enforcing sparsity to identify semantically meaningful features, a property known as monosemanticity. Following SAE training, an information-based feature selection process was implemented to identify the most relevant features for hallucination detection, focusing on those that maximize information gain. Building upon these selected features, the team developed an additive feature modeling technique, constructing a lightweight detector named RAGLens. This model combines the selected SAE features to predict the likelihood of hallucination, providing both accurate detection and interpretable rationales for its decisions. Experiments demonstrate that RAGLens achieves superior detection performance compared to existing methods, identifying features highly relevant to RAG hallucinations. This work provides a significant advancement in both the accuracy and interpretability of hallucination detection in RAG systems. Hallucination Detection via Sparse Autoencoders Scientists developed RAGLens, a new method for detecting inaccuracies in Retrieval-Augmented Generation (RAG) systems, which combine large language models with external knowledge sources. The work addresses a critical challenge: ensuring that the generated responses are faithful to the retrieved evidence and do not contain invented details or contradictions. Experiments demonstrate that RAGLens accurately identifies instances of “hallucination,” where the model generates unfaithful content, by analyzing internal representations within the language model itself.
The team employed sparse autoencoders (SAEs) to dissect the complex internal activations of the language model, successfully isolating features specifically triggered during instances of hallucination. These SAEs learn a dictionary of features from the model’s hidden states, effectively capturing nuanced dynamics related to unfaithful generation. By examining these features, researchers can pinpoint the specific internal signals associated with inaccuracies, establishing a strong foundation for detecting unfaithfulness without requiring extensive labeled data or costly external evaluations. Results show that RAGLens outperforms existing hallucination detection methods in accuracy while also providing interpretable feedback to aid in mitigating inaccuracies. The additive model structure and transparent input features enable a clear understanding of why the system flags certain outputs as unfaithful. Detailed analyses reveal that mid-layer SAE features, exhibiting high mutual information with the hallucination labels, are most informative for detection.
This research establishes the effectiveness of SAE features for detecting hallucinations in RAG systems and provides valuable insights into the distribution of hallucination-related signals within large language models. This work demonstrates the power of sparse autoencoders for detecting inaccuracies in retrieval-augmented generation systems. Researchers developed RAGLens, a framework that identifies instances where generated text contradicts or extends beyond the provided source material, achieving state-of-the-art performance on multiple benchmarks. Importantly, RAGLens not only flags these ‘hallucinations’ but also provides interpretable explanations for its decisions, offering insights into the internal workings of large language models. By leveraging internal representations of these models, the team created a lightweight detector that improves the reliability of RAG systems and enables actionable feedback for mitigating inaccuracies. The findings highlight the broader potential of sparse representation probing for enhancing model faithfulness and suggest future applications for integrating interpretable detectors into real-world applications where trust and accuracy are paramount. 👉 More information 🗞 Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders 🧠 ArXiv: https://arxiv.org/abs/2512.08892 Tags:
