Predictive Concept Decoders Achieve Scalable Interpretability for Neural Network Behavior

Summarize this article with:
Understanding how neural networks reach decisions remains a significant challenge, yet interpreting their internal workings is crucial for building trustworthy artificial intelligence. Vincent Huang, Dami Choi, and Daniel D. Johnson, alongside colleagues, address this problem by introducing a new approach that trains ‘interpretability assistants’ to predict network behaviour from its internal activations. Their work establishes an end-to-end training objective, where an encoder compresses complex activation data into a concise list of concepts, and a decoder then uses these concepts to answer questions about the network’s reasoning. This ‘Predictive Concept Decoder’ demonstrates improved performance with increasing data, and importantly, successfully identifies hidden vulnerabilities like jailbreaks, detects implanted information, and accurately reveals latent user attributes within the network. Observations typically relate to external behaviour. This work proposes transforming the task into an end-to-end training objective, by training interpretability assistants to accurately predict model behaviour from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model.
The team demonstrates how to pre-train this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, termed a Predictive Concept Decoder, exhibits favourable scaling properties, as the auto-interp score of the bottleneck concepts improves with increasing data. LLM Safety, Introspection and Concept Testing This research details experiments designed to evaluate and understand the inner workings of large language models (LLMs), specifically focusing on safety, introspection, concept representation, and prompt engineering.
The team rigorously tested the models’ resistance to harmful prompts, attempting to elicit undesirable responses through carefully crafted inputs. They also investigated how the models internally represent concepts, aiming to understand which ideas activate within the network during processing. This involved creating a list of concepts and generating passages that either hinted at these concepts or presented neutral information, then analyzing the model’s responses. The experiments relied on precisely defined prompts, ensuring reproducibility and allowing for detailed analysis of model behaviour.
The team classified responses to assess their relevance to the target concepts, often employing another LLM to evaluate the first one. These systematic experiments provide valuable insights into the capabilities and limitations of LLMs.
Predictive Concept Decoding Reveals Network Representations The research team developed a Predictive Concept Decoder (PCD), a novel architecture designed to interpret the internal workings of neural networks by identifying and decoding key concepts within their activations. This work addresses the challenge of understanding how these networks process information, moving beyond simply observing their outputs to analyzing their internal states. The PCD functions by compressing complex activation patterns into a sparse list of concepts, then reconstructing information from this compressed representation, effectively creating a bottleneck that forces the system to prioritize essential features. Experiments involved training the PCD on passages of text, specifically utilizing data from FineWeb, and assessing its ability to predict subsequent text segments.
The team measured the decoder’s loss during training, demonstrating a steady decrease indicating the encoder successfully learned to transmit increasingly useful information through the concept bottleneck. To maintain the effectiveness of these concepts over extended training periods, the researchers introduced an auxiliary loss function that revived inactive concepts, ensuring they remained relevant during processing.
Results demonstrate that this auxiliary loss improves both the auto-interpretability score and concept recall. Further evaluation focused on the interpretability of the learned concepts, assessed through automated metrics and user modeling, demonstrating that the PCD scales comparably to standard methods and offers a promising approach to understanding neural network behaviour.
Predictive Concepts Decode Neural Network Function This research introduces a new approach to understanding how neural networks function internally, moving beyond methods that rely on manually designed probes and towards a system trained to predict network behaviour from its internal activations.
The team developed a Predictive Concept Decoder (PCD), an architecture where an encoder compresses internal network activity into a concise list of concepts, and a decoder then uses this list to answer questions about the network’s actions. This end-to-end training process allows the system to learn which concepts are most relevant for predicting behaviour, resulting in a scalable method for interpretability. The PCD demonstrates an ability to detect vulnerabilities like jailbreaks, identify hidden information embedded within the network, and accurately determine latent user attributes. Importantly, the research highlights that the specific design of the encoder is less critical than its ability to create a comprehensible bottleneck of concepts for the decoder. This suggests that focusing on creating a clear and informative summary of internal activity is key to successful interpretability.
The team acknowledges that their current encoder is relatively simple, and future work could explore more complex architectures. This scalable and automated approach to interpretability will be crucial for maintaining our understanding of increasingly complex neural networks. 👉 More information 🗞 Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants 🧠 ArXiv: https://arxiv.org/abs/2512.15712 Tags:
