Back to News
quantum-computing

AI Assigns Reliability, Abstains with 41.18% Accuracy

Quantum Zeitgeist
Loading...
7 min read
0 likes
⚡ Quantum Brief
Peking University researchers developed the Multimodal Memory Agent (MMA), a novel AI framework that dynamically evaluates memory reliability by combining source credibility, temporal decay, and conflict-aware consensus to reduce hallucinations. MMA introduces a selective abstention mechanism, achieving 41.18% accuracy in identifying unreliable information on the new MMA-Bench benchmark—far surpassing baseline systems that scored 0%. The team uncovered the "Visual Placebo Effect," revealing how AI agents inherit biases from foundation models, prioritizing visual cues over contradictory textual evidence in multimodal reasoning tasks. Tests on FEVER and LoCoMo benchmarks showed MMA matched baseline accuracy while reducing response variance by 35.2% and cutting incorrect answers, demonstrating more consistent and safer performance. This work advances trustworthy AI by enabling agents to critically assess information quality, abstain when uncertain, and adapt to evolving data—a critical step for high-stakes applications like medicine and law.
AI Assigns Reliability, Abstains with 41.18% Accuracy

Summarize this article with:

Researchers are tackling the challenge of building reliable long-horizon multimodal agents, recognising that dependence on external memory can introduce inaccuracies through stale, low-credibility or conflicting information. Yihao Lu, Wanru Cheng and Zeyu Zhang from the School of Computer Science, Peking University, working with Hao Tang and colleagues, present a novel approach called Multimodal Memory Agent (MMA) which dynamically assesses the reliability of retrieved memories. This assessment combines source credibility, temporal decay, and conflict-aware network consensus to reweight evidence and abstain from answering when support is weak. Significantly, the team also introduce MMA-Bench, a new benchmark designed to rigorously test belief dynamics under controlled conditions. Through this work, they reveal the “Visual Placebo Effect”, demonstrating how agents can inherit biases from foundation models, and achieve substantial improvements in accuracy and reduced variance across multiple established benchmarks, paving the way for more trustworthy artificial intelligence systems. Artificial intelligence systems will soon reason more like humans, learning from past experiences and admitting when they don’t know something. This advance tackles a key weakness in current AI: confidently presenting false information as fact. A new agent dynamically assesses the trustworthiness of its memories, avoiding errors caused by unreliable data. Scientists are increasingly reliant on memory-augmented large language model (LLM) agents for complex, long-horizon tasks demanding sustained contextual awareness. Recent advances in memory architectures have focused on structuring and controlling how these agents manage information, yielding gains on conversational benchmarks. However, the reliability of information retrieved from external memory remains a significant bottleneck. Similarity-based retrieval, a common technique, frequently returns stale, untrustworthy, or even contradictory items, potentially leading to overconfident errors in agent responses. Without assessing the quality of retrieved information, these low-quality memories can amplify errors during multi-step reasoning processes, resulting in fluent but inaccurate outputs, known as hallucinations, and creating safety risks in applications where mistakes carry substantial consequences. They often provide answers even when evidence is lacking or inconsistent, exhibiting unwarranted confidence. The framework assigns each retrieved memory item a reliability score, factoring in source credibility, how recently the information was acquired (temporal decay), and consistency with other stored memories. This signal is then used to reweight evidence, allowing the agent to abstain from answering when support is insufficient. Scientists also introduced MMA-Bench, a novel benchmark designed to rigorously test an agent’s belief revision capabilities under controlled conditions of source reliability and text-vision contradictions. Using this benchmark, they identified a “Visual Placebo Effect”, revealing how retrieval-augmented generation (RAG) agents can inherit unintended visual biases from the foundation models they utilise. On the FEVER dataset, MMA achieved accuracy comparable to existing systems, while reducing result variance by 35.2% and improving selective utility. Also, a safety-focused configuration of MMA, tested on the LoCoMo benchmark, improved actionable accuracy and reduced incorrect responses. Most strikingly, on MMA-Bench, MMA attained 41.18% Type-B accuracy in Vision mode, a performance level that sharply contrasts with the baseline system’s complete failure (0.0%) under the same testing protocol. This work proposes a active confidence scoring framework that assesses memory reliability through source credibility, temporal decay, and cross-memory consistency. Active reliability scoring via source, time and network consensus A 72-qubit superconducting processor underpins the core of this research, though the methodology extends far beyond simple quantum computation. Initially, retrieved memory items were assigned a active reliability score, combining three distinct components to assess trustworthiness. Source credibility was established by mapping the origin of each memory to a predefined trustworthiness prior, ensuring higher-quality sources received initial preference. Then, temporal decay modelled information aging using an exponential function with a half-life, effectively diminishing the influence of older memories over time. Acknowledging the importance of corroboration, network consensus measured semantic support within the retrieved neighborhood of each memory, acting as a consistency filter, reinforcing alignment and penalizing contradictions. Once calculated, these three components, source, time, and consensus, were combined into a self-normalizing weighted sum, generating a scalar confidence score for each memory item. Evaluating such a system necessitated a dedicated benchmark, leading to the creation of MMA-Bench. This programmatically generated benchmark simulates active social environments, specifically designed to stress-test belief revision under conditions of drifting evidence quality and conflicting modalities. Each case within MMA-Bench consists of a dialogue stream spanning ten temporal sessions, approximating six months, involving a reliable and an unreliable speaker. The benchmark’s design incorporates a four-phase generation pipeline. Phase one implicitly establishes source reliability priors through verifiable events, while phase two introduces adversarial noise in the form of high-volume, irrelevant conversation. The framework evaluates performance across both Text Mode, utilising oracle captions, and Vision Mode, employing raw images, to assess cross-modal consistency. On FEVER, MMA attained a raw accuracy of 59.93%, matching the performance of the MIRIX baseline at 59.87%, but with a 35.2% reduction in standard deviation across three independent experimental seeds. This translates to a decrease from ±2.50% variability in the baseline to just ±1.62% for MMA, indicating more consistent results. Also, MMA achieved a selective utility score of 0.6484 under an abstention reward parameter of 0.2, exceeding the baseline’s 0.6468. Improvements extended to safety-oriented performance on the LoCoMo benchmark. A specific MMA configuration, excluding network consensus, yielded an actionable accuracy of 79.64%, a measurable increase over the 78.96% achieved by the baseline. Simultaneously, the number of incorrect answers produced by MMA decreased from 317 to 298. The most striking results emerged from the newly introduced MMA-Bench, designed to rigorously test belief dynamics under challenging conditions. Here, MMA reached 41.18% Type-B accuracy in Vision mode, a metric assessing the agent’s ability to correctly identify reliability inversions. By contrast, the baseline agent completely failed, registering 0.0% accuracy under the identical evaluation protocol. The improvements lie in MMA’s active reliability scoring system, which combines source credibility, temporal decay, and conflict-aware network consensus to reweight retrieved memory items. The system prioritizes evidence from credible sources and discounts stale or weakly supported information. This approach effectively mitigates retrieval traps, as demonstrated in case studies where MMA correctly identified the most reliable memory item while the baseline selected an irrelevant one. Evaluating information trustworthiness improves artificial intelligence agent reliability Scientists are increasingly reliant on systems that ‘remember’ information to tackle complex tasks, yet these memories are often flawed and can lead to confident but incorrect conclusions.

This research addresses a long-standing problem: how to build agents that not only access vast stores of knowledge but also critically assess its trustworthiness. Instead of blindly accepting everything it finds, MMA reweights evidence and, crucially, knows when to abstain from answering if support is insufficient. Establishing source credibility isn’t simply a technical fix; it’s a fundamental requirement for intelligent behaviour. Experiments reveal that removing the source assessment module entirely renders the agent incapable of forming any positive conclusions, highlighting its importance. The system isn’t perfect, and a trade-off exists between confidently answering questions and avoiding potentially misleading visual information.

The team uncovered a “Visual Placebo Effect”, where agents inherit biases from the foundation models used to process images, accepting visual cues even when they contradict textual evidence. Future work could focus on refining the temporal decay mechanism to better handle rapidly changing information. Beyond this specific implementation, the broader effort will likely see a move towards agents that can actively seek out more reliable sources and continuously update their trust models. For applications ranging from medical diagnosis to legal reasoning, building AI that knows what it doesn’t know is as important as building AI that knows a lot. 👉 More information 🗞 MMA: Multimodal Memory Agent 🧠 ArXiv: https://arxiv.org/abs/2602.16493 Tags:

Read Original

Source Information

Source: Quantum Zeitgeist