Large Language Models Memorize Supreme Court Cases, Demonstrating Complex Classification Abilities

Summarize this article with:
The ability of large language models to accurately classify complex information remains a significant challenge, often resulting in unexpected or inaccurate responses. John E. Ortega, Dhruv D. Joshi, and Matt P. Borkowski, from Pace University, investigate how these models perform when tasked with classifying United States Supreme Court decisions, a particularly demanding test due to the length, complexity, and specialised language of legal texts. Their research delves into the memorisation strategies employed by these models, utilising advanced fine-tuning and retrieval techniques to improve classification accuracy.
The team demonstrates that prompt-based models, incorporating memory functions like DeepSeek, achieve notably more robust performance than previous methods, exceeding scores by approximately two points on standard Supreme Court case classification tasks. This work advances understanding of how large language models process and retain information, paving the way for more reliable and accurate applications in legal and other complex domains. This corpus presents a significant challenge to assessing language model accuracy due to its extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. The core aim of this work is to evaluate the impact of memorization, scaling, domain adaptation, and prompt-based inference on the SCOTUS classification task. This corpus presents a challenging classification task due to the extensive sentence length, complex legal terminology, and non-standard structure inherent in SCOTUS decisions. Researchers implemented a comparative framework, meticulously evaluating each LLM component using accuracy, precision, recall, and F1 score metrics. This allowed for a detailed assessment of performance across both broad and fine-grained categorization tasks.
The team’s approach extends beyond simple performance measurement, focusing on understanding how these models respond to the complexities of legal language and whether memorization plays a significant role in their accuracy. Researchers rigorously evaluated both prompt-based and non-prompt-based approaches across the four models, focusing on the ability to accurately categorize complex legal documents characterized by extensive sentence length, specialized terminology, and non-standard structure. Results confirm that DeepSeek, when combined with effective prompting strategies, exhibits superior performance in capturing legal nuance and generalizing across different categories. The study’s findings demonstrate the potential of LLMs to significantly improve the accuracy and efficiency of legal document classification. By focusing on these characteristics, the researchers were able to identify strategies for improving model performance in this demanding domain. While acknowledging the advancements made with larger parameter counts and architectural innovations in recent models, the team emphasizes the need for continued research into how these models “memorize” and reason about legal information. 👉 More information 🗞 Large-Language Memorization During the Classification of United States Supreme Court Cases 🧠 ArXiv: https://arxiv.org/abs/2512.13654 Tags: Rohail T. As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world. Latest Posts by Rohail T.: Quantum Scheme Achieves Secure Data Analysis with Revocation and Keyword Search in Mobile Cloud Computing December 17, 2025 Strong Coupling Achieved in Hybrid Quantum System with Three-Mode Avoided Crossing December 16, 2025 Quantum Circuits Leverage Reference Frames for Perspective-Dependent Entangling Cost Trade-offs December 16, 2025
