Back to News
research

Do Generalisation Results Generalise? Study Reveals Correlation of Out-of-Distribution Performance across Multiple Testsets

Quantum Zeitgeist
Loading...
6 min read
2 views
0 likes
Do Generalisation Results Generalise? Study Reveals Correlation of Out-of-Distribution Performance across Multiple Testsets

Summarize this article with:

The ability of large language models to perform reliably on data they haven’t encountered before is fundamental to their practical use, yet current methods for assessing this ‘out-of-distribution’ generalisation often rely on evaluating performance on just one new dataset. Matteo Boglioni, Andrea Sgobbi, and Gabriel Tavernini, from ETH Zürich, alongside Francesco Rita from ETH Zürich and Marius Mosbach from Mila, Quebec Artificial Intelligence Institute, and Tiago Pimentel from ETH Zürich, now challenge whether results from a single test accurately reflect a model’s true generalisation ability. Their research investigates whether performance on different out-of-distribution datasets correlates, revealing that a model’s success on one new dataset does not necessarily predict success on another. This finding, demonstrated using OLMo2 and OPT models, highlights the complexity of evaluating generalisation and suggests that a more nuanced approach is needed to ensure the reliability of these powerful systems. Current out-of-distribution (OOD) evaluation typically focuses on a single OOD dataset, which may not precisely evaluate a model’s capabilities because data shifts encountered after deployment are often diverse. This work investigates whether OOD generalisation results generalise, specifically by evaluating a model’s performance across multiple OOD testsets throughout a finetuning run. Researchers then evaluate the partial correlation of performances across these testsets, removing the influence of in-domain performance, allowing assessment of how correlated generalisation performances are when in-domain performance is controlled for. Experiments using OLMo2 and OPT models revealed a complex and often unpredictable relationship between OOD generalisation results.

Language Model Correlation Across Sizes and Datasets This research presents a detailed analysis of the internal representations of language models, examining correlation matrices between OPT and OLMo2 models of varying sizes (3B, 7B, 13B, 30B) trained on different datasets (MNLI and SNLI). These correlation matrices reveal how similarly different models process information, with higher correlation indicating more similar processing. The models’ sizes represent their capacity to learn complex relationships, while the MNLI and SNLI datasets, both designed for natural language inference, require models to determine relationships between sentences. The analysis employs partial correlation, a technique that isolates the correlation due to model architecture and size by removing the effects of the training dataset.

Generalized Additive Models (GAM) regressors are used to model the potentially non-linear relationship between model size and correlation, allowing for a more flexible understanding than simple linear regression. The data reveals patterns in how models represent information, providing insights into the effects of model size, architecture, and training data. Visual analysis of the correlation matrices demonstrates the strength and patterns of correlations between models, while summaries show average partial correlations for each model size, broken down by dataset. Examining trends with model size reveals whether larger models tend to be more or less similar in their internal representations. Comparing performance on MNLI and SNLI highlights the influence of the dataset on how models learn and process information. Comparing OPT and OLMo2 models of the same size provides insights into the architectural differences between the models. Possible interpretations include convergence with size, suggesting larger models are moving towards a more universal representation of language. Architectural differences between OPT and OLMo2 models of the same size suggest different ways of processing information. Dataset bias, where the dataset influences how models learn, is also a possibility. Higher correlation between models might indicate better generalization ability, as the models are learning more robust and transferable representations. OOD Generalisation Reveals Unpredictable Transfer Patterns This research delivers a detailed analysis of out-of-distribution (OOD) generalisation in large language models, moving beyond evaluations on single datasets to investigate how performance correlates across multiple OOD testsets. Scientists evaluated model performance throughout a finetuning run, specifically examining the partial correlation of results across different OOD datasets after accounting for in-domain performance. This approach allows researchers to isolate how strongly generalisation capabilities transfer between different data shifts, beyond simply measuring overall in-domain proficiency.

The team discovered no overarching trend indicating consistent transfer of performance; instead, the correlation between any two OOD testsets depended heavily on the specific model analysed. For example, a positive correlation between generalisation performance on two testsets under one model might become negative under a different model, demonstrating substantial variance. Measurements confirm that the existence of a positive or negative correlation between any two OOD testsets is strongly influenced by the model being evaluated. The study highlights that fair evaluation of generalisation requires spanning multiple OOD testsets, as performance on a single dataset cannot reliably predict performance on others. This work demonstrates that assessing OOD generalisation is not simply a matter of achieving high scores on a single benchmark, but rather understanding the complex interplay between model architecture, training data, and the specific characteristics of the OOD data itself. Generalisation Depends on Model and Data This research demonstrates that evaluating a language model’s ability to generalise to new data requires assessing performance across multiple, diverse test sets, rather than relying on single benchmarks.

The team investigated whether improvements on one out-of-distribution test set reliably translate to gains on others, analysing models during a finetuning process. Results reveal a complex picture, showing that correlations between performance on different test sets are not consistent and depend heavily on the specific model and training data used. The study found no overarching trends suggesting that a model excelling on one challenging dataset will necessarily perform well on others, or that larger models consistently exhibit more stable generalisation abilities. Importantly, the team demonstrated that apparent robustness observed when comparing models can be misleading, as both model size and initial in-domain performance strongly influence results. Partial correlation analysis confirmed this complexity, revealing that the relationship between performance on any two out-of-distribution test sets is not an inherent property of those tests themselves. The authors acknowledge that computational limitations prevented the inclusion of models larger than 30 billion parameters in their analysis, and future work could explore whether the observed inconsistencies persist in larger models or different model families.

This research underscores the need for comprehensive evaluation strategies when assessing a language model’s true generalisation capabilities, an aspect often overlooked in current research practices. 👉 More information 🗞 Do Generalisation Results Generalise? 🧠 ArXiv: https://arxiv.org/abs/2512.07832 Tags:

Read Original

Source Information

Source: Quantum Zeitgeist