Document Parser Benchmarking Achieves 0.78 Correlation with Human Evaluation of Mathematical Formula Extraction from PDFs

Summarize this article with:
Extracting mathematical formulas accurately from scientific PDFs presents a significant challenge for accessing and utilising knowledge within academic literature, and Pius Horn from Offenburg University, along with Janis Keuper from the University of Mannheim, and their colleagues, have addressed this problem with a new benchmarking framework. Their work introduces a systematic method for evaluating the performance of document parsers, using synthetically generated PDFs with known mathematical content, and importantly, a novel approach employing large language models to assess the semantic correctness of extracted formulas.
This research demonstrates that this LLM-based evaluation aligns closely with human judgement, offering a substantial improvement over existing methods, and reveals considerable variation in performance among contemporary PDF parsers when tested on a large dataset of formulas. The findings provide valuable guidance for researchers and practitioners selecting appropriate tools for automated knowledge extraction from scientific documents, and establishes a reproducible methodology for future evaluation in this critical area. OmniOCR Benchmarks Mathematical Formula Extraction Scientists rigorously evaluated document parsers, focusing on their ability to accurately extract mathematical formulas from documents, including images and PDFs. The evaluation leveraged datasets like OmniOCR Benchmark, DocLayNet, and PubLayNet, crucial for analyzing document layout, alongside the ICDAR 2023 CROHME competition focusing on handwritten mathematical expressions. Over 20 models were assessed, including Qwen3-VL, Deepseek-OCR, Mistral OCR, dots. ocr, Nanonets-OCR-s, OpenAI’s GPT models, EMERS, Mineru, and Grobid. The research highlighted challenges including document diversity, notation complexity, handwritten formula recognition, and the importance of accurate layout analysis. Researchers employed metrics like BLEU and character detection matching, alongside tree matching and syntax-aware networks, to assess formula recognition accuracy. Synthetic PDFs and LLM-Based Formula Matching Scientists engineered a novel benchmarking framework to rigorously evaluate PDF parsers for mathematical formula extraction, addressing a critical gap in existing evaluation methods. The study pioneered a synthetic PDF generation approach, creating documents with precise LaTeX ground truth to overcome limitations of manually annotated benchmarks. This method allows systematic control over layout, formulas, and content characteristics, enabling targeted assessment of parser performance. To address inconsistencies in parser outputs, researchers developed a robust two-stage matching pipeline leveraging a large language model as a judge for semantic formula assessment, moving beyond simple text comparison. Validation with human evaluators demonstrated that the LLM-based evaluation achieves a substantially higher correlation with human judgment (r=0. 78) compared to traditional methods like character detection matching (r=0. 34). Evaluating over 20 contemporary PDF parsers across a dataset of 100 synthetic documents containing over 2,000 formulas revealed significant performance disparities, providing crucial insights for practitioners.
The team established a public leaderboard and released both the benchmark dataset and code, fostering reproducibility and enabling ongoing evaluation of PDF formula extraction quality. Synthetic PDF Benchmark for Formula Extraction Scientists developed a novel benchmarking framework to rigorously evaluate the performance of PDF parsers on extracting mathematical formulas, a critical task for building scientific knowledge bases. The work centers on generating synthetic PDFs with precisely defined LaTeX ground truth, allowing systematic control over layout, formula complexity, and content characteristics. A dataset of 319,000 LaTeX formulas was sourced from Wikipedia, filtered for visual complexity. The generated benchmark PDFs incorporate variations in document class, font size, margins, and column layout, and contain content in four languages: English, German, French, and Spanish. This meticulous process resulted in 100 synthetic documents, each containing over 2,000 formulas, providing a comprehensive testbed for parser evaluation. Researchers pioneered a two-stage matching pipeline leveraging a large language model as a judge to address challenges in matching parser output with ground truth. Experiments demonstrated that this LLM-based evaluation achieved a Pearson correlation coefficient of 0. 78 with human judgment, significantly outperforming traditional methods like character detection matching (r=0. 34). Semantic Benchmarking of Formula Extraction Accuracy The research team developed a new benchmarking framework for evaluating the accuracy of methods that extract mathematical formulas from PDF documents, a critical step in building scientific knowledge bases. Existing benchmarks lack methods for assessing semantic correctness, meaning they cannot reliably determine if a parser understands the meaning of a formula. This work addresses this gap by generating synthetic PDF documents with precisely known ground truth LaTeX code for every formula, allowing systematic control over document layout and content. A key achievement is the pioneering use of a large language model as a judge to evaluate the semantic correctness of extracted formulas, combined with a robust matching pipeline. Validation with human evaluators demonstrated that this LLM-based evaluation correlates strongly with human judgment.
The team has made the code and benchmark data publicly available to facilitate further research and development in this area. 👉 More information 🗞 Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs 🧠 ArXiv: https://arxiv.org/abs/2512.09874 Tags:
