Why GPT Alone Won’t Cut It for Real Document Extraction

Why GPT Alone Won’t Cut It for Real Document Extraction

A practical look into the limits of LLMs in Document Intelligence

At Anyformat, our mission is simple: turn any file into the data you need.

Large Language Models like GPT have transformed how we interact with text. They summarize, translate, and even generate content with impressive fluency. But when it comes to turning real-world documents — invoices, delivery notes, project plans — into reliable, structured data, the approach of “just using GPT” quickly falls short.

🧩 Real Documents Are Complex — And That Matters

Real documents are not just plain text. They combine diverse components: structured paragraphs, different font hierarchies, images, stamps, signatures, tables, diagrams, figures, etc.

Trying to extract all this information in a single model pass (one-shot extraction) can fail in unexpected and subtle manners. Why?

Because models must simultaneously handle layout understanding, semantic interpretation, and structure preservation. Minor layout noise or ambiguous formatting can derail the output entirely.

🧪 Can GPT Do OCR? Yes. Can you rely on it? Sometimes.

Every reliable document intelligence pipeline starts with one essential step: OCR (Optical Character Recognition).

That means converting PDFs or scanned images into a machine-readable format like Markdown or HTML — the foundation for any further analysis.

GPT and similar multimodal LLMs can technically perform OCR. But here’s what we’ve consistently seen in practice:

  • Minor layout noise leads to major misinterpretations
  • Low-contrast scans produce partial or merged text blocks
  • Tables and figures are often flattened or misread as prose
  • Visual context like stamps or highlights are ignored

And perhaps more importantly:

🔄 If the OCR is flawed, every downstream task such as data extraction is compromised.

A table misread as a paragraph? Impossible to parse.

A header dropped in OCR? Lost semantics.

An incorrect date or number? Broken business logic.

These are not just technical annoyances — they’re product-killers for companies trying to automate document workflows with LLMs alone.

📉 Tables: The Silent Breaker of Documents Pipelines

Among all document elements, tables are uniquely difficult to handle.

They compress multi-dimensional information into dense, layout-dependent formats. A table’s meaning relies not just on the text it contains, but on how that text is arranged — across rows, columns, headers, merged cells, and implicit structures.

While GPT models and other LLMs can sometimes generate passable tables, they consistently struggle to:

  • 🔄 Maintain structure — small OCR errors often cascade into misaligned rows or lost headers
  • 🧱 Preserve layout — formatting ambiguities are misinterpreted or ignored
  • ⚠️ Ensure consistency — output varies wildly across documents with only minor visual differences

📊 Benchmark: Table Accuracy

We used a dataset of ~1,130 tables, extracted from over 1,000 real documents, based on an open-source dataset [1]. It’s a valuable foundation for the community, but like any dataset and evaluation framework, it comes with limitations.

Here’s what we’ve observed:

  • Repetitive layouts: Many files share the same visual structure with only minor content changes. This reduces layout diversity and can artificially boost model performance.
  • Inconsistent table formats: The dataset mixes Markdown and HTML representations without clear criteria. This inconsistency makes evaluation harder to interpret and replicate.

In practice, we’ve found that many extraction failures originate in the parsing step to Markdown. Markdown simply can’t represent complex structures like merged cells or nested headers accurately. At Anyformat, we address this by manually reviewing and correcting Markdown-to-HTML conversions to ensure clean, reliable inputs for evaluation. We manually validated and corrected the HTML ground truths to ensure consistent structure — especially for edge cases and complex layouts.

To evaluate model performance, we computed the normalized edit distance between the predicted and ground-truth HTML [2]. This metric allowed us to quantitatively compare results and determine which models offered the most accurate HTML table generation for further data extraction.

With one-shot extraction, i.e., converting the full page to Markdown with tables in HTML format, we observe the following results.

image.png

Figure 1: _One-shot extraction t_able OCR accuracy across various LLMs, measured as the average normalized edit distance between extracted HTML tables and ground truth.

Key takeaways:

  • Accuracy varies sharply across models.
  • Smaller models tend to perform poorly.
  • Some large models (like Gemini Pro) stand out, but may be overkill for simple layouts.

So how do you balance precision and cost-efficiency?

🛠️ Our Approach: Decompose Before You Extract

At Anyformat, we don’t rely on brittle, one-shot model passes.

Instead, our AI agent actively orchestrates the extraction process, breaking down each document into its meaningful parts before deciding how to handle them.

We’ve built a structure-aware, multi-stage pipeline where the agent:

  1. Enhances input quality through pre-OCR adjustments (contrast, denoising, DPI boost).
  2. Segments content semantically, identifying zones like paragraphs, tables, figures, headers, and footnotes.
  3. Detects and isolates critical elements such as stamps, signatures, and tabular data.
  4. Routes each element to specialized extraction routines using tailored LLM prompts.
  5. Validates and reconciles the output through automatic consistency checks and structured postprocessing.

This intelligent decomposition means:

  • Noise and layout artifacts are filtered early.
  • Hallucinations are reduced by narrowing the model’s scope.
  • We don’t rely on model guesswork, our agent shapes the task so that even general-purpose models can perform with precision.

annotated-test_hard_v11.png

Figure 2: Example of a complex real-world document layout, where multiple elements such as section headers, charts, and multiple tables coexist. This highlights the importance of accurate element segmentation before attempting structured data extraction.

🛠️ From DIY to Done-for-You

Some teams try to build their own pipelines using general-purpose LLMs. A few succeed, but only after months of engineering, trial-and-error prompt tuning, and QA overhead.

Most businesses don’t have that luxury.

Anyformat isn’t a service. It’s a product.

One that delivers production-grade extraction from day one: no fragile scripts, no brittle pipelines.

You don’t need to build your own document intelligence engine.

You need one that works.

🚀 Ready to See It in Action?

Let us test Anyformat on your documents.

See what accurate extraction actually looks like.

📧 info@anyformat.ai

🌐 www.anyformat.ai

🧾 Methodology Summary

  • Documents analyzed: 1,000+ real-world files, including invoices, delivery notes, reports, and more [1].
  • Tables evaluated: ~1,130, manually reviewed and rewritten in HTML to preserve complex structures like nested headers and merged cells.
  • Evaluation metric: Normalized edit distance between the model’s output and the ground truth [2].
  • Benchmarked models: claude 3.7 sonnet, claude 3.5 sonnet, claude 3 haiku, gemini 1.5 pro, gemini 1.5 flash, gemini 2.0 flash, gemini 2.5 pro, gpt 4o, gpt 4o mini and gpt 4.1.

📚 Bibliography

[1] getOmni.ai. (2024). OCR Benchmark Dataset [Dataset]. Hugging Face. https://huggingface.co/datasets/getomni-ai/ocr-benchmark

[2] Zhong, Xu, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. "Image-based table recognition: data, model, and evaluation." European conference on computer vision. Cham: Springer International Publishing, 2020.