Evaluating AI Models for Data Table Extraction

Mar 2
7 min read

TL;DR A significant part of accurately observing and analysing climate information involves extracting structured data from tables published in documents like PDFs.

We benchmarked 8+ multimodal and OCR models on real-world PDF table extraction. 58% of results were immediately disqualified on accuracy alone (a sobering baseline!). Mathematical notation and complex layouts are where most models fall apart. For high-volume production use, gemini-3-flash-preview hits the sweet spot: 95%+ accuracy at an average of 5.7 seconds per table. If accuracy is everything and time isn’t, kimi-k2.5 leads the pack at 99.7% accuracy.

Spoiler alert: image resizing probably won’t save you the tokens you think it will.

Note: Since this article was authored, several new model versions have been released. We do not think it will affect the results significantly. We'll do some more testing and if anything changes we'll publish an update.

The landscape of multimodal models capable of reading and transcribing table images has exploded recently; and with it, a genuinely difficult question: does the model you choose actually matter?

Speed, accuracy, and cost don’t always move in the same direction, and when you’re processing thousands of documents, small inefficiencies compound real fast.

So, we did what any data-obsessed team would do: be benchmarked the options. We selected a representative set of real table images from our target corpus, spanning a range of sizes and structural complexity, ran them through a suite of multimodal and OCR models, and let the numbers speak for themselves.

Here’s what we found.

A benchmark of multimodal OCR for tables

(The graphs in this article were produced by Gemini in Google Sheets)

One of our tools is heavy on OCR (optical character recognition); specifically reversing the content in tables embedded in PDFs back into the constituent data. This requires multimodal large language or specific OCR models. With the many options available these days, this was a perfect opportunity to learn how much model choice actually matters.

When you’re processing a large number of documents, time is material. On the other hand, accuracy is also important, and then cost pops in to make itself known. Let’s see if we can find a sweet spot.

Disclaimer: This is not the world’s most rigorous benchmark, and every situation is different. This is not financial advice and we’re not lawyers… etc. 🙂 We’re committed to sharing what we’re learning in the open. Expect GitHub links in the future.

Methodology

First we selected a random range of data tables (which are stored as images in PDFs) to extract from the PDF reports we’ve been studying recently (corporate climate disclosures, if you were interested). They vary in size (pixel dimensions), volume detail (number of rows/columns) and complexity (cell contents including graphics plus spanned rows/columns).
We then selected a range of multimodal and OCR models hosted as APIs from various providers, including a self-hosted model from Bytedance, to be our candidates for benchmarking.
Then we whipped up a python script to do the following:
1. Submit each image to gemini-3-pro-preview to convert the table to markdown. We decided this would be the baseline on our previous accuracy assessments. Gemini is our default model group of choice because we’re big fans of Google’s efforts to mitigate data centre harm¹.
2. Submitted each image to each model in turn to have the table converted to markdown. We kept the prompt and all settings the same across models to keep it fair. For the technically-inclined, this was done with LiteLLM to keep things simple.
3. On a second pass, we resized the image and submitted the image again. This was because Gemini specifically tiles the image as part of processing and we wanted to see if it would make a difference. More on Tiling below.
4. Finally we took the output from each image/model along with the benchmark transcription from gemini-3-pro-preview and submitted it back to gemini-3-pro-preview to compare the two providing a score and feedback on markdown differences, if any.

The results are in... which raises more questions than answers

Let’s start from the assumption that errors are bad. Any accuracy score below 95% should be discarded - and even that threshold is arguably generous.

Applying it immediately eliminated around 58% of results. That single figure tells you most of what you need to know about where multimodal models currently stand with tabular data.

Does image complexity matter?

Data tables vary enormously in structure and content.

To isolate the effect of model choice, we looked at whether the same images consistently received high transcription scores across models (how closely the output matched the benchmark transcription). If scores are consistently high, then image complexity, not model capability, is the dominant performance factor.

The models that scored best were GPT-5.2 and Gemini 3 Flash but even GPT-5.2 struggled with formula-heavy images clocking just a 60% success score

Alas, it isn't that simple. Visual complexity - specifically structural layout and the presence of mathematical notation - is the primary driver of failure. Most models handle standard tables with regular rows, columns, and numeric content competently. But a subset of structurally complex images reveals significant differences in model robustness.

The clearest failure point is mathematical content. Images containing LaTeX or formulas caused transcription failures across many models, ranging from stripped operators and misread superscripts to cells being omitted entirely. Even GPT-5.2 - one of the stronger performers - clocked only a 60% success rate on formula-heavy images. For documents where precision matters (such as auditable climate disclosures), that's a hard limitation.

Beyond formulas, the most common failure modes were row fragmentation (a single text-heavy row incorrectly split into multiple cells) and header misclassification.

Speed vs Accuracy

The big one. Kimi-k2.5 leads in accuracy with an average score over the test images of 99.7%. However, it’s the slowest model in the ‘high accuracy group’, taking more than 2 minutes per request within the 95th percentile of processing time. If you’ve got time to spare, then this is the way to go.

Next up was opus-4-5 and gpt-5.2 showing good accuracy (98.71% and 98.39% respectively) with processing times between 10 and 15 seconds.

But the winner was gemini-3-flash-preview, contributing a large number of records with an accuracy score of 95%+. It’s also the fastest model, averaging just 5.7 seconds.

What about image size?

Gemini models are coming out on top, but what about image size - is there any penalty on speed or accuracy depending on the image size itself?

The good news: image resolution has almost no effect on transcription accuracy - bigger isn't better, and smaller isn't worse, right up until the image becomes too small to read clearly.

However, speed is where resolution starts to matter. Some models (glm-4.6v and claude-opus-4-5) get noticeably slower as image size increases, while others (gpt-5.2 and gemini-3-flash-preview) process images in roughly the same time regardless of size.

Does resizing images make a difference?

On accuracy, no - results stay consistent. On speed, it depends on the model. Resizing to a standardised 1536px format delivered meaningful gains for two model families: glm-4.6v saw a 25% speed improvement, and Gemini models were 16.5% faster.

Importantly, these gains came without any drop in accuracy - this comparison only includes results that already scored 95% or above.

Tiling: Why did we choose 1536px as the resizing target?

We found that when you send an image to a Gemini model, it doesn't read the whole thing at once. It slices the image into a grid of 768×768 pixel tiles and charges 258 tokens per tile. That means image size has a direct effect on cost - and it's why we chose 1536px as our resizing target.

Here's how the token maths works. Take an image at 960×540px. Gemini calculates a "crop unit" for that image (in this case 360px), then divides each dimension by that unit to get the tile grid: 3×2 = 6 tiles. At 258 tokens each, that image costs 1,548 tokens just to process visually.

So what happens when you take a large image — say 7,192×3,639px — and progressively scale it down? That's what we tested with gemini-3-flash-preview, reducing the size by 20% on each pass.

Scale	Dims	Crop Unit	Tiles x	Tiles y	Tiles	Tokens
100%	7192×3639	2426	ceil(2.97)=3	ceil(1.50)=2	6	1548
56%	4045×2046	1364	ceil(2.97)=3	ceil(1.50)=2	6	1548
13%	960×485	323	ceil(2.97)=3	ceil(1.50)=2	6	1548
8%	540×273	182	ceil(2.97)=3	ceil(1.50)=2	6	1548

We can see this play out when we submit multiple image sizes to gemini-3-flash-preview again shrinking the image each time.

Dims	Tokens	Time (s)	Accuracy
7192x3639	2478	11.0	100
5394x2729	2477	10.3	99
4045x2046	2477	8.9	99
3034x1535	2478	8.8	100
2275x1151	2479	8.7	100
1706x863	2479	7.9	100
1280x647	2478	8.7	99
960x485	2466	8.6	99
720x364	2481	8.5	98
540x273	2482	8.5	75

You can see that despite the crop level changing, we always end up with the same number of tiles because the proportions of the image never change due to the aspect ratio. We’re always using 1548 tokens regardless of size. Each image processed uses around 2475 tokens which is 1548 image tokens with the remainder being the prompt/completion tokens. We’re still well short of the 348px requirement to consume 258 tokens before quality falls off a cliff.

So, your two options to reduce token usage for image transcription is either to change the image dimensions (which is not possible in most cases) or to scale down to a tiny 348px (but your image will be unreadable). It’s a trap that isn’t clear in the documentation - hence why we’re sharing what we learned with you.

Conclusion

The ultimate decision in selecting an LLM for table extraction is, unsurprisingly, a trade-off between Accuracy and Speed. Our findings suggest a clear hierarchy:

For Maximum Accuracy (Time Permitting) Models like kimi-k2.5 offer high accuracy but at a significant time disadvantage. Consider kimi-k2.5 if you’re doing offline processing or processing a smaller amount of PDFs so time isn’t such a big penalty.
For The Sweet Spot (Best Value): gemini-3-flash-preview emerged as the strongest all-around performer. Its combination of high-tier accuracy (95%+ success rate in our focused subset) and speed (average 5.7 seconds) makes it the most efficient choice for high-volume, high-throughput table processing. The moderate cost sensitivity to image size also makes its usage predictable.

Cost is basically equivalent for both models: NZD$3 per 1M tokens.

A Note on Image Optimisation

While some models (like those from ZAI and Anthropic) show efficiency gains from resizing, the token cost mechanism for Gemini models means simple scaling down does not save on tokens unless the aspect ratio is drastically altered or the image is reduced to an impractical size.

This confirms that for Gemini, the quality of the transcription is primarily linked to the image's inherent information density, not minor pixel-count differences.

Therefore the real cost savings come from selecting the right model, not image manipulation.

In conclusion, for a production environment where both speed and accuracy matter, we think the current generation of fast multimodal models, spearheaded by gemini-3-flash-preview, offers the most pragmatic solution for turning complex table images back into clean, structured data.

At DataLoom, we’re working with some of the most information-dense documents in circulation - corporate sustainability reports, climate disclosures, and regulatory filings. Document intelligence is one piece of a larger puzzle we’re assembling, and this benchmark study came directly out of that work.

Sources

What Google's Environmental Report Says About Data Centres (2025): https://datacentremagazine.com/news/google-environmental-report-2025-the-data-centre-impact