Can you use logprobs to improve direct multimodal OCR?

Short answer, we can - for OpenAI models! Not so much for Gemini. Additionally, GPT does better at lower temps, Gemini does better at higher temps, and Gemini Flash makes significantly fewer mistakes than gpt-4o.

The core problem is simple: Do model confidences correlate to accuracy in OCR?

Gemini now provides token log probabilities (along with multiple candidates) in the response through Vertex AI, so let’s give it a shot.

Data

We’re just going to use this one page from our previous work replacing OCR with multimodal LLMs:

Page in use

This prompt is used to extract the table:

Convert the first table in this image ${imagePath} into a 2d array. Feel free to ignore formatting fluff or random things. Keep the headers. The number of columns and rows should match - there are no merged columns.

Unfortunately we can only use gemini-flash (both 001 and 002 work as of the writing of this post). We press on!

Approach

The approach was simple but took an hour or two to get working. The idea is to force JSON through a typespec, and visualize the output.

This script was used to get the output (there’s a half useful vertex adapter in there as well):

Code

We get this output for our image (at temp 0 with 002):

2024-12-09T07-00-21-027Z-gemini-1.5-flash-002-polling.png.json

This is the output for 002 at temp 1.0:

2024-12-09T07-25-29-558Z-gemini-1.5-flash-002-temp-1.0-polling.png.json

This is the output for flash 001:

2024-12-09T07-26-18-299Z-gemini-1.5-flash-001-polling.png.json