This is a problem that has stumped DocumentIntelligence (85%), Sonnet 3.5 (~90%) and other models. It’s pretty simple:

CMSFileView.pdf

Can you OCR this table (just the first page is hard) and get back to structured data?

With GOT and Sonnet-3.5, we can build a more generic pipeline with just rendered images that achieves 98.79%.

Here’s the approach:

  1. Use GOT to OCR the page.

Here’s what we get:

Raw OCR

  1. Plug that into Claude along with the table and this prompt:
Here is a table image and the OCR output. Can you use the OCR output and put it into a proper markdown table without any mistakes?
  1. Here’s what we get:

Certainly. I'll create a markdown table using the OCR output, correcting any obvious errors and formatting it properly. Here's the markdown table:

Sr. No. Station Name Anurag Singh Thakur Satpal Raizada Hem Raj Arun Ankesh Syal Subedar Major Kulwant Singh Jagdeep Kumar Sumit Rana Garib Dass Katoch Gopi Chand Attari Nand Lal Ramesh Chand Sarthi Surender Kumar Gautam Total valid votes Rejected votes NOTA Total votes No. of Tendered Votes
1 1-SAYAR 393 213 9 0 2 12 0 0 0 0 2 1 632 0 8 640 0
2 2-KAROT 313 172 20 2 1 1 0 0 1 1 1 0 512 0 3 515 0
3 3-RANIKOTLA 640 281 0 1 4 17 0 0 0 0 1 3 947 0 2 949 0
4 4-SAI BRAHMA 364 249 1 0 1 10 0 0 1 1 0 6 633 0 6 639 0
5 5-BATHOH 331 250 3 1 0 21 0 2 0 0 1 3 612 0 2 614 0
6 6-BHOLI 364 313 12 0 3 4 1 0 1 2 4 3 707 0 6 713 0
7 7-KHARSI KNAITTA 316 96 2 0 1 4 0 0 0 1 0 0 420 0 3 423 0
8 8-SOHRA BYUS 254 116 0 0 2 0 0 0 0 2 1 0 375 0 2 377 0
9 9-SUI 457 240 4 1 3 3 0 1 2 1 0 1 713 0 5 718 0
10 10-MALOKHAR 494 206 0 0 0 2 0 0 0 1 1 1 705 0 7 712 0
11 11-SOLDHA 445 275 5 0 3 6 0 0 0 2 2 1 739 0 5 744 0
12 12-CHANDPUR 254 167 1 0 0 7 1 1 0 2 1 5 439 0 6 445 0
13 13-KYARAN 301 243 2 0 0 7 0 0 0 3 0 1 557 0 2 559 0
14 14-CHHAKOH 542 261 4 1 1 1 1 0 0 0 0 0 811 0 5 816 0
15 15-GALAU 317 360 4 0 1 2 1 0 0 1 2 1 689 0 3 692 0
16 16-SAI NOWDAN 147 115 0 0 0 13 0 0 0 0 0 0 275 0 3 278 0
17 17-JANLAGHAT 296 336 2 0 0 17 0 0 0 1 0 0 652 0 1 653 0
18 18-SOHRI 263 182 1 1 3 8 2 0 0 0 0 2 462 0 4 466 0
19 19-BANDU 199 167 2 0 0 1 0 0 0 0 0 0 369 0 1 370 0
20 20-ASHA MAJARI 224 157 0 0 1 4 0 0 1 0 0 1 388 0 0 388 0
21 21-JUKHALA-1 195 198 2 0 1 17 1 0 0 1 1 2 418 0 4 422 0
22 22-GASSAUR 322 262 1 0 1 6 1 0 0 1 0 4 598 0 3 601 0
23 23-JUKHALA-2 281 238 0 0 0 7 0 0 0 0 1 0 527 0 2 529 0
24 24-SYCHLA 450 359 1 1 1 48 5 0 0 2 0 0 867 0 9 876 1
25 25-KOTLA 510 318 3 0 1 12 1 0 1 1 0 0 847 0 8 855 0
26 26-HIHARKHAN BASLA 309 320 4 0 0 18 1 0 1 0 0 2 655 0 4 659 0
27 27-PALOG 67 139 0 0 0 0 0 0 0 0 1 2 209 0 1 210 0
28 28-BAGKHURD 210 235 1 1 9 22 2 0 0 0 1 0 481 0 0 481 0
29 29-SOSAN 273 323 4 0 5 1 1 0 0 1 1 3 612 0 4 616 0
30 30-GHYAL 186 440 2 1 0 3 0 0 0 0 1 1 634 0 7 641 0
31 31-TEPARA 76 166 0 0 2 1 0 0 0 0 0 0 245 0 1 246 0
32 32-DABAR 199 315 0 0 1 4 0 0 0 0 1 1 521 0 0 521 0
33 33-KOTHI 139 127 0 0 0 4 0 0 0 0 0 1 271 0 1 272 0
34 34-LUHARDA 169 136 0 0 1 6 0 0 0 0 0 0 312 0 2 314 0
35 35-KAHLI 113 59 4 0 1 0 0 0 0 0 0 0 177 0 1 178 0

This table represents the voting results for various polling stations, including the number of votes received by each candidate and other relevant information.

This is the image we’re using:

CleanShot 2024-09-30 at 23.36.05@2x.png