SOTA OCR (state of the art optical character recognition)

rdmpage · March 25, 2026, 4:29pm

I’m hoping to win the prize for most acronyms in a title! This post is a placeholder for tracking the current state of the art (SOTA) in OCR. If you come across a tool that might be good for BHL, or indeed has been used, please add it here.

https://www.datalab.to

Pigsonthewing · March 27, 2026, 9:00pm

I’ve heard good things about Transkribus, especially for non-western scripts and handwriting

rdmpage · March 27, 2026, 9:29pm

Yes, although I seem to recall reading somewhere that mainstream LLMs are getting so good that the trained models of Transkribus may be obsolete . Perhaps a version of http://www.incompleteideas.net/IncIdeas/BitterLesson.html

rdmpage · March 27, 2026, 10:18pm

Perhaps what we should do is have a collection of BHL pages that represent all the sorts of pages BHL has (e.g., ranging from handwritten, old fonts, non-Latin scripts, through to modern journals) and use this to benchmark new OCR tools as they come out. Maybe the tech team has such a benchmark already (@cajunjoel?). Given that we are starting to see LLMs embedded in web browsers, maybe OCR text becomes something users do for themselves? The OCR built into my Mac is often better than what BHL provides.

Topic		Replies	Views
Field Notes Explorer: local AI transcription for handwritten (and other) field notes Research and Projects handwriting , ai , ocr	3	10	March 29, 2026
Briana Giasullo's workflow for improving OCR of BHL handwritten texts Technology and Tools ai , ocr , handwriting	2	22	March 25, 2026
About the Technology and Tools category Technology and Tools apis , metadata , ai	0	4	March 20, 2026
Handling special characters when searching for taxonomic names Feature Ideas	2	15	March 25, 2026
Transkribus ScanTents Technology and Tools ocr , scanning , phone	0	12	March 24, 2026

SOTA OCR (state of the art optical character recognition)

Related topics