SOTA OCR (state of the art optical character recognition)

,

I’m hoping to win the prize for most acronyms in a title! This post is a placeholder for tracking the current state of the art (SOTA) in OCR. If you come across a tool that might be good for BHL, or indeed has been used, please add it here.

https://www.datalab.to

I’ve heard good things about Transkribus, especially for non-western scripts and handwriting

Yes, although I seem to recall reading somewhere that mainstream LLMs are getting so good that the trained models of Transkribus may be obsolete :man_shrugging:. Perhaps a version of http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Perhaps what we should do is have a collection of BHL pages that represent all the sorts of pages BHL has (e.g., ranging from handwritten, old fonts, non-Latin scripts, through to modern journals) and use this to benchmark new OCR tools as they come out. Maybe the tech team has such a benchmark already (@cajunjoel?). Given that we are starting to see LLMs embedded in web browsers, maybe OCR text becomes something users do for themselves? The OCR built into my Mac is often better than what BHL provides.