A presentation I’ve been sharing frequently is one by Briana Giasullo, a Cataloging and Digital Resources Librarian from the Academy of Natural Sciences, on her solution to the issue of handwritten texts/fieldnotes in BHL having pretty much useless OCR. Briana explains how she used Amazon Textract to generate plain text files, then she used Zooniverse to find volunteers to check the transcription generated by Amazon textract. She then uploaded the corrected text files to BHL replacing the useless OCR with text. BHL could then extract species names making the handwritten text more findable. https://www.youtube.com/watch?v=PXQDWqoB8Xg&t=229s
Interesting example, this raises the question of who gets to upload corrected OCR text? Is there a general mechanism for this, or do you have to be a BHL member? What happens if a user spots errors? For example, I think there are several mistakes in the transcription of https://www.biodiversitylibrary.org/page/59782282 (image below). How do I fix those? From my perspective “agency” is a big issue with the BHL platform. There is no obvious way for people to contribute to improving the content.
Yes, we can upload corrected OCR into BHL. The workflow is the same as for uploading transcriptions. At present this is only possible via the BHL Dashboard, which you need a login for (and training).
However, like adding article data to BHL, the time consuming part is the gathering, checking, and formatting the content/data for upload. Or it was. Most of this work can now be done by AI.
The only data required by BHL for both transcriptions and corrected OCR is: pageID, SequenceNumber and Text. Perhaps, like @rdmpage did for article data, we can explore other pathways? A BioStor for transcriptions and corrected OCR?
| PageID | SequenceNumber | Text |
|---|---|---|
