Handling special characters when searching for taxonomic names

rdmpage · March 24, 2026, 3:02pm

This is a request that Doug Yanega emailed to me recently:

It’s an odd quirk of the algorithm that uses OCR to locate names in BHL’s literature that it cannot recognize “special characters” like the fused AE and OE. There are hundreds upon hundreds of scientific names for which a BHL search will come up empty, not because the literature is not there, but because the OCR does not see those names in those pieces of literature. I personally have to expend a lot of extra time doing manual searches for such names, most often by finding other genera I know were described in that same work, looking THOSE up in BHL, and then manually scrolling through the entire work until I locate the page containing the offending “AE” name buried there and invisible to BHL.

If I knew that there was somewhere else I could go to find a link to that page in that PDF without this circuitous and time-consuming search trick, I would be overjoyed.

Don’t underestimate how big a problem this is with BHL. The thing I can’t say is how many people there are like me who need to find names that BHL can’t find.

A lot of taxonomic names in the older literature use characters such as æ and œ and these names are hard to find in BHL. For example, “Cerambyeidæ” yields no hits for scientific names https://www.biodiversitylibrary.org/search?searchTerm=Cerambyeidæ&stype=F#/names even though this same word yields 357 hits in BHL when doing a full-text search https://www.biodiversitylibrary.org/search?stype=F&searchTerm=Cerambyeidæ#/titles

RichardLitt · March 25, 2026, 11:46am

And it’s also true that humans also have difficulty telling the difference between oe and æ. It would be great to be able to get one of them, anyway.

rdmpage · March 25, 2026, 11:59am

A related question is how well OCR handles these characters (for example, does “æ” get interpreted as “ae”, or as something else entirely). There are some new OCR tools such as Mistral OCR | Mistral AI and GitHub - PaddlePaddle/PaddleOCR: Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages. · GitHub that may handle these characters better, will be interesting to evaluate them on BHL content.

Topic		Replies	Views
Briana Giasullo's workflow for improving OCR of BHL handwritten texts Technology and Tools ai , ocr , handwriting	2	22	March 25, 2026
SOTA OCR (state of the art optical character recognition) Technology and Tools ai , ocr	3	23	March 27, 2026
🔍 Full Text Search: fun (and unexpected) ways to use it Help and Support search , tips , fulltextsearch	0	5	March 29, 2026
Field Notes Explorer: local AI transcription for handwritten (and other) field notes Research and Projects handwriting , ai , ocr	3	10	March 29, 2026
What tips do you have for getting the most out of BHL? Help and Support using-bhl , tips , help	0	9	March 24, 2026

Handling special characters when searching for taxonomic names

Related topics