Handling special characters when searching for taxonomic names

This is a request that Doug Yanega emailed to me recently:

It’s an odd quirk of the algorithm that uses OCR to locate names in BHL’s literature that it cannot recognize “special characters” like the fused AE and OE. There are hundreds upon hundreds of scientific names for which a BHL search will come up empty, not because the literature is not there, but because the OCR does not see those names in those pieces of literature. I personally have to expend a lot of extra time doing manual searches for such names, most often by finding other genera I know were described in that same work, looking THOSE up in BHL, and then manually scrolling through the entire work until I locate the page containing the offending “AE” name buried there and invisible to BHL.

If I knew that there was somewhere else I could go to find a link to that page in that PDF without this circuitous and time-consuming search trick, I would be overjoyed.

Don’t underestimate how big a problem this is with BHL. The thing I can’t say is how many people there are like me who need to find names that BHL can’t find.

A lot of taxonomic names in the older literature use characters such as æ and œ and these names are hard to find in BHL. For example, “Cerambyeidæ” yields no hits for scientific names https://www.biodiversitylibrary.org/search?searchTerm=Cerambyeidæ&stype=F#/names even though this same word yields 357 hits in BHL when doing a full-text search https://www.biodiversitylibrary.org/search?stype=F&searchTerm=Cerambyeidæ#/titles

And it’s also true that humans also have difficulty telling the difference between oe and æ. It would be great to be able to get one of them, anyway.

1 Like

A related question is how well OCR handles these characters (for example, does “æ” get interpreted as “ae”, or as something else entirely). There are some new OCR tools such as Mistral OCR | Mistral AI and GitHub - PaddlePaddle/PaddleOCR: Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages. · GitHub that may handle these characters better, will be interesting to evaluate them on BHL content.