#82: OCR

Good morning, Fun Fact Folk!

Just the other day, as I was panhandling in the gutter for some post ideas, my debonair roommate Ted happened to be walking by (likely on the way to a upper-crust party or a lucrative business deal, as I assume he is wont to do).

“Ah, poor Matt, down on your luck again?” He said, stepping carefully to avoid the city street squalor I had gathered around myself to keep warm. He fished in his pocket and tossed me something that sparkled as it spun in the sheer winter sunlight. “Try this. A gentleman always carries a spare Fun Fact on him,” silently judging my lack of original topic ideas as he walked off.

I caught it and stared down at the gift in my fingerless-gloved hands. I… I could work with this. I looked up to thank him (and ask him when we would ask the landlord to renew the lease), but he was already gone. A gentleman doesn’t dally.

And in that sentiment of getting on with it, this week I’ll dive into the trivial and the sometimes vulgar outcomes of OCR, Optical Character Recognition¹.

That’s a Kind of Weird/Obscure Topic, Matt
Weird and obscure it might be, but keep reading and you’ll see why I chose it. Also, one tends to get weird and obscure topics when grubbing for them on the street. Donating a single idea to the Fun Fact Fund is enough to keep a writer going for a whole week.

Anyway, for the wholly unfamiliar, OCR is the way that computers can ‘read’ printed text. It’s the process of taking an image and parsing out the letters and words to make it digital (and therefore manipulatable by programs). It’s crucial to PDFs, historic archive scanning, and of course: cheating at trivia.

Several enterprising software engineers have made entire projects out of taking OCR technology and applying it in real-time to the questions proposed by the once-more-popular app, HQ Trivia. With a phone’s display mirrored to a beefier computer, a program can OCR the text displayed on screen and google the question faster than any human could type, let alone comprehend the question. The power of OCR. From there, there are several different strategies of discerning the ‘correct’ answer, which I’ll let you read more about if you’re curious.

The part I wanted to focus on is the actual OCR process (at a high level²). At no point does the program ‘understand’ what it’s translating or the meanings of these words. It’s merely been given strategies to identify that a certain pixel pattern corresponds with a certain letter/word stored in memory.

First it has to be able to say definitively what in the image is and is not part of the letter. And like everything that a computer does at the deepest level, this determination is binary: yes or no, black or white. In this case it’s literally black or white, as the image is converted to greyscale and then the contrast is hiked way, way up. From there it can be compared directly to letters it knows and those black/white boundaries are matches to the edges that the stored letters have.

If you’ve played Mario Party, you already get it.

Alternatively, the computer can attempt to breakdown the shape into its component patterns. This was the way that the original Apple Palm Pilots translated a user’s stylus-scratching into text. A certain number of loops or connections in a specific order was a certain word. However, without being able to watch a user progress through the stages of writing a letter out, this process becomes harder (though not even close to impossible).

A relic from a simpler time.

Into The Arms of a Stranger
There are more OCR strategies and ways the technology can go right, but it’s just so satisfying to watch it go wrong³.

While pretty much no one OCRs more than Google transcribing as much of the printed word as possible, much of the transcription is done with little manual proofreading (who has the time to read through literally millions of OCR’d documents?) so errors squeak through.

Sometimes these letters of ours look too similar, especially when cramped together and especially when you’re a machine that doesn’t understand the context of the writing. Sometimes the letters ‘rm‘ can look a bit too much like ‘nu‘ and we get errors like transcribing the word ‘arms’ to ‘anus’. So you get the following snippets in digital transcriptions of books⁴:

The Atlantic Monthly‎ – Page 166
“… with the child in her anus, she followed her husband down-stairs, across the back-yard, hitting lier feet against stones and logs in the darkness, …”

The works of Daniel Defoe: with a memoir of his life and writings‎ – Page 46
by Daniel Defoe, William Hazlitt – 1840
“But when the man who had the child in his anus, had been told by signs that this was the mother, he beckoned to have her come to him,”


Really, that’s it. The whole post. This was all a ploy to cram the word ‘anus’ into a post without actually using the word. Mission: Accomplished.

Until next week, be extra careful with your right to bare arms!

¹ I realize that it can also stand for Obstacle Course Racing, but look at me, do I really strike you as that kind of OCR guy? Hmmm, now that I think about it, maybe…
² AKA, an explanation you can understand when smoking.
³ It’s what Bob Saget and America’s Funniest Home Videos staked (stook?) their livelihood on.
This fact was so deeply pulled from another blog, that I can’t help but link there.