For a long time now, I’ve been searching for a middle ground when it comes to Optical Character Recognition/OCR in .NET/C#. You might have heard of a little C++ library called “Tesseract” which many have tried to write wrappers around or interop in their C# code. I myself have followed tutorials and guides on how to do this and it’s always ended in pain. Most notably, when you are working with C++ libraries in C#, you have to be extremely careful about how memory is allocated otherwise there is a sure fire chance you’re going to end up with a memory leak somewhere along the way. Almost without fail when I’ve tried to use Tesseract in C++, I’ve ended up leaking memory all over the place and having my screen turn into a jigsaw requiring a PC restart.
The alternative has always been “enterprise” type OCR libraries with their hidden pricing that they then hoist on you at the last minute (I really have a distaste for these sorts of tactics if I’m being honest), and even then, they usually have some sort of limited feature set but you pay for it anyway just so you don’t end up losing sleep over memory issues.
Well, then in comes IronOCR. An OCR library that takes the headaches out of C++ interoperability, but with upfront (and very very reasonable) pricing. I’ve been playing around with this little OCR library for some time now and I’ve got to say, the ease in which this thing gets up and running is really a dream. Let’s get started!
What We Are Looking For In An OCR Library?
Before we jump to deep into the code, let me map out my thought process of what I wanted to get out of any OCR or computer vision library.
- I know that there are API services out there that do this (For example for Azure OCR, there is Azure Cognitive Services which is essentially a computer vision API), but I wanted to make sure that I could run this without making API calls, and without having to pay more as I scale up. IronOCR is a one time fee and that’s it.
- Support multiple languages, many libraries I looked at supported English only.
- Is there some level of “cleanup” smarts there. If the scanned document or image is a bit scratchy, does the library come with a way to clean things up?
- Does this work for both printed and handwritten text? This is more a nice to have, but it’s also a huge feature to have!
- Can I use this in the cloud (Specifically, will this work in Azure)? Usually this means no “installs” that need to be made because I may not be using a VM at all and instead be entirely serverless.
IronOCR ticks all of these boxes, but let’s take a dive into how it might look in code.
Simple OCR Example
Let’s start off with something really easy. I took this screenshot from the Google Books page on Frankenstein by Mary Shelly.
I then took my C#/.NET Console Application, and ran the following in the nuget package manager to install IronOCR
Install-Package IronOcr
And then onto the code. I literally OCR’d this image to extract text, including line breaks and everything, using 4 lines of code.
var ocr = new IronTesseract(); using (var Input = new OcrInput("Frankenstein.PNG")) { var result = ocr.Read(Input); Console.WriteLine(result.Text); }
And the output?
Frankenstein Annotated for Scientists, Engineers, and Creators of All Kinds By Mary Wollstonecraft Shelley - 2017
Literally perfect character recognition in just a few lines of code! When I said that using Iron was like computer vision on “easy mode”, I wasn’t lying.
Running IronOCR In An Azure Function
I know this might seem like an obvious thing to say, but there has been countless times where I’ve used libraries that require some sort of dedicated VM, either through an installation on the machine or because of licensing “per machine”. In this day and age, you should not be using any library that can’t work in any sort of serverless environment.
Since in most of my recent projects, we are using Azure Functions in a microservices architecture, let’s create a really simple function that can take a parameter of an image, OCR it, and return the text.
public static class OCRFunction { public static HttpClient _httpClient = new HttpClient(); [FunctionName("OCRFunction")] public static async Task<IActionResult> Run([HttpTrigger] HttpRequest req, ExecutionContext context) { var imageUrl = req.Query["image"]; var imageStream = await _httpClient.GetStreamAsync(imageUrl); var ocr = new IronTesseract(); using (var input = new OcrInput(imageStream)) { var result = ocr.Read(input); return new OkObjectResult(result.Text); } } }
Nice and simple! We take the query parameter of image, download it and OCR it immediately. The great thing about doing this directly inside an Azure Function is that immediately it can service different parts of our application in a microservice architecture without us having to copy and paste the code everywhere.
If we run the above code on our Frankenstein image above :
Super easy!
Another thing I want to point out about this approach is that if you’re currently paying for some service that charges a per OCR fee. Things can appear cheap but at scale, the monthly fee can quickly spiral out of control. Compare this to a one time fee with IronOCR, and you’re getting what is essentially a callable API all hosted in the Azure Cloud, without the ongoing costs.
Non-English Support
One thing I noticed with even the “Enterprise” level OCR libraries is that they often supported English only. It would come with some caveat like “But you can train it yourself on any language you want”. But that’s not really good enough when you are paying through the nose already.
However, IronOCR supports 125 languages currently, and you can add as many or as few as you like by simply installing the applicable Nuget language pack.
I was going to write more on the language availability in IronOCR, but it just works, and it’s all right there in a nifty package!
Cleaning Up Skewed Scans For OCR
The thing is, most of the time when you need OCR, it’s because of scanned documents. It’s very rare that you’re going to be using OCR for some pixel perfect screenshot from a website. Some OCR libraries shy away from this and sort of “avoid” the topic. IronOCR jumps right in to the deep end and gives us some out of the box options for fixing up poor scans.
Let’s use this as an example. A scanned page from the book Harry Potter.
There’s a bit of noise here but more importantly the text is heavily skewed. Two issues that are very very common when scanning in paper. If I run this through the OCR with no “fixes” in play, the only things I get back are :
Chapter Eight The Deathday Party
That’s because the page is just too skewed and noisy to make out smaller characters correctly. All we have to do is add the ability to correct the skew of the scan. We can do that with a single line of code :
var ocr = new IronTesseract(); using (var input = new OcrInput("HarryPotter.png")) { input.Deskew(); //Deskew the image var result = ocr.Read(input); Console.WriteLine(result.Text); }
And instantly, with no other changes, we actually get things working 100% :
Chapter Eight The Deathday Party October arrived, spreading a damp chill over the grounds and into the castie. Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy. The steam pouring from under her vivid hair gave the impression that her whole head was on fire.
There are actually a tonne of other options for cleaning up images/documents too including :
- DeNoise
- Rotating images a set amount of degrees
- Manually controlling contrast, greyscale, or simply turning the image black and white
- Enchancing the solution/image sharpening
- Erode and Dilate images
- And even more like color inversion and deep cleaning of background noise.
The thing is, if I’m being honest. I did play around with these but I just never really needed to. De-skewing my documents generally was enough to get everything coming out literally character perfect, but it’s great that IronOCR give you even more knobs to play with to really fine tune your OCR requirements.
Advanced Text Results
It might surprise you that other OCR libraries I tested simply output text and that was it. There was no structure to it, and you essentially had to work out based on counting line breaks or whitespace how each paragraph worked. IronOCR however not only can read text from your documents, but can work out the structure too!
For example, let’s use our Harry Potter image and instead use the following code :
var ocr = new IronTesseract(); using (var input = new OcrInput("HarryPotter.png")) { input.Deskew(); var result = ocr.Read(input); foreach(var paragraph in result.Paragraphs) { Console.WriteLine($"Paragraph : {paragraph.ParagraphNumber}"); Console.WriteLine(paragraph.Text); } }
Notice how instead of simply spitting out the text, I want to go paragraph by paragraph, to really understand the blocks of text I’m working with. And the result?
Paragraph : 1 Chapter Eight Paragraph : 2 The Deathday Party Paragraph : 3 October arrived, spreading a damp chill over the grounds and into the castie. Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy. The steam pouring from under her vivid hair gave the impression that her whole head was on fire.
Again, character perfect recognition split into the correct blocks. There’s a tonne of options around this too including reading line by line, or even reading only certain sections of the text a time by drawing a rectangle over the document. The latter is extremely helpful when you only need to use computer vision on a particular section of the document, and don’t need to worry about the rest.
Who Is This Library For?
As always, when I look at these sorts of libraries I try and think about who is this actually aimed at. Is it a hobbyist library, is it for enterprises only. And honestly, I struggle to place this one. Computer vision and optical character recognition is on the rise, and in the past couple of years, I’ve been asked about libraries to extract text from images more than all previous years combined. Azure obviously has their own offering, but it’s on a per call basis and over time, that all adds up. Add to the fact that you really don’t have control over how it’s trained and it’s not an easy sell.
However, going with IronOCR you have all of the control, with a single one time price tag. Add to the fact that you can download this library today and test to your hearts content before buying, it really makes it a no brainer if you are looking for any sort of text extraction/OCR features.
This is a sponsored post however all opinions are mine and mine alone.