Reading And Writing PDFs In C# With Datalogics PDF SDK

Some time back I wrote a post specifically on writing PDFs in C#/.NET Core. At the time, I had very specific requirements for turning an HTML file into a PDF and was really looking for the cheapest library that fit the bill, regardless of how much effort I had to do to get it working. I think I probably underestimated how bad support for generating PDFs actually was in C#. It turned into a maze of open source libraries that “kinda” worked complete with Github issue trackers that hadn’t been answered in years, all the way to paid libraries with seemingly black hole support.

Well a continual thing that came out of that post was people asking me “But.. doesn’t Adobe do this for you?”. At the time as I understood it, Adobe did not have an programming interface at all for PDF’s, they were just dealing with desktop software. But as it turns out, that’s because they partnered with a company called Datalogics who are their official vendor for a C# PDF SDK.

Better yet, Datalogics just announced their support for .NET Core and actual support for non-Windows environments. e.g. Linux. I put that in bold because people left comments on my previous post saying that even though a company said they supported Linux, they were actually just guessing and didn’t actually provide any support for it. So naturally, with my past post giving a few open source options and third party companies selling their own SDK, it makes sense to try the real deal.

Getting Started

You can head over to Datalogics and grab a free trial here : https://www.datalogics.com/products/pdf-sdks/adobe-pdf-library/. It looks scary filling in the contact form but don’t worry, you get an automated email back with all the details on how to grab the library immediately without having someone calling you 15 seconds after you hit submit.

As part of the download, you are given a package with over 70 sample projects that demonstrate how the library works! It’s actually somewhat refreshing to be given that many sample apps that work right out of the box without having to muck about working out how all the pieces fit together. Not only are all the sample applications there, but all the sample PDF’s with their own features (For example, a PDF with images to show image extraction) are all their ready for you to just run and step through the code.

Better yet, chances are one of these sample apps are exactly what you are looking for a PDF library for. Just an example of the sort of samples you get :

  • A WinForms application showing opening PDF’s, viewing and being able to edit them inside a Winforms control.
  • A simple console application that showcases true text redaction (Not just dark rectangles drawn over the top of a PDF)
  • An application to extract text from PDF’s that is among one of the best I’ve seen
  • An OCR application that reads text from images and adds them to a PDF file (To be honest, forget the PDF writing, this was easily the most impressive sample here!)

If you are interested, you can even view the sample code/applications here : https://dev.datalogics.com/adobe-pdf-library/sample-program-descriptions/net-sample-programs/working-with-actions-in-pdf-files/ to see if it does what you need before even jumping in.

While Datalogics do have a library for turning an HTML file into a PDF, I want to look at maybe some of the more complex sample applications and see just how good they are.

PDF Visual Editor

Yes OK, this is .NET Framework (for now) as most WinForm apps are, but I do want to touch on it because I found it pretty impressive to use and was probably the feature that I hadn’t seen before in a PDF library for C#. I think most SDK’s focus on manipulating PDF’s from code and any samples given are all in the form of console applications. So it was a bit of a change of pace to fire up an actual GUI application that showcases PDF editing functionality.

(Ignore the actual PDF in the screenshot, my partner has been getting a bit too much into cross stitch lately!)

This is the sample application that comes with the library, built in .NET, and has the ability to both read PDF’s and even edit them on the fly. See what I mean when I say that the samples are pretty good? This is a fully fledged application that even if you needed just one of these features, be it just be viewing a PDF in a WinForms application, editing a PDF, printing, whatever it may be! This application has the code right there in front of you for you to debug and actually use.

Redacting Text Inside A PDF

A really impressive example app in the sample suite is a small console application that redacts text. I actually went back and checked other PDF libraries for C#, both paid and free, and I couldn’t find any tools for redacting text inside a PDF. More importantly, redacting text is *not* simply drawing a rectangle over a word and calling it a day.

Quite famously, in the Mueller investigation vs Paul Manafort, a “redacted” legal response was released to the public that you could literally just copy and paste the “black box” into a notepad doc and read all of the “redacted” text. (Further reading here on that fail : https://www.vice.com/en_us/article/8xpye3/paul-manafort-russia-case-redaction-fail).

The sample application uses a simple weather PDF that will redacts all instances of the word “cloudy”. Impressively it actually outputs two sample documents, the first is it finding all instances of the word and drawing a box where it’s found it, and then another completely redacted version.

Before

After Identifying

After Redaction

And again, this is not simply drawing a a rectangle over it MSPaint style, this is truly identifying the text and redacting it completely.

Extracting Text From A PDF

I guess this is basically a staple for all PDF SDK’s, but I was really impressed with the quality from the Datalogics SDK. Again, they have a great sample application to show you what it can do.

Take this PDF section for example, it’s on Page 5 of a sample PDF.

And the text output :

Now I kinda get that all this is doing is reading text from a PDF and that doesn’t sound that impressive. But I found other libraries (Especially free ones), basically read the text as one big long string and gave you that. The entire structure of the page was lost, especially the line breaks.

On a previous project of mine I was tasked to build an application that given the page and line number of a PDF, I had to extract out the text. Can you imagine the headaches when the entire structure of the PDF is lost and I have to come up with all sorts of crazy ways of reverse engineering the line number? I love a library that does what it says on the tin.

Image PDF OCR

Is it weird that a PDF library impresses me the most by showcasing it’s image optical character recognition powers (OCR)? This thing actually blew my mind.

In the sample, they give you a PDF that is one big image, including tables :

Again, this is an image inside a PDF. You cannot highlight this text or copy it out. Nor can you use any old PDF reader to actually “read” the text because the text isn’t actually there. It’s an image. Then the sample application reads all text inside the PDF, re-does the entire PDF, and resaves it including the table, but with the text now OCR’d and input as actual text and not an image.

So you can now copy out the text just like you could if it was text inside a PDF in the first place. There just something crazy about the fact that it can not only OCR the text, but then also recognize the fact that text is inside a table etc and basically redraw everything pixel perfect, but now with the text perfectly selectable.

In the screenshot above, it does look like the text is sort of disjointed, but I can assure you when you copy that text out it’s a complete sentence in order :

Required)The type of annotation that this dictionary describes;
must be Redact for a redaction ann

Part of the reason this is so impressive to me is because I’ve actually been part of teams that have attempted to OCR billing invoices. Where there are tables of charges as an image, inside a PDF, and we are trying to read the line items from invoices to input them into a database. It took months of work and in some cases, we just said “can’t be done”. But I would love to give it another try with this library.

Who Is This Library For?

So here’s the thing. It’s not free. Let’s just get that out of the way. But this is easily the most comprehensive, if not *the* most comprehensive PDF library I’ve used. For extremely simple applications that turn a pretty plain HTML file into a PDF, yeah maybe it’s not for you. But there were features in this library that I hadn’t seen anywhere else, and had actual working examples for you to just copy and paste into your own application.

The thing that most surprised me about my previous post on PDF support in C#/.NET Core is just the share amount of libraries that didn’t even do what they say on the tin. It was this sort of mish-mash of partly working features and zero support. If you’re working on a business critical piece of functionality that involves PDF’s (e.g. Generating thousands of invoices that cannot be delayed due to you waiting for a response on your Github issue), then the Datalogics C# SDK might just be right for you.


This is a sponsored post however all opinions are mine and mine alone. 

ENJOY THIS POST?
Join over 3.000 subscribers who are receiving our weekly post digest, a roundup of this weeks blog posts.
We hate spam. Your email address will not be sold or shared with anyone else.

Leave a Reply

Your email address will not be published. Required fields are marked *