Loading and Parsing A Web Page In .NET Core

Many years back, I actually started programming so that I could cheat at an online web browser based game (I know, I know). I started in Pascal/Delphi, but quickly moved onto C#. Back then, I was using things like HttpWebRequest/Response to load the webpage, and using raw regex to parse it. Looking back, some of my earlier apps I wrote were just a complete mess. Things have changed, and it’s now easier than ever to automate web requests using C#/.NET Core. Let’s get started.

Loading The Webpage

Let’s start by creating a simple console application. Immediately, let’s go ahead and change this to an async entry point. So from this :

We just make it async and return Task :

However if you’re using an older version of C#, you will get an error message that there is no valid entry point. Something like :

In which case you need to do something a little more complex. You need to turn your main method to look like the following :

What this essentially does is allow you to use async methods in a “sync” application. I’ll be using a proper async entry point, but if you want to go along with this second method, then you need to put the rest of this tutorial into your “MainAsync”.

For this guide I’m actually going to be loading my local news site, and pulling out the “Headline” article title and showing it on a console screen. The site itself is http://www.nzherald.co.nz/.

Loading the webpage is actually rather easy. I jam the following into my main method :

And run it to check the results :

Great! So we loaded the page via code, and we just spurted it all out into the window. Not that interesting so far, we probably need a way to extract some content from the page.

Extracting HTML Content

As I said earlier, in the early days I just used raw regex and with a bit of trial and error got things working. But there are much much easier ways to read content these days. One of these is to use HtmlAgilityPack (HAP). HAP has been around for years, which means that while it’s battle tested and is pretty bug free, it’s usage can be a bit “archaic”. Let’s give it a go.

To use HAP, we first need to install the nuget package. Run the following from your Package Manager Console :

The first thing we need to do is feed our HTML into an HtmlDocument. We do that like so :

Now given I wanted to try and read the title of the “headline” article on NZHerald, how do we go about finding that piece of text?

HAP (And most other parsers) work using a thing called “XPath“. It’s this sort of query language that tells us how we should traverse through an XML document. Because HTML is pretty close to XML, it’s also pretty handy here. Infact, a big part of what HAP does is try and clean up the HTML so it’s able to be parsed more or less as an HTML document.

Now the thing with XPath within an HTML document is that it can be a bit trial and error. Your starting point should be pulling up your browsers dev tools, and just inspecting the divs/headers/containers around the data you want to extract.

In my case, the first thing I notice is that the headline article is contained within a div that has a class of pb-f-homepage-hero . As I scroll down the page, this class is actually used on a few different other containers, but the first one in the HTML document is what I wanted. Ontop of that, our text is then inside a nice h3  tag within this container.

So given these facts, (And that I know a bit of XPath off the top of my head), the XPath we are going to use is :

Let’s break it down a bit.

//  says that we can start anywhere within the document (Not just the first line).
div  says that we are looking for a div that…
[contains(@class,'pb-f-homepage-hero')]  contains a “class” attribute that has the text ‘pb-f-homepage-hero’ somewhere within it.
//h3  says that within this div element, somewhere in there (Doesn’t have to be the first child, could be deeper), there will be an h3
()[1]  says just get me the first one.

We can actually test this by going here : http://videlibri.sourceforge.net/cgi-bin/xidelcgi. We can paste our HTML into the first window, our xpath into the second, and the results of our expression will come up below. There are a few XPath testers out there, but I like this one in particular because it allows for some pretty busted up HTML (Others try and validate that it’s valid XML).

So we have our XPath, how about we actually use it?

So really we can paste our entire XPath into a “SelectSingleNode” statement, and we use the “InnerText” property to get just the textual content within that node (Rather than the HTML).

If we let this run :

Done! We have successfully grabbed the headline news title! (Which is currently that a political party within NZ has nominated Simon Bridges as it’s new leader).

ENJOY THIS POST?
Join over 3.000 subscribers who are receiving our weekly post digest, a roundup of this weeks blog posts.
We hate spam. Your email address will not be sold or shared with anyone else.

Leave a Reply

Your email address will not be published. Required fields are marked *