Many years back, I actually started programming so that I could cheat at an online web browser based game (I know, I know). I started in Pascal/Delphi, but quickly moved onto C#. Back then, I was using things like HttpWebRequest/Response to load the webpage, and using raw regex to parse it. Looking back, some of my earlier apps I wrote were just a complete mess. Things have changed, and it’s now easier than ever to automate web requests using C#/.NET Core. Let’s get started.
Loading The Webpage
Let’s start by creating a simple console application. Immediately, let’s go ahead and change this to an async entry point. So from this :
static void Main(string[] args) { }
We just make it async and return Task :
async static Task Main(string[] args) { }
However if you’re using an older version of C#, you will get an error message that there is no valid entry point. Something like :
Program does not contain a static 'Main' method suitable for an entry point
In which case you need to do something a little more complex. You need to turn your main method to look like the following :
static void Main(string[] args) { MainAsync(args).ConfigureAwait(false).GetAwaiter().GetResult(); } async static Task MainAsync(string[] args) { }
What this essentially does is allow you to use async methods in a “sync” application. I’ll be using a proper async entry point, but if you want to go along with this second method, then you need to put the rest of this tutorial into your “MainAsync”.
For this guide I’m actually going to be loading my local news site, and pulling out the “Headline” article title and showing it on a console screen. The site itself is http://www.nzherald.co.nz/.
Loading the webpage is actually rather easy. I jam the following into my main method :
HttpClient client = new HttpClient(); var response = await client.GetAsync("http://www.nzherald.co.nz/"); var pageContents = await response.Content.ReadAsStringAsync(); Console.WriteLine(pageContents); Console.ReadLine();
And run it to check the results :
Great! So we loaded the page via code, and we just spurted it all out into the window. Not that interesting so far, we probably need a way to extract some content from the page.
Extracting HTML Content
As I said earlier, in the early days I just used raw regex and with a bit of trial and error got things working. But there are much much easier ways to read content these days. One of these is to use HtmlAgilityPack (HAP). HAP has been around for years, which means that while it’s battle tested and is pretty bug free, it’s usage can be a bit “archaic”. Let’s give it a go.
To use HAP, we first need to install the nuget package. Run the following from your Package Manager Console :
Install-Package HtmlAgilityPack
The first thing we need to do is feed our HTML into an HtmlDocument. We do that like so :
.... var pageContents = await response.Content.ReadAsStringAsync(); HtmlDocument pageDocument = new HtmlDocument(); pageDocument.LoadHtml(pageContents);
Now given I wanted to try and read the title of the “headline” article on NZHerald, how do we go about finding that piece of text?
HAP (And most other parsers) work using a thing called “XPath“. It’s this sort of query language that tells us how we should traverse through an XML document. Because HTML is pretty close to XML, it’s also pretty handy here. Infact, a big part of what HAP does is try and clean up the HTML so it’s able to be parsed more or less as an HTML document.
Now the thing with XPath within an HTML document is that it can be a bit trial and error. Your starting point should be pulling up your browsers dev tools, and just inspecting the divs/headers/containers around the data you want to extract.
In my case, the first thing I notice is that the headline article is contained within a div that has a class of pb-f-homepage-hero . As I scroll down the page, this class is actually used on a few different other containers, but the first one in the HTML document is what I wanted. Ontop of that, our text is then inside a nice h3 tag within this container.
So given these facts, (And that I know a bit of XPath off the top of my head), the XPath we are going to use is :
(//div[contains(@class,'pb-f-homepage-hero')]//h3)[1]
Let’s break it down a bit.
// says that we can start anywhere within the document (Not just the first line).
div says that we are looking for a div that…
[contains(@class,’pb-f-homepage-hero’)] contains a “class” attribute that has the text ‘pb-f-homepage-hero’ somewhere within it.
//h3 says that within this div element, somewhere in there (Doesn’t have to be the first child, could be deeper), there will be an h3
()[1] says just get me the first one.
We can actually test this by going here : http://videlibri.sourceforge.net/cgi-bin/xidelcgi. We can paste our HTML into the first window, our xpath into the second, and the results of our expression will come up below. There are a few XPath testers out there, but I like this one in particular because it allows for some pretty busted up HTML (Others try and validate that it’s valid XML).
So we have our XPath, how about we actually use it?
.... var pageContents = await response.Content.ReadAsStringAsync(); HtmlDocument pageDocument = new HtmlDocument(); pageDocument.LoadHtml(pageContents); var headlineText = pageDocument.DocumentNode.SelectSingleNode("(//div[contains(@class,'pb-f-homepage-hero')]//h3)[1]").InnerText; Console.WriteLine(headlineText); Console.ReadLine();
So really we can paste our entire XPath into a “SelectSingleNode” statement, and we use the “InnerText” property to get just the textual content within that node (Rather than the HTML).
If we let this run :
Done! We have successfully grabbed the headline news title! (Which is currently that a political party within NZ has nominated Simon Bridges as it’s new leader).
Nice article. This is what im looking for.
I have a project that will fetch/read the data from wikipedia. I know there is an wikimedia api but its too complicated and a bunch of unuseful data from json response. I hope this method works. Thank you very much
Wow, just now I came across this article and this is amazing, it helped me a lot after all the searching! Any idea how can this be incorporated if I have more entities, stocks for example, and I am supposed to show price for each one of them. So this should be some kind of foreach loop, I suppose…but do I have to insert the data for each stock manually?