February 24th, 2009 | |
Posted in ASP.NET, C#
I’ve got a fair share of websites that draw data from a event promoter’s website. I’d originally done research on utilizing their web services model, but was severely disappointed with their implementation. That left me with two essential problems I needed to solve:
1) I needed to know when certain events were updating so I could keep my websites up to date.
2) I needed to be able to parse through the raw html and parse out the relevant data to generate an XML file that my GridView can pull to display the relevant data to my vistors.
The first task was to build a Console App that I could setup as a Scheduled Task to run nightly to tell me which pages had updated. So up Visual Studio goes. I’ve chosen not to hard code the page values so I can easily add additional pages to check in the future. I’ve stored these values in a text file called parsePages.txt as follows:
yahoo,http://www.yahoo.com
google,http://www.google.com
These values could be just as easily stored in a database or an xml document. Next, I go about pulling this data into the program so I know which pages I need to parse through. I setup the regular expression to split up the values into an ArrayList from parsePages.txt so I can start making my checks.
Now that I have my list of files to parse through, I’m going to loop through each — checking to see if the file exists then executing the parse routine.
Here is where we do our file verification. We are using two text files for each value — one to house our current content and one for the previous days content. This way we can see what changes, if any, have been made since the previous day. If for some reason either the main file or the alt file doesn’t exist, we want to create it. This is handy for new events we’ve just added to the list.
So far we’ve done a lot of work, but we still haven’t gotten to the meat of what we’re trying to do here. Well never fear because here comes the parse routine. So we pass in the reference and the url into the ScreenScrape method. We use the System.Net.WebRequest object to create our url and the WebResponse to instantiate it.
We spit the html contents of the page into a StreamReader then do a little cleanup to strip off the header and footer html that we aren’t concerned with.
Next, we need to compare the main text file with the alt text file to identify which one was written to most recently so we can overwrite the alternate file with the html we were chewing on in the StreamReader.
Now if we found any change between the old and the new file, we need to know about it. If the file lengths don't match, we write a line out to the console saying an update was found.
Tags:
ASP.NET,
c#,
HTML,
page parsing,
visual studio,
XML