David J Rice

The blog of freelance Designer & Developer, David Rice.

16 Jan 2008

So I’m working on a little project involving movie showing times (more on that soon) where I need to scrape a web-page for a couple of nuggets of data. That website is stormcinems.co.uk only joking it’s actually www.stormcinemas.co.uk. Anyway during testing I saved a copy of the HTML from Firefox to use as a testing fixture, after a little while I had finished everything and wanted to give it a live run. Job done I thought, nope the first run gave all sorts of errors and it took me ages to figure out why.

In this situation the HTML being sent back to the browser is so invalid that Firefox cries out in pain and does some work to fix things up behind the scenes. So whenever I saved the page, that wasn’t the HTML you get from a request via ruby’s Net::HTTP library. The main problem is on this page where there are no <tr> elements in the showing times table and simply trying to parse it with hpricot is a no go.

Looking around the web for a solution I found this neat little unix program tidy and to my surprise it was already on my mac, whoop! So here’s a little ruby method you can use to give that fugly HTML a spring clean before running it through your parsers.

require 'open3'

def tidy(html)
  tidied_html = ""
  Open3.popen3("tidy --force-output true") do |stdin, stdout, stderr|
    stdin.puts(html)
    stdin.close
    tidied_html << stdout.read
  end
  return tidied_html
end

After going through all this extra work I was wondering who actually did the site, a wee look on the page itself and there’s no evidence, but checking the source I can see that it was developed by Tibus (at least they got their own dns records sorted) I wonder if my table row problem has anything to do with the lack of accreditation?

Oh last one, honest. This really made my day when I saw it, whatever developer wrote this line of code I heart you. Lesson, if you don’t want your work to be known. Don’t do this.

#tibus-strapline { display: none; ... }

Peace.

David Rice

If you need help with the Design, Build, Management, Hosting or Support of your project do get in touch, I'd love to hear from you!

Recently

Archive