The blog of freelance Designer & Developer, David Rice.
So I’m working on a little project involving movie showing times (more on that soon) where I need to scrape a web-page for a couple of nuggets of data. That website is stormcinems.co.uk only joking it’s actually www.stormcinemas.co.uk. Anyway during testing I saved a copy of the HTML from Firefox to use as a testing fixture, after a little while I had finished everything and wanted to give it a live run. Job done I thought, nope the first run gave all sorts of errors and it took me ages to figure out why.
In this situation the HTML being sent back to the browser is so invalid that Firefox cries out in pain and does some work to fix things up behind the scenes. So whenever I saved the page, that wasn’t the HTML you get from a request via ruby’s Net::HTTP library. The main problem is on this page where there are no <tr> elements in the showing times table and simply trying to parse it with hpricot is a no go.
Looking around the web for a solution I found this neat little unix program tidy and to my surprise it was already on my mac, whoop! So here’s a little ruby method you can use to give that fugly HTML a spring clean before running it through your parsers.
require 'open3'
def tidy(html)
tidied_html = ""
Open3.popen3("tidy --force-output true") do |stdin, stdout, stderr|
stdin.puts(html)
stdin.close
tidied_html << stdout.read
end
return tidied_html
end
After going through all this extra work I was wondering who actually did the site, a wee look on the page itself and there’s no evidence, but checking the source I can see that it was developed by Tibus (at least they got their own dns records sorted) I wonder if my table row problem has anything to do with the lack of accreditation?
Oh last one, honest. This really made my day when I saw it, whatever developer wrote this line of code I heart you. Lesson, if you don’t want your work to be known. Don’t do this.
#tibus-strapline { display: none; ... }
Peace.
- email me@davidjrice.co.uk
- phone me on +44 7590 538 303
If you need help with the Design, Build, Management, Hosting or Support of your project do get in touch, I'd love to hear from you!
Recently
- 22 Apr » HTML5 Validator.nu ruby gem
- 28 Sep » ActiveMerchant Support for Realex
- 09 Sep » Getting Real with Realex
- 04 Sep » Back in Black
- 25 Nov » Rails Session Storage Cookie Vs Active Record
- 06 Jun » Get Exceptional
- 21 Apr » git and github ftw
- 19 Apr » Co-Working Belfast, Put Your Money Where Your Mouth Is
- 28 Mar » Co-working Belfast Plan
- 26 Feb » Do Not Buy an Apple AirPort Extreme Base Station, They Crash and Burn
- 13 Feb » Ssh, Presentation in Progress