Straightening Out Hpricot and Case-Sensitive Meta Parsing

May 6th, 2009  |  Published in ruby

When confronted with the following on a page:

META name="authors" content="John Q. Public"

or

META NAME="authors" CONTENT="John Q. Public"

Hpricot believes there’s a difference. According to everyone who’s written about this behavior, the standard says there is a difference, so Hpricot’s behavior is correct.

Most of the time it probably wouldn’t be a big deal. When you’re scraping a site, the chances are good that it uses some sort of standardized template, so the worst you have to deal with is getting it wrong the first time by assuming Hpricot isn’t case sensitive then figuring it out and being happy again.

My CMS-scraping class started failing while looking for the “authors” meta tag, but it only failed once in a while. So I’d go in and look and compare the HTML source the browser was providing with what I was telling Hpricot to look for and they’d seem to jibe so I’d be all “Why, _why, why?” Then one day they didn’t jibe and I was all “Oh … they changed!” Then the fix would re-break and so on.

Eventually I realized that one of the load balancers in use had a different set of templates. Most of the time I’d be getting meta tags with “name” and “content” attributes, but occasionally I’d get the one that had meta tags with NAME and CONTENT attributes.

I did a little poking around and eventually settled on

a nice method that decapitalizes attributes when Hpricot objects are passed through, and that’s that.

Leave a Response

© Michael Hall, licensed under a Creative Commons Attribution-ShareAlike 3.0 United States license.