Nerdery Watch

April 3rd, 2007  |  Published in etc  |  5 Comments

Scrubyt is a web-scraping toolkit for Ruby. It is awesome. Here’s a sample scrape:

    #!/usr/bin/env ruby

    require 'rubygems'

    require 'scrubyt'



    du_data = Scrubyt::Extractor.define do



     fetch 'http://mph.puddingbowl.org'

     story do 

     title 'Does nothing escape the filthy clutches of these vi-rmin?'

     cats 'software'

     end

     end



     du_data.export(__FILE__)

     du_data.to_xml.write($stdout, 1)

     Scrubyt::ResultDumper.print_statistics(du_data) 

The nutshell version: It goes out and grabs mph.puddingbowl.org, it takes a pair of examples for the things you want to scrape (title and categories), then generalizes those items by their DOM characteristics. The first time you scrape a site, you give it the examples based on the first item of the type you want to scrape on a given page. The “__FILE__” line exports the scrape as a generalization of your example items and produces a scrubyt script you can use in the future without having to provide any examples:

    require 'rubygems'

    require 'scrubyt'



    du_data = Scrubyt::Extractor.define do



     fetch 'http://mph.puddingbowl.org'

      story "/html/body/div/div/div" do 

      title "/h2[1]/a[1]"

      cats "/p[1]/a[1]"

     end

     end



    du_data.to_xml.write($stdout, 1) 

The output is in XML, which really doesn’t cause me to wince anymore, because as long as it’s consistent you don’t have to really work with it as XML if you’re of a mind to just munge the tags into something else.

The thing that blows me away is that it takes so little effort to do this. Creating objects and using them is so simple it freaks me out a little. Sixteen lines of code, three of which are reader-friendly whitespace, and you get an easy-to-read scraping script that pretty much wrote itself as far as the heavy lifting is concerned. Nice.

What’s not nice? One page I’d love to scrape has been built with an apparent desire to make HTML Tidy wish it had been born as a third year CS student’s porn spider.

Note: If you want to try it, get the 0.23 version of scrubyt. The 0.26 version is busted up for some things. It also seems to pay to just go with the MacPort, because Tiger’s default install is sorta busted up, too.

Second Note: Some time between last night and today, all the stuff I was doing that broke 0.26 is mysteriously working. Read any information I provide as if I am a blithering n00b and you’ll probably be fine.

Responses

  1. gl. says:

    April 3rd, 2007 at 11:42 am (#)

    i don’t understand: what do you want to scrape and what do you want to do with it once it has been scrup?

  2. mph says:

    April 3rd, 2007 at 11:55 am (#)

    Well, in the case of the work sites, I’m looking for a way to take some of the gruntwork out of weekly reports I have to generate. It’s not hard to do those by hand, but if I had a way to automate it, I could reuse that data in other ways that would eliminate even more gruntwork. That’s really just a warmup, though …

    I’ve got to start processing a LOT of pages and getting some basic characteristics out of them so I can build a keyword web out them. All that is going toward generating an unofficial navigational flow that bypasses the horrific site navigation we had inflicted on us years ago.

    Scrubyt can help me automate that. I think, fortunately, that there’s a lot less cruft on our inside pages, so when I get down to processing out the actual data from individual pages, it won’t be so bad. Getting a coherent automated record of what’s appeared on the index page, however, will take three distinct steps instead of scrubyt’s normally elegant one.

  3. Peter Szinek says:

    April 3rd, 2007 at 1:27 pm (#)

    Hi mph,

    Could you please elaborate a bit more on what got screwed up in 0.2.6? I am just working to fix some bugs as well as to add new features (0.2.8 is due this week), so it would definitely help if you could tell me about your problems. The best place to do so is the scRUBYt! forum, http://agora.scrubyt.org. I am very eager to fix bugs and provide support – of course provided that I know about the problems :)

  4. mph says:

    April 3rd, 2007 at 2:04 pm (#)

    Hey, Peter. Absolutely! I just reverted to an older version in the heat of trying to get things to work. Now that the joy of discovery is 18 hours old, I’ll be happy to reinstall the newer gem and document what breaks. Thanks for stopping by, and thanks for an awesome tool.

  5. Peter says:

    April 4th, 2007 at 2:13 am (#)

    mph,

    OK, please do your worst :)

    See you at the forum…

Leave a Response

© Michael Hall, licensed under a Creative Commons Attribution-ShareAlike 3.0 United States license.