In Which Mike Internalizes the Whole Class Thing

February 2nd, 2009  |  Published in ruby

Last year I wrote a class in Ruby that I use to pull information from stories published on sites I manage. Initially I was just looking for a way to write reports on what I’d published over the past week (title and publication date), but it eventually grew into something that can provide more data:

  • publication date

  • author

  • content

  • teaser

  • headline

And it can also take the article content and pass back reformatted versions I can use for some frequent tasks: updating a page with links to news stories, or reformatting a story from a company site I don’t manage. I can feed it this URL:

http://www.internetnews.com/infra/article.php/3800236/

and get back this:

</p>

<pre><code> &lt;p&gt;&lt;a href="http://www.internetnews.com/infra/article.php/3800236/Juniper+Breaks+into+New+Terabit+Territory.htm"&gt;&lt;b&gt;Juniper Breaks into New Terabit Territory&lt;/b&gt;&lt;/a&gt;&lt;br&gt;Juniper Rolls New Center Stage Core Routing System -new platform could scale to 25 Tbps for core routing.  &amp;mdash; 02/02/09&lt;/p&gt;
</code></pre>

<p>

I hooked the code up to a Web front-end/bookmarklet and I probably get back 15 minutes a day using that instead of doing everything by hand (loading a page, getting its source, reformatting to suit my site’s style, changing links as needed, etc.)

With the ability to tap Google Analytics via Ruby, it opens up some new territory for my class, especially since there are some peculiarities with the way Analytics interacts with some of the articles I track.

First, the CMS does some odd things with the URL, depending on a number of factors. New articles have URLs like the one above, old ones don’t have the search engine friendly URLs. Some pages that show up in Analytics reports are continuation pages, and they get their own URLs (something like the above, only with the numeric id after “article.php” wrapped in underscores and another identifier). Making it even more difficult, if you visit a continuation page, the link back to the first page of the article is in yet another format from the link you’d come in from on the front page. Printer-friendly pages have a different URL structure, too. From Analytics’ perspective, those are all different pages.

When I’m reading Analytics reports I don’t always get the best sense of a how a multi-page story did because it’s not very easy to see the traffic a given article pulled down on all its possible pages: printer-friendly, continuation, front page as visited by someone following a link from a continuation page.

Mercifully, every single article has one thing in common: an i.d. number is always somewhere in the URL, and it’s always in a predictable place. So my class returns the i.d. of the article along with all the other data, which makes it nice for gathering the actual traffic information on articles for every URL they can appear under.

Which brings us to today’s thing, which is marrying my class (CDEV) to Analytics mining with rugalytics so I can calculate how well a story is doing in terms of the views it has garnered vs. what it cost.

I’ve been doing something like this in the context of a Rails app for a few months, but there’s a new requirement to provide a spreadsheet instead of the HTML table. I wouldn’t mind being able to help coworkers who have to file the same report for their sites, either.

One reason I like writing this stuff up:

Since I’d already done a lot of this in Rails, I already had the simple loops in place to get pageview counts and all that. I didn’t want to make the spreadsheet-generating version dependent on ActiveRecord, though, so I found myself needing a new place to stow article objects.

Ed clued me to ostruct, which let me create the story objects I use in the report without the brain-fog hashes induce. The script loops through each of the items in the report, adding the pageviews for each reported instance of the unique id to the related story object.

In the version below, I used my CDEV class to create new article objects, then created yet another object using ostruct, which was sort of goofy of me because I already had story objects, courtesy of CDEV. The only thing they were missing was the author’s base rate and the number of pageviews a story has. So I extended my class by adding two lines:

attr_accessor :pageviews

attr_accessor :cost

which let me cut out some clutter, drop the ostruct requirement, save a few bytes of RAM and internalize a bit more of the Ruby way. If CDEV were not simple, or if it were something I didn’t want to modify directly, I could also create a subclass and add the needed attributes:

 class Article<CDEV

    attr_accessor :pageviews

    attr_accessor :cost

  end

So anyhow, another triumph of Rubygems married to really basic logic, though unlike this morning’s entry I actually contributed 200+ lines of code you can’t see through the CDEV class:


  #!/usr/bin/ruby

  require 'rubygems'

  require 'spreadsheet'

  require 'rugalytics'

  require 'ostruct'

  require '/Users/mph/lib/ruby/CDEV'

  google_account = 'GOOGLE ACCOUNT NAME'

  google_password = 'GOOGLE ACCOUNT PASSWORD'



      # Set up the spreadsheet

      book = Spreadsheet.open 'LOCATION OF SPREADSHEET'

      log_sheet = book.worksheet 0

      author_sheet = book.worksheet 1

      report_sheet = book.worksheet 2



      # set up the Analytics pull

      Rugalytics.login google_account, google_password

      analytics_channel = 'YOUR ANALYTICS ACCOUNT'

      analytics_site = 'YOUR ANALYTICS REPORT'

      profile = Rugalytics.find_profile(analytics_channel, analytics_site)

      report = profile.top_content_report :from => Date.today - 90.days, :to => Date.today, :rows => 500



      articles = Hash.new(0)



      def article?(url)

        url.include?("article.php")

      end



      def extra_page?(url)

        url.include?("_")

      end



      def id(url)

        if extra_page?(url)

          url.scan(/article\.php\/\d{1,}_(\d{1,})/)[0]

        else

          url.scan(/article\.php\/(\d{3,})/)[0]

        end

      end



      # read in the authors

      authors = Hash.new

 

      author_sheet.each do |row|

        if row[0] != nil

          author_name = row[0]

          author_rate = row[1]

          authors[author_name] = author_rate

        end

      end



      log_sheet.each 1 do |row|

        if row[0] != nil

          puts row[0]

          url = row[0].to_s

          cdev = CDEV.new(url)

        # no need for all this now:

          article = OpenStruct.new

          article.id = id(url)

          article.title = cdev.title

          article.author = cdev.author

          article.pubdate = cdev.pubdate

          article.cost = authors[cdev.author].to_i

          article.url = url

          article.pageviews = 0

          articles[article.id] = article

        end

      end



      report.items.each do |i|

            url = i.url.to_s

            if article?(url) 

              cdevid = id(url)

              if articles.include?(cdevid)

                articles[cdevid].pageviews += i.pageviews.to_i

              end

            end

      end

  

      i = 1

      articles.sort{|a,b| a[1].pubdate <=> b[1].pubdate}.each { |k,v| 

         row = report_sheet.row(i)

         row[0] = v.title.to_s

         row[1] = v.pubdate.strftime("%m/%d/%y")

         row[2] = v.author.to_s

         row[3] = v.cost.to_i

         row[4] = v.pageviews.to_i

         row[5] = v.pageviews.to_f/v.cost.to_f

        i += 1

      } 



      book.write 'LOCATION OF OUTPUT FILE'

Leave a Response

© Michael Hall, licensed under a Creative Commons Attribution-ShareAlike 3.0 United States license.