From Legacy CMS to Joomla with Ruby

April 18th, 2010  |  Published in etc, ruby  |  3 Comments

This weekend’s project was sort of interesting:

Given a spreadsheet that represents a 642-article data dump from a legacy, proprietary CMS, import all the content to Joomla and come up with a way that the legacy URL structure can be redirected from the old URL scheme to the new Joomla scheme without breaking links and killing the site’s Google juice.

Fortunately, I wrote a bunch of importing code last year to handle an even more primitive setup (no CMS at all … just a collection of hundreds of static files), so I solved a lot of the problems that come from importing to Joomla then, so for the import phase of the project I just needed to come up with the code to read the spreadsheet.

Here’s the process I used in a nutshell:

1. Reading the Content

The spreadsheet was pretty minimal. Each line included the publication date, title, teaser and content, but the article ids weren’t included as data: they corresponded to the line of the spreadsheet the article was on. I had to have the article i.d. because I’d need it later to map the URLs correctly.

The roo gem makes reading an Excel spreadsheet super easy.

2. Massaging the Content

Part of the spec for the migration includes converting to human-readable (a.k.a. search engine friendly (SEF)) URLs. The old URL structure looks something like this:

 http://foo.com/category/index.html?article_id=123

We want to end up with something like this:

 http://foo.com/category/section/lorem_ipsum_dolor_amet

And we want to make sure we control the URL such that we can map all those legacy “article_id” URLs to their SEF equivalents in a few steps. Since Joomla allows you to set your own article aliases and use them as part of the URL, I just took each article title and did some basic substitutions and changes:

article_alias = article_title.gsub(/\'|\"|\?|\./,"")

article_alias.gsub!(/\W/,"-")

article_alias.downcase!

article_alias.gsub!(/\-{2,}/,"-")

That had the effect of giving an article with a title like “Can This ‘Article’ Possibly End Well?” an alias of can-this-article-possibly-end-well.

3. Storing the Content

I hooked up to the Joomla backend with ActiveRecord. As roo read each line of the spreadsheet it created a new article in the jos_content table in the Joomla database.

A number of lines in the spreadsheet were completely blank. I’m guessing it’s because articles were deleted, or created but never completed. When I came across cases like that, I added some filler text and made sure to set the state field in the database to “0” to keep them from appearing in Joomla once the new site goes live until the site owner can decide what to do with them.

To keep an eye on how all this was going over on the target database, I used Sequel Pro, which provides a very clean and easy-to-use GUI for MySQL. It was easy to do a quick run, catch any problems, quickly remove the records then re-run my scripts.

It’s a comment on how nice Ruby, roo and ActiveRecord were to work with that this part of the process took less than two hours from opening the spreadsheet and scratching out some ideas on a legal pad to having all the content into the Joomla database, ready for the next phase.

4. Mapping the Legacy URLs

The next step in the process was figuring out how to redirect 640 legacy URLs from this:

 http://foo.com/category/index.html?article_id=123

to this:

 http://foo.com/category/section/lorem-ipsum-dolor-amet

There are a few pieces to the puzzle:

First off, you need to have some sort of SEF URL scheme configured in Joomla. The default method is to toggle them in the system-wide control panel, which will take you from something like this:

http://foo.com/index.php?option=com_content&Itemid=506&catid=117&id=1654&lang=en&view=article

to something like this:

 http://foo.com/index.php/category/section/12345-lorem-ipsum-dolor-amet

There are two things I don’t like about that:

First, the “index.php” part is pointless and ugly.

Second, the “12345-” (the article i.d.) at the tail end is also pointless and ugly, and it threatened to make life hard when it came time to do the redirects, because Joomla’s article i.d. would have to be mapped in relationship to the legacy CMS’s article i.d. In other words, the article with the Joomla id “12345” was the article with the legacy i.d. of “321,” “12346” was the legacy i.d. “322”, etc.

I didn’t want to have to figure out the offset, and it just didn’t seem very elegant to have to do so, plus I have my doubts that’s a very future-proof solution.

So Joomla’s SEF functionality needed a little help, which I got in the form of the sh404SEF module for Joomla. I’ve been using sh404sef since before Joomla even had SEF URLs of its own. It provides a lot of really useful functionality both in terms of controlling how URLs work and in manipulating how the site’s section/category structure presents to visitors. It has moved to a subscription model in the past year, but it’s worth it.

With sh404sef, I was able to go from this:

 http://foo.com/index.php/category/section/12345-lorem-ipsum-dolor-amet

to this:

 http://foo.com/category/section/lorem-ipsum-dolor-amet

5. Managing the Redirects

Everything up to this point was pretty easy. The spec called for converting the data and adding SEF URLs to enhance the site’s SEO. The spec also called for preserving the site’s PageRank by making sure we cut down on the number of 404s.

This is the point where I briefly hit a wall. The old legacy URLs looked like this:

     http://foo.com/category/index.html?article_id=123

Since sh404SEF allows you to create your own custom redirects in a file and import them, my thought was “just generate a quick list of needed redirects, mapping the “article_id” parameter to the correct SEF URL, call it day.

Sadly, sh404SEF (and mod_redirect) don’t really understand query parameters if you try to use them as a generic URL, so none of the redirects would work. I ended up having to generate the redirects as a series of rules that started with a condition (grab the article id parameter) and then went to the proper mapping, e.g.:

RewriteCond %{QUERY_STRING} ^article_id=7$

RewriteRule ^(.*)$ /archives/news/lorem-ipsum-dolor-amet? [R=301,L]

I went back to Ruby and roo to read the spreadsheet, map the article ids to the proper alias (and place in the directory structure), and spit out my .htaccess with all the rules in place.

I could have gone another direction with this, by simply converting the query strings into a path. From this:

     http://foo.com/category/index.html?article_id=123

to this:

     http://foo.com/category/section/123

Then handled all the mapping with sh404SEF, which would be able to understand that sort of URL.

That would have had the benefit of requiring just a few lines in .htaccess, but intuitively I don’t think it makes more sense to have a bunch of custom redirect rules down in a plugin within the CMS. I haven’t done any testing, but hitting a PHP process every time a redirect is needed seemed worse than letting Apache handle it.

Responses

  1. gl. says:

    April 18th, 2010 at 8:32 pm (#)

    “First, the “index.php” part is pointless and ugly.”

    oh, god, yes! when i see that in URLs it takes all my willpower not to scratch at the screen.

  2. Live and Learn :: dot unplanned says:

    April 19th, 2010 at 9:53 am (#)

    […] Well, from yesterday’s entry on Ruby and Joomla and such: […]

  3. Relaunch Day :: dot unplanned says:

    May 26th, 2010 at 5:11 pm (#)

    […] from the previous CMS to Joomla. I wrote some notes about managing the transition from the legacy CMS to Joomla with Ruby last month, but a few more things are worth calling out that made the job […]

Leave a Response

© Michael Hall, licensed under a Creative Commons Attribution-ShareAlike 3.0 United States license.