The new Ballpark Digest went live today.
There were over 2,500 legacy articles that needed to be imported this time around, and preserving their search engine placement was a little more important because the site was pretty well indexed. I had to pick up one new trick, too:
Not having any legacy i.d. numbers to work with during the import, I ended up having to figure out the legacy URLs on my own. I knew the article were, at least, in the proper order, so at the beginning of the import data, the second article in the list had an i.d. of “2″, the tenth in the list had an i.d. of “10″, etc. Unfortunately, that 1-1 mapping broke down the first time an article was published then taken down, because my import data didn’t note missing articles. By the time I got to the 2,000th article, the relationship between import row and legacy i.d. was off by a pretty substantial amount.
I grabbed the RBing gem and automated the process of searching by article title and using the URL I got back to figure out the article’s old i.d. and URL. That didn’t work perfectly, because there were some gaps in Bing’s indexation of the site. So I had to write a second script that ran down the list of articles and looked at each i.d., applying the following algorithm:
If the article i.d. was one greater than the i.d. of the article before it, and one less than the i.d. of the article after it, I assumed it was o.k.
If the article i.d. didn’t match the above criteria, but the i.d. of the article after it was two greater than the i.d. of the article before it, I assumed the real page for that article had failed to be indexed, and I assigned it an i.d between those of the articles on either side of the sequence.
If the article i.d. didn’t meet either of those criteria, I flagged it for review.
Most of the time, the ones that were flagged for review were part of a streak of articles that hadn’t been indexed properly to begin with, so the best result Bing could produce was an easily recognizable archive page URL. It was easy to consult the list and see sequences like this:
Clearly the third through sixth articles in the list had to be 455, 456, 457 and 458. I felt a little guilty for not taking the time to work out a way to do that programatically, but there were only three or four sequences like that so I sucked it up. There were also a few sequences where there was no discerning the proper sequence, but that list totaled fewer than 15.
Once all the i.d.’s were straightened out, I wrote a script to generate the redirects, and plopped it into the site .htaccess.