December 13th, 2012  |  Published in drupal, tech  |  1 Comment

Clearing out some old notes from the last job and figured this would be a good one to preserve. It shows how many times tags were used on a collection of posts generated and tagged almost exclusively by the user community for a site I used to work on.

For instance, row 1 shows that tags were used only once in 12,602 cases; row 2 shows that tags were used 2-5 times in 2,735 cases, etc. So out of a total of 16,169 tags, we can see that tags were used only once about 78 percent of the time, and tags were used no more than five times about 95 percent of the time.

Times Used Number of Tags
1 12,602
2-5 2,735
6-10 416
11-20 200
21-50 118
51-100 54
101-500 27
501-1,000 7
1,000-2,000 5
2,000-5,000 5

I wasn’t prepared to worry too much about the whole thing until the site (run by Drupal) started to crawl and the slow query log showed us that taxonomy-related queries were killing us. I even took to Ask Metafilter to see what everyone else had to say, and got an answer from the guy who coined the term “folksonomy.”

The thing that was maybe a shame was that a lot of those 12,602 tags were variations on each other:

  • social networking

  • social_networking

  • socialnetworking

  • SocialNetworking

  • Social Networking

  • Social_Networking

  • Social_networking

  • Social networking

  • Socialnetworking

In the absence of any discipline at all and no overarching style guide for tagging, no real patterns emerged to make the tags useful. Search engine indexation sucked because we had 12,000+ tag index pages with only a single post, those thousands of tag pages netted well under 0.5 percent of site traffic and crawl times were ridiculous. You really should not have almost as many tag indexes as you have actual posts.

It wasn’t deemed a wise use of time to try to automate normalization. In the end, I wrote a VBO that allowed us to delete the 12,602 tags that were used only once (provided they weren’t newer than a month old, so we didn’t arbitrarily blow up a trend before it blossomed). We also locked the users at large out of being able to tag at all, leaving it to the curators on staff. Yes, it helped performance.

Dark side of tag normalization: At the job I held before this one, they just gave an editor a spreadsheet with the thousands of non-normalized tags and invited her to correct them by hand. I do believe I would have gone mad.


  1. Sam says:

    December 16th, 2012 at 1:26 pm (#)

    I wrote a fuzzy string matcher recently, it’s easy. I should write my own social-bookmarking app and have it auto-suggest fuzzy matches. Maybe the world needs a new DMR. :P

Leave a Response

© Michael Hall, licensed under a Creative Commons Attribution-ShareAlike 3.0 United States license.