Let’s Hear It for Lossy

August 20th, 2010  |  Published in mac and iphone, ruby  |  2 Comments

So, I’ve got this Excel spreadsheet that represents a dump of over 2,000 articles from an old CMS. Using the same general workflow I used before, I’m importing the spreadsheet into a local MySQL database using the Ruby spreadsheet gem. The importer I wrote just reads line by line, creates a new ActiveRecord object then saves it.

Two thousand lines of spreadsheet — two thousand lines that include news-length content in each of their lines — is enough to make Excel (on the Mac) fall to its knees whining. Worse, after a few changes and saves, the spreadsheet library couldn’t even read the content. I tried saving it between .xls and .xlsx a few times and the problem persisted. I thought about pasting just the data into a blank spreadsheet to see if that would shake whatever rot had settled into the spreadsheet’s soul, but getting Excel to copy and paste from a spreadsheet that big just made it beachball and crash.

So it was in a state of agitation that I decided to try one last thing: I opened the spreadsheet in Numbers, re-saved it as an Excel file and tried reading that in to my importer script. It not only worked, but the execution time of the script itself went from somewhere around 10 seconds to about 1.5. It was so much faster I rechecked my script, thinking I’d commented out the line that actually saved the record.

I have no idea what goes on inside an Excel file, but if I had to guess, “a lot of crap” would probably be a big part of it. Some of that crap is, no doubt, useful to someone somewhere who’d just die if they couldn’t have this or that feature that’s enabled by this or that bit of crap. I’d also bet that there are people who are bitterly disappointed that Excel spreadsheets they’ve opened or saved in Numbers have quietly stripped out some essential bit of crap they need to do their work. As near as I can tell, Numbers stripped out a bit of crap that kept me from doing mine. It’s almost like discovering that high-compression JPEGS make bad pictures look better.

Responses

  1. David Ernst says:

    August 21st, 2010 at 7:05 am (#)

    I always use csv for such things because I worry about crap in excel. Would that work here?

  2. mph says:

    August 21st, 2010 at 9:14 am (#)

    Yep … for more data-y data, I usually use csv (or ^sv on one project). In this case, the last column is a multi-paragraph article full of punctuation and potentially nutty character set stuff. A straight csv import would involve escaping the commas and hoping there weren’t weird linebreak things. But sure … if I took more time to set it all up up front, it’d probably work. I just hate the whole iterative process of catching one glitchy thing, accounting for it, trying again, catching another glitchy thing, etc. etc.

    The nice thing about Excel, I guess, is that the spreadsheet-parsing library I use does a pretty good job of just grabbing a cell and sloshing its contents into the ActiveRecord object without me needing to worry about any of that. The apparent tradeoff is that I’ll have to do one round trip to Numbers and back to scrub off the barnacles for those cases where Excel is feeling horsey.

Leave a Response

© Michael Hall, licensed under a Creative Commons Attribution-ShareAlike 3.0 United States license.