Today’s author, Chris Rae, a Program Manager on the Excel team, talks about using Perl code to read Excel’s new file format.
The OOXML Format
As most readers will know, we spent a lot of time during the development of Office 2007 in creating and documenting a new XML-based file format (Office Open XML) to replace the much more complex binary formats the various Office applications have used for the last twenty years. A lot of time. Many of us dream in XML now. OOXML encompasses an XML file format for each of the largest Office applications, these formats being WordprocessingML, PresentationML and SpreadsheetML. It became an ECMA standard (ECMA-376) in 2006 and was approved with modifications as an ISO/IEC standard (IS 29500) in early 2008, though it has yet to be released by ISO/IEC. I think it would be fair to say that this Office feature has had more press coverage, both good and bad, than all other Office 2007 features combined. Political positions aside, one thing that can be said for definite is that the move made Office files easier to understand, and has created new ways for our customers to work with their data. In this post I’m going to walk through the solution to an imaginary customer’s interoperability problem, but first I’d like to cover a couple of OOXML basics.
The standardisation process means that the file format itself now belongs to ECMA and ISO/IEC, rather than to Microsoft. All of the documentation, therefore, lives on ECMA International’s web site (and will live on the ISO/IEC site once they have completed their editing of IS 29500). These documents are a huge help to anyone interested in cracking open their spreadsheet files – the documents are really quite readable and, hey, they’re free. The main ECMA-376 page is at http://www.ecma-international.org/publications/standards/Ecma-376.htm – the documents are available for download here as DOCXs or PDFs. There’s a lot in these file formats, so don’t click “print” without first observing the page count. Aside from the format documentation itself there’s also a good XML in Office Developers’ Portal on MSDN.
Several parts of the OOXML standard are going to be particularly useful for looking at spreadsheet files. These are:
Parts 1 and 2 (how the OOXML format generally fits together)
Part 3 (how the various parts of SpreadsheetML interrelate)
Part 4 (detailed information about specific elements and attributes)
I’ll make reference occasionally during this article to these parts.
OOXML Files and Excel
Any Excel file with the extension “xlsx”, “xlsm” or “xlam” is an OOXML file. It’s the default file format for Office 2007 and can be saved by Office 2003 or Office XP if you download and install the Compatibility Pack (http://office.microsoft.com/en-us/products/HA101686761033.aspx). An OOXML file (or “package”) is simply a zip archive – if you don’t believe me, rename one to .zip and unpack it. Inside the zip file are some .xml files (inside which the worksheets, charts et cetera are defined), some binary files to hold things that don’t lend themselves to being stored in XML (such as images) and some .rels files which define the relationships between them all. The packaging model is covered in much more detail in Part 2 of ECMA-376, section 8.
I used to work for a derivatives trading desk, writing spreadsheets. As anyone in investment banking knows, Excel is like the Outlook of the trading floor – traders use it to price absolutely everything; keep the departmental vacation calendar; plan their weddings and name their babies. At the same time, the more prudent parts of the organisation have enormous back-end systems to manage the organisation’s risk and monitor their positions. These systems are almost always on a wide range of platforms – Windows Server, Solaris, Linux – and have been in place for very long periods of time. One thing you can say with a lot of confidence is that these machines don’t have Excel installed.
The need for agility in trading means that almost all trades start life in Excel before migrating to whatever risk systems the bank has in place. There is a constant need to take numbers from spreadsheets and port them into back-end risk management systems, and this is done in various different ways today. One popular method is to have a “click here before you go home” macro on your spreadsheet, which connects to the database and uploads trades. This works fine until you end up with different versions of the macro smattered around the place in random trading sheets, or the trader goes on vacation and forgets to tell his stand-in about it, or anyone from audit finds out about it. Another option is to write some code on a centrally-managed machine which will load up all of the trading sheets, recalculate them all and write all the data. This works fine until a VBA bug appears in the middle of the night, or the market data permissions aren’t right, or the IT guy who was supposed to run this is out sick.
So let’s say we have a derivatives trader, Sally, who is experimenting with some new trades. In this case, Sally is experimenting merely with holding large quantities of her favourite stocks and keeping her fingers crossed, so she’s probably due to be fired soon. The Office 2003 spreadsheet on which these positions live is an enormous beast – it’s full of buggy VBA macros and littered with sheets and sheets of old data and notes from previous trades. Sally knows that somehow she has to get this into the back-end risk systems but she’s terrified of messing with her pride and joy and doesn’t want the responsibility of having to upload them every day. Her list of positions looks like this: