Chemistry Add-in for Word Helps Bridge the Gap between Science and Technology


Could a semantic, chemical authoring tool be developed for Microsoft Word? The paper and PDF formats that are the standard vehicles for scholarly communication are great at presenting natural language for people to read, but are not as good at carrying the machine-interpretable semantic data that is becoming an increasingly important aspect of making sense of today’s “data deluge.” Tony Hey, Savas Parastatidis, and Lee Dirks from Microsoft Research initially discussed this possibility with Dr. Peter Murray-Rust of the Unilever Centre at Cambridge University back in 2007. Peter is considered the “father” of Chemical Markup Language (CML is a semantic XML representation of chemical entities) and explained that a large percentage of chemists use Microsoft Word to write their research papers. He hoped that by incorporating CML into Word, he could expedite his idea of developing a semantic, chemical authoring tool. We wondered if Peter would partner with us in this endeavor. Suffice it to say that when he agreed to sign on, a project was soon underway.

I first met Peter and Joe Townsend in early 2008, just after I joined the Microsoft External Research team. Peter, Joe, and Jim Downing were all visiting Redmond to discuss making this idea a reality through a joint development project between our team in Redmond and Peter and his team in Cambridge. I was asked to serve as the program manager for this adventure.

Spotlight: Webinar series

Microsoft research webinars

Lectures from Microsoft researchers with live Q&A and on-demand viewing.

We all had a common vision for what we wanted to achieve, but we faced many obstacles. Multiple time zones, varying degrees of programming language familiarity, different project management styles, and a total lack of chemistry knowledge on my part made the first six months a little slow going.

Still, we made progress: Embedding the CML files for each molecule referenced in the document was fairly straightforward. One of the nice features of the new DOCX format is that each file is basically a ZIP file—a container into which we could park each bit of chemistry as its own XML file. And we could also anchor these to bits of text in the document itself (in other words, the document.xml file).

So far, so good—as long as the thing in the document was text. Or an image. In fact, we got by for a while by having a handy PNG graphic for each of the molecules that we had in CML, so that when we imported a CML file we could also slip the pre-fabricated graphic into the document. It couldn’t be edited, but it made the point to the casual observer that a human could see a two-dimensional (2-D) representation of the molecule, or the label in the document. More importantly, it demonstrated that a machine can understand the underlying semantics of the chemistry by reading the CML representation.

But what about all the fancy subscripts and superscripts, and pre-sub, and pre-super, and sub-super, or super-super scripts required of charges, electrons, isotopes, hydrogen dots, labels, and so forth? For this, we looked to Murray Sargent, the guru of all things mathematical and the driving force behind the great equation editing features in recent releases of Word. After reviewing our options, we decided to build upon Word’s math zone features. This would allow us to take advantage of the work already done to support the complex and flexible layout required of mathematical equations.  

Meanwhile, our team was spending a good deal of time reviewing options for our 2-D chemical editor. We ended up launching a separate Windows Presentation Foundation (WPF) pane from within Word, which reads in the CML, renders it, and allows the user to perform various editing functions, all while preserving a certain amount of “chemical intelligence.”

This was not just characters and lines on a drawing board. When you select a particular atom, the options that you get for editing are dependent on the sorts of things that are chemically viable in that particular structure. And when you save an edited structure, the Chemistry Add-in for Word (Chem4Word)  writes the modified CML back into the DOCX package, creates a PNG (for viewing in the document), updates the chemical formula, and prompts the user to update any of the other labels from the CML file.

Once this initial work was established, we brought the chemical intelligence developers and the WPF developers closer together—in the U.K.—so they could meet in person more frequently. This helped move the project along at a good pace, culminating in our beta release at the American Chemical Society Annual Meeting in March 2010.

Screen capture of Chemistry Add-in for Word

Since the beta release, most of the work has been on Joe’s shoulders. He has done significant clean-up, fixing bugs and taking in a lot of usability feedback (especially from his students). Most importantly, he has added the ability to look up existing molecular structures via existing web services at the NCBI’s PubChem and the Unilever Centre’s OPSIN databases. These can be used in the Chem4Word version 1.0 ribbon via load from PubChem.

I am extremely proud of this project and I am thrilled to finally see version 1.0 released to the world. We have so much more to do, however. A colleague in the U.K. recently helped explain the potential directions:

“The future of research will be powered not only by ever more rapid dissemination of ever large quantities of data, but also by software tools that ‘understand’ something about science. These tools will behave intelligently with respect to the information they process, and will free their human users to spend more time doing the things that humans do best: generating ideas, designing experiments, and making discoveries,” said Timo Hannay, Managing Director for Digital Science at MacMillan Publishers Ltd. “Chem4Word is one of the best examples so far of this important new development at the interface between science and technology.”

The Chem4Word project was one of our team’s first open source releases. Just after the beta release last March, we launched the source code project on CodePlex under an Apache 2.0 license. And today, we are announcing that the project has joined the Research Accelerators gallery as a part of the Outercurve Foundation.

Here’s to a long and happy future for the Chem4Word project—we hope it will offer the community a method for better facilitating and enabling semantic chemistry.

—Alex Wade, Director of Scholarly Communication, Microsoft Research

For more information, check out the Chemistry Add-in for Word press release.