Enhancing Multilingual Content in Wikipedia
By Douglas Gantenbein, Senior Writer, Microsoft News Center
Wikipedia has become one of the world’s largest and perhaps most powerful information repositories. But it is heavily English-centric.
Making Wikipedia more multilingual inspired a Microsoft Research India team to develop a tool called WikiBhasha, which was launched Oct. 18. WikiBhasha—“Wiki,” signifying its community-oriented approach; “Bhasha,” a Sanskrit word meaning “language”—features a content-creation platform that combines linguistic services, such as machine translation, with a Wikipedia-friendly content editor. Everyday users in countries around the world, as well as language enthusiasts, can use WikiBhasha to adapt English-language Wikipedia articles for local languages. Along the way, they can create new local content to expand the article they have translated.
WikiBhasha users also can create new articles from scratch. And in time, the tool could help convert articles in languages other than English into local languages.
The team behind WikiBhasha is led by A Kumaran, a multilingual-technologies and -systems researcher whose research interests include multilingual and cross-language information processing, machine translation and transliteration, and methods for creating data for computational linguistic research. He and his team started work on WikiBhasha four and a half years ago.
WikiBhasha is designed to solve several problems. Foremost, of course, is broadening the reach and language adoption of Wikipedia.
“While English, the most prevalent Wikipedia, has 3.4 million articles, even the second most-popular language, German, has only one-third as many articles,” Kumaran says. “And there is a huge tail of more than 200 languages that have fewer than 100,000 articles each. It would be great to help people expand that number—in lots of languages.”
WikiBhasha also offers a way to sharpen the abilities of current machine translators. That, in fact, was the one of the driving forces for the idea behind WikiBhasha. It takes about 4 million sentence pairs, matched between two languages, to develop a machine translator robust enough to create effective translations. In many languages, collecting that much data is a nearly insurmountable task. But if a machine translator can at least start a translation using a smaller data set, then it’s possible for a wiki-style community to build on that and correct the machine translator—literally “teaching” the translator to be more effective.
Kumaran also says that WikiBhasha could be used to take a machine translator that is effective with one type of content—news articles, for instance—and, through community participation, train the translator to handle other content, such as medical articles or other documents that use more specialized language, more effectively.
WikiBhasha is a browser-based tool with an easy-to-use interface that can be invoked atop a Wikipedia page. In a three-step process, a user identifies an appropriate set of English-language articles to use as source of information for contribution to a Wikipedia in their own language. The user is guided through composing and editing a translation and adding additional content as desired and can then submit the completed document to Wikipedia. New articles also can be created using WikiBhasha.
But WikiBhasha is much more than a translator.
“You can do whatever you want with the content,” Kumaran says. “Creating the translation is only the first step. On the other hand, you can choose to create an article from scratch and not use the translation service at all.”
Once an article has been translated and edited, it becomes part of the Wikipedia corpus and can be amended by other users. Any changes a user makes to the translations also are made available to other WikiBhasha users through a cloud-based service, the Collaborative Translation Framework, developed by the Microsoft Machine Translation Incubation Team.
“It’s very collaborative that way,” Kumaran says. “If a user takes a sentence and corrects a mistake that the translator has made, it gets recorded in a cloud-based repository. The next time someone uses the service, the machine-translated version—and all the changes people have made to it—also will be available.”
Kumaran says WikiBhasha has been developed to suit both everyday users and experienced Wikipedians. He notes that although Wikipedia has a small percentage of highly active contributors, a large portion of the content is created by a huge number of casual users who make only occasional contributions. WikiBhasha is designed to appeal to both groups. It enables sophisticated editing and content sourcing, but also is easy to use when a user wants to make small changes.
“Some people will want to create a great deal of content, while most others may just want to work on a sentence or two,” Kumaran says. “The key is to make the user experience simple and intuitive, to attract the casual users repeatedly, and, at the same time, not hamper the productivity of active contributors.”
WikiBhasha has gone through several iterations. It started as a research prototype with a text-based interface. The first externally published version appeared in 2008 as a hosted solution, with editing enabled only on a sentence-by-sentence basis. Though Wikipedia was an early object of the work, the tool could have been used to translate content from news sources such as The New York Times.
When this early version was shown to the Wikipedia organization in Germany in 2008, the reaction was underwhelming.
“The response was not good,” Kumaran says candidly. “The philosophy behind the tool was not right for them. They wanted more free-form content creation in a local language and didn’t just want a mirror image of an English-language article.”
Also the Wikipedians preferred a solution where they stay on the Wikipedia and not work on Wikipedia content in a different domain.
A second version, which stayed on-site on Wikipedia, took care of many of the hosting and technical issues flagged by the WIkipedians. But, in the process, it became too complex for casual users.
For the version now being released, Kumaran and his team worked to create an intuitive Wikipedia experience, focus users more on the final content creation and less on the original translated document, and ensure that any user’s work in WikiBhasha becomes part of a collaborative experience. The latter element was key to winning over the Wikimedia Foundation, the parent organization to the global Wikipedias.
Kumaran’s team was able to do so, in part, by committing the project entirely to a Wikipedia-centric approach. The tool was redesigned to integrate tightly with Wikipedia and to be “part of” the Wikipedia experience during the time a user is translating, adding, and creating content. When WikiBhasha was shown again to Wikipedia representatives, “the response was very positive,” Kumaran says, “a near 180-degree turn from what we had encountered before.”
WikiBhasha will be made available to the Wikipedia community as a MediaWiki extension. Soon, it will be available as a user gadget on Wikipedia, as well as an installable bookmark at the WikiBhasha web site, which is hosted on the Windows Azure platform.
The current WikiBhasha release supports the 31 languages offered by Microsoft Translator and will be able to handle any new language added to Microsoft Translator. And although the first WikiBhasha release is based on the notion that English-language articles will be the focus, further iterations could enable the use of articles in other languages as source material—German, for instance, or Spanish or Japanese.
Over the next few months, the Wikimedia Foundation and Microsoft Research will be conducting joint workshops and community-interaction sessions in four countries—Brazil, Egypt, India, and Mexico—and the WikiBhasha team will work closely with users in those communities to study and encourage adoption and use of WikiBhasha for enhancing content in the respective Wikipedias. The goal is to enhance the Wikipedia content, as well as to get direct user feedback and refine the WikiBhasha tool while also increasing availability of multilingual content.
Kumaran hopes the release of WikiBhasha will give him several ways to expand his study of language. In one case, he is eager simply to study how people use WikiBhasha.
“I am really interested in seeing how to use crowd sourcing as a method for gathering linguistic data,” he says. “We’d also like to understand which features help or hinder adoption, what are the specific needs of individual demographics, and so on. If the adoption is not up to our expectation, then we would like to know why. This is a fantastic opportunity to do real-world research on what works and what doesn’t work with crowds.”
In addition, WikiBhasha represents a significant open-source contribution from Microsoft Research, as well as its initial engagement with the Wikimedia Foundation and Wikipedia communities.
WikiBhasha is a collaborative project in which multiple individuals from Microsoft Research India have participated. In addition to Kumaran, contributors include K Saravanan from the Multilingual Systems Group; Naren Datha, Anil Ande, and B. Ashok from the Advanced Development Group; and Ashwani Sharma, Sridhar Vedantham, and Vidya Natampally from the External Research team.
In addition, a significant contribution to WikiBhasha in terms of design, development, and liaison with the Wikimedia Foundation was made by members of the Machine Translation team from Microsoft Research Redmond, particularly Vikram Dendi and Sandor Maurice. WikiBhasha also relies critically on several pieces built and deployed by the machine-translation service and the Collaborative Translation Framework.
“They are not just service providers for WikiBhasha,” Kumaran says, “but a part of the WikiBhasha team.”
He says WikiBhasha might open the doors to a whole new world of content translation into languages that machine translators now ignore because it simply takes too much data—and, consequently, too much time—to create a useful translator.
“Take my mother tongue, Tamil, where no translators are available now,” Kumaran says. “Maybe we could use WikiBhasha to bootstrap a machine translator from the ground up. We could start with a rudimentary machine translator based on a small amount of parallel data, deploy WikiBhasha based on it to produce Tamil Wikipedia content and parallel data, which in turn may produce a bit better translator, and so on. This may be the only practical way through which translators in many languages of the world will get created—by community participation.”
For now, though, Kumaran is delighted to see his work come to fruition.
“We’re very excited about WikiBhasha,” he says. “WikiBhasha is really a very nice idea, and we are hoping it will prove to be useful and successful with the Wikipedians, too.”