When reading ancient Chinese poetry, we often marvel at the very wonderful words ancient writers could use to describe people, events, objects, and scenes. This is a splendid cultural treasure that has been left behind for us. However, similar to Shakespeare’s verses in the English language, the literary Chinese used by these poets is often difficult for modern day people to understand, and the meanings and subtleties embedded within it are frequently lost.
To solve this problem, researchers at Microsoft Research Asia adopted the latest neural machine translation techniques to train direct translation models between literary Chinese and modern Chinese, which also results in creating translation capabilities between literary Chinese and more than 90 other languages and dialects in Microsoft Translator. Currently, literary Chinese translation has been integrated into the Microsoft Translator app, Azure Cognitive Services Translator, and a number of Microsoft products that are supported by Microsoft Translator services.
Image: The painting from “West Mountain in Misty Rain” by Shen Zhou, Ming Dynasty. The ancient Chinese poem on the painting is from Yong Liu, Northern Song Dynasty. The poem depicts the spring scenery in southern China during the Qingming Festival and the prosperity of social life.
Enabling more people to appreciate the charm of traditional Chinese culture
Literary Chinese is an important carrier of traditional Chinese culture. Voluminous books and texts from the ancient times have recorded China’s rich and profound culture over the past five thousand years. The thoughts and wisdom accumulated and contained in them are worthy of continuous exploration and thinking.
With the help of machine translation, tourists can now understand ancient Chinese texts and poems written on historic buildings and monuments, students now have an extra tool to help them learn Chinese, and researchers who are engaged in collating and translating ancient texts can be more productive.
Dongdong Zhang, a principal researcher at Microsoft Research Asia, said, “From a technical perspective, literary Chinese can be regarded as a separate language. Once translation between literary Chinese and modern Chinese is realized, the translation between literary Chinese and other languages such as English, French, and German becomes a matter of course.”
Biggest difficulty of literary Chinese translation AI model: Little training data
The most critical element of AI model training is data. Only when data volume is large enough and its quality high enough can you train a more accurate model. In machine translation, the training of the model requires bilingual data: original text data and target language data. The translation of literary Chinese is very special, as it’s not a language used in daily life. Therefore, compared with the translation of other languages, the training data of literary Chinese translation is very small, which is not conducive to the training of machine translation models.
Although Microsoft Research Asia researchers collected a lot of publicly available literary and modern Chinese data in the early stages, the original data cannot be directly used. Data cleaning needs to be conducted to normalize data from different sources, various formats, as well as full-width/half-width punctuations, as a means to minimize the interference of invalid data on model training. In this way, the actual available high-quality data is further reduced.
According to Shuming Ma, a researcher at Microsoft Research Asia, in order to reduce the data sparseness issue, researchers have conducted a great amount of data synthesis and augmentation work, including:
First, common character– based alignment and expansion to increase training data size. Different from translations between Chinese and other languages such as English, French, Russian, etc., literary Chinese and modern Chinese use the same character set. Taking advantage of this feature, researchers at Microsoft Research Asia have used innovative algorithms to allow machine translation to recall common characters, conduct natural alignment, and then further expand to words, phrases, and short sentences, thereby synthesizing a large amount of usable data.
Second, deform sentence structure to improve the robustness of machine translation. Regarding breaks in texts and poems, researchers have added a number of variants to make machines more comprehensive in learning ancient poems. For people, even when they see a sentence that is structured abnormally, such as a poem segmented into lines based on rhythm rather than full sentences, they can still put the parts together and understand it. But for a translation model that has never seen such segmentation before, it will likely be confused. Therefore, transformation of data format can not only expand the amount of training data, but also improve the robustness of the translation model training.
Third, conduct traditional and simplified character translation training to increase model adaptability. In Chinese, traditional characters exist in both literary and modern Chinese. When researchers trained the model, in order to improve the adaptability of the model, they not only leveraged data in simplified Chinese, but also added data in traditional Chinese, as well as data mixed with traditional and simplified characters. Thus, the model can understand both the traditional and simplified contents, which leads to more accurate translation results.
Fourth, increase the training of foreign-language words to improve the accuracy of translation. When translating modern Chinese into literary Chinese, there are often modern words derived from foreign-language words and new words that have never appeared in ancient Chinese, such as “Microsoft”, “computer”, “high-speed rail”, and many others like it. To deal with this issue, researchers trained a small model to recognize entities. The model first translated the meaning of the word outside the entity, then filled the entity back in to ensure the accuracy of the machine’s processing of the foreign words.
Image: The literary Chinese translation process
In addition, for informal writing styles such as blogs, forums, Weibo, and so on, the machine translation model has been trained specifically to further improve the robustness of translation between modern and literary Chinese.
Dongdong Zhang expressed, “Based on the current translation system, we will continue to enrich the data set and improve the model training method to make it more robust and versatile. In the future, the method may not only be used for literary Chinese translation, but can also be extended to other application scenarios.”