{"id":546372,"date":"2018-11-01T08:06:47","date_gmt":"2018-11-01T15:06:47","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=546372"},"modified":"2018-11-01T08:06:47","modified_gmt":"2018-11-01T15:06:47","slug":"cracking-code-mixing-an-important-step-in-making-human-computer-interaction-more-engaging","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/cracking-code-mixing-an-important-step-in-making-human-computer-interaction-more-engaging\/","title":{"rendered":"Cracking code-mixing \u2014 an important step in making human-computer interaction more engaging"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-546384 aligncenter\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788-1024x642.png\" alt=\"ENMLP\" width=\"1024\" height=\"642\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788-1024x642.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788-300x188.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788-768x481.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788.png 1401w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Communication is a large part of who we are as human beings, and today, technology has allowed us to communicate in new ways and to audiences much larger and wider than ever before. That technology has assumed single-language speech, which \u2014 quite often \u2014 does not reflect the way people naturally speak. India, like many other parts of the world, is multilingual on a societal level with most people speaking two or more languages. I speak Bengali, English, and Hindi, as do a lot of my friends and colleagues. When we talk, we move fluidly between these languages without much thought.<\/p>\n<p>This mixing of words and phrases is referred to as code-mixing or code-switching, and from it, we\u2019ve gained such combinations as <em>Hinglish<\/em> and <em>Spanglish<\/em>. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.psychologytoday.com\/us\/blog\/life-bilingual\/201011\/bilingualisms-best-kept-secret\">More than half of the world\u2019s population speaks two or more languages<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, so with as many people potentially code-switching, creating technology that can process it is important in not only creating useful translation and speech recognition tools, but also in building engaging user interface. Microsoft is progressing on that front in exciting ways.<\/p>\n<p>In <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/melange\/\">Project M\u00e9lange<\/a>, we at Microsoft Research India have been building technologies for processing code-mixed speech and text. Through large-scale computational studies, we are also exploring some fascinating linguistic and behavioral questions around code-mixing, such as why and when people code-mix, that are helping us build technology people can relate to. At the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/emnlp2018.org\/\">2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, my colleagues and I have the opportunity to share some of our recent research with our paper \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/word_embeddings_cm.pdf\">Word Embeddings for Code-Mixed Language Processing.<\/a>\u201d<\/p>\n<h3>A data shortage in code-mixed language<\/h3>\n<p>Word embeddings \u2014 multidimensional vector representation where words similar in meaning or used in similar context are closer to each other \u2014 are learnt using deep learning from large language corpora and are valuable in solving a variety of natural language processing tasks using neural techniques. For processing code-mixed language \u2014 say, Hinglish \u2014 one would ideally need an embedding of words from both Hindi and English in the same space. There are standard methods for obtaining multilingual word embeddings; however, these techniques typically try to map translation equivalents from the two languages (e.g., <em>school<\/em> and <em>vidyalay<\/em>) close to each other. This helps in cross-lingual transfer of models. For instance, a sentiment analysis system trained for English can be appropriately transferred to work for Hindi using multilingual embeddings. But it\u2019s not ideal for code-mixed language processing. While <em>school<\/em> and <em>vidyalay<\/em> are translation equivalents, in Hinglish, <em>school<\/em> is far more commonly used than <em>vidyalay<\/em>; also, these words are used in slightly different contexts. Further, there are grammatical constraints on code-mixing that disallow certain types of direct word substitutions, most notably for verbs in Hinglish. For processing code-mixed language, the word embeddings should ideally be learnt from a corpus of code-mixed text.<\/p>\n<div id=\"attachment_546669\" style=\"width: 635px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-546669\" class=\"wp-image-546669 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/pie-chart.png\" alt=\"\" width=\"625\" height=\"304\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/pie-chart.png 625w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/pie-chart-300x146.png 300w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><p id=\"caption-attachment-546669\" class=\"wp-caption-text\">It is difficult to estimate the amount of code-mixing that happens in the world. One good proxy is the code-mixing patterns on social media. Approximately 3.5 percent of the tweets on Twitter are code-mixed. The above pie charts show the distribution of monolingual and code-mixed, or code-switched (cs), tweets in seven major European languages: Dutch (nl), English (en), French (fr), German (de), Portuguese (pt), Spanish (es), and Turkish (tr).<\/p><\/div>\n<div id=\"attachment_546660\" style=\"width: 1008px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-546660\" class=\"wp-image-546660 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/Map-for-EMNLP.png\" alt=\"\" width=\"998\" height=\"460\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/Map-for-EMNLP.png 998w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/Map-for-EMNLP-300x138.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/Map-for-EMNLP-768x354.png 768w\" sizes=\"auto, (max-width: 998px) 100vw, 998px\" \/><p id=\"caption-attachment-546660\" class=\"wp-caption-text\">The chart above shows the distributions of monolingual and code-mixed tweets for 12 major cities in Europe and the Americas that were found to have very large or very small fractions of code-mixed tweets, represented in the larger pies by the missing white wedge. The smaller pies show the top two code-mixed language pairs, the size being proportionate to their usage. The Microsoft Research India team found that code-mixing is more prevalent in cities where English is not the major language used to tweet.<\/p><\/div>\n<p>Even though code-mixing is extremely common in multilingual societies, it happens in casual speech and rarely in text, so we\u2019re limited in the amount of text data available for code-mixed language. What little we do have is from informal speech conversations, such as interactions on social media, where people write almost exactly how they speak. To address this challenge, we developed a technique to generate natural-looking code-mixed data from monolingual text data. Our method is based on a linguistic model known as the equivalence constraint theory of code-mixing, which imposes several syntactic constraints on code-mixing. In building the Spanglish corpus, for example, we used Bing Microsoft Translator to first translate an English sentence into Spanish. Then we aligned the words, identifying which English word corresponded to the Spanish word, and in a process called parsing identified in the sentences the phrases and how they\u2019re related. Then using the equivalence constraint theory, we systematically generated all possible valid Spanglish versions of the input English sentence. A small number of the generated sentences were randomly sampled based on certain criteria that indicated how close they were to natural Spanglish data, and these sentences comprise our artificial Spanglish corpus. Since there is no dearth of monolingual English and Spanish sentences, using this fully automated technique, we can generate as large a Spanglish corpus as we want.<\/p>\n<h3>Solving NLP tasks with an artificially generated corpus<\/h3>\n<p>Through experiments on parts-of-speech tagging and sentiment classification, we showed that word embeddings learnt from the artificially generated Spanglish corpus were more effective in solving these NLP tasks for code-mixed language than the standard cross-lingual embedding techniques.<\/p>\n<p>The linguistic theory\u2013based generation of code-mixed text has applications beyond word embeddings. For instance, in one of our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/language_modeling_cm.pdf\">previous studies published earlier this year<\/a>, we showed that this technique helps us in learning better language models that can help us build better speech recognition systems for code-mixed speech. We are exploring its application in machine translation to improve the accuracy of mixed-language requests. And imagine a multilingual chatbot that can code-mix depending on who you are, the context of the conversation, and what topic is being discussed, and switch in a natural and appropriate way. That would be true engagement.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Communication is a large part of who we are as human beings, and today, technology has allowed us to communicate in new ways and to audiences much larger and wider than ever before. That technology has assumed single-language speech, which \u2014 quite often \u2014 does not reflect the way people naturally speak. India, like many [&hellip;]<\/p>\n","protected":false},"author":37074,"featured_media":546384,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Monojit Choudhury","user_id":"32996"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[241770,194455],"tags":[],"research-area":[13556,13545,13555],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-546372","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","category-machine-learning","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-research-area-search-information-retrieval","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144733,144940],"related-projects":[],"related-events":[511391],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"862\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788.png\" class=\"img-object-cover\" alt=\"ENMLP\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788.png 1401w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788-300x188.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788-768x481.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/10\/ENMLP_Site_10_2018_1400x788-1024x642.png 1024w\" sizes=\"auto, (max-width: 862px) 100vw, 862px\" \/>","byline":"Monojit Choudhury","formattedDate":"November 1, 2018","formattedExcerpt":"Communication is a large part of who we are as human beings, and today, technology has allowed us to communicate in new ways and to audiences much larger and wider than ever before. That technology has assumed single-language speech, which \u2014 quite often \u2014 does&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/546372","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=546372"}],"version-history":[{"count":25,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/546372\/revisions"}],"predecessor-version":[{"id":547035,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/546372\/revisions\/547035"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/546384"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=546372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=546372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=546372"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=546372"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=546372"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=546372"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=546372"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=546372"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=546372"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=546372"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=546372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}