ICML 2020 highlights: A Transformer-based RL agent, causal ML for increased privacy, and more


With over 50 papers from Microsoft accepted at this year’s International Conference on Machine Learning (ICML 2020), a number of which were presented in virtual workshops, Microsoft researchers are in full summer swing when it comes to advancing machine learning in accessibility, privacy, healthcare, and other areas. As Microsoft Partner Research Manager and ICML President John Langford puts it, “ICML is a very broad conference, so its specialty is in some sense ‘all of the above.’” But Langford goes on to add that one of the topics that ICML has a long track record on is currently trending: reinforcement learning. A brief glance through the sessions and workshops presented by Microsoft researchers shows the wide influence reinforcement learning has in our world today, from natural language to robotics to infrastructure considerations like transportation.

Beyond the research contributions, Microsoft was also a sponsor of and recruiter at the conference. Additionally, the company sponsored two events co-located with the conference, the first Women in Machine Learning Un-Workshop and the fourth Queer in AI Workshop. The impact of the conference—now and in the future—is multifaceted, according to Langford. “ICML is ‘the’ summer machine learning conference. As such, it’s critically important to the academic discovery, review, and dissemination process, a great way to meet fellow researchers, and a natural recruiting point for the field,” he says.

Microsoft research webinars

Lectures from Microsoft researchers with live Q&A and on-demand viewing.

Below are five selections of research presented by Microsoft. These projects highlight how broadly researchers are thinking about ML and its implications for society. But this diverse group of papers represents only a small slice of the advancements presented by Microsoft researchers. Explore the Microsoft at ICML 2020 accepted papers list to learn about further research contributions. 

See sections on: ICML 2020 Overview | How AI models reason | Utility and privacy with causal machine learning | Using Transformers to create RL agents | Pretraining for bidirectional language models | Identifying layer normalization location

Understanding how AI models reason about what they see 

Bottom line: “We propose a principled approach to isolate, analyze, and interpret how visual question-answering models reason about what they see.” 
—Machine Learning Scientist Saeed Amizadeh

Quick glance: In “Neuro-Symbolic Visual Reasoning: Disentangling ‘Visual’ from ‘Reasoning,’” researchers from the Microsoft Applied Sciences Lab and MSR AI collaborated to combine visual understanding and neuro-symbolic reasoning with natural language processing and program synthesis. “We develop a novel way to perform differentiable logical inference over visual scenes, which allows us to disentangle the processes of reasoning and perception in visual question answering (VQA) models,” explains Amizadeh. The work also led to creating a methodology for evaluating state-of-the-art VQA models, and the researchers propose expanding beyond pure probabilistic logical reasoning to incorporate other contextual signals and improve visual perception of the models.

Areas of impact: This research lies at the intersection of natural language and visual perception, which makes it a good candidate for systems using AI for accessibility. Key is the work’s focus on interpretability. The user should understand throughout the whole process how the neural reasoning is connecting what it “sees” to language, building trust and reliability in AI. 

Fun Fact: This project initially began as a project out of the Microsoft AI Residency Program. 

The research team: Saeed AmizadehHamid Palangi, Alex Polozov, Yichen Huang, Kazuhito Koishida

Additional Resources:  
Applied Sciences homepage 
MSR AI homepage  
Microsoft AI Residency Program

Improving utility and privacy with causal machine learning

Bottom line: What if you can build machine learning models that are both accurate and preserve privacy of individuals? Try causal predictive models: We show that they are more robust to privacy attacks like membership inference and have higher accuracy on new domains than typical ML models.” 
—Microsoft Senior Researchers Amit Sharma and Shruti Tople

Quick glance: Privacy is paramount for institutions like hospitals and governments, which handle sensitive datasets and use ML models. Standard ML privacy approaches add noise to a model or data to protect information, but this can have the undesired effect of reducing accuracy or utility of the model. This work shows that causal learning, by which ML models are trained based on domain knowledge about causal relationships between features and outcomes, can increase both privacy and utility when compared to associational ML models with the same amount of noise. Researchers from Microsoft Research India provided knowledge of causal ML for this project, while researchers from Microsoft Research Cambridge brought expertise on privacy and security. Their paper is called “Alleviating Privacy Attacks via Causal Learning.”

Areas of impact: This work aims to improve privacy protections for institutions using sensitive data with causal ML. In addition, this direction allows for improved model sharing across institutions and allows individuals to voluntarily share their own data without risk of information being leaked by an ML model.  

New tools: The researchers have released an open-source toolkit, RobustDG, for evaluating causal ML models on privacy, robustness, and out-of-distribution accuracy.

The research team: Amit SharmaShruti Tople, and Aditya Nori

Additional Resources: 
GitHub repository including open-source toolkit
Microsoft Research India homepage 

Microsoft Research Cambridge homepage  

Using Transformers to create RL agents suited for real-world tasks

Bottom line: “Transformers for RL!” 
—Senior Research Software Engineer Ricky Loynd

Quick glance: Working Memory Graphs” presents a new reinforcement learning agent “that accelerates learning on challenging tasks by leveraging the power of Transformers in three ways,” explains Loynd. These three approaches apply Transformer attention to past observations, recurrent state vectors, and factored observations, respectively. “By leveraging the power of Transformers in these ways, our Working Memory Graph (WMG) agent accelerates learning on several challenging tasks: BabyAI, Pathfinding, and Sokoban. In BabyAI, WMG achieves drastic improvements in sample efficiency when observations are factored into more succinct representations,” says Loynd. The team includes members from the reinforcement learning and deep learning groups within MSR AI.

Areas of impact: This work shows that WMG is effective in handling the structured, factored observations used in today’s real-world applications of RL and accelerates RL so that AI agents will eventually be able to accomplish previously unattainable real-world tasks.

Performance and novel features: WMG outperforms a GRU (Gated Recurrent Unit) baseline agent at complex reasoning over past observations, and WMG has a new form of “shortcut recurrence” that proves to be more effective than standard gated recurrence. Sokoban results demonstrate that WMG performs better on this complex domain than the state-of-the-art Deep Repeated ConvLSTM (DRC) agent (by Google DeepMind) throughout 20 million steps of training.

The research team: Ricky Loynd, Roland Fernandez, Asli Celikyilmaz, Adith Swaminathan, and Matthew Hausknecht.

Additional Resources: 
Working Memory Graph GitHub repository 

MSR AI homepage

Efficient pretraining for bidirectional language models in one forward pass

Bottom line: “Our work efficiently realizes unified pretraining of bidirectional language models (via autoencoding) and sequence-to-sequence language models (via partially autoregressive) with a pseudo-masked language model for language understanding and generation.”  
—Senior Principal Research Manager Furu Wei

Quick glance: This research introduces pseudo-masked language models, allowing for efficient pretraining of bidirectional language models in natural language understanding and sequence-to-sequence language models in natural language generation in one forward pass. This work is a collaboration between Microsoft Research Asia, Microsoft Research Redmond, and both the DeepSpeed and Project Turing teams, who help scale up the pretraining to larger models and are working to implement those models in Microsoft products in an initiative called AI at Scale. The paper is titled “UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training.”

Areas of impact: This novel language model improves techniques for natural language generation, including document summarization and dialog generation. It also builds on techniques for natural language understanding, which includes text classification, question answering, and information extraction.  

New state of the art: Results show this model achieves state of the art on various natural language generation and understanding tasks across numerous benchmarks.

The research team: 

MSR Asia: Li DongFuru Wei, Wenhui Wang, Nan YangMing ZhouHsiao-Wuen Hon, and collaborators Hangbo Bao and Songhao Piao

MSR Redmond: Xiaodong Liu, Yu Wang, and Jianfeng Gao 

Additional resources:

GitHub repository of UniLM 

AI at Scale homepage 

Microsoft Research Asia homepage 

Microsoft Research Redmond homepage 

DeepSpeed homepage 

Project Turing homepage

Correctly identifying layer normalization location for better Transformer optimization

Bottom line: “Use Pre-LN Transformer to remove the annoying warm-up stage and save greatly on converge time.”
— Microsoft Researchers Di He and Shuxin Zheng

Quick glance: This research explores a known optimization issue with the original Transformer (BERT) that causes slowed down training, requiring hyperparameter tunings. The researchers offer theoretical proof the issue emerges from the location of layer normalization. They propose a variant of Pre-LN Transformer that correctly locates the layer normalization with easy optimization and the ability to quickly converge. This work was done by researchers affiliated with Microsoft Research Asia, the Chinese Academy of Sciences, and Peking University. The research is detailed in the paper “On Layer Normalization in the Transformer Architecture.”  

Areas of impact: There are many projects already using Pre-LN Transformer to train large-scale BERT models because of its exceptional optimization stability, including training on NVIDIA’s Megatron, Open AI’s GPT-2, and Open AI’s GPT-3 models.

Added benefits: Because of the way this variant operates, it requires no additional hyperparameter tuning. This fact, combined with faster convergence, results in boosted energy efficiency.  

The research team: Di HeShuxin ZhengHuishuai Zhang, and Tie-Yan Liu  

Additional Resources:  
AI at Scale homepage 

Microsoft Research Asia homepage 

Neural Machine Translation