a tall building lit up at night

Microsoft Research Lab – Asia

ICML 2022 highlights from MSR Asia: From reinforcement learning to graph neural networks

Share this page

ICML is regarded as one of the top international conferences in the field of artificial intelligence and machine learning and enjoys high prestige in the computer science community. ICML 2022 was held from July 17 to 23. In this article, we present seven papers selected from those published by MSR Asia in the conference, covering topics such as reinforcement learning, graph neutral networks, knowledge graph embedding, and more.

Branching Reinforcement Learning

pic

Paper link: https://arxiv.org/pdf/2202.07995.pdf (opens in new tab)

Reinforcement Learning (RL) models a fundamental sequential decision-making problem, where an agent interacts with the environment over time in order to maximize the obtained rewards. Standard RL considers taking only a single action in a state and formulates a single H-step path model. However, in many real-world applications such as recommendation systems and online advertising, users often select multiple options at once, and each option can trigger a corresponding successor state. For example, in category-based shopping recommendation, the recommendation system often displays a list of main categories in the first step, with each one having a probability of being clicked. When one of the main categories is clicked, the system provides a list of sub-categories in the second step. Then in the last step, the system provides a list of items according to the chosen category path. In this process, users can select (trigger) more than one category-item path; for example, they may trigger “IT accessories -> printers -> laser printers” and “IT accessories -> scanners -> document scanners” at the same time.

To handle such scenarios involving multiple actions and successor states, researchers proposed a novel Branching Reinforcement Learning (Branching RL) framework consisting of an episodic tree-structured forward model. In each episode, an agent starts from an initial state and takes a super action that contains multiple base actions, where each base action in the state has a probability of being triggered. For each state-base action pair, if it is triggered successfully, the agent receives a reward and transitions to a next state. If it is not triggered, the agent receives zero rewards and transitions to an absorbing state associated with zero rewards. Thus, the transitions branch out to multiple successor states. In the second step, for each branched-out state, the agent also selects a super action that contains multiple base actions with trigger probabilities. The agent only obtains rewards from the triggered state-base action pairs, and each state-base action pair transitions to the next corresponding state. Then, the transitions in the second step branch out to more successor states. In the end, the agent generates an H-layer tree-structured trajectory.

diagram
Figure 1: Branching reinforcement learning (when the number of base actions taken in a state is two)

For this branching RL framework, researchers established new analytical techniques, including the branching Bellman equations, the branching value difference lemma, and the branching law of total variance. Furthermore, they proposed two computationally efficient algorithms, BranchRM and BranchRFE, and provided nearly matching regret/sample complexity upper and lower bounds that are polynomial in all problem parameters despite having exponentially-large trajectories. The researchers also conducted empirical simulations to validate the efficiency of the algorithms in both sampling and computation.

Supervised Off-Policy Ranking

pic

Paper link:https://arxiv.org/pdf/2107.01360.pdf (opens in new tab)
Code link:https://github.com/SOPR-T/SOPR-T (opens in new tab)

Off-Policy Evaluation (OPE) aims to evaluate the performance of a target policy using data generated by other policies. OPE is critical in many real-world applications such as transactions, advertising, autonomous driving, drug trials, etc. In these applications, evaluating policies online by interacting with the real environment can be costly.

Existing OPE methods are mainly based on distribution correction, model estimation, and Q-estimation. They focus on accurately estimating the return of a policy and use unsupervised estimation methods. In this paper, researchers relay that these methods cannot adequately meet real-world needs and conditions. First, in many applications, the ultimate goal of OPE is to select better policies from candidate policies rather than to accurately estimate their returns. Second, people can usually grasp the performance of certain policies that have been deployed in real environments, but this type of information is not well exploited. Therefore, researchers defined two new problems, Supervised Off-Policy Evaluation (SOPE) and Supervised Off-Policy Ranking (SOPR), to estimate the performance or performance ranking of target policies using off-policy datasets and a set of deployed policies with their performance or performance ranking. Because SOPR does not need to accurately estimate policy performance, it is simpler than SOPE and has more practical application value.

diagram
Figure 2: Architecture of the hierarchical Transformer encoder-based policy representation and scoring model

To solve the SOPR problem, researchers proposed a policy ranking algorithm based on supervised learning that uses policy representations and policy ranking labels to learn a scoring model and then rank policies based on their scores. For policy representation, since different policies have different forms and may not have parameters, it is infeasible to use policy parameters to represent policies. Therefore, researchers proposed to use state-action data and a hierarchical Transformer encoder to learn policy representations, where states come from the off-policy dataset, and actions are generated from the target policy on these states. The algorithm is named SOPR-T, where T stands for Transformer.

Extensive experiments were conducted on a public dataset of Mujoco control tasks. The results demonstrate that SOPR-T outperforms OPE baseline algorithms in both rank correlation and regret value.

Going Deeper into Permutation-Sensitive Graph Neural Networks

pic

Paper link: https://proceedings.mlr.press/v162/huang22l.html (opens in new tab)
Code link: https://github.com/zhongyu1998/PG-GNN (opens in new tab)
Demo link: https://github.com/zhongyu1998/PG-GNN/blob/main/demo.mp4 (opens in new tab)

The invariance of permutations in an adjacency matrix is a key inductive bias for Graph Neural Networks (GNNs). Conventionally, this prerequisite can be satisfied by carrying out invariant operations over node permutations when aggregating messages. However, these strongly symmetric permutation-invariant aggregators presume equal statuses of all neighboring nodes, sometimes resulting in ignored relationships among neighboring nodes and thus hindering the expressivity of GNNs.

Different than the permutation-invariant aggregators, the permutation-sensitive aggregators are sensitive to node orderings and can be regarded as a “symmetry-breaking” mechanism that breaks the equal statuses of neighboring nodes. In this way, the relationships among neighboring nodes, e.g., the 2-ary dependency between each pair of neighboring nodes, are explicitly modeled in the permutation-sensitive paradigm. These 2-ary dependencies help capture whether two neighboring nodes are connected, thereby exploiting the local graph substructures to improve the expressive power of GNNs.

However, the permutation-sensitive aggregators need to explicitly consider all n! possible permutations to guarantee generalization capability, leading to intractable computational complexity. To address the complexity challenge, the researchers suggested sampling a small number of representative permutations. Specifically, they designed a permutation sampling strategy via permutation groups and then applied a permutation-sensitive aggregator to model the sampled permutations while covering all 2-ary dependencies between neighboring nodes. The proposed approach is then capable of significantly reducing computational complexity while guaranteeing expressive power. They demonstrated that their approach is strictly more powerful than the two-dimensional Weisfeiler-Lehman (2-WL) graph isomorphism test and can distinguish between certain non-isomorphic graph pairs that the 3-WL test fails to distinguish. Moreover, this approach achieves linear permutation sampling complexity and does not require considering all n! possible permutations.

In conclusion, the researchers designed a powerful yet efficient GNN via the permutation-sensitive aggregation mechanism and took an important step forward to better understand permutation-sensitive GNNs. Using its natural advantage in expressive power to find a balance between expressiveness and computational complexity is a promising future research direction.

diagram
Figure 3: Illustration of the proposed model. Five neighboring nodes are considered for simplicity, and central node v is ignored for clarity.

HousE: Knowledge Graph Embedding with Householder Parameterization

pic

Paper link: https://arxiv.org/abs/2202.07919 (opens in new tab)

Knowledge Graph Embedding (KGE), which learns low-dimensional representations for entities and relations, is an effective tool to alleviate the problem of incompleteness in knowledge graphs. This paper analyzes the modeling capability of existing KGE methods and points out that: (1) The relational rotations used in existing methods are fixed in low-dimensional spaces, greatly limiting modeling capacity; (2) Existing methods cannot comprehensively model the crucial relation patterns and mapping properties in knowledge graphs.

As an answer to these limitations, this paper introduces Householder reflection as the basic mathematical tool and presents the design of two linear transformations based on it to model relations in knowledge graphs. The two linear transformations are: (1) Householder rotation (composed of multiple Householder reflections), which can naturally extend to high-dimensional spaces to achieve powerful modeling capacity; and (2) Householder projection (modified from the original Householder reflection), which can handle sophisticated relation mapping properties without losing the capability of modeling relation patterns.

Using the Householder framework, this paper proposes a more powerful and general KGE model named HousE that represents each relation as a two-stage transformation between entities. As shown in Figure 4, For a given triple, HousE first uses relational Householder projections to generate relation-specific representations for head and tail entities, and then models relational Householder rotations between the projected entities.

chart, radar chart
Figure 4: (a) Householder reflection in 2-dimensional space; (b) Householder rotation in 2-dimensional space; (c) Householder projections under different τ values in 2-dimensional space; (d) Illustration of HousE. In order to model the triples (h,r,t_1 ) and (h,r,t_2), HousE first uses Householder projections (Pro-H1 and Pro-H2) to adjust the relative distance between entities, and then performs Householder rotation (Rot-H) on the projected head entity S_(h,r) to have it more closely resemble the projected tail entity.

Theoretically, HousE is capable of modeling important relation patterns and complex mapping properties in knowledge graphs. It is also a generalization of existing rotation-based models that extends rotations to high-dimensional spaces. Empirically, HousE has achieved new state-of-the-art performance on five benchmark datasets, and experimental results (such as fine-grained performance analysis, etc.) have further verified the powerful modeling capabilities brought by the Householder framework.

ClofNet: SE(3) Equivariant Graph Neural Networks with Complete Local Frames

pic

Paper link: https://arxiv.org/abs/2110.14811 (opens in new tab)

Group equivariance (e.g. SE(3) equivariance) is a critical physical symmetry of many 3D many-body systems, such as molecular dynamic systems. Great efforts have been put into encoding this symmetry into deep neural networks, i.e., equivariant neural networks, and these efforts have been shown to improve the generalization performance and data efficiency for downstream tasks such as molecular property prediction and conformation generation. The neural network ϕ is equivariant to SE(3) group G, meaning the input and output of the group equivariant networks remain equivariant with respect to translation or rotation transformations, i.e., ϕ(T_g (x))=S_g (ϕ(x)), where T_g and S_g are group representations of group action g∈G. Some equivariant networks (EGNN, Schnet) directly implement equivariant operations in the original space (e.g., applying nonlinear functions on the distances or linear transformations on radial directions between nodes), providing an efficient way to preserve equivariance without complex equivariant embeddings. However, in some scenarios, using incomplete frames with only the radial direction will cause a direction degeneration problem and is insufficient for expressing complex geometric information. In light of this, the researchers proposed a framework to construct a SE(3) equivariant network with complete local frames called ClofNet that can faithfully and efficiently approximate the geometric quantities.

diagram
Figure 5: An overview of ClofNet

Specifically, researchers represent a given 3D many-body system as a spatial graph. First, they move the centroid of the system into the origin to preserve the translation equivariance. Then, a complete local frame F_ij=(a_ij,b_ij,c_ij) is built for any given particle pair (x_i,y_i), where a_ij=(x_i-x_j)/||x_i-x_j ||, b_ij=x_i×x_j/||x_i×x_j || and c_ij=a_ij×b_ij. If F_ij is non-degenerate, it is complete in the sense that it formulates a local orthonormal basis. After that, ClofNet projects all input vectors of the particle pair onto the local frame and obtains SE(3) invariant scalars s_ij=Scalarize(X_i,X_j,F_ij). For example, the projected scalar vector of position vector x_i is (〈x_i,a_ij 〉,〈x_i,b_ij 〉,〈x_i,c_ij 〉). These invariant scalars are fed into a message-passing neural network to learn edgewise message embeddings m_ij. Finally, the vectorization block of ClofNet leverages the edge embeddings to predict the coefficients o_ij of the corresponding local frame and represent the output vectors with the linear combination of the local frame vectors and the predicted coefficients. The expressiveness of ClofNet for the SE(3) equivariant function space is also demonstrated and discussed in the paper.

The researchers validated the effectiveness of ClofNet on two many-body tasks: (1) Newtonian many-body system trajectories prediction and (2) 3D molecular conformation generation. Experimental results show that ClofNet can considerably reduce sample complexity, increase the prediction accuracy of complex dynamic trajectories, and improve the conformation generation quality of molecules.

diagram
Figure 6: MSE of different methods when sweeping over different amounts of training data on the G+ES dataset
table
Table 1: Generation quality of different methods on the GEOMQM9 and GEOM-Drugs datasets
pic

Paper link: https://arxiv.org/abs/2108.12821 (opens in new tab)

Weight sharing is a popular approach used to reduce the cost of neural architecture search (NAS) by reusing the weights of shared operators from previously trained sub-models. However, the rank correlation between the estimated accuracy and ground truth accuracy of these sub-models is low due to the interference among different sub-models caused by weight sharing, as shown in Figure 7 and 8, and this hinders practical usage in applications.

diagram

In this paper, researchers investigated the interference issue by sampling different sub-models and calculating the gradient similarity of shared operators and observed that: 1) the interference on a shared operator between two sub-models is positively correlated with the number of different operators; 2) interference is smaller when the inputs and outputs of the shared operator are more similar.

Inspired by these two observations, researchers proposed two approaches to mitigate interference:

  1. MAGIC-T: rather than randomly sampling sub-models for optimization, the researchers proposed a gradual modification scheme by modifying one operator between adjacent optimization steps to minimize interference on the shared operators;
  2. MAGIC-A: forcing the inputs and outputs of the operator across all sub-models to be similar to reduce interference.

The researchers first conducted experiments on a BERT search space to verify that mitigating interference via each of the proposed methods can improve super-net rank correlation and that combining the two methods can yield better results. Then, the researchers leveraged MAGIC-AT to search top-performing architectures on BERT, SQuAD, and ImageNet tasks. The discovered architectures consistently and significantly outperformed previous works, demonstrating the effectiveness and generality of the methods proposed in this work.

diagram
Figure 8: Gradient cosine similarity of different child models on the shared operators
table
Table 2: Results of architectures found by MAGIC-AT on the GLUE benchmark.

Finding Global Homophily in Graph Neural Networks When Meeting Heterophily

pic

Paper Link: https://arxiv.org/abs/2205.07308 (opens in new tab)

In graphs with heterophily, linked nodes are more likely to have different labels. Generally, nodes in the same label are known as homophilous, and nodes with different labels are known as heterophilous. When traditional GNN methods (GCN, GAT, etc.) are used to learn representations of nodes in heterophilous graphs, node representations are easily misled by heterophilous neighbors, which could lead to poor representation learning. To address this problem, researchers have previously attempted to enlarge node neighborhood size, which essentially increases the number of homophilous neighbors belonging to a node. However, a challenge still remains: how large should a node neighborhood be? In this paper, researchers present a new solution, which is to use the global neighborhood, i.e., the whole graph!

This paper proposes a novel GNN model called GloGNN, whose architecture is shown in the figure below. The model input includes node features and an adjacency matrix that are integrated to generate the initial node representation matrix. In each layer, GloGNN updates the node representation matrix based on a coefficient matrix. The coefficient matrix describes the correlation between all the nodes in the graph and is calculated from an optimization function that incorporates node features and graph topology. Further, the time complexity of the updating process is reduced to linear complexity by optimizing the matrix inversion operation with the Woodbury Formula and adjusting the order of matrix multiplication. Additionally, an upgraded model GloGNN++ is presented. This model considers not only node correlations but also the importance of each dimension of a node feature. The researchers demonstrated the effectiveness of GloGNN and GloGNN++ using both theoretical analysis and empirical studies.

diagram
Figure 10: Architecture of GloGNN

Theory-wise, the rationality of the methods has been verified using Grouping Effect analysis on the coefficient matrix and the node representation matrix. Experimental-wise, researchers compared GloGNN and GloGNN++ with 11 other representative GNN models on 15 datasets in various domains and of different scales and heterophilies. They also performed deep analysis on model efficiency and interpretability. Results show that GloGNN and GloGNN++ can capture homophilous nodes from the whole graph effectively and efficiently.