July 22, 2022 – January 24, 2024

MSR Asia Theory Lecture Series

Location: Virtual / MSRA

MSR Asia Theory Lecture Series is a forum where we invite researchers around the world to share the latest theoretical advances in big data, artificial intelligence, and related areas. The Lecture series are broadcast live over Teams. If you would like to receive the information about the upcoming talks, please send email “Subscribe to the Lecture Series” to MSRA.TheoryCenter@outlook.com or subscribe to Wechat official account “微软学术合作“.

Lectures

9/25/2024: Revealing the Reasoning Mechanisms of Large Language Models—Level-2 Reasoning Skill Beyond Human, Tian Ye

Abstract: The latest language models have demonstrated near-perfect accuracy on elementary math test sets (such as GSM8K), indicating their capability to solve mathematical reasoning problems. To investigate how language models address these issues, we designed a series of controlled variable experiments and explored the following questions: First, have language models truly learned genuine reasoning abilities, or do they merely rely on memorizing answer templates? Second, what is the nature of the internal reasoning process within the models? Third, do the models employ human-like techniques to solve mathematical problems? Fourth, can models trained on datasets like GSM8K learn reasoning skills beyond what is necessary to solve GSM8K problems? Fifth, what causes the models to make reasoning errors? Sixth, how large or deep must a model be to effectively solve math problems at the GSM8K level? Our research reveals many hidden mechanisms by which language models solve mathematical problems and provides insights that surpass the current understanding of large language models.

Bio: Tian Ye, a PhD student at Carnegie Mellon University, majoring in Machine Learning, currently serves as a Research Scientist Intern at Meta. His research interests primarily focus on the reasoning mechanisms of large language models. He has published research papers at top conferences such as NeurIPS. Additionally, he has twice qualified for the Chinese Mathematical Olympiad training team; he earned his bachelor’s degree from the Yao Class at Tsinghua University.

8/21/2024: Regularization and Optimal Multiclass Learning, Shang-Hua Teng

Abstract: The quintessential learning algorithm of empirical risk minimization (ERM) is known to fail in various settings for which uniform convergence does not characterize learning. Relatedly, the practice of machine learning is rife with considerably richer algorithmic techniques, perhaps the most notable of which is regularization. Nevertheless, no such technique or principle has broken away from the pack to characterize optimal learning in these more general settings. The purpose of this work is to precisely characterize the role of regularization in perhaps the simplest setting for which ERM fails: multiclass learning with arbitrary label sets. Using one-inclusion graphs (OIGs), we exhibit optimal learning algorithms that dovetail with tried-and-true algorithmic principles: Occam’s Razor as embodied by structural risk minimization (SRM), the principle of maximum entropy, and Bayesian inference. We also extract from OIGs a combinatorial sequence we term the Hall complexity, which is the first to characterize a problem’s transductive error rate exactly. Lastly, we introduce a generalization of OIGs and the transductive learning setting to the agnostic case, where we show that optimal orientations of Hamming graphs – judged using nodes’ outdegrees minus a system of node-dependent credits – characterize optimal learners exactly. We demonstrate that an agnostic version of the Hall complexity again characterizes error rates exactly, and exhibit an optimal learner using maximum entropy programs.

Bio: Shang-Hua Teng is a University Professor and Seely G. Mudd Professor of Computer Science and Mathematics at USC. He is a fellow of SIAM, ACM, and Alfred P. Sloan Foundation, and has twice won the Gödel Prize, first in 2008, for developing smoothed analysis, and then in 2015, for designing the breakthrough scalable Laplacian solver. Citing him as, “one of the most original theoretical computer scientists in the world”, the Simons Foundation named him a 2014 Simons Investigator to pursue long-term curiosity-driven fundamental research. He also received the 2009 Fulkerson Prize, 2023 Science & Technology Award for Overseas Chinese from the China Computer Federation, 2022 ACM SIGecom Test of Time Award (for settling the complexity of computing a Nash equilibrium), 2021 ACM STOC Test of Time Award (for smoothed analysis), 2020 Phi Kappa Phi Faculty Recognition Award (2020) for his book Scalable Algorithms for Data and Network Analysis, 2011 ACM STOC Best Paper Award (for improving maximum-flow minimum-cut algorithms). In addition, he and collaborators developed the first optimal well-shaped Delaunay mesh generation algorithms for arbitrary three-dimensional domains, settled the Rousseeuw-Hubert regression-depth conjecture in robust statistics, and resolved two long-standing complexity-theoretical questions regarding the Sprague-Grundy theorem in combinatorial game theory. For his industry work with Xerox, NASA, Intel, IBM, Akamai, and Microsoft, he received fifteen patents in areas including compiler optimization, Internet technology, and social networks. Dedicated to teaching his daughter to speak Chinese as the sole Chinese-speaking parent in an otherwise English-speaking family and environment, he has also become fascinated with children’s bilingual learning.

slides

4/25/2024: Toward Demystifying Grokking, Wei Hu

Abstract: Grokking is a surprising phenomenon in which a neural network first memorizes the training set, resulting in perfect training accuracy but near-random test accuracy, and after training for sufficiently longer, it suddenly transitions to perfect test accuracy. I will talk about our recent work toward theoretically explaining the grokking phenomenon. First, we show that a dichotomy of early and late-phase implicit biases can provably induce grokking, and exemplify it in simplified settings such as sparse linear models and matrix completion. Second, we show that in a simple non-linear classification task, grokking also provably occurs and coincides with another intriguing phenomenon known as benign overfitting.

Bio: Wei Hu is an Assistant Professor in Computer Science and Engineering at the University of Michigan. He obtained his Ph.D. degree from Princeton University, where he was advised by Sanjeev Arora, and his Bachelor’s degree from Tsinghua University, where he was a member of Yao Class. His research interest is in the theoretical and scientific foundations of deep learning. He is a recipient of the Google Research Scholar award and the Siebel Scholarship.

Slides

2/28/2024: Feature learning of neural network by mean field Langevin dynamics: Optimization and generalization, Taiji Suzuki

Abstract: In this talk, I will discuss the feature learning ability of neural networks from statistical and optimization perspectives. In particular, I will present recent developments of theory of the mean-field Langevin dynamics (MFLD) and its application to neural network training. MFLD is a nonlinear generalization of the gradient Langevin dynamics (GLD) that minimizes an entropy regularized convex function defined on the space of probability distributions, and it naturally arises from the optimization of two-layer neural networks via (noisy) gradient descent. In the first half, I will present the convergence result of MFLD and explain how the convergence of MFLD is connected to the duality gap through the log-Sobolev inequality of the so-called proximal Gibbs measure. In addition to that, the time-space discretization of MFLD will be addressed. It can be shown that the discretization error can be bounded uniformly in time unlike existing work. In the latter half, I will discuss the generalization error analysis of neural networks trained by MFLD. Addressing a binary classification problem, we have a general form of a test classification error bound that provides a fast learning rate based on a local Rademacher complexity analysis. By applying this general framework to the k-sparse parity problem, we demonstrate how the feature learning helps its sample complexity compared with the kernel methods. Finally, we also discuss how anisotropic structure of input will affect the sample complexity and computational complexity. If the data is well aligned to the target function, both sample and computational complexities are significantly mitigated.

Bio: Taiji Suzuki is currently an Associate Professor in the Department of Mathematical Informatics at the University of Tokyo. He also serves as the team leader of “Deep learning theory” team in AIP-RIKEN. He received his Ph.D. degree in information science and technology from the University of Tokyo in 2009. He worked as an assistant professor in the department of mathematical informatics, the University of Tokyo between 2009 and 2013, and then he was an associate professor in the department of mathematical and computing science, Tokyo Institute of Technology between 2013 and 2017. He served as area chairs of premier conferences such as NeurIPS, ICML, ICLR and AISTATS, a program chair of ACML2019, and an action editor of the Annals of Statistics. He received the Outstanding Paper Award at ICLR in 2021, the MEXT Young Scientists’ Prize, and Outstanding Achievement Award in 2017 from the Japan Statistical Society. He is interested in deep learning theory, nonparametric statistics, high dimensional statistics, and stochastic optimization. In particular, he is mainly working on deep learning theory from several aspects such as representation ability, generalization ability and optimization ability. He also has devoted stochastic optimization to accelerate large scale machine learning problems including variance reduction methods, Nesterov’s acceleration, federated learning and non-convex noisy optimization.

1/25/2024: Recent Advances in Coresets for Clustering, Shaofeng Jiang

Abstract: Coreset is a popular data reduction technique. Roughly, a coreset is a tiny proxy of the dataset, such that the objective function evaluated on the coreset for every feasible solution approximates that on the original dataset. Coresets are particularly useful for dealing with big data since they can usually be constructed in sublinear models efficiently, including streaming and parallel computing.

The study of coresets for clustering is very fruitful, and nearly tight bounds have recently been obtained for well-known problems such as k-median and k-means and their variants. In this talk, I will introduce the recent advances in coresets for clustering, with a focus on presenting several fundamental sampling techniques, including importance sampling and hierarchical uniform sampling, for the construction of coresets. I will conclude the talk by discussing future directions for the study of coreset (and beyond).

Bio: Shaofeng Jiang is an assistant professor at Peking University. He obtained his PhD at the University of Hong Kong, and before he joined PKU, he worked as a postdoctoral researcher at the Weizmann Institute of Science, and an assistant professor at Aalto University. His research interest generally lies in theoretical computer science, with a focus on sublinear algorithms.

Slides

11/28/2023: Textbooks Are All You Need, Yin Tat Lee

Abstract: Many believed that training large language models (LLMs) required using a vast dataset and an immense number of parameters. This is computationally demanding, requiring significant GPU resources. GPT-4 exemplified this belief, being a colossal model trained on a vast corpus.

In light of this, we sought to discern if such impressive results could be achieved with smaller models and limited data for code generation. We demonstrate that with high-quality data, the demand for expansive datasets and a multitude of parameters lessens. The outcome was a few billion-size model, which not only met or exceeded the performance of existing open-source models but did so utilizing a mere 1/1000th of compute in training. Moreover, we will discuss specific emergent properties observed in the model after its fine-tuning on coding exercises.

Bio: Yin Tat Lee is a Principal Researcher at MSR and an Associate Professor of Paul G. Allen School of Computer Science & Engineering at the University of Washington. His research interests are convex optimization, convex geometry, graph algorithms, online algorithms, and differential privacy. During his career, he has received a variety of awards, including Best Paper Awards at FOCS, SODA and NeurIPS, Sprowls Award, NSF CAREER Award, A.W. Tucker Prize, Microsoft Research Faculty Fellowship, Sloan Research Fellowship, and Packard Fellowships.

10/30/2023: Intelligent Heuristics Are the Future of Computing, Shang-Hua Teng

Abstract: Back in 1988, the partial game trees explored by computer chess programs were among the largest search structures in real-world computing. Because the game tree is too large to be fully evaluated, chess programs must make heuristic strategic decisions based on partial information, making it an illustrative subject for teaching AI search. In one of his lectures that year on AI search for games and puzzles, Professor Hans Berliner — a pioneer of computer chess programs — stated: “Intelligent heuristics are the future of computing.”

As a student in the field of the theory of computation, I was naturally perplexed but fascinated by this perspective. I had been trained to believe that “Algorithms and computational complexity theory are the foundation of computer science.” However, as it happens, my attempts to understand heuristics in computing have subsequently played a significant role in my career as a theoretical computer scientist. I have come to realize that Berliner’s postulation is a far-reaching worldview, particularly in the age of big, rich, complex, and multifaceted data and models, when computing has ubiquitous interactions with science, engineering, humanity, and society.

In this talk, I will share some of my experiences on the subject of heuristics in computing, presenting examples of theoretical attempts to understand the behavior of heuristics on real data, as well as efforts to design practical heuristics with desirable theoretical characterizations. My hope is that these theoretical insights from past heuristics — such as spectral partitioning, multilevel methods, evolutionary algorithms, and simplex methods — can shed light on and further inspire a deeper understanding of the current and future techniques in AI and data mining.

Bio: Shang-Hua Teng is a University Professor and Seely G. Mudd Professor of Computer Science and Mathematics at USC. He is a fellow of SIAM, ACM, and Alfred P. Sloan Foundation, and has twice won the Gödel Prize, first in 2008, for developing smoothed analysis, and then in 2015, for designing the breakthrough scalable Laplacian solver. Citing him as, “one of the most original theoretical computer scientists in the world”, the Simons Foundation named him a 2014 Simons Investigator to pursue long-term curiosity-driven fundamental research. He also received the 2009 Fulkerson Prize, 2021 ACM STOC Test of Time Award (for smoothed analysis), 2022 ACM SIGecom Test of Time Award (for settling the complexity of computing a Nash equilibrium), 2011 ACM STOC Best Paper Award (for improving maximum-flow minimum-cut algorithms), and 2023 Science & Technology Award for Overseas Chinese (opens in new tab) from the China Computer Federation. In addition, he and collaborators developed the first optimal well-shaped Delaunay mesh generation algorithms for arbitrary three-dimensional domains, settled the Rousseeuw-Hubert regression-depth conjecture in robust statistics, and resolved two long-standing complexity-theoretical questions regarding the Sprague-Grundy theorem in combinatorial game theory. For his industry work with Xerox, NASA, Intel, IBM, Akamai, and Microsoft, he received fifteen patents in areas including compiler optimization, Internet technology, and social networks. Dedicated to teaching his daughter to speak Chinese as the sole Chinese-speaking parent in an otherwise English-speaking family and environment, he has also become fascinated with children’s bilingual learning.

10/23/2023: The mathematics of complex streamed data, Terry Lyons

Abstract: Complex streams of evolving data are better understood by their effects on nonlinear systems than by their values at times. The question of which nonlinear systems would seem to be context dependent, but it is not. Core to rough path theory is a simple universal nonlinear system that captures all the information needed to predict any response to any nonlinear system. This idealized mathematical feature set is known as the signature of the stream. Its abstract simplicity opens the possibilities for understanding and working with streams in the same context free way that calculators work with numbers. Signature-based techniques offer simple to apply universal numerical methods that are robust to irregular data and efficient at representing the order of events and complex oscillatory data. Specific software can be developed and then applied across many contexts. Signatures underpin prize winning contributions in recognizing Chinese handwriting, in detecting sepsis, and in generating financial data, and most recently in the ability to score streams as outliers against a corpus of normal streams. This principled outlier technology has emerged as a powerful unifying technique; it identifies radio frequency interference in astronomical data, brain injury from meg data…. The underpinning theoretical contributions span a range from abstract algebra and non-commutative analysis to questions of organization of efficient numerical calculation. See www.datasig.ac.uk/ (opens in new tab). New hyperbolic partial differential equations have been developed that compute the “signature kernel” trick without ever having to introduce signatures. Neural controlled differential equations can directly harness approaches such as the log ode method and consume the control as a rough path.

Bio:

Professor Terry Lyons is Wallis Professor Emeritus and Professor of Mathematics at the University of Oxford. He is currently PI of the DataSıg program (primarily funded by EPSRC), and of the complementary research programme CIMDA-Oxford (under the support of InnoHK and the HKSAR). He was a founding member (2007) of, and then Director (2011-2015) of, the Oxford Man Institute of Quantitative Finance; he was the Director of the Wales Institute of Mathematical and Computational Sciences (WIMCS; 2008-2011). He came to Oxford in 2000 having previously been Professor of Mathematics at Imperial College London (1993-2000), and before that he held the Colin Maclaurin Chair at Edinburgh (1985-93). He was President of the London Mathematical Society (2013-15).

Professor Lyons’s long-term research interests are focused on the mathematics of streamed data and building strong applications from these mathematical insights. His current goal is to use rough path theory to develop innovative and truly generic tools for working with streamed data and make these widely accessible through the python package RoughPy. One example of this synergy comes from the signature of a stream. Signatures underpin prize winning contributions in recognizing Chinese handwriting, in detecting sepsis, and in generating financial data, and most recently in the ability to score streams as outliers against a corpus of normal streams. This principled outlier technology has emerged as a powerful unifying technique; it identifies radio frequency interference in astronomical data, brain injury from meg data…. The underpinning theoretical contributions span a range from abstract algebra and non-commutative analysis to questions of organization of efficient numerical calculation. See www.datasig.ac.uk/ (opens in new tab)

10/18/2023: Is RLHF More Difficult than Standard RL?, Chi Jin

Abstract: Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This work theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games under a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.

Bio: Chi Jin is an assistant professor at the Electrical and Computer Engineering department of Princeton University. He obtained his PhD degree in Computer Science at University of California, Berkeley, advised by Michael I. Jordan. His research mainly focuses on theoretical machine learning, with special emphasis on nonconvex optimization and Reinforcement Learning (RL). In nonconvex optimization, he provided the first proof showing that first-order algorithm (stochastic gradient descent) is capable of escaping saddle points efficiently. In RL, he provided the first efficient learning guarantees for Q-learning and least-squares value iteration algorithms when exploration is necessary. His works also lay the theoretical foundation for RL with function approximation, multiagent RL and partially observable RL

Video (opens in new tab)

8/24/2023: On the Power of Foundation Models, Yang Yuan

Abstract: With infinitely many high-quality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for everything? This question cannot be answered by the existing theory of representation, optimization or generalization, because the issues they mainly investigate are assumed to be nonexistent here. In this paper, we show that category theory provides powerful machinery to answer this question. We have proved three results. The first one limits the power of prompt-based learning, saying that the model can solve a downstream task with prompts if and only if the task is representable. The second one says fine tuning does not have this limit, as a foundation model with the minimum required power (up to symmetry) can theoretically solve downstream tasks for the category defined by pretext task, with fine tuning and enough resources. Our final result can be seen as a new type of generalization theorem, showing that the foundation model can generate unseen objects from the target category (e.g., images) using the structural information from the source category (e.g., texts). Along the way, we provide a categorical framework for supervised and self-supervised learning, which might be of independent interest.

Bio: Yang Yuan is now an assistant professor at IIIS, Tsinghua. He finished his undergraduate study at Peking University in 2012. Afterwards, he received his PhD at Cornell University in 2018, advised by Professor Robert Kleinberg. Before joining Tsinghua, he spent one year at MIT Institute for Foundations of Data Science (MIFODS) as a postdoc researcher. He works on AI+Healthcare, AI Theory and Applied Category Theory.

Video (opens in new tab)

4/27/2023: Understanding Adam and AdamW through proximal updates, scale-freeness, and relaxed smoothness, Francesco Orabona

Abstract: Adam and AdamW are the most commonly used algorithms for training deep neural networks due to their remarkable performance. However, despite a massive amount of research, it is fair to say that we are still far from understanding the true reasons why they work so well. In this talk, I’ll show you some recent results on unique characteristics of Adam and AdamW.
First, I’ll show how AdamW can be easily understood as an approximation of a proximal update on the squared L2 regularizer. Next, I’ll show that, contrary to Adam, AdamW’s update is “scale-free”, i.e., its update is invariant to component-wise rescaling of the gradients. I’ll show how scale-freeness provides an automatic preconditioning and how it correlates with the better performance of AdamW over Adam on deep learning experiments. Finally, I’ll show the first analysis of a (minor) variant of Adam, that has a provably advantage over SGD for functions that satisfy a relaxed smoothness assumption, like the objective functions of Transformers.

Bio: Francesco Orabona is an Associate Professor of Electrical & Computer Engineering at Boston University. His research interests lie in online learning, optimization, and statistical learning theory. He obtained his Ph.D. from the University of Genova in 2007. He previously was an Assistant Professor of Computer Science at Stony Brook University, a Senior Research Scientist at Yahoo Labs, and a Research Assistant Professor at the Toyota Technological Institute at Chicago. He received a Faculty Early Career Development (CAREER) from NSF in 2021 and a Google Research Award in 2017.

Video (opens in new tab)

3/10/2023: Modeling Multiagent Game Dynamics: Approaches to Equilibrium Computation and Incentive Analysis, Xiaotie Deng

Abstract: This talk explores various research approaches to modeling the computation of equilibria and analysis of incentives in game dynamics. We discuss computational complexity, sequential and interactive optimization, and equilibrium analysis in multiagent systems.

Bio: Xiaotie Deng is a Chair Professor at Peking University with a Ph.D. from Stanford University. His research focuses on algorithmic game theory, particularly in the context of the Internet and Blockchain Economics. Deng has taught at several universities and is a fellow of the ACM, IEEE, and CSIAM. He is a foreign member of Academia Europaea and received the 2022 Test of Time Award from ACM SIGecom.

Video (opens in new tab)

1/12/2023: Passive and Active Multi-Task Representation Learning, Simon (Shaolei) Du

Abstract: Representation learning has been widely used in many applications. In this talk, I will present our work which uncovers when and why representation learning provably improves the sample efficiency, from a statistical learning point of view. Furthermore, I will talk about how to actively select the most relevant task to boost the performance.

Bio: Simon Shaolei Du is an assistant professor in the Paul G. Allen School of Computer Science & Engineering at the Universityof Washington. His research interests are broadly in machine learning, such as deep learning, representation learning, and reinforcement learning. Prior to starting as faculty, he was a postdoc at the Institute for Advanced Study of Princeton. He completed his Ph.D. in Machine Learning at Carnegie Mellon University. Simon’s research has been recognized by a Samsung AI Researcher of the Year Award, an NSF CAREER award, an Nvidia Pioneer Award, a AAAI New Faculty Highlights, and a Distinguished Dissertation Award honorable mention from CMU.

Video (opens in new tab)

12/22/2022: Reward-free Reinforcement Learning via Sample-Efficient Representation Learning, Yingbin Liang

Abstract: As reward-free reinforcement learning (RL) becomes a powerful framework for a variety of multi-objective applications, representation learning arises as an effective technique to deal with the curse of dimensionality in reward-free RL. However, the existing algorithms of representation learning in reward-free RL still suffers seriously from high sample complexity, although they are polynomially efficient. In this talk, I will first present a novel representation learning algorithm that we propose for reward-free RL. We show that such an algorithm provably finds near-optimal policy as well as attaining near-accurate system identification via reward-free exploration, with significantly improved sample complexity compared to the best-known result before. I will then present our characterization of the benefit of representation learning in reward-free multitask (a.k.a. meta) RL as well as the benefit of employing the learned representation from upstream to downstream tasks. I will conclude my talk with remarks of future directions. The work to be presented was jointly with Yuan Cheng (USTC), Ruiquan Huang (PSU), Dr. Songtao Feng (OSU), Prof. Jing Yang (PSU), and Prof. Hong Zhang (USTC).

Bio: Dr. Yingbin Liang is currently a Professor at the Department of Electrical and Computer Engineering at the Ohio State University (OSU), and a core faculty of the Ohio State Translational Data Analytics Institute (TDAI). She also serves as the Deputy Director of the AI-Edge Institute at OSU. Dr. Liang received the Ph.D. degree in Electrical Engineering from the University of Illinois at Urbana-Champaign in 2005, and served on the faculty of University of Hawaii and Syracuse University before she joined OSU. Dr. Liang’s research interests include machine learning, optimization, information theory, and statistical signal processing. Dr. Liang received the National Science Foundation CAREER Award and the State of Hawaii Governor Innovation Award in 2009. She also received EURASIP Best Paper Award in 2014.

Video (opens in new tab)

11/04/2022: Player-optimal Stable Regret for Bandit Learning in Matching Markets, Shuai Li

Abstract: The problem of matching markets has been studied for a long history in the literature due to its wide range of applications. Finding a stable matching is a common equilibrium objective in this problem. Since market participants are usually uncertain of their preferences, a rich line of recent works study the online setting where one-side participants (players) learn their unknown preferences from iterative interactions with the other side (arms). Most previous works in this line are only able to derive theoretical guarantees for player-pessimal stable regret, which is defined compared with the players’ least-preferred stable matching.

However, under the pessimal stable matching, players only obtain the least reward among all stable matchings. To maximize players’ profits, the player-optimal stable matching would be the most desirable Though Basu et al. [2021] successfully bring an upper bound for player-optimal stable regret, their result can be exponentially large if players’ preference gap is small. Whether a polynomial guarantee for this regret exists is a significant but still open problem. In this work, we provide a new algorithm and show that the optimal stable regret of each player can be upper bounded by O(K log T / ∆^2) where K is the number of arms, T is the horizon and ∆ is the players’ minimum preference gap. This result significantly improves previous works which either has a weaker player-pessimal stable matching objective or applies only for markets with special assumptions. When the preferences of participants satisfy some special conditions, our regret upper bound also matches the previously derived lower bound This work is accepted at SODA 2023.

Bio: Shuai Li is currently an Assistant Professor in the John Hopcroft Center of Shanghai Jiao Tong University. She received PhD degree in Computer Science from the Chinese University of Hong Kong, master degree in Mathematics from University of the Chinese Academy of Sciences and bachelor degree in Mathematics from Zhejiang University. Her research interests include machine learning theory, bandit algorithms and reinforcement learning algorithms. She has published 40+ papers in top machine learning conferences like ICML/NeurIPS/AAAI/IJCAI/KDD and serves as reviewers in these conferences. She is a recipient of Shanghai Sailing Program 2020 and Google PhD fellowship 2018.

Video

10/13/2022: What Should a Good Deep Neural Network Look Like? Insights from a Layer-Peeled Model and the Law of Equi-Separation, Weijie Su

Abstract: In this talk, we will investigate the emergence of geometric patterns in well-trained deep learning models by making use of a layer-peeled model and the law of equi-separation. The former is a nonconvex optimization program that models the last-layer features and weights. We use the model to shed light on the neural collapse phenomenon of Papyan, Han, and Donoho, and to predict a hitherto-unknown phenomenon that we term minority collapse in imbalanced training. This is based on joint work with Cong Fang, Hangfeng He, and Qi Long.

The law of equi-separation is a pervasive empirical phenomenon that describes how data are separated according to their class membership from the bottom to the top layer in a well-trained neural network. We will show that, through extensive computational experiments, neural networks improve data separation through layers in a simple exponential manner. This law leads to roughly equal ratios of separation that a single layer is able to improve, thereby showing that all layers are created equal. We will conclude the talk by discussing the implications of this law on the interpretation, robustness, and generalization of deep learning, as well as on the inadequacy of some existing approaches toward demystifying deep learning. This is based on joint work with Hangfeng He.

Bio: Weijie Su is an Associate Professor in the Wharton Statistics and Data Science Department and, by courtesy, in the Department of Computer and Information Science, at the University of Pennsylvania. He is a co-director of Penn Research in Machine Learning. Prior to joining Penn, he received his Ph.D. from Stanford University in 2016 and his bachelor’s degree from Peking University in 2011. His research interests span privacy-preserving data analysis, deep learning theory, optimization, high-dimensional statistics, and mechanism design. He is a recipient of the Stanford Theodore Anderson Dissertation Award in 2016, an NSF CAREER Award in 2019, an Alfred Sloan Research Fellowship in 2020, the SIAM Early Career Prize in Data Science in 2022, and the IMS Peter Gavin Hall Prize in 2022.

Video (opens in new tab)

09/22/2022: On the (Non)smoothness of Neural Network Training, Jingzhao Zhang

Abstract: In this talk, we will discuss the following question―why is neural network training non-smooth from an optimization perspective, and how should we analyze convergence for non smooth problems. We start by showing that the non-smoothness is essential to standard neural network training procedures, and that network training converges in an unstable manner. We then provide theoretical models for understanding why optimization in neural network is unstable, and how new definitions of convergence can reconcile theory with practice.

Bio: Jingzhao Zhang is an assistant professor at Tsinghua, IIIS. He graduated from MIT EECS under the supervision of Prof Ali Jadbabaie and Prof Suvrit Sra. His research focuses on providing theoretical justifications and analyses to practical large-scale optimization algorithms. He is also interested in machine learning applications, especially those involving dynamical system formulations.

Video (opens in new tab)

08/25/2022: Local Elasticity of Neural Networks and Its Inspired Theory, Zhun Deng

Abstract: In this talk, I will briefly review local elasticity of neural networks proposed by He et al. Then, based on that, I will introduce a new type of stability notion, which can improve over classical stability notions with respect to generalization behavior in certain situations. Specifically, among different notions of stability, uniform stability is arguably the most popular one, which yields exponential generalization bounds. However, uniform stability only considers the worst-case loss change (or so-called sensitivity) by removing a single data point, which is distribution-independent and therefore undesirable. There are many cases that the worst-case sensitivity of the loss is much larger than the average sensitivity taken over the single data point that is removed, especially in some advanced models such as random feature models or neural networks. Many previous works try to mitigate the distribution independent issue by proposing weaker notions of stability, however, they either only yield polynomial bounds or the bounds derived do not vanish as sample size goes to infinity. Given that, we propose locally elastic stability as a weaker and distribution-dependent stability notion, which still yields exponential generalization bounds. We further demonstrate that locally elastic stability implies tighter generalization bounds than those derived based on uniform stability in many situations by revisiting the examples of bounded support vector machines, regularized least square regressions, and stochastic gradient descent.

Bio: Zhun is a postdoctoral researcher with Toniann Pitassi (opens in new tab) and Richard Zemel (opens in new tab) at Columbia University, and also part of Simons Collaboration on the Theory of Algorithmic Fairness (opens in new tab). Previously, Zhun got his Ph.D. in Computer Science at Harvard University, advised by Cynthia Dwork (opens in new tab). His research interests lie at the intersection of theoretical computer science, machine learning, and social science. His work aims to make data science more trustworthy, statistically rigorous, and aligned with societal values. Here is the website: https://www.zhundeng.org (opens in new tab).

Video (opens in new tab)

08/04/2022: Toward Understanding Self-Supervised Pre-training, Tengyu Ma

Abstract: AI is undergoing a paradigm shift the rise of models that are pretrained with self-supervised learning and then adapted to a wide range of downstream tasks. Despite the unprecedented empirical success, why and how pretrained models work still largely remains a mystery. This talk will discuss recent works on analyzing contrastive learning, a family of popular self-supervised pretraining methods that learn visual representations/embeddings of images from unlabeled data. We will develop a framework that views contrastive learning as a parametric version of spectral clustering on a so-called population positive-pair graph. We will also analyze the adaptability of the representations and provide sample complexity bounds. Finally, I will briefly discuss two follow-up works that study self-supervised representations’ performance under imbalanced pretraining datasets and for shifting test distributions. Joint works with Jeff Z. Haochen, Colin Wei, Kendrick Shen, Robbie Jones, Ananya Kumar, Sang Michael Xie, Adrien Gaidon, and Percy Liang.

Bio: Tengyu Ma is an assistant professor of Computer Science and Statistics at Stanford University. He received his Ph.D. from Princeton University and B.E. from Tsinghua University. His research interests include topics in machine learning and algorithms, such as deep learning and its theory, non-convex optimization, deep reinforcement learning, representation learning, and high-dimensional statistics. He is a recipient of the ACM Doctoral Dissertation Award Honorable Mention, the Sloan Fellowship, and NSF CAREER Award.

Video (opens in new tab) | Slides (opens in new tab)

07/22/2022: Unveiling Transformers with LEGO, Sebastien Bubeck

Abstract:

The discovery of the transformer architecture was a paradigm shifting event for deep learning. However, these architectures are arguably even harder to understand than say convolutional neural networks. In this work we propose a synthetic task, called LEGO, to probe the inner workings of transformers. We obtain some insights on multi-head attention, the effect of pretraining, as well as overfitting issues. Joint work with Yi Zhang, Arturs Backurs, Ronen Eldan, Suriya Gunasekar, and Tal Wagner.

Bio: Sebastien Bubeck manages the Machine Learning Foundations team in MSR Redmond. He has worked on multi-armed bandits, convex optimization, online algorithms, and adversarial examples, winning best papers at COLT (2009 and 2016), ALT (2018), and NeurIPS (2018 and 2021). At the moment he is trying to understand Transformers.

Video (opens in new tab)