Portrait of Jonathan Mace

Jonathan Mace

Senior Researcher

About

I am a Senior Researcher in the Cloud Reliability Group at Microsoft Research, Redmond.  I joined MSR in 2023; before that I was faculty (opens in new tab) at the Max Planck Institute for Software Systems (2018-2022), and a PhD student (opens in new tab) at Brown University (2011-2018).

For the most up-to-date information, see my personal website at jonathanmace.github.io (opens in new tab).


Our group is hiring research interns! Feel free to drop me an e-mail if you are a PhD student and your research interests align with those of our group!


My research focuses on designing and building reliable, observable, self-managing cloud systems. A central goal for me is to make it easier to operate large, complicated software systems, and to understand their behavior at runtime. Currently I am working at the intersection of observability, semantic modeling, and agentic AI.

A few recent project highlights include:

  • Telemeta extracts and indexes semantic models from large-scale observability data, enabling accurate and reliable AI agents for cloud operations. This is an ongoing project I lead at Microsoft Research, so get in touch if you’re interested in internships or collaborations!
  • Blueprint is an extensible compiler and benchmark suite for microservice applications. It simplifies prototyping by making it easy to reconfigure infrastructure choices without rewriting application code. Check out the project on GitHub (opens in new tab).
  • Hindsight is a distributed tracing framework for edge-case tracing, i.e. capturing detailed traces for rare and outlier requests without the data loss of sampling-based systems. It combines per-node telemetry history, programmatic symptom detection, and rapid distributed retrieval. Hindsight appeared at NSDI 2023; code is on GitLab (opens in new tab).
  • Clockwork is a DNN serving system designed for predictable performance. By eliminating sources of variability and centralizing scheduling and admission control, Clockwork achieves extremely tight tail latency. It received the Distinguished Artifact Award at OSDI 2020; code is on GitLab (opens in new tab).
  • Pivot Tracing is a cross-component monitoring framework for distributed systems. Troubleshooting cross-component problems often requires information that is inaccessible due to a lack of end-to-end visibility. Pivot Tracing addresses this by combining causal metadata propagation with dynamic instrumentation, enabling operators to define, measure, and aggregate metrics across component boundaries using a simple SQL-like interface. It received the Best Paper Award at SOSP 2015; code is on GitHub (opens in new tab).