Experience Platform (ExP) header - two flask icons

Experimentation Platform

Metric computation for multiple backends

Share this page

If I have data from an A/B test and a favorite metric, it is easy to run a statistical test to see if the metric has changed significantly. If I need this metric in the future, it’s also easy to re-compute: I can save my code and re-use it later. But what if I need 100 metrics, with variance estimation, confidence intervals, and other stats? Well, this is getting challenging to manage. Still, with some dedication and manual work, I believe I can do this.

But what if I need to compare these metrics over data slices like day of week, device type, and who knows what else? What if I need to update these slices regularly? How about if people with different background (data scientists, product managers, developers) want to contribute to the metrics development? What if I need to maintain the same level of trustworthiness with every change? What if I work with several products owned by different teams with data living in different fabrics like Azure Data Lake, or Azure Blob Storage, or in a SQL database? And on top of it all, what if these teams migrate their data from one fabric to another every now and again?

At Microsoft, this is exactly the challenge that Experimentation Platform (ExP) is facing. ExP works with dozens of products across Microsoft (Bing, Office, Xbox, MS Teams, VS Code, Microsoft Support to name just a few). There are hundreds of practitioners implementing and maintaining metrics. Some teams have thousands of metrics, their data is in different fabrics, and these teams run hundreds of A/B tests a month that all require detailed and trustworthy statistical analyses.

The questions above are fundamentally hard. There are many approaches an A/B testing team can take to address them. In this blog post, we will describe one of the key engineering components that we use to address these questions. It has been fourteen years of developing and iterating ExP’s A/B testing tools. We will share three key learnings obtained during these years and iterations. We hope the reader will find them useful, too.

Overview of our approach

If we had to distill our engineering answer to the challenges described above into just two words, it would be “code generation.” What does this mean? Let us unpack.

Compute pipeline overview

Over the years, the ExP team has developed a Domain Specific Language (DSL) for defining metrics (see also [1]). We will give some examples below, but for now let’s just say that this language allows users to define metrics regardless of where their data lives, or what compute fabric they are using. In other words, we can say this language is fabric-agnostic. Treating the DSL as a “black box” for now, the overall workflow is as follows.

First, platform users create a metric set (really a git repo) with the DSL code defining the metrics. These metrics are reused between A/B tests, so eventually users don’t have to change them often. Behind the scenes, the code is compiled into artifacts (“metrics plans“) and published to a “Metric Set Service”.

Overview of ExP compute pipeline

Overview of ExP compute pipeline

What happens when a user requests an analysis? Each request describes which fabric should be used, which metrics to compute, what’s the time range for the analysis, which data slices should be computed, etc. This request is sent to the compute pipeline, which then fetches the correct metrics plan from the Metric Set Service and sends it together with the request information to the Code Gen. From the request, the Code Gen knows which fabric to produce a script for, among other things listed above. Remember that metrics plans are fabric-agnostic, so the same metrics plan can be used for any data fabric. It all depends on the request.

Once the script is produced, the pipeline runs the corresponding job, fetches the result, and displays it to the user.

Key learnings

Now that we know how the system fits together, let’s explore the learnings we mentioned in the introduction:

1. Use a DSL for metric definition. This makes it easier to implement and maintain even large number of metrics. It also allows to simplify and democratize the metric definition process: you don’t need to be a data scientist or a developer to create metrics.
2. Automate code generation system. Since the DSL is designed specifically for metric definitions, it is possible to automate statistical analysis, data slicing, etc. Automation means that it is easy to re-configure analyses: changing data slices, metrics or filters is just a few clicks away. Even more importantly, the logic can be robustly tested to ensure trustworthiness. As a result, even people with no stats background will reliably get trustworthy metrics.
3. Design components to be fabric-agnostic. This separates the concern of what to compute from the concern of exactly how to compute it. This helps with products whose data is in different compute fabrics, or when a data migration is needed. Indeed, you can just re-use your metrics “for free” instead of re-writing them all in a new language.

In the following sections we will discuss these three learnings in a bit more depth (see also [2] for a deeper discussion of the motivation).

Metrics definition language

In our DSL, metrics are defined at levels. These levels are just columns in the data whose values can be used as units of analysis. Let’s take an example of a website, and let’s say we have three aggregation levels: Page, Session, and User, represented by columns in the data. Each user can have several sessions on the site, during each session the user can view several pages, and on each of the pages the user can perform several events (e.g., clicks, scrolls, mouse hovers). However, each page view belongs to a single session, and each session belongs to a single user. In our DSL, metrics could be “per page”, “per session” or “per user” in this case.

Let us consider a simple example: average latency per user. There are better ways to measure latency (e.g., as percentiles across all latency values, not per user). We chose this metric just to illustrate the DSL. In the DSL it could be written as

LatencyPerUser = Avg(Sum(Latency))

Here both User and Latency are columns in the data, and User has been marked as one of the aggregation levels. In the metric, we first sum all the Latency values from all the data rows for each user (via Sum). Then, we compute the average value across all users. Whenever there is no “” near an aggregation, we assume that this aggregation is done across all values. Equivalent SQL code could look something like this:

Events =
SELECT Latency, User, Events
FROM Data;

UserLevel =
SUM(Latency) as UserLatency
FROM Events

OutputLevel =
SELECT AVG(UserLatency) as LatencyPerUser
FROM UserLevel;

What are some key takeaways from this? First, having a dedicated language saves time; it’s more concise than SQL, for example. Second, it makes metrics implementation simpler: it’s (hopefully) easier to pick up and is also less error prone. But what’s maybe even more important is that it allows us to automate a lot of the analysis. The rest of the blog post will describe exactly that.

Object model for the language

Metrics written in the DSL get parsed into syntax trees. Once again, we will give an overview and consider an example to illustrate the main points.

Metrics as trees

So how are metrics represented as trees? There are two kinds of objects: expressions and tables. Both have a collection of “parents” (expressions or tables, respectively), and each expression belongs to a table. Roughly speaking, expressions correspond to the expressions in the metric computation (e.g., arithmetic operations, aggregations, string operations etc.) while tables determine the overall flow of the computation (think of a sequence of SELECT statements with GROUP BY’s, JOINs, UNIONs, etc.).

Let’s consider another example. A metric computing “average number of clicks per user” on a website could look in our DSL something like:

ClicksPerUser = Avg(Sum(Event == “Click” ? 1 : 0))

with Event and Userbeing columns in the data. Let’s walk through this metric. First, for each data row we assign a binary value based on whether the Event in that row was a click or not (via Event == “Click” ? 1 : 0). Then, we sum those values across all rows for each user. This would count rows where Event was a click. Finally, we average that count across all users (via the outer Avg operation).

Suppose we also have a column in the data called Variant containing values “Treatment” and “Control”, and we want to use it as a data slice to compute the value of the metric for T and C separately.
The parsed metrics plan containing the ClicksPerUser metric and Variant data slice would look as follows:

Diagram of a metrics plan containing only one metric ClicksPerUser.

Example of a metrics plan

There will be four tables: data extraction, then creating a new column computing Event == “Click” ? 1 : 0 (we called that table “Base”). Then another table with single value per user (their click count). Finally, the Output table containing the average count across the users. Every data slice and every part of the metric definition are kinds of expressions. For example, Event and Variant are data source columns expression, == is a binary operation expression, “Click”, 1 and 0 are constants, Sum and Avg are aggregation expressions.

How would these objects convert into, say, SQL code? Tables would roughly correspond to select statements, and expression are just lines in these statements. The data slice Variant would be part of the GROUP BY statement. We hope the example below is self-explanatory.

Illustrate how ClicksPerUser metric could be translated into SQL code.

Generated code for ClicksPerUser metric

How is this useful

What’s the point of it all? Well, the key point is that these trees describe a general computation without any specifics of how this computation will be carried out!

Since the metric computation is just a tree now, we can programmatically manipulate the tree to modify the computation. This might seem trivial, but it’s actually very powerful! For example, note that the metric definition in the DSL didn’t have any stats, e.g., variance calculation. This is deliberate: since we can change the trees representing the computations, we can automatically add all the necessary expressions to compute the metric’s variance! This means that a person implementing the metrics doesn’t need to think about (or even know about!) variance estimation. This also means that the stats logic is all in one place: inside the code gen. It can thus be thoroughly tested, and the tests are universal for all the customers of ExP. Similar tree manipulations provide automatic data slicing for computing metrics over only a subset of users, or days, or web pages, etc.

Generating code for many fabrics

As we established above, the code gen is generating scripts given two pieces of information: the metrics plan and the analysis request (also called config). Metrics plans consist of syntax trees describing all possible metrics, while configs describe the specifics of the analysis: time range of the analysis, target compute fabric, which of the metrics to compute, which data slices to use, etc. So, what’s happening inside of the code gen?

Key components of Code Gen

There are three key structural components:

1. Transformers. They take a metrics plan and return a new modified metrics plan.
2. Factories. They take the config, and return a list of transformers based on what they see in the config.
3. Emitter. It takes a metrics plan and generates code from it.

The general flow is as follows. At the core of the flow is “master coordinator”. It knows the logical transformations that should be applied to the metrics plan before emitting the code, but it doesn’t know exactly which transformers should be used at each step. Logically, “master coordinator” has three main parts:

1. General transformations, e.g., keeping only the requested metrics, removing un-used data sources.
2. Smarts, e.g., adding stats and data slicing.
3. Fabric-specific stuff, e.g., in U-SQL one can take max of bool but in Azure Data Explorer one needs to convert bool to int first.

Each part can have many transformers. The exact set of transformers could be different depending on the target fabric and other information in the config. To separate the concern of knowing the logical flow from the concern of knowing the exact transformers, “master coordinator” is really describing a list of factories. Each factory is responsible for a single logical step. Given the config, each factory creates the correct chain of transformers to be applied to the metrics plan for this analysis. The last factory is responsible for producing the correct code emitter.

Overview of the general structure of the code generation engine.

Overview of the Code Generation service

Code emitter is always the last step. Emitters are deliberately naïve, only describing how to write operations in the given fabric. For example, ToString operation in U-SQL is column.ToString() but in Azure Data Explorer it’s tostring(column).

Key observations about the design:

1. Only factories and emitters are fabric-aware. Everything else is generic.
2. Master coordinator describes the logical flow with factories, separating what to do from how to do it.
3. Emitters are naïve, which makes it easy to add support for new fabrics if needed.


It took ExP team many years to arrive at the current state of the system. Not every team is ready to go full “automatic code gen for many fabrics”, or even needs to. Yet we believe that some of the key learnings from ExP would be useful for other teams, too. To emphasize the three key learnings again:

1. Separate metric writing from metric computation, e.g., via a DSL, interfaces, etc. This helps to democratize metric creation to all employees.
2. Automatically enrich the computations with stats, data slicing, and other “smarts”. This simplifies and democratizes the process of defining metrics, as well as makes it more testable and less error prone.
3. When creating such automated systems for metric computations, try to make it as fabric-agnostic as possible, and separate fabric-specific components from generic components. This makes the systems flexible and easier to re-use.

– Craig Boucher, Ulf Knoblich, Dan Miller, Sasha Patotski, Amin Saied, Microsoft Experimentation Platform


[1] S. Gupta, L. Ulanova, S. Bhardwaj, P. Dmitriev, P. Raff, and A. Fabijan, “The Anatomy of a Large-Scale Experimentation Platform,” in 2018 IEEE International Conference on Software Architecture (ICSA), Apr. 2018, no. May, pp. 1–109, doi: 10.1109/ICSA.2018.00009.
[2] C. Boucher, U. Knoblich, D. Miller, S. Patotski, A. Saied, and V. Venkateshaiah, “Automated metrics calculation in a dynamic heterogeneous environment,” extended abstract for 2019 MIT CODE, Nov. 2019, link on Arxiv: https://arxiv.org/abs/1912.00913