A/A’/B Testing: Evaluating Microsoft Teams across Build Releases
Microsoft Teams is a communication platform . It integrates meet, chat, call and collaborate in one place. The application updates multiple times a month , with additional new features and iterative improvements to existing features. To ensure high quality user experience across frequent updates, the team needs to actively monitor the quality of each new build release.
A/B testing is the gold standard to compare product variants . As the Microsoft Teams Experimentation team, we have run 100s of A/B tests. The best practice we always follow is to test one feature or a combination of interactive features at a time . That said, A/B testing is like a ‘unit-testing’ tool. In practice, A/B testing is rarely used for comparison between whole builds. That’s because each build integrates multiple feature changes and it is hard to figure out which features cause regressions, if any. However, we can attempt to use A/B testing as an integration testing tool for builds comparison.
In this scenario, each user is presented with either current or next build release randomly. We evaluate if the variants generate statistically significant different results in key metrics. During our analysis, we identified two factors which introduce biases. Thus, the comparison is invalid and does not generate useful insights. In this blog post, we talk about why the issue exists. We also introduce an A/A’/B testing framework which successfully enables valid builds comparison in Microsoft Teams.
Why is comparing builds through A/B testing insufficient?
We started from running an A/A test  via the current A/B testing framework. The test is a sanity check to determine if testing between builds would provide useful insights. The users in the control variant continue using the current build. The users in treatment receive a request to update to the next build, which is identical to current build except for the build version number. Without introducing treatment effect, we expected to see no differences between the results of two variants. But we observed many statistically significant metric movements. Why did those false positives show up? After investigation, we identified two factors which can introduce bias: penetration difference and update effect (reinstall-and-restart effect).
It takes time for the next build to penetrate across the treatment users. When an A/B test is running, the overall traffic volumes from variants are close. But the compositions are quite different. Let’s take a look at an example in Figure 2. Assume v is the current build version for A (the control variant), and v+1 is the next build released to B (the treatment variant). On day 0, one hundred percent of users in A and B are using build v. Starting from day 1, users in B consist of two parts, those using build v and those using build v+1. The portion of latter part increases as the test runs longer. It will eventually approach 100%. But the time to reach that point will vary depending on how long build v+1 penetrates across users in B. If most users are daily users, the duration might be short to achieve a high enough portion. Otherwise, it can take weeks or even months.
Impact on analysis
We have two options to perform the comparison: filtered analysis and standard analysis. Filtered analysis drills down to user activities for target builds. In the figure, those are the activities covered by the blue boxes in A and the grey boxes in B. Standard analysis includes all traffic in both variants, which compares the activities covered by the blue boxes in A with those covered by grey AND blue boxes in B.
Filtered analysis is a direct and intuitive way to compare builds. But selection bias  exists between current-build users in A and next-build users in B. Therefore, we should not directly compare those user groups. An example is that daily users are very likely to update within 24 hours, while weekly users may take up to a week to upgrade. This means on day 1, the average next-build users will be more active than the average current-build users. Instead of measuring the outcome differences between builds, the comparison can be dominated by the characteristic differences between more engaged and less engaged users.
To resolve the issue, we can use standard analysis instead. As we don’t filter out any users, the average users are identical. In that way, we don’t need to worry about the selection bias anymore. But as the analysis covers non-targeted users, it results in the dilution of treatment effect.
To wrap up, penetration difference can introduce bias to the filtered analysis but not standard analysis. However, standard analysis still does not work because update effect is another key factor introducing bias.
The users must reinstall and restart Microsoft Teams application to update to a new build version. After reinstallation and restart, the application normalizes the memory usage and performance profile. For users in the control variant, memory usage accumulates since the application was launched. It can increase the memory consumed by the application. In contrast, memory usage is significantly reduced after reinstallation and restart in treatment. That difference can further lead to the secondary effects on application performance and user engagement. Therefore, the builds comparison measures not only the differences between the results of two builds, but also the impact of reinstallation and restart.
In the A/A test mentioned earlier, we observed statistically significant metric movements even when performing standard analysis. This indicates that the update effect was the main reason leading to the gap between builds.
We proposed several methods to mitigate the impact of penetration difference and update effect. The key point is to only include users who have experienced application restart or update in the analysis.
Triggered analysis  is to drill down to activities of users who have restarted, regardless of whether or not they received the update. This alleviates the impact of the update effect, though we ignore the impact of reinstalling which plays a less essential role. Triggered analysis does not need the modification to the A/B testing framework, but we might still encounter selection bias issue. As the treatment variant proactively sends the update request to the users, the probability of restarting the application will be higher than that in control.
Forced update in control variant
Alternatively, we can induce a counterfactual reinstall and restart of the control variant. This forces the users in control variant to undergo the same process as those in treatment. The downside of this method is we need extra logic to track the counterfactual update. Otherwise, the users could get stuck in an infinite reinstall loop.
Forced update in additional control variant (A/A’/B testing)
Based on the previous option, we can introduce a Custom Control variant  with a forced update. In this way, we run A/A’/B tests instead of traditional A/B tests. Figure 3 shows how it works. In an A/A’/B test, A is Standard Control variant with build in production (version v). A’ is Custom Control variant with the exact same build as A, except that the build version is updated to v’. B is Treatment variant asking users to update to the next build (version v+1). The comparison between A and A’ is mainly used as a sanity check, whereas the comparison between A’ and B is used to derive user experience insights that will be taken into account in ship decisions.
To avoid the dilution of treatment effect, we could perform filtered analysis drilling down to user activities for version v’ and v+1. Let’s use the traffic composition in Figure 4 to illustrate how to do it. During A/A’ comparison, we perform standard analysis by comparing the blue boxes in A with the grey AND blue boxes in A’. During A’/B comparison, we perform filtered analysis by ONLY comparing the grey boxes in A’ and B.
We selected A/A’/B testing
We selected the A/A’/B testing proposal due to its simplicity for implementation and analysis.
Let’s revisit the A/A test we mentioned at the beginning for which we observed statistically significant differences during analysis. We ran the test again using the proposed framework. During the A versus A’ comparison, we performed standard analysis to get rid of selection bias. About 30% of metrics had highly statistically significant movements. The significance level was 0.001 (much lower than the commonly used 0.05), thus those metric movements were likely to be true positives. Such big gap was mainly caused by update effect. During A’ versus B comparison, the proportion of moved metrics was close to false positive rate (significance level). This A/A test validated that the framework did work for builds comparison.
How did we deploy it?
We have adopted the framework in a scalable manner and are using it to compare builds regularly. When we deployed it in production, we made a change – only keeping the A’ and B variants. The reason is that we get limited benefit from A and A’ comparison. If we consider the difference between A and A’ as the baseline, we can only detect an issue in A’ when the metric movements are far away from the baseline. Alternatively, we implemented an automatic process to create a duplicated identical build with a new version whenever there is a new build release. Whenever we start an A’/B test for a new build, we would send that duplicated build to variant A’. The process ensures that we won’t introduce any issues to A’. One more benefit from not keeping variant A is that we can maximize the traffic allocated to variants A’ and B. In such way, we can increase the metric sensitivity as much as possible.
The framework did help the team with safe build releases. In a recent A’/B test for a real build release, we successfully detected a number of statistically significant regressions. These regressions caused the team to halt and investigate the issue before moving forward.
We were trying to use A/B testing to compare build releases for Microsoft Teams. We identified that penetration difference and update effect may introduce bias to the A/B analysis. To mitigate this issue, we introduced an A/A’/B testing framework. The framework enables us to regularly perform product builds comparison in a trustworthy way and serves as a gate for safe release of a new build.
Special thanks to Microsoft Teams Experimentation team, Microsoft Experimentation Platform team, Microsoft Teams Client Release team, Paola Mejia Minaya, Ketan Lamba, Eduardo Giordano, Peter Wang, Pedro DeRose, Seena Menon, Ulf Knoblich.
– Robert Kyle, Punit Kishor, Microsoft Teams Experimentation Team
– Wen Qin, Experimentation Platform
 “Microsoft Teams.” https://www.microsoft.com/en-us/microsoft-teams/group-chat-software
 “Teams update process.” https://docs.microsoft.com/en-us/microsoftteams/teams-client-update
 R. Kohavi and S. Thomke, “The Surprising Power of Online Experiments.” https://hbr.org/2017/09/the-surprising-power-of-online-experiments
 R. Kohavi, R. M. Henne, and D. Sommerfield, “Practical guide to controlled experiments on the web: listen to your customers not to the hippo,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD ’07, San Jose, California, USA, 2007, p. 959. doi: 10.1145/1281192.1281295.
 T. Crook, B. Frasca, R. Kohavi, and R. Longbotham, “Seven pitfalls to avoid when running controlled experiments on the web,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining – KDD ’09, Paris, France, 2009, p. 1105. doi: 10.1145/1557019.1557139.
 “Selection bias.” https://en.wikipedia.org/wiki/Selection_bias
 N. Chen, M. Liu, and Y. Xu, “How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments,” p. 9, 2019.
 W. Machmouchi, S. Gupta, R. Zhang, and A. Fabijan, “Patterns of Trustworthy Experimentation: Pre-Experiment Stage.” https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/patterns-of-trustworthy-experimentation-pre-experiment-stage/