Experience Platform (ExP) header - two flask icons

Experimentation Platform

For Event-based A/B tests: why they are special

Share this page

An “event-based” A/B test is a method used to test two or more variables during a limited duration. We can use what we learn to increase user engagement, satisfaction, or retention of a product, while also applying our insights to future event and product scenarios. We often use A/B testing when there is a launch of a new feature. This allows the product team to try different messaging to determine which content maximizes user engagement.

Unlike classic A/B testing, where a feature is developed, incrementally tested, gradually rolled out and then becomes a permanent part of the product, an event-based feature has limited time for experimentation. The period can be as little as a day or a handful of days. For example, International Olympics-related headlines on a news app throughout the duration of the Tokyo Olympic Games. In this blog post, we will explore some of the challenges of running event-based A/B tests and share some insights for the set-up and analysis for such experiments.

What are the challenges of running event-based A/B tests?

Feature Testing

As a best practice, product teams perform manual or unit testing before exposing features to end users. This method helps detect and fix bugs to minimize user harm. However, not every bug can be detected at this stage. Feature teams often run A/B tests to discover things that may have been overlooked or cannot be tested in manual/unit testing. Teams expose the feature to a small traffic percentage, measure user engagement, identify issues, remedy them, and then do another round of experimentation to verify improvement [1]. Yet in the case of event-based A/B tests it’s almost impossible to test and iterate on the feature during the experiment given time constraints.

Rotating traffic

For global events like International Women’s Day which spans multiple regions, we may want to run an A/B test at the same local time in each region. Depending on the experimentation system’s capability, this could mean setting up multiple A/B tests with each targeting a specific region and starting at a different UTC time. If many regions need to be covered, the experiment set-up would require quite a bit of effort. It might be tempting to use a single A/B test for all regions and start at exactly the same time. However, if there is an issue specific to a region, feature teams cannot do anything about it but stop the entire experiment. On the contrary, having one experiment per region allows us to turn off the feature for the affected region alone. This method provides a way to manage risk while also adding small overhead to experiment management.

Analysis latency

Metric results help us understand the impact of a feature. Data becomes available once the:

  1. Telemetry for the experiment is collected;
  2. Data gets transformed into an analyzable format;
  3. Analysis job is submitted;
  4. Job is queued; and
  5. Job is completed. 

Depending on the product scenario, it can take some time to complete #1 and #2. If an A/B test targets a broad audience, or very large data sets need to be consumed, then it could take hours for results to be ready. In an example that we observed recently, experimenters could not know the feature performance until the ~8th hour after the experiment had started using the first 4-hour data. If there is an issue with the feature and metrics are the only way to know about the issue, then a significant amount of time would have elapsed by the time the feature team discovers the problem. 

Experiment debugging

It is possible to encounter issues during an A/B test. For example, Sample Ratio Mismatch (a.k.a. SRM) has been found to happen relatively frequently in A/B tests. It is a symptom for a variety of experiment quality issues ranging from assignment, execution, log processing, analysis, and interference between variants and telemetry [2]. Debugging such issues takes time. For a classic A/B test with SRM, it could take days, weeks, or even months to identify the root cause. Given the time limit and data latency, it may not be feasible for the feature team to identify the root cause before the event ends.

Experiment learning

Displaying an event-based feature often comes at the cost of not showing another feature. Let’s return for a moment to our Olympics Games carousel. The spaces used for Olympics headlines could also be used for other types of information. A noticeable increase in user engagement in the carousel could mean a missed ad engagement opportunity. How should we make a tradeoff between different types of engagement? How do we quantify and understand the impact in both the short and long term?

Treatment – Olympics carousel: 

graphical user interface, website

Control – No Olympics carousel (Shopping carousel is displayed by default):

graphical user interface, application, website

Figure 1. Event-based feature comes at a cost of not showing another feature

There are multiple variables in event-based A/B tests. In the Olympics carousel example, its format, content, and where it appears could be all the things that we want to test out. We may also want to test it for different geolocations. Moreover, event-based experiments introduce a unique dimension of variability – the event itself. What would you say if the result shows a stat-sig decrease in content interaction for users located in the US? Does that mean that the carousel is bad? What if the result shows a stat-sig increase for users in Asian countries? What if you see opposite results on a similar carousel but for a different event i.e., Super Bowl?

Feature for A/B testingRegionMetric movementImmediate reactionActual reason
A carousel for the Olympics eventUS-0.8%Carousel does not work well in US -> may need to change the format or source of content in carouselCJK users are more interested in the Olympics event than US users
A carousel for the Olympics eventCJK (China, Japan, Korea)+1%Carousel works well in CJKCJK users are more interested in the Olympics event than US users
Figure 2. User engagement metric moves in different direction for users in different region 

In classic A/B testing, we can have two different UX treatments for a feature. Through testing, we can see which one works better and use what we learn in other experiments. In event-based experiments, however, insights from one event may not be transferrable to others. For instance, the Olympics is different from International Women’s Day, which is quite different from the Super Bowl. Thus, it can be difficult to draw conclusions from one experiment and define the right execution for the next event.

Recommendation on experimentation infrastructure and analysis

We recommend the following to address the challenges of event-based A/B tests.

Experiment tooling

Provide the option to schedule A/B tests in batch and automatically run experiment and metric analysis. This is especially helpful when traffic rotation needs to happen, and multiple A/B tests need to be created for the same feature targeting different regions. The feature team should be able to schedule A/B tests for different time zones and have the tests start automatically at preset times. The short-period metric analysis should be kicked off as soon as data becomes available so that experimenters can see results early before the event ends.

Ideally, this option should be an integral part of the experimentation system. Depending on how often event-based A/B tests are projected to run and the ROI (return on investment) of the engineering investment, it might be good enough to have a plug-in or tooling that leverages an existing system’s API (application programming interfaces) to set up, run, and analyze event-based A/B tests automatically.

Near-real-time monitoring

Establish near-real-time pipeline to monitor, detect and debug A/B tests. Feature teams need to react quickly when something is off. Waiting hours for metrics to be calculated puts users and teams at risk of adverse impacts. With near-real-time pipeline, experiment data is aggregated every couple of minutes and key guardrail metrics are computed. These metrics help detect egregious effects, especially in the first few hours after the experiment starts. Although they may be a subset of all the metrics the feature team cares about, they allow the team to closely monitor event-based A/B tests, debug issues quickly, and take action to shut down experiments when needed. At Microsoft, we have established a near-real-time pipeline for live site monitoring for online services (details to be shared in future blog post). It allows us to detect a number of experiments quickly that have bad outcomes. Note that having near-real-time data can motivate experimenters to check scorecards more frequently before reaching the fixed time horizon. This is called “p-hacking”. It can inflate the Type I error rate and cause experimenters to see more false positives. Using a traditional A/B test or “fixed-horizon” statistic no longer works, and sequential testing is better suited for continuous monitoring of the experiment. To develop the sequential probability ratio, it is advisable to understand metric distributions beforehand. You can then verify that the independence assumption holds for the test to be applicable [3].

Triggered analysis

Use triggered analysis to increase sensitivity of metrics. When an event-based feature is displayed, it is possible that not every product user sees it. For example, a component may require that the user scroll the page to be seen. If the user does not scroll, then the component will not be discovered. Sometimes, the feature might be enabled only when certain conditions are met. For instance, we might show sports-events-related features only if the user has previously visited sports articles or websites. Using the full user population for analysis can dilute the results. It would be valuable to do a triggered analysis, such as analyzing only those users who see the experience [4]. From our observations on A/B tests run at Microsoft, the more targeted the audience and metrics for analysis, the more likely we are to get results with stat-sig metric movements.

Post-experiment analysis

Conduct post-experiment analysis to understand the impact not reflected in A/B test results. These analyses help establish a more complete picture about the experiment and the event itself. For example, an event-based carousel may cause a drop in revenue due to less ads being displayed (as shown in Figure 1). However, if users like the carousel, there might be lingering effect that makes them revisit the app more frequently. Conducting post-experiment retention analysis helps quantify the impact that is not observed during the time of A/B test. By comparing the retention of cohorts in the treatment and the control after the experiment, we may find that the feature leads to an increase in user retention over the long term.

We can also dig deeper to uncover other insights. For instance, if the overall difference in retention is small, could it be prominent for some subset of users? Could there be a shift from product “active users” to “enthusiasts”, or “random users” to “active users” for those seeing treatment experience? Could there be a more observable difference if we look at cohorts that have been exposed to multiple event-based features on a cumulative basis? 

As an event itself is a variable, doing cross-experiment analysis helps shed light on the differences between events. This requires keeping a repository of historical A/B tests and metrics data. By comparing the metric movements between different events, or applying machine learning techniques, we can find out how event, region, feature format, content, and other variables play a role in the metric movement. The value of such analysis is dependent on the data accumulated over time. By testing more event-based features and collecting more data, we can derive more value out of the cross-experiment analysis.

Summary

Event-based experiments are a unique type of A/B testing due to their time sensitivity and limited duration. Likewise, these events face unique challenges throughout the experimentation lifecycle including feature testing, experiment set-up, analysis, debugging, experiment understanding and learning. Event-based testing requires specific tooling, monitoring, and an analysis approach to address these and other challenges. As we embrace diversity and inclusion across the world, we expect to see more event-based A/B tests happen across various products. In this blog post, we share our thoughts and recommendations on this type of experiment and hope that it is helpful for you if you are or consider running event-based A/B tests in future.

– Li Jiang (Microsoft ExP),

– Ben Goldfein, Erik Johnston, John Henrikson, Liping Chen (Microsoft Start)

References

[1]  T. Xia, S. Bhardwaj, P. Dmitriev, and A. Fabijan, “Safe Velocity: A Practical Guide to Software Deployment at Scale using Controlled Rollout,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2019, pp. 11–20.

[2]  A. Fabijan et al., “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining – KDD ’19, 2019, pp. 2156–2164.

[3]  A. Tartakovsky, I. Nikiforov, M. Basseville, Sequential Analysis: Hypothesis Testing and Changepoint Detection. Chapman and Hall/CRC Press.

[4]  R. Kohavi, D. Tang, and Y. Xu, Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.