{"id":774094,"date":"2021-11-09T10:00:06","date_gmt":"2021-11-09T18:00:06","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=774094"},"modified":"2021-11-09T11:04:50","modified_gmt":"2021-11-09T19:04:50","slug":"data-quality-fundamental-building-blocks-for-trustworthy-a-b-testing-analysis","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/data-quality-fundamental-building-blocks-for-trustworthy-a-b-testing-analysis\/","title":{"rendered":"Data Quality: Fundamental Building Blocks for Trustworthy A\/B testing Analysis"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p><span data-contrast=\"auto\">Data\u00a0plays a vital role in\u00a0the lifecycle of\u00a0every single product in the technology industry.\u00a0With\u00a0data,\u00a0we can generate insights\u00a0to improve products\u00a0and provide a superior customer experience. Yet insights are only actionable if they are derived from trustworthy data. For example, a navigation app relies on accurate location detection data to provide directions to users.\u00a0<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Data quality is especially important when it comes to A\/B testing. A\/B testing allows us to evaluate a feature from different perspectives and make informed decisions. Yet unreliable data can lead to erroneous analysis and poor decision-making that ultimately harms the user experience or other key metrics. Fortunately, there are tools and methodologies that we can use to ensure that we are relying on quality data to produce trustworthy A\/B testing analyses. <\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In this blog post, we\u00a0will\u00a0explore these resources in more detail and answer\u00a0the following\u00a0questions:<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"11\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">How can data quality impact our A\/B testing results?<\/span><\/b><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"11\" data-aria-posinset=\"2\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">What are the critical data quality requirements for\u00a0A\/B testing?<\/span><\/b><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"11\" data-aria-posinset=\"3\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">How\u00a0can we\u00a0continuously\u00a0monitor data\u00a0quality?<\/span><\/b><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<\/ul>\n<h2>How can data quality impact our A\/B testing results?<\/h2>\n<p><span data-contrast=\"auto\">A\/B testing results are only reliable and actionable if they are based on complete and accurate data. But how do we know that the data that we are using is reliable? The following are several signs of data quality issues in A\/B testing to watch out for. <\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<h3>Sample Ratio Mismatch<\/h3>\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/diagnosing-sample-ratio-mismatch-in-a-b-testing\/\"><span data-contrast=\"none\">SRM (Sample Ratio Mismatch)<\/span><\/a><span data-contrast=\"auto\">\u00a0means that the observed traffic allocated to treatment and control variants\u00a0are\u00a0different\u00a0from\u00a0\u00a0the\u00a0ways that they are configured.\u00a0When\u00a0this occurs, it can lead to\u00a0selection bias.\u00a0<\/span><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.cambridge.org\/us\/academic\/subjects\/computer-science\/knowledge-management-databases-and-data-mining\/trustworthy-online-controlled-experiments-practical-guide-b-testing?format=AR\"><span data-contrast=\"none\">Analyses that have SRMs are generally viewed as untrustworthy and should not be used to\u00a0make a decision<\/span><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><span data-contrast=\"auto\">\u00a0.\u00a0There are many scenarios\u00a0where\u00a0data quality issues\u00a0can\u00a0lead to an SRM. For example, the\u00a0incomplete logging of treatment\/control assignment information in telemetry, or\u00a0the\u00a0unbalanced telemetry loss for treatment\u00a0versus\u00a0control.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<h3>STEDI\u00a0of\u00a0metrics<\/h3>\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-during-experiment-stage\/\"><span data-contrast=\"none\">A\/B metrics should be Sensitive, Trustworthy, Efficient, Debuggable, and Interpretable (STEDI)<\/span><\/a><span data-contrast=\"auto\">\u00a0\u00a0to\u00a0help\u00a0the experimenter understand the treatment effect and interpret the metric movements\u00a0that are\u00a0observed.\u00a0Poor data quality can\u00a0negatively\u00a0impact\u00a0metric sensitivity and\u00a0trustworthiness.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><b><span data-contrast=\"auto\">Sensitivity:<\/span><\/b><span data-contrast=\"auto\">\u00a0If a data field has a very high missing rate, then metrics\u00a0that\u00a0are conditional calculations based on this\u00a0data field\u00a0will have a very low sample size. In this scenario, the hypothesis tests on these metrics will\u00a0lack power.\u00a0<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Outliers\u00a0are\u00a0another\u00a0common\u00a0issue\u00a0that can negatively\u00a0impact\u00a0\u00a0sensitivity.\u00a0Outliers\u00a0increase the variance and small numbers of severe outliers can\u00a0introduce huge noise\u00a0that can\u00a0move metrics.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><b><span data-contrast=\"auto\">Trustworthiness:<\/span><\/b><span data-contrast=\"auto\"> Metrics that utilize missing data may have incorrect statistics and untrustworthy hypothesis test results. And it can be more concerning when the missing rate is not balanced between treatment and control. There are several reasons why data may be missing. For instance, failed processing or incomplete data extraction. In the context of retention analysis, missing data can make it look as if a customer churned when they did not, causing the retention rate metric to appear lower.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-during-experiment-stage\/\"><span data-contrast=\"none\">When\u00a0an in-depth\u00a0analysis\u00a0is needed\u00a0for\u00a0various\u00a0segments\u00a0of the population<\/span><\/a><span data-contrast=\"auto\">\u00a0and when the fields\u00a0used for segmentation\u00a0are\u00a0inaccurate\u00a0or\u00a0have a high\u00a0missing\u00a0rate, results based on these segments can be misleading.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<h3>Latency in analysis result delivery\u00a0and delayed\u00a0decision-making<\/h3>\n<p><span data-contrast=\"auto\">Statistical analysis relies on\u00a0complete data.<\/span><span data-contrast=\"auto\">\u00a0In software products,\u00a0it is common to observe delays between data generation and data readiness for analysis. For example, if\u00a0we need\u00a0input from multiple data sources to test a hypothesis,\u00a0the time to\u00a0wait for the availability of all the data could be substantial. Delayed analysis may lead to delayed\u00a0decision-making.\u00a0Consequently,\u00a0we may not detect or rectify\u00a0unexpected regressions in product experience as promptly as desired. To mitigate this, well-designed data pipelines that meet the required Service Level Agreements for its availability are critical.\u00a0\u00a0<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-774100 aligncenter\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/09\/Data-quality-dimensions-1024x712.png\" alt=\"diagram, schematic\" width=\"1024\" height=\"712\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/09\/Data-quality-dimensions-1024x712.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/09\/Data-quality-dimensions-300x209.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/09\/Data-quality-dimensions-768x534.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/09\/Data-quality-dimensions-1536x1068.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/09\/Data-quality-dimensions-240x167.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/09\/Data-quality-dimensions.png 2035w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p style=\"text-align: center\">Figure 1: common data quality dimensions<\/p>\n<h2><\/h2>\n<h2>What are the critical data quality requirements for A\/B testing?<\/h2>\n<p><span data-contrast=\"auto\">As shown in\u00a0Figure\u00a01\u00a0above, common data quality dimensions include completeness, uniqueness, timeliness, validity, integrity, consistency, relevance, compliance,\u00a0and\u00a0retention.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">We have seen many instances\u00a0where\u00a0data quality\u00a0varies across A\/B testing scenarios. For instance,\u00a0a bug in a new feature\u00a0that increases\u00a0invalid values. These instances highlight the importance of\u00a0data quality\u00a0metrics\u00a0in the A\/B testing process.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">We\u00a0can use the\u00a0following\u00a0checklist\u00a0to\u00a0set\u00a0up additional metrics to measure data quality\u00a0in\u00a0our\u00a0analysis:<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<ol>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"7\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Missing\u00a0rates:<\/span><\/b><span data-contrast=\"auto\">\u00a0How much data is missing from each column of interest? Is there any pattern, such as\u00a0a dummy value\u00a0representing\u00a0missing values?<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"7\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Invalid values:<\/span><\/b><span data-contrast=\"auto\">\u00a0Do the values follow the proper format?\u00a0Are the values valid for this column?<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"7\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Join rates:<\/span><\/b><span data-contrast=\"auto\">\u00a0Are the join rates high enough\u00a0if\u00a0we\u00a0are joining data from different sources?<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"7\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Uniqueness:<\/span><\/b><span data-contrast=\"auto\">\u00a0Are there duplicates\u00a0of the same row?\u00a0Is\u00a0there more than one column for the same signal?<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"7\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Data delays:<\/span><\/b><span data-contrast=\"auto\">\u00a0What\u2019s the proportion of\u00a0data available\u00a0at the time of computation?\u00a0How long does it take between when the events were logged and\u00a0when the\u00a0data is available for analysis?<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<\/ol>\n<p><span data-contrast=\"auto\">In addition to the above list,\u00a0we\u00a0should of course ensure that\u00a0we\u00a0are complying with data policy including retention periods and privacy requirements.\u00a0<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Now let\u2019s take a look at several data fields we should pay extra attention to before analyzing A\/B testing results.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<h3>Randomization\u00a0unit<\/h3>\n<p><span data-contrast=\"auto\">Randomization\u00a0unit is the unit on which we randomly assign treatment and control. Common randomization units include users, sessions, pages, impressions, devices, cookies, documents, etc.\u00a0A data quality check\u00a0must\u00a0help\u00a0to\u00a0ensure that the\u00a0data\u00a0belonging to\u00a0the same randomization unit\u00a0are\u00a0linked\u00a0in an accurate and compliant\u00a0manner.\u00a0<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">It is important to always check for missing rates and patterns of the columns which we consider as randomization units. Missing data can appear as nulls, empty strings, zeros, infinity, or default values. Missing data can come from a specific data source due to logging issues or may be introduced in the data cooking pipeline. For example, we may detect missing data from a certain build version or from one data source which has partial delays. In some instances, we may analyze A\/B testing data on multiple aggregation levels (aggregation levels are likely represented by the same columns of randomization units), such as sessions and users (a user can have multiple sessions). We should then check whether all the aggregation level columns are present.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<h3>Treatment\u00a0assignment<\/h3>\n<p><span data-contrast=\"auto\">Treatment assignment information is necessary to differentiate the data between treatment and control to compare the efficacy of the tested feature.\u00a0A column containing treatment assignment information should always be present in the data\u00a0that is\u00a0used for analysis.\u00a0And it\u2019s better\u00a0to\u00a0ensure\u00a0that\u00a0the treatment assignment information\u00a0follows\u00a0a consistent format, both within the same data source and across multiple data sources if there is\u00a0a\u00a0need to join them together.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Sometimes,\u00a0we can add\u00a0additional information to treatment assignment\u00a0information. We can then\u00a0use\u00a0it\u00a0to separate different traffic.\u00a0For example,\u00a0when\u00a0we expand an\u00a0experiment to another audience,\u00a0we can add\u00a0the audience itself to filter the data between groups in the analysis.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">In addition, treatment assignment should be consistent\u00a0for the randomization unit (i.e,\u00a0users)\u00a0through the\u00a0analysis.\u00a0For\u00a0instance,\u00a0if we assign\u00a0a user\u00a0to a\u00a0treatment\u00a0group\u00a0she\u00a0should not appear in\u00a0the\u00a0control group.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<h3>Timestamp<\/h3>\n<p><span data-contrast=\"auto\">When running an A\/B test, we need to keep track of the time when an event of interest occurs. So, there should always be a column representing timestamps of the events. Timestamps play an important role in filtering events that are included in an analysis and in deciding the sequence of events for a triggered analysis (A\/B tests that are analyzed on a subset of the assigned population \u2013\u00a0<\/span><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-during-experiment-stage\/\"><span data-contrast=\"none\">a triggered A\/B analysis<\/span><\/a><span data-contrast=\"auto\">).\u00a0We\u00a0should also check for missing rates and filter out values such as a default datetime 1900-01-01 and datetime in the future.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">It is important to\u00a0carefully consider\u00a0selecting\u00a0an appropriate time to support the\u00a0aforementioned use\u00a0cases, preferably in the context of the\u00a0product. Typically, there are at least two timestamps to choose from. These include\u00a0the time the activity occurred and the time\u00a0when\u00a0the\u00a0corresponding log\u00a0is received. For web-based server-side\u00a0A\/B tests, the time difference is likely to be insignificant and the activity ordering remains unchanged.\u00a0However, for\u00a0client-side\u00a0A\/B tests, these two timestamps can be quite\u00a0different.\u00a0Deciding which one to use will be a subjective trade-off taken by the product team.\u00a0From previous experience,\u00a0client returned timestamps are often notoriously noisy. We\u00a0would\u00a0highly\u00a0recommend\u00a0using\u00a0timestamps when\u00a0receiving\u00a0the log rather than the client-logged timestamps.\u00a0Below is an example\u00a0of how device recorded\u00a0event time (client time) can differ\u00a0drastically\u00a0from the\u00a0correct event time.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-793658\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Device-Count-Distribution-300x209.png\" alt=\"Distribution of Event t\" width=\"600\" height=\"419\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Device-Count-Distribution-300x209.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Device-Count-Distribution-768x536.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Device-Count-Distribution-240x167.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/Device-Count-Distribution.png 860w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/p>\n<p style=\"text-align: center\">Figure 2: Distribution of the event time with device count<\/p>\n<h3><\/h3>\n<h3>Segments<\/h3>\n<p><span data-contrast=\"auto\">We\u00a0frequently slice\u00a0data\u00a0into subgroups for analysis.\u00a0We call these\u00a0subgroups\u00a0segments\u00a0and\u00a0we\u00a0define them at\u00a0various levels of aggregation.\u00a0Segments can be on\u00a0row-level or\u00a0if applicable\u00a0and\u00a0desired,\u00a0on\u00a0randomization unit\u00a0level.\u00a0For example,\u00a0if our\u00a0A\/B test is randomized on user level,\u00a0we\u00a0can have a date segment on\u00a0row-level\u00a0as well as\u00a0a\u00a0user age group segment on user level.\u00a0Some common segments include\u00a0date, market, browser, application version, etc.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Columns used as segments should be\u00a0with\u00a0the right format,\u00a0with acceptable missing rates\u00a0and of reasonable cardinality.\u00a0A heuristic reasonable range\u00a0of cardinality\u00a0is 2 to 10. One with a constant value provides no additional insight. One with hundreds of distinct values\u00a0may\u00a0result in\u00a0a broad range of issues such as\u00a0slow computation, low statistical power, and\/or loading or responsiveness issues with visualization.\u00a0Such\u00a0segments\u00a0can\u00a0create\u00a0even\u00a0a larger\u00a0number of subgroups,\u00a0especially if\u00a0we pair it\u00a0with another segment with high cardinality. If the segment column naturally comes with high cardinality (e.g., language), a commonly adopted practice is to truncate the list to the Top N with the highest frequencies and cast the rest to \u201cOthers\u201d.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<h2>How can we continuously monitor data quality?<\/h2>\n<p><span data-contrast=\"auto\">Data quality changes as the product and data sources evolve over time. Adding new logs, accommodating new data sources, following new requirements for data governance, will all potentially impact data quality. There are multiple tools that we can use to help keep track of data quality measurements.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<ol>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"5\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Dashboard\u00a0for\u00a0data quality metrics<\/span><\/b><span data-contrast=\"auto\">. This requires periodical ingestion of data. The ingestion can be on hourly, daily, weekly, or a customized frequency. It\u00a0depends\u00a0on how frequently the data quality is likely to change and\u00a0the\u00a0computation and storage cost budget.<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"5\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Alerting on anomalies<\/span><\/b><span data-contrast=\"auto\">.\u00a0Setting\u00a0up alerts on anomalies can help detect abnormal changes in data quality metrics. This can indicate\u00a0abnormal activities in the source data or processing pipelines.<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"5\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">Segment data quality metrics.<\/span><\/b><span data-contrast=\"auto\">\u00a0Data quality metrics should be calculated based on\u00a0the\u00a0same segments just like regular metrics. Sometimes\u00a0a\u00a0significant data quality drop can\u00a0occur\u00a0within a segment while overall there\u00a0may\u00a0be little difference.<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<li data-leveltext=\"%1.\" data-font=\"Calibri\" data-listid=\"5\" data-aria-posinset=\"1\" data-aria-level=\"1\"><b><span data-contrast=\"auto\">A\/A\u00a0test<\/span><\/b><span data-contrast=\"auto\">.\u00a0An\u00a0<\/span><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/dl.acm.org\/doi\/10.1145\/1557019.1557139\"><span data-contrast=\"none\">A\/A test<\/span><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><span data-contrast=\"auto\">\u00a0is an A\/B test with\u00a0an\u00a0identical\u00a0experience in\u00a0control and treatment\u00a0variants.\u00a0Running A\/A\u00a0tests\u00a0is a recommended approach\u00a0for an\u00a0end-to-end test of\u00a0an\u00a0A\/B testing\u00a0system, and often it can also help uncover data quality issues.\u00a0We expect\u00a0an A\/A\u00a0test\u00a0to have balanced\u00a0traffic\u00a0split\u00a0between\u00a0variants,\u00a0with\u00a0no\u00a0statistically significant\u00a0changes on metrics. When there is SRM or unexpected metric movements, it is likely due to\u00a0the telemetry\u00a0or data pipeline implementation issues.<\/span><span data-ccp-props=\"{\"134233279\":true,\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/li>\n<\/ol>\n<p><span data-contrast=\"auto\">Below is a\u00a0real example\u00a0that illustrates how\u00a0the team\u00a0uses the first 3 tools\u00a0for monitoring.\u00a0The\u00a0dashboard\u00a0(the blue line)\u00a0shows the\u00a0value change\u00a0of a data quality metric\u00a0across time. And\u00a0the\u00a0red points highlight the anomalies\u00a0that\u00a0will trigger\u00a0the alerts\u00a0to the product team.\u00a0To ensure\u00a0the data quality of subgroups, the team also\u00a0sets\u00a0up the\u00a0same\u00a0dashboard\u00a0for segments such as\u00a0market\u00a0and\u00a0OS version.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-793454 aligncenter\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/data-quality-monitoring.png\" alt=\"Data quality monitoring\" width=\"1432\" height=\"321\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/data-quality-monitoring.png 1432w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/data-quality-monitoring-300x67.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/data-quality-monitoring-1024x230.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/data-quality-monitoring-768x172.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/11\/data-quality-monitoring-240x54.png 240w\" sizes=\"auto, (max-width: 1432px) 100vw, 1432px\" \/><\/p>\n<p style=\"text-align: center\">Figure 3 : Continuous data quality monitoring<\/p>\n<h2>Summary<\/h2>\n<p><span class=\"TextRun SCXW6076107 BCX8\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW6076107 BCX8\">High-<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">q<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">uality data is<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0a<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0fundamental building b<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">lock<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0for trustworthy\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">A\/B testing<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0results.\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">Data quality issues\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">can\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">produce<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\"> unreliable A\/B testing <\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">results<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">. This can\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">negatively impact\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">A\/B testing<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0analysis<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0and decision<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">&#8211;<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">making.<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">By utilizing\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">t<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">he checklists and the\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">requirements for critical data fields<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">,\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">we<\/span><\/span><span class=\"TrackChangeTextInsertion TrackedChange SCXW6076107 BCX8\"><span class=\"TextRun SCXW6076107 BCX8\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0<\/span><\/span><\/span><span class=\"TextRun SCXW6076107 BCX8\" lang=\"EN-US\" xml:lang=\"EN-US\" data-contrast=\"auto\"><span class=\"NormalTextRun SCXW6076107 BCX8\">get ready<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0to implement checks on critical components of\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">A\/B testing<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0data.\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">When\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">we\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">set up the data quality check monitors,\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">we\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">always\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">adopt\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">the four\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">tools for continuous monitoring<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0as mentioned above<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">.<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0We hope\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">this blog post\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">can\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">help\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">you\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">produce<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">the best-<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">qualified<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0build<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">ing<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0b<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">locks<\/span><span class=\"NormalTextRun SCXW6076107 BCX8\">\u00a0for your A\/B testing.\u00a0<\/span><\/span><span class=\"EOP SCXW6076107 BCX8\" data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><em>&#8211; Platina Liu, Wen Qin, Hao Ai, Jing Jin, Microsoft Experimentation Platform<\/em><\/p>\n<p>&nbsp;<\/p>\n<h2>References:<\/h2>\n<p><span data-contrast=\"auto\">[1] A. Fabijan et al., \u201cDiagnosing Sample Ratio Mismatch in A\/B Testing.\u201d https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/diagnosing-sample-ratio-mismatch-in-a-b-testing\/<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[2] R.\u00a0Kohavi, D. Tang, and Y. Xu, Trustworthy Online Controlled Experiments: A Practical Guide to A\/B Testing. Cambridge University Press, 2020.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[3] W.\u00a0Machmouchi, S. Gupta, and R. Zhang, \u201cPatterns of Trustworthy Experimentation: During-Experiment Stage.\u201d https:\/\/www.microsoft.com\/en-us\/research\/group\/experimentation-platform-exp\/articles\/patterns-of-trustworthy-experimentation-during-experiment-stage\/<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[4]\u00a0Janhavie, \u201cData Quality.\u201d https:\/\/medium.com\/datacrat\/data-quality-dc4018fc443<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[5] Lean-Data, \u201cHere is how to start with data quality.\u201d https:\/\/www.lean-data.nl\/data-quality\/here-is-how-to-start-with-data-quality\/<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[6] T. Crook, B. Frasca, R.\u00a0Kohavi, and R. Longbotham, \u201cSeven pitfalls to avoid when running controlled experiments on the web,\u201d in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining &#8211; KDD \u201909, Paris, France, 2009, p. 1105.\u00a0doi: 10.1145\/1557019.1557139.<\/span><span data-ccp-props=\"{\"201341983\":0,\"335559739\":160,\"335559740\":259}\">\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Data\u00a0plays a vital role in\u00a0the lifecycle of\u00a0every single product in the technology industry.\u00a0With\u00a0data,\u00a0we can generate insights\u00a0to improve products\u00a0and provide a superior customer experience. Yet insights are only actionable if they are derived from trustworthy data. For example, a navigation app relies on accurate location detection data to provide directions to users.\u00a0\u00a0 Data quality is [&hellip;]<\/p>\n","protected":false},"author":39168,"featured_media":793637,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":651963,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-774094","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":651963,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/774094","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39168"}],"version-history":[{"count":20,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/774094\/revisions"}],"predecessor-version":[{"id":794012,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/774094\/revisions\/794012"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/793637"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=774094"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=774094"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=774094"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=774094"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}