{"id":917721,"date":"2023-02-27T09:01:13","date_gmt":"2023-02-27T17:01:13","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=917721"},"modified":"2023-02-27T09:21:46","modified_gmt":"2023-02-27T17:21:46","slug":"responsible-ai-mitigations-and-tracker-new-open-source-tools-for-guiding-mitigations-in-responsible-ai-2","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/responsible-ai-mitigations-and-tracker-new-open-source-tools-for-guiding-mitigations-in-responsible-ai-2\/","title":{"rendered":"Responsible AI Mitigations and Tracker: New open-source tools for guiding mitigations in Responsible AI"},"content":{"rendered":"\n<p><strong>Responsible AI Mitigations<\/strong>: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-mitigations\">https:\/\/github.com\/microsoft\/responsible-ai-toolbox-mitigations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p><strong>Responsible AI Tracker<\/strong>: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\">https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p><strong>Authors<\/strong>: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/benushi\/\">Besmira Nushi<\/a> (Principal Researcher) and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/raghosh\/\">Rahee Ghosh Peshawaria<\/a> (Senior Program Manager)<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>The goal of responsible AI is to create trustworthy AI systems that benefit people while mitigating harms, which can occur when AI systems fail to perform with fair, reliable, or safe outputs for various stakeholders. Practitioner-oriented tools in this space help with accelerating the model improvement lifecycle from identification to diagnosis and then mitigation of responsible AI concerns. This blog describes two new open-source tools in this space developed at Microsoft Research as part of the larger <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox\">Responsible AI Toolbox<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> effort in collaboration with Azure Machine Learning and Aether, the Microsoft advisory body for AI ethics and effects in engineering and research:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-mitigations\">Responsible AI Mitigations library<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> \u2013 Python library for implementing and exploring mitigations for Responsible AI.<\/li>\n\n\n\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\">Responsible AI Tracker<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> \u2013 JupyterLab extension for tracking, comparing, and validating Responsible AI mitigations and experiments.<\/li>\n<\/ol>\n\n\n\n<p>Both new additions to the toolbox currently support structured tabular data.<\/p>\n\n\n\n<p>Throughout the blog, you will learn how these tools fit in the everyday job of a data scientist, how they connect to other tools in the Responsible AI ecosystem, and how to use them for concrete problems in data science and machine learning. We will also use a concrete prediction scenario to illustrate main functionalities of both tools and tie all insights together.<\/p>\n\n\n\n<h2 id=\"targeted-model-improvement\">Targeted model improvement<\/h2>\n\n\n\n<p>Traditional methods of addressing failures can rely too heavily on a single metric for measuring model effectiveness and approach tackling problems that do arise with more data, more compute, bigger models, and better parameters. While adding more data or compute into the picture as a blanket approach is beneficial, addressing particular problems that negatively impact subsets of the data or cohorts requires a more systematic and cost-effective approach. Targeted model improvement encourages a systematic process of:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li><a href=\"#identification\">Carefully identifying failure modes during model evaluation.<\/a><\/li>\n\n\n\n<li><a href=\"#diagnosis\">Diagnosing factors behind them during model debugging.<\/a><\/li>\n\n\n\n<li><a href=\"#mitigation\">Taking informed mitigation actions during and before model retraining.<\/a><\/li>\n\n\n\n<li><a href=\"#tracking\">Tracking, comparing, and validating the different mitigation choices during model selection.<\/a><\/li>\n<\/ol>\n\n\n\n<p>In this big picture, the Responsible AI Mitigations library helps data scientists not only implement but also customize mitigation steps according to failure modes and issues they might have found during identification and diagnosis. Responsible AI Tracker then helps with interactively tracking and comparing mitigation experiments, enabling data scientists to see where the model has improved and whether there are variations in performance for different data cohorts. Both tools are meant to be used in combination with already available tools such as the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox\">Responsible AI Dashboard<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> from the same toolbox, which supports failure mode identification and diagnosis.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"697\" height=\"409\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fa4673bf4.png\" alt=\"Figure 1 \u2013 Targeted model improvement encourages a systematic process of identifying, diagnosing, and then comparing and validating failure modes. \nThe Responsible AI Toolbox facilitates the process through tools that support each stage.\" class=\"wp-image-917766\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fa4673bf4.png 697w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fa4673bf4-300x176.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fa4673bf4-240x141.png 240w\" sizes=\"auto, (max-width: 697px) 100vw, 697px\" \/><figcaption class=\"wp-element-caption\">Figure 1 \u2013 Targeted model improvement encourages a systematic process of identifying, diagnosing, and then comparing and validating failure modes. <br>The Responsible AI Toolbox facilitates the process through tools that support each stage.<\/figcaption><\/figure>\n\n\n\n<p>Next, we show through a data science case study how each of these tools can be used to perform targeted model improvement. For each step we provide code snippets and Jupyter Notebooks that can be used together with the tools.<\/p>\n\n\n\n<h2 id=\"case-study\">Case study<\/h2>\n\n\n\n<p><strong>Dataset<\/strong>: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Census+Income\">UCI Income dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p><strong>Dataset features<\/strong>: age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country<\/p>\n\n\n\n<p><strong>Prediction task<\/strong>: Binary classification. Predicting whether an individual earns more or less than 50K. The positive class in this case is >50K.<\/p>\n\n\n\n<p><strong>Tools needed to run this case study<\/strong>: raimitigations, Responsible AI Tracker extension on Jupyter Lab, raiwidgets<\/p>\n\n\n\n<p><strong>Other libraries<\/strong>: lightgbm, scikit-learn, pandas<\/p>\n\n\n\n<p><strong>Tour notebooks <\/strong>available <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/tree\/main\/tour\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p><strong>Tour video<\/strong>: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/youtu.be\/jN6LWFzSLaU\">https:\/\/youtu.be\/jN6LWFzSLaU<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<div class=\"yt-consent-placeholder\" role=\"region\" aria-label=\"Video playback requires cookie consent\" data-video-id=\"jN6LWFzSLaU\" data-poster=\"https:\/\/img.youtube.com\/vi\/jN6LWFzSLaU\/maxresdefault.jpg\"><iframe aria-hidden=\"true\" tabindex=\"-1\" title=\"Responsible AI Tracker Tour\" width=\"500\" height=\"281\" data-src=\"https:\/\/www.youtube-nocookie.com\/embed\/jN6LWFzSLaU?feature=oembed&rel=0&enablejsapi=1\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><div class=\"yt-consent-placeholder__overlay\"><button class=\"yt-consent-placeholder__play\"><svg width=\"42\" height=\"42\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" aria-hidden=\"true\" focusable=\"false\"><g fill=\"none\" fill-rule=\"evenodd\"><circle fill=\"#000\" opacity=\".556\" cx=\"21\" cy=\"21\" r=\"21\"\/><path stroke=\"#FFF\" d=\"M27.5 22l-12 8.5v-17z\"\/><\/g><\/svg><span class=\"yt-consent-placeholder__label\">Video playback requires cookie consent<\/span><\/button><\/div><\/div>\n<\/div><\/figure>\n\n\n\n<a name=\"identification\"><\/a>\n\n\n\n<h2 id=\"part-1-identification\">Part 1: Identification<\/h2>\n\n\n\n<p>Imagine you train a gradient boosting classifier using the LightGBM library and the UCI income dataset. Full notebook available <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/5%20estimators.ipynb\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def split_label(dataset, target_feature):\n    X = dataset.drop(&#091;target_feature], axis=1)\n    y = dataset&#091;&#091;target_feature]]\n    return X, y\n\ntarget_feature = 'income'\ncategorical_features = &#091;'workclass', 'education', 'marital-status',\n                        'occupation', 'relationship', 'race', 'gender', 'native-country']\n\ntrain_data = pd.read_csv('adult-train.csv', skipinitialspace=True, header=0)\ntest_data = pd.read_csv('adult-test-sample.csv', skipinitialspace=True, header=0)\n\nX_train_original, y_train = split_label(train_data, target_feature)\nX_test_original, y_test = split_label(test_data, target_feature)\n\nestimator = LGBMClassifier(random_state=0, n_estimators=5)\n\npipe = Pipeline(&#091;\n    (\"imputer\", SimpleImputer(strategy='constant', fill_value='?')),\n    (\"encoder\", OneHotEncoder(handle_unknown='ignore', sparse=False)),\n    (\"estimator\", estimator)\n])\npipe.fit(X_train_original, y_train)<\/code><\/pre>\n\n\n\n<p>The overall evaluation of the model shows that the model is 78.9% accurate in the given sample test data, or in other words it makes mistakes for 21.1% of the test data. Before taking any improvement steps, the natural questions for a data scientist would be \u201cWhere are most errors concentrated?\u201d and \u201cAre there any cohorts in the data that have a considerably higher error and if so how do these errors map to data or model problems?\u201d Responsible AI Dashboard brings in these insights through <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.youtube.com\/watch?v=NYXRrLzGiFk&ab_channel=MicrosoftResearch\">disaggregated model evaluation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and error analysis, fairness assessment, interpretability, and data exploration. Check out <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aka.ms\/raidashboardblog\">this blog<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for a full tour on how to use Responsible AI Dashboard.<\/p>\n\n\n\n<p>For example, this is how you can run the dashboard on this case study:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from raiwidgets import ResponsibleAIDashboard\nfrom responsibleai import RAIInsights\nrai_insights = RAIInsights(pipe, train_data, test_data, target_feature, 'classification',\n                           categorical_features=categorical_features)\n# Interpretability\nrai_insights.explainer.add()\n# Error Analysis\nrai_insights.error_analysis.add()\nrai_insights.compute()\nResponsibleAIDashboard(rai_insights)<\/code><\/pre>\n\n\n\n<p>The dashboard digests the model or predictive pipeline, the training data, and the sample test data. Then it generates insights about input conditions that are main drivers of model errors. Through the dashboard you can find out two major failure modes:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li>The error rate increases from 21.1% to 38.9% for individuals who are married.<\/li>\n\n\n\n<li>The error rate increases from 21.1% to 58.1% for individuals who are married and have a number of years of education higher than 11 (Figure 2). At the same time, we also see that the error rate increases with the number of education years (Figure 3).<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fdc2ed39c.png\" alt=\"Figure 2 \u2013 Error Analysis through Responsible AI Dashboard highlighting a higher error rate for individuals who are married and have a higher number of education years.\" class=\"wp-image-917784\" width=\"900\" height=\"427\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fdc2ed39c.png 975w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fdc2ed39c-300x142.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fdc2ed39c-768x365.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e2fdc2ed39c-240x114.png 240w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><figcaption class=\"wp-element-caption\">Figure 2 \u2013 Error Analysis through Responsible AI Dashboard highlighting a higher error rate for individuals who are married and have a higher number of education years.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1251\" height=\"701\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2.png\" alt=\"Figure 3 - Error Analysis through Responsible AI Dashboard shows that the error rate increases with the number of education years.\" class=\"wp-image-918150\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2.png 1251w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2-300x168.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2-1024x574.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2-768x430.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2-240x134.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Responsible-AI-Dashboard-2-640x360.png 640w\" sizes=\"auto, (max-width: 1251px) 100vw, 1251px\" \/><figcaption class=\"wp-element-caption\">Figure 3 &#8211; Error Analysis through Responsible AI Dashboard shows that the error rate increases with the number of education years.<\/figcaption><\/figure>\n\n\n\n<a name=\"diagnosis\"><\/a>\n\n\n\n<h2 id=\"part-2-diagnosis\">Part 2: Diagnosis<\/h2>\n\n\n\n<p>Since these cohorts (subsets of the data) seem to have a higher number of errors, we will then see how to further diagnose the underlying problems, mitigate them through the raimitigations library, and continue to track them in Responsible AI Tracker. More specifically, we will track the following cohorts for debugging purposes:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"A\">\n<li>Married individuals (i.e. <code>relationship == \u2018Wife\u2019 or &nbsp;relationship == \u2018Husband\u2019<\/code>)<\/li>\n\n\n\n<li>Not married individuals (i.e. <code>relationship <> \u2018Wife\u2019 and &nbsp;relationship <> \u2018Husband\u2019<\/code>)<\/li>\n\n\n\n<li>Education > 11 (i.e. <code>education-num > 11<\/code>)<\/li>\n\n\n\n<li>Married individuals with Education > 11 (i.e. <code>(relationship == \u2018Wife\u2019 or &nbsp;relationship == \u2018Husband\u2019) and (education-num > 11)<\/code>)<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"624\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Data-Analysis.gif\" alt=\"Figure 4 - Data Analysis through Responsible AI Dashboard shows an opposite class imbalance for married individuals with a number of education years higher than 11.\" class=\"wp-image-917796\" \/><figcaption class=\"wp-element-caption\">Figure 4 &#8211; Data Analysis through Responsible AI Dashboard shows an opposite class imbalance for married individuals with a number of education years higher than 11.<\/figcaption><\/figure>\n\n\n\n<p>When looking at the overall data distribution we see that the ground truth label distribution we see that it is skewed towards the negative class ( <= 50K). However, the cohorts with a higher error have an almost balanced class label distribution, or in the extreme case for married individuals with a number of education years > 11, the skew shifts to the opposite direction with more data where the class label is positive ( > 50K). Given this observation, we can now form the hypothesis that the very different class label distribution and imbalance is the reason behind why the model performs worse for these cohorts.<\/p>\n\n\n\n<p>Based on these diagnostics, the immediate question for a data scientist would lead them to investigating data balancing opportunities. However, there may be different ways of balancing the data:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li><strong>Balancing all data<\/strong> [blanket mitigation]: This would entail sampling more data for the positive class but not necessarily from the cohorts of interest.<\/li>\n\n\n\n<li><strong>Balancing each cohort separately<\/strong> [separate but not targeted mitigation]: This would entail sampling more data from the positive class for all disjoint cohorts separately including those that do not have a high error. For simplicity, we will look two disjoint cohorts: \u201cMarried\u201d and \u201cNot married\u201d.<\/li>\n\n\n\n<li><strong>Balancing only the cohort with higher error<\/strong> [targeted mitigation]: This would entail sampling more data from the positive class only for the cohort with higher error (\u201cMarried\u201d) and leave the other disjoint cohort as is (\u201cNot married\u201d).<\/li>\n<\/ol>\n\n\n\n<p>Of course, you could imagine other types of mitigations that are even more customized and targeted, but in this case study we will show that even by only targeting mitigations to the larger \u201cMarried\u201d cohort helps other cohorts as well that may also have intersections with this one and that may fail for similar reasons (i.e., different class imbalance than the overall data).<\/p>\n\n\n\n<p>Let\u2019s see how we can implement these situations with the help of the Responsible AI Mitigations library.<\/p>\n\n\n\n<a name=\"mitigation\"><\/a>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 id=\"part-3-mitigations\">Part 3: Mitigations<\/h2>\n\n\n\n<p>Formulating and experimenting with targeted model improvements can be complex and time consuming. The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-mitigations\">Responsible AI Mitigations library<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> streamlines this process by bringing together well-known machine learning techniques and adapting them to target model errors that occur in specific cohorts or across all data.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1258\" height=\"777\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Model-Improvement-Lifecycle.png\" alt=\"Figure 5 - Different cohorts may have different reasons to why errors occur, requiring potentially different and targeted mitigation actions.\" class=\"wp-image-918156\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Model-Improvement-Lifecycle.png 1258w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Model-Improvement-Lifecycle-300x185.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Model-Improvement-Lifecycle-1024x632.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Model-Improvement-Lifecycle-768x474.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/Model-Improvement-Lifecycle-240x148.png 240w\" sizes=\"auto, (max-width: 1258px) 100vw, 1258px\" \/><figcaption class=\"wp-element-caption\">Figure 5 &#8211; Different cohorts may have different reasons to why errors occur, requiring potentially different and targeted mitigation actions.<\/figcaption><\/figure>\n\n\n\n<p>The library surfaces error mitigation methods that can be applied at different stages of model development:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Data mitigation<\/em> methods such as data balancing, scaling, and missing value imputation enable transformations of the training data that can improve the resulting model. You will find these mitigations in the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/responsible-ai-toolbox-mitigations.readthedocs.io\/en\/latest\/dataprocessing\/intro.html\">DataProcessing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> class of the library. Since different cohorts may have different reasons to why errors happen in the first place (see Figure 5), it is possible to assign different mitigations steps to each cohort, using the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/responsible-ai-toolbox-mitigations.readthedocs.io\/en\/latest\/cohort\/cohort_manager.html#cohort-manager\">CohortManager<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> class of the library.<\/li>\n\n\n\n<li><em>Model mitigation <\/em>methods are applied at the time of model training and involves training models separately on data from cohorts where the model is underperforming. This can be achieved by either using the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/responsible-ai-toolbox-mitigations.readthedocs.io\/en\/latest\/cohort\/cohort_manager.html#cohort-manager\">CohortManager<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> or the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/responsible-ai-toolbox-mitigations.readthedocs.io\/en\/latest\/cohort\/decoupled_class.html\">DecoupledClass<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> classes in the library.<\/li>\n\n\n\n<li><em>Post-training mitigation<\/em> methods adjust the final model predictions based on custom, automatically found thresholds for each cohort. This can be achieved using the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/responsible-ai-toolbox-mitigations.readthedocs.io\/en\/latest\/cohort\/decoupled_class.html\">DecoupledClass<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> class in the library. The model builds upon <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/proceedings.mlr.press\/v81\/dwork18a\/dwork18a.pdf\">prior research work<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> proposing post-processing decoupled techniques to improve model fairness.<\/li>\n<\/ul>\n\n\n\n<p>Using the library, these mitigations can be combined as needed, and applied or customized to specific cohorts for more targeted improvement given different cohorts may have separate issues degrading model performance and benefit from individual mitigations. Figure 6 shows a summary of possible scenarios that can be implemented for targeted mitigations and this <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-mitigations\/blob\/main\/notebooks\/cohort\/cohort_manager_scenarios.ipynb\">notebook<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> shows how to implement each of the scenarios.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1100\" height=\"680\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/mitigationtypes.jpg\" alt=\"Figure 6 \u2013 A summary of possible targeted mitigation scenarios that can be configured through the Responsible AI Mitigations library.\" class=\"wp-image-918159\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/mitigationtypes.jpg 1100w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/mitigationtypes-300x185.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/mitigationtypes-1024x633.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/mitigationtypes-768x475.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/mitigationtypes-240x148.jpg 240w\" sizes=\"auto, (max-width: 1100px) 100vw, 1100px\" \/><figcaption class=\"wp-element-caption\">Figure 6 \u2013 A summary of possible targeted mitigation scenarios that can be configured through the Responsible AI Mitigations library.<\/figcaption><\/figure>\n\n\n\n<p>Beyond surfacing relevant error mitigation methods, the Responsible AI Mitigations library saves time in model development by condensing the code required to explore and test different approaches for improving model performance. For example, without the library support, practitioners would have to split the data manually, apply different mitigations techniques, and then re merge the data making sure that the merge is consistent. The library takes away this complexity and handles all the data splitting and targeted mitigations in the background. By reducing the need for coding custom infrastructure, we aim to allow ML practitioners to focus their time on the modeling problem they are tackling and encourage responsible AI development practices.<\/p>\n\n\n\n<p>Revisiting our case study, we will use the mitigations library to try the mitigations discussed above that may improve the model\u2019s performance. First, you will need to install the raimitigations library from pypi:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install raimitigations<\/code><\/pre>\n\n\n\n<h3 id=\"balancing-all-data\">Balancing all data<\/h3>\n\n\n\n<p>This error mitigation technique involves oversampling from the positive class for all training data. Rebalancing will result in a training dataset with an equal number of instances of the positive and negative class. Retraining with rebalanced data may help address errors from the original model, but this doesn\u2019t guarantee any specific distribution within different cohorts, as can be seen in Figure 7. The following code demonstrates the implementation, also available in this notebook: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/balance%20all%20data.ipynb\">balance all data.ipynb<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Note that we are leaving out parts of the code that read the data and define the model.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># 24720 is the size of the majority class\nrebalance = dp.Rebalance(X=X_train_original, y=y_train, \nverbose=False, strategy_over={0:24720, 1:24720})\nnew_X, new_y = rebalance.fit_resample()\n\npipe = Pipeline(&#091;\n    (\"imputer\", dp.BasicImputer(verbose=False)),\n\n    (\"encoder\", dp.EncoderOHE(unknown_err=True)),\n    (\"model\", estimator)\n])\n\npipe.fit(new_X, new_y)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"524\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e569c7f3823.png\" alt=\"Figure 7 \u2013 Rebalancing the whole data gives equal frequencies of the positive and negative classes in the overall data but does not guarantee data balance in each disjoint cohort of the training data.\" class=\"wp-image-918309\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e569c7f3823.png 975w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e569c7f3823-300x161.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e569c7f3823-768x413.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e569c7f3823-240x129.png 240w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><figcaption class=\"wp-element-caption\">Figure 7 \u2013 Rebalancing the whole data gives equal frequencies of the positive and negative classes in the overall data but does not guarantee data balance in each disjoint cohort of the training data.<\/figcaption><\/figure>\n\n\n\n<h3 id=\"balancing-each-cohort-separately\">Balancing each cohort separately<\/h3>\n\n\n\n<p>A second mitigation strategy we will try is to rebalance the training data again, but do so for each disjoint cohort so that the rebalancing data is sampled from each cohort separately. The assumption here is that resampling only within each cohort will help the model better understand the nuances of class imbalance depending on the cohort. Within each cohort (\u201cMarried\u201d and \u201cNot Married\u201d in this case) the rebalanced data will now have the same frequency of positive and negative class examples, as can be seen in Figure 8. The library accepts a transformation pipeline to be specified for each cohort. The following code demonstrates the implementation, also available in this notebook: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/balance%20per%20cohort%20both.ipynb\">balance per cohort both.ipynb<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The implementation uses the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/responsible-ai-toolbox-mitigations.readthedocs.io\/en\/latest\/cohort\/cohort_manager.html\" target=\"_blank\" rel=\"noopener noreferrer\">CohortManager<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> class in the raimitigations library, which manages the complexity behind slicing the data, applying mitigations separately, and merging the data again prior to model retraining.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Define the cohorts\nc1 = &#091; &#091; &#091;'relationship', '==', 'Wife'], 'or', &#091;'relationship', '==', 'Husband']]]\nc2 = None\n\nc1_pipe = &#091;dp.Rebalance(verbose=False,strategy_over={0:24463, 1:24463})]\nc2_pipe = &#091;dp.Rebalance(verbose=False,strategy_over={0:16622, 1:16622})]\n\nrebalance_cohort = CohortManager(\n    transform_pipe=&#091;c1_pipe, c2_pipe],\n    cohort_def={\"married\":c1, \"not_married\":c2}\n    )\n    \nnew_X, new_y = rebalance_cohort.fit_resample(X_train_original, y_train)\n\n#Create a pipeline that uses the cohort manager\npipe = Pipeline(&#091;\n    (\"imputer\", dp.BasicImputer(verbose=False)),\n    (\"encoder\", dp.EncoderOHE(unknown_err=True)),\n    (\"model\", estimator)\n    ])\n\npipe.fit(new_X, new_y)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"523\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56b9f7dcfb.png\" alt=\"Figure 8 \u2013 Rebalancing per cohort gives equal frequencies of the positive and negative classes in each disjoint cohort of the training data.\" class=\"wp-image-918321\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56b9f7dcfb.png 975w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56b9f7dcfb-300x161.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56b9f7dcfb-768x412.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56b9f7dcfb-710x380.png 710w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56b9f7dcfb-240x129.png 240w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><figcaption class=\"wp-element-caption\">Figure 8 \u2013 Rebalancing per cohort gives equal frequencies of the positive and negative classes in each disjoint cohort of the training data.<\/figcaption><\/figure>\n\n\n\n<h3 id=\"balancing-only-the-cohort-with-higher-error\">Balancing only the cohort with higher error<\/h3>\n\n\n\n<p>The last approach we will evaluate is to perform the rebalance only on the cohort with a high error rate. This targeted mitigation will leave data in other cohorts unchanged while sampling more data from the positive class for the \u201cMarried\u201d cohort, as shown in Figure 9. Similar to the previous mitigation, we will specify a transform pipe to rebalance the \u201cMarried\u201d cohort, but this time we will give an empty pipeline for the rest of the data. The following code demonstrates the implementation, also available in this notebook: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/target%20balance%20per%20cohort.ipynb\">target balance per cohort.ipynb<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Define the cohorts\nc1 = &#091; &#091; &#091;'relationship', '==', 'Wife'], 'or', &#091;'relationship', '==', 'Husband']]]\nc2 = None\n\nc1_pipe = &#091;dp.Rebalance(verbose=False,strategy_over={0:24463, 1:24463})]\nc2_pipe = &#091;]\n\nrebalance_cohort = CohortManager(\n    transform_pipe=&#091;c1_pipe, c2_pipe],\n    cohort_def={\"married\":c1, \"not_married\":c2}\n    )\n    \nnew_X, new_y = rebalance_cohort.fit_resample(X_train_original, y_train)\n\n#Create a pipeline that uses the cohort manager\npipe = Pipeline(&#091;\n    (\"imputer\", dp.BasicImputer(verbose=False)),\n    (\"encoder\", dp.EncoderOHE(unknown_err=True)),\n    (\"model\", estimator)\n    ])\n\npipe.fit(new_X, new_y)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"520\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56be83696f.png\" alt=\"Figure 9 \u2013 Targeting rebalancing only for the cohort with higher errors (\u201cMarried\u201d in this case) ensures equal frequencies of positive and negative labels for the \u201cMarried\u201d cohort but does not affect the rest of the data.\" class=\"wp-image-918327\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56be83696f.png 975w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56be83696f-300x160.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56be83696f-768x410.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56be83696f-710x380.png 710w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56be83696f-240x128.png 240w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><figcaption class=\"wp-element-caption\">Figure 9 \u2013 Targeting rebalancing only for the cohort with higher errors (\u201cMarried\u201d in this case) ensures equal frequencies of positive and negative labels for the \u201cMarried\u201d cohort but does not affect the rest of the data.<\/figcaption><\/figure>\n\n\n\n<p>With the three mitigation strategies implemented, we have three new models trained to compare against the original. Next, we\u2019ll discuss how the Responsible AI Tracker can be used to run the comparison and determine which model yields the best results.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<a name=\"tracking\"><\/a>\n\n\n\n<h2 id=\"part-4-tracking-comparing-and-validating-mitigations\">Part 4: Tracking, comparing, and validating mitigations<\/h2>\n\n\n\n<p>As we saw in the walkthrough for the Responsible AI Mitigations library, during the model improvement lifecycle, data scientists can create several ways of improving the data and the model itself. While some of these improvements yield different rates of overall accuracy improvements, improvement for particular cohorts is not always guaranteed. Performance drops for parts of the data are often referred to as <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/creating-better-ai-partners-a-case-for-backward-compatibility\/\">backward incompatibility issues<\/a> in machine learning. Therefore, it becomes important prior to deployment for practitioners to not only be able to track and compare the different mitigation outcomes, but also validate whether the issues that they were set of addressing on the first place are indeed tackled by the model they would select for deployment.<\/p>\n\n\n\n<p>This is exactly where <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-tracker\">Responsible AI Tracker<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> comes into action. Not only does it enable model comparison across several models and metrics, but it also disaggregates model comparison across cohorts. This fills in a large gap in practice, as this type of functionality is currently not available in most visualizations and the lack of cohort-based comparisons may hide important performance drops.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"515\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cb79e014.png\" alt=\"Figure 10 - Responsible AI Tracker brings together several artifacts needed for data science and machine learning practice: code, models, and comparative visualizations.\" class=\"wp-image-918336\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cb79e014.png 975w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cb79e014-300x158.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cb79e014-768x406.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cb79e014-240x127.png 240w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><figcaption class=\"wp-element-caption\">Figure 10 &#8211; Responsible AI Tracker brings together several artifacts needed for data science and machine learning practice: code, models, and comparative visualizations.<\/figcaption><\/figure>\n\n\n\n<p>Responsible AI Tracker is built as an open-source and downloadable extension to Jupyter Lab, the latest web-based interactive development environment for notebooks, code, and data for project <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/jupyter.org\/\">Jupyter<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Jupyter Lab has a modular design that invites extensions to expand and enrich functionality. The design has given rise to <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/jupyterlab-contrib.github.io\/extensions.html\">several fast-growing projects<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in data science to enrich the environment with much needed functionalities.<\/p>\n\n\n\n<p>In comparison to Jupyter Notebooks, Jupyter Lab enables practitioners to work with more than one notebook at the same time to better organize their work. Responsible AI Tracker takes this to the next step, by bringing together notebooks, models, and visualization reports on model comparison within the same interface. Practitioners can map notebooks to models such that the relationship between code and models is persisted and tracked easily. The mapping helps with embracing clean data science practices but also accommodates that flexibility that experimental data science still needs to iterate fast through the use of jupyter notebooks.<\/p>\n\n\n\n<p>Once a model is registered to a notebook, it then appears in the model comparison table side by side with other models. The model comparison table offers two forms of visualization: absolute and comparative. Generally, it is recommended to use a stronger shade for desirable performance. The absolute view will show raw absolute score metrics and will be shaded using one single color. The comparative view will also show the corresponding differences between model performance and the baseline performance either for the overall dataset or for the given cohort. For example, if the accuracy of the baseline is 0.8 in the overall dataset and the accuracy of a mitigated model is 0.85 for the overall dataset, the respective cell in the table will show 0.85 (0.05 \u2191), indicating that there is a 0.05 improvement for the overall dataset. Similarly, if the accuracy of the baseline for the same baseline is instead 0.9 for cohort A, but it is 0.87 for the newly mitigated model, the respective cell for the model and cohort A will show 0.87 (0.03 \u2193) indicating a 0.03 decline in accuracy for cohort A. This enables a 1:1 comparison across cohorts over several models. The shading in the comparative view is based on two colors: one for performance improvement (dark red by default) and one for performance decline (dark green by default).<\/p>\n\n\n\n<p>Practitioners can choose which cohorts they want to compare by creating cohorts via the cohort management functionalities. The cohort creation process entails adding one or more filters to the data and saving that cohort for later use in model comparison.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63eb1ef549b17.png\" alt=\"Figure 11 - Cohort creation in Responsible AI Tracker.\" class=\"wp-image-919020\" width=\"257\" height=\"471\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63eb1ef549b17.png 317w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63eb1ef549b17-164x300.png 164w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63eb1ef549b17-98x180.png 98w\" sizes=\"auto, (max-width: 257px) 100vw, 257px\" \/><figcaption class=\"wp-element-caption\">Figure 11 &#8211; Cohort creation in Responsible AI Tracker.<\/figcaption><\/figure>\n\n\n\n<p>Let\u2019s now go back to our case study and see how the different mitigation ideas we discussed earlier, can be compared and validated.<\/p>\n\n\n\n<p>First, let\u2019s import the four notebooks containing the initial baseline model, along with the notebooks with the data balancing mitigations:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Initial baseline notebook<\/strong> [only missing value imputation and encoding]: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/5%20estimators.ipynb\">5 estimators.ipynb<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><strong>Balancing all data<\/strong> [blanket mitigation]: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/balance%20all%20data.ipynb\">balance all data.ipynb<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><strong>Balancing each cohort separately<\/strong> [separate but not targeted mitigation]: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/balance%20per%20cohort%20both.ipynb\">balance per cohort both.ipynb<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n\n\n\n<li><strong>Balancing only the cohort with higher error<\/strong> [targeted mitigation]: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/target%20balance%20per%20cohort.ipynb\">target balance per cohort.ipynb<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n\n\n\n<p>Next, let\u2019s run these notebooks and register the corresponding models to the notebooks. The registration process will require the model itself as well as the test dataset (in our case, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\/blob\/main\/tour\/adult-test-sample.csv\">adult-test-sample.csv<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63eb1e21b4890.png\" alt=\"Figure 12 - Registering a model in Responsible AI Tracker\" class=\"wp-image-919011\" width=\"299\" height=\"319\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63eb1e21b4890.png 531w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63eb1e21b4890-281x300.png 281w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63eb1e21b4890-169x180.png 169w\" sizes=\"auto, (max-width: 299px) 100vw, 299px\" \/><figcaption class=\"wp-element-caption\">Figure 12 &#8211; Registering a model in Responsible AI Tracker<\/figcaption><\/figure>\n\n\n\n<p>Then, we can create a few corresponding cohorts of interest to track. In this case, it would be interesting to track the cohorts we had identified at the beginning for comparison:<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"A\">\n<li>Married individuals (i.e. <code>relationship == \u2018Wife\u2019 or &nbsp;relationship == \u2018Husband\u2019<\/code>)<\/li>\n\n\n\n<li>Not married individuals (i.e. <code>relationship <> \u2018Wife\u2019 and &nbsp;relationship <> \u2018Husband\u2019<\/code>)<\/li>\n\n\n\n<li>Education > 11 (i.e. <code>education-num > 11<\/code>)<\/li>\n\n\n\n<li>Married individuals with Education > 11 (i.e. (<code>relationship == \u2018Wife\u2019 or &nbsp;relationship == \u2018Husband\u2019) and (education-num > 11)<\/code>)<\/li>\n<\/ol>\n\n\n\n<p>Figure 13 shows how the model comparison table would look like for the &#8220;Married&#8221; and &#8220;Not married&#8221; cohorts.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>Note on usability<\/strong>: In order to filter and focus on one set of insights at a time, you can also filter the table by notebook, cohort, and metric. Filtering will not only readjust the color shading according to the table content but will also enable you to put relevant numbers side by side, when needed. For example, if you only want to compare model performance on a single cohort, removing all other cohorts from the filter will help with showing all relevant metrics next to each other vertically.<\/p>\n<\/blockquote>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"979\" height=\"541\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cf5015c8.png\" alt=\"Figure 13 - Model comparison for the overall data and for the \"Married\" and \"Not married\" cohorts.\" class=\"wp-image-918339\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cf5015c8.png 979w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cf5015c8-300x166.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cf5015c8-768x424.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56cf5015c8-240x133.png 240w\" sizes=\"auto, (max-width: 979px) 100vw, 979px\" \/><figcaption class=\"wp-element-caption\">Figure 13 &#8211; Model comparison for the overall data and for the &#8220;Married&#8221; and &#8220;Not married&#8221; cohorts.<\/figcaption><\/figure>\n\n\n\n<p>What are the main observations here?<\/p>\n\n\n\n<ol class=\"wp-block-list\" type=\"1\">\n<li><strong>Balancing all data <\/strong>[blanket mitigation] improves overall model accuracy by ~4% and it benefits both the \u201cMarried\u201d and \u201cNot married\u201d cohorts.<\/li>\n\n\n\n<li><strong>Balancing each cohort separately<\/strong> [separate but not targeted mitigation]: improves overall model accuracy by only 0.4% but it only benefits the \u201cMarried\u201d cohort. In contrary, accuracy for the \u201cNot married cohort\u201d drops by 9%. To understand what is happening here, let\u2019s take a look at the training data distribution with respect to class balance before and after each mitigation as shown in Figure 14. Initially, we see that for the \u201cNot married\u201d cohort, the class imbalance is skewed towards the negative class. Rebalancing the data for this cohort, albeit separately (meaning samples are withdrawn only from this cohort), distorts the original prior on this cohort and therefore leads the new model to be less accurate for it.<\/li>\n\n\n\n<li><strong>Balancing only the cohort with higher error<\/strong> [targeted mitigation] is more effective than the blanket strategy and at the same time does not introduce performance drops for the \u201cNot married\u201d cohort. The approach samples more data with a positive label for the \u201cMarried\u201d cohort, without affecting the rest of the data. This allows the model to improve recall for the \u201cMarried\u201d cohort and yet keep the prior on more negative labels for the \u201cNot married\u201d one.<\/li>\n<\/ol>\n\n\n\n<p>In general, we also see that all models sacrifice some of the initial precision for a better recall, which is expected from all such rebalancing strategies.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"975\" height=\"509\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e351c81de48.png\" alt=\"Figure 14 \u00a0- Training data distribution of the class label before and after each mitigation.\" class=\"wp-image-917844\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e351c81de48.png 975w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e351c81de48-300x157.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e351c81de48-768x401.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e351c81de48-240x125.png 240w\" sizes=\"auto, (max-width: 975px) 100vw, 975px\" \/><figcaption class=\"wp-element-caption\">Figure 14 &nbsp;&#8211; Training data distribution of the class label before and after each mitigation.<\/figcaption><\/figure>\n\n\n\n<p>Finally, Figure 15 also shows how the different models compare with respect to the other cohorts we identified with higher errors at the identification stage. We see that indeed the most problematic cohort (&#8220;Married and Education > 11&#8221;) is the one that is improved the most, by at least 28%.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/image-63e56d225b487.png\" alt=\"Figure 15 - Model comparison for the overall data and for the Education > 11 and Married and Education > 11 cohorts.\" class=\"wp-image-918342\" \/><figcaption class=\"wp-element-caption\">Figure 15 &#8211; Model comparison for the overall data and for the &#8220;Education > 11&#8221; and &#8220;Married and Education > 11&#8221; cohorts.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"summary\">Summary<\/h2>\n\n\n\n<p>In this extended blog, we saw how a targeted model improvement approach can provide immediate benefits for improving model performance in parts of the data where the model fails the most. The approach is enabled by a set of existing and new tools for Responsible AI: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox#introducing-responsible-ai-dashboard\">Responsible AI Dashboard<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and most recently <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-mitigations\">Responsible AI Mitigations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker\">Tracker<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Looking forward, we hope that such tools will accelerate the process of model improvement and help data scientists and domain experts take informed decisions in the Machine Learning lifecycle. As ML systems and models continue to get deployed in user-facing applications, taking a rigorous and yet accelerated approach to how we build and evaluate Machine Learning will help us create applications that most benefit people and society.<\/p>\n\n\n\n<p>If you have feedback on any of the tools or ideas presented in this blog or would like to propose an open-source collaboration, reach us at <a href=\"mailto:rai-toolbox@microsoft.com\">rai-toolbox@microsoft.com<\/a>. All tools described in this blog are open-source and welcome community contributors.<\/p>\n\n\n\n<h2 id=\"acknowledgements\">Acknowledgements<\/h2>\n\n\n\n<p>This work was made possible through the collaboration of several teams and amazing folks passionate about operationalizing Responsible AI. We are an interdisciplinary team consisting of Machine Learning and front-end engineers, designers, UX researchers, and Machine Learning researchers. If you would like to learn more about the history and journey of this and other work from the team read our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/responsible-ai-the-research-collaboration-behind-new-open-source-tools-offered-by-microsoft\/\">hero blog<\/a>.<\/p>\n\n\n\n<p><strong>Microsoft Research<\/strong>: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/danyr\/\">Dany Rouhana<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mmendonca\/\">Matheus Mendon\u00e7a<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/maabdin\/\">Marah Abdin<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/thuvanp\/\">ThuVan Pham<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/irinasp\/\">Irina Spiridonova<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/markenc\/\">Mark Encarnaci\u00f3n<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/raghosh\/publications\/\">Rahee Ghosh Peshawaria<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/samershi\/\">Saleema Amershi<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/eckamar\/\">Ece Kamar<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/benushi\/\">Besmira Nushi<\/a><\/p>\n\n\n\n<p><strong>Microsoft Aether<\/strong>: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/jingya-chen-61745b154\/\">Jingya Chen<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mivorvor\/\">Mihaela Vorvoreanu<\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/kathleenedits\/\">Kathleen Walker<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/horvitz\/\">Eric Horvitz<\/a><\/p>\n\n\n\n<p><strong>Azure Machine Learning<\/strong>: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/gug121\/\">Gaurav Gupta<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/imatiach\/\">Ilya Matiach<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/romanlutz\/\">Roman Lutz<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/ke-xu-004253119\/\">Ke Xu<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/minsoothigpen\/\">Minsoo Thigpen<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/mehrnoosh-sameki-a2a02245\/\">Mehrnoosh Sameki<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/ssweetman\/\">Steve Sweetman<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p>Big thanks and congratulations to everyone who made this possible!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Responsible AI Mitigations: https:\/\/github.com\/microsoft\/responsible-ai-toolbox-mitigations (opens in new tab) Responsible AI Tracker: https:\/\/github.com\/microsoft\/responsible-ai-toolbox-tracker (opens in new tab) Authors: Besmira Nushi (Principal Researcher) and Rahee Ghosh Peshawaria (Senior Program Manager) The goal of responsible AI is to create trustworthy AI systems that benefit people while mitigating harms, which can occur when AI systems fail to perform with [&hellip;]<\/p>\n","protected":false},"author":36864,"featured_media":898554,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":917364,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-917721","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":917364,"type":"project"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/917721","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/36864"}],"version-history":[{"count":54,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/917721\/revisions"}],"predecessor-version":[{"id":923061,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/917721\/revisions\/923061"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/898554"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=917721"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=917721"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=917721"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=917721"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}