{"id":630264,"date":"2020-01-14T11:31:43","date_gmt":"2020-01-14T19:31:43","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=630264"},"modified":"2020-01-16T18:14:17","modified_gmt":"2020-01-17T02:14:17","slug":"are-all-samples-created-equal-boosting-generative-models-via-importance-weighting","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/are-all-samples-created-equal-boosting-generative-models-via-importance-weighting\/","title":{"rendered":"Are all samples created equal?: Boosting generative models via importance weighting"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-631143 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1.png\" alt=\"many faces and colorful squares\" width=\"1401\" height=\"788\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1.png 1401w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-1280x720.png 1280w\" sizes=\"auto, (max-width: 1401px) 100vw, 1401px\" \/><\/p>\n<p>There is a growing interest in the use of deep generative models for sampling high-dimensional data; examples include high-resolution natural images, long-form text generation, designing pharmaceutical drugs, and creating new materials at the molecular level. Training these models is, however, an arduous task. Even state-of-the-art models have noticeable deficiencies in some of the generated samples: image models of faces have <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/medium.com\/@kcimc\/how-to-recognize-fake-ai-generated-images-4d1f6f9a2842\">artifacts in the hair textures and makeup<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, text models often require repeated attempts at generating coherent completions of sentences or paragraphs, and other deficiencies. In these cases, cherry-picking good samples is not a scalable alternative.<\/p>\n<p>In a paper presented last month <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/neurips.cc\/Conferences\/2019\">at the thirty-third Conference on Neural Information Processing Systems<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (NeurIPS 2019), called \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/bias-correction-of-learned-generative-models-using-likelihood-free-importance-weighting\/\">Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting,<\/a>\u201d our team of researchers at Microsoft and Stanford University propose a scalable algorithmic approach to characterize and mitigate the imperfections of generative models. Our technique consistently improves sample quality metrics for state-of-the-art generative models while also benefiting downstream use cases of generative models for data augmentation and off-policy policy evaluation.<\/p>\n<h3>Importance weighting induces an energy-based generative model<\/h3>\n<p>Let\u2019s say we are given a generative model \\(p_\\theta\\) (such as any variational autoencoder, generative adversarial network, or other model) that has been trained to learn a data distribution \\( p_data\\). Our goal is to characterize and mitigate the imperfections of this model. To do this, we consider any non-negative weighting function \\(w_\\phi\\) and combine it with our base model to induce an energy-based model with density:<\/p>\n<p style=\"text-align: center;\">\\( p_{\\theta,\\phi}\\left(x\\right)\\propto\\ p_\\theta\\left(x\\right)w_\\phi(x)\\)<\/p>\n<p style=\"text-align: left;\">The above model is an instantiation of a product-of-experts (PoE) model as it boosts a base (normalized) model \\(p_\\theta\\) multiplicatively using a weighting function \\(w_\u03d5\\).<\/p>\n<h4>What\u2019s the ideal weighting function?<\/h4>\n<p>If the weighting function corresponds to the ratio of data density to the model density (that is, \\( w_\\phi\\left(x\\right)\\) = \\( p_data (x)\\)\/ \\(p_\\theta\\left(x\\right) \\) for all &lt;em>x&lt;\/em>), then the energy-based model recovers the data distribution (that is, \\(p_{\\theta,\\phi}\\left(x\\right)\\) =\\(p_data(x)\\) ). In such a scenario, \\( w_\\phi\\left(x\\right) \\) is the importance weighting function for debiasing expectations under the data distribution (also known as the \u201ctarget\u201d in Monte Carlo terminology) given access to only the model distribution (or \u201cproposal\u201d).<\/p>\n<h4>How do we estimate the importance weights?<\/h4>\n<p>In order to compute the density ratio, the data density (the numerator) is unavailable and model density (the denominator) is often intractable in practice in the case of variational autoencoders, generative adversarial networks, and many other generative models. To get rid of this shortcoming, we use probabilistic binary classifiers to estimate the density ratio\u2014in particular, the estimator is the odds ratio of a classifier trained to distinguish data samples from the generated samples. If the classifier is Bayes optimal, the importance weights are exact. Appealingly, this procedure is \u201clikelihood-free\u201d as it does not involve knowing the model or the data density. A toy example is shown below.<\/p>\n<div id=\"attachment_630834\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-630834\" class=\"wp-image-630834 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-1.png\" alt=\"diagram\" width=\"640\" height=\"480\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-1.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-1-300x225.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-1-80x60.png 80w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-1-240x180.png 240w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><p id=\"caption-attachment-630834\" class=\"wp-caption-text\">Figure 1: univariate Gaussian (green) is fit to a mixture of two Gaussians (blue).<\/p><\/div>\n<div id=\"attachment_630831\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-630831\" class=\"wp-image-630831 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-2-.png\" alt=\"diagram\" width=\"640\" height=\"480\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-2-.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-2--300x225.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-2--80x60.png 80w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Diagram-2--240x180.png 240w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><p id=\"caption-attachment-630831\" class=\"wp-caption-text\">Figure 2: estimated (red) and Bayes optimal (black, BayesOpt) class probabilities (with 95% confidence intervals based on 1,000 bootstraps) for a classifier trained to distinguish 1,000 true data and generated data samples.<\/p><\/div>\n<p>&nbsp;<\/p>\n<h5><\/h5>\n<h3><\/h3>\n<h4 style=\"text-align: left;\">How do we sample from the induced model?<\/h4>\n<p>Exact sampling from the induced energy-based model is computationally intractable. However, we can leverage a resampling technique, called <strong>Sampling Importance Resampling<\/strong> (SIR), to sample from an approximation to the energy-based model. Given a positive integer parameter <em>k<\/em>, SIR prescribes a 3-step procedure:<\/p>\n<p>(1) Generate <em>k<\/em> independent <strong>samples<\/strong> from the base model <em>p_theta<\/em>.<br \/>\n(2) Estimate <strong>importance<\/strong> weights for the<em> k<\/em> samples.<br \/>\n(3) <strong>Resample<\/strong> from these <em>k<\/em> samples in proportion to the importance weights.<\/p>\n<p>In the limit of <em>k<\/em> going to infinity, we will exactly sample from the energy-based model. Therefore, for any finite budget <em>k<\/em>, we can trade accuracy for computational efficiency or vice versa.<\/p>\n<h3>Application use cases<\/h3>\n<p>We evaluate several standard sample quality metrics on the CIFAR-10 dataset for state-of-the-art likelihood-based and likelihood-free models with and without our proposed debiasing technique (denoted as likelihood-free importance weighting or LFIW). The weights here were estimated using a neural network performing binary classification. Our technique consistently improves on these metrics, suggesting reduced bias in evaluation.<\/p>\n<div id=\"attachment_630819\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-630819\" class=\"wp-image-630819 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Table1-5e1ce2507795a-1024x253.png\" alt=\"table\" width=\"1024\" height=\"253\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Table1-5e1ce2507795a-1024x253.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Table1-5e1ce2507795a-300x74.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Table1-5e1ce2507795a-768x190.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Table1-5e1ce2507795a-1536x380.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/Table1-5e1ce2507795a-2048x507.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-630819\" class=\"wp-caption-text\">Table 1: Goodness-of-fit evaluation on CIFAR-10 dataset for PixelCNN++ and SNGAN. Standard errors computed over 10 runs. Higher Inception Scores (IS) are better. Lower Frechet Inception Distance (FID) and Kernel Inception Distance (KID) scores are better.<\/p><\/div>\n<p>Besides improved sample-quality metrics, we show the benefits of our approach for:<\/p>\n<p>\u2022 data augmentation on Omniglot datasets using generative adversarial networks: weighting the contributions of the good and bad generations in the training loss improves classification accuracy.<br \/>\n\u2022 model-based off-policy policy evaluation on MuJoCo environments: weighting the contributions of simulated trajectories under the dynamics model (learned using off-policy data) leads to better estimates of the policy of interest.<\/p>\n<p>In summary, we present a simple, yet highly effective technique based on importance weighting to correct for the imperfections of generative models by inducing a boosted energy-based model. While the proposed technique can correct for the <em>model bias<\/em>, the datasets used for training could also be biased (as is the case when the training dataset is scraped from Internet sites, such as Reddit), and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1910.12008\">our follow-up<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> work uses similar techniques to mitigate <em>dataset bias<\/em> for achieving fairness in generative modeling.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There is a growing interest in the use of deep generative models for sampling high-dimensional data; examples include high-resolution natural images, long-form text generation, designing pharmaceutical drugs, and creating new materials at the molecular level. Training these models is, however, an arduous task. Even state-of-the-art models have noticeable deficiencies in some of the generated samples: [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":631149,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-630264","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"aditya-grover","user_id":"631506","display_name":"Aditya Grover","author_link":"<a href=\"http:\/\/aditya-grover.github.io\/\" aria-label=\"Visit the profile page for Aditya Grover\">Aditya Grover<\/a>","is_active":true,"last_first":"Grover, Aditya","people_section":0,"alias":"aditya-grover"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-960x540.png\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/01\/MSResearch_20200114_NeurIPS_Aditya_1400x788-1.png 1401w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"http:\/\/aditya-grover.github.io\/\" title=\"Go to researcher profile for Aditya Grover\" aria-label=\"Go to researcher profile for Aditya Grover\" data-bi-type=\"byline author\" data-bi-cN=\"Aditya Grover\">Aditya Grover<\/a>","formattedDate":"January 14, 2020","formattedExcerpt":"There is a growing interest in the use of deep generative models for sampling high-dimensional data; examples include high-resolution natural images, long-form text generation, designing pharmaceutical drugs, and creating new materials at the molecular level. Training these models is, however, an arduous task. Even state-of-the-art&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/630264","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=630264"}],"version-history":[{"count":53,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/630264\/revisions"}],"predecessor-version":[{"id":648537,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/630264\/revisions\/648537"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/631149"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=630264"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=630264"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=630264"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=630264"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=630264"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=630264"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=630264"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=630264"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=630264"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=630264"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=630264"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}