The need

Sharing sensitive datasets poses both utility and privacy challenges. Anonymization methods may not capture accurate statistics or protect against all kinds of privacy leaks, while differential privacy doesn’t make datasets available for unrestricted analysis with group-level privacy protection.

The idea

We believe that in many cases, the most practical method for sharing sensitive information isn’t sharing the actual datasets, but instead sharing interfaces that showcase derived datasets that are inherently anonymous.

The solution

We developed and open-source Synthetic Data Showcase, an automated pipeline for generating both synthetic and aggregate datasets that conserve the utility of the original, along with dashboards for visualizing and exploring these derived datasets.

Combatting trafficking with synthetic data

This challenge arose through Microsoft’s participation in Tech Against Trafficking (TAT) – a coalition of technology companies working to combat trafficking with technology. We worked together to support the Counter Trafficking Data Collaborative (CTDC) – an initiative run by IOM to create the world’s largest database on identified victims of trafficking that would serve to inform evidence-based policy against human trafficking.

Explore how we worked to enable policy makers to view and explore this data in a way that preserves the privacy of groups of data subjects, preserves the utility of data sets; and is accessible to all data stakeholders.

Learn about TAT Learn about CTDC

Technical details for Synthetic Data Showcase

Synthetic datasets are produced using our concept of, and algorithm for, -synthetic anonymity. The algorithm constructs synthetic records whose attribute combination values appear at least a pre-determined number of times, k, in the original, sensitive dataset. Attribute combinations that do not meet this privacy resolution aren’t disclosed to prevent singling out individual data subjects or linking small groups of subjects to known individuals in the real world.

The synthetic data is complemented with precomputed aggregate data for reportable, short attribute combinations that appear in the sensitive dataset. We enable the selection of a privacy resolution k that provides both a minimum reporting threshold and rounding precision to prevent disclosing small counts that can pose privacy risks.

The synthetic and aggregate data are automatically loaded into a Power BI interface for interactive, privacy-preserving data exploration. Capable of being easily customized to meet specific visualization goals, these dashboards enable rich and code-free analysis independent of data science expertise.

Synthetic Data Showcase started as a project within our Tech Against Trafficking initiative, and we believe that its ability to improve the representation of at-risk groups can help us solve pressing societal problems and build a more resilient world.

We are pleased to announce that Synthetic Data Showcase has been adopted by the UN International Organization for Migration (IOM).

Projects related to Synthetic Data Showcase

Browse security and responsibility projects

Explore the possibilities of AI

Jumpstart your own AI innovations with learning resources and development solutions from Microsoft AI.