Synthetic Data Showcase
Sharing data from sensitive sources is critical to research but can put vulnerable data subjects at risk of being identified. We created an open-source pipeline that generates synthetic data to preserve privacy when sharing and analyzing sensitive datasets.Try the demo
Sharing sensitive datasets poses both utility and privacy challenges. Anonymization methods may not capture accurate statistics or protect against all kinds of privacy leaks, while differential privacy doesn’t make datasets available for unrestricted analysis with group-level privacy protection.
We believe that in many cases, the most practical method for sharing sensitive information isn’t sharing the actual datasets, but instead sharing interfaces that showcase derived datasets that are inherently anonymous.
We developed and open-source Synthetic Data Showcase, an automated pipeline for generating both synthetic and aggregate datasets that conserve the utility of the original, along with dashboards for visualizing and exploring these derived datasets.
Protecting vulnerable groups in datasets
While a dataset may be de-identified, the selection of specific attributes within that dataset could offer opportunities for traffickers to “drill down” to a single individual or small, identifiable group within the dataset. A trafficker’s knowledge of their victims could then allow them to infer who those de-identified individuals are based on their distinctive combination of attributes.
Synthesizing data with full k-anonymity
Our approach addresses privacy risks by treating each column as potentially identifying, protecting all combinations of attributes. By using synthetic data, we provide a level of indirection – any combination of attributes, even if unique, corresponds to at least k records in the sensitive dataset. Singling-out attacks are prevented because groups smaller than k are masked, and groups larger than k are protected by safety in numbers.
Safeguarding privacy while preserving utility
We replace sensitive data with synthetic data that preserves privacy while still ensuring utility. Because common attribute combinations are unaffected, the resulting synthetic data accurately captures the counts of these combinations, which are often important for both sensemaking and decision making.
Technical details for Synthetic Data Showcase
Synthetic datasets are produced using our concept of, and algorithm for, k -synthetic anonymity. The algorithm constructs synthetic records whose attribute combination values appear at least a pre-determined number of times, k, in the original, sensitive dataset. Attribute combinations that do not meet this privacy resolution aren’t disclosed to prevent singling out individual data subjects or linking small groups of subjects to known individuals in the real world.
The synthetic data is complemented with precomputed aggregate data for reportable, short attribute combinations that appear in the sensitive dataset. We enable the selection of a privacy resolution k that provides both a minimum reporting threshold and rounding precision to prevent disclosing small counts that can pose privacy risks.
The synthetic and aggregate data are automatically loaded into a Power BI interface for interactive, privacy-preserving data exploration. Capable of being easily customized to meet specific visualization goals, these dashboards enable rich and code-free analysis independent of data science expertise.
Synthetic Data Showcase started as a project within our Tech Against Trafficking initiative, and we believe that its ability to improve the representation of at-risk groups can help us solve pressing societal problems and build a more resilient world.
Homomorphic encryption (HE)
HE technology allows computations to be performed directly on encrypted data. Using state-of-the-art cryptology, you can run machine learning on anonymized datasets without losing context.
Differential privacy is crucial for machine learning and data privacy. Safeguarding the privacy of people while enabling deeper analysis will empower health research innovations.
Innovation Developer Hub
Explore insights and behind-the-scenes technology for breakthrough AI innovations. From Tech Minutes videos to Technology Deep Dives, learn about the engineering that powers the future of AI.
Learn to create your own AI experiences with courses in AI technology. Engage with learning paths in conversational AI, machine learning, AI for devices, cognitive services, autonomous systems, AI business strategies, and responsible AI.
Start building AI solutions with powerful tools and services. Microsoft AI is a robust framework for developing AI solutions in conversational AI, machine learning, data sciences, robotics, IoT, and more.