Synthetic Data Showcase
Sharing data from sensitive sources is critical to research but can put vulnerable data subjects at risk of being identified. We created an open-source pipeline that generates synthetic data to preserve privacy when sharing and analysing sensitive datasets.Try the demo
Sharing sensitive datasets poses both utility and privacy challenges. Anonymisation methods may not capture accurate statistics or protect against all kinds of privacy leaks, while differential privacy doesn’t make datasets available for unrestricted analysis with group-level privacy protection.
We believe that in many cases, the most practical method for sharing sensitive information isn’t sharing the actual datasets, but instead sharing interfaces that showcase derived datasets that are inherently anonymous.
We developed Synthetic Data Showcase, an automated, open-source pipeline for generating both synthetic and aggregate datasets that conserve the utility of the original, along with dashboards for visualising and exploring these derived datasets.
Protecting vulnerable groups in datasets
While a dataset may be de-identified, the selection of specific attributes within that dataset could offer opportunities for traffickers to ‘drill down’ to a single individual or small, identifiable group within the dataset. A trafficker’s knowledge of their victims could then allow them to infer who those de-identified individuals are based on their distinctive combination of attributes.
Synthesising data with full k-anonymity
Our approach addresses privacy risks by treating each column as potentially identifying, protecting all combinations of attributes. By using synthetic data, we provide a level of indirection – any combination of attributes, even if unique, corresponds to at least k records in the sensitive dataset. Singling-out attacks are prevented because groups smaller than k are masked and groups larger than k are protected by safety in numbers.
Safeguarding privacy while preserving utility
We replace sensitive data with synthetic data that preserves privacy while still ensuring utility. Because common attribute combinations are unaffected, the resulting synthetic data accurately captures the counts of these combinations, which are often important for both sense-making and decision-making.
Technical details for Synthetic Data Showcase
Synthetic datasets are produced using our concept of, and algorithm for, k -synthetic anonymity. The algorithm constructs synthetic records whose attribute combination values appear at least a pre-determined number of times, k, in the original, sensitive dataset. Attribute combinations that do not meet this privacy resolution aren’t disclosed to prevent singling out individual data subjects or linking small groups of subjects to known individuals in the real world.
The synthetic data is complemented with precomputed aggregate data for reportable, short attribute combinations that appear in the sensitive dataset. We enable the selection of a privacy resolution k that provides both a minimum reporting threshold and rounding precision to prevent disclosing small counts that can pose privacy risks.
The synthetic and aggregate data are automatically loaded into a Power BI interface for interactive, privacy-preserving data exploration. Capable of being easily customised to meet specific visualisation goals, these dashboards enable rich and code-free analysis independent of data science expertise.
Synthetic Data Showcase started as a project within our Tech Against Trafficking initiative, and we believe that its ability to improve the representation of at-risk groups can help us solve pressing societal problems and build a more resilient world.
Homomorphic encryption (HE)
HE technology allows computations to be performed directly on encrypted data. Using state-of-the-art cryptology, you can run machine learning on anonymised datasets without losing context.
Differential privacy is crucial for machine learning and data privacy. Safeguarding the privacy of people while enabling deeper analysis will empower health research innovations.
Create innovative AI solutions
Discover Azure AI – a portfolio of AI services designed for developers and data scientists. Take advantage of the decades of breakthrough research, responsible AI practices and flexibility that Azure AI offers to build and deploy your own AI solutions.