Return to Microsoft Research Lab - Redmond

Cryptography and Privacy Research

Overview

The Cryptography and Privacy Research Group studies privacy problems in real-world systems and creates practical solutions using modern cryptography and privacy technologies.

Our Objectives

Fundamental research in privacy and cryptography

We conduct fundamental research in privacy and cryptography. Almost all of our research is public and we regularly publish in top conferences in the field. We also collaborate with researchers and students in academia, and host several interns every year through Microsoft’s internship program.

Design and analysis of privacy systems

We design and analyze privacy systems for Microsoft product groups. Sometimes this means providing consultation, but at other times we may end up designing entirely new protocols, addressing new kinds of performance and scalability issues, and building missing tools or libraries.

Threat modeling and privacy analysis in machine learning

Machine learning pipelines leak information about the training data. We explore new attacks and mitigation strategies in real-world scenarios, with the ultimate goal of being able to quantify this leakage and provide practical threat models.

Making privacy technologies available and usable for all developers

We aim to build production-quality libraries and release them under permissive open-source licenses (typically MIT). We believe modern privacy technologies should be broadly available to enable every developer deliver on exactly the privacy guarantees they want.

New business opportunities from modern privacy technologies

We believe modern privacy technologies can not only provide regulatory compliance, but also enable new data collaborations across distrusting parties, enable better insights by tearing down data silos, and enable new business opportunities that would otherwise be impossible or require costly legal procedures.

Colloquium

Every summer we invite multiple researchers from around the world to visit either in person, or virtually, to speak at the Cryptography and Privacy Colloquium.

Machine Learning

Privacy in ML pipelines is a topic of great interest and importance to Microsoft. Most important concerns are in training on private siloed data, in leaking training data by releasing models/word embeddings, and in enabling predictions on private data. Our approach to privacy-preserving ML has three elements:

Map. We study and create privacy attacks against real-world ML pipelines in both enterprise and consumer scenarios. The reason to focus on the offensive angle is that it is still poorly understood and creating effective defenses is hard or impossible until the scope of the problem is better known. For example, it is clear that ML models, even when exposed only in a black box setting, leak information about the training data, but this leakage is difficult to quantify. The situation gets worse when auxiliary information about the training data is available, or when the attacker can poison the training data with malicious samples, for example, in a federated learning setting.

Measure. Given an ML pipeline, we need to be able to easily tell whether it is vulnerable against privacy attacks. Creating effective metrics for privacy leakage is a step in this direction. Techniques such as differential privacy can in some cases provide high privacy guarantees, but are not straightforward to connect to practical metrics.

Mitigate. Once we understand potential weaknesses, and can effectively identify them in ML pipelines, we need to design and implement mitigations. This process must be generalizable and scalable so that large classes of ML pipelines can be protected without undue effort. While provable security may, in many cases, be out of reach or impractical, creating realistic threat-models and heuristic security arguments is necessary.

Secure Computing

Secure computing protects data in use in addition to during communication and in storage. Secure computing comes in numerous flavors, such as outsourced computing on encrypted data and collaborative computation on private data of two or more data providers. Such functionalities can be enabled primarily through three technologies: (1) homomorphic encryption (HE); (2) cryptographic protocols known collectively as secure multi-party computation (MPC); (3) hardware-based trusted execution environments (TEE), such as Intel SGX.

The different secure computing techniques have very different performance characteristics and privacy goals. For example, HE and MPC operate on encrypted data; the computations are typically represented as circuits, which can be difficult and cumbersome. TEEs decrypt the data before computing on it, but rely on the hardware design for restricting the operating system’s or hypervisor’s access to the decrypted data; regular code can run in a TEE, but various side-channels have been shown to leak information about the private data.

Homomorphic Encryption

(Fully) Homomorphic Encryption (HE or FHE) refers to a class of encryption methods that allow computing on encrypted data without decryption. It keeps data private from the party who performs computation, i.e., enables privacy-preserving (outsourced) computation. As all modern encryption schemes, homomorphic encryption is randomized: encrypting a particular value multiple times results always in different encryptions. The encryption can be reversed only with the secret key. “Homomorphic” refers to a mathematical concept, indicating that some operations on ciphertexts reflect to some operations on the underlying plaintexts. For example, the user can encrypt two numbers, compute a sum-like operation on the ciphertexts, and decrypt the result to obtain the sum of the plaintext numbers.
Computing on encrypted data, followed by decryption, yields the same result as computing on unencrypted data.

There have been many instantiations of homomorphic encryption, varying in types of supported computations, run-time performance, and application development effort. The most popular ones are Brakerski-Fan-Vercauteren (BFV), Brakerski-Gentry-Vaikuntanathan (BGV), and Cheon-Kim-Kim-Song (CKKS). BFV and BGV precisely preserve operations and are suitable for database retrieval or string matching, while CKKS approximately preserves operations and is preferred when processing floating-point data. All of these schemes base their security on a mathematical hard problem known as Learning with Errors (LWE), which is also the security foundation of many post-quantum cryptographic schemes currently being considered for recommendation by NIST.

Microsoft SEAL

Microsoft SEAL is a production-quality homomorphic encryption library that the Cryptography and Privacy Research Group has been building since 2015. It is actively maintained, supports multiple platforms, and currently implements the BFV and CKKS schemes.

Homomorphic encryption can be conceptually difficult to learn, and writing applications using the CKKS scheme API in Microsoft SEAL can be very challenging. To address this, we have created EVA, a language a compiler, for the CKKS scheme.

Standardization

In 2017 we launched together with partners from industry, academia, and government, a community standardization effort that is today known as HomomorphicEncryption.org. The community standard sets recommendations for secure parameters that are now used by multiple implementations. Recently ISO/IEC JTC1 SC27 WG2 has recommended fully homomorphic encryption for standardization.

Private Set Intersection

Private Set Intersection (PSI) often refers to a group of multiple related functionalities. In its simplest form, PSI allows two distrusting parties to find the intersection of their private sets of “items” (strings, multi-precision integers, documents) without revealing any other information to each other. Variants of PSI target scenarios with roughly equally sized sets (symmetric PSI), very unequally sized sets (asymmetric PSI), retrieval of data corresponding to items in the intersection (labeled PSI), or post-processing on the intersection, e.g., returning the sum of values in the intersection (PSI with post-processing).

PSI can be realized by orchestrating multiple cryptographic primitives to achieve different variants with performance/communication trade-offs. For example, homomorphic encryption can provide communication-efficient solutions in the asymmetric case where one set is very small while the other is very large, even up to billions of elements, whereas such a solution would be completely impractical when both sets have size in the millions.

APSI

APSI is a library we built for asymmetric (labeled) PSI. It allows a client to look up the presence of a relatively few (tens of thousands) of items from a much larger database held by a server, and for those items that are found, optionally retrieve data the server has associated to them. The entire operation reveals nothing to the server, except that a query was performed, and the client learns nothing beyond what the functionality indicates. Edge Password Monitor uses APSI to provide an efficient and secure way to detect whether Edge users’ credentials are found in Microsoft’s database of breached credentials. APSI is implemented using Microsoft SEAL.

Digital Identities

Digital identities are foundational to the modern web and to the many services enabled by Azure, and other cloud providers. A few large identity-as-a-service providers have significant control of the market, but recently new decentralized approaches have started gaining traction. Our goal is to enable a privacy-preserving and transparent digital identity ecosystem, where the users have more control over their identity and visibility into its usage.

Privacy-Preserving Credentials

Consider systems where an identity owner interacts with a credential issuer, who can attest to some attributes of the identity owner (is registered for a service, has solved a captcha, has held an account for a certain length of time, has a given age, place of residence, etc.). Later the identity owner visits a relying party (e.g., another website), and needs to demonstrate its ownership of some of these attributes. Currently most credential/authentication services provide very little privacy against the issuer or the relying party: either the issuer drops a cookie storing this information, which then becomes a way for various parties to track the user’s actions across the web, or the relying party must call back to the issuer to verify the attributes each time, which lets the issuer track all of the identity owner’s actions. In either approach the identity owner has little control and often little knowledge of when/where their information is being released.

We aim to enable similar functionality with privacy-preserving credentials, where presenting the credentials results in minimal privacy loss. An identity owner obtains a credential from an issuer, and then can present this credential to relying parties in a way where presentations are not linkable (the relying party cannot tell whether two presentations were from the same person), leak only the necessary attributes, and reveal nothing else to the issuer.

Auditable Server-Managed Identities

One common way to manage identity is to use an identity-as-a-service provider who manages the user’s identity on their behalf. For any changes, the user must authenticate to the identity provider, who will record the desired changes. The service provider will also respond to queries from external parties regarding a particular user’s identity, assuming appropriate permissions have been given. This is the approach taken in Single Sign-On (SSO) services and in most systems that provide end-to-end encryption. It has the advantage of being easy for users to manage and does not assume any long-term cryptographic key storage by users: a user can easily switch devices and can use a standard password recovery mechanism in case of a forgotten password. The downside of this type of approach is that the service provider has complete control over their customers’ identities. If the service provider is compromised, or simply has an error, the user’s identity may be compromised without them knowing.

We hope to allow users to audit the server to get cryptographic guarantees that the server is behaving correctly, without introducing any additional privacy leakage into the system. One concrete example of this is the SEEMless auditable public key directory, that allows users to verify that their public keys are correctly delivered to relying parties, e.g., in an end-to-end encrypted messaging or video conferencing applications. Competing proposals assume people do things like compare digital fingerprints with all their contacts, which may be impractical and unrealistic, whereas SEEMless replaces this with an asynchronous process that can be run entirely into the background.

Authentication

Decentralized systems must fundamentally rely on user-controlled, user-custodied cryptographic keys. All current implementations of cryptocurrencies, “smart contract” systems, and decentralized identifier (DID) protocols require users to maintain secure control over cryptographically random secrets composed of long strings (e.g., 32-byte private keys) that humans are generally unable to remember. The decentralized systems and applied cryptography communities have generally employed two mechanisms for aiding in the recovery of these unmemorable secrets: (1) randomly selected mnemonic phrases, and (2) Shamir-based secret-sharing schemes. Both schemes provide various advantages over simply attempting to remember or store long, unmemorable secret strings of random numbers and letters, but the approaches come with almost diametrically opposed trade-offs. With mnemonic phrases, inputs are words, which increases memorability, but users must correctly reproduce all words in the exact order in which they were generated. Secret-sharing schemes, on the other hand, only require the user to reproduce a threshold subset of shares, in any order. Unlike mnemonic phrases, however, the shares are long, random strings of letters and numbers, rather than words or other human-friendly inputs.

A person’s digital identity can be of enormous value and should be recoverable in the case of unpredictable events, such as loss of token devices, memory, or notes. One step in this direction is the Fuzzy Vault project, where we developed a new scheme that makes it easier for users to reproduce cryptographically secure passphrases for secret recovery.

Internships

Our group hosts multiple research interns each year through Microsoft’s internship program. The internships usually take place in the summer, but sometimes also earlier in the spring. They always last for exactly 12 weeks and typically involve a combination of research and implementation work, and often lead to at least publication. Internships are awesome networking opportunities, as the interns get an opportunity to meet many other interns and researchers at Microsoft. If you have any questions about the internship program, email us at msrcrypt@microsoft.com.

Open positions

Currently we have no open internship positions. Future positions will be listed here and can be found at MSR Careers.

People