Research Tools: code, datasets, & models

Tool

ToxiGen

Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle…

GitHub Publication

Tool

Hybrid Hiring

We introduce our full experimental data as Hybrid Hiring, a large-scale dataset for studying human AI decision-making that is collected and evaluated on real-world candidates. Comprised of 38,400 human judgements and over 9,600 unique prediction…

Access Publication

Tool

AdaTest

Find and fix bugs in natural language machine learning models using adaptive testing.

GitHub Publication

Tool

Learning to Detect Scene Landmarks for Camera Localization

Source code and data for the CVPR 2022 paper “Learning to Detect Scene Landmarks for Camera Localization”.

GitHub Publication

Tool

Data for society catalog

Microsoft is working to make data that is relevant to important social problems as open as possible, including by contributing open data ourselves. The Data for Society resource center provides access to Microsoft’s open datasets,…

GitHub

Tool

Admin-Torch

Here, we provide a plug-in-and-play implementation of Admin, which stabilizes previously-diverged Transformer training and achieves better performance, without introducing additional hyper-parameters. The design of Admin is half-precision friendly and can be reparameterized into the original…

GitHub

Tool

XtremeDistil

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale.

GitHub Publication Publication

Tool

LoRA

This repo contains the source code of the Python package loralib and several examples of how to integrate it with PyTorch models, such as those in HuggingFace. We only support PyTorch for now. See our…

GitHub Publication Publication

Tool

MoLeR: A Model for Molecule Generation

Implementation of MoLeR: a generative model of molecular graphs which supports scaffold-constrained generation. This open-source code accompanies our paper “Learning to Extend Molecular Scaffolds with Structural Motifs”, which has been accepted at the ICLR 2022…

GitHub Publication

Tool

Project Iris

Github link to Iris – pretrained summarization models for structured datasets and cardinality estimation.

Access