close-up image of interlocking gears turning with a rainbow gradient overlay

Research Tools: code, datasets, & models

Discover an index of datasets, SDKs, APIs and open-source tools developed by Microsoft researchers and shared with the global academic community below. These experimental technologies—available through Azure AI Foundry Labs (opens in new tab)—offer a glimpse into the future of AI innovation.

Current selections

Sort by: Most recent

Clear selections

Search within these results

Published Date

Dataset Source Code

BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

GitHub Publication

Dataset Source Code

MetaOpt: Towards efficient heuristic design with quantifiable and confident performance

MetaOpt is the first general-purpose and scalable tool that enables users to analyze a broad class of heuristics through easy-to-use abstractions that apply to a broad range of practical heuristics. For more information, checkout MetaOpt’s project webpage and…

GitHub Project Publication

Dataset Source Code

VisEval

VisEval: A NL2VIS Benchmark. VisEval is a benchmark designed to evaluate visualization generation methods. In this repository, we provide both the toolkit to support the benchmarking, as well as the data used for benchmarks.

GitHub Publication

Dataset Source Code

MicroCode

Microsoft MicroCode is an icon-based programming language and editor for young learners to code with the BBC micro:bit V2. MicroCode allows you to program the micro:bit V2 with only an Arcade shield accessory – no…

GitHub Project Publication

Dataset Source Code

DOSA

A dataset of social artifacts from different Indian geographical subcultures. This repo hosts the code to run experiments on the DOSA dataset.

GitHub Publication

Dataset Source Code

MunTTS: A Text-to-Speech System for Mundari

We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to…

GitHub Project Publication

Dataset Source Code

TE-CCL

TE-CCL is a tool to generate collective communication schedules for large topologies using a Traffic Engineering-based solver. TE-CCL takes in a topology and collective (e.g. AllGather) and outputs a schedule (in JSON) detailing data transfer…

GitHub Publication

Dataset Source Code

Private Benchmarking

A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

GitHub Publication

Dataset Source Code

Cookie-doh

Cookie-doh is a repository template for creating single Python package projects.

GitHub

Dataset Source Code

MInference: Accelerating Pre-filling for Long-context LLMs via Dynamic Sparse Attention

MInference 1.0 leverages the dynamic sparse nature of LLMs’ attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then…

GitHub Project Publication Publication Publication