close-up image of interlocking gears turning with a rainbow gradient overlay

Research Tools: code, datasets, & models

Discover an index of datasets, SDKs, APIs and open-source tools developed by Microsoft researchers and shared with the global academic community below. These experimental technologies—available through Azure AI Foundry Labs (opens in new tab)—offer a glimpse into the future of AI innovation.

Current selections

Sort by: Most recent

Clear selections

Search within these results

Published Date

Dataset Source Code

MarkItDown

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to textract, but with a focus on…

GitHub

Dataset Source Code

Code Release for Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling

We introduce Reprompting, an iterative sampling algorithm that automatically learns the Chain-of-Thought (CoT) recipes for a given task without human intervention. Through Gibbs sampling, Reprompting infers the CoT recipes that work consistently well for a…

GitHub Publication

Dataset Source Code

fLSA: Learning Semantic Structures in Document Collections Using Foundation Models

Humans can learn to solve new tasks by inducing high-level strategies from example solutions to similar problems and then adapting these strategies to solve unseen problems. Can we use large language models to induce such…

GitHub Publication

Tool

Magma

Magma is a multimodal foundation model designed to both understand and act in digital and physical environments. Magma builds on the foundation models paradigm that pretraining on a larger amount of more diverse datasets allows…

Access Video Project Publication

Dataset Source Code

OG-RAG

OG-RAG enhances Large Language Models (LLMs) with domain-specific ontologies for improved factual accuracy and contextually relevant responses in fields with specialized workflows like agriculture, healthcare, knowledge work, and more. Paper: OG-RAG: Ontology-Grounded Retrieval-Augmented Generation For…

GitHub Publication

Dataset Source Code

CollabLLM: teaching LLMs to be more effective collaborators

Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach…

GitHub Publication

Dataset Source Code

DSP+: Reviving DSP for Advanced Theorem Proving in the Era of Reasoning Models

Recent advancements, such as DeepSeek-Prover-V2-671B and Kimina-Prover-Preview-72B, demonstrate a prevailing trend in leveraging reinforcement learning (RL)-based large-scale training for automated theorem proving. Surprisingly, we discover that even without any training, careful neuro-symbolic coordination of existing…

GitHub

Dataset Source Code

Open-Source Consent Package

A collection of packages for building consent management systems with audit trails, granular permissions, and flexible storage backends. Designed for transparency, compliance with privacy regulations, and easy integration into existing applications.

GitHub

Dataset Source Code

SOC Fine-tuning Stable Diffusion

Stochastic Optimal Control Fine-Tuning of Stable Diffusion. This repository provides an implementation of reward fine-tuning methods for Stable Diffusion 1.5 based on stochastic optimal control (SOC), focusing on Adjoint Matching. It adds specialized trainers, custom…

GitHub

Dataset Source Code

Data from Thomas et al., SIGIR2025: multi- and cross-lingual relevance labelling with LLMs

These are the prompts and qrels used for the experiments in Thomas et al., “System Comparison using Automated Generation of Relevance Judgements in Multiple Languages”, SIGIR 2025.

GitHub