OpenAI transforms AI model development with Azure Blob Storage

As the size and complexity of AI models grow, so do the requirements for storage, especially on throughput scale and reliability. For that reason, OpenAI, renowned for developing conversational AI products like ChatGPT, relies on Microsoft Azure Blob Storage and newly enhanced scaled accounts to underpin its vast data and training workflows.

From storing user and AI-generated content to managing complex pipelines for AI model training, OpenAI uses Azure Blob Storage to scale seamlessly while addressing challenges in data ingestion, processing, and distribution, focusing on its core mission: advancing AI technology.

This collaboration has improved storage efficiency, enhanced disaster recovery capabilities, and optimized throughput, allowing OpenAI to push the boundaries of AI research and innovation with confidence.

Boosting storage scale and training reliability for AI models and data

Best known for its powerful language models and conversational AI products like ChatGPT, OpenAI is a world leader in AI research, development, and advancement driven by a commitment to continuous technological innovation and industry impact. It has gone through generations of high-performance supercomputers to train cutting-edge AI models, including its latest release, GPT-4.5. Each model is more complex than the one before it in terms of overall scale and engineering efforts.

The company’s storage needs, including the size of the datasets it must store, have grown over time, demanding an efficient and expansive storage solution. As its AI models have grown not only in size but complexity, checkpointing model training and development has also become an essential step in helping ensure reliability and recovery. This process involves huge, distributed training clusters of graphics processing units (GPUs).

Every GPU pauses during checkpointing, causing a large influx of data to be pushed to storage in a very short time frame. Additionally, lost training time can result in an opportunity cost of millions of dollars a day, and the massive total storage footprint only further complicates the process. OpenAI solved its data storage and migration challenges by using Microsoft Azure Blob Storage to read training data and write checkpoints for model training. In particular, the engineering teams at OpenAI and Microsoft collaborated on the adoption of the new scaled accounts feature. Compared with traditional storage offerings, Azure Blob Storage scaled accounts push the leading edge of throughput and hit a higher scale point than previously possible. Sam Hopewell, Manager of OpenAI’s Supercomputing Storage Team, emphasizes, “Azure Blob Storage scaled accounts let us operate at exabyte scale without adding operational complexity. This enables our engineers to focus on advancing AI research.”

“Azure Blob Storage scaled accounts let us operate at exabyte scale without adding operational complexity. This enables our engineers to focus on advancing AI research.”

Sam Hopewell, Manager – Supercomputing Storage Team, OpenAI

Streamlining and optimizing complex training, checkpointing, and storing

The AI pipeline at OpenAI begins with transforming raw data into training-ready formats. After raw data is ingested from diverse public and private sources and stored in highly secure Azure Blob Storage, data cleaning and preparation steps are run using analytics pipelines such as Spark and custom frameworks. Vectorized training data is then made available for immediate use in GPU clusters. “Our data preparation milestones are critical for successful model training,” explains Hopewell. “Azure allows us to store and process data in one unified environment, eliminating friction and ensuring consistency across the pipeline.”

OpenAI takes advantage of Azure high-performance computing (HPC) virtual machines to process and train its AI models. These virtual machines run a streamlined version of Ubuntu Linux on Azure that’s optimized to maximize GPU performance. The data processing workflow includes aggregation, filtering, and transformation to refine the data for upcoming training sessions. After the initial data is staged on the GPU host and the GPUs initiate the training process, OpenAI continuously retrieves additional data from Azure Blob Storage.

This process, however, doesn’t occur overnight. Training OpenAI’s models requires consistent, high-bandwidth data access to thousands of GPUs over an extended period. Every several minutes, the present state of the training process is checkpointed to Azure Blob Storage. As needed, OpenAI’s research team can simply read and restore to the latest checkpoint and quickly get back to experimentation and development. “The predictable throughput of Azure Blob Storage is crucial for checkpointing,” says Hopewell. “It ensures our models stay on track, even in the face of hardware or software challenges.” In addition to recovering from a technological failure or other disruption, OpenAI can choose to roll back to a certain point in training development, change certain parameters, roll forward, and then evaluate its outcomes. The reliability and consistent storage performance from Azure Blob Storage scaled accounts helps ensure training continues without interruptions, boosting overall resilience.

Since adopting Azure Blob Storage scaled accounts, OpenAI has seamlessly integrated new storage clusters into its existing infrastructure for training data and storage that land over time. This approach eliminates the company’s need to manage new infrastructure that could disrupt critical training workflows or put stress on capacity management, allowing it to organically enhance storage access and throughput without any interruptions.

Notably, the growth in data has exploded in the most recent generation of supercomputers. Having amassed several supercomputers and smaller GPU deployments in multiple Azure regions, OpenAI tackles its crucial data migration needs with the Put Blob From URL API in Azure Blob Storage, which gives it seamless storage-to-storage copying across Azure regions or tenants. OpenAI then taps into global data movement with Azure Virtual WAN to automate and streamline the entire migration process.

More about this diagram

Scaling for the future and focusing on AI innovation

Training state-of-the-art AI models requires moving immense datasets across global infrastructure and ensuring consistent throughput to GPUs. Traditional storage systems often struggle to scale linearly, introducing bottlenecks and increasing operational overhead. OpenAI addressed these challenges with its Azure Blob Storage scaled accounts, scaling across regions to support datasets at exabyte scale. OpenAI also gained seamless storage-to-storage replication with tools like AzCopy and CopyBlobFromURL and high-throughput capabilities to help ensure that data flows uninterrupted to GPUs. “Having scaled accounts that grow with our demands means we don’t have to split our data into smaller pools or worry about complex management,” says Hopewell. “With Azure, we can focus on AI innovation instead of infrastructure limitations.”

OpenAI now benefits from a consistency of throughput that it hasn’t experienced before, which has given it the confidence to use Azure Blob Storage scaled accounts for all of its large storage scenarios. From a compute and storage perspective, the combination of Azure Blob Storage and scaled accounts makes operations significantly smoother. Of particular benefit for OpenAI’s research teams is the ability to manage data more efficiently. Before Azure Blob Storage scaled accounts, it might have 1,000 petabytes of data but space for only 100 petabytes in a given storage account. That led to a complex, lengthy process of partitioning data, creating 10 storage accounts, dividing the data up into 100 petabyte chunks, writing those 10 storage accounts independently, and then remembering where specific data is stored. Now, OpenAI can have a single storage account with an exabyte of data, which has already proven to be a great differentiator.

“Having scaled accounts that grow with our demands means we don’t have to split our data into smaller pools or worry about complex management. With Azure, we can focus on AI innovation instead of infrastructure limitations.”

Sam Hopewell, Manager – Supercomputing Storage Team, OpenAI

Supporting global AI workloads with private backbones

Tapping into the global presence of more than 70 Azure regions, OpenAI efficiently moves datasets around the planet to maximize GPU availability. Microsoft’s private WAN and advanced replication tools ensure high-speed data transfers, while the Azure private backbone moves data at terabit speeds, allowing researchers to access data wherever GPUs are available. Data is transferred in a highly secure and efficient manner, enabling OpenAI to scale operations worldwide. “Taking advantage of the Azure private backbone means we can move data to any region and fully utilize our compute resources,” explains Hopewell. “This infrastructure allows us to keep pace with the demands of cutting-edge AI research.”

Strengthening its security stance, OpenAI safeguards its blob data through role-based access control and access management. It also doubly encrypts data with automatic encryption at rest through Azure Storage alongside self-encryption.

“Taking advantage of the Azure private backbone means we can move data to any region and fully utilize our compute resources. This infrastructure allows us to keep pace with the demands of cutting-edge AI research.”

Sam Hopewell, Manager – Supercomputing Storage Team, OpenAI

Embracing generational improvements and being an early adopter

On a day-to-day level, OpenAI’s staff benefit from being able to manage their critical datasets with less effort. And with improved performance and scalability, OpenAI can also reliably and seamlessly train larger models on larger datasets with 10 terabits per second of throughput over exabytes of data. This represents a 20 times storage limit improvement from earlier solutions. The consistent and guaranteed throughput enables the company to minimize the amount of costly idle GPU time, keep those GPUs fed with data, and not waste resources.

OpenAI continues to operate on generational improvements in supercomputers and in its total storage capacity. The company also plans to be an early adopter of unprecedented orders of scale for the next generation of supercomputers to support its continued AI research and development.

Discover more about OpenAI on Facebook, Instagram, LinkedIn, X/Twitter, and YouTube.