Miniature: Fast AI Supercomputer Networks Simulation on FPGAs
- Yicheng Qian ,
- Ran Shu ,
- Rui Ma ,
- Yang Wang ,
- Derek Chiou ,
- Nadeen Gebara ,
- Luca Piccolboni ,
- Miriam Leeser ,
- Yongqiang Xiong
APNet |
Published by ACM
Best Student Paper
Download BibTexAI supercomputer network performance has become a blocker to training larger AI models. Analytically modeling such networks is virtually impossible, and the cost and time of building prototypes is prohibitive, making simulation the only real option. Simulating an 8,192-node AI cluster for one second, however, using existing network simulators requires over a week for a single run. In this paper, we propose Miniature, an FPGA-based network simulator that simulates network components using specialized circuits, achieving both fast simulation and efficient resource utilization. Miniature further provides an efficient multiplexing architecture to automatically fit large AI supercomputer networks on given number of FPGA(s). We implement a Miniature prototype running AI traffic. Our measurements demonstrate that Miniature can simulate a 65,536 node AI-cluster 4332 × faster than state-of-the-art software-based network simulators on a single FPGA.