Miniature: Fast AI Supercomputer Networks Simulation on FPGAs

APNet |

Published by ACM

AI supercomputer network performance has become a blocker to training larger AI models. Analytically modeling such networks is virtually impossible, and the cost and time of building prototypes is prohibitive, making simulation the only real option. Simulating an 8,192-node AI cluster for one second, however, using existing network simulators requires over a week for a single run. In this paper, we propose Miniature, an FPGA-based network simulator that simulates network components using specialized circuits, achieving both fast simulation and efficient resource utilization. Miniature further provides an efficient multiplexing architecture to automatically fit large AI supercomputer networks on given number of FPGA(s). We implement a Miniature prototype running AI traffic. Our measurements demonstrate that Miniature can simulate a 65,536 node AI-cluster 4332 × faster than state-of-the-art software-based network simulators on a single FPGA.