Abstract

Network reliability is critical for large clouds and online service providers like Microsoft. Our network is large, heterogeneous, complex and undergoes constant churns. In such an environment even small issues triggered by device failures, buggy device software, configuration errors, unproven management tools and unavoidable human errors can quickly cause large outages. A promising way to minimize such network outages is to proactively validate all network operations in a high-fidelity network emulator, before they are carried out in production. To this end, we present CrystalNet, a cloud-scale, high-fidelity network emulator. It runs real network device firmwares in a network of containers and virtual machines, loaded with production configurations. Network engineers can use the same management tools and methods to interact with the emulated network as they do with a production network. CrystalNet can handle heterogeneous device firmwares and can scale to emulate thousands of network devices in a matter of minutes. To reduce resource consumption, it carefully selects a boundary of emulations, while ensuring correctness of propagation of network changes. Microsoft’s network engineers use CrystalNet on a daily basis to test planned network operations. Our experience shows that CrystalNet enables operators to detect many issues that could trigger significant outages.