Admin-Torch
Here, we provide a plug-in-and-play implementation of Admin, which stabilizes previously-diverged Transformer training and achieves better performance, without introducing additional hyper-parameters. The design of Admin is half-precision friendly and can be reparameterized into the original Transformer.