Consensus-Robust Transfer Attacks via Parameter and Representation Perturbations

NeurIPS 2025 |

Adversarial attacks threaten the reliability of deep neural networks, particularly in black-box settings where transferability is essential. However, existing transfer-based attacks often fail when the target model’s architecture or training diverges from the surrogate, due to decision-boundary variation and representation drift. We introduce CORTA, a consensus-robust transfer attack that explicitly models these two sources of transfer failure as parameter and representation perturbations on the surrogate model. We formalize transferability as a distributionally robust optimization (DRO) problem over an uncertainty set of plausible targets, and provide efficient first-order approximations with theoretical guarantees. CORTA enforces consensus misclassification by jointly regularizing parameter sensitivity and promoting robustness to feature blending on the surrogate. Extensive experiments on ImageNet and CIFAR-100 show that CORTA consistently outperforms state-of-the-art transfer-based black-box attacks, including ensemble methods, across both convolutional and transformer architectures. For example, when transferring from ResNet-18 to Swin-B on CIFAR-100, CORTA achieves a 19.1\% higher transfer success rate than the strongest baseline. Our approach establishes a new benchmark for robust black-box adversarial evaluation.