Disarming Strategic Text: Span-Aware Counterfactuals for Robust Content Moderation
- Hardik Meisheri ,
- Muhammad Zaid Hassan ,
- Swati Tiwari ,
- Puneet Mangla ,
- Samarth Bharadwaj ,
- Karthik Sankaranarayanan ,
- Amit Singh
NeurIPS 2025 Workshop: Reliable ML from Unreliable Data
Machine learning systems deployed in the wild must operate reliably despite unreliable inputs, whether arising from distribution shifts, adversarial manipulation, or strategic behavior by users. Content moderation is a prime example: violators deliberately exploit euphemisms, obfuscations, or benign co-occurrence patterns to evade detection, creating unreliable supervision signals for classifiers. We present a span-aware augmentation framework that generates high-quality counterfactual hard negatives to improve robustness under such conditions. Our pipeline combines (i) multi-LLM agreement to extract causal violation spans,(ii) policy-guided rewrites of those spans into compliant alternatives, and (iii) validation via reinference to ensure only genuine label-flipping counterfactuals are retained. Across real-world ad moderation and toxic comment datasets, this approach consistently reduces spurious correlations and improves robustness to adversarial triggers, with PRAUC gains of up to+ 6.3 points. We further show that augmentation benefits peak at task-dependent ratios, underscoring the importance of balance in reliable learning. These findings highlight span-aware counterfactual augmentation as a practical path toward reliable ML from strategically manipulated and unreliable text data.