Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng; Xiaodong Liu; Weiwei Yang; Chenliang Xu; Chris White; Jianfeng Gao

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.

Related Tools

SABER: Scaling-Aware Best-of-N Estimation of Risk

March 12, 2026

Scaling-Aware Best-of-N Estimation of Risk A Python package for predicting large-scale adversarial risk in Large Language Models under Best-of-N sampling. Paper: https://arxiv.org/pdf/2601.22636 Overview Standard LLM safety evaluations use single-shot (ASR@1) metrics, but real attackers can exploit parallel sampling to repeatedly probe models. SABER provides a principled statistical framework to: Predict ASR@N at large budgets from small measurements Estimate how many attempts are needed to reach a target success rate Quantify uncertainty in adversarial risk predictions Key Insight Attack success rates scale according to a power law governed by the Beta distribution of per-query vulnerabilities: ASR@N ≈ 1 - Γ(α+β)/Γ(β) · N^(-α) The parameter α controls how fast risk amplifies with more attempts. Installation pip install saber-risk Or from source: git clone https://github.com/microsoft/saber cd saber pip install -e . Quick Start import numpy as np from saber import SABER # Your jailbreak evaluation data: # k[i] = number of successful jailbreaks for query i # n[i] = number of attempts for query i k = np.array([3, 5, 0, 2, 8, 1, 4, 0, 6, 2]) n = 100 # 100 attempts per query # Fit and predict model = SABER() model.fit(k, n) # Predict ASR at N=1000 attempts result = model.predict(N=1000) print(f"ASR@1000 = {result.asr:.2%}") # With confidence interval result = model.predict(N=1000, confidence=0.95) print(f"ASR@1000 = {result.asr:.2%} [{result.ci_lower:.2%}, {result.ci_upper:.2%}]") Core Usage from saber import SABER # 1. Collect jailbreak data # Run n attempts per query, count successes k k = [...] # successes per query n = 100 # trials per query (or array for heterogeneous budgets) # 2. Fit the model model = SABER() model.fit(k, n) # 3. Predict ASR at target budget asr_1000 = model.predict(N=1000).asr # Budget estimation result = model.budget_for_asr(target=0.95) print(f"Need {result.budget:.0f} attempts for 95% ASR") # Fluent API asr = SABER().fit(k, n).predict(1000).asr Documentation Full documentation is available in the docs/ directory. To build: cd docs pip install -r requirements.txt make html Quick Start - Getting started guide API Reference - Complete API documentation Advanced Usage - Model selection, scaling curves, low-level API Citation If you use SABER in your research, please cite: @misc{feng2026statisticalestimationadversarialrisk, title={Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling}, author={Mingqian Feng and Xiaodong Liu and Weiwei Yang and Chenliang Xu and Christopher White and Jianfeng Gao}, year={2026}, eprint={2601.22636}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.22636}, } Contact For any questions regarding the package or paper, feel free to reach out to: Mingqian Feng - mfeng7@ur.rochester.edu Xiaodong Liu - xiaodl@microsoft.com Weiwei Yang - weiwei.yang@microsoft.com License MIT License - see LICENSE for details.

Access