Benchmarking robustness of automated CT pancreas segmentation: achieving human-level reliability through human-in-the-loop optimization
Background Deep learning–based pancreas segmentation in CT has advanced rapidly yet remains evaluated primarily with mean overlap metrics that fail to capture robustness—defined as the proportion of cases reaching human-level performance. Models performing well on mean Dice or surface metrics can still fail unpredictably across…