Benchmarking robustness of automated CT pancreas segmentation: achieving human-level reliability through human-in-the-loop optimization

Radiology |

DOI | PDF

Background

Deep learning–based pancreas segmentation in CT has advanced rapidly yet remains evaluated primarily with mean overlap metrics that fail to capture robustness—defined as the proportion of cases reaching human-level performance. Models performing well on mean Dice or surface metrics can still fail unpredictably across scanners or anatomies. Because early detection and quantitative biomarkers rely on consistent segmentation, robustness is critical for clinical deployment.

Purpose

To systematically evaluate the robustness of deep learning models for pancreas segmentation relative to human readers and to investigate an active learning strategy to improve reliability.

Materials and Methods

We retrospectively assembled 903 venous-phase CT scans from patients with presumed normal-appearing pancreases and without known pancreatic disease (2005-2023), split into 803 for training/validation and 100 healthy test cases. Each test case had 4 independent human segmentations. Inter-reader variability on this healthy-only test set defined the empirical human distribution, providing an upper-bound estimate of robustness. We introduced a Fractional Threshold (FT) metric, measuring the proportion of model predictions exceeding the minimum human performance. Robustness was assessed across models trained from scratch, fine-tuned, or pretrained, including both normal and abnormal cases. An active learning approach identified high-uncertainty predictions for human revision. Statistical comparisons were performed using the Wilcoxon signed-rank and proportions Z-tests.

Results

The best model, a 3-dimensional U-Net trained from scratch, achieved a Dice Similarity Coefficient (DSC) of 0.88 ± 0.04 and Normalized Surface Dice (NSD) of 0.77 ± 0.09, approaching human-level segmentation (DSC = 0.89 ± 0.03; NSD = 0.75 ± 0.07). However, FT for DSC and NSD remained lower than human performance in most cases, indicating persistent model variability. Human-in-the-loop revision of acquisition-flagged outliers increased FT to 0.99, with an average time of 1.54 minutes per case, corresponding to a 23-fold workload reduction.

Conclusion

Automated pancreas segmentation reduces workload but remains constrained by tail-case failures. Active learning enhances model reliability, bridging the gap between artificial intelligence and human-level performance.