Rad-Phi4-Vision-CXR: A Compact Multimodal Assistant for Versatile Radiology Workflows

Mercy Ranjit; Tanuja Ganu

Rad-Phi4-Vision-CXR: A Compact Multimodal Assistant for Versatile Radiology Workflows

Mercy Ranjit ,
Tanuja Ganu

Machine Learning for Health Symposium 2025 | November 2025

Published by Proceedings of Machine Learning Research (PMLR) | Organized by Microsoft

Download BibTex

The integration of artificial intelligence into radiology underscores the need for efficient models capable of supporting a wide range of clinical tasks. We introduce Rad-Phi4-Vision-CXR, a compact multimodal vision-language model designed to seamlessly integrate into radiology workflows for chest X-rays. It supports radiology report generation, fine-grained visual question answering (VQA) for abnormalities and tubes/lines (including presence and placement), and grounding capabilities for anatomies, pathologies, and medical devices. Beyond these tasks, we propose a capability for findings generation with causal exploration of radiology findings and differential diagnosis, enabling the model to affirm findings or rule out conditions, thereby enhancing its utility in clinical decision-making. Rad-Phi4-Vision CXR achieves state-of-the-art performance on multiple benchmarks for report generation, VQA, and grounding, including the ReXrank (opens in new tab) benchmark for Visual Question Answering. Its compact architecture provides a scalable, high-performance solution for AI-driven radiology.