Multimodal Input Support
Processes medical text, 2D medical images, and 3D volumetric data in a unified model interface.
๐ฉบ Medical Multimodal Large Language Model
A high-performance multimodal large language model for medical understanding and analysis, supporting medical text, 2D medical images, and 3D medical volumes.
For research use only. HealthGPT-Pro should not be used as a substitute for professional clinical judgment, diagnosis, or treatment.
HealthGPT-Pro preserves broad instruction-following ability while extending Qwen3-VL to diverse medical modalities and tasks.
Processes medical text, 2D medical images, and 3D volumetric data in a unified model interface.
Uses a two-stage training recipe with 3M alignment samples and 10M SFT samples.
Retains a substantial proportion of general data to maintain general instruction-following ability.
Trained on diverse medical and general tasks for strong text-based and vision-language performance.
HealthGPT-Pro-8B reaches 61.3 average on text benchmarks and 69.0 average on multimodal benchmarks.
Bold marks the best result in each benchmark; underlined values mark the second-best result.
| Model | MMLU-Med | MMLU-Pro-Med | MMedBench | MedBullets | MedMCQA | MedQA | MedXpertQA-Text | PubMedQA | SuperGPQA-Medical | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-4B | 74.3 | 50.7 | 60.5 | 46.4 | 56.0 | 60.5 | 12.6 | 75.6 | 29.6 | 51.8 |
| Qwen3-VL-8B | 79.8 | 57.4 | 65.9 | 51.3 | 61.1 | 65.9 | 12.8 | 76.2 | 30.2 | 55.6 |
| Lingshu-7B | 75.8 | 53.5 | 64.5 | 57.8 | 56.6 | 64.4 | 16.9 | 76.8 | 29.9 | 55.1 |
| HealthGPT-14B | 80.2 | 63.4 | 63.2 | 39.8 | 63.4 | 66.2 | 11.3 | 68.0 | 25.7 | 53.5 |
| HuatuoGPT-V-34B | 74.7 | 51.8 | 60.7 | 42.7 | 54.7 | 58.8 | 11.4 | 54.7 | 26.5 | 48.4 |
| Hulu-Med-4B | 78.6 | 58.6 | 66.7 | 59.4 | 64.8 | 71.9 | 16.8 | 77.6 | 29.5 | 58.2 |
| Hulu-Med-7B | 79.5 | 60.6 | 72.8 | 61.5 | 67.6 | 73.5 | 19.6 | 77.4 | 31.1 | 60.4 |
| HealthGPT-Pro-4B | 80.4 | 58.4 | 71.6 | 58.0 | 64.4 | 71.5 | 16.2 | 78.4 | 31.4 | 58.9 |
| HealthGPT-Pro-8B | 83.1 | 64.1 | 71.4 | 60.6 | 68.5 | 71.3 | 18.3 | 79.2 | 35.4 | 61.3 |
| Model | MMMU-Med | VQA-RAD | SLAKE | PathVQA | MedXpertQA-Multimodal | MedFrameQA | OmniMedVQA-Mini | PMC-VQA | M3D-MCQ | CT-RATE-MCQ | AMOS-MM-MCQ | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-4B | 44.3 | 59.9 | 77.0 | 53.0 | 13.4 | 40.6 | 74.7 | 53.0 | 57.2 | 58.8 | 49.2 | 52.8 |
| Qwen3-VL-8B | 46.5 | 63.4 | 80.2 | 58.3 | 18.7 | 46.4 | 73.0 | 55.6 | 59.5 | 61.6 | 51.2 | 55.9 |
| Lingshu-7B | 47.3 | 66.7 | 81.9 | 61.0 | 25.5 | 52.6 | 82.4 | 57.2 | 64.1 | 68.3 | 62.7 | 60.9 |
| HealthGPT-14B | 45.5 | 62.6 | 64.2 | 56.0 | 24.1 | 45.3 | 70.2 | 56.4 | 55.2 | 57.3 | 46.5 | 53.0 |
| HuatuoGPT-V-34B | 50.1 | 60.3 | 68.3 | 47.7 | 21.5 | 49.6 | 69.7 | 56.6 | 50.1 | 54.9 | 48.7 | 52.5 |
| Hulu-Med-4B | 45.8 | 72.6 | 81.7 | 59.7 | 24.6 | 54.2 | 75.1 | 53.1 | 76.0 | 70.1 | 69.1 | 62.0 |
| Hulu-Med-7B | 50.5 | 77.2 | 85.8 | 64.2 | 28.3 | 57.4 | 77.7 | 57.3 | 80.4 | 76.2 | 70.5 | 66.0 |
| HealthGPT-Pro-4B | 52.0 | 76.6 | 83.9 | 66.7 | 20.8 | 61.4 | 78.2 | 60.0 | 81.0 | 86.2 | 71.1 | 67.1 |
| HealthGPT-Pro-8B | 54.7 | 78.4 | 85.0 | 70.7 | 25.3 | 63.6 | 80.2 | 61.1 | 81.6 | 86.0 | 72.2 | 69.0 |
If you find this model useful for your research, please cite:
@misc{lin2025healthgptmedicallargevisionlanguage,
title={HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation},
author={Tianwei Lin and Wenqiao Zhang and Sijing Li and Yuqian Yuan and Binhe Yu and Haoyuan Li and Wanggui He and Hao Jiang and Mengze Li and Xiaohui Song and Siliang Tang and Jun Xiao and Hui Lin and Yueting Zhuang and Beng Chin Ooi},
year={2025},
eprint={2502.09838},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.09838},
}