AUEB NLP Group at ImageCLEFmedical Caption 2025
Anna Chatzipapadopoulou, Ippokratis Pantelidis, Foivos Charalampakos, and 5 more authors
In Conference and Labs of the Evaluation Forum (Working Notes), Madrid, Spain, Mar 2025
This article presents the methodology and results of AUEB NLP Group’s and Archimedes Unit’s participation in the 9th edition of the ImageCLEFmedical Caption evaluation campaign, addressing the Concept Detection, Caption Prediction, and Explainability tasks. The Concept Detection task involves the automatic association of biomedical images with relevant medical concepts, while the Caption Prediction task focuses on generating clinically meaningful diagnostic captions based on the content of these images. Building upon our previous work, we experimented extensively with image encoders based on Convolutional Neural Networks (CNNs) in combination with Feed-Forward Neural Network (FFNN) classifiers and ensemble approaches. To improve robustness and generalization, we developed diverse ensemble strategies that combine predictions across multiple architectures. Additionally, we applied a per-label thresholding method during inference, allowing the system to fine-tune decision boundaries for each concept individually. For the Caption Prediction task, we used InstructBLIP as the backbone of our pipeline to generate initial captions, which we then refined using a series of enhancement strategies. These included a retrieval-augmented Synthesizer that incorporates information from similar training images, a Multisynthesizer that additionally integrates concept predictions, and LM-Fuser, a lightweight model trained to combine multiple caption hypotheses. Furthermore, we applied the Distance from Median Maximum Concept Similarity (DMMCS) method to guide decoding toward concept-aware captions and used MedCLIP-based re-ranking to further improve visual-textual alignment. We also experimented with reinforcement learning via a mixed training objective that combines cross-entropy and task-specific rewards. In the Explainability task, we generated visual explanations by identifying and localizing key medical entities in the images using a structured prompting approach with GPT-4o. This involved extracting medical terms from generated captions and drawing bounding boxes to connect these terms to visual regions, thereby enhancing clinical decision transparency. Overall, our group ranked 1st in Concept Detection, 5th in Caption Prediction, and 1st in the Explainability task.