publications | Panagiotis (Panos) Kaliosis

2025

EMNLP

Learning to Align: Addressing Character Frequency Distribution Shifts in Handwritten Text Recognition

Panagiotis Kaliosis and John Pavlopoulos

In Findings of the Empirical Methods in Natural Language Processing: EMNLP 2025, Suzhou, China. Our code will soon be made available here , Nov 2025

Abs PDF

Handwritten text recognition aims to convert visual input into machine-readable text, and it remains challenging due to the evolving and context-dependent nature of handwriting. Character sets change over time, and character frequency distributions shift across historical periods or regions, often causing models trained on broad, heterogeneous corpora to underperform on specific subsets. To tackle this, we propose a novel loss function that incorporates the Wasserstein distance between the character frequency distribution of the predicted text and a target distribution empirically derived from training data. By penalizing divergence from expected distributions, our approach enhances both accuracy and robustness under temporal and contextual intra-dataset shifts. Furthermore, we demonstrate that character distribution alignment can also improve existing models at inference time without requiring retraining by integrating it as a scoring function in a guided decoding scheme. Experimental results across multiple datasets and architectures confirm the effectiveness of our method in boosting generalization and performance. We open source our code at https://github.com/pkaliosis/fada.
EMNLP

LVLMs are Bad at Overhearing Human Referential Communication

Zhengxiang Wang, Weiling Li, Panagiotis Kaliosis, and 2 more authors

In Proceedings of the Empirical Methods in Natural Language Processing: EMNLP 2025, Suzhou, China. Our code will soon be made available here , Nov 2025

Abs PDF

During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.
Pre-print

Improving Contrastive Learning for Referring Expression Counting

Kostas Triaridis^*, Panagiotis Kaliosis^*, E-Ro Nguyen, and 3 more authors

In Pre-print, . * denoting joint co-authorship. , Mar 2025

Abs PDF

Object counting has progressed from class-specific models, which count only known categories, to class-agnostic models that generalize to unseen categories. The next challenge is Referring Expression Counting (REC), where the goal is to count objects based on fine-grained attributes and contextual differences. Existing methods struggle with distinguishing visually similar objects that belong to the same category but correspond to different referring expressions. To address this, we propose C-REX, a novel contrastive learning framework, based on supervised contrastive learning, designed to enhance discriminative representation learning. Unlike prior works, C-REX operates entirely within the image space, avoiding the misalignment issues of image-text contrastive learning, thus providing a more stable contrastive signal. It also guarantees a significantly larger pool of negative samples, leading to improved robustness in the learned representations. Moreover, we showcase that our framework is versatile and generic enough to be applied to other similar tasks like class-agnostic counting. To support our approach, we analyze the key components of sota detection-based models and identify that detecting object centroids instead of bounding boxes is the key common factor behind their success in counting tasks. We use this insight to design a simple yet effective detection-based baseline to build upon. Our experiments show that C-REX achieves state-of-the-art results in REC, outperforming previous methods by more than 22% in MAE and more than 10% in RMSE, while also demonstrating strong performance in class-agnostic counting.
CLEF

AUEB NLP Group at ImageCLEFmedical Caption 2025

Anna Chatzipapadopoulou, Ippokratis Pantelidis, Foivos Charalampakos, and 5 more authors

In Conference and Labs of the Evaluation Forum (Working Notes), Madrid, Spain, Mar 2025

Abs

This article presents the methodology and results of AUEB NLP Group’s and Archimedes Unit’s participation in the 9th edition of the ImageCLEFmedical Caption evaluation campaign, addressing the Concept Detection, Caption Prediction, and Explainability tasks. The Concept Detection task involves the automatic association of biomedical images with relevant medical concepts, while the Caption Prediction task focuses on generating clinically meaningful diagnostic captions based on the content of these images. Building upon our previous work, we experimented extensively with image encoders based on Convolutional Neural Networks (CNNs) in combination with Feed-Forward Neural Network (FFNN) classifiers and ensemble approaches. To improve robustness and generalization, we developed diverse ensemble strategies that combine predictions across multiple architectures. Additionally, we applied a per-label thresholding method during inference, allowing the system to fine-tune decision boundaries for each concept individually. For the Caption Prediction task, we used InstructBLIP as the backbone of our pipeline to generate initial captions, which we then refined using a series of enhancement strategies. These included a retrieval-augmented Synthesizer that incorporates information from similar training images, a Multisynthesizer that additionally integrates concept predictions, and LM-Fuser, a lightweight model trained to combine multiple caption hypotheses. Furthermore, we applied the Distance from Median Maximum Concept Similarity (DMMCS) method to guide decoding toward concept-aware captions and used MedCLIP-based re-ranking to further improve visual-textual alignment. We also experimented with reinforcement learning via a mixed training objective that combines cross-entropy and task-specific rewards. In the Explainability task, we generated visual explanations by identifying and localizing key medical entities in the images using a structured prompting approach with GPT-4o. This involved extracting medical terms from generated captions and drawing bounding boxes to connect these terms to visual regions, thereby enhancing clinical decision transparency. Overall, our group ranked 1st in Concept Detection, 5th in Caption Prediction, and 1st in the Explainability task.

2024

ACL

A Data-Driven Guided Decoding Mechanism for Diagnostic Captioning

Panagiotis Kaliosis, John Pavlopoulos, Foivos Charalampakos, and 2 more authors

In Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand. Our code is available here , Aug 2024

Abs DOI PDF

Diagnostic Captioning (DC) automatically generates a diagnostic text from one or more medical images (e.g., X-rays, MRIs) of a patient. Treated as a draft, the generated text may assist clinicians, by providing an initial estimation of the patient’s condition, speeding up and helping safeguard the diagnostic process. The accuracy of a diagnostic text, however, strongly depends on how well the key medical conditions depicted in the images are expressed. We propose a new data-driven guided decoding method that incorporates medical information, in the form of existing tags capturing key conditions of the image(s), into the beam search of the diagnostic text generation process. We evaluate the proposed method on two medical datasets using four DC systems that range from generic image-to-text systems with CNN encoders and RNN decoders to pre-trained Large Language Models. The latter can also be used in few- and zero-shot learning scenarios. In most cases, the proposed mechanism improves performance with respect to all evaluation measures. We provide an open-source implementation of the proposed method at https://github.com/nlpaueb/dmmcs.
ICASSP

A Self-Supervised Learning Approach for Detecting Non-Psychotic Relapses Using Wearable-Based Digital Phenotyping

Panagiotis Kaliosis, Sofia Eleftheriou, Christos Nikou, and 1 more author

In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Seoul, South Korea. Our code is available here , Apr 2024

Abs DOI PDF

We present MagCIL’s approach for the 1st track of the "2nd e-Prevention challenge: Psychotic and Non-Psychotic Relapse Detection using Wearable-Based Digital Phenotyping". First we present our approach for preprocessing and extracting features from the wearable’s raw data. We then propose a Transformer model for learning self-supervised representations from augmented features, trained on data from non-relapse days from each of the 9 patients of the challenge. We adopt two unsupervised methods for detecting relapse days as outliers. A separate unsupervised model is tuned for each patient using the validation data of the challenge. Our method ranked second with ROCAUC = 0.651 and PRAUC = 0.642 on the final test dataset of the challenge. The respective code is available at this repository.
SETN

Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation

Evangelia Gkritzali, Panagiotis Kaliosis, Sofia Galanaki, and 2 more authors

In Proceedings of the 13th Hellenic Conference on Artificial Intelligence, Piraeus, Greece. Our code is available here , Apr 2024

Abs DOI

In the vast majority of the academic and scientific domains, LaTeX has established itself as the de facto standard for typesetting complex mathematical equations and formulae. However, LaTeX ś complex syntax and code-like appearance present accessibility barriers for individuals with disabilities, as well as those unfamiliar with coding conventions. In this paper, we present a novel solution to this challenge through the development of a novel speech-to- LaTeX equations system specifically designed for the Greek language. We propose an end-to-end system that harnesses the power of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) techniques to enable users to verbally dictate mathematical expressions and equations in natural language, which are subsequently converted into LaTeX format. We present the architecture and design principles of our system, highlighting key components such as the ASR engine, the LLM-based prompt-driven equations generation mechanism, as well as the application of a custom evaluation metric employed throughout the development process. We have made our system open source and available at https://github.com/magcil/greek-speech-to-math.

2023

CLEF

AUEB NLP Group at ImageCLEFmedical Caption 2023

Panagiotis Kaliosis, Georgios Moschovis, Foivos Charalampakos, and 2 more authors

In Conference and Labs of the Evaluation Forum (Working Notes), Thessaloniki, Greece, Apr 2023

Abs

This article describes the methods that the AUEB NLP Group experimented with during its participation in the 7th edition of the ImageCLEFmedical Caption sub-tasks, namely Concept Detection and Caption Prediction. The former intends to automatically classify biomedical images into a set of one or more tags based solely on the visual input, while the latter aims to generate a syntactically and semantically accurate diagnostic caption that addresses the medical conditions depicted on a given image. For the Concept Detection sub-task, extending our previous work, we utilized a wide range of Convolutional Neural Network encoders followed by a Feed-Forward Neural Network, both in a single-task and a multi-task fashion, as well as combined with a contrastive learning approach. Our methods concerning the Caption Prediction sub-task are influenced by both our previous work and recent progress in Natural Language Processing (NLP) methods. Our two base systems use CNN-RNN and Transformer-to-Transformer encoder-decoder architectures, respectively. Additionally, we experimented with a Transformer-based denoising component, which was trained to reformulate the generated captions in a more syntactically coherent and medically accurate way. Our group ranked 1st in Concept Detection and 3rd in Caption Prediction.