Article
Version 1
Preserved in Portico This version is not peer-reviewed
Text-to-Image Segmentation with Open-Vocabulary and Multitasking
Version 1
: Received: 8 April 2024 / Approved: 9 April 2024 / Online: 9 April 2024 (11:43:57 CEST)
How to cite: Pan, L.; Yang, Y.; Wang, Z.; Zhang, R. Text-to-Image Segmentation with Open-Vocabulary and Multitasking. Preprints 2024, 2024040631. https://doi.org/10.20944/preprints202404.0631.v1 Pan, L.; Yang, Y.; Wang, Z.; Zhang, R. Text-to-Image Segmentation with Open-Vocabulary and Multitasking. Preprints 2024, 2024040631. https://doi.org/10.20944/preprints202404.0631.v1
Abstract
Open-vocabulary learning has recently gained prominence as a means to enable image segmentation for arbitrary categories based on textual descriptions. This advancement has extended the applicability of segmentation systems to a broader range of generally purpose scenarios. However, current methods often revolve around specialized architectures and parameters tailored to specific segmentation tasks, resulting in a fragmented landscape of segmentation models. In response to these challenges, we introduce OVAMTSeg, a versatile framework designed for Open-Vocabulary and Multitask Image Segmentation. OVAMTSeg harnesses adaptive prompt learning to empower the model to capture category-sensitive concepts, enhancing its robustness across diverse multi-task and scenario contexts. Text prompts are employed to effectively capture semantic and contextual features of the text, while cross-attention and cross-modal interactions enable the fusion of image and text features. Furthermore, a transformer-based decoder is incorporated for dense prediction. Extensive experimental results underscore the effectiveness of OVAMTSeg, showcasing its state-of-the-art performance and superior generalization capabilities across three segmentation tasks. Notable achievements include a 47.5 mIoU in referring expression segmentation, 51.6 mIoU on Pascal-VOC with four unseen classes, 46.6 mIoU on Pascal-Context in zero-shot segmentation, 65.9 mIoU on Pascal-5i, and 35.7 mIoU on COCO-20i datasets for one-shot segmentation.
Keywords
image segmentation; open vocabulary; multitask; multi-modal interaction
Subject
Computer Science and Mathematics, Computer Vision and Graphics
Copyright: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments (0)
We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.
Leave a public commentSend a private comment to the author(s)
* All users must log in before leaving a comment