FineRegion-LM: Enhancing Large Vision-Language Models for Fine-Grained Region-Level Understanding

Kentaro Yamada; Nicholas Campbell; Owen Patterson

doi:10.20944/preprints202412.1262.v1

Submitted:

16 December 2024

Posted:

16 December 2024

You are already at the latest version

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable success in vision-language tasks, yet they often fall short in fine-grained region-level understanding due to limited spatial sensitivity and insufficient region-specific annotations. To address these challenges, we propose FineRegion-LM, a generative model that enhances LVLMs' capabilities in region comprehension through a novel dual-stage framework. Our approach utilizes dynamic region masking to refine spatial focus and adaptive prompt-based learning for contextual generation. Extensive experiments on benchmark datasets demonstrate that FineRegion-LM significantly outperforms existing methods in region description, object classification, and spatial reasoning tasks. Human evaluations further confirm the effectiveness of our approach in generating accurate and contextually relevant descriptions.

Keywords:

Large Vision-Language Models

;

Fine-Grained Region Understanding

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

In recent years, Large Vision-Language Models (LVLMs) have made significant strides in the fields of computer vision and natural language processing by integrating multimodal capabilities to understand images and text jointly. These models, leveraging extensive pre-training on large-scale datasets, have excelled in tasks such as image captioning, visual question answering, and object detection [1,2]. However, despite their impressive performance in holistic image understanding, current LVLMs often struggle with region-specific comprehension. This limitation is particularly problematic for applications that require fine-grained spatial understanding, such as autonomous driving, medical imaging, and scene analysis.

The core challenge in region understanding lies in the model’s ability to precisely focus on specific areas of an image while maintaining a coherent contextual interpretation. Existing models like CLIP and BLIP rely heavily on global image features, which often leads to poor performance in tasks requiring localized region descriptions and spatial reasoning [1,2]. Additionally, most publicly available training datasets contain annotations at a coarse level, limiting the models’ ability to learn nuanced details within image regions. This gap indicates a need for an approach that not only understands the global context of images but also excels at extracting and interpreting information from localized regions.

Motivated by these challenges, our research proposes a novel framework, FineRegion-LM, designed to enhance LVLMs for fine-grained region-level understanding. Our approach is built on the premise that improving spatial sensitivity and leveraging detailed region-specific annotations can significantly boost model performance on complex vision-language tasks. To achieve this, we introduce a dual-stage training strategy. In the first stage, we fine-tune a vision encoder using dynamic region masking to force the model to focus on specific areas. In the second stage, we use a large language model fine-tuned with region-aware prompts to generate detailed captions and responses for localized image regions. This two-phase approach ensures that our model is both spatially attentive and contextually accurate, enabling it to outperform existing methods in tasks that require region-specific understanding.

To evaluate the effectiveness of our approach, we conduct experiments on widely used datasets such as ReferCOCOg and Visual Genome, as well as our own generated dataset with fine-grained region annotations. We assess the performance of our model using standard metrics such as accuracy, mean Intersection over Union (mIoU), and F1 score for region description, object detection, and spatial reasoning tasks. Our method demonstrates significant improvements over baseline models, with an average performance gain of over 5% in region-specific tasks, indicating its effectiveness in capturing fine-grained visual details.

We introduce a novel dual-stage training framework for LVLMs that enhances spatial sensitivity using dynamic region masking and prompt-based fine-tuning.
Our approach leverages adaptive region-aware prompts to guide large language models in generating precise and contextually relevant descriptions for localized regions.
We demonstrate significant improvements in region understanding on multiple datasets, achieving state-of-the-art results in fine-grained tasks such as region description and spatial reasoning.

2. Related Work

2.1. Large Vision-Language Models

Large Vision-Language Models (LVLMs) have gained significant traction in recent years due to their ability to jointly understand and generate content across visual and textual modalities [3,4]. These models leverage large-scale pre-training on image-text pairs to excel in various multimodal tasks such as image captioning, visual question answering, and object recognition [5]. Despite their impressive capabilities, challenges remain in fine-grained region understanding and efficient scaling for complex real-world applications.

Recent advancements in LVLMs include the development of methods to enhance their spatial sensitivity and region-level comprehension. For instance, TextHawk2 demonstrates superior performance in bilingual Optical Character Recognition (OCR) and grounding tasks while reducing token usage by 16 times through optimized token efficiency strategies [6]. Similarly, MoE-LLaVA employs a mixture of experts approach to balance model performance and computational cost, effectively scaling LVLMs for diverse vision-language applications [7].

Another notable contribution is the introduction of attention prompting techniques. For example, the study by [8] explores the use of visual attention prompts to enhance the alignment between textual inputs and image features. These approachs significantly improves LVLMs’ accuracy in complex multimodal tasks [9,10]. Additionally, CogCoM introduces a chain-of-manipulations method [11] to dive into fine-grained image details, making LVLMs more adept at handling nuanced visual content [12].

Despite these advancements, evaluating the performance of LVLMs remains a challenge. The study [13] critically examines existing evaluation methodologies and proposes new metrics to better assess the effectiveness of LVLMs in real-world scenarios. Moreover, the issue of cross-modality knowledge conflicts, as discussed by [14], highlights the need for strategies to mitigate inconsistencies between visual and textual modalities.

To further enhance LVLMs, self-training techniques have been employed to improve image comprehension. The work by [15] utilizes self-supervised learning to refine model understanding, thereby enhancing performance on visual reasoning tasks. Meanwhile, generalist models like VisionLLM v2 demonstrate the feasibility of creating end-to-end multimodal models capable of handling hundreds of tasks with a unified framework [16].

In summary, while current LVLMs have shown remarkable progress in bridging the gap between vision and language, there are still areas that require improvement, especially in terms of region-level understanding, efficient scaling, and robust evaluation [17,18]. Our proposed method builds upon these advancements by introducing a dual-stage framework that focuses on enhancing fine-grained region comprehension and context-aware text generation.

2.2. Vision Region Understanding

Vision region understanding has become a crucial component in advancing the capabilities of vision-language models, particularly in tasks that require fine-grained spatial reasoning and region-specific comprehension. The ability to focus on specific regions within an image is essential for applications like object detection, scene understanding, and multimodal dialogue systems. Recent advancements in this area have introduced various techniques to enhance the spatial sensitivity and contextual accuracy of models.

One of the recent approaches, RegionGPT, proposes a framework that integrates region-level understanding into vision-language models by leveraging generative pre-training techniques to enhance spatial comprehension [19]. Building on this, SpatialRGPT focuses on improving grounded spatial reasoning capabilities by introducing task-specific prompts, which help models understand spatial relationships more effectively [20]. Similarly, RegionViT uses a regional-to-local attention mechanism to extend the capabilities of vision transformers, allowing them to capture both global and local image features [21].

Another line of work explores optimizing large language models for region-specific tasks. For instance, GPT4RoI applies instruction tuning on regions of interest (RoI), allowing the model to generate more accurate and context-aware descriptions of localized content [22]. Groma introduces a localized visual tokenization approach to improve the grounding of multimodal large language models, facilitating better alignment between image regions and textual inputs [23]. This helps to enhance the model’s performance in tasks involving region-level grounding.

Furthermore, there have been efforts to unify region localization and vision understanding in pre-trained language model [24,25]. The GLIPv2 model integrates object localization with vision-language tasks, achieving better results by jointly optimizing these components [26]. Additionally, Optimization Efficient Open-World Visual Region Recognition addresses the challenges of recognizing regions in open-world settings, where the diversity and ambiguity of visual data are significantly higher [27].

In terms of enhancing position-awareness in vision-language models, Position-Enhanced Visual Instruction Tuning has been proposed to incorporate positional information into the tuning process, thus improving the spatial coherence of generated responses [28]. These advancements highlight the ongoing efforts to refine region-level comprehension in vision-language models, making them more robust and versatile for real-world applications.

3. Method

In this section, we present our proposed approach, FineRegion-LM, which is designed as a generative model focused on fine-grained region-level understanding within the context of Large Vision-Language Models (LVLMs). The generative nature of our model allows it to produce detailed region-specific captions and contextual explanations rather than simply classifying or detecting objects. This enables more nuanced and descriptive outputs that are crucial for tasks like region-aware visual question answering and scene interpretation.

3.1. Dual-Stage Framework Overview

The FineRegion-LM consists of a two-stage framework designed to enhance the spatial sensitivity and contextual generation capabilities of LVLMs. The first stage involves dynamic region masking to improve spatial focus, while the second stage utilizes adaptive prompt-based fine-tuning for generating detailed captions and responses. Let

I \in R^{H \times W \times 3}

denote an input image, where H and W represent its height and width, respectively.

3.2. Dynamic Region Masking with Vision Encoder

We begin by extracting the visual features of the input image

I

using a Swin Transformer backbone. The feature map

F \in R^{h \times w \times d}

is generated as:

F = SwinTransformer (I),

(1)

where h, w, and d represent the spatial dimensions and feature depth, respectively. To enhance region-level focus, we apply a dynamic region masking mechanism. For each region of interest (RoI)

R_{i}

, we generate a mask

M_{i}

:

M_{i} (x, y) = \{\begin{matrix} 1, & if (x, y) \in R_{i}, \\ 0, & otherwise . \end{matrix}

(2)

The masked feature map

F_{i}

for region

R_{i}

is computed as:

F_{i} = F ⊙ M_{i},

(3)

where ⊙ denotes element-wise multiplication. This process forces the model to focus on specific regions while ignoring irrelevant parts of the image.

3.3. Adaptive Prompt-Based Fine-Tuning with Language Model

In the second stage, we integrate the region-specific features with a generative language model (e.g., T5 or GPT-4) to generate descriptive captions. Let

z_{i}

be the feature vector obtained by pooling

F_{i}

:

z_{i} = AdaptivePooling (F_{i}) .

(4)

The feature vector

z_{i}

is then used to condition the language model. We employ an adaptive prompt

P

based on the context of the region:

T_{i} = Prompt (z_{i}, P),

(5)

where

T_{i}

represents the generated textual output for region

R_{i}

. The language model generates the final caption

C_{i}

:

C_{i} = LanguageModel (T_{i}) .

(6)

By using adaptive prompts, the language model is guided to generate more accurate and contextually relevant descriptions.

3.4. Training

The training objective of FineRegion-LM involves optimizing a joint loss function to ensure both accurate region recognition and high-quality text generation. The overall loss

L

is defined as:

L = L_{mask} + λ_{1} L_{caption} + λ_{2} L_{consistency},

(7)

where:

$L_{mask}$ : Ensures that the model focuses on the correct regions by penalizing incorrect attention maps.
$L_{caption}$ : A cross-entropy loss between the generated and ground truth captions.
$L_{consistency}$ : A self-supervised consistency loss to ensure that similar regions produce consistent descriptions.

For the captioning loss, let

C^{*}

denote the ground truth caption for region

R_{i}

. The captioning loss is given by:

L_{caption} = - \sum_{t = 1}^{T} log p (C_{t}^{*} | z_{i}, T_{i, 1 : t - 1}),

(8)

where T is the length of the caption and

p (\cdot)

represents the probability predicted by the model.

The consistency loss encourages consistent outputs for regions with overlapping content:

L_{consistency} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{N} I (R_{i} \cap R_{j} \neq \emptyset) \cdot | | C_{i} - C_{j} {| |}_{2}^{2},

(9)

where

I (\cdot)

is an indicator function and N is the number of regions in the image.

3.5. Inference

During inference, given an input image

I

, the model first extracts region-specific features using the vision encoder with dynamic masking. Then, it generates descriptive captions using the fine-tuned language model. The final output consists of a set of captions

{C_{i}}

, each corresponding to a specific region in the image, providing a detailed and context-aware understanding of the visual content.

4. Experiments

In this section, we present an extensive evaluation of our proposed FineRegion-LM method against several state-of-the-art models in the domain of region-level vision-language tasks. Our experiments involve quantitative analysis, ablation studies, and human evaluations to comprehensively assess the effectiveness of our approach. The results demonstrate that FineRegion-LM significantly outperforms existing methods across multiple metrics, highlighting its superior region understanding capabilities.

4.1. Experimental Setup

We conducted our experiments on three well-known datasets: ReferCOCOg, Visual Genome, and a custom fine-grained region annotation dataset. We evaluate all models on tasks such as region description, object classification, and spatial reasoning. The models we compare against include:

CLIP-based Model: A strong baseline model that uses global image features.
BLIP: A transformer-based model designed for vision-language pre-training.
Region-VLP: A model focusing on region-aware pre-training.

4.2. Quantitative Results

We report the accuracy, F1 score, and mean Intersection over Union (mIoU) for each method on the ReferCOCOg and Visual Genome datasets. The results, shown in Table 1, indicate that our method achieves the best performance across all metrics.

Our model outperforms the best baseline (Region-VLP) by a margin of 3.9% in accuracy, demonstrating its superior ability to generate accurate and context-aware region descriptions.

4.3. Ablation Study

To validate the effectiveness of the components of FineRegion-LM, we performed an ablation study. We tested three configurations: (1) without dynamic region masking, (2) without adaptive prompts, and (3) the full model. The results are summarized in Table 2.

The results confirm that both dynamic region masking and adaptive prompts are critical components that contribute significantly to the model’s performance.

4.4. Human Evaluation

We conducted a human evaluation study to further assess the quality of the captions generated by FineRegion-LM compared to other models. Human evaluators rated the outputs based on three criteria: accuracy, relevance, and fluency, using a 5-point Likert scale (1: Poor, 5: Excellent). The results are shown in Table 3.

The human evaluation results indicate that our model produces more accurate, relevant, and fluent descriptions compared to the baseline models, further validating its effectiveness in real-world scenarios.

4.5. Analysis and Discussion

The experimental results demonstrate that FineRegion-LM consistently outperforms existing models across multiple datasets and evaluation metrics. The combination of dynamic region masking and adaptive prompts contributes to the model’s enhanced capability to focus on relevant regions while generating contextually accurate captions. This results in both quantitative improvements and better qualitative feedback from human evaluators. These findings confirm the robustness and generalization capability of our method, making it a strong candidate for tasks requiring fine-grained region-level understanding.

5. Conclusion

In this work, we introduced FineRegion-LM, a novel framework aimed at advancing the region-level understanding capabilities of Large Vision-Language Models. By employing a dual-stage approach combining dynamic region masking and adaptive prompt-based fine-tuning, our model effectively overcomes the limitations of existing methods in capturing fine-grained spatial details. Our comprehensive experiments on multiple datasets confirm that FineRegion-LM not only achieves superior quantitative performance but also excels in qualitative human assessments. The ablation studies further validate the importance of our proposed components. Future work will explore extending this framework to more complex multi-modal scenarios and incorporating self-supervised learning to further enhance its robustness.

Referenced

References

Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; Sutskever, I. Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event; Meila, M.; Zhang, T., Eds. PMLR, 2021, Vol. 139, Proceedings of Machine Learning Research, pp. 8748–8763.
Li, J.; Li, D.; Xiong, C.; Hoi, S.C.H. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA; Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu, G.; Sabato, S., Eds. PMLR, 2022, Vol. 162, Proceedings of Machine Learning Research, pp. 12888–12900.
Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Xu, C.; Long, G.; Jiao, B.; Jiang, D. Towards Robust Ranker for Text Retrieval. Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 5387–5401. [Google Scholar]
Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Shen, J.; Long, G.; Xu, C.; Jiang, D. Fine-grained distillation for long document retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 19732–19740.
Zhou, Y.; Long, G. Style-Aware Contrastive Learning for Multi-Style Image Captioning. Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 2257–2267. [Google Scholar]
Yu, Y.Q.; Liao, M.; Zhang, J.; Wu, J. TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens. arXiv preprint, arXiv:2410.05261 2024.
Lin, B.; Tang, Z.; Ye, Y.; Cui, J.; Zhu, B.; Jin, P.; Zhang, J.; Ning, M.; Yuan, L. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models. CoRR, 2024; abs/2401.15947, abs/2401.15947. [Google Scholar] [CrossRef]
Yu, R.; Yu, W.; Wang, X. Attention Prompting on Image for Large Vision-Language Models. Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXX; Leonardis, A.; Ricci, E.; Roth, S.; Russakovsky, O.; Sattler, T.; Varol, G., Eds. Springer, 2024, Vol. 15088, Lecture Notes in Computer Science, pp. 251–268. [CrossRef]
Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024, pp. 15890–15902.
Zhou, Y.; Rao, Z.; Wan, J.; Shen, J. Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models. arXiv preprint, arXiv:2410.19732 2024.
Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv preprint, arXiv:2311.08734 2023.
Qi, J.; Ding, M.; Wang, W.; Bai, Y.; Lv, Q.; Hong, W.; Xu, B.; Hou, L.; Li, J.; Dong, Y.; Tang, J. CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations. CoRR, 2024; abs/2402.04236, [2402.04236]. [Google Scholar] [CrossRef]
Chen, L.; Li, J.; Dong, X.; Zhang, P.; Zang, Y.; Chen, Z.; Duan, H.; Wang, J.; Qiao, Y.; Lin, D.; Zhao, F. Are We on the Right Way for Evaluating Large Vision-Language Models? CoRR, 2024; abs/2403.20330, [2403.20330]. [Google Scholar] [CrossRef]
Zhu, T.; Liu, Q.; Wang, F.; Tu, Z.; Chen, M. Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models. CoRR, 2024; abs/2410.03659, [2410.03659]. [Google Scholar] [CrossRef]
Deng, Y.; Lu, P.; Yin, F.; Hu, Z.; Shen, S.; Zou, J.; Chang, K.; Wang, W. Enhancing Large Vision Language Models with Self-Training on Image Comprehension. CoRR, 2024; abs/2405.19716, [2405.19716]. [Google Scholar] [CrossRef]
Wu, J.; Zhong, M.; Xing, S.; Lai, Z.; Liu, Z.; Wang, W.; Chen, Z.; Zhu, X.; Lu, L.; Lu, T.; Luo, P.; Qiao, Y.; Dai, J. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. CoRR 2024, abs/2406.08394, [2406.08394]. [Google Scholar] [CrossRef]
Zhou, Y.; Long, G. Improving Cross-modal Alignment for Text-Guided Image Inpainting. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 3445–3456.
Zhou, Y.; Long, G. Multimodal Event Transformer for Image-guided Story Ending Generation. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 3434–3444.
Guo, Q.; Mello, S.D.; Yin, H.; Byeon, W.; Cheung, K.C.; Yu, Y.; Luo, P.; Liu, S. RegionGPT: Towards Region Understanding Vision Language Model. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. IEEE, 2024, pp. 13796–13806. [CrossRef]
Cheng, A.; Yin, H.; Fu, Y.; Guo, Q.; Yang, R.; Kautz, J.; Wang, X.; Liu, S. SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model. CoRR, 2024; abs/2406.01584, [2406.01584]. [Google Scholar] [CrossRef]
Chen, C.; Panda, R.; Fan, Q. RegionViT: Regional-to-Local Attention for Vision Transformers. The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
Zhang, S.; Sun, P.; Chen, S.; Xiao, M.; Shao, W.; Zhang, W.; Liu, Y.; Chen, K.; Luo, P. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint, arXiv:2307.03601 2023.
Ma, C.; Jiang, Y.; Wu, J.; Yuan, Z.; Qi, X. Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models. Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part VI; Leonardis, A.; Ricci, E.; Roth, S.; Russakovsky, O.; Sattler, T.; Varol, G., Eds. Springer, 2024, Vol. 15064, Lecture Notes in Computer Science, pp. 417–435. [CrossRef]
Zhou, Y.; Geng, X.; Shen, T.; Zhang, W.; Jiang, D. Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5822–5834.
Zhou, Y.; Geng, X.; Shen, T.; Pei, J.; Zhang, W.; Jiang, D. Modeling event-pair relations in external knowledge graphs for script reasoning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021. [Google Scholar]
Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.; Li, L.H.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.; Gao, J. GLIPv2: Unifying Localization and Vision-Language Understanding. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
Bendale, A.; Boult, T. Towards open world recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1893–1902.
Chen, C.; Qin, R.; Luo, F.; Mi, X.; Li, P.; Sun, M.; Liu, Y. Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models. CoRR, 2023; abs/2308.13437, [2308.13437]. [Google Scholar] [CrossRef]

Table 1. Quantitative comparison of FineRegion-LM with existing methods.

Model	Accuracy (%)	F1 Score	mIoU
CLIP-based Model	74.5	0.68	0.55
BLIP	76.3	0.72	0.57
Region-VLP	78.2	0.75	0.60
FineRegion-LM (Ours)	82.1	0.79	0.64

Table 2. Ablation study results on the ReferCOCOg dataset.

Model Variation	Accuracy (%)	F1 Score	mIoU
No Dynamic Masking	78.3	0.74	0.61
No Adaptive Prompts	79.0	0.76	0.62
Full Model (FineRegion-LM)	82.1	0.79	0.64

Table 3. Human evaluation of region-specific caption generation. Ratings are averaged across multiple evaluators.

Model	Accuracy	Relevance	Fluency
CLIP-based Model	3.8	3.7	4.0
BLIP	4.0	3.9	4.2
Region-VLP	4.3	4.2	4.4
FineRegion-LM (Ours)	4.7	4.5	4.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.