Submitted:
06 January 2025
Posted:
08 January 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We introduce a new setting for semantic segmentation, i.e., open vocabulary domain generalization (OVDG), which is an important yet unstudied problem. In addition, we propose an effective framework FreeMix for solving OVDG, which focuses on learning a generalized model by integrating entity mask to enhance the diversity and completeness of masks for both base classes and novel classes.
- We propose a dual-branch universal segmentation module by unifying the base segmentation branch (BSB) and the entity segmentation branch (ESB) in an end-to-end trainable framework, where the BSB leverages a self-supervised pre-trained model, CMID, to extract domain-agnostic visual features for decoding masks and semantic logits.
- To integrate and leverage information from various source domains, we propose a simple yet effective training strategy, called dataset-aware sampling (DAS). Extensive experiments on four benchmark datasets reveal that our proposed method outperforms the state-of-the-art methods on the OVL and the OVDG benchmark.
2. Related works
2.1. Open Vocabulary Semantic Segmentation
2.2. Domain Generalization
2.3. Self-Supervised Learning in Remote Sensing
3. Proposed Method
3.1. Problem Definition
3.2. Overview
3.3. Universal Segmentation Module
3.4. Train Tactics: Dataset-Aware Sampling
4. Experiments
4.1. Experimental Datasets and Processing
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Comparison with SOTA Methods
4.5. Experiments on Multi-Source Domain
4.6. Ablation Experiments
4.7. Additional Experimental Results
5. Conclusion
Acknowledgments
References
- Rottensteiner, F.; Sohn, G.; Gerke, M.; Wegner, J.D. ISPRS semantic labeling contest. ISPRS: Leopoldshöhe, Germany 2014, 1, 4. [Google Scholar]
- Hong, J.; Li, W.; Han, J.; Zheng, J.; Fang, P.; Harandi, M.; Petersson, L. Goss: Towards generalized open-set semantic segmentation. The Visual Computer 2024, 40, 2391–2404. [Google Scholar] [CrossRef]
- Nunes, I.; Laranjeira, C.; Oliveira, H.; dos Santos, J.A. A systematic review on open-set segmentation. Computers & Graphics 2023. [Google Scholar]
- Nunes, I.M.; Poggi, M.; Oliveira, H.; Pereira, M.B.; Dos Santos, J.A. Deep open-set segmentation in visual learning. In Proceedings of the 2022 35th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI); IEEE, 2022; Volume 1, pp. 314–319. [Google Scholar]
- Joseph, K.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Towards open world object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 5830–5840. [Google Scholar]
- Bendale, A.; Boult, T. Towards open world recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition; 2015; pp. 1893–1902. [Google Scholar]
- Yang, J.; Zhou, K.; Li, Y.; Liu, Z. Generalized out-of-distribution detection: A survey. arXiv 2021, arXiv:2110.11334 2021. [Google Scholar] [CrossRef]
- Liu, J.; Shen, Z.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; Cui, P. Towards out-of-distribution generalization: A survey. arXiv 2021, arXiv:2108.13624 2021. [Google Scholar]
- Zhang, H.; Ding, H. Prototypical matching and open set rejection for zero-shot semantic segmentation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021; pp. 6974–6983. [Google Scholar]
- He, S.; Ding, H.; Jiang, W. Semantic-promoted debiasing and background disambiguation for zero-shot instance segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 19498–19507. [Google Scholar]
- Baek, D.; Oh, Y.; Ham, B. Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision; 2021; pp. 9536–9545. [Google Scholar]
- Gu, Z.; Zhou, S.; Niu, L.; Zhao, Z.; Zhang, L. Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the Proceedings of the 28th ACM International Conference on Multimedia; 2020; pp. 1921–1929. [Google Scholar]
- Zheng, Y.; Wu, J.; Qin, Y.; Zhang, F.; Cui, L. Zero-shot instance segmentation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 2593–2602. [Google Scholar]
- He, S.; Ding, H.; Jiang, W. Primitive generation and semantic-related alignment for universal zero-shot segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 11238–11247. [Google Scholar]
- Bucher, M.; Vu, T.H.; Cord, M.; Pérez, P. Zero-shot semantic segmentation. Advances in Neural Information Processing Systems 2019, 32. [Google Scholar]
- Ding, J.; Xue, N.; Xia, G.S.; Dai, D. Decoupling zero-shot semantic segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 11583–11592. [Google Scholar]
- Ma, C.; Yang, Y.; Wang, Y.; Zhang, Y.; Xie, W. Open-vocabulary semantic segmentation with frozen vision-language models. arXiv 2022, arXiv:2210.15138 2022. [Google Scholar]
- Chen, X.; Li, S.; Lim, S.N.; Torralba, A.; Zhao, H. Open-vocabulary panoptic segmentation with embedding modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023; pp. 1141–1150. [Google Scholar]
- Ding, Z.; Wang, J.; Tu, Z. Open-Vocabulary Panoptic Segmentation MaskCLIP. arXiv 2022, arXiv:2208.08984 2022. [Google Scholar]
- Ghiasi, G.; Gu, X.; Cui, Y.; Lin, T.Y. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of the European Conference on Computer Vision; Springer, 2022; pp. 540–557. [Google Scholar]
- Zhou, C.; Loy, C.C.; Dai, B. Extract free dense labels from clip. In Proceedings of the European Conference on Computer Vision; Springer, 2022; pp. 696–712. [Google Scholar]
- Huynh, D.; Kuen, J.; Lin, Z.; Gu, J.; Elhamifar, E. Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 7020–7031. [Google Scholar]
- Liang, F.; Wu, B.; Dai, X.; Li, K.; Zhao, Y.; Zhang, H.; Zhang, P.; Vajda, P.; Marculescu, D. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 7061–7070. [Google Scholar]
- Qin, J.; Wu, J.; Yan, P.; Li, M.; Yuxi, R.; Xiao, X.; Wang, Y.; Wang, R.; Wen, S.; Pan, X.; et al. Freeseg: Unified, universal and open-vocabulary image segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 19446–19455. [Google Scholar]
- Ren, S.; Zhang, A.; Zhu, Y.; Zhang, S.; Zheng, S.; Li, M.; Smola, A.J.; Sun, X. Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
- Zhang, H.; Li, F.; Zou, X.; Liu, S.; Li, C.; Yang, J.; Zhang, L. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023; pp. 1020–1031. [Google Scholar]
- Wu, J.; Li, X.; Xu, S.; Yuan, H.; Ding, H.; Yang, Y.; Li, X.; Zhang, J.; Tong, Y.; Jiang, X.; et al. Towards open vocabulary learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024. [Google Scholar] [CrossRef] [PubMed]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805 2018. [Google Scholar]
- Ding, H.; Cohen, S.; Price, B.; Jiang, X. Phraseclick: toward achieving flexible interactive segmentation by phrase and click. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Proceedings, Part III 16. Glasgow, UK, 23–28 August 2020; Springer, 2020; pp. 417–435. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 2013, 26. [Google Scholar]
- Zhu, C.; Chen, L. A survey on open-vocabulary detection and segmentation: Past, present, and future. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024. [Google Scholar] [CrossRef] [PubMed]
- You, K.; Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Universal domain adaptation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019; pp. 2720–2729. [Google Scholar]
- Saito, K.; Kim, D.; Sclaroff, S.; Saenko, K. Universal domain adaptation through self supervision. Advances in neural information processing systems 2020, 33, 16282–16292. [Google Scholar]
- Kundu, J.N.; Venkat, N.; Babu, R.V.; et al. Universal source-free domain adaptation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020; pp. 4544–4553. [Google Scholar]
- Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geoscience and Remote Sensing Letters 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Niu, X.; Zeng, Q.; Luo, X.; Chen, L. FCAU-Net for the Semantic Segmentation of Fine-Resolution Remotely Sensed Images. Remote Sensing 2022, 14, 215. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sensing 2021, 13, 3065. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Gui, R.; Xu, X.; Wang, L.; Yang, R.; Pu, F. A generalized zero-shot learning framework for PolSAR land cover classification. Remote Sensing 2018, 10, 1307. [Google Scholar] [CrossRef]
- Jia, X.; Khandelwal, A.; Nayak, G.; Gerber, J.; Carlson, K.; West, P.; Kumar, V. Incremental dual-memory lstm in land cover prediction. In Proceedings of the Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining; 2017; pp. 867–876. [Google Scholar]
- Li, A.; Lu, Z.; Wang, L.; Xiang, T.; Wen, J.R. Zero-shot scene classification for high spatial resolution remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 2017, 55, 4157–4167. [Google Scholar] [CrossRef]
- Sumbul, G.; Cinbis, R.G.; Aksoy, S. Fine-grained object recognition and zero-shot learning in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 2017, 56, 770–779. [Google Scholar] [CrossRef]
- Luo, C.; Li, Z.; Huang, K.; Feng, J.; Wang, M. Zero-shot learning via attribute regression and class prototype rectification. IEEE Transactions on Image Processing 2017, 27, 637–648. [Google Scholar] [CrossRef]
- Long, Y.; Shao, L. Describing unseen classes by exemplars: Zero-shot learning using grouped simile ensemble. In Proceedings of the 2017 IEEE winter conference on applications of computer vision (WACV); IEEE, 2017; pp. 907–915. [Google Scholar]
- Tsai, Y.H.; Hung, W.C.; Schulter, S.; Sohn, K.; Yang, M.H.; Chandraker, M. Learning to adapt structured output space for semantic segmentation. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition; 2018; pp. 7472–7481. [Google Scholar]
- Zheng, Z.; Yang, Y. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. International Journal of Computer Vision 2021, 129, 1106–1120. [Google Scholar] [CrossRef]
- Muhtar, D.; Zhang, X.; Xiao, P.; Li, Z.; Gu, F. Cmid: A unified self-supervised learning framework for remote sensing image understanding. IEEE Transactions on Geoscience and Remote Sensing 2023. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PMLR; 2021; pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International conference on machine learning. PMLR; 2021; pp. 4904–4916. [Google Scholar]
- Chen, Y.; Bruzzone, L. Toward Open-World Semantic Segmentation of Remote Sensing Images. In Proceedings of the IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium; IEEE, 2023; pp. 5045–5048. [Google Scholar]
- Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; Bai, X. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In Proceedings of the European Conference on Computer Vision; Springer, 2022; pp. 736–753. [Google Scholar]
- Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 2945–2954. [Google Scholar]
- Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. arXiv 2023, arXiv:2306.11300 2023. [Google Scholar]
- Wang, Z.; Prabha, R.; Huang, T.; Wu, J.; Rajagopal, R. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence; 2024; Volume 38, pp. 5805–5813. [Google Scholar]
- Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain. arXiv 2024, arXiv:2401.16822 2024. [Google Scholar]
- Mall, U.; Phoo, C.P.; Liu, M.K.; Vondrick, C.; Hariharan, B.; Bala, K. Remote sensing vision-language foundation models without annotations via ground remote alignment. arXiv 2023, arXiv:2312.06960 2023. [Google Scholar]
- Liang, C.; Li, W.; Dong, Y.; Fu, W. Single Domain Generalization Method for Remote Sensing Image Segmentation via Category Consistency on Domain Randomization. IEEE Transactions on Geoscience and Remote Sensing 2024. [Google Scholar] [CrossRef]
- Wang, M.; Liu, J.; Luo, G.; Wang, S.; Wang, W.; Lan, L.; Wang, Y.; Nie, F. Smooth-Guided Implicit Data Augmentation for Domain Generalization. IEEE Transactions on Neural Networks and Learning Systems 2024. [Google Scholar] [CrossRef] [PubMed]
- Iizuka, R.; Xia, J.; Yokoya, N. Frequency-based Optimal Style Mix for Domain Generalization in Semantic Segmentation of Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2023. [Google Scholar] [CrossRef]
- Zheng, J.; Wu, W.; Yuan, S.; Fu, H.; Li, W.; Yu, L. Multisource-domain generalization-based oil palm tree detection using very-high-resolution (vhr) satellite images. IEEE Geoscience and Remote Sensing Letters 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, M.; Li, W.; Wang, S.; Tao, R. Language-aware domain generalization network for cross-scene hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 2023, 61, 1–12. [Google Scholar] [CrossRef]
- Li, D.; Yang, Y.; Song, Y.Z.; Hospedales, T. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence; 2018; Volume 32. [Google Scholar]
- Balaji, Y.; Sankaranarayanan, S.; Chellappa, R. Metareg: Towards domain generalization using meta-regularization. Advances in neural information processing systems 2018, 31. [Google Scholar]
- Li, Y.; Yang, Y.; Zhou, W.; Hospedales, T. Feature-critic networks for heterogeneous domain generalization. In Proceedings of the International Conference on Machine Learning. PMLR; 2019; pp. 3915–3924. [Google Scholar]
- Shankar, S.; Piratla, V.; Chakrabarti, S.; Chaudhuri, S.; Jyothi, P.; Sarawagi, S. Generalizing across domains via cross-gradient training. arXiv 2018, arXiv:1804.10745 2018. [Google Scholar]
- Wang, Y.; Li, H.; Kot, A.C. Heterogeneous domain generalization via domain mixup. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2020; pp. 3622–3626. [Google Scholar]
- Shu, Y.; Cao, Z.; Wang, C.; Wang, J.; Long, M. Open domain generalization with domain-augmented meta-learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 9624–9633. [Google Scholar]
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412 2017. [Google Scholar]
- Segu, M.; Tonioni, A.; Tombari, F. Batch normalization embeddings for deep domain generalization. Pattern Recognition 2023, 135, 109115. [Google Scholar] [CrossRef]
- Bhattacharya, A.; Singha, M.; Jha, A.; Banerjee, B. C-SAW: Self-Supervised Prompt Learning for Image Generalization in Remote Sensing. In Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing; 2023; pp. 1–10. [Google Scholar]
- Kang, J.; Fernandez-Beltran, R.; Duan, P.; Liu, S.; Plaza, A.J. Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast. IEEE Transactions on Geoscience and Remote Sensing 2020, 59, 2598–2610. [Google Scholar] [CrossRef]
- Ayush, K.; Uzkent, B.; Meng, C.; Tanmay, K.; Burke, M.; Lobell, D.; Ermon, S. Geography-aware self-supervised learning. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021; pp. 10181–10190. [Google Scholar]
- Manas, O.; Lacoste, A.; Giró-i Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021; pp. 9414–9423. [Google Scholar]
- Muhtar, D.; Zhang, X.; Xiao, P. Index your position: A novel self-supervised learning method for remote sensing images semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Sun, X.; Wang, P.; Lu, W.; Zhu, Z.; Lu, X.; He, Q.; Li, J.; Rong, X.; Yang, Z.; Chang, H.; et al. RingMo: A remote sensing foundation model with masked image modeling. IEEE Transactions on Geoscience and Remote Sensing 2022. [Google Scholar] [CrossRef]
- Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023; pp. 4088–4099. [Google Scholar]
- Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing plain vision transformer toward remote sensing foundation model. IEEE Transactions on Geoscience and Remote Sensing 2022, 61, 1–15. [Google Scholar] [CrossRef]
- Jakubik, J.; Roy, S.; Phillips, C.; Fraccaro, P.; Godwin, D.; Zadrozny, B.; Szwarcman, D.; Gomes, C.; Nyirjesy, G.; Edwards, B.; et al. Foundation models for generalist geospatial artificial intelligence. arXiv 2023, arXiv:2310.18660 2023. [Google Scholar]
- Dong, Z.; Gu, Y.; Liu, T. Generative ConvNet Foundation Model with Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation. IEEE Transactions on Geoscience and Remote Sensing 2024. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE conference on computer vision and pattern recognition; IEEE, 2009; pp. 248–255. [Google Scholar]
- Qi, L.; Kuen, J.; Shen, T.; Gu, J.; Guo, W.; Jia, J.; Lin, Z.; Yang, M.H. High Quality Entity Segmentation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023; pp. 4024–4033. [Google Scholar]
- Qi, L.; Kuen, J.; Wang, Y.; Gu, J.; Zhao, H.; Torr, P.; Lin, Z.; Jia, J. Open world entity segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022. [Google Scholar] [CrossRef] [PubMed]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415 2016. [Google Scholar]
- Shi, B.; Zhang, X.; Xu, H.; Dai, W.; Zou, J.; Xiong, H.; Tian, Q. Multi-dataset pretraining: A unified model for semantic segmentation. arXiv 2021, arXiv:2106.04121 2021. [Google Scholar]
- Chen, Y.; Wang, M.; Mittal, A.; Xu, Z.; Favaro, P.; Tighe, J.; Modolo, D. ScaleDet: A scalable multi-dataset object detector. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 7288–7297. [Google Scholar]
- Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sensing of Environment 2020, 237, 111322. [Google Scholar] [CrossRef]
- Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; 2018; pp. 172–181. [Google Scholar]
- Ji, D.; Zhao, F.; Lu, H.; Tao, M.; Ye, J. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2023; pp. 23621–23630. [Google Scholar]
- Shi, J.X.; Wei, T.; Xiang, Y.; Li, Y.F. How Re-sampling Helps for Long-Tail Learning? Advances in Neural Information Processing Systems 2023, 36. [Google Scholar]
- Zhou, X.; Koltun, V.; Krähenbühl, P. Simple multi-dataset detection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022; pp. 7571–7580. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition; 2016; pp. 770–778. [Google Scholar]
- Yu, Q.; He, J.; Deng, X.; Shen, X.; Chen, L.C. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]






| Dataset | Type | Base classes | Novel classes | |||||
|---|---|---|---|---|---|---|---|---|
| Potsdam | Original | impervious surface |
building | car | low vegetation | tree | ||
| Mapped | impervious surface |
building | car | meadow | tree | |||
| GID5 | Original | built up | farmland | forest | meadow | water | ||
| Mapped | building | farmland | forest land | meadow | water | |||
| DeepGlobe | Original | urban land | agriculture land | range land | forest land | water | barren land | |
| Mapped | building | farmland | range land | forest land | water | bare land | ||
| URUR | Original | building | farmland | greenhouse | wood land | bare land | water | road |
| Mapped | building | farmland | greenhouse | forest land | bare land | water | road | |
| Potsdam | IoU of base classes | IoU of novel classes | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Year | Image encoder | VLM | imper. | building | car | tree | meadow | ||||||||
| ZSSeg | 2021 | ResNet50 | CLIP-B/16 | 54.27 | 78.49 | 17.94 | 51.02 | 66.71 | 66.98 | 88.05 | 34.75 | 59.74 | 85.31 | 90.42 | 0.00 | 35.88 |
| ZegFormer | 2022 | ResNet50 | CLIP-B/16 | 49.20 | 71.99 | 15.01 | 45.24 | 61.73 | 62.27 | 84.67 | 27.99 | 53.93 | 75.52 | 86.51 | 0.00 | 30.03 |
| MaskCLIP | 2023 | ResNet50 | CLIP-L/16 | 15.58 | 21.84 | 6.19 | 21.50 | 28.54 | 39.46 | 60.16 | 7.78 | 32.24 | 33.29 | 0.00 | 11.23 | 1.16 |
| SAN | 2023 | ResNet50 | CLIP-B/16 | 38.56 | 60.25 | 6.02 | 38.82 | 59.80 | 60.71 | 96.01 | 6.70 | 52.94 | 69.04 | 58.77 | 2.12 | 9.92 |
| OVSeg | 2023 | ResNet101 | CLIP-B/16 | 31.56 | 50.43 | 3.25 | 35.07 | 43.49 | 54.28 | 87.44 | 3.54 | 41.72 | 74.62 | 34.96 | 0.21 | 6.28 |
| FC-CLIP | 2023 | ConvNeXt_L | CLIP-RN50 | 44.78 | 73.74 | 1.32 | 39.03 | 59.12 | 59.76 | 97.85 | 1.48 | 48.12 | 81.32 | 91.79 | 0.05 | 2.60 |
| FreeSeg | 2023 | ResNet50 | CLIP-B/16 | 51.25 | 75.89 | 14.29 | 46.57 | 64.12 | 65.10 | 95.89 | 18.00 | 53.99 | 81.86 | 91.82 | 3.54 | 25.05 |
| FreeMix(ours) | 2024 | ResNet50 | CLIP-B/16 | 63.44 | 86.46 | 28.92 | 64.45 | 73.87 | 75.87 | 89.92 | 54.37 | 83.89 | 90.16 | 85.32 | 11.11 | 46.73 |
| Training dataset | Model |
Testing type |
Testing dataset: Potsdam | Testing dataset: GID5 | Testing dataset: DeepGlobe | Testing dataset: URUR | avg. mIoU | avg. mAcc | ||||||||||||
| Potsdam | ZSSeg | MS | 78.49 | 17.94 | 54.27 | 66.71 | 1.73 | 13.66 | 6.50 | 34.53 | 0.15 | 10.83 | 3.71 | 18.85 | 0.15 | 3.13 | 1.43 | 14.99 | 16.47 | 33.77 |
| ZegFormer | MS | 71.99 | 15.01 | 49.20 | 61.73 | 0.58 | 4.87 | 2.30 | 20.33 | 12.56 | 2.13 | 9.08 | 15.95 | 14.17 | 1.61 | 8.79 | 13.85 | 17.34 | 27.96 | |
| MaskCLIP | MS | 21.84 | 6.19 | 15.58 | 28.54 | 13.85 | 0.35 | 8.45 | 23.68 | 7.07 | 0.00 | 4.71 | 14.68 | 7.08 | 0.00 | 4.04 | 12.12 | 8.19 | 19.75 | |
| FC-CLIP | MS | 73.74 | 1.32 | 44.78 | 59.12 | 12.51 | 0.00 | 7.51 | 16.64 | 5.55 | 0.00 | 3.70 | 13.44 | 7.87 | 0.00 | 4.50 | 8.39 | 15.12 | 24.39 | |
| FreeSeg | MS | 75.89 | 14.29 | 51.25 | 64.12 | 2.99 | 18.80 | 9.31 | 33.35 | 2.96 | 11.87 | 5.93 | 22.65 | 1.56 | 10.34 | 5.32 | 21.59 | 17.95 | 35.42 | |
| FreeMix(ours) | SS | 86.46 | 28.92 | 63.44 | 73.87 | 15.73 | 16.31 | 15.96 | 43.90 | 3.41 | 8.93 | 5.25 | 19.95 | 3.53 | 3.30 | 3.43 | 15.76 | 22.02 | 38.37 | |
| GID5 | ZSSeg | MS | 0.00 | 10.77 | 4.31 | 20.00 | 33.15 | 0.63 | 20.14 | 37.60 | 3.03 | 5.85 | 3.97 | 18.61 | 9.45 | 3.13 | 6.74 | 19.43 | 8.79 | 23.91 |
| ZegFormer | MS | 6.35 | 12.30 | 8.73 | 23.93 | 28.16 | 4.18 | 18.57 | 38.21 | 28.46 | 0.27 | 19.06 | 29.75 | 6.35 | 12.30 | 8.73 | 23.93 | 13.77 | 28.95 | |
| MaskCLIP | MS | 21.06 | 8.60 | 16.08 | 29.19 | 16.45 | 0.66 | 10.13 | 20.78 | 10.71 | 0.00 | 7.14 | 16.49 | 9.89 | 0.00 | 5.65 | 9.98 | 9.75 | 19.11 | |
| FC-CLIP | MS | 22.78 | 10.00 | 17.67 | 36.48 | 6.12 | 0.13 | 3.72 | 19.66 | 3.59 | 0.00 | 2.40 | 16.54 | 3.87 | 0.01 | 2.21 | 10.55 | 6.50 | 20.80 | |
| FreeSeg | MS | 3.32 | 17.59 | 9.02 | 23.46 | 73.36 | 22.22 | 52.91 | 61.88 | 19.05 | 8.81 | 15.64 | 26.30 | 15.86 | 1.72 | 9.80 | 15.51 | 21.84 | 31.78 | |
| FreeMix(ours) | SS | 8.33 | 15.83 | 11.33 | 26.36 | 76.47 | 22.55 | 54.90 | 65.44 | 23.01 | 12.38 | 19.47 | 35.81 | 20.95 | 9.79 | 16.17 | 26.48 | 25.46 | 38.52 | |
| DeepGlobe | ZSSeg | MS | 5.43 | 11.51 | 7.86 | 23.29 | 14.59 | 9.85 | 12.69 | 32.53 | 0.85 | 5.60 | 2.44 | 17.18 | 0.93 | 5.32 | 2.81 | 20.31 | 6.45 | 23.32 |
| ZegFormer | MS | 0.00 | 12.27 | 4.91 | 20.95 | 0.17 | 0.25 | 0.20 | 20.10 | 7.20 | 5.70 | 6.70 | 19.71 | 0.01 | 1.13 | 0.49 | 14.28 | 3.07 | 18.76 | |
| MaskCLIP | MS | 16.43 | 7.51 | 12.86 | 23.85 | 6.51 | 0.00 | 3.90 | 20.36 | 9.95 | 0.00 | 6.63 | 26.26 | 5.59 | 0.00 | 3.19 | 14.53 | 6.64 | 21.25 | |
| FC-CLIP | MS | 24.89 | 5.92 | 17.30 | 37.53 | 5.77 | 0.00 | 3.46 | 19.72 | 2.71 | 0.00 | 1.80 | 14.25 | 3.74 | 0.00 | 2.14 | 8.73 | 6.17 | 20.05 | |
| FreeSeg | MS | 17.62 | 22.61 | 19.61 | 37.10 | 41.40 | 15.96 | 31.22 | 44.04 | 9.44 | 7.03 | 8.63 | 23.16 | 8.44 | 2.71 | 5.99 | 17.81 | 16.36 | 30.52 | |
| FreeMix(ours) | SS | 17.89 | 17.14 | 17.59 | 39.37 | 32.89 | 20.26 | 27.84 | 49.37 | 24.97 | 9.35 | 19.76 | 33.88 | 19.12 | 6.49 | 13.71 | 24.97 | 19.72 | 36.89 | |
| URUR | ZSSeg | MS | 2.92 | 11.12 | 6.20 | 21.64 | 7.53 | 8.36 | 7.86 | 33.10 | 5.52 | 6.39 | 5.81 | 20.30 | 5.18 | 1.87 | 3.76 | 16.37 | 5.90 | 22.85 |
| ZegFormer | MS | 0.59 | 5.34 | 2.49 | 22.72 | 0.76 | 0.25 | 0.56 | 20.47 | 10.56 | 0.00 | 7.04 | 22.31 | 0.02 | 1.13 | 0.50 | 14.30 | 2.64 | 19.95 | |
| MaskCLIP | MS | 12.94 | 12.63 | 12.82 | 28.19 | 15.39 | 0.44 | 9.41 | 21.48 | 10.39 | 0.00 | 6.93 | 17.01 | 12.24 | 0.00 | 6.99 | 13.17 | 9.03 | 19.96 | |
| FC-CLIP | MS | 26.37 | 8.89 | 19.38 | 35.17 | 5.74 | 0.00 | 3.44 | 19.89 | 2.97 | 0.77 | 2.24 | 16.36 | 3.78 | 0.00 | 2.16 | 10.19 | 6.80 | 20.40 | |
| FreeSeg | MS | 12.95 | 22.56 | 16.79 | 32.22 | 43.92 | 21.93 | 35.12 | 57.84 | 21.25 | 8.05 | 16.85 | 32.12 | 21.00 | 5.81 | 14.49 | 24.71 | 20.81 | 36.72 | |
| FreeMix(ours) | SS | 15.70 | 21.73 | 18.11 | 36.12 | 33.06 | 16.97 | 26.62 | 54.02 | 28.39 | 13.08 | 23.28 | 37.33 | 33.29 | 9.80 | 23.22 | 32.45 | 22.80 | 39.98 | |
| Potsdam | GID5 | DeepGlobe | URUR | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Training dataset | Testing type | |||||||||||||
| ZSSeg | 9.84 | 10.85 | 8.32 | 3.66 | 0.02 | 9.13 | 2.38 | 0.00 | 7.16 | 1.94 | 1.65 | 2.32 | 4.46 | ||
| ZegFormer | 4.31 | 0.00 | 10.77 | 0.50 | 0.83 | 0.00 | 8.11 | 11.59 | 1.15 | 0.48 | 0.00 | 1.13 | 3.35 | ||
| MaskCLIP | 11.72 | 18.78 | 1.13 | 13.62 | 22.70 | 0.00 | 9.20 | 13.80 | 0.00 | 6.87 | 12.02 | 0.00 | 10.35 | ||
| SAN | 23.84 | 23.99 | 23.61 | 35.10 | 57.67 | 1.25 | 29.38 | 42.86 | 2.41 | 30.44 | 48.39 | 6.49 | 29.69 | ||
| OVSeg | 9.32 | 14.96 | 0.86 | 15.58 | 22.15 | 5.73 | 25.87 | 37.57 | 2.47 | 19.56 | 31.86 | 3.16 | 17.58 | ||
| FC-CLIP | 21.02 | 28.76 | 9.40 | 2.87 | 4.79 | 0.00 | 1.81 | 2.72 | 0.00 | 1.48 | 2.60 | 0.00 | 6.80 | ||
| FreeSeg | GPDU | MS | 17.58 | 14.75 | 21.83 | 25.26 | 33.94 | 12.24 | 24.55 | 31.58 | 10.49 | 23.71 | 38.99 | 3.34 | 22.78 |
| FreeMix†(ours) | GPDU | SS | 19.98 | 17.25 | 24.06 | 57.26 | 75.91 | 29.27 | 32.03 | 41.17 | 13.75 | 29.3 | 42.41 | 11.83 | 34.64 |
| FreeMix(ours) | GPDU | SS | 47.03 | 69.54 | 13.26 | 43.13 | 67.91 | 5.97 | 35.14 | 52.69 | 0.04 | 35.72 | 60.85 | 2.22 | 40.26 |
| Potsdam | GID5 | DeepGlobe | URUR | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Training Dataset | RS_SSL | ESB | DAS | ||||||||||||
| 17.58 | 32.58 | 25.26 | 32.92 | 24.55 | 39.75 | 23.71 | 30.98 | 22.78 | 34.05 | ||||||
| √ | 19.73 | 37.32 | 38.53 | 57.22 | 31.97 | 47.06 | 26.46 | 34.49 | 29.17 | 44.02 | +6.39 | +9.96 | |||
| √ | √ | 19.98 | 35.92 | 57.26 | 71.10 | 32.03 | 46.49 | 29.30 | 37.85 | 34.64 | 47.84 | +5.47 | +3.81 | ||
| GPDU | √ | √ | √ | 47.03 | 62.56 | 43.13 | 58.44 | 35.14 | 47.37 | 35.72 | 45.75 | 40.26 | 53.53 | +5.62 | +5.69 |
| Backbone | Pre-train Type |
Pre-train Dataset |
Training tactic |
Potsdam | GID5 | DeepGlobe | URUR | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet50 | Supervised | In1K | random | 19.73 | 43.66 | 38.53 | 76.23 | 31.97 | 63.66 | 26.46 | 70.45 | 29.17 | 63.50 |
| ResNet50 | Self-Supervised | MillionAID | random | 19.98 | 40.43 | 57.26 | 87.65 | 32.03 | 63.24 | 29.30 | 71.70 | 34.64 | 65.75 |
| ResNet50 | Supervised | In1K | DAS | 39.49 | 61.35 | 41.18 | 78.04 | 7.81 | 22.71 | 6.39 | 19.02 | 23.71 | 45.28 |
| ResNet50 | Self-Supervised | MillionAID | DAS | 47.03 | 62.47 | 43.13 | 81.90 | 35.14 | 69.38 | 35.72 | 78.45 | 40.25 | 73.05 |
| Swin-B | – | – | random | 11.79 | 30.92 | 33.76 | 73.99 | 22.20 | 62.06 | 26.05 | 71.96 | 23.45 | 59.73 |
| Swin-B | Self-Supervised | MillionAID | random | 11.54 | 31.09 | 39.31 | 78.63 | 23.22 | 61.91 | 27.71 | 72.38 | 25.44 | 61.00 |
| Swin-B | – | – | DAS | 36.26 | 60.70 | 34.29 | 72.58 | 19.78 | 48.39 | 15.05 | 47.18 | 26.34 | 57.21 |
| Swin-B | Self-Supervised | MillionAID | DAS | 43.85 | 66.57 | 40.11 | 79.49 | 23.81 | 56.59 | 18.44 | 54.53 | 31.55 | 64.29 |
| Backbone of ESB | Training tactic | Potsdam | GID5 | DeepGlobe | URUR | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Swin-T | random | 19.98 | 40.43 | 57.26 | 87.65 | 32.03 | 63.24 | 29.30 | 71.70 | 34.64 | 65.75 |
| Swin-L | 23.31 | 45.90 | 55.07 | 88.71 | 24.85 | 50.92 | 23.40 | 59.67 | 31.65 | 61.30 | |
| Hornet-L | 21.86 | 45.65 | 53.14 | 88.63 | 30.40 | 64.15 | 27.75 | 72.03 | 33.28 | 67.61 | |
| Swin-T | DAS | 47.03 | 62.47 | 43.13 | 81.90 | 35.14 | 69.38 | 35.72 | 78.45 | 40.25 | 73.05 |
| Swin-L | 47.39 | 57.09 | 57.11 | 85.90 | 10.89 | 31.00 | 14.66 | 24.70 | 32.51 | 49.67 | |
| Hornet-L | 53.68 | 66.17 | 53.40 | 82.55 | 12.91 | 36.81 | 15.67 | 30.10 | 33.91 | 53.90 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
