The fusion of global contextual information with local cropped block details is crucial for segmenting ultra-high resolution images. In this study, we introduce a novel fusion mechanism termed Global-Local Deep Fusion (GL-Deep Fusion) based on an enhanced transformer architecture, which efficiently integrates global contextual information and local details. Specifically, we propose the Global-Local Synthesis Networks (GLSNet), a dual-branch network where one branch processes the entire original image, while the other branch handles cropped local patches as input. The feature fusion of different branches in GLSNet is achieved through GL-Deep Fusion, significantly enhancing the accuracy of ultra-high resolution image segmentation. Particularly effective is GLSNet in identifying tiny overlapping items. To optimize GPU memory utilization, we meticulously design a dual-branch architecture that proficiently leverages the features it extracts, seamlessly integrating them into the enhanced transformer framework of GL-Deep Fusion. Extensive experiments conducted on challenging benchmarks, including DeepGlobe and Vaihingen datasets, demonstrate that GLSNet achieves a new state-of-the-art performance in terms of GPU memory utilization and segmentation accuracy tradeoff.