The overpopulation of feral pigeons in Hong Kong has significantly disrupted the urban ecosystem, highlighting the urgent need for effective strategies to control their population. In general, control measures should be implemented and re-evaluated periodically following accurate estimations of feral pigeon population in the concerned regions, which, however, is very difficult in urban envi-ronments due to the concealment and mobility of pigeons within complex building structures. With the advances in deep learning, computer vision can be a promising tool for pigeon monitoring and population estimation but has not been well investigated so far. Therefore, we propose an improved deep learning model based on Mask-RCNN (Swin-Mask R-CNN) for feral pigeon detection using computer vision techniques. Specifically, our model consists of a Swin transformer network (STN) as the backbone, a feature pyramid network (FPN) as the neck, and three decoupled detection heads. The STN is utilized to extract deep feature information of feral pigeons through local and cross-window attention mechanisms. The FPN is employed to fuse multi-scale features and enhance the multi-scale learning ability. Heads in the three branches are responsible for classification, pre-dicting best bounding boxes, and segmentation of feral pigeons, respectively. During the prediction phase, a Slicing Aided Hyper Inference (SAHI) tool is employed to zoom in on the feature infor-mation of small feral pigeon targets, and the segmentation head is frozen to expedite inference of large images. Experiments were conducted on feral pigeon dataset to evaluate model performance. The results reveal that our model is well-suited for detecting small targets in high-resolution images and achieves excellent recognition performance for feral pigeons with a mAP (mean average pre-cision) and an AP50 (average precision at 50% intersection over union) of 0.74 and 0.93, respectively. For small target feral pigeons, AP50 in small scale (AP50s) improved by 10% as compared to the Mask R-CNN (AP50s of 0.75), demonstrating its potential for dynamic pigeon detection and population estimation in the future.