Generally, tumors are difficult to segment as they might be found in regions with low contrast and hence it is more difficult to have accurate boundaries. Ge et al. [
11] introduced the Multi-input dilated (MD) U-net to segment bladder tumor. They mentioned that the traditional U-net down-samples the original features to learn global features, but it ends up in corrupting the local features of small sized objects. They replaced the max-pooling operation of down-sampling with dilated convolution which increases the receptive field. Furthermore they used multiple scaled inputs at different levels so that the context information could be improved. Wang et al. [
12] proposed a network i.e. A-net for the semantic segmentation of tumors to be used in Adaptive Radiotherapy(ART). Their model used a deep learning network with patches of 3x3 cm as the input. This helped their model with low data size. Patches as input gives more consideration to the local level features rather than global features. Though they used the initial weeks MRI volumes in the training set and the last MRI of the same patients in the testing set. As medical data is always limited in perspective of annotation or data size, use of multiple modalities is getting more popular. Wang et al. [
12] used CT and MRI to create a 2 stage network which uses Cycle-Gan to learn important features of the tumor. In the first stage, they have a cycle consistency loss between the two domains, while they also introduced another structure loss for the tumor which takes into account the shape and size of the tumor generated by the gan. In the second stage, the pseudo MRI images are collected together with the few available expert-annotated MRI scans to train the network. Li et al. [
13] have introduced a multi-modal network which uses CT and PET images to segment the tumor. PET images with F-FDG(F-fluorodeoxy-glucose), helps to show a clearer contrast at the tumor boundaries. Since these type of images have low spatial resolution, so a fusion of PET and CT is an interesting approach. They generate a probability map of the tumor from CT using a FCN. Then this map is fused with the intensity values of PET via a fuzzy variational model. Zhao et al. [
14] also used CT-PET images for segmentation. They introduced two networks . First, a multitask network to extract the features maps from CT and PET images separately. Then, they used another network comprising of cascaded convolution operations which gave the segmentation map. Jin et al. [
15] introduced the DeepTarget network which could delineate GTV and CTV in CT guided with PET. They used two stream 3D fusion PSNN network based on Unet and PHNN [
16]. They first carry out a deformable registration between the CT and the PET and segment GTV and organs at risk to use in the final network for CTV delineation. Wang et al. [
17] used multi-view fusion segmentation for GTV segmentation of brain glioma on CT images. They used an encoder-decoder architecture similar to a U-net which used 3(current, previous and next) 2D-CT images whose features are fused at the decoder ie:Dense-Decoder. They mentioned that this type of input covered more spatial region than 2D CNN while it had less parameters than 3D CNN. Ma et al.( [
18] proposed a registration-guided deep learning architecture that used CBCT images and registered CT-Masks to delineate Organs at Risk. They used two different types of registration on the CT masks i.e. Rigid and Deformable Registration. They show that the deformable registration performs better than the rigid registration. Segmentation of Cone Beam CT is difficult due to lower soft tissue contrast and generation of artifacts. Fu et al. [
19] used a cross modality attention pyramid network to automatically segment bladder, prostate, rectum, and left/right femoral heads in CBCT. This network consisted of 2 U-nets which took one of the inputs i.e. CBCT or a synthetic MRI. The loss used for training is a combined loss of the 2 Unet networks and also a loss from the late fusion of the features in the 2 decoders via an attention gate.
The synthetic MRI is made by training a CycleGan which learns the translation between CBCT and MRI [
20]. For this purpose, they performed a rigid registration between the two images. They also mention that errors in the registration could deteriorate the performance of the segmentation by the network. Jia et al. [
21] used a CycleGan to translate CT (with contours) to a synthetic CBCT (with no contours) and used domain adaptation with adversarial feature learning to train the CBCT segmentation network without any CBCT annotations. They observed that with adversarial learning, the network produced a higher DSC in comparison to the network which used sCBCT directly from the Cycle-Gan.
They trained the domain discriminator first until a threshold and then started the training of the CBCT, s-CBCT segmentation network which used the sCBCT contours for calculating the dice loss. Brion et al. [
22] also used an adversarial network for unsupervised domain adaptation between annotated CT’s and non-annotated CBCT’s. They used a 3D-Unet which were trained to segment CT images. Along with that they added a gradient reversal layer(GRL) at the decoder which reduced the domain shift between the CT and the CBCT. GRL is a custom layer where the gradients are changed and hard-coded. They also introduced different strategies for intensity based data augmentation. It improves generalization of CT models to use CBCT data without explicitly training with its contours. We realised that there is a need for a segmentation network which uses the data produced during the Planning Phase to delineate GTV in CBCT. Generally, multiple-modality for automatic segmentation would require extra data to be generated which could be considered as burdensome as it would require an additional(MRI or PET) imaging modality. Since in the planning phase we manually annotate the tumor, this information is important and should be used for further delineation of the GTV in the CBCT. Though our approach is similar to Lin Ma et al in terms of the input, but we differ from their approach as we use the simplest form of registration i.e. Translation. We further perform analysis of different fusion strategies using different types of inaccuracies in the CT-Mask. As mentioned in [
18], the OAR and the tumor volume are required to be delineated. As they segment the organs at risk, we go forward to segment the GTV using this approach.