In recent times, DL models have revolutionized the field of medical image segmentation by significantly enhancing segmentation accuracy and performance such as [
3]. Convolutional operations have long been recognized for their efficacy in various computer vision tasks owing to their robust representational capabilities. CNNs, comprising convolutional layers, have emerged as the predominant choice across diverse computer vision applications such as [
4], particularly in advancing the state-of-the-art in medical image segmentation such as [
5]. Among the notable architectures, U-Net stands out as a cornerstone in medical image segmentation such as [
6]. Its distinctive U-shaped design employs descending convolutions to extract spatial information and compose a feature map representing low-level features. Subsequently, ascending convolutions are employed to upscale the feature map to the original image size while preserving detailed object boundaries.
However, image segmentation involves partitioning an image into distinct, non-overlapping regions based on specific criteria. By segmenting the image into these regions or pixel sets, the area for analysis is reduced, streamlining the search for pertinent features based on predefined criteria according to [
7]. Consequently, this process yields a collection of image segments that collectively represent the original image. The emphasis lies in highlighting relevant image attributes to enhance the interpretability and utility of image analysis. Segmentation algorithms aim to achieve pixel-wise prediction such as [
8]. Within the realm of DL, medical imaging segmentation confronts numerous challenges, necessitating the formulation of multiple hypotheses. A prominent issue in medical imaging is the scarcity of annotated training data according to [
9]. Therefore, MRI provides comprehensive information regarding kidney function and structure. It allows for precise visualization of the kidney's condition and its various components, such as the cortex, medulla, and pelvis. This imaging modality facilitates the detection of renal lesions, tumors, and small masses, although it is not suitable for identifying calcifications, including stones according to [
10]. To capture a MRI, a magnetic field is utilized to align the free water protons within the individual along the axis of the magnetic field. An Radio Frequency (RF) antenna is positioned over the targeted area for image capture, emitting energy pulses. These pulses align the protons to the magnetic field's angle, inducing them to spin in synchronization, thereby creating resonance. Following the RF pulse burst, the nuclei gradually return to their resting alignment through various relaxation processes, releasing RF energy. MRI captures this released energy to generate an image according to [
11]. The Fourier transform processes the frequency data provided by the signal from each point on the image plane into corresponding intensity levels, depicted as various shades of gray within a pixel matrix. Various images can be generated by altering RF pulse sequences applied or gathered. These alterations are influenced by the Repetition Time (TR), referring to the duration between successive pulse sequences on the same slice, and the Echo Time (TE), indicating the interval between the emission of the RF pulse and the detection of the echo signal according to [
12]. Different MRI sequences are available, with the most common being T1-weighted and T2-weighted, distinguished by variations in TR and TE. T1-weighted images are produced with short TE and TR durations, while T2-weighted images utilize longer TE and TR durations. T1, representing longitudinal relaxation time, determines the rate at which excited protons return to their original state and align with the external magnetic field. T2, or transverse relaxation time, dictates the speed at which excited protons either return to equilibrium or lose coherence with each other, leading to a loss of phase alignment perpendicular to the primary magnetic field. T1 images typically emphasize adipose tissue, whereas T2 images highlight both adipose tissue and water within the body.