1. Introduction
Ultrasound (US) is a versatile imaging technique that has become a fundamental resource in medical diagnosis and screening. Its wide acceptance and use by both physicians and radiologists underscore its importance and reliability. US is widely utilized due to its safety, affordability, non-invasive nature, real-time visualization capabilities, and the comfort it provides to those performing the procedure. It stands out among other imaging techniques like X-ray, MRI, and CT scans because of its several significant benefits, including absence of ionizing radiation, portability, ease of access, and cost-efficiency [
1]. US is applied in various medical fields, such as breast US, echocardiography, transrectal US, intravascular US (IVUS), prenatal diagnostic US, and abdominal US. It is particularly prevalent in obstetrics [
2]. However, despite its numerous benefits, US also presents certain challenges. These include lower image quality due to noise and artifacts, a high dependence on the operator or diagnostician's experience, and significant variations in performance across different institutions and manufacturers' US systems [
3].
Artificial intelligence (AI) methods, particularly deep learning models, have revolutionized ultrasound imaging by automating image processing and enabling automated disease diagnosis and detection of abnormalities. Convolutional neural networks (CNNs) have played a vital role in this transformation, demonstrating remarkable improvements in various medical imaging modalities [
4,
5,
6,
7,
8,
9]. However, the limitations of CNNs in capturing long-range dependencies and contextual information led to the development of vision transformers (ViTs) [
10] in image processing. The self-attention mechanism, a key part of the Transformer, possesses the capability to establish relationships between sequence elements, thus facilitating the learning of long-range interactions.
Significant strides have been made in the vision community to integrate attention mechanisms into architectures inspired by CNNs. Recent research has shown that these transformer modules can potentially substitute standard convolutions in deep neural networks by working on a sequence of image patches culminating in the creation of ViTs.
In recent years, the integration of Vision Transformers into medical US analysis has encompassed a diverse range of methodological tasks. These tasks include traditional diagnostic functions like segmentation, classification, biometric measurements, detection, quality assessment, and registration, as well as innovative applications like image-guided interventions and therapy.
Notably, segmentation, detection, and classification stand out as the fundamental tasks, with widespread utilization across various anatomical structures in medical US analysis.
Figure 1 illustrates the distribution of methodological tasks and anatomical application categories in the context of medical ultrasound image analysis. The figure provides insights into the prevalence of these tasks across different anatomical structures, including but not limited to heart [
11], prostate [
12], liver [
13], breast [
14], brain [
15], lymph node [
16], lung [
17], pancreatic [
18], carotid [
19], thyroid [
11], intravascular [
20], fetus [
21], urinary bladder [
22], gallbladder, other structures [
23].
The review in [
24]offers a broad perspective on the applications of vision transformers in medical imaging, encompassing various modalities and imaging techniques. Nonetheless, it lacks a specific focus on ultrasound imaging applications, which are crucial for understanding the unique challenges and opportunities within this specialized field.
This review seeks to fill the knowledge gap in the field of ultrasound imaging by providing a comprehensive examination of Vision Transformer-based AI methods that are specifically designed for this application. Given the unique attributes and diagnostic needs of ultrasound imaging, such an overview can offer invaluable insights for those working in this specialized area. The purpose of this review is to offer a thorough analysis of Transformer models that have been specially developed for ultrasound imaging and its associated analysis applications
Figure 2 depicts the trend of published papers related to the intersection of ultrasound and vision transformers since 2017, based on the search query "Ultrasound AND vision transformers" on Google Scholar. The plot clearly demonstrates a substantial increase in the number of publications in this field over the specified period. The rising trend suggests a growing interest and research activity in the application of vision transformers in ultrasound imaging.
This survey paper offers an exhaustive examination of the utilization of transformers in the field of ultrasound imaging. As the first of its kind, it serves to connect the vision and ultrasound imaging communities in this rapidly advancing domain. This analysis involves an extensive review of 231 relevant papers to summarize recent advancements and identify the most pertinent ones for this topic.
Our review paper is divided based on the organs and provides an in-depth analysis of the different tasks such as classification, segmentation, object detection, and image enhancement. In
Figure 3, the pie plot visually represents the distribution of the number of papers for each organ considered in the review, providing a comprehensive overview of the focus areas within the medical ultrasound analysis literature.
In conclusion, we offer a comprehensive critique of the current state of the field, pinpointing major challenges, emphasizing unresolved issues, and suggesting potential future paths. The structure of the paper is as follows.
Section 2 provides foundational information on the field, with an emphasis on the key principles that underpin transformers. Our review is then segmented based on the organs, covered in Sections 3-1 to 3-14. Lastly, we engage in a thorough discussion of the overall state of the field, identifying significant challenges, spotlighting open problems, and charting promising directions for the future.
2. background
2.1. Fundamentals of Transformer
Transformers have revolutionized the field of Natural Language Processing (NLP) by providing a fundamental framework for processing and understanding language. Originally introduced by Vaswani et al. in 2017 [
25], transformers have since become a cornerstone in NLP, particularly with the development of models such as BERT, GPT-3, and T5.
Fundamentally, transformers represent a kind of neural network architecture that doesn’t rely on convolutions. They excel at identifying long-range dependencies and relationships in sequential data, which makes them especially effective for tasks related to language. The groundbreaking aspect of transformers is their attention mechanism, which empowers them to assign weights to the significance of various words in a sentence. This capability allows them to process and comprehend context more efficiently than earlier NLP models.
In addition to their significant impact on NLP, transformers have also shown promise in the field of computer vision. Vision transformers (ViTs) have emerged as a novel approach for image recognition tasks, challenging the traditional convolutional neural network (CNN) architectures.
By applying the self-attention mechanism to image patches, vision transformers can effectively capture global dependencies in images, enabling them to understand context and relationships between different parts of an image. This has led to impressive results in tasks such as image classification, object detection, and image segmentation.
The introduction of vision transformers has also opened up opportunities for cross-modal learning, where transformers can be applied to tasks that involve both text and images, such as image captioning and visual question-answering. This demonstrates the versatility of transformers in handling multimodal data and their potential to drive innovation at the intersection of NLP and computer vision.
Overall, the application of transformers in computer vision showcases their adaptability and potential to revolutionize not only NLP but also other domains of artificial intelligence, paving the way for new advancements in multimodal learning and understanding of complex data.
2.1.1. Self-attention
In transformers, self-attention is a critical element of the attention mechanism that allows the model to focus on various segments of the input sequence and identify dependencies among them [
25]. This process involves converting the input sequence into three vectors: queries, keys, and values. The queries are employed to extract pertinent data from the keys, while the values are utilized to generate the output. The attention weights are determined based on the correlation between the queries and keys. The final output is produced by summing the weighted values.
This mechanism is especially potent in detecting long-term dependencies within the input sequence, thereby making it a valuable instrument for natural language processing and similar sequence-based tasks.
To elaborate, before the input sentence is fed into the self-attention block, it is first converted into an embedding vector. This process is known as "word embedding" or "sentence embedding," and it forms the basis for many Natural Language Processing (NLP) tasks. After the embedding vector, the positional information of each word is also included because the position can alter the meaning of the word or sentence. This is done to allow the model to track the position of each vector or word. Once the vectors are prepared, the next step is to calculate the similarity between any two vectors. The dot product is commonly used for this purpose due to its computational efficiency and space optimization. The dot product provides scalar results, which are suitable for our needs. After obtaining the similarity scores, the next step involves normalizing and applying the softmax function to obtain the attention weights. These weights are then multiplied with the original input vector to adjust the values according to the weights received from the softmax function.
2.1.2. Multi-head self-attention
Multi-head self-attention is a strategy used in transformers to boost the model's capacity to grasp a variety of relationships and dependencies within the input sequence [
25]. This methodology involves executing self-attention multiple times concurrently, each time with different sets of learned queries, keys, and values. Each set originates from a linear projection of the initial input, offering multiple unique viewpoints on the input sequence (
Figure 4).
Utilizing multiple attention heads allows the model to pay attention to different portions of the input sequence and collect various types of information simultaneously. Once the self-attention process is independently carried out for each head, the outcomes are amalgamated and subjected to a linear transformation to yield the final output. This methodology empowers the model to effectively identify intricate patterns and relationships within the input data, thereby enhancing its overall representational capability.
Multi-head self-attention is a key innovation in transformers, contributing to their effectiveness in handling diverse and intricate sequences of data, such as those encountered in natural language processing and other sequence-based tasks.
2.2. transformer architecture
The architecture of transformers consists of both encoder and decoder blocks(
Figure 5), which are fundamental components in sequence-to-sequence models, particularly in tasks such as machine translation [
25].
Encoder: The encoder is responsible for processing the input sequence. It typically comprises multiple layers, each containing self-attention mechanisms and feedforward neural networks. In each layer, the input sequence is transformed through self-attention, allowing the model to capture dependencies and relationships within the sequence. The outputs from the self-attention are then passed through position-wise feedforward networks to further process the information. The encoder's role is to create a rich representation of the input sequence, capturing its semantic and contextual information effectively.
Decoder: The decoder, on the other hand, is tasked with generating the output sequence based on the processed input. Similar to the encoder, it consists of multiple layers, each containing self-attention mechanisms and feedforward neural networks. However, the decoder also includes an additional cross-attention mechanism that allows it to focus on the input sequence (encoded representation) while generating the output. This enables the decoder to leverage the information from the input sequence to produce a meaningful output sequence.
The encoder-decoder architecture in transformers enables the model to effectively handle sequence-to-sequence tasks, such as machine translation and text summarization. It allows for capturing complex dependencies within the input sequence and leveraging that information to generate accurate and coherent output sequences.
2.3. vision transformers
The achievements of transformers in natural language processing have influenced the computer vision research community, leading to numerous endeavors to modify transformers for vision-related tasks. Transformer-based models specifically designed for vision applications have been rapidly developed, with notable examples including the detection transformer (DETR) [
26], Vision Transformer (ViT), data-efficient image transformer (DeiT) [
27], and Swin-Transformer [
28]. These models represent significant advancements in leveraging transformers for computer vision and have gained recognition for their contributions to tasks such as object detection, image classification, and efficient image comprehension.
DETR: DETR, standing for DEtection TRansformer, has brought a major breakthrough in the realm of computer vision, particularly in the area of object detection tasks. Created by Carion et al. [
26], DETR represents a departure from conventional methods that depended heavily on manual design processes, and demonstrates the potential of transformers in revolutionizing object detection within the field of computer vision. This approach replaces the complex, hand-crafted object detection pipeline with a simpler one based on Transformers. This method simplifies the intricate, manually crafted object detection pipeline by substituting it with a Transformer.
DETR uses a transformer encoder to comprehend the relationships between image features derived from a Convolutional Neural Network (CNN) backbone. The transformer decoder generates object queries, and a feed-forward network is responsible for assigning labels and determining bounding boxes around the objects. This involves a set-based global loss mechanism that ensures unique predictions through bipartite matching, along with a Transformer encoder-decoder architecture. With a fixed small set of learned object queries, DETR considers the relationships between objects and the global image context to directly produce the final set of predictions in parallel.
ViT: Following the introduction of DETR, Dosovitskiy et al. [
10] introduced the Vision Transformer (ViT), a model that employs the fundamental architecture of the traditional transformer for image classification tasks. As depicted in
Figure 6, ViT operates similarly to a BERT-like encoder-only transformer, utilizing a series of vector representations to classify images.
The process begins with the input image being converted into a sequence of patches. Each patch is paired by a positional encoding technique, which encodes the spatial positions of the patches to provide spatial information. These patches, along with a class token, are then fed into the transformer. This process computes the Multi-Head Self-Attention (MHSA) and generates the learned embeddings of the patches. The class token's state from the ViT's output serves as the image's representation. Lastly, a multi-layer perceptron (MLP) is used to classify the learned image representation.
Moreover, ViT can also accept feature maps from CNNs as input for relational mapping, in addition to raw images. This flexibility allows for more nuanced and complex image analyses.
DeiT: To address the issue of ViT requiring vast amounts of training data, Touvron et al. [30] introduced Data-efficient Image Transformer (DeiT) to achieve high performance on small-scale data.
In the context of knowledge distillation, a teacher-student framework was implemented, incorporating a distillation token, a term used in transformer terminology. This token followed the input sequence and enabled the student model to learn from the output of the teacher model. They hypothesized that using a Convolutional Neural Network (CNN) as the teacher model could assist in training the transformer as the student network, allowing the student network to inherit inductive bias.
Swin Transformer Introduced by Ze Liu et al. in 2021 [
28], the Swin Transformer is a transformer architecture known for its ability to generate a hierarchical feature representation. This architecture exhibits linear computational complexity relative to the size of the input image. It's particularly useful in various computer vision tasks due to its ability to serve as a versatile backbone. These tasks include instance segmentation, semantic segmentation, image classification, and object detection.
The Swin Transformer is based on the standard transformer architecture, but it uses shifted windows to process images at different scales. The Swin Transformer is designed to be more efficient than other transformer architectures, such as the ViT, with smaller datasets.
PVT: The Pyramid Vision Transformer (PVT) [
29] is a transformer variant adept at handling dense prediction tasks. It employs a pyramid structure, enabling detailed inputs (4 x 4 pixels per patch) and reducing the sequence length of the Transformer as it deepens, thus lowering computational cost. PVT comprises several key components: Dense Connections for learning complex patterns, Feedforward Networks for data processing, Layer Normalization for stabilizing learning, Residual Connections (Skip Connections) for mitigating the vanishing gradients problem, and Scaled Dot-Product Attention for calculating input data relevance.
CvT: The Convolutional Vision Transformer (CvT) [
30] is an innovative architecture that enhances the Vision Transformer (ViT) by integrating convolutions. This enhancement is realized through two primary alterations: a hierarchical structure of Transformers with a new convolutional token embedding, and a convolutional Transformer block that employs a convolutional projection. The convolutional token embedding layer provides the ability to modify the token feature dimension and the quantity of tokens at each level, allowing the tokens to depict progressively intricate visual patterns across wider spatial areas, similar to feature layers in Convolutional Neural Networks (CNNs). The convolutional Transformer block replaces the linear projection in the Transformer module with a convolutional projection, capturing local spatial context and reducing semantic ambiguity in the attention mechanism.
The CvT architecture has been found to exhibit superior performance compared to other Vision Transformers and ResNets on ImageNet-1k. Interestingly, the CvT model demonstrates that positional encoding, a crucial element in existing Vision Transformers, can be safely discarded. This simplification allows the model to handle higher resolution vision tasks more effectively.
HVT: The Hybrid Vision Transformer (HVT) [
31] is a unique architecture that merges the advantages of convolutional neural networks (CNNs) and transformers for image processing. It capitalizes on transformers’ ability to concentrate on global relationships in images and CNNs’ capacity to model local correlations, resulting in superior performance across various computer vision tasks. HVTs typically blend both the convolution operation and self-attention mechanism, enabling the exploitation of both local and global image representations. They have demonstrated impressive results in vision applications, providing a viable alternative to traditional CNNs, and have been successfully deployed in tasks such as image segmentation, object detection, and surveillance anomaly detection. However, the specific implementation of the HVT can vary significantly depending on the task and requirements, with some HVTs incorporating additional components or modifications to further boost performance. In essence, the Hybrid Vision Transformer is a potent tool for image processing tasks, amalgamating the strengths of both CNNs and transformers to achieve high performance
3. Organs
In this section, we explain the use of transformer-based methods in ultrasound images, focusing on commonly used organs.
3.1. Breast
Breast cancer is the most common cancer. Providing breast cancer screening and introducing new treatment methods since 1990’s reduced mortality of this cancer [
32]. The gold standard imaging method for breast cancer detection is mammography. But mammography uses ionizing radiations. Also, mammography is not proper for detecting cancer in dense breasts. Sonography is another imaging system that routinely is used for breast screening. It is harmless, cheap, uses portable systems and provide better results in dense breast cases.
Table 1 provides a detailed comparison of transformer-based models for breast ultrasound image analysis.
Two dimensional ultrasound images are mostly used, but recently automatic breast ultrasound systems (ABUS) are provided that produce 3-D scans of the breast containing many 2-D slices. Due to speckle noise and artifacts and also various shapes of breast nodules, analyzing these images is a challenge and need a lot of experience.
There are many attempts to use artificial intelligence for analyzing ultrasound breast images. In recent years vision transformers were considered for this problem.
In [
33] 3D U-Net is used with attention mechanism and transformer layers for segmenting ABUS 3-D images. The transformers inserted to consider long distance relations.
In [
14] Semi-supervised vision transformer is used for breast cancer classification in 2-D breast ultrasound images. They tackle the problem of image scarcity of breast cancer data bases by using semi supervised vision transformer. They adopted an adaptive token sampler to select informative tokens to reduce computation cost.
Vision transformer with various augmentation strategies is used for breast nodule classification in [
34]. They considered three classes of benign, malignant and normal. They showed that transfer learning in the vision transformer has comparable or better performance than transfer learning the CNN. Also, they found that augmentation is not very helpful in transferring pre-trained vision transformers.
Transfer learning of vision transformer by imageNet and histology datasets were used for classifying breast ultrasound images in [
35]. The trained Vision Transformer, Vision Transformer with transfer learning, and CNN with transfer learning are compared with their model, and it got the best results. The patch sizes of 16*16 and 32*32 are compared and the smaller patch size provide better results.
Transformer and information bottlenecks based on the UNet model (IB-TransUNet) is used in [
36] for ultrasound breast image segmentation. The bottlenecks remove the redundant features and prevent the overfitting.
[
37] provides a relatively large breast ultrasound image dataset including 2405 images. They used the fact that most benign tumors growths horizontally and most malignant tumors expand vertically to the deeper tissues, and applied horizontal and vertical transformer to distinguish them without using predefined region of interest.
[
38] considered a cross image modelling and a cross image dependency loss to take to account the common features of tumors in different images for segmentation purpose. They combined a CNN based encoder and a transformer-based encoder for getting near and far dependencies. They suggest that this idea can be used for combining different information like elastography and attenuation.
A relatively large dataset consisting 21332 images is used in [
39] with a vision transformer for localizing and BI-RADS classification of the nodules. Its outputs are compared with the diagnoses of 6 experienced radiologists.
A single Transformer layer and multiple information bottleneck (IB) blocks are used in [
40] for segmenting ultrasound breast images instead of using many transformers that increase complexity and becomes vulnerable to overfitting. They got better results in comparison with TransUNet that uses 12 transformer layers.
A deep supervised transformer full-resolution residual network was presented in [
41]. Its feature fusion is better and also suppress irrelevant features while the deep supervision mechanism reduces the gradient vanishing problem. Augmentation is used in training dataset. It took 33msec for segmenting every image, using GPU, which is an acceptable time.
Supervised learning and unsupervised learning are used together in [
42] for segmentation and classification of the breast ultrasound images.
3.2. urinary bladder
Bacterial infection can cause cystitis in the urinary bladder. Ultrasound imaging is one of the diagnostic methods of cystitis. The urinary Bladder Wall Thickness (BWT) estimated in ultrasound (US) images to diagnose cystitis [
22]. Deep learning is used for segmenting the urinary wall bladder, then feature extraction for classification is used to detect cystitis. CNN is compared with vision transformer. 250 subjects were used where half of them had cystitis. Image augmentation is applied to increase the images to 1000 images.
A U-Net is used for urinary bladder wall, then eight features are derived from the segmented area, and five notable features are chosen between them. A CNN model is applied for classifying the normal and cystitis cases. The result is compared with the results of pre trained CNN and vision transformer.
The best CNN model got 95% for precision, recall, F1 score and accuracy. While the vision transformer got 94%for precision, 89% for recall, 93% for F1 score and 92.1% for accuracy.
3.3. Pancreatic
Diagnosis of pancreatic cancer is the most difficult among all cancers. Endoscopic ultrasound (EUS) is the best diagnostic method for this cancer. But EUS images are difficult to analyze. In [
18] a ViT based dual self-supervised network (DSN) is presented for classifying EUS images to pancreatic and non-pancreatic cancer. At first a ROI is selected by a multi-operator transformation, and then DSN transfer the unlabeled images to features. In this research also a huge EUS-based pancreas image dataset (LEPset) is gathered that includes 3500 pathologically confirmed EUS images with labels and 8,000 EUS images without label that is publically available [
43]. The method is also applied to BUID [
44] data set.
3.4. Prostate
Prostate cancer is a common cancer between men. Early diagnosis will help the treatment and mortality reduction. A usual method for its initial diagnosis is transrectal sonography. These sonography images are difficult to be labeled, because of its low resolution, noises and artifacts. Therefore, in [
45] an unsupervised network is designed to extract features from these images.
ROI of the ultrasound image and biopsy core image are used together for improving cancer detection [
12].
A “ROI-scale” network using self-supervised learning extracts features from small ROIs and a “core-scale” transformer model derives a series of features from several ROIs in the needle trace region of prostate biopsy tissue to predict the tissue type. Attention maps are used for localizing the cancer at the ROI region.
A dataset of micro-ultrasound gathered from 578 patients who had prostate biopsy is used for evaluating the method. The model performs better compared to ROI-scale-only models. It got 80.3% AUROC, a statistically significant improvement over ROI-scale classification. The method is compared to other studies on prostate cancer detection with various imaging modalities.
Table 2 illustrates an extensive comparison of transformer-based models applied to the analysis of prostate ultrasound images.
3.5. Thyroid
The prevalence of thyroid nodules in adults is between 19% to 67%. There are many attempts to segment and classification of these nodules.
Table 3 presents a comprehensive comparison of transformer-based models used in the analysis of Thyroid ultrasound images.
In [
46], a boundary attention transformer net (BTNet), by incorporating CNN and transformer short and long range features are fused. A boundary attention block is designed for improving edge information learning. The features are fused at different scales.
In [
47] ultrasound images and infrared thermal images are used simultaneously. The features are derived separately by two CNN and transformer encoders to capture local and global features respectively, and these features are fused using a vision transformer. Ultrasound images provide anatomical information and infrared thermal images provide thermodynamic information of the nodules.
A hybrid model of CNN and ViT for diagnosing thyroid nodules is presented in [
48]. GAN model is used for data augmentation to overcome the problem of data shortage. They showed that the hybrid model, that combines ResNet50 and ViT_B16, has better performance compared to CNN or ViT when using independently.
To protect parathyroid glands in thyroid surgery using ultrasound images, a network including transformer for considering long range dependency is introduced in [
49]. It consists of two encoding and one decoding network for parathyroid glands segmentation. The two branches extract local and global features.
Contrast-enhanced ultrasound (CEUS) can be used for monitoring microvascular perfusion. [
50] provides a segmentation and classification method for thyroid nodules using CEUS images. It used a spatiotemporal transformer based CEUS analysis.
In [
51] shallow and deep features are fused for classification of thyroid nodules. ROI of the nodule is fed to a CNN network for extracting shape and texture features. Whole image is fed to a Swin Transformer to derive deep features. Then these two group features are combined and fed to a fully connected layer to classify the nodule.
3.6. Heart
Understanding the heart's complexity and dynamism presents considerable challenges due to its intricate and constantly changing characteristics. These characteristics include detailed structures such as chambers, valves, and vessels that undergo transformations throughout the cardiac cycle.
Despite issues such as speckle noise, shadows, and changes in patient anatomy, several deep learning models have been designed for applications like single image classification [63]. However, these models often fail to consider the dynamic nature of the heart and struggle with signal loss in ultrasound images. Transformers play a crucial role in ultrasound heart imaging, particularly in the analysis of complex temporal dependencies in patient data, which can enhance the prediction of various tasks. Their successful application in heart imaging is evident in various ways, including the detection of End-Systolic (ES) and End-Diastolic (ED) frames in ultrasound videos, heart chamber segmentation, predicting left ventricular ejection fraction (LVEF), Aortic stenosis (AS) detection and severity classification, and assessing the size and function of the right ventricular (RV) in cardiovascular patients. In the field of cardiac imaging, transformers are used to process Patient histories are organized as time series, encompassing a variety of clinical events, enabling the models to decipher intricate temporal patterns over time.
In cardiac imaging, transformers are employed to handle patient histories, which are structured as time series encompassing diverse clinical events. This enables the models to comprehend progressively intricate temporal relationships. The slow changes of the heart’s structure and background in echo imaging, along with the resemblance of following frames, emphasizes the necessity to understand the local temporal context and minor spatial modifications in the heart’s chambers, valves, and walls for a thorough diagnosis.
Table 4 summarized an exhaustive comparison of transformer-based models utilized in the analysis of heart ultrasound images.
Zeng and et al. [
52] developed the Multi-Attention Efficient Feature Fusion Network (MAEF-Net), a system that automatically detect ES and ED frames and segments the left ventricle in all cardiac cycles to calculate the LVEF. The system employs a multi-attention mechanism for effective heartbeat feature capture and noise suppression, and integrates a deep supervision mechanism and spatial pyramid feature fusion for improved feature extraction. The method was tested and proven effective on the publicly accessible EchoNet-Dynamic dataset and a private clinical dataset, showing promising results. The mean absolute error (MAE) for detecting ED and ES frames, as well as for predicting LVEF on the public EchoNet-Dynamic dataset, was particularly noteworthy.
Ahmadi et al. [
53] examined aortic stenosis severity by focusing on the temporal localization of the opening and closing of the valve, and the shape and mobility of the aortic valve. They applied Temporal Deformable Attention (TDA) in frame-level embedding to enhance the transformers’ understanding of locality and a temporal coherent loss to increase sensitivity to minor aortic valve movements. Finally, they adopted attention weights to identify echo frames with significant clinical importance, prioritizing these frames in the weighted aggregation for the final classification.
Vafeezadeh and his colleagues [
54] introduced the CarpNet network to account for the time information of all echocardiography video frames. This was accomplished by combining the transformer network and the Inception_Resnet_V2 convolutional network as a feature extractor. As a result, the performance of mitral valve classification based on the Carpentier criteria was enhanced, surpassing the performance of single-image acquisition.
A novel model is proposed by Qurri [
55] that merges the advantages of Convolutional Neural Networks (CNNs) and Transformers in the Unet framework for segmenting the heart in ultrasound images from the CAMUS Dataset. A Transformer is placed at the Unet bottleneck to connect the encoder and the decoder and to capture long-range contextual information. The model presents a new attention module named Three-Level Attention (TLA) at the decoder side, that consists of an Attention Gate (AG), channel attention, and spatial normalization technique. The TLA module enriched the feature map derived from the skip connections. For the encoder, Squeeze-and-Excitation (SE) is applied to the skip connections leaving the encoder, as another type of attention.
Luo et al [
56] present a new method for segmenting the heart in images that utilizes multi-scale features and a position-aware attention mechanism. Their approach, based on an inverted pyramid structure, is aimed at extracting contextual information from low-resolution ultrasound images. The network is trained with images at different scales and combines prediction results to improve its contextual awareness. An attention module enhanced with positional encoding information is presented to help the network learn important positional clues, thereby increasing segmentation accuracy. This method is able to capture contextual information at various resolutions, which is especially helpful in comprehending the complexities of the heart’s structure and its varying changes throughout the cardiac cycle. The method is verified through rigorous experiments on the EchoNet-Dynamic dataset.
A Co-Attention Spatial Transformer Network (STN) that exploits interframe correlations to improve left ventricle motion tracking between ED and ES frames and strain analysis in noisy 3D echocardiography is introduced by Ahn et al. [
57]. This method enhances feature extraction through the utilization of feature cross-correlations, drawing inspiration from speckle tracking techniques. The team introduces an innovative temporal constraint aimed at normalizing the motion field, thereby facilitating the generation of smooth and realistic paths of cardiac displacement over time, all without the need for preconceived notions about cardiac motion. This objective is accomplished by integrating a temporal consistency regularization component into the loss function. Both a synthetic echocardiography dataset and an in vivo porcine 3D+time echocardiography dataset were utilized for thorough performance evaluations.
Tang et al. [
58] proposed a novel approach that merges a deformable model with a medical transformer neural network for image segmentation, addressing the challenge of data scarcity in medical imaging. The axial-attention and dual-scale training strategy are applied to mine long-range feature information. The image augmentation strategy effectively applies these techniques to enhance the performance of deep neural networks in medical image processing.
Liao et al. [
59] suggested two different Transformer models for LV segmentation in echocardiography. One model utilizes Segformer, while the other combines Swin Transformer and K-Net. The performance of the models on challenging samples that were not easily segmented was also examined. The results confirmed the superiority of the proposed Transformer models over CNN models, even for samples that were not easily segmented by the CNN model. To achieve precise segmentation results, post-processing such as filtering out unnecessary parts is applied.
Zhao et al. [
60] have developed an Interactive Fusion Transformer Network (IFT-Net) for the quantitative analysis of pediatric echocardiography. This network constructs a dual-attention pyramid transformer (DPT) branch that models long-range dependencies from spatial and channels, thereby enhancing the learning of global context information. The IFT-Net also incorporates a bidirectional interactive fusion (BIF) unit that merges local and global features interactively. This approach maximizes their preservation and refines the segmentation process. The BIF consists of two independent modules: the group feature learning (GFL) and the channel squeeze-excitation (CSE) unit. The anatomical structures are segmented through the decoder network, and the clinical anatomical parameters are measured through key point positioning.
Fazry and his team [
61] introduced a new deep learning method for estimating the ejection fraction from echocardiogram videos, eliminating the need for left ventricle segmentation. This approach, known as UltraSwin, leverages hierarchical vision Transformers and Swin Transformers to extract spatio-temporal features. UltraSwin comprises two primary modules: the Transformers Encoder (TE), which serves as a feature extractor, and the EF Regressor, which functions as a regressor head. The method was evaluated on the EchoNet-Dynamic dataset.
Hagberg et al. [
62] created a deep learning model that employs Natural Language Processing (NLP) to evaluate the size and functionality of the right ventricle (RV) from echocardiographic images. They established a pipeline for automatic annotation of video loops, which formed the basis for constructing two image classification models. These models were trained on labels generated through a combination of manual annotation and NLP models. The models were then employed to assess RV function and size. The RV size and function models were 12-layer BERT models, which were pre-trained on a large Swedish dataset.
Ultrasound videos, which can have varying lengths and cardiac cycles of different durations, often require more sophisticated processing methods than traditional frame-by-frame approaches. This is because such methods can overlook the temporal information encoded within the videos. In order to incorporate spatio-temporal support within deep convolutional networks, heuristic frame sampling methods are typically applied to create a stack of chosen frames from videos. In a research conducted by Reynaud et al. [
11], a transformer architecture known as the Residual Auto-Encoder Network was utilized along with a BERT model to automatically identify the Early Systole (ES) and Early Diastole (ED) frames in ultrasound videos. This was done to compute the Left Ventricular Ejection Fraction (LVEF).
3.7. Fetal
Researchers have developed innovative methods for analyzing fetal obstetric ultrasound imagery, leveraging the power of transformer and Convolutional Neural Network (CNN) architectures. Rahman and his team [64] have enhanced the precision of identifying fetal planes from ultrasound images by training the Swin Transformer. They have also improved image quality through the use of Histogram Equalization and Fuzzy Logic-based contrast enhancement.
Table 5 provides a thorough evaluation of transformer-based models employed in the analysis of fetal obstetric ultrasound images.
A transformer-based image classification approach using a newly designed residual cross-variance attention (R-XCA) block named COMFormer was introduced for categorizing maternal-fetal and brain anatomical structures within 2D fetal US images [65]. The structures are divided into two primary categories: maternal-fetal (which includes the brain, abdomen, thorax, femur, and the mother's cervix among others), and brain anatomical structures (such as trans-ventricular, trans-cerebellum, trans-thalamic, and non-brain structures). A significant feature of the R-XCA block is the use of residual connections, which help mitigate the vanishing gradients problem and enhance the learning process of COMFormer. The performance of this architecture was assessed using a widely accessible dataset known as "BCNatal" for two separate classification tasks.
In another study, Arora et al. [66] explored the application of the vision transformer as a machine learning method to analyze the texture of placental ultrasound images during the first, second, and third trimesters of pregnancy. This was achieved through a prospective observational study that involved the collection of 2D placental US images at different stages of pregnancy.
Chen and his team [67] have introduced the Children Intussusception Diagnosis Network (CIDNet), a comprehensive artificial intelligence algorithm designed for swift diagnosis of intussusception in children using ultrasound images. The system utilizes a transformer-based approach and a Multi-Instance Deformable Transformer Classification (MI-DTC) module, which includes a pre-processing component. This module is engineered to precisely identify and locate abnormal regions related to intussusception in ultrasound images. The team also incorporated several Convolutional Neural Network (CNN)-based algorithms as the backbone networks.
Qiao and colleagues [68] proposed a dual-path chain multi-scale gated axial-transformer network (DPC-MSGATNet) that models both global dependencies and local visual cues for fetal US four-chamber (FC) views for segment heart chambers, supporting clinicians in studying cardiac anatomy and aiding in the identification of fetal congenital heart defects (CHD). This model enables precise segmentation of the four chambers, assisting clinicians in analyzing cardiac morphology and aiding in the diagnosis of fetal congenital heart defects (CHD). The DPC-MSGATNet consists of a local and a global branch that operate concurrently on an entire FC view and image patches to learn multi-scale representations. To enhance the interactions between these branches, an interactive dual-path chain gated axial-transformer (IDPCGAT) module has been designed.
In a landmark study (For the first time) [69], Płotka et al. introduced an innovative system for predicting fetal birth weight (FBW) known as BabyNet. This system leverages multimodal data and a visual data processing component, effectively integrating Transformers and CNNs. The hybrid model enhances the 3D ResNet-18 architecture by incorporating a Residual Transformer Module (RTM). This module refines features through a global self-attention mechanism, residual connections, and facilitates both local and global feature representation. The architecture of BabyNet was further developed in their subsequent research [71]. The convolutional component identifies local image patterns and interactions, while the transformer component models long-term dependencies and relationships. A module is implemented in the deeper layers of BabyNet to conditionally shift feature maps based on non-imaging data, such as gestational age. Following up on their initial work, Płotka et al. unveiled BabyNet++ [70], a unique network specifically engineered for FBW prediction using multimodal data. This network uses a custom RTM and incorporates Dynamic Affine Feature Transform Maps (DAFT) to efficiently incorporate clinical data within the model structure. This approach evaluates 2D+ t spatio-temporal features in fetal US videos using tabular clinical data.
Yang et al. [
21] proposed a one-stage network for automatic measurement of fetal head circumference (HC) using ultrasound images, without any post-processing. This system detects the fetal head position and ellipse parameters utilizing an anchor-free method. Their network combines a simple transformer with a CNN to extract global and local features, and uses a soft stage-wise regression (SSR) strategy and an IOU loss term to improve the accuracy of rotating elliptic object detection. The network is the first of its kind to directly measure fetal HC, marking a significant advancement in the field.
Finally, Zhao et al. [72] designed a landmark retrieval-based method for guiding US-probe movement, which constructs a set of landmarks around a virtual 3D fetal model and compares the current ultrasound image to the landmarks’ global descriptors using a deep neural network (DNN) model. Their method uses a Transformer-VLAD network to learn the global descriptors, and avoids human annotation by using a KD-tree search of 3D probe positions to generate training data in a self-supervised way. This approach is intuitive and suitable for human operators, and it avoids costly human annotation.
3.8. Carotic
Atherosclerosis, a common cause of ischemic heart disease and stroke, is typically monitored by physicians through the analysis of various anatomical and biomechanical properties of carotid plaques over several cardiac cycles. Ultrasound (US) imaging plays a crucial role in this process, providing a non-invasive method for visualizing, evaluating, and screening carotid atherosclerotic plaque. It enables radiologists to accurately segment these plaques and extract key features such as size, shape, and echo strength, thereby significantly improving early diagnosis and treatment strategies for carotid atherosclerosis. Despite the computational challenges associated with using transformers for analyzing carotid US videos, the advent of several transformer-based networks for ultrasound medical video analysis signals a promising advancement in this field.
LIN et al. [73] developed a model called the U-shaped CSWin Transformer (U-CSWT) for the purpose of automatically segmenting the lumen-intima boundary (LIB) and media-adventitia boundary (MAB) in 3-D ultrasound images of the carotid artery (CA) from 3-D ultrasound images. The U-CSWT, which is composed of hierarchical CSWT modules in both its encoder and decoder, is designed to extract comprehensive global context information from the 3-D image. The U-CSWT’s U-shaped structure and the inclusion of the CSWin transformer in the encoder and decoder allow to model long-range dependence while reducing the model’s computational complexity. This process involved descriptor learning via contrastive learning, using self-constructed anchor-positive-negative ultrasound image pairs.
Li et al. [74] proposed a new video analysis transformer-based network, known as BP-Net, which is guided by target boundary and perfusion features and is designed to assess the integrity of the fibrous cap using B-mode US and contrast-enhanced US (CEUS) videos. Building on their previously proposed plaque auto-tracking network, they introduced a plaque edge attention module and reverse mechanism to focus the dual video analysis on the fiber cap of plaques. To extract the most valuable features from the fibrous cap, they proposed a feature fusion module. Finally, they integrated a multi-head convolution attention into a transformer-based network to evaluate the integrity of fibrous caps accurately. This approach captures both semantic features and global context information.
Lastly, Hu et al. developed the RMFG_Net [
19], a network designed for automatic segmentation of atherosclerotic carotid plaques in ultrasound videos. This network uses a transformer-based algorithm for stable plaque positioning, extracts spatial and temporal features across video frames for high-quality segmentation, integrates a spatial-temporal feature filter to suppress noise and enhance target area detail, applies multi-layer gated computing for feature fusion and adequate feature map aggregation, and is trained end-to-end, eliminating the need for additional operations. Furthermore, it can process at a speed of 68 frames per second.
Table 6 illustrate a detailed assessment of transformer-based models used in analyzing carotic ultrasound images.
3.10. Lung
Various studies have explored the use of transformers in the lung organ, particularly in relation to COVID-19 data. Nehary et al. [75] discuss the application of deep learning and hand-crafted features for classifying lung ultrasound images to detect COVID-19. Their proposed method involves a fusion of Histogram of Oriented Gradients (HOG) features with abstract features from deep learning models like VGG16 and Vision Transformer (ViT) to enhance detection accuracy. The effectiveness of this fusion technique is demonstrated using a public COVID-19 dataset, showing improved classification accuracy when HOG features are fused with abstract features from VGG16 and ViT .
Perera et al. introduced POCFormer [
17], a lightweight transformer architecture designed for COVID-19 detection using point-of-care ultrasound. The architecture, consisting of a vision transformer and a linear transformer, is compact with around 2 million parameters, making it suitable for deployment on low-power devices like smartphones. It can run in real-time and has potential for use in rural and underserved areas. POCFormer outperforms other architectures in binary and multiclass classification experiments, demonstrating high accuracy in distinguishing between COVID-19 and healthy patients, as well as COVID-19 and bacterial pneumonia.
Xing et al. [76] proposed a semi-supervised, frame-to-video-based lung ultrasound (LUS) scoring model for diagnosing respiratory diseases. The model consists of two components: a frame-level (FL) scoring model and a video-level (VL) scoring model. The FL model uses a dual attention vision transformer (DaViT) to extract local and global features from LUS frames, which are manually scored by clinicians. The VL model employs a frame-to-video approach, using a 40-channel input with a patch embedding layer and transferring DaViT parameters from the FL model to each channel. It uses a long-short term memory (LSTM) module for correlation analysis of the 40-channel output and a final MLP head for video scoring. The model achieves high accuracy in both FL and VL scoring, with 95.08% and 92.59% accuracy, respectively.
Table 7 provides a comprehensive evaluation of transformer-based models that have been applied in the analysis of lung ultrasound images.
3.11. Liver
Transformer models have indeed been used for tasks related to liver ultrasounds, specifically for the classification of liver lesions. One notable example is the TransLiver model, a hybrid Transformer model designed for multi-phase liver lesion classification.
Zhang et al. [77] disscused the use of deep learning techniques, specifically a Vision-Transformer (ViT)-based classification method, for the automatic recognition of standard liver sections in ultrasound images. The research aims to address subjective errors in traditional manual scanning and standardize the medical examination of the liver in adults. The authors collect 12 common liver ultrasound standard sections and train the ViT model on these, achieving an accuracy of 92.9% in the available ultrasound dataset. The ViT model outperforms other deep learning frameworks and shows promising results for the recognition of standard liver sections. The research contributes to the study of adult organs, as previous research has mainly focused on fetal organs. The dataset used in this research was partitioned into a 4:1 ratio for model training, validation, prediction, and testing. The document also mentions the use of visual attention mechanisms and targeted histogram equalization to enhance the recognition and contour information in the ultrasound images.
Table 8 gives a comprehensive overview of how transformer-based models have been utilized in the analysis of liver ultrasound images.
Dadoun et al. [
13] discussed a study on the use of deep learning networks, specifically Faster R-CNN and DETR, for detecting, localizing, and characterizing focal liver lesions (FLLs) on abdominal ultrasound images. The networks were trained on a dataset of 1026 patients and tested on 48 additional patients. DETR outperformed Faster R-CNN and was comparable to or exceeded the performance of three caregivers in detecting FLLs, localizing lesions, and characterizing FLLs as benign or malignant. The study suggests that these networks, particularly DETR, could assist non-expert caregivers in screening patients at high risk of malignancy, potentially improving early detection of hepatocellular carcinoma. However, the study had limitations, including a limited number of images in the test set and the retrospective nature of the study. Further research is needed to validate these findings and explore the integration of clinical information in the screening process.
Zhang et al. [78] introduced the use of an Ultra-Attention structured perception strategy for the automatic recognition of standard liver sections in ultrasound imaging. This deep learning approach, inspired by natural language processing attention mechanisms, amplifies small features in ultrasound images that may be overlooked. The Ultra-Attention model, guided by a convolutional neural network, addresses the challenge of accurately identifying standard sections by considering the coupling of anatomic structures within the images. It uses a modularized approach where each local piece of information contributes to the final decision, rather than focusing solely on local areas like traditional convolutional neural networks. The Ultra-Attention structure consists of multiple encoder layers, each performing attention operations on the ultrasound images. It uses a modularized approach where each local piece of information contributes to the final decision. The model incorporates dropout mechanisms and Part-Transfer Learning to enhance robustness and convergence. With a classification accuracy of 93.2%, the Ultra-Attention model outperforms traditional convolutional neural network methods, offering a promising solution for improving the accuracy and efficiency of ultrasound diagnosis.
3.12. IVUS
Transformer models have indeed found applications in the analysis of intravascular ultrasound (IVUS) images.
Huang et al. [79] proposed a framework, POST-IVUS, for automated segmentation of lumen and external elastic membrane (EEM) boundaries in intravascular ultrasound (IVUS) images. This framework addresses the challenges of IVUS segmentation, such as inter-observer variability and the presence of artifacts, by combining Fully Convolutional Networks (FCNs) with temporal context-based feature encoders, a selective transformer module, and a temporal constraining and fusion module. The POST-IVUS framework has shown superior performance compared to state-of-the-art methods, with a Jaccard measure of 0.92 for lumen and 0.94 for EEM segmentation. It has been integrated into a software called QCU-CMS for user-friendly automated IVUS image segmentation, demonstrating its potential for practical applications.
The proposed framework for IVUS segmentation includes two temporal context-based feature encoders, the rotational alignment encoder and the visual persistence encoder, which focus on relevant vessel movement and encode residual visual features, respectively. The Selective Transformer module in the STR U-Net enhances the inference ability of the segmentation model, particularly in regions with little visual information, by mimicking the perceptual organization property of human vision and capturing long-range dependencies and global context. The SWIN transformer, a key component of the framework, is used as the backbone of the inference branch in the STR U-Net. It introduces connections between areas by dividing images into different patches and calculating hierarchical representations, thereby improving the accuracy of boundary prediction in challenging areas.
The document discusses the Multilevel Structure-Preserved Generative Adversarial Network (MSP-GAN) [
20], a proposed method for domain adaptation in intravascular ultrasound (IVUS) analysis. The MSP-GAN addresses the poor generalizability of IVUS analysis methods due to the diversity of IVUS datasets by integrating a vision transformer, a superpixel-wise multiscale contrastive (SMC) constraint, and an uncertainty-aware teacher-student consistency (TSC) constraint. These components work together to effectively preserve structures at global, local, and fine levels, improving the generalizability of IVUS analysis methods. The vision transformer, incorporated into the generator of the MSP-GAN, maintains global pathology information during the image translation process by capturing long-range dependencies and understanding the global context of the images. This enhances the structural similarity between the synthetic and source images, improving the accuracy of downstream IVUS analysis methods such as vessel and lumen segmentation and stenosis-related parameter quantification. The document also discusses the Transformer-incorporated generator, a key component of the MSP-GAN, which preserves global pathology information during the image translation process by combining the strengths of convolutional networks and transformers. It captures both local interactions and long-range dependencies in IVUS images. The generator comprises a convolution-based encoder for efficient visual feature learning and a vision transformer for modeling complex relations of feature components and extracting global information. The outputs of the encoder and transformer are fused to generate context-rich features, which are then decoded into the synthetic image. By incorporating the vision transformer, the generator can interpret the global context of IVUS images, maintain the global pathology information presented in the source images, and improve the structural similarity between the synthesized and source images.
Table 9 provides an exhaustive summary of the application of transformer-based models in the analysis of IVUS ultrasound images.
3.13. Gallbladder
Basu et al. proposed RadFormer [
23], a novel deep neural network architecture for accurate and interpretable detection of Gallbladder Cancer (GBC) from Ultrasound (USG) images. RadFormer combines global and local attention mechanisms using a transformer-based approach. It outperforms human radiologists in detection accuracy and provides interpretable explanations for its decisions. These explanations are based on visual bag-of-words style feature embeddings that can be mapped to radiological features used in medical literature. The model demonstrates high sensitivity and specificity in detecting GBC from USG images and allows for the discovery of new visual features relevant to GBC diagnosis.
RadFormer uses a global branch to extract deep features from the entire ultrasound image and a local branch to generate a region of interest (ROI) and extract deep features using a bag-of-features (BOF) technique. These features are fused using a transformer-based architecture, enhancing GBC detection performance. RadFormer’s performance is evaluated against several baseline models, demonstrating superior accuracy. By mapping the neural features to radiological lexicons, RadFormer provides precise and interpretable explanations for GBC detection. The architecture addresses challenges of ultrasound images, such as sensor noise, artifacts, and visual similarities between non-malignant regions and cancerous gallbladder. Overall, RadFormer presents a significant advancement in the field of medical imaging and cancer detection.
3.14. Other-Synthetic
This section delves deeper into the wider application of transformer technology beyond the specific analysis of ultrasound images for certain organs, as discussed in previous sections. The field of imaging and tracking has witnessed substantial advancements through the use of transformer networks. For example, Qu et al. [87] developed the Complex Transformer Network (CTN), which integrates complex self-attention (CSA) and complex convolution modules for zero-degree single-angle polarization waveform imaging (PWI) beamforming. This technique maps delayed in-phase and quadrature (IQ) data directly to an image, with the CSA module assigning dynamic weights to reconstruction features based on their coherence.
Table 10 gives a thorough review of the utilization of transformer-based models in the examination of Other-Synthetic ultrasound images.
In the realm of microbubble (MB) localization, Liu et al. [
15] have introduced a Swin transformer-based neural network for end-to-end mapping of MBs. They further refined this method with a Super-Resolution Modified Transformer (SR-MT), improving MB localization and scaling the input dimension. They proposed a transformer-based neural network to replace the MB localization step in generating Ultra-Structure-Super-Resolution (US-SR) images.
Another transformer-based approach, the Depthwise Separable Convolutional Swin Transformer, was introduced by Liu et al. [
16]. This transformer is designed for cervical lymph-node-level classification in ultrasound images. The network includes a deepwise separable convolution branch in the self-attention mechanism to capture discriminative local features. To tackle data imbalance issues, a new loss function was proposed to enhance the performance of the classification network.
Yan et al. [80] utilized a transformer-based network for motion prediction in their needle tip tracking system. This approach helped them estimate the target's current position from its past position data, addressing the issue of the target's temporary disappearance. The transformer network processes the entire data sequence at each instance, capturing both long- and short-term dependencies to fully understand the internal relationships within the input data sequence.
Zhao et al. trained an automatic segmentation Medical Transformer (MedT) network for ultrasound images of the distal humeral cartilage [81]. This research represents the first application of multiple deep learning algorithms for dynamic, volumetric ultrasound images in distal humeral cartilage segmentation, which are critical for minimally invasive surgeries.
Zhou et al. introduced the Lightweight Attention Encoder–Decoder Network (LAEDNet) [82], an innovative and efficient asymmetrical encoder–decoder network, for the segmentation of Head Circumference Ultrasound Images Dataset (HCUS).
Katakis and his team [83] evaluated the potential of vision transformers for the automated segmentation of a muscle's cross-sectional area (CSA) and its mean grey level value, aiming to estimate the echogenicity of muscle architecture.
Lo [84] employed a pre-trained Vision Transformer (ViT) model to extract image features for the purpose of diagnosing septic arthritis from gray-scale and Power Doppler ultrasound images. Leveraging the deep learning capabilities of the ViT, the system autonomously and efficiently gathers significant image features for classification purposes.
Zhang et al. [85] have introduced a novel Pyramid Convolutional Transformer (PCT) architecture for the segmentation of parotid gland tumors. This architecture employs a shrinking pyramid framework to capture dense pixel features effectively and leverages multi-scale image dependencies. A Fusion Attention Transformer CNN (FTC) block is also incorporated to manage the complex and variable contour characteristics of parotid gland tumors. This block merges the Transformer with CNN, forming a dual branch structure to extract both global and local image features.
Finally, Manzari et al. [86] suggested an innovative hybrid model that integrates the strengths of CNNs and Transformers, mitigating the high quadratic complexity of the self-attention mechanism. They use an efficient convolution operation to attend to information across various representation spaces. Additionally, they aim to enhance the model's resistance to adversarial attacks by learning smoother decision boundaries. The hybrid model, known as Medical Vision Transformer (MedViT), combines local representations and global features using robust components. A novel patch moment changer augmentation has also been developed to add diversity and affinity to the training data.
4. Discussion
Transformers, a unique type of convolutional-free neural network architecture, are designed to excel in capturing long-range dependencies within sequential data, making them suitable for language-related and computer vision tasks. They utilize an attention mechanism, specifically self-attention, which allows the model to focus on different parts of the input sequence and identify relationships between them. This makes self-attention a powerful tool for tasks involving sequences, such as video processing. Another technique used in transformers is multi-head self-attention, which enhances the model's ability to discern varied relationships and dependencies within the input sequence. This technique provides multiple unique viewpoints on the input sequence, enabling the model to focus on different segments of the input sequence and capture a variety of information simultaneously.
Transformers are surpassing conventional methods in medical image analysis. Despite their promising results, there are still challenges due to the high computational demand of transformers. Therefore, improvements to the transformer architecture are needed to make it more lightweight and efficient.
Despite their potential, transformers face challenges in ultrasound imaging due to data sparsity and the complexity of medical data. However, they offer new possibilities for accurate and efficient analysis, emphasizing the need for future research to address these challenges.
Research using transformers is rapidly growing in the field of ultrasound medical image analysis. Most current transformer-based methods can be easily applied to ultrasound imaging problems without significant changes.
Current transformer-based models for ultrasound diagnosis only use 2-D cross-sectional images for making predictions. However, 3-D ultrasound data or spatiotemporal data could potentially improve diagnostic accuracy. Despite the promising outcomes demonstrated by transformer methods for ultrasound, the advancement of AI-powered ultrasound lags behind that of AI-powered CT and MRI. This is primarily due to the significant intra- and inter-reader variability encountered during the acquisition and interpretation of ultrasound images.
5. Conclusions
Given the unique characteristics and diagnostic needs of ultrasound imaging, a comprehensive review of AI methods based on vision transformers, specifically designed for ultrasound imaging, can offer crucial insights for researchers and practitioners in this particular field. Hence, this review seeks to fill this void by providing an in-depth examination of Transformer models that have been specifically developed for ultrasound imaging and its related image analysis applications.
In conclusion, this review paper has provided a comprehensive overview of the application of vision transformers in medical ultrasound. It includes a detailed analysis of over 231 relevant studies, highlighting the most recent developments in this field.
We began by elucidating the fundamental structures of transformers, followed by an introduction to the most significant architectures of vision transformers. Subsequently, we categorized the paper based on different organs, providing a structured approach to understanding the diverse applications of these technologies. We reviewed the most pivotal papers that have utilized vision transformers in medical ultrasound, highlighting the transformative impact of these methodologies in the field. As the landscape of medical ultrasound continues to evolve, the role of vision transformers is anticipated to become increasingly prominent, paving the way for more sophisticated and precise diagnostic tools. This review underscores the potential of vision transformers to revolutionize medical ultrasound, marking a significant stride towards the future of healthcare.