1. Introduction
The AI research landscape has been transformed by foundation models trained on vast data [
1,
2,
3,
4]. Recently, among the foundation models, Among these, Segment Anything (SAM) [
5] stands out as a highly successful image segmentation model with demonstrated success in diverse scenarios. However, in our previously study, we found that SAM’s performance was limited in some challenging low-level structural segmentation tasks, such as camouflaged object detection and shadow detection. To address this, in 2023, within two weeks of SAM’s release, we proposed the SAM-Adapter [
6,
7], which aimed to leverage the power of the SAM model to deliver better performance on these challenging downstream tasks. The success of the SAM-Adapter, with its training and evaluation code and checkpoints made publicly available, has already been a valuable resource for many researchers in the community to experiment with and build upon, demonstrating its effectiveness on a variety of downstream tasks.
Now, the research community has pushed the boundaries further with the introduction of an even more capable and versatile successor to SAM, known as Segment Anything 2 (SAM2). Boasting further enhancements in its network architecture and training on an even larger visual corpus, SAM2 has certainly piqued our interest. This naturally leads us to the questions:
Do the challenges faced by SAM in downstream tasks persist in SAM2?
Can we replicate the success of SAM-Adapter and leverage SAM2’s more powerful pre-trained encoder and decoder to achieve new state-of-the-art (SOTA) results on these tasks?
In this paper, we answer both questions with a resounding "Yes." Our experiments confirm that the challenges SAM encountered in downstream tasks do persist in SAM2, due to the inherent limitations of foundation models—where training data cannot cover the entire corpus and working scenarios vary [
1]. However, we have devised a solution to address this challenge. By introducing the
SAM2-Adapter, we’ve created a multi-adapter configuration that leverages SAM2’s enhanced components to achieve new SOTA results in tasks including medical image segmentation, camouflaged object detection, and shadow detection.
Just like SAM-Adapter [
6,
7],
this pioneering work is the first attempt to adapt the large pre-trained segmentation model SAM2 to specific downstream tasks and achieve new SOTA performance. SAM2-Adapter builds on the strengths of the original SAM-Adapter while introducing significant advancements.
SAM2-Adapter inherits the core advantages of SAM-Adapter, including:
Generalizability: SAM2-Adapter can be directly applied to customized datasets of various tasks, enhancing performance with minimal additional data. This flexibility ensures that the model can adapt to a wide range of applications, from medical imaging to environmental monitoring.
Composability: SAM2-Adapter supports the easy integration of multiple conditions to fine-tune SAM2, improving task-specific outcomes. This composability allows for the combination of different adaptation strategies to meet the specific requirements of diverse downstream tasks.
SAM2-Adapter enhances these benefits by adapting to SAM2’s multi-resolution hierarchical Transformer architecture. By employing multiple adapters working in tandem, SAM2-Adapter effectively leverages SAM2’s multi-resolution and hierarchical features for more precise and robust segmentation, which maximizes the potential of the already-powerful SAM2. We perform extensive experiments on multiple tasks and datasets, including ISTD for shadow detection [
8] and COD10K [
9], CHAMELEON [
10], CAMO [
11] for camouflaged object detection task, and kvasir-SEG [
12] for polyp segmentation (medical image segmentation) task. Benefiting from the capability of SAM2 and our SAM-Adapter, our method achieves state-of-the-art (SOTA) performance on both tasks. The contributions of this work can be summarized as follows:
We are the first to identify and analyze the limitations of the Segment Anything 2 (SAM2) model in specific downstream tasks, continuing our research from SAM.
Second, we are the first to propose the adaptation approach, SAM2-Adapter, to adapt SAM2 to downstream tasks and achieve enhanced performance. This method effectively integrates task-specific knowledge with the general knowledge learned by the large model.
Third, despite SAM2’s backbone being a simple plain model lacking specialized structures tailored for the specific downstream tasks, our extensive experiments demonstrate that SAM2-Adapter achieves SOTA results on challenging segmentation tasks, setting new benchmarks and proving its effectiveness in diverse applications.
By further building upon the success of the SAM-Adapter, the SAM2-Adapter inherents the advantages of SAM-Adapter and demonstrates the exceptional ability of the SAM2 model to transfer its knowledge to specific data domains, pushing the boundaries of what is possible in downstream segmentation tasks. We encourage the research community to adopt SAM2 as the backbone in conjunction with our SAM2-Adapter, to achieve even better segmentation results in various research fields and industrial applications. We are releasing our code, pre-trained model, and data processing protocols in
http://tianrun-chen.github.io/SAM-Adaptor/.
2. Related Work
Semantic Segmentation. In recent years, semantic segmentation has made significant progress, primarily due to the remarkable advancements in deep-learning-based methods such as fully convolutional networks (FCN) [
13], encoder-decoder structures [
14,
15,
16,
17,
18,
19], dilated convolutions [
20,
21,
22,
23,
24,
25], pyramid structures [
22,
23,
26,
27,
28,
29], attention modules [
30,
31,
32,
33,
34], and transformers [
2,
35,
36,
37,
38]. Recent advancements have improved SAM’s performance, such as [
39], which introduces a High-Quality output token and trains the model on fine-grained masks. Other efforts have focused on enhancing SAM’s efficiency for broader real-world and mobile use, exemplified by [
40,
41,
42]. The widespread success of SAM has led to its adoption in various fields, including medical imaging [
43,
44,
45,
46], remote sensing [
47,
48], motion segmentation [
49], and camouflaged object detection [
50]. Notably, our previous work SAM-Adapter [
6,
7] tested camouflaged object detection, polyp segmentation, and shadow segmentation, and provide with the first adapter-based method to integrate the SAM’s exceptional capability to these downstream tasks.
Adapters.The concept of Adapters was first introduced in the NLP community [
51] as a tool to fine-tune a large pre-trained model for each downstream task with a compact and scalable model. In [
52], multi-task learning was explored with a single BERT model shared among a few task-specific parameters. In the computer vision community, [
53] suggested fine-tuning the ViT [
54] for object detection with minimal modifications. Recently, ViT-Adapter [
55] leveraged Adapters to enable a plain ViT to perform various downstream tasks. [
56] introduce an Explicit Visual Prompting (EVP) technique that can incorporate explicit visual cues to the Adapter. However, no prior work has tried to apply Adapters to leverage pretrained image segmentation model SAM trained at large image corpus. Here, we mitigate the research gap.
Polyp Segmentation. In recent years, there has been notable progress in polyp segmentation [
57] due to deep-learning approaches. These techniques employ deep neural networks to derive more discriminative features from endoscopic polyp images. Nonetheless, the use of bounding-box detectors often leads to inaccurate polyp boundary localization. To resolve this, [
58] leveraged fully convolutional networks (FCN) with pre-trained models to identify and segment polyps. [
59] introduced a technique utilizing Fully Convolutional Neural Networks (FCNNs) to predict 2D Gaussian shapes. Subsequently, the U-Net [
60] architecture, featuring a contracting path for context capture and a symmetric expanding path for precise localization, achieved favorable segmentation results. However, these strategies focus primarily on entire polyp regions, neglecting boundary constraints. Therefore, Psi-Net [
61] incorporated both region and boundary constraints for polyp segmentation, yet the interplay between regions and boundaries remained underexplored. [
62] introduced PolypSegNet, an enhanced encoder-decoder architecture designed for the automated segmentation of polyps in colonoscopy images. To address the issue of non-equivalent images and pixels, [
63] proposed a confidence-aware resampling method for polyp segmentation tasks. Specifically for polyp segmentation, works done by [
64] and [
6] present promising results using an unprompted SAM and a domain-adapted SAM respectively. Additionally, Polyp-SAM [
65] used SAM for the same task. [
66] evaluated the zero-shot capabilities of SAM on the organ segmentation task.
Camouflaged Object Detection (COD). Camouflaged object detection, or concealed object detection is a challenging but useful task that identifies objects blend in with their surroundings. COD has wide applications in medicine, agriculture, and art. Initially, researches of camouflage detection relied on low-level features like texture, brightness, and color [
67,
68,
69,
70] to distinguish foreground from background. It is worth noting that some of these prior knowledge is critical in identifying the objects, and is used to guide the neural network in this paper.
Le et al.[
11] first proposed an end-to-end network consisting of a classification and a segmentation branch. Recent advances in deep learning-based methods have shown a superior ability to detect complex camouflaged objects [
9,
71,
72]. In this work, we leverage the advanced neural network backbone (a foundation model – SAM2) with the input of task-specific prior knowledge to achieve the state-of-the-art (SOTA) performance.
Shadow Detection. Shadows can occur when an object surface is not directly exposed to light. They offer hints on light source direction and scene illumination that can aid scene comprehension [
73,
74]. They can also negatively impact the performance of computer vision tasks [
75,
76]. Early method use hand-crafted heuristic cues like chromacity, intensity and texture [
74,
77,
78]. Deep learning approaches leverage the knowledge learnt from data and use delicately designed neural network structure to capture the information (e.g. learned attention modules) [
79,
80,
81]. This work leverage the heuristic priors with large neural network models to achieve the state-of-the-art (SOTA) performance.
5. Conclusion and Future Work
In this paper, we introduced SAM2-Adapter, a novel adaptation method designed to leverage the advanced capabilities of the Segment Anything 2 (SAM2) model for specific downstream segmentation tasks. Building on the success of the original SAM-Adapter, SAM2-Adapter utilizes a multi-adapter configuration that is specifically tailored to SAM2’s multi-resolution hierarchical Transformer architecture. This approach effectively addresses the limitations encountered with SAM, enabling the achievement of new state-of-the-art (SOTA) performance in challenging segmentation tasks such as camouflaged object detection, shadow detection, and polyp segmentation.
Our experiments demonstrate that SAM2-Adapter not only retains the beneficial features of its predecessor, including generalizability and composability but also enhances these capabilities by integrating seamlessly with SAM2’s advanced architecture. This integration allows SAM2-Adapter to outperform previous methods and set new benchmarks across various datasets and tasks.
The continued presence of challenges from SAM in SAM2 highlights the inherent complexities of applying foundation models to diverse real-world scenarios. Nevertheless, SAM2-Adapter effectively addresses these issues, showcasing its potential as a robust tool for high-quality segmentation in a range of applications.
We encourage researchers and engineers to adopt SAM2 as the backbone for their segmentation tasks, coupled with SAM2-Adapter, to realize improved performance and advance the field of image segmentation. Our work not only extends the capabilities of SAM2 but also paves the way for future innovations in adapting large pre-trained models for specialized applications.