1. Introduction
Image segmentation is a crucial component in several systems that aim to comprehend visual data. The technique divides images—or video frames—into multiple elements or sections [
1]. Image segmentation refers to assigning semantic labels to pixels, also known as semantic segmentation. It can also involve the separation of distinct objects, which is called instance segmentation [
2]. Image categorization is a less difficult task than semantic segmentation. It involves labeling each pixel in an image with a specific item category, e.g., human, car, tree, or sky. On the other hand, image classification involves predicting a solitary label that represents the complete image [
3]. Instance segmentation extends the capabilities of semantic segmentation by detecting several items in an image and precisely delineating and isolating each specific object of interest, e.g., discriminating between different individuals in the image [
4]. Initial efforts have been to combine the two segmentation tasks to comprehend scenes thoroughly [
5,
6,
7,
8]. Segmentation plays a vital role in a wide range of applications, including driverless vehicles, medical image analysis, augmented reality, and video surveillance, among others [
9]. A multitude of image segmentation algorithms have been devised in the literature, ranging from basic techniques, e.g., histogram-based bundling, thresholding [
10], k-means clustering [
11], region-growing [
12], and watersheds [
13], to more sophisticated methods such as graph cuts [
14], active contours [
15], conditional and Markov random fields [
16], and sparsity-based approaches [
17,
18]. Currently acknowledged as the next generation of image segmentation algorithms, Deep Learning (DL) models exhibit significant performance gains in recent years. These models (e.g., Universal Network (U-Net) [
19], High-Resolution Network (HRNet) [
20], Mask Region-based Convolutional Neural Network (Mask R-CNN) [
21], Fully Convolutional Networks (FCN) [
22], Segmented Neural Network (SegNet) [
23]), Segment Anything Model (SAM) [
24]) often achieve achieving the highest levels of accuracy on well-known standards, resulting in a significant shift in the sector.
Video segmentation, which involves identifying the key objects in a video scene based on their specific properties or semantics, is a fundamental and difficult problem in Computer Vision (CV). It possesses numerous potential uses, including robotics, autonomous driving, social media, automated surveillance, augmented reality, movie creation, and video conferencing [
25]. The problem has been tackled using conventional CV and Machine Learning (ML) methods. These techniques include hand-crafted features, e.g., optical flow and histogram statistics, heuristic prior knowledge, e.g., motion boundaries [
26] and visual attention mechanism [
27], low/mid-level visual representations, e.g., super-voxel [
28], trajectory [
29], and object proposal [
30], as well as classical ML models, e.g., graph models [
31], clustering [
32], support vector machines (SVM) [
33], random decision forests [
34], random walks [
35], Markov random fields [
36], and conditional random fields [
37]. Recently, DL models, namely FCN [
22], You Only Look Once (YOLO) v5,v7,v8 models [
38,
39,
40], Mask R-CNN [
21], and SAM [
24] have significantly progressed in video segmentation. DL-based video segmentation algorithms exhibit significantly higher precision and, at times, greater efficiency than traditional approaches, e.g., SAM [
24], Mask R-CNN [
21], etc.
The uprising of foundation models has caused a substantial change in different fields, e.g., Natural Language Processing (NLP), CV, Reinforcement Learning (RL), etc. These models achieve spectacular outcomes because of their thorough pre-training on large datasets and exceptional ability to apply their knowledge to a wide range of specific tasks, e.g., Machine Translation, Image Segmentation, Autonomous Driving, Healthcare, etc. [
41]. The Generative Pre-trained Transformer (GPT) [
42] developed by OpenAI has achieved significant advancements in various language tasks within the field of NLP. It has also facilitated the development of successful commercial applications such as ChatGPT [
43], renowned for its ability to generate coherent language in real-time and engage in meaningful interactions with users. Nevertheless, in the field of CV, researchers are still pursuing the construction of foundation models that are both powerful and adaptable. This pursuit is driven by the need to solve the distinct obstacles and complexities in the visual domain.
The creation of Contrastive Language-Image Pre-training (CLIP) [
44], a model that successfully integrates image and text modalities, has shown the ability to generalize to new visual concepts without prior exposure. Despite this, the generalization capacity of vision tasks for AI models is still limited due to the scarcity of complete training data, especially when compared to NLP models. Last year, Meta AI Research unveiled the Segment Anything Model (SAM) [
24], a highly adaptable and responsive model that can accurately segment any item in images or videos without requiring extra training. In CV, this approach is referred to as zero-shot transfer. SAM is a unique CV model trained using the SA-1B dataset [
24], which includes more than 11 million images and one billion masks. This makes SAM the first foundation model of its kind. SAM is designed to provide precise segmentation outcomes by utilizing several cues, including points, boxes, or a combination. It has consistently demonstrated excellent generalization capabilities across various images and objects. Despite its long achievements, SAM had several limitations. The primary purpose of SAM was to do static picture segmentation, rendering it inefficient for video analysis [
45,
46]. The processing speed and efficiency of the system were not optimized for handling massive datasets, making it less appropriate for real-time applications [
24]. In addition, SAM had a deficiency in its capability to retain memory across several frames, resulting in a restricted capacity to monitor the movement of objects over a period of time [
45,
46]. SAM’s training on static picture datasets also limited its effectiveness and capacity to apply to various video scenarios [
47], e.g., object tracking, action recognition, lighting variations, etc.
Under consideration of the limitations of the SAM model, Meta AI Research recently introduced a new model, Segment Anything Model 2 (SAM 2) [
47]. It is designed for prompt object segmentation in both images and videos. It runs in real-time using a transformer-based architecture with streaming memory. SAM 2 builds on the success of the original SAM model, which was designed to facilitate flexible, prompt image segmentation. There is a need for a model that can handle images and videos seamlessly with the rapid growth of multimedia content and the increasing demand for video analysis. SAM 2 meets this need by incorporating advanced features like streaming memory and a robust training SA-V dataset. SAM 2 is introduced to address the limitations of the original SAM model, specifically in handling video data. The new model extends segmentation capabilities from static images to videos and accommodates the complex spatio-temporal dynamics involved. SAM 2 addresses several challenges in visual segmentation. It provides efficient real-time video segmentation and quick and accurate video frame processing. The model enhances accuracy while requiring fewer interactions. This makes it more user-friendly and effective in practical applications. Additionally, SAM 2 improves the ability to segment objects in dynamic and cluttered environments. This ensures robust performance even in complex visual scenes. The objective of this study is to provide an overview of SAM 2.