1. Introduction
In recent years, the rapid advancement of technology has led to the integration of camcorders into many devices. As the number of camcorders increases, so does the number of recorded videos [
1]. This results in a huge increase in videos that are uploaded daily on the Internet [
2]. Daily activities or special moments are recorded, resulting in large amounts of video [
3]. Human access to multimedia creation, mainly through mobile devices and the tendency to share them through social networks [
3], causes an explosion of videos available on the Web. Searching for specific video content and categorization is generally time-consuming. The traditional representation of video files as a sequence of numerous consecutive frames, each of which corresponds to a constant time interval, while adequate for viewing a file in a movie mode, presents a number of limitations for the new emerging multimedia services such as content-based search, retrieval, navigation, and video browsing [
4]. The need for rational time management led to the development of automatic video content summarization [
2] and indexing / clustering [
1] techniques to facilitate access, content search, automatic categorization (tagging / labeling), as well as action recognition [
5] and common action detection in videos [
6]. The number of papers per year that contain in their title the phrase "video summarization" according to Google Scholar is depicted in
Figure 1.
Many studies have been devoted to developing and designing tools that have the ability to create videos with a duration shorter than the original video, reflecting the most important visual and semantic content [
4,
7]. As users grow in the base, so does the diversity between them. A summary which is uniform for all users may not suit everyone’s needs. Each user can consider different important sections according to his interests, needs, and the time he will spend. Therefore, the focus should be on the personalized summary of the general video summary [
8]. User preferences, which are often involved in the expected results, should be taken into account in video summaries [
9]. Therefore, it is important to modify the video summary to suit the user’s interests and preferences, thus creating a personalized video summary, while retaining important semantic content from the original video [
10].
Video summaries that reflect the understanding of individual users about the content of the video should be personalized in a way that is based on individual needs and intuitive. Consequently, the personalized video summary is tailored to the individual user’s understanding of the video content or understanding of the content of the video. In contrast video summaries, that are not personalized, are not customized in any way to the understanding or understanding of the individual user. In several ways, summaries can be personalized. For example, based on the user’s behavior in summarizing content, movements when capturing the video, and video summary based on user access patterns and individual browsing and filtering of video content before being presented to the user through a user profile [
11].
Several studies on personalized video summarization have appeared in the literature. In [
12], Tsenk et al. (2001) introduced personalized video summarization for Mobile Devices. After two years, Tseng et al. (2003) [
13] used context information to generate hierarchical video summarization. Lie and Hsu (2008) [
14] proposed a personalized summary video framework from the semantic extraction features of each frame. Shafeian and Bhanu (2012) [
15] proposed a personalized system for video summarization and retrieval. Zhang et al. (2013) [
16] proposed a personalized video summarization that is interactive and based on sketches. Panagiotakis et al. (2020) [
9] provided personalized video summarization through a recommender system, where the output is personalized rankings of video segments. Hereafter, a rough classification of the current research is presented.
Many works were carried out that produced a personalized video summary in real time. Valdés and Martínez (2010) [
17] introduced an application for interactive video summarization in real time on the fly. Chen and Vleeschouwer (2010) [
18] produced personalized basketball video summaries in real time from data from multiple streams.
Several works were based on queries. Kannan et al. (2015) [
19] proposed a system to create personalized movie summaries with a query interface. Given a given semantic query, Garcia (2016) [
20] proposed a system that can find relevant digital memories and perform personalized summarization. Huang and Worring (2020) [
21] created a dataset based on a query-video pair.
In a few works, information is extracted from humans using sensors. Katti et al. (2011) [
22] used eye-gaze and Pupillary dilation to generate storyboard personalized video summaries. Qayyum et al. (2019) [
23] used electroencephalography to detect viewer emotion.
Few studies focused on egocentric personalized video summarization. Varini et al. (2015) [
24] proposed a personalized egocentric video summarization framework. The most recent study by Nagar et al. (2021) [
25] presented a framework of unsupervised reinforcement learning in daylong egocentric videos.
Many studies have been conducted in which machine learning has been used in the development of a video summarization technique. Park and Cho (2011) [
26] proposed the summarization of personalized live video logs from a multi-camera system using machine learning. Peng et al. (2011) [
27] proposed personalized video summarization by supervised machine learning using a classifier. Ul et al. (2019) [
28] used the deep CNN model to recognize facial expression. Zhou et al. (2019) [
29] proposed a Character-Oriented Video Summarization framework. Fei et al. (2021) [
30] proposed a triplet deep-ranking model for personalized video summarization. Mujtaba et al. (2022) [
31] proposed a framework for personalized video summarization using 2D CNN. öprü and Erzin (2022) [
32] used Affective Visual Information for Human-Centric Video Summarization. Ul et al. (2022) [
33] presented Object of Interest (OoI), a personalized video summarization framework based on the Object of Interest.
Figure 2 presents milestone works in personalized video summarization. This figure also shows the first paper published for the personalized video summary and the long time it took for the works to begin to be published en masse.
The rest of this paper is structured as follows.
Section 2 introduces the main applications of the personalized video summary.
Section 3 presents approaches to summarize personalized video summaries by separating audiovisual cues.
Section 4 presents the classification of personalized video summary techniques into six categories according to the type of personalized summary, criteria, video domain, source of information, time of summarization and machine learning technique.
Section 5 classifies the techniques into five main categories according to the type of methodology used by the personalized video summarization techniques.
Section 6 describes the individualized video datasets suitable for summarization.
Section 7 describes the evaluation of personalized video summarization methods. In
Section 8, the quantitative comparison of personalized video summarization approaches is presented on the most prevalent datasets. In
Section 9 this work is concluded by briefly describing the main results of the study conducted. In addition, opportunities and challenges in the field and suggested innovative research lines are analyzed.
4. Classification of Personalized Video Summarization Techniques
Many techniques have emerged to create personalized video summaries with the aim of keeping the exact content of the original video intact. Based on the characteristics and properties, the techniques are classified into the following categories that are depicted in
Figure 4 and are based on:
4.1. Type of Personalized Summary
Video summarization is a technique used to generate a concise overview of a video, which can consist of a sequence of still images (keyframes) or moving images (video skims) [
51]. The personalized video summarization process involves creating a personalized summary of a specific video. The desired types of personalized summaries following this process also serve as a foundation for categorizing techniques used in personalized video summarization. Different possible outputs can be considered when approaching the problem of personalized summary, such as:
Static story boards are also called static summaries. Some personalized video summaries consist of a set of static images that extract highlights, and their representation is made as a photo album [
15,
16,
21,
22,
23,
29,
33,
34,
42,
43,
47,
51,
52,
53,
54,
55,
56,
57,
58,
55]. The generation of these summaries is done by extracting the keyframes according to the user’s preferences and the summary criteria.
Dynamic summaries are also called dynamic video skims. Personalized video summarization is performed by selecting a subset of consecutive frames from all frames of the original video that includes the most relevant sub-shots that represent the original video according to the user’s preferences as well as the summary criteria [
12,
14,
17,
18,
19,
20,
25,
27,
28,
30,
31,
35,
37,
38,
44,
45,
47,
49,
54,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69].
A
Hierarchical summary is a multilevel and scalable summary. It consists of a few abtractive layers where the lowest layer contains the largest number of keyframes and more details, while the highest layer contains the smallest number of keyframes. Hierarchical video summaries provide the user with a few levels of summary, which provides the advantage of making it easier for users to determine what is appropriate [
48]. Based on the preferences of each user, the video summary presents an appropriate overview of their original video. Sachan and Keshaveni [
8] to accurately identify the related concept with a frame proposed a hierarchical mechanism to classify the images. The role of the system that performs the hierarchical classification is the deep categorization of a framework against a classification that is defined. Tseng and Smith [
13] suggested a summary algorithm, where server metadata descriptions, contextual information, user preferences, and user interface statements are used to create hierarchical summary video production.
Multiview summaries are summaries created from videos recorded simultaneously by multiple cameras. When watching sports videos, these summaries are useful, as the video is recorded by many cameras. In the multiview summary, challenges can arise that are often due to the overlapping and redundancy of the contents, as well as the lighting conditions and visual fields from the different views. As a result, for static summary output, the basic frame can be extracted with difficulty, and for the video, to shoot border detection. Therefore, in this scenario, conventional techniques for video skimming and extracting key frames from videos recorded by a camera cannot be applied directly [
48]. In personalized video summarization, the multiview video summary is determined based on the preferences of each user and the summary criteria [
18,
26,
38,
70].
Fast forward summaries: When a user watches a video that is not informative or interesting, he often plays it fast or moves it forward quickly [
71]. Therefore, in the personalized video summary, the user wants to fast forward the video segments that are not of interest to him. In [
50] Yin et al. in order to inform with the context users, play in fast forward mode the less important parts. Chen et al. [
40] proposed an adaptive fast forward personalized video summarization framework that performs clip-level fast forwarding, choosing from discrete options the playback speeds, which include as a special case the cropping of content at a playback speed that is infinite. In personalized video summary, fast forward is used in sports videos, where each viewer wants to watch the main phases of the match that are of interest to him according to his preferences, while also having a brief overview of the remaining phases that are not as interesting or important to him. Chen and Vleeschouwer [
41] proposed a personalized video summarization framework for soccer video broadcasts with adaptive fast forwarding where efficient resource allocation selection selects the optimal combination of candidate summaries.
Figure 5 shows the distribution of the papers reported in
Table 2 according to the type of feature of the personalized summary. From this distribution, it is depicted that personalized summaries with a percentage of around 59% are of the type of video skimming. Next in the ranking are the papers, whose personalized summaries are of the storyboard type, with a percentage of around 27%. At a percentage of around 8%, there are works whose personalized summaries are multiview type. The number of works whose personalized summaries are of hierarchical type is half that of multiview type, since their percentage is around 4%. The smallest number of works is those whose personalized summaries are of the fast forward type, only around 2%.
Table 1.
Classification of domains.
Table 1.
Classification of domains.
Domain |
Papers |
Cultural heritage |
[61] |
Movies/Serial Clips |
[19,28,29,35,49] |
Office videos |
[26] |
Personal Videos |
[52,53] |
Sports Video |
[18,35,37,38,39,40,41,42,43,44] |
Zoo videos |
[8] |
4.2. Features
Each video includes several features that are useful when creating a personalized video summary with the correct representation of the content of the original video. Users focus on one or more video features, such as identifying events, activities, objects, etc. resulting in the user adopting specific summary techniques based on these characteristics selected according to their preferences. The descriptions of the techniques based on these features are presented below.
4.2.1. Object Based
Object based techniques focus on objects that are specific and present in the video. Techniques are useful in detecting one or more objects, such as a table, a dog, a person, etc. Some object based summaries may include graphics or text to simplify the process of selecting video segments or keys easier, or to represent the objects contained in the video summary based on detection of those objects [
11]. Video summarization can be done by collecting all frames with the desired object from the video. If the video does not include some type of desired object, then this method, although executed, will not be effective [
36].
In the literature, there are several studies in which the creation of a personalized video summary is based on the user’s query, which is taken into consideration [
12,
13,
19,
30,
47,
51,
54,
63,
64,
67,
68,
69,
73]. Sachan and Keshaveni [
8] proposed a personalized video summarization approach using classification-based entity identification. They succeeded in creating personalized summaries of desired lengths, thanks to the compression, recognition, and special classification of the macros present in the video. After completion of video segmentation, Otani et al. [
72] method selects video segments based on their objects. Each video segment is represented by object tags and their meaning is included. The Object of Interest (OOI) defined by Gunawardena et al. [
58] is the user’s interest features. They suggested an algorithm that summarizes a video that focuses on a given OOI with the key option oriented to the user’s interest. Ul et al. [
33] proposed a framework in which, using deep learning technology, frames with OOIs are detected, which are combined to produce a video digest as output. This framework can detect one or more objects presented in the video. Nagar et al. [
25] proposed a framework for egocentric video summarization that uses unsupervised reinforcement learning and 3D convolutional neural networks (CNNs) that incorporate user preferences such as places, faces, and user choice in an interactive way to exclude or include that type of content. Zhang et al. [
16] proposed a method to generate personalized video summaries based on sketch selection that relies on the generation of a graph from keyframes. The features of the sketches are represented by the generated graph. The proposed algorithm interactively selects those frames related to the same person so that there is direct user interaction with the video content. Depending on the different requirements of each user, the interaction and design are carried out with gestures.
The body-based method presented by Tejero-de-Pablos et al. [
44], which focuses on the characteristics of the body joint based on these characteristics of human appearance, has little variability, and human movement is well represented. The method based on a neural network uses the actions of the players by combining holistic features and body joints so that the most important points can be extracted from the original video. Two types of action-related features are extracted and then classified into uninteresting or interesting. The personalized video summaries presented in [
14,
17,
26,
43,
45,
52,
53,
57,
59,
61,
65,
66] are also object based.
4.2.2. Area Based
Area-based techniques are useful for detecting one or more areas, such as indoors or outdoors, as well as detecting specific scenes in that area such as mountains, nature, skyline, etc. Lie and Hsu [
14] proposed a video summary system that takes into account user preferences based on events, objects, and area. Through a user-friendly interface, the user can adjust the frame number or the time limit. Kannan et al. [
66] proposed a personalized video summarization system. The proposed system includes 25 detectors for semantic concepts based on objects, events, activities, and areas. Each captured video is assigned shot relevance scores according to the set of semantic concepts. Based on the time period and shots that are most relevant to the user’s preferences, the video summary is generated. Tseng et al. [
59] proposed a system consisting of three layers of server-middleware-client to produce a video summary according to user preferences based on objects, events, and scenes in the area. Many studies [
52,
53,
61,
65,
67] have proposed that personalized video summarization is also based on area.
4.2.3. Perception Based
Perception based video summaries refer to the ways, represented by high-level concepts, with which the content of a video can be perceived by the user. Some relevant examples include the association of the content of the video with some level of importance by the user, the size of the excitement that the user can have by watching the video, the amount that the user can perceive from the distraction videos, or the kind of emotions or the intensity that the user may perceive from the content. Other research areas apply theories such as semiotic theory, human perception theory, and user-attention theories as means of interpreting the semantics of video content to achieve high-level abstraction. Therefore, unlike video summaries based on events and objects in which tangible events and objects present in video content are identified, user perception-based summaries focus on extracting how the user perceives or can perceive the content of the video [
11]. Hereafter, we present perception-based video summaries methods.
Miniakhmetova and Zymbler [
74] proposed a framework to generate a personalized video summarization that takes into account the user’s evaluations of previously watched videos. The rating can be "like", "neutral", or "dislike". Scenes that affect the most the user of the video are collected, and their sequence forms the personalized video summary. A behavior pattern classifier presented by Varini et al. [
61] that is based on 3D Convolutional Neural Networks (3D CNN) and uses visual evaluation features and 3D motion. The selection of elements is based on the degree of narrative, visual, and semantic perspective along with the user’s preferences, as well as the user’s attention behavior. Dong et al. [
52,
53] proposed a video summarization system based on the intentions of the user according to a set of predefined idioms and expected duration. Ul et al. [
28] proposed a framework for summarizing movies based on people’s emotional moments through facial expression recognition (FER). The emotions it is based on are: disgust, surprise, happy, sad, neutral, anger, and fear. Köprü and Erzin [
32] proposed a human-centered video summary framework based on information derived from emotions and extracted from visual data. Initially, with the use of repetitive convolutional neural networks, emotional information is extracted, and then the emotional mechanisms of attention and information are expanded to enrich the video summary. Katti et al. [
22] presented a semi-automated gaze-based method of the eye for emotional video analysis. The behavioral signal received from the user and entered is pupil dilation (PD) to assess user engagement and arousal. In addition to discovering Regions of Interest (ROIs) and emotional video segments, the method includes fusion with content-based features and gaze analysis. The Interest Meter (IM) was proposed by Peng et al. [
27] to measure user interest through spontaneous reactions. By using a fuzzy fusion scheme, emotion and attention features are combined; this results in viewing behaviors being converted into quantitative interest scores, and a video summary is produced by combining those parts of the video that are considered interesting. Yoshitaka and Sawada [
71] proposed a framework for video summary based on the observer’s behavior while monitoring the content of the video. The observer’s behavior is detected on the basis of the operation of the video remote control and eye movement. Olsen and Moon [
39] proposed the DOI function to obtain a set of features of each video based on the interactive behavior of the viewers.
4.2.4. Motion and Color Based
Producing a video summary is difficult when it is based on motion and especially when the camera is involved [
36]. Sukhwani and Kothari [
43] proposed a framework for creating a personalized video summary in which the use of colors identifies football clips as event segments. To isolate the football events from the video, they modeled models of the player’s activity and movement. The actions of the footballers are described using the dense characteristics descriptions in the trajectory descriptions. For the immediate identification of the players in the moving frames, they used the Deep Learning method for the identification of the player, and for the modeling of the football field, the Gaussian mixture model (GMM) method to achieve background removal. AVCutty was used in the work of Darabi et al. [
34] to detect the boundaries of the scene and thus detect when a change in scene occurs through the motion and color features of the frames. The study by Varini et al. [
61] is also motion-based.
4.2.5. Event Based
To detect abnormal and normal events, presented in videos, event-based approaches are useful. There are many examples such as terrorism, mobile hijacking, robbery scenes, recognition and monitoring of sudden changes in the environment, etc. in which the observation of anomalous/suspicious features is done using detection models. To produce the video digest, the frames with abnormal scenes are joined using a summarization algorithm [
36]. Hereafter, we present such event-based approaches.
Valdés and Martínez [
17] described a method for creating video skims using an algorithm to dynamically build a skimming tree. They presented an implementation to obtain features that are different through online analysis. Park and Cho [
26] presented a study in which a single sequence of events is generated from multiple sequences and the production of a personalized video summary is performed using fuzzy TSC rules. Chen et al. [
38] described a study in which metadata are acquired based on the detection of events and objects in video. Taking metadata into account divides the video into segments so that each segment covers a self-contained period of a basketball game. Fei et al. [
35] proposed two methods for event detection. The first is detection by combining a support vector machine (CNNs-SVM) with a convolutional neural network, and the second is detection using a summary network optimized (SumNet). Lei et al. [
60] proposed a method to produce a personalized video summarization without supervision based on interesting events. The unsupervised Multi-Graph Fusion (MGF) method was proposed by Ji et al. [
70] to find events that are automatically relevant to the query. The LTC-SUM method was proposed by Mujtaba et al. [
31] based on the design of a 2D CNN model to detect individual events through thumbnails. This model solves privacy and computation problems on end-user devices that are resource-constrained. At the same time, due to this model, the efficiency of storage and communication is improved as the computational complexity is reduced to a significant extent. PASSEV developed by Hannon et al. [
37] to create personalized real time video summaries by detecting events using data from the Web. Chung et al. [
49] proposed a PLSA-based model for video summarization based on user preferences and events. Chen and Vleeschouwer proposed a video segmentation method based on clock events [
18]. More specifically, it is based on the rule that exists in basketball that each team has 24 seconds to attack by making a shot. Many events such as fouls, shots, interceptions, and others are immediately monitored with the restart, end, and start of the clock. Many studies [
13,
14,
16,
19,
25,
45,
52,
53,
57,
59,
61,
65,
66] have proposed that personalized video summarization is also based on events.
4.2.6. Activity Based
Activity-based techniques focus on specific activities present in the video, such as lunch at the office, sunset at the beach, drinks after work, friends playing tennis, bowling, cooking, etc. Ghinea et al. [
65] proposed a summarization system that can be used with any kind of video in any domain. From each keyframe, 25 semantic concepts are detected for categories of visual scenes that cover people, areas, settings, objects, events, and activities that are sufficient to represent a keyframe in a semantic space. From each video, relevance scores are assigned for a set of semantic concepts. The score shows the relevance between a particular semantic concept and a shot. Garcia [
20] proposed a system that accepts from the user an image, text, or video query, then retrieves from the memories small subshots stored in the database, and produces the summary according to the user’s activity preferences. The following studies [
17,
19,
57,
66] are also based on activities.
The distribution of the papers, reported in
Table 2, according to the type of features of the personalized summary is shown in
Figure 6. According to the distribution, most tasks perform a personalized object-based summarization at a percentage of around 39%. With a difference of 10% and a percentage of around 29%, those tasks follow in which the personalized video summary is event-based. Next, we have the tasks whose personalized summary is an area based on a percentage of around 10%. In fourth place are personalized perception-based and activity-based summaries with the same number of tasks and with a percentage of around 9%. The smallest number of works is the one with a percentage of around 4% whose abstract is based on motion and color.
4.3. Video Domain
The techniques can be divided into two categories, those whose analysis is domain specific and those whose analysis is not domain specific.
In techniques that refer to
domain specific analysis, video summarization is performed in that domain. Common content areas include home videos, news, music, and sports. When performing an analysis of video content, blur levels are reduced when focusing on a domain specific [
11]. The video summary must be unique for the domain. To produce a good summary in different domains, the criteria vary dramatically [
53].
In contrast to domain specific techniques,
non-domain specific techniques perform video summarization for any domain, so there is no restriction on the choice of video to produce the summary. The system proposed by Kannan et al. [
66] can generate a video summarization without relying on any specific domain. The summaries presented in [
12,
13,
14,
17,
20,
22,
25,
27,
30,
31,
33,
34,
45,
47,
51,
54,
55,
57,
58,
59,
60,
63,
64,
65,
67,
68,
69,
70,
71,
72] are not domain specific. The types of domains found in the literature are Personal Videos [
52,
53], Movies / serial clips [
19,
28,
29,
35,
49], Sports Videos [
18,
35,
37,
38,
39,
40,
41,
42,
43,
44], Cultural heritage [
61], Zoo videos [
8], and Office videos [
26].
Table 1 provides a classification of works by domain type.
Figure 7 presents the distribution of papers in the domain of personalized summary. In
Figure 7, it can be seen that the large number of works produce personalized non-domain specific summaries at a percentage of around 67%, in contrast to the works that produce personalized domain specific summaries at a percentage of around 33%.
4.4. Source of Information
The video content lifecycle includes three stages. The first stage is the capture stage, during which the recording or capturing of the video takes place. The second stage is the production stage; during this stage, a message or a story is conveyed through the format the video has been converted into after being edited. The third stage is the viewing stage during which the video is shown to a targeted audience. Through these stages, the video content is desemanticized and then the various audiovisual elements are extracted [
11]. Based on the source of information, personalized video summarization techniques can be divided into three categories: internal personalized summarization techniques, external personalized summarization techniques, and hybrid personalized summarization techniques.
4.4.1. Internal Personalized Summarization Techniques
Internal personalized summarization techniques are the techniques that analyze and use information that is internal to the content of the video stream. During the second stage of the video life cycle and, more specifically, for the video content produced, internal summarization techniques apply. Through these techniques, low-level text, audio, and image features are automatically parsed from the video stream into abstract semantics suitable for video summarization [
11].
Image features may have changes in the motion, shape, texture, and color of objects that come from the video image stream. These changes can be used to perform video segmentation in shots by identifying fades or hot boundaries. Fades are identified by slow changes in the characteristics of an image. Hot boundaries are identified by changes in an image’s features in a sharp way, such as clipping. Specific objects can be detected, and an improvement in the depth of summarization can also be achieved for videos with known structure by analysis of image features. Sports video is suitable for event detection because of its rigid structure. At the same time, event and object detection can also be achieved in other content areas that present a rigid structure, such as in news-related videos as the start includes an overview of headlines, then a series of references is displayed, and finally the anchor face is the return [
11].
Audio features are related to the video stream and appear in the audio stream in different ways. Audio features include music, speech, silence, and sounds. Audio features can help identify segments that are candidates for inclusion in a video summary, and improving the depth of the summary can be achieved using domain specific knowledge [
11].
Text features in the form of text captions or subtitles are displayed in the video. Instead of being a separate stream, text captions are "burned" or integrated into the video’s image stream. Text may contain detailed information related to the content of the video stream and thus be an important source of information [
11]. For example, in a football match broadcast live, captions showing the names of the teams, the score between them, the percentage of possession, the shots on target at that moment, etc. should appear during the match. As with video and audio features, events can also be identified from text. Otani et al. [
72] proposed a text-based method. According to text-based method, video blog posts use supporting texts that are used in video summary at an earlier time. First, the video is segmented and then, according to the relevance of each segment to the input text, its priority is assigned to the summary video. Then, a subset of segments that have content similar to the content of the input text is selected. Therefore, based on the input text, a different video summary is produced.
4.4.2. External Personalized Summarization Techniques
External personalized summarization techniques analyze and use information that is external to the content of the video stream in the form of metadata. The life cycle of the video at each stage of its information is analyzed using external summarization techniques [
11]. An external source of information is contextual information. Contextual information does not come from the video stream or the user and is additional [
11]. The method presented by Katti et al. [
22] is based on gaze analysis by combining the features of the video content to discover the regions of interest (ROIs) and emotional segments. First, eye tracking is performed to record pupil dilation and eye movement information. Then after each peak pupil dilation stimulation, a determination of the first fixation is made, and the corresponding video frames are marked as keyframes. Linking keyframes creates a storyboard sequence.
4.4.3. Hybrid Personalized Summarization Techniques
Hybrid personalized summarization techniques analyze and use information that is both external and internal to the content of the video stream. From the life cycle of the video, at each stage of its information is analyzed by external summarization techniques. Any combination of outer and inner summarization techniques can form hybrid summarization techniques. Each approach tries to capitalize on its strengths while minimizing its weaknesses, to make video summaries as effective as possible [
11]. Combining text metadata with the capabilities of image-level video frames can help improve summary performance [
46]. Furthermore, for non-domain specific techniques, hybrid approaches have proven useful [
11].
4.5. Based on Time of Summarization
The techniques can be divided into two categories depending on whether the personalized summary is conducted live or on a pre-recorded video, respectively. The first category is real time, and the second category is static. Both are presented below.
In
Real time techniques, in which the production of the personalized summary takes place during the playback of the video stream. Due to the fact that the output should be delivered very quickly, it is a difficult process to produce in real time. In real time systems, an output that is delayed is incorrect [
46]. The Sequential and Hierarchical Determinant Point Process (SH-DPP) is a probabilistic model developed by Sharghi et al. [
47] to be able to produce extractive summaries from streaming video or long-form video. The personalized video summaries presented in [
12,
13,
17,
18,
24,
31,
37,
38,
45,
49,
59] are in real time.
In
static techniques, in which the production of the personalized summary takes place on a recorded video. Most studies are static in time [
8,
14,
16,
19,
20,
22,
25,
26,
27,
28,
33,
34,
35,
43,
44,
51,
52,
53,
54,
55,
57,
58,
60,
61,
63,
64,
65,
66,
67,
68,
69,
70,
71,
72].
Figure 8 presents the distribution of the papers, reported in
Table 2, between the time of the personalized summary. In this distribution, dominates the number of papers whose personalized summaries are static, at a percentage of around 76%, as opposed to the number of papers whose summaries are real time, at a percentage of around 24%.
4.6. Based on Machine Learning
Techniques related to machine learning are developed to identify objects, areas, events, etc. Various machine learning techniques apply algorithms to generate personalized video summaries. Based on the algorithm applied, the techniques are divided into supervised, weakly supervised, and unsupervised categories. The methods in the above categories use Convolutional Neural Networks (CNN).
In large-scale video and image recognition, CNNs have been very successful. The learning of high-level features in a progressive manner and the obtaining of the original image with the best representation are supported by CNNs [
35]. Features for the recognition of holistic action that are more generalized and reliable than hand-made characteristics can be extracted from CNN. Therefore, CNN has overcome traditional methods [
44]. The FasterRCNN model was proposed by Nagar et al. [
25] for face detection. Varini et al. [
61] proposed a 3D CNN with 10 layers. The layers are trained in frame features and visual motion evaluation. The way for unsupervised and supervised summarization techniques has been paved by the success of deep neural networks (DNNs) in learning video representations and complex frames [
25]. In neural networks, memory networks are used to flexibly model the attentional scheme. In addition, to deal with visual answers to questions and answers to questions, memory networks are used [
5].
Through deep learning based on artificial neural networks, computers learn to process data as a human would. By using many data and through training models, learn the characteristics that are their own, with room for optimization. The learned features are the features that have been learned through a deep learning system [
46]. On the basis of the algorithm applied, the techniques are divided into the following categories.
4.6.1. Supervised
In supervised techniques, a model is first trained using labeled data, and then the video summary is produced based on that model. Hereafter, we present supervised approaches. Deep learning for player detection was proposed by Sukhwani and Kothari [
43]. To address the multiscale matching problem in the person search, Zhou et al. [
29] used the Cross-Level Semantic Alignment (CLSA) model. From the identity features, the most discriminative representations are learned using the end-to-end CLSA model. The Deep-SORT algorithm was used by Namitha et al. [
73] to track objects. Huang and Worring [
21] proposed a deep learning method to generate query-based video summarization for a visual text embedding space. The method is end-to-end and consists of a video digest generator, a video digest controller, and a video digest output module. OoI detection was performed using YOLOv3 by Ul et al. [
33]. A single neural network was applied to the full video. The frames are divided into regions, delineated, and the neural network predicts probabilities. A deep architecture trained by Choi et al. [
56] to perform efficient learning of the semantic embeddings of video frames. The learning is done through the progressive exploitation of the data from the captions of the images. According to the implemented algorithm, the semantically relevant segments of the original video stream are selected according to the context of the sentence or text description provided by the user. Studies [
5,
7,
13,
14,
19,
24,
27,
32,
33,
35,
43,
47,
59,
61] use the supervised technique to produce a summary.
Active Video Summary (AVS) was suggested by del et al. [
67]. According to which, AVS constantly asks the user questions to get the user’s preferences about the video, updating the summary online. From each extracted frame, object recognition is performed using a neural network. Using a line search algorithm, the summary is generated. A video summarization framework presented by Fei et al. [
30] that after analyzing large-scale Flickr photos of the user trains a deep-ranking model to learn to rank video frames based on their importance so that, for video summarization, it selects major sections. This framework does not simply use hand-crafted features to calculate the similarity between a set of web images and a frame.
Tejero-de-Pablos et al. [
44] proposed a method that uses two separate neural networks to transform the joint positions of the human body and the RGB frames of the video stream as input types. To identify the highlights, the two streams are merged into a single representation. The network is trained using the UGSV data set from the lower to the upper layers. Depending on the type of summary, there are many ways that the personalized summary process can be modeled.
Keyframe selection: The goal is to identify the most inclusive and varied content (or frames) in a video for brief summaries. Keyframes are used to represent significant information included in the video [
51]. The keyframe-based approach chooses a limited set of image sequences from the original video to provide an approximate visual representation[
31]. Baghel et al. [
51] proposed a method in which user preference is entered as an image query. The method is based on object recognition with automatic keyframe recognition. From the input video, the important frame is selected, so that the output video is produced from these frames. Based on the similarity score between the input video and the query image, a keyframe is selected. A selection table is created from the keyframe that is decided to be selected. A threshold is applied to the selection score. If the frame has a selection score greater than the threshold value, then this frame is a keyframe; otherwise, the frame is not considered a keyframe and is discarded.
Keyshot selection: The keyshots comprised standard continuous video segments extracted from full-length video, each of which was shorter than the original video. Keyshot-driven video summarization techniques are used to generate excerpts from short videos (such as user-created TikToks and news) or long videos (such as full-length movies and soccer games) [
31]. Mujtaba et al. [
31] presented the LTC-SUM method to produce personalized keyshot summaries that minimize the distance between the semantic information of the side and the selected video frame. Using a supervised encoder-decoder network, the importance of the frame sequence is measured. Zhang et al. [
64] proposed a mapping network (MapNet) to express the degree of association of a shot with a given query, to create a visual information mapping in the query space. Using deep reinforcement learning (SummNet), they proposed to build a summarization network to integrate diversity, representativeness, and relevance to produce personalized video summaries. Jiang and Han proposed a scoring mechanism [
68]. In the hierarchical structure, the mechanism is based on the scene layer and the shooting layer and receives the output. Each shot is scored through this mechanism, and as basic shots, the shots are selected as high-rated shots.
Event based selection: The process of personalized summarization detects events from a video based on the user’s preferences. In the above method [
31] to identify thumbnail events that are not specific domains, a two-dimensional convolutional neural network (2D CNN) model was implemented.
Learning shot-level features: It involves learning advanced semantic information from a video segment. The Convolutional Hierarchical Attention Network (CHAN) method was proposed by Xiao et al. [
54] After dividing the video into segments, visual features are extracted using the pre-trained network. To perform shot-level feature learning, visual features are sent to the feature encoding network. To perform learning on a high-level semantic information video segment, they proposed a local self-attention module. To manage the semantic relationship between the given query and all segments, they used a global attention module that responds to the query. To reduce the length of the shot sequence and the dimension of the visual feature, they use a fully convolutional block.
Transfer learning: Transfer learning involves adjusting the information gained from one area (source domain) to address challenges in a separate, yet connected area (target domain). The concept of transfer learning is rooted in the idea that when tackling a problem, we generally rely on the knowledge and experience we have gained from addressing similar issues in the past [
75]. Ul et al. [
28] proposed a framework using transfer learning to perform facial expression recognition (FER). More specifically, they presented the learning process that, to be completed, includes two steps. In the first step, a CNN model is trained for face recognition. In the second step, transfer learning is performed for the FER of the same model.
Adversarial learning. Is a technique employed in the field of machine learning to trick or confuse a model by introducing harmful input, which can be used to carry out an attack or cause a malfunction in a machine learning system. A competitive three-player network was proposed by Zhang et al. [
63]. The content of the video, as well as the representation of the user query, is learned from the generator. The parser receives three pairs of digests based on the query, so that the parser can distinguish the real digest from a random one and a generated one. To train the classifier and the generator, a lossy input of three players is performed. Training avoids the generation of random summaries which are trivial, as the summary results are better learned by the generator.
Vision-language: A vision language model is an artificial intelligence model that integrates natural language and computer vision processing abilities to comprehend and produce textual descriptions of images, thus connecting visual information with natural language explanations. Plummer et al. [
76] used a two-branch network to learn the integration model of the vision language. Of the two branches of the network, one receives the text features and the other the visual features. The triple loss based on the margin trains the network by combining a neighborhood-preserving term and two-way ranking terms.
Hierarchical self-attentive network: The hierarchical self-attentive network (HSAN) is able to understand the consistent connection between video content and its associated semantic data at both the frame and segment levels. This ability enables the generation of a comprehensive video summary [
77]. A hierarchical self-attentive network was presented by Xiao et al. [
77]. First, the original video is divided into segments, and then, using a pre-trained deep convolutional network, the visual feature is extracted from each frame. To record the semantic relationship at the section level and at the context level, a global and a local self-care module were proposed. To learn the relationship between visual content and caption, the self-attention results are sent to a caption generator, which is enhanced. An importance score is generated for each frame or segment to produce the video summary.
Weight learning: The weight learning approach was proposed by Dong et al. [
53] in which, using maximum-margin learning, it can automatically learn the weights of different objects. Learning can occur for processing styles that are not the same or different types of product, as these videos contain annotations that are highly relevant to the domain expert’s decisions. For different processing styles or product categories, there may be different weightings of audio annotations built directly with domain specific processing decisions. For efficient user exploration of the design space, there may be default storage of these weights.
Gaussian mixture model: A Gaussian mixture model is a clustering method used to estimate the likelihood that a specific data point is part of a cluster. A user preference learning algorithm proposed by Niu et al. [
45] in which the most representative are initially selected as temporary keyframes from the extracted frames. To indicate a scene change, temporary frames are displayed to the user. If the user is not satisfied with the selected temporary keyframes, they can interact by manually selecting the keyframes. A Gaussian Mixture Model (GMM) is modeled by learning user preferences. The parameters of the GMM are automatically updated based on the user’s manual selection of keyframes. Production of the personalized summary is done in real time as the personalized frames update the selected frames from the temporary base. Personalized keyframes represent user preferences and taste.
4.6.2. Unsupervised
In unsupervised techniques, clusters of frames are first created based on the quality of their content, and then the video summary is created by concatenating the keyframes of each cluster in chronological order. An unsupervised method, called FrameRank, was introduced by Lei et al. [
60]. They constructed a graph where frame similarity is measured at the edges and video frames correspond to vertices. To measure the relative importance of each segment and each video frame, they applied a graph ranking technique. Depending on the type of summary, there are many ways in which the personalized summary process can be modeled, which are described hereafter.
Contrastive learning: Using a pretext, self-supervised pretraining of a model can be performed, which is the approach to contrastive learning. According to contrastive learning, the model learns to repel representations intended to be far away, called negative representations, and to attract them from positive representations intended to be close to discriminate between different objects [
55].
Reinforcement learning: A framework for creating an unsupervised personalized video summary proposed by Nagar et al. [
25]. Initially, the egocentric video captures spatio-temporal features using 3D convolutional neural networks (3D CNNs). Then the video is split into non-overlapping frame shots, and the features are extracted. The reinforcement learning agent then imports the features. The reinforcement learning agent uses a two-way long- and short-term memory network (BiLSTM). Using forward and backward flow, BiLSTM serves to encapsulate future and past information from each subshot.
Event based keyframe selection (EKP): EKP was developed by Ji et al. [
70] so that keyframes can be presented in groups. Separation of groups will be based on specific facts that are relevant to the query. The MultiGraph Fusion (MGF) method was implemented to automatically find events that are relevant to the query. The keyframes in the different event categories are then separated from the correspondence between the videos and the keyframes. Through the two-level structure, the summarization is represented. Event descriptions are the first layer, and keyframes are the second layer.
Fuzzy rule-based: To represent human knowledge, which includes fuzziness and uncertainty, a method is a fuzzy system. From the theory of fuzzy sets, the fuzzy system is a representative and important application [
26]. Park and Cho [
26] used a system based on fuzzy TSK rules. This system was used to evaluate video event shots. Also, in this rule-based system, consistency is a function of the variables used as input and not a linguistic variable, and therefore, the time-consuming decomposition process can be avoided. The summaries in [
49,
50,
55,
58,
72] use an unsupervised technique to produce a summary.
4.6.3. Supervised and Unsupervised
Narasimhan et al. [
57] presented a model that can be trained with and without supervision and that belongs to the Supervised and Unsupervised category. The supervised setting uses reconstruction, diversity, and classification as loss functions, whereas the unsupervised setting uses reconstruction and diversity as loss functions.
4.6.4. Weakly Supervised
Weakly supervised video summary methods use less expensive labels without using basic truth data. The labels they use are imperfect compared to tags that are complete in human annotations. However, they can lead to an effective training of summary models [
78]. Compared to the supervised video summary approach, a weakly supervised summary approach needs a smaller set of training to carry out the video summary. Cizmeciler et al. [
69] suggested a personalized video summary approach with weak supervision. Weak supervision is carried out as semantic maps of reliability. Through predictions from pre-trained classifiers of actions/characteristics, semantic maps are obtained.
The distribution of the papers reported in
Table 2, according to the type of personalized summary method, is shown in
Figure 9. It is evident from the distribution that in the largest number of papers, the proposed method of producing a personalized video summary was supervised, at a percentage of around 73%. Second, the percentage of unsupervised personalized video summarization methods was around 24%. In the last place, and with a percentage of around 3%, is the weakly supervised methods.
8. Quantitative Comparisons
In this section, we present quantitative comparisons of personalized video summarization approaches on the most prevalent datasets in the literature, which are TVSum and SumMe. The evaluation of unsupervised personalized video summarization methods using the F-score in the SumMe dataset is presented in
Table 5.
Table 6 shows the performance of the personalized unsupervised video summarization methods in the TVSum dataset evaluated using the F-score. Based on the F-score results in
Table 5 and
Table 6, the following observations are worth mentioning:
The FrameRank (KL divergence based) approach performs higher on the TVSum dataset than it does on the SumMe dataset. This unbalanced performance shows that this technique is more suited to the TVSum dataset. On the contrary, the CSUM-MSVA and CLIP-It approaches have balanced performance, as they show high performance on both datasets.
A very good choice for unsupervised personalized video summarization is to use the contrastive learning framework which includes diversity and representatives, as this framework is based on the CSUM-MSVA method which is the most effective and top performing method in both datasets.
-
The advanced CLIP-It method (CLIP-Image+Video Caption+Transformer) provides a high F-score on both datasets. From the comparison of this method with other methods (GoogleNet + bi-LSTM, ResNet + bi-LSTM, CLIP-Image + bi-LSTM, CLIP-Image + Video Caption + bi-LSTM, GoogleNet + transformer, ResNet + transformer, CLIP-Image + transformer) in [
57] the benefits of this method are realized, as it has better results in all three settings. According to the CLIP-It method, fusion of language and image embedding is performed using pre-trained networks through learned language, which is guided by multiple attention heads. All frames are tracked together using the Frame Score Transform, which predicts frame relevance scores. Using the Knapsack algorithm, frame scores are converted into high-scoring shot scores. Therefore, it is one of the top methods for unsupervised personalized video summarization.
Figure 11 shows the comparison of the reference summary with the results obtained from the CLIP-Image+Transformer and the complete CLIP-It model (CLIP-Image+Video Caption+Transfo- rmer). The content provided is a video showing a cooking procedure. Without captions, the model gives high scores to some irrelevant frames, such as those showing the woman talking or eating, which negatively impacts precision. By including captions, the cross-attention mechanism guarantees that frames featuring significant actions and objects receive elevated scores.
The multimodal MultiGraph Fusion (MGF) method, introduced in the QUASC approach to automatically identify events that are related to the query based on the tag information in the video, such as descriptions and titles, does not allow one to create a good summary. As a result, the QUASC approach is not competitive with any other approach in the TVSum dataset.
The Actor-Critic method allows the creation of a satisfactory personalized video summary, which is shown by the F-score in the two datasets. To capture spatio-temporal features, 3D convolutional neural networks (CNNs) are used. In addition, a bidirectional short-term memory network (BiLSTM) is used to learn the extracted features. However, the Actor-Critic method is not competitive with the pioneering CSUM-MSVA and CLIP-It method.
Table 7 shows the performance of the supervised personalized video summarization methods evaluated using the F-score in the SumMe dataset.
Table 8 refers to the performance of the personalized supervised video summarization methods in the TVSum dataset evaluated by the F-score. Based on the F-score results in
Table 7 and
Table 8, the following observations are worth mentioning:
Based on the F-score results in
Table 5,
Table 6,
Table 7, and
Table 8, we draw the following conclusions: The CLIP-It method is a good choice for creating an unsupervised personalized video summary and even better for creating supervised personalized video summary. In each dataset, better results are provided in the F-score for the supervised setting than for the unsupervised setting. The superiority of the F-score in the supervised setting is due to the use of three loss functions (reconstruction, diversity, and classification) in the model training, in contrast to the unsupervised setting that uses only two loss functions (reconstruction and diversity). Finally, contrastive learning outperforms multimodal language-guided transformation in unsupervised video summarization. Therefore, a better choice seems to be the CSUM-MSVA method for unsupervised personalized video summarization, while the CLIP-It method enables a good summarization.