Preprint Article Version 1 This version is not peer-reviewed

Unifying Video Self-Supervised Learning across Families of Tasks: A Survey

Version 1 : Received: 1 August 2024 / Approved: 2 August 2024 / Online: 2 August 2024 (06:04:36 CEST)

How to cite: Dave, I.; Gunawardhana, M.; Sadith, L.; Zhou, H.; David, L.; Harari, D.; Shah, M.; Khan, M. Unifying Video Self-Supervised Learning across Families of Tasks: A Survey. Preprints 2024, 2024080133. https://doi.org/10.20944/preprints202408.0133.v1 Dave, I.; Gunawardhana, M.; Sadith, L.; Zhou, H.; David, L.; Harari, D.; Shah, M.; Khan, M. Unifying Video Self-Supervised Learning across Families of Tasks: A Survey. Preprints 2024, 2024080133. https://doi.org/10.20944/preprints202408.0133.v1

Abstract

Video self-supervised learning (VideoSSL) offers significant potential for reducing annotation costs and enhancing a wide range of downstream tasks in video understanding. The ultimate goal of VideoSSL is to achieve human-level video intelligence across a spectrum of tasks, from low-level tasks such as pixel temporal correspondence to high-level complex spatio-temporal tasks like action recognition. However, most existing VideoSSL methods focus on isolated aspects of this spectrum and fail to integrate different levels of task complexity. Our study presents the first comprehensive survey that connects all families of VideoSSL methods. We provide a detailed review of the full spectrum of VideoSSL, from low to high levels, by conceptually linking their self-supervised learning objectives and including a comprehensive categorization. Our extensive evaluation highlights the strengths and limitations of each SSL objective across various downstream task families. We also detail the challenges in current VideoSSL research such as data curation, interpretability, deployment, and privacy concerns, an area that previous surveys have not thoroughly explored. In addressing these challenges, we recognize the strengths of existing methods in addressing these challenges and outline future directions for research.

Keywords

Video Understanding; Self-Supervised Learning; Representation Learning

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.