Preprint
Article

RVTAF: Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation

Altmetrics

Downloads

4

Views

3

Comments

0

This version is not peer-reviewed

Submitted:

18 November 2024

Posted:

19 November 2024

You are already at the latest version

Alerts
Abstract
Precision depth estimation plays a key role in many applications, including 3D scene reconstruction, virtual reality, autonomous driving and human-computer interaction. Recent advancements in deep learning technologies, the monocular depth estimation has surpassed the traditional stereo camera systems, bringing new possibilities in 3D sensing. In this paper, by using single camera, we propose an end-to-end supervised monocular depth estimation autoencoder which contains a CNN-ViT encoder and an adaptive fusion decoder, to obtain high-precision depth maps. In the CNN-ViT encoder, we construct a multi-scale feature extractor by mixing residual configurations of vision transformers to enhance both local and global information. In the adaptive fusion decoder, we introduce adaptive fusion modules to effectively merge features of the encoder and decoder together. Lastly, the model is trained using a loss function that aligns with human perception to enable it to focus on the depth values of foreground objects. Experimental results demonstrate the effective prediction of the depth map from a single-view color image by the proposed RVTAF autoencoder.
Keywords: 
Subject: Engineering  -   Electrical and Electronic Engineering
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated