Article
Version 1
This version is not peer-reviewed
The Optimal Choice of the Encoder-Decoder Model Components for Image Captioning
Version 1
: Received: 13 August 2024 / Approved: 14 August 2024 / Online: 14 August 2024 (16:36:18 CEST)
How to cite: Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder-Decoder Model Components for Image Captioning. Preprints 2024, 2024081045. https://doi.org/10.20944/preprints202408.1045.v1 Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder-Decoder Model Components for Image Captioning. Preprints 2024, 2024081045. https://doi.org/10.20944/preprints202408.1045.v1
Abstract
Image captioning aims at generating meaningful verbal descriptions of a digital image. Our paper focuses on the classic encoder-decoder deep learning model that consists of several components – sub-networks, each performing a separate task that, combined, form an effective caption generator. We investigate image feature extractors, recurrent neural networks, word embedding models, and word generation layers and discuss how each component influences the captioning model’s overall performance. Our experiments are performed on the MS COCO 2014 dataset. The results help design efficient models with optimal combinations of their components.
Keywords
image captioning; image processing; image analysis; computer vision; recurrent neural network
Subject
Computer Science and Mathematics, Computer Vision and Graphics
Copyright: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments (0)
We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.
Leave a public commentSend a private comment to the author(s)
* All users must log in before leaving a comment