Preprint Article Version 1 This version is not peer-reviewed

The Optimal Choice of the Encoder-Decoder Model Components for Image Captioning

Version 1 : Received: 13 August 2024 / Approved: 14 August 2024 / Online: 14 August 2024 (16:36:18 CEST)

How to cite: Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder-Decoder Model Components for Image Captioning. Preprints 2024, 2024081045. https://doi.org/10.20944/preprints202408.1045.v1 Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder-Decoder Model Components for Image Captioning. Preprints 2024, 2024081045. https://doi.org/10.20944/preprints202408.1045.v1

Abstract

Image captioning aims at generating meaningful verbal descriptions of a digital image. Our paper focuses on the classic encoder-decoder deep learning model that consists of several components – sub-networks, each performing a separate task that, combined, form an effective caption generator. We investigate image feature extractors, recurrent neural networks, word embedding models, and word generation layers and discuss how each component influences the captioning model’s overall performance. Our experiments are performed on the MS COCO 2014 dataset. The results help design efficient models with optimal combinations of their components.

Keywords

image captioning; image processing; image analysis; computer vision; recurrent neural network

Subject

Computer Science and Mathematics, Computer Vision and Graphics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.