Preprint
Article

Adaptive Kernel-Attention Framework for Multimodal Representation Learning

Altmetrics

Downloads

10

Views

5

Comments

0

This version is not peer-reviewed

Submitted:

20 November 2024

Posted:

22 November 2024

You are already at the latest version

Alerts
Abstract
The integration of multimodal data in retrieval applications, such as text with accompanying images on platforms like Wikipedia, has emerged as a critical area of research. The challenge lies in effectively representing such multimodal data for efficient retrieval tasks. Traditional deep multimodal learning methods generally involve a two-step process: (1) independent extraction of intermediate features for each modality through separate deep models, and (2) subsequent fusion of these intermediate features into a unified representation. However, these approaches are limited by the lack of mutual awareness among the intermediate features during their extraction, which prevents full utilization of inter-modal information. In this work, we introduce a novel Adaptive Kernel-Attention Framework (AKAF) designed to address these limitations. The AKAF framework incorporates a dynamic modal-aware operation as a core building block to capture complex inter-modal dependencies during the intermediate feature learning stage. This operation is composed of a kernel network to model non-linear inter-modal relationships and an attention network to focus on salient regions within the data, optimizing the representations for binary hash code generation. By introducing mutual awareness across modalities at an early stage, our framework significantly enhances the joint representation quality. Through extensive experiments conducted on three benchmark datasets, we demonstrate that AKAF achieves substantial improvements in retrieval performance compared to state-of-the-art methods. Our results underscore the potential of modal-aware learning in advancing multimodal retrieval systems.
Keywords: 
Subject: Computer Science and Mathematics  -   Artificial Intelligence and Machine Learning
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated