Lawson, Z.; Martinez, A.; Quinn, S. SensorySync: Multimodal Integration Framework for Unified Perceptual Understanding. Preprints2024, 2024091969. https://doi.org/10.20944/preprints202409.1969.v1
APA Style
Lawson, Z., Martinez, A., & Quinn, S. (2024). SensorySync: Multimodal Integration Framework for Unified Perceptual Understanding. Preprints. https://doi.org/10.20944/preprints202409.1969.v1
Chicago/Turabian Style
Lawson, Z., Ava Martinez and Seraphina Quinn. 2024 "SensorySync: Multimodal Integration Framework for Unified Perceptual Understanding" Preprints. https://doi.org/10.20944/preprints202409.1969.v1
Abstract
Generic text embeddings have demonstrated considerable success across a multitude of applications. However, these embeddings are typically derived by modeling the co-occurrence patterns within text-only corpora, which can limit their ability to generalize effectively across diverse contexts. In this study, we investigate methodologies that incorporate visual information into textual representations to overcome these limitations. Through extensive ablation studies, we introduce a novel and straightforward architecture named VisualText Fusion Network (VTFN). This architecture not only surpasses existing multimodal approaches on a range of well-established benchmark datasets but also achieves state-of-the-art performance on image-related textual datasets while utilizing significantly less training data. Our findings underscore the potential of integrating visual modalities to substantially enhance the robustness and applicability of text embeddings, paving the way for more nuanced and contextually rich semantic representations.
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.