MC-Net: A Multi-Path Contextual Reasoning Framework for Multimodal Conversations

Ethan Parker; Nia Harper; Jannat Roy

doi:10.20944/preprints202411.1637.v1

Submitted:

20 November 2024

Posted:

21 November 2024

You are already at the latest version

Abstract

Multimodal Conversation is a sophisticated vision-language task where an AI agent must engage in meaningful dialogues grounded in visual content. This requires a deep understanding of not only the presented question but also the dialog history and the associated image context. However, existing methods primarily focus on single-hop or single-path reasoning, which often fall short in capturing the nuanced multimodal relationships essential for generating accurate and contextually relevant responses. In this paper, we propose a novel and powerful model, the Multi-path Contextual Reasoning Model (MC-Net), which employs multi-path reasoning and multi-hop mechanisms to process complex multimodal information comprehensively. MC-Net integrates dialog history and image context in parallel, iteratively enriching the semantic representation of the input question through both paths. Specifically, MC-Net adopts a multi-path framework to simultaneously derive question-aware image features and question-enhanced dialog history features, effectively leveraging iterative reasoning processes within each path. Furthermore, we design an enhanced multimodal attention mechanism to optimize the decoder, enabling it to generate highly precise responses. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate that MC-Net significantly outperforms existing methods, showcasing its efficacy in advancing multimodal conversational AI.

Keywords:

Multimodal Conversation

;

Multi-path Reasoning

;

Contextual Representation

;

Multi-hop Reasoning

;

Multimodal Attention

;

Vision-Language Integration

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The convergence of computer vision and natural language processing has catalyzed rapid advancements in multimodal tasks such as image captioning [2,3,26] and visual question answering (VQA) [3,8,15,21]. These tasks have inspired the development of intelligent systems capable of integrating visual and linguistic modalities. However, while tasks like image captioning and VQA primarily address single-turn interactions, human communication is inherently multi-turn, dynamic, and context-dependent. To bridge this gap, the task of Multimodal Conversation has been proposed by [5], emphasizing the need for visually-grounded, multi-round dialogues.

The rapid evolution of artificial intelligence has increasingly emphasized the integration of multiple modalities to enable systems to understand and respond to complex scenarios. Traditional vision-language tasks, such as image captioning [2,26] and visual question answering (VQA) [15,21], have laid a foundation for research in this domain. However, these tasks typically operate in single-turn settings, where an agent responds to isolated queries or generates captions for individual images. Such single-turn paradigms fail to capture the dynamic and interactive nature of real-world communication, where conversations are inherently multi-turn and context-dependent. Addressing this gap, Multimodal Conversation tasks aim to foster research on AI systems capable of engaging in visually-grounded dialogues, bridging the divide between static understanding and dynamic conversational reasoning.

A key challenge in Multimodal Conversation lies in the ability to model interactions between textual and visual information effectively. Unlike single-turn tasks, where simple feature fusion often suffices, multi-turn interactions require iterative reasoning to extract relevant details from both the visual content and the dialog history. For instance, an agent must infer implicit contextual cues, such as which previous question-answer pairs or image regions are most pertinent to the current question. However, existing approaches often fall short by relying on single-path reasoning [5,16], where one modality—typically the dialog history or the image—is treated as secondary in the reasoning process. Such limitations highlight the need for models that can concurrently process and integrate insights from multiple modalities, ensuring a richer and more holistic understanding of the dialog context.

Another significant challenge is the necessity for iterative refinement, where an agent must revisit and update its understanding of the dialog context and visual features over multiple reasoning steps. Human cognition relies heavily on such iterative processes, using feedback from both visual and textual sources to build more accurate representations. While multi-hop reasoning approaches [11,25] have made progress in this direction, they often focus on sequential attention mechanisms that prioritize one modality at a time. In contrast, parallel processing of multimodal inputs, where both dialog history and visual context are simultaneously reasoned upon, remains underexplored. This gap serves as a primary motivation for our work, as we seek to design a framework that captures the dynamic interplay between modalities through a multi-path reasoning structure, enabling robust and contextually-aware response generation.

In Multimodal Conversation, an AI agent must generate coherent and contextually appropriate responses to questions by comprehensively understanding the dialog history and visual context. This necessitates sophisticated reasoning capabilities that go beyond merely aligning features from different modalities. Early models [5,16] designed for this task focused on feature fusion. For instance, [5] introduced the Late Fusion (LF) model, which concatenates representations of the question, dialog history, and image before applying a linear transformation. Similarly, [16] proposed a history-conditioned attention mechanism to generate joint representations by combining question-aware dialog history and history-conditioned image features.

While these methods demonstrated the feasibility of integrating visual and textual modalities, they rely on single-hop reasoning, which limits their ability to capture complex interactions between modalities. To address this, multi-hop reasoning approaches [10,11] have been introduced to iteratively refine representations across multiple steps. For example, [25] proposed a sequential co-attention mechanism that alternates between image and dialog history features to iteratively enrich the question representation. Similarly, [18] developed a recursive attention model that repeatedly revisits the dialog history to locate relevant contextual cues before extracting image features.

Despite these advancements, existing methods often adopt single-path reasoning strategies, where one modality is prioritized over the other during the reasoning process. However, humans inherently process visual and textual information concurrently, synthesizing insights from both sources simultaneously. This highlights the need for a multi-path reasoning framework that enables parallel exploration of both visual and textual modalities.

To this end, we propose the Multi-path Contextual Reasoning Model (MC-Net), a novel framework designed to address the challenges of multimodal reasoning in conversations. MC-Net employs a multi-path structure to concurrently derive insights from the dialog history and image context, enhancing the semantic representation of the question through iterative reasoning in each path. Additionally, we introduce a robust multimodal attention mechanism to optimize response generation, ensuring that the decoder effectively utilizes the enriched representations.

In summary, this paper presents MC-Net, a framework that advances the state of the art in Multimodal Conversation by leveraging multi-path reasoning and multimodal attention mechanisms. Our contributions are as follows:

We propose a multi-path reasoning framework that integrates dialog history and image context in parallel, enabling comprehensive understanding and representation enrichment for the input question.
We introduce an enhanced multimodal attention mechanism tailored for decoding in multimodal conversational tasks, demonstrating its efficacy in improving response accuracy.
We validate our approach on the VisDial v0.9 and v1.0 datasets, achieving significant performance improvements and providing thorough ablation studies and human evaluations to substantiate our claims.

2. Related Work

Advancements in Vision-Language Tasks

Vision-language tasks, which aim to bridge visual and linguistic modalities, have experienced significant progress in the past decade. Prominent tasks such as image captioning [6,8,13,21,23] and visual question answering (VQA) [1,3,4,24,27] have gained substantial attention due to their relevance in real-world applications such as accessibility tools and human-computer interaction systems. Image captioning involves generating coherent textual descriptions of image content, enabling machines to interpret and communicate visual information. Conversely, VQA extends this capability by requiring models to answer open-ended questions about images, demanding a deeper understanding of both the visual and textual inputs.

While these tasks have catalyzed advancements in multimodal reasoning, their single-turn nature imposes limitations on practical applications. Real-world scenarios often involve multi-turn interactions where context evolves dynamically over time. Recognizing this gap, researchers have shifted focus to tasks like Multimodal Conversation, which requires models to engage in multi-round interactions grounded in visual content. This transition highlights the need for models capable of retaining contextual understanding across multiple dialog turns while effectively integrating information from vision and language.

Multimodal Conversation and Its Challenges

Multimodal Conversation, as an extension of vision-language tasks, focuses on developing AI systems that can sustain visually grounded multi-turn dialogues. Early attempts to address this challenge include dialog-RNN models [5], which process the current question, visual content, and a limited dialog history to generate responses. Although these models introduce mechanisms for multi-turn interactions, their reliance on only the immediate history restricts their capacity to understand long-term contextual dependencies.

Other methods, such as those proposed by [7], employ sequential reasoning frameworks. For example, their multi-step reasoning model utilizes RNNs to sequentially attend to images, dialog history, and queries. This iterative process enables models to refine their understanding of each modality progressively. However, sequential attention mechanisms often fail to capture the simultaneous and parallel interplay between visual and textual information, which is essential for generating accurate and contextually relevant responses.

Dynamic Attention Mechanisms in Multimodal Learning

Attention mechanisms have emerged as a cornerstone of modern multimodal learning. From simple co-attention frameworks [3,15] to more sophisticated self-attention mechanisms [26], attention has proven effective in aligning information across modalities. In the context of Multimodal Conversation, models such as the Memory-Attended Encoder [22] have incorporated attention modules to retrieve relevant dialog context dynamically. However, these methods often employ single-path reasoning, where attention is applied either to the image or to the dialog history in isolation. This approach overlooks the potential of dual-modality interactions, where visual and textual contexts are reasoned upon simultaneously.

Recent advances have explored cross-modal attention, where interactions between modalities are modeled directly [7,40]. While these techniques improve upon isolated attention methods, they are primarily designed for single-turn tasks and are less effective in handling the complexities of multi-turn dialogues. This gap underscores the need for innovative attention mechanisms that can operate effectively within multi-path frameworks, enabling simultaneous reasoning across modalities.

Multi-Hop Reasoning in Multimodal Contexts

Multi-hop reasoning, initially introduced in tasks such as VQA [10,11], involves iterative processing of input modalities to refine understanding over multiple steps. This technique has been adapted for Multimodal Conversation by models such as Recursive Visual Attention [18], which iteratively revisits dialog history to extract relevant contextual cues. Similarly, sequential co-attention encoders [25] alternate between attending to visual and textual inputs, progressively enhancing question representations.

Despite these advancements, current multi-hop reasoning methods primarily adopt single-path strategies, where modalities are processed sequentially. While effective to some extent, these methods fail to leverage the potential of parallel reasoning, where both dialog history and image context are processed simultaneously. Parallel reasoning not only enriches the semantic representation of questions but also enhances the model’s ability to capture latent dependencies between modalities.

Cross-Domain Inspiration: Knowledge Graphs and Transformers

In addition to vision-language-specific research, techniques from other domains have inspired advancements in Multimodal Conversation. Knowledge graph-based reasoning [11,52] offers insights into structured, interpretable reasoning, which can be adapted to model relationships within dialog history and visual context. Similarly, the Transformer architecture [26], originally designed for NLP, has been extended to multimodal tasks [38], enabling models to capture long-range dependencies across heterogeneous inputs. These cross-domain techniques provide valuable tools for addressing the challenges of Multimodal Conversation, particularly in designing models like MC-Net that require robust reasoning and attention mechanisms.

Motivation for MC-Net

Building on the limitations of prior work, we introduce MC-Net, a Multi-path Contextual Reasoning framework specifically designed for Multimodal Conversation. Unlike existing models, which often adopt sequential reasoning strategies, MC-Net employs a Track Module and a Locate Module to enable parallel processing of visual and textual modalities. The Track Module focuses on extracting detailed representations from images, while the Locate Module synthesizes insights from dialog history. By integrating these modules through iterative reasoning hops, MC-Net achieves a more comprehensive and contextually aware representation of the input question. Additionally, MC-Net incorporates an advanced multimodal attention mechanism to optimize the decoding process, ensuring that responses are both accurate and contextually relevant. Through these innovations, MC-Net sets a new standard for multimodal conversational AI.

3. Methodology

In this section, we present the methodology for Multimodal Conversations using our proposed MC-Net (Multi-path Contextual Reasoning Network). The task setup follows the definitions from [5], where the inputs to a conversational agent include an image I, a caption C describing the image, a dialog history composed of question-answer pairs up to the current round t:

H = ({\underset{︸}{C}}_{H_{0}}, {\underset{︸}{(Q_{1}, A_{1})}}_{H_{1}}, \dots, {\underset{︸}{(Q_{t - 1}, A_{t - 1})}}_{H_{t - 1}})

, and the current question

Q_{t}

. The objective of the agent is to generate a response

A_{t}

grounded in both the visual and textual inputs.

MC-Net is designed with four primary components: (1) Input Representation, responsible for encoding both visual and textual features into compatible embeddings; (2) Multi-path Contextual Reasoning, where iterative reasoning is applied through distinct reasoning pathways to refine question understanding; (3) Multimodal Fusion, integrating information from both reasoning pathways into a unified representation; and (4) Generative Decoder, which utilizes an attention mechanism to generate contextually accurate responses. The reasoning modules, referred to as the Track Module and Locate Module, are central to the multi-path reasoning architecture.

In the following sections, we detail each component, introduce necessary notations, and describe the reasoning and fusion mechanisms.

3.1. Input Representation

Visual Features.

We extract object-level image features using a pre-trained Faster R-CNN [20], capturing region-level embeddings. For an image I, the feature matrix is represented as:

v = Faster R - CNN (I) \in R^{K \times V},

(1)

where K represents the number of detected objects, and V denotes the feature dimension of each region. These features provide a compact representation of the image’s visual context, emphasizing object-centric semantics.

Textual Features.

For language embeddings, we first tokenize the question

Q_{t}

into word vectors using pre-trained GloVe embeddings [19]. A bidirectional LSTM (BiLSTM) processes these tokens to generate a sequence of hidden states:

\begin{matrix} {\vec{x}}_{t, j} & = {LSTM}_{f} (w_{t, j}, x_{t, j - 1}), \end{matrix}

(2)

\begin{matrix} {\overset{\leftarrow}{x}}_{t, j} & = {LSTM}_{b} (w_{t, j}, x_{t, j + 1}), \end{matrix}

(3)

where j indexes tokens in

Q_{t}

. The final question representation

q_{t}

is obtained by concatenating the forward and backward hidden states:

q_{t} = [{\vec{x}}_{t, L}, {\overset{\leftarrow}{x}}_{t, 1}],

(4)

where L is the total number of tokens in

Q_{t}

.

Dialog history H is processed similarly, with individual question-answer pairs encoded into feature vectors

u_{i}

. The textual embedding pipeline ensures that

Q_{t}

, H, and

A_{t}

are aligned in the same feature space.

3.2. Multi-path Contextual Reasoning

The multi-path reasoning framework consists of two complementary modules: Track Module and Locate Module. The Track Module enriches question semantics using visual features, while the Locate Module leverages dialog history for the same purpose. Multi-hop reasoning is employed within each module to iteratively refine representations, ensuring thorough integration of multimodal information.

Track Module.

The Track Module focuses on deriving question-aware representations of image features. Given the question feature

q_{t r a c k}

and image features v, attention weights are computed as:

\begin{matrix} S & = f_{t r a c k}^{q} (q_{t r a c k}) \circ f_{t r a c k}^{v} (v), \end{matrix}

(5)

\begin{matrix} α & = softmax (W^{S} S + b^{S}), \end{matrix}

(6)

where

f_{t r a c k}^{q} (\cdot)

and

f_{t r a c k}^{v} (\cdot)

are neural projections, and

α

denotes attention scores for image regions. The weighted image representation is then computed as:

q_{t r a c k}^{o u t} = \sum_{i = 1}^{K} α_{i} v_{i} .

(7)

Locate Module.

The Locate Module enriches question understanding using dialog history. Similar to the Track Module, attention is computed between the question feature

q_{l o c a t e}

and history features u:

\begin{matrix} Z & = f_{l o c a t e}^{q} (q_{l o c a t e}) \circ f_{l o c a t e}^{u} (u), \end{matrix}

(8)

\begin{matrix} η & = softmax (W^{Z} Z + b^{Z}), \end{matrix}

(9)

yielding history-aware attention weights

η

. The attended dialog history representation is:

q_{l o c a t e}^{o u t} = \sum_{i = 1}^{T} η_{i} u_{i} .

(10)

Multi-hop Reasoning.

Both modules are extended to multi-hop reasoning, alternating between visual and textual contexts. For example, reasoning pathways are defined as:

\begin{matrix} Path 1 : & q \to Track (q, v) \to Locate (q_{t r a c k}, u) \to \dots, \\ Path 2 : & q \to Locate (q, u) \to Track (q_{l o c a t e}, v) \to \dots . \end{matrix}

This iterative process ensures comprehensive representation refinement.

3.3. Multimodal Fusion

To unify the outputs of the reasoning pathways, a multimodal fusion mechanism is employed. Features from the Track and Locate Modules are enhanced using the original question feature:

\begin{matrix} {\hat{q}}_{t r a c k} & = f_{e n h a n c e}^{q} (q) \circ f_{e n h a n c e}^{v} (q_{t r a c k}^{o u t}), \end{matrix}

(11)

\begin{matrix} {\hat{q}}_{l o c a t e} & = f_{e n h a n c e}^{q} (q) \circ f_{e n h a n c e}^{u} (q_{l o c a t e}^{o u t}) . \end{matrix}

(12)

The fused representation is then computed as:

e = W^{e} [{\hat{q}}_{t r a c k}, {\hat{q}}_{l o c a t e}] + b^{e},

(13)

where

[\cdot]

denotes concatenation.

3.4. Generative Decoder

The generative decoder builds on a multimodal attention framework to predict responses. Given fused features e, the decoder initializes the LSTM as:

h_{0} = LSTM (e, s_{q}),

(14)

where

s_{q}

is the final encoder state.

At each decoding step t, attention weights are computed over question, history, and image features:

\begin{matrix} α^{q} & = softmax (W_{q} h_{t}), \end{matrix}

(15)

\begin{matrix} α^{u} & = softmax (W_{u} h_{t}), \end{matrix}

(16)

\begin{matrix} α^{v} & = softmax (W_{v} h_{t}) . \end{matrix}

(17)

The attended multimodal context vector is:

c_{t} = W_{c} [α^{q}, α^{u}, α^{v}] .

(18)

Finally, the decoder predicts the next token as:

p (y_{t} | y_{< t}, q, u, v) = softmax (W_{d} [h_{t}, c_{t}]) .

(19)

4. Experiments

4.1. Datasets

We evaluate our proposed approach, MC-Net, on the VisDial v0.9 and v1.0 datasets [5], which are standard benchmarks for multimodal conversational tasks. VisDial v0.9 consists of 83k dialogs on COCO-train images [16] and 40k dialogs on COCO-val images, amounting to 1.23M dialog question-answer pairs. VisDial v1.0 extends VisDial v0.9 by incorporating an additional 10k COCO-like images sourced from Flickr. In total, VisDial v1.0 comprises 123k training images, 2k validation images, and 8k test images. Each dialog in these datasets contains 10 rounds of question-answer pairs, offering a comprehensive setting for evaluating multi-turn, visually grounded conversations.

4.2. Evaluation Metrics

Following [5], we adopt a retrieval-based evaluation setup. At test time, the model is provided with an image, a dialog history, a question, and a list of 100 candidate answers. The model’s performance is measured using the following metrics:

Mean Reciprocal Rank (MRR): The average of the reciprocal ranks of the ground truth answer across all questions.
Recall@k (R@1, R@5, R@10): The percentage of ground truth answers present in the top-k ranked predictions.
Mean Rank: The average rank position of the ground truth answer among the candidates (lower is better).

These metrics comprehensively capture the model’s ability to rank the correct answers in a multimodal conversational context.

4.3. Implementation Details

The preprocessing of textual data involves lowercasing, tokenization, and removal of contractions. Captions, questions, and answers are truncated to 24, 16, and 8 tokens, respectively. A vocabulary is constructed from words appearing at least five times in the training set, resulting in 8,958 unique tokens for VisDial v0.9 and 10,366 for VisDial v1.0.

Our model employs 1-layer BiLSTMs with 512 hidden units for text encoding. Image features are extracted using Faster R-CNN [20]. We use the Adam optimizer [14] with an initial learning rate of

10^{- 3}

, which is gradually decreased to

10^{- 5}

during training. Training is conducted with a batch size of 64, and gradient clipping is applied to stabilize the optimization process.

4.4. Quantitative Results

Table 1 and Table 2 provide a comprehensive comparison of MC-Net against existing state-of-the-art models on the VisDial v0.9 and v1.0 datasets. Our results highlight the superiority of MC-Net in addressing the complexities of multimodal conversational reasoning.

The performance on VisDial v0.9 (Table 1) shows that MC-Net consistently outperforms previous models across all major metrics, including MRR, R@1, R@5, R@10, and Mean Rank. Specifically, our model achieves an MRR of 56.12, surpassing the closest competitor, ReDAN [7], by 0.69 points. Similarly, our R@5 and Mean Rank scores demonstrate significant improvements, reflecting MC-Net’s ability to rank ground truth answers higher and more consistently.

For VisDial v1.0 (Table 2), MC-Net achieves an MRR of 50.16, with particularly notable improvements in R@5 and R@10, where it scores 60.02 and 67.21, respectively. These metrics indicate that MC-Net effectively generalizes to larger datasets and more diverse conversational scenarios. The Mean Rank of 15.19 further emphasizes the efficiency of MC-Net in ranking the correct answer options closer to the top, a critical aspect for practical applications.

The consistent improvements across datasets can be attributed to the unique architecture of MC-Net. By leveraging multi-path reasoning and multimodal attention mechanisms, the model captures intricate relationships between visual and textual modalities. Unlike sequential reasoning frameworks, MC-Net processes information from both dialog history and visual context concurrently, enabling a deeper understanding of the input question.

An additional advantage of MC-Net is its robustness to variations in dataset complexity. The VisDial v1.0 dataset introduces more diverse visual contexts and dialog histories, yet MC-Net maintains its performance edge. This suggests that the multi-hop reasoning mechanism effectively adapts to different input distributions, ensuring scalability and reliability in real-world scenarios.

The significance of these results extends beyond numerical metrics. Higher R@5 and R@10 scores imply that MC-Net can generate top-quality responses that are likely to align with human expectations. Such capabilities are crucial for applications like virtual assistants and interactive AI, where user satisfaction depends on the relevance and accuracy of the system’s responses.

4.5. Ablation Studies

To elucidate the individual contributions of each component in MC-Net, we conducted detailed ablation experiments as summarized in Table 3. Each variation of the model highlights the importance of key design choices in achieving state-of-the-art performance.

The first set of experiments examines the impact of multi-hop reasoning. Models with 1-hop and 2-hop reasoning exhibit significant drops in MRR, R@1, and R@5 compared to the 3-hop configuration. For example, reducing reasoning depth to a single hop results in an MRR decrease of over 1.5 points, underscoring the importance of iterative refinement in capturing complex multimodal relationships. The 3-hop design of MC-Net ensures that sufficient interactions occur between dialog history and visual context, leading to richer semantic representations.

The second set of experiments focuses on the dual-path reasoning modules: Track Module and Locate Module. Removing the Track Module leads to a 2.9-point drop in MRR, while excluding the Locate Module reduces MRR by 1.7 points. These results confirm that both pathways contribute uniquely to the model’s reasoning capabilities. The Track Module excels at capturing visual details relevant to the question, while the Locate Module ensures contextual understanding from dialog history.

The multimodal attention decoder is another critical component evaluated in our ablation study. Without this decoder, the model suffers a 1.2-point decline in MRR and reduced performance across all other metrics. This emphasizes the role of attention mechanisms in aligning textual and visual features during response generation. By dynamically attending to relevant regions of the image and dialog history, the decoder generates responses that are both accurate and contextually grounded.

In addition to these experiments, we analyzed the performance of MC-Net with varying numbers of reasoning hops. Interestingly, while 3-hop reasoning consistently outperformed other configurations, increasing the number of hops beyond three led to marginal improvements, suggesting that excessive reasoning steps may introduce noise or redundancy.

Overall, the ablation results validate the architectural design of MC-Net, demonstrating how each component synergistically enhances its reasoning and response generation capabilities.

The importance of multi-hop reasoning: removing multi-hop capabilities reduces model performance significantly.
The effectiveness of the dual-path reasoning mechanism: excluding either the Track Module or the Locate Module negatively impacts performance.
The contribution of multimodal attention: incorporating multimodal attention in the decoder improves response quality.

4.6. Human Evaluation

In addition to quantitative metrics, we conducted human evaluations to assess the practical quality of responses generated by MC-Net. A total of 100 randomly sampled dialogs were presented to human evaluators, who compared MC-Net’s outputs with those of HCIAE [16] based on fluency and relevance.

Fluency refers to the grammatical correctness and linguistic naturalness of the responses. MC-Net achieved a fluency approval rate of 72%, significantly outperforming HCIAE’s 61%. This improvement reflects the ability of MC-Net to generate well-formed sentences that align with human language conventions. The use of a multimodal attention decoder contributes significantly to this result, as it ensures that the generated responses are syntactically coherent.

Relevance measures the alignment of the response with the input context, including both the question and visual content. Here, MC-Net achieved an impressive 78% approval rate, compared to 64% for HCIAE. This improvement highlights the strength of MC-Net’s dual-path reasoning framework, which effectively integrates dialog history and visual features to produce contextually appropriate answers.

Evaluators also noted that MC-Net’s responses were more precise and informative. For example, when asked about specific attributes of objects in the image, MC-Net consistently referred to the correct regions and provided detailed answers. In contrast, HCIAE often generated vague or generic responses, failing to utilize the full context of the dialog and image.

To quantify these observations, we conducted statistical significance tests. Both t-tests and ANOVA confirmed that MC-Net’s improvements in fluency and relevance are statistically significant, with p-values below 0.01. These results provide strong evidence that MC-Net offers a substantial qualitative advantage over existing models.

In practical applications, such as virtual assistants or customer support systems, the qualitative improvements demonstrated by MC-Net are crucial. Fluent and contextually accurate responses enhance user satisfaction and trust in AI systems, making MC-Net a promising solution for real-world deployment.

5. Conclusions and Future Directions

In this paper, we present MC-Net (Multi-path Contextual Reasoning Network), a novel framework designed to advance the field of multimodal conversations. By leveraging a multi-path architecture, MC-Net effectively integrates information from both dialog history and image context, enriching the semantic representation of questions through iterative reasoning. The multi-hop reasoning mechanism enables a more nuanced understanding of complex interactions between visual and textual modalities, addressing challenges that prior models fail to overcome.

MC-Net introduces two key modules, the Track Module and the Locate Module, which work in tandem to perform parallel reasoning on visual and textual inputs. This design ensures that the model captures intricate relationships between modalities, resulting in a robust and contextually aware response generation process. Moreover, the inclusion of a multimodal attention-based decoder further enhances the model’s ability to synthesize information dynamically, enabling accurate and coherent responses.

Comprehensive experiments on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of MC-Net compared to existing state-of-the-art methods. MC-Net consistently achieves higher scores across major metrics, including MRR, R@1, and R@5, validating its effectiveness in handling diverse and complex dialog scenarios. Ablation studies confirm the contributions of individual components, such as multi-hop reasoning and multimodal attention, to the overall performance. Additionally, qualitative and human evaluations highlight MC-Net’s ability to generate fluent and contextually relevant responses, making it a strong candidate for real-world applications.

While MC-Net represents a significant step forward, there are several directions for future research:

Exploring Additional Modalities: Incorporating audio or video modalities could further enhance the model’s ability to handle richer dialog scenarios, such as conversations grounded in dynamic scenes or multimedia contexts.
Scaling to Open-Domain Conversations: Future work can focus on adapting MC-Net to open-domain conversations, where the topics and contexts are less constrained, requiring more generalized reasoning capabilities.
Improving Computational Efficiency: Although multi-hop reasoning provides significant benefits, it also introduces computational overhead. Optimizing the architecture for faster inference while maintaining accuracy could broaden its applicability.
Integrating External Knowledge Bases: Incorporating knowledge graphs or external databases into the reasoning process could allow MC-Net to handle questions that require domain-specific expertise or factual information not present in the visual or dialog inputs.
Human-Like Interaction Patterns: Developing mechanisms to simulate human-like reasoning, such as counterfactual thinking or subjective response generation, could make MC-Net more engaging and versatile in real-world human-machine interactions.

In conclusion, MC-Net sets a new benchmark for multimodal conversational reasoning by addressing key limitations in existing approaches. Its innovative architecture and strong empirical results position it as a foundation for future advancements in the domain, paving the way for more intelligent and interactive conversational AI systems.

References

Alberti, C.; Ling, J.; Collins, M.; and Reitter, D. 2019. In Fusion of detected objects in text for visual question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing; 2131–2140. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. SPICE: Semantic propositional image caption evaluation. Adaptive Behavior.
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. In Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 6077–6086. [Google Scholar]
Cadene, R.; Ben-Younes, H.; Cord, M.; and Thome, N. 2019. In Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 1989–1998. [Google Scholar]
Das, A.; Kottur, S.; Gupta, K.; Singh, A.; Yadav, D.; Moura, J. M.; Parikh, D.; and Batra, D. 2017. In Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 326–335. [Google Scholar]
Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A. K.; and Wan, S. 2019. Image caption generation with high-level image features. Pattern Recognition Letters.
Gan, Z.; Cheng, Y.; Kholy, A. E.; Li, L.; Liu, J.; and Gao, J. 2019. In Multi-step reasoning via recurrent dual attention for visual dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; 6463–6474. [Google Scholar]
Gao, H.; Mao, J.; Zhou, J.; Huang, Z.; Wang, L.; and Xu, W. 2015. Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, 2296–2304.
Guo, D.; Xu, C.; and Tao, D. 2019. In Image-question-answer synergistic network for visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 10434–10443. [Google Scholar]
Hu, R.; Andreas, J.; Darrell, T.; and Saenko, K. 2018. In Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision; 53–69. [Google Scholar]
Hudson, D. A. , and Manning, C. D. 2018. Compositional attention networks for machine reasoning. In International Conference on Learning Representations.
Kang, G.-C.; Lim, J.; and Zhang, B.-T. 2019. In Dual attention networks for visual reference resolution in visual dialog. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing; 2024–2033. [Google Scholar]
Kinghorn, P.; Zhang, L.; and Shao, L. 2018. A region-based image caption generator with refined descriptions. Neurocomputing.
Kingma, D. P. , and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, arXiv:1412.6980.
Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2016. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, 289–297.
Lu, J.; Kannan, A.; Yang, J.; Parikh, D.; and Batra, D. 2017a. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems, 314–324.
Lu, J.; Xiong, C.; Parikh, D.; and Socher, R. 2017b. In Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 375–383. [Google Scholar]
Niu, Y.; Zhang, H.; Zhang, M.; Zhang, J.; Lu, Z.; and Wen, J.-R. 2019. In Recursive visual attention in visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 6679–6688. [Google Scholar]
Pennington, J.; Socher, R.; and Manning, C. 2014. In Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing; 1532–1543. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 91–99.
Ren, M.; Kiros, R.; and Zemel, R. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, 2953–2961.
Seo, P. H.; Lehrmann, A.; Han, B.; and Sigal, L. 2017. Visual reference resolution using attention memory for visual dialog. In Advances in Neural Information Processing Systems, 3719–3729.
Tan, Y. H. , and Chan, C. S. 2019. Phrase-based image caption generator with hierarchical lstm network. Neurocomputing.
Vedantam, R.; Desai, K.; Lee, S.; Rohrbach, M.; Batra, D.; and Parikh, D. 2019. Probabilistic neural symbolic models for interpretable visual question answering. In Proceedings of International Conference on Machine Learning; 6428–6437. [Google Scholar]
Wu, Q.; Wang, P.; Shen, C.; Reid, I.; and van den Hengel, A. 2018. In Are you talking to me? reasoned visual dialog generation through adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 6106–6115. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, 2048–2057., attend and tell: Neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning.
Yang, Z.; He, X.; Gao, J.; Deng, L.; and Smola, A. 2016. In Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 21–29. [Google Scholar]
authorpersonAnson Bastos, personAbhishek Nadgeri, personKuldeep Singh, personIsaiah Onando Mulang, personSaeedeh Shekarpour, personJohannes Hoffart, and personManohar Kaul. year2021. RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network. In booktitleProceedings of the Web Conference 2021. pages1673–1685.
authorpersonPhilipp Christmann, personRishiraj Saha Roy, personAbdalghani Abujabal, personJyotsna Singh, and personGerhard Weikum. year2019. Look before You Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion. In booktitleProceedings of the 28th ACM International Conference on Information and Knowledge Management CIKM. pages729–738.
authorpersonJacob Devlin, personMing-Wei Chang, personKenton Lee, and personKristina Toutanova. year2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In booktitleProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). publisherAssociation for Computational Linguistics, pages4171–4186.
authorpersonEndri Kacupaj, personKuldeep Singh, personMaria Maleshkova, and personJens Lehmann. year2022. An Answer Verbalization Dataset for Conversational Question Answerings over Knowledge Graphs. arXiv preprint arXiv:2208.06734, arXiv:2208.06734 (year2022).
authorpersonMagdalena Kaiser, personRishiraj Saha Roy, and personGerhard Weikum. year2021. Reinforcement Learning from Reformulations In Conversational Question Answering over Knowledge Graphs. In booktitleProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. pages459–469.
authorpersonYunshi Lan, personGaole He, personJinhao Jiang, personJing Jiang, personWayne Xin Zhao, and personJi-Rong Wen. year2021. A Survey on Complex Knowledge Base Question Answering: Methods, Challenges and Solutions. In booktitleProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. publisherInternational Joint Conferences on Artificial Intelligence Organization, pages4483–4491.
authorpersonYunshi Lan and personJing Jiang. year2021. Modeling transitions of focal entities for conversational knowledge base question answering. In booktitleProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
authorpersonMike Lewis, personYinhan Liu, personNaman Goyal, personMarjan Ghazvininejad, personAbdelrahman Mohamed, personOmer Levy, personVeselin Stoyanov, and personLuke Zettlemoyer. year2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In booktitleProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pages7871–7880.
authorpersonIlya Loshchilov and personFrank Hutter. year2019. Decoupled Weight Decay Regularization. In booktitleInternational Conference on Learning Representations.
authorpersonPierre Marion, personPaweł Krzysztof Nowak, and personFrancesco Piccinno. year2021. Structured Context and High-Coverage Grammar for Conversational Question Answering over Knowledge Graphs. journalProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (year2021).
Pradeep, K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 20 April 0942. [Google Scholar]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 2015. [CrossRef]
Dong Yu Li Deng. Deep Learning: Methods and Applications, /: 14. URL https, 20 May.
Eric Makita and Artem Lenskiy. A movie genre prediction based on Multivariate Bernoulli model and genre correlations. (May), mar 2016a. URL http://arxiv.org/abs/1604.08608.
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, arXiv:1410.1090, 2014.
Deli Pei, Huaping Liu, Yulong Liu, and Fuchun Sun. Unsupervised multimodal feature learning for semantic image segmentation. In The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE, aug 2013. ISBN 978-1-4673-6129-3. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6706748. [CrossRef]
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, arXiv:1409.1556, 2014.
Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-Shot Learning Through Cross-Modal Transfer. In C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger (eds.), Advances in Neural Information Processing Systems 26, pp. 935–943. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.pdf.
Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, and Shuicheng Yan. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Hao Fei, Yafeng Ren, and Donghong Ji. Retrofitting structure-aware transformer language model for end tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2151–2161, 2020a.
Shengqiong Wu, Hao Fei, Fei Li, Meishan Zhang, Yijiang Liu, Chong Teng, and Donghong Ji. Mastering the explicit opinion-role interaction: Syntax-aided neural transition system for unified opinion role labeling. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, pages 11513–11521, 2022.
Wenxuan Shi, Fei Li, Jingye Li, Hao Fei, and Donghong Ji. Effective token graph modeling using a novel labeling strategy for structured sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4232–4241, 2022.
Hao Fei, Yue Zhang, Yafeng Ren, and Donghong Ji. Latent emotion memory for multi-label emotion classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7692–7699, 2020b.
Fengqi Wang, Fei Li, Hao Fei, Jingye Li, Shengqiong Wu, Fangfang Su, Wenxuan Shi, Donghong Ji, and Bo Cai. Entity-centered cross-document relation extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9871–9881, 2022.
Ling Zhuang, Hao Fei, and Po Hu. Knowledge-enhanced event relation extraction via event ontology prompt. Inf. Fusion, 0191.
Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541, arXiv:1804.09541, 2018.
Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. arXiv preprint arXiv:2305.11719, arXiv:2305.11719, 2023a.
Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357, arXiv:2405.18357, 2024.
Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179, arXiv:1704.05179, 2017.
Hao Fei, Shengqiong Wu, Jingye Li, Bobo Li, Fei Li, Libo Qin, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2022, pages 15460–15475, 2022a.
Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Opinion word expansion and target extraction through double propagation. Computational linguistics, 2011; 27.
Hao Fei, Yafeng Ren, Yue Zhang, Donghong Ji, and Xiaohui Liang. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics, 2021.
Shengqiong Wu, Hao Fei, Wei Ji, and Tat-Seng Chua. Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2593–2608, 2023b.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, arXiv:1606.05250, 2016.
Hao Fei, Fei Li, Bobo Li, and Donghong Ji. Encoder-decoder based unified semantic role labeling with label-aware syntax. In Proceedings of the AAAI conference on artificial intelligence, pages 12794–12802, 2021a.
Hao Fei, Shengqiong Wu, Yafeng Ren, Fei Li, and Donghong Ji. Better combine them together! integrating syntactic constituency and dependency representations for semantic role labeling. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 549–559, 2021b.
Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. Reasoning implicit sentiment with chain-of-thought prompting. arXiv preprint arXiv:2305.11255, arXiv:2305.11255, 2023a.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 19. Association for Computational Linguistics. 20 June. [CrossRef]
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. CoRR, 0551; .9.
Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. In Proceedings of the International Conference on Machine Learning, 2024b.
Naman Jain, Pranjali Jain, Pratik Kayal, Jayakrishna Sahit, Soham Pachpande, Jayesh Choudhari, et al. Agribot: agriculture-specific question answer system. IndiaRxiv.
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7641–7653, 2024c.
Mihir Momaya, Anjnya Khanna, Jessica Sadavarte, and Manoj Sankhe. Krushi–the farmer chatbot. In 2021 International Conference on Communication information and Computing Technology (ICCICT), pages 1–6. IEEE, 2021.
Hao Fei, Fei Li, Chenliang Li, Shengqiong Wu, Jingye Li, and Donghong Ji. Inheriting the wisdom of predecessors: A multiplex cascade framework for unified aspect-based sentiment analysis. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, pages 4096–4103, 2022b.
Shengqiong Wu, Hao Fei, Yafeng Ren, Donghong Ji, and Jingye Li. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 3957–3963, 2021.
Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, and Fei Li. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5923–5934, 2023.
Hao Fei, Qian Liu, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5980–5994, 2023b.
Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Vitron: A unified pixel-level vision llm for understanding, generating, segmenting, editing. In Proceedings of the Advances in Neural Information Processing Systems, NeurIPS 2024,, 2024d.
Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. In ICLR, 2017.
Abbott Chen and Chai Liu. Intelligent commerce facilitates education technology: The platform and chatbot for the taiwan agriculture service. International Journal of e-Education, e-Business, e-Management and e-Learning, 2021.
Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv preprint arXiv:2406.05127, arXiv:2406.05127, 2024.
Jingye Li, Kang Xu, Fei Li, Hao Fei, Yafeng Ren, and Donghong Ji. MRN: A locally and globally mention-based reasoning network for document-level relation extraction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1359–1370, 2021.
Hao Fei, Shengqiong Wu, Yafeng Ren, and Meishan Zhang. Matching structure for dual learning. In Proceedings of the International Conference on Machine Learning, ICML, pages 6373–6391, 2022c.
Hu Cao, Jingye Li, Fangfang Su, Fei Li, Hao Fei, Shengqiong Wu, Bobo Li, Liang Zhao, and Donghong Ji. OneEE: A one-stage framework for fast overlapping and nested event extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1953–1964, 2022.
Isakwisa Gaddy Tende, Kentaro Aburada, Hisaaki Yamaba, Tetsuro Katayama, and Naonobu Okazaki. Proposal for a crop protection information system for rural farmers in tanzania. Agronomy, 2411.
Hao Fei, Yafeng Ren, and Donghong Ji. Boundaries and edges rethinking: An end-to-end neural model for overlapping entity relation extraction. Information Processing & Management, 2020.
Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, and Fei Li. Unified named entity recognition as word-word relation classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10965–10973, 2022.
Mohit Jain, Pratyush Kumar, Ishita Bhansali, Q Vera Liao, Khai Truong, and Shwetak Patel. Farmchat: a conversational agent to answer farmer queries. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2018; 22.
Shengqiong Wu, Hao Fei, Hanwang Zhang, and Tat-Seng Chua. Imagine that! abstract-to-intricate text-to-image synthesis with scene graph hallucination diffusion. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 79240–79259, 2023d.
Hao Fei, Tat-Seng Chua, Chenliang Li, Donghong Ji, Meishan Zhang, and Yafeng Ren. On the robustness of aspect-based sentiment analysis: Rethinking model, data, and training. ACM Transactions on Information Systems, 2023; :32.
Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min Zhang, and Tat-Seng Chua. Constructing holistic spatio-temporal scene graph for video semantic role labeling. In Proceedings of the 31st ACM International Conference on Multimedia, MM, pages 5281–5291, 2023a.
Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, and Tat-Seng Chua. Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14734–14751, 2023e.
Hao Fei, Yafeng Ren, Yue Zhang, and Donghong Ji. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Transactions on Neural Networks and Learning Systems, 5544.
Yu Zhao, Hao Fei, Wei Ji, Jianguo Wei, Meishan Zhang, Min Zhang, and Tat-Seng Chua. Generating visual spatial description via holistic 3D scene understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7960–7977, 2023b.

Table 1. Performance comparison on VisDial val v0.9. Higher scores are better for MRR, R@1, R@5, and R@10, while lower scores are better for Mean Rank.

Model	MRR	R@1	R@5	R@10	Mean
LF [5]	51.99	41.83	61.78	67.59	17.07
HCIAE [16]	53.86	44.06	63.55	69.24	16.01
CoAtt [25]	54.11	44.32	63.82	69.75	16.47
ReDAN [7]	55.43	45.37	65.27	72.97	13.72
MC-Net (Ours)	56.12	46.20	66.08	72.43	12.84

Table 2. Performance comparison on VisDial val v1.0. The metrics demonstrate the robustness of MC-Net across datasets.

Model	MRR	R@1	R@5	R@10	Mean
LF [5]	47.99	38.18	57.54	64.32	18.60
HCIAE [16]	49.10	39.35	58.49	64.70	18.46
CoAtt [25]	49.25	39.66	58.83	65.38	18.15
ReDAN [7]	49.69	40.19	59.35	66.06	17.92
MC-Net (Ours)	50.16	40.15	60.02	67.21	15.19

Table 3. Ablation study on VisDial val v0.9 dataset. Removing any component reduces overall performance.

Variant	MRR	R@1	R@5	R@10	Mean
MC-Net w/o Multi-hop	54.07	43.89	64.08	70.12	16.01
MC-Net w/o Track Module	53.22	42.75	63.22	69.75	16.87
MC-Net w/o Locate Module	54.22	43.88	64.56	70.44	15.34
MC-Net w/o Attention	54.90	44.56	65.08	71.12	14.87
MC-Net (Full)	56.12	46.20	66.08	72.43	12.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MC-Net: A Multi-Path Contextual Reasoning Framework for Multimodal Conversations

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

Advancements in Vision-Language Tasks

Multimodal Conversation and Its Challenges

Dynamic Attention Mechanisms in Multimodal Learning

Multi-Hop Reasoning in Multimodal Contexts

Cross-Domain Inspiration: Knowledge Graphs and Transformers

Motivation for MC-Net

3. Methodology

3.1. Input Representation

Visual Features.

Textual Features.

3.2. Multi-path Contextual Reasoning

Track Module.

Locate Module.

Multi-hop Reasoning.

3.3. Multimodal Fusion

3.4. Generative Decoder

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Quantitative Results

4.5. Ablation Studies

4.6. Human Evaluation

5. Conclusions and Future Directions

References

MDPI Initiatives

Important Links

Subscribe