A multi-modal transformer-based model for generative visual dialog system

Elshamy, Ghada; Alfonse, Marco; Islam Hegazy; Aref, M.

A multi-modal transformer-based model for generative visual dialog system

Elshamy, Ghada; Alfonse, Marco; Islam Hegazy; Aref, M.;

Abstract

Recent advancements in generative artificial intelligence have boosted significant interest in conversational agents. The visual dialog task, a synthesis of visual question-answering and dialog systems, requires agents capable of both seeing and chatting in natural language interactions. These agents must effectively understand cross-modal contextual information and generate coherent, human-like responses to a sequence of questions about a given visual scene. Despite progress, previous approaches often required complex architectures and substantial resources. This paper introduces a generative dialog agent that effectively addresses these challenges while maintaining a relatively simple architecture, dataset, and resource requirements. The proposed model employs an encoder-decoder architecture, incorporating ViLBERT for cross-modal information grounding and GPT-2 for autoregressive answer generation. This is the first visual dialog agent solely reliant on an autoregressive decoder for text generation. Evaluated on the VisDial dataset, the model achieves promising results, with scores of 64.05, 62.67, 70.17, and 15.37
on normalized discounted cumulative gain (NDCG), rank@5, rank@10, and the mean, respectively. These outcomes underscore the effectiveness of this approach, particularly considering its efficiency in terms of dataset size, architecture complexity, and generation process. The code and dataset are available at https://github.com/GhadaElshamy/MS-GPT-visdial.git , complete with usage instructions to facilitate replication of these experiments.

Other data

Title	A multi-modal transformer-based model for generative visual dialog system
Authors	Elshamy, Ghada ; Alfonse, Marco ; Islam Hegazy ; Aref, M.
Keywords	Visual Dialog;transformers;ViLBERT;GPT;answer generation
Issue Date	31-Mar-2025
Publisher	Polish Association for Knowledge Promotion
Journal	Applied Computer Science
Volume	21
Issue	1
Start page	1
End page	17
ISSN	2353-6977
DOI	10.35784/acs_6856

Recommend this item

Similar Items from Core Recommender Database

Google Scholar^TM

Check

A multi-modal transformer-based model for generative visual dialog system

Elshamy, Ghada; Alfonse, Marco; Islam Hegazy; Aref, M.;

Abstract

Other data

Google ScholarTM

Google Scholar^TM