A multi-modal transformer-based model for generative visual dialog system

Elshamy, Ghada; Alfonse, Marco; Islam Hegazy; Aref, M.;

Abstract


Recent advancements in generative artificial intelligence have boosted significant interest in conversational agents. The visual dialog task, a synthesis of visual question-answering and dialog systems, requires agents capable of both seeing and chatting in natural language interactions. These agents must effectively understand cross-modal contextual information and generate coherent, human-like responses to a sequence of questions about a given visual scene. Despite progress, previous approaches often required complex architectures and substantial resources. This paper introduces a generative dialog agent that effectively addresses these challenges while maintaining a relatively simple architecture, dataset, and resource requirements. The proposed model employs an encoder-decoder architecture, incorporating ViLBERT for cross-modal information grounding and GPT-2 for autoregressive answer generation. This is the first visual dialog agent solely reliant on an autoregressive decoder for text generation. Evaluated on the VisDial dataset, the model achieves promising results, with scores of 64.05, 62.67, 70.17, and 15.37
on normalized discounted cumulative gain (NDCG), rank@5, rank@10, and the mean, respectively. These outcomes underscore the effectiveness of this approach, particularly considering its efficiency in terms of dataset size, architecture complexity, and generation process. The code and dataset are available at https://github.com/GhadaElshamy/MS-GPT-visdial.git , complete with usage instructions to facilitate replication of these experiments.


Other data

Title A multi-modal transformer-based model for generative visual dialog system
Authors Elshamy, Ghada ; Alfonse, Marco ; Islam Hegazy ; Aref, M. 
Keywords Visual Dialog;transformers;ViLBERT;GPT;answer generation
Issue Date 31-Mar-2025
Publisher Polish Association for Knowledge Promotion
Journal Applied Computer Science 
Volume 21
Issue 1
Start page 1
End page 17
ISSN 2353-6977
DOI 10.35784/acs_6856

Recommend this item

Similar Items from Core Recommender Database

Google ScholarTM

Check



Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.