A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

Ali, marwa; et al.;

Abstract


As large language models (LLMs) become
increasingly integrated into daily life, ensur-
ing their cultural sensitivity and inclusivity is
paramount. We introduce PALM, a year-long
community-driven project covering all 22 Arab
countries. The dataset includes instructions
(input, response pairs) in both Modern Stan-
dard Arabic (MSA) and dialectal Arabic (DA),
spanning 20 diverse topics. Built by a team
of 44 researchers across the Arab world, all
of whom are authors of this paper, PALM of-
fers a broad, inclusive perspective. We use
PALM to evaluate the cultural and dialectal ca-
pabilities of several frontier LLMs, revealing
notable limitations. For instance, while closed-
source LLMs generally exhibit strong perfor-
mance, they are not without flaws, and smaller
open-source models face greater challenges.
Moreover, certain countries (e.g., Egypt, the
UAE) appear better represented than others
(e.g., Iraq, Mauritania, Yemen). Our annota-
tion guidelines, code, and data for reproducibil-
ity are publicly available. More information
about PALM is available at our project page:
https://github.com/UBC-NLP/palm.


Other data

Title A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Authors Ali, marwa ; et al.
Issue Date Jul-2025
Journal arxiv.org/abs/ 
DOI https://arxiv.org/abs/2503.00151

Recommend this item

Similar Items from Core Recommender Database

Google ScholarTM

Check



Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.