A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Ali, marwa; et al.;
Abstract
As large language models (LLMs) become
increasingly integrated into daily life, ensur-
ing their cultural sensitivity and inclusivity is
paramount. We introduce PALM, a year-long
community-driven project covering all 22 Arab
countries. The dataset includes instructions
(input, response pairs) in both Modern Stan-
dard Arabic (MSA) and dialectal Arabic (DA),
spanning 20 diverse topics. Built by a team
of 44 researchers across the Arab world, all
of whom are authors of this paper, PALM of-
fers a broad, inclusive perspective. We use
PALM to evaluate the cultural and dialectal ca-
pabilities of several frontier LLMs, revealing
notable limitations. For instance, while closed-
source LLMs generally exhibit strong perfor-
mance, they are not without flaws, and smaller
open-source models face greater challenges.
Moreover, certain countries (e.g., Egypt, the
UAE) appear better represented than others
(e.g., Iraq, Mauritania, Yemen). Our annota-
tion guidelines, code, and data for reproducibil-
ity are publicly available. More information
about PALM is available at our project page:
https://github.com/UBC-NLP/palm.
increasingly integrated into daily life, ensur-
ing their cultural sensitivity and inclusivity is
paramount. We introduce PALM, a year-long
community-driven project covering all 22 Arab
countries. The dataset includes instructions
(input, response pairs) in both Modern Stan-
dard Arabic (MSA) and dialectal Arabic (DA),
spanning 20 diverse topics. Built by a team
of 44 researchers across the Arab world, all
of whom are authors of this paper, PALM of-
fers a broad, inclusive perspective. We use
PALM to evaluate the cultural and dialectal ca-
pabilities of several frontier LLMs, revealing
notable limitations. For instance, while closed-
source LLMs generally exhibit strong perfor-
mance, they are not without flaws, and smaller
open-source models face greater challenges.
Moreover, certain countries (e.g., Egypt, the
UAE) appear better represented than others
(e.g., Iraq, Mauritania, Yemen). Our annota-
tion guidelines, code, and data for reproducibil-
ity are publicly available. More information
about PALM is available at our project page:
https://github.com/UBC-NLP/palm.
Other data
| Title | A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs | Authors | Ali, marwa ; et al. | Issue Date | Jul-2025 | Journal | arxiv.org/abs/ | DOI | https://arxiv.org/abs/2503.00151 |
Recommend this item
Similar Items from Core Recommender Database
Items in Ain Shams Scholar are protected by copyright, with all rights reserved, unless otherwise indicated.