MBZUAI Publishes Pioneering Studies on Hindi-Speaking AI, Cultural Knowledge in LLMs, and Human Judgment Alignment

Advancing Inclusive AI at EACL 2026

research-publication-news
uae-ai
mbzuai
nlp-research
multilingual-ai

132views

a close up of a typewriter with a paper on it — Photo by Markus Winkler on Unsplash

MBZUAI Spearheads Advances in Multilingual and Culturally Aware AI

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), the UAE's pioneering graduate research university focused exclusively on artificial intelligence, has made significant strides in natural language processing (NLP) with a series of groundbreaking studies presented at the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026) in Rabat, Morocco. These publications address critical challenges in multilingual AI, cultural representation in large language models (LLMs), and aligning AI outputs with diverse human judgments, positioning MBZUAI at the forefront of inclusive AI development in the United Arab Emirates and beyond.

Founded in 2019, MBZUAI has rapidly established itself as a global hub for AI innovation, attracting top talent and fostering collaborations with industry leaders like G42 and Cerebras. Its Institute for Foundation Models (IFM) drives much of this work, emphasizing practical, culturally sensitive AI solutions that resonate with the UAE's diverse population and its vision for technological leadership under the UAE Centennial 2071 strategy.

MBZUAI researchers showcasing NLP papers at EACL 2026 conference

Revolutionizing Hindi-Speaking AI with Nanda Models

One standout contribution is the development of Nanda-10B and Nanda-87B, a pair of open-weight bilingual models optimized for Hindi and English. Unlike traditional LLMs that bolt on multilingual support to English-centric bases, Nanda integrates Hindi's linguistic and cultural nuances from the ground up. Hindi, spoken by over half a billion people worldwide, faces underrepresentation due to its varied forms: formal Devanagari script, Romanized transliterations common online, and code-mixed Hindi-English prevalent in social media and daily communication.

The creation process unfolds in three deliberate steps:

Tokenizer Extension: Llama's vocabulary was expanded with Hindi-specific tokens, halving Hindi tokenization fertility (the average tokens per word) while preserving English efficiency.
Hindi-First Continual Pretraining: On a 65 billion-token corpus encompassing Devanagari, Romanized, and code-mixed text to mirror real-world usage.
Bilingual Alignment: Fine-tuned with bilingual instructions and a culturally grounded safety dataset, ensuring contextually appropriate and safe responses.

These models outperform peers in generative tasks like summarization, translation, transliteration, and instruction following. Nanda-87B leads benchmarks, while the compact Nanda-10B shines in safety and cultural knowledge among sub-10B models. Evaluated via GPT-4o-judged pairwise comparisons, they surpass Llama instruction-tuned variants in Hindi tasks. Custom Hindi safety benchmarks, crafted by native speakers, confirm state-of-the-art performance on culturally sensitive prompts.

In cultural evaluations covering traditional medicine, finance, farming, and legal domains, Nanda excels within its size class, demonstrating improved handling of everyday background knowledge. This Hindi-first philosophy challenges the 'scale-first' paradigm, advocating integrated design for low-resource languages to achieve true fluency and cultural fit.

LLMs as Cultural Archives: Uneven Encoding Across Languages

Another pivotal study explores LLMs as 'cultural archives,' extracting procedural cultural commonsense knowledge graphs that capture societal patterns like event sequences, preconditions, and social effects. By prompting models to generate 'if-then' assertions and expand them into multi-step paths, researchers reveal how LLMs store culturally situated inferences beyond rote facts—for instance, the steps in preparing an Indonesian breakfast or planning an Egyptian wedding.

Graphs were constructed in English and native languages (Chinese, Arabic, Japanese, Bahasa Indonesia) for five countries: China, Indonesia, Japan, England, and Egypt. Native annotators validated correctness, relevance, and coherence. Adding these graphs boosted smaller models' performance on Arab-world and Indonesian cultural benchmarks, particularly in native-language question answering and culturally coherent story generation. Chain-of-thought prompting often underperformed, underscoring intuitive cultural knowledge's role.

Key finding: Cultural expression is uneven, with English yielding cleaner paths despite multilingual training. This English-centric organization absorbs cultures asymmetrically, raising concerns over biases from dominant training data. Led by PhD researcher Junior Cedric Tonga, with Chen Cecilia Liu, Iryna Gurevych, and advisor Fajri Koto, the work (arXiv: 2601.17971) highlights LLMs' potential and pitfalls as cultural repositories.

Aligning AI with Human Judgment Variation

Addressing AI alignment, the paper 'Training and Evaluating with Human Label Variation' treats judgment disagreements as informative signals rather than noise. In ambiguous tasks like content moderation or moral reasoning, multiple valid labels exist due to context or expertise. Using fuzzy set theory, human labels become partial memberships, enabling 'soft' metrics like soft micro F1 that measure distribution alignment over forced consensus.

Experiments on six datasets (English/Arabic, binary/multiclass/multilabel, crowd/expert) show simple methods—disaggregated training or soft labels—outperform complex objectives. A novel legal dataset (TAG), annotated by lawyers, underscores pluralism in interpretations. Meta-evaluation via lawyer-ranked model outputs validates soft micro F1. Coauthored with University of Melbourne researchers (arXiv: 2502.01891), it advocates preserving variation for robust AI in interpretive domains.

JEEM Benchmark: Probing Cultural Nuances in Visual AI

Complementing textual work, the JEEM benchmark tests vision-language models on culturally nuanced images across Arabic dialects (Jordan, UAE, Egypt, Morocco). While AI fluently describes visuals, it falters on cultural inferences, revealing gaps in dialectal and regional understanding. Accepted at EACL 2026, this effort by Karima Kadaoui, Hanin Atwany, and Hamdan Al-Ali advances culturally attuned multimodal AI.

Implications for UAE's AI Ecosystem

These studies underscore MBZUAI's role in UAE's National AI Strategy 2031, promoting Arabic, Hindi, and other underrepresented languages amid the nation's 200+ nationalities. By tackling cultural biases and judgment pluralism, they enhance AI trustworthiness for sectors like education, healthcare, and governance. Collaborations with G42 amplify deployment, while open models like Nanda democratize access.

In UAE higher education, MBZUAI's MSc/PhD programs in NLP attract global talent, with scholarships drawing diverse cohorts. Graduates contribute to local firms, aligning with UAE's push for 20,000 AI specialists by 2031.

Global Impact and Broader Challenges

Globally, the work challenges English-dominant AI, advocating holistic multilingual design. Nanda serves India's vast Hindi speakers; cultural graphs aid cross-cultural apps; HLV improves contested domains like ethics. Yet challenges persist: scarce authentic data for low-resource languages, bias risks in archives, and scaling soft evaluations.

Future directions: Larger Hindi corpora, multilingual cultural benchmarks, hybrid human-AI judgment systems.
Stakeholder views: Experts praise integrated approaches but urge diverse annotators to mitigate stereotypes.

Expert Perspectives and Real-World Applications

Fajri Koto notes, "Multilingual AI improves through deliberate cultural and linguistic choices." Tonga emphasizes procedural culture's encoding. Applications span chatbots for UAE's multicultural workforce, legal AI for Sharia interpretations, and educational tools preserving Emirati heritage.

Statistics: Hindi's 600M speakers underserved; LLMs show 20-30% cultural accuracy gaps cross-languages; HLV boosts alignment by 10-15% in ambiguous tasks.

Future Outlook for AI Research in UAE

MBZUAI plans expanded NLP initiatives, including Arabic dialect models and ethics labs. With UAE's $20B AI investment, these studies pave ethical, inclusive AI paths. For aspiring researchers, MBZUAI offers world-class facilities and industry ties.

As AI integrates deeper into UAE society, MBZUAI's focus on cultural and human-centric models ensures technology serves humanity equitably.

Photo by Karl Solano on Unsplash

Browse by Subject

Frequently Asked Questions

🤖What are Nanda models from MBZUAI?

Nanda-10B and Nanda-87B are open-weight bilingual LLMs for Hindi-English, featuring extended tokenizers and Hindi-first pretraining on 65B tokens including code-mixed text.

🌍How do LLMs encode cultural knowledge?

LLMs store procedural commonsense graphs of event sequences and social inferences from training data, but unevenly across languages, with English often clearer than native tongues like Arabic.

⚖️What is Human Label Variation (HLV)?

HLV models judgment disagreements as fuzzy sets, using soft metrics like micro F1 to train and evaluate AI on ambiguous tasks, outperforming majority-vote consensus.