EIT Researcher Co-Authors Nature Paper Exposing AI's Expert Limits with Humanity's Last Exam

Q: Can students access HLE?

Yes, public at lastexam.ai . Great for learning AI limits, contributing questions.

Breakthrough Benchmark Reveals Gaps in Frontier AI Capabilities

higher-education-nz
ai-benchmarks
eit-ai-breakthrough
nz-researcher-nature
humanity's-last-exam

120views

A lake surrounded by green mountains and trees — Photo by Sandro Scalco on Unsplash

In a landmark achievement for New Zealand's higher education sector, Dr. Syed M. Shahid, a Senior Postgraduate Lecturer in Health and Sport Science at the Eastern Institute of Technology (EIT) Auckland campus, has co-authored a groundbreaking paper published in the prestigious journal Nature. The study, titled "A benchmark of expert-level academic questions to assess AI capabilities," introduces Humanity's Last Exam (HLE), a rigorous new benchmark designed to push the boundaries of artificial intelligence (AI) testing and reveal its current limitations in handling expert-level knowledge.

This collaboration involves over 1,000 global experts and highlights EIT's growing role in international AI research. As AI systems like large language models (LLMs) dominate headlines with their impressive feats, HLE provides a sobering reality check, showing that even the most advanced models still fall short of human expert performance on complex, verifiable academic tasks. For New Zealand's tertiary institutions, this underscores the value of polytechnics like EIT contributing to cutting-edge global science.

Dr. Syed M. Shahid: From Biochemistry to AI Frontiers

Dr. Shahid brings a unique perspective to the project. Holding a PhD in Basic Health Science with a focus on Medical Biochemistry from the University of Karachi, he has over 20 years of experience in health research, including roles at the University of Auckland's Faculty of Medical and Health Sciences and Aspire2 International. At EIT, he supervises postgraduate research and lectures on topics like nutrition, digital health, and health promotion.

His involvement in HLE stems from his expertise in health sciences, where he contributed challenging questions that test deep domain knowledge. "Participating in this global effort was an honor," Dr. Shahid noted in EIT announcements. "It allows researchers from institutions like EIT to influence how we measure AI progress, ensuring benchmarks reflect real-world expert challenges." This marks a significant milestone for EIT, demonstrating how applied research institutes in New Zealand are making waves in theoretical AI evaluation.

Dr. Shahid's career trajectory—from publishing 60+ papers and supervising dozens of theses to now co-authoring in Nature—exemplifies the interdisciplinary paths available in NZ higher education. His work bridges health inequities in ethnic communities with emerging tech like AI-assisted diagnostics, positioning EIT as a hub for practical innovation.

Humanity's Last Exam: The Ultimate AI Stress Test

At its core, HLE comprises 2,500 multi-modal questions spanning dozens of subjects, from advanced mathematics and physics to humanities, biology, and niche areas like local customs or historical trivia. Unlike standard benchmarks, these are crafted to be unambiguous, verifiable, and resistant to simple internet lookups—requiring genuine reasoning and expert insight.

The benchmark's name reflects its ambition: as AI saturates easier tests, HLE aims to be the "last" comprehensive closed-ended academic exam before AI matches or exceeds human experts across the board. Questions include multiple-choice and short-answer formats for automated grading, with an expert disagreement rate of about 15%, ensuring reliability.

Humanity's Last Exam benchmark diagram showing AI vs human performance gap

The Saturation Crisis in AI Benchmarks

Traditional benchmarks like Massive Multitask Language Understanding (MMLU) have become obsolete. Frontier LLMs now score over 90% on them, masking true progress. This saturation leads to unreliable comparisons and overhyping capabilities.

HLE addresses this by targeting graduate-level expertise. Developers filtered questions where LLMs already perform well, ensuring a true measure of the "expert human frontier." Early tests showed models like GPT-4o at just 2.7% accuracy, while human experts hit around 90% in their domains—a stark 87% gap.

Crowdsourcing Expertise: Building HLE Globally

Over 1,000 subject-matter experts worldwide contributed, including several from New Zealand: Dr. Shahid (EIT), Mohinder Maheshbhai Naiya (Auckland University of Technology), Jennifer Zampese (University of Canterbury), and Gaël Gendron (University of Auckland). Questions underwent rigorous validation to confirm difficulty and verifiability.

The process involved crowdsourcing via platforms like Scale AI and the Center for AI Safety, with ongoing "HLE-Rolling" for fresh challenges. This collaborative model democratizes benchmark creation, allowing contributions from diverse institutions like EIT.

Expert vetting for unambiguous solutions
Rejection of retrievable or easy AI-solvable questions
Broad coverage: STEM-heavy but including humanities
Multi-modal: text, images for comprehensive testing

AI's Stumbling Blocks: Low Scores and Overconfidence

Initial results were humbling. As of early 2026 leaderboards, top models like Gemini 3.1 Pro Preview score ~45%, GPT-5 variants ~40-44%, Claude models ~30-35%—still far from human levels. Calibration errors exceed 50-70%, meaning AIs confidently give wrong answers.

Hardest areas: world-class math (deep reasoning), specialized STEM, and trivia requiring precise recall. Multiple-choice slightly easier, but exact-answer questions expose true limits.

Leaderboard of AI models on Humanity's Last Exam showing low accuracies

Implications for AI Development and Governance

HLE clarifies AI isn't yet "expert-level" on structured tasks, informing policy on risks like overreliance in academia or healthcare. It emphasizes reasoning gaps over memorization.

For developers, it's a roadmap: improving calibration and reasoning could close the gap. Policymakers gain a metric for safe deployment. The paper calls for transparent evaluation to guide research.

New Zealand's Emerging Role in AI Research

With contributors from EIT, AUT, Canterbury, and Auckland, NZ punches above its weight. EIT's involvement showcases polytechnics' research prowess, complementing universities.

Government initiatives like the AI strategy boost this. Institutions like EIT foster interdisciplinary talent, vital as AI integrates into health, education, and sustainability.

For students, it highlights opportunities in AI ethics, benchmarking—fields where human insight remains superior.

Transforming Higher Education in Aotearoa

In NZ colleges and universities, HLE prompts reflection on AI tools. Lecturers like Dr. Shahid integrate AI ethically, teaching limits alongside strengths.

Benefits: augmented research, personalized learning. Risks: plagiarism, reduced critical thinking. EIT's health programs now emphasize AI literacy, preparing grads for digital health roles.

Training on benchmark creation
Ethical AI curricula
Interdisciplinary projects

Career Pathways in AI and Research

This breakthrough opens doors. NZ needs AI researchers, ethicists, health data specialists. EIT grads pursue PhDs, industry roles.

Skills: domain expertise + tech savvy. Institutions offer research assistantships, lecturer positions fueling such contributions.

Photo by Lawrence Makoona on Unsplash

The Road Ahead: Evolving Benchmarks and AI

HLE isn't final—dynamic updates ensure relevance. As scores rise (45% now vs 3% initially), watch for 50% threshold signaling expert parity.

For NZ higher ed, it's a call to invest in talent. Dr. Shahid's success inspires: polytechs drive global impact.

Explore the full Nature paper or arXiv preprint for details. Leaderboards at lastexam.ai track progress.

Browse by Subject

Frequently Asked Questions

📚What is Humanity's Last Exam (HLE)?

HLE is a 2,500-question benchmark of expert-level academic problems across math, sciences, humanities. Designed to test AI beyond saturated tests like MMLU.

👨‍🏫Who is Dr. Syed M. Shahid from EIT?

Senior Lecturer in Health Science at EIT Auckland, PhD in Medical Biochemistry. Contributed health questions to HLE, bridging digital health and AI evaluation.