Academic Jobs - Home of Higher Ed Logo

EIT Researcher Co-Authors Nature Paper Exposing AI's Expert Limits with Humanity's Last Exam

120views
Submit News
A lake surrounded by green mountains and trees
Photo by Sandro Scalco on Unsplash

In a landmark achievement for New Zealand's higher education sector, Dr. Syed M. Shahid, a Senior Postgraduate Lecturer in Health and Sport Science at the Eastern Institute of Technology (EIT) Auckland campus, has co-authored a groundbreaking paper published in the prestigious journal Nature. The study, titled "A benchmark of expert-level academic questions to assess AI capabilities," introduces Humanity's Last Exam (HLE), a rigorous new benchmark designed to push the boundaries of artificial intelligence (AI) testing and reveal its current limitations in handling expert-level knowledge.

This collaboration involves over 1,000 global experts and highlights EIT's growing role in international AI research. As AI systems like large language models (LLMs) dominate headlines with their impressive feats, HLE provides a sobering reality check, showing that even the most advanced models still fall short of human expert performance on complex, verifiable academic tasks. For New Zealand's tertiary institutions, this underscores the value of polytechnics like EIT contributing to cutting-edge global science.

Dr. Syed M. Shahid: From Biochemistry to AI Frontiers

Dr. Shahid brings a unique perspective to the project. Holding a PhD in Basic Health Science with a focus on Medical Biochemistry from the University of Karachi, he has over 20 years of experience in health research, including roles at the University of Auckland's Faculty of Medical and Health Sciences and Aspire2 International. At EIT, he supervises postgraduate research and lectures on topics like nutrition, digital health, and health promotion.

His involvement in HLE stems from his expertise in health sciences, where he contributed challenging questions that test deep domain knowledge. "Participating in this global effort was an honor," Dr. Shahid noted in EIT announcements. "It allows researchers from institutions like EIT to influence how we measure AI progress, ensuring benchmarks reflect real-world expert challenges." This marks a significant milestone for EIT, demonstrating how applied research institutes in New Zealand are making waves in theoretical AI evaluation.

Dr. Shahid's career trajectory—from publishing 60+ papers and supervising dozens of theses to now co-authoring in Nature—exemplifies the interdisciplinary paths available in NZ higher education. His work bridges health inequities in ethnic communities with emerging tech like AI-assisted diagnostics, positioning EIT as a hub for practical innovation.

Humanity's Last Exam: The Ultimate AI Stress Test

At its core, HLE comprises 2,500 multi-modal questions spanning dozens of subjects, from advanced mathematics and physics to humanities, biology, and niche areas like local customs or historical trivia. Unlike standard benchmarks, these are crafted to be unambiguous, verifiable, and resistant to simple internet lookups—requiring genuine reasoning and expert insight.

The benchmark's name reflects its ambition: as AI saturates easier tests, HLE aims to be the "last" comprehensive closed-ended academic exam before AI matches or exceeds human experts across the board. Questions include multiple-choice and short-answer formats for automated grading, with an expert disagreement rate of about 15%, ensuring reliability.

Humanity's Last Exam benchmark diagram showing AI vs human performance gap

The Saturation Crisis in AI Benchmarks

Traditional benchmarks like Massive Multitask Language Understanding (MMLU) have become obsolete. Frontier LLMs now score over 90% on them, masking true progress. This saturation leads to unreliable comparisons and overhyping capabilities.

HLE addresses this by targeting graduate-level expertise. Developers filtered questions where LLMs already perform well, ensuring a true measure of the "expert human frontier." Early tests showed models like GPT-4o at just 2.7% accuracy, while human experts hit around 90% in their domains—a stark 87% gap.

Crowdsourcing Expertise: Building HLE Globally

Over 1,000 subject-matter experts worldwide contributed, including several from New Zealand: Dr. Shahid (EIT), Mohinder Maheshbhai Naiya (Auckland University of Technology), Jennifer Zampese (University of Canterbury), and Gaël Gendron (University of Auckland). Questions underwent rigorous validation to confirm difficulty and verifiability.

The process involved crowdsourcing via platforms like Scale AI and the Center for AI Safety, with ongoing "HLE-Rolling" for fresh challenges. This collaborative model democratizes benchmark creation, allowing contributions from diverse institutions like EIT.

  • Expert vetting for unambiguous solutions
  • Rejection of retrievable or easy AI-solvable questions
  • Broad coverage: STEM-heavy but including humanities
  • Multi-modal: text, images for comprehensive testing

AI's Stumbling Blocks: Low Scores and Overconfidence

Initial results were humbling. As of early 2026 leaderboards, top models like Gemini 3.1 Pro Preview score ~45%, GPT-5 variants ~40-44%, Claude models ~30-35%—still far from human levels. Calibration errors exceed 50-70%, meaning AIs confidently give wrong answers.

Hardest areas: world-class math (deep reasoning), specialized STEM, and trivia requiring precise recall. Multiple-choice slightly easier, but exact-answer questions expose true limits.

Leaderboard of AI models on Humanity's Last Exam showing low accuracies

Implications for AI Development and Governance

HLE clarifies AI isn't yet "expert-level" on structured tasks, informing policy on risks like overreliance in academia or healthcare. It emphasizes reasoning gaps over memorization.

For developers, it's a roadmap: improving calibration and reasoning could close the gap. Policymakers gain a metric for safe deployment. The paper calls for transparent evaluation to guide research.

New Zealand's Emerging Role in AI Research

With contributors from EIT, AUT, Canterbury, and Auckland, NZ punches above its weight. EIT's involvement showcases polytechnics' research prowess, complementing universities.

Government initiatives like the AI strategy boost this. Institutions like EIT foster interdisciplinary talent, vital as AI integrates into health, education, and sustainability.

For students, it highlights opportunities in AI ethics, benchmarking—fields where human insight remains superior.

Transforming Higher Education in Aotearoa

In NZ colleges and universities, HLE prompts reflection on AI tools. Lecturers like Dr. Shahid integrate AI ethically, teaching limits alongside strengths.

Benefits: augmented research, personalized learning. Risks: plagiarism, reduced critical thinking. EIT's health programs now emphasize AI literacy, preparing grads for digital health roles.

  • Training on benchmark creation
  • Ethical AI curricula
  • Interdisciplinary projects

Career Pathways in AI and Research

This breakthrough opens doors. NZ needs AI researchers, ethicists, health data specialists. EIT grads pursue PhDs, industry roles.

Skills: domain expertise + tech savvy. Institutions offer research assistantships, lecturer positions fueling such contributions.

The Road Ahead: Evolving Benchmarks and AI

HLE isn't final—dynamic updates ensure relevance. As scores rise (45% now vs 3% initially), watch for 50% threshold signaling expert parity.

For NZ higher ed, it's a call to invest in talent. Dr. Shahid's success inspires: polytechs drive global impact.

Explore the full Nature paper or arXiv preprint for details. Leaderboards at lastexam.ai track progress.

Portrait of Jarrod Kanizay
About the author

Jarrod KanizayView author

Academic Jobs In House Author

Discussion

Sort by:

Be the first to comment on this article!

You

Please keep comments respectful and on-topic.

New0 comments

Join the conversation!

Add your comments now!

Have your say

Engagement level

Browse by Faculty

Browse by Subject

Frequently Asked Questions

📚What is Humanity's Last Exam (HLE)?

HLE is a 2,500-question benchmark of expert-level academic problems across math, sciences, humanities. Designed to test AI beyond saturated tests like MMLU.

👨‍🏫Who is Dr. Syed M. Shahid from EIT?

Senior Lecturer in Health Science at EIT Auckland, PhD in Medical Biochemistry. Contributed health questions to HLE, bridging digital health and AI evaluation.

⚖️Why was HLE created?

Current AI benchmarks saturated (LLMs >90%). HLE provides hard, verifiable questions to accurately measure progress toward expert human performance.

📊How do AIs perform on HLE?

Top models like Gemini 3.1 ~45%, GPT-5 ~44%, far below human experts (~90%). High overconfidence (calibration error 50-70%). See leaderboard.

🧮What subjects are hardest for AI on HLE?

Advanced math, specialized STEM, precise trivia. Requires deep reasoning, not retrieval.

🌏How does EIT contribute to global AI research?

Through experts like Dr. Shahid, EIT shows polytechs' role in benchmarks, health-AI intersection. Boosts NZ's research profile.

🎓Implications for NZ higher education?

Highlights need for AI literacy, ethics training. Opportunities in research jobs at unis/polytechnics.

🔗Can students access HLE?

Yes, public at lastexam.ai. Great for learning AI limits, contributing questions.

🔮Future of AI benchmarks post-HLE?

Dynamic updates, new challenges. Tracks path to expert AI, informs policy.

💼Career tips from EIT's involvement?

Build domain expertise + AI skills. Pursue research at EIT/unis; check NZ research jobs.

📖How to read the Nature paper?

Open access summary at Nature, full details on arXiv.