A benchmark of expert-level academic questions to assess AI capabilities

Kalinin, Mikhail; [und viele weitere]

Please use this identifier to cite or link to this item: http://dx.doi.org/10.25673/122702

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kalinin, Mikhail	-
dc.contributor.author	[und viele weitere]	-
dc.date.accessioned	2026-03-18T12:48:28Z	-
dc.date.available	2026-03-18T12:48:28Z	-
dc.date.issued	2026	-
dc.identifier.uri	https://opendata.uni-halle.de//handle/1981185920/124647	-
dc.identifier.uri	http://dx.doi.org/10.25673/122702	-
dc.description.abstract	Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.	eng
dc.language.iso	eng	-
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/	-
dc.subject.ddc	540	-
dc.title	A benchmark of expert-level academic questions to assess AI capabilities	eng
dc.type	Article	-
local.versionType	publishedVersion	-
local.bibliographicCitation.journaltitle	Nature	-
local.bibliographicCitation.volume	649	-
local.bibliographicCitation.pagestart	1139	-
local.bibliographicCitation.pageend	1146	-
local.bibliographicCitation.publishername	Nature Publ. Group	-
local.bibliographicCitation.publisherplace	London [u.a.]	-
local.bibliographicCitation.doi	10.1038/s41586-025-09962-4	-
local.openaccess	true	-
dc.identifier.ppn	1965721567	-
cbs.publication.displayform	2026	-
local.bibliographicCitation.year	2026	-
cbs.sru.importDate	2026-03-18T12:47:47Z	-
local.bibliographicCitation	Enthalten in Nature - London [u.a.] : Nature Publ. Group, 1869	-
local.accessrights.dnb	free	-
Appears in Collections:	Open Access Publikationen der MLU

Files in This Item:

File	Size	Format
s41586-025-09962-4.pdf	3.39 MB	Adobe PDF	View/Open

Show simple item record BibTeX EndNote