Please use this identifier to cite or link to this item:
http://dx.doi.org/10.25673/122702Full metadata record
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Kalinin, Mikhail | - |
| dc.contributor.author | [und viele weitere] | - |
| dc.date.accessioned | 2026-03-18T12:48:28Z | - |
| dc.date.available | 2026-03-18T12:48:28Z | - |
| dc.date.issued | 2026 | - |
| dc.identifier.uri | https://opendata.uni-halle.de//handle/1981185920/124647 | - |
| dc.identifier.uri | http://dx.doi.org/10.25673/122702 | - |
| dc.description.abstract | Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai. | eng |
| dc.language.iso | eng | - |
| dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ | - |
| dc.subject.ddc | 540 | - |
| dc.title | A benchmark of expert-level academic questions to assess AI capabilities | eng |
| dc.type | Article | - |
| local.versionType | publishedVersion | - |
| local.bibliographicCitation.journaltitle | Nature | - |
| local.bibliographicCitation.volume | 649 | - |
| local.bibliographicCitation.pagestart | 1139 | - |
| local.bibliographicCitation.pageend | 1146 | - |
| local.bibliographicCitation.publishername | Nature Publ. Group | - |
| local.bibliographicCitation.publisherplace | London [u.a.] | - |
| local.bibliographicCitation.doi | 10.1038/s41586-025-09962-4 | - |
| local.openaccess | true | - |
| dc.identifier.ppn | 1965721567 | - |
| cbs.publication.displayform | 2026 | - |
| local.bibliographicCitation.year | 2026 | - |
| cbs.sru.importDate | 2026-03-18T12:47:47Z | - |
| local.bibliographicCitation | Enthalten in Nature - London [u.a.] : Nature Publ. Group, 1869 | - |
| local.accessrights.dnb | free | - |
| Appears in Collections: | Open Access Publikationen der MLU | |
Files in This Item:
| File | Size | Format | |
|---|---|---|---|
| s41586-025-09962-4.pdf | 3.39 MB | Adobe PDF | View/Open |