Humanity's Last Exam

Humanity's Last Exam (HLE) is a language model benchmark consisting of over 2,500 expert-level questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI, and was designed to test reasoning abilities and human-like intelligence, as opposed to just pattern recognition.

Organization	Model	Accuracy (%) ↑	Calibration Error (%) ↓
Google DeepMind	Gemini 3 Pro Preview	37.52	57
OpenAI	GPT-5 Pro	31.64	49
Anthropic	Claude Opus 4.5 (Thinking)	25.20	55
Z.ai	GLM 4.5	8.32	79
Meta AI	Llama 4 Maverick	5.68	83
Mistral AI	Mistral Medium 3	4.52	77
Amazon Web Services	Nova Pro	4.40	80

Organization	Model	Accuracy (%) ↑	Calibration Error (%) ↓
OpenAI	gpt-oss-120b	15.48	76
Alibaba Cloud	Qwen3-235B-A22B-Thinking-2507	15.43	78
DeepSeek	DeepSeek-R1-0528	14.04	78
Moonshot AI	Kimi-K2-Instruct	4.68	82
Amazon Web Services	Nova Micro	4.41	84

Humanity's Last Exam

History

Creation

Composition

Results

References

External links

Wikiwand - on