Abstract
A multimodal benchmark called Classroom Final Exam (CFE) is presented for evaluating large language models' reasoning abilities across 20+ STEM domains using authentic exam problems and instructor solutions.
We introduce (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. is curated from repeatedly used, authentic university homework and exam problems, together with reference solutions provided by course instructors. presents a significant challenge even for frontier models: the newly released Gemini-3.1-pro-preview achieves an overall accuracy of 59.69\%, while the second-best model, Gemini-3-flash-preview, reaches 55.46\%, leaving considerable room for improvement. Beyond leaderboard results, we perform a diagnostic analysis by decomposing reference solutions into reasoning flows. We find that although frontier models can often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically have more reasoning steps than those provided by the instructor, indicating suboptimal step efficiency and a higher risk of error accumulation. The data and code are available at https://github.com/Analogy-AI/CFE_Bench.
Community
CFE-Bench (Classroom Final Exam) is a text-only and multimodal reasoning benchmark built from authentic, repeatedly used university homework and exam problems sourced from instructor-maintained course materials and verified by professors. It contains 305 text-only and 144 multimodal samples spanning 20+ subjects across physics, mathematics, and other STEM domains.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models (2026)
- Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education (2026)
- From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics (2026)
- BabyVision: Visual Reasoning Beyond Language (2026)
- TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models (2026)
- MathDoc: Benchmarking Structured Extraction and Active Refusal on Noisy Mathematics Exam Papers (2026)
- Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper