Papers: Evaluation
updated
Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
Paper
• 2310.17567
• Published
• 1
This is not a Dataset: A Large Negation Benchmark to Challenge Large
Language Models
Paper
• 2310.15941
• Published
• 6
Holistic Evaluation of Language Models
Paper
• 2211.09110
• Published
• 1
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models
Paper
• 2306.04757
• Published
• 5
EleutherAI: Going Beyond "Open Science" to "Science in the Open"
Paper
• 2210.06413
• Published
Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models
Paper
• 2310.20499
• Published
• 8
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
• 2311.07463
• Published
• 15
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper
• 2306.05685
• Published
• 40
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue
Summarization
Paper
• 2402.13249
• Published
• 15