Difference between revisions of "AI benchmarks"

From GISAXS

Jump to: navigation, search

Revision as of 09:11, 8 September 2025

Contents

1 General
- 1.1 Lists of Benchmarks
- 1.2 Analysis of Methods
2 Methods
- 2.1 Task Length
3 Assess Specific Attributes

General

Models Table (lifearchitect.ai)
Artificial Analysis
Epoch AI
- Notable AI models
- AI benchmarking dashboard

Lists of Benchmarks

2025-05: Lisan al Gaib: The Ultimate LLM Benchmark list
- Average across 28 benchmarks

Analysis of Methods

2025-04: The Leaderboard Illusion

Methods

AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
- Leaderboard
- Suggestion to use Borda count
- 2025-04: add Quasar Alpha, Optimus Alpha, Llama-4 Scout and Llama-4 Maverick
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. Assess reasoning using puzzles of tunable complexity.
2025-08: Reinforcement Learning with Rubric Anchors (model)

Task Length

2020-09: Ajeya Cotra: Draft report on AI timelines
2025-03: Measuring AI Ability to Complete Long Tasks Measuring AI Ability to Complete Long Tasks

Assess Specific Attributes

Various

LMSYS: Human preference ranking leaderboard
Tracking AI: "IQ" leaderboard
LiveBench: A Challenging, Contamination-Free LLM Benchmark
LLM Thematic Generalization Benchmark

Hallucination

Vectara Hallucination Leaderboard
LLM Confabulation (Hallucination) Leaderboard for RAG
2025-09: OpenAI: Why Language Models Hallucinate

Software/Coding

2025-02: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (code)

Math

AIME Benchmark

Science

2025-07: SciArena: A New Platform for Evaluating Foundation Models in Scientific Literature Tasks (vote, data, code)

Visual

2024-06: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs (preprint, leaderboard)
2025-03: Can Large Vision Language Models Read Maps Like a Human? MapBench

Conversation

2025-01: MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs (project, code, leaderboard)

Creativity

See also: [AI creativity]
2024-10: AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text
2024-11: AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
2024-12: LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context
LLM Creative Story-Writing Benchmark

Reasoning

ENIGMAEVAL: "reasoning" leaderboard (paper)
Sober Reasoning Leaderboard
- 2025-04: A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Assistant/Agentic

See: AI Agents: Optimization

GAIA: a benchmark for General AI Assistants
Galileo AI Agent Leaderboard
Smolagents LLM Leaderboard: LLMs powering agents
OpenAI PaperBench: Evaluating AI’s Ability to Replicate AI Research (paper, code)
2025-06: The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
HAL: Holistic Agent Leaderboard The standardized, cost-aware, and third-party leaderboard for evaluating agents

Science

See: Science Benchmarks

Retrieved from "http://gisaxs.com/index.php?title=AI_benchmarks&oldid=8217"