Difference between revisions of "AI benchmarks"

From GISAXS

Jump to: navigation, search

Revision as of 16:26, 14 April 2025

Contents

1 General
2 Methods
- 2.1 Task Length
3 Assess Specific Attributes

General

Methods

AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
- Leaderboard
- Suggestion to use Borda count
- 2025-04: add Quasar Alpha, Optimus Alpha, Llama-4 Scout and Llama-4 Maverick
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. Assess reasoning using puzzles of tunable complexity.

Task Length

2020-09: Ajeya Cotra: Draft report on AI timelines
2025-03: Measuring AI Ability to Complete Long Tasks Measuring AI Ability to Complete Long Tasks

Assess Specific Attributes

Various

LMSYS: Human preference ranking leaderboard
Tracking AI: "IQ" leaderboard
LiveBench: A Challenging, Contamination-Free LLM Benchmark
LLM Thematic Generalization Benchmark

Hallucination

Software/Coding

2025-02: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (code)

Visual

2025-03: Can Large Vision Language Models Read Maps Like a Human? MapBench

Conversation

2025-01: MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs (project, code)

Creativity

Reasoning

ENIGMAEVAL: "reasoning" leaderboard (paper)
Sober Reasoning Leaderboard
- 2025-04: A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Assistant/Agentic

GAIA: a benchmark for General AI Assistants
Galileo AI Agent Leaderboard
Smolagents LLM Leaderboard: LLMs powering agents
OpenAI PaperBench: Evaluating AI’s Ability to Replicate AI Research (paper, code)

Science

See: Science Benchmarks

Retrieved from "http://gisaxs.com/index.php?title=AI_benchmarks&oldid=7575"