Difference between revisions of "AI benchmarks"

From GISAXS

Jump to: navigation, search

Revision as of 09:41, 11 April 2025

Contents

1 General
2 Methods
- 2.1 Task Length
3 Assess Specific Attributes

General

Methods

AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
- Leaderboard
- Suggestion to use Borda count
- 2025-04: add Quasar Alpha, Optimus Alpha, Llama-4 Scout and Llama-4 Maverick
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. Assess reasoning using puzzles of tunable complexity.

Task Length

2020-09: Ajeya Cotra: Draft report on AI timelines
2025-03: Measuring AI Ability to Complete Long Tasks Measuring AI Ability to Complete Long Tasks

Assess Specific Attributes

Various

LMSYS: Human preference ranking leaderboard
Tracking AI: "IQ" leaderboard
LiveBench: A Challenging, Contamination-Free LLM Benchmark
LLM Thematic Generalization Benchmark

Hallucination

Software/Coding

2025-02: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (code)

Visual

2025-03: Can Large Vision Language Models Read Maps Like a Human? MapBench

Creativity

Reasoning

ENIGMAEVAL: "reasoning" leaderboard (paper)

Assistant/Agentic

GAIA: a benchmark for General AI Assistants
Galileo AI Agent Leaderboard
Smolagents LLM Leaderboard: LLMs powering agents
OpenAI PaperBench: Evaluating AI’s Ability to Replicate AI Research (paper, code)

Science

See: Science Benchmarks

Retrieved from "http://gisaxs.com/index.php?title=AI_benchmarks&oldid=7549"