Difference between revisions of "AI benchmarks"

From GISAXS

Jump to: navigation, search

Revision as of 10:43, 20 March 2025

Contents

1 Methods
- 1.1 Task Length
2 Assess Specific Attributes

Methods

AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
- Leaderboard
- Suggestion to use Borda count
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. Assess reasoning using puzzles of tunable complexity.

Task Length

2020-09: Ajeya Cotra: Draft report on AI timelines
2025-03: Measuring AI Ability to Complete Long Tasks Measuring AI Ability to Complete Long Tasks

Assess Specific Attributes

Various

LMSYS: Human preference ranking leaderboard
Tracking AI: "IQ" leaderboard
Vectara Hallucination Leaderboard
LiveBench: A Challenging, Contamination-Free LLM Benchmark

Software/Coding

2025-02: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (code)

Creativity

2024-10: AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

Reasoning

ENIGMAEVAL: "reasoning" leaderboard (paper)

Assistant/Agentic

GAIA: a benchmark for General AI Assistants
Galileo AI Agent Leaderboard
Smolagents LLM Leaderboard: LLMs powering agents

Science

See: Science Benchmarks

Retrieved from "http://gisaxs.com/index.php?title=AI_benchmarks&oldid=7281"