Difference between revisions of "AI benchmarks"

From GISAXS

Jump to: navigation, search

Revision as of 13:52, 27 March 2025

Contents

1 General
2 Methods
- 2.1 Task Length
3 Assess Specific Attributes

General

Models Table (lifearchitect.ai)

Methods

AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
- Leaderboard
- Suggestion to use Borda count
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. Assess reasoning using puzzles of tunable complexity.

Task Length

2020-09: Ajeya Cotra: Draft report on AI timelines
2025-03: Measuring AI Ability to Complete Long Tasks Measuring AI Ability to Complete Long Tasks

Assess Specific Attributes

Various

LMSYS: Human preference ranking leaderboard
Tracking AI: "IQ" leaderboard
Vectara Hallucination Leaderboard
LiveBench: A Challenging, Contamination-Free LLM Benchmark

Hallucination

LLM Confabulation (Hallucination) Leaderboard for RAG

Software/Coding

2025-02: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (code)

Visual

2025-03: Can Large Vision Language Models Read Maps Like a Human? MapBench

Creativity

Reasoning

ENIGMAEVAL: "reasoning" leaderboard (paper)

Assistant/Agentic

GAIA: a benchmark for General AI Assistants
Galileo AI Agent Leaderboard
Smolagents LLM Leaderboard: LLMs powering agents

Science

See: Science Benchmarks

Retrieved from "http://gisaxs.com/index.php?title=AI_benchmarks&oldid=7371"