AI benchmarks

From GISAXS

Revision as of 15:21, 11 March 2025 by KevinYager (talk | contribs) (→‎Leaderboards)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

Contents

1 Leaderboards
- 1.1 Assistant/Agentic
2 Methods
3 Assess Specific Attributes
- 3.1 Software/Coding
- 3.2 Creativity

Leaderboards

LMSYS: Human preference ranking
Tracking AI ("IQ")
Vectara Hallucination Leaderboard
LiveBench: A Challenging, Contamination-Free LLM Benchmark
ENIGMAEVAL (paper, "reasoning")

Assistant/Agentic

GAIA: a benchmark for General AI Assistants
Galileo AI Agent Leaderboard
Smolagents LLM Leaderboard: LLMs powering agents

Methods

AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
- Leaderboard
- Suggestion to use Borda count
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. Assess reasoning using puzzles of tunable complexity.

Assess Specific Attributes

Software/Coding

2025-02: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (code)

Creativity

2024-10: AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

Retrieved from "http://gisaxs.com/index.php?title=AI_benchmarks&oldid=7196"