Difference between revisions of "AI benchmarks"

From GISAXS

Jump to: navigation, search

Revision as of 20:07, 10 March 2025

Contents

1 Leaderboards
2 Methods
3 Assess Specific Attributes
- 3.1 Software/Coding
- 3.2 Creativity

Leaderboards

LMSYS: Human preference ranking
Tracking AI
Vectara Hallucination Leaderboard
LiveBench: A Challenging, Contamination-Free LLM Benchmark
ENIGMAEVAL (paper)
Galileo AI Agent Leaderboard
Smolagents LLM Leaderboard: LLMs powering agents

Methods

AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
- Leaderboard
- Suggestion to use Borda count
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. Assess reasoning using puzzles of tunable complexity.

Assess Specific Attributes

Software/Coding

2025-02: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (code)

Creativity

2024-10: AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

Retrieved from "http://gisaxs.com/index.php?title=AI_benchmarks&oldid=7192"