Difference between revisions of "AI benchmarks"

From GISAXS
Jump to: navigation, search
(Assistant/Agentic)
(Methods)
Line 9: Line 9:
 
** [https://aidanbench.com/ Leaderboard]
 
** [https://aidanbench.com/ Leaderboard]
 
** [https://x.com/scaling01/status/1897301054431064391 Suggestion to use] [https://en.wikipedia.org/wiki/Borda_count Borda count]
 
** [https://x.com/scaling01/status/1897301054431064391 Suggestion to use] [https://en.wikipedia.org/wiki/Borda_count Borda count]
 +
** 2025-04: [https://x.com/scaling01/status/1910499781601874008 add] Quasar Alpha, Optimus Alpha, Llama-4 Scout and Llama-4 Maverick
 
* [https://arxiv.org/abs/2502.01100 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning]. Assess reasoning using puzzles of tunable complexity.
 
* [https://arxiv.org/abs/2502.01100 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning]. Assess reasoning using puzzles of tunable complexity.
  

Revision as of 09:41, 11 April 2025

General

Methods

Task Length

GmZHL8xWQAAtFlF.jpeg

Assess Specific Attributes

Various

Hallucination

Software/Coding

Visual

Creativity

Reasoning

Assistant/Agentic

Science

See: Science Benchmarks