Difference between revisions of "AI benchmarks"

From GISAXS
Jump to: navigation, search
(Assistant/Agentic)
(Reasoning)
(One intermediate revision by the same user not shown)
Line 9: Line 9:
 
** [https://aidanbench.com/ Leaderboard]
 
** [https://aidanbench.com/ Leaderboard]
 
** [https://x.com/scaling01/status/1897301054431064391 Suggestion to use] [https://en.wikipedia.org/wiki/Borda_count Borda count]
 
** [https://x.com/scaling01/status/1897301054431064391 Suggestion to use] [https://en.wikipedia.org/wiki/Borda_count Borda count]
 +
** 2025-04: [https://x.com/scaling01/status/1910499781601874008 add] Quasar Alpha, Optimus Alpha, Llama-4 Scout and Llama-4 Maverick
 
* [https://arxiv.org/abs/2502.01100 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning]. Assess reasoning using puzzles of tunable complexity.
 
* [https://arxiv.org/abs/2502.01100 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning]. Assess reasoning using puzzles of tunable complexity.
  
Line 41: Line 42:
 
==Reasoning==
 
==Reasoning==
 
* [https://scale.com/leaderboard/enigma_eval ENIGMAEVAL]: "reasoning" leaderboard ([https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf paper])
 
* [https://scale.com/leaderboard/enigma_eval ENIGMAEVAL]: "reasoning" leaderboard ([https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf paper])
 +
* [https://bethgelab.github.io/sober-reasoning/ Sober Reasoning Leaderboard]
 +
** 2025-04: [https://arxiv.org/abs/2504.07086 A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility]
  
 
==Assistant/Agentic==
 
==Assistant/Agentic==

Revision as of 12:01, 13 April 2025

General

Methods

Task Length

GmZHL8xWQAAtFlF.jpeg

Assess Specific Attributes

Various

Hallucination

Software/Coding

Visual

Creativity

Reasoning

Assistant/Agentic

Science

See: Science Benchmarks