Difference between revisions of "AI benchmarks"

From GISAXS
Jump to: navigation, search
(Methods)
(Reasoning)
Line 42: Line 42:
 
==Reasoning==
 
==Reasoning==
 
* [https://scale.com/leaderboard/enigma_eval ENIGMAEVAL]: "reasoning" leaderboard ([https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf paper])
 
* [https://scale.com/leaderboard/enigma_eval ENIGMAEVAL]: "reasoning" leaderboard ([https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf paper])
 +
* [https://bethgelab.github.io/sober-reasoning/ Sober Reasoning Leaderboard]
 +
** 2025-04: [https://arxiv.org/abs/2504.07086 A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility]
  
 
==Assistant/Agentic==
 
==Assistant/Agentic==

Revision as of 12:01, 13 April 2025

General

Methods

Task Length

GmZHL8xWQAAtFlF.jpeg

Assess Specific Attributes

Various

Hallucination

Software/Coding

Visual

Creativity

Reasoning

Assistant/Agentic

Science

See: Science Benchmarks