Difference between revisions of "AI benchmarks"

From GISAXS
Jump to: navigation, search
(Assess Specific Attributes)
(Methods)
Line 4: Line 4:
 
** [https://x.com/scaling01/status/1897301054431064391 Suggestion to use] [https://en.wikipedia.org/wiki/Borda_count Borda count]
 
** [https://x.com/scaling01/status/1897301054431064391 Suggestion to use] [https://en.wikipedia.org/wiki/Borda_count Borda count]
 
* [https://arxiv.org/abs/2502.01100 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning]. Assess reasoning using puzzles of tunable complexity.
 
* [https://arxiv.org/abs/2502.01100 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning]. Assess reasoning using puzzles of tunable complexity.
 +
 +
==Task Length==
 +
* 2025-03: [Measuring AI Ability to Complete Long Tasks Measuring AI Ability to Complete Long Tasks]
 +
[[Image:GmZHL8xWQAAtFlF.jpeg|450px]]
  
 
=Assess Specific Attributes=
 
=Assess Specific Attributes=

Revision as of 17:59, 19 March 2025