Difference between revisions of "AI benchmarks"

From GISAXS
Jump to: navigation, search
(Science)
(Science)
 
Line 45: Line 45:
  
 
==Science==
 
==Science==
* 2025-07: [https://allenai.org/blog/sciarena SciArena: A New Platform for Evaluating Foundation Models in Scientific Literature Tasks] ([https://sciarena.allen.ai/ vote], [https://huggingface.co/datasets/yale-nlp/SciArena data], [https://github.com/yale-nlp/SciArena code])
+
See [[Science_Agents|Science Agents]] > [[Science_Agents#Science_Benchmarks|Science Benchmarks]]
* 2026-04: [https://arxiv.org/abs/2604.14140 LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning] ([https://longcot.ai/ site], [https://github.com/LongHorizonReasoning/longcot code])
 
  
 
==Visual==
 
==Visual==

Latest revision as of 12:44, 16 April 2026

General

Lists of Benchmarks

Analysis of Methods

Methods

Task Length

GmZHL8xWQAAtFlF.jpeg

Assess Specific Attributes

Various

Hallucination

Software/Coding

Math

Science

See Science Agents > Science Benchmarks

Visual

Conversation

Creativity

Reasoning

Assistant/Agentic

See: AI Agents: Optimization

Science

See: Science Benchmarks