Difference between revisions of "AI benchmarks"

From GISAXS
Jump to: navigation, search
(Hallucination)
(Science)
Line 46: Line 46:
 
==Science==
 
==Science==
 
* 2025-07: [https://allenai.org/blog/sciarena SciArena: A New Platform for Evaluating Foundation Models in Scientific Literature Tasks] ([https://sciarena.allen.ai/ vote], [https://huggingface.co/datasets/yale-nlp/SciArena data], [https://github.com/yale-nlp/SciArena code])
 
* 2025-07: [https://allenai.org/blog/sciarena SciArena: A New Platform for Evaluating Foundation Models in Scientific Literature Tasks] ([https://sciarena.allen.ai/ vote], [https://huggingface.co/datasets/yale-nlp/SciArena data], [https://github.com/yale-nlp/SciArena code])
 +
* 2026-04: [https://arxiv.org/abs/2604.14140 LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning] ([https://longcot.ai/ site], [https://github.com/LongHorizonReasoning/longcot code])
  
 
==Visual==
 
==Visual==

Revision as of 12:42, 16 April 2026

General

Lists of Benchmarks

Analysis of Methods

Methods

Task Length

GmZHL8xWQAAtFlF.jpeg

Assess Specific Attributes

Various

Hallucination

Software/Coding

Math

Science

Visual

Conversation

Creativity

Reasoning

Assistant/Agentic

See: AI Agents: Optimization

Science

See: Science Benchmarks