Difference between revisions of "AI benchmarks"

From GISAXS
Jump to: navigation, search
(General)
(Assistant/Agentic)
Line 67: Line 67:
 
* [https://huggingface.co/spaces/smolagents/smolagents-leaderboard Smolagents LLM Leaderboard]: LLMs powering agents
 
* [https://huggingface.co/spaces/smolagents/smolagents-leaderboard Smolagents LLM Leaderboard]: LLMs powering agents
 
* OpenAI [https://openai.com/index/paperbench/ PaperBench: Evaluating AI’s Ability to Replicate AI Research] ([https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf paper], [https://github.com/openai/preparedness/tree/main/project/paperbench code])
 
* OpenAI [https://openai.com/index/paperbench/ PaperBench: Evaluating AI’s Ability to Replicate AI Research] ([https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf paper], [https://github.com/openai/preparedness/tree/main/project/paperbench code])
 +
* 2025-06: [https://arxiv.org/abs/2506.22419 The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements]
  
 
==Science==
 
==Science==
 
See: [[Science_Agents#Science_Benchmarks|Science Benchmarks]]
 
See: [[Science_Agents#Science_Benchmarks|Science Benchmarks]]

Revision as of 12:41, 30 June 2025

General

Lists of Benchmarks

Analysis of Methods

Methods

Task Length

GmZHL8xWQAAtFlF.jpeg

Assess Specific Attributes

Various

Hallucination

Software/Coding

Math

Visual

Conversation

Creativity

Reasoning

Assistant/Agentic

See: AI Agents: Optimization

Science

See: Science Benchmarks