Difference between revisions of "AI benchmarks"
KevinYager (talk | contribs) |
KevinYager (talk | contribs) (→Methods) |
||
Line 6: | Line 6: | ||
=Methods= | =Methods= | ||
* [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions] ([https://github.com/aidanmclaughlin/AidanBench code]) | * [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions] ([https://github.com/aidanmclaughlin/AidanBench code]) | ||
+ | * [https://arxiv.org/abs/2502.01100 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning]. Assess reasoning using puzzles of tunable complexity. |
Latest revision as of 13:33, 4 February 2025
Leaderboards
- LMSYS: Human preference ranking
- Tracking AI
- Vectara Hallucination Leaderboard
Methods
- AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions (code)
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. Assess reasoning using puzzles of tunable complexity.