Difference between revisions of "Increasing AI Intelligence"

From GISAXS
Jump to: navigation, search
((Optimal) Usage of Reasoning Compute)
(Inference-time Sampling)
 
(43 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
* 2025-02: [https://arxiv.org/abs/2502.21321 LLM Post-Training: A Deep Dive into Reasoning Large Language Models]
 
* 2025-02: [https://arxiv.org/abs/2502.21321 LLM Post-Training: A Deep Dive into Reasoning Large Language Models]
 
* 2025-03: [https://arxiv.org/abs/2503.24377 Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models]
 
* 2025-03: [https://arxiv.org/abs/2503.24377 Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models]
 +
* 2025-04: [https://arxiv.org/abs/2504.09037 A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems]
 +
* 2025-05: [https://lilianweng.github.io/posts/2025-05-01-thinking/ Why We Think]
 
* Links to papers: [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1)]
 
* Links to papers: [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1)]
  
Line 34: Line 36:
 
===Reinforcement Learning===
 
===Reinforcement Learning===
 
* 2025-04: DeepSeek: [https://arxiv.org/abs/2504.02495 Inference-Time Scaling for Generalist Reward Modeling]
 
* 2025-04: DeepSeek: [https://arxiv.org/abs/2504.02495 Inference-Time Scaling for Generalist Reward Modeling]
 +
* 2025-04: [https://arxiv.org/abs/2504.13837 Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?]
 +
* 2025-04: [https://arxiv.org/abs/2504.13941 NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning]
 +
* 2025-04: [https://arxiv.org/abs/2504.16084 TTRL: Test-Time Reinforcement Learning] ([https://github.com/PRIME-RL/TTRL code])
 +
* 2025-04: [https://arxiv.org/abs/2504.20571 Reinforcement Learning for Reasoning in Large Language Models with One Training Example]
 +
* 2025-05: [https://arxiv.org/abs/2505.03335 Absolute Zero: Reinforced Self-play Reasoning with Zero Data]
 +
* 2025-09: [https://www.nature.com/articles/s41586-025-09422-z DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning]
 +
* 2025-09: [https://github.com/NVlabs/RLP/blob/main/pdf/RLP_Reinforcement_as_a_Pretraining_Objective.pdf RLP : Reinforcement Learning Pre‑training] (Nvidia)
 +
* 2025-10: [https://arxiv.org/abs/2510.13786 The Art of Scaling Reinforcement Learning Compute for LLMs]
 +
 +
====Optimize Confidence/Entropy====
 +
* C.f. 2025-02: [https://arxiv.org/abs/2502.06233 Confidence Improves Self-Consistency in LLMs]
 +
* [https://x.com/xuandongzhao/status/1927270931874910259 2025-05]: [https://arxiv.org/abs/2505.19590 Learning to Reason without External Rewards] ([https://github.com/sunblaze-ucb/Intuitor code]): Reinforcement Learning from Internal Feedback, RLIF
 +
* [https://x.com/mihirp98/status/1927767453490172277 2025-05]: [https://rent-rl.github.io/ Maximizing Confidence Alone Improves Reasoning] ([https://github.com/satrams/rent-rl code]); a.k.a. RENT: Reinforcement Learning via Entropy Minimization.
 +
* 2025-05: [https://arxiv.org/abs/2505.22617 The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models]
 +
* 2025-06: [https://arxiv.org/abs/2506.01347 The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning]
 +
 +
====Exceed humans, using human-level data====
 +
* 2024-06: [https://arxiv.org/abs/2406.11741v1 Transcendence: Generative Models Can Outperform The Experts That Train Them]
 +
* 2025-03: [https://tecunningham.github.io/posts/2023-09-05-model-of-ai-imitation.html An AI Which Imitates Humans Can Beat Humans]
 +
* 2025-08: [https://arxiv.org/abs/2508.17669 A Taxonomy of Transcendence]
 +
 +
====Self-play====
 +
* 2025-09: [https://arxiv.org/abs/2509.07414 Language Self-Play For Data-Free Training]
  
 
===Training Data (Data Refinement, Synthetic Data)===
 
===Training Data (Data Refinement, Synthetic Data)===
Line 51: Line 76:
 
* 2023-10: [https://arxiv.org/abs/2310.16656 A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation]
 
* 2023-10: [https://arxiv.org/abs/2310.16656 A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation]
 
* 2024-07: [https://arxiv.org/abs/2407.06723 Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions]
 
* 2024-07: [https://arxiv.org/abs/2407.06723 Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions]
 +
 +
===Pre-generate material===
 +
* 2020-03: [https://research.google/blog/introducing-dreamer-scalable-reinforcement-learning-using-world-models/ Introducing Dreamer: Scalable Reinforcement Learning Using World Models]
 +
* 2025-03: [https://arxiv.org/abs/2503.18866 Reasoning to Learn from Latent Thoughts]
 +
* 2025-04: [https://arxiv.org/abs/2504.13171 Sleep-time Compute: Beyond Inference Scaling at Test-time]
  
 
===Generate consistent plans/thoughts===
 
===Generate consistent plans/thoughts===
Line 104: Line 134:
 
* 2025-02: [https://arxiv.org/pdf/2502.20339 Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners]
 
* 2025-02: [https://arxiv.org/pdf/2502.20339 Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners]
 
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 +
* 2025-10: [https://arxiv.org/abs/2510.13786 The Art of Scaling Reinforcement Learning Compute for LLMs]
  
 
=Inference Time Compute=
 
=Inference Time Compute=
Line 126: Line 157:
 
* 2024-11: [https://openreview.net/forum?id=FBkpCyujtS Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs]
 
* 2024-11: [https://openreview.net/forum?id=FBkpCyujtS Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs]
 
* 2024-12: [https://arxiv.org/abs/2412.06822 Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models]
 
* 2024-12: [https://arxiv.org/abs/2412.06822 Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models]
 +
* 2025-08: [https://arxiv.org/abs/2508.15260 Deep Think with Confidence] ([https://jiaweizzhao.github.io/deepconf/ project])
 +
* 2025-10: [https://arxiv.org/abs/2510.14901 Reasoning with Sampling: Your Base Model is Smarter Than You Think]
  
===Inference-time Gradient===
+
===Inference-time Gradient/Updating/RL/etc.===
 
* 2024-11: [https://ekinakyurek.github.io/papers/ttt.pdf The Surprising Effectiveness of Test-Time Training for Abstract Reasoning] ([https://github.com/ekinakyurek/marc code])
 
* 2024-11: [https://ekinakyurek.github.io/papers/ttt.pdf The Surprising Effectiveness of Test-Time Training for Abstract Reasoning] ([https://github.com/ekinakyurek/marc code])
 +
* 2025-04: [https://arxiv.org/abs/2504.16084 TTRL: Test-Time Reinforcement Learning] ([https://github.com/PRIME-RL/TTRL code])
  
 
===Self-prompting===
 
===Self-prompting===
Line 167: Line 201:
 
* 2024-11: [https://arxiv.org/abs/2411.10440 LLaVA-o1: Let Vision Language Models Reason Step-by-Step] ([https://github.com/PKU-YuanGroup/LLaVA-o1 code])
 
* 2024-11: [https://arxiv.org/abs/2411.10440 LLaVA-o1: Let Vision Language Models Reason Step-by-Step] ([https://github.com/PKU-YuanGroup/LLaVA-o1 code])
 
* 2025-04: [https://arxiv.org/abs/2504.07081 Self-Steering Language Models]: Planner generates program, Followers accomplish sub-tasks
 
* 2025-04: [https://arxiv.org/abs/2504.07081 Self-Steering Language Models]: Planner generates program, Followers accomplish sub-tasks
 +
* 2025-09: [https://arxiv.org/abs/2508.21184 BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design]
 +
* 2025-09: [https://arxiv.org/abs/2509.03918 MTQA: Matrix of Thought for Enhanced Reasoning in Complex Question Answering]
  
 
===Iteration (e.g. neural-like layered blocks)===
 
===Iteration (e.g. neural-like layered blocks)===
Line 184: Line 220:
 
* 2024-10: [https://arxiv.org/abs/2410.01707 Interpretable Contrastive Monte Carlo Tree Search Reasoning]
 
* 2024-10: [https://arxiv.org/abs/2410.01707 Interpretable Contrastive Monte Carlo Tree Search Reasoning]
 
* 2024-12: [https://arxiv.org/abs/2412.18319 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search]
 
* 2024-12: [https://arxiv.org/abs/2412.18319 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search]
 +
 +
===Pathfinding===
 +
* 2024-08: [https://arxiv.org/abs/2408.08152 DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search]
 +
* 2025-06: [https://arxiv.org/abs/2506.01939 Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning]
 +
* 2025-09: [https://arxiv.org/abs/2509.09284 Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning]
 +
* 2025-09: [https://arxiv.org/abs/2509.06160v1 Reverse-Engineered Reasoning for Open-Ended Generation]
  
 
===Other Search===
 
===Other Search===
Line 200: Line 242:
 
* 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators]
 
* 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators]
 
* 2025-03: [https://arxiv.org/abs/2503.23513 RARE: Retrieval-Augmented Reasoning Modeling]
 
* 2025-03: [https://arxiv.org/abs/2503.23513 RARE: Retrieval-Augmented Reasoning Modeling]
 +
* 2025-07: [https://arxiv.org/abs/2501.18858 BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning]
 +
* 2025-09: [https://arxiv.org/abs/2509.13351 Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning]
 +
 +
===Inner Monologue===
 +
* 2022-07: [https://arxiv.org/abs/2207.05608 Inner Monologue: Embodied Reasoning through Planning with Language Models]
 +
* 2025-06: [https://nicolehsing.com/mirror-paper.pdf MIRROR: Cognitive Inner Monologue Between Conversational Turns for Persistent Reflection and Reasoning in Conversational LLMs] ([https://github.com/nicolehsing/MIRROR code])
  
 
===Model Merging===
 
===Model Merging===
Line 222: Line 270:
 
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 
* 2025-03: [https://arxiv.org/abs/2504.00294 Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead]
 
* 2025-03: [https://arxiv.org/abs/2504.00294 Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead]
* 2025-05: [https://arxiv.org/abs/2504.03635 Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning]: Model size can improve things, but can also lead to overparametrization (memorization instead of reasoning)
+
* 2025-04: [https://arxiv.org/abs/2504.03635 Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning]: Model size can improve things, but can also lead to overparametrization (memorization instead of reasoning)
 +
* 2025-04: [https://arxiv.org/abs/2504.14047 Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods]: Reasoning models outperform inference-time-compute of non-reasoning; majority voting always helps, and is hard to beat
  
 
====(Optimal) Usage of Reasoning Compute====
 
====(Optimal) Usage of Reasoning Compute====
Line 234: Line 283:
 
* 2025-04: [https://arxiv.org/abs/2504.05185 Concise Reasoning via Reinforcement Learning]
 
* 2025-04: [https://arxiv.org/abs/2504.05185 Concise Reasoning via Reinforcement Learning]
 
* 2025-04: [https://arxiv.org/abs/2504.05419 Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification]
 
* 2025-04: [https://arxiv.org/abs/2504.05419 Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification]
 +
* 2025-04: [https://arxiv.org/abs/2504.15895 Dynamic Early Exit in Reasoning Models]
 +
* 2025-07: [https://arxiv.org/abs/2507.14417 Inverse Scaling in Test-Time Compute]
 +
* 2025-08: [https://arxiv.org/abs/2508.17627 Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit]
  
 
====Usage of Training Data====
 
====Usage of Training Data====
Line 248: Line 300:
 
* 2024-09-16: [https://www.oneusefulthing.org/p/scaling-the-state-of-play-in-ai Scaling: The State of Play in AI]
 
* 2024-09-16: [https://www.oneusefulthing.org/p/scaling-the-state-of-play-in-ai Scaling: The State of Play in AI]
 
* 2025-02-03: [https://arxiv.org/abs/2502.06807 Competitive Programming with Large Reasoning Models]
 
* 2025-02-03: [https://arxiv.org/abs/2502.06807 Competitive Programming with Large Reasoning Models]
 +
* 2025-10: [https://arxiv.org/abs/2510.14232 Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models]
  
 
==Pragmatics==
 
==Pragmatics==
Line 253: Line 306:
 
* [https://github.com/codelion/optillm optillm]: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)
 
* [https://github.com/codelion/optillm optillm]: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)
  
=Interact with Environment=
+
=Interact with Environment (Experiential Learning)=
 
* 2025-01: [https://arxiv.org/abs/2501.10893 Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments]
 
* 2025-01: [https://arxiv.org/abs/2501.10893 Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments]
 +
* 2025-09: [https://arxiv.org/abs/2509.24527 Training Agents Inside of Scalable World Models]
 +
* 2025-10: [https://arxiv.org/abs/2510.08558 Agent Learning via Early Experience]
  
 
=Memory=
 
=Memory=
Line 275: Line 330:
 
* 2025-01: [https://arxiv.org/abs/2501.13946 Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks]
 
* 2025-01: [https://arxiv.org/abs/2501.13946 Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks]
 
* 2025-02: [https://arxiv.org/abs/2502.16111 PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving]
 
* 2025-02: [https://arxiv.org/abs/2502.16111 PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving]
 +
* 2025-09: [https://arxiv.org/abs/2509.15172 Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment]
 +
 +
==Competition==
 +
* 2025-06: [https://arxiv.org/abs/2506.04721 SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat]
  
 
=ML-like Optimization of LLM Setup=
 
=ML-like Optimization of LLM Setup=
Line 282: Line 341:
 
* 2024-06: [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents] (optimize LLM frameworks)
 
* 2024-06: [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents] (optimize LLM frameworks)
 
* 2025-03: [https://www.nature.com/articles/s41586-025-08661-4 Optimizing generative AI by backpropagating language model feedback]
 
* 2025-03: [https://www.nature.com/articles/s41586-025-08661-4 Optimizing generative AI by backpropagating language model feedback]
 +
 +
=Self-modification=
 +
* 2025-06: [https://arxiv.org/abs/2506.10943 Self-Adapting Language Models]
  
 
=Limitations/Requirements=
 
=Limitations/Requirements=

Latest revision as of 12:50, 20 October 2025

Contents

Reviews

World Model

Prompt Engineering

Thought Templates

Automatic Prompt Optimization

Fine Tuning

Proactive Search

Compute expended after training, but before inference.

Reinforcement Learning

Optimize Confidence/Entropy

Exceed humans, using human-level data

Self-play

Training Data (Data Refinement, Synthetic Data)

Re-captioning

Pre-generate material

Generate consistent plans/thoughts

  • 2024-08: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers (code)
    • (Microsoft) rStar is a self-play mutual reasoning approach. A small model adds to MCTS using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
  • 2024-09: Self-Harmonized Chain of Thought
    • Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
  • 2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
    • They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.

Sampling

Automated prompt generation

Distill inference-time-compute into model

CoT reasoning model

See also: AI tools > LLM > Open-weights LLM > Reasoning

Scaling

Inference Time Compute

Methods

Review

In context learning (ICL), search, and other inference-time methods

Inference-time Sampling

Inference-time Gradient/Updating/RL/etc.

Self-prompting

Retrieval or Memory

In-context thought

Naive multi-LLM (verification, self-critique, majority voting, best-of-N, etc.)

Multi-LLM (multiple comparisons, branching, etc.)

Iteration (e.g. neural-like layered blocks)

Iterative reasoning via graphs

Monte Carlo Tree Search (MCTS)

Pathfinding

Other Search

Chain-of-Thought Reasoning

Inner Monologue

Model Merging

Meta-methods

Analysis

Scaling

(Optimal) Usage of Reasoning Compute

Usage of Training Data

  • 2025-02: LIMO: Less is More for Reasoning (surprisingly easy generalization, from very few reasoning training examples; model can go from knowledge-retrieval to diverse reasoning using curated examples)

Theory

Expending compute works

Compute.png

Pragmatics

Code for Inference-time Compute

  • optillm: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)

Interact with Environment (Experiential Learning)

Memory

Tool Use

Integrated

Multi-agent Effort (and Emergent Intelligence)

Competition

ML-like Optimization of LLM Setup

Self-modification

Limitations/Requirements

Creativity

See: AI creativity

See Also