Difference between revisions of "Increasing AI Intelligence"

From GISAXS
Jump to: navigation, search
(ML-like Optimization of LLM Setup)
((Optimal) Usage of Reasoning Compute)
 
(42 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
* 2025-02: [https://arxiv.org/abs/2502.03671 Advancing Reasoning in Large Language Models: Promising Methods and Approaches]
 
* 2025-02: [https://arxiv.org/abs/2502.03671 Advancing Reasoning in Large Language Models: Promising Methods and Approaches]
 
* 2025-02: [https://arxiv.org/abs/2502.09100 Logical Reasoning in Large Language Models: A Survey]
 
* 2025-02: [https://arxiv.org/abs/2502.09100 Logical Reasoning in Large Language Models: A Survey]
 +
* 2025-02: [https://arxiv.org/abs/2502.21321 LLM Post-Training: A Deep Dive into Reasoning Large Language Models]
 +
* 2025-03: [https://arxiv.org/abs/2503.24377 Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models]
 +
* 2025-04: [https://arxiv.org/abs/2504.09037 A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems]
 
* Links to papers: [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1)]
 
* Links to papers: [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1)]
 +
 +
===World Model===
 +
* 2025-03: [https://arxiv.org/abs/2503.04641 Simulating the Real World: A Unified Survey of Multimodal Generative Models]
  
 
=Prompt Engineering=
 
=Prompt Engineering=
Line 18: Line 24:
 
* 2025-02: [https://arxiv.org/abs/2502.16923 A Systematic Survey of Automatic Prompt Optimization Techniques]
 
* 2025-02: [https://arxiv.org/abs/2502.16923 A Systematic Survey of Automatic Prompt Optimization Techniques]
 
* 2025-02: [https://arxiv.org/abs/2502.18746 Automatic Prompt Optimization via Heuristic Search: A Survey]
 
* 2025-02: [https://arxiv.org/abs/2502.18746 Automatic Prompt Optimization via Heuristic Search: A Survey]
 
=Automatic Optimization=
 
==Analogous to Gradient Descent==
 
* 2024-06: [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text]
 
* 2024-06: [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents]
 
  
 
=Fine Tuning=
 
=Fine Tuning=
Line 31: Line 32:
 
=Proactive Search=
 
=Proactive Search=
 
Compute expended after training, but before inference.
 
Compute expended after training, but before inference.
 +
 +
===Reinforcement Learning===
 +
* 2025-04: DeepSeek: [https://arxiv.org/abs/2504.02495 Inference-Time Scaling for Generalist Reward Modeling]
 +
* 2025-04: [https://arxiv.org/abs/2504.13837 Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?]
 +
* 2025-04: [https://arxiv.org/abs/2504.13941 NEMOTRON-CROSSTHINK: Scaling Self-Learning beyond Math Reasoning]
 +
 +
====Exceed humans, using human-level data====
 +
* 2024-06: [https://arxiv.org/abs/2406.11741v1 Transcendence: Generative Models Can Outperform The Experts That Train Them]
 +
* 2025-03: [https://tecunningham.github.io/posts/2023-09-05-model-of-ai-imitation.html An AI Which Imitates Humans Can Beat Humans]
  
 
===Training Data (Data Refinement, Synthetic Data)===
 
===Training Data (Data Refinement, Synthetic Data)===
Line 41: Line 51:
 
* 2025-02: [https://arxiv.org/abs/2502.01718 ACECODER: Acing Coder RL via Automated Test-Case Synthesis]
 
* 2025-02: [https://arxiv.org/abs/2502.01718 ACECODER: Acing Coder RL via Automated Test-Case Synthesis]
 
* 2025-02: [https://arxiv.org/abs/2502.15588 Improving the Scaling Laws of Synthetic Data with Deliberate Practice]
 
* 2025-02: [https://arxiv.org/abs/2502.15588 Improving the Scaling Laws of Synthetic Data with Deliberate Practice]
 +
* 2025-03: [https://arxiv.org/abs/2503.19551 Scaling Laws of Synthetic Data for Language Models]
 +
* 2025-03: [https://arxiv.org/abs/2503.18866 Reasoning to Learn from Latent Thoughts]: infer the (latent) thoughts that would have led to training documents, so that you can pretrain on text+thoughts
 
* Updating list of links: [https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data Synthetic Data of LLMs, by LLMs, for LLMs]
 
* Updating list of links: [https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data Synthetic Data of LLMs, by LLMs, for LLMs]
 +
 +
====Re-captioning====
 +
* 2023-10: [https://arxiv.org/abs/2310.16656 A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation]
 +
* 2024-07: [https://arxiv.org/abs/2407.06723 Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions]
 +
 +
===Pre-generate material===
 +
* 2020-03: [https://research.google/blog/introducing-dreamer-scalable-reinforcement-learning-using-world-models/ Introducing Dreamer: Scalable Reinforcement Learning Using World Models]
 +
* 2025-03: [https://arxiv.org/abs/2503.18866 Reasoning to Learn from Latent Thoughts]
 +
* 2025-04: [https://arxiv.org/abs/2504.13171 Sleep-time Compute: Beyond Inference Scaling at Test-time]
  
 
===Generate consistent plans/thoughts===
 
===Generate consistent plans/thoughts===
Line 88: Line 109:
 
* 2025-02: [https://arxiv.org/abs/2502.03373 Demystifying Long Chain-of-Thought Reasoning in LLMs]
 
* 2025-02: [https://arxiv.org/abs/2502.03373 Demystifying Long Chain-of-Thought Reasoning in LLMs]
 
* 2025-02: [https://arxiv.org/abs/2502.05171 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach] ([https://huggingface.co/tomg-group-umd/huginn-0125 Huginn-0125])
 
* 2025-02: [https://arxiv.org/abs/2502.05171 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach] ([https://huggingface.co/tomg-group-umd/huginn-0125 Huginn-0125])
 +
* 2025-02: [https://arxiv.org/pdf/2502.20339 Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners]
  
 
===Scaling===
 
===Scaling===
 
* 2024-08: [https://arxiv.org/abs/2408.16737 Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling] (Google DeepMind)
 
* 2024-08: [https://arxiv.org/abs/2408.16737 Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling] (Google DeepMind)
 
* 2024-11: [https://arxiv.org/abs/2411.04434 Scaling Laws for Pre-training Agents and World Models]
 
* 2024-11: [https://arxiv.org/abs/2411.04434 Scaling Laws for Pre-training Agents and World Models]
 +
* 2025-02: [https://arxiv.org/pdf/2502.20339 Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners]
 +
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
  
 
=Inference Time Compute=
 
=Inference Time Compute=
Line 137: Line 161:
 
* 2023-01/2024-10: [https://arxiv.org/abs/2301.00234 A Survey on In-context Learning]
 
* 2023-01/2024-10: [https://arxiv.org/abs/2301.00234 A Survey on In-context Learning]
 
* 2025-01: [https://arxiv.org/abs/2501.04682 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought]
 
* 2025-01: [https://arxiv.org/abs/2501.04682 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought]
 +
* [https://x.com/dav1d_bai/status/1904057766593138841 2025-03]: [https://optimal-test-time.vercel.app/papers/accuracy-efficiency-tradeoffs Interruption is All You Need: Improving Reasoning Model Refusal Rates through measuring Parallel Reasoning Diversity]: A novel approach to reducing hallucinations in large language models through parallel reasoning and diversity measurement
  
===Naive multi-LLM (verification, majority voting, best-of-N, etc.)===
+
===Naive multi-LLM (verification, self-critique, majority voting, best-of-N, etc.)===
 
* 2023-06: [https://arxiv.org/abs/2306.02561 LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion] ([https://github.com/yuchenlin/LLM-Blender?tab=readme-ov-file code])
 
* 2023-06: [https://arxiv.org/abs/2306.02561 LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion] ([https://github.com/yuchenlin/LLM-Blender?tab=readme-ov-file code])
 
* 2023-12: [https://aclanthology.org/2023.findings-emnlp.203/ Dynamic Voting for Efficient Reasoning in Large Language Models]
 
* 2023-12: [https://aclanthology.org/2023.findings-emnlp.203/ Dynamic Voting for Efficient Reasoning in Large Language Models]
Line 145: Line 170:
 
* 2024-11: [https://arxiv.org/abs/2411.00492 Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models]
 
* 2024-11: [https://arxiv.org/abs/2411.00492 Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models]
 
* 2024-12: [https://github.com/irthomasthomas/llm-consortium llm-consortium]: Multiple LLMs collaboratively solve problems through structured dialogue, evaluation and arbitration
 
* 2024-12: [https://github.com/irthomasthomas/llm-consortium llm-consortium]: Multiple LLMs collaboratively solve problems through structured dialogue, evaluation and arbitration
 +
* 2025-03: [https://arxiv.org/abs/2502.01839 Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification]
 
* 2025-02: [https://arxiv.org/abs/2502.04506 When One LLM Drools, Multi-LLM Collaboration Rules]
 
* 2025-02: [https://arxiv.org/abs/2502.04506 When One LLM Drools, Multi-LLM Collaboration Rules]
 +
* 2025-03: [https://arxiv.org/abs/2503.17363 Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique]
  
 
===Multi-LLM (multiple comparisons, branching, etc.)===
 
===Multi-LLM (multiple comparisons, branching, etc.)===
Line 151: Line 178:
 
* 2024-11: [https://arxiv.org/abs/2411.02830 Mixtures of In-Context Learners]: Multiple "experts", each with a different set of in-context examples; combine outputs at the level of next-token-prediction
 
* 2024-11: [https://arxiv.org/abs/2411.02830 Mixtures of In-Context Learners]: Multiple "experts", each with a different set of in-context examples; combine outputs at the level of next-token-prediction
 
* 2024-11: [https://arxiv.org/abs/2411.10440 LLaVA-o1: Let Vision Language Models Reason Step-by-Step] ([https://github.com/PKU-YuanGroup/LLaVA-o1 code])
 
* 2024-11: [https://arxiv.org/abs/2411.10440 LLaVA-o1: Let Vision Language Models Reason Step-by-Step] ([https://github.com/PKU-YuanGroup/LLaVA-o1 code])
 +
* 2025-04: [https://arxiv.org/abs/2504.07081 Self-Steering Language Models]: Planner generates program, Followers accomplish sub-tasks
  
 
===Iteration (e.g. neural-like layered blocks)===
 
===Iteration (e.g. neural-like layered blocks)===
Line 181: Line 209:
 
* 2025-02: [https://arxiv.org/abs/2502.06807 Competitive Programming with Large Reasoning Models]
 
* 2025-02: [https://arxiv.org/abs/2502.06807 Competitive Programming with Large Reasoning Models]
 
* 2025-02: [https://arxiv.org/abs/2502.18600 Chain of Draft: Thinking Faster by Writing Less]
 
* 2025-02: [https://arxiv.org/abs/2502.18600 Chain of Draft: Thinking Faster by Writing Less]
 +
* 2025-03: [https://arxiv.org/abs/2503.17352 OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement] ([https://github.com/yihedeng9/OpenVLThinker code])
 +
* 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators]
 +
* 2025-03: [https://arxiv.org/abs/2503.23513 RARE: Retrieval-Augmented Reasoning Modeling]
 +
 +
===Model Merging===
 +
* 2025-01: [https://arxiv.org/abs/2501.12599 Kimi k1.5: Scaling Reinforcement Learning with LLMs]
 +
* 2025-03: [https://arxiv.org/abs/2503.20641 Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging] ([https://github.com/hahahawu/Long-to-Short-via-Model-Merging code])
 +
* 2025-03: [https://www.nature.com/articles/s41524-025-01564-y Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities]
  
 
===Meta-methods===
 
===Meta-methods===
Line 196: Line 232:
 
* 2024-11: [https://arxiv.org/abs/2411.17501 Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers]
 
* 2024-11: [https://arxiv.org/abs/2411.17501 Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers]
 
* 2025-02: [https://www.arxiv.org/abs/2502.08606 Distillation Scaling Laws]
 
* 2025-02: [https://www.arxiv.org/abs/2502.08606 Distillation Scaling Laws]
 +
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 +
* 2025-03: [https://arxiv.org/abs/2504.00294 Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead]
 +
* 2025-04: [https://arxiv.org/abs/2504.03635 Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning]: Model size can improve things, but can also lead to overparametrization (memorization instead of reasoning)
 +
* 2025-04: [https://arxiv.org/abs/2504.14047 Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods]: Reasoning models outperform inference-time-compute of non-reasoning; majority voting always helps, and is hard to beat
  
====Usage of Reasoning Compute====
+
====(Optimal) Usage of Reasoning Compute====
 +
* 2024-10: [https://arxiv.org/abs/2410.21333 Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse]
 
* 2024-12: [https://arxiv.org/abs/2412.21187 Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs]
 
* 2024-12: [https://arxiv.org/abs/2412.21187 Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs]
 
* 2025-01: [https://arxiv.org/abs/2501.18585 Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs]
 
* 2025-01: [https://arxiv.org/abs/2501.18585 Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs]
 
* 2025-02: [https://www.arxiv.org/abs/2502.04463 Training Language Models to Reason Efficiently]
 
* 2025-02: [https://www.arxiv.org/abs/2502.04463 Training Language Models to Reason Efficiently]
 
* 2025-02: [https://arxiv.org/abs/2502.08235 The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks]
 
* 2025-02: [https://arxiv.org/abs/2502.08235 The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks]
 +
* 2025-03: [https://arxiv.org/abs/2503.01141 How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach]
 +
* 2025-03: [https://arxiv.org/abs/2503.16419 Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models]
 +
* 2025-04: [https://arxiv.org/abs/2504.05185 Concise Reasoning via Reinforcement Learning]
 +
* 2025-04: [https://arxiv.org/abs/2504.05419 Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification]
 +
* 2025-04: [https://arxiv.org/abs/2504.15895 Dynamic Early Exit in Reasoning Models]
  
 
====Usage of Training Data====
 
====Usage of Training Data====
Line 249: Line 295:
 
* 2024-06: [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text] (gradient backpropagation through text, analogous to gradient descent)
 
* 2024-06: [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text] (gradient backpropagation through text, analogous to gradient descent)
 
* 2024-06: [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents] (optimize LLM frameworks)
 
* 2024-06: [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents] (optimize LLM frameworks)
 +
* 2025-03: [https://www.nature.com/articles/s41586-025-08661-4 Optimizing generative AI by backpropagating language model feedback]
  
 
=Limitations/Requirements=
 
=Limitations/Requirements=
Line 255: Line 302:
  
 
==Creativity==
 
==Creativity==
* 2024-09: [https://arxiv.org/abs/2409.04109 Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers]
+
See: [[AI creativity]]
* 2024-11: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions] ([https://github.com/aidanmclaughlin/AidanBench code])
 
* 2024-11: [https://conference.nber.org/conf_papers/f210475.pdf Artificial Intelligence, Scientific Discovery, and Product Innovation]
 
* 2024-12: [https://arxiv.org/abs/2412.17596 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context]
 
* 2024-12: [https://arxiv.org/abs/2412.02980 Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models]
 
  
 
=See Also=
 
=See Also=

Latest revision as of 15:34, 7 May 2025

Reviews

World Model

Prompt Engineering

Thought Templates

Automatic Prompt Optimization

Fine Tuning

Proactive Search

Compute expended after training, but before inference.

Reinforcement Learning

Exceed humans, using human-level data

Training Data (Data Refinement, Synthetic Data)

Re-captioning

Pre-generate material

Generate consistent plans/thoughts

  • 2024-08: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers (code)
    • (Microsoft) rStar is a self-play mutual reasoning approach. A small model adds to MCTS using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
  • 2024-09: Self-Harmonized Chain of Thought
    • Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
  • 2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
    • They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.

Sampling

Automated prompt generation

Distill inference-time-compute into model

CoT reasoning model

See also: AI tools > LLM > Open-weights LLM > Reasoning

Scaling

Inference Time Compute

Methods

Review

In context learning (ICL), search, and other inference-time methods

Inference-time Sampling

Inference-time Gradient

Self-prompting

Retrieval or Memory

In-context thought

Naive multi-LLM (verification, self-critique, majority voting, best-of-N, etc.)

Multi-LLM (multiple comparisons, branching, etc.)

Iteration (e.g. neural-like layered blocks)

Iterative reasoning via graphs

Monte Carlo Tree Search (MCTS)

Other Search

Chain-of-Thought Reasoning

Model Merging

Meta-methods

Analysis

Scaling

(Optimal) Usage of Reasoning Compute

Usage of Training Data

  • 2025-02: LIMO: Less is More for Reasoning (surprisingly easy generalization, from very few reasoning training examples; model can go from knowledge-retrieval to diverse reasoning using curated examples)

Theory

Expending compute works

Compute.png

Pragmatics

Code for Inference-time Compute

  • optillm: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)

Interact with Environment

Memory

Tool Use

Integrated

Multi-agent Effort (and Emergent Intelligence)

ML-like Optimization of LLM Setup

Limitations/Requirements

Creativity

See: AI creativity

See Also