Difference between revisions of "Increasing AI Intelligence"

From GISAXS
Jump to: navigation, search
(Scaling)
(Proactive Search)
 
(15 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
* 2025-02: [https://arxiv.org/abs/2502.09100 Logical Reasoning in Large Language Models: A Survey]
 
* 2025-02: [https://arxiv.org/abs/2502.09100 Logical Reasoning in Large Language Models: A Survey]
 
* 2025-02: [https://arxiv.org/abs/2502.21321 LLM Post-Training: A Deep Dive into Reasoning Large Language Models]
 
* 2025-02: [https://arxiv.org/abs/2502.21321 LLM Post-Training: A Deep Dive into Reasoning Large Language Models]
 +
* 2025-03: [https://arxiv.org/abs/2503.24377 Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models]
 
* Links to papers: [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1)]
 
* Links to papers: [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1)]
 +
 +
===World Model===
 +
* 2025-03: [https://arxiv.org/abs/2503.04641 Simulating the Real World: A Unified Survey of Multimodal Generative Models]
  
 
=Prompt Engineering=
 
=Prompt Engineering=
Line 27: Line 31:
 
=Proactive Search=
 
=Proactive Search=
 
Compute expended after training, but before inference.
 
Compute expended after training, but before inference.
 +
 +
===Reinforcement Learning===
 +
* 2025-04: DeepSeek: [https://arxiv.org/abs/2504.02495 Inference-Time Scaling for Generalist Reward Modeling]
  
 
===Training Data (Data Refinement, Synthetic Data)===
 
===Training Data (Data Refinement, Synthetic Data)===
Line 37: Line 44:
 
* 2025-02: [https://arxiv.org/abs/2502.01718 ACECODER: Acing Coder RL via Automated Test-Case Synthesis]
 
* 2025-02: [https://arxiv.org/abs/2502.01718 ACECODER: Acing Coder RL via Automated Test-Case Synthesis]
 
* 2025-02: [https://arxiv.org/abs/2502.15588 Improving the Scaling Laws of Synthetic Data with Deliberate Practice]
 
* 2025-02: [https://arxiv.org/abs/2502.15588 Improving the Scaling Laws of Synthetic Data with Deliberate Practice]
 +
* 2025-03: [https://arxiv.org/abs/2503.19551 Scaling Laws of Synthetic Data for Language Models]
 +
* 2025-03: [https://arxiv.org/abs/2503.18866 Reasoning to Learn from Latent Thoughts]: infer the (latent) thoughts that would have led to training documents, so that you can pretrain on text+thoughts
 
* Updating list of links: [https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data Synthetic Data of LLMs, by LLMs, for LLMs]
 
* Updating list of links: [https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data Synthetic Data of LLMs, by LLMs, for LLMs]
  
Line 140: Line 149:
 
* 2023-01/2024-10: [https://arxiv.org/abs/2301.00234 A Survey on In-context Learning]
 
* 2023-01/2024-10: [https://arxiv.org/abs/2301.00234 A Survey on In-context Learning]
 
* 2025-01: [https://arxiv.org/abs/2501.04682 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought]
 
* 2025-01: [https://arxiv.org/abs/2501.04682 Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought]
 +
* [https://x.com/dav1d_bai/status/1904057766593138841 2025-03]: [https://optimal-test-time.vercel.app/papers/accuracy-efficiency-tradeoffs Interruption is All You Need: Improving Reasoning Model Refusal Rates through measuring Parallel Reasoning Diversity]: A novel approach to reducing hallucinations in large language models through parallel reasoning and diversity measurement
  
===Naive multi-LLM (verification, majority voting, best-of-N, etc.)===
+
===Naive multi-LLM (verification, self-critique, majority voting, best-of-N, etc.)===
 
* 2023-06: [https://arxiv.org/abs/2306.02561 LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion] ([https://github.com/yuchenlin/LLM-Blender?tab=readme-ov-file code])
 
* 2023-06: [https://arxiv.org/abs/2306.02561 LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion] ([https://github.com/yuchenlin/LLM-Blender?tab=readme-ov-file code])
 
* 2023-12: [https://aclanthology.org/2023.findings-emnlp.203/ Dynamic Voting for Efficient Reasoning in Large Language Models]
 
* 2023-12: [https://aclanthology.org/2023.findings-emnlp.203/ Dynamic Voting for Efficient Reasoning in Large Language Models]
Line 150: Line 160:
 
* 2025-03: [https://arxiv.org/abs/2502.01839 Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification]
 
* 2025-03: [https://arxiv.org/abs/2502.01839 Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification]
 
* 2025-02: [https://arxiv.org/abs/2502.04506 When One LLM Drools, Multi-LLM Collaboration Rules]
 
* 2025-02: [https://arxiv.org/abs/2502.04506 When One LLM Drools, Multi-LLM Collaboration Rules]
 +
* 2025-03: [https://arxiv.org/abs/2503.17363 Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique]
  
 
===Multi-LLM (multiple comparisons, branching, etc.)===
 
===Multi-LLM (multiple comparisons, branching, etc.)===
Line 185: Line 196:
 
* 2025-02: [https://arxiv.org/abs/2502.06807 Competitive Programming with Large Reasoning Models]
 
* 2025-02: [https://arxiv.org/abs/2502.06807 Competitive Programming with Large Reasoning Models]
 
* 2025-02: [https://arxiv.org/abs/2502.18600 Chain of Draft: Thinking Faster by Writing Less]
 
* 2025-02: [https://arxiv.org/abs/2502.18600 Chain of Draft: Thinking Faster by Writing Less]
 +
* 2025-03: [https://arxiv.org/abs/2503.17352 OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement] ([https://github.com/yihedeng9/OpenVLThinker code])
 +
* 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators]
 +
* 2025-03: [https://arxiv.org/abs/2503.23513 RARE: Retrieval-Augmented Reasoning Modeling]
 +
 +
===Model Merging===
 +
* 2025-01: [https://arxiv.org/abs/2501.12599 Kimi k1.5: Scaling Reinforcement Learning with LLMs]
 +
* 2025-03: [https://arxiv.org/abs/2503.20641 Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging] ([https://github.com/hahahawu/Long-to-Short-via-Model-Merging code])
  
 
===Meta-methods===
 
===Meta-methods===
Line 201: Line 219:
 
* 2025-02: [https://www.arxiv.org/abs/2502.08606 Distillation Scaling Laws]
 
* 2025-02: [https://www.arxiv.org/abs/2502.08606 Distillation Scaling Laws]
 
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 +
* 2025-03: [https://arxiv.org/abs/2504.00294 Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead]
  
====Usage of Reasoning Compute====
+
====(Optimal) Usage of Reasoning Compute====
 
* 2024-12: [https://arxiv.org/abs/2412.21187 Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs]
 
* 2024-12: [https://arxiv.org/abs/2412.21187 Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs]
 
* 2025-01: [https://arxiv.org/abs/2501.18585 Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs]
 
* 2025-01: [https://arxiv.org/abs/2501.18585 Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs]
Line 208: Line 227:
 
* 2025-02: [https://arxiv.org/abs/2502.08235 The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks]
 
* 2025-02: [https://arxiv.org/abs/2502.08235 The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks]
 
* 2025-03: [https://arxiv.org/abs/2503.01141 How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach]
 
* 2025-03: [https://arxiv.org/abs/2503.01141 How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach]
 +
* 2025-03: [https://arxiv.org/abs/2503.16419 Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models]
  
 
====Usage of Training Data====
 
====Usage of Training Data====
Line 262: Line 282:
  
 
==Creativity==
 
==Creativity==
* 2024-09: [https://arxiv.org/abs/2409.04109 Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers]
+
See: [[AI creativity]]
* 2024-11: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions] ([https://github.com/aidanmclaughlin/AidanBench code])
 
* 2024-11: [https://conference.nber.org/conf_papers/f210475.pdf Artificial Intelligence, Scientific Discovery, and Product Innovation]
 
* 2024-12: [https://arxiv.org/abs/2412.17596 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context]
 
* 2024-12: [https://arxiv.org/abs/2412.02980 Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models]
 
  
 
=See Also=
 
=See Also=

Latest revision as of 10:07, 4 April 2025

Reviews

World Model

Prompt Engineering

Thought Templates

Automatic Prompt Optimization

Fine Tuning

Proactive Search

Compute expended after training, but before inference.

Reinforcement Learning

Training Data (Data Refinement, Synthetic Data)

Re-captioning

Generate consistent plans/thoughts

  • 2024-08: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers (code)
    • (Microsoft) rStar is a self-play mutual reasoning approach. A small model adds to MCTS using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
  • 2024-09: Self-Harmonized Chain of Thought
    • Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
  • 2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
    • They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.

Sampling

Automated prompt generation

Distill inference-time-compute into model

CoT reasoning model

See also: AI tools > LLM > Open-weights LLM > Reasoning

Scaling

Inference Time Compute

Methods

Review

In context learning (ICL), search, and other inference-time methods

Inference-time Sampling

Inference-time Gradient

Self-prompting

Retrieval or Memory

In-context thought

Naive multi-LLM (verification, self-critique, majority voting, best-of-N, etc.)

Multi-LLM (multiple comparisons, branching, etc.)

Iteration (e.g. neural-like layered blocks)

Iterative reasoning via graphs

Monte Carlo Tree Search (MCTS)

Other Search

Chain-of-Thought Reasoning

Model Merging

Meta-methods

Analysis

Scaling

(Optimal) Usage of Reasoning Compute

Usage of Training Data

  • 2025-02: LIMO: Less is More for Reasoning (surprisingly easy generalization, from very few reasoning training examples; model can go from knowledge-retrieval to diverse reasoning using curated examples)

Theory

Expending compute works

Compute.png

Pragmatics

Code for Inference-time Compute

  • optillm: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)

Interact with Environment

Memory

Tool Use

Integrated

Multi-agent Effort (and Emergent Intelligence)

ML-like Optimization of LLM Setup

Limitations/Requirements

Creativity

See: AI creativity

See Also