Difference between revisions of "AI Agents"

From GISAXS
Jump to: navigation, search
(Reviews & Perspectives)
(Metrics, Benchmarks)
 
(12 intermediate revisions by the same user not shown)
Line 28: Line 28:
 
* [https://github.com/lmnr-ai/flow Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility]
 
* [https://github.com/lmnr-ai/flow Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility]
 
* [https://llama-stack.readthedocs.io/en/latest/index.html llama-stack]
 
* [https://llama-stack.readthedocs.io/en/latest/index.html llama-stack]
 +
* [https://huggingface.co/blog/smolagents Huggingface] [https://github.com/huggingface/smolagents smolagents]
 +
* [https://github.com/elizaOS/eliza Eliza] (includes multi-agent, interaction with docs, Discord, Twitter, etc.)
  
 
===Information Retrieval===
 
===Information Retrieval===
Line 127: Line 129:
  
 
=Increasing AI Agent Intelligence=
 
=Increasing AI Agent Intelligence=
 
+
See: [[Increasing AI Intelligence]]
==Reviews==
 
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 
 
 
==Prompt Engineering==
 
* 2024-11: [https://arxiv.org/abs/2411.05778 LLMs as Method Actors: A Model for Prompt Engineering and Architecture]
 
 
 
==Proactive Search==
 
Compute expended after training, but before inference.
 
 
 
===Training Data (Data Refinement, Synthetic Data)===
 
* C.f. image datasets:
 
** 2023-06: [https://arxiv.org/abs/2306.00984 StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners]
 
** 2023-11: [https://arxiv.org/abs/2311.17946 DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback]
 
* 2024-09: [https://arxiv.org/abs/2409.17115 Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale]
 
* 2024-10: [https://arxiv.org/abs/2410.15547 Data Cleaning Using Large Language Models]
 
* Updating list of links: [https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data Synthetic Data of LLMs, by LLMs, for LLMs]
 
 
 
===Generate consistent plans/thoughts===
 
* 2024-08: [https://arxiv.org/abs/2408.06195 Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers] ([https://github.com/zhentingqi/rStar code])
 
** (Microsoft) rStar is a self-play mutual reasoning approach. A small model adds to MCTS using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
 
* 2024-09: [https://www.arxiv.org/abs/2409.04057 Self-Harmonized Chain of Thought]
 
** Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
 
* 2024-11: [https://arxiv.org/abs/2411.15862 LLMs Do Not Think Step-by-step In Implicit Reasoning]
 
** They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.
 
 
 
===Sampling===
 
* 2024-11: [https://arxiv.org/abs/2411.04282 Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding] ([https://github.com/SalesforceAIResearch/LaTRO code])
 
 
 
===Automated prompt generation===
 
* 2024-09: [https://arxiv.org/abs/2409.13449 Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts]
 
 
 
===Distill inference-time-compute into model===
 
* 2023-10: [https://arxiv.org/abs/2310.11716 Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning] (U. Maryland, Adobe)
 
* 2023-11: [https://arxiv.org/abs/2311.01460 Implicit Chain of Thought Reasoning via Knowledge Distillation] (Harvard, Microsoft, Hopkins)
 
* 2024-02: [https://arxiv.org/abs/2402.04494 Grandmaster-Level Chess Without Search] (Google DeepMind)
 
* 2024-07: [https://arxiv.org/abs/2407.03181 Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models]
 
* 2024-07: [https://arxiv.org/abs/2407.06023 Distilling System 2 into System 1]
 
* 2024-07: [https://arxiv.org/abs/2407.14622 BOND: Aligning LLMs with Best-of-N Distillation]
 
* 2024-09: [https://arxiv.org/abs/2409.12917 Training Language Models to Self-Correct via Reinforcement Learning] (Google DeepMind)
 
* 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
 
* 2024-10: [https://arxiv.org/abs/2410.09918 Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces]
 
* 2024-12: [https://arxiv.org/abs/2412.06769 Training Large Language Models to Reason in a Continuous Latent Space]
 
 
 
====CoT reasoning model====
 
* 2024-09: [https://openai.com/o1/ OpenAI o1]
 
* 2024-10: [https://github.com/GAIR-NLP/O1-Journey/blob/main/resource/report.pdf O1 Replication Journey: A Strategic Progress Report – Part 1] ([https://github.com/GAIR-NLP/O1-Journey code]): Attempt by [https://gair-nlp.github.io/walnut-plan/ Walnut Plan] to reproduce o1-like in-context reasoning
 
* 2024-11: [https://x.com/deepseek_ai/status/1859200141355536422 DeepSeek-R1-Lite-Preview reasoning model]
 
* 2024-11: [https://arxiv.org/abs/2411.14405 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions]
 
* 2024-11: [https://huggingface.co/papers/2411.16489 O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?]
 
* 2024-12: [https://arxiv.org/abs/2412.00154 o1-Coder: an o1 Replication for Coding] ([https://github.com/ADaM-BJTU/O1-CODER code])
 
 
 
===Scaling===
 
* 2024-08: [https://arxiv.org/abs/2408.16737 Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling] (Google DeepMind)
 
* 2024-11: [https://arxiv.org/abs/2411.04434 Scaling Laws for Pre-training Agents and World Models]
 
 
 
==Inference Time Compute==
 
===Methods===
 
* 2024-03: [https://arxiv.org/abs/2403.09629 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking]
 
* 2024-11: [https://arxiv.org/pdf/2411.19865 Reverse Thinking Makes LLMs Stronger Reasoners]
 
* 2024-12: [https://arxiv.org/abs/2412.06769 Training Large Language Models to Reason in a Continuous Latent Space] (Chain of Continuous Thought, COCONUT)
 
'''Review'''
 
* 2024-06: [https://arxiv.org/abs/2406.16838 From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models]
 
 
 
===In context learning (ICL), search, and other inference-time methods===
 
* 2023-03: [https://arxiv.org/abs/2303.11366 Reflexion: Language Agents with Verbal Reinforcement Learning]
 
* 2023-05: [https://arxiv.org/abs/2305.16291 VOYAGER: An Open-Ended Embodied Agent with Large Language Models]
 
* 2024-04: [https://arxiv.org/abs/2404.11018 Many-Shot In-Context Learning]
 
* 2024-08: [https://arxiv.org/abs/2408.08435 Automated Design of Agentic Systems]
 
* 2024-09: [https://arxiv.org/abs/2409.03733 Planning In Natural Language Improves LLM Search For Code Generation]
 
 
 
===Inference-time Sampling===
 
* 2024-10: [https://github.com/xjdr-alt/entropix entropix: Entropy Based Sampling and Parallel CoT Decoding]
 
* 2024-10: [https://arxiv.org/abs/2410.16033 TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling]
 
* 2024-11: [https://openreview.net/forum?id=FBkpCyujtS Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs]
 
 
 
===Inference-time Gradient===
 
* 2024-11: [https://ekinakyurek.github.io/papers/ttt.pdf The Surprising Effectiveness of Test-Time Training for Abstract Reasoning] ([https://github.com/ekinakyurek/marc code])
 
 
 
===Self-prompting===
 
* 2023-05: [https://arxiv.org/abs/2305.09993 Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling]
 
* 2023-11: [https://arxiv.org/abs/2311.04205 Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves]
 
 
 
===In-context thought===
 
* 2022-01: [https://arxiv.org/abs/2201.11903 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models] (Google Brain)
 
* 2023-05: [https://arxiv.org/abs/2305.10601 Tree of Thoughts: Deliberate Problem Solving with Large Language Models] (Google DeepMind)
 
* 2024-05: [https://arxiv.org/abs/2405.18357 Faithful Logical Reasoning via Symbolic Chain-of-Thought]
 
* 2024-06: [https://aclanthology.org/2024.findings-naacl.78/ A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages]
 
* 2024-09: [https://arxiv.org/abs/2409.12183 To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning]
 
* 2024-09: [https://arxiv.org/abs/2409.12618 Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning] ([https://agnostiq.ai/ Agnostiq], Toronto)
 
* 2024-09: [https://arxiv.org/abs/2409.17539 Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models]
 
* 2024-10: [https://arxiv.org/abs/2410.16540 A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration] (failed reasoning traces can improve CoT)
 
* 2024-10: [https://arxiv.org/abs/2410.06634 Tree of Problems: Improving structured problem solving with compositionality]
 
* 2023-01/2024-10: [https://arxiv.org/abs/2301.00234 A Survey on In-context Learning]
 
 
 
===Naive multi-LLM (verification, majority voting, best-of-N, etc.)===
 
* 2023-06: [https://arxiv.org/abs/2306.02561 LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion] ([https://github.com/yuchenlin/LLM-Blender?tab=readme-ov-file code])
 
* 2023-12: [https://aclanthology.org/2023.findings-emnlp.203/ Dynamic Voting for Efficient Reasoning in Large Language Models]
 
* 2024-04: [https://arxiv.org/abs/2404.01054 Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment]
 
* 2024-08: [https://arxiv.org/abs/2408.17017 Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling]
 
* 2024-11: [https://arxiv.org/abs/2411.00492 Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models]
 
* 2024-12: [https://github.com/irthomasthomas/llm-consortium llm-consortium]: Multiple LLMs collaboratively solve problems through structured dialogue, evaluation and arbitration
 
 
 
===Multi-LLM (multiple comparisons, branching, etc.)===
 
* 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
 
* 2024-11: [https://arxiv.org/abs/2411.02830 Mixtures of In-Context Learners]: Multiple "experts", each with a different set of in-context examples; combine outputs at the level of next-token-prediction
 
* 2024-11: [https://arxiv.org/abs/2411.10440 LLaVA-o1: Let Vision Language Models Reason Step-by-Step] ([https://github.com/PKU-YuanGroup/LLaVA-o1 code])
 
 
 
===Iteration (e.g. neural-like layered blocks)===
 
* 2024-06: [https://arxiv.org/abs/2406.04692 Mixture-of-Agents Enhances Large Language Model Capabilities]
 
 
 
===Iterative reasoning via graphs===
 
* 2023-08: [https://arxiv.org/abs/2308.09687 Graph of Thoughts: Solving Elaborate Problems with Large Language Models]
 
* 2023-10: [https://arxiv.org/abs/2310.04363 Amortizing intractable inference in large language models] ([https://github.com/GFNOrg/gfn-lm-tuning code])
 
* 2024-09: [https://arxiv.org/abs/2409.10038 On the Diagram of Thought]: Iterative reasoning as a directed acyclic graph (DAG)
 
 
 
===Monte Carlo Tree Search (MCTS)===
 
* 2024-05: [https://arxiv.org/abs/2405.03553 AlphaMath Almost Zero: process Supervision without process]
 
* 2024-06: [https://arxiv.org/abs/2406.03816 ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search]
 
* 2024-06: [https://arxiv.org/abs/2406.06592 Improve Mathematical Reasoning in Language Models by Automated Process Supervision]
 
* 2024-06: [https://arxiv.org/abs/2406.07394 Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B]
 
* 2024-07: [https://arxiv.org/abs/2407.01476 Tree Search for Language Model Agents]
 
* 2024-10: [https://arxiv.org/abs/2410.01707 Interpretable Contrastive Monte Carlo Tree Search Reasoning]
 
 
 
===Other Search===
 
* 2024-11: [https://arxiv.org/abs/2411.05010 Scattered Forest Search: Smarter Code Space Exploration with LLMs]
 
 
 
===Chain-of-Thought Reasoning===
 
* 2017-05: [https://arxiv.org/abs/1705.04146 Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems]
 
* 2021-11: [https://arxiv.org/abs/2110.14168 Training Verifiers to Solve Math Word Problems]
 
* 2024-02: [https://arxiv.org/abs/2402.10200 Chain-of-Thought Reasoning Without Prompting]
 
 
 
===Scaling===
 
* 2021-04: [https://arxiv.org/abs/2104.03113 Scaling Scaling Laws with Board Games]
 
* 2024-03: [https://arxiv.org/abs/2403.02419 Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems]
 
* 2024-04: [https://arxiv.org/abs/2404.00725 The Larger the Better? Improved LLM Code-Generation via Budget Reallocation]
 
* 2024-07: [https://arxiv.org/abs/2407.21787 Large Language Monkeys: Scaling Inference Compute with Repeated Sampling]
 
* 2024-08: [https://arxiv.org/abs/2408.00724 An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models]
 
* 2024-08: [https://arxiv.org/abs/2408.03314 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters]
 
* 2024-10: (comparing fine-tuning to in-context learning) [https://arxiv.org/abs/2405.19874 Is In-Context Learning Sufficient for Instruction Following in LLMs?]
 
* 2024-11: [https://arxiv.org/abs/2411.17501 Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers]
 
 
 
===Theory===
 
* 2024-02: [https://arxiv.org/abs/2402.12875 Chain of Thought Empowers Transformers to Solve Inherently Serial Problems]
 
 
 
===Expending compute works===
 
* 2024-06-10: Blog post (opinion): [https://yellow-apartment-148.notion.site/AI-Search-The-Bitter-er-Lesson-44c11acd27294f4495c3de778cd09c8d AI Search: The Bitter-er Lesson]
 
* 2024-07-17: Blog post (test): [https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt Getting 50% (SoTA) on ARC-AGI with GPT-4o]
 
* 2024-09-12: [https://openai.com/o1/ OpenAI o1]: [https://openai.com/index/learning-to-reason-with-llms/ Learning to Reason with LLMs]
 
[[Image:Compute.png|600px]]
 
* 2024-09-16: [https://www.oneusefulthing.org/p/scaling-the-state-of-play-in-ai Scaling: The State of Play in AI]
 
 
 
===Code for Inference-time Compute===
 
* [https://github.com/codelion/optillm optillm]: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)
 
 
 
==Memory==
 
* 2024-10: [https://arxiv.org/abs/2410.08821 Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation]
 
 
 
==Tool Use==
 
* 2024-11: [https://arxiv.org/abs/2411.01747 DynaSaur: Large Language Agents Beyond Predefined Actions]: writes functions/code to increase capabilities
 
 
 
==Multi-agent Effort (and Emergent Intelligence)==
 
* 2024-10: [https://arxiv.org/abs/2410.11163 Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence]
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 
* 2024-10: [https://arxiv.org/abs/2410.19318 Two are better than one: Context window extension with multi-grained self-injection]
 
* 2024-11: [https://arxiv.org/abs/2411.00114 Project Sid: Many-agent simulations toward AI civilization]
 
 
 
==ML-like Optimization of LLM Setup==
 
* 2023-03: [https://arxiv.org/abs/2310.03714 DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines] ([https://github.com/stanfordnlp/dspy code]: Programming—not prompting—Foundation Models)
 
* 2024-05: [https://arxiv.org/abs/2305.03495 Automatic Prompt Optimization with "Gradient Descent" and Beam Search]
 
* 2024-06: [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text] (gradient backpropagation through text)
 
* 2024-06: [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents] (optimize LLM frameworks)
 
  
 
=Multi-agent orchestration=
 
=Multi-agent orchestration=
Line 348: Line 179:
 
* Amazon AWS [https://github.com/awslabs/multi-agent-orchestrator Multi-Agent Orchestrator]
 
* Amazon AWS [https://github.com/awslabs/multi-agent-orchestrator Multi-Agent Orchestrator]
 
* [https://github.com/kaiban-ai/KaibanJS KaibanJS]: Kanban for AI Agents? (Takes inspiration from [https://en.wikipedia.org/wiki/Kanban Kanban] visual [https://www.atlassian.com/agile/kanban work management].)
 
* [https://github.com/kaiban-ai/KaibanJS KaibanJS]: Kanban for AI Agents? (Takes inspiration from [https://en.wikipedia.org/wiki/Kanban Kanban] visual [https://www.atlassian.com/agile/kanban work management].)
 +
* [https://github.com/Thytu/Agentarium Agentarium]
  
 
==Open Source Systems==
 
==Open Source Systems==
Line 411: Line 243:
 
* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
 
* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
 
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 +
* 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])
 +
 +
===Evaluation Schemes===
 +
* 2024-12: [https://arxiv.org/abs/2412.10424 LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation]
 +
* 2025-01: [https://github.com/marquisdepolis/LLMRank LLMRank ("SlopRank")]: LLMs evaluate each other, allowing top model (for a given prompt/problem) to be inferred from a large number of recommendations.
  
 
===Multi-agent===
 
===Multi-agent===

Latest revision as of 09:31, 3 January 2025

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

Guides

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval

Control (tool-use, computer use, etc.)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

See Science Agents.

LLM-as-judge

Advanced Workflows

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

  1. Manual, human driven
  2. AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
    1. OpenAI ChatGPT
    2. Anthropic Claude
  3. API calls to an LLM, which generates code and inserts the file into the project
  4. LLM-integration into the IDE
    1. Copilot
    2. Qodo (Codium) & AlphaCodium (preprint, code)
    3. Cursor
    4. Codeium Windsurf (with "Cascade" AI Agent)
  5. AI-assisted IDE, where the AI generates and manages the dev environment
    1. Replit
    2. Aider (code): Pair programming on commandline
    3. Pythagora
    4. StackBlitz bolt.new
    5. Cline (formerly Claude Dev)
  6. Prompt-to-product
    1. Github Spark (demo video)
  7. Semi-autonomous software engineer agents
    1. Devin (Cognition AI)
    2. Amazon Q
    3. Honeycomb

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

Agentic Systems

Increasing AI Agent Intelligence

See: Increasing AI Intelligence

Multi-agent orchestration

Research

Societies and Communities of AI agents

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Spreadsheet

Cloud solutions

Frameworks

Optimization

Metrics, Benchmarks

Evaluation Schemes

Multi-agent

Agent Challenges

  • Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
  • Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
  • MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Automated Improvement

See Also