Difference between revisions of "AI Agents"

From GISAXS
Jump to: navigation, search
(Iterative reasoning via graphs)
(Increasing AI Agent Intelligence)
 
(31 intermediate revisions by the same user not shown)
Line 19: Line 19:
  
 
==Components of AI Assistants==
 
==Components of AI Assistants==
 +
 +
===Agent Internal Workflow Management===
 +
* [https://github.com/langchain-ai/langchain LangChain]
 +
* [https://github.com/pydantic/pydantic-ai Pydantic: Agent Framework / shim to use Pydantic with LLMs]
 +
* [https://github.com/lmnr-ai/flow Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility]
  
 
===Information Retrieval===
 
===Information Retrieval===
 
* See also [[AI_tools#Retrieval_Augmented_Generation_.28RAG.29|RAG]].
 
* See also [[AI_tools#Retrieval_Augmented_Generation_.28RAG.29|RAG]].
 
* 2024-10: [https://arxiv.org/abs/2410.09713 Agentic Information Retrieval]
 
* 2024-10: [https://arxiv.org/abs/2410.09713 Agentic Information Retrieval]
 +
 +
===Control (tool-use, computer use, etc.)===
 +
* Anthropic [https://www.anthropic.com/news/model-context-protocol Model Context Protocol] (MCP)
 +
** [https://github.com/jlowin/fastmcp FastMCP]: The fast, Pythonic way to build MCP servers
  
 
===Open-source===
 
===Open-source===
Line 41: Line 50:
 
===Computer Use===
 
===Computer Use===
 
* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
 
* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
 +
 +
===Software Engineering===
 +
* 2024-11: [https://github.com/MLSysOps/MLE-agent MLE-Agent: Your intelligent companion for seamless AI engineering and research]
  
 
===Science Agents===
 
===Science Agents===
Line 51: Line 63:
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
 
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
 +
* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]
  
 
=Advanced Workflows=
 
=Advanced Workflows=
Line 108: Line 121:
  
 
=Increasing AI Agent Intelligence=
 
=Increasing AI Agent Intelligence=
 +
 +
==Reviews==
 +
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 +
 +
==Prompt Engineering==
 +
* 2024-11: [https://arxiv.org/abs/2411.05778 LLMs as Method Actors: A Model for Prompt Engineering and Architecture]
  
 
==Proactive Search==
 
==Proactive Search==
Line 125: Line 144:
 
* 2024-09: [https://www.arxiv.org/abs/2409.04057 Self-Harmonized Chain of Thought]
 
* 2024-09: [https://www.arxiv.org/abs/2409.04057 Self-Harmonized Chain of Thought]
 
** Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
 
** Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
 +
* 2024-11: [https://arxiv.org/abs/2411.15862 LLMs Do Not Think Step-by-step In Implicit Reasoning]
 +
** They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.
  
 
===Sampling===
 
===Sampling===
Line 130: Line 151:
  
 
===Automated prompt generation===
 
===Automated prompt generation===
* 2024-09: [https://arxiv.org/abs/2409.13449 Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts] (
+
* 2024-09: [https://arxiv.org/abs/2409.13449 Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts]
  
 
===Distill inference-time-compute into model===
 
===Distill inference-time-compute into model===
Line 137: Line 158:
 
* 2024-02: [https://arxiv.org/abs/2402.04494 Grandmaster-Level Chess Without Search] (Google DeepMind)
 
* 2024-02: [https://arxiv.org/abs/2402.04494 Grandmaster-Level Chess Without Search] (Google DeepMind)
 
* 2024-07: [https://arxiv.org/abs/2407.03181 Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models]
 
* 2024-07: [https://arxiv.org/abs/2407.03181 Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models]
 +
* 2024-07: [https://arxiv.org/abs/2407.06023 Distilling System 2 into System 1]
 
* 2024-07: [https://arxiv.org/abs/2407.14622 BOND: Aligning LLMs with Best-of-N Distillation]
 
* 2024-07: [https://arxiv.org/abs/2407.14622 BOND: Aligning LLMs with Best-of-N Distillation]
 
* 2024-09: [https://arxiv.org/abs/2409.12917 Training Language Models to Self-Correct via Reinforcement Learning] (Google DeepMind)
 
* 2024-09: [https://arxiv.org/abs/2409.12917 Training Language Models to Self-Correct via Reinforcement Learning] (Google DeepMind)
 
* 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
 
* 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
 
* 2024-10: [https://arxiv.org/abs/2410.09918 Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces]
 
* 2024-10: [https://arxiv.org/abs/2410.09918 Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces]
 +
* 2024-12: [https://arxiv.org/abs/2412.06769 Training Large Language Models to Reason in a Continuous Latent Space]
  
 
====CoT reasoning model====
 
====CoT reasoning model====
Line 148: Line 171:
 
* 2024-11: [https://arxiv.org/abs/2411.14405 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions]
 
* 2024-11: [https://arxiv.org/abs/2411.14405 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions]
 
* 2024-11: [https://huggingface.co/papers/2411.16489 O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?]
 
* 2024-11: [https://huggingface.co/papers/2411.16489 O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?]
 +
* 2024-12: [https://arxiv.org/abs/2412.00154 o1-Coder: an o1 Replication for Coding] ([https://github.com/ADaM-BJTU/O1-CODER code])
  
 
===Scaling===
 
===Scaling===
Line 156: Line 180:
 
===Methods===
 
===Methods===
 
* 2024-03: [https://arxiv.org/abs/2403.09629 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking]
 
* 2024-03: [https://arxiv.org/abs/2403.09629 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking]
 +
* 2024-11: [https://arxiv.org/pdf/2411.19865 Reverse Thinking Makes LLMs Stronger Reasoners]
 +
* 2024-12: [https://arxiv.org/abs/2412.06769 Training Large Language Models to Reason in a Continuous Latent Space] (Chain of Continuous Thought, COCONUT)
 +
'''Review'''
 +
* 2024-06: [https://arxiv.org/abs/2406.16838 From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models]
  
 
===In context learning (ICL), search, and other inference-time methods===
 
===In context learning (ICL), search, and other inference-time methods===
Line 227: Line 255:
 
* 2024-08: [https://arxiv.org/abs/2408.03314 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters]
 
* 2024-08: [https://arxiv.org/abs/2408.03314 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters]
 
* 2024-10: (comparing fine-tuning to in-context learning) [https://arxiv.org/abs/2405.19874 Is In-Context Learning Sufficient for Instruction Following in LLMs?]
 
* 2024-10: (comparing fine-tuning to in-context learning) [https://arxiv.org/abs/2405.19874 Is In-Context Learning Sufficient for Instruction Following in LLMs?]
 +
* 2024-11: [https://arxiv.org/abs/2411.17501 Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers]
  
 
===Theory===
 
===Theory===
Line 250: Line 279:
 
* 2024-10: [https://arxiv.org/abs/2410.11163 Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence]
 
* 2024-10: [https://arxiv.org/abs/2410.11163 Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence]
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 +
* 2024-10: [https://arxiv.org/abs/2410.19318 Two are better than one: Context window extension with multi-grained self-injection]
 
* 2024-11: [https://arxiv.org/abs/2411.00114 Project Sid: Many-agent simulations toward AI civilization]
 
* 2024-11: [https://arxiv.org/abs/2411.00114 Project Sid: Many-agent simulations toward AI civilization]
  
Line 259: Line 289:
  
 
=Multi-agent orchestration=
 
=Multi-agent orchestration=
 +
==Research==
 +
===Societies and Communities of AI agents===
 +
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 +
 
==Research demos==
 
==Research demos==
 
* [https://github.com/camel-ai/camel Camel]
 
* [https://github.com/camel-ai/camel Camel]
Line 279: Line 313:
 
* 2024-06: [https://arxiv.org/abs/2406.11638 MASAI: Modular Architecture for Software-engineering AI Agents]
 
* 2024-06: [https://arxiv.org/abs/2406.11638 MASAI: Modular Architecture for Software-engineering AI Agents]
 
* 2024-10: [https://arxiv.org/abs/2410.08164 Agent S: An Open Agentic Framework that Uses Computers Like a Human] ([https://github.com/simular-ai/Agent-S code])
 
* 2024-10: [https://arxiv.org/abs/2410.08164 Agent S: An Open Agentic Framework that Uses Computers Like a Human] ([https://github.com/simular-ai/Agent-S code])
 +
* 2024-10: [https://arxiv.org/abs/2410.20424 AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions]
  
 
===Related work===
 
===Related work===
Line 347: Line 382:
 
===Metrics, Benchmarks===
 
===Metrics, Benchmarks===
 
* 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]
 
* 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]
 +
* 2023-06: [https://arxiv.org/abs/2306.05836 Can Large Language Models Infer Causation from Correlation?] (challenging Corr2Cause task)
 +
* 2024-01: [https://microsoft.github.io/autogen/0.2/blog/2024/01/25/AutoGenBench/ AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents]
 
* 2024-04: AutoRace ([https://github.com/maitrix-org/llm-reasoners code]): [https://arxiv.org/abs/2404.05221 LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models]
 
* 2024-04: AutoRace ([https://github.com/maitrix-org/llm-reasoners code]): [https://arxiv.org/abs/2404.05221 LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models]
 
* 2024-04: [https://arxiv.org/abs/2404.07972 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments] ([https://os-world.github.io/ github])
 
* 2024-04: [https://arxiv.org/abs/2404.07972 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments] ([https://os-world.github.io/ github])
Line 358: Line 395:
 
* 2024-10: SimpleAQ: [https://cdn.openai.com/papers/simpleqa.pdf Measuring short-form factuality in large language models] ([https://openai.com/index/introducing-simpleqa/ announcement], [https://github.com/openai/simple-evals code])
 
* 2024-10: SimpleAQ: [https://cdn.openai.com/papers/simpleqa.pdf Measuring short-form factuality in large language models] ([https://openai.com/index/introducing-simpleqa/ announcement], [https://github.com/openai/simple-evals code])
 
* 2024-11: [https://metr.org/AI_R_D_Evaluation_Report.pdf RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts] ([https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ blog], [https://github.com/METR/ai-rd-tasks/tree/main code])
 
* 2024-11: [https://metr.org/AI_R_D_Evaluation_Report.pdf RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts] ([https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ blog], [https://github.com/METR/ai-rd-tasks/tree/main code])
 +
* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
 +
* 2024-11: [https://arxiv.org/abs/2411.13543 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games]
 +
* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
 +
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 +
 +
===Multi-agent===
 +
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
  
 
===Agent Challenges===
 
===Agent Challenges===
 
* [https://github.com/aidanmclaughlin/Aidan-Bench Aidan-Bench]: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
 
* [https://github.com/aidanmclaughlin/Aidan-Bench Aidan-Bench]: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
 +
** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]
 
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 
* [https://github.com/mc-bench/orchestrator MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.
 
* [https://github.com/mc-bench/orchestrator MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Latest revision as of 12:28, 20 December 2024

Contents

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval

Control (tool-use, computer use, etc.)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

See Science Agents.

LLM-as-judge

Advanced Workflows

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

  1. Manual, human driven
  2. AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
    1. OpenAI ChatGPT
    2. Anthropic Claude
  3. API calls to an LLM, which generates code and inserts the file into the project
  4. LLM-integration into the IDE
    1. Copilot
    2. Qodo (Codium) & AlphaCodium (preprint, code)
    3. Cursor
    4. Codeium Windsurf (with "Cascade" AI Agent)
  5. AI-assisted IDE, where the AI generates and manages the dev environment
    1. Replit
    2. Aider (code): Pair programming on commandline
    3. Pythagora
    4. StackBlitz bolt.new
    5. Cline (formerly Claude Dev)
  6. Prompt-to-product
    1. Github Spark (demo video)
  7. Semi-autonomous software engineer agents
    1. Devin (Cognition AI)
    2. Amazon Q
    3. Honeycomb

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

Agentic Systems

Increasing AI Agent Intelligence

Reviews

Prompt Engineering

Proactive Search

Compute expended after training, but before inference.

Training Data (Data Refinement, Synthetic Data)

Generate consistent plans/thoughts

  • 2024-08: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers (code)
    • (Microsoft) rStar is a self-play mutual reasoning approach. A small model adds to MCTS using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
  • 2024-09: Self-Harmonized Chain of Thought
    • Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
  • 2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
    • They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.

Sampling

Automated prompt generation

Distill inference-time-compute into model

CoT reasoning model

Scaling

Inference Time Compute

Methods

Review

In context learning (ICL), search, and other inference-time methods

Inference-time Sampling

Inference-time Gradient

Self-prompting

In-context thought

Naive multi-LLM (verification, majority voting, best-of-N, etc.)

Multi-LLM (multiple comparisons, branching, etc.)

Iteration (e.g. neural-like layered blocks)

Iterative reasoning via graphs

Monte Carlo Tree Search (MCTS)

Other Search

Scaling

Theory

Expending compute works

Compute.png

Code for Inference-time Compute

  • optillm: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)

Memory

Tool Use

Multi-agent Effort (and Emergent Intelligence)

ML-like Optimization of LLM Setup

Multi-agent orchestration

Research

Societies and Communities of AI agents

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Spreadsheet

Cloud solutions

Frameworks

Optimization

Metrics, Benchmarks

Multi-agent

Agent Challenges

  • Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
  • Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
  • MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Automated Improvement

See Also