Difference between revisions of "AI Agents"

From GISAXS
Jump to: navigation, search
(Multi-agent Effort (and Emergent Intelligence))
(Increasing AI Agent Intelligence)
 
(21 intermediate revisions by the same user not shown)
Line 19: Line 19:
  
 
==Components of AI Assistants==
 
==Components of AI Assistants==
 +
 +
===Agent Internal Workflow Management===
 +
* [https://github.com/langchain-ai/langchain LangChain]
 +
* [https://github.com/pydantic/pydantic-ai Pydantic: Agent Framework / shim to use Pydantic with LLMs]
 +
* [https://github.com/lmnr-ai/flow Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility]
  
 
===Information Retrieval===
 
===Information Retrieval===
 
* See also [[AI_tools#Retrieval_Augmented_Generation_.28RAG.29|RAG]].
 
* See also [[AI_tools#Retrieval_Augmented_Generation_.28RAG.29|RAG]].
 
* 2024-10: [https://arxiv.org/abs/2410.09713 Agentic Information Retrieval]
 
* 2024-10: [https://arxiv.org/abs/2410.09713 Agentic Information Retrieval]
 +
 +
===Control (tool-use, computer use, etc.)===
 +
* Anthropic [https://www.anthropic.com/news/model-context-protocol Model Context Protocol] (MCP)
 +
** [https://github.com/jlowin/fastmcp FastMCP]: The fast, Pythonic way to build MCP servers
  
 
===Open-source===
 
===Open-source===
Line 54: Line 63:
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
 
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
 +
* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]
  
 
=Advanced Workflows=
 
=Advanced Workflows=
Line 111: Line 121:
  
 
=Increasing AI Agent Intelligence=
 
=Increasing AI Agent Intelligence=
 +
 +
==Reviews==
 +
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
  
 
==Prompt Engineering==
 
==Prompt Engineering==
Line 145: Line 158:
 
* 2024-02: [https://arxiv.org/abs/2402.04494 Grandmaster-Level Chess Without Search] (Google DeepMind)
 
* 2024-02: [https://arxiv.org/abs/2402.04494 Grandmaster-Level Chess Without Search] (Google DeepMind)
 
* 2024-07: [https://arxiv.org/abs/2407.03181 Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models]
 
* 2024-07: [https://arxiv.org/abs/2407.03181 Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models]
 +
* 2024-07: [https://arxiv.org/abs/2407.06023 Distilling System 2 into System 1]
 
* 2024-07: [https://arxiv.org/abs/2407.14622 BOND: Aligning LLMs with Best-of-N Distillation]
 
* 2024-07: [https://arxiv.org/abs/2407.14622 BOND: Aligning LLMs with Best-of-N Distillation]
 
* 2024-09: [https://arxiv.org/abs/2409.12917 Training Language Models to Self-Correct via Reinforcement Learning] (Google DeepMind)
 
* 2024-09: [https://arxiv.org/abs/2409.12917 Training Language Models to Self-Correct via Reinforcement Learning] (Google DeepMind)
 
* 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
 
* 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
 
* 2024-10: [https://arxiv.org/abs/2410.09918 Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces]
 
* 2024-10: [https://arxiv.org/abs/2410.09918 Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces]
 +
* 2024-12: [https://arxiv.org/abs/2412.06769 Training Large Language Models to Reason in a Continuous Latent Space]
  
 
====CoT reasoning model====
 
====CoT reasoning model====
Line 156: Line 171:
 
* 2024-11: [https://arxiv.org/abs/2411.14405 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions]
 
* 2024-11: [https://arxiv.org/abs/2411.14405 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions]
 
* 2024-11: [https://huggingface.co/papers/2411.16489 O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?]
 
* 2024-11: [https://huggingface.co/papers/2411.16489 O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?]
 +
* 2024-12: [https://arxiv.org/abs/2412.00154 o1-Coder: an o1 Replication for Coding] ([https://github.com/ADaM-BJTU/O1-CODER code])
  
 
===Scaling===
 
===Scaling===
Line 165: Line 181:
 
* 2024-03: [https://arxiv.org/abs/2403.09629 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking]
 
* 2024-03: [https://arxiv.org/abs/2403.09629 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking]
 
* 2024-11: [https://arxiv.org/pdf/2411.19865 Reverse Thinking Makes LLMs Stronger Reasoners]
 
* 2024-11: [https://arxiv.org/pdf/2411.19865 Reverse Thinking Makes LLMs Stronger Reasoners]
 +
* 2024-12: [https://arxiv.org/abs/2412.06769 Training Large Language Models to Reason in a Continuous Latent Space] (Chain of Continuous Thought, COCONUT)
 +
'''Review'''
 +
* 2024-06: [https://arxiv.org/abs/2406.16838 From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models]
  
 
===In context learning (ICL), search, and other inference-time methods===
 
===In context learning (ICL), search, and other inference-time methods===
Line 270: Line 289:
  
 
=Multi-agent orchestration=
 
=Multi-agent orchestration=
 +
==Research==
 +
===Societies and Communities of AI agents===
 +
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 +
 
==Research demos==
 
==Research demos==
 
* [https://github.com/camel-ai/camel Camel]
 
* [https://github.com/camel-ai/camel Camel]
Line 360: Line 383:
 
* 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]
 
* 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]
 
* 2023-06: [https://arxiv.org/abs/2306.05836 Can Large Language Models Infer Causation from Correlation?] (challenging Corr2Cause task)
 
* 2023-06: [https://arxiv.org/abs/2306.05836 Can Large Language Models Infer Causation from Correlation?] (challenging Corr2Cause task)
 +
* 2024-01: [https://microsoft.github.io/autogen/0.2/blog/2024/01/25/AutoGenBench/ AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents]
 
* 2024-04: AutoRace ([https://github.com/maitrix-org/llm-reasoners code]): [https://arxiv.org/abs/2404.05221 LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models]
 
* 2024-04: AutoRace ([https://github.com/maitrix-org/llm-reasoners code]): [https://arxiv.org/abs/2404.05221 LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models]
 
* 2024-04: [https://arxiv.org/abs/2404.07972 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments] ([https://os-world.github.io/ github])
 
* 2024-04: [https://arxiv.org/abs/2404.07972 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments] ([https://os-world.github.io/ github])
Line 372: Line 396:
 
* 2024-11: [https://metr.org/AI_R_D_Evaluation_Report.pdf RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts] ([https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ blog], [https://github.com/METR/ai-rd-tasks/tree/main code])
 
* 2024-11: [https://metr.org/AI_R_D_Evaluation_Report.pdf RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts] ([https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ blog], [https://github.com/METR/ai-rd-tasks/tree/main code])
 
* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
 
* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
 +
* 2024-11: [https://arxiv.org/abs/2411.13543 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games]
 +
* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
 +
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 +
 +
===Multi-agent===
 +
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
  
 
===Agent Challenges===
 
===Agent Challenges===
 
* [https://github.com/aidanmclaughlin/Aidan-Bench Aidan-Bench]: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
 
* [https://github.com/aidanmclaughlin/Aidan-Bench Aidan-Bench]: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
 +
** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]
 
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 
* [https://github.com/mc-bench/orchestrator MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.
 
* [https://github.com/mc-bench/orchestrator MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Latest revision as of 12:28, 20 December 2024

Contents

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval

Control (tool-use, computer use, etc.)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

See Science Agents.

LLM-as-judge

Advanced Workflows

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

  1. Manual, human driven
  2. AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
    1. OpenAI ChatGPT
    2. Anthropic Claude
  3. API calls to an LLM, which generates code and inserts the file into the project
  4. LLM-integration into the IDE
    1. Copilot
    2. Qodo (Codium) & AlphaCodium (preprint, code)
    3. Cursor
    4. Codeium Windsurf (with "Cascade" AI Agent)
  5. AI-assisted IDE, where the AI generates and manages the dev environment
    1. Replit
    2. Aider (code): Pair programming on commandline
    3. Pythagora
    4. StackBlitz bolt.new
    5. Cline (formerly Claude Dev)
  6. Prompt-to-product
    1. Github Spark (demo video)
  7. Semi-autonomous software engineer agents
    1. Devin (Cognition AI)
    2. Amazon Q
    3. Honeycomb

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

Agentic Systems

Increasing AI Agent Intelligence

Reviews

Prompt Engineering

Proactive Search

Compute expended after training, but before inference.

Training Data (Data Refinement, Synthetic Data)

Generate consistent plans/thoughts

  • 2024-08: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers (code)
    • (Microsoft) rStar is a self-play mutual reasoning approach. A small model adds to MCTS using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
  • 2024-09: Self-Harmonized Chain of Thought
    • Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
  • 2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
    • They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.

Sampling

Automated prompt generation

Distill inference-time-compute into model

CoT reasoning model

Scaling

Inference Time Compute

Methods

Review

In context learning (ICL), search, and other inference-time methods

Inference-time Sampling

Inference-time Gradient

Self-prompting

In-context thought

Naive multi-LLM (verification, majority voting, best-of-N, etc.)

Multi-LLM (multiple comparisons, branching, etc.)

Iteration (e.g. neural-like layered blocks)

Iterative reasoning via graphs

Monte Carlo Tree Search (MCTS)

Other Search

Scaling

Theory

Expending compute works

Compute.png

Code for Inference-time Compute

  • optillm: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)

Memory

Tool Use

Multi-agent Effort (and Emergent Intelligence)

ML-like Optimization of LLM Setup

Multi-agent orchestration

Research

Societies and Communities of AI agents

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Spreadsheet

Cloud solutions

Frameworks

Optimization

Metrics, Benchmarks

Multi-agent

Agent Challenges

  • Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
  • Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
  • MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Automated Improvement

See Also