Latest revision as of 12:28, 20 December 2024

Reviews & Perspectives

Published

2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
2024-09: Towards a Science Exocortex
2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
2024-09: Agents in Software Engineering: Survey, Landscape, and Vision

Continually updating

Analysis/Opinions

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval

See also RAG.
2024-10: Agentic Information Retrieval

Control (tool-use, computer use, etc.)

Anthropic Model Context Protocol (MCP)
- FastMCP: The fast, Pythonic way to build MCP servers

Open-source

Khoj (code): self-hostable AI assistant
RAGapp: Agentic RAG for enterprise
STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking
- Can write (e.g.) Wikipedia-style articles
- code
- Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

Personalities/Personas

2023-10: Generative Agents: Interactive Simulacra of Human Behavior
2024-11: Microsoft TinyTroupe 🤠🤓🥸🧐: LLM-powered multiagent persona simulation for imagination enhancement and business insights
2024-11: Generative Agent Simulations of 1,000 People (code)

Specific Uses for AI Assistants

LLM-as-judge

Advanced Workflows

Salesforce DEI: meta-system that leverages a diversity of SWE agents
- Preprint: Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
Sakana AI: AI Scientist
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- code
SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
- code

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

Manual, human driven
AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
1. OpenAI ChatGPT
2. Anthropic Claude
API calls to an LLM, which generates code and inserts the file into the project
LLM-integration into the IDE
1. Copilot
2. Qodo (Codium) & AlphaCodium (preprint, code)
3. Cursor
4. Codeium Windsurf (with "Cascade" AI Agent)
AI-assisted IDE, where the AI generates and manages the dev environment
1. Replit
2. Aider (code): Pair programming on commandline
3. Pythagora
4. StackBlitz bolt.new
5. Cline (formerly Claude Dev)
Prompt-to-product
1. Github Spark (demo video)
Semi-autonomous software engineer agents
1. Devin (Cognition AI)
2. Amazon Q
3. Honeycomb

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Payman AI: AI to Human platform that allows AI to pay people for what it needs
VoiceFlow: Build customer experiences with AI
Mistral AI: genAI applications
Taskade: Task/milestone software with AI agent workflows
Covalent: Building a Multi-Agent Prompt Refining Application

Inference-compute Reasoning

Nous Research: Forge Reasoning API Beta

Agentic Systems

Topology AI
Cognition AI: Devin software engineer (14% SWE-Agent)
Honeycomb (22% SWE-Agent)

Increasing AI Agent Intelligence

Reviews

2024-12: A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

Prompt Engineering

2024-11: LLMs as Method Actors: A Model for Prompt Engineering and Architecture

Proactive Search

Compute expended after training, but before inference.

Training Data (Data Refinement, Synthetic Data)

C.f. image datasets:
- 2023-06: StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
- 2023-11: DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback
2024-09: Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
2024-10: Data Cleaning Using Large Language Models
Updating list of links: Synthetic Data of LLMs, by LLMs, for LLMs

Generate consistent plans/thoughts

2024-08: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers (code)
- (Microsoft) rStar is a self-play mutual reasoning approach. A small model adds to MCTS using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
2024-09: Self-Harmonized Chain of Thought
- Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
- They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.

Sampling

2024-11: Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (code)

Automated prompt generation

2024-09: Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts

Distill inference-time-compute into model

2023-10: Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning (U. Maryland, Adobe)
2023-11: Implicit Chain of Thought Reasoning via Knowledge Distillation (Harvard, Microsoft, Hopkins)
2024-02: Grandmaster-Level Chess Without Search (Google DeepMind)
2024-07: Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models
2024-07: Distilling System 2 into System 1
2024-07: BOND: Aligning LLMs with Best-of-N Distillation
2024-09: Training Language Models to Self-Correct via Reinforcement Learning (Google DeepMind)
2024-10: Thinking LLMs: General Instruction Following with Thought Generation
2024-10: Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
2024-12: Training Large Language Models to Reason in a Continuous Latent Space

CoT reasoning model

2024-09: OpenAI o1
2024-10: O1 Replication Journey: A Strategic Progress Report – Part 1 (code): Attempt by Walnut Plan to reproduce o1-like in-context reasoning
2024-11: DeepSeek-R1-Lite-Preview reasoning model
2024-11: Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
2024-11: O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?
2024-12: o1-Coder: an o1 Replication for Coding (code)

Scaling

2024-08: Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling (Google DeepMind)
2024-11: Scaling Laws for Pre-training Agents and World Models

Inference Time Compute

Methods

2024-03: Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
2024-11: Reverse Thinking Makes LLMs Stronger Reasoners
2024-12: Training Large Language Models to Reason in a Continuous Latent Space (Chain of Continuous Thought, COCONUT)

Review

2024-06: From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

In context learning (ICL), search, and other inference-time methods

2023-03: Reflexion: Language Agents with Verbal Reinforcement Learning
2023-05: VOYAGER: An Open-Ended Embodied Agent with Large Language Models
2024-04: Many-Shot In-Context Learning
2024-08: Automated Design of Agentic Systems
2024-09: Planning In Natural Language Improves LLM Search For Code Generation

Inference-time Sampling

Inference-time Gradient

2024-11: The Surprising Effectiveness of Test-Time Training for Abstract Reasoning (code)

Self-prompting

In-context thought

2022-01: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Google Brain)
2023-05: Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Google DeepMind)
2024-05: Faithful Logical Reasoning via Symbolic Chain-of-Thought
2024-06: A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages
2024-09: To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
2024-09: Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning (Agnostiq, Toronto)
2024-09: Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models
2024-10: A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration (failed reasoning traces can improve CoT)
2024-10: Tree of Problems: Improving structured problem solving with compositionality
2023-01/2024-10: A Survey on In-context Learning

Naive multi-LLM (verification, majority voting, best-of-N, etc.)

Multi-LLM (multiple comparisons, branching, etc.)

2024-10: Thinking LLMs: General Instruction Following with Thought Generation
2024-11: Mixtures of In-Context Learners: Multiple "experts", each with a different set of in-context examples; combine outputs at the level of next-token-prediction
2024-11: LLaVA-o1: Let Vision Language Models Reason Step-by-Step (code)

Iteration (e.g. neural-like layered blocks)

2024-06: Mixture-of-Agents Enhances Large Language Model Capabilities

Iterative reasoning via graphs

2023-08: Graph of Thoughts: Solving Elaborate Problems with Large Language Models
2023-10: Amortizing intractable inference in large language models (code)
2024-09: On the Diagram of Thought: Iterative reasoning as a directed acyclic graph (DAG)

Monte Carlo Tree Search (MCTS)

Other Search

2024-11: Scattered Forest Search: Smarter Code Space Exploration with LLMs

Scaling

2021-04: Scaling Scaling Laws with Board Games
2024-03: Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
2024-04: The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
2024-07: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
2024-08: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
2024-08: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
2024-10: (comparing fine-tuning to in-context learning) Is In-Context Learning Sufficient for Instruction Following in LLMs?
2024-11: Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers

Theory

2024-02: Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Expending compute works

2024-06-10: Blog post (opinion): AI Search: The Bitter-er Lesson
2024-07-17: Blog post (test): Getting 50% (SoTA) on ARC-AGI with GPT-4o
2024-09-12: OpenAI o1: Learning to Reason with LLMs

2024-09-16: Scaling: The State of Play in AI

Code for Inference-time Compute

optillm: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)

Memory

2024-10: Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation

Tool Use

2024-11: DynaSaur: Large Language Agents Beyond Predefined Actions: writes functions/code to increase capabilities

Multi-agent Effort (and Emergent Intelligence)

ML-like Optimization of LLM Setup

2023-03: DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines (code: Programming—not prompting—Foundation Models)
2024-05: Automatic Prompt Optimization with "Gradient Descent" and Beam Search
2024-06: TextGrad: Automatic "Differentiation" via Text (gradient backpropagation through text)
2024-06: Symbolic Learning Enables Self-Evolving Agents (optimize LLM frameworks)

Multi-agent orchestration

Research

Societies and Communities of AI agents

2024-12: Cultural Evolution of Cooperation among LLM Agents

Research demos

Camel
LoopGPT
JARVIS
OpenAGI
AutoGen
- preprint: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Agent-E: Browser (eventually computer) automation (code, preprint, demo video)
- AutoGen Studio: GUI for agent workflows (code)
- Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
AG2 (previously AutoGen) (code, docs, Discord)
TaskWeaver
MetaGPT
AutoGPT (code); and AutoGPT Platform
Optima
- preprint: Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
- code
2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
2024-06: MASAI: Modular Architecture for Software-engineering AI Agents
2024-10: Agent S: An Open Agentic Framework that Uses Computers Like a Human (code)
2024-10: AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

Related work

2024-07: PersonaGym: Evaluating Persona Agents and LLMs

Inter-agent communications

2024-10: Agora: A Scalable Communication Protocol for Networks of Large Language Models (preprint): disparate agents auto-negotiate communication protocol
2024-11: DroidSpeak: Enhancing Cross-LLM Communication: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)
2024-11: Anthropic describes Model Context Protocol: an open standard for secure, two-way connections between data sources and AI (intro, quickstart, code)

Architectures

Open Source Frameworks

LangChain
ell (code, docs)
AgentOps AI AgentStack
Agent UI
kyegomez swarms
OpenAI Swarm (cookbook)
Amazon AWS Multi-Agent Orchestrator
KaibanJS: Kanban for AI Agents? (Takes inspiration from Kanban visual work management.)

Open Source Systems

ControlFlow
- documentation
- code
OpenHands (formerly OpenDevin)
- code: platform for autonomous software engineers, powered by AI and LLMs
- Report: OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Commercial Automation Frameworks

Lutra: Automation and integration with various web systems.
Gumloop
TextQL: Enterprise Virtual Data Analyst
Athena intelligence: Analytics platform
Nexus GPT: Business co-pilot
Multi-On: AI agent that acts on your behalf
Firecrawl: Turn websites into LLM-ready data
Reworkd: End-to-end data extraction
Lindy: Custom AI Assistants to automate business workflows
- E.g. use Slack
Bardeen: Automate workflows
Abacus: AI Agents
LlamaIndex: (𝕏, code, docs, Discord)
MultiOn AI: Agent Q (paper) automated planning and execution

Spreadsheet

Cloud solutions

Numbers Station Meadow: agentic framework for data workflows (code).
CrewAI says they provide multi-agent automations (code).
LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.
- LangGraph Studio is an IDE for agent workflows
C3 AI enterprise platform
Deepset AI Haystack (docs, code)

Frameworks

Google Project Oscar
- Agent: Gaby (for "Go AI bot") (code, documentation) helps with issue tracking.
OpenPlexity-Pages: Data-aggregator implementation (like Perplexity) based on CrewAI

Optimization

Metrics, Benchmarks

2022-06: PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
2023-06: Can Large Language Models Infer Causation from Correlation? (challenging Corr2Cause task)
2024-01: AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents
2024-04: AutoRace (code): LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
2024-04: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (github)
2024-07: AI Agents That Matter
2024-09: CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark (leaderboard)
2024-09: LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
2024-09: On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
2024-10: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
2024-10: WorFBench: Benchmarking Agentic Workflow Generation
2024-10: VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
2024-10: SimpleAQ: Measuring short-form factuality in large language models (announcement, code)
2024-11: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts (blog, code)
2024-11: The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (code)
2024-11: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
2024-12: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (code, project, leaderboard)
2024-12: A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

Multi-agent

2024-12: Cultural Evolution of Cooperation among LLM Agents

Agent Challenges

Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
- NeurIPS 2024 paper/poster: AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions
Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Automated Improvement

2024-06: EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
2024-06: Symbolic Learning Enables Self-Evolving Agents
2024-08: Automated Design of Agentic Systems (ADAS code)
2024-08: Self-Taught Evaluators: Iterative self-improvement through generation of synthetic data and evaluation

@@ Line 19: / Line 19: @@
 ==Components of AI Assistants==
+===Agent Internal Workflow Management===
+* [https://github.com/langchain-ai/langchain LangChain]
+* [https://github.com/pydantic/pydantic-ai Pydantic: Agent Framework / shim to use Pydantic with LLMs]
+* [https://github.com/lmnr-ai/flow Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility]
 ===Information Retrieval===
 * See also [[AI_tools#Retrieval_Augmented_Generation_.28RAG.29|RAG]].
 * 2024-10: [https://arxiv.org/abs/2410.09713 Agentic Information Retrieval]
+===Control (tool-use, computer use, etc.)===
+* Anthropic [https://www.anthropic.com/news/model-context-protocol Model Context Protocol] (MCP)
+** [https://github.com/jlowin/fastmcp FastMCP]: The fast, Pythonic way to build MCP servers
 ===Open-source===
@@ Line 41: / Line 50: @@
 ===Computer Use===
 * 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
+===Software Engineering===
+* 2024-11: [https://github.com/MLSysOps/MLE-agent MLE-Agent: Your intelligent companion for seamless AI engineering and research]
 ===Science Agents===
@@ Line 50: / Line 62: @@
 * [https://eugeneyan.com/writing/llm-evaluators/ Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)]
 * 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
+* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
+* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]
 =Advanced Workflows=
@@ Line 107: / Line 121: @@
 =Increasing AI Agent Intelligence=
+==Reviews==
+* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
+==Prompt Engineering==
+* 2024-11: [https://arxiv.org/abs/2411.05778 LLMs as Method Actors: A Model for Prompt Engineering and Architecture]
 ==Proactive Search==
@@ Line 124: / Line 144: @@
 * 2024-09: [https://www.arxiv.org/abs/2409.04057 Self-Harmonized Chain of Thought]
 ** Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
+* 2024-11: [https://arxiv.org/abs/2411.15862 LLMs Do Not Think Step-by-step In Implicit Reasoning]
+** They argue that models trained to reproduce CoT outputs do not, internally, perform stepwise reasoning (with intermediate representations); this suggests that explicit CoT could be superior to implicit CoT.
 ===Sampling===
@@ Line 129: / Line 151: @@
 ===Automated prompt generation===
-* 2024-09: [https://arxiv.org/abs/2409.13449 Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts] (
+* 2024-09: [https://arxiv.org/abs/2409.13449 Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts]
 ===Distill inference-time-compute into model===
@@ Line 136: / Line 158: @@
 * 2024-02: [https://arxiv.org/abs/2402.04494 Grandmaster-Level Chess Without Search] (Google DeepMind)
 * 2024-07: [https://arxiv.org/abs/2407.03181 Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models]
+* 2024-07: [https://arxiv.org/abs/2407.06023 Distilling System 2 into System 1]
 * 2024-07: [https://arxiv.org/abs/2407.14622 BOND: Aligning LLMs with Best-of-N Distillation]
 * 2024-09: [https://arxiv.org/abs/2409.12917 Training Language Models to Self-Correct via Reinforcement Learning] (Google DeepMind)
 * 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
 * 2024-10: [https://arxiv.org/abs/2410.09918 Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces]
+* 2024-12: [https://arxiv.org/abs/2412.06769 Training Large Language Models to Reason in a Continuous Latent Space]
 ====CoT reasoning model====
@@ Line 146: / Line 170: @@
 * 2024-11: [https://x.com/deepseek_ai/status/1859200141355536422 DeepSeek-R1-Lite-Preview reasoning model]
 * 2024-11: [https://arxiv.org/abs/2411.14405 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions]
+* 2024-11: [https://huggingface.co/papers/2411.16489 O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?]
+* 2024-12: [https://arxiv.org/abs/2412.00154 o1-Coder: an o1 Replication for Coding] ([https://github.com/ADaM-BJTU/O1-CODER code])
 ===Scaling===
@@ Line 154: / Line 180: @@
 ===Methods===
 * 2024-03: [https://arxiv.org/abs/2403.09629 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking]
+* 2024-11: [https://arxiv.org/pdf/2411.19865 Reverse Thinking Makes LLMs Stronger Reasoners]
+* 2024-12: [https://arxiv.org/abs/2412.06769 Training Large Language Models to Reason in a Continuous Latent Space] (Chain of Continuous Thought, COCONUT)
+'''Review'''
+* 2024-06: [https://arxiv.org/abs/2406.16838 From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models]
 ===In context learning (ICL), search, and other inference-time methods===
@@ Line 203: / Line 233: @@
 ===Iterative reasoning via graphs===
 * 2023-08: [https://arxiv.org/abs/2308.09687 Graph of Thoughts: Solving Elaborate Problems with Large Language Models]
+* 2023-10: [https://arxiv.org/abs/2310.04363 Amortizing intractable inference in large language models] ([https://github.com/GFNOrg/gfn-lm-tuning code])
 * 2024-09: [https://arxiv.org/abs/2409.10038 On the Diagram of Thought]: Iterative reasoning as a directed acyclic graph (DAG)
@@ Line 224: / Line 255: @@
 * 2024-08: [https://arxiv.org/abs/2408.03314 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters]
 * 2024-10: (comparing fine-tuning to in-context learning) [https://arxiv.org/abs/2405.19874 Is In-Context Learning Sufficient for Instruction Following in LLMs?]
+* 2024-11: [https://arxiv.org/abs/2411.17501 Inference Scaling FLaws: The Limits of LLM Resampling with Imperfect Verifiers]
 ===Theory===
@@ Line 247: / Line 279: @@
 * 2024-10: [https://arxiv.org/abs/2410.11163 Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence]
 * 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
+* 2024-10: [https://arxiv.org/abs/2410.19318 Two are better than one: Context window extension with multi-grained self-injection]
 * 2024-11: [https://arxiv.org/abs/2411.00114 Project Sid: Many-agent simulations toward AI civilization]
@@ Line 256: / Line 289: @@
 =Multi-agent orchestration=
+==Research==
+===Societies and Communities of AI agents===
+* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 ==Research demos==
 * [https://github.com/camel-ai/camel Camel]
@@ Line 276: / Line 313: @@
 * 2024-06: [https://arxiv.org/abs/2406.11638 MASAI: Modular Architecture for Software-engineering AI Agents]
 * 2024-10: [https://arxiv.org/abs/2410.08164 Agent S: An Open Agentic Framework that Uses Computers Like a Human] ([https://github.com/simular-ai/Agent-S code])
+* 2024-10: [https://arxiv.org/abs/2410.20424 AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions]
 ===Related work===
@@ Line 283: / Line 321: @@
 * 2024-10: Agora: [https://agoraprotocol.org/ A Scalable Communication Protocol for Networks of Large Language Models] ([https://arxiv.org/abs/2410.11905 preprint]): disparate agents auto-negotiate communication protocol
 * 2024-11: [https://arxiv.org/abs/2411.02820 DroidSpeak: Enhancing Cross-LLM Communication]: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)
+* 2024-11: Anthropic describes [https://www.anthropic.com/news/model-context-protocol Model Context Protocol]: an open standard for secure, two-way connections between data sources and AI ([https://modelcontextprotocol.io/introduction intro], [https://modelcontextprotocol.io/quickstart quickstart], [https://github.com/modelcontextprotocol code])
 ==Architectures==
@@ Line 343: / Line 382: @@
 ===Metrics, Benchmarks===
 * 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]
+* 2023-06: [https://arxiv.org/abs/2306.05836 Can Large Language Models Infer Causation from Correlation?] (challenging Corr2Cause task)
+* 2024-01: [https://microsoft.github.io/autogen/0.2/blog/2024/01/25/AutoGenBench/ AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents]
 * 2024-04: AutoRace ([https://github.com/maitrix-org/llm-reasoners code]): [https://arxiv.org/abs/2404.05221 LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models]
 * 2024-04: [https://arxiv.org/abs/2404.07972 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments] ([https://os-world.github.io/ github])
@@ Line 354: / Line 395: @@
 * 2024-10: SimpleAQ: [https://cdn.openai.com/papers/simpleqa.pdf Measuring short-form factuality in large language models] ([https://openai.com/index/introducing-simpleqa/ announcement], [https://github.com/openai/simple-evals code])
 * 2024-11: [https://metr.org/AI_R_D_Evaluation_Report.pdf RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts] ([https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ blog], [https://github.com/METR/ai-rd-tasks/tree/main code])
+* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
+* 2024-11: [https://arxiv.org/abs/2411.13543 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games]
+* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
+* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
+===Multi-agent===
+* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 ===Agent Challenges===
 * [https://github.com/aidanmclaughlin/Aidan-Bench Aidan-Bench]: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
+** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]
 * [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 * [https://github.com/mc-bench/orchestrator MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Difference between revisions of "AI Agents"

Latest revision as of 12:28, 20 December 2024

Contents

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval

Control (tool-use, computer use, etc.)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

LLM-as-judge

Advanced Workflows

Software Development Workflows

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

Agentic Systems

Increasing AI Agent Intelligence

Reviews

Prompt Engineering

Proactive Search

Training Data (Data Refinement, Synthetic Data)

Generate consistent plans/thoughts

Sampling

Automated prompt generation

Distill inference-time-compute into model

CoT reasoning model

Scaling

Inference Time Compute

Methods

In context learning (ICL), search, and other inference-time methods

Inference-time Sampling

Inference-time Gradient

Self-prompting

In-context thought

Naive multi-LLM (verification, majority voting, best-of-N, etc.)

Multi-LLM (multiple comparisons, branching, etc.)

Iteration (e.g. neural-like layered blocks)

Iterative reasoning via graphs

Monte Carlo Tree Search (MCTS)

Other Search

Scaling

Theory

Expending compute works

Code for Inference-time Compute

Memory

Tool Use

Multi-agent Effort (and Emergent Intelligence)

ML-like Optimization of LLM Setup

Multi-agent orchestration

Research

Societies and Communities of AI agents

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Spreadsheet

Cloud solutions

Frameworks

Optimization

Metrics, Benchmarks

Multi-agent

Agent Challenges

Automated Improvement

See Also

Navigation menu

Search