Revision as of 10:31, 3 January 2025

Reviews & Perspectives

Published

2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
2024-09: Towards a Science Exocortex
2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
2024-09: Agents in Software Engineering: Survey, Landscape, and Vision

Continually updating

Analysis/Opinions

Guides

Anthropic: Building Effective Agents

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

LangChain
Pydantic: Agent Framework / shim to use Pydantic with LLMs
Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility
llama-stack
Huggingface smolagents
Eliza (includes multi-agent, interaction with docs, Discord, Twitter, etc.)

Information Retrieval

See also RAG.
2024-09: PaperQA2: Language Models Achieve Superhuman Synthesis of Scientific Knowledge (𝕏 post, code)
2024-10: Agentic Information Retrieval

Control (tool-use, computer use, etc.)

Anthropic Model Context Protocol (MCP)
- FastMCP: The fast, Pythonic way to build MCP servers

Open-source

Khoj (code): self-hostable AI assistant
RAGapp: Agentic RAG for enterprise
STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking
- Can write (e.g.) Wikipedia-style articles
- code
- Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

Personalities/Personas

2023-10: Generative Agents: Interactive Simulacra of Human Behavior
2024-11: Microsoft TinyTroupe 🤠🤓🥸🧐: LLM-powered multiagent persona simulation for imagination enhancement and business insights
2024-11: Generative Agent Simulations of 1,000 People (code)

Specific Uses for AI Assistants

Advanced Workflows

Salesforce DEI: meta-system that leverages a diversity of SWE agents
- Preprint: Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
Sakana AI: AI Scientist
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- code
SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
- code

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

Manual, human driven
AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
1. OpenAI ChatGPT
2. Anthropic Claude
API calls to an LLM, which generates code and inserts the file into the project
LLM-integration into the IDE
1. Copilot
2. Qodo (Codium) & AlphaCodium (preprint, code)
3. Cursor
4. Codeium Windsurf (with "Cascade" AI Agent)
AI-assisted IDE, where the AI generates and manages the dev environment
1. Replit
2. Aider (code): Pair programming on commandline
3. Pythagora
4. StackBlitz bolt.new
5. Cline (formerly Claude Dev)
Prompt-to-product
1. Github Spark (demo video)
Semi-autonomous software engineer agents
1. Devin (Cognition AI)
2. Amazon Q
3. Honeycomb

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Payman AI: AI to Human platform that allows AI to pay people for what it needs
VoiceFlow: Build customer experiences with AI
Mistral AI: genAI applications
Taskade: Task/milestone software with AI agent workflows
Covalent: Building a Multi-Agent Prompt Refining Application

Inference-compute Reasoning

Nous Research: Forge Reasoning API Beta

Agentic Systems

Topology AI
Cognition AI: Devin software engineer (14% SWE-Agent)
Honeycomb (22% SWE-Agent)

Increasing AI Agent Intelligence

See: Increasing AI Intelligence

Multi-agent orchestration

Research

Societies and Communities of AI agents

2024-12: Cultural Evolution of Cooperation among LLM Agents

Research demos

Camel
LoopGPT
JARVIS
OpenAGI
AutoGen
- preprint: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Agent-E: Browser (eventually computer) automation (code, preprint, demo video)
- AutoGen Studio: GUI for agent workflows (code)
- Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
AG2 (previously AutoGen) (code, docs, Discord)
TaskWeaver
MetaGPT
AutoGPT (code); and AutoGPT Platform
Optima
- preprint: Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
- code
2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
2024-06: MASAI: Modular Architecture for Software-engineering AI Agents
2024-10: Agent S: An Open Agentic Framework that Uses Computers Like a Human (code)
2024-10: AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

Related work

2024-07: PersonaGym: Evaluating Persona Agents and LLMs

Inter-agent communications

2024-10: Agora: A Scalable Communication Protocol for Networks of Large Language Models (preprint): disparate agents auto-negotiate communication protocol
2024-11: DroidSpeak: Enhancing Cross-LLM Communication: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)
2024-11: Anthropic describes Model Context Protocol: an open standard for secure, two-way connections between data sources and AI (intro, quickstart, code)

Architectures

Open Source Frameworks

LangChain
ell (code, docs)
AgentOps AI AgentStack
Agent UI
kyegomez swarms
OpenAI Swarm (cookbook)
Amazon AWS Multi-Agent Orchestrator
KaibanJS: Kanban for AI Agents? (Takes inspiration from Kanban visual work management.)
Agentarium

Open Source Systems

ControlFlow
- documentation
- code
OpenHands (formerly OpenDevin)
- code: platform for autonomous software engineers, powered by AI and LLMs
- Report: OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Commercial Automation Frameworks

Lutra: Automation and integration with various web systems.
Gumloop
TextQL: Enterprise Virtual Data Analyst
Athena intelligence: Analytics platform
Nexus GPT: Business co-pilot
Multi-On: AI agent that acts on your behalf
Firecrawl: Turn websites into LLM-ready data
Reworkd: End-to-end data extraction
Lindy: Custom AI Assistants to automate business workflows
- E.g. use Slack
Bardeen: Automate workflows
Abacus: AI Agents
LlamaIndex: (𝕏, code, docs, Discord)
MultiOn AI: Agent Q (paper) automated planning and execution

Spreadsheet

Cloud solutions

Numbers Station Meadow: agentic framework for data workflows (code).
CrewAI says they provide multi-agent automations (code).
LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.
- LangGraph Studio is an IDE for agent workflows
C3 AI enterprise platform
Deepset AI Haystack (docs, code)

Frameworks

Google Project Oscar
- Agent: Gaby (for "Go AI bot") (code, documentation) helps with issue tracking.
OpenPlexity-Pages: Data-aggregator implementation (like Perplexity) based on CrewAI

Optimization

Metrics, Benchmarks

2022-06: PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
2023-06: Can Large Language Models Infer Causation from Correlation? (challenging Corr2Cause task)
2024-01: AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents
2024-04: AutoRace (code): LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
2024-04: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (github)
2024-07: AI Agents That Matter
2024-09: CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark (leaderboard)
2024-09: LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
2024-09: On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
2024-10: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
2024-10: WorFBench: Benchmarking Agentic Workflow Generation
2024-10: VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
2024-10: SimpleAQ: Measuring short-form factuality in large language models (announcement, code)
2024-11: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts (blog, code)
2024-11: The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (code)
2024-11: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
2024-12: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (code, project, leaderboard)
2024-12: A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
2025-01: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings (preprint, leaderboard)

Evaluation Schemes

2024-12: LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
2025-01: LLMRank ("SlopRank"): LLMs evaluate each other, allowing top model (for a given prompt/problem) to be inferred from a large number of recommendations.

Multi-agent

2024-12: Cultural Evolution of Cooperation among LLM Agents

Agent Challenges

Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
- NeurIPS 2024 paper/poster: AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions
Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Automated Improvement

2024-06: EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
2024-06: Symbolic Learning Enables Self-Evolving Agents
2024-08: Automated Design of Agentic Systems (ADAS code)
2024-08: Self-Taught Evaluators: Iterative self-improvement through generation of synthetic data and evaluation

@@ Line 243: / Line 243: @@
 * 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
 * 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
+* 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])
 ===Evaluation Schemes===

Difference between revisions of "AI Agents"

Revision as of 10:31, 3 January 2025

Contents

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

Guides

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval

Control (tool-use, computer use, etc.)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

LLM-as-judge

Advanced Workflows

Software Development Workflows

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

Agentic Systems

Increasing AI Agent Intelligence

Multi-agent orchestration

Research

Societies and Communities of AI agents

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Spreadsheet

Cloud solutions

Frameworks

Optimization

Metrics, Benchmarks

Evaluation Schemes

Multi-agent

Agent Challenges

Automated Improvement

See Also

Navigation menu

Search