Difference between revisions of "AI Agents"
KevinYager (talk | contribs) (→Inference-time Sampling) |
KevinYager (talk | contribs) (→Metrics, Benchmarks) |
||
(10 intermediate revisions by the same user not shown) | |||
Line 28: | Line 28: | ||
* [https://github.com/lmnr-ai/flow Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility] | * [https://github.com/lmnr-ai/flow Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility] | ||
* [https://llama-stack.readthedocs.io/en/latest/index.html llama-stack] | * [https://llama-stack.readthedocs.io/en/latest/index.html llama-stack] | ||
+ | * [https://huggingface.co/blog/smolagents Huggingface] [https://github.com/huggingface/smolagents smolagents] | ||
+ | * [https://github.com/elizaOS/eliza Eliza] (includes multi-agent, interaction with docs, Discord, Twitter, etc.) | ||
===Information Retrieval=== | ===Information Retrieval=== | ||
Line 127: | Line 129: | ||
=Increasing AI Agent Intelligence= | =Increasing AI Agent Intelligence= | ||
− | + | See: [[Increasing AI Intelligence]] | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | [[ | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=Multi-agent orchestration= | =Multi-agent orchestration= | ||
Line 352: | Line 179: | ||
* Amazon AWS [https://github.com/awslabs/multi-agent-orchestrator Multi-Agent Orchestrator] | * Amazon AWS [https://github.com/awslabs/multi-agent-orchestrator Multi-Agent Orchestrator] | ||
* [https://github.com/kaiban-ai/KaibanJS KaibanJS]: Kanban for AI Agents? (Takes inspiration from [https://en.wikipedia.org/wiki/Kanban Kanban] visual [https://www.atlassian.com/agile/kanban work management].) | * [https://github.com/kaiban-ai/KaibanJS KaibanJS]: Kanban for AI Agents? (Takes inspiration from [https://en.wikipedia.org/wiki/Kanban Kanban] visual [https://www.atlassian.com/agile/kanban work management].) | ||
+ | * [https://github.com/Thytu/Agentarium Agentarium] | ||
==Open Source Systems== | ==Open Source Systems== | ||
Line 415: | Line 243: | ||
* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard]) | * 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard]) | ||
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges] | * 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges] | ||
+ | * 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard]) | ||
+ | |||
+ | ===Evaluation Schemes=== | ||
+ | * 2024-12: [https://arxiv.org/abs/2412.10424 LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation] | ||
+ | * 2025-01: [https://github.com/marquisdepolis/LLMRank LLMRank ("SlopRank")]: LLMs evaluate each other, allowing top model (for a given prompt/problem) to be inferred from a large number of recommendations. | ||
===Multi-agent=== | ===Multi-agent=== |
Latest revision as of 09:31, 3 January 2025
Contents
Reviews & Perspectives
Published
- 2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
- 2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
- 2024-09: Towards a Science Exocortex
- 2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
- 2024-09: Agents in Software Engineering: Survey, Landscape, and Vision
Continually updating
- OpenThought - System 2 Research Links
- Awesome LLM Strawberry (OpenAI o1): Collection of research papers & blogs for OpenAI Strawberry(o1) and Reasoning
Analysis/Opinions
- LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
- Cutting AI Assistant Costs by Up to 77.8%: The Power of Enhancing LLMs with Business Logic
Guides
- Anthropic: Building Effective Agents
AI Assistants
Components of AI Assistants
Agent Internal Workflow Management
- LangChain
- Pydantic: Agent Framework / shim to use Pydantic with LLMs
- Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility
- llama-stack
- Huggingface smolagents
- Eliza (includes multi-agent, interaction with docs, Discord, Twitter, etc.)
Information Retrieval
- See also RAG.
- 2024-09: PaperQA2: Language Models Achieve Superhuman Synthesis of Scientific Knowledge (𝕏 post, code)
- 2024-10: Agentic Information Retrieval
Control (tool-use, computer use, etc.)
- Anthropic Model Context Protocol (MCP)
- FastMCP: The fast, Pythonic way to build MCP servers
Open-source
- Khoj (code): self-hostable AI assistant
- RAGapp: Agentic RAG for enterprise
- STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking
- Can write (e.g.) Wikipedia-style articles
- code
- Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
Personalities/Personas
- 2023-10: Generative Agents: Interactive Simulacra of Human Behavior
- 2024-11: Microsoft TinyTroupe 🤠🤓🥸🧐: LLM-powered multiagent persona simulation for imagination enhancement and business insights
- 2024-11: Generative Agent Simulations of 1,000 People (code)
Specific Uses for AI Assistants
Computer Use
Software Engineering
Science Agents
See Science Agents.
LLM-as-judge
- List of papers.
- LLM Evaluation doesn't need to be complicated
- Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
- [https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge Awesome-LLM-as-a-judge Survey
- 2024-10: Agent-as-a-Judge: Evaluate Agents with Agents
- 2024-11: A Survey on LLM-as-a-Judge
- 2024-12: LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Advanced Workflows
- Salesforce DEI: meta-system that leverages a diversity of SWE agents
- Sakana AI: AI Scientist
- SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
Software Development Workflows
Several paradigms of AI-assisted coding have arisen:
- Manual, human driven
- AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
- API calls to an LLM, which generates code and inserts the file into the project
- LLM-integration into the IDE
- AI-assisted IDE, where the AI generates and manages the dev environment
- Replit
- Aider (code): Pair programming on commandline
- Pythagora
- StackBlitz bolt.new
- Cline (formerly Claude Dev)
- Prompt-to-product
- Semi-autonomous software engineer agents
For a review of the current state of software-engineering agentic approaches, see:
- 2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
- 2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
- 2024-09: Agents in Software Engineering: Survey, Landscape, and Vision
Corporate AI Agent Ventures
Mundane Workflows and Capabilities
- Payman AI: AI to Human platform that allows AI to pay people for what it needs
- VoiceFlow: Build customer experiences with AI
- Mistral AI: genAI applications
- Taskade: Task/milestone software with AI agent workflows
- Covalent: Building a Multi-Agent Prompt Refining Application
Inference-compute Reasoning
Agentic Systems
- Topology AI
- Cognition AI: Devin software engineer (14% SWE-Agent)
- Honeycomb (22% SWE-Agent)
Increasing AI Agent Intelligence
See: Increasing AI Intelligence
Multi-agent orchestration
Research
Societies and Communities of AI agents
Research demos
- Camel
- LoopGPT
- JARVIS
- OpenAGI
- AutoGen
- preprint: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Agent-E: Browser (eventually computer) automation (code, preprint, demo video)
- AutoGen Studio: GUI for agent workflows (code)
- Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
- AG2 (previously AutoGen) (code, docs, Discord)
- TaskWeaver
- MetaGPT
- AutoGPT (code); and AutoGPT Platform
- Optima
- 2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
- 2024-06: MASAI: Modular Architecture for Software-engineering AI Agents
- 2024-10: Agent S: An Open Agentic Framework that Uses Computers Like a Human (code)
- 2024-10: AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
Related work
Inter-agent communications
- 2024-10: Agora: A Scalable Communication Protocol for Networks of Large Language Models (preprint): disparate agents auto-negotiate communication protocol
- 2024-11: DroidSpeak: Enhancing Cross-LLM Communication: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)
- 2024-11: Anthropic describes Model Context Protocol: an open standard for secure, two-way connections between data sources and AI (intro, quickstart, code)
Architectures
Open Source Frameworks
- LangChain
- ell (code, docs)
- AgentOps AI AgentStack
- Agent UI
- kyegomez swarms
- OpenAI Swarm (cookbook)
- Amazon AWS Multi-Agent Orchestrator
- KaibanJS: Kanban for AI Agents? (Takes inspiration from Kanban visual work management.)
- Agentarium
Open Source Systems
- ControlFlow
- OpenHands (formerly OpenDevin)
- code: platform for autonomous software engineers, powered by AI and LLMs
- Report: OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
Commercial Automation Frameworks
- Lutra: Automation and integration with various web systems.
- Gumloop
- TextQL: Enterprise Virtual Data Analyst
- Athena intelligence: Analytics platform
- Nexus GPT: Business co-pilot
- Multi-On: AI agent that acts on your behalf
- Firecrawl: Turn websites into LLM-ready data
- Reworkd: End-to-end data extraction
- Lindy: Custom AI Assistants to automate business workflows
- E.g. use Slack
- Bardeen: Automate workflows
- Abacus: AI Agents
- LlamaIndex: (𝕏, code, docs, Discord)
- MultiOn AI: Agent Q (paper) automated planning and execution
Spreadsheet
Cloud solutions
- Numbers Station Meadow: agentic framework for data workflows (code).
- CrewAI says they provide multi-agent automations (code).
- LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.
- LangGraph Studio is an IDE for agent workflows
- C3 AI enterprise platform
- Deepset AI Haystack (docs, code)
Frameworks
- Google Project Oscar
- Agent: Gaby (for "Go AI bot") (code, documentation) helps with issue tracking.
- OpenPlexity-Pages: Data-aggregator implementation (like Perplexity) based on CrewAI
Optimization
Metrics, Benchmarks
- 2022-06: PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
- 2023-06: Can Large Language Models Infer Causation from Correlation? (challenging Corr2Cause task)
- 2024-01: AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents
- 2024-04: AutoRace (code): LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
- 2024-04: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (github)
- 2024-07: AI Agents That Matter
- 2024-09: CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark (leaderboard)
- 2024-09: LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
- 2024-09: On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
- 2024-10: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
- 2024-10: WorFBench: Benchmarking Agentic Workflow Generation
- 2024-10: VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
- 2024-10: SimpleAQ: Measuring short-form factuality in large language models (announcement, code)
- 2024-11: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts (blog, code)
- 2024-11: The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (code)
- 2024-11: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
- 2024-12: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (code, project, leaderboard)
- 2024-12: A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
- 2025-01: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings (preprint, leaderboard)
Evaluation Schemes
- 2024-12: LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
- 2025-01: LLMRank ("SlopRank"): LLMs evaluate each other, allowing top model (for a given prompt/problem) to be inferred from a large number of recommendations.
Multi-agent
Agent Challenges
- Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
- NeurIPS 2024 paper/poster: AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions
- Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
- MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.
Automated Improvement
- 2024-06: EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
- 2024-06: Symbolic Learning Enables Self-Evolving Agents
- 2024-08: Automated Design of Agentic Systems (ADAS code)
- 2024-08: Self-Taught Evaluators: Iterative self-improvement through generation of synthetic data and evaluation