Difference between revisions of "AI Agents"
| KevinYager (talk | contribs)  (→Related work) | KevinYager (talk | contribs)   (→Information Retrieval (Memory)) | ||
| (37 intermediate revisions by the same user not shown) | |||
| Line 9: | Line 9: | ||
| * 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems] | * 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems] | ||
| * 2025-04: [https://arxiv.org/abs/2503.19213 A Survey of Large Language Model Agents for Question Answering] | * 2025-04: [https://arxiv.org/abs/2503.19213 A Survey of Large Language Model Agents for Question Answering] | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.09037 A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems] | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems] | ||
| ===Continually updating=== | ===Continually updating=== | ||
| Line 18: | Line 20: | ||
| * [https://arxiv.org/abs/2402.01817v3 LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks] | * [https://arxiv.org/abs/2402.01817v3 LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks] | ||
| * [https://rasa.com/blog/cutting-ai-assistant-costs-the-power-of-enhancing-llms-with-business/ Cutting AI Assistant Costs by Up to 77.8%: The Power of Enhancing LLMs with Business Logic] | * [https://rasa.com/blog/cutting-ai-assistant-costs-the-power-of-enhancing-llms-with-business/ Cutting AI Assistant Costs by Up to 77.8%: The Power of Enhancing LLMs with Business Logic] | ||
| + | * 2025-05: [https://arxiv.org/abs/2505.10468 AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges] | ||
| ===Guides=== | ===Guides=== | ||
| * Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents] | * Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents] | ||
| − | * Google: [https://www.kaggle.com/whitepaper-agents Agents] | + | * Google: [https://www.kaggle.com/whitepaper-agents Agents] and [https://www.kaggle.com/whitepaper-agent-companion Agents Companion] | 
| + | * OpenAI: [https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf A practical guide to building agents] | ||
| + | * Anthropic: [https://www.anthropic.com/engineering/claude-code-best-practices Claude Code: Best practices for agentic coding] | ||
| + | * Anthropic: [https://www.anthropic.com/engineering/built-multi-agent-research-system How we built our multi-agent research system] | ||
| =AI Assistants= | =AI Assistants= | ||
| Line 35: | Line 41: | ||
| * [https://github.com/elizaOS/eliza Eliza] (includes multi-agent, interaction with docs, Discord, Twitter, etc.) | * [https://github.com/elizaOS/eliza Eliza] (includes multi-agent, interaction with docs, Discord, Twitter, etc.) | ||
| * [https://github.com/The-Pocket/PocketFlow Pocket Flow]: LLM Framework in 100 Lines | * [https://github.com/The-Pocket/PocketFlow Pocket Flow]: LLM Framework in 100 Lines | ||
| + | * [https://github.com/coze-dev/coze-studio Coze]: All-in-one AI agent development tool | ||
| ===Information Retrieval (Memory)=== | ===Information Retrieval (Memory)=== | ||
| Line 42: | Line 49: | ||
| * 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models] | * 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models] | ||
| * [https://mem0.ai/ Mem0 AI]: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized. | * [https://mem0.ai/ Mem0 AI]: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized. | ||
| + | * 2025-08: [https://arxiv.org/abs/2508.16153 Memento: Fine-tuning LLM Agents without Fine-tuning LLMs] | ||
| ===Contextual Memory=== | ===Contextual Memory=== | ||
| Line 104: | Line 112: | ||
| * 2025-04: [https://www.nature.com/articles/s41586-025-08866-7?linkId=13898052 Towards conversational diagnostic artificial intelligence] | * 2025-04: [https://www.nature.com/articles/s41586-025-08866-7?linkId=13898052 Towards conversational diagnostic artificial intelligence] | ||
| * 2025-04: [https://www.nature.com/articles/s41586-025-08869-4?linkId=13898054 Towards accurate differential diagnosis with large language models] | * 2025-04: [https://www.nature.com/articles/s41586-025-08869-4?linkId=13898054 Towards accurate differential diagnosis with large language models] | ||
| + | * 2025-08: [https://arxiv.org/abs/2508.20148 The Anatomy of a Personal Health Agent] | ||
| ===LLM-as-judge=== | ===LLM-as-judge=== | ||
| Line 116: | Line 125: | ||
| * 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods] | * 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods] | ||
| * 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators] | * 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators] | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.00050 JudgeLRM: Large Reasoning Models as a Judge] | ||
| ===Deep Research=== | ===Deep Research=== | ||
| Line 138: | Line 148: | ||
| * [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks) | * [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks) | ||
| * 2025-04: [https://arxiv.org/abs/2504.03160 DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments] | * 2025-04: [https://arxiv.org/abs/2504.03160 DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments] | ||
| + | * 2025-04: Anthropic [https://x.com/AnthropicAI/status/1912192384588271771 Research] | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.21776 WebThinker: Empowering Large Reasoning Models with Deep Research Capability] | ||
| + | * 2025-09: [https://arxiv.org/abs/2509.06283 SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents] | ||
| =Advanced Workflows= | =Advanced Workflows= | ||
| Line 147: | Line 160: | ||
| * [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning] | * [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning] | ||
| ** [https://github.com/lamm-mit/SciAgentsDiscovery code] | ** [https://github.com/lamm-mit/SciAgentsDiscovery code] | ||
| + | * [https://skywork.ai/home Skywork] [https://skywork.ai/home?inviter=el.cine&shortlink_id=1919604877427924992&utm_source=X Super Agent] | ||
| ===Streamline Administrative Tasks=== | ===Streamline Administrative Tasks=== | ||
| Line 160: | Line 174: | ||
| ## OpenAI [https://chatgpt.com/ ChatGPT] | ## OpenAI [https://chatgpt.com/ ChatGPT] | ||
| ## Anthropic [https://claude.ai/ Claude] | ## Anthropic [https://claude.ai/ Claude] | ||
| + | ## Google [https://gemini.google.com/app Gemini] | ||
| # API calls to an LLM, which generates code and inserts the file into the project | # API calls to an LLM, which generates code and inserts the file into the project | ||
| # LLM-integration into the IDE | # LLM-integration into the IDE | ||
| ## [https://github.com/features/copilot Copilot] | ## [https://github.com/features/copilot Copilot] | ||
| ## [https://www.qodo.ai/ Qodo] (Codium) & [https://www.qodo.ai/products/alphacodium/ AlphaCodium] ([https://arxiv.org/abs/2401.08500 preprint], [https://github.com/Codium-ai/AlphaCodium code]) | ## [https://www.qodo.ai/ Qodo] (Codium) & [https://www.qodo.ai/products/alphacodium/ AlphaCodium] ([https://arxiv.org/abs/2401.08500 preprint], [https://github.com/Codium-ai/AlphaCodium code]) | ||
| − | ## [https://www.cursor.com/ Cursor] | + | ## '''[https://www.cursor.com/ Cursor]''' | 
| ## [https://codeium.com/ Codeium] [https://codeium.com/windsurf Windsurf] (with "Cascade" AI Agent) | ## [https://codeium.com/ Codeium] [https://codeium.com/windsurf Windsurf] (with "Cascade" AI Agent) | ||
| ## ByteDance [https://www.trae.ai/ Trae AI] | ## ByteDance [https://www.trae.ai/ Trae AI] | ||
| Line 178: | Line 193: | ||
| ## [https://app.devin.ai/ Devin 2.0] ([https://cognition.ai/ Cognition AI]) | ## [https://app.devin.ai/ Devin 2.0] ([https://cognition.ai/ Cognition AI]) | ||
| ## Google [https://firebase.google.com/docs/studio Firebase Studio] | ## Google [https://firebase.google.com/docs/studio Firebase Studio] | ||
| + | ## [https://github.com/rowboatlabs/rowboat rowboat] (for building multi-agent workflows) | ||
| + | ## [https://www.trae.ai/ Trae IDE]: The Real AI Engineer | ||
| # AI-assisted IDE, where the AI generates and manages the dev environment | # AI-assisted IDE, where the AI generates and manages the dev environment | ||
| ## [https://replit.com/ Replit] | ## [https://replit.com/ Replit] | ||
| − | |||
| ## [https://www.pythagora.ai/ Pythagora] | ## [https://www.pythagora.ai/ Pythagora] | ||
| ## [https://stackblitz.com/ StackBlitz] [https://bolt.new/ bolt.new] | ## [https://stackblitz.com/ StackBlitz] [https://bolt.new/ bolt.new] | ||
| ## [https://github.com/clinebot/cline Cline] (formerly [https://generativeai.pub/meet-claude-dev-an-open-source-autonomous-ai-programmer-in-vs-code-f457f9821b7b Claude Dev]) | ## [https://github.com/clinebot/cline Cline] (formerly [https://generativeai.pub/meet-claude-dev-an-open-source-autonomous-ai-programmer-in-vs-code-f457f9821b7b Claude Dev]) | ||
| + | ## [https://www.all-hands.dev/ All Hands] | ||
| + | # AI Agent on Commandline | ||
| + | ## [https://aider.chat/ Aider] ([https://github.com/Aider-AI/aider code]): Pair programming on commandline | ||
| + | ## [https://docs.anthropic.com/en/docs/claude-code/overview Claude Code] | ||
| + | ## [https://openai.com/codex/ OpenAI Codex] | ||
| + | ## [https://blog.google/technology/developers/introducing-gemini-cli-open-source-ai-agent/ Gemini CLI] | ||
| # Prompt-to-product | # Prompt-to-product | ||
| ## [https://githubnext.com/projects/github-spark Github Spark] ([https://x.com/ashtom/status/1851333075374051725 demo video]) | ## [https://githubnext.com/projects/github-spark Github Spark] ([https://x.com/ashtom/status/1851333075374051725 demo video]) | ||
| Line 198: | Line 220: | ||
| ## [https://base44.com/ base44]: no-code dashboard apps | ## [https://base44.com/ base44]: no-code dashboard apps | ||
| ## [https://www.theorigin.ai/ Origin AI] | ## [https://www.theorigin.ai/ Origin AI] | ||
| + | ## [https://app.emergent.sh/ Emergent AI] | ||
| # Semi-autonomous software engineer agents | # Semi-autonomous software engineer agents | ||
| ## [https://www.cognition.ai/blog/introducing-devin Devin] (Cognition AI) | ## [https://www.cognition.ai/blog/introducing-devin Devin] (Cognition AI) | ||
| ## [https://aws.amazon.com/q/ Amazon Q] (and CodeWhisperer) | ## [https://aws.amazon.com/q/ Amazon Q] (and CodeWhisperer) | ||
| ## [https://honeycomb.sh/ Honeycomb] | ## [https://honeycomb.sh/ Honeycomb] | ||
| + | ## [https://www.blackbox.ai/ Agent IDE] | ||
| ## [https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview Claude Code] | ## [https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview Claude Code] | ||
| − | + | ## OpenAI [https://help.openai.com/en/articles/11096431-openai-codex-cli-getting-started Codex CLI] and [https://openai.com/index/introducing-codex/ Codex] cloud | |
| + | ## [https://www.factory.ai/ Factory AI] [https://x.com/FactoryAI/status/1927754706014630357 Droids] | ||
| For a review of the current state of software-engineering agentic approaches, see: | For a review of the current state of software-engineering agentic approaches, see: | ||
| * 2024-08: [https://arxiv.org/abs/2408.02479 From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future] | * 2024-08: [https://arxiv.org/abs/2408.02479 From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future] | ||
| Line 231: | Line 256: | ||
| * [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks) | * [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks) | ||
| * [https://agents.cloudflare.com/ Cloudflare Agents] | * [https://agents.cloudflare.com/ Cloudflare Agents] | ||
| + | * [https://www.maskara.ai/ Maskara AI] | ||
| =Increasing AI Agent Intelligence= | =Increasing AI Agent Intelligence= | ||
| Line 237: | Line 263: | ||
| =Multi-agent orchestration= | =Multi-agent orchestration= | ||
| ==Research== | ==Research== | ||
| + | * 2025-02: [https://arxiv.org/abs/2502.02533 Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies] | ||
| * 2025-03: [https://arxiv.org/abs/2503.13657 Why Do Multi-Agent LLM Systems Fail?] | * 2025-03: [https://arxiv.org/abs/2503.13657 Why Do Multi-Agent LLM Systems Fail?] | ||
| * 2025-03: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks] | * 2025-03: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks] | ||
| + | * 2025-09: [https://arxiv.org/abs/2509.20175 Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI] | ||
| ===Organization Schemes=== | ===Organization Schemes=== | ||
| Line 245: | Line 273: | ||
| ===Societies and Communities of AI agents=== | ===Societies and Communities of AI agents=== | ||
| * 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents] | * 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents] | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.10157 SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users] | ||
| + | * 2025-05: [https://www.science.org/doi/10.1126/sciadv.adu9368 Emergent social conventions and collective bias in LLM populations] | ||
| + | * 2025-09: [https://arxiv.org/abs/2509.10147 Virtual Agent Economies] | ||
| ===Domain-specific=== | ===Domain-specific=== | ||
| Line 281: | Line 312: | ||
| * 2024-11: [https://arxiv.org/abs/2411.02820 DroidSpeak: Enhancing Cross-LLM Communication]: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window) | * 2024-11: [https://arxiv.org/abs/2411.02820 DroidSpeak: Enhancing Cross-LLM Communication]: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window) | ||
| * 2024-11: Anthropic describes [https://www.anthropic.com/news/model-context-protocol Model Context Protocol]: an open standard for secure, two-way connections between data sources and AI ([https://modelcontextprotocol.io/introduction intro], [https://modelcontextprotocol.io/quickstart quickstart], [https://github.com/modelcontextprotocol code]) | * 2024-11: Anthropic describes [https://www.anthropic.com/news/model-context-protocol Model Context Protocol]: an open standard for secure, two-way connections between data sources and AI ([https://modelcontextprotocol.io/introduction intro], [https://modelcontextprotocol.io/quickstart quickstart], [https://github.com/modelcontextprotocol code]) | ||
| + | * 2025-09: [https://arxiv.org/abs/2509.20175 Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI] | ||
| ==Architectures== | ==Architectures== | ||
| Line 326: | Line 358: | ||
| * [https://www.bardeen.ai/ Bardeen]: Automate workflows | * [https://www.bardeen.ai/ Bardeen]: Automate workflows | ||
| * [https://abacus.ai/ Abacus]: [https://abacus.ai/ai_agents AI Agents] | * [https://abacus.ai/ Abacus]: [https://abacus.ai/ai_agents AI Agents] | ||
| + | ** [https://abacus.ai/help/howTo HowTo] | ||
| * [https://www.llamaindex.ai/ LlamaIndex]: ([https://x.com/llama_index 𝕏], [https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://discord.com/invite/dGcwcsnxhU Discord]) | * [https://www.llamaindex.ai/ LlamaIndex]: ([https://x.com/llama_index 𝕏], [https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://discord.com/invite/dGcwcsnxhU Discord]) | ||
| * [https://www.multion.ai/ MultiOn AI]: [https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities Agent Q] ([https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf paper]) automated planning and execution | * [https://www.multion.ai/ MultiOn AI]: [https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities Agent Q] ([https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf paper]) automated planning and execution | ||
| * Google [https://cloud.google.com/products/agentspace Agentspace] | * Google [https://cloud.google.com/products/agentspace Agentspace] | ||
| + | * [https://try.flowith.io/ Flowith] | ||
| ===Multi-agent Handoff/Collaboration=== | ===Multi-agent Handoff/Collaboration=== | ||
| Line 338: | Line 372: | ||
| * [https://www.paradigmai.com/ Paradigm] | * [https://www.paradigmai.com/ Paradigm] | ||
| * [https://www.superworker.ai/ Superworker AI] | * [https://www.superworker.ai/ Superworker AI] | ||
| + | * [https://www.genspark.ai/ Genspark] | ||
| ==Cloud solutions== | ==Cloud solutions== | ||
| Line 358: | Line 393: | ||
| ===Metrics, Benchmarks=== | ===Metrics, Benchmarks=== | ||
| + | See also: [[AI benchmarks]] | ||
| * 2019-11: [https://arxiv.org/abs/1911.01547 On the Measure of Intelligence] | * 2019-11: [https://arxiv.org/abs/1911.01547 On the Measure of Intelligence] | ||
| * 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change] | * 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change] | ||
| Line 381: | Line 417: | ||
| * 2025-02: [https://arxiv.org/abs/2502.18356 WebGames: Challenging General-Purpose Web-Browsing AI Agents] | * 2025-02: [https://arxiv.org/abs/2502.18356 WebGames: Challenging General-Purpose Web-Browsing AI Agents] | ||
| * 2025-03: ColBench: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks] | * 2025-03: ColBench: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks] | ||
| + | * 2025-04 OpenAI [https://openai.com/index/browsecomp/ BrowseComp: a benchmark for browsing agents] | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.11844 Evaluating the Goal-Directedness of Large Language Models] | ||
| ===Evaluation Schemes=== | ===Evaluation Schemes=== | ||
| Line 394: | Line 432: | ||
| ** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions] | ** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions] | ||
| * [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities. | * [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities. | ||
| − | * [https:// | + | * [https://mcbench.ai/ MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges ([https://github.com/mc-bench/orchestrator code]). | 
| ===Automated Improvement=== | ===Automated Improvement=== | ||
| Line 401: | Line 439: | ||
| * 2024-08: [https://arxiv.org/abs/2408.08435 Automated Design of Agentic Systems] ([https://github.com/ShengranHu/ADAS ADAS code]) | * 2024-08: [https://arxiv.org/abs/2408.08435 Automated Design of Agentic Systems] ([https://github.com/ShengranHu/ADAS ADAS code]) | ||
| * 2024-08: [https://arxiv.org/abs/2408.02666 Self-Taught Evaluators]: Iterative self-improvement through generation of synthetic data and evaluation | * 2024-08: [https://arxiv.org/abs/2408.02666 Self-Taught Evaluators]: Iterative self-improvement through generation of synthetic data and evaluation | ||
| + | * 2025-05: [https://arxiv.org/abs/2505.22954 Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents] ([https://github.com/jennyzzt/dgm code], [https://sakana.ai/dgm/ project]) | ||
| =See Also= | =See Also= | ||
Latest revision as of 11:56, 23 October 2025
Contents
- 1 Reviews & Perspectives
- 2 AI Assistants
- 3 Advanced Workflows
- 4 Corporate AI Agent Ventures
- 5 Increasing AI Agent Intelligence
- 6 Multi-agent orchestration
- 7 Optimization
- 8 See Also
Reviews & Perspectives
Published
- 2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
- 2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
- 2024-09: Towards a Science Exocortex
- 2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
- 2024-09: Agents in Software Engineering: Survey, Landscape, and Vision
- 2025-04: Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
- 2025-04: A Survey of Large Language Model Agents for Question Answering
- 2025-04: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems
- 2025-04: Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
Continually updating
- OpenThought - System 2 Research Links
- Awesome LLM Strawberry (OpenAI o1): Collection of research papers & blogs for OpenAI Strawberry(o1) and Reasoning
- Awesome AI Agents
Analysis/Opinions
- LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
- Cutting AI Assistant Costs by Up to 77.8%: The Power of Enhancing LLMs with Business Logic
- 2025-05: AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
Guides
- Anthropic: Building Effective Agents
- Google: Agents and Agents Companion
- OpenAI: A practical guide to building agents
- Anthropic: Claude Code: Best practices for agentic coding
- Anthropic: How we built our multi-agent research system
AI Assistants
Components of AI Assistants
Agent Internal Workflow Management
- LangChain
- Pydantic: Agent Framework / shim to use Pydantic with LLMs
- Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility
- llama-stack
- Huggingface smolagents
- Eliza (includes multi-agent, interaction with docs, Discord, Twitter, etc.)
- Pocket Flow: LLM Framework in 100 Lines
- Coze: All-in-one AI agent development tool
Information Retrieval (Memory)
- See also RAG.
- 2024-09: PaperQA2: Language Models Achieve Superhuman Synthesis of Scientific Knowledge (𝕏 post, code)
- 2024-10: Agentic Information Retrieval
- 2025-02: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
- Mem0 AI: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized.
- 2025-08: Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
Contextual Memory
- Memobase: user profile-based memory (long-term user memory for genAI) applications)
Control (tool-use, computer use, etc.)
- See also: Human_Computer_Interaction#AI_Computer_Use
- Tavily: Connect Your LLM to the Web: Empowering your AI applications with real-time, accurate search results tailored for LLMs and RAG
Model Context Protocol (MCP)
- Standards:
- Anthropic Model Context Protocol (MCP)
- OpenAI Agents SDK
 
- Tools:
- Servers:
- Lists:
- Noteworthy:
- Official Github MCP server
- Unofficial Github MCP server
- Puppeteer
- Google Maps MCP Server
- Slack MCP Server
- Zapier MCP Servers (Slack, Google Sheets, Notion, etc.)
- AWS MCP Servers
- ElevenLabs
 
 
Agent2Agent Protocol (A2A)
- Google announcement
Open-source
- Khoj (code): self-hostable AI assistant
- RAGapp: Agentic RAG for enterprise
- STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking
- Can write (e.g.) Wikipedia-style articles
- code
- Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
 
Personalities/Personas
- 2023-10: Generative Agents: Interactive Simulacra of Human Behavior
- 2024-11: Microsoft TinyTroupe 🤠🤓🥸🧐: LLM-powered multiagent persona simulation for imagination enhancement and business insights
- 2024-11: Generative Agent Simulations of 1,000 People (code)
Specific Uses for AI Assistants
Computer Use
Software Engineering
- 2024-11: MLE-Agent: Your intelligent companion for seamless AI engineering and research
- Agentless: agentless approach to automatically solve software development problems
Science Agents
See Science Agents.
Medicine
- 2025-03: Microsoft Dragon Copilot: streamline clinical workflows and paperwork
- 2025-04: Training state-of-the-art pathology foundation models with orders of magnitude less data
- 2025-04: Towards conversational diagnostic artificial intelligence
- 2025-04: Towards accurate differential diagnosis with large language models
- 2025-08: The Anatomy of a Personal Health Agent
LLM-as-judge
- List of papers.
- LLM Evaluation doesn't need to be complicated
- Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
- Awesome-LLM-as-a-judge Survey
- haizelabs Awesome LLM Judges
- 2024-08: Self-Taught Evaluators
- 2024-10: Agent-as-a-Judge: Evaluate Agents with Agents
- 2024-11: A Survey on LLM-as-a-Judge
- 2024-12: LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
- 2025-03: Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators
- 2025-04: JudgeLRM: Large Reasoning Models as a Judge
Deep Research
- Google Deep Research
- OpenAI Deep Research
- Perplexity:
- Exa AI:
- Websets: Web research agent
- Web-search agent powered by DeepSeek (code) or o3-mini (code)
 
- Firecrawl wip
- Matt Shumer OpenDeepResearcher
- DeepSearcher (operate on local data)
- nickscamara open-deep-research
- dzhng deep-research
- huggingface open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code)
- xAI Grok 3 Deep Search
- Liner Deep Research
- Allen AI (AI2) Paper Finder
- 2025-03: Open Deep Search: Democratizing Search with Open-source Reasoning Agents (code)
- Convergence AI Deep Work (swarms for web-based tasks)
- 2025-04: DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
- 2025-04: Anthropic Research
- 2025-04: WebThinker: Empowering Large Reasoning Models with Deep Research Capability
- 2025-09: SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
Advanced Workflows
- Salesforce DEI: meta-system that leverages a diversity of SWE agents
- Sakana AI: AI Scientist
- SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
- Skywork Super Agent
Streamline Administrative Tasks
Author Research Articles
- 2024-02: STORM: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (discussion/analysis)
Software Development Workflows
Several paradigms of AI-assisted coding have arisen:
- Manual, human driven
- AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
- API calls to an LLM, which generates code and inserts the file into the project
- LLM-integration into the IDE
- Copilot
- Qodo (Codium) & AlphaCodium (preprint, code)
- Cursor
- Codeium Windsurf (with "Cascade" AI Agent)
- ByteDance Trae AI
- Tabnine
- Traycer
- IDX: free
- Aide: open-source AI-native code editor (fork of VS Code)
- continue.dev: open-source code assistant
- Pear AI: open-source code editor
- Haystack Editor: canvas UI
- Onlook: for designers
- All Hands AI
- Devin 2.0 (Cognition AI)
- Google Firebase Studio
- rowboat (for building multi-agent workflows)
- Trae IDE: The Real AI Engineer
 
- AI-assisted IDE, where the AI generates and manages the dev environment
- Replit
- Pythagora
- StackBlitz bolt.new
- Cline (formerly Claude Dev)
- All Hands
 
- AI Agent on Commandline
- Aider (code): Pair programming on commandline
- Claude Code
- OpenAI Codex
- Gemini CLI
 
- Prompt-to-product
- Github Spark (demo video)
- Create.xyz: text-to-app, replicate product from link
- a0.dev: generate mobil apps (from your phone)
- Softgen: web app developer
- wrapifai: build form-based apps
- Lovable: web app (from text, screenshot, etc.)
- Vercel v0
- MarsX (John Rush): SaaS builder
- Webdraw: turn sketches into web apps
- Tempo Labs: build React apps
- Databutton: no-code software development
- base44: no-code dashboard apps
- Origin AI
- Emergent AI
 
- Semi-autonomous software engineer agents
- Devin (Cognition AI)
- Amazon Q (and CodeWhisperer)
- Honeycomb
- Agent IDE
- Claude Code
- OpenAI Codex CLI and Codex cloud
- Factory AI Droids
 
For a review of the current state of software-engineering agentic approaches, see:
- 2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
- 2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
- 2024-09: Agents in Software Engineering: Survey, Landscape, and Vision
Corporate AI Agent Ventures
Mundane Workflows and Capabilities
- Payman AI: AI to Human platform that allows AI to pay people for what it needs
- VoiceFlow: Build customer experiences with AI
- Mistral AI: genAI applications
- Taskade: Task/milestone software with AI agent workflows
- Covalent: Building a Multi-Agent Prompt Refining Application
Inference-compute Reasoning
AI Assistant
- Convergence Proxy
- Shortwave AI Assistant (organize, write, search, schedule, etc.)
Agentic Systems
- Topology AI
- Cognition AI: Devin software engineer (14% SWE-Agent)
- Honeycomb (22% SWE-Agent)
- Factory AI
- Convergence AI Deep Work (swarms for web-based tasks)
- Cloudflare Agents
- Maskara AI
Increasing AI Agent Intelligence
See: Increasing AI Intelligence
Multi-agent orchestration
Research
- 2025-02: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
- 2025-03: Why Do Multi-Agent LLM Systems Fail?
- 2025-03: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
- 2025-09: Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI
Organization Schemes
Societies and Communities of AI agents
- 2024-12: Cultural Evolution of Cooperation among LLM Agents
- 2025-04: SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users
- 2025-05: Emergent social conventions and collective bias in LLM populations
- 2025-09: Virtual Agent Economies
Domain-specific
- 2024-12: TradingAgents: Multi-Agents LLM Financial Trading Framework
- 2025-01: Agent Laboratory: Using LLM Agents as Research Assistants
Research demos
- Camel
- LoopGPT
- JARVIS
- OpenAGI
- AutoGen
- preprint: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Agent-E: Browser (eventually computer) automation (code, preprint, demo video)
- AutoGen Studio: GUI for agent workflows (code)
- Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
 
- AG2 (previously AutoGen) (code, docs, Discord)
- TaskWeaver
- MetaGPT
- AutoGPT (code); and AutoGPT Platform
- Optima
- 2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
- 2024-06: MASAI: Modular Architecture for Software-engineering AI Agents
- 2024-10: Agent S: An Open Agentic Framework that Uses Computers Like a Human (code)
- 2024-10: AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
- 2025-02: PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
Related work
- 2024-07: PersonaGym: Evaluating Persona Agents and LLMs
- 2025-01: Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks
Inter-agent communications
- 2024-10: Agora: A Scalable Communication Protocol for Networks of Large Language Models (preprint): disparate agents auto-negotiate communication protocol
- 2024-11: DroidSpeak: Enhancing Cross-LLM Communication: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)
- 2024-11: Anthropic describes Model Context Protocol: an open standard for secure, two-way connections between data sources and AI (intro, quickstart, code)
- 2025-09: Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI
Architectures
Open Source Frameworks
- LangChain
- ell (code, docs)
- AgentOps AI AgentStack
- Agent UI
- kyegomez swarms
- OpenAI Swarm (cookbook)
- Amazon AWS Multi-Agent Orchestrator
- KaibanJS: Kanban for AI Agents? (Takes inspiration from Kanban visual work management.)
- Agentarium
- Orchestra (docs, code)
- AutoAgent: Fully-Automated & Zero-Code LLM Agent Framework
- Mastra (github): opinionated Typescript framework for AI applications (primitives for workflows, agents, RAG, integrations and evals)
- Orra: multi-agent applications with complex real-world interactions
- GenSX
- Cloudflare agents-sdk (info, code)
- OpenAI responses API and agents SDK
- Google Agent Development Kit
Open Source Systems
- ControlFlow
- OpenHands (formerly OpenDevin)
- code: platform for autonomous software engineers, powered by AI and LLMs
- Report: OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
 
Commercial Automation Frameworks
- Lutra: Automation and integration with various web systems.
- Gumloop
- TextQL: Enterprise Virtual Data Analyst
- Athena intelligence: Analytics platform
- Nexus GPT: Business co-pilot
- Multi-On: AI agent that acts on your behalf
- Firecrawl: Turn websites into LLM-ready data
- Reworkd: End-to-end data extraction
- Lindy: Custom AI Assistants to automate business workflows
- E.g. use Slack
 
- Bardeen: Automate workflows
- Abacus: AI Agents
- LlamaIndex: (𝕏, code, docs, Discord)
- MultiOn AI: Agent Q (paper) automated planning and execution
- Google Agentspace
- Flowith
Multi-agent Handoff/Collaboration
Spreadsheet
Cloud solutions
- Numbers Station Meadow: agentic framework for data workflows (code).
- CrewAI says they provide multi-agent automations (code).
- LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.
- LangGraph Studio is an IDE for agent workflows
 
- C3 AI enterprise platform
- Deepset AI Haystack (docs, code)
Frameworks
- Google Project Oscar
- Agent: Gaby (for "Go AI bot") (code, documentation) helps with issue tracking.
 
- OpenPlexity-Pages: Data-aggregator implementation (like Perplexity) based on CrewAI
Optimization
Reviews
- 2024-12: A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
- 2025-03: Survey on Evaluation of LLM-based Agents
Metrics, Benchmarks
See also: AI benchmarks
- 2019-11: On the Measure of Intelligence
- 2022-06: PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
- 2023-06: Can Large Language Models Infer Causation from Correlation? (challenging Corr2Cause task)
- 2024-01: AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents
- 2024-04: AutoRace (code): LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
- 2024-04: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (github)
- 2024-07: AI Agents That Matter
- 2024-09: CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark (leaderboard)
- 2024-09: LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
- 2024-09: On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
- 2024-10: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
- 2024-10: WorFBench: Benchmarking Agentic Workflow Generation
- 2024-10: VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
- 2024-10: SimpleAQ: Measuring short-form factuality in large language models (announcement, code)
- 2024-11: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts (blog, code)
- 2024-11: The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (code)
- 2024-11: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
- 2024-12: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (code, project, leaderboard)
- 2025-01: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings (preprint, leaderboard)
- 2025-02: ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges (leaderboard)
- 2025-02: MLGym: A New Framework and Benchmark for Advancing AI Research Agents (paper, code)
- 2025-02: WebGames: Challenging General-Purpose Web-Browsing AI Agents
- 2025-03: ColBench: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
- 2025-04 OpenAI BrowseComp: a benchmark for browsing agents
- 2025-04: Evaluating the Goal-Directedness of Large Language Models
Evaluation Schemes
- 2024-12: LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
- 2025-01: LLMRank ("SlopRank"): LLMs evaluate each other, allowing top model (for a given prompt/problem) to be inferred from a large number of recommendations.
Multi-agent
- 2024-12: Cultural Evolution of Cooperation among LLM Agents
- Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure
Agent Challenges
- Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
- NeurIPS 2024 paper/poster: AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions
 
- Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
- MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges (code).
Automated Improvement
- 2024-06: EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
- 2024-06: Symbolic Learning Enables Self-Evolving Agents
- 2024-08: Automated Design of Agentic Systems (ADAS code)
- 2024-08: Self-Taught Evaluators: Iterative self-improvement through generation of synthetic data and evaluation
- 2025-05: Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents (code, project)

