Difference between revisions of "AI Agents"

From GISAXS
Jump to: navigation, search
(Metrics, Benchmarks)
(Research)
 
(22 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
* 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems]
 
* 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems]
 
* 2025-04: [https://arxiv.org/abs/2503.19213 A Survey of Large Language Model Agents for Question Answering]
 
* 2025-04: [https://arxiv.org/abs/2503.19213 A Survey of Large Language Model Agents for Question Answering]
 +
* 2025-04: [https://arxiv.org/abs/2504.09037 A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems]
 +
* 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems]
  
 
===Continually updating===
 
===Continually updating===
Line 21: Line 23:
 
===Guides===
 
===Guides===
 
* Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents]
 
* Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents]
* Google: [https://www.kaggle.com/whitepaper-agents Agents]
+
* Google: [https://www.kaggle.com/whitepaper-agents Agents] and [https://www.kaggle.com/whitepaper-agent-companion Agents Companion]
 +
* OpenAI: [https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf A practical guide to building agents]
 +
* Anthropic: [https://www.anthropic.com/engineering/claude-code-best-practices Claude Code: Best practices for agentic coding]
  
 
=AI Assistants=
 
=AI Assistants=
Line 139: Line 143:
 
* [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks)
 
* [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks)
 
* 2025-04: [https://arxiv.org/abs/2504.03160 DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments]
 
* 2025-04: [https://arxiv.org/abs/2504.03160 DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments]
 +
* 2025-04: Anthropic [https://x.com/AnthropicAI/status/1912192384588271771 Research]
 +
* 2025-04: [https://arxiv.org/abs/2504.21776 WebThinker: Empowering Large Reasoning Models with Deep Research Capability]
  
 
=Advanced Workflows=
 
=Advanced Workflows=
Line 148: Line 154:
 
* [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning]
 
* [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning]
 
** [https://github.com/lamm-mit/SciAgentsDiscovery code]
 
** [https://github.com/lamm-mit/SciAgentsDiscovery code]
 +
* [https://skywork.ai/home Skywork] [https://skywork.ai/home?inviter=el.cine&shortlink_id=1919604877427924992&utm_source=X Super Agent]
  
 
===Streamline Administrative Tasks===
 
===Streamline Administrative Tasks===
Line 179: Line 186:
 
## [https://app.devin.ai/ Devin 2.0] ([https://cognition.ai/ Cognition AI])
 
## [https://app.devin.ai/ Devin 2.0] ([https://cognition.ai/ Cognition AI])
 
## Google [https://firebase.google.com/docs/studio Firebase Studio]
 
## Google [https://firebase.google.com/docs/studio Firebase Studio]
 +
## [https://github.com/rowboatlabs/rowboat rowboat] (for building multi-agent workflows)
 +
## [https://www.trae.ai/ Trae IDE]: The Real AI Engineer
 
# AI-assisted IDE, where the AI generates and manages the dev environment
 
# AI-assisted IDE, where the AI generates and manages the dev environment
 
## [https://replit.com/ Replit]
 
## [https://replit.com/ Replit]
Line 199: Line 208:
 
## [https://base44.com/ base44]: no-code dashboard apps
 
## [https://base44.com/ base44]: no-code dashboard apps
 
## [https://www.theorigin.ai/ Origin AI]
 
## [https://www.theorigin.ai/ Origin AI]
 +
## [https://app.emergent.sh/ Emergent AI]
 
# Semi-autonomous software engineer agents
 
# Semi-autonomous software engineer agents
 
## [https://www.cognition.ai/blog/introducing-devin Devin] (Cognition AI)
 
## [https://www.cognition.ai/blog/introducing-devin Devin] (Cognition AI)
 
## [https://aws.amazon.com/q/ Amazon Q] (and CodeWhisperer)
 
## [https://aws.amazon.com/q/ Amazon Q] (and CodeWhisperer)
 
## [https://honeycomb.sh/ Honeycomb]
 
## [https://honeycomb.sh/ Honeycomb]
 +
## [https://www.blackbox.ai/ Agent IDE]
 
## [https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview Claude Code]
 
## [https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview Claude Code]
 
+
## OpenAI [https://help.openai.com/en/articles/11096431-openai-codex-cli-getting-started Codex CLI] and [https://openai.com/index/introducing-codex/ Codex] cloud
 +
## [https://www.factory.ai/ Factory AI] [https://x.com/FactoryAI/status/1927754706014630357 Droids]
 
For a review of the current state of software-engineering agentic approaches, see:
 
For a review of the current state of software-engineering agentic approaches, see:
 
* 2024-08: [https://arxiv.org/abs/2408.02479 From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future]
 
* 2024-08: [https://arxiv.org/abs/2408.02479 From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future]
Line 232: Line 244:
 
* [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks)
 
* [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks)
 
* [https://agents.cloudflare.com/ Cloudflare Agents]
 
* [https://agents.cloudflare.com/ Cloudflare Agents]
 +
* [https://www.maskara.ai/ Maskara AI]
  
 
=Increasing AI Agent Intelligence=
 
=Increasing AI Agent Intelligence=
Line 238: Line 251:
 
=Multi-agent orchestration=
 
=Multi-agent orchestration=
 
==Research==
 
==Research==
 +
* 2025-02: [https://arxiv.org/abs/2502.02533 Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies]
 
* 2025-03: [https://arxiv.org/abs/2503.13657 Why Do Multi-Agent LLM Systems Fail?]
 
* 2025-03: [https://arxiv.org/abs/2503.13657 Why Do Multi-Agent LLM Systems Fail?]
 
* 2025-03: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]
 
* 2025-03: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]
Line 246: Line 260:
 
===Societies and Communities of AI agents===
 
===Societies and Communities of AI agents===
 
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 +
* 2025-04: [https://arxiv.org/abs/2504.10157 SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users]
 +
* 2025-05: [https://www.science.org/doi/10.1126/sciadv.adu9368 Emergent social conventions and collective bias in LLM populations]
  
 
===Domain-specific===
 
===Domain-specific===
Line 327: Line 343:
 
* [https://www.bardeen.ai/ Bardeen]: Automate workflows
 
* [https://www.bardeen.ai/ Bardeen]: Automate workflows
 
* [https://abacus.ai/ Abacus]: [https://abacus.ai/ai_agents AI Agents]
 
* [https://abacus.ai/ Abacus]: [https://abacus.ai/ai_agents AI Agents]
 +
** [https://abacus.ai/help/howTo HowTo]
 
* [https://www.llamaindex.ai/ LlamaIndex]: ([https://x.com/llama_index 𝕏], [https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://discord.com/invite/dGcwcsnxhU Discord])
 
* [https://www.llamaindex.ai/ LlamaIndex]: ([https://x.com/llama_index 𝕏], [https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://discord.com/invite/dGcwcsnxhU Discord])
 
* [https://www.multion.ai/ MultiOn AI]: [https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities Agent Q] ([https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf paper]) automated planning and execution
 
* [https://www.multion.ai/ MultiOn AI]: [https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities Agent Q] ([https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf paper]) automated planning and execution
 
* Google [https://cloud.google.com/products/agentspace Agentspace]
 
* Google [https://cloud.google.com/products/agentspace Agentspace]
 +
* [https://try.flowith.io/ Flowith]
  
 
===Multi-agent Handoff/Collaboration===
 
===Multi-agent Handoff/Collaboration===
Line 339: Line 357:
 
* [https://www.paradigmai.com/ Paradigm]
 
* [https://www.paradigmai.com/ Paradigm]
 
* [https://www.superworker.ai/ Superworker AI]
 
* [https://www.superworker.ai/ Superworker AI]
 +
* [https://www.genspark.ai/ Genspark]
  
 
==Cloud solutions==
 
==Cloud solutions==
Line 384: Line 403:
 
* 2025-03: ColBench: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]
 
* 2025-03: ColBench: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]
 
* 2025-04 OpenAI [https://openai.com/index/browsecomp/ BrowseComp: a benchmark for browsing agents]
 
* 2025-04 OpenAI [https://openai.com/index/browsecomp/ BrowseComp: a benchmark for browsing agents]
 +
* 2025-04: [https://arxiv.org/abs/2504.11844 Evaluating the Goal-Directedness of Large Language Models]
  
 
===Evaluation Schemes===
 
===Evaluation Schemes===
Line 397: Line 417:
 
** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]
 
** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]
 
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
* [https://github.com/mc-bench/orchestrator MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.
+
* [https://mcbench.ai/ MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges ([https://github.com/mc-bench/orchestrator code]).
  
 
===Automated Improvement===
 
===Automated Improvement===
Line 404: Line 424:
 
* 2024-08: [https://arxiv.org/abs/2408.08435 Automated Design of Agentic Systems] ([https://github.com/ShengranHu/ADAS ADAS code])
 
* 2024-08: [https://arxiv.org/abs/2408.08435 Automated Design of Agentic Systems] ([https://github.com/ShengranHu/ADAS ADAS code])
 
* 2024-08: [https://arxiv.org/abs/2408.02666 Self-Taught Evaluators]: Iterative self-improvement through generation of synthetic data and evaluation
 
* 2024-08: [https://arxiv.org/abs/2408.02666 Self-Taught Evaluators]: Iterative self-improvement through generation of synthetic data and evaluation
 +
* 2025-05: [https://arxiv.org/abs/2505.22954 Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents] ([https://github.com/jennyzzt/dgm code], [https://sakana.ai/dgm/ project])
  
 
=See Also=
 
=See Also=

Latest revision as of 12:17, 3 June 2025

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

Guides

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval (Memory)

Contextual Memory

  • Memobase: user profile-based memory (long-term user memory for genAI) applications)

Control (tool-use, computer use, etc.)

Model Context Protocol (MCP)

Agent2Agent Protocol (A2A)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

See Science Agents.

Medicine

LLM-as-judge

Deep Research

Advanced Workflows

Streamline Administrative Tasks

Author Research Articles

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

  1. Manual, human driven
  2. AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
    1. OpenAI ChatGPT
    2. Anthropic Claude
  3. API calls to an LLM, which generates code and inserts the file into the project
  4. LLM-integration into the IDE
    1. Copilot
    2. Qodo (Codium) & AlphaCodium (preprint, code)
    3. Cursor
    4. Codeium Windsurf (with "Cascade" AI Agent)
    5. ByteDance Trae AI
    6. Tabnine
    7. Traycer
    8. IDX: free
    9. Aide: open-source AI-native code editor (fork of VS Code)
    10. continue.dev: open-source code assistant
    11. Pear AI: open-source code editor
    12. Haystack Editor: canvas UI
    13. Onlook: for designers
    14. All Hands AI
    15. Devin 2.0 (Cognition AI)
    16. Google Firebase Studio
    17. rowboat (for building multi-agent workflows)
    18. Trae IDE: The Real AI Engineer
  5. AI-assisted IDE, where the AI generates and manages the dev environment
    1. Replit
    2. Aider (code): Pair programming on commandline
    3. Pythagora
    4. StackBlitz bolt.new
    5. Cline (formerly Claude Dev)
  6. Prompt-to-product
    1. Github Spark (demo video)
    2. Create.xyz: text-to-app, replicate product from link
    3. a0.dev: generate mobil apps (from your phone)
    4. Softgen: web app developer
    5. wrapifai: build form-based apps
    6. Lovable: web app (from text, screenshot, etc.)
    7. Vercel v0
    8. MarsX (John Rush): SaaS builder
    9. Webdraw: turn sketches into web apps
    10. Tempo Labs: build React apps
    11. Databutton: no-code software development
    12. base44: no-code dashboard apps
    13. Origin AI
    14. Emergent AI
  7. Semi-autonomous software engineer agents
    1. Devin (Cognition AI)
    2. Amazon Q (and CodeWhisperer)
    3. Honeycomb
    4. Agent IDE
    5. Claude Code
    6. OpenAI Codex CLI and Codex cloud
    7. Factory AI Droids

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

AI Assistant

Agentic Systems

Increasing AI Agent Intelligence

See: Increasing AI Intelligence

Multi-agent orchestration

Research

Organization Schemes

Societies and Communities of AI agents

Domain-specific

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Multi-agent Handoff/Collaboration

Spreadsheet

Cloud solutions

Frameworks

Optimization

Reviews

Metrics, Benchmarks

See also: AI benchmarks

Evaluation Schemes

Multi-agent

Agent Challenges

  • Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
  • Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
  • MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges (code).

Automated Improvement

See Also