Difference between revisions of "AI Agents"

From GISAXS
Jump to: navigation, search
(Research)
(Published)
 
(50 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
* 2024-09: [https://www.arxiv.org/abs/2409.02977 Large Language Model-Based Agents for Software Engineering: A Survey]
 
* 2024-09: [https://www.arxiv.org/abs/2409.02977 Large Language Model-Based Agents for Software Engineering: A Survey]
 
* 2024-09: [https://arxiv.org/abs/2409.09030 Agents in Software Engineering: Survey, Landscape, and Vision]
 
* 2024-09: [https://arxiv.org/abs/2409.09030 Agents in Software Engineering: Survey, Landscape, and Vision]
 +
* 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems]
 +
* 2025-04: [https://arxiv.org/abs/2503.19213 A Survey of Large Language Model Agents for Question Answering]
 +
* 2025-04: [https://arxiv.org/abs/2504.09037 A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems]
 +
* 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems]
  
 
===Continually updating===
 
===Continually updating===
Line 19: Line 23:
 
===Guides===
 
===Guides===
 
* Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents]
 
* Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents]
* Google: [https://www.kaggle.com/whitepaper-agents Agents]
+
* Google: [https://www.kaggle.com/whitepaper-agents Agents] and [https://www.kaggle.com/whitepaper-agent-companion Agents Companion]
 +
* OpenAI: [https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf A practical guide to building agents]
 +
* Anthropic: [https://www.anthropic.com/engineering/claude-code-best-practices Claude Code: Best practices for agentic coding]
  
 
=AI Assistants=
 
=AI Assistants=
Line 47: Line 53:
 
* See also: [[Human_Computer_Interaction#AI_Computer_Use]]
 
* See also: [[Human_Computer_Interaction#AI_Computer_Use]]
 
* [https://tavily.com/ Tavily]: Connect Your LLM to the Web: Empowering your AI applications with real-time, accurate search results tailored for LLMs and RAG
 
* [https://tavily.com/ Tavily]: Connect Your LLM to the Web: Empowering your AI applications with real-time, accurate search results tailored for LLMs and RAG
* Anthropic [https://www.anthropic.com/news/model-context-protocol Model Context Protocol] (MCP)
+
===Model Context Protocol (MCP)===
** '''Tools:'''
+
* '''Standards:'''
*** [https://github.com/jlowin/fastmcp FastMCP]: The fast, Pythonic way to build MCP servers
+
*# Anthropic [https://www.anthropic.com/news/model-context-protocol Model Context Protocol] (MCP)
** '''Servers''' ([https://github.com/modelcontextprotocol/servers full list here]):
+
*# [https://openai.github.io/openai-agents-python/mcp/ OpenAI Agents SDK]
**# [https://github.com/modelcontextprotocol/servers/tree/main/src/github Github MCP server]
+
* '''Tools:'''
 +
** [https://github.com/jlowin/fastmcp FastMCP]: The fast, Pythonic way to build MCP servers
 +
** [https://github.com/fleuristes/fleur/ Fleur]: A desktop app marketplace for Claude Desktop
 +
* '''Servers:'''
 +
** '''Lists:'''
 +
**# [https://github.com/modelcontextprotocol/servers Model Context Protocol servers]
 +
**# [https://www.mcpt.com/ MCP Servers, One Managed Registry]
 +
**# [https://github.com/punkpeye/awesome-mcp-servers Awesome MCP Servers]
 +
** '''Noteworthy:'''
 +
**# Official [https://github.com/github/github-mcp-server Github MCP server]
 +
**# Unofficial [https://github.com/modelcontextprotocol/servers/tree/main/src/github Github MCP server]
 
**# [https://github.com/modelcontextprotocol/servers/tree/main/src/puppeteer Puppeteer]
 
**# [https://github.com/modelcontextprotocol/servers/tree/main/src/puppeteer Puppeteer]
 
**# [https://github.com/modelcontextprotocol/servers/tree/main/src/google-maps Google Maps MCP Server]
 
**# [https://github.com/modelcontextprotocol/servers/tree/main/src/google-maps Google Maps MCP Server]
 
**# [https://github.com/modelcontextprotocol/servers/tree/main/src/slack Slack MCP Server]
 
**# [https://github.com/modelcontextprotocol/servers/tree/main/src/slack Slack MCP Server]
 +
**# [https://zapier.com/mcp Zapier MCP Servers] (Slack, Google Sheets, Notion, etc.)
 +
**# [https://github.com/awslabs/mcp AWS MCP Servers]
 +
**# [https://x.com/elevenlabsio/status/1909300782673101265 ElevenLabs]
 +
 +
===Agent2Agent Protocol (A2A)===
 +
* Google [https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/ announcement]
  
 
===Open-source===
 
===Open-source===
Line 83: Line 105:
 
===Medicine===
 
===Medicine===
 
* 2025-03: [https://news.microsoft.com/2025/03/03/microsoft-dragon-copilot-provides-the-healthcare-industrys-first-unified-voice-ai-assistant-that-enables-clinicians-to-streamline-clinical-documentation-surface-information-and-automate-task/ Microsoft Dragon Copilot]: streamline clinical workflows and paperwork
 
* 2025-03: [https://news.microsoft.com/2025/03/03/microsoft-dragon-copilot-provides-the-healthcare-industrys-first-unified-voice-ai-assistant-that-enables-clinicians-to-streamline-clinical-documentation-surface-information-and-automate-task/ Microsoft Dragon Copilot]: streamline clinical workflows and paperwork
 +
* 2025-04: [https://arxiv.org/abs/2504.05186 Training state-of-the-art pathology foundation models with orders of magnitude less data]
 +
* 2025-04: [https://www.nature.com/articles/s41586-025-08866-7?linkId=13898052 Towards conversational diagnostic artificial intelligence]
 +
* 2025-04: [https://www.nature.com/articles/s41586-025-08869-4?linkId=13898054 Towards accurate differential diagnosis with large language models]
  
 
===LLM-as-judge===
 
===LLM-as-judge===
Line 90: Line 115:
 
* [https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge Awesome-LLM-as-a-judge Survey]
 
* [https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge Awesome-LLM-as-a-judge Survey]
 
* [https://github.com/haizelabs/Awesome-LLM-Judges haizelabs Awesome LLM Judges]
 
* [https://github.com/haizelabs/Awesome-LLM-Judges haizelabs Awesome LLM Judges]
 +
* 2024-08: [https://arxiv.org/abs/2408.02666 Self-Taught Evaluators]
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
 
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
 
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
 
* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]
 
* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]
 +
* 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators]
 +
* 2025-04: [https://arxiv.org/abs/2504.00050 JudgeLRM: Large Reasoning Models as a Judge]
  
 
===Deep Research===
 
===Deep Research===
 
* Google [https://blog.google/products/gemini/google-gemini-deep-research/ Deep Research]
 
* Google [https://blog.google/products/gemini/google-gemini-deep-research/ Deep Research]
 
* OpenAI [https://openai.com/index/introducing-deep-research/ Deep Research]
 
* OpenAI [https://openai.com/index/introducing-deep-research/ Deep Research]
* Perplexity [https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research Deep Research]
+
* Perplexity:
* [https://exa.ai/ Exa AI] web-search agent, powered by [https://demo.exa.ai/deepseekchat DeepSeek] ([https://github.com/exa-labs/exa-deepseek-chat code]) or [https://o3minichat.exa.ai/ o3-mini] ([https://github.com/exa-labs/exa-o3mini-chat code])
+
** [https://www.perplexity.ai/ Search]
 +
** [https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research Deep Research]
 +
* [https://exa.ai/ Exa AI]:
 +
** [https://exa.ai/websets Websets]: Web research agent
 +
** [https://demo.exa.ai/deepseekchat Web-search agent] powered by DeepSeek ([https://github.com/exa-labs/exa-deepseek-chat code]) or [https://o3minichat.exa.ai/ o3-mini] ([https://github.com/exa-labs/exa-o3mini-chat code])
 
* [https://www.firecrawl.dev/ Firecrawl] [https://x.com/nickscamara_/status/1886287956291338689 wip]
 
* [https://www.firecrawl.dev/ Firecrawl] [https://x.com/nickscamara_/status/1886287956291338689 wip]
 
* [https://x.com/mattshumer_ Matt Shumer] [https://github.com/mshumer/OpenDeepResearcher OpenDeepResearcher]
 
* [https://x.com/mattshumer_ Matt Shumer] [https://github.com/mshumer/OpenDeepResearcher OpenDeepResearcher]
Line 106: Line 138:
 
* [https://huggingface.co/ huggingface] [https://huggingface.co/blog/open-deep-research open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code])
 
* [https://huggingface.co/ huggingface] [https://huggingface.co/blog/open-deep-research open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code])
 
* xAI Grok 3 Deep Search
 
* xAI Grok 3 Deep Search
* Exa AI [https://exa.ai/websets Websets]
+
* [https://liner.com/news/introducing-deepresearch Liner Deep Research]
 +
* [https://allenai.org/ Allen AI] (AI2) [https://paperfinder.allen.ai/chat Paper Finder]
 +
* 2025-03: [https://arxiv.org/abs/2503.20201 Open Deep Search: Democratizing Search with Open-source Reasoning Agents] ([https://github.com/sentient-agi/OpenDeepSearch code])
 +
* [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks)
 +
* 2025-04: [https://arxiv.org/abs/2504.03160 DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments]
 +
* 2025-04: Anthropic [https://x.com/AnthropicAI/status/1912192384588271771 Research]
 +
* 2025-04: [https://arxiv.org/abs/2504.21776 WebThinker: Empowering Large Reasoning Models with Deep Research Capability]
  
 
=Advanced Workflows=
 
=Advanced Workflows=
Line 144: Line 182:
 
## [https://haystackeditor.com/ Haystack Editor]: canvas UI
 
## [https://haystackeditor.com/ Haystack Editor]: canvas UI
 
## [https://onlook.com/ Onlook]: for designers
 
## [https://onlook.com/ Onlook]: for designers
 +
## [https://www.all-hands.dev/ All Hands AI]
 +
## [https://app.devin.ai/ Devin 2.0] ([https://cognition.ai/ Cognition AI])
 +
## Google [https://firebase.google.com/docs/studio Firebase Studio]
 +
## [https://github.com/rowboatlabs/rowboat rowboat] (for building multi-agent workflows)
 
# AI-assisted IDE, where the AI generates and manages the dev environment
 
# AI-assisted IDE, where the AI generates and manages the dev environment
 
## [https://replit.com/ Replit]
 
## [https://replit.com/ Replit]
Line 188: Line 230:
 
==AI Assistant==
 
==AI Assistant==
 
* [https://convergence.ai/ Convergence] [https://proxy.convergence.ai/ Proxy]
 
* [https://convergence.ai/ Convergence] [https://proxy.convergence.ai/ Proxy]
 +
* [https://www.shortwave.com/ Shortwave] [https://www.shortwave.com/docs/guides/ai-assistant/ AI Assistant] (organize, write, search, schedule, etc.)
  
 
==Agentic Systems==
 
==Agentic Systems==
Line 194: Line 237:
 
* [https://honeycomb.sh/ Honeycomb] ([https://honeycomb.sh/blog/swe-bench-technical-report 22% SWE-Agent])
 
* [https://honeycomb.sh/ Honeycomb] ([https://honeycomb.sh/blog/swe-bench-technical-report 22% SWE-Agent])
 
* [https://www.factory.ai/ Factory AI]
 
* [https://www.factory.ai/ Factory AI]
 +
* [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks)
 +
* [https://agents.cloudflare.com/ Cloudflare Agents]
 +
* [https://www.maskara.ai/ Maskara AI]
  
 
=Increasing AI Agent Intelligence=
 
=Increasing AI Agent Intelligence=
Line 200: Line 246:
 
=Multi-agent orchestration=
 
=Multi-agent orchestration=
 
==Research==
 
==Research==
 +
* 2025-03: [https://arxiv.org/abs/2503.13657 Why Do Multi-Agent LLM Systems Fail?]
 +
* 2025-03: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]
 +
 
===Organization Schemes===
 
===Organization Schemes===
 
* 2025-03: [https://arxiv.org/abs/2503.02390 ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks]
 
* 2025-03: [https://arxiv.org/abs/2503.02390 ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks]
Line 205: Line 254:
 
===Societies and Communities of AI agents===
 
===Societies and Communities of AI agents===
 
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 +
* 2025-04: [https://arxiv.org/abs/2504.10157 SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users]
  
 
===Domain-specific===
 
===Domain-specific===
Line 235: Line 285:
 
===Related work===
 
===Related work===
 
* 2024-07: [https://arxiv.org/abs/2407.18416 PersonaGym: Evaluating Persona Agents and LLMs]
 
* 2024-07: [https://arxiv.org/abs/2407.18416 PersonaGym: Evaluating Persona Agents and LLMs]
 +
* 2025-01: [https://arxiv.org/abs/2501.13946 Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks]
  
 
===Inter-agent communications===
 
===Inter-agent communications===
Line 261: Line 312:
 
* [https://github.com/gensx-inc/gensx/blob/main/README.md GenSX]
 
* [https://github.com/gensx-inc/gensx/blob/main/README.md GenSX]
 
* Cloudflare [https://developers.cloudflare.com/agents/ agents-sdk] ([https://blog.cloudflare.com/build-ai-agents-on-cloudflare/ info], [https://github.com/cloudflare/agents code])
 
* Cloudflare [https://developers.cloudflare.com/agents/ agents-sdk] ([https://blog.cloudflare.com/build-ai-agents-on-cloudflare/ info], [https://github.com/cloudflare/agents code])
 +
* OpenAI [https://platform.openai.com/docs/api-reference/responses responses API] and [https://platform.openai.com/docs/guides/agents agents SDK]
 +
* Google [https://google.github.io/adk-docs/ Agent Development Kit]
  
 
==Open Source Systems==
 
==Open Source Systems==
Line 285: Line 338:
 
* [https://www.llamaindex.ai/ LlamaIndex]: ([https://x.com/llama_index 𝕏], [https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://discord.com/invite/dGcwcsnxhU Discord])
 
* [https://www.llamaindex.ai/ LlamaIndex]: ([https://x.com/llama_index 𝕏], [https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://discord.com/invite/dGcwcsnxhU Discord])
 
* [https://www.multion.ai/ MultiOn AI]: [https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities Agent Q] ([https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf paper]) automated planning and execution
 
* [https://www.multion.ai/ MultiOn AI]: [https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities Agent Q] ([https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf paper]) automated planning and execution
 +
* Google [https://cloud.google.com/products/agentspace Agentspace]
 +
 +
===Multi-agent Handoff/Collaboration===
 +
* [https://www.maskara.ai/ Maskara AI]
  
 
===Spreadsheet===
 
===Spreadsheet===
Line 290: Line 347:
 
* [https://ottogrid.ai/ Otto Grid]
 
* [https://ottogrid.ai/ Otto Grid]
 
* [https://www.paradigmai.com/ Paradigm]
 
* [https://www.paradigmai.com/ Paradigm]
 +
* [https://www.superworker.ai/ Superworker AI]
  
 
==Cloud solutions==
 
==Cloud solutions==
Line 305: Line 363:
  
 
=Optimization=
 
=Optimization=
 +
===Reviews===
 +
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 +
* 2025-03: [https://arxiv.org/abs/2503.16416 Survey on Evaluation of LLM-based Agents]
 +
 
===Metrics, Benchmarks===
 
===Metrics, Benchmarks===
 +
See also: [[AI benchmarks]]
 
* 2019-11: [https://arxiv.org/abs/1911.01547 On the Measure of Intelligence]
 
* 2019-11: [https://arxiv.org/abs/1911.01547 On the Measure of Intelligence]
 
* 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]
 
* 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]
Line 324: Line 387:
 
* 2024-11: [https://arxiv.org/abs/2411.13543 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games]
 
* 2024-11: [https://arxiv.org/abs/2411.13543 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games]
 
* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
 
* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 
 
* 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])
 
* 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])
 
* 2025-02: [https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges] ([https://scale.com/leaderboard/enigma_eval leaderboard])
 
* 2025-02: [https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges] ([https://scale.com/leaderboard/enigma_eval leaderboard])
 
* 2025-02: [https://sites.google.com/view/mlgym MLGym: A New Framework and Benchmark for Advancing AI Research Agents] ([https://arxiv.org/abs/2502.14499 paper], [https://github.com/facebookresearch/MLGym code])
 
* 2025-02: [https://sites.google.com/view/mlgym MLGym: A New Framework and Benchmark for Advancing AI Research Agents] ([https://arxiv.org/abs/2502.14499 paper], [https://github.com/facebookresearch/MLGym code])
 
* 2025-02: [https://arxiv.org/abs/2502.18356 WebGames: Challenging General-Purpose Web-Browsing AI Agents]
 
* 2025-02: [https://arxiv.org/abs/2502.18356 WebGames: Challenging General-Purpose Web-Browsing AI Agents]
 +
* 2025-03: ColBench: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]
 +
* 2025-04 OpenAI [https://openai.com/index/browsecomp/ BrowseComp: a benchmark for browsing agents]
 +
* 2025-04: [https://arxiv.org/abs/2504.11844 Evaluating the Goal-Directedness of Large Language Models]
  
 
===Evaluation Schemes===
 
===Evaluation Schemes===
Line 336: Line 401:
 
===Multi-agent===
 
===Multi-agent===
 
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
 +
* [https://github.com/lechmazur/step_game/ Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure]
  
 
===Agent Challenges===
 
===Agent Challenges===
Line 341: Line 407:
 
** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]
 
** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]
 
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
* [https://github.com/mc-bench/orchestrator MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.
+
* [https://mcbench.ai/ MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges ([https://github.com/mc-bench/orchestrator code]).
  
 
===Automated Improvement===
 
===Automated Improvement===

Latest revision as of 12:26, 8 May 2025

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

Guides

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval (Memory)

Contextual Memory

  • Memobase: user profile-based memory (long-term user memory for genAI) applications)

Control (tool-use, computer use, etc.)

Model Context Protocol (MCP)

Agent2Agent Protocol (A2A)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

See Science Agents.

Medicine

LLM-as-judge

Deep Research

Advanced Workflows

Streamline Administrative Tasks

Author Research Articles

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

  1. Manual, human driven
  2. AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
    1. OpenAI ChatGPT
    2. Anthropic Claude
  3. API calls to an LLM, which generates code and inserts the file into the project
  4. LLM-integration into the IDE
    1. Copilot
    2. Qodo (Codium) & AlphaCodium (preprint, code)
    3. Cursor
    4. Codeium Windsurf (with "Cascade" AI Agent)
    5. ByteDance Trae AI
    6. Tabnine
    7. Traycer
    8. IDX: free
    9. Aide: open-source AI-native code editor (fork of VS Code)
    10. continue.dev: open-source code assistant
    11. Pear AI: open-source code editor
    12. Haystack Editor: canvas UI
    13. Onlook: for designers
    14. All Hands AI
    15. Devin 2.0 (Cognition AI)
    16. Google Firebase Studio
    17. rowboat (for building multi-agent workflows)
  5. AI-assisted IDE, where the AI generates and manages the dev environment
    1. Replit
    2. Aider (code): Pair programming on commandline
    3. Pythagora
    4. StackBlitz bolt.new
    5. Cline (formerly Claude Dev)
  6. Prompt-to-product
    1. Github Spark (demo video)
    2. Create.xyz: text-to-app, replicate product from link
    3. a0.dev: generate mobil apps (from your phone)
    4. Softgen: web app developer
    5. wrapifai: build form-based apps
    6. Lovable: web app (from text, screenshot, etc.)
    7. Vercel v0
    8. MarsX (John Rush): SaaS builder
    9. Webdraw: turn sketches into web apps
    10. Tempo Labs: build React apps
    11. Databutton: no-code software development
    12. base44: no-code dashboard apps
    13. Origin AI
  7. Semi-autonomous software engineer agents
    1. Devin (Cognition AI)
    2. Amazon Q (and CodeWhisperer)
    3. Honeycomb
    4. Claude Code

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

AI Assistant

Agentic Systems

Increasing AI Agent Intelligence

See: Increasing AI Intelligence

Multi-agent orchestration

Research

Organization Schemes

Societies and Communities of AI agents

Domain-specific

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Multi-agent Handoff/Collaboration

Spreadsheet

Cloud solutions

Frameworks

Optimization

Reviews

Metrics, Benchmarks

See also: AI benchmarks

Evaluation Schemes

Multi-agent

Agent Challenges

  • Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
  • Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
  • MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges (code).

Automated Improvement

See Also