Difference between revisions of "AI Agents"
KevinYager (talk | contribs)  (→Deep Research)  | 
				KevinYager (talk | contribs)   (→Information Retrieval (Memory))  | 
				||
| (100 intermediate revisions by the same user not shown) | |||
| Line 7: | Line 7: | ||
* 2024-09: [https://www.arxiv.org/abs/2409.02977 Large Language Model-Based Agents for Software Engineering: A Survey]  | * 2024-09: [https://www.arxiv.org/abs/2409.02977 Large Language Model-Based Agents for Software Engineering: A Survey]  | ||
* 2024-09: [https://arxiv.org/abs/2409.09030 Agents in Software Engineering: Survey, Landscape, and Vision]  | * 2024-09: [https://arxiv.org/abs/2409.09030 Agents in Software Engineering: Survey, Landscape, and Vision]  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems]  | ||
| + | * 2025-04: [https://arxiv.org/abs/2503.19213 A Survey of Large Language Model Agents for Question Answering]  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.09037 A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems]  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.01990 Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems]  | ||
===Continually updating===  | ===Continually updating===  | ||
* [https://github.com/open-thought/system-2-research OpenThought - System 2 Research Links]  | * [https://github.com/open-thought/system-2-research OpenThought - System 2 Research Links]  | ||
* [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1): Collection of research papers & blogs for OpenAI Strawberry(o1) and Reasoning]  | * [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1): Collection of research papers & blogs for OpenAI Strawberry(o1) and Reasoning]  | ||
| + | * [https://github.com/e2b-dev/awesome-ai-agents Awesome AI Agents]  | ||
===Analysis/Opinions===  | ===Analysis/Opinions===  | ||
* [https://arxiv.org/abs/2402.01817v3 LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks]  | * [https://arxiv.org/abs/2402.01817v3 LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks]  | ||
* [https://rasa.com/blog/cutting-ai-assistant-costs-the-power-of-enhancing-llms-with-business/ Cutting AI Assistant Costs by Up to 77.8%: The Power of Enhancing LLMs with Business Logic]  | * [https://rasa.com/blog/cutting-ai-assistant-costs-the-power-of-enhancing-llms-with-business/ Cutting AI Assistant Costs by Up to 77.8%: The Power of Enhancing LLMs with Business Logic]  | ||
| + | * 2025-05: [https://arxiv.org/abs/2505.10468 AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges]  | ||
===Guides===  | ===Guides===  | ||
* Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents]  | * Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents]  | ||
| − | * Google: [https://www.kaggle.com/whitepaper-agents Agents]  | + | * Google: [https://www.kaggle.com/whitepaper-agents Agents] and [https://www.kaggle.com/whitepaper-agent-companion Agents Companion]  | 
| + | * OpenAI: [https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf A practical guide to building agents]  | ||
| + | * Anthropic: [https://www.anthropic.com/engineering/claude-code-best-practices Claude Code: Best practices for agentic coding]  | ||
| + | * Anthropic: [https://www.anthropic.com/engineering/built-multi-agent-research-system How we built our multi-agent research system]  | ||
=AI Assistants=  | =AI Assistants=  | ||
| Line 32: | Line 41: | ||
* [https://github.com/elizaOS/eliza Eliza] (includes multi-agent, interaction with docs, Discord, Twitter, etc.)  | * [https://github.com/elizaOS/eliza Eliza] (includes multi-agent, interaction with docs, Discord, Twitter, etc.)  | ||
* [https://github.com/The-Pocket/PocketFlow Pocket Flow]: LLM Framework in 100 Lines  | * [https://github.com/The-Pocket/PocketFlow Pocket Flow]: LLM Framework in 100 Lines  | ||
| + | * [https://github.com/coze-dev/coze-studio Coze]: All-in-one AI agent development tool  | ||
===Information Retrieval (Memory)===  | ===Information Retrieval (Memory)===  | ||
| Line 39: | Line 49: | ||
* 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]  | * 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]  | ||
* [https://mem0.ai/ Mem0 AI]: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized.  | * [https://mem0.ai/ Mem0 AI]: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized.  | ||
| + | * 2025-08: [https://arxiv.org/abs/2508.16153 Memento: Fine-tuning LLM Agents without Fine-tuning LLMs]  | ||
| + | |||
| + | ===Contextual Memory===  | ||
| + | * [https://github.com/memodb-io/memobase Memobase]: user profile-based memory (long-term user memory for genAI) applications)  | ||
===Control (tool-use, computer use, etc.)===  | ===Control (tool-use, computer use, etc.)===  | ||
* See also: [[Human_Computer_Interaction#AI_Computer_Use]]  | * See also: [[Human_Computer_Interaction#AI_Computer_Use]]  | ||
* [https://tavily.com/ Tavily]: Connect Your LLM to the Web: Empowering your AI applications with real-time, accurate search results tailored for LLMs and RAG  | * [https://tavily.com/ Tavily]: Connect Your LLM to the Web: Empowering your AI applications with real-time, accurate search results tailored for LLMs and RAG  | ||
| − | * Anthropic [https://www.anthropic.com/news/model-context-protocol Model Context Protocol] (MCP)  | + | ===Model Context Protocol (MCP)===  | 
| + | * '''Standards:'''  | ||
| + | *# Anthropic [https://www.anthropic.com/news/model-context-protocol Model Context Protocol] (MCP)  | ||
| + | *# [https://openai.github.io/openai-agents-python/mcp/ OpenAI Agents SDK]  | ||
| + | * '''Tools:'''  | ||
** [https://github.com/jlowin/fastmcp FastMCP]: The fast, Pythonic way to build MCP servers  | ** [https://github.com/jlowin/fastmcp FastMCP]: The fast, Pythonic way to build MCP servers  | ||
| + | ** [https://github.com/fleuristes/fleur/ Fleur]: A desktop app marketplace for Claude Desktop  | ||
| + | * '''Servers:'''  | ||
| + | ** '''Lists:'''  | ||
| + | **# [https://github.com/modelcontextprotocol/servers Model Context Protocol servers]  | ||
| + | **# [https://www.mcpt.com/ MCP Servers, One Managed Registry]  | ||
| + | **# [https://github.com/punkpeye/awesome-mcp-servers Awesome MCP Servers]  | ||
| + | ** '''Noteworthy:'''  | ||
| + | **# Official [https://github.com/github/github-mcp-server Github MCP server]  | ||
| + | **# Unofficial [https://github.com/modelcontextprotocol/servers/tree/main/src/github Github MCP server]  | ||
| + | **# [https://github.com/modelcontextprotocol/servers/tree/main/src/puppeteer Puppeteer]  | ||
| + | **# [https://github.com/modelcontextprotocol/servers/tree/main/src/google-maps Google Maps MCP Server]  | ||
| + | **# [https://github.com/modelcontextprotocol/servers/tree/main/src/slack Slack MCP Server]  | ||
| + | **# [https://zapier.com/mcp Zapier MCP Servers] (Slack, Google Sheets, Notion, etc.)  | ||
| + | **# [https://github.com/awslabs/mcp AWS MCP Servers]  | ||
| + | **# [https://x.com/elevenlabsio/status/1909300782673101265 ElevenLabs]  | ||
| + | |||
| + | ===Agent2Agent Protocol (A2A)===  | ||
| + | * Google [https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/ announcement]  | ||
===Open-source===  | ===Open-source===  | ||
| Line 70: | Line 106: | ||
===Science Agents===  | ===Science Agents===  | ||
See [[Science Agents]].  | See [[Science Agents]].  | ||
| + | |||
| + | ===Medicine===  | ||
| + | * 2025-03: [https://news.microsoft.com/2025/03/03/microsoft-dragon-copilot-provides-the-healthcare-industrys-first-unified-voice-ai-assistant-that-enables-clinicians-to-streamline-clinical-documentation-surface-information-and-automate-task/ Microsoft Dragon Copilot]: streamline clinical workflows and paperwork  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.05186 Training state-of-the-art pathology foundation models with orders of magnitude less data]  | ||
| + | * 2025-04: [https://www.nature.com/articles/s41586-025-08866-7?linkId=13898052 Towards conversational diagnostic artificial intelligence]  | ||
| + | * 2025-04: [https://www.nature.com/articles/s41586-025-08869-4?linkId=13898054 Towards accurate differential diagnosis with large language models]  | ||
| + | * 2025-08: [https://arxiv.org/abs/2508.20148 The Anatomy of a Personal Health Agent]  | ||
===LLM-as-judge===  | ===LLM-as-judge===  | ||
| Line 77: | Line 120: | ||
* [https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge Awesome-LLM-as-a-judge Survey]  | * [https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge Awesome-LLM-as-a-judge Survey]  | ||
* [https://github.com/haizelabs/Awesome-LLM-Judges haizelabs Awesome LLM Judges]  | * [https://github.com/haizelabs/Awesome-LLM-Judges haizelabs Awesome LLM Judges]  | ||
| + | * 2024-08: [https://arxiv.org/abs/2408.02666 Self-Taught Evaluators]  | ||
* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]  | * 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]  | ||
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]  | * 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]  | ||
* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]  | * 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]  | ||
| + | * 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators]  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.00050 JudgeLRM: Large Reasoning Models as a Judge]  | ||
===Deep Research===  | ===Deep Research===  | ||
* Google [https://blog.google/products/gemini/google-gemini-deep-research/ Deep Research]  | * Google [https://blog.google/products/gemini/google-gemini-deep-research/ Deep Research]  | ||
* OpenAI [https://openai.com/index/introducing-deep-research/ Deep Research]  | * OpenAI [https://openai.com/index/introducing-deep-research/ Deep Research]  | ||
| − | * Perplexity [https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research Deep Research]  | + | * Perplexity:  | 
| − | * [https://exa.ai/ Exa AI]   | + | ** [https://www.perplexity.ai/ Search]  | 
| + | ** [https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research Deep Research]  | ||
| + | * [https://exa.ai/ Exa AI]:  | ||
| + | ** [https://exa.ai/websets Websets]: Web research agent  | ||
| + | ** [https://demo.exa.ai/deepseekchat Web-search agent] powered by DeepSeek ([https://github.com/exa-labs/exa-deepseek-chat code]) or [https://o3minichat.exa.ai/ o3-mini] ([https://github.com/exa-labs/exa-o3mini-chat code])  | ||
* [https://www.firecrawl.dev/ Firecrawl] [https://x.com/nickscamara_/status/1886287956291338689 wip]  | * [https://www.firecrawl.dev/ Firecrawl] [https://x.com/nickscamara_/status/1886287956291338689 wip]  | ||
* [https://x.com/mattshumer_ Matt Shumer] [https://github.com/mshumer/OpenDeepResearcher OpenDeepResearcher]  | * [https://x.com/mattshumer_ Matt Shumer] [https://github.com/mshumer/OpenDeepResearcher OpenDeepResearcher]  | ||
| Line 92: | Line 142: | ||
* [https://x.com/dzhng dzhng] [https://github.com/dzhng/deep-research deep-research]  | * [https://x.com/dzhng dzhng] [https://github.com/dzhng/deep-research deep-research]  | ||
* [https://huggingface.co/ huggingface] [https://huggingface.co/blog/open-deep-research open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code])  | * [https://huggingface.co/ huggingface] [https://huggingface.co/blog/open-deep-research open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code])  | ||
| + | * xAI Grok 3 Deep Search  | ||
| + | * [https://liner.com/news/introducing-deepresearch Liner Deep Research]  | ||
| + | * [https://allenai.org/ Allen AI] (AI2) [https://paperfinder.allen.ai/chat Paper Finder]  | ||
| + | * 2025-03: [https://arxiv.org/abs/2503.20201 Open Deep Search: Democratizing Search with Open-source Reasoning Agents] ([https://github.com/sentient-agi/OpenDeepSearch code])  | ||
| + | * [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks)  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.03160 DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments]  | ||
| + | * 2025-04: Anthropic [https://x.com/AnthropicAI/status/1912192384588271771 Research]  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.21776 WebThinker: Empowering Large Reasoning Models with Deep Research Capability]  | ||
| + | * 2025-09: [https://arxiv.org/abs/2509.06283 SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents]  | ||
=Advanced Workflows=  | =Advanced Workflows=  | ||
| Line 101: | Line 160: | ||
* [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning]  | * [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning]  | ||
** [https://github.com/lamm-mit/SciAgentsDiscovery code]  | ** [https://github.com/lamm-mit/SciAgentsDiscovery code]  | ||
| + | * [https://skywork.ai/home Skywork] [https://skywork.ai/home?inviter=el.cine&shortlink_id=1919604877427924992&utm_source=X Super Agent]  | ||
| + | |||
| + | ===Streamline Administrative Tasks===  | ||
| + | * 2025-02: [https://er.educause.edu/articles/2025/2/ushering-in-a-new-era-of-ai-driven-data-insights-at-uc-san-diego Ushering in a New Era of AI-Driven Data Insights at UC San Diego]  | ||
===Author Research Articles===  | ===Author Research Articles===  | ||
| Line 111: | Line 174: | ||
## OpenAI [https://chatgpt.com/ ChatGPT]  | ## OpenAI [https://chatgpt.com/ ChatGPT]  | ||
## Anthropic [https://claude.ai/ Claude]  | ## Anthropic [https://claude.ai/ Claude]  | ||
| + | ## Google [https://gemini.google.com/app Gemini]  | ||
# API calls to an LLM, which generates code and inserts the file into the project  | # API calls to an LLM, which generates code and inserts the file into the project  | ||
# LLM-integration into the IDE  | # LLM-integration into the IDE  | ||
## [https://github.com/features/copilot Copilot]  | ## [https://github.com/features/copilot Copilot]  | ||
## [https://www.qodo.ai/ Qodo] (Codium) & [https://www.qodo.ai/products/alphacodium/ AlphaCodium] ([https://arxiv.org/abs/2401.08500 preprint], [https://github.com/Codium-ai/AlphaCodium code])  | ## [https://www.qodo.ai/ Qodo] (Codium) & [https://www.qodo.ai/products/alphacodium/ AlphaCodium] ([https://arxiv.org/abs/2401.08500 preprint], [https://github.com/Codium-ai/AlphaCodium code])  | ||
| − | ## [https://www.cursor.com/ Cursor]  | + | ## '''[https://www.cursor.com/ Cursor]'''  | 
## [https://codeium.com/ Codeium] [https://codeium.com/windsurf Windsurf] (with "Cascade" AI Agent)  | ## [https://codeium.com/ Codeium] [https://codeium.com/windsurf Windsurf] (with "Cascade" AI Agent)  | ||
## ByteDance [https://www.trae.ai/ Trae AI]  | ## ByteDance [https://www.trae.ai/ Trae AI]  | ||
| + | ## [https://www.tabnine.com/ Tabnine]  | ||
| + | ## [https://marketplace.visualstudio.com/items?itemName=Traycer.traycer-vscode Traycer]  | ||
| + | ## [https://idx.dev/ IDX]: free  | ||
| + | ## [https://github.com/codestoryai/aide Aide]: open-source AI-native code editor (fork of VS Code)  | ||
| + | ## [https://www.continue.dev/ continue.dev]: open-source code assistant  | ||
| + | ## [https://trypear.ai/ Pear AI]: open-source code editor  | ||
| + | ## [https://haystackeditor.com/ Haystack Editor]: canvas UI  | ||
| + | ## [https://onlook.com/ Onlook]: for designers  | ||
| + | ## [https://www.all-hands.dev/ All Hands AI]  | ||
| + | ## [https://app.devin.ai/ Devin 2.0] ([https://cognition.ai/ Cognition AI])  | ||
| + | ## Google [https://firebase.google.com/docs/studio Firebase Studio]  | ||
| + | ## [https://github.com/rowboatlabs/rowboat rowboat] (for building multi-agent workflows)  | ||
| + | ## [https://www.trae.ai/ Trae IDE]: The Real AI Engineer  | ||
# AI-assisted IDE, where the AI generates and manages the dev environment  | # AI-assisted IDE, where the AI generates and manages the dev environment  | ||
## [https://replit.com/ Replit]  | ## [https://replit.com/ Replit]  | ||
| − | |||
## [https://www.pythagora.ai/ Pythagora]  | ## [https://www.pythagora.ai/ Pythagora]  | ||
## [https://stackblitz.com/ StackBlitz] [https://bolt.new/ bolt.new]  | ## [https://stackblitz.com/ StackBlitz] [https://bolt.new/ bolt.new]  | ||
## [https://github.com/clinebot/cline Cline] (formerly [https://generativeai.pub/meet-claude-dev-an-open-source-autonomous-ai-programmer-in-vs-code-f457f9821b7b Claude Dev])  | ## [https://github.com/clinebot/cline Cline] (formerly [https://generativeai.pub/meet-claude-dev-an-open-source-autonomous-ai-programmer-in-vs-code-f457f9821b7b Claude Dev])  | ||
| + | ## [https://www.all-hands.dev/ All Hands]  | ||
| + | # AI Agent on Commandline  | ||
| + | ## [https://aider.chat/ Aider] ([https://github.com/Aider-AI/aider code]): Pair programming on commandline  | ||
| + | ## [https://docs.anthropic.com/en/docs/claude-code/overview Claude Code]  | ||
| + | ## [https://openai.com/codex/ OpenAI Codex]  | ||
| + | ## [https://blog.google/technology/developers/introducing-gemini-cli-open-source-ai-agent/ Gemini CLI]  | ||
# Prompt-to-product  | # Prompt-to-product  | ||
## [https://githubnext.com/projects/github-spark Github Spark] ([https://x.com/ashtom/status/1851333075374051725 demo video])  | ## [https://githubnext.com/projects/github-spark Github Spark] ([https://x.com/ashtom/status/1851333075374051725 demo video])  | ||
| + | ## [https://www.create.xyz/ Create.xyz]: text-to-app, replicate product from link  | ||
| + | ## [https://a0.dev/ a0.dev]: generate mobil apps (from your phone)  | ||
| + | ## [https://softgen.ai/ Softgen]: web app developer  | ||
| + | ## [https://wrapifai.com/ wrapifai]: build form-based apps  | ||
| + | ## [https://lovable.dev/ Lovable]: web app (from text, screenshot, etc.)  | ||
| + | ## [https://v0.dev/ Vercel v0]  | ||
| + | ## [https://x.com/johnrushx/status/1625179509728198665 MarsX] ([https://x.com/johnrushx John Rush]): SaaS builder  | ||
| + | ## [https://webdraw.com/ Webdraw]: turn sketches into web apps  | ||
| + | ## [https://www.tempo.new/ Tempo Labs]: build React apps  | ||
| + | ## [https://databutton.com/ Databutton]: no-code software development  | ||
| + | ## [https://base44.com/ base44]: no-code dashboard apps  | ||
| + | ## [https://www.theorigin.ai/ Origin AI]  | ||
| + | ## [https://app.emergent.sh/ Emergent AI]  | ||
# Semi-autonomous software engineer agents  | # Semi-autonomous software engineer agents  | ||
## [https://www.cognition.ai/blog/introducing-devin Devin] (Cognition AI)  | ## [https://www.cognition.ai/blog/introducing-devin Devin] (Cognition AI)  | ||
| − | ## [https://aws.amazon.com/q/ Amazon Q]  | + | ## [https://aws.amazon.com/q/ Amazon Q] (and CodeWhisperer)  | 
## [https://honeycomb.sh/ Honeycomb]  | ## [https://honeycomb.sh/ Honeycomb]  | ||
| − | + | ## [https://www.blackbox.ai/ Agent IDE]  | |
| + | ## [https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview Claude Code]  | ||
| + | ## OpenAI [https://help.openai.com/en/articles/11096431-openai-codex-cli-getting-started Codex CLI] and [https://openai.com/index/introducing-codex/ Codex] cloud  | ||
| + | ## [https://www.factory.ai/ Factory AI] [https://x.com/FactoryAI/status/1927754706014630357 Droids]  | ||
For a review of the current state of software-engineering agentic approaches, see:  | For a review of the current state of software-engineering agentic approaches, see:  | ||
* 2024-08: [https://arxiv.org/abs/2408.02479 From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future]  | * 2024-08: [https://arxiv.org/abs/2408.02479 From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future]  | ||
| Line 146: | Line 244: | ||
==Inference-compute Reasoning==  | ==Inference-compute Reasoning==  | ||
* [https://nousresearch.com/#popup-menu-anchor Nous Research]: [https://nousresearch.com/introducing-the-forge-reasoning-api-beta-and-nous-chat-an-evolution-in-llm-inference/ Forge Reasoning API Beta]  | * [https://nousresearch.com/#popup-menu-anchor Nous Research]: [https://nousresearch.com/introducing-the-forge-reasoning-api-beta-and-nous-chat-an-evolution-in-llm-inference/ Forge Reasoning API Beta]  | ||
| + | |||
| + | ==AI Assistant==  | ||
| + | * [https://convergence.ai/ Convergence] [https://proxy.convergence.ai/ Proxy]  | ||
| + | * [https://www.shortwave.com/ Shortwave] [https://www.shortwave.com/docs/guides/ai-assistant/ AI Assistant] (organize, write, search, schedule, etc.)  | ||
==Agentic Systems==  | ==Agentic Systems==  | ||
| Line 151: | Line 253: | ||
* [https://www.cognition.ai/ Cognition AI]: [https://www.cognition.ai/blog/introducing-devin Devin] software engineer (14% SWE-Agent)  | * [https://www.cognition.ai/ Cognition AI]: [https://www.cognition.ai/blog/introducing-devin Devin] software engineer (14% SWE-Agent)  | ||
* [https://honeycomb.sh/ Honeycomb] ([https://honeycomb.sh/blog/swe-bench-technical-report 22% SWE-Agent])  | * [https://honeycomb.sh/ Honeycomb] ([https://honeycomb.sh/blog/swe-bench-technical-report 22% SWE-Agent])  | ||
| + | * [https://www.factory.ai/ Factory AI]  | ||
| + | * [https://convergence.ai/welcome Convergence AI] Deep Work (swarms for web-based tasks)  | ||
| + | * [https://agents.cloudflare.com/ Cloudflare Agents]  | ||
| + | * [https://www.maskara.ai/ Maskara AI]  | ||
=Increasing AI Agent Intelligence=  | =Increasing AI Agent Intelligence=  | ||
| Line 157: | Line 263: | ||
=Multi-agent orchestration=  | =Multi-agent orchestration=  | ||
==Research==  | ==Research==  | ||
| + | * 2025-02: [https://arxiv.org/abs/2502.02533 Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies]  | ||
| + | * 2025-03: [https://arxiv.org/abs/2503.13657 Why Do Multi-Agent LLM Systems Fail?]  | ||
| + | * 2025-03: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]  | ||
| + | * 2025-09: [https://arxiv.org/abs/2509.20175 Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI]  | ||
| + | |||
| + | ===Organization Schemes===  | ||
| + | * 2025-03: [https://arxiv.org/abs/2503.02390 ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks]  | ||
| + | |||
===Societies and Communities of AI agents===  | ===Societies and Communities of AI agents===  | ||
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]  | * 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.10157 SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users]  | ||
| + | * 2025-05: [https://www.science.org/doi/10.1126/sciadv.adu9368 Emergent social conventions and collective bias in LLM populations]  | ||
| + | * 2025-09: [https://arxiv.org/abs/2509.10147 Virtual Agent Economies]  | ||
===Domain-specific===  | ===Domain-specific===  | ||
| Line 185: | Line 302: | ||
* 2024-10: [https://arxiv.org/abs/2410.08164 Agent S: An Open Agentic Framework that Uses Computers Like a Human] ([https://github.com/simular-ai/Agent-S code])  | * 2024-10: [https://arxiv.org/abs/2410.08164 Agent S: An Open Agentic Framework that Uses Computers Like a Human] ([https://github.com/simular-ai/Agent-S code])  | ||
* 2024-10: [https://arxiv.org/abs/2410.20424 AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions]  | * 2024-10: [https://arxiv.org/abs/2410.20424 AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions]  | ||
| + | * 2025-02: [https://arxiv.org/abs/2502.16111 PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving]  | ||
===Related work===  | ===Related work===  | ||
* 2024-07: [https://arxiv.org/abs/2407.18416 PersonaGym: Evaluating Persona Agents and LLMs]  | * 2024-07: [https://arxiv.org/abs/2407.18416 PersonaGym: Evaluating Persona Agents and LLMs]  | ||
| + | * 2025-01: [https://arxiv.org/abs/2501.13946 Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks]  | ||
===Inter-agent communications===  | ===Inter-agent communications===  | ||
| Line 193: | Line 312: | ||
* 2024-11: [https://arxiv.org/abs/2411.02820 DroidSpeak: Enhancing Cross-LLM Communication]: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)  | * 2024-11: [https://arxiv.org/abs/2411.02820 DroidSpeak: Enhancing Cross-LLM Communication]: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)  | ||
* 2024-11: Anthropic describes [https://www.anthropic.com/news/model-context-protocol Model Context Protocol]: an open standard for secure, two-way connections between data sources and AI ([https://modelcontextprotocol.io/introduction intro], [https://modelcontextprotocol.io/quickstart quickstart], [https://github.com/modelcontextprotocol code])  | * 2024-11: Anthropic describes [https://www.anthropic.com/news/model-context-protocol Model Context Protocol]: an open standard for secure, two-way connections between data sources and AI ([https://modelcontextprotocol.io/introduction intro], [https://modelcontextprotocol.io/quickstart quickstart], [https://github.com/modelcontextprotocol code])  | ||
| + | * 2025-09: [https://arxiv.org/abs/2509.20175 Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI]  | ||
==Architectures==  | ==Architectures==  | ||
| Line 209: | Line 329: | ||
* [https://github.com/Thytu/Agentarium Agentarium]  | * [https://github.com/Thytu/Agentarium Agentarium]  | ||
* [https://orchestra.org/ Orchestra] ([https://docs.orchestra.org/orchestra/introduction docs], [https://docs.orchestra.org/orchestra/introduction code])  | * [https://orchestra.org/ Orchestra] ([https://docs.orchestra.org/orchestra/introduction docs], [https://docs.orchestra.org/orchestra/introduction code])  | ||
| + | * [https://github.com/HKUDS/AutoAgent AutoAgent]: Fully-Automated & Zero-Code LLM Agent Framework  | ||
| + | * [https://mastra.ai/ Mastra] ([https://github.com/mastra-ai/mastra github]): opinionated Typescript framework for AI applications (primitives for workflows, agents, RAG, integrations and evals)  | ||
| + | * [https://github.com/orra-dev/orra Orra]: multi-agent applications with complex real-world interactions  | ||
| + | * [https://github.com/gensx-inc/gensx/blob/main/README.md GenSX]  | ||
| + | * Cloudflare [https://developers.cloudflare.com/agents/ agents-sdk] ([https://blog.cloudflare.com/build-ai-agents-on-cloudflare/ info], [https://github.com/cloudflare/agents code])  | ||
| + | * OpenAI [https://platform.openai.com/docs/api-reference/responses responses API] and [https://platform.openai.com/docs/guides/agents agents SDK]  | ||
| + | * Google [https://google.github.io/adk-docs/ Agent Development Kit]  | ||
==Open Source Systems==  | ==Open Source Systems==  | ||
| Line 231: | Line 358: | ||
* [https://www.bardeen.ai/ Bardeen]: Automate workflows  | * [https://www.bardeen.ai/ Bardeen]: Automate workflows  | ||
* [https://abacus.ai/ Abacus]: [https://abacus.ai/ai_agents AI Agents]  | * [https://abacus.ai/ Abacus]: [https://abacus.ai/ai_agents AI Agents]  | ||
| + | ** [https://abacus.ai/help/howTo HowTo]  | ||
* [https://www.llamaindex.ai/ LlamaIndex]: ([https://x.com/llama_index 𝕏], [https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://discord.com/invite/dGcwcsnxhU Discord])  | * [https://www.llamaindex.ai/ LlamaIndex]: ([https://x.com/llama_index 𝕏], [https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://discord.com/invite/dGcwcsnxhU Discord])  | ||
* [https://www.multion.ai/ MultiOn AI]: [https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities Agent Q] ([https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf paper]) automated planning and execution  | * [https://www.multion.ai/ MultiOn AI]: [https://www.multion.ai/blog/introducing-agent-q-research-breakthrough-for-the-next-generation-of-ai-agents-with-planning-and-self-healing-capabilities Agent Q] ([https://multion-research.s3.us-east-2.amazonaws.com/AgentQ.pdf paper]) automated planning and execution  | ||
| + | * Google [https://cloud.google.com/products/agentspace Agentspace]  | ||
| + | * [https://try.flowith.io/ Flowith]  | ||
| + | |||
| + | ===Multi-agent Handoff/Collaboration===  | ||
| + | * [https://www.maskara.ai/ Maskara AI]  | ||
===Spreadsheet===  | ===Spreadsheet===  | ||
| Line 238: | Line 371: | ||
* [https://ottogrid.ai/ Otto Grid]  | * [https://ottogrid.ai/ Otto Grid]  | ||
* [https://www.paradigmai.com/ Paradigm]  | * [https://www.paradigmai.com/ Paradigm]  | ||
| + | * [https://www.superworker.ai/ Superworker AI]  | ||
| + | * [https://www.genspark.ai/ Genspark]  | ||
==Cloud solutions==  | ==Cloud solutions==  | ||
| Line 253: | Line 388: | ||
=Optimization=  | =Optimization=  | ||
| + | ===Reviews===  | ||
| + | * 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]  | ||
| + | * 2025-03: [https://arxiv.org/abs/2503.16416 Survey on Evaluation of LLM-based Agents]  | ||
| + | |||
===Metrics, Benchmarks===  | ===Metrics, Benchmarks===  | ||
| + | See also: [[AI benchmarks]]  | ||
* 2019-11: [https://arxiv.org/abs/1911.01547 On the Measure of Intelligence]  | * 2019-11: [https://arxiv.org/abs/1911.01547 On the Measure of Intelligence]  | ||
* 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]  | * 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]  | ||
| Line 272: | Line 412: | ||
* 2024-11: [https://arxiv.org/abs/2411.13543 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games]  | * 2024-11: [https://arxiv.org/abs/2411.13543 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games]  | ||
* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])  | * 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])  | ||
| − | |||
* 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])  | * 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])  | ||
* 2025-02: [https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges] ([https://scale.com/leaderboard/enigma_eval leaderboard])  | * 2025-02: [https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges] ([https://scale.com/leaderboard/enigma_eval leaderboard])  | ||
| + | * 2025-02: [https://sites.google.com/view/mlgym MLGym: A New Framework and Benchmark for Advancing AI Research Agents] ([https://arxiv.org/abs/2502.14499 paper], [https://github.com/facebookresearch/MLGym code])  | ||
| + | * 2025-02: [https://arxiv.org/abs/2502.18356 WebGames: Challenging General-Purpose Web-Browsing AI Agents]  | ||
| + | * 2025-03: ColBench: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]  | ||
| + | * 2025-04 OpenAI [https://openai.com/index/browsecomp/ BrowseComp: a benchmark for browsing agents]  | ||
| + | * 2025-04: [https://arxiv.org/abs/2504.11844 Evaluating the Goal-Directedness of Large Language Models]  | ||
===Evaluation Schemes===  | ===Evaluation Schemes===  | ||
| Line 282: | Line 426: | ||
===Multi-agent===  | ===Multi-agent===  | ||
* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]  | * 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]  | ||
| + | * [https://github.com/lechmazur/step_game/ Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure]  | ||
===Agent Challenges===  | ===Agent Challenges===  | ||
| Line 287: | Line 432: | ||
** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]  | ** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]  | ||
* [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.  | * [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.  | ||
| − | * [https://  | + | * [https://mcbench.ai/ MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges ([https://github.com/mc-bench/orchestrator code]).  | 
===Automated Improvement===  | ===Automated Improvement===  | ||
| Line 294: | Line 439: | ||
* 2024-08: [https://arxiv.org/abs/2408.08435 Automated Design of Agentic Systems] ([https://github.com/ShengranHu/ADAS ADAS code])  | * 2024-08: [https://arxiv.org/abs/2408.08435 Automated Design of Agentic Systems] ([https://github.com/ShengranHu/ADAS ADAS code])  | ||
* 2024-08: [https://arxiv.org/abs/2408.02666 Self-Taught Evaluators]: Iterative self-improvement through generation of synthetic data and evaluation  | * 2024-08: [https://arxiv.org/abs/2408.02666 Self-Taught Evaluators]: Iterative self-improvement through generation of synthetic data and evaluation  | ||
| + | * 2025-05: [https://arxiv.org/abs/2505.22954 Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents] ([https://github.com/jennyzzt/dgm code], [https://sakana.ai/dgm/ project])  | ||
=See Also=  | =See Also=  | ||
* [[Science Agents]]  | * [[Science Agents]]  | ||
| + | * [[Increasing AI Intelligence]]  | ||
* [[AI tools]]  | * [[AI tools]]  | ||
* [[AI understanding]]  | * [[AI understanding]]  | ||
* [[Robots]]  | * [[Robots]]  | ||
* [[Exocortex]]  | * [[Exocortex]]  | ||
Latest revision as of 10:56, 23 October 2025
Contents
- 1 Reviews & Perspectives
 - 2 AI Assistants
 - 3 Advanced Workflows
 - 4 Corporate AI Agent Ventures
 - 5 Increasing AI Agent Intelligence
 - 6 Multi-agent orchestration
 - 7 Optimization
 - 8 See Also
 
Reviews & Perspectives
Published
- 2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
 - 2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
 - 2024-09: Towards a Science Exocortex
 - 2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
 - 2024-09: Agents in Software Engineering: Survey, Landscape, and Vision
 - 2025-04: Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
 - 2025-04: A Survey of Large Language Model Agents for Question Answering
 - 2025-04: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems
 - 2025-04: Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
 
Continually updating
- OpenThought - System 2 Research Links
 - Awesome LLM Strawberry (OpenAI o1): Collection of research papers & blogs for OpenAI Strawberry(o1) and Reasoning
 - Awesome AI Agents
 
Analysis/Opinions
- LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
 - Cutting AI Assistant Costs by Up to 77.8%: The Power of Enhancing LLMs with Business Logic
 - 2025-05: AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
 
Guides
- Anthropic: Building Effective Agents
 - Google: Agents and Agents Companion
 - OpenAI: A practical guide to building agents
 - Anthropic: Claude Code: Best practices for agentic coding
 - Anthropic: How we built our multi-agent research system
 
AI Assistants
Components of AI Assistants
Agent Internal Workflow Management
- LangChain
 - Pydantic: Agent Framework / shim to use Pydantic with LLMs
 - Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility
 - llama-stack
 - Huggingface smolagents
 - Eliza (includes multi-agent, interaction with docs, Discord, Twitter, etc.)
 - Pocket Flow: LLM Framework in 100 Lines
 - Coze: All-in-one AI agent development tool
 
Information Retrieval (Memory)
- See also RAG.
 - 2024-09: PaperQA2: Language Models Achieve Superhuman Synthesis of Scientific Knowledge (𝕏 post, code)
 - 2024-10: Agentic Information Retrieval
 - 2025-02: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
 - Mem0 AI: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized.
 - 2025-08: Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
 
Contextual Memory
- Memobase: user profile-based memory (long-term user memory for genAI) applications)
 
Control (tool-use, computer use, etc.)
- See also: Human_Computer_Interaction#AI_Computer_Use
 - Tavily: Connect Your LLM to the Web: Empowering your AI applications with real-time, accurate search results tailored for LLMs and RAG
 
Model Context Protocol (MCP)
- Standards:
- Anthropic Model Context Protocol (MCP)
 - OpenAI Agents SDK
 
 - Tools:
 - Servers:
- Lists:
 - Noteworthy:
- Official Github MCP server
 - Unofficial Github MCP server
 - Puppeteer
 - Google Maps MCP Server
 - Slack MCP Server
 - Zapier MCP Servers (Slack, Google Sheets, Notion, etc.)
 - AWS MCP Servers
 - ElevenLabs
 
 
 
Agent2Agent Protocol (A2A)
- Google announcement
 
Open-source
- Khoj (code): self-hostable AI assistant
 - RAGapp: Agentic RAG for enterprise
 - STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking
- Can write (e.g.) Wikipedia-style articles
 - code
 - Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
 
 
Personalities/Personas
- 2023-10: Generative Agents: Interactive Simulacra of Human Behavior
 - 2024-11: Microsoft TinyTroupe 🤠🤓🥸🧐: LLM-powered multiagent persona simulation for imagination enhancement and business insights
 - 2024-11: Generative Agent Simulations of 1,000 People (code)
 
Specific Uses for AI Assistants
Computer Use
Software Engineering
- 2024-11: MLE-Agent: Your intelligent companion for seamless AI engineering and research
 - Agentless: agentless approach to automatically solve software development problems
 
Science Agents
See Science Agents.
Medicine
- 2025-03: Microsoft Dragon Copilot: streamline clinical workflows and paperwork
 - 2025-04: Training state-of-the-art pathology foundation models with orders of magnitude less data
 - 2025-04: Towards conversational diagnostic artificial intelligence
 - 2025-04: Towards accurate differential diagnosis with large language models
 - 2025-08: The Anatomy of a Personal Health Agent
 
LLM-as-judge
- List of papers.
 - LLM Evaluation doesn't need to be complicated
 - Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
 - Awesome-LLM-as-a-judge Survey
 - haizelabs Awesome LLM Judges
 - 2024-08: Self-Taught Evaluators
 - 2024-10: Agent-as-a-Judge: Evaluate Agents with Agents
 - 2024-11: A Survey on LLM-as-a-Judge
 - 2024-12: LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
 - 2025-03: Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators
 - 2025-04: JudgeLRM: Large Reasoning Models as a Judge
 
Deep Research
- Google Deep Research
 - OpenAI Deep Research
 - Perplexity:
 - Exa AI:
- Websets: Web research agent
 - Web-search agent powered by DeepSeek (code) or o3-mini (code)
 
 - Firecrawl wip
 - Matt Shumer OpenDeepResearcher
 - DeepSearcher (operate on local data)
 - nickscamara open-deep-research
 - dzhng deep-research
 - huggingface open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code)
 - xAI Grok 3 Deep Search
 - Liner Deep Research
 - Allen AI (AI2) Paper Finder
 - 2025-03: Open Deep Search: Democratizing Search with Open-source Reasoning Agents (code)
 - Convergence AI Deep Work (swarms for web-based tasks)
 - 2025-04: DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
 - 2025-04: Anthropic Research
 - 2025-04: WebThinker: Empowering Large Reasoning Models with Deep Research Capability
 - 2025-09: SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
 
Advanced Workflows
- Salesforce DEI: meta-system that leverages a diversity of SWE agents
 - Sakana AI: AI Scientist
 - SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
 - Skywork Super Agent
 
Streamline Administrative Tasks
Author Research Articles
- 2024-02: STORM: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (discussion/analysis)
 
Software Development Workflows
Several paradigms of AI-assisted coding have arisen:
- Manual, human driven
 - AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
 - API calls to an LLM, which generates code and inserts the file into the project
 - LLM-integration into the IDE
- Copilot
 - Qodo (Codium) & AlphaCodium (preprint, code)
 - Cursor
 - Codeium Windsurf (with "Cascade" AI Agent)
 - ByteDance Trae AI
 - Tabnine
 - Traycer
 - IDX: free
 - Aide: open-source AI-native code editor (fork of VS Code)
 - continue.dev: open-source code assistant
 - Pear AI: open-source code editor
 - Haystack Editor: canvas UI
 - Onlook: for designers
 - All Hands AI
 - Devin 2.0 (Cognition AI)
 - Google Firebase Studio
 - rowboat (for building multi-agent workflows)
 - Trae IDE: The Real AI Engineer
 
 - AI-assisted IDE, where the AI generates and manages the dev environment
- Replit
 - Pythagora
 - StackBlitz bolt.new
 - Cline (formerly Claude Dev)
 - All Hands
 
 - AI Agent on Commandline
- Aider (code): Pair programming on commandline
 - Claude Code
 - OpenAI Codex
 - Gemini CLI
 
 - Prompt-to-product
- Github Spark (demo video)
 - Create.xyz: text-to-app, replicate product from link
 - a0.dev: generate mobil apps (from your phone)
 - Softgen: web app developer
 - wrapifai: build form-based apps
 - Lovable: web app (from text, screenshot, etc.)
 - Vercel v0
 - MarsX (John Rush): SaaS builder
 - Webdraw: turn sketches into web apps
 - Tempo Labs: build React apps
 - Databutton: no-code software development
 - base44: no-code dashboard apps
 - Origin AI
 - Emergent AI
 
 - Semi-autonomous software engineer agents
- Devin (Cognition AI)
 - Amazon Q (and CodeWhisperer)
 - Honeycomb
 - Agent IDE
 - Claude Code
 - OpenAI Codex CLI and Codex cloud
 - Factory AI Droids
 
 
For a review of the current state of software-engineering agentic approaches, see:
- 2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
 - 2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
 - 2024-09: Agents in Software Engineering: Survey, Landscape, and Vision
 
Corporate AI Agent Ventures
Mundane Workflows and Capabilities
- Payman AI: AI to Human platform that allows AI to pay people for what it needs
 - VoiceFlow: Build customer experiences with AI
 - Mistral AI: genAI applications
 - Taskade: Task/milestone software with AI agent workflows
 - Covalent: Building a Multi-Agent Prompt Refining Application
 
Inference-compute Reasoning
AI Assistant
- Convergence Proxy
 - Shortwave AI Assistant (organize, write, search, schedule, etc.)
 
Agentic Systems
- Topology AI
 - Cognition AI: Devin software engineer (14% SWE-Agent)
 - Honeycomb (22% SWE-Agent)
 - Factory AI
 - Convergence AI Deep Work (swarms for web-based tasks)
 - Cloudflare Agents
 - Maskara AI
 
Increasing AI Agent Intelligence
See: Increasing AI Intelligence
Multi-agent orchestration
Research
- 2025-02: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
 - 2025-03: Why Do Multi-Agent LLM Systems Fail?
 - 2025-03: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
 - 2025-09: Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI
 
Organization Schemes
Societies and Communities of AI agents
- 2024-12: Cultural Evolution of Cooperation among LLM Agents
 - 2025-04: SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users
 - 2025-05: Emergent social conventions and collective bias in LLM populations
 - 2025-09: Virtual Agent Economies
 
Domain-specific
- 2024-12: TradingAgents: Multi-Agents LLM Financial Trading Framework
 - 2025-01: Agent Laboratory: Using LLM Agents as Research Assistants
 
Research demos
- Camel
 - LoopGPT
 - JARVIS
 - OpenAGI
 - AutoGen
- preprint: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
 - Agent-E: Browser (eventually computer) automation (code, preprint, demo video)
 - AutoGen Studio: GUI for agent workflows (code)
 - Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
 
 - AG2 (previously AutoGen) (code, docs, Discord)
 - TaskWeaver
 - MetaGPT
 - AutoGPT (code); and AutoGPT Platform
 - Optima
 - 2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
 - 2024-06: MASAI: Modular Architecture for Software-engineering AI Agents
 - 2024-10: Agent S: An Open Agentic Framework that Uses Computers Like a Human (code)
 - 2024-10: AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
 - 2025-02: PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
 
Related work
- 2024-07: PersonaGym: Evaluating Persona Agents and LLMs
 - 2025-01: Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks
 
Inter-agent communications
- 2024-10: Agora: A Scalable Communication Protocol for Networks of Large Language Models (preprint): disparate agents auto-negotiate communication protocol
 - 2024-11: DroidSpeak: Enhancing Cross-LLM Communication: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)
 - 2024-11: Anthropic describes Model Context Protocol: an open standard for secure, two-way connections between data sources and AI (intro, quickstart, code)
 - 2025-09: Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI
 
Architectures
Open Source Frameworks
- LangChain
 - ell (code, docs)
 - AgentOps AI AgentStack
 - Agent UI
 - kyegomez swarms
 - OpenAI Swarm (cookbook)
 - Amazon AWS Multi-Agent Orchestrator
 - KaibanJS: Kanban for AI Agents? (Takes inspiration from Kanban visual work management.)
 - Agentarium
 - Orchestra (docs, code)
 - AutoAgent: Fully-Automated & Zero-Code LLM Agent Framework
 - Mastra (github): opinionated Typescript framework for AI applications (primitives for workflows, agents, RAG, integrations and evals)
 - Orra: multi-agent applications with complex real-world interactions
 - GenSX
 - Cloudflare agents-sdk (info, code)
 - OpenAI responses API and agents SDK
 - Google Agent Development Kit
 
Open Source Systems
- ControlFlow
 - OpenHands (formerly OpenDevin)
- code: platform for autonomous software engineers, powered by AI and LLMs
 - Report: OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
 
 
Commercial Automation Frameworks
- Lutra: Automation and integration with various web systems.
 - Gumloop
 - TextQL: Enterprise Virtual Data Analyst
 - Athena intelligence: Analytics platform
 - Nexus GPT: Business co-pilot
 - Multi-On: AI agent that acts on your behalf
 - Firecrawl: Turn websites into LLM-ready data
 - Reworkd: End-to-end data extraction
 - Lindy: Custom AI Assistants to automate business workflows
- E.g. use Slack
 
 - Bardeen: Automate workflows
 - Abacus: AI Agents
 - LlamaIndex: (𝕏, code, docs, Discord)
 - MultiOn AI: Agent Q (paper) automated planning and execution
 - Google Agentspace
 - Flowith
 
Multi-agent Handoff/Collaboration
Spreadsheet
Cloud solutions
- Numbers Station Meadow: agentic framework for data workflows (code).
 - CrewAI says they provide multi-agent automations (code).
 - LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.
- LangGraph Studio is an IDE for agent workflows
 
 - C3 AI enterprise platform
 - Deepset AI Haystack (docs, code)
 
Frameworks
- Google Project Oscar
- Agent: Gaby (for "Go AI bot") (code, documentation) helps with issue tracking.
 
 - OpenPlexity-Pages: Data-aggregator implementation (like Perplexity) based on CrewAI
 
Optimization
Reviews
- 2024-12: A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
 - 2025-03: Survey on Evaluation of LLM-based Agents
 
Metrics, Benchmarks
See also: AI benchmarks
- 2019-11: On the Measure of Intelligence
 - 2022-06: PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
 - 2023-06: Can Large Language Models Infer Causation from Correlation? (challenging Corr2Cause task)
 - 2024-01: AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents
 - 2024-04: AutoRace (code): LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
 - 2024-04: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (github)
 - 2024-07: AI Agents That Matter
 - 2024-09: CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark (leaderboard)
 - 2024-09: LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
 - 2024-09: On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
 - 2024-10: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
 - 2024-10: WorFBench: Benchmarking Agentic Workflow Generation
 - 2024-10: VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
 - 2024-10: SimpleAQ: Measuring short-form factuality in large language models (announcement, code)
 - 2024-11: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts (blog, code)
 - 2024-11: The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (code)
 - 2024-11: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
 - 2024-12: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (code, project, leaderboard)
 - 2025-01: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings (preprint, leaderboard)
 - 2025-02: ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges (leaderboard)
 - 2025-02: MLGym: A New Framework and Benchmark for Advancing AI Research Agents (paper, code)
 - 2025-02: WebGames: Challenging General-Purpose Web-Browsing AI Agents
 - 2025-03: ColBench: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
 - 2025-04 OpenAI BrowseComp: a benchmark for browsing agents
 - 2025-04: Evaluating the Goal-Directedness of Large Language Models
 
Evaluation Schemes
- 2024-12: LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
 - 2025-01: LLMRank ("SlopRank"): LLMs evaluate each other, allowing top model (for a given prompt/problem) to be inferred from a large number of recommendations.
 
Multi-agent
- 2024-12: Cultural Evolution of Cooperation among LLM Agents
 - Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure
 
Agent Challenges
- Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
- NeurIPS 2024 paper/poster: AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions
 
 - Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 - MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges (code).
 
Automated Improvement
- 2024-06: EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
 - 2024-06: Symbolic Learning Enables Self-Evolving Agents
 - 2024-08: Automated Design of Agentic Systems (ADAS code)
 - 2024-08: Self-Taught Evaluators: Iterative self-improvement through generation of synthetic data and evaluation
 - 2025-05: Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents (code, project)