Latest revision as of 16:53, 31 March 2025

Reviews & Perspectives

Published

2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
2024-08: From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future
2024-09: Towards a Science Exocortex
2024-09: Large Language Model-Based Agents for Software Engineering: A Survey
2024-09: Agents in Software Engineering: Survey, Landscape, and Vision

Continually updating

Analysis/Opinions

Guides

Anthropic: Building Effective Agents
Google: Agents

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

LangChain
Pydantic: Agent Framework / shim to use Pydantic with LLMs
Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility
llama-stack
Huggingface smolagents
Eliza (includes multi-agent, interaction with docs, Discord, Twitter, etc.)
Pocket Flow: LLM Framework in 100 Lines

Information Retrieval (Memory)

See also RAG.
2024-09: PaperQA2: Language Models Achieve Superhuman Synthesis of Scientific Knowledge (𝕏 post, code)
2024-10: Agentic Information Retrieval
2025-02: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
Mem0 AI: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized.

Contextual Memory

Memobase: user profile-based memory (long-term user memory for genAI) applications)

Control (tool-use, computer use, etc.)

See also: Human_Computer_Interaction#AI_Computer_Use
Tavily: Connect Your LLM to the Web: Empowering your AI applications with real-time, accurate search results tailored for LLMs and RAG

Model Context Protocol (MCP)

Standards:
1. Anthropic Model Context Protocol (MCP)
2. OpenAI Agents SDK
Tools:
- FastMCP: The fast, Pythonic way to build MCP servers
- Fleur: A desktop app marketplace for Claude Desktop
Servers:
- Lists:
- Noteworthy:

Open-source

Khoj (code): self-hostable AI assistant
RAGapp: Agentic RAG for enterprise
STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking
- Can write (e.g.) Wikipedia-style articles
- code
- Preprint: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

Personalities/Personas

2023-10: Generative Agents: Interactive Simulacra of Human Behavior
2024-11: Microsoft TinyTroupe 🤠🤓🥸🧐: LLM-powered multiagent persona simulation for imagination enhancement and business insights
2024-11: Generative Agent Simulations of 1,000 People (code)

Specific Uses for AI Assistants

Medicine

2025-03: Microsoft Dragon Copilot: streamline clinical workflows and paperwork

LLM-as-judge

Deep Research

Google Deep Research
OpenAI Deep Research
Perplexity:
- Search
- Deep Research
Exa AI:
- Websets: Web research agent
- Web-search agent powered by DeepSeek (code) or o3-mini (code)
Firecrawl wip
Matt Shumer OpenDeepResearcher
DeepSearcher (operate on local data)
nickscamara open-deep-research
dzhng deep-research
huggingface open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code)
xAI Grok 3 Deep Search
Liner Deep Research
Allen AI (AI2) Paper Finder
2025-03: Open Deep Search: Democratizing Search with Open-source Reasoning Agents (code)

Advanced Workflows

Salesforce DEI: meta-system that leverages a diversity of SWE agents
- Preprint: Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
Sakana AI: AI Scientist
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- code
SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning
- code

Streamline Administrative Tasks

2025-02: Ushering in a New Era of AI-Driven Data Insights at UC San Diego

Author Research Articles

2024-02: STORM: Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (discussion/analysis)

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

Manual, human driven
AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
1. OpenAI ChatGPT
2. Anthropic Claude
API calls to an LLM, which generates code and inserts the file into the project
LLM-integration into the IDE
1. Copilot
2. Qodo (Codium) & AlphaCodium (preprint, code)
3. Cursor
4. Codeium Windsurf (with "Cascade" AI Agent)
5. ByteDance Trae AI
6. Tabnine
7. Traycer
8. IDX: free
9. Aide: open-source AI-native code editor (fork of VS Code)
10. continue.dev: open-source code assistant
11. Pear AI: open-source code editor
12. Haystack Editor: canvas UI
13. Onlook: for designers
AI-assisted IDE, where the AI generates and manages the dev environment
1. Replit
2. Aider (code): Pair programming on commandline
3. Pythagora
4. StackBlitz bolt.new
5. Cline (formerly Claude Dev)
Prompt-to-product
1. Github Spark (demo video)
2. Create.xyz: text-to-app, replicate product from link
3. a0.dev: generate mobil apps (from your phone)
4. Softgen: web app developer
5. wrapifai: build form-based apps
6. Lovable: web app (from text, screenshot, etc.)
7. Vercel v0
8. MarsX (John Rush): SaaS builder
9. Webdraw: turn sketches into web apps
10. Tempo Labs: build React apps
11. Databutton: no-code software development
12. base44: no-code dashboard apps
13. Origin AI
Semi-autonomous software engineer agents
1. Devin (Cognition AI)
2. Amazon Q (and CodeWhisperer)
3. Honeycomb
4. Claude Code

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Payman AI: AI to Human platform that allows AI to pay people for what it needs
VoiceFlow: Build customer experiences with AI
Mistral AI: genAI applications
Taskade: Task/milestone software with AI agent workflows
Covalent: Building a Multi-Agent Prompt Refining Application

Inference-compute Reasoning

Nous Research: Forge Reasoning API Beta

AI Assistant

Convergence Proxy

Agentic Systems

Topology AI
Cognition AI: Devin software engineer (14% SWE-Agent)
Honeycomb (22% SWE-Agent)
Factory AI

Increasing AI Agent Intelligence

See: Increasing AI Intelligence

Multi-agent orchestration

Research

2025-03: Why Do Multi-Agent LLM Systems Fail?
2025-03: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Organization Schemes

2025-03: ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks

Societies and Communities of AI agents

2024-12: Cultural Evolution of Cooperation among LLM Agents

Domain-specific

Research demos

Camel
LoopGPT
JARVIS
OpenAGI
AutoGen
- preprint: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Agent-E: Browser (eventually computer) automation (code, preprint, demo video)
- AutoGen Studio: GUI for agent workflows (code)
- Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
AG2 (previously AutoGen) (code, docs, Discord)
TaskWeaver
MetaGPT
AutoGPT (code); and AutoGPT Platform
Optima
- preprint: Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
- code
2024-04: LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models (code)
2024-06: MASAI: Modular Architecture for Software-engineering AI Agents
2024-10: Agent S: An Open Agentic Framework that Uses Computers Like a Human (code)
2024-10: AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
2025-02: PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving

Related work

2024-07: PersonaGym: Evaluating Persona Agents and LLMs

Inter-agent communications

2024-10: Agora: A Scalable Communication Protocol for Networks of Large Language Models (preprint): disparate agents auto-negotiate communication protocol
2024-11: DroidSpeak: Enhancing Cross-LLM Communication: Exploits caches of embeddings and key-values, to allow context to be more easily transferred between AIs (without consuming context window)
2024-11: Anthropic describes Model Context Protocol: an open standard for secure, two-way connections between data sources and AI (intro, quickstart, code)

Architectures

Open Source Frameworks

LangChain
ell (code, docs)
AgentOps AI AgentStack
Agent UI
kyegomez swarms
OpenAI Swarm (cookbook)
Amazon AWS Multi-Agent Orchestrator
KaibanJS: Kanban for AI Agents? (Takes inspiration from Kanban visual work management.)
Agentarium
Orchestra (docs, code)
AutoAgent: Fully-Automated & Zero-Code LLM Agent Framework
Mastra (github): opinionated Typescript framework for AI applications (primitives for workflows, agents, RAG, integrations and evals)
Orra: multi-agent applications with complex real-world interactions
GenSX
Cloudflare agents-sdk (info, code)
OpenAI responses API and agents SDK

Open Source Systems

ControlFlow
- documentation
- code
OpenHands (formerly OpenDevin)
- code: platform for autonomous software engineers, powered by AI and LLMs
- Report: OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Commercial Automation Frameworks

Lutra: Automation and integration with various web systems.
Gumloop
TextQL: Enterprise Virtual Data Analyst
Athena intelligence: Analytics platform
Nexus GPT: Business co-pilot
Multi-On: AI agent that acts on your behalf
Firecrawl: Turn websites into LLM-ready data
Reworkd: End-to-end data extraction
Lindy: Custom AI Assistants to automate business workflows
- E.g. use Slack
Bardeen: Automate workflows
Abacus: AI Agents
LlamaIndex: (𝕏, code, docs, Discord)
MultiOn AI: Agent Q (paper) automated planning and execution

Spreadsheet

Cloud solutions

Numbers Station Meadow: agentic framework for data workflows (code).
CrewAI says they provide multi-agent automations (code).
LangChain introduced LangGraph to help build agents, and LangGraph Cloud as a service for running those agents.
- LangGraph Studio is an IDE for agent workflows
C3 AI enterprise platform
Deepset AI Haystack (docs, code)

Frameworks

Google Project Oscar
- Agent: Gaby (for "Go AI bot") (code, documentation) helps with issue tracking.
OpenPlexity-Pages: Data-aggregator implementation (like Perplexity) based on CrewAI

Optimization

Reviews

Metrics, Benchmarks

2019-11: On the Measure of Intelligence
2022-06: PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
2023-06: Can Large Language Models Infer Causation from Correlation? (challenging Corr2Cause task)
2024-01: AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents
2024-04: AutoRace (code): LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
2024-04: OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (github)
2024-07: AI Agents That Matter
2024-09: CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark (leaderboard)
2024-09: LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench
2024-09: On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
2024-10: MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
2024-10: WorFBench: Benchmarking Agentic Workflow Generation
2024-10: VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
2024-10: SimpleAQ: Measuring short-form factuality in large language models (announcement, code)
2024-11: RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts (blog, code)
2024-11: The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (code)
2024-11: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
2024-12: TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks (code, project, leaderboard)
2025-01: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings (preprint, leaderboard)
2025-02: ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges (leaderboard)
2025-02: MLGym: A New Framework and Benchmark for Advancing AI Research Agents (paper, code)
2025-02: WebGames: Challenging General-Purpose Web-Browsing AI Agents
2025-03: ColBench: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Evaluation Schemes

2024-12: LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
2025-01: LLMRank ("SlopRank"): LLMs evaluate each other, allowing top model (for a given prompt/problem) to be inferred from a large number of recommendations.

Multi-agent

Agent Challenges

Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
- NeurIPS 2024 paper/poster: AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions
Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Automated Improvement

2024-06: EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
2024-06: Symbolic Learning Enables Self-Evolving Agents
2024-08: Automated Design of Agentic Systems (ADAS code)
2024-08: Self-Taught Evaluators: Iterative self-improvement through generation of synthetic data and evaluation

@@ Line 11: / Line 11: @@
 * [https://github.com/open-thought/system-2-research OpenThought - System 2 Research Links]
 * [https://github.com/hijkzzz/Awesome-LLM-Strawberry Awesome LLM Strawberry (OpenAI o1): Collection of research papers & blogs for OpenAI Strawberry(o1) and Reasoning]
+* [https://github.com/e2b-dev/awesome-ai-agents Awesome AI Agents]
 ===Analysis/Opinions===
 * [https://arxiv.org/abs/2402.01817v3 LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks]
 * [https://rasa.com/blog/cutting-ai-assistant-costs-the-power-of-enhancing-llms-with-business/ Cutting AI Assistant Costs by Up to 77.8%: The Power of Enhancing LLMs with Business Logic]
+===Guides===
+* Anthropic: [https://www.anthropic.com/research/building-effective-agents Building Effective Agents]
+* Google: [https://www.kaggle.com/whitepaper-agents Agents]
 =AI Assistants=
@@ Line 20: / Line 25: @@
 ==Components of AI Assistants==
-===Information Retrieval===
+===Agent Internal Workflow Management===
+* [https://github.com/langchain-ai/langchain LangChain]
+* [https://github.com/pydantic/pydantic-ai Pydantic: Agent Framework / shim to use Pydantic with LLMs]
+* [https://github.com/lmnr-ai/flow Flow: A lightweight task engine for building AI agents that prioritizes simplicity and flexibility]
+* [https://llama-stack.readthedocs.io/en/latest/index.html llama-stack]
+* [https://huggingface.co/blog/smolagents Huggingface] [https://github.com/huggingface/smolagents smolagents]
+* [https://github.com/elizaOS/eliza Eliza] (includes multi-agent, interaction with docs, Discord, Twitter, etc.)
+* [https://github.com/The-Pocket/PocketFlow Pocket Flow]: LLM Framework in 100 Lines
+===Information Retrieval (Memory)===
 * See also [[AI_tools#Retrieval_Augmented_Generation_.28RAG.29|RAG]].
+* 2024-09: PaperQA2: [https://paper.wikicrow.ai/ Language Models Achieve Superhuman Synthesis of Scientific Knowledge] ([https://x.com/SGRodriques/status/1833908643856818443 𝕏 post], [https://github.com/Future-House/paper-qa code])
 * 2024-10: [https://arxiv.org/abs/2410.09713 Agentic Information Retrieval]
+* 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]
+* [https://mem0.ai/ Mem0 AI]: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized.
+===Contextual Memory===
+* [https://github.com/memodb-io/memobase Memobase]: user profile-based memory (long-term user memory for genAI) applications)
+===Control (tool-use, computer use, etc.)===
+* See also: [[Human_Computer_Interaction#AI_Computer_Use]]
+* [https://tavily.com/ Tavily]: Connect Your LLM to the Web: Empowering your AI applications with real-time, accurate search results tailored for LLMs and RAG
+===Model Context Protocol (MCP)===
+* '''Standards:'''
+*# Anthropic [https://www.anthropic.com/news/model-context-protocol Model Context Protocol] (MCP)
+*# [https://openai.github.io/openai-agents-python/mcp/ OpenAI Agents SDK]
+* '''Tools:'''
+** [https://github.com/jlowin/fastmcp FastMCP]: The fast, Pythonic way to build MCP servers
+** [https://github.com/fleuristes/fleur/ Fleur]: A desktop app marketplace for Claude Desktop
+* '''Servers:'''
+** '''Lists:'''
+**# [https://github.com/modelcontextprotocol/servers Model Context Protocol servers]
+**# [https://www.mcpt.com/ MCP Servers, One Managed Registry]
+**# [https://github.com/punkpeye/awesome-mcp-servers Awesome MCP Servers]
+** '''Noteworthy:'''
+**# [https://github.com/modelcontextprotocol/servers/tree/main/src/github Github MCP server]
+**# [https://github.com/modelcontextprotocol/servers/tree/main/src/puppeteer Puppeteer]
+**# [https://github.com/modelcontextprotocol/servers/tree/main/src/google-maps Google Maps MCP Server]
+**# [https://github.com/modelcontextprotocol/servers/tree/main/src/slack Slack MCP Server]
 ===Open-source===
@@ Line 40: / Line 81: @@
 ===Computer Use===
-* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
+* See: [[Human_Computer_Interaction#AI_Computer_Use]]
+===Software Engineering===
+* 2024-11: [https://github.com/MLSysOps/MLE-agent MLE-Agent: Your intelligent companion for seamless AI engineering and research]
+* [https://github.com/OpenAutoCoder/Agentless Agentless]: agentless approach to automatically solve software development problems
 ===Science Agents===
 See [[Science Agents]].
+===Medicine===
+* 2025-03: [https://news.microsoft.com/2025/03/03/microsoft-dragon-copilot-provides-the-healthcare-industrys-first-unified-voice-ai-assistant-that-enables-clinicians-to-streamline-clinical-documentation-surface-information-and-automate-task/ Microsoft Dragon Copilot]: streamline clinical workflows and paperwork
 ===LLM-as-judge===
@@ Line 49: / Line 97: @@
 * [https://www.philschmid.de/llm-evaluation LLM Evaluation doesn't need to be complicated]
 * [https://eugeneyan.com/writing/llm-evaluators/ Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)]
+* [https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge Awesome-LLM-as-a-judge Survey]
+* [https://github.com/haizelabs/Awesome-LLM-Judges haizelabs Awesome LLM Judges]
 * 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
+* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
+* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]
+* 2025-03: [https://arxiv.org/abs/2503.19877 Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators]
+===Deep Research===
+* Google [https://blog.google/products/gemini/google-gemini-deep-research/ Deep Research]
+* OpenAI [https://openai.com/index/introducing-deep-research/ Deep Research]
+* Perplexity:
+** [https://www.perplexity.ai/ Search]
+** [https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research Deep Research]
+* [https://exa.ai/ Exa AI]:
+** [https://exa.ai/websets Websets]: Web research agent
+** [https://demo.exa.ai/deepseekchat Web-search agent] powered by DeepSeek ([https://github.com/exa-labs/exa-deepseek-chat code]) or [https://o3minichat.exa.ai/ o3-mini] ([https://github.com/exa-labs/exa-o3mini-chat code])
+* [https://www.firecrawl.dev/ Firecrawl] [https://x.com/nickscamara_/status/1886287956291338689 wip]
+* [https://x.com/mattshumer_ Matt Shumer] [https://github.com/mshumer/OpenDeepResearcher OpenDeepResearcher]
+* [https://github.com/zilliztech/deep-searcher DeepSearcher] (operate on local data)
+* [https://github.com/nickscamara nickscamara] [https://github.com/nickscamara/open-deep-research open-deep-research]
+* [https://x.com/dzhng dzhng] [https://github.com/dzhng/deep-research deep-research]
+* [https://huggingface.co/ huggingface] [https://huggingface.co/blog/open-deep-research open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code])
+* xAI Grok 3 Deep Search
+* [https://liner.com/news/introducing-deepresearch Liner Deep Research]
+* [https://allenai.org/ Allen AI] (AI2) [https://paperfinder.allen.ai/chat Paper Finder]
+* 2025-03: [https://arxiv.org/abs/2503.20201 Open Deep Search: Democratizing Search with Open-source Reasoning Agents] ([https://github.com/sentient-agi/OpenDeepSearch code])
 =Advanced Workflows=
@@ Line 59: / Line 132: @@
 * [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning]
 ** [https://github.com/lamm-mit/SciAgentsDiscovery code]
+===Streamline Administrative Tasks===
+* 2025-02: [https://er.educause.edu/articles/2025/2/ushering-in-a-new-era-of-ai-driven-data-insights-at-uc-san-diego Ushering in a New Era of AI-Driven Data Insights at UC San Diego]
+===Author Research Articles===
+* 2024-02: STORM: [https://arxiv.org/abs/2402.14207 Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models] ([https://www.aihero.dev/storm-generate-high-quality-articles-based-on-real-research discussion/analysis])
 ===Software Development Workflows===
@@ Line 72: / Line 151: @@
 ## [https://www.cursor.com/ Cursor]
 ## [https://codeium.com/ Codeium] [https://codeium.com/windsurf Windsurf] (with "Cascade" AI Agent)
+## ByteDance [https://www.trae.ai/ Trae AI]
+## [https://www.tabnine.com/ Tabnine]
+## [https://marketplace.visualstudio.com/items?itemName=Traycer.traycer-vscode Traycer]
+## [https://idx.dev/ IDX]: free
+## [https://github.com/codestoryai/aide Aide]: open-source AI-native code editor (fork of VS Code)
+## [https://www.continue.dev/ continue.dev]: open-source code assistant
+## [https://trypear.ai/ Pear AI]: open-source code editor
+## [https://haystackeditor.com/ Haystack Editor]: canvas UI
+## [https://onlook.com/ Onlook]: for designers
 # AI-assisted IDE, where the AI generates and manages the dev environment
 ## [https://replit.com/ Replit]
@@ Line 80: / Line 168: @@
 # Prompt-to-product
 ## [https://githubnext.com/projects/github-spark Github Spark] ([https://x.com/ashtom/status/1851333075374051725 demo video])
+## [https://www.create.xyz/ Create.xyz]: text-to-app, replicate product from link
+## [https://a0.dev/ a0.dev]: generate mobil apps (from your phone)
+## [https://softgen.ai/ Softgen]: web app developer
+## [https://wrapifai.com/ wrapifai]: build form-based apps
+## [https://lovable.dev/ Lovable]: web app (from text, screenshot, etc.)
+## [https://v0.dev/ Vercel v0]
+## [https://x.com/johnrushx/status/1625179509728198665 MarsX] ([https://x.com/johnrushx John Rush]): SaaS builder
+## [https://webdraw.com/ Webdraw]: turn sketches into web apps
+## [https://www.tempo.new/ Tempo Labs]: build React apps
+## [https://databutton.com/ Databutton]: no-code software development
+## [https://base44.com/ base44]: no-code dashboard apps
+## [https://www.theorigin.ai/ Origin AI]
 # Semi-autonomous software engineer agents
 ## [https://www.cognition.ai/blog/introducing-devin Devin] (Cognition AI)
-## [https://aws.amazon.com/q/ Amazon Q]
+## [https://aws.amazon.com/q/ Amazon Q] (and CodeWhisperer)
 ## [https://honeycomb.sh/ Honeycomb]
+## [https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview Claude Code]
 For a review of the current state of software-engineering agentic approaches, see:
@@ Line 100: / Line 201: @@
 ==Inference-compute Reasoning==
 * [https://nousresearch.com/#popup-menu-anchor Nous Research]: [https://nousresearch.com/introducing-the-forge-reasoning-api-beta-and-nous-chat-an-evolution-in-llm-inference/ Forge Reasoning API Beta]
+==AI Assistant==
+* [https://convergence.ai/ Convergence] [https://proxy.convergence.ai/ Proxy]
 ==Agentic Systems==
@@ Line 105: / Line 209: @@
 * [https://www.cognition.ai/ Cognition AI]: [https://www.cognition.ai/blog/introducing-devin Devin] software engineer (14% SWE-Agent)
 * [https://honeycomb.sh/ Honeycomb] ([https://honeycomb.sh/blog/swe-bench-technical-report 22% SWE-Agent])
+* [https://www.factory.ai/ Factory AI]
 =Increasing AI Agent Intelligence=
+See: [[Increasing AI Intelligence]]
-==Proactive Search==
+=Multi-agent orchestration=
-Compute expended after training, but before inference.
+==Research==
+* 2025-03: [https://arxiv.org/abs/2503.13657 Why Do Multi-Agent LLM Systems Fail?]
-===Training Data (Data Refinement, Synthetic Data)===
+* 2025-03: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]
-* C.f. image datasets:
-** 2023-06: [https://arxiv.org/abs/2306.00984 StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners]
-** 2023-11: [https://arxiv.org/abs/2311.17946 DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback]
-* 2024-09: [https://arxiv.org/abs/2409.17115 Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale]
-* 2024-10: [https://arxiv.org/abs/2410.15547 Data Cleaning Using Large Language Models]
-* Updating list of links: [https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data Synthetic Data of LLMs, by LLMs, for LLMs]
-===Generate consistent plans/thoughts===
-* 2024-08: [https://arxiv.org/abs/2408.06195 Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers] ([https://github.com/zhentingqi/rStar code])
-** (Microsoft) rStar is a self-play mutual reasoning approach. A small model adds to MCTS using some defined reasoning heuristics. Mutually consistent trajectories can be emphasized.
-* 2024-09: [https://www.arxiv.org/abs/2409.04057 Self-Harmonized Chain of Thought]
-** Produce refined chain-of-thought style solutions/prompts for diverse problems. Given a large set of problems/questions, first aggregated semantically, then apply zero-shot chain-of-thought to each problem. Then cross-pollinate between proposed solutions to similar problems, looking for refined and generalize solutions.
-===Sampling===
+===Organization Schemes===
-* 2024-11: [https://arxiv.org/abs/2411.04282 Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding] ([https://github.com/SalesforceAIResearch/LaTRO code])
+* 2025-03: [https://arxiv.org/abs/2503.02390 ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks]
-===Automated prompt generation===
+===Societies and Communities of AI agents===
-* 2024-09: [https://arxiv.org/abs/2409.13449 Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts] (
+* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
-===Distill inference-time-compute into model===
+===Domain-specific===
-* 2023-10: [https://arxiv.org/abs/2310.11716 Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning] (U. Maryland, Adobe)
+* 2024-12: [https://arxiv.org/abs/2412.20138 TradingAgents: Multi-Agents LLM Financial Trading Framework]
-* 2023-11: [https://arxiv.org/abs/2311.01460 Implicit Chain of Thought Reasoning via Knowledge Distillation] (Harvard, Microsoft, Hopkins)
+* 2025-01: [https://arxiv.org/abs/2501.04227 Agent Laboratory: Using LLM Agents as Research Assistants]
-* 2024-02: [https://arxiv.org/abs/2402.04494 Grandmaster-Level Chess Without Search] (Google DeepMind)
-* 2024-07: [https://arxiv.org/abs/2407.03181 Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models]
-* 2024-07: [https://arxiv.org/abs/2407.14622 BOND: Aligning LLMs with Best-of-N Distillation]
-* 2024-09: [https://arxiv.org/abs/2409.12917 Training Language Models to Self-Correct via Reinforcement Learning] (Google DeepMind)
-* 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
-* 2024-10: [https://arxiv.org/abs/2410.09918 Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces]
-====CoT reasoning model====
-* 2024-09: [https://openai.com/o1/ OpenAI o1]
-* 2024-10: [https://github.com/GAIR-NLP/O1-Journey/blob/main/resource/report.pdf O1 Replication Journey: A Strategic Progress Report – Part 1] ([https://github.com/GAIR-NLP/O1-Journey code]): Attempt by [https://gair-nlp.github.io/walnut-plan/ Walnut Plan] to reproduce o1-like in-context reasoning
-* 2024-11: [https://x.com/deepseek_ai/status/1859200141355536422 DeepSeek-R1-Lite-Preview reasoning model]
-* 2024-11: [https://arxiv.org/abs/2411.14405 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions]
-===Scaling===
-* 2024-08: [https://arxiv.org/abs/2408.16737 Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling] (Google DeepMind)
-* 2024-11: [https://arxiv.org/abs/2411.04434 Scaling Laws for Pre-training Agents and World Models]
-==Inference Time Compute==
-===Methods===
-* 2024-03: [https://arxiv.org/abs/2403.09629 Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking]
-===In context learning (ICL), search, and other inference-time methods===
-* 2023-03: [https://arxiv.org/abs/2303.11366 Reflexion: Language Agents with Verbal Reinforcement Learning]
-* 2023-05: [https://arxiv.org/abs/2305.16291 VOYAGER: An Open-Ended Embodied Agent with Large Language Models]
-* 2024-04: [https://arxiv.org/abs/2404.11018 Many-Shot In-Context Learning]
-* 2024-08: [https://arxiv.org/abs/2408.08435 Automated Design of Agentic Systems]
-* 2024-09: [https://arxiv.org/abs/2409.03733 Planning In Natural Language Improves LLM Search For Code Generation]
-===Inference-time Sampling===
-* 2024-10: [https://github.com/xjdr-alt/entropix entropix: Entropy Based Sampling and Parallel CoT Decoding]
-* 2024-10: [https://arxiv.org/abs/2410.16033 TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling]
-* 2024-11: [https://openreview.net/forum?id=FBkpCyujtS Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs]
-===Inference-time Gradient===
-* 2024-11: [https://ekinakyurek.github.io/papers/ttt.pdf The Surprising Effectiveness of Test-Time Training for Abstract Reasoning] ([https://github.com/ekinakyurek/marc code])
-===Self-prompting===
-* 2023-05: [https://arxiv.org/abs/2305.09993 Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling]
-* 2023-11: [https://arxiv.org/abs/2311.04205 Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves]
-===In-context thought===
-* 2022-01: [https://arxiv.org/abs/2201.11903 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models] (Google Brain)
-* 2023-05: [https://arxiv.org/abs/2305.10601 Tree of Thoughts: Deliberate Problem Solving with Large Language Models] (Google DeepMind)
-* 2024-05: [https://arxiv.org/abs/2405.18357 Faithful Logical Reasoning via Symbolic Chain-of-Thought]
-* 2024-06: [https://aclanthology.org/2024.findings-naacl.78/ A Tree-of-Thoughts to Broaden Multi-step Reasoning across Languages]
-* 2024-09: [https://arxiv.org/abs/2409.12183 To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning]
-* 2024-09: [https://arxiv.org/abs/2409.12618 Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning] ([https://agnostiq.ai/ Agnostiq], Toronto)
-* 2024-09: [https://arxiv.org/abs/2409.17539 Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models]
-* 2024-10: [https://arxiv.org/abs/2410.16540 A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration] (failed reasoning traces can improve CoT)
-* 2024-10: [https://arxiv.org/abs/2410.06634 Tree of Problems: Improving structured problem solving with compositionality]
-* 2023-01/2024-10: [https://arxiv.org/abs/2301.00234 A Survey on In-context Learning]
-===Naive multi-LLM (verification, majority voting, best-of-N, etc.)===
-* 2023-06: [https://arxiv.org/abs/2306.02561 LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion] ([https://github.com/yuchenlin/LLM-Blender?tab=readme-ov-file code])
-* 2023-12: [https://aclanthology.org/2023.findings-emnlp.203/ Dynamic Voting for Efficient Reasoning in Large Language Models]
-* 2024-04: [https://arxiv.org/abs/2404.01054 Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment]
-* 2024-08: [https://arxiv.org/abs/2408.17017 Dynamic Self-Consistency: Leveraging Reasoning Paths for Efficient LLM Sampling]
-* 2024-11: [https://arxiv.org/abs/2411.00492 Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models]
-===Multi-LLM (multiple comparisons, branching, etc.)===
-* 2024-10: [https://arxiv.org/abs/2410.10630 Thinking LLMs: General Instruction Following with Thought Generation]
-* 2024-11: [https://arxiv.org/abs/2411.02830 Mixtures of In-Context Learners]: Multiple "experts", each with a different set of in-context examples; combine outputs at the level of next-token-prediction
-* 2024-11: [https://arxiv.org/abs/2411.10440 LLaVA-o1: Let Vision Language Models Reason Step-by-Step] ([https://github.com/PKU-YuanGroup/LLaVA-o1 code])
-===Iteration (e.g. neural-like layered blocks)===
-* 2024-06: [https://arxiv.org/abs/2406.04692 Mixture-of-Agents Enhances Large Language Model Capabilities]
-===Iterative reasoning via graphs===
-* 2023-08: [https://arxiv.org/abs/2308.09687 Graph of Thoughts: Solving Elaborate Problems with Large Language Models]
-* 2024-09: [https://arxiv.org/abs/2409.10038 On the Diagram of Thought]: Iterative reasoning as a directed acyclic graph (DAG)
-===Monte Carlo Tree Search (MCTS)===
-* 2024-05: [https://arxiv.org/abs/2405.03553 AlphaMath Almost Zero: process Supervision without process]
-* 2024-06: [https://arxiv.org/abs/2406.03816 ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search]
-* 2024-06: [https://arxiv.org/abs/2406.06592 Improve Mathematical Reasoning in Language Models by Automated Process Supervision]
-* 2024-06: [https://arxiv.org/abs/2406.07394 Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B]
-* 2024-07: [https://arxiv.org/abs/2407.01476 Tree Search for Language Model Agents]
-* 2024-10: [https://arxiv.org/abs/2410.01707 Interpretable Contrastive Monte Carlo Tree Search Reasoning]
-===Other Search===
-* 2024-11: [https://arxiv.org/abs/2411.05010 Scattered Forest Search: Smarter Code Space Exploration with LLMs]
-===Scaling===
-* 2021-04: [https://arxiv.org/abs/2104.03113 Scaling Scaling Laws with Board Games]
-* 2024-03: [https://arxiv.org/abs/2403.02419 Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems]
-* 2024-04: [https://arxiv.org/abs/2404.00725 The Larger the Better? Improved LLM Code-Generation via Budget Reallocation]
-* 2024-07: [https://arxiv.org/abs/2407.21787 Large Language Monkeys: Scaling Inference Compute with Repeated Sampling]
-* 2024-08: [https://arxiv.org/abs/2408.00724 An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models]
-* 2024-08: [https://arxiv.org/abs/2408.03314 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters]
-* 2024-10: (comparing fine-tuning to in-context learning) [https://arxiv.org/abs/2405.19874 Is In-Context Learning Sufficient for Instruction Following in LLMs?]
-===Theory===
-* 2024-02: [https://arxiv.org/abs/2402.12875 Chain of Thought Empowers Transformers to Solve Inherently Serial Problems]
-===Expending compute works===
-* 2024-06-10: Blog post (opinion): [https://yellow-apartment-148.notion.site/AI-Search-The-Bitter-er-Lesson-44c11acd27294f4495c3de778cd09c8d AI Search: The Bitter-er Lesson]
-* 2024-07-17: Blog post (test): [https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt Getting 50% (SoTA) on ARC-AGI with GPT-4o]
-* 2024-09-12: [https://openai.com/o1/ OpenAI o1]: [https://openai.com/index/learning-to-reason-with-llms/ Learning to Reason with LLMs]
-[[Image:Compute.png|600px]]
-* 2024-09-16: [https://www.oneusefulthing.org/p/scaling-the-state-of-play-in-ai Scaling: The State of Play in AI]
-===Code for Inference-time Compute===
-* [https://github.com/codelion/optillm optillm]: Inference proxy which implements state-of-the-art techniques to improve accuracy and performance of LLMs (improve reasoning over coding, logical and mathematical queries)
-==Memory==
-* 2024-10: [https://arxiv.org/abs/2410.08821 Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation]
-==Tool Use==
-* 2024-11: [https://arxiv.org/abs/2411.01747 DynaSaur: Large Language Agents Beyond Predefined Actions]: writes functions/code to increase capabilities
-==Multi-agent Effort (and Emergent Intelligence)==
-* 2024-10: [https://arxiv.org/abs/2410.11163 Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence]
-* 2024-10: [https://arxiv.org/abs/2410.10934 Agent-as-a-Judge: Evaluate Agents with Agents]
-* 2024-11: [https://arxiv.org/abs/2411.00114 Project Sid: Many-agent simulations toward AI civilization]
-==ML-like Optimization of LLM Setup==
-* 2023-03: [https://arxiv.org/abs/2310.03714 DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines] ([https://github.com/stanfordnlp/dspy code]: Programming—not prompting—Foundation Models)
-* 2024-05: [https://arxiv.org/abs/2305.03495 Automatic Prompt Optimization with "Gradient Descent" and Beam Search]
-* 2024-06: [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text] (gradient backpropagation through text)
-* 2024-06: [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents] (optimize LLM frameworks)
-=Multi-agent orchestration=
 ==Research demos==
 * [https://github.com/camel-ai/camel Camel]
@@ Line 276: / Line 249: @@
 * 2024-06: [https://arxiv.org/abs/2406.11638 MASAI: Modular Architecture for Software-engineering AI Agents]
 * 2024-10: [https://arxiv.org/abs/2410.08164 Agent S: An Open Agentic Framework that Uses Computers Like a Human] ([https://github.com/simular-ai/Agent-S code])
+* 2024-10: [https://arxiv.org/abs/2410.20424 AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions]
+* 2025-02: [https://arxiv.org/abs/2502.16111 PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving]
 ===Related work===
@@ Line 298: / Line 273: @@
 * Amazon AWS [https://github.com/awslabs/multi-agent-orchestrator Multi-Agent Orchestrator]
 * [https://github.com/kaiban-ai/KaibanJS KaibanJS]: Kanban for AI Agents? (Takes inspiration from [https://en.wikipedia.org/wiki/Kanban Kanban] visual [https://www.atlassian.com/agile/kanban work management].)
+* [https://github.com/Thytu/Agentarium Agentarium]
+* [https://orchestra.org/ Orchestra] ([https://docs.orchestra.org/orchestra/introduction docs], [https://docs.orchestra.org/orchestra/introduction code])
+* [https://github.com/HKUDS/AutoAgent AutoAgent]: Fully-Automated & Zero-Code LLM Agent Framework
+* [https://mastra.ai/ Mastra] ([https://github.com/mastra-ai/mastra github]): opinionated Typescript framework for AI applications (primitives for workflows, agents, RAG, integrations and evals)
+* [https://github.com/orra-dev/orra Orra]: multi-agent applications with complex real-world interactions
+* [https://github.com/gensx-inc/gensx/blob/main/README.md GenSX]
+* Cloudflare [https://developers.cloudflare.com/agents/ agents-sdk] ([https://blog.cloudflare.com/build-ai-agents-on-cloudflare/ info], [https://github.com/cloudflare/agents code])
+* OpenAI [https://platform.openai.com/docs/api-reference/responses responses API] and [https://platform.openai.com/docs/guides/agents agents SDK]
 ==Open Source Systems==
@@ Line 327: / Line 310: @@
 * [https://ottogrid.ai/ Otto Grid]
 * [https://www.paradigmai.com/ Paradigm]
+* [https://www.superworker.ai/ Superworker AI]
 ==Cloud solutions==
@@ Line 342: / Line 326: @@
 =Optimization=
+===Reviews===
+* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
+* 2025-03: [https://arxiv.org/abs/2503.16416 Survey on Evaluation of LLM-based Agents]
 ===Metrics, Benchmarks===
+* 2019-11: [https://arxiv.org/abs/1911.01547 On the Measure of Intelligence]
 * 2022-06: [https://arxiv.org/abs/2206.10498 PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change]
+* 2023-06: [https://arxiv.org/abs/2306.05836 Can Large Language Models Infer Causation from Correlation?] (challenging Corr2Cause task)
+* 2024-01: [https://microsoft.github.io/autogen/0.2/blog/2024/01/25/AutoGenBench/ AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents]
 * 2024-04: AutoRace ([https://github.com/maitrix-org/llm-reasoners code]): [https://arxiv.org/abs/2404.05221 LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models]
 * 2024-04: [https://arxiv.org/abs/2404.07972 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments] ([https://os-world.github.io/ github])
@@ Line 355: / Line 346: @@
 * 2024-10: SimpleAQ: [https://cdn.openai.com/papers/simpleqa.pdf Measuring short-form factuality in large language models] ([https://openai.com/index/introducing-simpleqa/ announcement], [https://github.com/openai/simple-evals code])
 * 2024-11: [https://metr.org/AI_R_D_Evaluation_Report.pdf RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts] ([https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ blog], [https://github.com/METR/ai-rd-tasks/tree/main code])
+* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
+* 2024-11: [https://arxiv.org/abs/2411.13543 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games]
+* 2024-12: [https://arxiv.org/abs/2412.14161 TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks] ([https://github.com/TheAgentCompany/TheAgentCompany code], [https://the-agent-company.com/ project], [https://the-agent-company.com/#/leaderboard leaderboard])
+* 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])
+* 2025-02: [https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges] ([https://scale.com/leaderboard/enigma_eval leaderboard])
+* 2025-02: [https://sites.google.com/view/mlgym MLGym: A New Framework and Benchmark for Advancing AI Research Agents] ([https://arxiv.org/abs/2502.14499 paper], [https://github.com/facebookresearch/MLGym code])
+* 2025-02: [https://arxiv.org/abs/2502.18356 WebGames: Challenging General-Purpose Web-Browsing AI Agents]
+* 2025-03: ColBench: [https://arxiv.org/abs/2503.15478 SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks]
+===Evaluation Schemes===
+* 2024-12: [https://arxiv.org/abs/2412.10424 LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation]
+* 2025-01: [https://github.com/marquisdepolis/LLMRank LLMRank ("SlopRank")]: LLMs evaluate each other, allowing top model (for a given prompt/problem) to be inferred from a large number of recommendations.
+===Multi-agent===
+* 2024-12: [https://arxiv.org/abs/2412.10270 Cultural Evolution of Cooperation among LLM Agents]
+* [https://github.com/lechmazur/step_game/ Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure]
 ===Agent Challenges===
 * [https://github.com/aidanmclaughlin/Aidan-Bench Aidan-Bench]: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
+** NeurIPS 2024 paper/poster: [https://openreview.net/pdf?id=fz969ahcvJ AidanBench: Evaluating Novel Idea Generation on Open-Ended Questions]
 * [https://x.com/paul_cal/status/1850262678712856764 Pictionary]: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
 * [https://github.com/mc-bench/orchestrator MC-bench]: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.
@@ Line 369: / Line 377: @@
 =See Also=
 * [[Science Agents]]
+* [[Increasing AI Intelligence]]
 * [[AI tools]]
 * [[AI understanding]]
 * [[Robots]]
 * [[Exocortex]]

Difference between revisions of "AI Agents"

Latest revision as of 16:53, 31 March 2025

Contents

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

Guides

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval (Memory)

Contextual Memory

Control (tool-use, computer use, etc.)

Model Context Protocol (MCP)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

Medicine

LLM-as-judge

Deep Research

Advanced Workflows

Streamline Administrative Tasks

Author Research Articles

Software Development Workflows

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

AI Assistant

Agentic Systems

Increasing AI Agent Intelligence

Multi-agent orchestration

Research

Organization Schemes

Societies and Communities of AI agents

Domain-specific

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Spreadsheet

Cloud solutions

Frameworks

Optimization

Reviews

Metrics, Benchmarks

Evaluation Schemes

Multi-agent

Agent Challenges

Automated Improvement

See Also

Navigation menu

Search