Difference between revisions of "AI Agents"

From GISAXS
Jump to: navigation, search
(Metrics, Benchmarks)
(Metrics, Benchmarks)
 
(9 intermediate revisions by the same user not shown)
Line 39: Line 39:
 
* 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]
 
* 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]
 
* [https://mem0.ai/ Mem0 AI]: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized.
 
* [https://mem0.ai/ Mem0 AI]: Memory Layer for AI Agents; self-improving memory layer for LLM applications, enabling personalized.
 +
 +
===Contextual Memory===
 +
* [https://github.com/memodb-io/memobase Memobase]: user profile-based memory (long-term user memory for genAI) applications)
  
 
===Control (tool-use, computer use, etc.)===
 
===Control (tool-use, computer use, etc.)===
Line 62: Line 65:
  
 
===Computer Use===
 
===Computer Use===
* 2024-11: [https://arxiv.org/abs/2411.10323 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use] ([https://github.com/showlab/computer_use_ootb code])
+
* See: [[Human_Computer_Interaction#AI_Computer_Use]]
* 2025-01: [https://arxiv.org/abs/2501.10893 Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments]
 
  
 
===Software Engineering===
 
===Software Engineering===
Line 81: Line 83:
 
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
 
* 2024-11: [https://arxiv.org/abs/2411.15594 A Survey on LLM-as-a-Judge]
 
* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]
 
* 2024-12: [https://arxiv.org/abs/2412.05579 LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods]
 +
 +
===Deep Research===
 +
* Google [https://blog.google/products/gemini/google-gemini-deep-research/ Deep Research]
 +
* OpenAI [https://openai.com/index/introducing-deep-research/ Deep Research]
 +
* Perplexity [https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research Deep Research]
 +
* [https://exa.ai/ Exa AI] web-search agent, powered by [https://demo.exa.ai/deepseekchat DeepSeek] ([https://github.com/exa-labs/exa-deepseek-chat code]) or [https://o3minichat.exa.ai/ o3-mini] ([https://github.com/exa-labs/exa-o3mini-chat code])
 +
* [https://www.firecrawl.dev/ Firecrawl] [https://x.com/nickscamara_/status/1886287956291338689 wip]
 +
* [https://x.com/mattshumer_ Matt Shumer] [https://github.com/mshumer/OpenDeepResearcher OpenDeepResearcher]
 +
* [https://github.com/zilliztech/deep-searcher DeepSearcher] (operate on local data)
 +
* [https://github.com/nickscamara nickscamara] [https://github.com/nickscamara/open-deep-research open-deep-research]
 +
* [https://x.com/dzhng dzhng] [https://github.com/dzhng/deep-research deep-research]
 +
* [https://huggingface.co/ huggingface] [https://huggingface.co/blog/open-deep-research open-Deep-research ([https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research code])
 +
* xAI Grok 3 Deep Search
  
 
=Advanced Workflows=
 
=Advanced Workflows=
Line 90: Line 105:
 
* [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning]
 
* [https://arxiv.org/abs/2409.05556 SciAgents: Automating scientific discovery through multi-agent intelligent graph reasoning]
 
** [https://github.com/lamm-mit/SciAgentsDiscovery code]
 
** [https://github.com/lamm-mit/SciAgentsDiscovery code]
 +
 +
===Streamline Administrative Tasks===
 +
* 2025-02: [https://er.educause.edu/articles/2025/2/ushering-in-a-new-era-of-ai-driven-data-insights-at-uc-san-diego Ushering in a New Era of AI-Driven Data Insights at UC San Diego]
  
 
===Author Research Articles===
 
===Author Research Articles===
Line 106: Line 124:
 
## [https://www.cursor.com/ Cursor]
 
## [https://www.cursor.com/ Cursor]
 
## [https://codeium.com/ Codeium] [https://codeium.com/windsurf Windsurf] (with "Cascade" AI Agent)
 
## [https://codeium.com/ Codeium] [https://codeium.com/windsurf Windsurf] (with "Cascade" AI Agent)
 +
## ByteDance [https://www.trae.ai/ Trae AI]
 
# AI-assisted IDE, where the AI generates and manages the dev environment
 
# AI-assisted IDE, where the AI generates and manages the dev environment
 
## [https://replit.com/ Replit]
 
## [https://replit.com/ Replit]
Line 197: Line 216:
 
* [https://github.com/Thytu/Agentarium Agentarium]
 
* [https://github.com/Thytu/Agentarium Agentarium]
 
* [https://orchestra.org/ Orchestra] ([https://docs.orchestra.org/orchestra/introduction docs], [https://docs.orchestra.org/orchestra/introduction code])
 
* [https://orchestra.org/ Orchestra] ([https://docs.orchestra.org/orchestra/introduction docs], [https://docs.orchestra.org/orchestra/introduction code])
 +
* [https://github.com/HKUDS/AutoAgent AutoAgent]: Fully-Automated & Zero-Code LLM Agent Framework
  
 
==Open Source Systems==
 
==Open Source Systems==
Line 262: Line 282:
 
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 
* 2024-12: [https://arxiv.org/abs/2412.11936 A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges]
 
* 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])
 
* 2025-01: [https://codeelo-bench.github.io/ CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings] ([https://arxiv.org/abs/2501.01257 preprint], [https://codeelo-bench.github.io/#leaderboard-table leaderboard])
* 2025-02: ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges] ([https://scale.com/leaderboard/enigma_eval leaderboard])
+
* 2025-02: [https://static.scale.com/uploads/654197dc94d34f66c0f5184e/EnigmaEval%20v4.pdf ENIGMAEVAL:A Benchmark of Long Multimodal Reasoning Challenges] ([https://scale.com/leaderboard/enigma_eval leaderboard])
 +
* 2025-02: [https://sites.google.com/view/mlgym MLGym: A New Framework and Benchmark for Advancing AI Research Agents] ([https://arxiv.org/abs/2502.14499 paper], [https://github.com/facebookresearch/MLGym code])
  
 
===Evaluation Schemes===
 
===Evaluation Schemes===

Latest revision as of 13:05, 23 February 2025

Reviews & Perspectives

Published

Continually updating

Analysis/Opinions

Guides

AI Assistants

Components of AI Assistants

Agent Internal Workflow Management

Information Retrieval (Memory)

Contextual Memory

  • Memobase: user profile-based memory (long-term user memory for genAI) applications)

Control (tool-use, computer use, etc.)

Open-source

Personalities/Personas

Specific Uses for AI Assistants

Computer Use

Software Engineering

Science Agents

See Science Agents.

LLM-as-judge

Deep Research

Advanced Workflows

Streamline Administrative Tasks

Author Research Articles

Software Development Workflows

Several paradigms of AI-assisted coding have arisen:

  1. Manual, human driven
  2. AI-aided through chat/dialogue, where the human asks for code and then copies it into the project
    1. OpenAI ChatGPT
    2. Anthropic Claude
  3. API calls to an LLM, which generates code and inserts the file into the project
  4. LLM-integration into the IDE
    1. Copilot
    2. Qodo (Codium) & AlphaCodium (preprint, code)
    3. Cursor
    4. Codeium Windsurf (with "Cascade" AI Agent)
    5. ByteDance Trae AI
  5. AI-assisted IDE, where the AI generates and manages the dev environment
    1. Replit
    2. Aider (code): Pair programming on commandline
    3. Pythagora
    4. StackBlitz bolt.new
    5. Cline (formerly Claude Dev)
  6. Prompt-to-product
    1. Github Spark (demo video)
  7. Semi-autonomous software engineer agents
    1. Devin (Cognition AI)
    2. Amazon Q
    3. Honeycomb

For a review of the current state of software-engineering agentic approaches, see:

Corporate AI Agent Ventures

Mundane Workflows and Capabilities

Inference-compute Reasoning

Agentic Systems

Increasing AI Agent Intelligence

See: Increasing AI Intelligence

Multi-agent orchestration

Research

Societies and Communities of AI agents

Domain-specific

Research demos

Related work

Inter-agent communications

Architectures

Open Source Frameworks

Open Source Systems

Commercial Automation Frameworks

Spreadsheet

Cloud solutions

Frameworks

Optimization

Metrics, Benchmarks

Evaluation Schemes

Multi-agent

Agent Challenges

  • Aidan-Bench: Test creativity by having a particular LLM generate long sequence of outputs (meant to be different), and measuring how long it can go before duplications appear.
  • Pictionary: LLM suggests prompt, multiple LLMs generate outputs, LLM judges; allows raking of the generation abilities.
  • MC-bench: Request LLMs to build an elaborate structure in Minecraft; outputs can be A/B tested by human judges.

Automated Improvement

See Also