Latest revision as of 16:03, 26 March 2026

LLM

Open-weights LLM

2025-03Mar-11: Mellow: a small audio language model for reasoning, 167M (paper)
2025-03Mar-12: Audio Flamingo 2 0.5B, 1.5B, 3B paper, code

Cloud LLM

Groq cloud (very fast inference)

Triage

Retrieval Augmented Generation (RAG)

kotaemon: An open-source clean & customizable RAG UI for chatting with your documents.
LlamaIndex (code, docs, voice chat code)
Nvidia ChatRTX with RAG
Anthropic Customer Support Agent example
LangChain and LangGraph (tutorial)
- RAGBuilder: Automatically tunes RAG hyperparams
WikiChat
- WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia
Chonkie: No-nonsense RAG chunking library (open-source, lightweight, fast)
autoflow: open source GraphRAG (Knowledge Graph), including conversational search page
RAGLite
nano-graphrag: A simple, easy-to-hack GraphRAG implementation
Dabarqus

SciSpace Chat with PDF (also available as a GPT).

LLM for scoring/ranking

LLM Agents

Interfaces

Chatbot Frontend

AnythingLLM (docs, code): includes chat-with-docs, selection of LLM and vector db, etc.

Alternative Text Chatbot UI

Loom provides a sort of tree-like structure for LLM coming up with branched writings.
The Pantheon Interface is a new idea for how to interact with LLMs (live instance, code). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming.

Conversational Audio Chatbot

Swift is a fast AI voice assistant (code, live demo) uses:
- Groq cloud running OpenAI Whisper for fast speech transcription.
- Cartesia Sonic for fast speech synthesis
- VAD to detect when user is talking
- Vercel for app deployment
RTVI-AI (code, demo), uses:
- Groq
- Llama 3.1
- Daily
- RTVI
June: Local Voice Chatbot
- Ollama
- Hugging Face Transformers (for speech recognition)
- Coqui TTS Toolkit
kyutai Moshi chatbot (demo)
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (model, code, demo)
2024-09Sep-11: Llama-3.1-8B-Omni (code), enabling end-to-end speech.
2024-10Oct-18: Meta Spirit LM: open source multimodal language model that freely mixes text and speech
2025-02Feb-28: Sesame (demo)

2025-03: Smart Turn: Open-source

Speech Recognition (ASR) and Transcription

Lists

Open Source

In Browser

Whisper Timestamped: Multilingual speech recognition with word-level timestamps, running locally in browser

Phrase Endpointing and Voice Activity Detection (VAD)

I.e. how to determine when user is done talking, and bot should respond?

Audio Cleanup

Krisp AI: Noise cancellation, meeting summary, etc.

Auto Video Transcription

TranslateMom
Voice-Pro: YouTube downloader, speech separation, transcription, translation, TTS, and voice cloning toolkit for creators

Text-to-speech (TTS)

Open Source

Cloud

Elevenlabs ($50/million characters)
- voice isolator
Cartesia Sonic
Neets AI ($1/million characters)
Hailuo AI T2A-01-HD (try, API)
Hume (can set emotion, give acting directions, etc.)

Text-to-audio

Vision

Langfun library as a means of converting images into structured output.
See also: Multimodal open-weights models

Visual Models

CLIP
Siglip
Supervision
Florence-2
Nvidia MambaVision
Meta Sapiens: Foundation for Human Vision Models (video input, can infer segmentation, pose, depth-map, and surface normals)

Depth

Superresolution

Embedding

Text Embedding

Image Embedding

Time Series

Control

Forecasting

Meta Kats (code): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation
Context is Key: A Benchmark for Forecasting with Essential Textual Information

Anomaly Detection

Data

Vector Database

milvus (open source with paid cloud option)
Qdrant (open source with paid cloud option)
Vespa (open source with paid cloud option)
chroma
LlamaIndex
sqlite-vec

MySQL does not traditionally have support, but:
- PlanetScale is working on it
- mysql_vss (discussion)
- tibd (discussion)

Database with Search

@@ Line 29: / Line 29: @@
 * [https://x.com/MiniMax__AI/status/1879226391352549451 2025-01Jan-14]: [https://www.minimaxi.com/en/news/minimax-01-series-2 MiniMax-01], MiniMax-Text-01 and MiniMax-VL-01; 4M context length ([https://www.minimaxi.com/en/news/minimax-01-series-2 paper])
 * 2025-01Jan-27: [https://qwenlm.github.io/blog/qwen2.5-1m/ Qwen2.5-1M] ([https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-1M/Qwen2_5_1M_Technical_Report.pdf report])
+* 2025-01Jan-27: DeepSeek [https://huggingface.co/deepseek-ai/Janus-Pro-7B Janus-Pro-7B] (with image capabilities)
+* [https://x.com/cohere/status/1900170005519753365 2025-03Mar-14]: Cohere [https://cohere.com/blog/command-a Command A] ([https://huggingface.co/CohereForAI/c4ai-command-a-03-2025?ref=cohere-ai.ghost.io weights])
+* [https://x.com/MistralAI/status/1901668499832918151 2025-03Mar-17]: [https://mistral.ai/news/mistral-small-3-1 Mistral Small 3.1] 24B ([https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503 weights])
+* [https://x.com/deepseek_ai/status/1904526863604883661 2025-03Mar-24]: [https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 DeepSeek-V3-0324] 685B
+* 2025-04Apr-05: Meta [https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Llama 4] (109B, 400B, 2T)
+* [https://x.com/kuchaev/status/1909444566379573646 2025-04Apr-08]: Nvidia [https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 Llama-3_1-Nemotron-Ultra-253B-v1]
+* [https://x.com/MistralAI/status/1920119463430500541 2025-05May-07]: Mistral [https://mistral.ai/news/mistral-medium-3 Medium 3]
+* [https://x.com/googleaidevs/status/1938279967026274383 2025-06Jun-26]: Google [https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/ Gemma 3n] (on-device multimodal)
+* [https://x.com/Alibaba_Qwen/status/1953128028047102241 2025-08Aug-06]: [https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 Qwen3-4B-Instruct-2507]
+* [https://x.com/GoogleDeepMind/status/1956393664248271082 2025-08Aug-15]: Google [https://developers.googleblog.com/en/introducing-gemma-3-270m/ Gemma 3 270M]
+* [https://x.com/arcee_ai/status/2016278017572495505?s=20 2026-01Jan-28]: [https://www.arcee.ai/ Arcee AI] [https://docs.arcee.ai/get-started/models-overview Trinity Large] [https://huggingface.co/arcee-ai 400B]
-===For Coding===
+===Coding===
 Rankings: [https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard bigcode-models-leaderboard] and [https://codeelo-bench.github.io/#leaderboard-table CodeElo leaderboard]
 * 2024-10Oct-06: [https://abacus.ai/ Abacus AI] [https://huggingface.co/abacusai/Dracarys2-72B-Instruct Dracarys2-72B-Instruct] (optimized for coding, fine-tune of [https://huggingface.co/Qwen/Qwen2.5-72B-Instruct Qwen2.5-72B-Instruct])
 * 2024-11Nov-09: [https://opencoder-llm.github.io/ OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models] ([https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e weights], [https://arxiv.org/abs/2411.04905 preprint])
 * 2024-11Nov-13: [https://qwenlm.github.io/blog/qwen2.5-coder-family/ Qwen2.5-Coder]
+* [https://x.com/Agentica_/status/1909700115755061374 2025-04Apr-08]: [https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51 DeepCoder-14B-Preview] ([https://github.com/agentica-project/rllm code], [https://huggingface.co/agentica-org/DeepCoder-14B-Preview hf])
+* [https://x.com/GeZhang86038849/status/1921147887871742329 2025-05May-10]: ByteDance [https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base SeedCoder] 8B
+* [https://x.com/Kimi_Moonshot/status/1943687594560332025 2025-07Jul-11]: [https://moonshotai.github.io/Kimi-K2/ Kimi-K2] 1T ([https://github.com/MoonshotAI/Kimi-K2 code], [https://huggingface.co/moonshotai weights])
+* [https://x.com/Alibaba_Qwen/status/1947766835023335516 2025-07Jul-23]: [https://qwenlm.github.io/blog/qwen3-coder/ Qwen3-Coder-480B-A35B-Instruct] ([https://github.com/QwenLM/qwen-code code], [https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct weights])
+* [https://x.com/MiniMax_AI/status/2021980761210134808?s=20 2026-02Feb-12]: [https://www.minimax.io/news/minimax-m25 MiniMax M2.5] 230B
 ===Reasoning===
+See also: [[Increasing_AI_Intelligence|Increasing AI Intelligence]] > Proactive Search > [[Increasing_AI_Intelligence#CoT_reasoning_model|CoT reasoning model]]
 * [https://x.com/deepseek_ai/status/1859200141355536422 2024-11Nov-20]: DeepSeek-R1-Lite-Preview ([https://x.com/deepseek_ai/status/1859200145037869485 results], [https://x.com/teortaxesTex/status/1859259359630356955 CoT])
 * 2024-11Nov-23: [https://arxiv.org/abs/2411.14405 Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions]
@@ Line 44: / Line 61: @@
 * 2025-01Jan-10: [https://mbzuai-oryx.github.io/LlamaV-o1/ LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs] ([https://arxiv.org/abs/2501.06186 preprint], [https://github.com/mbzuai-oryx/LlamaV-o1 code], [https://huggingface.co/omkarthawakar/LlamaV-o1 weights])
 * [https://x.com/deepseek_ai/status/1881318130334814301 2025-01Jan-20]: [https://huggingface.co/deepseek-ai/DeepSeek-R1 DeepSeek-R1], [https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B DeepSeek-R1-Distill-Llama-70B], DeepSeek-R1-Distill-Qwen-32B, ... ([https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf paper])
+* 2025-02Feb-10: [https://huggingface.co/tomg-group-umd/huginn-0125 Huginn-0125]: [https://arxiv.org/abs/2502.05171 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach] ([https://github.com/seal-rg/recurrent-pretraining code], [https://huggingface.co/tomg-group-umd/huginn-0125 model])
+* [https://x.com/NousResearch/status/1890148000204485088 2025-02Feb-14]: [https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview DeepHermes 3 - Llama-3.1 8B]
+* [https://x.com/Alibaba_Qwen/status/1894130603513319842 2025-02Feb-24]: Qwen [https://qwenlm.github.io/blog/qwq-max-preview/ QwQ-Max-Preview] ([https://chat.qwen.ai/ online demo])
+* [https://x.com/Alibaba_Qwen/status/1897361654763151544 2025-03Mar-05]: Qwen [https://qwenlm.github.io/blog/qwq-32b/ QwQ-32B] ([https://huggingface.co/spaces/Qwen/QwQ-32B-Demo demo])
+* [https://x.com/BlinkDL_AI/status/1898579674575552558 2025-03Mar-05]: [https://github.com/BlinkDL/RWKV-LM RWKV7-G1] "GooseOne" 0.1B ([https://huggingface.co/BlinkDL/rwkv7-g1 weights], [https://arxiv.org/abs/2305.13048 preprint])
+* [https://x.com/LG_AI_Research/status/1901803002052436323 2025-03Mar-17]: LG AI Research [https://www.lgresearch.ai/blog/view?seq=543 EXAONE Deep] 2.4B, 7.8B, 32B ([https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-32B weights])
+* [https://x.com/kuchaev/status/1902078122792775771 2025-03Mar-18]: Nvidia [https://huggingface.co/collections/nvidia/llama-nemotron-67d92346030a2691293f200b Llama Nemotron] 8B, 49B ([https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1 demo])
+* [https://x.com/Agentica_/status/1909700115755061374 2025-04Apr-08]: [https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51 DeepCoder-14B-Preview] ([https://github.com/agentica-project/rllm code], [https://huggingface.co/agentica-org/DeepCoder-14B-Preview hf])
+* 2025-04Apr-10: Bytedance [https://github.com/ByteDance-Seed/Seed-Thinking-v1.5 Seed-Thinking-v1.5] 200B
+* [https://x.com/ZyphraAI/status/1910362745423425966 2025-04Apr-11]: [https://www.zyphra.com/ Zyphra] [https://www.zyphra.com/post/introducing-zr1-1-5b-a-small-but-powerful-math-code-reasoning-model ZR1-1.5B] ([https://huggingface.co/Zyphra/ZR1-1.5B weights], [https://playground.zyphra.com/sign-in use])
+* [https://x.com/Alibaba_Qwen/status/1916962087676612998 2025-04Apr-29]: [https://qwenlm.github.io/blog/qwen3/ Qwen3] 0.6B to 235B ([https://github.com/QwenLM/Qwen3 code], [https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f weights], [https://modelscope.cn/home modelscope])
+* [https://x.com/DimitrisPapail/status/1917731614899028190 2025-04Apr-30]: [https://huggingface.co/microsoft/Phi-4-reasoning Phi-4 Reasoning] 14B ([https://www.microsoft.com/en-us/research/wp-content/uploads/2025/04/phi_4_reasoning.pdf tech report])
+* [https://x.com/deepseek_ai/status/1928061589107900779 2025-05May-28]: [https://huggingface.co/deepseek-ai/DeepSeek-R1-0528 DeepSeek-R1-0528]
+* [https://x.com/MistralAI/status/1932441507262259564 2025-06Jun-10]: Mistral [https://mistral.ai/static/research/magistral.pdf Magistral] 24B ([https://huggingface.co/mistralai/Magistral-Small-2506 weights])
+* [https://x.com/LoubnaBenAllal1/status/1942614508549333211 2025-07Jul-08]: [https://huggingface.co/blog/smollm3 SmolLM3]: smol, multilingual, long-context reasoner
+* [https://x.com/OpenAI/status/1952776916517404876 2025-08Aug-05]: [https://openai.com/open-models/ OpenAI] gpt-oss-120b, gpt-oss-20b
+* [https://x.com/Alibaba_Qwen/status/1953128028047102241 2025-08Aug-06]: [https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507 Qwen3-4B-Thinking-2507]
+* 2025-09Sep: [https://huggingface.co/LLM360/K2-Think K2-Think] 32B
+* [https://x.com/Kimi_Moonshot/status/1986449512538513505 2025-11Nov]: [https://moonshotai.github.io/Kimi-K2/thinking.html Kimi K2 Thinking] 1T (32B active)
+* [https://x.com/deepseek_ai/status/1995452641430651132?s=20 2025-12Dec]: [https://huggingface.co/deepseek-ai/DeepSeek-V3.2 DeepsSeek-v3.2] and [https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale DeepSeek-v3.2-Speciale]
+===Agentic===
+* 2025-02Feb-18: Microsoft [https://huggingface.co/microsoft/Magma-8B Magma-8B] ([https://www.arxiv.org/abs/2502.13130 preprint])
+* 2025-02Feb-26: [https://convergence.ai/ Convergence] [https://github.com/convergence-ai/proxy-lite Proxy Lite]
+* [https://x.com/MiniMax_AI/status/2021980761210134808?s=20 2026-02Feb-12]: [https://www.minimax.io/news/minimax-m25 MiniMax M2.5] 230B
+===Multimodal===
+====Language/Vision====
+* [https://arxiv.org/abs/2407.07895 LLaVA-NeXT-Interleave] ([https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19 models], [https://huggingface.co/spaces/merve/llava-interleave demo])
+* [https://huggingface.co/papers/2407.15841 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models]
+* Nvidia [https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 NVEagle] 13B, 7B ([https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat demo], [https://arxiv.org/abs/2408.15998 preprint])
+* 2024-08Aug-29: [https://qwenlm.github.io/blog/qwen2-vl/ Qwen2-VL] 7B, 2B ([https://github.com/QwenLM/Qwen2-VL code], [https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d models]): Can process videos up to 20 minutes in length
+* 2024-09Sep-11: Mistral [https://huggingface.co/mistral-community/pixtral-12b-240910 Pixtral 12B]
+* 2024-09Sep-17: [https://nvlm-project.github.io/ NVLM 1.0]
+* 2024-12Dec-06: Nvidia [https://arxiv.org/abs/2412.04468 NVILA: Efficient Frontier Visual Language Models]
+* [https://x.com/Alibaba_Qwen/status/1883954247743725963 2025-01Jan-28]: [https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5 Qwen2.5-VL]
+* 2025-02Feb-18: Microsoft [https://huggingface.co/microsoft/Magma-8B Magma-8B] ([https://www.arxiv.org/abs/2502.13130 preprint])
+* [https://x.com/CohereForAI/status/1896923657470886234 2025-03Mar-05]: Cohere [https://cohere.com/research/aya Aya] 8B, 32B
+* 2025-03Mar-12: Google [https://developers.googleblog.com/en/introducing-gemma3/ Gemma 3] 1B 4B, 12B, 27B ([https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf technical report])
+* [https://x.com/DeepLearningAI/status/1903295570527002729 2025-03Mar-23]: Cohere [https://cohere.com/blog/aya-vision Aya Vision] 8B, 32B ([https://huggingface.co/collections/CohereForAI/c4ai-aya-vision-67c4ccd395ca064308ee1484?ref=cohere-ai.ghost.io weights])
+* [https://x.com/Alibaba_Qwen/status/1904227859616641534 2025-03Mar-24]: Alibaba [https://qwenlm.github.io/blog/qwen2.5-vl-32b/ Qwen2.5-VL-32B-Instruct] ([https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct weights])
+* 2025-05May-20: ByteDance [https://bagel-ai.org/ BAGEL: Unified Model for Multimodal Understanding and Generation] 7B ([https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT weights], [https://github.com/bytedance-seed/BAGEL code], [https://demo.bagel-ai.org/ demo])
+====Language/Vision/Speech====
+* 2025-02Feb-27: Microsoft [https://huggingface.co/microsoft/Phi-4-multimodal-instruct Phi-4-multimodal-instruct] (language, vision, speech)
+* [https://x.com/kyutai_labs/status/1903082848547906011 2025-03Mar-21]: kyutai [https://kyutai.org/moshivis MoshiVis] ([https://vis.moshi.chat/ demo])
+* [https://x.com/Alibaba_Qwen/status/1904944923159445914 2025-03Mar-26]: [https://qwenlm.github.io/blog/qwen2.5-omni/ Qwen2.5-Omni-7B] ([https://github.com/QwenLM/Qwen2.5-Omni/blob/main/assets/Qwen2.5_Omni.pdf tech report], [https://github.com/QwenLM/Qwen2.5-Omni code], [https://huggingface.co/Qwen/Qwen2.5-Omni-7B weight])
+====Language/Audio====
+* 2025-03Mar-11: [https://github.com/soham97/mellow Mellow]: a small audio language model for reasoning, 167M ([https://arxiv.org/abs/2503.08540 paper])
+* 2025-03Mar-12: [https://research.nvidia.com/labs/adlr/AF2/ Audio Flamingo 2] 0.5B, 1.5B, 3B [https://arxiv.org/abs/2503.03983 paper], [https://github.com/NVIDIA/audio-flamingo code]
+===RAG===
+* 2025-04: [https://huggingface.co/collections/PleIAs/pleias-rag-680a0d78b058fffe4c16724d Pleias-RAG] 350M, 1.2B
+** Paper: [http://ragpdf.pleias.fr/ Even Small Reasoners Should Quote Their Sources: Introducing Pleias-RAG Model Family]
+* 2025-04: Meta ReasonIR 8B: [https://arxiv.org/abs/2504.20595 ReasonIR: Training Retrievers for Reasoning Tasks]
 ==Cloud LLM==
@@ Line 55: / Line 128: @@
 ==Retrieval Augmented Generation (RAG)==
+* See Also: [[AI_tools#Document_Parsing|Document Parsing]]
 ===Reviews===
 * 2024-08: [https://arxiv.org/abs/2408.08921 Graph Retrieval-Augmented Generation: A Survey]
@@ Line 60: / Line 135: @@
 * 2024-12: [https://arxiv.org/abs/2412.17558 A Survey of Query Optimization in Large Language Models]
 * 2025-01: [https://arxiv.org/abs/2501.07391 Enhancing Retrieval-Augmented Generation: A Study of Best Practices]
+* 2025-01: [https://arxiv.org/abs/2501.09136 Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG] ([https://github.com/asinghcsu/AgenticRAG-Survey github])
 * List of [https://github.com/NirDiamant/RAG_Techniques RAG techniques]
 * [https://github.com/athina-ai/rag-cookbooks Advanced RAG Cookbooks👨🏻‍💻]
+* [https://github.com/DEEP-PolyU/Awesome-GraphRAG Awesome-GraphRAG (GraphRAG Survey)]
 ===Measuring RAG performance===
@@ Line 74: / Line 151: @@
 * AutoMetaRAG ([https://github.com/darshil3011/AutoMetaRAG/tree/main code])
 * [https://verba.weaviate.io/ Verba]: RAG for [https://weaviate.io/ Weaviate] vector database ([https://github.com/weaviate/verba code], [https://www.youtube.com/watch?v=UoowC-hsaf0 video])
+* Microsoft: [https://github.com/microsoft/PIKE-RAG PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation]
 * 2024-10: Google [https://arxiv.org/abs/2410.07176 Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models]
 * 2024-10: [https://arxiv.org/abs/2410.08815 StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization]: Reformats retrieved data into task-appropriate structures (table, graph, tree).
@@ Line 83: / Line 161: @@
 * 2025-01: [https://github.com/Marker-Inc-Korea/AutoRAG AutoRAG: RAG AutoML tool for automatically finding an optimal RAG pipeline for your data]
 * 2025-01: [https://arxiv.org/abs/2501.05874 VideoRAG: Retrieval-Augmented Generation over Video Corpus]
+* 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]
+* 2025-02: [https://weaviate.io/developers/weaviate/tutorials/multi-vector-embeddings Multi-vector embeddings]
+* 2025-03: [https://arxiv.org/abs/2503.23513 RARE: Retrieval-Augmented Reasoning Modeling]
 ===Open-source Implementations===
@@ Line 111: / Line 192: @@
 * [https://platform.vectorize.io/ Vectorize]
 * [https://www.voyageai.com/ Voyage AI]
+* [https://abacus.ai/ Abacus AI]
-===Document Parsing===
+* [https://www.cloudflare.com/ Cloudflare] [https://blog.cloudflare.com/introducing-autorag-on-cloudflare/ AutoRAG]
-* [https://github.com/DS4SD/docling Docling]: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
-* [https://github.com/microsoft/markitdown Microsoft Markitdown]: converts various formats (PDF, Word, Excel, PPT) to Markdown (available via [https://msftmd.replit.app/ web interface on replit])
-* [https://github.com/wisupai/e2m e2m: Everything to Markdown] (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a)
-* Nvidia [https://docs.nvidia.com/nv-ingest/user-guide/index.html NV-ingest] ([https://github.com/NVIDIA/nv-ingest code]) scalable, performance-oriented document content and metadata extraction microservice
-* [https://github.com/QuivrHQ/MegaParse MegaParse]: Your Parser for every type of documents (pdf, powerpoint, word)
-====PDF Conversion====
-* [https://github.com/kermitt2/grobid Grobid]
-* [https://chunkr.ai/ Chunkr] ([https://github.com/lumina-ai-inc/chunkr code])
-==Automatic Optimization==
-===Analogous to Gradient Descent===
-* [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text]
-* [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents]
 ==LLM for scoring/ranking==
@@ Line 175: / Line 242: @@
 * 2024-09Sep-11: [https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Llama-3.1-8B-Omni] ([https://github.com/ictnlp/LLaMA-Omni code]), enabling end-to-end speech.
 * [https://x.com/AIatMeta/status/1847383580269510670 2024-10Oct-18]: Meta [https://speechbot.github.io/spiritlm/ Spirit LM]: open source multimodal language model that freely mixes text and speech
+* 2025-02Feb-28: [https://www.sesame.com/ Sesame] ([https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo demo])
+===Turn Detection===
+* 2025-03: [https://github.com/pipecat-ai/smart-turn Smart Turn]: Open-source
 ===Related Research===
@@ Line 185: / Line 256: @@
 * [https://www.bland.ai Bland AI]
 * [https://deepgram.com/ DeepGram Voice AI]
+* [https://www.sesame.com/ Sesame] ([https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo demo])
 =Speech Recognition (ASR) and Transcription=
@@ Line 205: / Line 277: @@
 * 2024-10: [https://www.rev.ai/ Rev AI] [https://huggingface.co/Revai models] for [https://huggingface.co/Revai/reverb-asr transcription] and [https://huggingface.co/Revai/reverb-diarization-v2 diarization]
 * 2024-10: [https://github.com/usefulsensors/moonshine Moonshine] (optimized for resource-constrained devices)
+* 2025-05: [https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 Parakeet TDT 0.6B V2]
+* [https://x.com/kyutai_labs/status/1925840420187025892 2025-05]: [https://kyutai.org/ Kyutai] [https://unmute.sh/ Unmute]
+* [https://x.com/cohere/status/2037159129345614174?s=20 2026-03]: [https://cohere.com/blog/transcribe Cohere Transcribe]
 ==In Browser==
@@ Line 218: / Line 293: @@
 ==Audio Cleanup==
 * [https://krisp.ai/ Krisp AI]: Noise cancellation, meeting summary, etc.
+==Auto Video Transcription==
+* [https://www.translate.mom/ TranslateMom]
+* [https://github.com/abus-aikorea/voice-pro Voice-Pro]: YouTube downloader, speech separation, transcription, translation, TTS, and voice cloning toolkit for creators
 =Text-to-speech (TTS)=
@@ Line 231: / Line 310: @@
 * [https://huggingface.co/amphion/MaskGCT MaskGCT] ([https://huggingface.co/spaces/amphion/maskgct demo])
 * [https://arxiv.org/abs/2312.09911 Amphion: An Open-Source Audio, Music and Speech Generation Toolkit] ([https://github.com/open-mmlab/Amphion code])
+* [https://www.zyphra.com/ Zyphra] [https://huggingface.co/Zyphra/Zonos-v0.1-hybrid Zonos]
+* [https://github.com/fishaudio/fish-speech Fish Speech] (includes voice cloning)
+* [https://canopylabs.ai/ Canopy] [https://huggingface.co/collections/canopylabs/orpheus-tts-67d9ea3f6c05a941c06ad9d2 Orpheus] 3B
+* Canopy [https://canopylabs.ai/releases/orpheus_can_speak_any_language Orpheus Multilingual]
+* [https://narilabs.org/ Nari Labs] [https://github.com/nari-labs/dia Dia]
+* [https://kyutai.org/ Kyutai] [https://kyutai.org/next/tts TTS] [https://unmute.sh/ Unmute]
+* [https://github.com/resemble-ai/chatterbox Chatterbox TTS] ([https://huggingface.co/spaces/ResembleAI/Chatterbox try])
+* [https://play.ai/ Play AI] [https://github.com/playht/PlayDiffusion PlayDiffusion] ([https://huggingface.co/spaces/PlayHT/PlayDiffusion demo], [https://x.com/_mfelfel/status/1929586464125239589 example])
+* Mistral [https://mistral.ai/news/voxtral Voxtral]
+* Kitten TTS ([https://github.com/KittenML/KittenTTS github], [https://huggingface.co/KittenML/kitten-tts-nano-0.1 hf]) 15M (fast, light-weight)
+* Microsoft [https://microsoft.github.io/VibeVoice/ VibeVoice] 1.5B
+* [https://x.com/hume_ai/status/2031401003078062578?s=20 2026-03]: Huma AI [https://huggingface.co/collections/HumeAI/tada TADA]
+* [https://x.com/FishAudio/status/2031411140820152560?s=20 2026-03]: [https://huggingface.co/fishaudio/s2-pro Fish Audio S2]
 ==Cloud==
@@ Line 238: / Line 330: @@
 * [https://neets.ai/ Neets AI] ($1/million characters)
 * Hailuo AI T2A-01-HD ([https://www.hailuo.ai/audio try], [https://intl.minimaxi.com/document/platform%20introduction?key=66701c8e1d57f38758d58198 API])
+* [https://www.hume.ai/ Hume] (can set emotion, give acting directions, etc.)
 =Text-to-audio=
 * 2024-12: [https://tangoflux.github.io/ TangoFlux]: [https://arxiv.org/abs/2412.21037 Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization] ([https://github.com/declare-lab/TangoFlux code])
+* 2025-03: [https://arxiv.org/abs/2503.10522 AudioX: Diffusion Transformer for Anything-to-Audio Generation]
 =Vision=
+* [https://github.com/google/langfun Langfun] library as a means of converting images into structured output.
+* See also: [[AI_tools#Multimodal| Multimodal open-weights models]]
 ==Visual Models==
 * [https://openai.com/index/clip/ CLIP]
@@ Line 251: / Line 348: @@
 * Meta [https://about.meta.com/realitylabs/codecavatars/sapiens Sapiens: Foundation for Human Vision Models] (video input, can infer segmentation, pose, depth-map, and surface normals)
-==Multi-modal Models (language-vision/video)==
+==Depth==
-* [https://arxiv.org/abs/2407.07895 LLaVA-NeXT-Interleave] ([https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19 models], [https://huggingface.co/spaces/merve/llava-interleave demo])
+* 2024-06: [https://arxiv.org/abs/2406.09414 Depth Anything V2] ([https://github.com/DepthAnything/Depth-Anything-V2 code])
-* [https://huggingface.co/papers/2407.15841 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models]
-* Nvidia [https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 NVEagle] 13B, 7B ([https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat demo], [https://arxiv.org/abs/2408.15998 preprint])
-* 2024-08Aug-29: [https://qwenlm.github.io/blog/qwen2-vl/ Qwen2-VL] 7B, 2B ([https://github.com/QwenLM/Qwen2-VL code], [https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d models]): Can process videos up to 20 minutes in length
-* 2024-09Sep-11: Mistral [https://huggingface.co/mistral-community/pixtral-12b-240910 Pixtral 12B]
-* 2024-09Sep-17: [https://nvlm-project.github.io/ NVLM 1.0]
-* 2024-12Dec-06: Nvidia [https://arxiv.org/abs/2412.04468 NVILA: Efficient Frontier Visual Language Models]
-==Optical character recognition (OCR)==
+==Superresolution==
-* [https://arxiv.org/abs/2409.01704 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model] ([https://huggingface.co/stepfun-ai/GOT-OCR2_0 project], [https://github.com/Ucas-HaoranWei/GOT-OCR2.0/ code], [https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo demo])
+* 2025-03: [https://arxiv.org/abs/2311.17643 Thera: Aliasing-Free Arbitrary-Scale Super-Resolution with Neural Heat Fields] ([https://github.com/prs-eth/thera code], [https://huggingface.co/spaces/prs-eth/thera use])
-* [https://github.com/yigitkonur/swift-ocr-llm-powered-pdf-to-markdown Swift OCR: LLM Powered Fast OCR]
 ==Related==
@@ Line 269: / Line 359: @@
 =Embedding=
 * [https://www.marktechpost.com/2024/07/28/a-comparison-of-top-embedding-libraries-for-generative-ai/ A Comparison of Top Embedding Libraries for Generative AI]
+* [https://x.com/OfficialLoganK/status/2031411916489298156?s=20 2026-03]: [https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/ Gemini Embedding 2]
+* [https://x.com/mixedbreadai/status/2032127466081567106?s=20 2026-03]: [https://www.mixedbread.com/ Mixedbread] Wholembed v3
+==Text Embedding==
 * 2024-12: [https://huggingface.co/blog/modernbert modernBERT]
+* 2025-02: [https://huggingface.co/chandar-lab/NeoBERT NeoBERT] ([https://arxiv.org/abs/2502.19587 preprint])
+* 2025-03: [https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/ gemini-embedding-exp-03-07]
+==Image Embedding==
+* 2025-01: [https://arxiv.org/abs/2501.18593 Diffusion Autoencoders are Scalable Image Tokenizers] ([https://yinboc.github.io/dito/ project], [https://github.com/yinboc/dito code])
 =Time Series=
@@ Line 283: / Line 382: @@
 * Salesforce: [https://arxiv.org/abs/2410.10469 Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts] ([https://github.com/SalesforceAIResearch/uni2ts/tree/main/project/moirai-moe-1 code], [https://huggingface.co/collections/Salesforce/moirai-r-models-65c8d3a94c51428c300e0742 weights], [https://www.salesforce.com/blog/time-series-morai-moe/ blog])
 * IBM [https://huggingface.co/docs/transformers/en/model_doc/patchtsmixer PatchTSMixer] and [https://huggingface.co/docs/transformers/en/model_doc/patchtst PatchTST] (being [https://research.ibm.com/blog/time-series-AI-transformers used] for particle accelerators)
+* 2026-02: Google [https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/ TimesFM]
 ==Control==
@@ Line 291: / Line 390: @@
 * Meta [https://facebookresearch.github.io/Kats/ Kats] ([https://github.com/facebookresearch/Kats code]): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation
 * [https://arxiv.org/abs/2410.18959 Context is Key: A Benchmark for Forecasting with Essential Textual Information]
+==Anomaly Detection==
+* 2024-10: [https://arxiv.org/abs/2410.05440 Can LLMs Understand Time Series Anomalies?] ([https://github.com/rose-stl-lab/anomllm code])
 =Data=
+* See also: [[Data_Extraction#Data_Scraping| Data Scraping]] and [[Data_Extraction#Document_Parsing| Document Parsing]]
 ==Vector Database==
 ===Open Source===
@@ Line 314: / Line 417: @@
 ==Database with Search==
 * [https://typesense.org/ Typesense] ([https://github.com/typesense/typesense code])
-==Web Scraping==
-* [https://github.com/mendableai/firecrawl Firecrawl]
-* [https://github.com/unclecode/crawl4ai Crawl4AI: Crawl Smarter, Faster, Freely. For AI.]
-* [https://github.com/ScrapeGraphAI/Scrapegraph-ai ScrapeGraphAI: You Only Scrape Once]: web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.)
-===Headless Browser (scrape & automate)===
-* [https://github.com/lightpanda-io/browser Lightpanda Browser]
-===Github===
-* [https://github.com/cyclotruc/gitingest GitIngest]: Turn any GitHub repository into a prompt-friendly text file, for inclusion in LLM's context. Available at: [https://gitingest.com/ gitingest.com]
-* [https://github.gg/ github.gg]: For analyzing GitHub repositories and providing valuable insights about code quality, dependencies, and more
-* [https://github.com/mattmireles/Flatty Flatty - Codebase-to-Text for LLMs]
 =See Also=
+* [[AI]]
+** [[Data Extraction]]
+** [[AI compute]]
 * [[AI agents]]
 * [[AI understanding]]
-* [[AI compute]]
 * [[Robots]]

Difference between revisions of "AI tools"

Latest revision as of 16:03, 26 March 2026

Contents

LLM

Open-weights LLM

Coding

Reasoning

Agentic

Multimodal

Language/Vision

Language/Vision/Speech

Language/Audio

RAG

Cloud LLM

Multi-modal: Audio

Triage

Retrieval Augmented Generation (RAG)

Reviews

Measuring RAG performance

Analysis of RAG overall

Approaches

Open-source Implementations

Web-based Tools

Commercial Cloud Offerings

LLM for scoring/ranking

LLM Agents

Interfaces

Chatbot Frontend

Web (code)

Web (product)

Desktop GUI

Alternative Text Chatbot UI

Conversational Audio Chatbot

Turn Detection

Related Research

Commercial Systems

Speech Recognition (ASR) and Transcription

Lists

Open Source

In Browser

Phrase Endpointing and Voice Activity Detection (VAD)

Audio Cleanup

Auto Video Transcription

Text-to-speech (TTS)

Open Source

Cloud

Text-to-audio

Vision

Visual Models

Depth

Superresolution

Related

Embedding

Text Embedding

Image Embedding

Time Series

Control

Forecasting

Anomaly Detection

Data

Vector Database

Open Source

Commercial cloud

MySQL

Database with Search

See Also

Navigation menu

Search