Difference between revisions of "AI tools"
KevinYager (talk | contribs)  (→Document Parsing)  | 
				KevinYager (talk | contribs)   (→Reasoning)  | 
				||
| (94 intermediate revisions by the same user not shown) | |||
| Line 30: | Line 30: | ||
* 2025-01Jan-27: [https://qwenlm.github.io/blog/qwen2.5-1m/ Qwen2.5-1M] ([https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-1M/Qwen2_5_1M_Technical_Report.pdf report])  | * 2025-01Jan-27: [https://qwenlm.github.io/blog/qwen2.5-1m/ Qwen2.5-1M] ([https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-1M/Qwen2_5_1M_Technical_Report.pdf report])  | ||
* 2025-01Jan-27: DeepSeek [https://huggingface.co/deepseek-ai/Janus-Pro-7B Janus-Pro-7B] (with image capabilities)  | * 2025-01Jan-27: DeepSeek [https://huggingface.co/deepseek-ai/Janus-Pro-7B Janus-Pro-7B] (with image capabilities)  | ||
| + | * [https://x.com/cohere/status/1900170005519753365 2025-03Mar-14]: Cohere [https://cohere.com/blog/command-a Command A] ([https://huggingface.co/CohereForAI/c4ai-command-a-03-2025?ref=cohere-ai.ghost.io weights])  | ||
| + | * [https://x.com/MistralAI/status/1901668499832918151 2025-03Mar-17]: [https://mistral.ai/news/mistral-small-3-1 Mistral Small 3.1] 24B ([https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503 weights])  | ||
| + | * [https://x.com/deepseek_ai/status/1904526863604883661 2025-03Mar-24]: [https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 DeepSeek-V3-0324] 685B  | ||
| + | * 2025-04Apr-05: Meta [https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Llama 4] (109B, 400B, 2T)  | ||
| + | * [https://x.com/kuchaev/status/1909444566379573646 2025-04Apr-08]: Nvidia [https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 Llama-3_1-Nemotron-Ultra-253B-v1]  | ||
| + | * [https://x.com/MistralAI/status/1920119463430500541 2025-05May-07]: Mistral [https://mistral.ai/news/mistral-medium-3 Medium 3]  | ||
| + | * [https://x.com/googleaidevs/status/1938279967026274383 2025-06Jun-26]: Google [https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/ Gemma 3n] (on-device multimodal)  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1953128028047102241 2025-08Aug-06]: [https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 Qwen3-4B-Instruct-2507]  | ||
| + | * [https://x.com/GoogleDeepMind/status/1956393664248271082 2025-08Aug-15]: Google [https://developers.googleblog.com/en/introducing-gemma-3-270m/ Gemma 3 270M]  | ||
| − | ===  | + | ===Coding===  | 
Rankings: [https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard bigcode-models-leaderboard] and [https://codeelo-bench.github.io/#leaderboard-table CodeElo leaderboard]  | Rankings: [https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard bigcode-models-leaderboard] and [https://codeelo-bench.github.io/#leaderboard-table CodeElo leaderboard]  | ||
* 2024-10Oct-06: [https://abacus.ai/ Abacus AI] [https://huggingface.co/abacusai/Dracarys2-72B-Instruct Dracarys2-72B-Instruct] (optimized for coding, fine-tune of [https://huggingface.co/Qwen/Qwen2.5-72B-Instruct Qwen2.5-72B-Instruct])  | * 2024-10Oct-06: [https://abacus.ai/ Abacus AI] [https://huggingface.co/abacusai/Dracarys2-72B-Instruct Dracarys2-72B-Instruct] (optimized for coding, fine-tune of [https://huggingface.co/Qwen/Qwen2.5-72B-Instruct Qwen2.5-72B-Instruct])  | ||
* 2024-11Nov-09: [https://opencoder-llm.github.io/ OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models] ([https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e weights], [https://arxiv.org/abs/2411.04905 preprint])  | * 2024-11Nov-09: [https://opencoder-llm.github.io/ OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models] ([https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e weights], [https://arxiv.org/abs/2411.04905 preprint])  | ||
* 2024-11Nov-13: [https://qwenlm.github.io/blog/qwen2.5-coder-family/ Qwen2.5-Coder]  | * 2024-11Nov-13: [https://qwenlm.github.io/blog/qwen2.5-coder-family/ Qwen2.5-Coder]  | ||
| + | * [https://x.com/Agentica_/status/1909700115755061374 2025-04Apr-08]: [https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51 DeepCoder-14B-Preview] ([https://github.com/agentica-project/rllm code], [https://huggingface.co/agentica-org/DeepCoder-14B-Preview hf])  | ||
| + | * [https://x.com/GeZhang86038849/status/1921147887871742329 2025-05May-10]: ByteDance [https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base SeedCoder] 8B  | ||
| + | * [https://x.com/Kimi_Moonshot/status/1943687594560332025 2025-07Jul-11]: [https://moonshotai.github.io/Kimi-K2/ Kimi-K2] 1T ([https://github.com/MoonshotAI/Kimi-K2 code], [https://huggingface.co/moonshotai weights])  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1947766835023335516 2025-07Jul-23]: [https://qwenlm.github.io/blog/qwen3-coder/ Qwen3-Coder-480B-A35B-Instruct] ([https://github.com/QwenLM/qwen-code code], [https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct weights])  | ||
===Reasoning===  | ===Reasoning===  | ||
| Line 48: | Line 61: | ||
* 2025-02Feb-10: [https://huggingface.co/tomg-group-umd/huginn-0125 Huginn-0125]: [https://arxiv.org/abs/2502.05171 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach] ([https://github.com/seal-rg/recurrent-pretraining code], [https://huggingface.co/tomg-group-umd/huginn-0125 model])  | * 2025-02Feb-10: [https://huggingface.co/tomg-group-umd/huginn-0125 Huginn-0125]: [https://arxiv.org/abs/2502.05171 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach] ([https://github.com/seal-rg/recurrent-pretraining code], [https://huggingface.co/tomg-group-umd/huginn-0125 model])  | ||
* [https://x.com/NousResearch/status/1890148000204485088 2025-02Feb-14]: [https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview DeepHermes 3 - Llama-3.1 8B]  | * [https://x.com/NousResearch/status/1890148000204485088 2025-02Feb-14]: [https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview DeepHermes 3 - Llama-3.1 8B]  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1894130603513319842 2025-02Feb-24]: Qwen [https://qwenlm.github.io/blog/qwq-max-preview/ QwQ-Max-Preview] ([https://chat.qwen.ai/ online demo])  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1897361654763151544 2025-03Mar-05]: Qwen [https://qwenlm.github.io/blog/qwq-32b/ QwQ-32B] ([https://huggingface.co/spaces/Qwen/QwQ-32B-Demo demo])  | ||
| + | * [https://x.com/BlinkDL_AI/status/1898579674575552558 2025-03Mar-05]: [https://github.com/BlinkDL/RWKV-LM RWKV7-G1] "GooseOne" 0.1B ([https://huggingface.co/BlinkDL/rwkv7-g1 weights], [https://arxiv.org/abs/2305.13048 preprint])  | ||
| + | * [https://x.com/LG_AI_Research/status/1901803002052436323 2025-03Mar-17]: LG AI Research [https://www.lgresearch.ai/blog/view?seq=543 EXAONE Deep] 2.4B, 7.8B, 32B ([https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-32B weights])  | ||
| + | * [https://x.com/kuchaev/status/1902078122792775771 2025-03Mar-18]: Nvidia [https://huggingface.co/collections/nvidia/llama-nemotron-67d92346030a2691293f200b Llama Nemotron] 8B, 49B ([https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1 demo])  | ||
| + | * [https://x.com/Agentica_/status/1909700115755061374 2025-04Apr-08]: [https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51 DeepCoder-14B-Preview] ([https://github.com/agentica-project/rllm code], [https://huggingface.co/agentica-org/DeepCoder-14B-Preview hf])  | ||
| + | * 2025-04Apr-10: Bytedance [https://github.com/ByteDance-Seed/Seed-Thinking-v1.5 Seed-Thinking-v1.5] 200B  | ||
| + | * [https://x.com/ZyphraAI/status/1910362745423425966 2025-04Apr-11]: [https://www.zyphra.com/ Zyphra] [https://www.zyphra.com/post/introducing-zr1-1-5b-a-small-but-powerful-math-code-reasoning-model ZR1-1.5B] ([https://huggingface.co/Zyphra/ZR1-1.5B weights], [https://playground.zyphra.com/sign-in use])  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1916962087676612998 2025-04Apr-29]: [https://qwenlm.github.io/blog/qwen3/ Qwen3] 0.6B to 235B ([https://github.com/QwenLM/Qwen3 code], [https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f weights], [https://modelscope.cn/home modelscope])  | ||
| + | * [https://x.com/DimitrisPapail/status/1917731614899028190 2025-04Apr-30]: [https://huggingface.co/microsoft/Phi-4-reasoning Phi-4 Reasoning] 14B ([https://www.microsoft.com/en-us/research/wp-content/uploads/2025/04/phi_4_reasoning.pdf tech report])  | ||
| + | * [https://x.com/deepseek_ai/status/1928061589107900779 2025-05May-28]: [https://huggingface.co/deepseek-ai/DeepSeek-R1-0528 DeepSeek-R1-0528]  | ||
| + | * [https://x.com/MistralAI/status/1932441507262259564 2025-06Jun-10]: Mistral [https://mistral.ai/static/research/magistral.pdf Magistral] 24B ([https://huggingface.co/mistralai/Magistral-Small-2506 weights])  | ||
| + | * [https://x.com/LoubnaBenAllal1/status/1942614508549333211 2025-07Jul-08]: [https://huggingface.co/blog/smollm3 SmolLM3]: smol, multilingual, long-context reasoner  | ||
| + | * [https://x.com/OpenAI/status/1952776916517404876 2025-08Aug-05]: [https://openai.com/open-models/ OpenAI] gpt-oss-120b, gpt-oss-20b  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1953128028047102241 2025-08Aug-06]: [https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507 Qwen3-4B-Thinking-2507]  | ||
| + | * 2025-09Sep: [https://huggingface.co/LLM360/K2-Think K2-Think] 32B  | ||
| + | |||
| + | ===Agentic===  | ||
| + | * 2025-02Feb-18: Microsoft [https://huggingface.co/microsoft/Magma-8B Magma-8B] ([https://www.arxiv.org/abs/2502.13130 preprint])  | ||
| + | * 2025-02Feb-26: [https://convergence.ai/ Convergence] [https://github.com/convergence-ai/proxy-lite Proxy Lite]  | ||
| + | |||
| + | ===Multimodal===  | ||
| + | ====Language/Vision====  | ||
| + | * [https://arxiv.org/abs/2407.07895 LLaVA-NeXT-Interleave] ([https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19 models], [https://huggingface.co/spaces/merve/llava-interleave demo])  | ||
| + | * [https://huggingface.co/papers/2407.15841 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models]  | ||
| + | * Nvidia [https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 NVEagle] 13B, 7B ([https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat demo], [https://arxiv.org/abs/2408.15998 preprint])  | ||
| + | * 2024-08Aug-29: [https://qwenlm.github.io/blog/qwen2-vl/ Qwen2-VL] 7B, 2B ([https://github.com/QwenLM/Qwen2-VL code], [https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d models]): Can process videos up to 20 minutes in length  | ||
| + | * 2024-09Sep-11: Mistral [https://huggingface.co/mistral-community/pixtral-12b-240910 Pixtral 12B]  | ||
| + | * 2024-09Sep-17: [https://nvlm-project.github.io/ NVLM 1.0]  | ||
| + | * 2024-12Dec-06: Nvidia [https://arxiv.org/abs/2412.04468 NVILA: Efficient Frontier Visual Language Models]  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1883954247743725963 2025-01Jan-28]: [https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5 Qwen2.5-VL]  | ||
| + | * 2025-02Feb-18: Microsoft [https://huggingface.co/microsoft/Magma-8B Magma-8B] ([https://www.arxiv.org/abs/2502.13130 preprint])  | ||
| + | * [https://x.com/CohereForAI/status/1896923657470886234 2025-03Mar-05]: Cohere [https://cohere.com/research/aya Aya] 8B, 32B  | ||
| + | * 2025-03Mar-12: Google [https://developers.googleblog.com/en/introducing-gemma3/ Gemma 3] 1B 4B, 12B, 27B ([https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf technical report])  | ||
| + | * [https://x.com/DeepLearningAI/status/1903295570527002729 2025-03Mar-23]: Cohere [https://cohere.com/blog/aya-vision Aya Vision] 8B, 32B ([https://huggingface.co/collections/CohereForAI/c4ai-aya-vision-67c4ccd395ca064308ee1484?ref=cohere-ai.ghost.io weights])  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1904227859616641534 2025-03Mar-24]: Alibaba [https://qwenlm.github.io/blog/qwen2.5-vl-32b/ Qwen2.5-VL-32B-Instruct] ([https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct weights])  | ||
| + | * 2025-05May-20: ByteDance [https://bagel-ai.org/ BAGEL: Unified Model for Multimodal Understanding and Generation] 7B ([https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT weights], [https://github.com/bytedance-seed/BAGEL code], [https://demo.bagel-ai.org/ demo])  | ||
| + | |||
| + | ====Language/Vision/Speech====  | ||
| + | * 2025-02Feb-27: Microsoft [https://huggingface.co/microsoft/Phi-4-multimodal-instruct Phi-4-multimodal-instruct] (language, vision, speech)  | ||
| + | * [https://x.com/kyutai_labs/status/1903082848547906011 2025-03Mar-21]: kyutai [https://kyutai.org/moshivis MoshiVis] ([https://vis.moshi.chat/ demo])  | ||
| + | * [https://x.com/Alibaba_Qwen/status/1904944923159445914 2025-03Mar-26]: [https://qwenlm.github.io/blog/qwen2.5-omni/ Qwen2.5-Omni-7B] ([https://github.com/QwenLM/Qwen2.5-Omni/blob/main/assets/Qwen2.5_Omni.pdf tech report], [https://github.com/QwenLM/Qwen2.5-Omni code], [https://huggingface.co/Qwen/Qwen2.5-Omni-7B weight])  | ||
| + | |||
| + | ====Language/Audio====  | ||
| + | * 2025-03Mar-11: [https://github.com/soham97/mellow Mellow]: a small audio language model for reasoning, 167M ([https://arxiv.org/abs/2503.08540 paper])  | ||
| + | * 2025-03Mar-12: [https://research.nvidia.com/labs/adlr/AF2/ Audio Flamingo 2] 0.5B, 1.5B, 3B [https://arxiv.org/abs/2503.03983 paper], [https://github.com/NVIDIA/audio-flamingo code]  | ||
| + | |||
| + | ===RAG===  | ||
| + | * 2025-04: [https://huggingface.co/collections/PleIAs/pleias-rag-680a0d78b058fffe4c16724d Pleias-RAG] 350M, 1.2B  | ||
| + | ** Paper: [http://ragpdf.pleias.fr/ Even Small Reasoners Should Quote Their Sources: Introducing Pleias-RAG Model Family]  | ||
| + | * 2025-04: Meta ReasonIR 8B: [https://arxiv.org/abs/2504.20595 ReasonIR: Training Retrievers for Reasoning Tasks]  | ||
==Cloud LLM==  | ==Cloud LLM==  | ||
| Line 59: | Line 123: | ||
==Retrieval Augmented Generation (RAG)==  | ==Retrieval Augmented Generation (RAG)==  | ||
| + | * See Also: [[AI_tools#Document_Parsing|Document Parsing]]  | ||
| + | |||
===Reviews===  | ===Reviews===  | ||
* 2024-08: [https://arxiv.org/abs/2408.08921 Graph Retrieval-Augmented Generation: A Survey]  | * 2024-08: [https://arxiv.org/abs/2408.08921 Graph Retrieval-Augmented Generation: A Survey]  | ||
| Line 67: | Line 133: | ||
* List of [https://github.com/NirDiamant/RAG_Techniques RAG techniques]  | * List of [https://github.com/NirDiamant/RAG_Techniques RAG techniques]  | ||
* [https://github.com/athina-ai/rag-cookbooks Advanced RAG Cookbooks👨🏻💻]  | * [https://github.com/athina-ai/rag-cookbooks Advanced RAG Cookbooks👨🏻💻]  | ||
| + | * [https://github.com/DEEP-PolyU/Awesome-GraphRAG Awesome-GraphRAG (GraphRAG Survey)]  | ||
===Measuring RAG performance===  | ===Measuring RAG performance===  | ||
| Line 90: | Line 157: | ||
* 2025-01: [https://arxiv.org/abs/2501.05874 VideoRAG: Retrieval-Augmented Generation over Video Corpus]  | * 2025-01: [https://arxiv.org/abs/2501.05874 VideoRAG: Retrieval-Augmented Generation over Video Corpus]  | ||
* 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]  | * 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]  | ||
| + | * 2025-02: [https://weaviate.io/developers/weaviate/tutorials/multi-vector-embeddings Multi-vector embeddings]  | ||
| + | * 2025-03: [https://arxiv.org/abs/2503.23513 RARE: Retrieval-Augmented Reasoning Modeling]  | ||
===Open-source Implementations===  | ===Open-source Implementations===  | ||
| Line 119: | Line 188: | ||
* [https://www.voyageai.com/ Voyage AI]  | * [https://www.voyageai.com/ Voyage AI]  | ||
* [https://abacus.ai/ Abacus AI]  | * [https://abacus.ai/ Abacus AI]  | ||
| − | + | * [https://www.cloudflare.com/ Cloudflare] [https://blog.cloudflare.com/introducing-autorag-on-cloudflare/ AutoRAG]  | |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | * [https://  | ||
| − | |||
==LLM for scoring/ranking==  | ==LLM for scoring/ranking==  | ||
| Line 175: | Line 237: | ||
* 2024-09Sep-11: [https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Llama-3.1-8B-Omni] ([https://github.com/ictnlp/LLaMA-Omni code]), enabling end-to-end speech.  | * 2024-09Sep-11: [https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Llama-3.1-8B-Omni] ([https://github.com/ictnlp/LLaMA-Omni code]), enabling end-to-end speech.  | ||
* [https://x.com/AIatMeta/status/1847383580269510670 2024-10Oct-18]: Meta [https://speechbot.github.io/spiritlm/ Spirit LM]: open source multimodal language model that freely mixes text and speech  | * [https://x.com/AIatMeta/status/1847383580269510670 2024-10Oct-18]: Meta [https://speechbot.github.io/spiritlm/ Spirit LM]: open source multimodal language model that freely mixes text and speech  | ||
| + | * 2025-02Feb-28: [https://www.sesame.com/ Sesame] ([https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo demo])  | ||
| + | |||
| + | ===Turn Detection===  | ||
| + | * 2025-03: [https://github.com/pipecat-ai/smart-turn Smart Turn]: Open-source   | ||
===Related Research===  | ===Related Research===  | ||
| Line 185: | Line 251: | ||
* [https://www.bland.ai Bland AI]  | * [https://www.bland.ai Bland AI]  | ||
* [https://deepgram.com/ DeepGram Voice AI]  | * [https://deepgram.com/ DeepGram Voice AI]  | ||
| + | * [https://www.sesame.com/ Sesame] ([https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo demo])  | ||
=Speech Recognition (ASR) and Transcription=  | =Speech Recognition (ASR) and Transcription=  | ||
| Line 205: | Line 272: | ||
* 2024-10: [https://www.rev.ai/ Rev AI] [https://huggingface.co/Revai models] for [https://huggingface.co/Revai/reverb-asr transcription] and [https://huggingface.co/Revai/reverb-diarization-v2 diarization]  | * 2024-10: [https://www.rev.ai/ Rev AI] [https://huggingface.co/Revai models] for [https://huggingface.co/Revai/reverb-asr transcription] and [https://huggingface.co/Revai/reverb-diarization-v2 diarization]  | ||
* 2024-10: [https://github.com/usefulsensors/moonshine Moonshine] (optimized for resource-constrained devices)  | * 2024-10: [https://github.com/usefulsensors/moonshine Moonshine] (optimized for resource-constrained devices)  | ||
| + | * 2025-05: [https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2 Parakeet TDT 0.6B V2]  | ||
| + | * [https://x.com/kyutai_labs/status/1925840420187025892 2025-05]: [https://kyutai.org/ Kyutai] [https://unmute.sh/ Unmute]  | ||
==In Browser==  | ==In Browser==  | ||
| Line 218: | Line 287: | ||
==Audio Cleanup==  | ==Audio Cleanup==  | ||
* [https://krisp.ai/ Krisp AI]: Noise cancellation, meeting summary, etc.  | * [https://krisp.ai/ Krisp AI]: Noise cancellation, meeting summary, etc.  | ||
| + | |||
| + | ==Auto Video Transcription==  | ||
| + | * [https://www.translate.mom/ TranslateMom]  | ||
| + | * [https://github.com/abus-aikorea/voice-pro Voice-Pro]: YouTube downloader, speech separation, transcription, translation, TTS, and voice cloning toolkit for creators  | ||
=Text-to-speech (TTS)=  | =Text-to-speech (TTS)=  | ||
| Line 233: | Line 306: | ||
* [https://www.zyphra.com/ Zyphra] [https://huggingface.co/Zyphra/Zonos-v0.1-hybrid Zonos]  | * [https://www.zyphra.com/ Zyphra] [https://huggingface.co/Zyphra/Zonos-v0.1-hybrid Zonos]  | ||
* [https://github.com/fishaudio/fish-speech Fish Speech] (includes voice cloning)  | * [https://github.com/fishaudio/fish-speech Fish Speech] (includes voice cloning)  | ||
| + | * [https://canopylabs.ai/ Canopy] [https://huggingface.co/collections/canopylabs/orpheus-tts-67d9ea3f6c05a941c06ad9d2 Orpheus] 3B  | ||
| + | * Canopy [https://canopylabs.ai/releases/orpheus_can_speak_any_language Orpheus Multilingual]  | ||
| + | * [https://narilabs.org/ Nari Labs] [https://github.com/nari-labs/dia Dia]  | ||
| + | * [https://kyutai.org/ Kyutai] [https://kyutai.org/next/tts TTS] [https://unmute.sh/ Unmute]  | ||
| + | * [https://github.com/resemble-ai/chatterbox Chatterbox TTS] ([https://huggingface.co/spaces/ResembleAI/Chatterbox try])  | ||
| + | * [https://play.ai/ Play AI] [https://github.com/playht/PlayDiffusion PlayDiffusion] ([https://huggingface.co/spaces/PlayHT/PlayDiffusion demo], [https://x.com/_mfelfel/status/1929586464125239589 example])  | ||
| + | * Mistral [https://mistral.ai/news/voxtral Voxtral]  | ||
| + | * Kitten TTS ([https://github.com/KittenML/KittenTTS github], [https://huggingface.co/KittenML/kitten-tts-nano-0.1 hf]) 15M (fast, light-weight)  | ||
| + | * Microsoft [https://microsoft.github.io/VibeVoice/ VibeVoice] 1.5B  | ||
==Cloud==  | ==Cloud==  | ||
| Line 240: | Line 322: | ||
* [https://neets.ai/ Neets AI] ($1/million characters)  | * [https://neets.ai/ Neets AI] ($1/million characters)  | ||
* Hailuo AI T2A-01-HD ([https://www.hailuo.ai/audio try], [https://intl.minimaxi.com/document/platform%20introduction?key=66701c8e1d57f38758d58198 API])  | * Hailuo AI T2A-01-HD ([https://www.hailuo.ai/audio try], [https://intl.minimaxi.com/document/platform%20introduction?key=66701c8e1d57f38758d58198 API])  | ||
| + | * [https://www.hume.ai/ Hume] (can set emotion, give acting directions, etc.)  | ||
=Text-to-audio=  | =Text-to-audio=  | ||
* 2024-12: [https://tangoflux.github.io/ TangoFlux]: [https://arxiv.org/abs/2412.21037 Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization] ([https://github.com/declare-lab/TangoFlux code])  | * 2024-12: [https://tangoflux.github.io/ TangoFlux]: [https://arxiv.org/abs/2412.21037 Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization] ([https://github.com/declare-lab/TangoFlux code])  | ||
| + | * 2025-03: [https://arxiv.org/abs/2503.10522 AudioX: Diffusion Transformer for Anything-to-Audio Generation]  | ||
=Vision=  | =Vision=  | ||
* [https://github.com/google/langfun Langfun] library as a means of converting images into structured output.  | * [https://github.com/google/langfun Langfun] library as a means of converting images into structured output.  | ||
| + | * See also: [[AI_tools#Multimodal| Multimodal open-weights models]]  | ||
==Visual Models==  | ==Visual Models==  | ||
| Line 255: | Line 340: | ||
* Meta [https://about.meta.com/realitylabs/codecavatars/sapiens Sapiens: Foundation for Human Vision Models] (video input, can infer segmentation, pose, depth-map, and surface normals)  | * Meta [https://about.meta.com/realitylabs/codecavatars/sapiens Sapiens: Foundation for Human Vision Models] (video input, can infer segmentation, pose, depth-map, and surface normals)  | ||
| − | ==  | + | ==Depth==  | 
| − | *   | + | * 2024-06: [https://arxiv.org/abs/2406.09414 Depth Anything V2] ([https://github.com/DepthAnything/Depth-Anything-V2 code])  | 
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | ==  | + | ==Superresolution==  | 
| − | * [https://arxiv.org/abs/  | + | * 2025-03: [https://arxiv.org/abs/2311.17643 Thera: Aliasing-Free Arbitrary-Scale Super-Resolution with Neural Heat Fields] ([https://github.com/prs-eth/thera code], [https://huggingface.co/spaces/prs-eth/thera use])  | 
| − | |||
| − | |||
==Related==  | ==Related==  | ||
| Line 275: | Line 351: | ||
=Embedding=  | =Embedding=  | ||
* [https://www.marktechpost.com/2024/07/28/a-comparison-of-top-embedding-libraries-for-generative-ai/ A Comparison of Top Embedding Libraries for Generative AI]  | * [https://www.marktechpost.com/2024/07/28/a-comparison-of-top-embedding-libraries-for-generative-ai/ A Comparison of Top Embedding Libraries for Generative AI]  | ||
| + | |||
| + | ==Text Embedding==  | ||
* 2024-12: [https://huggingface.co/blog/modernbert modernBERT]  | * 2024-12: [https://huggingface.co/blog/modernbert modernBERT]  | ||
| + | * 2025-02: [https://huggingface.co/chandar-lab/NeoBERT NeoBERT] ([https://arxiv.org/abs/2502.19587 preprint])  | ||
| + | * 2025-03: [https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/ gemini-embedding-exp-03-07]  | ||
| + | |||
==Image Embedding==  | ==Image Embedding==  | ||
| − | * 2025-01:[https://arxiv.org/abs/2501.18593 Diffusion Autoencoders are Scalable Image Tokenizers] ([https://yinboc.github.io/dito/ project], [https://github.com/yinboc/dito code])  | + | * 2025-01: [https://arxiv.org/abs/2501.18593 Diffusion Autoencoders are Scalable Image Tokenizers] ([https://yinboc.github.io/dito/ project], [https://github.com/yinboc/dito code])  | 
=Time Series=  | =Time Series=  | ||
| Line 291: | Line 372: | ||
* Salesforce: [https://arxiv.org/abs/2410.10469 Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts] ([https://github.com/SalesforceAIResearch/uni2ts/tree/main/project/moirai-moe-1 code], [https://huggingface.co/collections/Salesforce/moirai-r-models-65c8d3a94c51428c300e0742 weights], [https://www.salesforce.com/blog/time-series-morai-moe/ blog])  | * Salesforce: [https://arxiv.org/abs/2410.10469 Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts] ([https://github.com/SalesforceAIResearch/uni2ts/tree/main/project/moirai-moe-1 code], [https://huggingface.co/collections/Salesforce/moirai-r-models-65c8d3a94c51428c300e0742 weights], [https://www.salesforce.com/blog/time-series-morai-moe/ blog])  | ||
* IBM [https://huggingface.co/docs/transformers/en/model_doc/patchtsmixer PatchTSMixer] and [https://huggingface.co/docs/transformers/en/model_doc/patchtst PatchTST] (being [https://research.ibm.com/blog/time-series-AI-transformers used] for particle accelerators)  | * IBM [https://huggingface.co/docs/transformers/en/model_doc/patchtsmixer PatchTSMixer] and [https://huggingface.co/docs/transformers/en/model_doc/patchtst PatchTST] (being [https://research.ibm.com/blog/time-series-AI-transformers used] for particle accelerators)  | ||
| − | |||
==Control==  | ==Control==  | ||
| Line 299: | Line 379: | ||
* Meta [https://facebookresearch.github.io/Kats/ Kats] ([https://github.com/facebookresearch/Kats code]): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation  | * Meta [https://facebookresearch.github.io/Kats/ Kats] ([https://github.com/facebookresearch/Kats code]): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation  | ||
* [https://arxiv.org/abs/2410.18959 Context is Key: A Benchmark for Forecasting with Essential Textual Information]  | * [https://arxiv.org/abs/2410.18959 Context is Key: A Benchmark for Forecasting with Essential Textual Information]  | ||
| + | |||
| + | ==Anomaly Detection==  | ||
| + | * 2024-10: [https://arxiv.org/abs/2410.05440 Can LLMs Understand Time Series Anomalies?] ([https://github.com/rose-stl-lab/anomllm code])  | ||
=Data=  | =Data=  | ||
| + | * See also: [[Data_Extraction#Data_Scraping| Data Scraping]] and [[Data_Extraction#Document_Parsing| Document Parsing]]  | ||
==Vector Database==  | ==Vector Database==  | ||
===Open Source===  | ===Open Source===  | ||
| Line 322: | Line 406: | ||
==Database with Search==  | ==Database with Search==  | ||
* [https://typesense.org/ Typesense] ([https://github.com/typesense/typesense code])  | * [https://typesense.org/ Typesense] ([https://github.com/typesense/typesense code])  | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
=See Also=  | =See Also=  | ||
| + | * [[AI]]  | ||
| + | ** [[Data Extraction]]  | ||
| + | ** [[AI compute]]  | ||
* [[AI agents]]  | * [[AI agents]]  | ||
* [[AI understanding]]  | * [[AI understanding]]  | ||
| − | |||
* [[Robots]]  | * [[Robots]]  | ||
Latest revision as of 11:26, 12 September 2025
Contents
LLM
Open-weights LLM
- 2023-07Jul-18: Llama2 7B, 13B, 70B
 - 2024-04Apr-18: Llama3 8B, 70B
 - 2024-06Jun-14: Nemotron-4 340B
 - 2024-07Jul-23: Llama 3.1 8B, 70B, 405B
 - 2024-07Jul-24: Mistral Large 2 128B
 - 2024-07Jul-31: Gemma 2 2B
 - 2024-08Aug-08: Qwen2-Math (hf, github) 1.5B, 7B, 72B
 - 2024-08Aug-14: Nous research Hermes 3 (technical report) 8B, 70B, 405B
 - 2024-08Aug-19: Salesforce AI xGen-MM (BLIP-3): A Family of Open Large Multimodal Models (preprint, code)
 - 2024-09Sep-04: OLMoE: Open Mixture-of-Experts Language Models (code) 7B model (uses 1B per input token)
 - 2024-09Sep-05: Reflection 70B (demo): Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.
 - 2024-09Sep-06: DeepSeek-V2.5 238B mixture-of-experts (160 experts, 16B active params)
 - 2024-09Sep-19: Microsoft GRadient-INformed (GRIN) MoE (demo, model, github) 6.6B
 - 2024-09Sep-23: Nvidia Llama-3_1-Nemotron-51B-instruct 51B
 - 2024-09Sep-25: Meta Llama 3.2 with visual and voice modalities 1B, 3B, 11B, 90B
 - 2024-09Sep-25: Ai2 Molmo multi-modal models 1B, 7B, 72B
 - 2024-10Oct-01: Nvidia NVLM-D-72B (includes vision)
 - 2024-10Oct-16: Mistral Ministral-8B-Instruct-2410
 - 2024-10Oct-16: Nvidia Llama-3.1-Nemotron-70B-Reward
 - 2024-11Nov-04: Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent 389B (code, weights)
 - 2024-11Nov-18: Mistral-Large-Instruct-2411) 123B; and Pixtral Large multimodal model 124B (weights)
 - 2024-11Nov-22: Nvidia Hymba (blog): small and high-performance
 - 2024-12Dec-06: Meta Llama 3.3 70B
 - 2024-12Dec-26: DeepSeek-V3-Base 671B
 - 2025-01Jan-02: SmallThinker-3B-Preview (fine-tune of Qwen2.5-3b-Instruct)
 - 2025-01Jan-08: Microsoft phi-4 15B
 - 2025-01Jan-14: MiniMax-01, MiniMax-Text-01 and MiniMax-VL-01; 4M context length (paper)
 - 2025-01Jan-27: Qwen2.5-1M (report)
 - 2025-01Jan-27: DeepSeek Janus-Pro-7B (with image capabilities)
 - 2025-03Mar-14: Cohere Command A (weights)
 - 2025-03Mar-17: Mistral Small 3.1 24B (weights)
 - 2025-03Mar-24: DeepSeek-V3-0324 685B
 - 2025-04Apr-05: Meta Llama 4 (109B, 400B, 2T)
 - 2025-04Apr-08: Nvidia Llama-3_1-Nemotron-Ultra-253B-v1
 - 2025-05May-07: Mistral Medium 3
 - 2025-06Jun-26: Google Gemma 3n (on-device multimodal)
 - 2025-08Aug-06: Qwen3-4B-Instruct-2507
 - 2025-08Aug-15: Google Gemma 3 270M
 
Coding
Rankings: bigcode-models-leaderboard and CodeElo leaderboard
- 2024-10Oct-06: Abacus AI Dracarys2-72B-Instruct (optimized for coding, fine-tune of Qwen2.5-72B-Instruct)
 - 2024-11Nov-09: OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models (weights, preprint)
 - 2024-11Nov-13: Qwen2.5-Coder
 - 2025-04Apr-08: DeepCoder-14B-Preview (code, hf)
 - 2025-05May-10: ByteDance SeedCoder 8B
 - 2025-07Jul-11: Kimi-K2 1T (code, weights)
 - 2025-07Jul-23: Qwen3-Coder-480B-A35B-Instruct (code, weights)
 
Reasoning
See also: Increasing AI Intelligence > Proactive Search > CoT reasoning model
- 2024-11Nov-20: DeepSeek-R1-Lite-Preview (results, CoT)
 - 2024-11Nov-23: Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
 - 2024-11Nov-27: Alibaba Qwen QwQ 32B (model, demo)
 - 2024-12Dec-04: Ruliad Deepthought 8B (demo)
 - 2024-12Dec-24: Qwen QvQ-72B-preview (visual reasoning)
 - 2025-01Jan-10: LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs (preprint, code, weights)
 - 2025-01Jan-20: DeepSeek-R1, DeepSeek-R1-Distill-Llama-70B, DeepSeek-R1-Distill-Qwen-32B, ... (paper)
 - 2025-02Feb-10: Huginn-0125: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (code, model)
 - 2025-02Feb-14: DeepHermes 3 - Llama-3.1 8B
 - 2025-02Feb-24: Qwen QwQ-Max-Preview (online demo)
 - 2025-03Mar-05: Qwen QwQ-32B (demo)
 - 2025-03Mar-05: RWKV7-G1 "GooseOne" 0.1B (weights, preprint)
 - 2025-03Mar-17: LG AI Research EXAONE Deep 2.4B, 7.8B, 32B (weights)
 - 2025-03Mar-18: Nvidia Llama Nemotron 8B, 49B (demo)
 - 2025-04Apr-08: DeepCoder-14B-Preview (code, hf)
 - 2025-04Apr-10: Bytedance Seed-Thinking-v1.5 200B
 - 2025-04Apr-11: Zyphra ZR1-1.5B (weights, use)
 - 2025-04Apr-29: Qwen3 0.6B to 235B (code, weights, modelscope)
 - 2025-04Apr-30: Phi-4 Reasoning 14B (tech report)
 - 2025-05May-28: DeepSeek-R1-0528
 - 2025-06Jun-10: Mistral Magistral 24B (weights)
 - 2025-07Jul-08: SmolLM3: smol, multilingual, long-context reasoner
 - 2025-08Aug-05: OpenAI gpt-oss-120b, gpt-oss-20b
 - 2025-08Aug-06: Qwen3-4B-Thinking-2507
 - 2025-09Sep: K2-Think 32B
 
Agentic
- 2025-02Feb-18: Microsoft Magma-8B (preprint)
 - 2025-02Feb-26: Convergence Proxy Lite
 
Multimodal
Language/Vision
- LLaVA-NeXT-Interleave (models, demo)
 - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
 - Nvidia NVEagle 13B, 7B (demo, preprint)
 - 2024-08Aug-29: Qwen2-VL 7B, 2B (code, models): Can process videos up to 20 minutes in length
 - 2024-09Sep-11: Mistral Pixtral 12B
 - 2024-09Sep-17: NVLM 1.0
 - 2024-12Dec-06: Nvidia NVILA: Efficient Frontier Visual Language Models
 - 2025-01Jan-28: Qwen2.5-VL
 - 2025-02Feb-18: Microsoft Magma-8B (preprint)
 - 2025-03Mar-05: Cohere Aya 8B, 32B
 - 2025-03Mar-12: Google Gemma 3 1B 4B, 12B, 27B (technical report)
 - 2025-03Mar-23: Cohere Aya Vision 8B, 32B (weights)
 - 2025-03Mar-24: Alibaba Qwen2.5-VL-32B-Instruct (weights)
 - 2025-05May-20: ByteDance BAGEL: Unified Model for Multimodal Understanding and Generation 7B (weights, code, demo)
 
Language/Vision/Speech
- 2025-02Feb-27: Microsoft Phi-4-multimodal-instruct (language, vision, speech)
 - 2025-03Mar-21: kyutai MoshiVis (demo)
 - 2025-03Mar-26: Qwen2.5-Omni-7B (tech report, code, weight)
 
Language/Audio
- 2025-03Mar-11: Mellow: a small audio language model for reasoning, 167M (paper)
 - 2025-03Mar-12: Audio Flamingo 2 0.5B, 1.5B, 3B paper, code
 
RAG
- 2025-04: Pleias-RAG 350M, 1.2B
 - 2025-04: Meta ReasonIR 8B: ReasonIR: Training Retrievers for Reasoning Tasks
 
Cloud LLM
Multi-modal: Audio
- kyutai Open Science AI Lab chatbot moshi
 
Triage
Retrieval Augmented Generation (RAG)
- See Also: Document Parsing
 
Reviews
- 2024-08: Graph Retrieval-Augmented Generation: A Survey
 - 2024-09: Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
 - 2024-12: A Survey of Query Optimization in Large Language Models
 - 2025-01: Enhancing Retrieval-Augmented Generation: A Study of Best Practices
 - 2025-01: Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG (github)
 - List of RAG techniques
 - Advanced RAG Cookbooks👨🏻💻
 - Awesome-GraphRAG (GraphRAG Survey)
 
Measuring RAG performance
- 2025-01: The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
 
Analysis of RAG overall
Approaches
- RAGFlow (code)
 - GraphRAG (preprint, code, GraphRAG Accelerator for easy deployment on Azure)
 - AutoMetaRAG (code)
 - Verba: RAG for Weaviate vector database (code, video)
 - Microsoft: PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation
 - 2024-10: Google Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
 - 2024-10: StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization: Reformats retrieved data into task-appropriate structures (table, graph, tree).
 - 2024-10: Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval
 - 2024-11: FastRAG: Retrieval Augmented Generation for Semi-structured Data
 - 2024-11: Microsoft LazyGraphRAG: Setting a new standard for quality and cost
 - 2024-11: Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models
 - 2025-01: Search-o1: Agentic Search-Enhanced Large Reasoning Models (project, code)
 - 2025-01: AutoRAG: RAG AutoML tool for automatically finding an optimal RAG pipeline for your data
 - 2025-01: VideoRAG: Retrieval-Augmented Generation over Video Corpus
 - 2025-02: DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
 - 2025-02: Multi-vector embeddings
 - 2025-03: RARE: Retrieval-Augmented Reasoning Modeling
 
Open-source Implementations
- kotaemon: An open-source clean & customizable RAG UI for chatting with your documents.
 - LlamaIndex (code, docs, voice chat code)
 - Nvidia ChatRTX with RAG
 - Anthropic Customer Support Agent example
 - LangChain and LangGraph (tutorial)
- RAGBuilder: Automatically tunes RAG hyperparams
 
 - WikiChat
 - Chonkie: No-nonsense RAG chunking library (open-source, lightweight, fast)
 - autoflow: open source GraphRAG (Knowledge Graph), including conversational search page
 - RAGLite
 - nano-graphrag: A simple, easy-to-hack GraphRAG implementation
 - Dabarqus
 
Web-based Tools
- SciSpace Chat with PDF (also available as a GPT).
 
Commercial Cloud Offerings
- Graphlit
 - ColiVara
 - nhost
 - Vespa RAG
 - Unstructured
 - Fivetran
 - Vectorize
 - Voyage AI
 - Abacus AI
 - Cloudflare AutoRAG
 
LLM for scoring/ranking
- GPTScore: Evaluate as You Desire
 - Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
 - Domain-specific chatbots for science using embeddings
 - Large Language Models as Evaluators for Scientific Synthesis
 
LLM Agents
- See AI Agents.
 
Interfaces
Chatbot Frontend
Web (code)
Web (product)
Desktop GUI
- AnythingLLM (docs, code): includes chat-with-docs, selection of LLM and vector db, etc.
 
Alternative Text Chatbot UI
- Loom provides a sort of tree-like structure for LLM coming up with branched writings.
 - The Pantheon Interface is a new idea for how to interact with LLMs (live instance, code). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming.
 
Conversational Audio Chatbot
- Swift is a fast AI voice assistant (code, live demo) uses:
 - RTVI-AI (code, demo), uses:
 - June: Local Voice Chatbot
- Ollama
 - Hugging Face Transformers (for speech recognition)
 - Coqui TTS Toolkit
 
 - kyutai Moshi chatbot (demo)
 - Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (model, code, demo)
 - 2024-09Sep-11: Llama-3.1-8B-Omni (code), enabling end-to-end speech.
 - 2024-10Oct-18: Meta Spirit LM: open source multimodal language model that freely mixes text and speech
 - 2025-02Feb-28: Sesame (demo)
 
Turn Detection
- 2025-03: Smart Turn: Open-source
 
Related Research
Commercial Systems
Speech Recognition (ASR) and Transcription
Lists
Open Source
- DeepSpeech
 - speechbrain
 - Kaldi
 - wav2vec 2.0
 - Whisper
- Whisper medium.en
 - WhisperX (includes word-level timestamps and speaker diarization)
 - Distil Large v3 with MLX
 - 2024-10: whisper-large-v3-turbo distillation (demo, code)
 
 - Nvidia Canary 1B
 - 2024-09: Nvidia NeMo
 - 2024-10: Rev AI models for transcription and diarization
 - 2024-10: Moonshine (optimized for resource-constrained devices)
 - 2025-05: Parakeet TDT 0.6B V2
 - 2025-05: Kyutai Unmute
 
In Browser
- Whisper Timestamped: Multilingual speech recognition with word-level timestamps, running locally in browser
 
Phrase Endpointing and Voice Activity Detection (VAD)
I.e. how to determine when user is done talking, and bot should respond?
Audio Cleanup
- Krisp AI: Noise cancellation, meeting summary, etc.
 
Auto Video Transcription
- TranslateMom
 - Voice-Pro: YouTube downloader, speech separation, transcription, translation, TTS, and voice cloning toolkit for creators
 
Text-to-speech (TTS)
Open Source
- Parler TTS (demo)
 - Toucan (demo)
 - MetaVoice (github)
 - ChatTTS
 - Camb.ai MARS5-TTS
 - Coqui TTS Toolkit
 - Fish Speech 1.4: multi-lingual, can clone voices (video, weights, demo)
 - F5-TTS (demo): cloning, emotion, etc.
 - MaskGCT (demo)
 - Amphion: An Open-Source Audio, Music and Speech Generation Toolkit (code)
 - Zyphra Zonos
 - Fish Speech (includes voice cloning)
 - Canopy Orpheus 3B
 - Canopy Orpheus Multilingual
 - Nari Labs Dia
 - Kyutai TTS Unmute
 - Chatterbox TTS (try)
 - Play AI PlayDiffusion (demo, example)
 - Mistral Voxtral
 - Kitten TTS (github, hf) 15M (fast, light-weight)
 - Microsoft VibeVoice 1.5B
 
Cloud
- Elevenlabs ($50/million characters)
 - Cartesia Sonic
 - Neets AI ($1/million characters)
 - Hailuo AI T2A-01-HD (try, API)
 - Hume (can set emotion, give acting directions, etc.)
 
Text-to-audio
- 2024-12: TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization (code)
 - 2025-03: AudioX: Diffusion Transformer for Anything-to-Audio Generation
 
Vision
- Langfun library as a means of converting images into structured output.
 - See also: Multimodal open-weights models
 
Visual Models
- CLIP
 - Siglip
 - Supervision
 - Florence-2
 - Nvidia MambaVision
 - Meta Sapiens: Foundation for Human Vision Models (video input, can infer segmentation, pose, depth-map, and surface normals)
 
Depth
- 2024-06: Depth Anything V2 (code)
 
Superresolution
Related
Embedding
Text Embedding
- 2024-12: modernBERT
 - 2025-02: NeoBERT (preprint)
 - 2025-03: gemini-embedding-exp-03-07
 
Image Embedding
Time Series
- Stumpy: Python library, uses near-match subsequences for similarity and forecasting
 - Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
 - From latent dynamics to meaningful representations
 - Review of Time Series Forecasting Methods and Their Applications to Particle Accelerators
 - Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
 - A decoder-only foundation model for time-series forecasting
 - TimeGPT-1
 - Unified Training of Universal Time Series Forecasting Transformers
 - xLSTMTime : Long-term Time Series Forecasting With xLSTM
 - Salesforce: Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts (code, weights, blog)
 - IBM PatchTSMixer and PatchTST (being used for particle accelerators)
 
Control
Forecasting
- Meta Kats (code): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation
 - Context is Key: A Benchmark for Forecasting with Essential Textual Information
 
Anomaly Detection
- 2024-10: Can LLMs Understand Time Series Anomalies? (code)
 
Data
- See also: Data Scraping and Document Parsing
 
Vector Database
Open Source
- milvus (open source with paid cloud option)
 - Qdrant (open source with paid cloud option)
 - Vespa (open source with paid cloud option)
 - chroma
 - LlamaIndex
 - sqlite-vec
 
Commercial cloud
MySQL
- MySQL does not traditionally have support, but:
- PlanetScale is working on it
 - mysql_vss (discussion)
 - tibd (discussion)