Revision as of 12:38, 25 November 2024

LLM

Open-weights LLM

2023-07Jul-18: Llama2 7B, 13B, 70B
2024-04Apr-18: Llama3 8B, 70B
2024-06Jun-14: Nemotron-4 340B
2024-07Jul-23: Llama 3.1 8B, 70B, 405B
2024-07Jul-24: Mistral Large 2 128B
2024-07Jul-31: Gemma 2 2B
2024-08Aug-08: Qwen2-Math (hf, github) 1.5B, 7B, 72B
2024-08Aug-14: Nous research Hermes 3 (technical report) 8B, 70B, 405B
2024-08Aug-19: Salesforce AI xGen-MM (BLIP-3): A Family of Open Large Multimodal Models (preprint, code)
2024-09Sep-04: OLMoE: Open Mixture-of-Experts Language Models (code) 7B model (uses 1B per input token)
2024-09Sep-05: Reflection 70B (demo): Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.
2024-09Sep-06: DeepSeek-V2.5 238B mixture-of-experts (160 experts, 16B active params)
2024-09Sep-19: Microsoft GRadient-INformed (GRIN) MoE (demo, model, github) 6.6B
2024-09Sep-23: Nvidia Llama-3_1-Nemotron-51B-instruct 51B
2024-09Sep-25: Meta Llama 3.2 with visual and voice modalities 1B, 3B, 11B, 90B
2024-09Sep-25: Ai2 Molmo multi-modal models 1B, 7B, 72B
2024-10Oct-01: Nvidia NVLM-D-72B (includes vision)
2024-10Oct-16: Mistral Ministral-8B-Instruct-2410
2024-10Oct-16: Nvidia Llama-3.1-Nemotron-70B-Reward
2024-11Nov-04: Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent 389B (code, weights)
2024-11Nov-18: Mistral-Large-Instruct-2411) 123B; and Pixtral Large multimodal model 124B (weights)
2024-11Nov-22: Nvidia Hymba (blog): small and high-performance

For Coding

C.f. https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard

2024-10Oct-06: Abacus AI Dracarys2-72B-Instruct (optimized for coding, fine-tune of Qwen2.5-72B-Instruct)
2024-11Nov-09: OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models (weights, preprint)
2024-11Nov-13: Qwen2.5-Coder

Cloud LLM

Groq cloud (very fast inference)

Multi-modal: Audio

kyutai Open Science AI Lab chatbot moshi

Triage

RouteLLM: Learning to Route LLMs with Preference Data

Retrieval Augmented Generation (RAG)

Reviews

Analysis of RAG overall

2024-10: Is Semantic Chunking Worth the Computational Cost?

Approaches

RAGFlow (code)
GraphRAG (preprint, code)
- GraphRAG Accelerator for easy deployment on Azure
AutoMetaRAG (code)
Verba: RAG for Weaviate vector database
- code
- video
Google Astute RAG
- Preprint: Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
2024-10: StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization: Reformats retrieved data into task-appropriate structures (table, graph, tree).
2024-10: Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval
2024-11: FastRAG: Retrieval Augmented Generation for Semi-structured Data

Open-source Implementations

kotaemon: An open-source clean & customizable RAG UI for chatting with your documents.
LlamaIndex (code, docs, voice chat code)
Nvidia ChatRTX with RAG
Anthropic Customer Support Agent example
LangChain and LangGraph (tutorial)
- RAGBuilder: Automatically tunes RAG hyperparams
WikiChat
- WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia
Chonkie: No-nonsense RAG chunking library (open-source, lightweight, fast)
autoflow: open source GraphRAG (Knowledge Graph), including conversational search page

Web-based Tools

SciSpace Chat with PDF (also available as a GPT).

PDF Conversion

Grobid
Chunkr (code)
Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON

Automatic Optimization

Analogous to Gradient Descent

LLM for scoring/ranking

LLM Agents

See AI agents.

Interfaces

Chatbot Frontend

Web

Desktop GUI

AnythingLLM (docs, code): includes chat-with-docs, selection of LLM and vector db, etc.

Alternative Text Chatbot UI

Loom provides a sort of tree-like structure for LLM coming up with branched writings.
The Pantheon Interface is a new idea for how to interact with LLMs (live instance, code). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming.

Conversational Audio Chatbot

Swift is a fast AI voice assistant (code, live demo) uses:
- Groq cloud running OpenAI Whisper for fast speech transcription.
- Cartesia Sonic for fast speech synthesis
- VAD to detect when user is talking
- Vercel for app deployment
RTVI-AI (code, demo), uses:
- Groq
- Llama 3.1
- Daily
- RTVI
June: Local Voice Chatbot
- Ollama
- Hugging Face Transformers (for speech recognition)
- Coqui TTS Toolkit
kyutai Moshi chatbot (demo)
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (model, code, demo)
2024-09Sep-11: Llama-3.1-8B-Omni (code), enabling end-to-end speech.
2024-10Oct-18: Meta Spirit LM: open source multimodal language model that freely mixes text and speech

Related Research

Language Model Can Listen While Speaking

Commercial Systems

Speech Recognition (ASR) and Transcription

Lists

Open ASR Leaderboard

Open Source

DeepSpeech
speechbrain
Kaldi
wav2vec 2.0
- Paper: Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Whisper
- Whisper medium.en
- WhisperX (includes word-level timestamps and speaker diarization)
- Distil Large v3 with MLX
- 2024-10: whisper-large-v3-turbo distillation (demo, code)
Nvidia Canary 1B
2024-09: Nvidia NeMo
2024-10: Rev AI models for transcription and diarization
2024-10: Moonshine (optimized for resource-constrained devices)

In Browser

Whisper Timestamped: Multilingual speech recognition with word-level timestamps, running locally in browser

Phrase Endpointing and Voice Activity Detection (VAD)

I.e. how to determine when user is done talking, and bot should respond?

Notes

Audio Cleanup

Krisp AI: Noise cancellation, meeting summary, etc.

Text-to-speech (TTS)

Open Source

Parler TTS (demo)
Toucan (demo)
MetaVoice (github)
ChatTTS
Camb.ai MARS5-TTS
Coqui TTS Toolkit
Fish Speech 1.4: multi-lingual, can clone voices (video, weights, demo)
F5-TTS (demo): cloning, emotion, etc.
MaskGCT (demo)
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit (code)

Cloud

Elevenlabs ($50/million characters)
- voice isolator
Cartesia Sonic
Neets AI ($1/million characters)

Vision

Visual Models

CLIP
Siglip
Supervision
Florence-2
Nvidia MambaVision
Meta Sapiens: Foundation for Human Vision Models (video input, can infer segmentation, pose, depth-map, and surface normals)

Multi-modal Models (language-vision/video)

LLaVA-NeXT-Interleave (models, demo)
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Nvidia NVEagle 13B, 7B (demo, preprint)
2024-08Aug-29: Qwen2-VL 7B, 2B (code, models): Can process videos up to 20 minutes in length
2024-09Sep-11: Mistral Pixtral 12B
2024-09Sep-17: NVLM 1.0

Optical character recognition (OCR)

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model (project, code, demo)

Embedding

A Comparison of Top Embedding Libraries for Generative AI

Time Series

Stumpy: Python library, uses near-match subsequences for similarity and forecasting
Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
From latent dynamics to meaningful representations
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
A decoder-only foundation model for time-series forecasting
TimeGPT-1
Unified Training of Universal Time Series Forecasting Transformers
xLSTMTime : Long-term Time Series Forecasting With xLSTM
Salesforce: Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts (code, weights, blog)

Control

PIDformer: Transformer Meets Control Theory

Forecasting

Meta Kats (code): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation

Data

Vector Database

Open Source

milvus (open source with paid cloud option)
Qdrant (open source with paid cloud option)
Vespa (open source with paid cloud option)
chroma
LlamaIndex
sqlite-vec

Commercial cloud

MySQL

MySQL does not traditionally have support, but:
- PlanetScale is working on it
- mysql_vss (discussion)
- tibd (discussion)

Database with Search

Typesense (code)

Web Scraping

Firecrawl

Hardware

AI Acceleration Hardware

Nvidia GPUs
Google TPU
Tesla Dojo
Cerebras
Graphcore
Untether AI
SambaNova Systems
Groq
Deep Silicon: Combined hardware/software solution for accelerated AI (e.g. ternary math)
Etched: Transformer ASICs

@@ Line 1: / Line 1: @@
-TBD
+=LLM=
+==Open-weights LLM==
+* [https://about.fb.com/news/2023/07/llama-2/ 2023-07Jul-18]: [https://llama.meta.com/llama2/ Llama2] 7B, 13B,  70B
+* [https://ai.meta.com/blog/meta-llama-3/ 2024-04Apr-18]: [https://llama.meta.com/llama3/ Llama3] 8B, 70B
+* [https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/ 2024-06Jun-14]: [https://research.nvidia.com/publication/2024-06_nemotron-4-340b Nemotron-4] 340B
+* 2024-07Jul-23: [https://llama.meta.com/ Llama 3.1] 8B, 70B, 405B
+* [https://mistral.ai/news/mistral-large-2407/ 2024-07Jul-24]: [https://huggingface.co/mistralai/Mistral-Large-Instruct-2407 Mistral Large 2] 128B
+* [https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/ 2024-07Jul-31]: [https://huggingface.co/collections/google/gemma-2-2b-release-66a20f3796a2ff2a7c76f98f Gemma 2] 2B
+* [https://qwenlm.github.io/blog/qwen2-math/ 2024-08Aug-08]: Qwen2-Math ([https://huggingface.co/collections/Qwen/qwen2-math-66b4c9e072eda65b5ec7534d hf], [https://github.com/QwenLM/Qwen2-Math github]) 1.5B, 7B, 72B
+* [https://nousresearch.com/releases/ 2024-08Aug-14]: [https://nousresearch.com/ Nous research] [https://nousresearch.com/hermes3/ Hermes 3] ([https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf technical report]) 8B, 70B, 405B
+* 2024-08Aug-19: [https://www.salesforceairesearch.com/ Salesforce AI] [https://huggingface.co/papers/2408.08872 xGen-MM (BLIP-3)]: A Family of Open Large Multimodal Models ([https://www.arxiv.org/abs/2408.08872 preprint], [https://github.com/salesforce/LAVIS/tree/xgen-mm code])
+* 2024-09Sep-04: [https://arxiv.org/abs/2409.02060 OLMoE: Open Mixture-of-Experts Language Models] ([https://github.com/allenai/OLMoE code]) 7B model (uses 1B per input token)
+* 2024-09Sep-05: [https://huggingface.co/mattshumer/Reflection-70B Reflection 70B] ([https://reflection-playground-production.up.railway.app/ demo]): [https://x.com/mattshumer_/status/1831767014341538166 Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.]
+* 2024-09Sep-06: [https://huggingface.co/deepseek-ai/DeepSeek-V2.5 DeepSeek-V2.5] 238B mixture-of-experts (160 experts, 16B active params)
+* 2024-09Sep-19: Microsoft GRadient-INformed (GRIN) MoE ([https://huggingface.co/spaces/GRIN-MoE-Demo/GRIN-MoE demo], [https://huggingface.co/microsoft/GRIN-MoE model], [https://github.com/microsoft/GRIN-MoE github]) 6.6B
+* 2024-09Sep-23: Nvidia [https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct Llama-3_1-Nemotron-51B-instruct] 51B
+* 2024-09Sep-25: Meta [https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ Llama 3.2] with visual and voice modalities 1B, 3B, 11B, 90B
+* 2024-09Sep-25: [https://allenai.org/ Ai2] [https://molmo.allenai.org/ Molmo] [https://molmo.allenai.org/blog multi-modal models] 1B, 7B, 72B
+* 2024-10Oct-01: Nvidia [https://huggingface.co/nvidia/NVLM-D-72B NVLM-D-72B] (includes vision)
+* [https://mistral.ai/news/ministraux/ 2024-10Oct-16]: Mistral [https://huggingface.co/mistralai/Ministral-8B-Instruct-2410 Ministral-8B-Instruct-2410]
+* 2024-10Oct-16: Nvidia [https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF Llama-3.1-Nemotron-70B-Reward]
+* 2024-11Nov-04: [https://arxiv.org/abs/2411.02265 Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent] 389B ([https://github.com/Tencent/Tencent-Hunyuan-Large code], [https://huggingface.co/tencent/Tencent-Hunyuan-Large weights])
+* 2024-11Nov-18: [https://huggingface.co/mistralai/Mistral-Large-Instruct-2411 Mistral-Large-Instruct-2411]) 123B; and [https://mistral.ai/news/pixtral-large/ Pixtral Large] multimodal model 124B ([https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411 weights])
+* 2024-11Nov-22: Nvidia [https://github.com/NVlabs/hymba Hymba] ([https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/ blog]): small and high-performance
+===For Coding===
+C.f. [https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard]
+* 2024-10Oct-06: [https://abacus.ai/ Abacus AI] [https://huggingface.co/abacusai/Dracarys2-72B-Instruct Dracarys2-72B-Instruct] (optimized for coding, fine-tune of [https://huggingface.co/Qwen/Qwen2.5-72B-Instruct Qwen2.5-72B-Instruct])
+* 2024-11Nov-09: [https://opencoder-llm.github.io/ OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models] ([https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e weights], [https://arxiv.org/abs/2411.04905 preprint])
+* 2024-11Nov-13: [https://qwenlm.github.io/blog/qwen2.5-coder-family/ Qwen2.5-Coder]
+==Cloud LLM==
+* [https://groq.com/ Groq] [https://wow.groq.com/ cloud] (very fast inference)
+===Multi-modal: Audio===
+* [https://kyutai.org/ kyutai Open Science AI Lab] chatbot [https://www.us.moshi.chat/?queue_id=talktomoshi moshi]
+==Triage==
+* [https://arxiv.org/abs/2406.18665 RouteLLM: Learning to Route LLMs with Preference Data]
+==Retrieval Augmented Generation (RAG)==
+===Reviews===
+* 2024-08: [https://arxiv.org/abs/2408.08921 Graph Retrieval-Augmented Generation: A Survey]
+* 2024-09: [https://arxiv.org/abs/2409.14924 Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely]
+* List of [https://github.com/NirDiamant/RAG_Techniques RAG techniques]
+===Analysis of RAG overall===
+* 2024-10: [https://arxiv.org/abs/2410.13070 Is Semantic Chunking Worth the Computational Cost?]
+===Approaches===
+* RAGFlow ([https://github.com/infiniflow/ragflow code])
+* GraphRAG ([https://arxiv.org/abs/2404.16130 preprint], [https://github.com/microsoft/graphrag code])
+** [https://github.com/Azure-Samples/graphrag-accelerator GraphRAG Accelerator] for easy deployment on Azure
+* AutoMetaRAG ([https://github.com/darshil3011/AutoMetaRAG/tree/main code])
+* [https://verba.weaviate.io/ Verba]: RAG for [https://weaviate.io/ Weaviate] vector database
+** [https://github.com/weaviate/verba code]
+** [https://www.youtube.com/watch?v=UoowC-hsaf0 video]
+* Google Astute RAG
+** Preprint: [https://arxiv.org/abs/2410.07176 Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models]
+* 2024-10: [https://arxiv.org/abs/2410.08815 StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization]: Reformats retrieved data into task-appropriate structures (table, graph, tree).
+* 2024-10: [https://arxiv.org/abs/2410.13765 Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval]
+* 2024-11: [https://www.arxiv.org/abs/2411.13773 FastRAG: Retrieval Augmented Generation for Semi-structured Data]
+===Open-source Implementations===
+* [https://github.com/Cinnamon/kotaemon kotaemon]: An open-source clean & customizable RAG UI for chatting with your documents.
+* [https://www.llamaindex.ai/ LlamaIndex] ([https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://github.com/run-llama/voice-chat-pdf voice chat code])
+* Nvidia [https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/ ChatRTX] with [https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/ RAG]
+* Anthropic [https://github.com/anthropics/anthropic-quickstarts/tree/main/customer-support-agent Customer Support Agent example]
+* [https://www.langchain.com/ LangChain] and [https://www.langchain.com/langgraph LangGraph] ([https://www.metadocs.co/2024/08/20/simple-agentic-rag-for-multi-vector-stores-with-langchain-and-langgraph/ tutorial])
+** [https://github.com/KruxAI/ragbuilder RAGBuilder]: Automatically tunes RAG hyperparams
+* [https://github.com/stanford-oval/WikiChat WikiChat]
+** [https://arxiv.org/abs/2305.14292 WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia]
+* [https://github.com/bhavnicksm/chonkie Chonkie]: No-nonsense RAG chunking library (open-source, lightweight, fast)
+* [https://github.com/pingcap/autoflow autoflow]: open source GraphRAG (Knowledge Graph), including conversational search page
+===Web-based Tools===
+* [https://typeset.io/ SciSpace] Chat with PDF (also available as a GPT).
+===PDF Conversion===
+* [https://github.com/kermitt2/grobid Grobid]
+* [https://chunkr.ai/ Chunkr] ([https://github.com/lumina-ai-inc/chunkr code])
+* [https://github.com/DS4SD/docling Docling]: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
+==Automatic Optimization==
+===Analogous to Gradient Descent===
+* [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text]
+* [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents]
+==LLM for scoring/ranking==
+* [https://arxiv.org/abs/2302.04166 GPTScore: Evaluate as You Desire]
+* [https://arxiv.org/abs/2306.17563 Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting]
+* [https://doi.org/10.1039/D3DD00112A Domain-specific chatbots for science using embeddings]
+* [https://arxiv.org/abs/2407.02977 Large Language Models as Evaluators for Scientific Synthesis]
+=LLM Agents=
+* See [[AI agents]].
+=Interfaces=
+==Chatbot Frontend==
+===Web===
+* [https://docs.streamlit.io/develop/tutorials/llms/build-conversational-apps Steamlit]
+* [https://docs.cohere.com/v2/docs/cohere-toolkit Cohere Toolkit] ([https://github.com/cohere-ai/cohere-toolkit code])
+* [https://www.librechat.ai/ LibreChat]
+* [https://github.com/open-webui/open-webui open-webui]
+* [https://github.com/xjdr-alt/entropix/tree/main/ui entropix frontend UI]
+===Desktop GUI===
+* [https://anythingllm.com/ AnythingLLM] ([https://docs.anythingllm.com/ docs], [https://github.com/Mintplex-Labs/anything-llm code]): includes chat-with-docs, selection of LLM and vector db, etc.
+==Alternative Text Chatbot UI==
+* [https://generative.ink/posts/loom-interface-to-the-multiverse/ Loom] provides a sort of tree-like structure for LLM coming up with branched writings.
+* [https://www.lesswrong.com/posts/JHsfMWtwxBGGTmb8A/pantheon-interface The Pantheon Interface] is a new idea for how to interact with LLMs ([https://pantheon.chat/ live instance], [https://github.com/nickkeesG/Pantheon code]). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming.
+==Conversational Audio Chatbot==
+* Swift is a fast AI voice assistant ([https://github.com/ai-ng/swift code], [https://swift-ai.vercel.app/ live demo]) uses:
+** [https://groq.com/ Groq] cloud running [https://github.com/openai/whisper OpenAI Whisper] for fast speech transcription.
+** [https://cartesia.ai/ Cartesia] [https://cartesia.ai/sonic Sonic] for fast speech synthesis
+** [https://www.vad.ricky0123.com/ VAD] to detect when user is talking
+** [https://vercel.com/ Vercel] for app deployment
+* [https://github.com/rtvi-ai RTVI-AI] ([https://github.com/rtvi-ai/rtvi-web-demo code], [https://demo-gpu.rtvi.ai/ demo]), uses:
+** [https://groq.com/ Groq]
+** [https://llama.meta.com/ Llama 3.1]
+** [https://www.daily.co/ai/ Daily]
+** [https://github.com/rtvi-ai RTVI ]
+* [https://github.com/mezbaul-h/june June]: Local Voice Chatbot
+** [https://ollama.com/ Ollama]
+** [https://huggingface.co/docs/transformers/en/tasks/asr Hugging Face Transformers] (for speech recognition)
+** [https://github.com/coqui-ai/TTS Coqui TTS Toolkit]
+* [https://kyutai.org/ kyutai] Moshi chatbot ([https://us.moshi.chat/ demo])
+* [https://arxiv.org/abs/2408.16725 Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming] ([https://huggingface.co/gpt-omni/mini-omni model], [https://github.com/gpt-omni/mini-omni code], [https://huggingface.co/spaces/gradio/omni-mini demo])
+* 2024-09Sep-11: [https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Llama-3.1-8B-Omni] ([https://github.com/ictnlp/LLaMA-Omni code]), enabling end-to-end speech.
+* [https://x.com/AIatMeta/status/1847383580269510670 2024-10Oct-18]: Meta [https://speechbot.github.io/spiritlm/ Spirit LM]: open source multimodal language model that freely mixes text and speech
+===Related Research===
+* [https://arxiv.org/abs/2408.02622 Language Model Can Listen While Speaking]
+===Commercial Systems===
+* [https://heypi.com/talk HeyPi Talk]
+* [https://vapi.ai/ Vapi]
+* [https://callannie.ai/ Call Annie]
+* [https://www.bland.ai Bland AI]
+* [https://deepgram.com/ DeepGram Voice AI]
+=Speech Recognition (ASR) and Transcription=
+==Lists==
+* [https://huggingface.co/spaces/hf-audio/open_asr_leaderboard Open ASR Leaderboard]
+==Open Source==
+* [https://github.com/mozilla/DeepSpeech DeepSpeech]
+* [https://github.com/speechbrain/speechbrain speechbrain]
+* [https://github.com/kaldi-asr/kaldi/blob/master/README.md Kaldi]
+* wav2vec 2.0
+** [https://arxiv.org/abs/2104.01027 Paper: Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training]
+* Whisper
+** [https://huggingface.co/openai/whisper-medium.en Whisper medium.en]
+** [https://github.com/m-bain/whisperX WhisperX] (includes word-level timestamps and speaker diarization)
+** [https://huggingface.co/mlx-community/distil-whisper-large-v3 Distil Large v3 with MLX]
+** 2024-10: [https://huggingface.co/ylacombe/whisper-large-v3-turbo whisper-large-v3-turbo] distillation ([https://huggingface.co/spaces/hf-audio/whisper-large-v3-turbo demo], [https://github.com/openai/whisper/actions/runs/11111568226 code])
+* [https://huggingface.co/spaces/hf-audio/open_asr_leaderboard Nvidia Canary 1B]
+* [https://developer.nvidia.com/blog/accelerating-leaderboard-topping-asr-models-10x-with-nvidia-nemo/ 2024-09]: Nvidia [https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html NeMo]
+* 2024-10: [https://www.rev.ai/ Rev AI] [https://huggingface.co/Revai models] for [https://huggingface.co/Revai/reverb-asr transcription] and [https://huggingface.co/Revai/reverb-diarization-v2 diarization]
+* 2024-10: [https://github.com/usefulsensors/moonshine Moonshine] (optimized for resource-constrained devices)
+==In Browser==
+* [https://huggingface.co/spaces/Xenova/whisper-word-level-timestamps Whisper Timestamped]: Multilingual speech recognition with word-level timestamps, running locally in browser
+==Phrase Endpointing and Voice Activity Detection (VAD)==
+I.e. how to determine when user is done talking, and bot should respond?
+* [https://x.com/kwindla/status/1831364419261268017 Notes]
+** [https://demo.dailybots.ai/ Test settings]
+** [https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/vad/vad_analyzer.py code]
+** [https://github.com/snakers4/silero-vad Silero VAD repo]
+==Audio Cleanup==
+* [https://krisp.ai/ Krisp AI]: Noise cancellation, meeting summary, etc.
+=Text-to-speech (TTS)=
+==Open Source==
+* [https://github.com/huggingface/parler-tts Parler TTS] ([https://huggingface.co/spaces/parler-tts/parler_tts demo])
+* [https://github.com/DigitalPhonetics/IMS-Toucan Toucan] ([https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS demo])
+* [https://tts.themetavoice.xyz/ MetaVoice] ([https://github.com/metavoiceio/metavoice-src github])
+* [https://github.com/2noise/ChatTTS ChatTTS]
+* [https://www.camb.ai/ Camb.ai] [https://github.com/Camb-ai/MARS5-TTS MARS5-TTS]
+* [https://github.com/coqui-ai/TTS Coqui TTS Toolkit]
+* Fish Speech 1.4: multi-lingual, can clone voices ([https://x.com/reach_vb/status/1833801060659372071 video], [https://huggingface.co/fishaudio/fish-speech-1.4 weights], [https://huggingface.co/spaces/fishaudio/fish-speech-1 demo])
+* [https://huggingface.co/SWivid/F5-TTS F5-TTS] ([https://huggingface.co/spaces/mrfakename/E2-F5-TTS demo]): cloning, emotion, etc.
+* [https://huggingface.co/amphion/MaskGCT MaskGCT] ([https://huggingface.co/spaces/amphion/maskgct demo])
+* [https://arxiv.org/abs/2312.09911 Amphion: An Open-Source Audio, Music and Speech Generation Toolkit] ([https://github.com/open-mmlab/Amphion code])
+==Cloud==
+* [https://elevenlabs.io/ Elevenlabs] ($50/million characters)
+** [https://elevenlabs.io/voice-isolator voice isolator]
+* [https://cartesia.ai/ Cartesia] [https://cartesia.ai/sonic Sonic]
+* [https://neets.ai/ Neets AI] ($1/million characters)
+=Vision=
+==Visual Models==
+* [https://openai.com/index/clip/ CLIP]
+* [https://arxiv.org/abs/2303.15343 Siglip]
+* [https://github.com/roboflow/supervision Supervision]
+* [https://arxiv.org/abs/2311.06242 Florence-2]
+* Nvidia [https://github.com/NVlabs/MambaVision MambaVision]
+* Meta [https://about.meta.com/realitylabs/codecavatars/sapiens Sapiens: Foundation for Human Vision Models] (video input, can infer segmentation, pose, depth-map, and surface normals)
+==Multi-modal Models (language-vision/video)==
+* [https://arxiv.org/abs/2407.07895 LLaVA-NeXT-Interleave] ([https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19 models], [https://huggingface.co/spaces/merve/llava-interleave demo])
+* [https://huggingface.co/papers/2407.15841 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models]
+* Nvidia [https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 NVEagle] 13B, 7B ([https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat demo], [https://arxiv.org/abs/2408.15998 preprint])
+* 2024-08Aug-29: [https://qwenlm.github.io/blog/qwen2-vl/ Qwen2-VL] 7B, 2B ([https://github.com/QwenLM/Qwen2-VL code], [https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d models]): Can process videos up to 20 minutes in length
+* 2024-09Sep-11: Mistral [https://huggingface.co/mistral-community/pixtral-12b-240910 Pixtral 12B]
+* 2024-09Sep-17: [https://nvlm-project.github.io/ NVLM 1.0]
+==Optical character recognition (OCR)==
+* [https://arxiv.org/abs/2409.01704 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model] ([https://huggingface.co/stepfun-ai/GOT-OCR2_0 project], [https://github.com/Ucas-HaoranWei/GOT-OCR2.0/ code], [https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo demo])
+=Embedding=
+* [https://www.marktechpost.com/2024/07/28/a-comparison-of-top-embedding-libraries-for-generative-ai/ A Comparison of Top Embedding Libraries for Generative AI]
+=Time Series=
+* [https://github.com/TDAmeritrade/stumpy Stumpy]: Python library, uses near-match subsequences for similarity and forecasting
+* [https://arxiv.org/abs/1912.09363 Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting]
+* [https://arxiv.org/abs/2209.00905 From latent dynamics to meaningful representations]
+* [https://arxiv.org/abs/2310.01728 Time-LLM: Time Series Forecasting by Reprogramming Large Language Models]
+* [https://arxiv.org/abs/2310.10688 A decoder-only foundation model for time-series forecasting]
+* [https://arxiv.org/abs/2310.03589 TimeGPT-1]
+* [https://arxiv.org/abs/2402.02592 Unified Training of Universal Time Series Forecasting Transformers]
+* [https://arxiv.org/abs/2407.10240 xLSTMTime : Long-term Time Series Forecasting With xLSTM]
+* Salesforce: [https://arxiv.org/abs/2410.10469 Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts] ([https://github.com/SalesforceAIResearch/uni2ts/tree/main/project/moirai-moe-1 code], [https://huggingface.co/collections/Salesforce/moirai-r-models-65c8d3a94c51428c300e0742 weights], [https://www.salesforce.com/blog/time-series-morai-moe/ blog])
+==Control==
+* [https://arxiv.org/abs/2402.15989 PIDformer: Transformer Meets Control Theory]
+==Forecasting==
+* Meta [https://facebookresearch.github.io/Kats/ Kats] ([https://github.com/facebookresearch/Kats code]): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation
+=Data=
+==Vector Database==
+===Open Source===
+* [https://milvus.io/ milvus] (open source with paid cloud option)
+* [https://qdrant.tech/ Qdrant] (open source with paid cloud option)
+* [https://vespa.ai/ Vespa] (open source with paid cloud option)
+* [https://www.trychroma.com/ chroma]
+* [https://www.llamaindex.ai/ LlamaIndex]
+* [https://github.com/asg017/sqlite-vec/tree/main sqlite-vec]
+===Commercial cloud===
+* [https://archive.pinecone.io/lp/vector-database/ pinecone]
+* [https://weaviate.io/products weaviate]
+===MySQL===
+* MySQL does not traditionally have support, but:
+** [https://planetscale.com/blog/planetscale-is-bringing-vector-search-and-storage-to-mysql PlanetScale] is working on it
+** [https://github.com/stephenc222/mysql_vss mysql_vss] ([https://medium.com/@stephenc211/enhancing-mysql-searches-with-vector-embeddings-11f183932851 discussion])
+** [https://www.pingcap.com/tidb-serverless/ tibd] ([https://www.pingcap.com/article/mysql-vector-search-powering-the-future-of-ai-applications/ discussion])
+==Database with Search==
+* [https://typesense.org/ Typesense] ([https://github.com/typesense/typesense code])
+==Web Scraping==
+* [https://github.com/mendableai/firecrawl Firecrawl]
+=Hardware=
+==AI Acceleration Hardware==
+* Nvidia GPUs
+* [https://en.wikipedia.org/wiki/Tensor_Processing_Unit Google TPU]
+* [https://en.wikipedia.org/wiki/Tesla_Dojo Tesla Dojo]
+* [https://www.cerebras.net/ Cerebras]
+* [https://www.graphcore.ai/ Graphcore]
+* [https://www.untether.ai/ Untether AI]
+* [https://sambanova.ai/ SambaNova Systems]
+* [https://groq.com/ Groq]
+* [https://deepsilicon.com/ Deep Silicon]: Combined hardware/software solution for accelerated AI ([https://x.com/sdianahu/status/1833186687369023550 e.g.] ternary math)
+* [https://www.etched.com/ Etched]: Transformer ASICs
+==Cloud Training Compute==
+* [https://nebius.ai/ Nebius AI]
+* [https://glaive.ai/ Glaive AI]
+=See Also=
+* [[AI agents]]
+* [[AI understanding]]
+* [[Robots]]

Difference between revisions of "AI tools"

Revision as of 12:38, 25 November 2024

Contents

LLM

Open-weights LLM

For Coding

Cloud LLM

Multi-modal: Audio

Triage

Retrieval Augmented Generation (RAG)

Reviews

Analysis of RAG overall

Approaches

Open-source Implementations

Web-based Tools

PDF Conversion

Automatic Optimization

Analogous to Gradient Descent

LLM for scoring/ranking

LLM Agents

Interfaces

Chatbot Frontend

Web

Desktop GUI

Alternative Text Chatbot UI

Conversational Audio Chatbot

Related Research

Commercial Systems

Speech Recognition (ASR) and Transcription

Lists

Open Source

In Browser

Phrase Endpointing and Voice Activity Detection (VAD)

Audio Cleanup

Text-to-speech (TTS)

Open Source

Cloud

Vision

Visual Models

Multi-modal Models (language-vision/video)

Optical character recognition (OCR)

Embedding

Time Series

Control

Forecasting

Data

Vector Database

Open Source

Commercial cloud

MySQL

Database with Search

Web Scraping

Hardware

AI Acceleration Hardware

Cloud Training Compute

See Also

Navigation menu

Search