Difference between revisions of "AI tools"
KevinYager (talk | contribs) (Created page with "TBD") |
KevinYager (talk | contribs) (→LLM Agents) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | =LLM= | |
+ | ==Open-weights LLM== | ||
+ | * [https://about.fb.com/news/2023/07/llama-2/ 2023-07Jul-18]: [https://llama.meta.com/llama2/ Llama2] 7B, 13B, 70B | ||
+ | * [https://ai.meta.com/blog/meta-llama-3/ 2024-04Apr-18]: [https://llama.meta.com/llama3/ Llama3] 8B, 70B | ||
+ | * [https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/ 2024-06Jun-14]: [https://research.nvidia.com/publication/2024-06_nemotron-4-340b Nemotron-4] 340B | ||
+ | * 2024-07Jul-23: [https://llama.meta.com/ Llama 3.1] 8B, 70B, 405B | ||
+ | * [https://mistral.ai/news/mistral-large-2407/ 2024-07Jul-24]: [https://huggingface.co/mistralai/Mistral-Large-Instruct-2407 Mistral Large 2] 128B | ||
+ | * [https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/ 2024-07Jul-31]: [https://huggingface.co/collections/google/gemma-2-2b-release-66a20f3796a2ff2a7c76f98f Gemma 2] 2B | ||
+ | * [https://qwenlm.github.io/blog/qwen2-math/ 2024-08Aug-08]: Qwen2-Math ([https://huggingface.co/collections/Qwen/qwen2-math-66b4c9e072eda65b5ec7534d hf], [https://github.com/QwenLM/Qwen2-Math github]) 1.5B, 7B, 72B | ||
+ | * [https://nousresearch.com/releases/ 2024-08Aug-14]: [https://nousresearch.com/ Nous research] [https://nousresearch.com/hermes3/ Hermes 3] ([https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf technical report]) 8B, 70B, 405B | ||
+ | * 2024-08Aug-19: [https://www.salesforceairesearch.com/ Salesforce AI] [https://huggingface.co/papers/2408.08872 xGen-MM (BLIP-3)]: A Family of Open Large Multimodal Models ([https://www.arxiv.org/abs/2408.08872 preprint], [https://github.com/salesforce/LAVIS/tree/xgen-mm code]) | ||
+ | * 2024-09Sep-04: [https://arxiv.org/abs/2409.02060 OLMoE: Open Mixture-of-Experts Language Models] ([https://github.com/allenai/OLMoE code]) 7B model (uses 1B per input token) | ||
+ | * 2024-09Sep-05: [https://huggingface.co/mattshumer/Reflection-70B Reflection 70B] ([https://reflection-playground-production.up.railway.app/ demo]): [https://x.com/mattshumer_/status/1831767014341538166 Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.] | ||
+ | * 2024-09Sep-06: [https://huggingface.co/deepseek-ai/DeepSeek-V2.5 DeepSeek-V2.5] 238B mixture-of-experts (160 experts, 16B active params) | ||
+ | * 2024-09Sep-19: Microsoft GRadient-INformed (GRIN) MoE ([https://huggingface.co/spaces/GRIN-MoE-Demo/GRIN-MoE demo], [https://huggingface.co/microsoft/GRIN-MoE model], [https://github.com/microsoft/GRIN-MoE github]) 6.6B | ||
+ | * 2024-09Sep-23: Nvidia [https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct Llama-3_1-Nemotron-51B-instruct] 51B | ||
+ | * 2024-09Sep-25: Meta [https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ Llama 3.2] with visual and voice modalities 1B, 3B, 11B, 90B | ||
+ | * 2024-09Sep-25: [https://allenai.org/ Ai2] [https://molmo.allenai.org/ Molmo] [https://molmo.allenai.org/blog multi-modal models] 1B, 7B, 72B | ||
+ | * 2024-10Oct-01: Nvidia [https://huggingface.co/nvidia/NVLM-D-72B NVLM-D-72B] (includes vision) | ||
+ | * [https://mistral.ai/news/ministraux/ 2024-10Oct-16]: Mistral [https://huggingface.co/mistralai/Ministral-8B-Instruct-2410 Ministral-8B-Instruct-2410] | ||
+ | * 2024-10Oct-16: Nvidia [https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF Llama-3.1-Nemotron-70B-Reward] | ||
+ | * 2024-11Nov-04: [https://arxiv.org/abs/2411.02265 Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent] 389B ([https://github.com/Tencent/Tencent-Hunyuan-Large code], [https://huggingface.co/tencent/Tencent-Hunyuan-Large weights]) | ||
+ | * 2024-11Nov-18: [https://huggingface.co/mistralai/Mistral-Large-Instruct-2411 Mistral-Large-Instruct-2411]) 123B; and [https://mistral.ai/news/pixtral-large/ Pixtral Large] multimodal model 124B ([https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411 weights]) | ||
+ | * 2024-11Nov-22: Nvidia [https://github.com/NVlabs/hymba Hymba] ([https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/ blog]): small and high-performance | ||
+ | |||
+ | ===For Coding=== | ||
+ | C.f. [https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard] | ||
+ | * 2024-10Oct-06: [https://abacus.ai/ Abacus AI] [https://huggingface.co/abacusai/Dracarys2-72B-Instruct Dracarys2-72B-Instruct] (optimized for coding, fine-tune of [https://huggingface.co/Qwen/Qwen2.5-72B-Instruct Qwen2.5-72B-Instruct]) | ||
+ | * 2024-11Nov-09: [https://opencoder-llm.github.io/ OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models] ([https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e weights], [https://arxiv.org/abs/2411.04905 preprint]) | ||
+ | * 2024-11Nov-13: [https://qwenlm.github.io/blog/qwen2.5-coder-family/ Qwen2.5-Coder] | ||
+ | |||
+ | ==Cloud LLM== | ||
+ | * [https://groq.com/ Groq] [https://wow.groq.com/ cloud] (very fast inference) | ||
+ | |||
+ | ===Multi-modal: Audio=== | ||
+ | * [https://kyutai.org/ kyutai Open Science AI Lab] chatbot [https://www.us.moshi.chat/?queue_id=talktomoshi moshi] | ||
+ | |||
+ | ==Triage== | ||
+ | * [https://arxiv.org/abs/2406.18665 RouteLLM: Learning to Route LLMs with Preference Data] | ||
+ | |||
+ | ==Retrieval Augmented Generation (RAG)== | ||
+ | ===Reviews=== | ||
+ | * 2024-08: [https://arxiv.org/abs/2408.08921 Graph Retrieval-Augmented Generation: A Survey] | ||
+ | * 2024-09: [https://arxiv.org/abs/2409.14924 Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely] | ||
+ | * List of [https://github.com/NirDiamant/RAG_Techniques RAG techniques] | ||
+ | |||
+ | ===Analysis of RAG overall=== | ||
+ | * 2024-10: [https://arxiv.org/abs/2410.13070 Is Semantic Chunking Worth the Computational Cost?] | ||
+ | |||
+ | ===Approaches=== | ||
+ | * RAGFlow ([https://github.com/infiniflow/ragflow code]) | ||
+ | * GraphRAG ([https://arxiv.org/abs/2404.16130 preprint], [https://github.com/microsoft/graphrag code]) | ||
+ | ** [https://github.com/Azure-Samples/graphrag-accelerator GraphRAG Accelerator] for easy deployment on Azure | ||
+ | * AutoMetaRAG ([https://github.com/darshil3011/AutoMetaRAG/tree/main code]) | ||
+ | * [https://verba.weaviate.io/ Verba]: RAG for [https://weaviate.io/ Weaviate] vector database | ||
+ | ** [https://github.com/weaviate/verba code] | ||
+ | ** [https://www.youtube.com/watch?v=UoowC-hsaf0 video] | ||
+ | * Google Astute RAG | ||
+ | ** Preprint: [https://arxiv.org/abs/2410.07176 Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models] | ||
+ | * 2024-10: [https://arxiv.org/abs/2410.08815 StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization]: Reformats retrieved data into task-appropriate structures (table, graph, tree). | ||
+ | * 2024-10: [https://arxiv.org/abs/2410.13765 Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval] | ||
+ | * 2024-11: [https://www.arxiv.org/abs/2411.13773 FastRAG: Retrieval Augmented Generation for Semi-structured Data] | ||
+ | |||
+ | ===Open-source Implementations=== | ||
+ | * [https://github.com/Cinnamon/kotaemon kotaemon]: An open-source clean & customizable RAG UI for chatting with your documents. | ||
+ | * [https://www.llamaindex.ai/ LlamaIndex] ([https://github.com/run-llama/llama_index code], [https://docs.llamaindex.ai/en/stable/ docs], [https://github.com/run-llama/voice-chat-pdf voice chat code]) | ||
+ | * Nvidia [https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/ ChatRTX] with [https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/ RAG] | ||
+ | * Anthropic [https://github.com/anthropics/anthropic-quickstarts/tree/main/customer-support-agent Customer Support Agent example] | ||
+ | * [https://www.langchain.com/ LangChain] and [https://www.langchain.com/langgraph LangGraph] ([https://www.metadocs.co/2024/08/20/simple-agentic-rag-for-multi-vector-stores-with-langchain-and-langgraph/ tutorial]) | ||
+ | ** [https://github.com/KruxAI/ragbuilder RAGBuilder]: Automatically tunes RAG hyperparams | ||
+ | * [https://github.com/stanford-oval/WikiChat WikiChat] | ||
+ | ** [https://arxiv.org/abs/2305.14292 WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia] | ||
+ | * [https://github.com/bhavnicksm/chonkie Chonkie]: No-nonsense RAG chunking library (open-source, lightweight, fast) | ||
+ | * [https://github.com/pingcap/autoflow autoflow]: open source GraphRAG (Knowledge Graph), including conversational search page | ||
+ | |||
+ | ===Web-based Tools=== | ||
+ | * [https://typeset.io/ SciSpace] Chat with PDF (also available as a GPT). | ||
+ | |||
+ | ===PDF Conversion=== | ||
+ | * [https://github.com/kermitt2/grobid Grobid] | ||
+ | * [https://chunkr.ai/ Chunkr] ([https://github.com/lumina-ai-inc/chunkr code]) | ||
+ | * [https://github.com/DS4SD/docling Docling]: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON | ||
+ | |||
+ | ==Automatic Optimization== | ||
+ | ===Analogous to Gradient Descent=== | ||
+ | * [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text] | ||
+ | * [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents] | ||
+ | |||
+ | ==LLM for scoring/ranking== | ||
+ | * [https://arxiv.org/abs/2302.04166 GPTScore: Evaluate as You Desire] | ||
+ | * [https://arxiv.org/abs/2306.17563 Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting] | ||
+ | * [https://doi.org/10.1039/D3DD00112A Domain-specific chatbots for science using embeddings] | ||
+ | * [https://arxiv.org/abs/2407.02977 Large Language Models as Evaluators for Scientific Synthesis] | ||
+ | |||
+ | =LLM Agents= | ||
+ | * See [[AI Agents]]. | ||
+ | |||
+ | =Interfaces= | ||
+ | ==Chatbot Frontend== | ||
+ | ===Web=== | ||
+ | * [https://docs.streamlit.io/develop/tutorials/llms/build-conversational-apps Steamlit] | ||
+ | * [https://docs.cohere.com/v2/docs/cohere-toolkit Cohere Toolkit] ([https://github.com/cohere-ai/cohere-toolkit code]) | ||
+ | * [https://www.librechat.ai/ LibreChat] | ||
+ | * [https://github.com/open-webui/open-webui open-webui] | ||
+ | * [https://github.com/xjdr-alt/entropix/tree/main/ui entropix frontend UI] | ||
+ | ===Desktop GUI=== | ||
+ | * [https://anythingllm.com/ AnythingLLM] ([https://docs.anythingllm.com/ docs], [https://github.com/Mintplex-Labs/anything-llm code]): includes chat-with-docs, selection of LLM and vector db, etc. | ||
+ | |||
+ | ==Alternative Text Chatbot UI== | ||
+ | * [https://generative.ink/posts/loom-interface-to-the-multiverse/ Loom] provides a sort of tree-like structure for LLM coming up with branched writings. | ||
+ | * [https://www.lesswrong.com/posts/JHsfMWtwxBGGTmb8A/pantheon-interface The Pantheon Interface] is a new idea for how to interact with LLMs ([https://pantheon.chat/ live instance], [https://github.com/nickkeesG/Pantheon code]). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming. | ||
+ | |||
+ | ==Conversational Audio Chatbot== | ||
+ | * Swift is a fast AI voice assistant ([https://github.com/ai-ng/swift code], [https://swift-ai.vercel.app/ live demo]) uses: | ||
+ | ** [https://groq.com/ Groq] cloud running [https://github.com/openai/whisper OpenAI Whisper] for fast speech transcription. | ||
+ | ** [https://cartesia.ai/ Cartesia] [https://cartesia.ai/sonic Sonic] for fast speech synthesis | ||
+ | ** [https://www.vad.ricky0123.com/ VAD] to detect when user is talking | ||
+ | ** [https://vercel.com/ Vercel] for app deployment | ||
+ | * [https://github.com/rtvi-ai RTVI-AI] ([https://github.com/rtvi-ai/rtvi-web-demo code], [https://demo-gpu.rtvi.ai/ demo]), uses: | ||
+ | ** [https://groq.com/ Groq] | ||
+ | ** [https://llama.meta.com/ Llama 3.1] | ||
+ | ** [https://www.daily.co/ai/ Daily] | ||
+ | ** [https://github.com/rtvi-ai RTVI ] | ||
+ | * [https://github.com/mezbaul-h/june June]: Local Voice Chatbot | ||
+ | ** [https://ollama.com/ Ollama] | ||
+ | ** [https://huggingface.co/docs/transformers/en/tasks/asr Hugging Face Transformers] (for speech recognition) | ||
+ | ** [https://github.com/coqui-ai/TTS Coqui TTS Toolkit] | ||
+ | * [https://kyutai.org/ kyutai] Moshi chatbot ([https://us.moshi.chat/ demo]) | ||
+ | * [https://arxiv.org/abs/2408.16725 Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming] ([https://huggingface.co/gpt-omni/mini-omni model], [https://github.com/gpt-omni/mini-omni code], [https://huggingface.co/spaces/gradio/omni-mini demo]) | ||
+ | * 2024-09Sep-11: [https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Llama-3.1-8B-Omni] ([https://github.com/ictnlp/LLaMA-Omni code]), enabling end-to-end speech. | ||
+ | * [https://x.com/AIatMeta/status/1847383580269510670 2024-10Oct-18]: Meta [https://speechbot.github.io/spiritlm/ Spirit LM]: open source multimodal language model that freely mixes text and speech | ||
+ | |||
+ | ===Related Research=== | ||
+ | * [https://arxiv.org/abs/2408.02622 Language Model Can Listen While Speaking] | ||
+ | |||
+ | ===Commercial Systems=== | ||
+ | * [https://heypi.com/talk HeyPi Talk] | ||
+ | * [https://vapi.ai/ Vapi] | ||
+ | * [https://callannie.ai/ Call Annie] | ||
+ | * [https://www.bland.ai Bland AI] | ||
+ | * [https://deepgram.com/ DeepGram Voice AI] | ||
+ | |||
+ | =Speech Recognition (ASR) and Transcription= | ||
+ | ==Lists== | ||
+ | * [https://huggingface.co/spaces/hf-audio/open_asr_leaderboard Open ASR Leaderboard] | ||
+ | |||
+ | ==Open Source== | ||
+ | * [https://github.com/mozilla/DeepSpeech DeepSpeech] | ||
+ | * [https://github.com/speechbrain/speechbrain speechbrain] | ||
+ | * [https://github.com/kaldi-asr/kaldi/blob/master/README.md Kaldi] | ||
+ | * wav2vec 2.0 | ||
+ | ** [https://arxiv.org/abs/2104.01027 Paper: Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training] | ||
+ | * Whisper | ||
+ | ** [https://huggingface.co/openai/whisper-medium.en Whisper medium.en] | ||
+ | ** [https://github.com/m-bain/whisperX WhisperX] (includes word-level timestamps and speaker diarization) | ||
+ | ** [https://huggingface.co/mlx-community/distil-whisper-large-v3 Distil Large v3 with MLX] | ||
+ | ** 2024-10: [https://huggingface.co/ylacombe/whisper-large-v3-turbo whisper-large-v3-turbo] distillation ([https://huggingface.co/spaces/hf-audio/whisper-large-v3-turbo demo], [https://github.com/openai/whisper/actions/runs/11111568226 code]) | ||
+ | * [https://huggingface.co/spaces/hf-audio/open_asr_leaderboard Nvidia Canary 1B] | ||
+ | * [https://developer.nvidia.com/blog/accelerating-leaderboard-topping-asr-models-10x-with-nvidia-nemo/ 2024-09]: Nvidia [https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html NeMo] | ||
+ | * 2024-10: [https://www.rev.ai/ Rev AI] [https://huggingface.co/Revai models] for [https://huggingface.co/Revai/reverb-asr transcription] and [https://huggingface.co/Revai/reverb-diarization-v2 diarization] | ||
+ | * 2024-10: [https://github.com/usefulsensors/moonshine Moonshine] (optimized for resource-constrained devices) | ||
+ | |||
+ | ==In Browser== | ||
+ | * [https://huggingface.co/spaces/Xenova/whisper-word-level-timestamps Whisper Timestamped]: Multilingual speech recognition with word-level timestamps, running locally in browser | ||
+ | |||
+ | ==Phrase Endpointing and Voice Activity Detection (VAD)== | ||
+ | I.e. how to determine when user is done talking, and bot should respond? | ||
+ | * [https://x.com/kwindla/status/1831364419261268017 Notes] | ||
+ | ** [https://demo.dailybots.ai/ Test settings] | ||
+ | ** [https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/vad/vad_analyzer.py code] | ||
+ | ** [https://github.com/snakers4/silero-vad Silero VAD repo] | ||
+ | |||
+ | ==Audio Cleanup== | ||
+ | * [https://krisp.ai/ Krisp AI]: Noise cancellation, meeting summary, etc. | ||
+ | |||
+ | =Text-to-speech (TTS)= | ||
+ | ==Open Source== | ||
+ | * [https://github.com/huggingface/parler-tts Parler TTS] ([https://huggingface.co/spaces/parler-tts/parler_tts demo]) | ||
+ | * [https://github.com/DigitalPhonetics/IMS-Toucan Toucan] ([https://huggingface.co/spaces/Flux9665/MassivelyMultilingualTTS demo]) | ||
+ | * [https://tts.themetavoice.xyz/ MetaVoice] ([https://github.com/metavoiceio/metavoice-src github]) | ||
+ | * [https://github.com/2noise/ChatTTS ChatTTS] | ||
+ | * [https://www.camb.ai/ Camb.ai] [https://github.com/Camb-ai/MARS5-TTS MARS5-TTS] | ||
+ | * [https://github.com/coqui-ai/TTS Coqui TTS Toolkit] | ||
+ | * Fish Speech 1.4: multi-lingual, can clone voices ([https://x.com/reach_vb/status/1833801060659372071 video], [https://huggingface.co/fishaudio/fish-speech-1.4 weights], [https://huggingface.co/spaces/fishaudio/fish-speech-1 demo]) | ||
+ | * [https://huggingface.co/SWivid/F5-TTS F5-TTS] ([https://huggingface.co/spaces/mrfakename/E2-F5-TTS demo]): cloning, emotion, etc. | ||
+ | * [https://huggingface.co/amphion/MaskGCT MaskGCT] ([https://huggingface.co/spaces/amphion/maskgct demo]) | ||
+ | * [https://arxiv.org/abs/2312.09911 Amphion: An Open-Source Audio, Music and Speech Generation Toolkit] ([https://github.com/open-mmlab/Amphion code]) | ||
+ | |||
+ | ==Cloud== | ||
+ | * [https://elevenlabs.io/ Elevenlabs] ($50/million characters) | ||
+ | ** [https://elevenlabs.io/voice-isolator voice isolator] | ||
+ | * [https://cartesia.ai/ Cartesia] [https://cartesia.ai/sonic Sonic] | ||
+ | * [https://neets.ai/ Neets AI] ($1/million characters) | ||
+ | |||
+ | =Vision= | ||
+ | ==Visual Models== | ||
+ | * [https://openai.com/index/clip/ CLIP] | ||
+ | * [https://arxiv.org/abs/2303.15343 Siglip] | ||
+ | * [https://github.com/roboflow/supervision Supervision] | ||
+ | * [https://arxiv.org/abs/2311.06242 Florence-2] | ||
+ | * Nvidia [https://github.com/NVlabs/MambaVision MambaVision] | ||
+ | * Meta [https://about.meta.com/realitylabs/codecavatars/sapiens Sapiens: Foundation for Human Vision Models] (video input, can infer segmentation, pose, depth-map, and surface normals) | ||
+ | |||
+ | ==Multi-modal Models (language-vision/video)== | ||
+ | * [https://arxiv.org/abs/2407.07895 LLaVA-NeXT-Interleave] ([https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19 models], [https://huggingface.co/spaces/merve/llava-interleave demo]) | ||
+ | * [https://huggingface.co/papers/2407.15841 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models] | ||
+ | * Nvidia [https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 NVEagle] 13B, 7B ([https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat demo], [https://arxiv.org/abs/2408.15998 preprint]) | ||
+ | * 2024-08Aug-29: [https://qwenlm.github.io/blog/qwen2-vl/ Qwen2-VL] 7B, 2B ([https://github.com/QwenLM/Qwen2-VL code], [https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d models]): Can process videos up to 20 minutes in length | ||
+ | * 2024-09Sep-11: Mistral [https://huggingface.co/mistral-community/pixtral-12b-240910 Pixtral 12B] | ||
+ | * 2024-09Sep-17: [https://nvlm-project.github.io/ NVLM 1.0] | ||
+ | |||
+ | ==Optical character recognition (OCR)== | ||
+ | * [https://arxiv.org/abs/2409.01704 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model] ([https://huggingface.co/stepfun-ai/GOT-OCR2_0 project], [https://github.com/Ucas-HaoranWei/GOT-OCR2.0/ code], [https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo demo]) | ||
+ | |||
+ | =Embedding= | ||
+ | * [https://www.marktechpost.com/2024/07/28/a-comparison-of-top-embedding-libraries-for-generative-ai/ A Comparison of Top Embedding Libraries for Generative AI] | ||
+ | |||
+ | =Time Series= | ||
+ | * [https://github.com/TDAmeritrade/stumpy Stumpy]: Python library, uses near-match subsequences for similarity and forecasting | ||
+ | * [https://arxiv.org/abs/1912.09363 Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting] | ||
+ | * [https://arxiv.org/abs/2209.00905 From latent dynamics to meaningful representations] | ||
+ | * [https://arxiv.org/abs/2310.01728 Time-LLM: Time Series Forecasting by Reprogramming Large Language Models] | ||
+ | * [https://arxiv.org/abs/2310.10688 A decoder-only foundation model for time-series forecasting] | ||
+ | * [https://arxiv.org/abs/2310.03589 TimeGPT-1] | ||
+ | * [https://arxiv.org/abs/2402.02592 Unified Training of Universal Time Series Forecasting Transformers] | ||
+ | * [https://arxiv.org/abs/2407.10240 xLSTMTime : Long-term Time Series Forecasting With xLSTM] | ||
+ | * Salesforce: [https://arxiv.org/abs/2410.10469 Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts] ([https://github.com/SalesforceAIResearch/uni2ts/tree/main/project/moirai-moe-1 code], [https://huggingface.co/collections/Salesforce/moirai-r-models-65c8d3a94c51428c300e0742 weights], [https://www.salesforce.com/blog/time-series-morai-moe/ blog]) | ||
+ | |||
+ | ==Control== | ||
+ | * [https://arxiv.org/abs/2402.15989 PIDformer: Transformer Meets Control Theory] | ||
+ | |||
+ | ==Forecasting== | ||
+ | * Meta [https://facebookresearch.github.io/Kats/ Kats] ([https://github.com/facebookresearch/Kats code]): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation | ||
+ | |||
+ | =Data= | ||
+ | ==Vector Database== | ||
+ | ===Open Source=== | ||
+ | * [https://milvus.io/ milvus] (open source with paid cloud option) | ||
+ | * [https://qdrant.tech/ Qdrant] (open source with paid cloud option) | ||
+ | * [https://vespa.ai/ Vespa] (open source with paid cloud option) | ||
+ | * [https://www.trychroma.com/ chroma] | ||
+ | * [https://www.llamaindex.ai/ LlamaIndex] | ||
+ | * [https://github.com/asg017/sqlite-vec/tree/main sqlite-vec] | ||
+ | |||
+ | ===Commercial cloud=== | ||
+ | * [https://archive.pinecone.io/lp/vector-database/ pinecone] | ||
+ | * [https://weaviate.io/products weaviate] | ||
+ | |||
+ | ===MySQL=== | ||
+ | * MySQL does not traditionally have support, but: | ||
+ | ** [https://planetscale.com/blog/planetscale-is-bringing-vector-search-and-storage-to-mysql PlanetScale] is working on it | ||
+ | ** [https://github.com/stephenc222/mysql_vss mysql_vss] ([https://medium.com/@stephenc211/enhancing-mysql-searches-with-vector-embeddings-11f183932851 discussion]) | ||
+ | ** [https://www.pingcap.com/tidb-serverless/ tibd] ([https://www.pingcap.com/article/mysql-vector-search-powering-the-future-of-ai-applications/ discussion]) | ||
+ | |||
+ | ==Database with Search== | ||
+ | * [https://typesense.org/ Typesense] ([https://github.com/typesense/typesense code]) | ||
+ | |||
+ | ==Web Scraping== | ||
+ | * [https://github.com/mendableai/firecrawl Firecrawl] | ||
+ | |||
+ | =Hardware= | ||
+ | ==AI Acceleration Hardware== | ||
+ | * Nvidia GPUs | ||
+ | * [https://en.wikipedia.org/wiki/Tensor_Processing_Unit Google TPU] | ||
+ | * [https://en.wikipedia.org/wiki/Tesla_Dojo Tesla Dojo] | ||
+ | * [https://www.cerebras.net/ Cerebras] | ||
+ | * [https://www.graphcore.ai/ Graphcore] | ||
+ | * [https://www.untether.ai/ Untether AI] | ||
+ | * [https://sambanova.ai/ SambaNova Systems] | ||
+ | * [https://groq.com/ Groq] | ||
+ | * [https://deepsilicon.com/ Deep Silicon]: Combined hardware/software solution for accelerated AI ([https://x.com/sdianahu/status/1833186687369023550 e.g.] ternary math) | ||
+ | * [https://www.etched.com/ Etched]: Transformer ASICs | ||
+ | |||
+ | ==Cloud Training Compute== | ||
+ | * [https://nebius.ai/ Nebius AI] | ||
+ | * [https://glaive.ai/ Glaive AI] | ||
+ | |||
+ | =See Also= | ||
+ | * [[AI agents]] | ||
+ | * [[AI understanding]] | ||
+ | * [[Robots]] |
Latest revision as of 12:52, 25 November 2024
Contents
- 1 LLM
- 2 LLM Agents
- 3 Interfaces
- 4 Speech Recognition (ASR) and Transcription
- 5 Text-to-speech (TTS)
- 6 Vision
- 7 Embedding
- 8 Time Series
- 9 Data
- 10 Hardware
- 11 See Also
LLM
Open-weights LLM
- 2023-07Jul-18: Llama2 7B, 13B, 70B
- 2024-04Apr-18: Llama3 8B, 70B
- 2024-06Jun-14: Nemotron-4 340B
- 2024-07Jul-23: Llama 3.1 8B, 70B, 405B
- 2024-07Jul-24: Mistral Large 2 128B
- 2024-07Jul-31: Gemma 2 2B
- 2024-08Aug-08: Qwen2-Math (hf, github) 1.5B, 7B, 72B
- 2024-08Aug-14: Nous research Hermes 3 (technical report) 8B, 70B, 405B
- 2024-08Aug-19: Salesforce AI xGen-MM (BLIP-3): A Family of Open Large Multimodal Models (preprint, code)
- 2024-09Sep-04: OLMoE: Open Mixture-of-Experts Language Models (code) 7B model (uses 1B per input token)
- 2024-09Sep-05: Reflection 70B (demo): Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.
- 2024-09Sep-06: DeepSeek-V2.5 238B mixture-of-experts (160 experts, 16B active params)
- 2024-09Sep-19: Microsoft GRadient-INformed (GRIN) MoE (demo, model, github) 6.6B
- 2024-09Sep-23: Nvidia Llama-3_1-Nemotron-51B-instruct 51B
- 2024-09Sep-25: Meta Llama 3.2 with visual and voice modalities 1B, 3B, 11B, 90B
- 2024-09Sep-25: Ai2 Molmo multi-modal models 1B, 7B, 72B
- 2024-10Oct-01: Nvidia NVLM-D-72B (includes vision)
- 2024-10Oct-16: Mistral Ministral-8B-Instruct-2410
- 2024-10Oct-16: Nvidia Llama-3.1-Nemotron-70B-Reward
- 2024-11Nov-04: Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent 389B (code, weights)
- 2024-11Nov-18: Mistral-Large-Instruct-2411) 123B; and Pixtral Large multimodal model 124B (weights)
- 2024-11Nov-22: Nvidia Hymba (blog): small and high-performance
For Coding
C.f. https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
- 2024-10Oct-06: Abacus AI Dracarys2-72B-Instruct (optimized for coding, fine-tune of Qwen2.5-72B-Instruct)
- 2024-11Nov-09: OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models (weights, preprint)
- 2024-11Nov-13: Qwen2.5-Coder
Cloud LLM
Multi-modal: Audio
- kyutai Open Science AI Lab chatbot moshi
Triage
Retrieval Augmented Generation (RAG)
Reviews
- 2024-08: Graph Retrieval-Augmented Generation: A Survey
- 2024-09: Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
- List of RAG techniques
Analysis of RAG overall
Approaches
- RAGFlow (code)
- GraphRAG (preprint, code)
- GraphRAG Accelerator for easy deployment on Azure
- AutoMetaRAG (code)
- Verba: RAG for Weaviate vector database
- Google Astute RAG
- 2024-10: StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization: Reformats retrieved data into task-appropriate structures (table, graph, tree).
- 2024-10: Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval
- 2024-11: FastRAG: Retrieval Augmented Generation for Semi-structured Data
Open-source Implementations
- kotaemon: An open-source clean & customizable RAG UI for chatting with your documents.
- LlamaIndex (code, docs, voice chat code)
- Nvidia ChatRTX with RAG
- Anthropic Customer Support Agent example
- LangChain and LangGraph (tutorial)
- RAGBuilder: Automatically tunes RAG hyperparams
- WikiChat
- Chonkie: No-nonsense RAG chunking library (open-source, lightweight, fast)
- autoflow: open source GraphRAG (Knowledge Graph), including conversational search page
Web-based Tools
- SciSpace Chat with PDF (also available as a GPT).
PDF Conversion
- Grobid
- Chunkr (code)
- Docling: converts multiple formats (PDF, DOCX, PPTX, Images, HTML) into Markdown and JSON
Automatic Optimization
Analogous to Gradient Descent
LLM for scoring/ranking
- GPTScore: Evaluate as You Desire
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
- Domain-specific chatbots for science using embeddings
- Large Language Models as Evaluators for Scientific Synthesis
LLM Agents
- See AI Agents.
Interfaces
Chatbot Frontend
Web
Desktop GUI
- AnythingLLM (docs, code): includes chat-with-docs, selection of LLM and vector db, etc.
Alternative Text Chatbot UI
- Loom provides a sort of tree-like structure for LLM coming up with branched writings.
- The Pantheon Interface is a new idea for how to interact with LLMs (live instance, code). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming.
Conversational Audio Chatbot
- Swift is a fast AI voice assistant (code, live demo) uses:
- RTVI-AI (code, demo), uses:
- June: Local Voice Chatbot
- Ollama
- Hugging Face Transformers (for speech recognition)
- Coqui TTS Toolkit
- kyutai Moshi chatbot (demo)
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (model, code, demo)
- 2024-09Sep-11: Llama-3.1-8B-Omni (code), enabling end-to-end speech.
- 2024-10Oct-18: Meta Spirit LM: open source multimodal language model that freely mixes text and speech
Related Research
Commercial Systems
Speech Recognition (ASR) and Transcription
Lists
Open Source
- DeepSpeech
- speechbrain
- Kaldi
- wav2vec 2.0
- Whisper
- Whisper medium.en
- WhisperX (includes word-level timestamps and speaker diarization)
- Distil Large v3 with MLX
- 2024-10: whisper-large-v3-turbo distillation (demo, code)
- Nvidia Canary 1B
- 2024-09: Nvidia NeMo
- 2024-10: Rev AI models for transcription and diarization
- 2024-10: Moonshine (optimized for resource-constrained devices)
In Browser
- Whisper Timestamped: Multilingual speech recognition with word-level timestamps, running locally in browser
Phrase Endpointing and Voice Activity Detection (VAD)
I.e. how to determine when user is done talking, and bot should respond?
Audio Cleanup
- Krisp AI: Noise cancellation, meeting summary, etc.
Text-to-speech (TTS)
Open Source
- Parler TTS (demo)
- Toucan (demo)
- MetaVoice (github)
- ChatTTS
- Camb.ai MARS5-TTS
- Coqui TTS Toolkit
- Fish Speech 1.4: multi-lingual, can clone voices (video, weights, demo)
- F5-TTS (demo): cloning, emotion, etc.
- MaskGCT (demo)
- Amphion: An Open-Source Audio, Music and Speech Generation Toolkit (code)
Cloud
- Elevenlabs ($50/million characters)
- Cartesia Sonic
- Neets AI ($1/million characters)
Vision
Visual Models
- CLIP
- Siglip
- Supervision
- Florence-2
- Nvidia MambaVision
- Meta Sapiens: Foundation for Human Vision Models (video input, can infer segmentation, pose, depth-map, and surface normals)
Multi-modal Models (language-vision/video)
- LLaVA-NeXT-Interleave (models, demo)
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
- Nvidia NVEagle 13B, 7B (demo, preprint)
- 2024-08Aug-29: Qwen2-VL 7B, 2B (code, models): Can process videos up to 20 minutes in length
- 2024-09Sep-11: Mistral Pixtral 12B
- 2024-09Sep-17: NVLM 1.0
Optical character recognition (OCR)
Embedding
Time Series
- Stumpy: Python library, uses near-match subsequences for similarity and forecasting
- Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
- From latent dynamics to meaningful representations
- Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
- A decoder-only foundation model for time-series forecasting
- TimeGPT-1
- Unified Training of Universal Time Series Forecasting Transformers
- xLSTMTime : Long-term Time Series Forecasting With xLSTM
- Salesforce: Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts (code, weights, blog)
Control
Forecasting
- Meta Kats (code): Forecasting (ARIMA, Prophet, Holt Winters, VAR), detection, feature extraction, simulation
Data
Vector Database
Open Source
- milvus (open source with paid cloud option)
- Qdrant (open source with paid cloud option)
- Vespa (open source with paid cloud option)
- chroma
- LlamaIndex
- sqlite-vec
Commercial cloud
MySQL
- MySQL does not traditionally have support, but:
- PlanetScale is working on it
- mysql_vss (discussion)
- tibd (discussion)
Database with Search
Web Scraping
Hardware
AI Acceleration Hardware
- Nvidia GPUs
- Google TPU
- Tesla Dojo
- Cerebras
- Graphcore
- Untether AI
- SambaNova Systems
- Groq
- Deep Silicon: Combined hardware/software solution for accelerated AI (e.g. ternary math)
- Etched: Transformer ASICs