Difference between revisions of "AI tools"

From GISAXS
Jump to: navigation, search
(Reasoning)
(Language/Audio)
 
(21 intermediate revisions by the same user not shown)
Line 48: Line 48:
 
* 2025-02Feb-10: [https://huggingface.co/tomg-group-umd/huginn-0125 Huginn-0125]: [https://arxiv.org/abs/2502.05171 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach] ([https://github.com/seal-rg/recurrent-pretraining code], [https://huggingface.co/tomg-group-umd/huginn-0125 model])
 
* 2025-02Feb-10: [https://huggingface.co/tomg-group-umd/huginn-0125 Huginn-0125]: [https://arxiv.org/abs/2502.05171 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach] ([https://github.com/seal-rg/recurrent-pretraining code], [https://huggingface.co/tomg-group-umd/huginn-0125 model])
 
* [https://x.com/NousResearch/status/1890148000204485088 2025-02Feb-14]: [https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview DeepHermes 3 - Llama-3.1 8B]
 
* [https://x.com/NousResearch/status/1890148000204485088 2025-02Feb-14]: [https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview DeepHermes 3 - Llama-3.1 8B]
* [https://x.com/Alibaba_Qwen/status/1894130603513319842 2025-02Feb-24]: [https://qwenlm.github.io/blog/qwq-max-preview/ QwQ-Max-Preview] ([https://chat.qwen.ai/ online demo])
+
* [https://x.com/Alibaba_Qwen/status/1894130603513319842 2025-02Feb-24]: Qwen [https://qwenlm.github.io/blog/qwq-max-preview/ QwQ-Max-Preview] ([https://chat.qwen.ai/ online demo])
 +
* [https://x.com/Alibaba_Qwen/status/1897361654763151544 2025-03Mar-05]: Qwen [https://qwenlm.github.io/blog/qwq-32b/ QwQ-32B] ([https://huggingface.co/spaces/Qwen/QwQ-32B-Demo demo])
 +
* [https://x.com/BlinkDL_AI/status/1898579674575552558 2025-03Mar-05]: [https://github.com/BlinkDL/RWKV-LM RWKV7-G1] "GooseOne" 0.1B ([https://huggingface.co/BlinkDL/rwkv7-g1 weights], [https://arxiv.org/abs/2305.13048 preprint])
  
 
===Agentic===
 
===Agentic===
 +
* 2025-02Feb-18: Microsoft [https://huggingface.co/microsoft/Magma-8B Magma-8B] ([https://www.arxiv.org/abs/2502.13130 preprint])
 
* 2025-02Feb-26: [https://convergence.ai/ Convergence] [https://github.com/convergence-ai/proxy-lite Proxy Lite]
 
* 2025-02Feb-26: [https://convergence.ai/ Convergence] [https://github.com/convergence-ai/proxy-lite Proxy Lite]
 +
 +
===Multimodal===
 +
====Language/Vision====
 +
* [https://arxiv.org/abs/2407.07895 LLaVA-NeXT-Interleave] ([https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19 models], [https://huggingface.co/spaces/merve/llava-interleave demo])
 +
* [https://huggingface.co/papers/2407.15841 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models]
 +
* Nvidia [https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 NVEagle] 13B, 7B ([https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat demo], [https://arxiv.org/abs/2408.15998 preprint])
 +
* 2024-08Aug-29: [https://qwenlm.github.io/blog/qwen2-vl/ Qwen2-VL] 7B, 2B ([https://github.com/QwenLM/Qwen2-VL code], [https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d models]): Can process videos up to 20 minutes in length
 +
* 2024-09Sep-11: Mistral [https://huggingface.co/mistral-community/pixtral-12b-240910 Pixtral 12B]
 +
* 2024-09Sep-17: [https://nvlm-project.github.io/ NVLM 1.0]
 +
* 2024-12Dec-06: Nvidia [https://arxiv.org/abs/2412.04468 NVILA: Efficient Frontier Visual Language Models]
 +
* [https://x.com/Alibaba_Qwen/status/1883954247743725963 2025-01Jan-28]: [https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5 Qwen2.5-VL]
 +
* 2025-02Feb-18: Microsoft [https://huggingface.co/microsoft/Magma-8B Magma-8B] ([https://www.arxiv.org/abs/2502.13130 preprint])
 +
* [https://x.com/CohereForAI/status/1896923657470886234 2025-03Mar-05]: Cohere [https://cohere.com/research/aya Aya] 8B, 32B
 +
* 2025-03Mar-12: Google [https://developers.googleblog.com/en/introducing-gemma3/ Gemma 3] 1B 4B, 12B, 27B ([https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf technical report])
 +
 +
====Language/Vision/Speech====
 +
* 2025-02Feb-27: Microsoft [https://huggingface.co/microsoft/Phi-4-multimodal-instruct Phi-4-multimodal-instruct] (language, vision, speech)
 +
 +
====Language/Audio====
 +
* 2025-03Mar-11: [https://github.com/soham97/mellow Mellow]: a small audio language model for reasoning, 167M ([https://arxiv.org/abs/2503.08540 paper])
 +
* 2025-03Mar-12: [https://research.nvidia.com/labs/adlr/AF2/ Audio Flamingo 2] 0.5B, 1.5B, 3B [https://arxiv.org/abs/2503.03983 paper], [https://github.com/NVIDIA/audio-flamingo code]
  
 
==Cloud LLM==
 
==Cloud LLM==
Line 73: Line 97:
 
* List of [https://github.com/NirDiamant/RAG_Techniques RAG techniques]
 
* List of [https://github.com/NirDiamant/RAG_Techniques RAG techniques]
 
* [https://github.com/athina-ai/rag-cookbooks Advanced RAG Cookbooks👨🏻‍💻]
 
* [https://github.com/athina-ai/rag-cookbooks Advanced RAG Cookbooks👨🏻‍💻]
 +
* [https://github.com/DEEP-PolyU/Awesome-GraphRAG Awesome-GraphRAG (GraphRAG Survey)]
  
 
===Measuring RAG performance===
 
===Measuring RAG performance===
Line 96: Line 121:
 
* 2025-01: [https://arxiv.org/abs/2501.05874 VideoRAG: Retrieval-Augmented Generation over Video Corpus]
 
* 2025-01: [https://arxiv.org/abs/2501.05874 VideoRAG: Retrieval-Augmented Generation over Video Corpus]
 
* 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]
 
* 2025-02: [https://arxiv.org/abs/2502.01142 DeepRAG: Thinking to Retrieval Step by Step for Large Language Models]
 +
* 2025-02: [https://weaviate.io/developers/weaviate/tutorials/multi-vector-embeddings Multi-vector embeddings]
  
 
===Open-source Implementations===
 
===Open-source Implementations===
Line 125: Line 151:
 
* [https://www.voyageai.com/ Voyage AI]
 
* [https://www.voyageai.com/ Voyage AI]
 
* [https://abacus.ai/ Abacus AI]
 
* [https://abacus.ai/ Abacus AI]
 
==Automatic Optimization==
 
===Analogous to Gradient Descent===
 
* [https://arxiv.org/abs/2406.07496 TextGrad: Automatic "Differentiation" via Text]
 
* [https://arxiv.org/abs/2406.18532 Symbolic Learning Enables Self-Evolving Agents]
 
  
 
==LLM for scoring/ranking==
 
==LLM for scoring/ranking==
Line 178: Line 199:
 
* 2024-09Sep-11: [https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Llama-3.1-8B-Omni] ([https://github.com/ictnlp/LLaMA-Omni code]), enabling end-to-end speech.
 
* 2024-09Sep-11: [https://huggingface.co/ICTNLP/Llama-3.1-8B-Omni Llama-3.1-8B-Omni] ([https://github.com/ictnlp/LLaMA-Omni code]), enabling end-to-end speech.
 
* [https://x.com/AIatMeta/status/1847383580269510670 2024-10Oct-18]: Meta [https://speechbot.github.io/spiritlm/ Spirit LM]: open source multimodal language model that freely mixes text and speech
 
* [https://x.com/AIatMeta/status/1847383580269510670 2024-10Oct-18]: Meta [https://speechbot.github.io/spiritlm/ Spirit LM]: open source multimodal language model that freely mixes text and speech
 +
* 2025-02Feb-28: [https://www.sesame.com/ Sesame] ([https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo demo])
 +
 +
===Turn Detection===
 +
* 2025-03: [https://github.com/pipecat-ai/smart-turn Smart Turn]: Open-source
  
 
===Related Research===
 
===Related Research===
Line 188: Line 213:
 
* [https://www.bland.ai Bland AI]
 
* [https://www.bland.ai Bland AI]
 
* [https://deepgram.com/ DeepGram Voice AI]
 
* [https://deepgram.com/ DeepGram Voice AI]
 +
* [https://www.sesame.com/ Sesame] ([https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo demo])
  
 
=Speech Recognition (ASR) and Transcription=
 
=Speech Recognition (ASR) and Transcription=
Line 243: Line 269:
 
* [https://neets.ai/ Neets AI] ($1/million characters)
 
* [https://neets.ai/ Neets AI] ($1/million characters)
 
* Hailuo AI T2A-01-HD ([https://www.hailuo.ai/audio try], [https://intl.minimaxi.com/document/platform%20introduction?key=66701c8e1d57f38758d58198 API])
 
* Hailuo AI T2A-01-HD ([https://www.hailuo.ai/audio try], [https://intl.minimaxi.com/document/platform%20introduction?key=66701c8e1d57f38758d58198 API])
 +
* [https://www.hume.ai/ Hume] (can set emotion, give acting directions, etc.)
  
 
=Text-to-audio=
 
=Text-to-audio=
Line 249: Line 276:
 
=Vision=
 
=Vision=
 
* [https://github.com/google/langfun Langfun] library as a means of converting images into structured output.
 
* [https://github.com/google/langfun Langfun] library as a means of converting images into structured output.
 +
* See also: [[AI_tools#Multimodal| Multimodal open-weights models]]
  
 
==Visual Models==
 
==Visual Models==
Line 257: Line 285:
 
* Nvidia [https://github.com/NVlabs/MambaVision MambaVision]
 
* Nvidia [https://github.com/NVlabs/MambaVision MambaVision]
 
* Meta [https://about.meta.com/realitylabs/codecavatars/sapiens Sapiens: Foundation for Human Vision Models] (video input, can infer segmentation, pose, depth-map, and surface normals)
 
* Meta [https://about.meta.com/realitylabs/codecavatars/sapiens Sapiens: Foundation for Human Vision Models] (video input, can infer segmentation, pose, depth-map, and surface normals)
 
==Multi-modal Models (language-vision/video)==
 
* [https://arxiv.org/abs/2407.07895 LLaVA-NeXT-Interleave] ([https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19 models], [https://huggingface.co/spaces/merve/llava-interleave demo])
 
* [https://huggingface.co/papers/2407.15841 SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models]
 
* Nvidia [https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 NVEagle] 13B, 7B ([https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat demo], [https://arxiv.org/abs/2408.15998 preprint])
 
* 2024-08Aug-29: [https://qwenlm.github.io/blog/qwen2-vl/ Qwen2-VL] 7B, 2B ([https://github.com/QwenLM/Qwen2-VL code], [https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d models]): Can process videos up to 20 minutes in length
 
* 2024-09Sep-11: Mistral [https://huggingface.co/mistral-community/pixtral-12b-240910 Pixtral 12B]
 
* 2024-09Sep-17: [https://nvlm-project.github.io/ NVLM 1.0]
 
* 2024-12Dec-06: Nvidia [https://arxiv.org/abs/2412.04468 NVILA: Efficient Frontier Visual Language Models]
 
* [https://x.com/Alibaba_Qwen/status/1883954247743725963 2025-01Jan-28]: [https://huggingface.co/collections/Qwen/qwen25-vl-6795ffac22b334a837c0f9a5 Qwen2.5-VL]
 
  
 
==Related==
 
==Related==
Line 273: Line 291:
 
=Embedding=
 
=Embedding=
 
* [https://www.marktechpost.com/2024/07/28/a-comparison-of-top-embedding-libraries-for-generative-ai/ A Comparison of Top Embedding Libraries for Generative AI]
 
* [https://www.marktechpost.com/2024/07/28/a-comparison-of-top-embedding-libraries-for-generative-ai/ A Comparison of Top Embedding Libraries for Generative AI]
 +
 +
==Text Embedding==
 
* 2024-12: [https://huggingface.co/blog/modernbert modernBERT]
 
* 2024-12: [https://huggingface.co/blog/modernbert modernBERT]
 +
* 2025-02: [https://huggingface.co/chandar-lab/NeoBERT NeoBERT]
 +
* 2025-03: [https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/ gemini-embedding-exp-03-07]
 +
 
==Image Embedding==
 
==Image Embedding==
* 2025-01:[https://arxiv.org/abs/2501.18593 Diffusion Autoencoders are Scalable Image Tokenizers] ([https://yinboc.github.io/dito/ project], [https://github.com/yinboc/dito code])
+
* 2025-01: [https://arxiv.org/abs/2501.18593 Diffusion Autoencoders are Scalable Image Tokenizers] ([https://yinboc.github.io/dito/ project], [https://github.com/yinboc/dito code])
  
 
=Time Series=
 
=Time Series=

Latest revision as of 14:04, 12 March 2025

LLM

Open-weights LLM

For Coding

Rankings: bigcode-models-leaderboard and CodeElo leaderboard

Reasoning

See also: Increasing AI Intelligence > Proactive Search > CoT reasoning model

Agentic

Multimodal

Language/Vision

Language/Vision/Speech

Language/Audio

Cloud LLM

Multi-modal: Audio

Triage

Retrieval Augmented Generation (RAG)

Reviews

Measuring RAG performance

Analysis of RAG overall

Approaches

Open-source Implementations

Web-based Tools

  • SciSpace Chat with PDF (also available as a GPT).

Commercial Cloud Offerings

LLM for scoring/ranking

LLM Agents

Interfaces

Chatbot Frontend

Web (code)

Web (product)

Desktop GUI

Alternative Text Chatbot UI

  • Loom provides a sort of tree-like structure for LLM coming up with branched writings.
  • The Pantheon Interface is a new idea for how to interact with LLMs (live instance, code). In a traditional interaction, you prompt the bot and it replies in a turn-by-turn manner. Pantheon instead invites you to type out your thoughts, and various agents will asynchronously add comments or questions to spur along your brainstorming.

Conversational Audio Chatbot

Turn Detection

Related Research

Commercial Systems

Speech Recognition (ASR) and Transcription

Lists

Open Source

In Browser

  • Whisper Timestamped: Multilingual speech recognition with word-level timestamps, running locally in browser

Phrase Endpointing and Voice Activity Detection (VAD)

I.e. how to determine when user is done talking, and bot should respond?

Audio Cleanup

  • Krisp AI: Noise cancellation, meeting summary, etc.

Text-to-speech (TTS)

Open Source

Cloud

Text-to-audio

Vision

Visual Models

Related

Embedding

Text Embedding

Image Embedding

Time Series


Control

Forecasting

Data

Vector Database

Open Source

Commercial cloud

MySQL

Database with Search

See Also