Difference between revisions of "AI understanding"

From GISAXS
Jump to: navigation, search
(Other)
(Interpretability)
 
(20 intermediate revisions by the same user not shown)
Line 11: Line 11:
 
* 2025-01: [https://arxiv.org/abs/2501.14926 Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition] ([https://www.alignmentforum.org/posts/EPefYWjuHNcNH4C7E/attribution-based-parameter-decomposition blog post])
 
* 2025-01: [https://arxiv.org/abs/2501.14926 Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition] ([https://www.alignmentforum.org/posts/EPefYWjuHNcNH4C7E/attribution-based-parameter-decomposition blog post])
 
* 2025-01: Review: [https://arxiv.org/abs/2501.16496 Open Problems in Mechanistic Interpretability]
 
* 2025-01: Review: [https://arxiv.org/abs/2501.16496 Open Problems in Mechanistic Interpretability]
 +
* 2025-03: Anthropic: [https://www.anthropic.com/research/tracing-thoughts-language-model Tracing the thoughts of a large language model]
 +
** [https://transformer-circuits.pub/2025/attribution-graphs/methods.html Circuit Tracing: Revealing Computational Graphs in Language Models]
 +
** [https://transformer-circuits.pub/2025/attribution-graphs/biology.html On the Biology of a Large Language Model]
  
 
==Semanticity==
 
==Semanticity==
Line 22: Line 25:
 
* 2024-10: [https://arxiv.org/abs/2410.14670 Decomposing The Dark Matter of Sparse Autoencoders] ([https://github.com/JoshEngels/SAE-Dark-Matter code]) Shows that SAE errors are predictable
 
* 2024-10: [https://arxiv.org/abs/2410.14670 Decomposing The Dark Matter of Sparse Autoencoders] ([https://github.com/JoshEngels/SAE-Dark-Matter code]) Shows that SAE errors are predictable
 
* 2024-10: [https://arxiv.org/abs/2410.13928 Automatically Interpreting Millions of Features in Large Language Models]
 
* 2024-10: [https://arxiv.org/abs/2410.13928 Automatically Interpreting Millions of Features in Large Language Models]
 +
* 2024-10: [https://arxiv.org/abs/2410.21331 Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness]
 
* 2024-12: [https://arxiv.org/abs/2412.04139 Monet: Mixture of Monosemantic Experts for Transformers]
 
* 2024-12: [https://arxiv.org/abs/2412.04139 Monet: Mixture of Monosemantic Experts for Transformers]
 
* 2024-12: [https://www.lesswrong.com/posts/zbebxYCqsryPALh8C/matryoshka-sparse-autoencoders Matryoshka Sparse Autoencoders]
 
* 2024-12: [https://www.lesswrong.com/posts/zbebxYCqsryPALh8C/matryoshka-sparse-autoencoders Matryoshka Sparse Autoencoders]
Line 27: Line 31:
 
* 2025-01: [https://arxiv.org/abs/2501.19406 Low-Rank Adapting Models for Sparse Autoencoders]
 
* 2025-01: [https://arxiv.org/abs/2501.19406 Low-Rank Adapting Models for Sparse Autoencoders]
 
* 2025-02: [https://arxiv.org/abs/2502.03714 Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment]
 
* 2025-02: [https://arxiv.org/abs/2502.03714 Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment]
 +
* 2025-02: [https://arxiv.org/abs/2502.06755 Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models]
 +
* 2025-03: [https://arxiv.org/abs/2503.00177 Steering Large Language Model Activations in Sparse Spaces]
 +
* 2025-03: [https://arxiv.org/abs/2503.01776 Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation]
 +
* 2025-03: [https://arxiv.org/abs/2503.01824 From superposition to sparse codes: interpretable representations in neural networks]
 +
* 2025-03: [https://arxiv.org/abs/2503.18878 I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders]
  
 
===Counter-Results===
 
===Counter-Results===
Line 34: Line 43:
 
* 2025-01: [https://arxiv.org/abs/2501.17727 Sparse Autoencoders Can Interpret Randomly Initialized Transformers]
 
* 2025-01: [https://arxiv.org/abs/2501.17727 Sparse Autoencoders Can Interpret Randomly Initialized Transformers]
 
* 2025-02: [https://arxiv.org/abs/2502.04878 Sparse Autoencoders Do Not Find Canonical Units of Analysis]
 
* 2025-02: [https://arxiv.org/abs/2502.04878 Sparse Autoencoders Do Not Find Canonical Units of Analysis]
 +
* 2025-03: [https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/ Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research]
 +
 +
==Coding Models==
 +
* '''Sparse Auto Encoders''': See Semanticity.
 +
* [https://github.com/saprmarks/dictionary_learning dictionary_learning]
 +
* [https://transformer-circuits.pub/2024/jan-update/index.html#predict-future Predicting Future Activations]
 +
* 2024-06: [https://arxiv.org/abs/2406.11944 Transcoders Find Interpretable LLM Feature Circuits]
 +
* 2024-10: [https://transformer-circuits.pub/2024/crosscoders/index.html Sparse Crosscoders for Cross-Layer Features and Model Diffing]
  
 
==Reward Functions==
 
==Reward Functions==
Line 152: Line 169:
 
* 2024-06: [https://arxiv.org/abs/2406.11717 Refusal in Language Models Is Mediated by a Single Direction]
 
* 2024-06: [https://arxiv.org/abs/2406.11717 Refusal in Language Models Is Mediated by a Single Direction]
 
* 2025-02: [https://martins1612.github.io/emergent_misalignment_betley.pdf Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs] ([https://x.com/OwainEvans_UK/status/1894436637054214509 demonstrates] [https://x.com/ESYudkowsky/status/1894453376215388644 entangling] of concepts into a single preference vector)
 
* 2025-02: [https://martins1612.github.io/emergent_misalignment_betley.pdf Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs] ([https://x.com/OwainEvans_UK/status/1894436637054214509 demonstrates] [https://x.com/ESYudkowsky/status/1894453376215388644 entangling] of concepts into a single preference vector)
 +
* 2025-03: [https://arxiv.org/abs/2503.03666 Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction]
  
 
==Other==
 
==Other==
Line 169: Line 187:
 
* 2021-02: [https://arxiv.org/abs/2102.06701 Explaining Neural Scaling Laws] (Google DeepMind)
 
* 2021-02: [https://arxiv.org/abs/2102.06701 Explaining Neural Scaling Laws] (Google DeepMind)
 
* 2022-03: [https://arxiv.org/abs/2203.15556 Training Compute-Optimal Large Language Models] (Chinchilla, Google DeepMind)
 
* 2022-03: [https://arxiv.org/abs/2203.15556 Training Compute-Optimal Large Language Models] (Chinchilla, Google DeepMind)
 +
* 2025-03: [https://arxiv.org/abs/2503.04715 Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining]
 +
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
  
 
=Information Processing/Storage=
 
=Information Processing/Storage=
 +
* 2020-02: [https://arxiv.org/abs/2002.10689 A Theory of Usable Information Under Computational Constraints]
 
* "A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity" ([https://x.com/danielhanchen/status/1835684061475655967 c.f.])
 
* "A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity" ([https://x.com/danielhanchen/status/1835684061475655967 c.f.])
 
** 2024-02: [https://arxiv.org/abs/2402.14905 MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases]
 
** 2024-02: [https://arxiv.org/abs/2402.14905 MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases]
Line 178: Line 199:
 
* 2024-10: [https://arxiv.org/abs/2407.01687 Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning]. CoT involves both memorization and (probabilitic) reasoning
 
* 2024-10: [https://arxiv.org/abs/2407.01687 Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning]. CoT involves both memorization and (probabilitic) reasoning
 
* 2024-11: [https://arxiv.org/abs/2411.16679 Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?]
 
* 2024-11: [https://arxiv.org/abs/2411.16679 Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?]
 +
* 2025-03: [https://www.arxiv.org/abs/2503.03961 A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers]
  
 
==Tokenization==
 
==Tokenization==
Line 187: Line 209:
 
* 2024-12: [https://arxiv.org/abs/2412.11521 On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory]
 
* 2024-12: [https://arxiv.org/abs/2412.11521 On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory]
 
* 2025-01: [https://arxiv.org/abs/2501.12391 Physics of Skill Learning]
 
* 2025-01: [https://arxiv.org/abs/2501.12391 Physics of Skill Learning]
 +
 +
===Cross-modal knowledge transfer===
 +
* 2022-03: [https://arxiv.org/abs/2203.07519 Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer]
 +
* 2023-05: [https://arxiv.org/abs/2305.07358 Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters]
 +
* 2025-02: [https://arxiv.org/abs/2502.06755 Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models]: CLIP learns richer set of aggregated representations (e.g. for a culture or country), vs. a vision-only model.
  
 
==Hidden State==
 
==Hidden State==
 
* 2025-02: [https://arxiv.org/abs/2502.06258 Emergent Response Planning in LLM]: They show that the latent representation contains information beyond that needed for the next token (i.e. the model learns to "plan ahead" and encode information relevant to future tokens)
 
* 2025-02: [https://arxiv.org/abs/2502.06258 Emergent Response Planning in LLM]: They show that the latent representation contains information beyond that needed for the next token (i.e. the model learns to "plan ahead" and encode information relevant to future tokens)
 +
* 2025-03: [https://arxiv.org/abs/2503.02854 (How) Do Language Models Track State?]
 +
 +
==Function Approximation==
 +
* 2022-08: [https://arxiv.org/abs/2208.01066 What Can Transformers Learn In-Context? A Case Study of Simple Function Classes]: can learn linear functions (equivalent to least-squares estimator)
 +
* 2022-11: [https://arxiv.org/abs/2211.09066 Teaching Algorithmic Reasoning via In-context Learning]: Simple arithmetic
 +
* 2022-11: [https://arxiv.org/abs/2211.15661 What learning algorithm is in-context learning? Investigations with linear models] ([https://github.com/ekinakyurek/google-research/tree/master/incontext code]): can learn linear regression
 +
* 2022-12: [https://arxiv.org/abs/2212.07677 Transformers learn in-context by gradient descent]
 +
* 2023-06: [https://arxiv.org/abs/2306.00297 Transformers learn to implement preconditioned gradient descent for in-context learning]
 +
* 2023-07: [https://arxiv.org/abs/2307.03576 One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention]
 +
* 2024-04: [https://arxiv.org/abs/2404.02893 ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline]
 +
* 2025-02: [https://arxiv.org/abs/2502.20545 SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers]
 +
* 2025-02: [https://arxiv.org/abs/2502.21212 Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought]
  
 
=Failure Modes=
 
=Failure Modes=
Line 196: Line 235:
 
* 2023-09: [https://arxiv.org/abs/2309.13638 Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve] (biases towards "common" numbers, in-context CoT can reduce performance by incorrectly priming, etc.)
 
* 2023-09: [https://arxiv.org/abs/2309.13638 Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve] (biases towards "common" numbers, in-context CoT can reduce performance by incorrectly priming, etc.)
 
* 2023-11: [https://arxiv.org/abs/2311.16093 Visual cognition in multimodal large language models] (models lack human-like visual understanding)
 
* 2023-11: [https://arxiv.org/abs/2311.16093 Visual cognition in multimodal large language models] (models lack human-like visual understanding)
 +
 +
==Jagged Frontier==
 +
* 2024-07: [https://arxiv.org/abs/2407.03211 How Does Quantization Affect Multilingual LLMs?]: Quantization degrades different languages by differing amounts
 +
* 2025-03: [https://arxiv.org/abs/2503.10061v1 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]: Scaling laws are skill-dependent
  
 
=Psychology=
 
=Psychology=

Latest revision as of 14:41, 31 March 2025

Interpretability

Mechanistic Interpretability

Semanticity

Counter-Results

Coding Models

Reward Functions

Symbolic and Notation

Mathematical

Geometric

Topography

Challenges

GYe31yXXQAABwaZ.jpeg

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)

Task vectors:

Feature Geometry Reproduces Problem-space

Capturing Physics

Theory of Mind

Skeptical

Information Processing

Generalization

Grokking

Tests of Resilience to Dropouts/etc.

  • 2024-02: Explorations of Self-Repair in Language Models
  • 2024-06: What Matters in Transformers? Not All Attention is Needed
    • Removing entire transformer blocks leads to significant performance degradation
    • Removing MLP layers results in significant performance degradation
    • Removing attention layers causes almost no performance degradation
    • E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
  • 2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
    • They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
    • They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
      • Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
      • Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
      • Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
      • Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
    • This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).

Semantic Vectors

Other

Scaling Laws

Information Processing/Storage

Tokenization

For numbers/math

Learning/Training

Cross-modal knowledge transfer

Hidden State

Function Approximation

Failure Modes

Jagged Frontier

Psychology

Allow LLM to think

In-context Learning

Reasoning (CoT, etc.)

See Also