Revision as of 11:36, 18 February 2025

Interpretability

2017-01: Learning to Generate Reviews and Discovering Sentiment

Mechanistic Interpretability

2020-03: OpenAI: Zoom In: An Introduction to Circuits
2021-12: Anthropic: A Mathematical Framework for Transformer Circuits
2022-09: Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
2023-01: Tracr: Compiled Transformers as a Laboratory for Interpretability (code)
2024-07: Anthropic: Circuits Update
2025-01: Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (blog post)
2025-01: Review: Open Problems in Mechanistic Interpretability

Semanticity

2023-09: Sparse Autoencoders Find Highly Interpretable Features in Language Models
Anthropic monosemanticity interpretation of LLM features:
- 2023-10: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- 2024-05: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
2024-06: OpenaAI: Scaling and evaluating sparse autoencoders
2024-08: Showing SAE Latents Are Not Atomic Using Meta-SAEs (demo)
2024-10: Efficient Dictionary Learning with Switch Sparse Autoencoders (code) More efficient SAE generation
2024-10: Decomposing The Dark Matter of Sparse Autoencoders (code) Shows that SAE errors are predictable
2024-10: Automatically Interpreting Millions of Features in Large Language Models
2024-12: Monet: Mixture of Monosemantic Experts for Transformers
2024-12: Matryoshka Sparse Autoencoders
2024-12: Learning Multi-Level Features with Matryoshka SAEs
2025-01: Low-Rank Adapting Models for Sparse Autoencoders
2025-02: Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Counter-Results

Reward Functions

2024-10: Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

Symbolic and Notation

A Mathematical Framework for Transformer Circuits
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
2024-07: On the Anatomy of Attention: Introduces category-theoretic diagrammatic formalism for DL architectures
2024-11: diagrams to represent algorithms
2024-12: FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness

Mathematical

2024-06: Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Geometric

2023-11: The Linear Representation Hypothesis and the Geometry of Large Language Models
2024-06: The Geometry of Categorical and Hierarchical Concepts in Large Language Models
- Natural hierarchies of concepts---which occur throughout natural language and especially in scientific ontologies---are represented in the model's internal vectorial space as polytopes that can be decomposed into simplexes of mutually-exclusive categories.
2024-07: Reasoning in Large Language Models: A Geometric Perspective
2024-09: Deep Manifold Part 1: Anatomy of Neural Network Manifold
2024-10: The Geometry of Concepts: Sparse Autoencoder Feature Structure
- Tegmark et al. report multi-scale structure: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale
2025-02: The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models

Topography

2025-01: TopoNets: High Performing Vision and Language Models with Brain-Like Topography

Challenges

2023-07Jul: Measuring Faithfulness in Chain-of-Thought Reasoning roughly proves that sufficiently large models do not generate CoT that actually captures their internal reasoning)

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)

Task vectors:

Feature Geometry Reproduces Problem-space

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello)
Emergent linear representations in world models of self-supervised sequence models (Othello)
What learning algorithm is in-context learning? Investigations with linear models
Emergent analogical reasoning in large language models
Language Models Represent Space and Time (Maps of world, US)
Not All Language Model Features Are Linear (Days of week form ring, etc.)
Evaluating the World Model Implicit in a Generative Model (Map of Manhattan)
Reliable precipitation nowcasting using probabilistic diffusion models. Generation of precipitation map imagery is predictive of actual future weather; implies model is learning scientifically-relevant modeling.
The Platonic Representation Hypothesis: Different models (including across modalities) are converging to a consistent world model.
ICLR: In-Context Learning of Representations
Language Models Use Trigonometry to Do Addition: Numbers arranged in helix to enable addition

Capturing Physics

2025-02: Fair at Meta: Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Theory of Mind

Skeptical

Do generative video models learn physical principles from watching videos? (project, code)

Information Processing

What's the Magic Word? A Control Theory of LLM Prompting
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
- Models learning reasoning skills (they are not merely memorizing solution templates). They can mentally generate simple short plans (like humans).
- When presented facts, models develop internal understanding of what parameters (recursively) depend on each other. This occurs even before an explicit question is asked (i.e. before the task is defined). This appears to be different from human reasoning.
- Model depth matters for reasoning. This cannot be mitigated by chain-of-thought prompting (which allow models to develop and then execute plans) since even a single CoT step may require deep, multi-step reasoning/planning.
Why think step by step? Reasoning emerges from the locality of experience
2024-02: Chain of Thought Empowers Transformers to Solve Inherently Serial Problems: Proves that transformers can solve any problem, if they can generate sufficient intermediate tokens
2024-11: Ask, and it shall be given: Turing completeness of prompting

Generalization

2024-06: Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Grokking

Tests of Resilience to Dropouts/etc.

2024-02: Explorations of Self-Repair in Language Models
2024-06: What Matters in Transformers? Not All Attention is Needed
- Removing entire transformer blocks leads to significant performance degradation
- Removing MLP layers results in significant performance degradation
- Removing attention layers causes almost no performance degradation
- E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
- They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
- They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
  - Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
  - Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
  - Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
  - Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
- This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).

Other

2024-11: Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond
2024-11: Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (code)
2024-11: Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models: LLMs learn reasoning by extracting procedures from training data, not by memorizing specific answers
2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
2024-12: The Complexity Dynamics of Grokking

Scaling Laws

2017-12: Deep Learning Scaling is Predictable, Empirically (Baidu)
2019-03: The Bitter Lesson (Rich Sutton)
2020-01: Scaling Laws for Neural Language Models (OpenAI)
2020-10: Scaling Laws for Autoregressive Generative Modeling (OpenAI)
2020-05: The Scaling Hypothesis (Gwern)
2021-08: Scaling Laws for Deep Learning
2021-02: Explaining Neural Scaling Laws (Google DeepMind)
2022-03: Training Compute-Optimal Large Language Models (Chinchilla, Google DeepMind)

Information Processing/Storage

"A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity" (c.f.)
- 2024-02: MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- 2024-04: The Illusion of State in State-Space Models (figure 3)
- 2024-08: Gemma 2: Improving Open Language Models at a Practical Size (table 9)
2024-09: Schrodinger's Memory: Large Language Models
2024-10: Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning. CoT involves both memorization and (probabilitic) reasoning
2024-11: Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Tokenization

For numbers/math

2024-02: Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: L2R vs. R2L yields different performance on math

Learning/Training

2018-03: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: Sparse neural networks are optimal, but it is difficult to identify the right architecture and train it. Deep learning typically consists of training a dense neural network, which makes it easier to learn an internal sparse circuit optimal to a particular problem.
2024-12: On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory
2025-01: Physics of Skill Learning

Failure Modes

2023-06: Can Large Language Models Infer Causation from Correlation?: Poor causal inference
2023-09: The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
2023-09: Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve (biases towards "common" numbers, in-context CoT can reduce performance by incorrectly priming, etc.)
2023-11: Visual cognition in multimodal large language models (models lack human-like visual understanding)

Psychology

2023-04: Inducing anxiety in large language models can induce bias

Allow LLM to think

2024-12: Let your LLM generate a few tokens and you will reduce the need for retrieval

@@ Line 98: / Line 98: @@
 * [https://arxiv.org/abs/2502.00873 Language Models Use Trigonometry to Do Addition]: Numbers arranged in helix to enable addition
-===Physical Model===
+===Capturing Physics===
 * 2025-02: Fair at Meta: [https://arxiv.org/abs/2502.11831 Intuitive physics understanding emerges from self-supervised pretraining on natural videos]

Difference between revisions of "AI understanding"

Revision as of 11:36, 18 February 2025

Contents

Interpretability

Mechanistic Interpretability

Semanticity

Counter-Results

Reward Functions

Symbolic and Notation

Mathematical

Geometric

Topography

Challenges

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Feature Geometry Reproduces Problem-space

Capturing Physics

Theory of Mind

Skeptical

Information Processing

Generalization

Grokking

Tests of Resilience to Dropouts/etc.

Other

Scaling Laws

Information Processing/Storage

Tokenization

For numbers/math

Learning/Training

Failure Modes

Psychology

Allow LLM to think

In-context Learning

Reasoning (CoT, etc.)

See Also

Navigation menu

Search