Interpretability

Mechanistic Interpretability

2020-03Mar: OpenAI: Zoom In: An Introduction to Circuits
2021-12Dec: Anthropic: A Mathematical Framework for Transformer Circuits
2022-09Sep: Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
2024-07Jul: Anthropic: Circuits Update

Semanticity

2023-09Sep: Sparse Autoencoders Find Highly Interpretable Features in Language Models
Anthropic monosemanticity interpretation of LLM features:
- 2023-10Oct: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- 2024-05May: Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
2024-06Jun: OpenaAI: Scaling and evaluating sparse autoencoders
2024-08Aug: Showing SAE Latents Are Not Atomic Using Meta-SAEs (demo)
2024-10Oct: Efficient Dictionary Learning with Switch Sparse Autoencoders (code) More efficient SAE generation
2024-10Oct: Decomposing The Dark Matter of Sparse Autoencoders (code) Shows that SAE errors are predictable
2024-10Oct: Automatically Interpreting Millions of Features in Large Language Models

Reward Functions

2024-10: Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

Symbolic and Notation

A Mathematical Framework for Transformer Circuits
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
2024-07: On the Anatomy of Attention: Introduces category-theoretic diagrammatic formalism for DL architectures
2024-11: diagrams to represent algorithms

Mathematical

2024-06: Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Geometric

2023-11: The Linear Representation Hypothesis and the Geometry of Large Language Models
2024-06: The Geometry of Categorical and Hierarchical Concepts in Large Language Models
- Natural hierarchies of concepts---which occur throughout natural language and especially in scientific ontologies---are represented in the model's internal vectorial space as polytopes that can be decomposed into simplexes of mutually-exclusive categories.
2024-07: Reasoning in Large Language Models: A Geometric Perspective
2024-09: Deep Manifold Part 1: Anatomy of Neural Network Manifold
2024-10: The Geometry of Concepts: Sparse Autoencoder Feature Structure
- Tegmark et al. report multi-scale structure: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale

Challenges

2023-07Jul: Measuring Faithfulness in Chain-of-Thought Reasoning roughly proves that sufficiently large models do not generate CoT that actually captures their internal reasoning)

Heuristic Understanding

Emergent Internal Model Building

2024-06: Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space

Semantic Directions

Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)

Task vectors:

Feature Geometry Reproduces Problem-space

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello)
Emergent linear representations in world models of self-supervised sequence models (Othello)
What learning algorithm is in-context learning? Investigations with linear models
Emergent analogical reasoning in large language models
Language Models Represent Space and Time (Maps of world, US)
Not All Language Model Features Are Linear (Days of week form ring, etc.)
Evaluating the World Model Implicit in a Generative Model (Map of Manhattan)
Reliable precipitation nowcasting using probabilistic diffusion models. Generation of precipitation map imagery is predictive of actual future weather; implies model is learning scientifically-relevant modeling.
The Platonic Representation Hypothesis: Different models (including across modalities) are converging to a consistent world model.

Theory of Mind

Information Processing

What's the Magic Word? A Control Theory of LLM Prompting
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
- Models learning reasoning skills (they are not merely memorizing solution templates). They can mentally generate simple short plans (like humans).
- When presented facts, models develop internal understanding of what parameters (recursively) depend on each other. This occurs even before an explicit question is asked (i.e. before the task is defined). This appears to be different from human reasoning.
- Model depth matters for reasoning. This cannot be mitigated by chain-of-thought prompting (which allow models to develop and then execute plans) since even a single CoT step may require deep, multi-step reasoning/planning.
Why think step by step? Reasoning emerges from the locality of experience
2024-02: Chain of Thought Empowers Transformers to Solve Inherently Serial Problems: Proves that transformers can solve any problem, if they can generate sufficient intermediate tokens

Tests of Resilience to Dropouts/etc.

2024-02: Explorations of Self-Repair in Language Models
2024-06: What Matters in Transformers? Not All Attention is Needed
- Removing entire transformer blocks leads to significant performance degradation
- Removing MLP layers results in significant performance degradation
- Removing attention layers causes almost no performance degradation
- E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
- They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
- They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
  - Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
  - Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
  - Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
  - Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
- This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).

Other

2024-11: Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond
2024-11: Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (code)
2024-11: Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models: LLMs learn reasoning by extracting procedures from training data, not by memorizing specific answers
2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning

Scaling Laws

2017-12: Deep Learning Scaling is Predictable, Empirically (Baidu)
2019-03: The Bitter Lesson (Rich Sutton)
2020-01: Scaling Laws for Neural Language Models (OpenAI)
2020-10: Scaling Laws for Autoregressive Generative Modeling (OpenAI)
2020-05: The Scaling Hypothesis (Gwern)
2021-08: Scaling Laws for Deep Learning
2021-02: Explaining Neural Scaling Laws (Google DeepMind)
2022-03: Training Compute-Optimal Large Language Models (Chinchilla, Google DeepMind)

Information Processing/Storage

"A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity" (c.f.)
- 2024-02: MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- 2024-04: The Illusion of State in State-Space Models (figure 3)
- 2024-08: Gemma 2: Improving Open Language Models at a Practical Size (table 9)
2024-09: Schrodinger's Memory: Large Language Models
2024-10: Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning. CoT involves both memorization and (probabilitic) reasoning

Tokenization

For numbers/math

2024-02: Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: L2R vs. R2L yields different performance on math

Failure Modes

2023-06: Can Large Language Models Infer Causation from Correlation?: Poor causal inference

Psychology

2023-04: Inducing anxiety in large language models can induce bias

AI understanding

Contents

Interpretability

Mechanistic Interpretability

Semanticity

Reward Functions

Symbolic and Notation

Mathematical

Geometric

Challenges

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Feature Geometry Reproduces Problem-space

Theory of Mind

Information Processing

Tests of Resilience to Dropouts/etc.

Other

Scaling Laws

Information Processing/Storage

Tokenization

For numbers/math

Failure Modes

Psychology

See Also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools