AI understanding
Contents
Interpretability
Mechanistic Interpretability
- 2020-03Mar: OpenAI: Zoom In: An Introduction to Circuits
- 2021-12Dec: Anthropic: A Mathematical Framework for Transformer Circuits
- 2022-09Sep: Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
- 2024-07Jul: Anthropic: Circuits Update
Semanticity
- 2023-09Sep: Sparse Autoencoders Find Highly Interpretable Features in Language Models
- Anthropic monosemanticity interpretation of LLM features:
- 2024-06Jun: OpenaAI: Scaling and evaluating sparse autoencoders
- 2024-08Aug: Showing SAE Latents Are Not Atomic Using Meta-SAEs (demo)
- 2024-10Oct: Efficient Dictionary Learning with Switch Sparse Autoencoders (code) More efficient SAE generation
- 2024-10Oct: Decomposing The Dark Matter of Sparse Autoencoders (code) Shows that SAE errors are predictable
- 2024-10Oct: Automatically Interpreting Millions of Features in Large Language Models
- 2024-12Dec: Monet: Mixture of Monosemantic Experts for Transformers
- 2024-12Dec: Matryoshka Sparse Autoencoders
- 2024-12Dec: Learning Multi-Level Features with Matryoshka SAEs
Reward Functions
Symbolic and Notation
- A Mathematical Framework for Transformer Circuits
- Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
- 2024-07: On the Anatomy of Attention: Introduces category-theoretic diagrammatic formalism for DL architectures
- 2024-11: diagrams to represent algorithms
- 2024-12: FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness
Mathematical
Geometric
- 2023-11: The Linear Representation Hypothesis and the Geometry of Large Language Models
- 2024-06: The Geometry of Categorical and Hierarchical Concepts in Large Language Models
- Natural hierarchies of concepts---which occur throughout natural language and especially in scientific ontologies---are represented in the model's internal vectorial space as polytopes that can be decomposed into simplexes of mutually-exclusive categories.
- 2024-07: Reasoning in Large Language Models: A Geometric Perspective
- 2024-09: Deep Manifold Part 1: Anatomy of Neural Network Manifold
- 2024-10: The Geometry of Concepts: Sparse Autoencoder Feature Structure
- Tegmark et al. report multi-scale structure: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale
Challenges
- 2023-07Jul: Measuring Faithfulness in Chain-of-Thought Reasoning roughly proves that sufficiently large models do not generate CoT that actually captures their internal reasoning)
Heuristic Understanding
Emergent Internal Model Building
- 2023-07: A Theory for Emergence of Complex Skills in Language Models
- 2024-06: Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
Semantic Directions
Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)
- Efficient Estimation of Word Representations in Vector Space
- Linguistic Regularities in Continuous Space Word Representations
- Word Embeddings, Analogies, and Machine Learning: Beyond king - man + woman = queen
- Glove: Global vectors for word representation
- Using Word2Vec to process big text data
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasets (true/false)
- Monotonic Representation of Numeric Properties in Language Models (numeric directions)
Task vectors:
- Function Vectors in Large Language Models
- In-context learning creates task vectors
- Extracting sae task features for in-context learning
- Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
Feature Geometry Reproduces Problem-space
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello)
- Emergent linear representations in world models of self-supervised sequence models (Othello)
- What learning algorithm is in-context learning? Investigations with linear models
- Emergent analogical reasoning in large language models
- Language Models Represent Space and Time (Maps of world, US)
- Not All Language Model Features Are Linear (Days of week form ring, etc.)
- Evaluating the World Model Implicit in a Generative Model (Map of Manhattan)
- Reliable precipitation nowcasting using probabilistic diffusion models. Generation of precipitation map imagery is predictive of actual future weather; implies model is learning scientifically-relevant modeling.
- The Platonic Representation Hypothesis: Different models (including across modalities) are converging to a consistent world model.
Theory of Mind
- Evaluating Large Language Models in Theory of Mind Tasks
- Looking Inward: Language Models Can Learn About Themselves by Introspection
Information Processing
- What's the Magic Word? A Control Theory of LLM Prompting
- Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
- Models learning reasoning skills (they are not merely memorizing solution templates). They can mentally generate simple short plans (like humans).
- When presented facts, models develop internal understanding of what parameters (recursively) depend on each other. This occurs even before an explicit question is asked (i.e. before the task is defined). This appears to be different from human reasoning.
- Model depth matters for reasoning. This cannot be mitigated by chain-of-thought prompting (which allow models to develop and then execute plans) since even a single CoT step may require deep, multi-step reasoning/planning.
- Why think step by step? Reasoning emerges from the locality of experience
- 2024-02: Chain of Thought Empowers Transformers to Solve Inherently Serial Problems: Proves that transformers can solve any problem, if they can generate sufficient intermediate tokens
- 2024-11: Ask, and it shall be given: Turing completeness of prompting
Generalization
- 2024-06: Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
Tests of Resilience to Dropouts/etc.
- 2024-02: Explorations of Self-Repair in Language Models
- 2024-06: What Matters in Transformers? Not All Attention is Needed
- Removing entire transformer blocks leads to significant performance degradation
- Removing MLP layers results in significant performance degradation
- Removing attention layers causes almost no performance degradation
- E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
- 2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
- They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
- They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
- Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
- Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
- Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
- Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
- This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).
Other
- 2024-11: Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond
- 2024-11: Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (code)
- 2024-11: Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models: LLMs learn reasoning by extracting procedures from training data, not by memorizing specific answers
- 2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
Scaling Laws
- 2017-12: Deep Learning Scaling is Predictable, Empirically (Baidu)
- 2019-03: The Bitter Lesson (Rich Sutton)
- 2020-01: Scaling Laws for Neural Language Models (OpenAI)
- 2020-10: Scaling Laws for Autoregressive Generative Modeling (OpenAI)
- 2020-05: The Scaling Hypothesis (Gwern)
- 2021-08: Scaling Laws for Deep Learning
- 2021-02: Explaining Neural Scaling Laws (Google DeepMind)
- 2022-03: Training Compute-Optimal Large Language Models (Chinchilla, Google DeepMind)
Information Processing/Storage
- "A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity" (c.f.)
- 2024-02: MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- 2024-04: The Illusion of State in State-Space Models (figure 3)
- 2024-08: Gemma 2: Improving Open Language Models at a Practical Size (table 9)
- 2024-09: Schrodinger's Memory: Large Language Models
- 2024-10: Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning. CoT involves both memorization and (probabilitic) reasoning
- 2024-11: Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Tokenization
For numbers/math
- 2024-02: Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: L2R vs. R2L yields different performance on math
Learning/Training
- 2018-03: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: Sparse neural networks are optimal, but it is difficult to identify the right architecture and train it. Deep learning typically consists of training a dense neural network, which makes it easier to learn an internal sparse circuit optimal to a particular problem.
Failure Modes
- 2023-06: Can Large Language Models Infer Causation from Correlation?: Poor causal inference
- 2023-09: The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
- 2023-09: Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve (biases towards "common" numbers, in-context CoT can reduce performance by incorrectly priming, etc.)
- 2023-11: Visual cognition in multimodal large language models (models lack human-like visual understanding)
Psychology
Allow LLM to think
In-context Learning
- 2021-10: MetaICL: Learning to Learn In Context
- 2022-02: Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- 2022-08: What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
- 2022-11: What learning algorithm is in-context learning? Investigations with linear models
- 2022-12: Transformers learn in-context by gradient descent