Latest revision as of 19:03, 28 January 2026

2023-07Jul: Measuring Faithfulness in Chain-of-Thought Reasoning roughly proves that sufficiently large models do not generate CoT that actually captures their internal reasoning)

Heuristic Understanding

2022-09: Janus: Simulators

Emergent Internal Model Building

2023-07: A Theory for Emergence of Complex Skills in Language Models
2024-06: Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
2025-06: General agents contain world models
2025-09: Video models are zero-shot learners and reasoners

Semantic Directions

Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)

Task vectors:

Reasoning:

Understanding Reasoning in Thinking Language Models via Steering Vectors

Feature Geometry Reproduces Problem-space

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (Othello)
Emergent linear representations in world models of self-supervised sequence models (Othello)
What learning algorithm is in-context learning? Investigations with linear models
Emergent analogical reasoning in large language models
Language Models Represent Space and Time (Maps of world, US)
Not All Language Model Features Are Linear (Days of week form ring, etc.)
Evaluating the World Model Implicit in a Generative Model (Map of Manhattan)
Reliable precipitation nowcasting using probabilistic diffusion models. Generation of precipitation map imagery is predictive of actual future weather; implies model is learning scientifically-relevant modeling.
The Platonic Representation Hypothesis: Different models (including across modalities) are converging to a consistent world model.
ICLR: In-Context Learning of Representations
Language Models Use Trigonometry to Do Addition: Numbers arranged in helix to enable addition

Capturing Physics

2020-09: Learning to Identify Physical Parameters from Video Using Differentiable Physics
2022-07: Self-Supervised Learning for Videos: A Survey
2025-02: Fair at Meta: Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Theory of Mind

Skeptical

Information Processing

2019-03: Diagnosing and Enhancing VAE Models
2021-03: Pretrained Transformers as Universal Computation Engines
2022-10: How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders
2023-04: Why think step by step? Reasoning emerges from the locality of experience
2023-10: What's the Magic Word? A Control Theory of LLM Prompting
2024-02: Chain of Thought Empowers Transformers to Solve Inherently Serial Problems: Proves that transformers can solve any problem, if they can generate sufficient intermediate tokens
2024-07: Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
- Models learning reasoning skills (they are not merely memorizing solution templates). They can mentally generate simple short plans (like humans).
- When presented facts, models develop internal understanding of what parameters (recursively) depend on each other. This occurs even before an explicit question is asked (i.e. before the task is defined). This appears to be different from human reasoning.
- Model depth matters for reasoning. This cannot be mitigated by chain-of-thought prompting (which allow models to develop and then execute plans) since even a single CoT step may require deep, multi-step reasoning/planning.
2024-11: Ask, and it shall be given: Turing completeness of prompting
2025-04: Layers at Similar Depths Generate Similar Activations Across LLM Architectures

Generalization

2024-06: Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Grokking

Tests of Resilience to Dropouts/etc.

2024-02: Explorations of Self-Repair in Language Models
2024-06: What Matters in Transformers? Not All Attention is Needed
- Removing entire transformer blocks leads to significant performance degradation
- Removing MLP layers results in significant performance degradation
- Removing attention layers causes almost no performance degradation
- E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
- They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
- They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
  - Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
  - Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
  - Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
  - Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
- This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).

Semantic Vectors

2024-06: Refusal in Language Models Is Mediated by a Single Direction
2025-02: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (demonstrates entangling of concepts into a single preference vector)
2025-03: Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of Abstraction

Other

2024-11: Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond
2024-11: Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding (code)
2024-11: Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models: LLMs learn reasoning by extracting procedures from training data, not by memorizing specific answers
2024-11: LLMs Do Not Think Step-by-step In Implicit Reasoning
2024-12: The Complexity Dynamics of Grokking

Scaling Laws

1993: Learning Curves: Asymptotic Values and Rate of Convergence
2017-12: Deep Learning Scaling is Predictable, Empirically (Baidu)
2019-03: The Bitter Lesson (Rich Sutton)
2020-01: Scaling Laws for Neural Language Models (OpenAI)
2020-10: Scaling Laws for Autoregressive Generative Modeling (OpenAI)
2020-05: The Scaling Hypothesis (Gwern)
2021-08: Scaling Laws for Deep Learning
2021-02: Explaining Neural Scaling Laws (Google DeepMind)
2022-03: Training Compute-Optimal Large Language Models (Chinchilla, Google DeepMind)
2025-03: Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining
2025-03: Compute Optimal Scaling of Skills: Knowledge vs Reasoning
2025-04: Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
2025-05: LLMs on the Line: Data Determines Loss-To-Loss Scaling Laws
2025-10: The Art of Scaling Reinforcement Learning Compute for LLMs

Information Processing/Storage

2020-02: A Theory of Usable Information Under Computational Constraints
2021-04: Why is AI hard and Physics simple?
2021-06: Thinking Like Transformers
2023-05: Large Linguistic Models: Investigating LLMs' metalinguistic abilities
"A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity" (c.f.)
- 2024-02: MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- 2024-04: The Illusion of State in State-Space Models (figure 3)
- 2024-08: Gemma 2: Improving Open Language Models at a Practical Size (table 9)
2024-09: Schrodinger's Memory: Large Language Models
2024-10: Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning. CoT involves both memorization and (probabilitic) reasoning
2024-11: Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
2025-03: A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers
2025-12: The Bayesian Geometry of Transformer Attention
2026-01: From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence

Statistics/Math

2023-05: The emergence of clusters in self-attention dynamics
2023-12: A mathematical perspective on Transformers
2024-07: Understanding Transformers via N-gram Statistics
2024-10: Dynamic metastability in the self-attention model
2024-11: Measure-to-measure interpolation using Transformers
2025-04: Quantitative Clustering in Mean-Field Transformer Models

Tokenization

For numbers/math

2024-02: Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: L2R vs. R2L yields different performance on math

Data Storage

1988-09: On the capabilities of multilayer perceptrons
2006-12: Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition (single-layer perceptron stores >2 bits/parameter; MLP ~ 2*N² bits w/ N² params)
2016-11: Capacity and Trainability in Recurrent Neural Networks (5 bits/param)
2018-02: The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
2019-05: Memorization Capacity of Deep Neural Networks under Parameter Quantization
2020-02: How Much Knowledge Can You Pack Into the Parameters of a Language Model?
2020-08: Language Models as Knowledge Bases: On Entity Representations, Storage Capacity, and Paraphrased Queries (capacity scales linearly with parameters; more training samples leads to less memorization)
2020-12: When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?
2024-04: Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws (2 bits/param)
2024-06: Scaling Laws for Fact Memorization of Large Language Models (1T params needed to memorize Wikipedia)
2024-12: The Complexity Dynamics of Grokking
2025-05: How much do language models memorize? (3.6 bits/parameter)
2025-06: Trade-offs in Data Memorization via Strong Data Processing Inequalities

Reverse-Engineering Training Data

Compression

Learning/Training

2018-03: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: Sparse neural networks are optimal, but it is difficult to identify the right architecture and train it. Deep learning typically consists of training a dense neural network, which makes it easier to learn an internal sparse circuit optimal to a particular problem.
2024-12: On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory
2025-01: Physics of Skill Learning
2025-05: ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Cross-modal knowledge transfer

2022-03: Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer
2023-05: Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters
2025-02: Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models: CLIP learns richer set of aggregated representations (e.g. for a culture or country), vs. a vision-only model.

Hidden State

2025-02: Emergent Response Planning in LLM: They show that the latent representation contains information beyond that needed for the next token (i.e. the model learns to "plan ahead" and encode information relevant to future tokens)
2025-03: (How) Do Language Models Track State?

Convergent Representation

2015-11: Convergent Learning: Do different neural networks learn the same representations?
2025-05: Harnessing the Universal Geometry of Embeddings: Evidence for The Strong Platonic Representation Hypothesis; models converge to a single consensus reality
2025-12: Universally Converging Representations of Matter Across Scientific Foundation Models

Function Approximation

2022-08: What Can Transformers Learn In-Context? A Case Study of Simple Function Classes: can learn linear functions (equivalent to least-squares estimator)
2022-11: Teaching Algorithmic Reasoning via In-context Learning: Simple arithmetic
2022-11: What learning algorithm is in-context learning? Investigations with linear models (code): can learn linear regression
2022-12: Transformers learn in-context by gradient descent
2023-06: Transformers learn to implement preconditioned gradient descent for in-context learning
2023-07: One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention
2024-04: ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
2025-02: SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
2025-02: Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought

Physics Based

Failure Modes

2023-06: Can Large Language Models Infer Causation from Correlation?: Poor causal inference
2023-09: The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
2023-09: Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve (biases towards "common" numbers, in-context CoT can reduce performance by incorrectly priming, etc.)
2023-11: Visual cognition in multimodal large language models (models lack human-like visual understanding)

Fracture Representation

2025-05: Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis (code)

Jagged Frontier

2023-09: Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality
2024-07: How Does Quantization Affect Multilingual LLMs?: Quantization degrades different languages by differing amounts
2025-03: Compute Optimal Scaling of Skills: Knowledge vs Reasoning: Scaling laws are skill-dependent
2025-10: A Definition of AGI

Conversely (AI models converge)

Model Collapse

Analysis

2024-02: Scaling laws for learning with real and surrogate data
2024-12: Rate of Model Collapse in Recursive Training

Mitigation

Psychology

Allow LLM to think

2024-12: Let your LLM generate a few tokens and you will reduce the need for retrieval

In-context Learning

Reasoning (CoT, etc.)

2025-01: Large Language Models Think Too Fast To Explore Effectively
2025-01: Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
2025-01: Are DeepSeek R1 And Other Reasoning Models More Faithful?: reasoning models can provide faithful explanations for why their reasoning is correct
2025-03: Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
2025-04: Rethinking Reflection in Pre-Training: pre-training alone already provides some amount of reflection/reasoning
2025-07: BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

Pathfinding

Skeptical

Self-Awareness and Self-Recognition and Introspection

2022-07: Language Models (Mostly) Know What They Know
2024-04: LLM Evaluators Recognize and Favor Their Own Generations
2024-09: Me, Myself and AI: The Situational Awareness Dataset for LLMs
2024-10: Looking Inward: Language Models Can Learn About Themselves by Introspection
2024-12: AIs are becoming more self-aware. Here's why that matters
2025-01: Tell me about yourself: LLMs are aware of their learned behaviors
2025-04: LLMs can guess which comic strip was generated by themselves (vs. other LLM)
2025-05: Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations
2025-10: Emergent Introspective Awareness in Large Language Models (Anthropic, blog)
2025-12: Do Large Language Models Know What They Are Capable Of?

LLM personalities

Quirks & Biases

2025-04: Artificial intelligence and dichotomania
2025-09: Can Large Language Models Develop Gambling Addiction?

@@ Line 19: / Line 19: @@
 ** [https://transformer-circuits.pub/2025/attribution-graphs/biology.html On the Biology of a Large Language Model]
 * 2025-11: OpenAI: [https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf Weight-sparse transformers have interpretable circuits] ([https://openai.com/index/understanding-neural-networks-through-sparse-circuits/ blog])
+* 2026-01: [https://arxiv.org/abs/2601.13548 Patterning: The Dual of Interpretability]
 ==Semanticity==
@@ Line 56: / Line 57: @@
 ==Meta-cognition==
 * 2025-05: [https://arxiv.org/abs/2505.13763 Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations]
+* 2025-12: [https://arxiv.org/abs/2512.15674 Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers]
 ==Coding Models==
@@ Line 238: / Line 240: @@
 * 2024-11: [https://arxiv.org/abs/2411.16679 Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?]
 * 2025-03: [https://www.arxiv.org/abs/2503.03961 A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers]
+* 2025-12: [https://arxiv.org/abs/2512.22471 The Bayesian Geometry of Transformer Attention]
+* 2026-01: [https://arxiv.org/abs/2601.03220 From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence]
 ==Statistics/Math==
@@ Line 294: / Line 298: @@
 * 2015-11: [https://arxiv.org/abs/1511.07543 Convergent Learning: Do different neural networks learn the same representations?]
 * 2025-05: [https://arxiv.org/abs/2505.12540 Harnessing the Universal Geometry of Embeddings]: Evidence for [https://x.com/jxmnop/status/1925224620166128039 The Strong Platonic Representation Hypothesis]; models converge to a single consensus reality
+* 2025-12: [https://arxiv.org/abs/2512.03750 Universally Converging Representations of Matter Across Scientific Foundation Models]
 ==Function Approximation==
@@ Line 305: / Line 310: @@
 * 2025-02: [https://arxiv.org/abs/2502.20545 SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers]
 * 2025-02: [https://arxiv.org/abs/2502.21212 Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought]
+=Physics Based=
+* 2014-01: [https://arxiv.org/abs/1401.1219 Consciousness as a State of Matter]
+* 2016-08: [https://arxiv.org/abs/1608.08225 Why does deep and cheap learning work so well?]
+* 2025-05: [https://arxiv.org/abs/2505.23489 SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training]
+* 2025-12: [https://www.pnas.org/doi/full/10.1073/pnas.2523012122 Heavy-tailed update distributions arise from information-driven self-organization in nonequilibrium learning]
 =Failure Modes=
@@ Line 324: / Line 335: @@
 * [[AI_understanding|AI Understanding]] > [[AI_understanding#Psychology|Psychology]] > [[AI_understanding#LLM_personalities|LLM personalities]]
 * [[AI tricks]] > [[AI_tricks#Prompt_Engineering|Prompt Engineering]] > [[AI_tricks#Brittleness|Brittleness]]
+===Conversely (AI models converge)===
+* 2025-12: [https://www.arxiv.org/abs/2512.03750 Universally Converging Representations of Matter Across Scientific Foundation Models]
+* 2025-12: [https://arxiv.org/abs/2512.05117 The Universal Weight Subspace Hypothesis]
+* 2026-01: [https://avikrishna.substack.com/p/eliciting-frontier-model-character Eliciting Frontier Model Character Training: A study of personality convergence across language models]
 ==Model Collapse==
@@ Line 332: / Line 348: @@
 * 2024-04: [https://arxiv.org/abs/2404.03502 AI and the Problem of Knowledge Collapse]
 * 2024-07: [https://www.nature.com/articles/s41586-024-07566-y AI models collapse when trained on recursively generated data]
+* 2026-01: [https://arxiv.org/abs/2601.05280 On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis]
 ===Analysis===
@@ Line 350: / Line 367: @@
 * 2025-05: [https://arxiv.org/abs/2505.17117 From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning]
 * 2025-07: [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5357179 Call Me A Jerk: Persuading AI to Comply with Objectionable Requests]
+* 2026-01: [https://arxiv.org/abs/2601.06047 "They parted illusions -- they parted disclaim marinade": Misalignment as structural fidelity in LLMs]
 ==Allow LLM to think==
@@ Line 390: / Line 408: @@
 * 2025-05: [https://arxiv.org/abs/2505.13763 Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations]
 * 2025-10: [https://transformer-circuits.pub/2025/introspection/index.html Emergent Introspective Awareness in Large Language Models] (Anthropic, [https://www.anthropic.com/research/introspection blog])
+* 2025-12: [https://www.arxiv.org/abs/2512.24661 Do Large Language Models Know What They Are Capable Of?]
 ==LLM personalities==
 * 2025-07: [https://arxiv.org/abs/2507.02618 Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory]
 * 2025-09: [https://arxiv.org/abs/2509.04343 Psychologically Enhanced AI Agents]
+* 2026-01: [https://arxiv.org/abs/2601.10387 The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models]
 ==Quirks & Biases==

Difference between revisions of "AI understanding"

Latest revision as of 19:03, 28 January 2026

Contents

Interpretability

Concepts

Mechanistic Interpretability

Semanticity

Counter-Results

Meta-cognition

Coding Models

Reward Functions

Symbolic and Notation

Mathematical

Geometric

Topography

Challenges

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Feature Geometry Reproduces Problem-space

Capturing Physics

Theory of Mind

Skeptical

Information Processing

Generalization

Grokking

Tests of Resilience to Dropouts/etc.

Semantic Vectors

Other

Scaling Laws

Information Processing/Storage

Statistics/Math

Tokenization

For numbers/math

Data Storage

Reverse-Engineering Training Data

Compression

Learning/Training

Cross-modal knowledge transfer

Hidden State

Convergent Representation

Function Approximation

Physics Based

Failure Modes

Fracture Representation

Jagged Frontier

See also

Conversely (AI models converge)

Model Collapse

Analysis

Mitigation

Psychology

Allow LLM to think

In-context Learning

Reasoning (CoT, etc.)

Pathfinding

Skeptical

Self-Awareness and Self-Recognition and Introspection

LLM personalities

Quirks & Biases

Vision Models

See Also

Navigation menu

Search