Difference between revisions of "AI understanding"

From GISAXS
Jump to: navigation, search
(Jagged Frontier)
(Mechanistic Interpretability)
 
(12 intermediate revisions by the same user not shown)
Line 19: Line 19:
 
** [https://transformer-circuits.pub/2025/attribution-graphs/biology.html On the Biology of a Large Language Model]
 
** [https://transformer-circuits.pub/2025/attribution-graphs/biology.html On the Biology of a Large Language Model]
 
* 2025-11: OpenAI: [https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf Weight-sparse transformers have interpretable circuits] ([https://openai.com/index/understanding-neural-networks-through-sparse-circuits/ blog])
 
* 2025-11: OpenAI: [https://cdn.openai.com/pdf/41df8f28-d4ef-43e9-aed2-823f9393e470/circuit-sparsity-paper.pdf Weight-sparse transformers have interpretable circuits] ([https://openai.com/index/understanding-neural-networks-through-sparse-circuits/ blog])
 +
* 2026-01: [https://arxiv.org/abs/2601.13548 Patterning: The Dual of Interpretability]
  
 
==Semanticity==
 
==Semanticity==
Line 56: Line 57:
 
==Meta-cognition==
 
==Meta-cognition==
 
* 2025-05: [https://arxiv.org/abs/2505.13763 Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations]
 
* 2025-05: [https://arxiv.org/abs/2505.13763 Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations]
 +
* 2025-12: [https://arxiv.org/abs/2512.15674 Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers]
  
 
==Coding Models==
 
==Coding Models==
Line 238: Line 240:
 
* 2024-11: [https://arxiv.org/abs/2411.16679 Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?]
 
* 2024-11: [https://arxiv.org/abs/2411.16679 Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?]
 
* 2025-03: [https://www.arxiv.org/abs/2503.03961 A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers]
 
* 2025-03: [https://www.arxiv.org/abs/2503.03961 A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers]
 +
* 2025-12: [https://arxiv.org/abs/2512.22471 The Bayesian Geometry of Transformer Attention]
 +
* 2026-01: [https://arxiv.org/abs/2601.03220 From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence]
  
 
==Statistics/Math==
 
==Statistics/Math==
Line 294: Line 298:
 
* 2015-11: [https://arxiv.org/abs/1511.07543 Convergent Learning: Do different neural networks learn the same representations?]
 
* 2015-11: [https://arxiv.org/abs/1511.07543 Convergent Learning: Do different neural networks learn the same representations?]
 
* 2025-05: [https://arxiv.org/abs/2505.12540 Harnessing the Universal Geometry of Embeddings]: Evidence for [https://x.com/jxmnop/status/1925224620166128039 The Strong Platonic Representation Hypothesis]; models converge to a single consensus reality
 
* 2025-05: [https://arxiv.org/abs/2505.12540 Harnessing the Universal Geometry of Embeddings]: Evidence for [https://x.com/jxmnop/status/1925224620166128039 The Strong Platonic Representation Hypothesis]; models converge to a single consensus reality
 +
* 2025-12: [https://arxiv.org/abs/2512.03750 Universally Converging Representations of Matter Across Scientific Foundation Models]
  
 
==Function Approximation==
 
==Function Approximation==
Line 305: Line 310:
 
* 2025-02: [https://arxiv.org/abs/2502.20545 SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers]
 
* 2025-02: [https://arxiv.org/abs/2502.20545 SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers]
 
* 2025-02: [https://arxiv.org/abs/2502.21212 Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought]
 
* 2025-02: [https://arxiv.org/abs/2502.21212 Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought]
 +
 +
=Physics Based=
 +
* 2014-01: [https://arxiv.org/abs/1401.1219 Consciousness as a State of Matter]
 +
* 2016-08: [https://arxiv.org/abs/1608.08225 Why does deep and cheap learning work so well?]
 +
* 2025-05: [https://arxiv.org/abs/2505.23489 SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training]
 +
* 2025-12: [https://www.pnas.org/doi/full/10.1073/pnas.2523012122 Heavy-tailed update distributions arise from information-driven self-organization in nonequilibrium learning]
  
 
=Failure Modes=
 
=Failure Modes=
Line 324: Line 335:
 
* [[AI_understanding|AI Understanding]] > [[AI_understanding#Psychology|Psychology]] > [[AI_understanding#LLM_personalities|LLM personalities]]
 
* [[AI_understanding|AI Understanding]] > [[AI_understanding#Psychology|Psychology]] > [[AI_understanding#LLM_personalities|LLM personalities]]
 
* [[AI tricks]] > [[AI_tricks#Prompt_Engineering|Prompt Engineering]] > [[AI_tricks#Brittleness|Brittleness]]
 
* [[AI tricks]] > [[AI_tricks#Prompt_Engineering|Prompt Engineering]] > [[AI_tricks#Brittleness|Brittleness]]
 +
 +
===Conversely (AI models converge)===
 +
* 2025-12: [https://www.arxiv.org/abs/2512.03750 Universally Converging Representations of Matter Across Scientific Foundation Models]
 +
* 2025-12: [https://arxiv.org/abs/2512.05117 The Universal Weight Subspace Hypothesis]
 +
* 2026-01: [https://avikrishna.substack.com/p/eliciting-frontier-model-character Eliciting Frontier Model Character Training: A study of personality convergence across language models]
  
 
==Model Collapse==
 
==Model Collapse==
Line 332: Line 348:
 
* 2024-04: [https://arxiv.org/abs/2404.03502 AI and the Problem of Knowledge Collapse]
 
* 2024-04: [https://arxiv.org/abs/2404.03502 AI and the Problem of Knowledge Collapse]
 
* 2024-07: [https://www.nature.com/articles/s41586-024-07566-y AI models collapse when trained on recursively generated data]
 
* 2024-07: [https://www.nature.com/articles/s41586-024-07566-y AI models collapse when trained on recursively generated data]
 +
* 2026-01: [https://arxiv.org/abs/2601.05280 On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis]
  
 
===Analysis===
 
===Analysis===
Line 350: Line 367:
 
* 2025-05: [https://arxiv.org/abs/2505.17117 From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning]
 
* 2025-05: [https://arxiv.org/abs/2505.17117 From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning]
 
* 2025-07: [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5357179 Call Me A Jerk: Persuading AI to Comply with Objectionable Requests]
 
* 2025-07: [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5357179 Call Me A Jerk: Persuading AI to Comply with Objectionable Requests]
 +
* 2026-01: [https://arxiv.org/abs/2601.06047 "They parted illusions -- they parted disclaim marinade": Misalignment as structural fidelity in LLMs]
  
 
==Allow LLM to think==
 
==Allow LLM to think==
Line 390: Line 408:
 
* 2025-05: [https://arxiv.org/abs/2505.13763 Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations]
 
* 2025-05: [https://arxiv.org/abs/2505.13763 Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations]
 
* 2025-10: [https://transformer-circuits.pub/2025/introspection/index.html Emergent Introspective Awareness in Large Language Models] (Anthropic, [https://www.anthropic.com/research/introspection blog])
 
* 2025-10: [https://transformer-circuits.pub/2025/introspection/index.html Emergent Introspective Awareness in Large Language Models] (Anthropic, [https://www.anthropic.com/research/introspection blog])
 +
* 2025-12: [https://www.arxiv.org/abs/2512.24661 Do Large Language Models Know What They Are Capable Of?]
  
 
==LLM personalities==
 
==LLM personalities==
 
* 2025-07: [https://arxiv.org/abs/2507.02618 Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory]
 
* 2025-07: [https://arxiv.org/abs/2507.02618 Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory]
 
* 2025-09: [https://arxiv.org/abs/2509.04343 Psychologically Enhanced AI Agents]
 
* 2025-09: [https://arxiv.org/abs/2509.04343 Psychologically Enhanced AI Agents]
 +
* 2026-01: [https://arxiv.org/abs/2601.10387 The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models]
  
 
==Quirks & Biases==
 
==Quirks & Biases==

Latest revision as of 19:03, 28 January 2026

Interpretability

Concepts

Mechanistic Interpretability

Semanticity

Counter-Results

Meta-cognition

Coding Models

Reward Functions

Symbolic and Notation

Mathematical

Geometric

Topography

Challenges

GYe31yXXQAABwaZ.jpeg

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)

Task vectors:

Reasoning:

Feature Geometry Reproduces Problem-space

Capturing Physics

Theory of Mind

Skeptical

Information Processing

Generalization

Grokking

Tests of Resilience to Dropouts/etc.

  • 2024-02: Explorations of Self-Repair in Language Models
  • 2024-06: What Matters in Transformers? Not All Attention is Needed
    • Removing entire transformer blocks leads to significant performance degradation
    • Removing MLP layers results in significant performance degradation
    • Removing attention layers causes almost no performance degradation
    • E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
  • 2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
    • They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
    • They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
      • Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
      • Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
      • Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
      • Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
    • This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).

Semantic Vectors

Other

Scaling Laws

Information Processing/Storage

Statistics/Math

Tokenization

For numbers/math

Data Storage

Reverse-Engineering Training Data

Compression

Learning/Training

Cross-modal knowledge transfer

Hidden State

Convergent Representation

Function Approximation

Physics Based

Failure Modes

Fracture Representation

Jagged Frontier

See also

Conversely (AI models converge)

Model Collapse

Analysis

Mitigation

Psychology

Allow LLM to think

In-context Learning

Reasoning (CoT, etc.)

Pathfinding

Skeptical

Self-Awareness and Self-Recognition and Introspection

LLM personalities

Quirks & Biases

Vision Models

See Also