Difference between revisions of "AI understanding"

From GISAXS
Jump to: navigation, search
(Model Collapse)
(Geometric)
 
(62 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
* 2017-01: [https://arxiv.org/abs/1704.01444 Learning to Generate Reviews and Discovering Sentiment]
 
* 2017-01: [https://arxiv.org/abs/1704.01444 Learning to Generate Reviews and Discovering Sentiment]
 
* 2025-02: [https://arxiv.org/abs/2502.11639 Neural Interpretable Reasoning]
 
* 2025-02: [https://arxiv.org/abs/2502.11639 Neural Interpretable Reasoning]
 +
 +
==Concepts==
 +
* 2025-04: [https://arxiv.org/abs/2504.20938 Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition] ([https://github.com/OpenMOSS/Lorsa code])
 +
* 2025-08: [https://transformer-circuits.pub/2025/attention-qk/index.html Tracing Attention Computation Through Feature Interactions]
  
 
==Mechanistic Interpretability==
 
==Mechanistic Interpretability==
Line 36: Line 40:
 
* 2025-03: [https://arxiv.org/abs/2503.01824 From superposition to sparse codes: interpretable representations in neural networks]
 
* 2025-03: [https://arxiv.org/abs/2503.01824 From superposition to sparse codes: interpretable representations in neural networks]
 
* 2025-03: [https://arxiv.org/abs/2503.18878 I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders]
 
* 2025-03: [https://arxiv.org/abs/2503.18878 I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders]
 +
* 2025-05: [https://arxiv.org/abs/2505.20063 SAEs Are Good for Steering -- If You Select the Right Features]
 +
* 2025-06: [https://arxiv.org/abs/2506.15679 Dense SAE Latents Are Features, Not Bugs]
 +
* 2025-06: [https://arxiv.org/abs/2506.20790 Stochastic Parameter Decomposition] ([https://github.com/goodfire-ai/spd code], [https://www.goodfire.ai/papers/stochastic-param-decomp blog])
 +
* 2025-08: [https://arxiv.org/abs/2508.10003 Semantic Structure in Large Language Model Embeddings]
  
 
===Counter-Results===
 
===Counter-Results===
Line 44: Line 52:
 
* 2025-02: [https://arxiv.org/abs/2502.04878 Sparse Autoencoders Do Not Find Canonical Units of Analysis]
 
* 2025-02: [https://arxiv.org/abs/2502.04878 Sparse Autoencoders Do Not Find Canonical Units of Analysis]
 
* 2025-03: [https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/ Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research]
 
* 2025-03: [https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/ Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research]
 +
 +
==Meta-cognition==
 +
* 2025-05: [https://arxiv.org/abs/2505.13763 Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations]
  
 
==Coding Models==
 
==Coding Models==
Line 74: Line 85:
 
** Tegmark et al. report multi-scale structure: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale
 
** Tegmark et al. report multi-scale structure: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale
 
* 2025-02: [https://arxiv.org/abs/2502.08009 The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models]
 
* 2025-02: [https://arxiv.org/abs/2502.08009 The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models]
 +
* 2025-08: [https://arxiv.org/abs/2508.10003 Semantic Structure in Large Language Model Embeddings]
 +
* 2025-10: [https://arxiv.org/abs/2510.09782 The Geometry of Reasoning: Flowing Logics in Representation Space]
  
 
==Topography==
 
==Topography==
Line 89: Line 102:
 
* 2023-07: [https://arxiv.org/abs/2307.15936 A Theory for Emergence of Complex Skills in Language Models]
 
* 2023-07: [https://arxiv.org/abs/2307.15936 A Theory for Emergence of Complex Skills in Language Models]
 
* 2024-06: [https://arxiv.org/abs/2406.19370v1 Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space]
 
* 2024-06: [https://arxiv.org/abs/2406.19370v1 Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space]
 +
* 2025-09: [https://arxiv.org/abs/2509.20328 Video models are zero-shot learners and reasoners]
  
 
===Semantic Directions===
 
===Semantic Directions===
Line 104: Line 118:
 
* [https://www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning Extracting sae task features for in-context learning]
 
* [https://www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning Extracting sae task features for in-context learning]
 
* [https://arxiv.org/abs/2412.12276 Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers]
 
* [https://arxiv.org/abs/2412.12276 Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers]
 +
Reasoning:
 +
* [https://openreview.net/forum?id=OwhVWNOBcz Understanding Reasoning in Thinking Language Models via Steering Vectors]
  
 
===Feature Geometry Reproduces Problem-space===
 
===Feature Geometry Reproduces Problem-space===
Line 127: Line 143:
 
* [https://arxiv.org/abs/2410.13787 Looking Inward: Language Models Can Learn About Themselves by Introspection]
 
* [https://arxiv.org/abs/2410.13787 Looking Inward: Language Models Can Learn About Themselves by Introspection]
 
* [https://arxiv.org/abs/2501.11120 Tell me about yourself: LLMs are aware of their learned behaviors]
 
* [https://arxiv.org/abs/2501.11120 Tell me about yourself: LLMs are aware of their learned behaviors]
 +
* 2025-10: [https://arxiv.org/abs/2509.22887 Infusing Theory of Mind into Socially Intelligent LLM Agents]
  
 
===Skeptical===
 
===Skeptical===
* [https://www.arxiv.org/abs/2501.09038 Do generative video models learn physical principles from watching videos?] ([https://physics-iq.github.io/ project], [https://github.com/google-deepmind/physics-IQ-benchmark code])
+
* 2025-01: [https://www.arxiv.org/abs/2501.09038 Do generative video models learn physical principles from watching videos?] ([https://physics-iq.github.io/ project], [https://github.com/google-deepmind/physics-IQ-benchmark code])
 +
* 2025-06: [https://machinelearning.apple.com/research/illusion-of-thinking The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity]
 +
* 2025-06: [https://arxiv.org/abs/2506.21521 Potemkin Understanding in Large Language Models]
 +
* 2025-06: [https://arxiv.org/abs/2506.21876 Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation]
  
 
==Information Processing==
 
==Information Processing==
 +
* 2019-03: [https://arxiv.org/abs/1903.05789 Diagnosing and Enhancing VAE Models]
 
* 2021-03: [https://arxiv.org/abs/2103.05247 Pretrained Transformers as Universal Computation Engines]
 
* 2021-03: [https://arxiv.org/abs/2103.05247 Pretrained Transformers as Universal Computation Engines]
 +
* 2022-10: [https://arxiv.org/abs/2210.08344 How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders]
 
* 2023-04: [https://arxiv.org/abs/2304.03843 Why think step by step? Reasoning emerges from the locality of experience]
 
* 2023-04: [https://arxiv.org/abs/2304.03843 Why think step by step? Reasoning emerges from the locality of experience]
 
* 2023-10: [https://arxiv.org/abs/2310.04444 What's the Magic Word? A Control Theory of LLM Prompting]
 
* 2023-10: [https://arxiv.org/abs/2310.04444 What's the Magic Word? A Control Theory of LLM Prompting]
Line 141: Line 163:
 
** Model depth matters for reasoning. This cannot be mitigated by chain-of-thought prompting (which allow models to develop and then execute plans) since even a single CoT step may require deep, multi-step reasoning/planning.
 
** Model depth matters for reasoning. This cannot be mitigated by chain-of-thought prompting (which allow models to develop and then execute plans) since even a single CoT step may require deep, multi-step reasoning/planning.
 
* 2024-11: [https://arxiv.org/abs/2411.01992 Ask, and it shall be given: Turing completeness of prompting]
 
* 2024-11: [https://arxiv.org/abs/2411.01992 Ask, and it shall be given: Turing completeness of prompting]
 +
* 2025-04: [https://arxiv.org/abs/2504.08775 Layers at Similar Depths Generate Similar Activations Across LLM Architectures]
  
 
===Generalization===
 
===Generalization===
Line 151: Line 174:
 
* 2024-02: [https://arxiv.org/abs/2402.15175 Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition]
 
* 2024-02: [https://arxiv.org/abs/2402.15175 Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition]
 
* 2024-12: [https://arxiv.org/abs/2412.18624 How to explain grokking]
 
* 2024-12: [https://arxiv.org/abs/2412.18624 How to explain grokking]
 +
* 2024-12: [https://arxiv.org/abs/2412.09810 The Complexity Dynamics of Grokking]
 +
* 2025-09: [https://arxiv.org/abs/2509.21519 Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking]
  
 
===Tests of Resilience to Dropouts/etc.===
 
===Tests of Resilience to Dropouts/etc.===
Line 181: Line 206:
  
 
===Scaling Laws===
 
===Scaling Laws===
 +
* 1993: [https://proceedings.neurips.cc/paper/1993/file/1aa48fc4880bb0c9b8a3bf979d3b917e-Paper.pdf Learning Curves: Asymptotic Values and Rate of Convergence]
 
* 2017-12: [https://arxiv.org/abs/1712.00409 Deep Learning Scaling is Predictable, Empirically] (Baidu)
 
* 2017-12: [https://arxiv.org/abs/1712.00409 Deep Learning Scaling is Predictable, Empirically] (Baidu)
 
* 2019-03: [http://www.incompleteideas.net/IncIdeas/BitterLesson.html The Bitter Lesson] (Rich Sutton)
 
* 2019-03: [http://www.incompleteideas.net/IncIdeas/BitterLesson.html The Bitter Lesson] (Rich Sutton)
Line 192: Line 218:
 
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 
* 2025-03: [https://arxiv.org/abs/2503.10061 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]
 
* 2025-04: [https://arxiv.org/abs/2504.07951 Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models]
 
* 2025-04: [https://arxiv.org/abs/2504.07951 Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models]
 +
* 2025-05: [https://brendel-group.github.io/llm-line/ LLMs on the Line: Data Determines Loss-To-Loss Scaling Laws]
 +
* 2025-10: [https://arxiv.org/abs/2510.13786 The Art of Scaling Reinforcement Learning Compute for LLMs]
  
 
=Information Processing/Storage=
 
=Information Processing/Storage=
 
* 2020-02: [https://arxiv.org/abs/2002.10689 A Theory of Usable Information Under Computational Constraints]
 
* 2020-02: [https://arxiv.org/abs/2002.10689 A Theory of Usable Information Under Computational Constraints]
 +
* 2021-04: [https://arxiv.org/abs/2104.00008 Why is AI hard and Physics simple?]
 +
* 2021-06: [https://arxiv.org/abs/2106.06981 Thinking Like Transformers]
 
* "A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity" ([https://x.com/danielhanchen/status/1835684061475655967 c.f.])
 
* "A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity" ([https://x.com/danielhanchen/status/1835684061475655967 c.f.])
 
** 2024-02: [https://arxiv.org/abs/2402.14905 MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases]
 
** 2024-02: [https://arxiv.org/abs/2402.14905 MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases]
Line 203: Line 233:
 
* 2024-11: [https://arxiv.org/abs/2411.16679 Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?]
 
* 2024-11: [https://arxiv.org/abs/2411.16679 Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?]
 
* 2025-03: [https://www.arxiv.org/abs/2503.03961 A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers]
 
* 2025-03: [https://www.arxiv.org/abs/2503.03961 A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers]
 +
 +
==Statistics/Math==
 +
* 2023-05: [https://arxiv.org/abs/2305.05465 The emergence of clusters in self-attention dynamics]
 +
* 2023-12: [https://arxiv.org/abs/2312.10794 A mathematical perspective on Transformers]
 +
* 2024-07: [https://arxiv.org/abs/2407.12034 Understanding Transformers via N-gram Statistics]
 +
* 2024-10: [https://arxiv.org/abs/2410.06833 Dynamic metastability in the self-attention model]
 +
* 2024-11: [https://arxiv.org/abs/2411.04551 Measure-to-measure interpolation using Transformers]
 +
* 2025-04: [https://arxiv.org/abs/2504.14697 Quantitative Clustering in Mean-Field Transformer Models]
  
 
==Tokenization==
 
==Tokenization==
 
===For numbers/math===
 
===For numbers/math===
 
* 2024-02: [https://arxiv.org/abs/2402.14903 Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs]: L2R vs. R2L yields different performance on math
 
* 2024-02: [https://arxiv.org/abs/2402.14903 Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs]: L2R vs. R2L yields different performance on math
 +
 +
==Data Storage==
 +
* 1988-09: [https://www.sciencedirect.com/science/article/pii/0885064X88900209 On the capabilities of multilayer perceptrons]
 +
* 2006-12: [https://ieeexplore.ieee.org/document/4038449 Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition] (single-layer perceptron stores >2 bits/parameter; MLP ~ 2*N<sup>2</sup> bits w/ N<sup>2</sup> params)
 +
* 2016-11: [https://arxiv.org/abs/1611.09913 Capacity and Trainability in Recurrent Neural Networks] (5 bits/param)
 +
* 2018-02: [https://arxiv.org/abs/1802.08232 The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks]
 +
* 2019-05: [https://ieeexplore.ieee.org/document/8682462 Memorization Capacity of Deep Neural Networks under Parameter Quantization]
 +
* 2020-02: [https://arxiv.org/abs/2002.08910 How Much Knowledge Can You Pack Into the Parameters of a Language Model?]
 +
* 2020-08: [https://arxiv.org/abs/2008.09036 Language Models as Knowledge Bases: On Entity Representations, Storage Capacity, and Paraphrased Queries] (capacity scales linearly with parameters; more training samples leads to less memorization)
 +
* 2020-12: [https://arxiv.org/abs/2012.06421 When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?]
 +
* 2024-04: [https://arxiv.org/abs/2404.05405 Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws] (2 bits/param)
 +
* 2024-06: [https://arxiv.org/abs/2406.15720 Scaling Laws for Fact Memorization of Large Language Models] (1T params needed to memorize Wikipedia)
 +
* 2024-12: [https://arxiv.org/abs/2412.09810 The Complexity Dynamics of Grokking]
 +
* 2025-05: [https://arxiv.org/abs/2505.24832 How much do language models memorize?] (3.6 bits/parameter)
 +
* 2025-06: [https://arxiv.org/abs/2506.01855 Trade-offs in Data Memorization via Strong Data Processing Inequalities]
 +
 +
===Reverse-Engineering Training Data===
 +
* 2025-06: [https://arxiv.org/abs/2506.10364 Can We Infer Confidential Properties of Training Data from LLMs?]
 +
* 2025-06: [https://arxiv.org/abs/2506.15553 Approximating Language Model Training Data from Weights]
 +
 +
===Compression===
 +
* 2022-12: [https://arxiv.org/abs/2212.09410 Less is More: Parameter-Free Text Classification with Gzip]
 +
* 2023-06: [https://arxiv.org/abs/2306.04050 LLMZip: Lossless Text Compression using Large Language Models]
 +
* 2023-07: [https://aclanthology.org/2023.findings-acl.426/ “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors]
 +
* 2023-09: [https://arxiv.org/abs/2309.10668 Language Modeling Is Compression]
 +
* 2024-06: [https://arxiv.org/abs/2406.07550 An Image is Worth 32 Tokens for Reconstruction and Generation]
  
 
==Learning/Training==
 
==Learning/Training==
Line 212: Line 276:
 
* 2024-12: [https://arxiv.org/abs/2412.11521 On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory]
 
* 2024-12: [https://arxiv.org/abs/2412.11521 On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory]
 
* 2025-01: [https://arxiv.org/abs/2501.12391 Physics of Skill Learning]
 
* 2025-01: [https://arxiv.org/abs/2501.12391 Physics of Skill Learning]
 +
* 2025-05: [https://arxiv.org/abs/2505.24864 ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models]
  
 
===Cross-modal knowledge transfer===
 
===Cross-modal knowledge transfer===
Line 221: Line 286:
 
* 2025-02: [https://arxiv.org/abs/2502.06258 Emergent Response Planning in LLM]: They show that the latent representation contains information beyond that needed for the next token (i.e. the model learns to "plan ahead" and encode information relevant to future tokens)
 
* 2025-02: [https://arxiv.org/abs/2502.06258 Emergent Response Planning in LLM]: They show that the latent representation contains information beyond that needed for the next token (i.e. the model learns to "plan ahead" and encode information relevant to future tokens)
 
* 2025-03: [https://arxiv.org/abs/2503.02854 (How) Do Language Models Track State?]
 
* 2025-03: [https://arxiv.org/abs/2503.02854 (How) Do Language Models Track State?]
 +
===Convergent Representation===
 +
* 2015-11: [https://arxiv.org/abs/1511.07543 Convergent Learning: Do different neural networks learn the same representations?]
 +
* 2025-05: [https://arxiv.org/abs/2505.12540 Harnessing the Universal Geometry of Embeddings]: Evidence for [https://x.com/jxmnop/status/1925224620166128039 The Strong Platonic Representation Hypothesis]; models converge to a single consensus reality
  
 
==Function Approximation==
 
==Function Approximation==
Line 238: Line 306:
 
* 2023-09: [https://arxiv.org/abs/2309.13638 Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve] (biases towards "common" numbers, in-context CoT can reduce performance by incorrectly priming, etc.)
 
* 2023-09: [https://arxiv.org/abs/2309.13638 Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve] (biases towards "common" numbers, in-context CoT can reduce performance by incorrectly priming, etc.)
 
* 2023-11: [https://arxiv.org/abs/2311.16093 Visual cognition in multimodal large language models] (models lack human-like visual understanding)
 
* 2023-11: [https://arxiv.org/abs/2311.16093 Visual cognition in multimodal large language models] (models lack human-like visual understanding)
 +
 +
==Fracture Representation==
 +
* 2025-05: [https://arxiv.org/abs/2505.11581 Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis] ([https://github.com/akarshkumar0101/fer code])
  
 
==Jagged Frontier==
 
==Jagged Frontier==
 +
* 2023-09: [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321 Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality]
 
* 2024-07: [https://arxiv.org/abs/2407.03211 How Does Quantization Affect Multilingual LLMs?]: Quantization degrades different languages by differing amounts
 
* 2024-07: [https://arxiv.org/abs/2407.03211 How Does Quantization Affect Multilingual LLMs?]: Quantization degrades different languages by differing amounts
 
* 2025-03: [https://arxiv.org/abs/2503.10061v1 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]: Scaling laws are skill-dependent
 
* 2025-03: [https://arxiv.org/abs/2503.10061v1 Compute Optimal Scaling of Skills: Knowledge vs Reasoning]: Scaling laws are skill-dependent
 +
 +
===See also===
 +
* [[AI_understanding|AI Understanding]] > [[AI_understanding#Psychology|Psychology]] > [[AI_understanding#LLM_personalities|LLM personalities]]
 +
* [[AI tricks]] > [[AI_tricks#Prompt_Engineering|Prompt Engineering]] > [[AI_tricks#Brittleness|Brittleness]]
  
 
==Model Collapse==
 
==Model Collapse==
Line 266: Line 342:
 
=Psychology=
 
=Psychology=
 
* 2023-04: [https://arxiv.org/abs/2304.11111 Inducing anxiety in large language models can induce bias]
 
* 2023-04: [https://arxiv.org/abs/2304.11111 Inducing anxiety in large language models can induce bias]
 +
* 2025-05: [https://arxiv.org/abs/2505.17117 From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning]
 +
* 2025-07: [https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5357179 Call Me A Jerk: Persuading AI to Comply with Objectionable Requests]
  
 
==Allow LLM to think==
 
==Allow LLM to think==
Line 276: Line 354:
 
* 2022-11: [https://arxiv.org/abs/2211.15661 What learning algorithm is in-context learning? Investigations with linear models]
 
* 2022-11: [https://arxiv.org/abs/2211.15661 What learning algorithm is in-context learning? Investigations with linear models]
 
* 2022-12: [https://arxiv.org/abs/2212.07677 Transformers learn in-context by gradient descent]
 
* 2022-12: [https://arxiv.org/abs/2212.07677 Transformers learn in-context by gradient descent]
 +
* 2025-07: [https://arxiv.org/abs/2507.16003 Learning without training: The implicit dynamics of in-context learning]
  
 
==Reasoning (CoT, etc.)==
 
==Reasoning (CoT, etc.)==
Line 281: Line 360:
 
* 2025-01: [https://arxiv.org/abs/2501.18585 Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs]
 
* 2025-01: [https://arxiv.org/abs/2501.18585 Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs]
 
* 2025-01: [https://arxiv.org/abs/2501.08156 Are DeepSeek R1 And Other Reasoning Models More Faithful?]: reasoning models can provide faithful explanations for why their reasoning is correct
 
* 2025-01: [https://arxiv.org/abs/2501.08156 Are DeepSeek R1 And Other Reasoning Models More Faithful?]: reasoning models can provide faithful explanations for why their reasoning is correct
 +
* 2025-03: [https://arxiv.org/abs/2503.08679 Chain-of-Thought Reasoning In The Wild Is Not Always Faithful]
 
* 2025-04: [https://arxiv.org/abs/2504.04022 Rethinking Reflection in Pre-Training]: pre-training alone already provides some amount of reflection/reasoning
 
* 2025-04: [https://arxiv.org/abs/2504.04022 Rethinking Reflection in Pre-Training]: pre-training alone already provides some amount of reflection/reasoning
 +
* 2025-07: [https://arxiv.org/abs/2501.18858 BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning]
 +
 +
===Pathfinding===
 +
* 2024-08: [https://arxiv.org/abs/2408.08152 DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search]
 +
* 2025-06: [https://arxiv.org/abs/2506.01939 Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning]
 +
* 2025-09: [https://arxiv.org/abs/2509.09284 Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning]
 +
* 2025-09: [https://arxiv.org/abs/2509.06160v1 Reverse-Engineered Reasoning for Open-Ended Generation]
 +
 +
===Skeptical===
 +
* 2025-06: [https://arxiv.org/abs/2506.06941 The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity]
 +
* 2025-08: [https://www.arxiv.org/abs/2508.01191 Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens]
  
 
==Self-Awareness and Self-Recognition==
 
==Self-Awareness and Self-Recognition==
Line 287: Line 378:
 
* 2024-12: [https://theaidigest.org/self-awareness AIs are becoming more self-aware. Here's why that matters]
 
* 2024-12: [https://theaidigest.org/self-awareness AIs are becoming more self-aware. Here's why that matters]
 
* 2025-04: [https://x.com/Josikinz/status/1907923319866716629 LLMs can guess which comic strip was generated by themselves (vs. other LLM)]
 
* 2025-04: [https://x.com/Josikinz/status/1907923319866716629 LLMs can guess which comic strip was generated by themselves (vs. other LLM)]
 +
* 2025-05: [https://arxiv.org/abs/2505.13763 Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations]
 +
 +
==LLM personalities==
 +
* 2025-07: [https://arxiv.org/abs/2507.02618 Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory]
 +
* 2025-09: [https://arxiv.org/abs/2509.04343 Psychologically Enhanced AI Agents]
 +
 +
==Quirks & Biases==
 +
* 2025-04: [https://www.cambridge.org/core/journals/judgment-and-decision-making/article/artificial-intelligence-and-dichotomania/0421D2310727D73FAB47069FD1620AA1 Artificial intelligence and dichotomania]
 +
* 2025-09: [https://arxiv.org/abs/2509.22818 Can Large Language Models Develop Gambling Addiction?]
 +
 +
=Vision Models=
 +
* 2017-11: Distill: [https://distill.pub/2017/feature-visualization/ Feature Visualization: How neural networks build up their understanding of images]
 +
* 2021-01: [https://arxiv.org/abs/2101.12322 Position, Padding and Predictions: A Deeper Look at Position Information in CNNs]
 +
* 2025-04: [https://arxiv.org/abs/2504.13181 Perception Encoder: The best visual embeddings are not at the output of the network] ([https://github.com/facebookresearch/perception_models code])
  
 
=See Also=
 
=See Also=
 +
* [[AI]]
 
* [[AI tools]]
 
* [[AI tools]]
 
* [[AI agents]]
 
* [[AI agents]]
 
* [[Robots]]
 
* [[Robots]]

Latest revision as of 12:48, 17 October 2025

Interpretability

Concepts

Mechanistic Interpretability

Semanticity

Counter-Results

Meta-cognition

Coding Models

Reward Functions

Symbolic and Notation

Mathematical

Geometric

Topography

Challenges

GYe31yXXQAABwaZ.jpeg

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)

Task vectors:

Reasoning:

Feature Geometry Reproduces Problem-space

Capturing Physics

Theory of Mind

Skeptical

Information Processing

Generalization

Grokking

Tests of Resilience to Dropouts/etc.

  • 2024-02: Explorations of Self-Repair in Language Models
  • 2024-06: What Matters in Transformers? Not All Attention is Needed
    • Removing entire transformer blocks leads to significant performance degradation
    • Removing MLP layers results in significant performance degradation
    • Removing attention layers causes almost no performance degradation
    • E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
  • 2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
    • They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
    • They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
      • Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
      • Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
      • Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
      • Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
    • This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).

Semantic Vectors

Other

Scaling Laws

Information Processing/Storage

Statistics/Math

Tokenization

For numbers/math

Data Storage

Reverse-Engineering Training Data

Compression

Learning/Training

Cross-modal knowledge transfer

Hidden State

Convergent Representation

Function Approximation

Failure Modes

Fracture Representation

Jagged Frontier

See also

Model Collapse

Analysis

Mitigation

Psychology

Allow LLM to think

In-context Learning

Reasoning (CoT, etc.)

Pathfinding

Skeptical

Self-Awareness and Self-Recognition

LLM personalities

Quirks & Biases

Vision Models

See Also