Difference between revisions of "AI understanding"

From GISAXS
Jump to: navigation, search
(Mechanistic Interpretability)
(Feature Geometry Reproduces Problem-space)
 
(7 intermediate revisions by the same user not shown)
Line 12: Line 12:
  
 
==Semanticity==
 
==Semanticity==
* 2023-09Sep: [https://arxiv.org/abs/2309.08600 Sparse Autoencoders Find Highly Interpretable Features in Language Models]
+
* 2023-09: [https://arxiv.org/abs/2309.08600 Sparse Autoencoders Find Highly Interpretable Features in Language Models]
 
* Anthropic monosemanticity interpretation of LLM features:
 
* Anthropic monosemanticity interpretation of LLM features:
** 2023-10Oct: [https://transformer-circuits.pub/2023/monosemantic-features/index.html Towards Monosemanticity: Decomposing Language Models With Dictionary Learning]
+
** 2023-10: [https://transformer-circuits.pub/2023/monosemantic-features/index.html Towards Monosemanticity: Decomposing Language Models With Dictionary Learning]
** 2024-05May: [https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet]
+
** 2024-05: [https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet]
* 2024-06Jun: OpenaAI: [https://arxiv.org/abs/2406.04093 Scaling and evaluating sparse autoencoders]
+
* 2024-06: OpenaAI: [https://arxiv.org/abs/2406.04093 Scaling and evaluating sparse autoencoders]
* 2024-08Aug: [https://www.alignmentforum.org/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes Showing SAE Latents Are Not Atomic Using Meta-SAEs] ([https://metasae.streamlit.app/?page=Feature+Explorer&feature=11329 demo])
+
* 2024-08: [https://www.alignmentforum.org/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes Showing SAE Latents Are Not Atomic Using Meta-SAEs] ([https://metasae.streamlit.app/?page=Feature+Explorer&feature=11329 demo])
* 2024-10Oct: [https://arxiv.org/abs/2410.08201 Efficient Dictionary Learning with Switch Sparse Autoencoders] ([https://github.com/amudide/switch_sae code]) More efficient SAE generation
+
* 2024-10: [https://arxiv.org/abs/2410.08201 Efficient Dictionary Learning with Switch Sparse Autoencoders] ([https://github.com/amudide/switch_sae code]) More efficient SAE generation
* 2024-10Oct: [https://arxiv.org/abs/2410.14670 Decomposing The Dark Matter of Sparse Autoencoders] ([https://github.com/JoshEngels/SAE-Dark-Matter code]) Shows that SAE errors are predictable
+
* 2024-10: [https://arxiv.org/abs/2410.14670 Decomposing The Dark Matter of Sparse Autoencoders] ([https://github.com/JoshEngels/SAE-Dark-Matter code]) Shows that SAE errors are predictable
* 2024-10Oct: [https://arxiv.org/abs/2410.13928 Automatically Interpreting Millions of Features in Large Language Models]
+
* 2024-10: [https://arxiv.org/abs/2410.13928 Automatically Interpreting Millions of Features in Large Language Models]
* 2024-12Dec: [https://arxiv.org/abs/2412.04139 Monet: Mixture of Monosemantic Experts for Transformers]
+
* 2024-12: [https://arxiv.org/abs/2412.04139 Monet: Mixture of Monosemantic Experts for Transformers]
* 2024-12Dec: [https://www.lesswrong.com/posts/zbebxYCqsryPALh8C/matryoshka-sparse-autoencoders Matryoshka Sparse Autoencoders]
+
* 2024-12: [https://www.lesswrong.com/posts/zbebxYCqsryPALh8C/matryoshka-sparse-autoencoders Matryoshka Sparse Autoencoders]
* 2024-12Dec: [https://www.alignmentforum.org/posts/rKM9b6B2LqwSB5ToN/learning-multi-level-features-with-matryoshka-saes Learning Multi-Level Features with Matryoshka SAEs]
+
* 2024-12: [https://www.alignmentforum.org/posts/rKM9b6B2LqwSB5ToN/learning-multi-level-features-with-matryoshka-saes Learning Multi-Level Features with Matryoshka SAEs]
 +
* 2025-01: [https://arxiv.org/abs/2501.19406 Low-Rank Adapting Models for Sparse Autoencoders]
 +
 
 +
===Counter-Results===
 +
* 2020-10: [https://arxiv.org/abs/2010.12016 Towards falsifiable interpretability research]
 +
* 2025-01: [https://arxiv.org/abs/2501.16615 Sparse Autoencoders Trained on the Same Data Learn Different Features]
 +
* 2025-01: [https://arxiv.org/abs/2501.17148 AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders]
 +
* 2025-01: [https://arxiv.org/abs/2501.17727 Sparse Autoencoders Can Interpret Randomly Initialized Transformers]
  
 
==Reward Functions==
 
==Reward Functions==
Line 46: Line 53:
 
* 2024-10: [https://arxiv.org/abs/2410.19750 The Geometry of Concepts: Sparse Autoencoder Feature Structure]
 
* 2024-10: [https://arxiv.org/abs/2410.19750 The Geometry of Concepts: Sparse Autoencoder Feature Structure]
 
** Tegmark et al. report multi-scale structure: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale
 
** Tegmark et al. report multi-scale structure: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale
 +
 +
==Topography==
 +
* 2025-01: [https://arxiv.org/abs/2501.16396 TopoNets: High Performing Vision and Language Models with Brain-Like Topography]
  
 
==Challenges==
 
==Challenges==
Line 83: Line 93:
 
* [https://arxiv.org/abs/2405.07987 The Platonic Representation Hypothesis]: Different models (including across modalities) are converging to a consistent world model.
 
* [https://arxiv.org/abs/2405.07987 The Platonic Representation Hypothesis]: Different models (including across modalities) are converging to a consistent world model.
 
* [https://arxiv.org/abs/2501.00070 ICLR: In-Context Learning of Representations]
 
* [https://arxiv.org/abs/2501.00070 ICLR: In-Context Learning of Representations]
 +
* [https://arxiv.org/abs/2502.00873 Language Models Use Trigonometry to Do Addition]: Numbers arranged in helix to enable addition
  
 
===Theory of Mind===
 
===Theory of Mind===
 
* [https://arxiv.org/abs/2302.02083 Evaluating Large Language Models in Theory of Mind Tasks]
 
* [https://arxiv.org/abs/2302.02083 Evaluating Large Language Models in Theory of Mind Tasks]
 
* [https://arxiv.org/abs/2410.13787 Looking Inward: Language Models Can Learn About Themselves by Introspection]
 
* [https://arxiv.org/abs/2410.13787 Looking Inward: Language Models Can Learn About Themselves by Introspection]
 +
* [https://arxiv.org/abs/2501.11120 Tell me about yourself: LLMs are aware of their learned behaviors]
  
 
===Skeptical===
 
===Skeptical===
Line 173: Line 185:
 
* 2022-11: [https://arxiv.org/abs/2211.15661 What learning algorithm is in-context learning? Investigations with linear models]
 
* 2022-11: [https://arxiv.org/abs/2211.15661 What learning algorithm is in-context learning? Investigations with linear models]
 
* 2022-12: [https://arxiv.org/abs/2212.07677 Transformers learn in-context by gradient descent]
 
* 2022-12: [https://arxiv.org/abs/2212.07677 Transformers learn in-context by gradient descent]
 +
 +
==Reasoning (CoT, etc.)==
 +
* 2025-01: [https://arxiv.org/abs/2501.18009 Large Language Models Think Too Fast To Explore Effectively]
 +
* 2025-01: [https://arxiv.org/abs/2501.18585 Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs]
  
 
=See Also=
 
=See Also=

Latest revision as of 12:41, 4 February 2025

Interpretability

Mechanistic Interpretability

Semanticity

Counter-Results

Reward Functions

Symbolic and Notation

Mathematical

Geometric

Topography

Challenges

GYe31yXXQAABwaZ.jpeg

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)

Task vectors:

Feature Geometry Reproduces Problem-space

Theory of Mind

Skeptical

Information Processing

Generalization

Tests of Resilience to Dropouts/etc.

  • 2024-02: Explorations of Self-Repair in Language Models
  • 2024-06: What Matters in Transformers? Not All Attention is Needed
    • Removing entire transformer blocks leads to significant performance degradation
    • Removing MLP layers results in significant performance degradation
    • Removing attention layers causes almost no performance degradation
    • E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
  • 2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
    • They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
    • They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
      • Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
      • Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
      • Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
      • Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
    • This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).

Other

Scaling Laws

Information Processing/Storage

Tokenization

For numbers/math

Learning/Training

Failure Modes

Psychology

Allow LLM to think

In-context Learning

Reasoning (CoT, etc.)

See Also