Difference between revisions of "AI understanding"

From GISAXS
Jump to: navigation, search
(Semanticity)
(Semanticity)
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
=Interpretability=
 
=Interpretability=
 +
* 2017-01: [https://arxiv.org/abs/1704.01444 Learning to Generate Reviews and Discovering Sentiment]
 +
 
==Mechanistic Interpretability==
 
==Mechanistic Interpretability==
 
* 2020-03Mar: OpenAI: [https://distill.pub/2020/circuits/zoom-in/ Zoom In: An Introduction to Circuits]
 
* 2020-03Mar: OpenAI: [https://distill.pub/2020/circuits/zoom-in/ Zoom In: An Introduction to Circuits]
Line 17: Line 19:
 
* 2024-10Oct: [https://arxiv.org/abs/2410.13928 Automatically Interpreting Millions of Features in Large Language Models]
 
* 2024-10Oct: [https://arxiv.org/abs/2410.13928 Automatically Interpreting Millions of Features in Large Language Models]
 
* 2024-12Dec: [https://arxiv.org/abs/2412.04139 Monet: Mixture of Monosemantic Experts for Transformers]
 
* 2024-12Dec: [https://arxiv.org/abs/2412.04139 Monet: Mixture of Monosemantic Experts for Transformers]
 +
* 2024-12Dec: [https://www.lesswrong.com/posts/zbebxYCqsryPALh8C/matryoshka-sparse-autoencoders Matryoshka Sparse Autoencoders]
 +
* 2024-12Dec: [https://www.alignmentforum.org/posts/rKM9b6B2LqwSB5ToN/learning-multi-level-features-with-matryoshka-saes Learning Multi-Level Features with Matryoshka SAEs]
  
 
==Reward Functions==
 
==Reward Functions==
Line 47: Line 51:
 
=Heuristic Understanding=
 
=Heuristic Understanding=
 
==Emergent Internal Model Building==
 
==Emergent Internal Model Building==
 +
* 2023-07: [https://arxiv.org/abs/2307.15936 A Theory for Emergence of Complex Skills in Language Models]
 
* 2024-06: [https://arxiv.org/abs/2406.19370v1 Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space]
 
* 2024-06: [https://arxiv.org/abs/2406.19370v1 Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space]
  
Line 62: Line 67:
 
* [https://arxiv.org/abs/2310.15916 In-context learning creates task vectors]
 
* [https://arxiv.org/abs/2310.15916 In-context learning creates task vectors]
 
* [https://www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning Extracting sae task features for in-context learning]
 
* [https://www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning Extracting sae task features for in-context learning]
 +
* [https://arxiv.org/abs/2412.12276 Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers]
  
 
===Feature Geometry Reproduces Problem-space===
 
===Feature Geometry Reproduces Problem-space===
Line 138: Line 144:
 
* 2023-06: [https://arxiv.org/abs/2306.05836 Can Large Language Models Infer Causation from Correlation?]: Poor causal inference
 
* 2023-06: [https://arxiv.org/abs/2306.05836 Can Large Language Models Infer Causation from Correlation?]: Poor causal inference
 
* 2023-09: [https://arxiv.org/abs/2309.12288 The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"]
 
* 2023-09: [https://arxiv.org/abs/2309.12288 The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"]
 +
* 2023-09: [https://arxiv.org/abs/2309.13638 Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve] (biases towards "common" numbers, in-context CoT can reduce performance by incorrectly priming, etc.)
 +
* 2023-11: [https://arxiv.org/abs/2311.16093 Visual cognition in multimodal large language models] (models lack human-like visual understanding)
  
 
=Psychology=
 
=Psychology=

Latest revision as of 12:32, 19 December 2024

Interpretability

Mechanistic Interpretability

Semanticity

Reward Functions

Symbolic and Notation

Mathematical

Geometric

Challenges

GYe31yXXQAABwaZ.jpeg

Heuristic Understanding

Emergent Internal Model Building

Semantic Directions

Directions, e.g.: f(king)-f(man)+f(woman)=f(queen) or f(sushi)-f(Japan)+f(Italy)=f(pizza)

Task vectors:

Feature Geometry Reproduces Problem-space

Theory of Mind

Information Processing

Tests of Resilience to Dropouts/etc.

  • 2024-02: Explorations of Self-Repair in Language Models
  • 2024-06: What Matters in Transformers? Not All Attention is Needed
    • Removing entire transformer blocks leads to significant performance degradation
    • Removing MLP layers results in significant performance degradation
    • Removing attention layers causes almost no performance degradation
    • E.g. half of attention layers are deleted (48% speed-up), leads to only 2.4% decrease in the benchmarks
  • 2024-06: The Remarkable Robustness of LLMs: Stages of Inference?
    • They intentionally break the network (swapping layers), yet it continues to work remarkably well. This suggests LLMs are quite robust, and allows them to identify different stages in processing.
    • They also use these interventions to infer what different layers are doing. They break apart the LLM transformer layers into four stages:
      • Detokenization: Raw tokens are converted into meaningful entities that take into account local context (especially using nearby tokens).
      • Feature engineering: Features are progressively refined. Factual knowledge is leveraged.
      • Prediction ensembling: Predictions (for the ultimately-selected next-token) emerge. A sort of consensus voting is used, with “prediction neurons” and "suppression neurons" playing a major role in upvoting/downvoting.
      • Residual sharpening: The semantic representations are collapsed into specific next-token predictions. There is a strong emphasis on suppression neurons eliminating options. The confidence is calibrated.
    • This structure can be thought of as two halves (being roughly dual to each other): the first half broadens (goes from distinct tokens to a rich/elaborate concept-space) and the second half collapses (goes from rich concepts to concrete token predictions).

Other

Scaling Laws

Information Processing/Storage

Tokenization

For numbers/math

Learning/Training

Failure Modes

Psychology

See Also