Difference between revisions of "AI safety"
KevinYager (talk | contribs) (→Research) |
KevinYager (talk | contribs) (→Research) |
||
Line 32: | Line 32: | ||
* 2023-12: [https://cdn.openai.com/papers/weak-to-strong-generalization.pdf Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision] (OpenAI, [https://openai.com/research/weak-to-strong-generalization blog]) | * 2023-12: [https://cdn.openai.com/papers/weak-to-strong-generalization.pdf Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision] (OpenAI, [https://openai.com/research/weak-to-strong-generalization blog]) | ||
* 2023-12: [https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf Practices for Governing Agentic AI Systems] (OpenAI, [https://openai.com/index/practices-for-governing-agentic-ai-systems/ blog]) | * 2023-12: [https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf Practices for Governing Agentic AI Systems] (OpenAI, [https://openai.com/index/practices-for-governing-agentic-ai-systems/ blog]) | ||
+ | * 2024-01: [https://arxiv.org/pdf/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic) |
Revision as of 13:59, 14 February 2025
Contents
Description of Safety Concerns
Key Concepts
- Instrumental Convergence
- Orthogonality Thesis
- Inner/outer alignment
- Mesa-optimization
- Overhang
- Reward is not the optimization target (Alex Turner)
Medium-term Risks
- 2023-04: A.I. Dilemma – Tristan Harris and Aza Raskin” (video) (.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript): raises concern about human ability to handle these transformations
- 2023-04: Daniel Schmachtenberger and Liv Boeree (video): AI could accelerate perverse social dynamics
- 2023-10: Sociotechnical Safety Evaluation of Generative AI Systems (Google DeepMind)
Long-term (x-risk)
- List AGI Ruin: A List of Lethalities (Eliezer Yudkowsky)
Learning Resources
Research
- 2022-12: Discovering Latent Knowledge in Language Models Without Supervision
- 2023-02: Pretraining Language Models with Human Preferences
- 2023-04: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
- 2023-05: Model evaluation for extreme risks (DeepMind)
- 2023-05: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
- 2023-06: Preference Ranking Optimization for Human Alignment
- 2023-08: Self-Alignment with Instruction Backtranslation
- 2023-11: Debate Helps Supervise Unreliable Experts
- 2023-12: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision (OpenAI, blog)
- 2023-12: Practices for Governing Agentic AI Systems (OpenAI, blog)
- 2024-01: Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training (Anthropic)