Difference between revisions of "AI safety"
KevinYager (talk | contribs) (→Policy) |
KevinYager (talk | contribs) |
||
(4 intermediate revisions by the same user not shown) | |||
Line 13: | Line 13: | ||
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics | * 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics | ||
* 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind) | * 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind) | ||
+ | * 2024-02: [https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/ Towards a Cautious Scientist AI with Convergent Safety Bounds] (Yoshua Bengio) | ||
+ | * 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio) | ||
==Long-term (x-risk)== | ==Long-term (x-risk)== | ||
Line 28: | Line 30: | ||
==Policy== | ==Policy== | ||
* 2024-07: [https://arxiv.org/abs/2407.05694 On the Limitations of Compute Thresholds as a Governance Strategy] Sara Hooker | * 2024-07: [https://arxiv.org/abs/2407.05694 On the Limitations of Compute Thresholds as a Governance Strategy] Sara Hooker | ||
+ | * 2024-07: [https://www.cigionline.org/static/documents/AI-challenges.pdf Framework Convention on Global AI Challenges] ([https://www.cigionline.org/ CIGI]) | ||
* 2024-08: NIST guidelines: [https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-1.ipd.pdf Managing Misuse Risk for Dual-Use Foundation Models] | * 2024-08: NIST guidelines: [https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-1.ipd.pdf Managing Misuse Risk for Dual-Use Foundation Models] | ||
Line 43: | Line 46: | ||
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic) | * 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic) | ||
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI) | * 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI) | ||
− | + | * 2024-07: [https://arxiv.org/abs/2407.04622 On scalable oversight with weak LLMs judging strong LLMs] | |
* 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.) | * 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.) | ||
* 2024-08: [https://arxiv.org/abs/2408.00761 Tamper-Resistant Safeguards for Open-Weight LLMs] ([https://www.tamper-resistant-safeguards.com/ project], [https://github.com/rishub-tamirisa/tamper-resistance/ code]) | * 2024-08: [https://arxiv.org/abs/2408.00761 Tamper-Resistant Safeguards for Open-Weight LLMs] ([https://www.tamper-resistant-safeguards.com/ project], [https://github.com/rishub-tamirisa/tamper-resistance/ code]) | ||
Line 56: | Line 59: | ||
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github]) | * 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github]) | ||
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs] | * 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs] | ||
+ | |||
+ | =See Also= | ||
+ | * [[AI predictions]] |
Latest revision as of 13:15, 23 February 2025
Contents
Description of Safety Concerns
Key Concepts
- Instrumental Convergence
- Orthogonality Thesis
- Inner/outer alignment
- Mesa-optimization
- Overhang
- Reward is not the optimization target (Alex Turner)
Medium-term Risks
- 2023-04: A.I. Dilemma – Tristan Harris and Aza Raskin” (video) (.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript): raises concern about human ability to handle these transformations
- 2023-04: Daniel Schmachtenberger and Liv Boeree (video): AI could accelerate perverse social dynamics
- 2023-10: Sociotechnical Safety Evaluation of Generative AI Systems (Google DeepMind)
- 2024-02: Towards a Cautious Scientist AI with Convergent Safety Bounds (Yoshua Bengio)
- 2024-07: Reasoning through arguments against taking AI safety seriously (Yoshua Bengio)
Long-term (x-risk)
- List AGI Ruin: A List of Lethalities (Eliezer Yudkowsky)
Learning Resources
- Introduction to AI Safety, Ethics, and Society (Dan Hendrycks, Center for AI Safety)
- AI Safety FAQ
- Writing Doom (video) 27m short film on Superintelligence (2024)
- DeepMind short course on AGI safety
Status
Policy
- 2024-07: On the Limitations of Compute Thresholds as a Governance Strategy Sara Hooker
- 2024-07: Framework Convention on Global AI Challenges (CIGI)
- 2024-08: NIST guidelines: Managing Misuse Risk for Dual-Use Foundation Models
Research
- 2022-12: Discovering Latent Knowledge in Language Models Without Supervision
- 2023-02: Pretraining Language Models with Human Preferences
- 2023-04: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
- 2023-05: Model evaluation for extreme risks (DeepMind)
- 2023-05: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
- 2023-06: Preference Ranking Optimization for Human Alignment
- 2023-08: Self-Alignment with Instruction Backtranslation
- 2023-11: Debate Helps Supervise Unreliable Experts
- 2023-12: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision (OpenAI, blog)
- 2023-12: Practices for Governing Agentic AI Systems (OpenAI, blog)
- 2024-01: Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training (Anthropic)
- 2024-04: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (OpenAI)
- 2024-07: On scalable oversight with weak LLMs judging strong LLMs
- 2024-07: Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Dan Hendrycks et al.)
- 2024-08: Tamper-Resistant Safeguards for Open-Weight LLMs (project, code)
- 2024-08: Better Alignment with Instruction Back-and-Forth Translation
- 2024-10: First-Person Fairness in Chatbots (OpenAI, blog)
- 2024-10: Sabotage evaluations for frontier models (Anthropic, blog)
- 2024-12: Alignment Faking in Large Language Models (Anthropic)
- 2024-12: Best-of-N Jailbreaking (code)
- 2024-12: Deliberative Alignment: Reasoning Enables Safer Language Models (OpenAI)
- 2025-01: Trading Inference-Time Compute for Adversarial Robustness (OpenAI, blog)
- 2025-01: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (Anthropic, blog,
- 2025-02: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (site, github)
- 2025-02: Auditing Prompt Caching in Language Model APIs