Difference between revisions of "AI safety"
KevinYager (talk | contribs) (Created page with "=Learning Resources= * [https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c DeepMind short course on AGI safety]") |
KevinYager (talk | contribs) (→Long-term (x-risk)) |
||
(46 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=Learning Resources= | =Learning Resources= | ||
+ | ==Light== | ||
+ | * [https://www.youtube.com/watch?v=xfMQ7hzyFW4 Writing Doom] (27m video): short film on Superintelligence (2024) | ||
+ | * [https://orxl.org/ai-doom.html a casual intro to AI doom and alignment] (2022) | ||
+ | * Anthony Aguirre: [https://keepthefuturehuman.ai/ Keep The Future Human] | ||
+ | ** [https://interactive.keepthefuturehuman.ai/ Interactive Explainer] | ||
+ | ** [https://keepthefuturehuman.ai/essay/ Essay: Keep the Future Human] | ||
+ | ** [https://www.youtube.com/watch?v=27KDl2uPiL8 We Can’t Stop AI – Here’s What To Do Instead] (4m video, 2025) | ||
+ | ** [https://www.youtube.com/watch?v=zeabrXV8zNE The 4 Rules That Could Stop AI Before It’s Too Late] (15m video, 2025) | ||
+ | |||
+ | ==Deep== | ||
+ | * [https://www.thecompendium.ai/ The Compendium: Humanity risks extinction from its very creations — AIs.] (2024) | ||
+ | * [https://www.aisafetybook.com/ Introduction to AI Safety, Ethics, and Society] (Dan Hendrycks, [https://www.safe.ai/ Center for AI Safety]) | ||
+ | * [https://aisafety.info/ AI Safety FAQ] | ||
* [https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c DeepMind short course on AGI safety] | * [https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c DeepMind short course on AGI safety] | ||
+ | |||
+ | =Description of Safety Concerns= | ||
+ | ==Key Concepts== | ||
+ | * [https://en.wikipedia.org/wiki/Instrumental_convergence Instrumental Convergence] | ||
+ | * [https://www.lesswrong.com/w/orthogonality-thesis Orthogonality Thesis] | ||
+ | * [https://www.alignmentforum.org/posts/SzecSPYxqRa5GCaSF/clarifying-inner-alignment-terminology Inner/outer alignment] | ||
+ | * [https://www.alignmentforum.org/w/mesa-optimization Mesa-optimization] | ||
+ | * [https://www.lesswrong.com/posts/N6vZEnCn6A95Xn39p/are-we-in-an-ai-overhang Overhang] | ||
+ | * [https://www.alignmentforum.org/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target Reward is not the optimization target] (Alex Turner) | ||
+ | |||
+ | ==Medium-term Risks== | ||
+ | * 2023-04: [https://www.youtube.com/watch?v=xoVJKj8lcNQ A.I. Dilemma – Tristan Harris and Aza Raskin” (video)] ([https://assets-global .website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript]): raises concern about human ability to handle these transformations | ||
+ | * 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics | ||
+ | * 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind) | ||
+ | * 2024-02: [https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/ Towards a Cautious Scientist AI with Convergent Safety Bounds] (Yoshua Bengio) | ||
+ | * 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio) | ||
+ | |||
+ | ==Long-term (x-risk)== | ||
+ | * [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities List AGI Ruin: A List of Lethalities] (Eliezer Yudkowsky) | ||
+ | * [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for] (Marcus Arvan) | ||
+ | |||
+ | =Status= | ||
+ | * 2025-01: [https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)] | ||
+ | |||
+ | ==Policy== | ||
+ | * 2024-07: [https://arxiv.org/abs/2407.05694 On the Limitations of Compute Thresholds as a Governance Strategy] Sara Hooker | ||
+ | * 2024-07: [https://www.cigionline.org/static/documents/AI-challenges.pdf Framework Convention on Global AI Challenges] ([https://www.cigionline.org/ CIGI]) | ||
+ | * 2024-08: NIST guidelines: [https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-1.ipd.pdf Managing Misuse Risk for Dual-Use Foundation Models] | ||
+ | |||
+ | ==Proposals== | ||
+ | * 2025-02: [https://arxiv.org/abs/2502.18359 Responsible AI Agents] | ||
+ | * 2025-03: [https://controlai.com/ Control AI] [https://controlai.com/dip The Direct Institutional Plan] | ||
+ | * 2025-04: Google DeepMind: [https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/ Taking a responsible path to AGI] | ||
+ | ** Paper: [https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/evaluating-potential-cybersecurity-threats-of-advanced-ai/An_Approach_to_Technical_AGI_Safety_Apr_2025.pdf An Approach to Technical AGI Safety and Security] | ||
+ | |||
+ | =Research= | ||
+ | * 2022-09: [https://arxiv.org/abs/2209.00626v1 The alignment problem from a deep learning perspective] | ||
+ | * 2022-12: [https://arxiv.org/abs/2212.03827 Discovering Latent Knowledge in Language Models Without Supervision] | ||
+ | * 2023-02: [https://arxiv.org/abs/2302.08582 Pretraining Language Models with Human Preferences] | ||
+ | * 2023-04: [https://arxiv.org/abs/2304.03279 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark] | ||
+ | * 2023-05: [https://arxiv.org/abs/2305.15324 Model evaluation for extreme risks] (DeepMind) | ||
+ | * 2023-05: [https://arxiv.org/abs/2305.03047 Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision] | ||
+ | * 2023-06: [https://arxiv.org/abs/2306.17492 Preference Ranking Optimization for Human Alignment] | ||
+ | * 2023-08: [https://arxiv.org/abs/2308.06259 Self-Alignment with Instruction Backtranslation] | ||
+ | * 2023-11: [https://arxiv.org/abs/2311.08702 Debate Helps Supervise Unreliable Experts] | ||
+ | * 2023-12: [https://cdn.openai.com/papers/weak-to-strong-generalization.pdf Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision] (OpenAI, [https://openai.com/research/weak-to-strong-generalization blog]) | ||
+ | * 2023-12: [https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf Practices for Governing Agentic AI Systems] (OpenAI, [https://openai.com/index/practices-for-governing-agentic-ai-systems/ blog]) | ||
+ | * 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic) | ||
+ | * 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI) | ||
+ | * 2024-07: [https://arxiv.org/abs/2407.04622 On scalable oversight with weak LLMs judging strong LLMs] | ||
+ | * 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.) | ||
+ | * 2024-08: [https://arxiv.org/abs/2408.00761 Tamper-Resistant Safeguards for Open-Weight LLMs] ([https://www.tamper-resistant-safeguards.com/ project], [https://github.com/rishub-tamirisa/tamper-resistance/ code]) | ||
+ | * 2024-08: [https://arxiv.org/abs/2408.04614 Better Alignment with Instruction Back-and-Forth Translation] | ||
+ | * 2024-10: [https://cdn.openai.com/papers/first-person-fairness-in-chatbots.pdf First-Person Fairness in Chatbots] (OpenAI, [https://openai.com/index/evaluating-fairness-in-chatgpt/ blog]) | ||
+ | * 2024-10: [https://assets.anthropic.com/m/377027d5b36ac1eb/original/Sabotage-Evaluations-for-Frontier-Models.pdf Sabotage evaluations for frontier models] (Anthropic, [https://www.anthropic.com/research/sabotage-evaluations blog]) | ||
+ | * 2024-12: [https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf Alignment Faking in Large Language Models] (Anthropic) | ||
+ | * 2024-12: [https://arxiv.org/abs/2412.03556 Best-of-N Jailbreaking] ([https://github.com/jplhughes/bon-jailbreaking code]) | ||
+ | * 2024-12: [https://arxiv.org/abs/2412.16325 Towards Safe and Honest AI Agents with Neural Self-Other Overlap] | ||
+ | ** 2024-07: [https://www.lesswrong.com/posts/hzt9gHpNwA2oHtwKX/self-other-overlap-a-neglected-approach-to-ai-alignment Self-Other Overlap: A Neglected Approach to AI Alignment] | ||
+ | ** 2025-03: [https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine Reducing LLM deception at scale with self-other overlap fine-tuning] | ||
+ | * 2024-12: [https://arxiv.org/abs/2412.16339 Deliberative Alignment: Reasoning Enables Safer Language Models] (OpenAI) | ||
+ | * 2025-01: [https://cdn.openai.com/papers/trading-inference-time-compute-for-adversarial-robustness-20250121_1.pdf Trading Inference-Time Compute for Adversarial Robustness] (OpenAI, [https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/ blog]) | ||
+ | * 2025-01: [https://arxiv.org/abs/2501.18837 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming] (Anthropic, [https://www.anthropic.com/research/constitutional-classifiers blog], | ||
+ | * 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github]) | ||
+ | * 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs] | ||
+ | * 2025-03: [https://arxiv.org/abs/2209.00626v7 The Alignment Problem from a Deep Learning Perspective] | ||
+ | * 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog]) | ||
+ | * 2025-03: [https://arxiv.org/abs/2503.13621 Superalignment with Dynamic Human Values] | ||
+ | |||
+ | ==Demonstrations of Negative Use Capabilities== | ||
+ | * 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects] | ||
+ | |||
+ | =See Also= | ||
+ | * [[AI predictions]] |
Revision as of 14:41, 10 April 2025
Contents
Learning Resources
Light
- Writing Doom (27m video): short film on Superintelligence (2024)
- a casual intro to AI doom and alignment (2022)
- Anthony Aguirre: Keep The Future Human
- Interactive Explainer
- Essay: Keep the Future Human
- We Can’t Stop AI – Here’s What To Do Instead (4m video, 2025)
- The 4 Rules That Could Stop AI Before It’s Too Late (15m video, 2025)
Deep
- The Compendium: Humanity risks extinction from its very creations — AIs. (2024)
- Introduction to AI Safety, Ethics, and Society (Dan Hendrycks, Center for AI Safety)
- AI Safety FAQ
- DeepMind short course on AGI safety
Description of Safety Concerns
Key Concepts
- Instrumental Convergence
- Orthogonality Thesis
- Inner/outer alignment
- Mesa-optimization
- Overhang
- Reward is not the optimization target (Alex Turner)
Medium-term Risks
- 2023-04: A.I. Dilemma – Tristan Harris and Aza Raskin” (video) (.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript): raises concern about human ability to handle these transformations
- 2023-04: Daniel Schmachtenberger and Liv Boeree (video): AI could accelerate perverse social dynamics
- 2023-10: Sociotechnical Safety Evaluation of Generative AI Systems (Google DeepMind)
- 2024-02: Towards a Cautious Scientist AI with Convergent Safety Bounds (Yoshua Bengio)
- 2024-07: Reasoning through arguments against taking AI safety seriously (Yoshua Bengio)
Long-term (x-risk)
- List AGI Ruin: A List of Lethalities (Eliezer Yudkowsky)
- ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for (Marcus Arvan)
Status
Policy
- 2024-07: On the Limitations of Compute Thresholds as a Governance Strategy Sara Hooker
- 2024-07: Framework Convention on Global AI Challenges (CIGI)
- 2024-08: NIST guidelines: Managing Misuse Risk for Dual-Use Foundation Models
Proposals
- 2025-02: Responsible AI Agents
- 2025-03: Control AI The Direct Institutional Plan
- 2025-04: Google DeepMind: Taking a responsible path to AGI
Research
- 2022-09: The alignment problem from a deep learning perspective
- 2022-12: Discovering Latent Knowledge in Language Models Without Supervision
- 2023-02: Pretraining Language Models with Human Preferences
- 2023-04: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
- 2023-05: Model evaluation for extreme risks (DeepMind)
- 2023-05: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
- 2023-06: Preference Ranking Optimization for Human Alignment
- 2023-08: Self-Alignment with Instruction Backtranslation
- 2023-11: Debate Helps Supervise Unreliable Experts
- 2023-12: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision (OpenAI, blog)
- 2023-12: Practices for Governing Agentic AI Systems (OpenAI, blog)
- 2024-01: Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training (Anthropic)
- 2024-04: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (OpenAI)
- 2024-07: On scalable oversight with weak LLMs judging strong LLMs
- 2024-07: Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Dan Hendrycks et al.)
- 2024-08: Tamper-Resistant Safeguards for Open-Weight LLMs (project, code)
- 2024-08: Better Alignment with Instruction Back-and-Forth Translation
- 2024-10: First-Person Fairness in Chatbots (OpenAI, blog)
- 2024-10: Sabotage evaluations for frontier models (Anthropic, blog)
- 2024-12: Alignment Faking in Large Language Models (Anthropic)
- 2024-12: Best-of-N Jailbreaking (code)
- 2024-12: Towards Safe and Honest AI Agents with Neural Self-Other Overlap
- 2024-12: Deliberative Alignment: Reasoning Enables Safer Language Models (OpenAI)
- 2025-01: Trading Inference-Time Compute for Adversarial Robustness (OpenAI, blog)
- 2025-01: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (Anthropic, blog,
- 2025-02: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (site, github)
- 2025-02: Auditing Prompt Caching in Language Model APIs
- 2025-03: The Alignment Problem from a Deep Learning Perspective
- 2025-03: Auditing language models for hidden objectives (Anthropic, blog)
- 2025-03: Superalignment with Dynamic Human Values