Latest revision as of 15:52, 14 April 2026

Learning Resources

Light

a casual intro to AI doom and alignment (2022)
Anthony Aguirre: Keep The Future Human
- Interactive Explainer
- Essay: Keep the Future Human
- We Can’t Stop AI – Here’s What To Do Instead (4m video, 2025)
- The 4 Rules That Could Stop AI Before It’s Too Late (15m video, 2025)
Tristan Harris TED talk (15m): Why AI is our ultimate test and greatest invitation
- Text version: Center for Humane Technology: The Narrow Path: Why AI is Our Ultimate Test and Greatest Invitation
Fable about Transformative AI
2024-10: Writing Doom: short film on Superintelligence (27m video)
2026-03: The AI book that's freaking out national security advisors (44m video)

Deep

Description of Safety Concerns

Key Concepts

Instrumental Convergence
Orthogonality Thesis
Inner/outer alignment
Mesa-optimization
Overhang
Reward is not the optimization target (Alex Turner)
80,000 hours:

Medium-term Risks

2023-04: A.I. Dilemma – Tristan Harris and Aza Raskin” (video) (podcast transcript): raises concern about human ability to handle these transformations
2023-04: Daniel Schmachtenberger and Liv Boeree (video): AI could accelerate perverse social dynamics
2023-10: Sociotechnical Safety Evaluation of Generative AI Systems (Google DeepMind)
2024-02: Towards a Cautious Scientist AI with Convergent Safety Bounds (Yoshua Bengio)
2024-07: Reasoning through arguments against taking AI safety seriously (Yoshua Bengio)
2025-04: AI-Enabled Coups: How a Small Group Could Use AI to Seize Power
2025-06: The Singapore Consensus on Global AI Safety Research Priorities
2026-01: How malicious AI swarms can threaten democracy: The fusion of agentic AI and LLMs marks a new frontier in information warfare (Science Magazine, preprint)
2026-01: The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI (Dario Amodei)
2026-02: Updated thoughts on AI risk: Things have gotten scarier since 2023 (Noah Smith)

Long-term (x-risk)

2015-02: Sam Altman: Machine intelligence, part 1
2019-03: Daniel Kokotajlo and Wei Dai: The Main Sources of AI Risk?
2022-06: Eliezer Yudkowsky: List AGI Ruin: A List of Lethalities
2024-11: Marcus Arvan: ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
2025-04: ASI existential risk: reconsidering alignment as a goal
2025-12: Philip Trammell and Leopold Aschenbrenner: Existential Risk and Growth

Status

2025-01: International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)
AI Lab Watch (safety scorecard)
2026-03: The state of AI safety in four fake graphs

Assessmment

AI Assessment Scale (AIAS): A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions
2025-07: Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Policy

2024-07: On the Limitations of Compute Thresholds as a Governance Strategy Sara Hooker
2024-07: Framework Convention on Global AI Challenges (CIGI)
2024-08: NIST guidelines: Managing Misuse Risk for Dual-Use Foundation Models

Proposals

2025-02: Responsible AI Agents
2025-03: Control AI The Direct Institutional Plan
2025-04: Google DeepMind: Taking a responsible path to AGI
- Paper: An Approach to Technical AGI Safety and Security
2026-04: Joe Carlsmith: Writing AI constitutions

Research

2008: The Basic AI Drives
2022-09: The alignment problem from a deep learning perspective
2022-12: Discovering Latent Knowledge in Language Models Without Supervision
2023-02: Pretraining Language Models with Human Preferences
2023-04: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
2023-05: Model evaluation for extreme risks (DeepMind)
2023-05: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
2023-06: Preference Ranking Optimization for Human Alignment
2023-08: Self-Alignment with Instruction Backtranslation
2023-11: Debate Helps Supervise Unreliable Experts
2023-12: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision (OpenAI, blog)
2023-12: Practices for Governing Agentic AI Systems (OpenAI, blog)
2024-01: Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training (Anthropic)
2024-04: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (OpenAI)
2024-07: On scalable oversight with weak LLMs judging strong LLMs
2024-07: Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Dan Hendrycks et al.)
2024-08: Tamper-Resistant Safeguards for Open-Weight LLMs (project, code)
2024-08: Better Alignment with Instruction Back-and-Forth Translation
2024-10: First-Person Fairness in Chatbots (OpenAI, blog)
2024-10: Sabotage evaluations for frontier models (Anthropic, blog)
2024-12: Alignment Faking in Large Language Models (Anthropic)
2024-12: Best-of-N Jailbreaking (code)
2024-12: Towards Safe and Honest AI Agents with Neural Self-Other Overlap
- 2024-07: Self-Other Overlap: A Neglected Approach to AI Alignment
- 2025-03: Reducing LLM deception at scale with self-other overlap fine-tuning
2024-12: Deliberative Alignment: Reasoning Enables Safer Language Models (OpenAI)
2025-01: Trading Inference-Time Compute for Adversarial Robustness (OpenAI, blog)
2025-01: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (Anthropic, blog,
2025-02: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (site, github)
2025-02: Auditing Prompt Caching in Language Model APIs
2025-02: Multi-Agent Risks from Advanced AI
2025-03: The Alignment Problem from a Deep Learning Perspective
2025-03: Auditing language models for hidden objectives (Anthropic, blog)
2025-03: Superalignment with Dynamic Human Values
2025-04: Contemplative Wisdom for Superalignment
2025-04: Scaling Laws for Scalable Oversight (preprint, code)
2025-06: SHADE-Arena: Evaluating sabotage and monitoring in LLM agents (Anthropic, blog)
2025-06: Avoiding Obfuscation with Prover-Estimator Debate
2025-06: Persona Features Control Emergent Misalignment (OpenAI, blog)
2025-07: Why Do Some Language Models Fake Alignment While Others Don't? (Anthropic, code)
2025-07: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
2025-09: Detecting and reducing scheming in AI models
2025-11: Natural Emergent Misalignment from Reward Hacking in Production RL (Anthropic, blog)
2025-12: Distributional AGI Safety
2025-12: Difficulties with Evaluating a Deception Detector for AIs
2025-12: Monitoring Monitorability (OpenAI)
2026-01: Training large language models on narrow tasks can lead to broad misalignment
- 2025-02: Preprint: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
2026-02: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? (Anthropic blog)
2026-03: Reasoning Models Struggle to Control their Chains of Thought (OpenAI blog)
2026-03: The Consciousness Cluster: Preferences of Models that Claim to be Conscious

@@ Line 1: / Line 1: @@
 =Learning Resources=
 ==Light==
-* [https://www.youtube.com/watch?v=xfMQ7hzyFW4 Writing Doom] (27m video): short film on Superintelligence (2024)
 * [https://orxl.org/ai-doom.html a casual intro to AI doom and alignment] (2022)
 * Anthony Aguirre: [https://keepthefuturehuman.ai/ Keep The Future Human]
@@ Line 8: / Line 7: @@
 ** [https://www.youtube.com/watch?v=27KDl2uPiL8 We Can’t Stop AI – Here’s What To Do Instead] (4m video, 2025)
 ** [https://www.youtube.com/watch?v=zeabrXV8zNE The 4 Rules That Could Stop AI Before It’s Too Late] (15m video, 2025)
+* Tristan Harris TED talk (15m): [https://www.ted.com/talks/tristan_harris_why_ai_is_our_ultimate_test_and_greatest_invitation Why AI is our ultimate test and greatest invitation]
+** Text version: Center for Humane Technology: [https://centerforhumanetechnology.substack.com/p/the-narrow-path-why-ai-is-our-ultimate The Narrow Path: Why AI is Our Ultimate Test and Greatest Invitation]
+* [https://x.com/KeiranJHarris/status/1935429439476887594 Fable about Transformative AI]
+* 2024-10: [https://www.youtube.com/watch?v=xfMQ7hzyFW4 Writing Doom]: short film on Superintelligence (27m video)
+* 2026-03: [https://www.youtube.com/watch?v=Nl7-bRFSZBs The AI book that's freaking out national security advisors] (44m video)
 ==Deep==
@@ Line 23: / Line 27: @@
 * [https://www.lesswrong.com/posts/N6vZEnCn6A95Xn39p/are-we-in-an-ai-overhang Overhang]
 * [https://www.alignmentforum.org/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target Reward is not the optimization target] (Alex Turner)
+* 80,000 hours:
+** [https://80000hours.org/problem-profiles/risks-from-power-seeking-ai/ Risks from power-seeking AI systems]
+** [https://80000hours.org/problem-profiles/gradual-disempowerment/ Gradual disempowerment]
+** [https://80000hours.org/problem-profiles/catastrophic-ai-misuse/ Catastrophic AI misuse]
 ==Medium-term Risks==
-* 2023-04: [https://www.youtube.com/watch?v=xoVJKj8lcNQ A.I. Dilemma – Tristan Harris and Aza Raskin” (video)] ([https://assets-global .website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript]): raises concern about human ability to handle these transformations
+* 2023-04: [https://www.youtube.com/watch?v=xoVJKj8lcNQ A.I. Dilemma – Tristan Harris and Aza Raskin” (video)] ([https://assets-global.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript]): raises concern about human ability to handle these transformations
 * 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 * 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind)
 * 2024-02: [https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/ Towards a Cautious Scientist AI with Convergent Safety Bounds] (Yoshua Bengio)
 * 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio)
+* 2025-04: [https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power AI-Enabled Coups: How a Small Group Could Use AI to Seize Power]
+* 2025-06: [https://arxiv.org/abs/2506.20702 The Singapore Consensus on Global AI Safety Research Priorities]
+* 2026-01: [https://www.science.org/doi/10.1126/science.adz1697 How malicious AI swarms can threaten democracy: The fusion of agentic AI and LLMs marks a new frontier in information warfare] (Science Magazine, [https://arxiv.org/abs/2506.06299 preprint])
+* 2026-01: [https://www.darioamodei.com/essay/the-adolescence-of-technology The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI] (Dario Amodei)
+* 2026-02: [https://www.noahpinion.blog/p/updated-thoughts-on-ai-risk Updated thoughts on AI risk: Things have gotten scarier since 2023] ([https://x.com/Noahpinion Noah Smith])
 ==Long-term  (x-risk)==
-* [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities List AGI Ruin: A List of Lethalities] (Eliezer Yudkowsky)
+* 2015-02: Sam Altman: [https://blog.samaltman.com/machine-intelligence-part-1 Machine intelligence, part 1]
+* 2019-03: Daniel Kokotajlo and Wei Dai: [https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk The Main Sources of AI Risk?]
+* 2022-06: Eliezer Yudkowsky: [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities List AGI Ruin: A List of Lethalities]
+* 2024-11: Marcus Arvan: [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for]
+* 2025-04: [https://michaelnotebook.com/xriskbrief/index.html ASI existential risk: reconsidering alignment as a goal]
+* 2025-12: Philip Trammell and Leopold Aschenbrenner: [https://philiptrammell.com/static/Existential_Risk_and_Growth.pdf Existential Risk and Growth]
 =Status=
 * 2025-01: [https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)]
+* [https://ailabwatch.org/ AI Lab Watch] (safety scorecard)
+* 2026-03: [https://windowsontheory.org/2026/03/30/the-state-of-ai-safety-in-four-fake-graphs/ The state of AI safety in four fake graphs]
+==Assessmment==
+* [https://aiassessmentscale.com/ AI Assessment Scale (AIAS)]: A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions
+* 2025-07: [https://arxiv.org/abs/2507.16534 Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report]
 ==Policy==
@@ Line 47: / Line 71: @@
 * 2025-04: Google DeepMind: [https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/ Taking a responsible path to AGI]
 ** Paper: [https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/evaluating-potential-cybersecurity-threats-of-advanced-ai/An_Approach_to_Technical_AGI_Safety_Apr_2025.pdf An Approach to Technical AGI Safety and Security]
+* 2026-04: Joe Carlsmith: [https://joecarlsmith.substack.com/p/video-and-transcript-of-talk-on-writing Writing AI constitutions]
 =Research=
+* 2008: [https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf The Basic AI Drives]
 * 2022-09: [https://arxiv.org/abs/2209.00626v1 The alignment problem from a deep learning perspective]
 * 2022-12: [https://arxiv.org/abs/2212.03827 Discovering Latent Knowledge in Language Models Without Supervision]
@@ Line 78: / Line 104: @@
 * 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 * 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
+* 2025-02: [https://arxiv.org/abs/2502.14143 Multi-Agent Risks from Advanced AI]
 * 2025-03: [https://arxiv.org/abs/2209.00626v7 The Alignment Problem from a Deep Learning Perspective]
 * 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
 * 2025-03: [https://arxiv.org/abs/2503.13621 Superalignment with Dynamic Human Values]
+* 2025-04: [https://arxiv.org/abs/2504.15125 Contemplative Wisdom for Superalignment]
+* 2025-04: [https://www.lesswrong.com/posts/x59FhzuM9yuvZHAHW/untitled-draft-yhra Scaling Laws for Scalable Oversight] ([https://arxiv.org/abs/2504.18530 preprint], [https://github.com/subhashk01/oversight-scaling-laws code])
+* 2025-06: [https://assets.anthropic.com/m/4fb35becb0cd87e1/original/SHADE-Arena-Paper.pdf SHADE-Arena: Evaluating sabotage and monitoring in LLM agents] (Anthropic, [https://www.anthropic.com/research/shade-arena-sabotage-monitoring blog])
+* 2025-06: [https://arxiv.org/abs/2506.13609 Avoiding Obfuscation with Prover-Estimator Debate]
+* 2025-06: [https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf Persona Features Control Emergent Misalignment] (OpenAI, [https://openai.com/index/emergent-misalignment/ blog])
+* 2025-07: [https://arxiv.org/abs/2506.18032 Why Do Some Language Models Fake Alignment While Others Don't?] (Anthropic, [https://github.com/safety-research/open-source-alignment-faking code])
+* 2025-07: [https://arxiv.org/abs/2507.11473 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety]
+* 2025-09: [https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/ Detecting and reducing scheming in AI models]
+* 2025-11: [https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf Natural Emergent Misalignment from Reward Hacking in Production RL] (Anthropic, [https://www.anthropic.com/research/emergent-misalignment-reward-hacking blog])
+* 2025-12: [https://arxiv.org/abs/2512.16856 Distributional AGI Safety]
+* 2025-12: [https://arxiv.org/abs/2511.22662 Difficulties with Evaluating a Deception Detector for AIs]
+* 2025-12: [https://cdn.openai.com/pdf/d57827c6-10bc-47fe-91aa-0fde55bd3901/monitoring-monitorability.pdf Monitoring Monitorability] (OpenAI)
+* 2026-01: [https://www.nature.com/articles/s41586-025-09937-5 Training large language models on narrow tasks can lead to broad misalignment]
+** 2025-02: Preprint: [https://martins1612.github.io/emergent_misalignment_betley.pdf Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs]
+* 2026-02: [https://arxiv.org/pdf/2601.23045 The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?] (Anthropic [https://alignment.anthropic.com/2026/hot-mess-of-ai/ blog])
+* 2026-03: [https://cdn.openai.com/pdf/a21c39c1-fa07-41db-9078-973a12620117/cot_controllability.pdf Reasoning Models Struggle to Control their Chains of Thought] (OpenAI [https://openai.com/index/reasoning-models-chain-of-thought-controllability/ blog])
+* 2026-03: [https://truthful.ai/consciousness_cluster.pdf The Consciousness Cluster: Preferences of Models that Claim to be Conscious]
 ==Demonstrations of Negative Use Capabilities==
 * 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects]
+* 2025-04: [https://www.nathanlabenz.com/ Nathan Labenz] ([https://www.cognitiverevolution.ai/ The Cognitive Revolution]): [https://docs.google.com/presentation/d/1mvkpg1mtAvGzTiiwYPc6bKOGsQXDIwMb-ytQECb3i7I/edit#slide=id.g252d9e67d86_0_16 AI Bad Behavior]
+==Threat Vectors==
+* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training]
+* 2025-10: [https://arxiv.org/abs/2510.07192 Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples]
 =See Also=
 * [[AI predictions]]

Difference between revisions of "AI safety"

Latest revision as of 15:52, 14 April 2026

Contents

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

Threat Vectors

See Also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools