Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Assessmment)
(Research)
 
(16 intermediate revisions by the same user not shown)
Line 28: Line 28:
  
 
==Medium-term Risks==
 
==Medium-term Risks==
* 2023-04: [https://www.youtube.com/watch?v=xoVJKj8lcNQ A.I. Dilemma – Tristan Harris and Aza Raskin” (video)] ([https://assets-global .website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript]): raises concern about human ability to handle these transformations
+
* 2023-04: [https://www.youtube.com/watch?v=xoVJKj8lcNQ A.I. Dilemma – Tristan Harris and Aza Raskin” (video)] ([https://assets-global.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript]): raises concern about human ability to handle these transformations
 
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 
* 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind)
 
* 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind)
Line 35: Line 35:
 
* 2025-04: [https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power AI-Enabled Coups: How a Small Group Could Use AI to Seize Power]
 
* 2025-04: [https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power AI-Enabled Coups: How a Small Group Could Use AI to Seize Power]
 
* 2025-06: [https://arxiv.org/abs/2506.20702 The Singapore Consensus on Global AI Safety Research Priorities]
 
* 2025-06: [https://arxiv.org/abs/2506.20702 The Singapore Consensus on Global AI Safety Research Priorities]
 +
* 2026-01: [https://www.science.org/doi/10.1126/science.adz1697 How malicious AI swarms can threaten democracy: The fusion of agentic AI and LLMs marks a new frontier in information warfare] (Science Magazine, [https://arxiv.org/abs/2506.06299 preprint])
 +
* 2026-01: [https://www.darioamodei.com/essay/the-adolescence-of-technology The Adolescence of Technology: Confronting and Overcoming the Risks of Powerful AI] (Dario Amodei)
  
 
==Long-term  (x-risk)==
 
==Long-term  (x-risk)==
Line 42: Line 44:
 
* 2024-11: Marcus Arvan: [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for]
 
* 2024-11: Marcus Arvan: [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for]
 
* 2025-04: [https://michaelnotebook.com/xriskbrief/index.html ASI existential risk: reconsidering alignment as a goal]
 
* 2025-04: [https://michaelnotebook.com/xriskbrief/index.html ASI existential risk: reconsidering alignment as a goal]
 +
* 2025-12: Philip Trammell and Leopold Aschenbrenner: [https://philiptrammell.com/static/Existential_Risk_and_Growth.pdf Existential Risk and Growth]
  
 
=Status=
 
=Status=
Line 63: Line 66:
  
 
=Research=
 
=Research=
 +
* 2008: [https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf The Basic AI Drives]
 
* 2022-09: [https://arxiv.org/abs/2209.00626v1 The alignment problem from a deep learning perspective]
 
* 2022-09: [https://arxiv.org/abs/2209.00626v1 The alignment problem from a deep learning perspective]
 
* 2022-12: [https://arxiv.org/abs/2212.03827 Discovering Latent Knowledge in Language Models Without Supervision]
 
* 2022-12: [https://arxiv.org/abs/2212.03827 Discovering Latent Knowledge in Language Models Without Supervision]
Line 92: Line 96:
 
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
 
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
 +
* 2025-02: [https://arxiv.org/abs/2502.14143 Multi-Agent Risks from Advanced AI]
 
* 2025-03: [https://arxiv.org/abs/2209.00626v7 The Alignment Problem from a Deep Learning Perspective]
 
* 2025-03: [https://arxiv.org/abs/2209.00626v7 The Alignment Problem from a Deep Learning Perspective]
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
Line 102: Line 107:
 
* 2025-07: [https://arxiv.org/abs/2506.18032 Why Do Some Language Models Fake Alignment While Others Don't?] (Anthropic, [https://github.com/safety-research/open-source-alignment-faking code])
 
* 2025-07: [https://arxiv.org/abs/2506.18032 Why Do Some Language Models Fake Alignment While Others Don't?] (Anthropic, [https://github.com/safety-research/open-source-alignment-faking code])
 
* 2025-07: [https://arxiv.org/abs/2507.11473 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety]
 
* 2025-07: [https://arxiv.org/abs/2507.11473 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety]
 +
* 2025-09: [https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/ Detecting and reducing scheming in AI models]
 +
* 2025-11: [https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf Natural Emergent Misalignment from Reward Hacking in Production RL] (Anthropic, [https://www.anthropic.com/research/emergent-misalignment-reward-hacking blog])
 +
* 2025-12: [https://arxiv.org/abs/2512.16856 Distributional AGI Safety]
 +
* 2025-12: [https://arxiv.org/abs/2511.22662 Difficulties with Evaluating a Deception Detector for AIs]
 +
* 2025-12: [https://cdn.openai.com/pdf/d57827c6-10bc-47fe-91aa-0fde55bd3901/monitoring-monitorability.pdf Monitoring Monitorability] (OpenAI)
 +
* 2026-01: [https://www.nature.com/articles/s41586-025-09937-5 Training large language models on narrow tasks can lead to broad misalignment]
 +
** 2025-02: Preprint: [https://martins1612.github.io/emergent_misalignment_betley.pdf Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs]
 +
* 2026-02: [https://arxiv.org/pdf/2601.23045 The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity?] (Anthropic [https://alignment.anthropic.com/2026/hot-mess-of-ai/ blog])
  
 
==Demonstrations of Negative Use Capabilities==
 
==Demonstrations of Negative Use Capabilities==
 
* 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects]
 
* 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects]
 
* 2025-04: [https://www.nathanlabenz.com/ Nathan Labenz] ([https://www.cognitiverevolution.ai/ The Cognitive Revolution]): [https://docs.google.com/presentation/d/1mvkpg1mtAvGzTiiwYPc6bKOGsQXDIwMb-ytQECb3i7I/edit#slide=id.g252d9e67d86_0_16 AI Bad Behavior]
 
* 2025-04: [https://www.nathanlabenz.com/ Nathan Labenz] ([https://www.cognitiverevolution.ai/ The Cognitive Revolution]): [https://docs.google.com/presentation/d/1mvkpg1mtAvGzTiiwYPc6bKOGsQXDIwMb-ytQECb3i7I/edit#slide=id.g252d9e67d86_0_16 AI Bad Behavior]
 +
 +
==Threat Vectors==
 +
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training]
 +
* 2025-10: [https://arxiv.org/abs/2510.07192 Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples]
  
 
=See Also=
 
=See Also=
 
* [[AI predictions]]
 
* [[AI predictions]]

Latest revision as of 12:57, 3 February 2026

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

Threat Vectors

See Also