Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Assessmment)
(Research)
 
(One intermediate revision by the same user not shown)
Line 102: Line 102:
 
* 2025-07: [https://arxiv.org/abs/2506.18032 Why Do Some Language Models Fake Alignment While Others Don't?] (Anthropic, [https://github.com/safety-research/open-source-alignment-faking code])
 
* 2025-07: [https://arxiv.org/abs/2506.18032 Why Do Some Language Models Fake Alignment While Others Don't?] (Anthropic, [https://github.com/safety-research/open-source-alignment-faking code])
 
* 2025-07: [https://arxiv.org/abs/2507.11473 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety]
 
* 2025-07: [https://arxiv.org/abs/2507.11473 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety]
 +
* 2025-09: [https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/ Detecting and reducing scheming in AI models]
  
 
==Demonstrations of Negative Use Capabilities==
 
==Demonstrations of Negative Use Capabilities==
 
* 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects]
 
* 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects]
 
* 2025-04: [https://www.nathanlabenz.com/ Nathan Labenz] ([https://www.cognitiverevolution.ai/ The Cognitive Revolution]): [https://docs.google.com/presentation/d/1mvkpg1mtAvGzTiiwYPc6bKOGsQXDIwMb-ytQECb3i7I/edit#slide=id.g252d9e67d86_0_16 AI Bad Behavior]
 
* 2025-04: [https://www.nathanlabenz.com/ Nathan Labenz] ([https://www.cognitiverevolution.ai/ The Cognitive Revolution]): [https://docs.google.com/presentation/d/1mvkpg1mtAvGzTiiwYPc6bKOGsQXDIwMb-ytQECb3i7I/edit#slide=id.g252d9e67d86_0_16 AI Bad Behavior]
 +
 +
==Threat Vectors==
 +
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training]
 +
* 2025-10: [https://arxiv.org/abs/2510.07192 Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples]
  
 
=See Also=
 
=See Also=
 
* [[AI predictions]]
 
* [[AI predictions]]

Latest revision as of 12:09, 23 October 2025

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

Threat Vectors

See Also