Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
(Research)
Line 95: Line 95:
 
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
 
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
 +
* 2025-02: [https://arxiv.org/abs/2502.14143 Multi-Agent Risks from Advanced AI]
 
* 2025-03: [https://arxiv.org/abs/2209.00626v7 The Alignment Problem from a Deep Learning Perspective]
 
* 2025-03: [https://arxiv.org/abs/2209.00626v7 The Alignment Problem from a Deep Learning Perspective]
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
Line 107: Line 108:
 
* 2025-09: [https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/ Detecting and reducing scheming in AI models]
 
* 2025-09: [https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/ Detecting and reducing scheming in AI models]
 
* 2025-11: [https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf Natural Emergent Misalignment from Reward Hacking in Production RL] (Anthropic, [https://www.anthropic.com/research/emergent-misalignment-reward-hacking blog])
 
* 2025-11: [https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf Natural Emergent Misalignment from Reward Hacking in Production RL] (Anthropic, [https://www.anthropic.com/research/emergent-misalignment-reward-hacking blog])
 +
* 2025-12: [https://arxiv.org/abs/2512.16856 Distributional AGI Safety]
 
* 2025-12: [https://arxiv.org/abs/2511.22662 Difficulties with Evaluating a Deception Detector for AIs]
 
* 2025-12: [https://arxiv.org/abs/2511.22662 Difficulties with Evaluating a Deception Detector for AIs]
 
* 2025-12: [https://cdn.openai.com/pdf/d57827c6-10bc-47fe-91aa-0fde55bd3901/monitoring-monitorability.pdf Monitoring Monitorability] (OpenAI)
 
* 2025-12: [https://cdn.openai.com/pdf/d57827c6-10bc-47fe-91aa-0fde55bd3901/monitoring-monitorability.pdf Monitoring Monitorability] (OpenAI)

Revision as of 11:57, 1 February 2026

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

Threat Vectors

See Also