Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
(Research)
Line 67: Line 67:
 
* 2025-03: [https://arxiv.org/abs/2209.00626v7 The Alignment Problem from a Deep Learning Perspective]
 
* 2025-03: [https://arxiv.org/abs/2209.00626v7 The Alignment Problem from a Deep Learning Perspective]
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
*2025-03: [https://arxiv.org/abs/2503.13621 Superalignment with Dynamic Human Values]
+
* 2025-03: [https://arxiv.org/abs/2503.13621 Superalignment with Dynamic Human Values]
 +
 
 +
==Demonstrations of Negative Use Capabilities==
 +
* 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects]
  
 
=See Also=
 
=See Also=
 
* [[AI predictions]]
 
* [[AI predictions]]

Revision as of 13:03, 24 March 2025

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Policy

Research

Demonstrations of Negative Use Capabilities

See Also