Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Medium-term Risks)
(Research)
Line 85: Line 85:
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
 
* 2025-03: [https://arxiv.org/abs/2503.13621 Superalignment with Dynamic Human Values]
 
* 2025-03: [https://arxiv.org/abs/2503.13621 Superalignment with Dynamic Human Values]
 +
* 2025-04: [https://arxiv.org/abs/2504.15125 Contemplative Wisdom for Superalignment]
  
 
==Demonstrations of Negative Use Capabilities==
 
==Demonstrations of Negative Use Capabilities==

Revision as of 15:30, 28 April 2025

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

See Also