Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Medium-term Risks)
(Research)
 
Line 99: Line 99:
 
* 2025-06: [https://arxiv.org/abs/2506.13609 Avoiding Obfuscation with Prover-Estimator Debate]
 
* 2025-06: [https://arxiv.org/abs/2506.13609 Avoiding Obfuscation with Prover-Estimator Debate]
 
* 2025-06: [https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf Persona Features Control Emergent Misalignment] (OpenAI, [https://openai.com/index/emergent-misalignment/ blog])
 
* 2025-06: [https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf Persona Features Control Emergent Misalignment] (OpenAI, [https://openai.com/index/emergent-misalignment/ blog])
 +
* 2025-07: [https://arxiv.org/abs/2506.18032 Why Do Some Language Models Fake Alignment While Others Don't?] (Anthropic, [https://github.com/safety-research/open-source-alignment-faking code])
  
 
==Demonstrations of Negative Use Capabilities==
 
==Demonstrations of Negative Use Capabilities==

Latest revision as of 10:26, 10 July 2025

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

  • AI Assessment Scale (AIAS): A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

See Also