Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
(Research)
 
(One intermediate revision by the same user not shown)
Line 42: Line 42:
 
* 2024-11: Marcus Arvan: [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for]
 
* 2024-11: Marcus Arvan: [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for]
 
* 2025-04: [https://michaelnotebook.com/xriskbrief/index.html ASI existential risk: reconsidering alignment as a goal]
 
* 2025-04: [https://michaelnotebook.com/xriskbrief/index.html ASI existential risk: reconsidering alignment as a goal]
 +
* 2025-12: Philip Trammell and Leopold Aschenbrenner: [https://philiptrammell.com/static/Existential_Risk_and_Growth.pdf Existential Risk and Growth]
  
 
=Status=
 
=Status=
Line 106: Line 107:
 
* 2025-12: [https://arxiv.org/abs/2511.22662 Difficulties with Evaluating a Deception Detector for AIs]
 
* 2025-12: [https://arxiv.org/abs/2511.22662 Difficulties with Evaluating a Deception Detector for AIs]
 
* 2025-12: [https://cdn.openai.com/pdf/d57827c6-10bc-47fe-91aa-0fde55bd3901/monitoring-monitorability.pdf Monitoring Monitorability] (OpenAI)
 
* 2025-12: [https://cdn.openai.com/pdf/d57827c6-10bc-47fe-91aa-0fde55bd3901/monitoring-monitorability.pdf Monitoring Monitorability] (OpenAI)
 +
* 2026-01: [https://www.nature.com/articles/s41586-025-09937-5 Training large language models on narrow tasks can lead to broad misalignment]
 +
** 2025-02: Preprint: [https://martins1612.github.io/emergent_misalignment_betley.pdf Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs]
  
 
==Demonstrations of Negative Use Capabilities==
 
==Demonstrations of Negative Use Capabilities==

Latest revision as of 11:47, 15 January 2026

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

Threat Vectors

See Also