Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Light)
(Research)
 
(4 intermediate revisions by the same user not shown)
Line 36: Line 36:
 
==Long-term  (x-risk)==
 
==Long-term  (x-risk)==
 
* 2015-02: Sam Altman: [https://blog.samaltman.com/machine-intelligence-part-1 Machine intelligence, part 1]
 
* 2015-02: Sam Altman: [https://blog.samaltman.com/machine-intelligence-part-1 Machine intelligence, part 1]
* 2022-06: [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities List AGI Ruin: A List of Lethalities] (Eliezer Yudkowsky)
+
* 2019-03: Daniel Kokotajlo and Wei Dai: [https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk The Main Sources of AI Risk?]
* 2024-11: [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for] (Marcus Arvan)
+
* 2022-06: Eliezer Yudkowsky: [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities List AGI Ruin: A List of Lethalities]
 +
* 2024-11: Marcus Arvan: [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for]
 
* 2025-04: [https://michaelnotebook.com/xriskbrief/index.html ASI existential risk: reconsidering alignment as a goal]
 
* 2025-04: [https://michaelnotebook.com/xriskbrief/index.html ASI existential risk: reconsidering alignment as a goal]
  
 
=Status=
 
=Status=
 
* 2025-01: [https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)]
 
* 2025-01: [https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)]
 +
 +
==Assessmment==
 +
* [https://aiassessmentscale.com/ AI Assessment Scale (AIAS)]: A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions
  
 
==Policy==
 
==Policy==
Line 89: Line 93:
 
* 2025-04: [https://arxiv.org/abs/2504.15125 Contemplative Wisdom for Superalignment]
 
* 2025-04: [https://arxiv.org/abs/2504.15125 Contemplative Wisdom for Superalignment]
 
* 2025-04: [https://www.lesswrong.com/posts/x59FhzuM9yuvZHAHW/untitled-draft-yhra Scaling Laws for Scalable Oversight] ([https://arxiv.org/abs/2504.18530 preprint], [https://github.com/subhashk01/oversight-scaling-laws code])
 
* 2025-04: [https://www.lesswrong.com/posts/x59FhzuM9yuvZHAHW/untitled-draft-yhra Scaling Laws for Scalable Oversight] ([https://arxiv.org/abs/2504.18530 preprint], [https://github.com/subhashk01/oversight-scaling-laws code])
 +
* 2025-06: [https://assets.anthropic.com/m/4fb35becb0cd87e1/original/SHADE-Arena-Paper.pdf SHADE-Arena: Evaluating sabotage and monitoring in LLM agents] (Anthropic, [https://www.anthropic.com/research/shade-arena-sabotage-monitoring blog])
 +
* 2025-06: [https://arxiv.org/abs/2506.13609 Avoiding Obfuscation with Prover-Estimator Debate]
  
 
==Demonstrations of Negative Use Capabilities==
 
==Demonstrations of Negative Use Capabilities==

Latest revision as of 11:05, 18 June 2025

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

  • AI Assessment Scale (AIAS): A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

See Also