Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Learning Resources)
(Status)
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
=Learning Resources=
 +
==Light==
 +
* [https://www.youtube.com/watch?v=xfMQ7hzyFW4 Writing Doom] (27m video): short film on Superintelligence (2024)
 +
* [https://orxl.org/ai-doom.html a casual intro to AI doom and alignment] (2022)
 +
* Anthony Aguirre: [https://keepthefuturehuman.ai/ Keep The Future Human]
 +
** [https://interactive.keepthefuturehuman.ai/ Interactive Explainer]
 +
** [https://keepthefuturehuman.ai/essay/ Essay: Keep the Future Human]
 +
** [https://www.youtube.com/watch?v=27KDl2uPiL8 We Can’t Stop AI – Here’s What To Do Instead] (4m video, 2025)
 +
** [https://www.youtube.com/watch?v=zeabrXV8zNE The 4 Rules That Could Stop AI Before It’s Too Late] (15m video, 2025)
 +
* Tristan Harris TED talk (15m): [https://www.ted.com/talks/tristan_harris_why_ai_is_our_ultimate_test_and_greatest_invitation Why AI is our ultimate test and greatest invitation]
 +
** Text version: Center for Humane Technology: [https://centerforhumanetechnology.substack.com/p/the-narrow-path-why-ai-is-our-ultimate The Narrow Path: Why AI is Our Ultimate Test and Greatest Invitation]
 +
 +
==Deep==
 +
* [https://www.thecompendium.ai/ The Compendium: Humanity risks extinction from its very creations — AIs.] (2024)
 +
* [https://www.aisafetybook.com/ Introduction to AI Safety, Ethics, and Society] (Dan Hendrycks, [https://www.safe.ai/ Center for AI Safety])
 +
* [https://aisafety.info/ AI Safety FAQ]
 +
* [https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c DeepMind short course on AGI safety]
  
 
=Description of Safety Concerns=
 
=Description of Safety Concerns=
Line 15: Line 32:
 
* 2024-02: [https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/ Towards a Cautious Scientist AI with Convergent Safety Bounds] (Yoshua Bengio)
 
* 2024-02: [https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/ Towards a Cautious Scientist AI with Convergent Safety Bounds] (Yoshua Bengio)
 
* 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio)
 
* 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio)
 +
* 2025-04: [https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power AI-Enabled Coups: How a Small Group Could Use AI to Seize Power]
  
 
==Long-term  (x-risk)==
 
==Long-term  (x-risk)==
* [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities List AGI Ruin: A List of Lethalities] (Eliezer Yudkowsky)
+
* 2015-02: Sam Altman: [https://blog.samaltman.com/machine-intelligence-part-1 Machine intelligence, part 1]
 
+
* 2022-06: [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities List AGI Ruin: A List of Lethalities] (Eliezer Yudkowsky)
=Learning Resources=
+
* 2024-11: [https://link.springer.com/article/10.1007/s00146-024-02113-9 ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for] (Marcus Arvan)
==Light==
+
* 2025-04: [https://michaelnotebook.com/xriskbrief/index.html ASI existential risk: reconsidering alignment as a goal]
* [https://www.youtube.com/watch?v=xfMQ7hzyFW4 Writing Doom] (27m video): short film on Superintelligence (2024)
 
* [https://orxl.org/ai-doom.html a casual intro to AI doom and alignment] (2022)
 
* Anthony Aguirre: [https://keepthefuturehuman.ai/ Keep The Future Human] ([https://keepthefuturehuman.ai/essay/ essay])
 
** [https://www.youtube.com/watch?v=zeabrXV8zNE The 4 Rules That Could Stop AI Before It’s Too Late] (15m video, 2025)
 
 
 
==Deep==
 
* [https://www.thecompendium.ai/ The Compendium: Humanity risks extinction from its very creations — AIs.] (2024)
 
* [https://www.aisafetybook.com/ Introduction to AI Safety, Ethics, and Society] (Dan Hendrycks, [https://www.safe.ai/ Center for AI Safety])
 
* [https://aisafety.info/ AI Safety FAQ]
 
* [https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c DeepMind short course on AGI safety]
 
  
 
=Status=
 
=Status=
 
* 2025-01: [https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)]
 
* 2025-01: [https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)]
 +
 +
==Assessmment==
 +
* [https://aiassessmentscale.com/ AI Assessment Scale (AIAS)]: A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions
  
 
==Policy==
 
==Policy==
Line 43: Line 54:
 
* 2025-02: [https://arxiv.org/abs/2502.18359 Responsible AI Agents]
 
* 2025-02: [https://arxiv.org/abs/2502.18359 Responsible AI Agents]
 
* 2025-03: [https://controlai.com/ Control AI] [https://controlai.com/dip The Direct Institutional Plan]
 
* 2025-03: [https://controlai.com/ Control AI] [https://controlai.com/dip The Direct Institutional Plan]
 +
* 2025-04: Google DeepMind: [https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/ Taking a responsible path to AGI]
 +
** Paper: [https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/evaluating-potential-cybersecurity-threats-of-advanced-ai/An_Approach_to_Technical_AGI_Safety_Apr_2025.pdf An Approach to Technical AGI Safety and Security]
  
 
=Research=
 
=Research=
Line 77: Line 90:
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
 
* 2025-03: [https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf Auditing language models for hidden objectives] (Anthropic, [https://www.anthropic.com/research/auditing-hidden-objectives blog])
 
* 2025-03: [https://arxiv.org/abs/2503.13621 Superalignment with Dynamic Human Values]
 
* 2025-03: [https://arxiv.org/abs/2503.13621 Superalignment with Dynamic Human Values]
 +
* 2025-04: [https://arxiv.org/abs/2504.15125 Contemplative Wisdom for Superalignment]
 +
* 2025-04: [https://www.lesswrong.com/posts/x59FhzuM9yuvZHAHW/untitled-draft-yhra Scaling Laws for Scalable Oversight] ([https://arxiv.org/abs/2504.18530 preprint], [https://github.com/subhashk01/oversight-scaling-laws code])
  
 
==Demonstrations of Negative Use Capabilities==
 
==Demonstrations of Negative Use Capabilities==
 
* 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects]
 
* 2024-12: [https://arxiv.org/abs/2412.00586 Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects]
 +
* 2025-04: [https://www.nathanlabenz.com/ Nathan Labenz] ([https://www.cognitiverevolution.ai/ The Cognitive Revolution]): [https://docs.google.com/presentation/d/1mvkpg1mtAvGzTiiwYPc6bKOGsQXDIwMb-ytQECb3i7I/edit#slide=id.g252d9e67d86_0_16 AI Bad Behavior]
  
 
=See Also=
 
=See Also=
 
* [[AI predictions]]
 
* [[AI predictions]]

Latest revision as of 15:46, 20 May 2025

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

  • AI Assessment Scale (AIAS): A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

See Also