Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
(Medium-term Risks)
 
(3 intermediate revisions by the same user not shown)
Line 10: Line 10:
 
* Tristan Harris TED talk (15m): [https://www.ted.com/talks/tristan_harris_why_ai_is_our_ultimate_test_and_greatest_invitation Why AI is our ultimate test and greatest invitation]
 
* Tristan Harris TED talk (15m): [https://www.ted.com/talks/tristan_harris_why_ai_is_our_ultimate_test_and_greatest_invitation Why AI is our ultimate test and greatest invitation]
 
** Text version: Center for Humane Technology: [https://centerforhumanetechnology.substack.com/p/the-narrow-path-why-ai-is-our-ultimate The Narrow Path: Why AI is Our Ultimate Test and Greatest Invitation]
 
** Text version: Center for Humane Technology: [https://centerforhumanetechnology.substack.com/p/the-narrow-path-why-ai-is-our-ultimate The Narrow Path: Why AI is Our Ultimate Test and Greatest Invitation]
 +
* [https://x.com/KeiranJHarris/status/1935429439476887594 Fable about Transformative AI]
  
 
==Deep==
 
==Deep==
Line 33: Line 34:
 
* 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio)
 
* 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio)
 
* 2025-04: [https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power AI-Enabled Coups: How a Small Group Could Use AI to Seize Power]
 
* 2025-04: [https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power AI-Enabled Coups: How a Small Group Could Use AI to Seize Power]
 +
* 2025-06: [https://arxiv.org/abs/2506.20702 The Singapore Consensus on Global AI Safety Research Priorities]
  
 
==Long-term  (x-risk)==
 
==Long-term  (x-risk)==
Line 43: Line 45:
 
=Status=
 
=Status=
 
* 2025-01: [https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)]
 
* 2025-01: [https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)]
 +
* [https://ailabwatch.org/ AI Lab Watch] (safety scorecard)
  
 
==Assessmment==
 
==Assessmment==
Line 95: Line 98:
 
* 2025-06: [https://assets.anthropic.com/m/4fb35becb0cd87e1/original/SHADE-Arena-Paper.pdf SHADE-Arena: Evaluating sabotage and monitoring in LLM agents] (Anthropic, [https://www.anthropic.com/research/shade-arena-sabotage-monitoring blog])
 
* 2025-06: [https://assets.anthropic.com/m/4fb35becb0cd87e1/original/SHADE-Arena-Paper.pdf SHADE-Arena: Evaluating sabotage and monitoring in LLM agents] (Anthropic, [https://www.anthropic.com/research/shade-arena-sabotage-monitoring blog])
 
* 2025-06: [https://arxiv.org/abs/2506.13609 Avoiding Obfuscation with Prover-Estimator Debate]
 
* 2025-06: [https://arxiv.org/abs/2506.13609 Avoiding Obfuscation with Prover-Estimator Debate]
 +
* 2025-06: [https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf Persona Features Control Emergent Misalignment] (OpenAI, [https://openai.com/index/emergent-misalignment/ blog])
  
 
==Demonstrations of Negative Use Capabilities==
 
==Demonstrations of Negative Use Capabilities==

Latest revision as of 15:41, 29 June 2025

Learning Resources

Light

Deep

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Status

Assessmment

  • AI Assessment Scale (AIAS): A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions

Policy

Proposals

Research

Demonstrations of Negative Use Capabilities

See Also