Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
 
(11 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 
* 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind)
 
* 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind)
 +
* 2024-02: [https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/ Towards a Cautious Scientist AI with Convergent Safety Bounds] (Yoshua Bengio)
 +
* 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio)
  
 
==Long-term  (x-risk)==
 
==Long-term  (x-risk)==
Line 18: Line 20:
  
 
=Learning Resources=
 
=Learning Resources=
* [https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c DeepMind short course on AGI safety]
+
* [https://www.aisafetybook.com/ Introduction to AI Safety, Ethics, and Society] (Dan Hendrycks, [https://www.safe.ai/ Center for AI Safety])
 
* [https://aisafety.info/ AI Safety FAQ]
 
* [https://aisafety.info/ AI Safety FAQ]
 
* [https://www.youtube.com/watch?v=xfMQ7hzyFW4 Writing Doom (video)] 27m short film on Superintelligence (2024)
 
* [https://www.youtube.com/watch?v=xfMQ7hzyFW4 Writing Doom (video)] 27m short film on Superintelligence (2024)
 +
* [https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c DeepMind short course on AGI safety]
 +
 +
=Status=
 +
* 2025-01:[https://assets.publishing.service.gov.uk/media/679a0c48a77d250007d313ee/International_AI_Safety_Report_2025_accessible_f.pdf International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)]
 +
 +
==Policy==
 +
* 2024-07: [https://arxiv.org/abs/2407.05694 On the Limitations of Compute Thresholds as a Governance Strategy] Sara Hooker
 +
* 2024-07: [https://www.cigionline.org/static/documents/AI-challenges.pdf Framework Convention on Global AI Challenges] ([https://www.cigionline.org/ CIGI])
 +
* 2024-08: NIST guidelines: [https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-1.ipd.pdf Managing Misuse Risk for Dual-Use Foundation Models]
  
 
=Research=
 
=Research=
Line 35: Line 46:
 
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic)
 
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic)
 
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
 
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
 
+
* 2024-07: [https://arxiv.org/abs/2407.04622 On scalable oversight with weak LLMs judging strong LLMs]
 +
* 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.)
 +
* 2024-08: [https://arxiv.org/abs/2408.00761 Tamper-Resistant Safeguards for Open-Weight LLMs] ([https://www.tamper-resistant-safeguards.com/ project], [https://github.com/rishub-tamirisa/tamper-resistance/ code])
 +
* 2024-08: [https://arxiv.org/abs/2408.04614 Better Alignment with Instruction Back-and-Forth Translation]
 +
* 2024-10: [https://cdn.openai.com/papers/first-person-fairness-in-chatbots.pdf First-Person Fairness in Chatbots] (OpenAI, [https://openai.com/index/evaluating-fairness-in-chatgpt/ blog])
 +
* 2024-10: [https://assets.anthropic.com/m/377027d5b36ac1eb/original/Sabotage-Evaluations-for-Frontier-Models.pdf Sabotage evaluations for frontier models] (Anthropic, [https://www.anthropic.com/research/sabotage-evaluations blog])
 +
* 2024-12: [https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf Alignment Faking in Large Language Models] (Anthropic)
 +
* 2024-12: [https://arxiv.org/abs/2412.03556 Best-of-N Jailbreaking] ([https://github.com/jplhughes/bon-jailbreaking code])
 +
* 2024-12: [https://arxiv.org/abs/2412.16339 Deliberative Alignment: Reasoning Enables Safer Language Models] (OpenAI)
 +
* 2025-01: [https://cdn.openai.com/papers/trading-inference-time-compute-for-adversarial-robustness-20250121_1.pdf Trading Inference-Time Compute for Adversarial Robustness] (OpenAI, [https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/ blog])
 
* 2025-01: [https://arxiv.org/abs/2501.18837 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming] (Anthropic, [https://www.anthropic.com/research/constitutional-classifiers blog],  
 
* 2025-01: [https://arxiv.org/abs/2501.18837 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming] (Anthropic, [https://www.anthropic.com/research/constitutional-classifiers blog],  
 
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
 
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
 +
 +
=See Also=
 +
* [[AI predictions]]

Latest revision as of 13:15, 23 February 2025

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Policy

Research

See Also