Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
 
(3 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 
* 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind)
 
* 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind)
 +
* 2024-02: [https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/ Towards a Cautious Scientist AI with Convergent Safety Bounds] (Yoshua Bengio)
 +
* 2024-07: [https://yoshuabengio.org/2024/07/09/reasoning-through-arguments-against-taking-ai-safety-seriously/ Reasoning through arguments against taking AI safety seriously] (Yoshua Bengio)
  
 
==Long-term  (x-risk)==
 
==Long-term  (x-risk)==
Line 28: Line 30:
 
==Policy==
 
==Policy==
 
* 2024-07: [https://arxiv.org/abs/2407.05694 On the Limitations of Compute Thresholds as a Governance Strategy] Sara Hooker
 
* 2024-07: [https://arxiv.org/abs/2407.05694 On the Limitations of Compute Thresholds as a Governance Strategy] Sara Hooker
 +
* 2024-07: [https://www.cigionline.org/static/documents/AI-challenges.pdf Framework Convention on Global AI Challenges] ([https://www.cigionline.org/ CIGI])
 
* 2024-08: NIST guidelines: [https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-1.ipd.pdf Managing Misuse Risk for Dual-Use Foundation Models]
 
* 2024-08: NIST guidelines: [https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.800-1.ipd.pdf Managing Misuse Risk for Dual-Use Foundation Models]
  
Line 43: Line 46:
 
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic)
 
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic)
 
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
 
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
 
 
* 2024-07: [https://arxiv.org/abs/2407.04622 On scalable oversight with weak LLMs judging strong LLMs]
 
* 2024-07: [https://arxiv.org/abs/2407.04622 On scalable oversight with weak LLMs judging strong LLMs]
 
* 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.)
 
* 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.)
Line 57: Line 59:
 
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 
* 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
 
* 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]
 +
 +
=See Also=
 +
* [[AI predictions]]

Latest revision as of 13:15, 23 February 2025

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Policy

Research

See Also