Revision as of 15:33, 14 February 2025

Description of Safety Concerns

Key Concepts

Instrumental Convergence
Orthogonality Thesis
Inner/outer alignment
Mesa-optimization
Overhang
Reward is not the optimization target (Alex Turner)

Medium-term Risks

2023-04: A.I. Dilemma – Tristan Harris and Aza Raskin” (video) (.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript): raises concern about human ability to handle these transformations
2023-04: Daniel Schmachtenberger and Liv Boeree (video): AI could accelerate perverse social dynamics
2023-10: Sociotechnical Safety Evaluation of Generative AI Systems (Google DeepMind)
2024-02: Towards a Cautious Scientist AI with Convergent Safety Bounds (Yoshua Bengio)
2024-07: Reasoning through arguments against taking AI safety seriously (Yoshua Bengio)

Long-term (x-risk)

List AGI Ruin: A List of Lethalities (Eliezer Yudkowsky)

Learning Resources

Introduction to AI Safety, Ethics, and Society (Dan Hendrycks, Center for AI Safety)
AI Safety FAQ
Writing Doom (video) 27m short film on Superintelligence (2024)
DeepMind short course on AGI safety

Status

2025-01:International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)

Policy

2024-07: On the Limitations of Compute Thresholds as a Governance Strategy Sara Hooker
2024-07: Framework Convention on Global AI Challenges (CIGI)
2024-08: NIST guidelines: Managing Misuse Risk for Dual-Use Foundation Models

Research

2022-12: Discovering Latent Knowledge in Language Models Without Supervision
2023-02: Pretraining Language Models with Human Preferences
2023-04: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
2023-05: Model evaluation for extreme risks (DeepMind)
2023-05: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
2023-06: Preference Ranking Optimization for Human Alignment
2023-08: Self-Alignment with Instruction Backtranslation
2023-11: Debate Helps Supervise Unreliable Experts
2023-12: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision (OpenAI, blog)
2023-12: Practices for Governing Agentic AI Systems (OpenAI, blog)
2024-01: Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training (Anthropic)
2024-04: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (OpenAI)
2024-07: On scalable oversight with weak LLMs judging strong LLMs
2024-07: Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Dan Hendrycks et al.)
2024-08: Tamper-Resistant Safeguards for Open-Weight LLMs (project, code)
2024-08: Better Alignment with Instruction Back-and-Forth Translation
2024-10: First-Person Fairness in Chatbots (OpenAI, blog)
2024-10: Sabotage evaluations for frontier models (Anthropic, blog)
2024-12: Alignment Faking in Large Language Models (Anthropic)
2024-12: Best-of-N Jailbreaking (code)
2024-12: Deliberative Alignment: Reasoning Enables Safer Language Models (OpenAI)
2025-01: Trading Inference-Time Compute for Adversarial Robustness (OpenAI, blog)
2025-01: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (Anthropic, blog,
2025-02: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (site, github)
2025-02: Auditing Prompt Caching in Language Model APIs

@@ Line 46: / Line 46: @@
 * 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic)
 * 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
 * 2024-07: [https://arxiv.org/abs/2407.04622 On scalable oversight with weak LLMs judging strong LLMs]
 * 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.)

Difference between revisions of "AI safety"

Revision as of 15:33, 14 February 2025

Contents

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Policy

Research

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools