Revision as of 14:18, 14 February 2025

Description of Safety Concerns

Key Concepts

Instrumental Convergence
Orthogonality Thesis
Inner/outer alignment
Mesa-optimization
Overhang
Reward is not the optimization target (Alex Turner)

Medium-term Risks

2023-04: A.I. Dilemma – Tristan Harris and Aza Raskin” (video) (.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript): raises concern about human ability to handle these transformations
2023-04: Daniel Schmachtenberger and Liv Boeree (video): AI could accelerate perverse social dynamics
2023-10: Sociotechnical Safety Evaluation of Generative AI Systems (Google DeepMind)

Long-term (x-risk)

List AGI Ruin: A List of Lethalities (Eliezer Yudkowsky)

Learning Resources

DeepMind short course on AGI safety
AI Safety FAQ
Writing Doom (video) 27m short film on Superintelligence (2024)

Status

2025-01:International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)

Research

2022-12: Discovering Latent Knowledge in Language Models Without Supervision
2023-02: Pretraining Language Models with Human Preferences
2023-04: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
2023-05: Model evaluation for extreme risks (DeepMind)
2023-05: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
2023-06: Preference Ranking Optimization for Human Alignment
2023-08: Self-Alignment with Instruction Backtranslation
2023-11: Debate Helps Supervise Unreliable Experts
2023-12: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision (OpenAI, blog)
2023-12: Practices for Governing Agentic AI Systems (OpenAI, blog)
2024-01: Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training (Anthropic)
2024-04: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (OpenAI)

2024-10: First-Person Fairness in Chatbots (OpenAI, blog)
2024-10: Sabotage evaluations for frontier models (Anthropic, blog)
2024-12: Alignment Faking in Large Language Models (Anthropic)
2024-12: Best-of-N Jailbreaking (code)
2024-12: Deliberative Alignment: Reasoning Enables Safer Language Models (OpenAI)
2025-01: Trading Inference-Time Compute for Adversarial Robustness (OpenAI, blog)
2025-01: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (Anthropic, blog,
2025-02: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (site, github)
2025-02: Auditing Prompt Caching in Language Model APIs

@@ Line 39: / Line 39: @@
 * 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
+* 2024-10: [https://cdn.openai.com/papers/first-person-fairness-in-chatbots.pdf First-Person Fairness in Chatbots] (OpenAI, [https://openai.com/index/evaluating-fairness-in-chatgpt/ blog])
+* 2024-10: [https://assets.anthropic.com/m/377027d5b36ac1eb/original/Sabotage-Evaluations-for-Frontier-Models.pdf Sabotage evaluations for frontier models] (Anthropic, [https://www.anthropic.com/research/sabotage-evaluations blog])
+* 2024-12: [https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf Alignment Faking in Large Language Models] (Anthropic)
+* 2024-12: [https://arxiv.org/abs/2412.03556 Best-of-N Jailbreaking] ([https://github.com/jplhughes/bon-jailbreaking code])
+* 2024-12: [https://arxiv.org/abs/2412.16339 Deliberative Alignment: Reasoning Enables Safer Language Models] (OpenAI)
 * 2025-01: [https://cdn.openai.com/papers/trading-inference-time-compute-for-adversarial-robustness-20250121_1.pdf Trading Inference-Time Compute for Adversarial Robustness] (OpenAI, [https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/ blog])
 * 2025-01: [https://arxiv.org/abs/2501.18837 Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming] (Anthropic, [https://www.anthropic.com/research/constitutional-classifiers blog],
 * 2025-02: [https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs] ([https://www.emergent-values.ai/ site], [https://github.com/centerforaisafety/emergent-values github])
 * 2025-02: [https://arxiv.org/abs/2502.07776 Auditing Prompt Caching in Language Model APIs]

Difference between revisions of "AI safety"

Revision as of 14:18, 14 February 2025

Contents

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Research

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools