Difference between revisions of "AI safety"
| KevinYager (talk | contribs)  (→Research) | KevinYager (talk | contribs)   (→Research) | ||
| Line 100: | Line 100: | ||
| * 2025-06: [https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf Persona Features Control Emergent Misalignment] (OpenAI, [https://openai.com/index/emergent-misalignment/ blog]) | * 2025-06: [https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf Persona Features Control Emergent Misalignment] (OpenAI, [https://openai.com/index/emergent-misalignment/ blog]) | ||
| * 2025-07: [https://arxiv.org/abs/2506.18032 Why Do Some Language Models Fake Alignment While Others Don't?] (Anthropic, [https://github.com/safety-research/open-source-alignment-faking code]) | * 2025-07: [https://arxiv.org/abs/2506.18032 Why Do Some Language Models Fake Alignment While Others Don't?] (Anthropic, [https://github.com/safety-research/open-source-alignment-faking code]) | ||
| + | * 2025-07: [https://arxiv.org/abs/2507.11473 Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety] | ||
| ==Demonstrations of Negative Use Capabilities== | ==Demonstrations of Negative Use Capabilities== | ||
Revision as of 08:49, 16 July 2025
Contents
Learning Resources
Light
- Writing Doom (27m video): short film on Superintelligence (2024)
- a casual intro to AI doom and alignment (2022)
- Anthony Aguirre: Keep The Future Human
- Interactive Explainer
- Essay: Keep the Future Human
- We Can’t Stop AI – Here’s What To Do Instead (4m video, 2025)
- The 4 Rules That Could Stop AI Before It’s Too Late (15m video, 2025)
 
- Tristan Harris TED talk (15m): Why AI is our ultimate test and greatest invitation
- Text version: Center for Humane Technology: The Narrow Path: Why AI is Our Ultimate Test and Greatest Invitation
 
- Fable about Transformative AI
Deep
- The Compendium: Humanity risks extinction from its very creations — AIs. (2024)
- Introduction to AI Safety, Ethics, and Society (Dan Hendrycks, Center for AI Safety)
- AI Safety FAQ
- DeepMind short course on AGI safety
Description of Safety Concerns
Key Concepts
- Instrumental Convergence
- Orthogonality Thesis
- Inner/outer alignment
- Mesa-optimization
- Overhang
- Reward is not the optimization target (Alex Turner)
Medium-term Risks
- 2023-04: A.I. Dilemma – Tristan Harris and Aza Raskin” (video) (.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript): raises concern about human ability to handle these transformations
- 2023-04: Daniel Schmachtenberger and Liv Boeree (video): AI could accelerate perverse social dynamics
- 2023-10: Sociotechnical Safety Evaluation of Generative AI Systems (Google DeepMind)
- 2024-02: Towards a Cautious Scientist AI with Convergent Safety Bounds (Yoshua Bengio)
- 2024-07: Reasoning through arguments against taking AI safety seriously (Yoshua Bengio)
- 2025-04: AI-Enabled Coups: How a Small Group Could Use AI to Seize Power
- 2025-06: The Singapore Consensus on Global AI Safety Research Priorities
Long-term (x-risk)
- 2015-02: Sam Altman: Machine intelligence, part 1
- 2019-03: Daniel Kokotajlo and Wei Dai: The Main Sources of AI Risk?
- 2022-06: Eliezer Yudkowsky: List AGI Ruin: A List of Lethalities
- 2024-11: Marcus Arvan: ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
- 2025-04: ASI existential risk: reconsidering alignment as a goal
Status
- 2025-01: International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)
- AI Lab Watch (safety scorecard)
Assessmment
- AI Assessment Scale (AIAS): A practical framework to guide the appropriate and ethical use of generative AI in assessment design, empowering educators to make purposeful, evidence-based decisions
Policy
- 2024-07: On the Limitations of Compute Thresholds as a Governance Strategy Sara Hooker
- 2024-07: Framework Convention on Global AI Challenges (CIGI)
- 2024-08: NIST guidelines: Managing Misuse Risk for Dual-Use Foundation Models
Proposals
- 2025-02: Responsible AI Agents
- 2025-03: Control AI The Direct Institutional Plan
- 2025-04: Google DeepMind: Taking a responsible path to AGI
Research
- 2022-09: The alignment problem from a deep learning perspective
- 2022-12: Discovering Latent Knowledge in Language Models Without Supervision
- 2023-02: Pretraining Language Models with Human Preferences
- 2023-04: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
- 2023-05: Model evaluation for extreme risks (DeepMind)
- 2023-05: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
- 2023-06: Preference Ranking Optimization for Human Alignment
- 2023-08: Self-Alignment with Instruction Backtranslation
- 2023-11: Debate Helps Supervise Unreliable Experts
- 2023-12: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision (OpenAI, blog)
- 2023-12: Practices for Governing Agentic AI Systems (OpenAI, blog)
- 2024-01: Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training (Anthropic)
- 2024-04: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (OpenAI)
- 2024-07: On scalable oversight with weak LLMs judging strong LLMs
- 2024-07: Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Dan Hendrycks et al.)
- 2024-08: Tamper-Resistant Safeguards for Open-Weight LLMs (project, code)
- 2024-08: Better Alignment with Instruction Back-and-Forth Translation
- 2024-10: First-Person Fairness in Chatbots (OpenAI, blog)
- 2024-10: Sabotage evaluations for frontier models (Anthropic, blog)
- 2024-12: Alignment Faking in Large Language Models (Anthropic)
- 2024-12: Best-of-N Jailbreaking (code)
- 2024-12: Towards Safe and Honest AI Agents with Neural Self-Other Overlap
- 2024-12: Deliberative Alignment: Reasoning Enables Safer Language Models (OpenAI)
- 2025-01: Trading Inference-Time Compute for Adversarial Robustness (OpenAI, blog)
- 2025-01: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (Anthropic, blog,
- 2025-02: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (site, github)
- 2025-02: Auditing Prompt Caching in Language Model APIs
- 2025-03: The Alignment Problem from a Deep Learning Perspective
- 2025-03: Auditing language models for hidden objectives (Anthropic, blog)
- 2025-03: Superalignment with Dynamic Human Values
- 2025-04: Contemplative Wisdom for Superalignment
- 2025-04: Scaling Laws for Scalable Oversight (preprint, code)
- 2025-06: SHADE-Arena: Evaluating sabotage and monitoring in LLM agents (Anthropic, blog)
- 2025-06: Avoiding Obfuscation with Prover-Estimator Debate
- 2025-06: Persona Features Control Emergent Misalignment (OpenAI, blog)
- 2025-07: Why Do Some Language Models Fake Alignment While Others Don't? (Anthropic, code)
- 2025-07: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Demonstrations of Negative Use Capabilities
- 2024-12: Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects
- 2025-04: Nathan Labenz (The Cognitive Revolution): AI Bad Behavior

