Difference between revisions of "AI safety"

From GISAXS

Jump to: navigation, search

Revision as of 14:41, 10 April 2025

Contents

1 Learning Resources
- 1.1 Light
- 1.2 Deep
2 Description of Safety Concerns
3 Status
- 3.1 Policy
- 3.2 Proposals
4 Research
- 4.1 Demonstrations of Negative Use Capabilities
5 See Also

Learning Resources

Light

Writing Doom (27m video): short film on Superintelligence (2024)
a casual intro to AI doom and alignment (2022)
Anthony Aguirre: Keep The Future Human
- Interactive Explainer
- Essay: Keep the Future Human
- We Can’t Stop AI – Here’s What To Do Instead (4m video, 2025)
- The 4 Rules That Could Stop AI Before It’s Too Late (15m video, 2025)

Deep

Description of Safety Concerns

Key Concepts

Instrumental Convergence
Orthogonality Thesis
Inner/outer alignment
Mesa-optimization
Overhang
Reward is not the optimization target (Alex Turner)

Medium-term Risks

2023-04: A.I. Dilemma – Tristan Harris and Aza Raskin” (video) (.website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript): raises concern about human ability to handle these transformations
2023-04: Daniel Schmachtenberger and Liv Boeree (video): AI could accelerate perverse social dynamics
2023-10: Sociotechnical Safety Evaluation of Generative AI Systems (Google DeepMind)
2024-02: Towards a Cautious Scientist AI with Convergent Safety Bounds (Yoshua Bengio)
2024-07: Reasoning through arguments against taking AI safety seriously (Yoshua Bengio)

Long-term (x-risk)

List AGI Ruin: A List of Lethalities (Eliezer Yudkowsky)
‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for (Marcus Arvan)

Status

2025-01: International Safety Report: The International Scientific Report on the Safety of Advanced AI (January 2025)

Policy

2024-07: On the Limitations of Compute Thresholds as a Governance Strategy Sara Hooker
2024-07: Framework Convention on Global AI Challenges (CIGI)
2024-08: NIST guidelines: Managing Misuse Risk for Dual-Use Foundation Models

Proposals

2025-02: Responsible AI Agents
2025-03: Control AI The Direct Institutional Plan
2025-04: Google DeepMind: Taking a responsible path to AGI
- Paper: An Approach to Technical AGI Safety and Security

Research

2022-09: The alignment problem from a deep learning perspective
2022-12: Discovering Latent Knowledge in Language Models Without Supervision
2023-02: Pretraining Language Models with Human Preferences
2023-04: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
2023-05: Model evaluation for extreme risks (DeepMind)
2023-05: Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
2023-06: Preference Ranking Optimization for Human Alignment
2023-08: Self-Alignment with Instruction Backtranslation
2023-11: Debate Helps Supervise Unreliable Experts
2023-12: Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision (OpenAI, blog)
2023-12: Practices for Governing Agentic AI Systems (OpenAI, blog)
2024-01: Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training (Anthropic)
2024-04: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions (OpenAI)
2024-07: On scalable oversight with weak LLMs judging strong LLMs
2024-07: Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Dan Hendrycks et al.)
2024-08: Tamper-Resistant Safeguards for Open-Weight LLMs (project, code)
2024-08: Better Alignment with Instruction Back-and-Forth Translation
2024-10: First-Person Fairness in Chatbots (OpenAI, blog)
2024-10: Sabotage evaluations for frontier models (Anthropic, blog)
2024-12: Alignment Faking in Large Language Models (Anthropic)
2024-12: Best-of-N Jailbreaking (code)
2024-12: Towards Safe and Honest AI Agents with Neural Self-Other Overlap
- 2024-07: Self-Other Overlap: A Neglected Approach to AI Alignment
- 2025-03: Reducing LLM deception at scale with self-other overlap fine-tuning
2024-12: Deliberative Alignment: Reasoning Enables Safer Language Models (OpenAI)
2025-01: Trading Inference-Time Compute for Adversarial Robustness (OpenAI, blog)
2025-01: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming (Anthropic, blog,
2025-02: Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs (site, github)
2025-02: Auditing Prompt Caching in Language Model APIs
2025-03: The Alignment Problem from a Deep Learning Perspective
2025-03: Auditing language models for hidden objectives (Anthropic, blog)
2025-03: Superalignment with Dynamic Human Values

Demonstrations of Negative Use Capabilities

2024-12: Evaluating Large Language Models' Capability to Launch Fully Automated Spear Phishing Campaigns: Validated on Human Subjects

See Also

AI predictions

Retrieved from "http://gisaxs.com/index.php?title=AI_safety&oldid=7542"