Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
(Research)
Line 57: Line 57:
 
* 2024-12: [https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf Alignment Faking in Large Language Models] (Anthropic)
 
* 2024-12: [https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf Alignment Faking in Large Language Models] (Anthropic)
 
* 2024-12: [https://arxiv.org/abs/2412.03556 Best-of-N Jailbreaking] ([https://github.com/jplhughes/bon-jailbreaking code])
 
* 2024-12: [https://arxiv.org/abs/2412.03556 Best-of-N Jailbreaking] ([https://github.com/jplhughes/bon-jailbreaking code])
 +
* 2024-12: [https://arxiv.org/abs/2412.16325 Towards Safe and Honest AI Agents with Neural Self-Other Overlap]
 +
** 2025-03: [https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine Reducing LLM deception at scale with self-other overlap fine-tuning]
 
* 2024-12: [https://arxiv.org/abs/2412.16339 Deliberative Alignment: Reasoning Enables Safer Language Models] (OpenAI)
 
* 2024-12: [https://arxiv.org/abs/2412.16339 Deliberative Alignment: Reasoning Enables Safer Language Models] (OpenAI)
 
* 2025-01: [https://cdn.openai.com/papers/trading-inference-time-compute-for-adversarial-robustness-20250121_1.pdf Trading Inference-Time Compute for Adversarial Robustness] (OpenAI, [https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/ blog])
 
* 2025-01: [https://cdn.openai.com/papers/trading-inference-time-compute-for-adversarial-robustness-20250121_1.pdf Trading Inference-Time Compute for Adversarial Robustness] (OpenAI, [https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/ blog])

Revision as of 12:40, 16 March 2025

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Policy

Research

See Also