Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
(Research)
Line 58: Line 58:
 
* 2024-12: [https://arxiv.org/abs/2412.03556 Best-of-N Jailbreaking] ([https://github.com/jplhughes/bon-jailbreaking code])
 
* 2024-12: [https://arxiv.org/abs/2412.03556 Best-of-N Jailbreaking] ([https://github.com/jplhughes/bon-jailbreaking code])
 
* 2024-12: [https://arxiv.org/abs/2412.16325 Towards Safe and Honest AI Agents with Neural Self-Other Overlap]
 
* 2024-12: [https://arxiv.org/abs/2412.16325 Towards Safe and Honest AI Agents with Neural Self-Other Overlap]
 +
** 2024-07: [https://www.lesswrong.com/posts/hzt9gHpNwA2oHtwKX/self-other-overlap-a-neglected-approach-to-ai-alignment Self-Other Overlap: A Neglected Approach to AI Alignment]
 
** 2025-03: [https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine Reducing LLM deception at scale with self-other overlap fine-tuning]
 
** 2025-03: [https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine Reducing LLM deception at scale with self-other overlap fine-tuning]
 
* 2024-12: [https://arxiv.org/abs/2412.16339 Deliberative Alignment: Reasoning Enables Safer Language Models] (OpenAI)
 
* 2024-12: [https://arxiv.org/abs/2412.16339 Deliberative Alignment: Reasoning Enables Safer Language Models] (OpenAI)

Revision as of 12:41, 16 March 2025

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Policy

Research

See Also