Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Policy)
(Research)
Line 46: Line 46:
 
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic)
 
* 2024-01: [https://arxiv.org/abs/2401.05566 Sleeper Agents: Training Deceptive LLMs that Persist through Safety Training] (Anthropic)
 
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
 
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
 
 
* 2024-07: [https://arxiv.org/abs/2407.04622 On scalable oversight with weak LLMs judging strong LLMs]
 
* 2024-07: [https://arxiv.org/abs/2407.04622 On scalable oversight with weak LLMs judging strong LLMs]
 
* 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.)
 
* 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.)

Revision as of 14:33, 14 February 2025

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Policy

Research