Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Status)
(Research)
Line 43: Line 43:
 
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
 
* 2024-04: [https://arxiv.org/abs/2404.13208 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions] (OpenAI)
  
 +
* 2024-07: [https://arxiv.org/abs/2407.21792 Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?] (Dan Hendrycks et al.)
 +
* 2024-08: [https://arxiv.org/abs/2408.00761 Tamper-Resistant Safeguards for Open-Weight LLMs] ([https://www.tamper-resistant-safeguards.com/ project], [https://github.com/rishub-tamirisa/tamper-resistance/ code])
 +
* 2024-08: [https://arxiv.org/abs/2408.04614 Better Alignment with Instruction Back-and-Forth Translation]
 
* 2024-10: [https://cdn.openai.com/papers/first-person-fairness-in-chatbots.pdf First-Person Fairness in Chatbots] (OpenAI, [https://openai.com/index/evaluating-fairness-in-chatgpt/ blog])
 
* 2024-10: [https://cdn.openai.com/papers/first-person-fairness-in-chatbots.pdf First-Person Fairness in Chatbots] (OpenAI, [https://openai.com/index/evaluating-fairness-in-chatgpt/ blog])
 
* 2024-10: [https://assets.anthropic.com/m/377027d5b36ac1eb/original/Sabotage-Evaluations-for-Frontier-Models.pdf Sabotage evaluations for frontier models] (Anthropic, [https://www.anthropic.com/research/sabotage-evaluations blog])
 
* 2024-10: [https://assets.anthropic.com/m/377027d5b36ac1eb/original/Sabotage-Evaluations-for-Frontier-Models.pdf Sabotage evaluations for frontier models] (Anthropic, [https://www.anthropic.com/research/sabotage-evaluations blog])

Revision as of 14:23, 14 February 2025

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Status

Policy

Research