Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Learning Resources)
(Research)
Line 16: Line 16:
 
* 2023-04: [https://arxiv.org/abs/2304.03279 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark]
 
* 2023-04: [https://arxiv.org/abs/2304.03279 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark]
 
* 2023-05: [https://arxiv.org/abs/2305.15324 Model evaluation for extreme risks] (DeepMind)
 
* 2023-05: [https://arxiv.org/abs/2305.15324 Model evaluation for extreme risks] (DeepMind)
 +
* 2023-05: [https://arxiv.org/abs/2305.03047 Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision]
 +
* 2023-06: [https://arxiv.org/abs/2306.17492 Preference Ranking Optimization for Human Alignment]

Revision as of 13:44, 14 February 2025