Revision as of 14:44, 14 February 2025

Description of Safety Concerns

@@ Line 16: / Line 16: @@
 * 2023-04: [https://arxiv.org/abs/2304.03279 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark]
 * 2023-05: [https://arxiv.org/abs/2305.15324 Model evaluation for extreme risks] (DeepMind)
+* 2023-05: [https://arxiv.org/abs/2305.03047 Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision]
+* 2023-06: [https://arxiv.org/abs/2306.17492 Preference Ranking Optimization for Human Alignment]