Difference between revisions of "AI safety"

From GISAXS
Jump to: navigation, search
(Research)
Line 1: Line 1:
  
 
=Description of Safety Concerns=
 
=Description of Safety Concerns=
 +
==Key Concepts==
 +
* [https://en.wikipedia.org/wiki/Instrumental_convergence Instrumental Convergence]
 +
* [https://www.lesswrong.com/w/orthogonality-thesis Orthogonality Thesis]
 +
* [https://www.alignmentforum.org/posts/SzecSPYxqRa5GCaSF/clarifying-inner-alignment-terminology Inner/outer alignment]
 +
* [https://www.alignmentforum.org/w/mesa-optimization Mesa-optimization]
 +
* [https://www.lesswrong.com/posts/N6vZEnCn6A95Xn39p/are-we-in-an-ai-overhang Overhang]
 +
 
==Medium-term Risks==
 
==Medium-term Risks==
 
* 2023-04: [https://www.youtube.com/watch?v=xoVJKj8lcNQ A.I. Dilemma – Tristan Harris and Aza Raskin” (video)] ([https://assets-global .website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript]): raises concern about human ability to handle these transformations
 
* 2023-04: [https://www.youtube.com/watch?v=xoVJKj8lcNQ A.I. Dilemma – Tristan Harris and Aza Raskin” (video)] ([https://assets-global .website-files.com/5f0e1294f002b1bb26e1f304/64224a9051a6637c1b60162a_65-your-undivided-attention-The-AI-Dilemma-transcript.pdf podcast transcript]): raises concern about human ability to handle these transformations
 
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 
* 2023-04: [https://www.youtube.com/watch?v=KCSsKV5F4xc Daniel Schmachtenberger and Liv Boeree (video)]: AI could accelerate perverse social dynamics
 +
* 2023-10: [https://arxiv.org/pdf/2310.11986 Sociotechnical Safety Evaluation of Generative AI Systems] (Google DeepMind)
  
 
==Long-term  (x-risk)==
 
==Long-term  (x-risk)==
* [https://en.wikipedia.org/wiki/Instrumental_convergence Instrumental Convergence]
+
* [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities List AGI Ruin: A List of Lethalities] (Eliezer Yudkowsky)
  
 
=Learning Resources=
 
=Learning Resources=
Line 14: Line 22:
 
=Research=
 
=Research=
 
* 2022-12: [https://arxiv.org/abs/2212.03827 Discovering Latent Knowledge in Language Models Without Supervision]
 
* 2022-12: [https://arxiv.org/abs/2212.03827 Discovering Latent Knowledge in Language Models Without Supervision]
 +
* 2023-02: [https://arxiv.org/abs/2302.08582 Pretraining Language Models with Human Preferences]
 
* 2023-04: [https://arxiv.org/abs/2304.03279 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark]
 
* 2023-04: [https://arxiv.org/abs/2304.03279 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark]
 
* 2023-05: [https://arxiv.org/abs/2305.15324 Model evaluation for extreme risks] (DeepMind)
 
* 2023-05: [https://arxiv.org/abs/2305.15324 Model evaluation for extreme risks] (DeepMind)
Line 19: Line 28:
 
* 2023-06: [https://arxiv.org/abs/2306.17492 Preference Ranking Optimization for Human Alignment]
 
* 2023-06: [https://arxiv.org/abs/2306.17492 Preference Ranking Optimization for Human Alignment]
 
* 2023-08: [https://arxiv.org/abs/2308.06259 Self-Alignment with Instruction Backtranslation]
 
* 2023-08: [https://arxiv.org/abs/2308.06259 Self-Alignment with Instruction Backtranslation]
 +
* 2023-11: [https://arxiv.org/abs/2311.08702 Debate Helps Supervise Unreliable Experts]

Revision as of 13:53, 14 February 2025

Description of Safety Concerns

Key Concepts

Medium-term Risks

Long-term (x-risk)

Learning Resources

Research