AI Safety Papers
Learn about large-scale risks from advanced AI
- (Ngo et al., 2022) The Alignment Problem from a Deep Learning Perspective
- (Hendrycks et al., 2023) An Overview of Catastrophic AI Risks
- (Chan et al., 2023) Harms from Increasingly Agentic Algorithmic Systems
- (Anwar et al., 2024) Foundational Challenges in Assuring Alignment and Safety of Large Language Models
What types of risks from advanced AI might we face? These papers provide an overview of anticipated problems and relevant technical research directions.
- (Perez et al., 2023) Discovering Language Model Behaviors with Model-Written Evaluations
- (Phuong et al., 2024) Evaluating Frontier Models for Dangerous Capabilities
- (METR, 2023) Responsible Scaling Policies (RSPs), example: Anthropic's RSP
- Resources:
To ensure that advanced AI systems are safe, we need societal agreement on what is unsafe so that we can make appropriate tradeoffs with AI's anticipated benefits. Developing technical benchmarks for what dangerous capabilities are in advanced AI systems, as part of "model evaluations", is a first necessary step to concretize these tradeoffs for policymakers, researchers, and the public.
- (Olah et al., 2020) Zoom In: An Introduction to Circuits
- (Elhage et al., 2022) Toy models of superposition
- (Bricken et al., 2023) Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- (Templeton et al., 2024) Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- (Zou et al., 2023) Representation Engineering: A Top-Down Approach to AI Transparency
If we do not understand how AI models arrive at their outputs, we cannot robustly monitor or modify them. We can ask models to describe their reasoning, but models may be synchophantic or deceptive, especially as they become more capable. One approach is to understand model processes just by examining their weights -- though the major challenges with this approach are superposition and scaling.
- (Hubinger et al., 2024) Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- (Zou et al., 2023) Universal and Transferable Adversarial Attacks on Aligned Language Models
- (Carlini et al., 2023) Are aligned neural networks adversarially aligned?
Today and in the future, AI needs to be robust to adversarial attacks and generalize well with incomplete data on human preferences. As models become more capable, ensuring models will represent human intentions across distributional shift is even more important... since humans will be less economically incentivized and less capable of monitoring the processes generating AI outputs, and models will become better at deception.
- (Bowman et al., 2022) Measuring Progress on Scalable Oversight for Large Language Models
- (Bai et al., 2022) Constitutional AI: Harmlessness from AI Feedback
- (Burns et al., 2023) Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
How do we supervise systems that are more capable than human overseers? Perhaps we can use aligned AI overseers to oversee more capable AI, but the core problems still remain.
Talks, blog posts, newsletters, and upskilling
- New Orleans Alignment Workshop (Dec 2023), recordings available
- Princeton AI Alignment and Safety Seminar
- FAQ on Catastrophic AI Risks by Yoshua Bengio (2023)
- Why I Think More NLP Researchers Should Engage with AI Safety Concerns by Sam Bowman (2022)
- More is Different for AI by Jacob Steinhardt (2022)
- AI Digest: Visual explainers on AI progress and its risks
- Planned Obsolescence by Ajeya Cotra and Kelsey Piper (ongoing)
- Import AI by Jack Clark, for keeping up to date on AI progress
- ML Safety Newsletter and AI Safety Newsletter by the Center for AI Safety, for AI safety papers and policy updates
- Transformer by Shakeel Hashim, for AI news with a safety focus
- AI Safety Takes by Daniel Paleka, for a PhD student's overview
- Research: Alignment Careers Guide
- AI Safety Fundamentals Curriculum by BlueDot Impact
- Introduction to ML Safety by the Center for AI Safety
- Neuronpedia, a tool for visualising SAEs
- Autointerp, a library for automatically generating labels for SAE features
- Vivaria, a tool for running evaluations and agent elicitation research.
Highlighted talk series
Blog posts
Newsletters
Upskilling
Technical AI Governance
- Technical work that primarily aims to improve the efficacy of AI governance interventions, including compute governance, technical mechanisms for improving AI coordination and regulation, privacy-preserving transparency mechanisms, technical standards development, model evaluations, and information security.
- Blog post (Anonymous): "AI Governance Needs Technical Work"
- Blog post by Lennart Heim: "Technical AI Governance" (focuses on compute governance), Podcast: "Lennart Heim on the compute governance era and what has to come after"
- Blog post by Luke Muehlhauser, "12 Tentative Ideas for US AI Policy"
- Perspectives on options given AI hardware expertise (80,000 Hours)
- Arkose is looking for further resources about technical governance, as this is a narrow set; please send recommendations to team@arkose.org!
Find more AI safety papers
Curated with input from the experts on our Strategic Advisory Panel, this collection of papers, blog posts, and videos contains top and recent AI safety papers, categorized by AI safety and machine learning (ML) subfields.