AI Safety Papers
Learn about large-scale risks from advanced AI
- (Ngo et al., 2022) The Alignment Problem from a Deep Learning Perspective
- (Hendrycks et al., 2023) An Overview of Catastrophic AI Risks
- (Chan et al., 2023) Harms from Increasingly Agentic Algorithmic Systems
- (Anwar et al., 2024) Foundational Challenges in Assuring Alignment and Safety of Large Language Models
What types of risks from advanced AI might we face? These papers provide an overview of anticipated problems and relevant technical research directions.
- (Perez et al., 2023) Discovering Language Model Behaviors with Model-Written Evaluations
- (Phuong et al., 2024) Evaluating Frontier Models for Dangerous Capabilities
- (Anthropic, 2023) Anthropic's Responsible Scaling Policy
To ensure that advanced AI systems are safe, we need societal agreement on what is unsafe so that we can make appropriate tradeoffs with AI's anticipated benefits. Developing technical benchmarks for what dangerous capabilities are in advanced AI systems, as part of "model evaluations", is a first necessary step to concretize these tradeoffs for policymakers, researchers, and the public.
- (Hubinger et al., 2024) Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- (Zou et al., 2023) Universal and Transferable Adversarial Attacks on Aligned Language Models
- (Carlini et al., 2023) Are aligned neural networks adversarially aligned?
Today and in the future, AI needs to be robust to adversarial attacks and generalize well with incomplete data on human preferences. As models become more capable, ensuring models will represent human intentions across distributional shift is even more important... since humans will be less economically incentivized and less capable of monitoring the processes generating AI outputs, and models will become better at deception.
- (Olah et al., 2020) Zoom In: An Introduction to Circuits
- (Elhage et al., 2022) Toy models of superposition
- (Bricken et al., 2023) Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
If we do not understand how AI models arrive at their outputs, we cannot robustly monitor or modify them. We can ask models to describe their reasoning, but models may be synchophantic or deceptive, especially as they become more capable. One approach is to understand model processes just by examining their weights -- though the major challenges with this approach are superposition and scaling.
- (Shah et al., 2022) Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals
- (Pan, Bhatia and Steinhardt, 2022) The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
How do we ensure that AI systems represent the goals we intend them to at deployment and under distributional shift, when we cannot maximally specify our preferences during training and many goals are compatible with the training data? Proxy reward signals generally correlate with designers' true objectives, but AI systems can break this correlation when they strongly optimize towards reward objectives.
- (Bowman et al., 2022) Measuring Progress on Scalable Oversight for Large Language Models
- (Bai et al., 2022) Constitutional AI: Harmlessness from AI Feedback
- New Orleans Alignment Workshop (Dec 2023), recordings available
- Princeton AI Alignment and Safety Seminar (ongoing)
- FAQ on Catastrophic AI Risks by Yoshua Bengio (2023)
- Why I Think More NLP Researchers Should Engage with AI Safety Concerns by Sam Bowman (2022)
- More is Different for AI by Jacob Steinhardt (2022)
- AI Digest: Visual explainers on AI progress and its risks
- Planned Obsolescence by Ajeya Cotra and Kelsey Piper (ongoing)
- Import AI by Jack Clark, for keeping up to date on AI progress
- ML Safety Newsletter and AI Safety Newsletter by the Center for AI Safety, for AI safety papers and policy updates
- AI Safety Takes by Daniel Paleka, for a PhD student's overview
- Research: Alignment Careers Guide
- AI Safety Fundamentals Curriculum by BlueDot Impact
- Introduction to ML Safety by the Center for AI Safety
How do we supervise systems that are more capable than human overseers? Perhaps we can use aligned AI overseers to oversee more capable AI, but the core problems still remain.
Talks, blog posts, newsletters, and upskilling
Highlighted talk series
Blog posts
Newsletters
Upskilling
Find more AI safety papers
Curated with input from the experts on our Strategic Advisory Panel, this collection of papers, blog posts, and videos contains top and recent AI safety papers, categorized by AI safety and machine learning (ML) subfields.