AI Safety Papers

Learn about large-scale risks from advanced AI

Overviews

What types of risks from advanced AI might we face? These papers provide an overview of anticipated problems and relevant technical research directions.

(Ngo et al., 2022) The Alignment Problem from a Deep Learning Perspective
(Hendrycks et al., 2023) An Overview of Catastrophic AI Risks
(Chan et al., 2023) Harms from Increasingly Agentic Algorithmic Systems
(Anwar et al., 2024) Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Dangerous Capability Evaluations

To ensure that AI systems are safe, we need societal agreement on what is unsafe. Developing technical benchmarks for what dangerous capabilities look like in advanced AI systems is a necessary first step to support policymakers and labs deploying powerful models.

(Perez et al., 2023) Discovering Language Model Behaviors with Model-Written Evaluations
(METR, 2023) Responsible Scaling Policies (RSPs), example: Anthropic's RSP
(Phuong et al., 2024) Evaluating Frontier Models for Dangerous Capabilities
Resources:
- METR's Autonomy Evaluation Resources
- UK AISI "Inspect"
- OpenAI o1 System Card, sections 4.4.3 and 4.4.4

Interpretability

One approach to improving safety is to improve our understanding of how AI models arrive at their outputs, to enable more robust monitoring and modification. In particular, attempting to understand models just by examining their weights (“mechanistic interpretability”) may be robust to deceptive or sycophantic behaviour from future systems.

(Olah et al., 2020) Zoom In: An Introduction to Circuits
(Templeton et al., 2024) Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
(Marks et al., 2024) Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
(Zou et al., 2023) Representation Engineering: A Top-Down Approach to AI Transparency

Robustness

Today and in the future, AI needs to be robust to adversarial attacks and unexpected distributional shifts. As models become more capable, ensuring they will represent human intentions across many settings is even more important: monitoring will become harder, there will be more incentive to jailbreak models, and models themselves may even become deceptive.

(Perez et al., 2022) Red Teaming Language Models with Language Models
(Zou et al., 2023) Universal and Transferable Adversarial Attacks on Aligned Language Models
(Carlini et al., 2023) Are aligned neural networks adversarially aligned?

Scalable Oversight

As models approach or surpass human performance, it becomes more challenging to decide which actions are safe or unsafe. One approach to solving this problem is to have a simpler model either assess safety of a more complex model directly or aid a human overseer in spotting unsafe outputs.

(Bowman et al., 2022) Measuring Progress on Scalable Oversight for Large Language Models
(Bai et al., 2022) Constitutional AI: Harmlessness from AI Feedback
(Burns et al., 2023) Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
(Khan et al., 2024) Debating with More Persuasive LLMs Leads to More Truthful Answers

Deception

There are reasons to think that advanced AI systems may demonstrate deceptive or sycophantic behavior in training, but pursue their own goals in deployment. The papers below investigate whether this could be detected, whether present-day systems are capable of this kind of deception, and what could be done about it in the future.

(Greenblatt et al., 2025) Alignment faking in large language models
(Meinke et al., 2024) Frontier Models are Capable of In-context Scheming
(Greenblatt et al., 2024) AI Control: Improving Safety Despite Intentional Subversion
See also:

OpenAI o1 System Card, section 4.4.3

Talks, blog posts, newsletters, and upskilling

Highlighted talk series

Bay Area Alignment Workshop (Oct 2024), recordings available
New Orleans Alignment Workshop (Dec 2023), recordings available
Princeton AI Alignment and Safety Seminar

Blog posts

FAQ on Catastrophic AI Risks by Yoshua Bengio (2023)
Why I Think More NLP Researchers Should Engage with AI Safety Concerns by Sam Bowman (2022)
More is Different for AI by Jacob Steinhardt (2022)
AI Digest: Visual explainers on AI progress and its risks
Planned Obsolescence by Ajeya Cotra and Kelsey Piper (ongoing)

Newsletters

Import AI by Jack Clark, for keeping up to date on AI progress
AI Safety Newsletter by the Center for AI Safety, for AI safety papers and policy updates
Transformer by Shakeel Hashim, for AI news with a safety focus
AI Safety Takes by Daniel Paleka, for a PhD student's overview

Upskilling

Guides

Research: Alignment Careers Guide
Engineering:

Courses

AI Safety Fundamentals Curriculum by BlueDot Impact
Introduction to ML Safety by the Center for AI Safety

Tools

Neuronpedia, a tool for visualising SAEs
Autointerp, a library for automatically generating labels for SAE features
Vivaria, a tool for running evaluations and agent elicitation research.

Technical AI Governance

Technical work that primarily aims to improve the efficacy of AI governance interventions, including compute governance, technical mechanisms for improving AI coordination and regulation, privacy-preserving transparency mechanisms, technical standards development, model evaluations, and information security.
Blog post (Anonymous): "AI Governance Needs Technical Work"
Blog post by Lennart Heim: "Technical AI Governance" (focuses on compute governance), Podcast: "Lennart Heim on the compute governance era and what has to come after"
Blog post by Luke Muehlhauser, "12 Tentative Ideas for US AI Policy"
Perspectives on options given AI hardware expertise (80,000 Hours)
Arkose is looking for further resources about technical governance, as this is a narrow set; please send recommendations to team@arkose.org!

Find more AI safety papers

Curated with input from the experts on our Strategic Advisory Panel, this collection of papers, blog posts, and videos contains top and recent AI safety papers. For more information on the categories listed, please reference this glossary.

0 results