AI Safety Papers
Learn about large-scale risks from advanced AI
- (Ngo et al., 2022) The Alignment Problem from a Deep Learning Perspective
- (Hendrycks et al., 2023) An Overview of Catastrophic AI Risks
- (Chan et al., 2023) Harms from Increasingly Agentic Algorithmic Systems
- (Anwar et al., 2024) Foundational Challenges in Assuring Alignment and Safety of Large Language Models
What types of risks from advanced AI might we face? These papers provide an overview of anticipated problems and relevant technical research directions.
- (Perez et al., 2023) Discovering Language Model Behaviors with Model-Written Evaluations
- (METR, 2023) Responsible Scaling Policies (RSPs), example: Anthropic's RSP
- (Phuong et al., 2024) Evaluating Frontier Models for Dangerous Capabilities
- Resources:
- METR's Autonomy Evaluation Resources
- UK AISI "Inspect"
- OpenAI o1 System Card, sections 3.4.3 and 3.4.4
To ensure that AI systems are safe, we need societal agreement on what is unsafe. Developing technical benchmarks for what dangerous capabilities look like in advanced AI systems is a necessary first step to support policymakers and labs deploying powerful models.
- (Olah et al., 2020) Zoom In: An Introduction to Circuits
- (Templeton et al., 2024) Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
- (Marks et al., 2024) Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
- (Zou et al., 2023) Representation Engineering: A Top-Down Approach to AI Transparency
One approach to improving safety is to improve our understanding of how AI models arrive at their outputs, to enable more robust monitoring and modification. In particular, attempting to understand models just by examining their weights (“mechanistic interpretability”) may be robust to deceptive or sycophantic behaviour from future systems.
- (Perez et al., 2022) Red Teaming Language Models with Language Models
- (Zou et al., 2023) Universal and Transferable Adversarial Attacks on Aligned Language Models
- (Carlini et al., 2023) Are aligned neural networks adversarially aligned?
Today and in the future, AI needs to be robust to adversarial attacks and unexpected distributional shifts. As models become more capable, ensuring they will represent human intentions across many settings is even more important: monitoring will become harder, there will be more incentive to jailbreak models, and models themselves may even become deceptive.
- (Bowman et al., 2022) Measuring Progress on Scalable Oversight for Large Language Models
- (Bai et al., 2022) Constitutional AI: Harmlessness from AI Feedback
- (Burns et al., 2023) Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
- (Khan et al., 2024) Debating with More Persuasive LLMs Leads to More Truthful Answers
As models approach or surpass human performance, it becomes more challenging to decide which actions are safe or unsafe. One approach to solving this problem is to have a simpler model either assess safety of a more complex model directly or aid a human overseer in spotting unsafe outputs.
- (Hubinger et al., 2024) Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- (Meinke et al., 2024) Frontier Models are Capable of In-context Scheming
- (Greenblatt et al., 2024) AI Control: Improving Safety Despite Intentional Subversion
- See also:
- OpenAI o1 System Card, sections 3.4.3 and 3.4.4
There are reasons to think that advanced AI systems may demonstrate deceptive or sycophantic behavior in training, but pursue their own goals in deployment. The papers below investigate whether this could be detected, whether present-day systems are capable of this kind of deception, and what could be done about it in the future.
Talks, blog posts, newsletters, and upskilling
- Vienna Alignment Workshop (July 2024), recordings available
- New Orleans Alignment Workshop (Dec 2023), recordings available
- Princeton AI Alignment and Safety Seminar
- FAQ on Catastrophic AI Risks by Yoshua Bengio (2023)
- Why I Think More NLP Researchers Should Engage with AI Safety Concerns by Sam Bowman (2022)
- More is Different for AI by Jacob Steinhardt (2022)
- AI Digest: Visual explainers on AI progress and its risks
- Planned Obsolescence by Ajeya Cotra and Kelsey Piper (ongoing)
- Import AI by Jack Clark, for keeping up to date on AI progress
- ML Safety Newsletter and AI Safety Newsletter by the Center for AI Safety, for AI safety papers and policy updates
- Transformer by Shakeel Hashim, for AI news with a safety focus
- AI Safety Takes by Daniel Paleka, for a PhD student's overview
- Research: Alignment Careers Guide
- AI Safety Fundamentals Curriculum by BlueDot Impact
- Introduction to ML Safety by the Center for AI Safety
- Neuronpedia, a tool for visualising SAEs
- Autointerp, a library for automatically generating labels for SAE features
- Vivaria, a tool for running evaluations and agent elicitation research.
Highlighted talk series
Blog posts
Newsletters
Upskilling
Technical AI Governance
- Technical work that primarily aims to improve the efficacy of AI governance interventions, including compute governance, technical mechanisms for improving AI coordination and regulation, privacy-preserving transparency mechanisms, technical standards development, model evaluations, and information security.
- Blog post (Anonymous): "AI Governance Needs Technical Work"
- Blog post by Lennart Heim: "Technical AI Governance" (focuses on compute governance), Podcast: "Lennart Heim on the compute governance era and what has to come after"
- Blog post by Luke Muehlhauser, "12 Tentative Ideas for US AI Policy"
- Perspectives on options given AI hardware expertise (80,000 Hours)
- Arkose is looking for further resources about technical governance, as this is a narrow set; please send recommendations to team@arkose.org!
Find more AI safety papers
Curated with input from the experts on our Strategic Advisory Panel, this collection of papers, blog posts, and videos contains top and recent AI safety papers. For more information on the categories listed, please reference this glossary.