Glossary

AI Governance: Designing laws and regulations that would help ensure safety of machine learning models.

Capabilities of LLMs: Many important large language model abilities emerge unexpectedly as a result of scaling. Papers in this category investigate large language model abilities.

Deception: Deception is AI deliberately inducing false beliefs in humans. Examples of deception include a language model knowingly stating a falsehood, a poker playing model successfully bluffing, or an AI executing a feint in a military strategy game. We study ways to prevent or detect deceptive behavior in AI models.

Human Model Interaction: The resource deals with concerns specific to the usage of models by people and the interaction between people and models. Topics like learning from human feedback, deceptive behavior by models, and risks that come up specifically when a model is to be used by people other than its designers/developers fall into this category.

LLMs: The resource deals in a significant way with problems in Large Language Models (LLMs).

Major problems (misalignment, misuse, threat models): Discussion of potential dangers of AI (threat models), including misuse (AI is directed by humans to cause harm) and misalignment (AI pursuing goals that humans never intended).

Model evaluation / monitoring / detection: Detecting and measuring capabilities of AI models. Understanding AI capabilities is important to determine their capacity to cause harm.

Reinforcement Learning: The resource deals with problems in reinforcement learning. Note that just using a method (e.g. RLHF) from reinforcement learning doesn’t always get a resource the RL tag.

Reward misspecification and goal misgeneralization: In reinforcement learning, reward misspecification is a situation when the reward function incentivizes unintended behavior by the agent. For example, in a game where the player is rewarded for reaching intermediate goals in a boat race, the agent reaches higher reward by going in circles rather than by finishing the track as designers intende.

Robustness and Adversariality: The resource deals with robust learning and learning in the face of adversarial behavior. This tag is intended to cover everything from robustness to outliers, (randomly or adversarially) corrupted data, and misspecified objective functions, all the way to deceptive behavior by models, models pursuing their own agentic goals, and generally any kind of attempt by any kind of adversary to subvert the modeler’s or user’s intent.

Scalable oversight: Scalable oversight is the problem of supervising and evaluating models in situations where direct evaluation by experts is expensive or impossible. This situation is especially likely to arise with superhuman models that outperform us on most skills relevant to the task at hand.

Security: The resource deals with security concerns in the sense of the computer security domain. Things like system monitoring or provable verification of system properties for the purpose of auditing untrusted model developers belong in this category.

Theory: The resource deals with theoretical questions. Resources that focus on proving mathematical results or building mathematical frameworks fit into this category. Resources that are philosophical or focused on building non-mathematical frameworks may fall into this category, too, though that isn’t the primary intent of this tag.

Transparency / Interpretability / Model internals / Latent knowledge: Understanding the algorithm implemented by the model and extracting its learned knowledge from model weights.