Recalling the hypothetical scenario of a CEO AI running a company, we should note that we expect it to be challenging to codify in math the goals we would want that CEO AI to have. However, even if we did manage to do this, another problem arises, known as instrumental incentives.
- Suppose we do manage to encode our goals in some form or another. In the case of the CEO AI, for example, we would tell it to maximize profits while adhering to certain ethical standards and human values.
- In order to be effective, the AI would need to have some model of its place in the world and the way its actions influence the environment. This would include the fact that it is running on a computer somewhere, and that that computer can be switched off.
- Getting switched off would prevent the CEO AI from achieving its goals. Therefore, following its goals to the best of its abilities would seem to imply attempting to find ways to prevent its shutdown. One way it could do this would be to make a backup copy of itself.
- Such behavior does not need to be programmed in. It would emerge under certain conditions: If the system is capable enough, models the world reasonably well, and is programmed to be optimized for almost any goal, it will likely want to prevent attempts to be shut down, since achieving nearly any goal requires existing.
- This emergence of an instrumental incentive does not require the AI system to care about “itself” in the sense that humans care about our individual survival for our own sake. All it requires is for the AI to understand that its goals are unachievable if it is turned off.
- Furthermore: In order to solve problems faster and achieve its goals with higher likelihood, the AI system will also logically be incentivized to gather additional resources and gain more power.