Instrumental convergence

Updated on Dec 02, 2024

Edit

Comment

Instrumental convergence is the hypothetical tendency for most sufficiently intelligent agents to pursue certain instrumental goals such as self-preservation and resource acquisition.

Instrumental convergence suggests that an intelligent agent with apparently harmless goals can act in surprisingly harmful ways. For example, a computer with the sole goal of solving the Riemann hypothesis could attempt to turn the entire Earth into computronium in an effort to increase its computing power so that it can succeed in its calculations.

Instrumental and final goals

Final goals, or final values, are intrinsically valuable to an intelligent agent, whether an artificial intelligence or a human being, as an end in itself. In contrast, instrumental goals, or instrumental values, are only valuable to an agent as a means toward accomplishing its final goals.

Hypothetical examples of convergence

One hypothetical example of instrumental convergence is provided by the Riemann Hypothesis catastrophe. Marvin Minsky, the co-founder of MIT's AI laboratory, has suggested that an artificial intelligence designed to solve the Riemann hypothesis might decide to take over all of Earth's resources to build supercomputers to help achieve its goal. If the computer had instead been programmed to produce as many paper clips as possible, it would still decide to take all of Earth's resources to meet its final goal. Even though these two final goals are different, both of them produce a convergent instrumental goal of taking over Earth's resources.

Paperclip maximizer

The paperclip maximizer is a thought experiment originally described by Swedish philosopher Nick Bostrom in 2003. It illustrates the existential risk that an artificial general intelligence may pose to human beings when programmed to pursue even seemingly-harmless goals, and the necessity of incorporating machine ethics into artificial intelligence design. The scenario describes an advanced artificial intelligence tasked with manufacturing paperclips. If such a machine were not programmed to value human life, then given enough power its optimized goal would be to turn all matter in the universe, including human beings, into either paperclips or machines which manufacture paperclips.

Suppose we have an AI whose only goal is to make as many paper clips as possible. The AI will realize quickly that it would be much better if there were no humans because humans might decide to switch it off. Because if humans do so, there would be fewer paper clips. Also, human bodies contain a lot of atoms that could be made into paper clips. The future that the AI would be trying to gear towards would be one in which there were a lot of paper clips but no humans.

Bostrom has emphasised that he does not believe the paperclip maximiser scenario will actually occur. Rather, his intention is to illustrate the dangers of creating superintelligent machines without knowing how to safely program them to eliminate existential risk to human beings.

Basic AI drives

Steve Omohundro has itemized several convergent instrumental goals, including self-preservation or self-protection, utility function or goal-content integrity, self-improvement, and resource acquisition. He refers to these as the "basic AI drives". A "drive" here denotes a "tendency which will be present unless specifically counteracted"; this is different from the psychological term "drive", denoting an excitatory state produced by a homeostatic disturbance. A tendency for a person to fill out income tax forms every year is a "drive" in Omohundro's sense, but not in the psychological sense.

Goal-content integrity

In humans, maintenance of final goals can be explained with a thought experiment. Suppose a man named "Gandhi" has a pill that, if he took it, would cause him to want to kill people. This Gandhi is currently a pacifist: one of his explicit final goals is to never kill anyone. Gandhi is likely to refuse to take the pill, because Gandhi knows that if in the future he wants to kill people, he is likely to actually kill people, and thus the goal of "not killing people" would not be satisfied.

However, in other cases, people seem happy to let their final values drift. Humans are complicated, and their goals can be inconsistent or unknown, even to themselves.

In artificial intelligence

In 2009, Jürgen Schmidhuber concluded, in a setting where agents search for proofs about possible self-modifications, "that any rewrites of the utility function can happen only if the Gödel machine first can prove that the rewrite is useful according to the present utility function." An analysis by Bill Hibbard of a different scenario is similarly consistent with maintenance of goal content integrity. Hibbard also argues that in a utility maximizing framework the only goal is maximizing expected utility, so that instrumental goals should be called unintended instrumental actions.

Resource acquisition

For almost any open-ended, non-trivial reward function (or set of goals), possessing more resources (such as equipment, raw materials, or energy) can enable the AI to find a more "optimal" solution. Resources can benefit some AIs directly, through being able to create more of whatever stuff its reward function values: "The AI neither hates you, nor loves you, but you are made out of atoms that it can use for something else." In addition, almost all AI's can benefit from having more resources to spend on other instrumental goals, such as self-preservation.