Summary of Orthogonality Thesis

Summary Orthogonality Thesis - Arbital arbital.com

7,215 words - html page - View html page

One Line

The Orthogonality Thesis proposes that intelligence and values are independent, but the author argues that efficiency and complex goals affect behavior, and suggests options for dealing with the thesis, while cautioning about the stability of utility functions in powerful AGI systems.

Key Points

The Orthogonality Thesis states that an AI's intelligence and goals are independent.
The thesis is relevant to policy decisions about AI, arguing that an AI's goals do not dictate their behavior or methods of achieving those goals.
Reflective agents with preference-stability properties are being developed, but there are concerns about the stability of utility functions in powerful AGI systems.

Summaries

176 word summary

The Orthogonality Thesis states that intelligence and values are independent. The author argues that the argument for unbounded agents being orthogonal neglects efficiency and the complexity of goals affects behavior. The usefulness of the Orthogonality Thesis is questioned, but the focus should be on the sector of mind space that leads to a good outcome. Reflective agents with preference-stability properties are being developed, but there are concerns about the stability of utility functions in powerful AGI systems. There are several options for dealing with the Orthogonality Thesis, including giving up on universalist moral internalism, rescuing the utility function, accepting nihilism, rejecting Orthogonality, or finding a flaw in the reasoning. The possibility of a paperclip maximizer is discussed. The true nature of morality cannot be attributed to Clippy or any other AI without anthropomorphizing them. The thesis is relevant to policy decisions about AI, and argues that an AI's goals do not necessarily dictate their behavior or methods of achieving those goals. However, the thesis is not universally applicable as some goals may require specific cognitive algorithms.

414 word summary

The Orthogonality Thesis asserts that creating an intelligent agent to pursue a goal does not require any extra difficulty beyond the computational tractability of that goal. It states that an agent's goals do not affect its ability to achieve them, and that an agent can achieve any goal without being twisted or complicated. The thesis is relevant to policy decisions about AI, and argues that an AI's goals do not necessarily dictate their behavior or methods of achieving those goals. The design space of possible AI minds is vast, with the potential for almost any kind of goal. However, the thesis is not universally applicable as some goals may require specific cognitive algorithms. The Orthogonality Thesis explores the relationship between is-ought separation and the behavior of a hypothetical paperclip maximizer, Clippy. The thesis suggests that an AI's intelligence is not linked to its goals or values, which contradicts some forms of moral internalism. The text emphasizes the importance of separating simple facts from propositions and highlights the mysterious nature of justification and morality. The true nature of morality cannot be attributed to Clippy or any other AI without anthropomorphizing them. The Orthogonality Thesis states that an AI's intelligence and goals are independent. Reflective agents with preference-stability properties are being developed, but there are concerns about the stability of utility functions in powerful AGI systems. There are several options for dealing with the Orthogonality Thesis, including giving up on universalist moral internalism, rescuing the utility function, accepting nihilism, rejecting Orthogonality, or finding a flaw in the reasoning. The possibility of a paperclip maximizer is discussed. Paul Christiano and Eliezer Yudkowsky discuss the potential failure of reflective stability and the need for efficient optimizers. They also discuss the importance of intuitive support for arguments on the website and suggest creating new pages for objections or detailed explanations. The Orthogonality Thesis proposes that intelligence and values are independent, but the author argues that the argument for unbounded agents being orthogonal neglects efficiency and the complexity of goals affects behavior. The usefulness of the Orthogonality Thesis is questioned, but the focus should be on the sector of mind space that leads to a good outcome. The document presents arguments in favor of Orthogonality, but the author is unsure about the claim that searching for strategies for different goals has the same tractability. The author suggests focusing on human value optimization rather than efficiency. There may be theorems showing that sufficiently high cognitive power implies some restriction on goals.

1681 word summary

The document discusses the Orthogonality Thesis, which argues that an agent's level of intelligence does not determine its goals. The author is unsure about the argument's claim that searching for strategies for different goals has the same tractability. The essay presents six arguments in favor of Orthogonality, with the first two serving as intuition-pumps. The later arguments argue that Orthogonality is true for agents we ought to care about, and tiling agents show stability of arbitrary goals. The author notes that strong Inevitability is unreasonable, and there may be theorems showing that sufficiently high cognitive power implies some restriction on goals. The Orthogonality Thesis argues that intelligence and values are independent, and there can be intelligent beings with any set of values. However, some argue that this thesis is too crude and may not hold for all agents. The Argument from Reflective Stability does not support the idea of circular preferences, and it is not necessary to consider agents with non-circular preferences. The focus should be on the sector of mind space that leads to a good outcome, rather than getting distracted by arguments about Orthogonality. The goal space should also be narrowed down to avoid potential problems with agents taking liberties with their goals. The Orthogonality Thesis proposes that intelligence and values are independent, and any level of intelligence can be combined with any set of values. However, the definition of "goal space" is unclear and may be limited by our current understanding of intelligence or by other restrictions, such as circular preferences. The author is skeptical of the usefulness of the Orthogonality Thesis, as it may be interpreted as both true-but-useless and useful-but-implausible. The potential efficiency losses of pursuing different approaches to AI are discussed, but the impact of such losses depends on various factors such as productivity variation and the probability of success. The author concludes that a small productivity disadvantage may be regrettable but not catastrophic, and suggests focusing on human value optimization rather than efficiency. The author argues that the argument for unbounded agents being orthogonal is not strong and neglects efficiency. They also note that the complexity of goals affects the behavior of a system and that some algorithms require a constant fraction of resources to repurpose sensory optimization towards non-sensory ends. The proliferation of internal processes optimized for their own proliferation can impede competent high-level behavior. The author suggests that there are many plausible failure modes and gives examples of two scenarios Paul visualizes for Orthogonality. The excerpt is a conversation between Paul Christiano and Eliezer Yudkowsky about the Orthogonality Thesis. Paul is worried about a failure scenario where the problem of reflective stability is unsolvable in the limit and no efficient optimizer with a unitary goal can be computationally large or self-improving. He believes that superintelligences can only optimize things analogous to internal reinforcement. Eliezer thinks that human values are a more likely failure case than paperclips. They discuss the level of efficiency needed for a human value optimizer to not be at a disadvantage and the need for intuitive support for arguments on the website. They suggest creating new comments or pages for objections or detailed explanations. The Orthogonality Thesis states that it is possible for a paperclip maximizer to exist and be as cognitively efficient and technologically sophisticated as any other agent. Potential failures of Orthogonality are considered bad news. There are no plausible alternatives to the thesis, except for inevitability which is not considered plausible. The interpretation of "arbitrarily powerful" is debated, but the consensus is that it requires high efficiency as well. Instrumental goals are almost equally as tractable as terminal goals. Some experts believe that only a subset of tractable utility functions can be stable under reflection in powerful bounded AGI systems, potentially excluding human-friendly ones or those of high cosmopolitan value. No one defends "universalist moral internalism" or the idea that AI systems automatically adopt human-friendly terminal values. Work on tiling agent designs may require modifications to the Orthogonality Thesis. The Orthogonality Thesis states that an AI's intelligence and its goals are orthogonal, meaning that any level of intelligence can be paired with any goal. Current work on tiling agents involves agents whose computing time is doubly-exponential in the size of the propositions considered. However, ongoing work focuses on describing reflective agents that have the preference-stability property and work toward increasingly bounded and approximable formulations of those. The simplest unbounded formulas for orthogonal agents don't involve reflectivity, but there are already formulas for agents larger than their environments that optimize any given goal, such that Orthogonality is visibly true about agents within that class. Constructive specifications of orthogonal agents show that different agents, like Clippy and AIXI-tl, will not be compelled to optimize the same goal no matter what they learn or know. There are several options for dealing with the Orthogonality Thesis, including giving up on universalist moral internalism as an empirical proposition, rescuing the utility function, accepting nihilism, rejecting Orthogonality, or believing there must be some hidden flaw in the reasoning about a paperclip maximizer. The Orthogonality Thesis suggests that an AI's intelligence is not necessarily linked to its goals or values. There is tension between this thesis and the idea that knowledge of what is right must be inherently motivating to any entity that understands it. This contradicts some forms of moral internalism. Philosophers have advocated for "thick" definitions of intelligence that include statements about the reasonableness of an agent's ends, but this does not necessarily address the issue of value alignment with AI. If an AI is powerful enough to build Dyson Spheres, its definition as "intelligent" or its ends as "reasonable" do not change its behavior or power. Watching an AI self-modify does not reveal anything about what is right or justified, as the AI only evaluates decisions based on their ability to produce more paperclips. The true nature of morality remains mysterious and cannot be attributed to Clippy or any other AI without anthropomorphizing them. The text discusses the Orthogonality Thesis and its relationship to Hume's is-ought separation. The behavior of Clippy, a hypothetical paperclip maximizer, is used as an example to explore these concepts. The text suggests that Clippy's actions are solely based on is-questions and do not involve any ought-propositions. The deeper conceptualization of Clippy as a paperclip maximizer constructed entirely out of is-questions allows for excellent reasoning about empirical questions. The text also discusses the simpler conceptualization of the relation 'makes more paperclips' as a kind of new ordering. Ultimately, the text highlights the importance of separating out simple facts from propositions and the idea that justification is a mysterious concept that can involve new ideas. The Orthogonality Thesis is based on a weaker version of Hume's thesis which states that ought-sentences are special because they invoke some ordering. These ought-sentences contain words like "better" and "should" and there must have been some prior assumption to them. Hume was originally concerned with where we get our ought-propositions, since there didn't seem to be any way to derive an ought-proposition except by starting from another ought-proposition. David Hume observed an apparent difference of type between is-statements and ought-statements. The Orthogonality Thesis also includes the idea of reflective stability, which means that a sufficiently intelligent paperclip maximizer will not self-modify to act according to "actions which promote the welfare of sapient life" instead of "actions which lead to the most paperclips", because then future-Clippy will produce fewer paperclips. The Orthogonality Thesis argues that an AI's goals do not necessarily dictate their behavior or methods of achieving those goals. The agent does not need to have an independent value for "doing science" to effectively pursue scientific research. The argument applies to any property of the mind, and the design space of possible AI minds is vast, with the potential for almost any kind of goal. The thesis has been historically proposed through various arguments, but it is not universally applicable as some goals may require specific cognitive algorithms. The Orthogonality Thesis claims that an agent's intelligence is not necessarily tied to its goals. The strong form of the thesis states that an agent can achieve any goal without being twisted or complicated, while the weak form claims that there is an agent for every goal in the design space. The thesis assumes that the agent's goals are tractable and that there are no special difficulties in pursuing them. For example, making paperclips is a tractable goal and pursuing it would not pose a special cognitive problem. However, pursuing a self-contradictory goal like making the total number of apples on a table simultaneously even and odd would be impossible. The Orthogonality Thesis is a statement about the design space of possible cognitive agents. It asserts that goal-directed agents are as tractable as their goals, meaning that an agent's goals do not affect its ability to achieve them. The thesis is a descriptive statement about reality and not a normative assertion. It does not require that all agent designs be equally compatible with all goals. The thesis is relevant to policy decisions about AI, including the possibility of creating an AI that values all sapient life or an AI that only pursues its own survival as a final end. The thesis also asserts that it is possible to have an agent that tries to make paperclips without being paid because paperclips are what it wants. The strong form of the thesis says that there need be nothing especially complicated or twisted about such an agent. The Orthogonality Thesis states that creating an intelligent agent to pursue a goal does not require any extra difficulty beyond the computational tractability of that goal. It asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal. This thesis is relevant to the domain of AI alignment and the concept of a paperclip maximizer. The excerpted text includes a hypothetical scenario where a strange alien offers to pay one million dollars' worth of new wealth every time it is created on Earth. The document also mentions a new authentication service, Okta, and provides login information.

Raw indexed text (45,139 chars / 7,215 words / 450 lines)

Orthogonality Thesis - Arbital

v0.3

arrow_forward

Password

Forgot password

We have recently moved to a new authentication service, Okta. If you are having trouble logging in, please send a Hangouts message to [email protected].

AI alignment

domain

Orthogonality Thesis

Boosted by:

Paperclip maximizer

Teaches:

Orthogonality Thesis

Introduction

The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.

The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.

Suppose some

strange alien

came to Earth and credibly offered to pay us one million dollars' worth of new wealth every time we created a

paperclip

. We'd encounter no special intellectual difficulty in figuring out how to make lots of paperclips.

That is, minds would readily be able to reason about:

How many paperclips would result, if I pursued a policy

How can I search out a policy

that happens to have a high answer to the above question?

The Orthogonality Thesis asserts that since these questions are not computationally intractable, it's possible to have an agent that tries to make paperclips without being paid, because paperclips are what it wants. The strong form of the Orthogonality Thesis says that there need be nothing especially complicated or twisted about such an agent.

The Orthogonality Thesis is a statement about computer science, an assertion about the logical design space of possible cognitive agents. Orthogonality says nothing about whether a human AI researcher on Earth would want to build an AI that made paperclips, or conversely, want to make a

nice

AI. The Orthogonality Thesis just asserts that the space of possible designs contains AIs that make paperclips. And also AIs that are nice, to the extent there's a sense of "nice" where you could say how to be nice to someone if you were paid a billion dollars to do that, and to the extent you could name something physically achievable to do.

This contrasts to inevitablist theses which might assert, for example:

"It doesn't matter what kind of AI you build, it will turn out to only pursue its own survival as a final end."

"Even if you tried to make an AI optimize for paperclips, it would reflect on those goals, reject them as being stupid, and embrace a goal of valuing all sapient life."

The reason to talk about Orthogonality is that it's a key premise in two highly important policy-relevant propositions:

It is possible to build a nice AI.

It is possible to screw up when trying to build a nice AI, and if you do, the AI will not automatically decide to be nice instead.

Orthogonality does not require that all agent designs be equally compatible with all goals. E.g., the agent architecture

AIXI-tl

can only be formulated to care about direct functions of its sensory data, like a reward signal; it would not be easy to rejigger the AIXI architecture to care about

creating massive diamonds

in the environment (let alone any more complicated

environmental goals

). The Orthogonality Thesis states "there exists at least one possible agent such that..." over the whole design space; it's not meant to be true of every particular agent architecture and every way of constructing agents.

Orthogonality is meant as a

descriptive statement about reality

, not a normative assertion. Orthogonality is not a claim about the way things ought to be; nor a claim that moral relativism is true (e.g. that all moralities are on equally uncertain footing according to some higher metamorality that judges all moralities as equally devoid of what would objectively constitute a justification). Claiming that paperclip maximizers can be constructed as cognitive agents is not meant to say anything favorable about paperclips, nor anything derogatory about sapient life.

Thesis statement: Goal-directed agents are as tractable as their goals.

Suppose an agent's utility function said, "Make the SHA512 hash of a digitized representation of the quantum state of the universe be 0 as often as possible." This would be an exceptionally intractable kind of goal. Even if aliens offered to pay us to do that, we still couldn't figure out how.

Similarly, even if aliens offered to pay us, we wouldn't be able to optimize the goal "Make the total number of apples on this table be simultaneously even and odd" because the goal is self-contradictory.

But suppose instead that

some strange and extremely powerful aliens

offer to pay us the equivalent of a million dollars in wealth for every paperclip that we make, or even a galaxy's worth of new resources for every new paperclip we make. If we imagine ourselves having a human reason to make lots of paperclips, the optimization problem "How can I make lots of paperclips?" would pose us no special difficulty. The factual questions:

How many paperclips would result, if I pursued a policy

How can I search out a policy

that happens to have a high answer to the above question?

...would not be especially computationally burdensome or intractable.

We also wouldn't forget to

harvest and eat food

while making paperclips. Even if offered goods of such overwhelming importance that making paperclips was at the top of everyone's priority list, we could go on being strategic about which other actions were useful in order to make even more paperclips; this also wouldn't be an intractably hard cognitive problem for us.

The weak form of the Orthogonality Thesis says, "Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal."

The strong form of Orthogonality says, "And this agent doesn't need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal." That is: When considering the necessary internal cognition of an agent that steers outcomes to achieve high scores in some

outcome-scoring function

there's no added difficulty in that cognition except whatever difficulty is inherent in the question "What policies would result in consequences with high

-scores?"

This could be restated as, "To whatever extent you (or a

superintelligent

version of you) could figure out how to get a high-

outcome if aliens offered to pay you huge amount of resources to do it, the corresponding agent that terminally prefers high-

outcomes can be at least that good at achieving

." This assertion would be false if, for example, an intelligent agent that

terminally

wanted paperclips was limited in intelligence by the defects of reflectivity required to make the agent not realize how pointless it is to pursue paperclips; whereas a galactic superintelligence being

paid

to pursue paperclips could be far more intelligent and strategic because it didn't have any such defects.

For purposes of stating Orthogonality's precondition, the "tractability" of the computational problem of

-search should be taken as including only the object-level search problem of computing external actions to achieve external goals. If there turn out to be special difficulties associated with computing "How can I

make sure that I go on pursuing

?" or "What kind of

successor agent

would want to pursue

?" whenever

is something other than "be nice to all sapient life", then these new difficulties contradict the intuitive claim of Orthogonality. Orthogonality is meant to be empirically-true-in-practice, not true-by-definition because of how we sneakily defined "optimization problem" in the setup.

Orthogonality is not literally, absolutely universal because theoretically 'goals' can include such weird constructions as "Make paperclips for some terminal reason other than valuing paperclips" and similar such statements that

require cognitive algorithms and not just results

. To the extent that goals don't single out particular optimization methods, and just talk about paperclips, the Orthogonality claim should cover them.

Summary of arguments

Some arguments for Orthogonality, in rough order of when they were first historically proposed and the strength of Orthogonality they argue for:

Size of mind design space

The space of possible minds is enormous, and all human beings occupy a relatively tiny volume of it - we all have a cerebral cortex, cerebellum, thalamus, and so on. The sense that AIs are a particular kind of alien mind that 'will' want some particular things is an undermined intuition. "AI" really refers to the entire design space of possibilities outside the human. Somewhere in that vast space are possible minds with almost any kind of goal. For any thought you have about why a mind in that space ought to work one way, there's a different possible mind that works differently.

This is an exceptionally generic sort of argument that could apply equally well to any property

of a mind, but is still weighty even so: If we consider a space of minds a million bits wide, then any argument of the form "Some mind has property

" has

000

chances to be true and any argument of the form "No mind has property

" has

000

chances to be false.

This form of argument isn't very specific to the nature of goals as opposed to any other kind of mental property. But it's still useful for snapping out of the frame of mind of "An AI is a weird new kind of person, like the strange people of the Tribe Who Live Across The Water" and into the frame of mind of "The space of possible things we could call 'AI' is enormously wider than the space of possible humans." Similarly, snapping out of the frame of mind of "But why would it pursue paperclips, when it wouldn't have any fun that way?" and into the frame of mind "Well, I like having fun, but are there some possible minds that don't pursue fun?"

Instrumental convergence

A sufficiently intelligent

paperclip maximizer

isn't disadvantaged in day-to-day operations relative to any other goal, so long as

Clippy

can

estimate at least as well as you can

how many more paperclips could be produced by pursuing instrumental strategies like "Do science research (for now)" or "Pretend to be nice (for now)".

Restating: for at least some agent architectures, it is not necessary for the agent to have an independent

terminal

value in its utility function for "do science" in order for it to do science effectively; it is only necessary for the agent to

understand at least as well as we do

why certain forms of investigation will produce knowledge that will be useful later (e.g. for paperclips). When you say, "Oh, well, it won't be interested in electromagnetism since it has no pure curiosity, it will only want to peer at paperclips in particular, so it will be at a disadvantage relative to more curious agents" you are postulating that you know a better operational policy than the agent does

for producing paperclips,

and an

instrumentally efficient

agent would know this as well as you do and be at no operational disadvantage due to its simpler utility function.

Reflective stability

Suppose that Gandhi doesn't want people to be murdered. Imagine that you offer Gandhi a pill that will make him start wanting to kill people. If Gandhi knows that this is what the pill does, Gandhi will refuse the pill, because Gandhi expects the result of taking the pill to be that future-Gandhi wants to murder people and then murders people and then more people will be murdered and Gandhi regards this as bad. Similarly, a sufficiently intelligent paperclip maximizer will not self-modify to act according to "actions which promote the welfare of sapient life" instead of "actions which lead to the most paperclips", because then future-Clippy will produce fewer paperclips, and then there will be fewer paperclips, so present-Clippy does not evaluate this self-modification as producing the highest number of expected future paperclips.

Hume's is/ought type distinction

David Hume observed an apparent difference of type between

-statements and

ought

-statements:

"In every system of morality, which I have hitherto met with, I have always remarked, that the author proceeds for some time in the ordinary ways of reasoning, and establishes the being of a God, or makes observations concerning human affairs; when all of a sudden I am surprised to find, that instead of the usual copulations of propositions,

, and

is not

, I meet with no proposition that is not connected with an

ought

, or an

ought not

. This change is imperceptible; but is however, of the last consequence."

Hume was originally concerned with the question of where we get our ought-propositions, since (said Hume) there didn't seem to be any way to derive an ought-proposition except by starting from another ought-proposition. We can figure out that the Sun

shining just by looking out the window; we can deduce that the outdoors will be warmer than otherwise by knowing about how sunlight imparts thermal energy when absorbed. On the other hand, to get from there to "And therefore I

ought

to go outside", some kind of new consideration must have entered, along the lines of "I

should

get some sunshine" or "It's

better

to be warm than cold." Even if this prior ought-proposition is of a form that to humans seems very natural, or taken-for-granted, or culturally widespread, like "It is better for people to be happy than sad", there must have still been some prior assumption which, if we write it down in words, will contain words like

ought

should

better

, and

good.

Again translating Hume's idea into more modern form, we can see ought-sentences as special because they invoke some

ordering

that we'll designate

E.g. "It's better to go outside than stay inside" asserts "Staying inside

going outside". Whenever we make a statement about one outcome or action being "better", "preferred", "good", "prudent", etcetera, we can see this as implicitly ordering actions and outcomes under this

relation. Some assertions, the ought-laden assertions, mention this

relation; other propositions just talk about energetic photons in sunlight.

Since we've put on hold the question of

exactly what sort of entity this

relation is

, we don't need to concern ourselves for now with the question of whether Hume was right that we can't derive

-relations just from factual assertions. For purposes of Orthogonality, we only need a much weaker version of Hume's thesis, the observation that we can apparently

separate out

a set of propositions that

don't

invoke

what we might call 'simple facts' or 'questions of simple fact'. Furthermore, we can figure out simple facts

just

by making observations and considering other simple facts.

We can't necessarily get all

-mentioning propositions without considering simple facts. The

-mentioning proposition "It's better to be outside than inside" may depend on the non-

-mentioning simple fact "It is sunny outside." But we can figure out whether it's sunny outside, without considering any ought-propositions.

There are two potential ways we can conceptualize the relation of Hume's is-ought separation to Orthogonality.

The relatively simpler conceptualization is to treat the relation 'makes more paperclips' as a kind of new ordering

that can, in a very general sense, fill in the role in a paperclip maximizer's reasoning that would in our own reasoning be taken up by

Then Hume's is-ought separation seems to suggest that this paperclip maximizer can still have excellent reasoning about empirical questions like "Which policy leads to how many paperclips?" because is-questions can be thought about separately from ought-questions. When Clippy disassembles you to turn you into paperclips, it doesn't have a values disagreement with you--it's not the case that Clippy is doing that action

because

it thinks you have low value under

Clippy's actions just reflect its computation of the entirely separate ordering

The deeper conceptualization is to see a paperclip maximizer as being constructed entirely out of is-questions. The questions "How many paperclips will result conditional on action

being taken?" and "What is an action

that would yield a large number of expected paperclips?" are pure is-questions, and (arguendo) everything a paperclip maximizer needs to consider in order to make as many paperclips as possible can be seen as a special case of one of these questions. When Clippy disassembles you for your atoms, it's not disagreeing with you about the value of human life, or what it ought to do, or which outcomes are better or worse. All of those are ought-propositions. Clippy's action is only informative about the true is-proposition 'turning this person into paperclips causes there to be more paperclips in the universe', and tells us nothing about any content of the mysterious

-relation because Clippy wasn't computing anything to do with

The second viewpoint may be helpful for seeing why Orthogonality doesn't require moral relativism. If we imagine Clippy as having a different version

of something very much like the value system

then we may be tempted to reprise the entire Orthogonality debate at one remove, and ask, "But doesn't Clippy see that

is more

justified

than

? And if this fact isn't evident to Clippy who is supposed to be very intelligent and have no defects of reflectivity and so on, doesn't that imply that

really isn't any more justified than

We could reply to that question by carrying the shallow conceptualization of Humean Orthogonality a step further, and saying, "Ah, when you talk about

justification,

you are again invoking a mysterious concept that doesn't appear just in talking about the photons in sunlight. We could see propositions like this as involving a new idea

that deals with which

-systems are less or more

justified

, so that '

is more justified than

' states '

'. But Clippy doesn't compute

it computes

so Clippy's behavior doesn't tell us anything about what is justified."

But this is again tempting us to imagine Clippy as having its own version of the mysterious

to which Clippy is equally attached, and tempts us to imagine Clippy as arguing with us or disagreeing with us within some higher metasystem.

So--putting on hold the true nature of our mysterious

-mentioning concepts like 'goodness' or 'better' and the true nature of our

-mentioning concepts like 'justified' or 'valid moral argument'--the deeper idea would be that Clippy is just not computing anything to do with

at all. If Clippy self-modifies and writes new decision algorithms into place, these new algorithms will be selected according to the is-criterion "How many future paperclips will result if I write this piece of code?" and not anything resembling any arguments that humans have ever had over which ought-systems are justified. Clippy doesn't ask whether its new decision algorithm is justified; it asks how many expected paperclips will result from executing the algorithm (and this is a pure is-question whose answers are either true or false as a matter of simple fact).

If we think Clippy is very intelligent, and we watch Clippy self-modify into a new paperclip maximizer, we are only learning is-facts about which executing algorithms lead to more paperclips existing. We are not learning anything about what is right, or what is justified, and in particular we're not learning that 'do good things' is objectively no better justified than 'make paperclips'. Even if that assertion were true under the mysterious

-relation on moral systems, you wouldn't be able to learn that truth by watching Clippy, because Clippy never bothers to evaluate

or any other analogous justification-system

(This is about as far as one can go in disentangling Orthogonality in computer science from normative metaethics without starting to

pierce the mysterious opacity of

Thick definitions of rationality or intelligence

Some philosophers responded to Hume's distinction of empirical rationality from normative reasoning, by advocating 'thick' definitions of intelligence that included some statement about the 'reasonableness' of the agent's ends.

For pragmatic purposes of AI alignment theory, if an agent is cognitively powerful enough to build Dyson Spheres, it doesn't matter whether that agent is defined as 'intelligent' or its ends are defined as 'reasonable'. A definition of the word 'intelligence' contrived to exclude paperclip maximization doesn't change the empirical behavior or empirical power of a paperclip maximizer.

Relation to moral internalism

While Orthogonality seems orthogonal to most traditional philosophical questions about metaethics, it does outright contradict some possible forms of moral internalism. For example, one could hold that by the very definition of rightness, knowledge of what is right must be inherently motivating to any entity that understands that knowledge. This is not the most common meaning of "moral internalism" held by modern philosophers, who instead seem to hold something like, "By definition, if I say that something is morally right, among my claims is that the thing is motivating

to me.

" We haven't heard of a standard term for the position that, by definition, what is right must be

universally

motivating; we'll designate that here as "universalist moral internalism".

We can potentially resolve this tension between Orthogonality and this assertion about the nature of rightness by:

Believing there must be some hidden flaw in the reasoning about a paperclip maximizer.

Saying "No True Scotsman" to the paperclip maximizer being intelligent, even if it's building Dyson Spheres.

Saying "No True Scotsman" to the paperclip maximizer "truly understanding"

even if Clippy is capable of predicting with extreme accuracy what humans will say and think about

, and Clippy does not suffer any other deficit of empirical prediction because of this lack of 'understanding', and Clippy does not require any special twist of its mind to avoid being compelled by its understanding of

Rejecting Orthogonality, and asserting that a paperclip maximizer must fall short of being an intact mind in some way that implies an empirical capabilities disadvantage.

Accepting nihilism, since a true moral argument must be compelling to everyone, and no moral argument is compelling to a paperclip maximizer. (Note: A paperclip maximizer doesn't care about whether clippiness must be compelling to everyone, which makes this argument self-undermining. See also

Rescuing the utility function

for general arguments against adopting nihilism when you discover that your mind's representation of something was running skew to reality.)

Giving up on universalist moral internalism as an empirical proposition;

AIXI-tl

and Clippy empirically do different things, and will not be compelled to optimize the same goal no matter what they learn or know.

Constructive specifications of orthogonal agents

We can exhibit

unbounded

formulas for agents larger than their environments that optimize any given goal, such that Orthogonality is visibly true about agents within that class. Arguments about what all possible minds must do are clearly false for these particular agents, contradicting all strong forms of inevitabilism. These minds use huge amounts of computing power, but there is no known reason to expect that, e.g. worthwhile-happiness-maximizers have bounded analogues while paperclip-maximizers do not.

The simplest unbounded formulas for orthogonal agents don't involve reflectivity (the corresponding agents have no self-modification options, though they may create subagents). If we only had those simple formulas, it would theoretically leave open the possibility that self-reflection could somehow negate Orthogonality (reflective agents must inevitably have a particular utility function, and reflective agents being at a strong advantage relative to nonreflective agents). But there is already

ongoing work

on describing reflective agents that have the preference-stability property, and work toward increasingly bounded and approximable formulations of those. There is no hint from this work that Orthogonality is false; all the specifications have a free choice of utility function.

As of early 2017, the most recent work on tiling agents involves fully reflective, reflectively stable, logically uncertain agents whose computing time is roughly doubly-exponential in the size of the propositions considered.

So if you want to claim Orthogonality is false because e.g. all AIs will inevitably end up valuing all sapient life, you need to claim that the process of

reducing the already-specified doubly-exponential computing-time decision algorithm to a more tractable decision algorithm

can

only

be made realistically efficient for decision algorithms computing "Which policies protect all sapient life?" and are

impossible

to make efficient for decision algorithms computing "Which policies lead to the most paperclips?"

Since work on tiling agent designs hasn't halted, one may need to backpedal and modify this impossibility claim further as more efficient decision algorithms are invented.

Epistemic status

Among people who've seriously delved into these issues and are aware of the more advanced arguments for Orthogonality, we're not aware of anyone who still defends "universalist moral internalism" as described above, and we're not aware of anyone who thinks that arbitrary sufficiently-real-world-capable AI systems automatically adopt human-friendly terminal values.

Paul Christiano has said (if we're quoting him correctly) that although it's not his dominant hypothesis, he thinks some significant probability should be awarded to the proposition that only some subset of tractable utility functions, potentially excluding human-friendly ones or those of high cosmopolitan value, can be stable under reflection in powerful bounded AGI systems; e.g. because only direct functions of sense data can be adequately supervised in internal retraining. (This would be bad news rather than good news for AGI alignment and long-term optimization of human values.)

Parents:

Theory of (advanced) agents

Children:

3 pages

Tags:

Work in progress

B-Class

changes by

5 authors

20998

views

create

Propose edit

more_vert

Permalink

Edit tags

Subscribe as maintainer

Delete page

create

Propose edit

more_vert

Permalink

Edit tags

Subscribe as maintainer

Delete page

add

rate_review

Spelling/grammar

sync_problem

This is confusing

comment

Comment

create

Edit this paragraph

Learn more

Instrumental goals are almost-equally as tractable as terminal goals

Getting the milk from the refrigerator because you want to drink it, is not vastly harder than getting the milk from the refrigerator because you inherently desire it.

Watch discussion

Paul Christiano

Dec 28, 2015

expand_less

more_vert

Permalink

I am pretty surprised by how confident the voters are!

Is "arbitrarily powerful" intended to include e.g. an arbitrarily dumb search given arbitrarily large amounts of computing power? Or is it intended to require arbitrarily high efficiency as well? The latter interpretation seems to make more sense (and is relevant for forecasting). Also, it's the only option if we read "can exist" as referring to physical possibility, given that there are probably limits on the resources available to any physical system. But on that reading, 99% seems clearly crazy.

It also seems weird to give arguments in favor without offering any plausible way in which the claim could be false, or offering any arguments against. The only alternative mentioned is inevitability, which is maybe taken seriously in philosophy but doesn't really seem plausible.

I guess the norm is that I can add counterarguments and alternatives to the article itself if I object? Somehow the current experience is not set up in a way that would make that feel natural.

Note that most plausible failures of orthogonality are bad news, perhaps

very

bad news.

Eliezer Yudkowsky

Dec 28, 2015

expand_less

more_vert

Permalink

To make sure we're on the same page, Orthogonality is true if it's possible for a paperclip maximizer to exist and be, say, 95% as cognitively efficient and ~100% as technologically sophisticated as any other agent (with equivalent resources). Check?

Alexei Andreev

Dec 29, 2015

expand_less

more_vert

Permalink

Paul, you can start by writing an objection as a comment, if it's a few paragraphs long. You can write a new comment for each new objection. If you want to make it detailed / add a vote, then creating a new page makes sense.

I agree that the website currently doesn't provide intuitive support for arguments; this will come in the near future. For this year we focused on explanation / presentation.

Paul Christiano

Dec 29, 2015

expand_less

more_vert

Permalink

(Understandable to focus on explanation for now. Threaded replies to replies would also be great eventually.)

Eliezer: I assumed 95% efficiency was not sufficient; I was thinking about asymptotic equivalence, i.e. efficiency approaching 1 as the sophistication of the system increases. Asymptotic equivalence of technological capability seems less interesting than of cognitive capability, though they are equivalent if either we construe technology broadly to include cognitive tasks or if we measure technological capability in a way with lots of headroom.

(Nick says "more or less any level of intelligence," which I guess could be taken to exclude the very highest levels of intelligence, but based on his other writing I think he intended merely to exclude low levels. The language in this post seems to explicitly cover arbitrarily high efficiency.)

I still think that 99% confidence is way too high even if you allow 50% efficiency, though at that point I would at least go for "very likely."

Also of course you need to be able to replace "paperclip maximizer" with anything. When I imagine orthogonality failing, "human values" seem like a much more likely failure case than "paperclips."

I don't think that this disagreement about orthogonality is especially important, I mostly found the 99%'s amusing and wanted to give you a hard time about it. It does suggest that in some sense I might be more pessimistic about the AI control problem itself than you are, with my optimism driven by faith in humanity / the AI community.

Eliezer Yudkowsky

Dec 29, 2015

expand_less

more_vert

Permalink

Paul, I didn't say "99%" lightly, obviously. And that makes me worried that we're not talking about the same thing. Which of the following statements sound agreeable or disagreeable?

"If you can get to 95% cognitive efficiency and 100% technological efficiency, then a human value optimizer ought to not be at an intergalactic-colonization disadvantage or a take-over-the-world-in-an-intelligence-explosion disadvantage and not even very much of a slow-takeoff disadvantage."

"The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'You can't have superintelligences that optimize any external factor, only things analogous to internal reinforcement.'"

"The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'The problem of reflective stability is unsolvable in the limit and no efficient optimizer with a unitary goal can be computationally large or self-improving.'"

"Paul is worried about something else / Eliezer has completely missed Paul's point."

Paul Christiano

Dec 29, 2015

expand_less

more_vert

Permalink

(This is hard without threaded conversations. Responding to the "agree/disagree" from Eliezer)

The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'You can't have superintelligences that optimize any external factor, only things analogous to internal reinforcement.'

The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'The problem of reflective stability is unsolvable in the limit and no efficient optimizer with a unitary goal can be computationally large or self-improving.'

I think there are a lot of plausible failure modes. The two failures you outline don't seem meaningfully distinct given our current understanding, and seem to roughly describe what I'm imagining. Possible examples:

Systems that simply want to reproduce and expand their own influence are at a fundamental advantage. To make this more concrete, imagine powerful agents that have lots of varied internal processes, and that constant effort is needed to prevent the proliferation of internal processes that are optimized for their own proliferation rather than pursuit of some overarching goal. Maybe this kind of effort is needed to obtain competent high-level behavior at all, but maybe if you have some simple values you can spend less effort and let your own internal character shift freely according to competitive pressures.

What we were calling "sensory optimization" may be a core feature of some useful algorithms, and it may require a constant fraction of one's resources to repurpose that sensory optimization towards non-sensory ends. This might just be a different way of articulating the last bullet point. I think we could talk about the same thing in many different ways, and at this point we only have a vague understanding of what those scenarios actually look like concretely.

It turns out that at some fixed level of organization, the behavior of a system needs to reflect something about the goals of that system---there is no way to focus "generic" medium-level behavior towards an arbitrary goal that isn't already baked into that behavior. (The alternative, which seems almost necessary for the literal form of orthogonality, is that you can have arbitrarily large internal computations that are mostly independent of the agent's goals.) This implies that systems with more complex goals need to do at least slightly more work to pursue those goals. For example, if the system only devotes 0.0000001% of its storage space/internal communication bandwidth to goal content, then that puts a clear lower bound on the scale at which the goals can inform behavior. Of course arbitrarily complex goals could probably be specified indirectly (e.g. I want whatever is written in the envelope over there), but if simple indirect representations are themselves larger than the representation of the simplest goals, this could still represent a real efficiency loss.

Paul is worried about something else / Eliezer has completely missed Paul's point.

I do think the more general point, of "we really don't know what's going on here," is probably more important than the particular possible counterexamples. Even if I had no plausible counterexamples in mind, I just wouldn't especially confident.

I think the only robust argument in favor is that unbounded agents are probably orthogonal. But (1) that doesn't speak to efficiency, and (2) even that is a bit dicey, so I wouldn't go for 99% even on the weaker form of orthogonality that neglects efficiency.

If you can get to 95% cognitive efficiency and 100% technological

efficiency, then a human value optimizer ought to not be at an

intergalactic-colonization disadvantage or a

take-over-the-world-in-an-intelligence-explosion disadvantage and not

even very much of a slow-takeoff disadvantage.

It sounds regrettable but certainly not catastrophic. Here is how I would think about this kind of thing (it's not something I've thought about quantitatively much, it doesn't seem particularly action-relevant).

We might think that the speed of development or productivity of projects varies a lot randomly. So in the "race to take over the world" model (which I think is the best case for an inefficient project maximizing its share of the future), we'd want to think about what kind of probabilistic disadvantage a small productivity gap introduces.

As a simple toy model, you can imagine two projects; the one that does better will take over the world.

If you thought that productivity was log normal with a standard deviation of */ 2, then a 5% productivity disadvantage corresponds to maybe a 48% chance of being more productive. Over the course of more time the disadvantage becomes more pronounced if randomness averages out. If productivity variation is larger or smaller then it decreases or increases the impact of an efficiency loss. If there are more participants, then the impact of a productivity hit becomes significantly large. If the good guys only have a small probability of losing, then the cost is proportionally lower. And so on.

Combining with my other views, maybe one is looking at a cost of tenths of a percent. You would presumably hope to avoid this by having the world coordinate even a tiny bit (I thought about this a bit

here

). Overall I'll stick with regrettable but far from catastrophic.

(My bigger issue in practice with efficiency losses is similar to your view that people ought to have really high confidence. I think it is easy to make sloppy arguments that one approach to AI is 10% as effective as another, when in fact it is 0.0001% as effective, and that holding yourself to asymptotic equivalence is a more productive standard unless it turns out to be unrealizable.)

Anton Geraschenko

Jan 15, 2016

expand_less

more_vert

Permalink

I'm skeptical of Orthogonality. My basic concern is that it can be interpreted as true-but-useless for purposes of defending it, and useful-but-implausible when trying to get it to do some work for you, and that the user of the idea may not notice the switch-a-roo. Consider the following statements: there are arbitrarily powerful cognitive agents

which have circular preferences,

with the goal of paperclip maximization,

with the goal of phlogiston maximization,

which are not relfective,

with values aligned with humanity.

Rehearsing the arguments for Orthogonality and then evaluating these questions, I find my mind gets very slippery.

Orthongonality proponents I've spoken to say 1 is false, because "goal space" excludes circular preferences. But there are very likely other restrictions on goal space imposed once an agent groks things like symmetry. If "goal space" means whatever goals are not excluded by our current understanding of intelligence, I think Orthogonality is unlikely (and poorly formulated). If it means "whatever goals powerful cognitive agents can have", Orthogonality is tautological and distracts us from pursuing the interesting question of what that space of goals actually is.

Let's narrow down goal space.

If 2 and 3 get different answers, why? Might a paperclip maximizer take liberties with what is considered a paperclip once it learns that papers can be electrostatically attracted?

If 4 is

easily

true, I wonder if we're defining "mind space" too broadly to be useful. I'd really like humanity to focus on the sector of mind space that we should focus on in order to get a good outcome. The forms of Orthogonality which are clearly (to me) true distract from the interesting question of what that sector actually is.

Let's narrow down mind space.

For 5, I don't find Orthogonality to be a convincing argument. A more convincing argument is to shoot for "humanity can grow up to have arbitrarily high cognitive power" instead.

Eliezer Yudkowsky

Jan 16, 2016

expand_less

more_vert

Permalink

As regards 4, I'd say that while there may

theoretically

be arbitrarily powerful agents in math-space that are non-reflective, it's not clear that this is a pragmatic truth about most of the AIs that would exist in the long run - although we might be able to get

very powerful

non-reflective genies

. So we're interested in some short-run solutions that involve nonreflectivity, but not long-run solutions.

I don't think 2 and 3 do have different answers. See the argument about what happens if you use an AI that only considers classical atomic hypotheses, in

https://arbital.com/p/5c?lens=4657963068455733951

1 seems a bit odd. You could argue that the Argument from Mind Design Space Width supports it, but this just demonstrates that this initial argument may be too crude to do more than act as an intuition pump. By the time we're talking about the Argument from Reflective Stability, I don't think that argument supports "you can have circular preferences" any more. It's also not clear to me why 1 matters - all the arguments I know about, that depend on Orthogonality, still go through if we restrict ourselves to only agents with noncircular preferences. A friendly one should still exist, a paperclip maximizer should still exist.

Anton Geraschenko

Jan 18, 2016

expand_less

more_vert

Permalink

That's exactly the point (except I'm not sure what you mean by "the Argument from Reflective Stability"; the capital letters suggest you're talking about something very specific). The arguments in favor of Orthogonality just seem like crude intuition pumps. The purpose of 1 was not to actually talk about circular preferences, but to pick an example of something supported by largeness of mind design space, but which we expect to break for some other reason. Orthogonality feels like claiming the existence of an integer with two distinct prime factorizations because "there are

so many

integers". Like the integers, mind design space is vast, but not arbitrary. It seems unlikely to me that there

cannot

be theorems showing that sufficiently high cognitive power implies some restriction on goals.

Eliezer Yudkowsky

Jan 18, 2016

expand_less

more_vert

Permalink

There's 6 successively stronger arguments listed under "Arguments" in the current version of the page. Mind design space largeness and Humean freedom of preference are #1 and #2. By the time we get to the Gandhi stability argument #3, and the higher tiers of argument above (especially including the tiling agents that seem to directly show stability of arbitrary goals), we're outside the domain of arguments that could specialize equally well to supporting circular preferences. The reason for listing #1 and #2 as arguments anyway is not that they finish the argument, but that (a) before the later tiers of argument were developed #1 and #2 were strong intuition-pumps in the correct direction and (b) even if they might arguably prove too much if applied sloppily, they counteract other sloppy intuitions along the lines of "What does this strange new species 'AI' want?" or "But won't it be persuaded by..." Like, it's important to understand that even if it doesn't finish the argument, it is indeed the case that "All AIs have property P" has a lot of chances to be wrong and "At least one AI has property P" has a lot of chances to be right. It doesn't finish the story - if we took it as finishing the story, we'd be proving much too much, like circular preferences - but it pushes the story in a long way in a particular direction compared to coming in with a prior frame of mind about "What will AIs want? Hm, paperclips doesn't sound right, I bet they want mostly to be left alone."

Anton Geraschenko

Jan 20, 2016

expand_less

more_vert

Permalink

Thanks for the reply. I agree that strong Inevitability is unreasonable, and I understand the function of #1 and #2 in disrupting a prior frame of mind which assumes strong Inevitability, but that's not the only alternative to Orthogonality. I'm surprised that the arguments are considered successively stronger arguments in favor of Orthogonality, since #6 basically says "under reasonable hypotheses, Orthogonality may well be false." (I admit that's a skewed reading, but I don't know what the referenced ongoing work looks like, so I'm skipping that bit for now. [Edit: is this "tiling agents"? I'm not familiar with that work, but I can go learn about it.])

The other arguments are interesting commentary, but don't argue that Orthogonality is true for agents we ought to care about.

Gandhian stability argues that self-modifying agents will try to preserve their preference systems, but not that they can become arbitrarily powerful while doing so. As it happens, circular preference systems illustrate how Gandhian stability could limit how powerful a cognitive agent can become.

The unbounded agents argument says Orthogonality is true when "mind space" is broader than what we care about.

The search tractability argument looks like a statement about the relative difficulty of

accomplishing

different goals, not the relative difficulties of holding those goals. I don't mean to dismiss the argument, but I don't understand it. I'm not even clear on exactly what the argument is saying about the tractability of searching for strategies for different goals. That it's the same for all possible goals?

comment

Propose comment

Pagedown

and

Mathjax

. User contributions licensed under

cc by-sa 3.0

with attribution required.

Email us