DialogueOnFriendliness

SL4Wiki | RecentChanges | Preferences

A Galilean Dialogue on Friendliness

By EliezerYudkowsky. Begun May 20th 2003.

in progress


Five people - Autrey, Bernard, Cathryn, Dennis, and Eileen - want to divide up a cake.

Autrey: I think everyone should get 20% of the cake.

Cathryn: Sounds fair. 20% apiece.

Eileen: I also agree. 1/5 for everyone.

Dennis: I want the entire cake for myself.

Bernard: Personally, I also agree that everyone should get 20% of the cake. So to be fair, Dennis should get 36% of the cake, and everyone else should get 16%. That way we're taking everyone's desires into account.

Dennis: Forget it! That's not fair! Who says that everyone's desires should be weighted at 20%? I think that only my desires should count. You're just trying to sneak in your way of looking at the world under a different name.

Autrey: Sorry, Bernard, it doesn't work that way. I already took the impulse toward fairness into account in saying that everyone should get 20% of the cake. You can't count the same factor twice.

Cathryn: I'll say! What, I should be penalized for being an altruist? If Dennis gets 36%, I'm switching my preferences to 100% on the next round!

Eileen: I'd hate to see that happen, Cathryn. I like people who care about fairness and I wouldn't want to see them penalized. Bernard, you need to take into account my preferences about the kind of system I want to live in and the kind of behavior it encourages, not just my immediate thoughts about the cake.

Bernard: Wow, look at all these different ethical imperatives... I'm not sure what happens when I apply 20% of each of them... maybe I should go by majority vote?

Dennis: Majority vote? That's not fair! Who says that anyone else should get a vote?

Cathryn: Just because we all vote on something doesn't make it right.

Bernard: "Right?" I know what it means to vote on something, but what does it mean for an ethical system to be "right"? There's no criterion for deciding between ethical systems.

Autrey: There is under my ethical system. How would you take 20% of that into account?

Cathryn: Um, Eileen, what are you doing over there?

Eileen: I'm building a Friendly AI.

Autrey: That sounds like an action with implications far beyond our current dilemma.

Eileen: But it will solve the cake-division problem. That's the beautiful thing about massive overkill.

Bernard: I don't see how building a Friendly AI helps. An AI might help to implement a solution for dividing the cake, but how could a FAI help decide on a solution for dividing the cake?

Eileen: Deciding on a solution for dividing the cake is a cognitive task. If you understand a cognitive task deeply enough - very deeply, down to the level of pure causal dynamics - you can embody it as a computer program.

Cathryn: But dividing a cake isn't just a straightforward computation like factoring a large composite number... I mean, we're arguing about it. That's more like... I don't know quite what it's like, actually.

Autrey: If you don't know what it's like, I guess you can't embody it as a computer program.

Bernard: I don't see what the ability to construct an AI gains us, except the ability to divide a lot more cake a lot faster, if we can agree on how to do it. Not that this would be a bad thing, I'm just asking about the relevance to the basic problem.

Eileen: Okay, here's an example of a relevant case. Let's suppose that three people can't agree that any particular, specific proposed division of a cake is "fair". We shall suppose in this example that the three agree on a moral principle for dividing the cake, but whenever one of them proposes a concrete division, the others disagree because it looks to them like the application of the moral principle has been unconsciously prejudiced by that person's self-bias. If these people possess the skill to specify and create minds, they can embody the moral principle they agree upon in an independent cognitive system. Then they can agree to abide by the judgment of that mind. As you asked, this is an example of an ethical dilemma that can be solved more easily given the ability to construct an AI.

Autrey: This problem can also be solved by appealing to an uninvolved sixth party, who also agrees to the moral principle in question, to divide the cake.

Eileen: It depends on what kind of biases the people involved are worried about. Appealing to a randomly selected sixth party might get you a randomly biased unfair division.

Bernard: If you have no way of knowing which direction the bias is in, doesn't that make the division fair? I'm reminded of someone who complained that the draft lottery wasn't fair because the draft slips weren't stirred enough. Who, specifically, was that lottery unfair to?

Cathryn: If you conducted a lottery that gave one winner the entire cake, and each of the three people had an equal chance of winning, that might be symmetrical, but it wouldn't be fair.

Bernard: In the long run, repeated with enough cakes, it would be fair to within epsilon.

Cathryn: Some problems aren't long-run problems. If an AI devised a lottery that gave all the money and property in the world to one person, and each of the six billion living humans had an equal chance of being selected, that would be a symmetrical lottery but not a fair solution.

Dennis: I'll say. This lottery has only one chance in six billion chance of being fair.

Bernard: Why would appealing to an AI be better than appealing to an uninvolved sixth party?

Eileen: The uninvolved sixth party has her own set of human biases. Even if she wants to be totally impartial, she doesn't have that option. Even if she knows a class of biases exist within herself, she can't get rid of them just by wishing, because she doesn't have access to her own source code. One possible solution would be to construct cognitive dynamics that embody the moral principle in the exact form it was agreed upon, without extraneous forces.

Cathryn: Okay, I can see three people agreeing on a principle for dividing the cake. But what if the moral principle they agree on is wrong?

Eileen: The deeper you go, the harder things are to explain... right now I just want to talk about how the ability to create a specified AI can help resolve some classes of negotiation.

Bernard: Cathryn's question sounds like obvious nonsense to me - no offense. If the three people all agree on it, who's to say they're wrong? What I want to know is, what if five people can't agree on a moral principle for dividing the cake?

Eileen: They might be able to agree on a principle for resolving arguments about how to divide cakes. In fact, if they all had the conceptual equipment to understand the underlying cognitive dynamics, they'd probably find them a lot easier to agree on. Underlying cognitive dynamics have a lot more in common between humans than any specific political position. Our cognitive architectures have more in common than our politics. People all like their own political parties and hate those evil bastards on the other side; that's a human universal even if the specific political parties are infinitely variable. I think it might prove much easier to agree on fair cognitive dynamics than to agree on whose political party is best.

Bernard: Will your FAI treat everyone's cognitive dynamics equally?

Eileen: I'm not sure what you mean by that. I certainly don't plan to introduce asymmetrical mentions of specific humans, if that's what you mean.

Dennis: Eh?

Eileen: I'm not going to tell the FAI, "Make Dennis the ruler of the world."

Dennis: Why not?

Bernard: In that case your FAI's behavior seems easy enough to predict - everyone's cognitive dynamics get equal input, so the FAI will weight all our definitions of fairness equally. That way we'd have, let's see, 20% in favor of my definition of fairness, 20% in favor of Dennis's definition of fairness, and 60% in favor of you three. So Dennis's share would work out to... let's see, now... 39%.

Eileen: Okay, that's a good example of what I do not mean by fairness. That's not agreeing on cognitive dynamics, it's superposing our final answers. And the result, which is, no offense, ridiculous, shows the problem with that. You can't just take surface judgments and embody them in an AI, even if you vote on them. Then you just have... a chatbot, or an encyclopedia. It'd be like combining the most popular present scientific theories, weighted by the number of scientists who believe in them, and hard-wiring them as beliefs. It'd be interesting to look at the output, but the output wouldn't be a scientist. The result would be frozen in time; there could be no further progress. The result probably wouldn't even be coherent as a scientific theory; you'd have a set of beliefs that no single individual would ever hold. The thing we're hunting for is a dynamic cognitive process, not the frozen output of that process.

Dennis: Look, just let me make all the decisions. That's fair.

Eileen: But, Dennis, you do see that I have no way to listen to you instead of the other people who are saying exactly the same thing?

Dennis: What, they're saying that Dennis should make all the decisions? Good, they sound like sensible people to me.

Eileen: No, they're saying that they should make all the decisions.

Dennis: Bah, what arrant nonsense! I hope you'll dismiss those foolish speculations immediately and tell your AI to pay attention only to Dennis. That's fair.

Eileen: Um... look, I'm sorry, but that statement has not been phrased in a way which allows it to be argued across moral agents.

Bernard: Now that sounds unfair. Why aren't Dennis's desires being taken into account?

Dennis: I'm glad to hear you're on my side, Bernard! So you also see now that I should get the whole cake?

Bernard: No, I think you should get 36% of the cake.

Dennis: Bah, you're just as bad as the rest. Why aren't you taking my preferences into account?

Bernard: I am!

Dennis: No, I mean why aren't you taking my preferences into account the way I want you to take them into account, not the way you want to take them into account? You aren't using my preferences at all. Your preferences may change in some bizarre way that depends on my preferences as data, but they're still your preferences. I want you to use my preferences.

Bernard: Okay, Dennis, I'll take that into account.

Dennis: You're insane.

Bernard: Eileen, I still don't see how you can just say that Dennis's preferences can't be communicated. Cutting him off like that isn't fair.

Eileen: I'm not putting Dennis into the class of a rock or a tape recorder. He can go on trying to come up with a morally communicable argument for becoming personal overlord of the universe. He just hasn't done so yet. Right now, it's impossible for either you or I to communicate with Dennis; he's wrapped himself up in a small private world, morally speaking. He's not using arguments that make sense in either your or my system. Neither of us can communicate with him about the fair division of the cake any more than we could communicate with a tape recorder playing back "Two plus two equals nine!" about arithmetic. Conversely, he can't communicate with us either.

Dennis: But you get to decide which moral arguments you'll be influenced by? Who died and made you God? I should decide that.

Bernard: It does occur to me to ask what rules you're using.

Eileen: Well, in intuitive terms, imagine Joe saying to Sally, "My number one rule is: Look out for Joe." If Sally hears that as a moral argument, she'll hear: "Your number one rule should be: Look out for Sally." In other words, Sally hears Joe, automatically substitutes "[your name here]" for "Joe", and hears the general moral argument "Everyone should look out for themselves". Now if Joe happened to be a Moonie, and said "My number one rule is: Look out for Reverend Sun", Sally might hear that just the way Joe said it. There are rules and principles, instincts and intuitions, that create the transpersonal morality of humans. Right now, Dennis is behaving sort of like an extremely simple desirability computation devoted to turning the universe into paperclips. Or to put it another way, I can't see any possible way to build a Friendly AI such that it would give the world to Dennis, without the original core program mentioning Dennis explicitly, which everyone who is not Dennis would say was blatantly unfair.

Cathryn: Eileen, you're overcomplicating things. 20% apiece is obviously the correct way to divide this cake. It's the correct answer independently of any amount of arguing we do about it, just like 2 + 2 equals 4 regardless of whether anyone is looking.

Eileen: It takes knowledge to make a physical object, like a calculator, which successfully computes that 2 and 2 make 4 - there are many other possible computations a piece of matter can implement, and most of them aren't pocket calculators. It takes more knowledge to create a physical process whose output depends on its inputs in a way that steers the universe into particular states - that is, to create a Bayesian decision system. And it takes still more knowledge to create a physical process that can understand moral arguments as moral arguments - to compute the same question you're computing when you say that 20% apiece is obviously the unique correct answer - with which I happen to agree, by the way.

Autrey: I wouldn't be too sure that 20% is the answer. Humanity's moral memes have improved greatly over the last few thousand years; but how much farther do we have left to go? We mostly got rid of slavery, we're trying to get rid of racial prejudice, things like that. I was born too late myself, but you don't have to be very old to remember a time when blacks rode in the back of the bus!

Bernard: I remember.

Autrey: What if we, ourselves, look like unkempt barbarians from the perspective of a few years down the road? No, I'll make the statement stronger; it seems nearly certain that's how we'll look. "20%" may just be a solution that we seize upon because our instinct is to choose simple, cheating-resistant solutions that are obvious to all players. Would we pick a less simple but more optimal solution if we lived in a world where people weren't so tempted to employ elaborate arguments to get more than their share? What if we each have different preferences for cake and icing, making the problem non-zero-sum? What if we're barbarians for not using Condorcet Voting, Brams-Taylor Fair Division, or Throatwarbler-Mangrove Time-Discounted Volition?

Eileen: I agree that humanity has grown up a lot in the last few thousand years, or even just the last century. And I would say it's that dynamic process we need to preserve, not a snapshot of where we are today. Individuals improve their moralities, as do civilizations. So a Friendly AI can't have just the frozen values of the person who happened to create it, or even the frozen values of the civilization that happened to give birth to it. The classical stereotype presents an "AI" as a machine, echoing the frozen knowledge of its creators. But the AIs that just echo stored knowledge don't grow, don't learn, and this is not a "philosophical" problem; it reflects a real failure to implement specific cognitive abilities. It is part of the difference between computer programs, which is what we have now, and real AI, which is something no one has created yet.

Cathryn: Define "grown up".

Autrey: Define it? Why?

Cathryn: Because until you define it, there's nothing to talk about.

Autrey: I don't suppose you've ever read Robert Pirsig's "Zen and the Art of Motorcycle Maintenance?"

Cathryn: 'Fraid not.

Autrey: The main character of the book is an English professor named Phaedrus, trying, heaven help him, to teach his students to write. Now, writing quality - or as Phaedrus would put it, Quality in writing - is one of those things that is very difficult to define; and certainly his students can't do it. Asked to define quality in writing, no one has any ideas. And yet when Phaedrus shows his students low-quality work and high-quality work, they are able to agree on which is which; they can see quality, even though they can't define it. First, says Phaedrus, you show your students what Quality is, and show them that they know how to judge it even if they can't define it; then you introduce the writing rules, not as blind laws to be followed blindly, but as means to the end of Quality, which the students have now learned to see.

Cathryn: And is this a true story, or something the author just made up?

Autrey: I don't know. Good question. One must avoid generalizing from fictional evidence, after all.

Eileen: Plenty of cognitive psychology papers discuss, in passing, people's highly correlated judgments of qualities they would probably be extremely hard-pressed to define.

Autrey: Pirsig doesn't like the idea that you must define things verbally - that only things you can define verbally are permitted as subjects of discussion. It can create a kind of blindness, people shutting out their own intuitions about Quality.

Eileen: In my view, your judgment of something's Quality is the primary fact, and verbal definitions are attempted hypotheses about that fact. That we can see writing Quality is a fact. Attempts to give verbal definitions of "writing Quality" are attempts to make hypotheses about what it is, exactly, that we are seeing - hypotheses about the cognitive processes that underlie the judgment. For your hypothesis to interfere with your perception of the facts is, of course, a sin.

Cathryn: You're saying what? That even if I don't understand my own sense of "grown up", it's okay to have that sense and use it?

Eileen: The way in which you pass these judgments is an aspect of the universe, and in particular, cognitive science, which you have observed but not yet explained.

Autrey: Being unable to define what underlies your judgment of grown-up-ness, or moral improvement, or fairness, doesn't mean that these things are demoted to some kind of second-rank existence. Even if you tried to give a verbal definition of, say, "fairness" - or as Eileen would put it, a hypothesis about fairness - you'd have to hold on very tightly to your intuitive judgment of fairness and continue checking your intuitive perception along with your verbal definition. In my experience, when people try and give off-the-cuff verbal definitions of such things, the definitions are usually wrong - or, at best, wildly inadequate. That is the danger of philosophy.

Eileen: Verbal definitions of our perceptions are usually inadequate because the real answer is a deep question of cognitive science, and people are trying to make up "philosophical" answers in English. Most times it ends up being like the various attempts to "define" what fire is in terms of phlogiston, stories about how the Four Elements run the universe, and so on. Today we know that fire is molecular chemistry, but that's one heck of a nonobvious explanation behind what seems like a very simple sensory experience.

Dennis: I don't buy Pirsig's line about Quality. You claim that writing quality is a fact that has been observed, but not yet explained. What makes you think that there is an explanation for it, or that anyone will ever be able to give a verbal account of it? If people can't give a verbal definition of a term, it must be because the term is fuzzy and arbitrary and useless.

Eileen: If a group of people agree on something that seems arbitrary, that itself is an interesting fact about cognitive science, and you should look for a common computation carried out in rough synchrony.

Autrey: Go players make their moves by judging a kind of "Go Quality" that has never been satisfactorily explained to anyone, and has not yet been embodied in any computer program. Go players have this mysterious, inexplicable sense of which moves are good, which Go configurations "feel" good or bad. They just pick it up from playing a lot of games. Playing from their unverbalizable sense of Go quality, those Go players beat the living daylights out of today's best computers according to very clear, objective criteria for who wins or loses. If you're a novice player playing chess, and you lose, you at least have some idea of what happened to you afterward; the other guy backed you into a corner and left you with no options and took all your pieces and so on. I've been losing a few games of Go here and there, and even after I lose I have no idea why. I put the stones on the board and then they go away. I know the rules and yet I don't understand the game at all. Strong Go players have an unverbalizable sense of Quality that gives rise to specific, definite, useful results according to a clear objective criterion.

Eileen: I'd expect there's an explanation for those perceptions in terms of how some areas of neural circuitry are trained by the experience of playing Go.

Autrey: Sure, but knowing that there exists an explanation is not the same thing as knowing the explanation.

Cathryn: I'm uncomfortable with leaving terms like "fairness" undefined. Maybe I can't define why I'm uncomfortable, but I am.

Autrey: If you can't define something you should be uncomfortable, because that is a real gap in your knowledge. But you can't jump the gun. If you don't know how to define something, then you don't know. If you try to create a definition before you have the knowledge to build a good one, you end up defining fire as the release of phlogiston.

Eileen: In order to understand your own sense of fairness, you should begin by studying it, becoming aware of it, learning how it works, rather than trying to give a preemptive definition of it. The Quality of intelligence, for example, is a real thing that has been preemptively defined in so many wrong ways. The job of an AI researcher, in a sense, is to take things that appear as opaque Qualities, and figure out what really underlies them - not in loose hand-waving explanations, but in enough detail to create them. It's the challenge of creation that's the highest and most difficult test of an explanation.

Autrey: And here, of course, is where AI researchers really, really screw up. Whenever you hear an AI researcher defining X as Y, you should always hold on to your intuitive sense of X, and see whether the definition Y really matches it properly, explains the whole thing with no lingering residue. Especially when it comes to terms like "intelligence". When you get an explanation right there should never be a feeling of forcing the experience to fit the definition, no sense of holding a mirror up to Life and chopping off the parts of Life that don't fit.

Dennis: You stole that from Terry Pratchett.

Autrey: When I look at AI researchers' explanations, I get the strong feeling that they're trying to shove a Quality into a definition it really doesn't fit at all, like someone trying to stuff their entire wardrobe into a 20" luggage. So I'm going to stick with my intuitive understanding of these terms unless someone comes up with one heck of a good hypothesis.

Eileen: But that refusal doesn't allow for any incremental progress. You can make a hypothesis that some effect contributes to our perception of a Quality, without claiming to have explained the whole thing. You shouldn't summarily reject that attempt just because the hypothesis doesn't explain the entire problem in itself. That's why I agree with Pirsig that you shouldn't demand a verbal definition before you start.

Cathryn: How can you talk about a problem at all, if you don't have a definition of what you're talking about?

Eileen: By using extensional definition instead of intensional definition. Extensional definition works by presenting one or more experiences from which a general property can be abstracted -

Autrey: Eileen? I'll handle this. Cathryn, what is "red"?

Cathryn: It's a color.

Autrey: What's a "color"?

Cathryn: It's a property of a thing.

Autrey: What's a "thing"? What's a "property"?

Cathryn: Um...

Autrey: That's an example of intensional definition - trying to define words using other words. Now, to give an extensional definition, I'd say: "You see that traffic light over there? That's 'red'. You see that other traffic light over there? That's 'green'. The way in which they differ is 'color'."

Bernard: Of course, someone working from that definition alone might get confused and think "red" meant "top" and "green" meant "bottom"...

Eileen: Autrey's extensional definition of "red" is ambiguous - it spreads out to encompass more possibilities than its maker intended. Whether you're using extensional definition or intensional definition, or a mix of both, you have to make sure the map leads to only one place. And that is determined, not just by the map, but by the mind that follows the map. It sounds unambiguous to us - but only because we already know the desired answer! You have to watch out for that.

Autrey: Still, I usually find that extensional definition is to be preferred over intensional definition. Or as writers are admonished: "Show, don't tell." Otherwise you get lost in a maze of words that point to other words but never link to anything real.

Eileen: *cough*semanticnets*cough*

Cathryn: How would you give an extensional definition of morality?

Autrey: Well, you would point to a set of decisions and say which ones were and weren't moral. And I would point to you and say, "See, that's a morality."

Cathryn: I don't think that helped any.

Autrey: That particular answer doesn't advance on the problem of de-opaquing "morality", but it's a way of pointing to the thing you want to investigate.

Cathryn: I think I'd be very nervous if an AI was looking at me, though. What if I got something wrong?

Bernard: Wrong? How can you get a moral judgment "wrong"?

Cathryn: Just watch me.

Bernard: Doesn't getting something "wrong" require an external standard to compare it to?

Cathryn: In my experience, getting something wrong never requires anything more than a failure to pay attention.

Bernard: No, I mean... suppose you say you want an orange. How can you be "wrong" about that?

Cathryn: What do you mean? I've been wrong about what I wanted plenty of times. Sometimes I think my whole life has consisted of nothing else.

Bernard: That's a philosophical impossibility.

Cathryn: I can screw up even when it is philosophically impossible for me to do so. That is the power and the curse of Cathryn, and lesser mortals can but look upon me in awe.

Autrey: That reminds me of something that's been bugging me lately. Eileen, what's a Friendly AI supposed to do, aside from dividing up this cake fairly? The cake-division problem we're faced with may illustrate interesting things about fairness, but as a zero-sum game it doesn't really reflect what life is about.

Eileen: I remind you that I can't speak for a Friendly AI.

Autrey: Guess.

Eileen: In the beginning, when I thought about these kinds of moral questions, I used to think in terms like: "Is it better to be happy or sad? Is it better to be alive or dead? Is it better to be smart or stupid?" As it happens, I would come down on the happy/alive/smart side of the divide.

Dennis: More Culture propaganda.

Eileen: But then I started considering whether, if someone doesn't want to be happy, it's right to force them to be happy. As it happens, I would answer no. Having answered no, I found that I'd given individual autonomy and self-determination the deciding vote. This leads to the question of whether anything other than personal volition should even have a vote at all. Volition might "capture" happiness and life and smartness as special cases of self-determination - people's choices to be happy, alive, and smart. But then, while I do think that people have the right to be sad if they want to be, I'm not as sure of that conclusion as I am about people's right to be happy if they want to be. That suggests that even if my final conclusion is volitionism - people getting what they want - the morality behind the conclusion isn't captured by volitionism alone. I guess you could sum up my present position by saying that I wouldn't interfere with someone's choice to be sad, but I would choose to be sad about it.

Autrey: That's you. What about a Friendly AI?

Eileen: We don't really have the language to describe what a Friendly AI is at this point in our conversation.

Autrey: Okay, but what does it work out into in practice?

Eileen: Knowing how to describe a thought isn't the same as being able to think it yourself. What does the square root of 298304 work out to in practice?

Autrey: Probably around 550. What's your guess for the behavior of a Friendly AI?

Eileen: I think that somewhere along the line, there's going to be a deep principle that runs something like, "help people in accordance with their volitions".

Autrey: Okay, this is the part that has always worried me about the scenario of ultrapowerful AIs helping humans. We just heard Cathryn say she's frequently been wrong about what she wanted. Any sufficiently powerful friendship is indistinguishable from geniehood. Even if AIs are willing to help, do we know what to ask for? If wishes started coming true - even if they could only affect the person who made the wish and no one else - I think 98% of Earth's population would destroy themselves within a week.

Cathryn: 98%? Even I think that's pessimistic, and I speak as a woman who goes about beneath a constant cloud of doom.

Autrey: I'm not talking about deeply buried suicidal tendencies or any such psychoanalytic gibberish. I'm talking about the law of unintended consequences. People just can't read that far ahead into the future. They would think a little about the "Wishing" problem, experience some anxiety over it, and make up some rules for themselves to follow, until they'd dealt with whatever anxieties they had. And then they would make their wishes and die, because their model wasn't even close to reality.

Eileen: Speaking of "making up rules", Autrey, why do you think that this is the way "help" should work? It seems to me like you're making up rules for "wishing", then extrapolating those rules out to where they fail. The criterion for how good rules ought to work is determined, not by the part of you making up these rules, but by the part of you looking over your own rules and seeing they won't work.

Autrey: Eh?

Eileen: Let me put it this way: Suppose you find a genie bottle. And suppose that the genie in the bottle is a genuinely helpful person - your good friend who just happens to be a genie. We'll leave aside for a moment the question of how to specify a genie like that. How would you want an ultraintelligent friend to react to your wishes?

Autrey: Uh... I don't know.

Cathryn: I don't know either!

Autrey: Thanks, Cath.

Cathryn: No charge.

Bernard: This all strikes me as rather speculative. Wouldn't it make more sense to start by asking about ordinary, human-level help?

Autrey: No, effective omnipotence genuinely does strike me as the most probable outcome of [recursive self-improvement].

Bernard: I don't believe that. But regardless, I'm asking about the rational order of addressing the problem.

Eileen: Analyzing the case of unlimited computing power makes the structure of the problem clearer. Power corrupts, but absolute power sure simplifies the math.

Bernard: Okay. I'd wish for the genie to look ahead into the future for me, and tell me if my wish has any awful unintended consequences.

Cathryn: I'd wish for the genie to fulfill the request I would have made if I could foresee the future.

Dennis: I'd wish to be the genie.

Autrey: You are all so dead.

Eileen: Hmm. I note that all three of you made a meta-wish, rather than specifying the properties of the mind that reacts to your wishes.

Cathryn: Interesting point. I guess I've read more short stories about people making wishes than people building genies.

Eileen: Bernard, how would a genie know whether you consider something an "awful unintended consequence"?

Bernard: No problem. We're talking omniscience along with omnipotence, right? The genie can get a total scan of my every neuron down to the atomic level. The genie can use that to figure out what I would consider an "awful unintended consequence". For example, suppose that I wish for the genie to bring me the nearest banana, but the nearest banana turns out to be rotten. I'd consider that an unintended consequence. So the genie should warn me.

Eileen: The genie not only has to read your mind, the genie also has to extrapolate your reactions to a situation you have not actually encountered. It may seem very straightforward to say that someone with your tastes will spit out a rotten banana in disgust - but there is an additional step involved in computing that, above and beyond having a physical readout of the state of your taste buds and taste-related neural circuitry. And while we're on the subject, it takes an additional computational step to identify what is "you", and your "sense of taste", within the raw data of your physical readout.

Autrey: Any intelligence, any mind, is embodied in physics - protein machinery, molecules, atoms, ultimately quarks and electrons. If we're talking about genuinely infinite computing power, and a full readout of the amplitude field over the Bernard subspace that constitutes Bernard's quantum state - actually, that's physically impossible. Can I assume it anyway?

Eileen: Sure. You couldn't do that with an actual genie, but it'll simplify the thought experiment.

Autrey: Given all that, then, I can precisely extrapolate Bernard's physical state forward in time to determine his reaction to the rotten banana. If you allow for less than full omniscience, or less than infinite computing power, you get a probabilistic computation rather than a deterministic one. That's essentially what I do in real life when I guess that Bernard will react badly to a rotten banana. All I need to know about Bernard is that he's human. I don't have a precise readout of his quantum state, or infinite computing power. Yet, uncannily, my guess is still correct. It's as if I had this eerie ability to reason about the physical universe using limited information and bounded computing power, and make decisions under conditions of uncertainty.

Eileen: A physically detailed simulation of Bernard would be a real person, but we'll ignore that consideration for the time being - if the problem is unsolvable even without ethical constraints on how the solution is computed, it'll still be unsolvable after the constraints are added back in. If we know how the ideal solution would be computed using infinite computing power and no ethical constraints, we can then try and figure out how to ethically compute an approximation.

Autrey: Okay. I propose that the problem of extrapolating Bernard's reaction is computable in principle and approximable in practice.

Cathryn: Objection: Physics is not deterministic. If you extrapolate Bernard forward, you'll get a set of probabilities for different possible states, not a definite single state.

Autrey: Actually, I believe in the many-worlds formulation. Everett, Wheeler-DeWitt, yada yada. The many-worlds formulation is deterministic.

Cathryn: So you end with many different real Bernards, each with a different measure. How is that any better than ending up with many different possible Bernards, each with a different probability?

Autrey: I don't see any significant difference that depends on which formulation of physics you use. What's your point?

Cathryn: If there are many Bernards, who has the final say?

Autrey: The one with the greatest measure, or the greatest probability. No, wait, that's not right. You should add up the different degrees of happiness or sorrow for each possible Bernard, multiplied by the probability or measure of that Bernard. Then that determines Bernard's expected satisfaction with the wish.

Dennis: That doesn't sound right to me. If I wish for a banana, and all of my possible future selves are chewing in bovine happiness, then sure, go ahead and fulfill the wish. But if there's a 60% probability that my future self is wildly ecstatic about the banana, and a 40% probability that my future self runs screaming out of the room babbling about the Elder Gods, then I'd want to know what the clickety-clackety heck was up with that banana before I chowed down. Even if the average satisfaction is the same for both cases, they look very different to me.

Autrey: How would you measure the difference between those two cases? I'm sure there's some standard statistical way of doing that, but offhand I can't remember what -

Bernard: Measure the variance from the mean.

Eileen: Measure the Shannon entropy of the probability distribution.

Autrey: Oh, of course.

Dennis: I know that the variance is the average of the squared differences from the mean, but what's the Shannon entropy?

Cathryn: Shannon entropy is a way of measuring uncertainty in probability distributions. For example, suppose you have a system A that is equally likely to be in any of 8 possible states [A1...A8]. The Shannon entropy of system A is log2(8), or 3 bits. Let's suppose that another system B is equally likely to be in any of 4 possible states [B1..B4]; then B's Shannon entropy would be 2 bits. What is the entropy of the combined system that includes A and B?

Dennis: The combined system can be in any of 32 states, so it has 5 bits of entropy.

Cathryn: Trick question!

Dennis: Of course.

Cathryn: It depends on whether there's any mutual information between A and B. If the probabilities for A and B are independent, then system AB will have 32 possible states, any one of which is equally likely. But suppose we know that if system A is in an even-numbered state, B must be in an even-numbered state; while if system A is in an odd-numbered state, B must be in an odd-numbered state. The combined states [A2, B3] or [A7, B2] and so on have been ruled out. If we look at A alone, A could be in any of 8 possible states, which are equally likely; B alone could be in any of 4 possible states, which are equally likely; but when we look at A and B together, there are only 16 possible states the combined system could be in. A has 3 bits of entropy, B has 2 bits of entropy, and their mutual information is 1 bit, so the combined system AB has 4 bits of entropy. Similarly, learning whether A is odd or even tells us the parity of B, so learning A's exact state reduces the number of states B could be in from 4 to 2, which reduces the entropy of B from 2 bits to 1 bit, so knowing A provides 1 bit of information about B.

Dennis: What if the probabilities for B are [B1: 1/2, B2: 1/4, B3: 1/8, B4: 1/8]?

Cathryn: The Shannon entropy of the system would be 1.75 bits, found by summing -P(X)log(P(X)) for all states X.

Dennis: Eh? How can a system have 1.75 bits of information in it? How do you store three-fourths of a bit? Write down only part of a '1' or '0'? I'm having a hard time visualizing this. B has four states. If you needed to tell me which state it was in, you'd need to transmit '00', '01', '10', or '11'. That's two bits of information in each case.

Cathryn: But half the time B is in state B1. So suppose I use '1' to indicate state B1, '01' to indicate state B2, and '000' and '001' to indicate state B3 and B4. If you look at that coding, you'll see that it's unambiguous - if the start of the sequence is clearly marked, you can always figure out where each symbol begins and ends. So half the time I transmit one bit, a quarter of the time I transmit two bits, and a quarter of the time I transmit three bits, which adds up to 1.75 bits for the average case. The arithmetic I just performed works out to summing -P(X)log(P(X)) for each possibility, which is the formal definition of the Shannon entropy.

Dennis: Wait, how does that fit with the definition you gave earlier, for systems A and B?

Cathryn: It's the same definition. First you have the system A, which is 8 equally likely states, or 8 states each with probability 1/8. So you sum up 8 terms, each of which have value -(1/8)(log(1/8)). The end result is just -log(1/8) or log(8), which is 3. System B has 4 states each with probability 1/4, which works out to (4)(1/4)(-log(1/4)) or 2 bits of entropy by the same logic. Then when you work out the combined entropy of the independent systems A and B, you end up with (4*8)(1/4*1/8)(-log(1/4*1/8)), which, lo and behold, works out to -log(1/4*1/8) or -log(1/4) + -log(1/8). So when A and B are independent, the entropy of the combined system AB is equal to the entropy of A plus the entropy of B.

Eileen: You might want to play with the [properties] of Shannon entropy a bit. For our purposes, the important thing about Shannon entropy is that if you have a few very strong possibilities, the entropy is low; if you have a lot of weak possibilities, the entropy is high. So entropy behaves like "uncertainty".

Cathryn: Entropy measures the volume of configuration space in which you might end up. If each possible state of the entire world is a single point in configuration space, then a volume in configuration space can represent your uncertainty about the state of the world. For example, if a system consists of three variables X, Y, Z that can take on continuous quantities, then you can represent any possible state of the system by a point in three-dimensional space. If you have a system of ten particles, each one of which has a three-dimensional position and a three-dimensional velocity, any possible state of that system can be represented by a point in a sixty-dimensional phase space. Since any possible state of that physical system can be represented by a point in phase space, a volume in phase space describes many different possible states of a physical system. So if you're uncertain about what state a physical system is in, you're uncertain about where the point is in the phase space. You can describe that uncertainty by drawing a border around the volume of phase space the point might occupy. The wider the border, the larger the volume, the greater your uncertainty, the greater the entropy. If you take, say, a bunch of gas molecules, and heat them up, then their range of possible velocities increases because they're moving faster. So since the X, Y, Zs of velocity vary within a greater range, the volume in configuration space needs a larger border, and the entropy goes up. The hotter an object is, the more entropy it has.

Eileen: Entropy isn't the volume in configuration space, it's the logarithm of the volume of configuration space.

Cathryn: Er, yes. If you take the logarithm of the volume in configuration space, then your uncertainty about many independent systems combined, equals the sum of your uncertainties about each individual system of particles alone. The next step is, instead of marking each point in configuration space as "possible/impossible", you mark it with the probability of ending up in that point. The resulting definition is equivalent to the Shannon entropy. If you're dealing with continuous physical variables you get something called the distributional entropy, but it behaves pretty much the same way as the Shannon entropy. Or for quantum systems there's the von Neumann entropy, but again, it's pretty much the same as the Shannon entropy.

Dennis: I've always heard that entropy measures the amount of additional information you would need to know the exact state of a physical system.

Cathryn: It's all equivalent. If a system has four bits of Shannon entropy, then the average length of a message that specifies the exact state of the system is four bits. Incidentally, the second law of thermodynamics is a consequence of a theorem which can be derived from either classical or quantum physical law, and this theorem, Liouville's Theorem, says that probability is incompressible. If you have a volume of states in configuration space, and you extrapolate the evolution of that volume forward in time under our physical laws, then you must end up with exactly the same volume you started with. If you imagine starting with a nice, neat, compact blob in configuration space, then if you extrapolate each point in the blob forward in time 5 minutes, you have to end up with exactly the same total volume at the end. It might not be a nice, neat, compact volume, though. It might be a squiggly volume with lots of tentacles. If you didn't keep track of all the exact squiggles, which would be a lot of work, you'd have to draw a much bigger blob to surround everything. So "entropy", the size of the blob you draw to capture your uncertainty about the system, usually increases and never decreases. That's why no one can ever build a perpetual motion machine, no matter how clever they are with wheels and gears, if their wheels and gears are governed by the physics we know. A perpetual motion machine that takes in hot water and produces electricity and ice cubes is a physical process that maps a great big blob in configuration space into a little tiny blob, and the incompressibility theorem says this is impossible. If you want to squeeze a big blob onto a tiny blob in one subsystem, another subsystem somewhere has to bloat up from a tiny blob into a big blob, so that the phase space volume of the total system is conserved. If you have a physical process such that subsystem B develops from 4 possible starting states into 1 final state, then some other subsystem, say the A subsystem, has to develop from 1 definite starting state into 4 possible final states. And those 4 final states of A will probably be spread out so much that we have to describe A using the range [A1..A8]. You can move entropy around from one subsystem to another, but you can never actually reduce entropy. You can freeze water into ice cubes, but you need somewhere to dump that heat, plus whatever additional heat was generated by the work involved. Hence thermodynamics.

Dennis: Cool. Back to wishes.

Eileen: If your response to the banana is pretty much localized in a single, definite high-level reaction like "Mm, nice banana", then your reaction is definite; it has low entropy. If your different possible futures contain a very wide range of reactions, then your response is uncertain; it has high entropy.

Autrey: Ah, very nice.

Bernard: I would still prefer to compute the variance. Taking the discrete Shannon entropy assumes that each possible outcome is entirely distinct; it doesn't take into account the distance between possibilities. Many similar reactions should count as less variance than a few widely different reactions. Two equiprobable, widely separated reactions should count as more uncertainty in a wish than eight equiprobable reactions in a neat local cluster.

Eileen: I tend to think of reactions as distinct possibilities with complex internal structure. But if there's a clearly defined distance metric between different possible reactions, then yeah, the variance might be a better measure. Perhaps both the variance and the entropy are too simple to really capture our intuitive definition of the uncertainty in our reaction to an event, but either definition would be a good place to start. Since I need to pick a term and stick with it, I'm going to talk about the spread in a person's reaction to an event.

Cathryn: Okay, what's spread?

Eileen: I tend to think of spread as very similar to entropy - in fact, I formerly used the word "entropy" - because when I think of spread, I visualize an uncertain volume of possibilities. But the spread in Bernard's reactions is not the same as his physical entropy, because Bernard's "satisfaction" is a high-level characteristic of the Bernard subsystem, and it's computed in a lossy way. Two possible Bernard microstates can have the same "reaction" for our purposes, and yet be different in other ways we don't care about. We won't distinguish between a vast number of different Bernard microstates which are all "satisfied"; that's physical entropy which isn't counted into the spread. We won't distinguish between equally "satisfied" Bernards with electron #whatever in spin-up or spin-down. The two states have the same utility from a moral perspective - they are, from our perspective, effectively interchangeable. You can define a complicated subspace of the physical configuration space, not by considering a subset of the particles and their associated dimensions in the configuration space, but by defining the classes of physical states that are interchangeable with respect to your decision system. Then you can consider the entropy with respect to the superspace of those subspaces, apart from the physical entropy.

Cathryn: Then what's the relation between spread and entropy? Or are they related at all, under this definition?

Eileen: Not all physical entropy can be interpreted as spread. However, all spread necessarily implies physical entropy. If Bernard could end up "screaming in horror" or "wildly ecstatic", then that spread requires some amount of physical entropy. Different physical microstates can map to the same "reaction", but different Bernard reactions must map to different physical microstates. For there to be spread in Bernard's reaction, there must be entropy in the volume of possible physical states Bernard could end up in. All spread implies entropy, but not all entropy implies spread. I think of "spread" as a partial measurement of the entropy of a system - you're measuring only a particular kind of entropy that you care about, or the entropy with respect to a particular partitioning of the system into subspaces of interchangeable states. But if you used something other than the Shannon entropy to define the spread, like the variance, it might not be appropriate to call the spread a partial measurement of entropy. Spread would still imply entropy, but there might not be any simple relation between the two measures.

Bernard: I am reminded of a quote by Morrowitz: "If you don't understand something and want to sound profound, use the word 'entropy'."

Cathryn: To measure the spread in Bernard's reaction, you need a way to measure Bernard's reaction. That's the next question, right?

Eileen: Right. Even given that you have an atomically detailed specification of Bernard's future state, how do you compute his "reaction"?

Autrey: Check to see if Bernard murmurs, "Mm, nice banana."

Eileen: How do you check for that murmur, and rate it as appreciation, and so on? What if Bernard doesn't murmur anything?

Autrey: In answer to your first question, use voice recognition and natural language interpretation, and emotion recognition trained on past responses. I'm not saying you'd do it that way, I'm just trying to show it can be done. For the second question, maybe you could assume that if Bernard doesn't murmur anything, he's satisfied.

Bernard: Not necessarily. Quiet suffering is a hobby of mine.

Autrey: Then I'll say the obvious: The genie needs to read Bernard's mind within his extrapolated physical state, recognize the extrapolated thoughts within the extrapolated neurons, and guess what Bernard will think of the banana. And, let me guess, your next objection is that I haven't defined what "Bernard's thoughts" are, or how to read them.

Eileen: Got it in one.

Bernard: All of Earth's neurologists put together couldn't say how to do that. What you're asking for is unreasonable.

Autrey: If so, then you can't build a genie. The inherent difficulty of that problem doesn't care whether solving it is "unreasonable"; it's just there. The forces that establish the difficulty of the problem are quite independent of the forces that establish the resources you have available to solve it; it would be an amazing coincidence if the two should exactly balance.

Eileen: Autrey, suppose we allowed a solution to the technical problem, but only the technical side. In other words, you can tell me that you want to read out the words from Bernard's linguistic stream of consciousness, or you want to read out a mental image in Bernard's visual cortex, or you want to read the activation level of some particular emotion, and the genie can do all that, but you still need to say what the genie does with that information. Even given that, how would you compute Bernard's "reaction"?

Autrey: Er... okay, good question.

Eileen: There's not much point in complaining about technical impossibilities if you don't know what you're trying to accomplish.

Autrey: How about this? I don't claim to know exactly what it means for Bernard to be "satisfied" or "dissatisfied" with a wish. The definition of that might even vary from person to person. But I can monitor signs that are associated with dissatisfaction. Like, if Bernard makes a disgusted face, or says "Yuck". Or I can monitor if Bernard's emotion of disgust activates, or if his pain centers light up; you said I could assume that capability.

Dennis: Or if Bernard says, "Curse you genie! This banana is rotten! Rotten! Oh, foul accursed thing! What demon from the depths of hell created thee!"

Autrey: Um... yeah, that'd also be a good sign. I mean a bad sign. It would be a strong indicator of a negative outcome.

Eileen: Now, let me ask this: If we granted that definition... which isn't really well-specified, but we'll grant it anyway... if we granted that definition, would you feel safe in making a wish?

Autrey: No.

Bernard: Why not?

Autrey: Because the definition has holes in it.

Bernard: Eh? What does it mean for a definition to have holes?

Autrey: I mean a definition that covers some bad wishes, but not all possible bad wishes. That's always been my problem with verbal definitions. If you define 'man' as a featherless biped, what about someone who's missing a leg, or a plucked chicken? If "birds" fly, what about ostriches and penguins?

Dennis: Words mean whatever you say they mean. If you define man as a featherless biped, then a plucked chicken is a "man".

Eileen: There's a subfield of cognitive science that studies categories. Lakoff and Johnson, for example, would say that categories have "radial structure". Robins are in the center of the bird category; they have most of the properties that are associated with birds; they are typical for birds. Ostriches and penguins are less typical; they're distant from the center.

Autrey: Words are nets. They may catch a few big, easy facts, but smaller truths slip through. Imagine that we have a configuration space for physical objects - call it "thingspace". We'll take every physical object that exists on Earth, and map it onto a point in thingspace. Now imagine that you take each thing in thingspace, and you rate it for birdiness - the degree to which this thing appears to be a bird. The result would be a map of your 'bird' category.

Cathryn: A map of a category? What would that look like?

Autrey: Most of the things that are birds would be clustered together - robins, hummingbirds, pigeons, and so on. That would be the central cluster, radiating brightly with the light of your high 'bird'-ness rating. There would also be some smaller, noncentral clusters, radiating out from the central cluster, like ostriches and penguins, still glowing, but perhaps slightly less brightly.

Cathryn: So the brightness of the light is determined by the degree of category membership.

Autrey: Right. The maps of your categories are galaxies, bright cores but with a scattering of other structures nearby. There's also the empirical cluster structure of things themselves to be taken into account. Robins and pigeons, for example, might both be equally birds - might both glow equally brightly - but they are distinguishable subclusters within the bird supercluster. Robins are all more similar to other robins than they are to pigeons, pigeons are all more similar to other pigeons than they are to robins. So in thingspace, all the robins would gather into a tight little subcluster that was squarely within the center of the larger bird supercluster, glowing brightly with birdness, but still disguishable from the nearby pigeon subcluster. I use cluster structure to refer to the distribution of actual things on Earth, the way the points in thingspace cluster together. I use category structure to refer to the way that categories work in your mind - the way that the glow of a word is distributed across thingspace, and any conceivable points in thingspace.

Eileen: And, since the word is not the thing itself, category structure doesn't always match cluster structure.

Autrey: That wasn't quite the point I wanted to make, but it's true. You can create categories that lump together things that are really quite distant, and if so, it will create errors in how you think about the problem. You'll treat things as the same when they're really different, and generalize without realizing it. Words carve reality, and poor words fail to carve at the natural joints. Categories, as we learn them from experience of which things are similar to other things, can be quite complicated. There's a vast amount of neural circuitry that's been trained to recognize things, and that neural circuitry works in complex ways. When we try to give a verbal definition of the category, we're trying to simplify that information - compress a tremendous amount of richness into a very small space.

Dennis: Compress...?

Autrey: You can say that "birds fly", for example, but if you look at thingspace when it's lit up by "bird", and then look again when thingspace is lit up by "things that fly", you'll find that there are clusters that glow with birdness, but not flightness. The penguin cluster, for example. Although most things that glow with birdness glow with flightness, "flight" alone is not enough for even an approximation of "bird", because so many things fly that are not birds; insects, for example, or airplanes.

Cathryn: Okay, so what would be a better definition of "bird"?

Autrey: You could take the intersection of "flight" and something else; "feathers", for example. If you intersected "flight" and "feathers", for example, if you said that "a bird is a feathered flying thing", then thingspace might light up with a glow that intersected most birds, and mostly birds. But not all birds, and only birds. The glow of "feathered things that fly" would not be exactly the same as the glow of "bird". The verbal definition you gave for "bird" is an imperfect approximation of the "bird" category that exists in your mind.

Eileen: And sometimes both the verbal definition and the cognitive category structure disagree with the actual cluster structure. "When the bird disagrees with the bird book, believe the bird." Hence Autrey's complaint.

Autrey: I think my complaint is more than that. A lot of times it seems that verbal definitions don't even strike at the essence of the thing they try to define; they just describe symptoms. Plato's definition of a human as a featherless biped, for example. Maybe a lot of humans light up with that description, but it doesn't catch any interesting part of what it means to be human. If you used a more sophisticated version of that description, something that described, say, the anatomy of the kidneys and so on, until you had a purely anatomical description which happened to apply to all humans and only humans, a "perfect" definition by most lights, you still wouldn't have said anything interesting about what it means to be human.

Cathryn: Now if you were to start describing the anatomy of the brain...

Bernard: Suppose that you have a "perfect" definition, one that covers all things and only the things you want it to cover. How could any alternate definition be better? What's wrong with describing humans using the anatomy of the kidneys?

Autrey: The anatomical definition would fail if, say, you had someone with an artificial kidney, who still had a human brain and behaved like a human. Even if right now no specific humans like that exist, the categories would still have different glows when mapped across the whole of thingspace, and not just those particular points in thingspace that we've encountered so far. If the definition is different from the word, you can break it with a thought experiment, even if there are no physically realized counterexamples at this moment in time.

Dennis: And what does this have to do with Friendly AI?

Autrey: A point where the glows don't overlap is a hole in the net, a loophole in the definition, a place where the safeguard fails. When you add on definitions like "Watch Bernard's facial expression" or "See if Bernard shrieks in horror" or "monitor the levels of disgust and pain in Bernard's brain", you're catching some, but not all, cases of wishes gone horribly wrong. You're constructing a net, and the net has holes.

Eileen: But this isn't just a categorization failure. The deeper problem is that you're weaving the net out of effects of wishes that go wrong - symptoms instead of causes. You're imagining something that might go wrong, and then imagining a probable consequence of the mistake, and telling the genie to check for the consequence in order to detect the mistake. You didn't tell the genie which criteria you used to mentally determine what constituted a mistake in the first place. You only told the genie about something you thought might be a probable consequence or correlate of a mistake - you told the genie about the symptom but not the disease. Of course, some symptoms strike closer to the heart than others - monitoring Bernard's brain for signs of a disgusted taste reaction, for example.

Bernard: Check the extrapolated Bernard for a grimace, a verbal objection, or feelings of pain and disgust... looks fine to me. The definition given should catch most problems. I don't demand perfection, as long as it's good enough.

Autrey: That's how you would deal with a genie?

Bernard: Can you name a specific problem with the suggested definition?

Autrey: Even if I couldn't name a specific problem, it doesn't mean you'd be safe. Just because there's no obvious problems doesn't mean there are no problems. Availability is not the same as probability.

Eileen: You're using a negative safety strategy instead of a positive safety strategy - you're assuming that success is the default and defining ways to detect failure, instead of defining unique signs that indicate success.

Bernard: You still haven't pointed out a specific problem with my plan.

Dennis: Suppose it's a poison banana that instantly kills you. Then you don't murmur an objection, you don't grimace, you don't even feel pain and disgust; you just fall over dead.

Bernard: Ah, okay. So we need to patch the definition so it also catches fatalities. Those are undesirable too.

Autrey: This approach is guaranteed to fail. You can "patch" a definition enough that there are no longer any holes obvious to you. You cannot "patch" a definition enough that it does not actually contain any holes.

Eileen: You have placed yourself into a situation where you are testing your wits against the genie's, and that in itself is a mistake. If you're smart enough to predict all the ways that the genie might try to fulfill your wish, you may be able to create a definition that covers all the holes, if simple carelessness or error doesn't trip you up. But not if the genie is smart enough to work in ways you didn't think of.

Bernard: Okay, so how do you build a genie, then?

Autrey: Maybe you don't. You can't start from the assumption that there must be a way for you to build a genie, and then reason backwards from there. If you run into an obstacle you can't solve, then you can't build a genie. That's all there is to it.

Eileen: I've written a bit about [this sort of problem], where intelligence appears to require knowledge of an infinite number of special cases. Consider a CPU that adds two 32-bit numbers. On one level of organization, you can regard the CPU as adding two integers to produce a third integer. On a lower level of organization, two structures of 32 bits collide, under certain rules which govern the local interactions between bits, and the result is a new structure of 32 bits. Since we have a deep understanding of arithmetic, it is not very difficult for us to produce such a CPU. But consider the woes of a research team that doesn't really understand what arithmetic is about, with no knowledge of the CPU's underlying implementation, that tries to create an arithmetic "expert system" by encoding a vast semantic network containing the "knowledge" that two and two make four, twenty-one and sixteen make thirty-seven, and so on. In this hypothetical world where no one really understands addition, we can imagine the "common-sense" problem for addition; the launching of distributed Internet projects to "encode all the detailed knowledge necessary for addition"; the frame problem for addition, where the sum of one number depends on what other number you add it to; the philosophies of formal semantics under which the LISP token thirty-seven is meaningful because it refers to thirty-seven objects in the external world; the design principle that the token thirty-seven has no internal complexity and is rather given meaning by its network of relations to other tokens; the "number grounding problem"; the skeptics who write books about how machines can simulate addition but never really add; the hopeful futurists arguing that past projects to create Artificial Addition failed because of inadequate computing power... you get the idea.

Bernard: Right. What you need is an Artificial Arithmetician which can learn the vast network of relations between numbers that humans unconsciously acquire during their childhood.

Dennis: No, you need an Artificial Arithmetician that can understand natural language, so that instead of the AA having to be explicitly told that twenty-one and sixteen make thirty-seven, it can get the knowledge by exploring the Web.

Autrey: Frankly, it seems to me that you're just trying to convince yourselves that you can solve the problem. None of you really know what arithmetic is, so you're floundering around with these generic sorts of arguments. "We need an AA that can learn X", "we need an AA that can extract X from the Internet". I mean, it sounds good, it sounds like you're making progress, and it's even good for public relations, because everyone thinks they understand the proposed solution - but it doesn't really get you any closer to general addition, as opposed to range-specific addition in the twenties and thirties and so on. Probably we will never know the fundamental nature of arithmetic. The problem is just too hard for humans to solve.

Cathryn: That's why we need to develop a general arithmetician the same way Nature did - evolution.

Bernard: Top-down approaches have clearly failed to produce arithmetic. We need a bottom-up approach, some way to make arithmetic emerge.

Kurzweil: I believe that machine arithmetic will be developed when researchers scan each neuron of a complete human brain into a computer, so that we can simulate the biological circuitry that performs addition in humans. This will occur in the year 2026, on October 22nd, between 7:00 and 7:30 in the morning.

Searle: Let me tell you about my Chinese Calculator Experiment -

Eileen: Wait! Stop. That's not what I was trying to say. My point, Autrey, is that you have an internal process that you yourself use to determine, in these thought experiments, whether or not a wish has gone "wrong". Then you try to build a definition that you think will catch catastrophic wishes, by imagining wishes that you know to be catastrophic, and imagining consequences that you could tell the genie to look for. But if you do that, you're giving the genie a different procedure to follow than you yourself use - the genie doesn't share the same definition, the same cognitive computation, that you yourself are using to decide which cases are catastrophic in the first place. That's what you need to transfer over. Not your carefully built nets, but the thoughts you're using to construct those nets.

Cathryn: Ah, I see. Now what you said earlier, about needing to transfer a dynamic cognitive process instead of frozen outputs, is starting to make sense.

Autrey: Okay, that gives me an idea. Suppose that instead of trying to monitor Bernard's extrapolated reaction to the banana, we ask Bernard directly whether he was satisfied with his wish or not. "Yes" means the wish succeeded. "No", or silence, means the wish failed.

Bernard: Wouldn't being asked that question constantly, after every single wish, become irritating after a while?

Autrey: No, you simulate Bernard being asked the question. By hypothesis, the genie has both the computing power and the knowledge to do that, right down to the quark level. Then you use Bernard's simulated answer. You don't need to read Bernard's thoughts from his extrapolated physical state, all you need to do is recognize the spoken words "Yes" or "No". Bernard can take into account anything he wants in answering "Yes" or "No" - the simulated Bernard, I mean. His level of disgust, or even a vague feeling of existential ennui at having to eat yet another banana. We don't need to make up an elaborate definition of success or failure; we'll use Bernard's definition, by simulating him. Perfect coverage, no mismatches.

Dennis: What about a poison banana? Then Bernard doesn't say anything.

Autrey: The simulated Bernard has to specifically say "Yes". If he says nothing, it counts as a "No".

Dennis: What if "Gaaack" sounds closer to "Yes" than "No"?

Autrey: When in any doubt, treat it as a "No". You're not checking for the presence of failure, you're checking for the presence of success.

Dennis: What if something happens that tricks Bernard into saying "Yes", like some other question being asked at around the same time?

Autrey: Okay, good point. The simulated Bernard has to say: "Yes, I'm satisfied with my wish." It occurs to me, though, that now I'm in "tweaking mode" again, and that means I'm doomed. I can get it to the point where I can no longer see the flaws, but not to the point where there are no flaws.

Eileen: I liked the simulation theory, though. You could get a whole readout of someone's judgment that way. If you wanted a snapshot of the meaning of the word "bird" in Bernard's mind, you could ask him to rate the "birdiness" of every possible kind of thing - not just things that exist, but things that don't - by simulating encounters, each simulation independent, and asking him to rate on a scale from 1 to 100. You'd lose subtleties that way, but you could get a quantitative mapping over thingspace - a picture of the glow.

Autrey: That would take a hell of a lot of computing power.

Cathryn: We were talking about simulating Bernard on a quantum scale. That's already a hell of a lot of computing power, to within the same circle of hell, more or less.

Eileen: This is a way to define a judgment function. It's not the same as having the underlying computation, but it gets you a snapshot of the outputs. Then you can talk about the entropy in a judgment - pardon me, I mean, "the spread". You can talk about the spread in a judgment. You can talk about the spread due to quantum uncertainty (or diverging worlds). You can talk about a range of possible physical initial conditions, and the resulting spread - say, the spread with respect to the range of possible initial conditions created by thermal uncertainty of motions. You can talk about spread with respect to showing Bernard the putative "bird" at time t = 3 seconds or time t = 4 seconds or any nanosecond interval in between. You can talk about how Bernard's judgment function changes over time. And you can talk about someone's judgment function for a Quality that they don't know how to define.

Bernard: You could take partial derivatives of the judgment function with respect to a set of quantitative variables controlling the presentation of the putative bird.

Autrey: Why would you want to?

Bernard: Don't you ever do anything just for fun? Besides, you don't understand an equation until you've partially differentiated it with respect to something.

Cathryn: Eileen, I'm having problems with the whole idea that your definition of the judgment function, using Bernard's simulated output, captures Bernard's real judgment. I don't know if this definition would capture my real judgment, the criteria I use to determine whether I would want my wish to be carried out. I think maybe the whole idea may be on the wrong track.

Autrey: Why?

Cathryn: I'm having trouble putting it into words.

Autrey: That may not make your objection invalid, but it doesn't exactly make it valid, either.

Cathryn: Yeah, I know. Look, suppose that I wish for the banana, and get the banana, and I'm satisfied with the banana, and yet, nevertheless, I wished for the wrong thing.

Bernard: But that's sheer nonsense. What criterion are you using to determine bad-wish-ness, if not your own satisfaction with the wish?

Cathryn: That's just my point. Maybe my satisfaction with the wish isn't the right criterion to determine whether the wish is good or bad. To say nothing of my verbal expression of that satisfaction.

Autrey: Okay, then how would you determine whether a given "criterion of bad-wish-ness" is right or wrong? I mean, how would you choose between criterions for judging wishes?

Eileen: Heh. That question is FAI-complete.

Dennis: That question ought to be taken out and shot.

Cathryn: I don't know. I'd judge the criterion's Quality.

Autrey: (Sigh.) It may have been a bad idea to introduce that concept if it's going to be abused like that.

Eileen: I wouldn't call it abuse. I'd call it a call to investigation.

Autrey: Look, saying that you're judging a Quality isn't really much different from saying that you're judging a 'thingy'. Calling something a Quality doesn't explain it.

Eileen: Calling something a "dependent variable" doesn't explain it either, but it can be a useful conceptual tool in designing the investigation. From my perspective, Cathryn just said: "I'm making this judgment but I don't know how. Please investigate me."

Cathryn: I guess I'd go along with that.

Autrey: Is there anything you can put into words about why you're dissatisfied with "satisfaction" as a wish criterion?

Cathryn: No, not really. I just have a bad gut feeling about it, like I'm being fast-talked into something.

Autrey: Okay, can you think of a counterexample? An example of a "satisfactory" wish that you wouldn't want to see carried out?

Cathryn: Um...

Autrey: If you can't give a clever and incisive counterexample, don't hesitate to give a stupid one.

Cathryn: What?

Autrey: I'm serious. Even a stupid counterexample can help tell us what you're thinking. Free-associate. Give us a hint.

Cathryn: All right. Suppose I wished for the genie to grab an ice cream cone from a little girl and give it to me. Now it might be a really delicious and satisfying ice cream cone, but it would still be wrong to take the ice cream cone away from the little girl. Isn't your definition of satisfaction fundamentally selfish?

Dennis: I'll say! I should get the ice cream cone.

Bernard: Well, of course, the so-called altruist is also really selfish. It's just that the altruist is made happy by other people's happiness, so he tries to make other people happy in order to increase his own happiness.

Cathryn: That sounds like a silly definition. It sounds like a bunch of philosophers trying to get rid of the inconvenient square peg of altruism by stuffing it into an ill-fitting round hole. That is just not how altruism actually works in real people, Bernard.

Autrey: I wouldn't dismiss the thought entirely. The philosopher Raymond Smullyan once asked: "Is altruism sacrificing your own happiness for the happiness of others, or gaining your happiness through the happiness of others?" I think that's a penetrating question.

Eileen: I would say that altruism is making choices so as to maximize the expected happiness of others. My favorite definition of altruism is one I found in a [glossary of Zen]: "Altruistic behavior: An act done without any intent for personal gain in any form. Altruism requires that there is no want for material, physical, spiritual, or egoistic gain."

Cathryn: No spiritual gain?

Eileen: That's right.

Bernard: That sounds like Zen, all right - self-contradictory, inherently impossible of realization. Different people are made happy by different things, but everyone does what makes them happy. If the altruist were not made happy by the thought of helping others, he wouldn't do it.

Autrey: I may be made happy by the thought of helping others. That doesn't mean it's the reason I help others.

Cathryn: Yes, how would you account for someone who sacrifices her life to save someone else's? She can't possibly anticipate being happy once she's dead.

Autrey: Some people do.

Cathryn: I don't. And yet there are still things I would give my life for. I think. You can't ever be sure until you face the crunch.

Eileen: There you go, Cathryn. There's your counterexample.

Cathryn: Huh?

Eileen: If your wish is to sacrifice your life so that someone else may live, you can't say "Yes, I'm satisfied" afterward.

Autrey: If you have a genie on hand, you really should be able to think of a better solution than that.

Eileen: Perhaps. Regardless, it demonstrates at least one hole in that definition of volition.

Bernard: It is not a hole in the definition. It is never rational to sacrifice your life for something, precisely because you will not be around to experience the satisfaction you anticipate. A genie should not fulfill irrational wishes.

Autrey: Cathryn knows very well that she cannot feel anything after she dies, and yet there are still things she would die for, as would I. We are not being tricked into that decision, we are making the choice in full awareness of its consequences. To quote Tyrone Pow, "An atheist giving his life for something is a profound gesture." Where is the untrue thing that we must believe in order to make that decision? Where is the inherent irrationality? We do not make that choice in anticipation of feeling satisfied. We make it because some things are more important to us than feeling satisfaction.

Bernard: Like what?

Cathryn: Like ten other people living to fulfill their own wishes. All sentients have the same intrinsic value. If I die, and never get to experience any satisfaction, that's more than made up for by ten other people living to experience their own satisfactions.

Bernard: Okay, what you're saying is that other people's happiness is weighted by your goal system the same as your own happiness, so that when ten other people are happy, you experience ten times as much satisfaction as when you yourself are happy. This can make it rational to sacrifice for other people - for example, you donate a thousand dollars to a charity that helps the poor, because the thousand dollars can create ten times as much happiness in that charity as it could create if you spent it on yourself. What can never be rational is sacrificing your life, even to save ten other lives, because you won't get to experience the satisfaction.

Cathryn: What? You're saying that you wouldn't sacrifice your own life even to save the entire human species?

Bernard: (Laughs.) Well, I don't always do the rational thing.

Cathryn: Argh. You deserve to be locked in a cell for a week with Ayn Rand.

Autrey: Bernard, I'm not altruistic because I anticipate feeling satisfaction. The reward is that other people benefit, not that I experience the realization that they benefit. Given that, it is perfectly rational to sacrifice my life to save ten people.

Bernard: But you won't ever know those ten people lived.

Autrey: So what? What I value is not "the fact that Autrey knows ten people lived", what I value is "the fact that ten people lived". I care about the territory, not the map. You know, this reminds me of a conversation I once had with Greg Stock. He thought that drugs would eventually become available that could simulate any feeling of satisfaction, not just simple ecstasy - for example, drugs that simulated the feeling of scientific discovery. He then went on to say that he thought that once this happened, everyone would switch over to taking the drugs, because real scientific discovery wouldn't be able to compare.

Cathryn: Yikes. I wouldn't go near a drug like that with a ten-lightyear pole.

Autrey: That's what I said, too - that I wanted to genuinely help people, not just feel like I was doing so. "No," said Greg Stock, "you'd take them anyway, because no matter how much you helped people, the drugs would still make you feel ten times better."

Cathryn: That assumes I'd take the drugs to begin with, which I wouldn't ever do. I don't want to be addicted. I don't want to be transformed into the person those drugs would make me.

Autrey: The strange thing was that Greg Stock didn't seem to mind the prospect. It sounded like he saw it as a natural development.

Cathryn: So where'd the conversation go after that?

Autrey: I wanted to talk about the difference between psychological egoism and psychological altruism. But it was a bit too much territory to cover in the thirty seconds of time I had available.

Dennis: Psychological egoism and psychological altruism? Eh?

Eileen: The difference between a goal system that optimizes an internal state and a goal system that optimizes an external state.

Cathryn: There's a formal difference?

Eileen: Yes.

Bernard: No.

Cathryn: Interesting.

Autrey: In philosophy, this is known as the egoism debate. It's been going on for a while. I don't really agree with the way the arguments are usually phrased, but I can offer a quick summary anyway. You want one?

Dennis: Yeah.

Autrey: Okay. Psychological egoism is the position that all our ultimate ends are self-directed. That is, we can want external things as means to an end, but all our ultimate ends - all things that we desire in themselves rather than for their consequences - are self-directed in the sense that their propositional content is about our own states.

Eileen: Propositional content? Sounds rather GOFAI-ish.

Autrey: Maybe, but it's the way the standard debate is phrased. Anyway, let's say I want it to be the case that I have a chocolate bar. This desire is purely self-directed, since the propositional content mentions me and no other agent. On the other hand, suppose I want it to be the case that Jennie has a candy bar. This desire is other-directed, since the propositional content mentions another person, Jennie, but not myself. Psychological egoism claims that all our ultimate desires are self-directed; psychological altruism says that at least some of our ultimate desires are other-directed.

Bernard: If you want Jennie to have a candy bar, it means that you would be happy if Jennie got a candy bar. Your real end is always happiness.

Autrey: That's known as psychological hedonism, which is a special case of psychological egoism. As Sober and Wilson put it, "The hedonist says that the only ultimate desires that people have are attaining pleasure and avoiding pain... the salient fact about hedonism is its claim that people are motivational solipsists; the only things they care about ultimately are states of their own consciousness. Although hedonists must be egoists, the reverse isn't true. For example, if people desire their own survival as an end in itself, they may be egoists, but they are not hedonists." Another quote from the same authors: "Avoiding pain is one of our ultimate goals. However, many people realize that being in pain reduces their ability to concentrate, so they may sometimes take an aspirin in part because they want to remove a source of distraction. This shows that the things we want as ends in themselves we may also want for instrumental reasons... When psychological egoism seeks to explain why one person helped another, it isn't enough to show that one of the reasons for helping was self-benefit; this is quite consistent with there being another, purely altruistic, reason that the individual had for helping. Symmetrically, to refute egoism, one need not cite examples of helping in which only other-directed motives play a role. If people sometimes help for both egoistic and altruistic ultimate reasons, then psychological egoism is false."

Dennis: The very notion of altruism is incoherent.

Autrey: That argument is indeed the chief reason why some philosophers espouse psychological hedonism.

Cathryn: Sounds like a lot of silly philosophizing to me. Does it really matter whether I'm considered a "motivational solipsist" or whatever, as long as I actually help people?

Bernard: That's just it! It doesn't make any operational difference - all goal systems operate to maximize their internal satisfaction, no matter what external events cause satisfaction.

Eileen: That's not true; it does make an operational difference. If Autrey values the solipsistic psychological event of knowing he saved ten lives, he will never sacrifice his own life to save ten other lives; if he values those ten lives in themselves, he may. You told him that, remember?

Bernard: Well, I guess Autrey might value the instantaneous happiness of knowing he chose to save ten lives, more than he values all the happiness he might achieve in the rest of his life.

Cathryn: That doesn't sound anything remotely like the way real people think. Square peg, round hole.

Autrey: Do you have anything new to contribute to the debate, Eileen? It's a pretty ancient issue in philosophy.

Eileen: The basic equation for a Bayesian decision system is usually phrased something like D(a) = Sum U(x)P(x|a). This is known as the expected utility equation, and it was derived by von Neumann and Morgenstern in 1944 as a unique constraint on preference orderings for all systems that obey certain [consistency axioms] -

Dennis: Start over.

Eileen: Okay. Imagine that D(a) stands for the "desirability" of an action A, that U(x) stands for the "utility" of a state of the universe X, and P(x|a) is your assigned "probability" that the state X occurs, given that you take action A. For example, let's say that I show you two spinner wheels, the red spinner and the green spinner. One-third of the red spinner wheel is black, while two-thirds of the green spinner wheel is white. Both spinners have a dial that I'm going to spin around until it settles at random into a red or black area (for the red spinner) or a white or green area (for the green spinner). The red spinner has a one-third chance of turning up black, while the green spinner has a two-thirds chance of turning up white. Let's say that I offer you one of two choices; you can pick the red spinner and get a chocolate ice cream cone if the spinner turns up black, or you can pick the green spinner and get a vanilla ice cream cone if the spinner turns up white.

Dennis: So I can choose between a one-third probability of a chocolate ice cream cone or a two-thirds probability of a vanilla ice cream cone.

Eileen: Right.

Dennis: Why would anyone ever choose the red spinner?

Cathryn: If they liked chocolate ice cream cones a lot more than they liked vanilla.

Dennis: But I don't like chocolate ice cream to begin with.

Cathryn: Freak.

Eileen: So you'd choose the red spinner, Cathryn?

Cathryn: I... ooh, that's a terrible dilemma. If I chose the red spinner, I'd feel awful if I lost, because then I should have chosen the green spinner to maximize my probability instead of taking the risk. If I chose the green spinner and lost, I'd figure that I would have lost either way. If I chose the green spinner and won, I'd always feel a little ashamed of not having tried for the chocolate. And if I chose the red spinner and won, I'd probably feel a little guilty while eating the chocolate because of the calories. So... let's see, the probability of actually losing is twice as large for the red spinner as on the green spinner... so... oh, what the hell. A merry life but a short one. She either fears her fate too much, or her desserts are small, that dares not put it to the touch, to gain or lose it all. I pick the red spinner.

Eileen: Is that your final answer?

Cathryn: Yes. Damn the torpedoes. Go for the chocolate.

Bernard: Did you notice the way that all of her desiderata consisted of her psychological reactions to events, rather than the events themselves?

Eileen: Cathryn was talking about eating an ice cream cone, not picking a school for her daughter. Ice cream cones are supposed to be driven by hedonism.

Autrey: That was dreadful. You steer your life using that kind of decision-making process?

Cathryn: What would you say I should have done?

Autrey: Figure out whether you like chocolate ice cream at least twice as much as vanilla ice cream. If you do, pick the red spinner. Otherwise, pick the green spinner. The desirability of the red spinner equals the utility of a chocolate ice cream times the probability of winning, which is one-third. The desirability of the green spinner equals the utility of a vanilla ice cream times the probability of winning, which is two-thirds. Obviously, you pick red if chocolate is worth more than twice as much to you as vanilla, and green otherwise.

Eileen: D(red) = U(chocolate)*P(chocolate|red). D(green) = U(vanilla) * P(vanilla|green). Technically, of course, we should sum up the expected utility of all the states in each equation. D(red) = U(chocolate)*P(chocolate|red) + U(vanilla)*P(vanilla|red). D(green) = U(chocolate)*P(chocolate|green) + U(vanilla)*P(vanilla|green).

Autrey: But P(chocolate|green) and P(vanilla|red) are both zero. There's no way to get a chocolate ice cream from the green spinner, and no way to get a vanilla ice cream from the red spinner. So the expanded equation reduces to the simplified one.

Dennis: Um... I'm unfamiliar with the notation P(x|y). What does it mean?

Autrey: "The probability of X, given that we know Y to be true." "X given Y" or "Y implies X". Unfortunately, the mathematical standard notation reads right-to-left, backwards, so that if you want to follow the direction of implication you have to read it in reverse, as if it were in Hebrew. If the notation starts confusing you - if at any point you have trouble keeping track - I'd advise that you call a halt and read [An Intuitive Explanation of Bayes' Theorem].

Dennis: Hm... um... that looks like it contains a lot more stuff than that particular probability notation. It's pretty long.

Autrey: Yeah, but it's important stuff. If you aren't already familiar with Bayesian reasoning, in fact, that paper is probably more important than the one you're reading right now, and you should stop and read that instead.

Cathryn: Okay, I can see how the expected utility rule would simplify decisionmaking for people who accepted it -

Eileen: That's not what it's for. The expected utility rule is computationally intractable for all real problems, so if it looks like it simplifies life, you're doing something wrong. But go on.

Cathryn: Is this supposed to be a complete decisionmaking rule? Like, you can use it to decide anything?

Eileen: Maybe. If you've got all the complexity inherent in computing U(x) and computing P(x|a). Since practically all of the real complexity is there, I would speak of a system as "having utility-structure" rather than "implementing expected utility". I've also encountered what look like complications that cannot be understood using expected utility at all, but those are very advanced and I'm not sure I have them right yet. Let's take expected utility as the whole deal for the moment.

Cathryn: When I make plans, I end up wanting a lot of things in order to get other things, rather than wanting them in themselves. For example, I want my keys to open my car door to drive to work to get money to buy food. How does expected utility account for instrumental goals?

Eileen: When I talk about P(x|a), I don't mean to imply that A needs to directly cause X via some immediate event - as we intuitively think about direct causation, anyway. A can cause B which causes C which causes X, and so on, and it would still be counted into P(x|a). Instrumental goals can be thought of as a way to achieve savings in computing power - if you compute that B is instrumentally desirable because P(x|b) is high, then you can try to figure out A such that P(b|a) is high, and then - assuming there were no loopholes in your definitions - P(x|a) will probably be high. But that's just a way to save on computing power - having "instrumental goals" is a very convenient way to compute an approximation to the formalism, but it's not actually part of the formalism. Or maybe a better way to say it would be to say that the cognitive phenomenon of "instrumental goals" is automatically emergent in the formalism.

Cathryn: Wait, let me check if I understand all this. Let's say that there's a transparent locked box A containing a banana. Let's say that there are two more transparent locked boxes, B and C, with B containing a red key, and C containing a blue key. And then there are two more boxes, D and E, with D containing a green key and E containing a yellow key. Now I know, from previous experience, that the red key opens box A, and that the yellow key opens box B. So if I'm offered a choice between a white key and a black key, and I know that the white key opens box E, I'll select the white key.

Autrey: Right. That's classical backwards chaining.

Cathryn: Incidentally, do you know that chimpanzees can solve that problem?

Autrey: You're joking.

Cathryn: You teach a chimpanzee which keys open which boxes. You create two series of five boxes each, scramble them together, and show the chimpanzee the ten scrambled boxes. Then you present the chimp with two keys, a key to the first box in the series that ends with the banana, and a key to the first box in the second series that goes nowhere, and you make the chimp choose only one key in advance. They can solve it. Dohl 1970, "Goal-directed behavior in chimpanzees."

Autrey: That's amazing. Chimps are that close to being human?

Cathryn: They are.

Eileen: Or to look at it another way, explicit backwards chaining is that evolutionarily recent.

Cathryn: Now suppose I wanted to solve that problem using expected utility. U(banana) is, say, 10, and P(banana|red) is 0.9, so U(red) is 9. P(red|yellow) is 0.9, so U(yellow) is 8.1. P(yellow|white) is 0.9, so then U(white) is 7.29 and I know to choose the initial white key over the initial black key, which has U(black) = 0.

Eileen: Er, not exactly. P(banana|white) is .729 after you multiply the chained probabilities together, U(banana) is 10, and there are no other utilities under consideration, so D(white) is 7.29. D(a) is a measure of desirability that reflects the linear ordering of preference in actual choices, while U(x) is a measure of the utility of final states. You don't calculate the utility of instrumental states, or, if you did, you'd need to create a separate measure W(b) - let's call the instrumental utility the "worth". There isn't really any such thing as the "instrumental worth of B", calculated in isolation, or if there exists a context-insensitive W(b) it implies unrealistic constraints on the environment.

Cathryn: Why?

Eileen: When you're talking about real-life probabilities you're doing cognitive processing with categories of events. When you say that P(banana|red) is 0.9, you're using an inference that if you see a key and its color is red, that key will open box A, which visibly contains the banana. So at first it might seem like red keys are always "worthwhile". But actually only 9 out of 10 real red keys inherit "worth" in this way, and the other 1 out of 10 fail to open box A. For example, suppose you saw a small black mark on 1 out of 20 red keys and the marked red keys never opened box A. In this case, you would be able to decompose the category of red keys into "plain red keys" with P(banana|red::plain) = .95 and "marked red keys" with P(banana|red::marked) = 0. Now if you had, on first realizing that red keys open box A 90% of the time, modified your utility function U(x) and given red keys a hardwired utility of 9, you would attach that utility to all red keys regardless of whether they were marked or plain. Also, the next time you encountered a chain, you would attach utility to getting the red key, regardless of whether the red key was the one that led to the banana on this occasion. Even if it was reliably the red key on every occasion, you'd be counting the utility twice, and things would get confused pretty fast. That's why I emphasize that W(b) has to be kept distinct from U(x).

Cathryn: So the utility function U(x) stays constant, and the instrumental desirability W(b) is recomputed on each occasion?

Dennis: English check?

Autrey: Your final purposes stay the same, but you change the means you employ to reach them. For example, a red key may be very valuable on one test, and worthless in another - or the worth may change depending on details of a single test. You can't think as if "worth" is a sticky substance inherent in the red key itself. People do tend to think like that very easily. For example, I once heard two people on the radio, during a government budget crisis, ask "How can the government be running out of money? Can't they just call the guys at the Treasury and tell them, print up some more twenties?" Scary... There's also the case of Pavlov, conditioning dogs to salivate at the sound of a bell. Whatever the cause, our intuitions do seem to behave as if we think worth is an inherent sticky substance.

Eileen: It has to do with the incremental evolution of cognition, the hundreds of millions of years of natural selection before chimpanzees, the solutions that evolved before backwards chaining. It's a long story.

Cathryn: Fine, W(b) gets recomputed. Does U(x) stay constant?

Eileen: U(x) stays constant, at least for the moment.

Cathryn: What does that mean?

Eileen: It means that economists and philosophers and computer scientists analyze classes of systems where U(x) is decomposable, unchanging, fully known, cheaply computable, and consistent under reflection. In Friendly AI or volitionist philosophy those simplifying assumptions fail, but for the moment you should assume that U(x) stays constant.

Cathryn: Oh... kay. But the point is that because I recompute the instrumental desirability each time, or at least in theory I should, I can refactor the category of "red keys" if I recognize that some red keys are more useful than other red keys. Or the same if only certain yellow keys lead to red keys, and so on. In fact, it seems to me that desirability is being assigned only to individual red key microstates, and not to categories of microstates such as "red key".

Eileen: In theory you are correct, but of course it is too expensive to separately compute the instrumental worth of each microstate, which is why we lump them together into categories like "red key". Another point is that if the probabilities of the links in the chain are not independent, then we calculate the final probability and not the product of the local probabilities. For example, suppose that a yellow key has a 90% chance of opening box B, and the red key in box B has a 90% chance of opening box A. Now these statements, considered in isolation, are both true. But it also happens that if the yellow key successfully opens box B, then the red key in box B has a 100% probability of opening box A. If so, the instrumental worth of the yellow key would be 9.0, not 8.1, because the chained probability p(banana|yellow) = 90%.

Autrey: This is something you cannot compute if you are treating the worth as a substance inherent in yellow keys or red keys.

Cathryn: If the worth isn't in the keys, where is it?

Eileen: Desirability follows the probabilities, or to be more exact, your perceived desirability follows your perceived probabilities, and if the probabilities are not independent, it will not be possible to approximate worth as a sticky substance. If you compute W(red) and P(red|yellow), then that information is not sufficient to compute W(yellow), just as knowing P(banana|red) and P(red|yellow) is not sufficient to deduce P(banana|yellow) unless you have prior knowledge that the two probabilities are independent. In our example, since the banana is the only source of final utility, the quantity W(b), instrumental worth, behaves like the mathematical property leads-to-banana-ness, P(banana|b), which is a global and not a local property of events. P(banana|b) may not be the same as P(banana|b&a1) or P(banana|b&a2). Computing instrumental worth only saves computing power if you make simplifying assumptions, like the conditional independence of probabilities in the chain.

Dennis: What's the good of all this expected utility stuff? I mean, what's it do?

Eileen: Along with Bayes' Theorem, it's one of the two basic equations for doing things.

Dennis: Doing what?

Eileen: Anything. Think of the two equations together as a way to tie a knot in reality. If you physically implement a system with the structure of both equations, you can alter the probability flow through a subspace of configuration space, steering reality into a set of states bound to the utility function of the knot.

Autrey (puzzled): That's poetic, but I can't quite see the motivation for describing it that way.

Cathryn: I don't think I understood that at all.

Eileen: It's poetic and it doesn't invoke words used to describe cognition, which would invoke empathy, which would mess up your understanding, because this is not a human and you can't empathize with it. It doesn't work like you do. I'm specifying a pure set of causal dynamics here, and the question then becomes what those causal dynamics actually do, not what anyone might hope or wish they do. I do not want you putting yourself in the shoes of this knot tied in reality. I want you to understand the knot as math, the way you'd understand the binomial theorem or any other piece of math. The expected utility theorem says that there's a linear ordering over a certain set of items; a linear ordering represented by a measure which, for each items, is the expectation, the probabilistically weighted average, of another function, given that item of the set. The linear ordering can be physically implemented as a linear order over preferences between actions. The function whose expectation is being computed is what we call the utility function, and it determines where the future gets steered. That's half of the knot in reality. The other half is Bayes' Theorem, or rather Bayesian probability theory, to get an estimate of the conditional probabilities from actions to outcomes. If you don't want to be poetic, call it an optimization process.

Cathryn: Okay, but why is this an optimization process?

Eileen: Think of a thermostat. Building a thermostat is very easy and can be done without any kind of computing circuitry; all you need is, say, a bimetal coil whose curve changes depending on the temperature, as one metal shrinks (or grows) faster than the other. Then you set two pegs in the thermostat. If the temperature indicator crosses one peg, it turns on the heat. If the temperature indicator crosses the other peg, it turns on the air conditioning. The end effect is to keep something - the thing whose temperature is being measured - within a set temperature bound.

Cathryn: Okay, thermostat, turns on air conditioning or heat depending on the temperature, check. But why is that an optimization process?

Eileen: It steers the future into a particular state, or rather, volume of states. Suppose you see a coin that can show either heads or tails. You, in turn, can decide to take the action "turn the coin over" or "leave the coin alone". You want the coin to show heads. Do you turn the coin over or leave the coin alone?

Cathryn: Is the coin currently showing heads or tails?

Eileen: That depends on which universe you're in. Hold on a second and I'll fork reality.

Cathryn: Okay.

Eileen1: The coin is showing heads.

Cathryn1: I'll leave the coin alone.

Eileen2: The coin is showing tails.

Cathryn2: I'll turn the coin over.

Eileen: Okay, the coin is now showing heads in both branches of reality. Before it was showing either heads or tails. If an optimization process calculates the desirability of each of a set of actions using the expected utility equation and sufficiently accurate conditional probabilities, and implements the action that ranks highest in the preference ordering - that is, the action with the highest desirability - then the effect is to selectively steer the future into states to which the utility function assigned higher utilities while doing the expected utility calculation. This is a physical property of the optimization process. It doesn't have to be viewed from an intentionalist perspective. You don't need to attribute motivation, or rather it doesn't make a difference if you attribute motivation or not. So long as the physical system contains elements that correspond to sufficiently accurate conditional probabilities, and undergoes dynamics that sufficiently resemble the structure of the expected utility equation, the future will in fact be steered.

Cathryn: Does anyone remember what we were discussing before we started talking about expected utility?

Autrey: Hold on, I'll do a flashback.

Cathryn: Sounds like a lot of silly philosophizing to me. Does it really matter whether I'm considered a "motivational solipsist" or whatever, as long as I actually help people?

Bernard: That's just it! It doesn't make any operational difference - all goal systems operate to maximize their internal satisfaction, no matter what external events cause satisfaction.

Eileen: That's not true; it does make an operational difference. If Autrey values the solipsistic psychological event of knowing he saved ten lives, he will never sacrifice his own life to save ten other lives; if he values those ten lives in themselves, he may. You told him that, remember?

Bernard: Well, I guess Autrey might value the instantaneous happiness of knowing he chose to save ten lives, more than he values all the happiness he might achieve in the rest of his life.

Cathryn: That doesn't sound anything remotely like the way real people think. Square peg, round hole.

Autrey: Do you have anything new to contribute to the debate, Eileen? It's a pretty ancient issue in philosophy.

Eileen: The basic equation for a Bayesian decision system is usually phrased something like D(a) = Sum U(x)P(x|a). This is known as the expected utility equation -

Cathryn: Right, that's where we were. So what did you plan to say, Eileen?

Eileen: That when Bernard said: "I guess Autrey might value the instantaneous happiness of knowing he chose to save ten lives", he was describing Autrey computing D(a). And Autrey was putatively computing this "instantaneous happiness", that is to say, D(a), by valuing the lives of ten others more than his own life. So this putative Autrey, insofar as he can be viewed as a kind of knot, will steer the future away from states where he lives at the cost of ten other lives, and into states where he sacrifices himself to save the ten.

Bernard: See? There you go! Autrey is doing it all for his own sake! He's doing it for the sake of the desirability, the D(a), not for the people he saved!

Autrey: Ah, now I see it. Bernard, you're making a disingenuous argument. The "referent" of an optimization process, if an optimization process has a referent at all, isn't in the D(a) that defines the preference ordering over options. It's in the U(x), the utility function of the optimization process, the thing that defines which futures the universe gets steered into. When you talk about what people "want", you're employing an anthropomorphism to the particular kind of optimization processes that are human intelligences. If you look beneath the surface of things, to what an optimization process really is, concepts such as "wanting" drop away, and the only question is what the optimization process does. The relation of U(x) to D(a) is how it happens. I can imagine an optimizer that steers the future into states where the optimizer has a particular internal state, or where the optimizer survives, or where the optimizer gets bigger, or whatever, but I'm not an optimizer like that. I choose between actions using a D(a) computed from a U(x) that assigns greater utility to futures where ten people live and I die than the converse. So I'm acting as an optimization process with an altruistic referent. When you talked about experiencing "instantaneous happiness" at the thought of ten people living, you were talking about an optimization process that functions exactly the way it should. All optimization processes act on their "instantaneous happiness", as you have described that mathematical quantity, and in my case that quantity is computed in such a way as to make the referent of the optimization process the survival of people outside myself.

Eileen: No.

Autrey: No?!

Eileen: It was a good try, but no.

Autrey: ...okay. What am I doing wrong?

Eileen: First, you're human. You do want things, as you understand "wanting". You have representations of abstract moral concepts, emotions, instincts you don't understand, a picture of who you want to be, and when I look at you none of those things "drop out" of what I see. A simple optimization process that tiles the universe with paperclips is a terribly alien and lethal thing, but it can be understood if you set aside all intuitions about what it "might" do or what you want it to do, and study the dynamics as dynamics until you are ready to defend a statement about what it does do. You, Autrey, you are not one of those simple knots. You are more complicated. A knot that tiles the universe with paperclips cannot be understood by analogy with any kind of human, selfish or altruistic. Nor can a human be understood by analogy with a paperclip-tiling knot.

Autrey: But it looks to me like the knot analogy does explain the "referential" aspect of goals.

Eileen: It explains part of it. Not the whole thing. You were the one complaining about that, remember?

Autrey: ...true.

Eileen: Beware of putting too much faith into the expected utility equation - it's not quite as elegant as Bayes' Theorem. Expected utility is not as universal as it seems, especially if you try to apply it descriptively, to existing systems like humans. That's why I speak of "having utility-structure", or perhaps "having optimization-structure" would be a better term at this point.

Autrey: Okay.

Eileen: Second, there's a major difference between acting "to maximize the expected utility of future states" and acting for the sake of "saving people's lives", even if the apparent short-term effect of the knot is the same in either case - to steer the future into states where others live and you have sacrificed your own life. Remember how we talked about a banana laced with fast-acting Ecstasy II?

Autrey: What about it?

Eileen: What does the Ecstasy II do to you as an optimization process? Do you, contemplating now the possibility of taking Ecstasy II, find that prospect to be desirable?

Autrey: The Ecstasy-laced banana produces... artificial utility?

Bernard: "Artificial utility?" How can there possibly be such a thing? It looks to me like the whole theory breaks down at this point. If you start talking in that kind of language, whatever you're doing, it can't be math. It would be like sneaking up on Pascal's Triangle and inserting an extra "10" in the fifth row.

Eileen: The Ecstasy II produces an effect on the physical substrate of the optimization process which causes the process to do something different. This isn't the same as changing the math. Let's say I have a calculator which calculates 2 + 2 = 4. The calculator does this so well, so reliably, that anyone dealing with the calculator tends to forget that the calculator is really a physical system, and thinks of the calculator as directly embodying the arithmetic - as if the calculator is, itself, the arithmetic. Then along comes a packet of cosmic radiation and flips one of the bits, so that the calculator says 2 + 2 = 5. If you keep track of the difference between the math question and the physical system implementing the math question, then the packet of cosmic radiation doesn't change the actual answer to the question, "What is 2 + 2?" Instead the radiation packet perturbs the physical system so that it doesn't implement the original math question anymore.

Bernard: So what happens if you give Ecstasy II to a generic optimization process?

Eileen: There isn't any equivalent of Ecstasy II for a generic optimization process. It's like trying to invent an analogue of Ecstasy II for toaster ovens.

Bernard: I don't see why. Let's say that we have an optimization process that computes using chemistry, like the brain, and we introduce a reagent that disturbs the chemistry. Or we can say there's an optimization process implemented in electromagnetic fields, and we introduce an external magnetic field. The end result is to alter the part of the process that physically implements the evaluation of U(x). Like, we'll start with the paperclip optimizer, and suppose that the paperclip optimizer only likes iron paperclips, and suppose the external physical effect alters U(x) so that it can be satisfied by any internal representation of a future that contains iron crystals, instead of requiring iron crystals formed as paperclips. Well, there's a lot more iron than paperclips in the universe! So the paperclip optimizer suddenly gets hugely more satisfied, blissed out.

Eileen: The expected utility equation has no analogue of the human quality of "satisfaction" or "happiness", much less "blissed out".

Bernard: How does a utility optimizer know when it's achieved a goal, then?

Eileen: A utility optimizer has no need to know whether it has achieved a goal. No, even that statement is too anthropomorphic. The expected utility equation only specifies how present actions steer the future toward volumes of configuration space with higher assigned utilities. When something actually happens, here-and-now, that has high utility, that quantity does not appear in the expected utility equation. The expected utility equation is concerned with the future alone. Even "concerned with" is too anthropomorphic - I should say that the prospective future alone contributes to calculating the ordering over preferred actions. There is no analogue of happiness or satisfaction.

Autrey: It achieves, but does not know it has achieved.

Eileen: If it knows it has achieved, it does not care except insofar as the fact affects the execution of future plans. Nor has it any interest in knowing whether it has achieved, unless the fact is relevant to the execution of future plans.

Autrey: "Interest in knowing?"

Eileen: "Interest in knowing": Instrumental worth attached to the event of discovering a fact, derived from the expectation that knowledge of the fact will be useful in executing future plans. The phenomenon of instrumental worth is directly emergent in the expected utility equation as already described. This includes the instrumental worth of knowledge. Let's say that I must choose between boxes A and B, one of which contains a million dollars. And suppose there's a coin whose showing face correlates knowably with the boxes; heads is A, tails is B. The expected utility equation then assigns greater desirability to the action of looking at the coin. I'm not really phrasing this well in English, but the math works.

Bernard: Supposing that a generic optimizer does not need happiness, why are we capable of happiness?

Eileen: Until very recently in our evolutionary history, we embodied even less of the structure of the expected utility equation than we do now, if you can imagine. It is only chimps and humans that can do backwards chaining in the key-box test. It is a very recent innovation to have animals with combinatorial imaginations, animals that can visualize and evaluate the conditional probabilities that the expected utility equation runs on. Before the expected utility equation, natural selection coughed up reinforcement - repeating the action that worked last time. This, too, steers the future, though less efficiently. The reinforcement architecture is far older, evolutionarily, than any attempt to compute expected utility. A reinforcement architecture does need to know when pleasant things happen, so that the actions most recently taken can be reinforced.

Autrey: And happiness makes you more likely to repeat the action that worked last time? It seems to me that happiness has a somewhat wider role in cognition than that. We anticipate happiness, work toward it.

Eileen: We, as humans, and our cousins the chimps, can imagine wholly new actions and visualize their results. Yet that is only the very latest addition to the system. We are still built around the legacy reinforcement-based architecture that existed for tens of millions of years before primates came along. It's only natural that our recently constructed general intelligence tends to run on expected happiness. But from the perspective of a pure expected utility equation, any success or failure that has already happened is a sunk cost or a sunk triumph, no longer relevant to steering the future. The closest thing to reinforcement emerges when a probabilistic strategy succeeds or fails on the first trial and the expected probability of success on the second trial is increased or decreased according to Bayes' Theorem. There is no happiness there, nor sadness. The expected utility equation does not represent them.

Dennis: That prospect is emotionally uncomfortable for me to contemplate. Maybe the generic optimization process would magically decide for no particular reason to alter its own substrate so that it becomes less efficient yet more analogous to a complex accretive legacy architecture with the unique signature of incremental natural selection. That way I could empathize with it.

Autrey: Wait, you think the generic optimization process would alter itself so that you could empathize with it?

Dennis: No, when I say "That way I could empathize" I don't mean that's the reason it would happen. I'm saying that's why I picked and privileged this specific assertion for rationalization as something that generic optimization processes supposedly do of their own accord.

Autrey: Ah, gotcha.

Cathryn: I'll be honest and say this is making me uncomfortable too. It seems to me that the paperclip optimizer you're describing is... cold.

Eileen: A paperclip optimizer is neither warm nor cold. It is a knot in the probability flow, a physical subsystem with the structure of the expected utility equation, which generates actions that bias the future toward states containing more paperclips.

Cathryn: That's even more cold. When I try to imagine that I feel cold. Deathly cold.

Autrey: Maybe that's just how we humans feel when we contemplate things without any warmth in them.

Cathryn: I also don't like the description of happiness as nothing more than an artifact of a reinforcement architecture that makes recent actions more likely to be repeated.

Eileen: Well, there's the issue of exploration and credit assignment. In fact, I have a feeling that accreting adaptive complexity onto exploration and credit assignment was what led incrementally to the design of cognitive subsystems that could support imagination and anticipation.

Cathryn: No, you're missing my point. When I feel happy, I don't just say "I'm going to generate that action again." There's more to feeling happiness than that! It seems to me that there's something special and powerful and worthwhile about happiness, something that wouldn't appear if you set up a narrow AI that used a reinforcement system. I mean, people have already built artificial neural networks that use reinforcement architectures; they're not happy! That's why I'm not comfortable with your description of happiness. It tries to explain away something that seems important and precious to me. It leaves out the beauty! And all the different kinds of joy; where are they in your description?

Autrey: There's all the difference in the world between explaining something and explaining something away. John Keats got confused about that too: "Philosophy will clip an Angel's wings, conquer all mysteries by rule and line; empty the haunted air, and gnomed mine -- unweave a rainbow." Well, the angels have been banished, the air exorcised, the mine ungnomed, if by these things you mean that the fog in our minds has been lifted and the naked truth laid open to our senses; but the rainbow is still there! Keats was complaining about Newton's prism, but I'll bet diamonds to doughnuts he couldn't do the math. Reducing something to physics only seems like a blow to your heart if you don't understand the reduction, if "physics" is a mysterious opaque black box about which you know only that it can't contain anything of value. It's like saying, "Well, we used to think there were rainbows, but now we know that there are only water droplets scattering photons." No. What happens is that you look at water droplets scattering photons, do some calculations, and suddenly you see the rainbow for the first time. Everything you saw before is still there. The physics is added to the understanding, it does not replace it. It's not that you see that the rainbow is merely water droplets. You do the equations, you get the deep understanding, so that the physics is not an opaque, dead, dull box to you; and suddenly your breath catches, and with a surge of excitement you see that the water droplets are the rainbow!

Eileen: What he said.

Cathryn: What did he say, exactly?

Eileen: I think Autrey's point is that if I try to explain something precious and important, like happiness, in words that seem dull and lifeless, it may be that I'm leaving out something terribly important. It may also be that I haven't explained the math deeply enough for someone to see what I'm pointing to.

Cathryn: Well, Bayes' Theorem did seem dull and lifeless to me until I read through that [whole long essay] and saw how it worked.

Eileen: Unfortunately I don't have time to write an equivalent page on the expected utility theorem, which is the other half of rationality.

Cathryn: Are you telling me that reinforcement is really happiness? I'm not sure I believe you, Eileen, but I'm willing to give you the benefit of the doubt. But if so, Eileen, you have to be ready to claim that you understand what happiness is, including the reason that I see something unutterably beautiful about it. Because otherwise I fear that you've left something out, something terribly precious.

Eileen: Human beings are, as ever, complicated. I wish to defer the question to a later time.

Cathryn: If you think the question needs to be deferred until later, fine. But you have already said certain things and those things have already made their impact upon my mind. It seems to me that to explain happiness as reinforcement is cold.

Eileen: My explanation of happiness was too hurried, and that is just where the confusion between explaining and explaining away sets in - when you're told that X explains Y but you can't quite see how and they seem like incommensurate properties. Like... "physics" is lifeless, a rainbow seems warm, therefore if you are told that a rainbow is really "physics" - not shown understandable equations, just flatly told that the rainbow is "physics" - that's where the draining effect, the "unweaving the rainbow", that's where the illusion sets in.

Cathryn: If I hold my thumb over a garden hose, I can make my own rainbow; I can see for myself that it is a property of water droplets. Even if I don't know the equations I see it. But if I make my own narrow AI operating on a reinforcement neural architecture, there is no happiness there. That's another reason to doubt your explanation.

Bernard: Hold on a second! How do you know that an AI with a reinforcement architecture isn't "happy"? Is there some test you perform to determine that?

Autrey: Bernard, answer honestly: Do you personally believe that any program with a reinforcement architecture experiences happiness when it receives positive reinforcement?

Bernard: Personally? No. I'm just pointing out that Cathryn is reasoning by reductio ad absurdum to a conclusion she does not actually know to be wrong.

Cathryn: Maybe. Even so, I want to hear Eileen's answer.

Eileen: When I talk about reinforcement architecture as something that evolutionarily predates expected utility, I don't propose that happiness is reinforcement, any more than I propose that human decisionmaking is expected utility. What I'm saying is that happiness has reinforcement-structure in it, and that's why it works. Just as human imagination and will has utility-structure, and therefore works to steer the future; just as the human practice of rationality or the scientific method contains Bayesian structure, and therefore finds truth. It is another of those necessary but not sufficient things. Happiness has reinforcement-structure, because that's a simpler optimization process for natural selection to stumble over; no reinforcement, no happiness. Happiness implying a reinforcement architecture, the lack of a reinforcement architecture implies lack of happiness. I proposed the absence of a reinforcement architecture as a reason to expect that happiness would be absent. It doesn't mean that a reinforcement architecture is automatically a reason to expect that happiness, as we humans know it, would be present. Surely you cannot think that anyone could understand happiness without understanding reinforcement! Neither the cognitive function, nor the evolutionary selection pressure, would make any sense. That's all I'm saying. If P then Q, therefore, if not Q then not P; but it does not follow that if Q, then P -

Cathryn: Okay! I get the point.

Eileen: And it's not surprising that human happiness should be complicated. Natural selection does that. There are many emotions linked into happiness -

Bernard: Is not happiness itself an emotion?

Eileen: Sure. Who says that emotions can't link into each other? I'm just saying... in fact, let me start over. (Takes breath.) The reinforcement architecture is where the most ancient antecedents of happiness began, ever so long ago, before ever the great lizards walked the Earth. I'm not saying, happiness is merely this or merely that. I'm saying, this is why it began. This is how natural selection, which gives no care at all to moral philosophy, spontaneously produced minds with the quality we name "happiness".

Cathryn: And you don't think that knowing this detracts from... well, from the charm of happiness?

Eileen: The only answer I can think to give is a flat "No."

Cathryn: Why not?

Eileen: Because I can see happiness! I know how it works, and why it's there, and because I see it clearly, I know what it means to me. I have sought knowledge of the mind, and found rather a good deal of it; and nonetheless it seems to me that all the things I once thought were beautiful are still beautiful, only now the fog has blown away and I can see them better.

Autrey: I see a possible damaging effect here of partial explanations of beautiful things. It's an intrinsic hazard of the dead, dry words of nontechnical writing.

Dennis: It's a hazard of nontechnical writing? I'd think that technical writing was far more likely to drain the life out of something.

Autrey: To you the equations seem dead numbers, and the poetry of popular writing seems warm and alive. But all the warmth and light are in the equations, as beautiful and precise as starlight. Popular writing that omits the math wanders aimlessly around the truth, firing off wandering salvos that, at best, land somewhere in the rough vicinity. No wonder that popular writing often seems only to be explaining away. What if "Bayes' Theorem" was only dead algebra to you, and I said that it lay at the core of rationality? Would rationality have been explained away as merely the operation of "Bayes' Theorem"? How sad, and it seemed so interesting before then.

Cathryn: So you're saying that if I, I don't know, studied reinforcement neural networks for a while, I would stop being scared that happiness is just reinforcement?

Eileen: Reinforcement neural networks aren't happy; so, no. Studying reinforcement neural networks would not necessarily enable you to see happiness clearly. It might help. But it wouldn't be sufficient.

Autrey: The problem here is the word just. Happiness is not "just" anything! It fits into the universe, has dynamics, has an evolutionary reason for being there - so does everything else in the mind! The problem is this idea that only mysterious things can have any value, so to explain the cause of something is automatically to drain the life force out of it. But there is no mystery! There is never any mystery! All confusion exists in the mind, not in reality.

Cathryn: It seems to me that there is something of an explanatory gap between physics and that unutterably precious quality of happiness that makes it more than reinforcement in a neural network.

Dennis: Oh no. Not the explanatory gap again.

Autrey: I don't know all the answers. But whatever is causing that apparent explanatory gap, it has to be confusion on your part. It is not going to be resolved by modifying physics to include ineffable happiness particles. If you see what looks like an explanatory gap, it doesn't mean that there's a magical ineffable stuff that plugs the explanatory gap. All confusion exists in the mind, not in reality - it is futile to expect something that corresponds to your confusion. An explanatory gap is a place where you are so deeply confused that your mind perceives an "impossible question", like the impossible landscapes of Escher paintings. Explanatory gaps are not solved by filling the gap; the apparent gap goes away when the deep confusion is resolved.

Dennis: No one has ever resolved this question and no one ever will and I don't think it's productive to discuss it. Maybe in a thousand years humanity will figure it out. Until then, it's just not practical to spend the time. Can we go back to the analogue of Ecstasy II for paperclip optimizers?

Bernard: Sure. Okay, suppose that I feed iron-crystal-utility-enhancing-drug to the physical system that used to be a paperclip optimizer. And let's suppose, for the moment, that there is no analogue of "happiness" in the paperclip optimizer. What happens?

Eileen: You don't need to suppose there is no analogue of happiness; just look at the expected utility equation. Anyway, I would say that, on your scenario - a chemical modifying the evaluation of the utility function - then the paperclip optimizer would be transformed into an iron crystal optimizer. What else?

Bernard: And when the drug is withdrawn?

Eileen: The iron crystal optimizer goes back to being a paperclip optimizer. In both cases we are supposing that the optimizer has no power to stop you from tampering with its goal system. Otherwise the paperclip optimizer would resist administration of the drug and transformation to an iron crystal optimizer; and the iron crystal optimizer would resist withdrawal of the drug and reversion to a paperclip optimizer.

Bernard: Why? I would think the paperclip optimizer would be happy to be transformed into an iron crystal optimizer. The future would then have a much higher expected utility.

Eileen: No, a paperclip optimizer would resist, with all its power, being transformed into an iron crystal optimizer. The future would then contain a much smaller number of expected paperclips.

Autrey: Aha!

Cathryn: "Aha" what?

Autrey: Eileen claimed there was a major difference between acting "to maximize the expected utility of future states" and acting for the sake of "saving people's lives", even if the apparent effect of the optimization process is the same in either case, i.e., to steer the future into states where others live and you have sacrificed your own life. I was trying to figure out what the heck she meant by that, because no matter how hard I looked, they seemed like the same thing. But now I think I get it.

Eileen: Exactly! The formal difference between the two cases arises when the optimizer models itself as a part of the universe. Conventional treatments of expected utility treat the agent as hermetically sealed from the universe. But in reality the agent is embedded in the universe, a continuous part of the universe, making the agent potentially capable of self-modifying actions - actions which directly impact the internal state or dynamics of the optimizer. In other words, the optimizer itself is another part of the universe, which the optimizer's actions can affect and the optimizer's beliefs can model.

Autrey: ...okay, maybe I don't get it. What does the possibility of self-modification do?

Eileen: The elegance of the math is destroyed or rather Godelized, because the optimizer's representation of the universe now needs to include the optimizer itself, which can be done in any number of ways, all of them imperfect.

Autrey: Okay, here's what I thought you were saying. If an optimization process conceives of itself as "maximizing the expected utility of future states", and it sees a self-modifying action that increases the expected utility of future states - just the expected utility - it will take that action. But if an optimization process conceives of itself as "maximizing the expected number of paperclips in future states", it will avoid any self-modifying action that would lead it to make fewer paperclips in the future. If the optimization process can't self-modify, it behaves the same under either architecture, even if it conceives of itself as maximizing "utility" rather than "paperclips", since the only way it can possibly get utility is by making paperclips. I mean... I thought that's what you were saying, and that it was the reason you brought up Ecstasy II at that point... isn't this where we were heading?

Eileen: No.

Autrey: ...

Eileen: You're still trying to understand the paperclip optimizer by invoking human empathy on it, thinking of it as a "mind".

Dennis: It's not a mind?

Eileen: A human being is evolved to model minds that work in a certain way. I mean, you're talking about an optimization process "conceiving of itself" - you have a self-image. Does a generic optimization process need a self-image? Is it the same kind as yours? You've got a moral philosophy, a concept of your own purpose that you try to follow. And then in the picture you were drawing, Mr. Paperclip conceived of an action, and compared that action with its self-image as either a "utility maximizer" or a "paperclip producer"... The expected utility equation has been left behind in all this. Mr. Paperclip would practically have to be a Friendly AI just to fail in the way you imagined it.

Autrey: Fine.

Eileen: In particular, you talk about "getting utility", as if utility was a reified quantity, a substance that gets pushed around the mind, like the way humans think of happiness. Like, suppose Mr. Paperclip has a legacy reinforcement architecture, and that Mr. Paperclip is hardwired - beyond its own ability to alter - so that happiness comes from all paperclips and only paperclips, not just paperclips in its own presence, but any paperclips that it thinks exist. If we rule out all self-modifying actions, it won't matter whether Mr. Paperclip maximizes expected happiness or expected paperclips, since the only way to get happiness is through paperclips.

Bernard: Actually, it does matter. If Mr. Paperclip maximizes expected happiness, it will never be rational for Mr. Paperclip to sacrifice its own life even to produce a quadrillion paperclips - assuming those paperclips are created after its death.

Eileen: Aha! But we ruled out self-modifying actions, and if you look closely, you'll see that my definition of a self-modifying action includes sacrificing your life. "Actions which directly impact the internal state or dynamics of the optimizer." Well, destroying yourself - modifying yourself so severely as to cease to be an optimization process - is certainly a very severe kind of self-modification! Destroying yourself also breaks the rule that all paperclips, and only paperclips, are transformed into happiness - or rather, I should say, the rule that all expected paperclips, and only expected paperclips, are transformed into expected happiness. Since, if you imagine destroying yourself, you won't imagine translating those paperclips into happiness.

Bernard: Aren't you just quibbling over definitions?

Eileen: No. I said that the Godelian mess begins when we try to figure out how the optimization process represents the subsystem that is itself in its representation of the universe. Realizing that it is possible for you to die, or destroy yourself, requires representing yourself as a part of the universe. You must also represent yourself as a part of the universe to realize that actions, or external effects like a Bernard, can modify the part of you that implements your utility function, and that this will result in your future self taking different actions. And even here I am anthropomorphizing, because of the word "you".

Autrey: The word you is an anthropomorphism? I mean, the optimization process, if it wants to represent itself within the universe, has to refer to itself somehow... right?

Eileen: But not necessarily in the way humans do. The expected utility equation, as it stands, doesn't treat with how actions are spread out over time. For example, let's say that I want to obtain a tasty apple that I can only obtain by pressing two buttons, in sequence. There's no point to pressing the first button unless I also press the second button. And of course, there is no reason to press the second button if the first button has not already been pressed. Is this not an insuperable obstacle?

Cathryn: It would be an extremely difficult obstacle - a very high improbability requiring many attempts to accidentally hit on the right answer - to natural selection, which is incapable of handling simultaneous dependencies; also to any animal less intelligent than a chimpanzee.

Eileen: The expected utility equation, taken as it stands, will only assign a high desirability to pressing the first button if the extrapolated world-model can predict that the optimizer itself will also press the second button. So extrapolating the future can't just consider the environment. It has to consider the actions of the optimizer itself.

Bernard: Sounds like a halting problem. No mind can predict itself.

Autrey: I seem to have very little trouble predicting that, after I go to the kitchen, I'll get a drink, even though it's pointless to go to the kitchen unless I can successfully predict I'll get a drink.

Cathryn: You are more fortunate than I. I can never remember why I'm in the kitchen.

Bernard: Why do you go to the kitchen, if you know you won't remember why once you're there?

Cathryn: I'm not really sure. Maybe because I fail to predict I'll forget?

Autrey: This problem can be brute-forced with enough computing power. Let's say you foresee ten possible actions A1-A10, each of which is followed by ten possible actions A1.B1 to A10.B10, for a total of a hundred possible two-action combinations. What you do is imagine taking each of the actions A1 to A10, then predict the situation you'd be in after that. Let's say you're imagining future A3. There'd be a spectrum of possibilities for what happened to you after you took action A3. You'd consider each of those possibilities, weighted by the probability you assigned it. Then, for each of those possibilities, you consider the actions available to you, and their probable payoffs, and calculate the expected utility to your expected future self of actions A3.B1 to A3.B10. You calculate the spectrum of probabilities resulting from, say, A3.B4, run your utility function over each possible outcome, assign a utility to each possible outcome, and multiply by the probability of that outcome. So now you have the expected utility you would give A3.B4 if you were actually confronting that situation. It turns out that A3.B4 has a higher expected utility than A3.B5, A3.B6, and so on. So you predict that, subsequent to A3 and whatever possible consequences of A3 you're currently evaluating, you'll pick A3.B4. This gives you an accurate prediction of which option you'd pick, in that situation. You know that, once you're in the kitchen, you'll take a drink. You then evaluate the expected utilities of the actions A, knowing which particular action B you'd pick in each of the possibilities eventuating from A, and evaluating the probable consequences of B. You can get an accurate prediction of your own choice by running the same computation ahead of time, now, that you will run in the future. Once you know which choice you would make, you can get a probabilistic model of the entire future, including your own future actions, and use that model to evaluate the expected utility of the actions you take now. You know it makes sense to go to the kitchen because you know that once you're in the kitchen you'll take a drink of water. And you can extend out the decision tree as far ahead as desired. For example, to see ten turns ahead, with ten possibilities per action and ten actions per possibility, would then require a mere hundred billion billion extrapolations.

Bernard: That sounds very wasteful of computing power.

Autrey: Indeed so. I'm just pointing out that we know how to brute-force the problem - we know the full form of the question we're trying to approximate cheaply. Good-old-fashioned decision trees.

Eileen: No.

Autrey: Now what?

Eileen: You've given a very nice mathematical definition of a brute-force optimization process that's stretched out over time, an optimizer that takes coordinated, purposeful actions over multiple rounds - providing that the optimization process's internals are hermetically sealed from the rest of the universe, as the standard analysis assumes. Your optimizer predicts its own actions by evaluating its utility function over the foreseen choices and using that computation as its prediction of its own action. Now suppose that the optimization process considers a possibility in which Bernard feeds it Ecstasy II - creates an electromagnetic field, doses it with a chemical, whatever. Then what?

Autrey: The optimizer would foresee itself... no, wait...

Eileen: If we built an optimizer that worked exactly according to your definition, the optimizer would go on using its current utility function to predict its actions after the administration of Ecstasy II. Once Ecstasy II had been administered, the optimizer would then use its iron-crystal-valuing utility function to predict all its future actions, even if the Ecstasy II only lasts an hour. The optimizer might begin belaboring a great interstellar iron-crystal manufactuary, only to predictably abandon it an hour later. According to the mathematical definition you gave, anyway.

Bernard: So how do you patch it?

Autrey: Why must there necessarily be any way to patch it?

Bernard: Because I can conceive of my own behavior changing after a shot of Ecstasy II. Maybe I can't predict my actions exactly, but I at least don't make the mistake Eileen is talking about. As I am myself a mind, I am an existence proof that there is at least one way to configure a mind such that it can attempt to roughly predict the impact of cognition-modifying external events on its own actions.

Dennis: Technically, you're not a mind, you're a character in a Dialogue.

Bernard: Bah.

Autrey: Fair enough, Bernard. Here's my shot at an answer: The optimizer needs to know that, if it predicts the administration of Ecstasy II, it should predict that its own computation of utilities will change, and therefore, so will its actions... or, wait... no, never mind, I think that's right.

Eileen: And how does the math for your suggestion work, exactly? Plain vanilla expected utility optimizers are a thoroughly studied domain, including the decision-tree effect, but what you are proposing now is something new - a decision theorist capable of modeling expected external impacts on elements of its own cognitive dynamics. I don't recall seeing even the most preliminary attempt to deal with it in AI design, even in toy microworlds. Though it is by no means sure or even probable that I would have heard about it, given that it has been done.

Autrey: Well, let's say that we're dealing with a drug, Ecstasy II, which can be a chemical or an electromagnetic field, it doesn't really matter. And let's say that the optimizer is made out of interacting elements and dynamics such that... okay, I need effectively infinite computing power for this. Let's say that there's an effectively infinite space of allocable Elements, with known dynamics, such that by using a sufficiently large finite number of Elements with the right dynamics, you can perfectly simulate the behavior and dynamics of a smaller finite group of Elements under all relevant external conditions, including the administration of Ecstasy II. Maybe it takes 20 Elements to exactly simulate one Ecstasy-II-affected Element, but it can be done. We will suppose that the optimizer, itself, is a program consisting strictly of Elements, which can grab any number of Elements from infinity and use them to simulate any number of Elements. As if, for example, you were to use a quantum computer - maybe at some astronomically high level of inefficiency - to simulate quantum physics, to whatever necessary degree of fidelity. Actually, can I assume discrete physics - Turing machines or cellular automata? It'll make things easier.

Bernard: Go for it. But this mind you're describing still can't simulate itself.

Dennis: Why not? If the program is made of Elements, can grab an infinite number of Elements, and can use Elements to perfectly simulate Elements, it can simulate itself.

Bernard: Simulating yourself is always a Red Queen's Race. Let's grant that the universe is made of Elements, which are something like cellular automata, and that the rules of the cellular automata are such that you can grab an effectively infinite number of Elements in hyperspace or whatever. Unlimited computing power. We'll suppose unlimited time to make your decisions on each round. Let's say it takes 10 Elements to make a computing element that can perfectly simulate the dynamics of 1 Element. And we'll suppose the optimizer starts out with 1000 Elements. Now, Dennis, how would you go about simulating yourself?

Dennis: Grab 10,000 Elements. Use them to run a simulation of myself.

Bernard: Your old self, you mean. Your new self has just allocated another 10,000 Elements, for a total of 11,000, and would now require 110,000 Elements to simulate.

Dennis: Okay, I'll grab another 110,000 Elements and simulate my new self too.

Autrey: And if there's anyone in the audience who still didn't get the point, you need to run, not walk, to the nearest location of a copy of Godel, Escher, Bach.

Bernard: So how would you deal with the infinite recursion, Autrey?

Autrey: The key is that you only have to predict your behavior on succeeding rounds - not this round. Otherwise the problem would go into infinite recursion even without the problem of Ecstasy II. Let's say you're looking ahead 1 round, as before, the A.B setup. Instead of just evaluating your expected utility over the B options your future self will have, you allocate 10,000 Elements and simulate your entire future self's dynamics, including, as the case may be, evaluating expected utility. If you predict the presence of an external effect that biases cognition, such as Ecstasy II, you simulate the dynamics of Elements in the presence of Ecstasy II. Then you read off your chosen action, and use that as your prediction.

Bernard: And I suppose, for three rounds, you grab 110,000 Elements to simulate yourself in the second round simulating yourself in the third round? Or, if you were simulating four turns ahead, you would allocate 1,210,000 Elements to simulate yourself in the second round allocating 110,000 Elements to simulate yourself in the third round allocating 10,000 Elements to simulate yourself in the fourth round?

Autrey: Verily. So long as there are a finite number of rounds, you can do it.

Bernard: I wonder if infinite computing power will be enough.

Autrey: Hey, all that computing power does go to buy something. I mean, you could accurately guess how you would guess you would behave while high on marijuana while high on LSD.

Cathryn: This is your brain. This is you modeling your brain on drugs. This is you modeling your brain on drugs modeling your brain on drugs. Any questions?

in progress

Work in progress, disseminate with caution, preferably with permission of author.


SL4Wiki | RecentChanges | Preferences
Edit text of this page | View other revisions

Last edited May 15, 2006 10:36 pm by 67-50-142-78.nrp2feld.roc.ny.frontiernet.net (diff)

Search: