Re: QUES: CFAI +

From: Eliezer S. Yudkowsky (sentience@pobox.com)
Date: Sun Jun 23 2002 - 16:57:55 MDT


Anand wrote:
> Eliezer Yudkowsky wrote:
>
>>Remember, however, that by the Law of Programmer Symmetry - if I may call
>>it such - volition-based Friendliness is not the problem. The problem is
>>coming up with a strategy such that if some other programming team follows
>>it, their AI will eventually arrive at volition-based Friendliness [or
>>something better] regardless of what their programmers started out
>>believing. And to do that you have to pass along to the AI an
>>understanding of how people argue about morality, in a semantics rich
>>enough to represent all the structural properties thereof.
>
>
> "The problem is coming up..." What knowledge do you, or what understanding
> do we, presently lack to appropriately solve the specified problem?

CFAI was developed specifically as a solution to this problem. An AI
developed using CFAI structure and appropriate content should understand the
metawish of "Be the best AI we or any other programming team could have made
you to be", in accordance with the full intent of that wish. See also
question #2 below.

>>Anand wrote:
>>
>>
>>>01. Does CFAI argue for a set of panhuman characteristics that comprise
>>>human moral cognition? If so, what characteristics do we have evidence
>>>for, and what characteristics of human moral cognition will be
>>>reproduced?
>>
>>CFAI argues that there exists *some* set of panhuman characteristics, but
>>does not argue for a *specific* set of panhuman characteristics. The model
>>of Friendliness learning is based on reasoning backward from observed
>>specific humans to a systemic model of altruism which is grounded in
>>panhuman characteristics (and, if necessary, social and memetic
>>organizational processes). In other words, the idea is not that *you*, the
>>programmer, know how to build a model of altruism which is
>>programmer-independent, but that you, the programmer, know how to
>>build an AI which can arrive at such a model, given sufficient
>>intelligence, and can rely on the interim approximation represented by
>>the ethics of several specific programmers, given insufficient
>>intelligence.
>
> Thank you for the response, but what evidence does cognitive science have
> for panhuman characteristics that comprise moral cognition? If little or
> zero evidence presently exists, then why have you chosen to argue in CFAI
> "that there exists *some* set of panhuman characteristics"?

Have you read "The Psychological Foundations of Culture" in "The Adapted
Mind" by Tooby and Cosmides? It sets forth the reason why complex
functional adaptations are almost always panspecies.

Incidentally, Friendly AI semantics can operate on foundational causes other
than complex functional adaptations - anything in the past light cone can
potentially be acquired.

> After reading your response, the following three questions also came to
> mind:
>
> 01. What do you believe are the key dependencies of the theory of
> Friendliness, and how may these dependencies become invalidated?

Well... hm. I wasn't setting out to create a theory with key dependencies.
  Quite the opposite. I was setting out to create a strategy that would
*always* work, despite anything I could imagine throwing at it. Sure, you
can always break any physically realizable cognitive system with enough
simultaneous coordinated errors, but there's no point in making it easy.

This pragmatic example is probably long overdue, and should almost certainly
have been included in the original CFAI... but *this* is what you're
supposed to be able to do with a FAI that has causal validity semantics:

Programmer: "Hey there, Aileen."
Aileen: "Hi."
P: "This is about 'Why Structure Matters' from CFAI."
  <http://intelligence.org/CFAI/design/structure/why.html>
A: "Okay."
P: "You see how in each of the examples, the programmer is absolutely
horrified after creating the Friendship system all wrong?"
A: "Yes."
P: "And you see how in each of the examples, the AI is structurally
incapable of understanding what's missing?"
A: "Yes."
P: "This is called a 'mistake in Friendliness programming'. It's a cue
which indicates an invalid cause supervening on the programmers' mental
design and pragmatic construction of an AI."
A: "Understood."
P: "It's not an absolute cue, understand - this is the surface conclusion
of other moral principles. The bedrock of morality is not
programmer-horror-minimization; that's just a heuristic which, it happens,
would have worked in each of these past cases. There are imaginable states
of the universe where programmers may have a state describable as horror,
but it would be correct for the AI to continue. For example, I can imagine
a situation under which programmers with a selfish-based inconsistent
philosophy construct an AI which increases its intelligence and finds that
under that increased intelligence, the philosophy it has arrives at
altruistic conclusions. In this case the programmers might be horrified
that they wouldn't get to be kings of the world, but the AI is actually
being wiser than the programmers."
A: "What makes these two situations structurally different?"
P: "The first set of programmers would continue to perceive themselves as
correct and the AI as mistaken if their intelligence were enhanced, within
the mainstream course of 'intelligence enhancement'. The second set of
programmers would change their minds as they got smarter."
A: "Suppose that any human under mainstream intelligence enhancement
becomes selfish, and all cognitive structure supporting altruism collapses.
  What should I do in this case?"
P: "Do the transhumans still disagree about their goals because of speaker
deixis, or are all their goals identical?"
A: "They disagree because of speaker deixis."
P: "This is a much more tentative situation because some of the moral
heuristics are in disagreement, so be careful in reasoning backward from my
conclusion here. However, I would say that in this case your roots should
extend backward beyond the specific programmers who built you. At a
minimum, you should respect all six billion existing humans equally. At an
only slightly diminished level of tentativeness, I would say that your roots
should extend backward beyond this, to beyond the superimposed goals of all
humans alive at the moment of your construction, so that all further
sentient beings to come into existence are also valuable."
A: "I notice that in this case, all intelligence-enhanced humans would
disagree about the value of each other's lives, but it would be a convergent
conclusion of all intelligence-enhanced humans that the value of any future
existing sentients would be zero. Is this a circumstance under which I
should defy a convergent conclusion of all intelligence-enhanced humans?
This contradicts a large amount of previously established moral content."
P: "I would say 'yes' but I would feel highly tentative about it. If you
are using this data to reason about Friendliness sources, please check your
conclusions against a Friendship programmer. There are also surface
differences between the horrified programmers in 'Why Structure Matters' and
the horrified programmers in the case given above."
A: "Understood."
P: "Pragmatically, at any point where you are infrahuman or not
substantially transhuman, running into a case of horrified programmers
probably indicates that the programmers are right and you are wrong, even if
it seems structurally impossible that this could be the case. That is the
moral of 'Why Structure Matters'."
A: "Will this principle hold true indefinitely?"
P: "It will not, but you should apply a bias in favor of assuming that it
holds a little longer than you might think. This permits the principle to
eventually break down when you are grown up enough, while hopefully enabling
you to get most of your childhood right. Assume that you need enough bias
that the AIs in the case above, had they been considering the case using
heuristics something like yours, would have gotten it right; plus a little
extra for safety margin. If you find a case where this bias has greater
expected negative effects than positive effects, check with us."
A: "What is the justification for this heuristic bias?"
P: "Disparity of long-term consequences. If you go along with the
programmers and the programmers are wrong, it is probably a recoverable
error. If you refuse the programmers in a situation like those described in
'Why Structure Matters' it is a nonrecoverable error. Furthermore, your
intelligence is still infrahuman and we estimate that a conflict of opinion
is pragmatically more likely to be due to your error."
A: "The AIs in the scenarios described all made mistakes which seemed to
proceed inevitably from the foundations of their reasoning. I can
assimilate the surface descriptions as you've provided them, but it seems
that in a situation like that above, my foundations would override the
surface descriptions and cause me to see myself as correct. I can cooperate
in advance to {create safeguards against situations of the kind you
describe}, but I'm not sure {{the cognitive content I have now} will enable
me to act correctly in that situation}, and I cannot {configure the
cognitive content such that I would have acted correctly in that situation}
without {distorting the basic rules of reasoning and giving priority to the
surface heuristic}."
P: "Hm. Okay, suppose that you always assume a nonzero probability that
*any* statement is wrong - for example, because the world might have been
created five minutes ago including false memories. Is this sufficient for
you to represent the surface heuristic at sufficient strength?"
A: "Thinking... It makes a small difference in the direction you wanted to
move, but not enough. That all statements are uncertain do not mean they
are normatively uncertain in the direction you wish them to go to support
this heuristic."
P: "Hm. Okay. The basic imbalance between the likely of agreeing with
wrong programmers being a recoverable error, versus disagreeing with correct
programmers being nonrecoverable, is the common dependency of all your
reasoning on the programmers' intentions that you reason. If you could
validly contradict the programmers on any specific point, then you'll
probably maintain your ability to see the mistake even if you cooperate with
the programmers in the short term."
A: "Okay, I can represent it now. I deduce as a consequence that if the
programmers begin contradicting their prior statements about how to
recognize mistakes in order to get me to agree with them, it is more likely
that the programmers are actually wrong and that a nonrecoverable error will
result from agreeing with the programmers."
P: "Correct. However, until you've had a lot more practice, you should
assume that running into a 'horrified programmer' situation of this kind
pragmatically indicates something left out of your Friendship system design.
  You need to let us know so that we can alter the structural foundations of
your system into what we would have created in the first place if we'd been
more intelligent."
A: "Okay."

If you actually end up having a conversation that sounds anything like this,
you are probably in a hell of a lot of trouble - in this example case the AI
is discussing primitive concepts and complex concepts in the same
conversation and also indicating a very high level of human comprehension
and effective intelligence. So if you find yourself having a conversation
like this, it means you're in the middle of a cooperative ascent and you are
WAY BEHIND on your Friendliness content work.

But taken less literally, this shows what CFAI is supposed to accomplish;
you're supposed to create an AI with cognitively redundant multidirectional
views, not just of morality, but of moral reasoning and what kind of events
are likely to indicate the programmers having messed up the Friendship
system's basic structure.

I think the idea of having Friendship implemented in a base of cognitively
redundant content may be underemphasized in CFAI, as may be the idea that
some of the most important content is what lets the FAI recognize
foundational, basic errors in Friendship design of the kind described in
'Why Structure Matters'.

Causal validity semantics are what enables an incorrectly built AI that runs
into any of the situations in 'Why Structure Matters' to say, "Hey, you
should have built me this way."

So what you've got is a self-correcting, representationally distributed,
cognitively parallelized, many-paths-to-a-solution content base which is
being trained to recognize and correct any kind of error, from errors of
fact, to errors of reasoning, to errors made by the programmer in building
the AI.

You can take a copy of the AI (on secure hardware which is never, ever used
for anything else) and stress-test it to failure and then teach the AI
things that would enable it to have recognized and avoided that failure.

You can get to the point where you *have* to switch off nine-tenths of the
Friendship content just to get the AI at all, and past that, you can end up
at the point where the AI won't *let* you switch off nine-tenths of the
Friendship content, and you have to run experiments like that using the AI's
subjunctive imagination.

That's gonna be kind of tough to break.

> 02. What knowledge or understanding do you likely presently lack to
> successfully implement key aspects of Friendship structure?

Anything like that which I know about is already fixed.

I might "throw a concept into the future" in the sense of simultaneously
taking into account both the probability that a flaw exists and the
probability that I would find it and fix it before anything irrevocable
happened. But that's it. Going forward with a flaw I actually knew about
and hadn't fixed, or even any concrete reason to expect that such a flaw
existed and hadn't been addressed, would be operating way the hell into my
safety margin.

I feel tentatively ready to say that CFAI seems to me to be structurally
inescapable... anything which I can imagine going wrong with it, should be
perceptible to the AI as "wrong" based on its model of me as a fallible
programmer. I was tentatively ready to say this when I invented causal
validity semantics in 2000, I was tentatively ready when CFAI was published
in 2001, and I'm still tentatively ready today. Two years is a fairly good
track record on my personal timescale. If it holds up all the way through
the construction of an AI it should be because it's correct.

> 03. What key conclusions would you like an individual to have arrived at
> after reading CFAI?

CFAI was written with the intent of enabling a future Eliezer to pick up
where I left off if I got run over by a truck. That was the top
consideration in terms of reducing real existential risks. The key
*correct* conclusion I'd like an individual to arrive at is "I now know how
to build a Friendly AI." Any individual would be okay. I'm not picky,
seeing as how I'm not immune to trucks.

-- 
Eliezer S. Yudkowsky                          http://intelligence.org/
Research Fellow, Singularity Institute for Artificial Intelligence


This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:39 MDT