From: Eliezer S. Yudkowsky (email@example.com)
Date: Sat Mar 09 2002 - 12:56:21 MST
Ben Goertzel wrote:
> In this section we will briefly explore some of the more futuristic aspects
> of Novamente’s Feelings and Goals component. These aspects of Novamente are
> always important, but in the future when a Novamente system intellectually
> outpaces its human mentors, they will become yet more critical.
You say this here, and "superhuman intelligence" below. Just to make sure
we're all on the same wavelength, would you care to describe briefly what
you think a transhuman Novamente would be like?
> The MaximizeFriendliness GoalNode is very close to the concept of
> “Friendliness,” which Eliezer Yudkowsky has discussed extensively in his
> treatise Creating a Friendly AI (Yudkowsky, 2001:
> http://intelligence.org/CFAI/). Yudkowsky believes that an AI should be
> designed with an hierarchical goal system that has Friendliness at the top.
> In this scheme, the AI pursues other goals only to the extent that it
> believes (through experience or instruction) that these other goals are
> helpful for achieving its over-goal of Friendliness.
My claim, and it is a strong one, is:
A system in which "desirability" behaves equivalently with the property "is
predicted to lead to Friendliness" is the normative and elegant form of goal
cognition. Many powerful idioms, including what we would call "positive and
negative reinforcement" are directly emergent from this underlying pattern;
all other idioms that are necessary to the survival and growth of an AI
system, such as "curiosity" and "resource optimization" and "knowledge
seeking" and "self-improvement", can be made to emerge very easily from this
I claim: There is no important sense in which a cleanly causal,
Friendliness-topped goal system is inferior to any alternate system of
I claim: The CFAI goal architecture is directly and immediately superior to
the various widely differing formalisms that were described to me by
different parties, including Ben Goertzel, as being "Webmind's goal
I claim: That for any important Novamente behavior, I will be able to
describe how that behavior can be implemented under the CFAI architecture,
without significant loss of elegance or significant additional computing
> Yudkowsky’s motivation for this proposed design is long-term thinking about
> the possible properties of a progressively self-modifying AI with superhuman
> intelligence. His worry (a very reasonable one, from a big-picture
> perspective) is that one day an AI will transform itself to be so
> intelligent that it cannot be controlled by humans – and at this point, it
> will be important that the AI values Friendliness. Of course, if an AI is
> self-modifying itself into greater and greater levels of intelligence, there
> ’s no guarantee that Friendliness will be preserved through these successive
> self-modifications. His argument, however, is that if Friendliness is the
> chief goal, then self-modifications will be done with the goal of increasing
> Friendliness, and hence will be highly likely to be Friendly.
To be precise: A correctly built Friendly AI is argued to have at least
that chance of remaining well-disposed toward humanity as would be possible
for any transhuman, upload, social system of uploads, et cetera.
> Unlike the hypothetical Friendly AI systems that Yudkowsky has discussed,
> Novamente does not have an intrinsically hierarchical goal system. However,
> the basic effect that Yudkowsky describes – MaximizeFriendliness supervening
> over other goals -- can be achieved within Novamente’s goal system through
> appropriate parameter settings. Basically all one has to do is
> * constantly pump activation to the MaximizeFriendliness GoalNode.
> * encourage the formation of links of the form "InheritanceLink G
> MaximizeFriendliness", where G is another GoalNode
> This will cause it to seek Friendliness maximization avidly, and will also
> cause it to build an approximation of Yudkowsky’s posited hierarchical goal
> system, by making the system continually seek to represent other goals as
> subgoals (goals inheriting from) MaximizeFriendliness.
No, this is what we humans call "rationalization". An AI that seeks to
rationalize all goals as being Friendly is not an AI that tries to invent
Friendly goals and avoid unFriendly ones.
> However, even if one enforces a Friendliness-centric goal system in this
> way, it is not clear that the Friendliness-preserving evolutionary path that
> Yudkowsky envisions will actually take place. There is a major weak point
> to this argument, which has to do with the stability of the Friendliness
> goal under self-modifications.
> Suppose our AI modifies itself with the goal of maintaining Friendliness.
> But suppose it makes a small error, and in its self-modificatory activity,
> it actually makes itself a little less able to judge what is Friendly and
> what isn’t. It’s almost inevitable that this kind of error will occur at
> some point. The system will then modify itself again, the next time around,
> with this less accurate assessment of the nature of Friendliness as its
> goal. The question is, what is the chance that this kind of dynamic leads
> to a decreasing amount of Friendliness, due to an increasingly erroneous
> notion of Friendliness.
In other words, a poor Friendliness design is "correct by definition" and is
anchored neither by external reference nor by internal error correction.
Thus, any disturbance to Friendliness cannot be corrected because the
current cognitive content for Friendliness is "correct by definition". This
is why you need probabilistic supergoal content anchored on external
I claim: That for any type of error you can describe, I will be able to
describe why a CFAI-architecture AI will be able to perceive this as an
> One may also put this argument slightly differently: without speaking of
> error, what if the AI’s notion of Friendliness slowly drifts through
> successful self-modifications? Yudkowsky’s intuition seems to be that when
> an AI has become intelligent enough to self-modify in a sophisticated
> goal-directed way, it will be sufficiently free of inference errors that its
> notion of Friendliness won’t drift or degenerate. Our intuition is not so
> clear on this point.
It's not an intuition. It's a system design that was crafted to accomplish
exactly that end.
> It might seem that one strategy to make Yudkowsky’s idea workable would be
> give the system another specific goal, beyond simple Friendliness: the goal
> of not ever letting its concept of Friendliness change substantially.
No, absolutely not. See:
'Subgoals for "improving the supergoals" or "improving the goal-system
architecture" derive desirability from uncertainty in the supergoal
content. They may be metaphorically considered as "child goals of the
currently unknown supergoal content". The desirability of "resolving a
supergoal ambiguity" derives from the prediction that the unknown referent
of Friendliness will be better served.'
The same holds for the subgoals of correcting *errors* in the
*probabilistic* Friendliness content.
> However, this would be very, very difficult to ensure, because every concept
> in the mind is defined implicitly in terms of all the other concepts in the
> mind. The pragmatic significance of a Friendliness FeelingNode is defined
> in terms of a huge number of other nodes and links, and when a Novamente
> significantly self-modifies it will change many of its nodes and links.
> Even if the Friendliness FeelingNode always looks the same, its meaning
> consists in its relations to other things in the mind, and these other
> things may change.
That's why a probabilistic supergoal is anchored to external referents in
terms of information provided by the programmers, rather than being anchored
entirely to internal nodes etc.
> Keeping the full semantics of Friendliness invariant
> through substantial self-modifications is probably not going to be possible,
> even by an hypothetical superhumanly intelligent Novamente. Of course, this
> cannot be known for sure since such a system may possess AI techniques
> beyond our current imagination. But it’s also possible that, even if such
> techniques are arrived at by an AI eventually, they may be arrived at well
> after the AI’s notion of Friendliness has drifted from the initial
> programmers’ notions of Friendliness.
If Novamente were programmed simply with a static set of supergoal content
having only an intensional definition, then yes, it might drift very far
after a few rounds of self-modification. This is why you need the full
> The resolution of such issues requires a subtle understanding of Novamente
> dynamics, which we are very far from having right now. However, based on
> our current state of relative ignorance, it seems to us quite possible that
> the only way to cause an evolving Novamente to maintain a humanly-desirable
> notion of Friendliness maximization is for it to continually be involved
> with Friendliness-reinforcing human interactions. Human minds tend to
> maintain the same definitions of concepts as the other human minds with
> which they frequently interact: this is a key aspect of culture. To the
> extent that an advanced Novamente system is part of a community of Friendly
> humans, it is more likely to maintain a human-like notion of Friendliness.
> But of course, this is not a demonstrable panacea for Friendly AI either.
This correctly expresses the need for Friendliness to be anchored to the
external referent which is proximately defined by the programmers'
conception of Friendliness, thus allowing the programmers to provide
informational feedback (not to be confused with hardwired "reinforcement")
about whether the AI's current conception of Friendliness (ergo, it's
current supergoal content) is correct, incorrect, needs adjustment in a
certain direction, and so on. The cognitive semantics required for this are
laid out in "External Reference Semantics" in CFAI.
The rest of the above statement, as far as I can tell, represents two
understandable but anthropomorphic intuitions:
(a) Maintaining Friendliness requires rewarding Friendliness. In humans,
socially moral behavior is often reinforced by rewarding individually
selfish goals that themselves require no reinforcement. An AI, however,
should work the other way around.
(b) Novamente will be "socialized" by interaction with other humans.
However, the ability of humans to be socialized is the result of millions of
years of evolution resulting in a set of adaptations which enable
socialization. Without these adaptations present, there is no reason to
expect socialization to have the same effects.
-- -- -- -- --
Eliezer S. Yudkowsky http://intelligence.org/
Research Fellow, Singularity Institute for Artificial Intelligence
This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:37 MDT