From: Eliezer S. Yudkowsky (email@example.com)
Date: Tue Feb 14 2006 - 09:04:37 MST
Mitchell Porter wrote:
> Eliezer has posted a job notice at the SIAI website, looking
> for research partners to tackle the problem of rigorously
> ensuring AI goal stability under self-enhancement transformations.
> I would like to see this problem (or perhaps a more refined one)
> stated in the rigorous terms of theoretical computer science; and
> I'd like to see this list try to generate such a formulation.
I strongly suspect that when this problem is stated in rigorous terms it
will have been solved.
I am not averse to SL4 trying to generate such a formulation, but I have
a feeling it's not going to work.
> There are several reasons why one might wish to refine the
> problem's specification, even at an informal level. Clearly, one
> can achieve goal stability by committing to an architecture
> incapable of modifying its goal-setting components. This is an
> insufficiently general solution for Eliezer's purposes, but he does
> not specify exactly how broad the solution is supposed to be,
> and perhaps he can't. Ideally, one should want a theory of
> Friendliness which can say something (even if only 'irredeemably
> unsafe, do not use') for all possible architectures. More on this
A theory which correctly sorts Friendly and unFriendly architectures *in
general* is impossible by Rice's Theorem; you can't do that even for
multiplication. You can possibly have a theory which produces no false
positives. But that's not even what I'm looking for - just a theory
which correctly states of a single recursive self-improver that it is
Friendly. That is all I need. Engineers don't need a theory of which
bridges in general stay up, just a theory which lets them select from a
small subset of possible bridges which *knowably* stay up.
> So, now to business. What do we want? A theory of 'goal
> stability under self-enhancement', just as rigorous as (say) the
> theory of computational complexity classes.
An interesting analogy. Computational complexity classes will very
often overestimate the difficulty of real-world problems, which are
usually special cases possessing regularities not applicable to the
problem as formulated by the computer scientist. I guess one could make
a case for complexity classes as rigorous upper bounds; like a bridge
that knowably supports a certain minimum weight even if in practice,
with real-world cars driving over it, it can handle an unguessable
> We want to pose
> exact problems, and solve them. But before we can even pose
> them, we must be able to formalize the basic concepts. What
> are they here? I would nominate 'Friendliness', 'self-enhancement',
> and 'Friendly self-enhancement'. (I suppose even the definition of
> 'goal' may prove subtle.)
Reflective decision theory - a theory of motivationally stable
self-enhancement - is the world's second most important math problem.
The *most* important math problem is how to phrase the motivational
invariant itself. A classical utility function *probably* isn't going
to cut it. My suspicion is that being able to build a reflective
decision system, I would know a great deal more about my options for
motivational invariants, and how to structurally describe those
structurally complex things that humans want - such as "free will" or
"freedom from having one's life path too heavily optimized by outside
sources as opposed to one's own efforts". I am doubtful I can solve the
most important math problem without having solved the second most
important math problem first. Sadly and dangerously, FAI knowledge
*always* lags behind AGI knowledge because AGI is a strictly simpler
> It seems to me that the rigorous
> characterization of 'self-enhancement', especially, has been
> neglected so far - and this tracks the parallel failure to define
Mitchell, you yourself in private conversation suggested the final form
of the equation I'm still using to do that, at least for classical
utility functions. Let U(x) be the utility function over outcomes, and
EU(a, P) be the expected utility of an action relative to a conditional
probability distribution over outcomes given action. Let H(x, U) be the
entropy of an outcome relative to a utility function, defined as the
logarithm of the volume of outcomes in X with utility equal to or
greater than x. We may similarly define EH(a, U, P) as the entropy of
an action relative to a utility function and probability distribution.
(I originally had H measured in the volume of outcomes with utility
exactly equal to x, a silly mistake which Mitchell corrected, hence
originating the final form of the equation.)
A reason this does not work for specifying motivational stability is
that you can have systems with different U, that still achieve lower
entropy relative to your current U - for example, a more intelligent
system that incorporates additional, unwanted criteria into a modular
utility function, on top of your existing criteria. If the system is
more intelligent, it may, most of the time, steer reality into regions
of higher utility relative to your current U, but there would be strange
additional quirks, and it would not be motivationally stable in the long
> We have a sort of empirical definition - success at
> prediction - which provides an *empirical* criterion, but we need
> a theoretical one (prediction within possible worlds? but then
> which worlds, and with what a-priori probabilities?): both a way
> to rate the 'intelligence' of an algorithm or a bundle of heuristics,
> and a way to judge whether a given self-modification is actually
> an *enhancement* (although that should follow, given a truly
> rigorous definition of intelligence). When multiple criteria are in
> play, there are usually trade-offs: an improvement in one direction
> will eventually be a diminution in another. One needs to think
> carefully about how to set objective criteria for enhancement,
> without arbitrarily selecting a narrow set of assumptions.
It is ability to steer the future that matters, not prediction.
Decision theory incorporates probability theory but not the other way
> Now, the probability that YOU win the race is less than 1, probably
> much less; not necessarily because you're making an obvious
> mistake, but just because we do not know (and perhaps cannot
> know, in advance) the most efficient route to superintelligence.
> Given the uncertainties and the number of researchers, it's fair to
> say that the odds of any given research group being the first are
> LOW, but the odds that *someone* gets there are HIGH. But this
> implies that one should be working, not just privately on a
> Friendliness theory for one's preferred architecture, but publicly on
> a Friendliness theory general enough to say something about all
> possible architectures. That sounds like a huge challenge, but it's
> best to know what the ideal would be, and it's important to see
> this in game-theoretic terms. By contributing to a publicly available,
> general theory of Friendliness, you are hedging your bets;
> accounting for the contingency that someone else, with a
> different AI philosophy, will win the race.
The first problem is that an FAI theory which generalizes across all
architectures is impossible by Rice's Theorem.
The second problem is that a constructive theory of the world's second
most important math problem, reflective decision systems, is necessarily
a constructive theory of seed AI; and constitutes, in itself, a weapon
of math destruction, which can be used for destruction more *quickly*
than to any good purpose. Any Singularity-value I attach to publicizing
Friendly AI would go into explaining the *problem*. Solutions are far
harder than this and will be specialized on particular constructive
> To expand on this: the priority of public research should be to
> achieve a rigorous theoretical conception of Friendliness, to develop
> a practical criterion for evaluating whether a proposed AI
> architecture is Friendly or not, and then to make this a *standard*
> in the world of AI research, or at least "seed AI" research.
Practically, it's not very hard. You can eliminate nearly all AGI
projects by asking:
"What theory of AI ethics are you currently using?"
and they won't have one. A few AI projects will solve the problem by
slapping the word "ethical" on their sales literature, so you ask them:
"Please point to a specific design decision or architectural change
that you made solely because your FAI theory required it."
That eliminates everyone else. If they'd answered that, your next
question would be:
"Please show me a walkthrough of how your AI architecture makes a
particular Friendly decision."
Incidentally, the same criterion applies to AGI. Most AI projects don't
have a theory of general intelligence. The ones that would claim to
have a theory of general intelligence have a theory of mystic vital
forces which supposedly produce general intelligence, which they are
trying to infuse into their AI system in the hope that general
intelligence comes out the other end. In other words, they generally
cannot show you a *walkthrough* for how their system does a particular
generally intelligent thing - only say, "what really causes general
intelligence is such-and-such vital forces, so we're trying to build a
system with such-and-such vital forces, just like a human".
> So, again, what would I say the research problems are? To
> develop behavioral criteria of Friendliness in an 'agent', whether
> natural or artificial;
A complete, formally specified behavioral criterion solves the problem.
An incomplete, informal criterion is easy. Anything that wipes out the
human species is unFriendly, anything that doesn't wipe out the human
species is Friendly. This probably works in practice to distinguish
between nearly all AGIs that research projects are likely to actually build.
> to develop a theory of Friendly cognitive
> architecture (examples - an existence proof - would be useful;
> rigorous proof that these *and only these* architectures exhibit
> unconditional Friendliness would be even better);
That can't possibly be right. There should be an infinity of
architectures that do this, smaller than the infinity of architectures
that don't, and larger than the infinity of architectures that knowably do.
> to develop
> *criteria* of self-enhancement (what sort of modifications
> constitute an enhancement?); to develop a knowledge of
> what sort of algorithms will *actually self-enhance*.
Any theory you can constructively apply to create an AGI with a simple
goal system like 'paperclips', as opposed to putting in the additional
work to define the Friendly part, is a weapon of math destruction; it
can never go into the public domain until a Friendly AI is already up
> Then one can tackle questions like, which initial conditions
> lead to stably Friendly self-enhancement; and which
> self-enhancing algorithms get smarter fastest, when
> launched at the same time.
> The aim should always be, to turn all of these into
> well-posed problems of theoretical computer science, just
> as well-posed as, say, "Is P equal to NP?"
On ordinary computers or quantum ones? For our current model of physics
or actual physics?
> Beyond that, the
> aim should be to *answer* those problems (although I
> suspect that in some cases the answer will be an unhelpful
> 'undecidable', i.e. trial and error is all that can be advised,
> and so luck and raw computational speed are all that will
Then you're dead. The Kolmogorov complexity of the target is too great;
no one can have that much luck. You can't build a working wagon by
sawing boards at random and nailing them by coinflips. You can maybe
establish bounds where you know that a given algorithm will solve the
problem correctly, but not how long the algorithm will take to solve the
problem - those are acceptable.
> and to establish standards - standards of practice,
> perhaps even standards of implementation - in the global
> AI development community.
I think this is around as likely as the spontaneous generation of AGI
from the emergent complexity of packet routers. It's not just that
everyone has a different architecture, and a full solution constitutes
in itself a weapon of math destruction. Most of these people aren't in
it for the Singularity; they're in AI because that's the major they
stumbled into in college. They aren't in it to protect the human
species and it hasn't occurred to them that it's an issue.
> Furthermore, as I said, every AI project that aims to
> produce human-equivalent or superhuman intelligence
> should devote some fraction of its efforts to the establishment
> of universal safe standards among its peers and rivals - or at
> least, devote some fraction of its efforts to thinking about
> what 'universal safe standards' could possibly mean. The odds
> are it is in your interest, not just to try to secretly crack the
> seed AI problem in your bedroom, but to contribute to
> developing a *public* understanding of Friendliness theory.
> (What fraction of efforts should be spent on private project,
> versus on public discussion, I leave for individual researchers
> to decide.)
It may be worthwhile to try and get more people to understand that there
exists a problem and it is hard to solve.
You will be unable to make external researchers solve this frontier
research problem on your behalf, even if you can make them feel an
obligation to put in at least a little effort, when the challenge isn't
really what interests them and they secretly (or not-so-secretly) wish
the whole problem would just go away and stop bothering them.
Really difficult engineering problems can be solved by really smart
engineers, and the trick works because you can select the engineers to
be sufficiently smart. Trying to get everyone else to play nice is a
much harder problem because the "everyone else" is not preselected to be
sufficiently smart. In the lab, when you win or lose, it's your own
fault. Public relations success depends on many real-world factors that
you cannot control by your own power.
> One more word on what public development of Friendliness
> standards would require - more than just having a
> Friendliness-stabilizing strategy for your preferred architecture,
> the one by means of which you hope that your team will win
> the mind race. Public Friendliness standards must have
> something to say on *every possible cognitive architecture* -
> that it is irrelevant because it cannot achieve superintelligence
> (although Friendliness is also relevant to the coexistence of
> humans with non-enhancing human-equivalent AIs); that it
> cannot be made safe, must not win the race, and should
> never be implemented; that it can be made safe, but only
> if you do it like so.
This is knowably impossible by Rice's Theorem.
> And since in the real world, candidates for first
> superintelligence will include groups of humans, enhanced
> individual humans, enhanced animals, and all sorts of AI-human
> symbioses, as well as exercises such as massive experiments in
> Tierra-like darwinism - a theory of Friendliness, ideally, would
> have a principled evaluation of all of these, along the lines I
> already sketched.
Well, in theory, this is far less difficult than saying it for *every
possible* cognitive architecture. The proposal may even achieve the
status of not being ruled out a priori.
In practice, it's a good way of illustrating how absurd is the problem
as posed. No way, dude.
> It sounds like a tall order, it certainly is, and
> it may even be unattainable, pre-Singularity. But it's worth
> having an idea of what the ideal looks like.
Mitch, I hate to say this, but as long as we're asking for an ideal that
unattainable, I'd also like a pony.
I would prefer that we concentrate on how to go from the state of the
world being simply doomed, which is where it is now, to the state where
it is theoretically possible to survive because at least one project
somewhere knows how to build a Friendly AI. Let's try to make
incremental progress on this.
-- Eliezer S. Yudkowsky http://intelligence.org/ Research Fellow, Singularity Institute for Artificial Intelligence
This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:00:55 MDT