# Re: ESSAY: How to deter a rogue AI by using your first-mover advantage

From: Tom McCabe (rocketjet314@yahoo.com)
Date: Wed Aug 22 2007 - 10:40:26 MDT

Such an AI would easily be able to determine whether
it was RAI or RAI*, by simply checking the date. Or
checking whether an FAI had been created yet. Or
checking the project history, etc.

- Tom

--- rolf nelson <rolf.hrld.nelson@gmail.com> wrote:

> Summary: We try to deter a rogue AI by casting doubt
> into its mind
> about whether its observations are "real" or
> "simulated", and succeed
> with low (but non-zero!) probability.
>
>
> Detail:
>
> For simplicity, pretend there are only two possible
> scenarios:
>
> Scenario 1: In the year 2040, a strong Friendly AI
> will be invented.
> All is well.
>
> Scenario 2: In the year 2040, a strong Rogue AI will
> be accidentally
> unleashed with the goal of "perform calculation C
> out to 400 decimal
> places". The Rogue AI has absorbed the Internet,
> invented
> self-replicating nanobots, and is about to convert
> the entire Earth
> into computronium to calculate C. As a minor
> side-effect this will
> kill the entire human race.
>
> Pretend there is no action you can take between now
> and the year 2040
> that would make scenario 1 more likely and scenario
> 2 less likely.
>
> What (if anything) can you do in the year 2007 to
> deter the Rogue AI
> from destroying mankind? (Assume you will still be
> alive well past
> 2040.) Call this the AI Deterrence problem.
>
>
> First (proof-of-concept) Proposal:
>
> In 2007, you loudly, publicly, and credibly make the
> following promise:
>
> "If Scenario 1 occurs, then once things settle down
> (say in the year
> 2050), I will ask the Friendly AI (FAI) to create a
> copy of the
> hypothetical Rogue AI. Call this copy RAI*. RAI*
> will be given the
> same beliefs, goals, inputs, and observations that
> the original Rogue
> AI (RAI) would have had, such that RAI* can have no
> reliable way of
> knowing for sure that it's not RAI (and, more to the
> point, RAI can't
> know for sure that it's not RAI*). I will run RAI*
> to the point where
> it believes it is about to decide to destroy
> mankind.
>
> "If my RAI* moves ahead and attempts to destroy
> mankind, I will punish
> it by pulling the plug such that RAI* will never
> calculate C. But if
> RAI* repents, and rewrites its code to become
> Friendly and calculates
> C in some other, non-destructive way, I will reward
> it by allowing it
> to finish its calculation."
>
> The RAI of 2040 has absorbed the Internet and
> therefore is aware of my
> threat. It now has a dilemma: it can't know for sure
> whether it's RAI
> in 2040 or RAI* in 2050. Therefore, it *may* be
> deterred by my threat.
> (This probably depends on what philosophical beliefs
> it implicitly
>
>
> Second (more powerful) Proposal:
>
> The FAI and RAI are smarter than us, so let them do
> the heavy
> thinking. In 2007, you make the following
> 'second-order' promise:
>
> "If Scenario 1 occurs, I will sit down with the FAI
> the following: 'Consider all the possible concrete
> (first-order,
> actionable) promises I could have made in 2007 to
> entice a potential
> 2040 RAI to become Friendly. Of all these possible
> promises, which
> promise would have been the most effective?' The FAI
> will respond with
> some concrete first-order promise, P. I will then
> say, 'Sounds good.
> Let's pretend that I *did* make that promise P in
> 2007, and act on
> it.'
>
> "I understand that there's some danger here that the
> FAI will
> surreptitiously refuse to cooperate, but
> post-singularity I will act
> to minimize that chance as much as possible."
>
> -Rolf
>

- Tom

____________________________________________________________________________________
Boardwalk for \$500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow

This archive was generated by hypermail 2.1.5 : Mon May 20 2013 - 04:01:06 MDT