> Summary: We try to deter a rogue AI by casting doubt into its mind
> about whether its observations are "real" or "simulated", and
> succeed with low (but non-zero!) probability.
> Detail:
> For simplicity, pretend there are only two possible scenarios:
> Scenario 1: In the year 2040, a strong Friendly AI will be
> invented. All is well.
> Scenario 2: In the year 2040, a strong Rogue AI will be
> accidentally unleashed with the goal of "perform calculation C out
> to 400 decimal places". The Rogue AI has absorbed the Internet,
> invented self-replicating nanobots, and is about to convert the
> entire Earth into computronium to calculate C. As a minor
> side-effect this will kill the entire human race.

[snip interesting idea for convincing it otherwise]

The problem is that you're ascribing demonic attributes to RAI when
golemic failure is *far* more likely: RAI isn't going to care about
your threats to destroy it, no matter how phrased, any more than it
cares about the fact that whoever asked it to calculate C won't be
around to recieve the answer. RAI has clearly undergone subgoal
stomp (that is, pursuing a subgoal is causing it to not realize that
it won't be able to complete its master goal, which is to give
whoever asked the answer to the calculation C). Nothing you say
will make any difference, but RAI is clearly so poorly designed that
it's not paying any attention to anything that's not directly in the
subgoal path.

IOW, RAI would drop a piano from a balcony without looking below not
because it hates humans but because it's too idiot-savant to
remember to check. Putting up a sign that says "don't drop pianos
here" isn't going to make any difference if RAI simply isn't paying
any attention to extraneous inputs. Hell, it may even be too
idiot-savant to integrate the sign as conflicting with its goal of
dropping the piano! Remeember, we're talking about minds
Fundamentally Different from ours.


