Deception (was Re: What if there's an expiration date?)

From: Tim Freeman (
Date: Mon Apr 28 2008 - 19:28:49 MDT

From: "Nick Tarleton" <>
>So the AI will see no problem with doing something that destroys the
>world a day after expiration, provided it's helpful now and humans
>aren't aware of it (because the knowledge would make us suffer).

Deceiving the humans with false beliefs is essential to losing that
way, and if the AI is motivated to deceive the humans then we have
problems even without an expiration date. So the essential part of
this scenario is the deception, not the expiration date. It's good to
check that existing spec [1] deals correctly with deception, since
that was the hardest part to get plausibly correct. Here's how it

The AI explains the behavior of each human as maximizing that human's
estimated utility function.

It matters what the human believes, and humans may be mistaken, so
actually the AI explains each human's behavior as maximizing that
human's utility function, which is a function of that human's
estimated belief about the current state of the world.

We certainly don't want to give the AI an incentive to lie, so the AI
optimizes for the human's utility function applied to the *AI's*
belief about the current state of the world. (This is in contrast to
optimizing for the human's utility function applied to the human's
belief about the current state of the world, which would give the AI
an incentive to lie.)

Any long term planning done by the humans is part of evaluating their
utility function. So if we set aside scenarios where the AI is simply
mistaken about what's going on, the scenario you describe is possible
if the humans really don't care if the world is destroyed a day after
the end of the AI's planning horizon. I don't think real humans want
that, so at least in this case, I think we're okay.

>... a sharp cutoff is Not The Way. Even exponential discounting would
>be superior, even though it has very implausible normative

Thanks for the pointer. I'm looking at it now but haven't read it
yet. Can you point out a scenario where my sharp cutoff gives bad

Tim Freeman      

This archive was generated by hypermail 2.1.5 : Wed Jul 17 2013 - 04:01:02 MDT