arXiv 2501.13011

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

By Sebastian Farquhar, Vikrant Varma, et al.

Published 2025-01-22

Citation lineage

Review the prior work and downstream research connected to this paper.

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Ap…

View the original paper on arXiv