arXiv 2501.13011

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

By Sebastian Farquhar, Vikrant Varma, et al.

Published 2025-01-22

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Ap…

View the original paper on arXiv