arXiv 2212.08073

Constitutional AI: Harmlessness from AI Feedback

By Yuntao Bai, Saurav Kadavath, et al.

Published 2022-12-15

Citation lineage

Review the prior work and downstream research connected to this paper.

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and…

View the original paper on arXiv