宪法人工智能：来自人工智能反馈的无害性

原标题：Constitutional AI: Harmlessness from AI Feedback

Yuntao BaiSaurav KadavathSandipan KunduAmanda AskellJohn KernionAndy JonesA. ChenAnna GoldieAzalia MirhoseiniC. McKinnonCarol ChenCatherine OlssonC. OlahDanny HernandezDawn DrainDeep GanguliDustin LiEli Tran-JohnsonE. PerezJamie KerrJ. MuellerJeff LadishJ. LandauKamal NdousseKamilė LukošiūtėLiane LovittMichael SellittoNelson ElhageNicholas SchieferNoem'i MercadoNova DasSarmaR. LasenbyRobin LarsonSam RingerScott JohnstonS. KravecS. E. ShowkStanislav FortTamera LanhamTimothy Telleen-LawtonTom ConerlyT. HenighanTristan HumeSam BowmanZac Hatfield-DoddsBenjamin MannDario AmodeiNicholas JosephSam McCandlishTom B. BrownJared Kaplan

arXiv.org (2022)

5 分

关键词

large language models

constitutional AI

supervised learning

reinforcement learning

harmful prompts

critique

revision

principles

human feedback

chain-of-thought reasoning

摘要

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

AI理解论文

这篇文档介绍了一种名为Constitutional AI (CAI) 的方法，用于训练既有帮助性又无害的人工智能（AI）助手，而无需使用人类反馈标签来评估其无害性。该方法通过制定一系列简单的原则或指导方针，即宪法，来实现对AI行为的监督和指导。文档的结构清晰，主要内容包括以下几个方面：

动机：
- 文档首先介绍了开发该方法的动机，包括研究使用AI帮助人类更有效地监督AI的可能性，改进之前训练的无害AI助手，使AI能够解释其对有害请求的反对意见，提高AI行为原则的透明度，以及减少修改目标时收集新的人类反馈标签的迭代时间。
Constitutional AI 方法：
- 文档详细描述了Constitutional AI (CAI) 方法的实施步骤。该方法分为两个阶段：监督阶段和强化学习（RL）阶段。在监督阶段，通过对AI助手的初始响应进行批判和修订，使用一系列宪法原则来反复修订响应，并最终通过监督学习对最终修订的响应进行微调。在RL阶段，使用模型生成的标签对无害性进行评估，进一步改进无害性，并通过RL对监督学习模型进行微调。
贡献：
- 文档强调了该方法的贡献，包括通过简单的原则实现对AI行为的监督，训练既有帮助性又无害的AI助手，以及减少对人类反馈标签的依赖，从而更接近自我监督方法。
相关工作：
- 文档提到了与Constitutional AI 方法相关的其他工作，包括RLHF、LaMDA、InstructGPT等，以及对自我监督、链式思维推理和模型自我批判的讨论。
讨论：
- 最后，文档讨论了该方法的训练结果，强调了通过Constitutional AI 方法训练的AI助手在无害性方面的改进，并探讨了未来可能的研究方向和潜在的改进空间。

通过对文档的深入理解，可以清晰地了解Constitutional AI 方法的实施细节、动机和贡献，以及其在AI行为监督和指导方面的潜在影响。

Chat Paper

当前问答基于全文

转入私有库对话