• 文件
  • 知识库

宪法人工智能:来自人工智能反馈的无害性

[{"authorId":"1486307451","name":"Yuntao Bai"},{"authorId":"148070327","name":"Saurav Kadavath"},{"authorId":"2158813858","name":"Sandipan Kundu"},{"authorId":"119609682","name":"Amanda Askell"},{"authorId":"1583434563","name":"John Kernion"},{"authorId":"2149890773","name":"Andy Jones"},{"authorId":"2111074159","name":"A. Chen"},{"authorId":"46684455","name":"Anna Goldie"},{"authorId":"1861312","name":"Azalia Mirhoseini"},{"authorId":"2190108315","name":"C. McKinnon"},{"authorId":"2183729976","name":"Carol Chen"},{"authorId":"2061321863","name":"Catherine Olsson"},{"authorId":"37232298","name":"C. Olah"},{"authorId":"39182747","name":"Danny Hernandez"},{"authorId":"1943097969","name":"Dawn Drain"},{"authorId":"2081806483","name":"Deep Ganguli"},{"authorId":"2108506462","name":"Dustin Li"},{"authorId":"2175781319","name":"Eli Tran-Johnson"},{"authorId":"47635264","name":"E. Perez"},{"authorId":"2067765208","name":"Jamie Kerr"},{"authorId":"2190111475","name":"J. Mueller"},{"authorId":"70988670","name":"Jeff Ladish"},{"authorId":"2068044809","name":"J. Landau"},{"authorId":"1978097132","name":"Kamal Ndousse"},{"authorId":"2105347564","name":"Kamilė Lukošiūtė"},{"authorId":"2154608229","name":"Liane Lovitt"},{"authorId":"2054578129","name":"Michael Sellitto"},{"authorId":"2866708","name":"Nelson Elhage"},{"authorId":"2833768","name":"Nicholas Schiefer"},{"authorId":"2190107517","name":"Noem'i Mercado"},{"authorId":"2142833890","name":"Nova DasSarma"},{"authorId":"3112577","name":"R. Lasenby"},{"authorId":"48810415","name":"Robin Larson"},{"authorId":"1380664820","name":"Sam Ringer"},{"authorId":"2154610174","name":"Scott Johnston"},{"authorId":"49604482","name":"S. Kravec"},{"authorId":"2154609053","name":"S. E. Showk"},{"authorId":"30176974","name":"Stanislav Fort"},{"authorId":"46239941","name":"Tamera Lanham"},{"authorId":"1419532638","name":"Timothy Telleen-Lawton"},{"authorId":"2154608209","name":"Tom Conerly"},{"authorId":"103143311","name":"T. Henighan"},{"authorId":"2162194147","name":"Tristan Hume"},{"authorId":"1799822","name":"Sam Bowman"},{"authorId":"1573482302","name":"Zac Hatfield-Dodds"},{"authorId":"2056658938","name":"Benjamin Mann"},{"authorId":"2698777","name":"Dario Amodei"},{"authorId":"2117706920","name":"Nicholas Joseph"},{"authorId":"52238703","name":"Sam McCandlish"},{"authorId":"31035595","name":"Tom B. Brown"},{"authorId":"2053807409","name":"Jared Kaplan"}]

()

|

5

关键词

摘要

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

AI理解论文

这篇文档介绍了一种名为Constitutional AI (CAI) 的方法,用于训练既有帮助性又无害的人工智能(AI)助手,而无需使用人类反馈标签来评估其无害性。该方法通过制定一系列简单的原则或指导方针,即宪法,来实现对AI行为的监督和指导。文档的结构清晰,主要内容包括以下几个方面:

  1. 动机

    • 文档首先介绍了开发该方法的动机,包括研究使用AI帮助人类更有效地监督AI的可能性,改进之前训练的无害AI助手,使AI能够解释其对有害请求的反对意见,提高AI行为原则的透明度,以及减少修改目标时收集新的人类反馈标签的迭代时间。
  2. Constitutional AI 方法

    • 文档详细描述了Constitutional AI (CAI) 方法的实施步骤。该方法分为两个阶段:监督阶段和强化学习(RL)阶段。在监督阶段,通过对AI助手的初始响应进行批判和修订,使用一系列宪法原则来反复修订响应,并最终通过监督学习对最终修订的响应进行微调。在RL阶段,使用模型生成的标签对无害性进行评估,进一步改进无害性,并通过RL对监督学习模型进行微调。
  3. 贡献

    • 文档强调了该方法的贡献,包括通过简单的原则实现对AI行为的监督,训练既有帮助性又无害的AI助手,以及减少对人类反馈标签的依赖,从而更接近自我监督方法。
  4. 相关工作

    • 文档提到了与Constitutional AI 方法相关的其他工作,包括RLHF、LaMDA、InstructGPT等,以及对自我监督、链式思维推理和模型自我批判的讨论。
  5. 讨论

    • 最后,文档讨论了该方法的训练结果,强调了通过Constitutional AI 方法训练的AI助手在无害性方面的改进,并探讨了未来可能的研究方向和潜在的改进空间。

通过对文档的深入理解,可以清晰地了解Constitutional AI 方法的实施细节、动机和贡献,以及其在AI行为监督和指导方面的潜在影响。

Chat Paper

当前问答基于全文

转入私有库对话