• 文件
  • 知识库

人工智能生成的文本能否可靠检测?

[{"authorId":"150333898","name":"Vinu Sankar Sadasivan"},{"authorId":"31910622","name":"Aounon Kumar"},{"authorId":"144021807","name":"S. Balasubramanian"},{"authorId":"47825356","name":"Wenxiao Wang"},{"authorId":"34389431","name":"S. Feizi"}]

()

|

5

关键词

摘要

The unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc. Therefore, reliable detection of AI-generated text can be critical to ensure the responsible use of LLMs. Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques that imprint specific patterns onto them. In this paper, we show that these detectors are not reliable in practical scenarios. In particular, we develop a recursive paraphrasing attack to apply on AI text, which can break a whole range of detectors, including the ones using the watermarking schemes as well as neural network-based detectors, zero-shot classifiers, and retrieval-based detectors. Our experiments include passages around 300 tokens in length, showing the sensitivity of the detectors even in the case of relatively long passages. We also observe that our recursive paraphrasing only degrades text quality slightly, measured via human studies, and metrics such as perplexity scores and accuracy on text benchmarks. Additionally, we show that even LLMs protected by watermarking schemes can be vulnerable against spoofing attacks aimed to mislead detectors to classify human-written text as AI-generated, potentially causing reputational damages to the developers. In particular, we show that an adversary can infer hidden AI text signatures of the LLM outputs without having white-box access to the detection method. Finally, we provide a theoretical connection between the AUROC of the best possible detector and the Total Variation distance between human and AI text distributions that can be used to study the fundamental hardness of the reliable detection problem for advanced language models. Our code is publicly available at https://github.com/vinusankars/Reliability-of-AI-text-detectors.

AI理解论文

这篇论文主要涉及对人工智能生成文本的检测方法的可靠性进行全面分析。首先,作者介绍了四种不同类别的检测器的性能,包括基于水印、神经网络、零样本和检索的检测器,并揭示了它们的可靠性问题。特别是,作者开发的递归改写攻击是第一种可以在只有轻微文本质量下降的情况下破坏水印和基于检索的检测器的方法。其次,作者展示了现有检测器对于虚假文本的漏洞,即攻击者可以编写被误认为是AI生成的文本,而无需对检测方法进行白盒访问。最后,作者建立了最佳检测器的AUROC与人类和AI文本分布之间的总变差距之间的理论联系,用于研究可靠文本检测问题的难度。

在实验方面,作者使用了XSum、PubMedQA和Kafkai等多个数据集以及OPT-1.3B和GPT-2-Medium等多个目标语言模型进行分析。作者进行了递归改写攻击的实验,并展示了攻击对原始文本内容的影响。此外,作者还展示了可能的欺骗攻击,以及检测AI生成文本的困难性的理论结果。

总的来说,这篇论文对现有的AI生成文本检测方法进行了深入的分析,揭示了它们的脆弱性和局限性,并提出了一些理论结果来解释可靠文本检测问题的难度。

Chat Paper

当前问答基于全文

转入私有库对话