Adobe遭集体诉讼指控,被控在AI训练中滥用作者作品。

内容总结:
近日,软件巨头Adobe因涉嫌使用盗版书籍训练其人工智能模型而面临集体诉讼,再次引发业界对AI训练数据版权问题的关注。
这起诉讼由俄勒冈州作家伊丽莎白·莱昂代表多位作者提起。诉状指出,Adobe在训练其轻量化语言模型SlimLM时,使用了包含大量盗版书籍的数据集。该模型被宣传为可“针对移动设备文档辅助任务进行优化”。
诉状披露,Adobe使用的训练数据集SlimPajama-627B衍生自另一个名为RedPajama的开源数据集,其中包含了备受争议的Books3数据库——这个收录了19.1万本著作的数据集已成为多家科技公司的法律纠纷源头。莱昂表示,她撰写的多本非虚构写作指南未经授权就被纳入训练数据。
这并非个案。今年9月,苹果公司因使用受版权保护的材料训练其Apple Intelligence模型被诉,同月人工智能公司Anthropic与指控其侵权的作家团体达成15亿美元和解。10月,Salesforce也因使用RedPajama数据集面临类似诉讼。
随着AI技术快速发展,训练数据版权争议日益凸显。科技公司普遍采用海量数据训练算法,但其中可能混杂未经授权的受版权保护材料。业内人士指出,这些接连不断的诉讼正在成为AI行业发展过程中无法回避的法律挑战。
中文翻译:
与几乎所有科技公司一样,近年来Adobe也大力投身人工智能领域。自2023年以来,这家软件公司推出了多款人工智能服务,包括其AI驱动的媒体生成套件Firefly。然而如今,该公司对这项技术的全力拥抱可能已引发麻烦——最新诉讼指控其使用盗版书籍训练AI模型。
这起由俄勒冈州作家伊丽莎白·莱昂发起的集体诉讼指控称,Adobe使用大量盗版书籍(包括莱昂本人的作品)来训练其SlimLM程序。Adobe将SlimLM描述为"可针对移动设备文档辅助任务进行优化"的小型语言模型系列,并声明该模型基于Cerebras公司于2023年6月发布的"经过去重处理、多语料库开源数据集"SlimPajama-627B进行预训练。撰写过多部非虚构写作指南的莱昂指出,她的部分作品被包含在Adobe使用的预训练数据集中。
据路透社最先报道的诉讼文件显示,莱昂的作品存在于经过处理的衍生数据集中,而该数据集正是Adobe程序的基础:"SlimPajama数据集通过复制并篡改RedPajama数据集(包括复制Books3)创建而成。因此作为RedPajama数据集的衍生副本,SlimPajama包含了Books3数据集,其中涵盖原告及集体诉讼成员受版权保护的作品。"
包含19.1万册书籍的"Books3"数据集已被广泛用于生成式AI系统训练,持续引发科技界的法律纠纷。RedPajama数据集同样在多起诉讼中被提及:今年9月针对苹果的诉讼指控该公司使用受版权保护材料训练其Apple Intelligence模型,诉状特别指出该数据集并谴责科技公司"未经许可、未标注出处且未支付报酬"复制受保护作品;10月针对Salesforce的类似诉讼也指控该公司使用RedPajama进行模型训练。
对科技行业而言,此类诉讼如今已屡见不鲜。AI算法依赖海量数据集进行训练,而部分数据集据称包含盗版材料。今年9月,Anthropic公司同意向起诉其使用盗版作品训练聊天机器人Claude的多位作者支付15亿美元赔偿,该案被视为AI训练数据版权争议众多法律战中的潜在转折点。
英文来源:
Like pretty much every other tech company in existence, Adobe has leaned heavily into AI over the past several years. The software firm has launched a number of different AI services since 2023, including Firefly — its AI-powered media-generation suite. Now, however, the company’s full-throated embrace of the technology may have led to trouble, as a new lawsuit claims it used pirated books to train one of its AI models.
A proposed class-action lawsuit filed on behalf of Elizabeth Lyon, an author from Oregon, claims that Adobe used pirated versions of numerous books — including her own — to train the company’s SlimLM program.
Adobe describes SlimLM as a small language model series that can be “optimized for document assistance tasks on mobile devices.” It states that SlimLM was pre-trained on SlimPajama-627B, a “deduplicated, multi-corpora, open-source dataset” released by Cerebras in June of 2023. Lyon, who has written a number of guidebooks for non-fiction writing, says that some of her works were included in a pretraining dataset that Adobe had used.
Lyon’s lawsuit, which was originally reported on by Reuters, says that her writing was included in a processed subset of a manipulated dataset that was the basis of Adobe’s program: “The SlimPajama dataset was created by copying and manipulating the RedPajama dataset (including copying Books3),” the lawsuit says. “Thus, because it is a derivative copy of the RedPajama dataset, SlimPajama contains the Books3 dataset, including the copyrighted works of Plaintiff and the Class members.”
“Books3” — a huge collection of 191,000 books that have been used to train GenAI systems — has been an ongoing source of legal trouble for the tech community. RedPajama has also been cited in a number of litigation cases. In September, a lawsuit against Apple claimed the company had used copyrighted material to train its Apple Intelligence model. The litigation mentioned the dataset and accused the tech company of copying protected works “without consent and without credit or compensation.” In October, a similar lawsuit against Salesforce also claimed the company had used RedPajama for training purposes.
Unfortunately for the tech industry, such lawsuits have, by now, become somewhat commonplace. AI algorithms are trained on massive datasets and, in some cases, those datasets have allegedly included pirated materials. In September, Anthropic agreed to pay $1.5 billion to a number of authors who had sued it and accused it of using pirated versions of their work to train its chatbot, Claude. The case was considered a potential turning point in the ongoing legal battles over copyrighted material in AI training data, of which there are many.
文章标题:Adobe遭集体诉讼指控,被控在AI训练中滥用作者作品。
文章链接:https://blog.qimuai.cn/?post=2506
本站文章均为原创,未经授权请勿用于任何商业用途