人工智能公司训练模型时必须保护研究人员的知识产权 | 自然

新英文外刊 · 公众号 · 科技自媒体 · 2024-09-02 08:30

主要观点总结

本文讨论人工智能公司在使用学术数据进行训练时面临的知识产权和责任问题。许多大型语言模型（LLMs）的训练数据来自学术论文，但很少有开发者公开关于使用的具体数据。这引发了关于谁应该获得这些数据使用的功劳和如何使用这些数据的问题的讨论。同时，世界知识产权组织（WIPO）对使用这些数据是否构成版权侵犯尚不清楚。一些出版商正在寻求法院澄清，而一些AI公司正在购买版权许可以避免法律纠纷。对于开放的许可材料虽然可以鼓励自由使用和再利用，但可能仍然有一定的限制。研究人员呼吁AI模型的发展也需要确保对原创作品的尊重。

关键观点总结

关键观点1: 人工智能模型训练数据的来源和知识产权问题。

大型语言模型（LLMs）的训练数据包括许多学术论文，但很少有开发者公开关于使用的具体数据。这引发了关于知识产权的争议。

关键观点2: 版权法律的模糊性。

世界知识产权组织（WIPO）对使用学术数据训练人工智能模型是否构成版权侵犯尚不清楚。

关键观点3: 出版商和AI公司的反应。

一些出版商正在寻求法院澄清，而一些AI公司正在购买版权许可以避免法律纠纷。同时，一些公司和研究人员呼吁更清晰的法规和透明度。

关键观点4: 研究的必要性。

研究人员呼吁需要更多的研究来寻找更激进的解决方案，比如新的许可证类型或版权法的改变。

关键观点5: AI模型的发展也需要确保尊重原创作品。

在人工智能模型发展的同时，需要考虑到原创作品的价值和保护原创者的权益。

文章预览

📢 文末扫码进裙，免费领取双语精读版 AI firms must play fair when they use academic data in training Researchers are among those who feel uneasy about the unrestrained use of their intellectual property in training commercial large language models. Firms and regulators need to agree the rules of engagement. Nature Intellectual property 27 August, 2024 | 946 words | ★★ ★ ★ ☆ No one knows for sure exactly what ChatGPT - the most famous product of artificial intelligence - and similar tools were trained on. But millions of academic papers scraped from the web are among the reams of data that have been fed into large language models (LLMs) that generate text, and similar algorithms that make images. Should the creators of such training data get credit - and if so, how? There is an urgent need for more clarity around the boundaries of acceptable use. Few LLMs - even those described as ‘open’ - have developers who are upfront about exactly which data were ………………………………

原文地址：访问原文地址
快照地址：访问文章快照
总结与预览地址：访问总结与预览

分享到微博