AI Training Copyright Debate

The cluster discusses the ethics, legality, and hypocrisy of AI companies like OpenAI using copyrighted or unlicensed data (e.g., scraped web content, pirated ebooks) to train models, while prohibiting others from using their outputs for training under fair use claims.

➡️ Stable 0.7x AI & Machine Learning

4,598

Comments

Years Active

Top Authors

#5523

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

106

2022

290

2023

1,467

2024

1,197

2025

1,201

2026

100

Top Contributors

simonw (36) JohnFen (35) gumballindie (31) __loam (26) kmeisthax (21)

Keywords

AI AGI LLM HN ML stackdiary.com OAI i.ibb OS GPT train training models openai data ai training data fair use ai models trained

Sample Comments

jofzar • Jan 30, 2025 • View on HN

Sorry, it's now a problem to train off other people's data? Surely openai has never trained off other people's data without permission...

NewJazz • Jan 10, 2024 • View on HN

So OpenAI license the content they train on... They just admitted it has value.

WithinReason • Nov 27, 2025 • View on HN

Wouldn't it be still legal to train on the data due to fair use?

oh_sigh • Feb 4, 2023 • View on HN

Why not? Open AI used data that they didn't receive permission from the author to train their models.

elcomet • Jun 17, 2023 • View on HN

It was trained on data they don't own. They could face a lawsuit for this, like it has happened for image generation models.

ares623 • Oct 3, 2025 • View on HN

Can the company just claim it’s for AI training and it’s fair use?

skilled • Aug 3, 2023 • View on HN

It’s not so innocent,https://stackdiary.com/brave-selling-copyrighted-data-for-ai...

jacooper • Apr 19, 2023 • View on HN

That only works when ai training isnt considered fair use.

misnome • Jan 13, 2024 • View on HN

Don't they have an explicit T&C that says you are not allowed to use their output for training other models?

hamasho • Jul 30, 2024 • View on HN

Probably that data was used to train AI models too. I hope we establish a legal framework that prevents training models without proper permission, and the companies that have already trained their models will get fined and those models will be banned from commercial use.I enjoy the rapid progress of LLMs. ChatGPT and Claude are already a critical part of my daily work. But I don't like the current situation where VCs and start-ups use unpermitted data to train the models, don't resp