AI Training Copyright Debate

The cluster discusses the ethics, legality, and hypocrisy of AI companies like OpenAI using copyrighted or unlicensed data (e.g., scraped web content, pirated ebooks) to train models, while prohibiting others from using their outputs for training under fair use claims.

➡️ Stable 0.7x AI & Machine Learning
4,598
Comments
19
Years Active
5
Top Authors
#5523
Topic ID

Activity Over Time

2008
1
2009
1
2010
1
2011
1
2012
1
2013
1
2014
6
2015
4
2016
26
2017
28
2018
45
2019
76
2020
46
2021
106
2022
290
2023
1,467
2024
1,197
2025
1,201
2026
100

Keywords

AI AGI LLM HN ML stackdiary.com OAI i.ibb OS GPT train training models openai data ai training data fair use ai models trained

Sample Comments

jofzar Jan 30, 2025 View on HN

Sorry, it's now a problem to train off other people's data? Surely openai has never trained off other people's data without permission...

NewJazz Jan 10, 2024 View on HN

So OpenAI license the content they train on... They just admitted it has value.

WithinReason Nov 27, 2025 View on HN

Wouldn't it be still legal to train on the data due to fair use?

oh_sigh Feb 4, 2023 View on HN

Why not? Open AI used data that they didn't receive permission from the author to train their models.

elcomet Jun 17, 2023 View on HN

It was trained on data they don't own. They could face a lawsuit for this, like it has happened for image generation models.

ares623 Oct 3, 2025 View on HN

Can the company just claim it’s for AI training and it’s fair use?

skilled Aug 3, 2023 View on HN

It’s not so innocent,https://stackdiary.com/brave-selling-copyrighted-data-for-ai...

jacooper Apr 19, 2023 View on HN

That only works when ai training isnt considered fair use.

misnome Jan 13, 2024 View on HN

Don't they have an explicit T&C that says you are not allowed to use their output for training other models?

hamasho Jul 30, 2024 View on HN

Probably that data was used to train AI models too. I hope we establish a legal framework that prevents training models without proper permission, and the companies that have already trained their models will get fined and those models will be banned from commercial use.I enjoy the rapid progress of LLMs. ChatGPT and Claude are already a critical part of my daily work. But I don't like the current situation where VCs and start-ups use unpermitted data to train the models, don't resp