Synthetic Data in AI Training

The cluster focuses on the use, generation, and effectiveness of synthetic data for training AI models, especially LLMs, with discussions on its benefits, risks like model collapse, and examples from companies like OpenAI.

➡️ Stable 1.0x AI & Machine Learning
1,641
Comments
16
Years Active
5
Top Authors
#4013
Topic ID

Activity Over Time

2009
2
2012
1
2013
2
2014
4
2015
4
2016
24
2017
29
2018
39
2019
85
2020
145
2021
51
2022
105
2023
308
2024
425
2025
392
2026
25

Keywords

LLM AlphaZero capitalone.com SMOTE AI RAG CAD RL simonwillison.net L3 synthetic data training models training data generated dataset datasets generate trained

Sample Comments

sebzim4500 Feb 21, 2024 View on HN

We know OpenAI trains on significant amounts of synthetic data, they probably have something like this.

aroo Nov 26, 2023 View on HN

Sounds like something right up the domain of synthetic data.

alansaber Dec 1, 2025 View on HN

Since synthetic data for training is pretty ubiquitous seems like a novelty

jxdxbx Mar 9, 2024 View on HN

How does this relate to synthetic data?

handfuloflight Apr 14, 2025 View on HN

What about the role of synthetic data?

heavyset_go Apr 30, 2024 View on HN

I agree with your point, just want to point out that models have been trained on AI generated prompts as synthetic data.

Leynos Jun 12, 2025 View on HN

Synthetic training data presumably.

kiran30B Feb 15, 2024 View on HN

What does your synthetic data pipeline look like?

repeat_or Mar 28, 2022 View on HN

Synthetic data is algorithmically generated data that mirrors the statistical properties of the dataset it’s based on. Learn how to make high-quality synthetic data.

ag408 Apr 13, 2022 View on HN

Synthetic data is algorithmically generated data that mirrors the statistical properties of the dataset it’s based on. Learn how to make high-quality synthetic data.