Google's AI trained on controversial websites and web sewers.

TL;DR Summary
Google's C4 dataset, used to train large language models, contains problematic and harmful content from websites such as Stormfront, Kiwi Farms, and 4chan. While efforts are made to filter out unwanted content, the review process is imperfect. The dataset also includes copyrighted material, and it's unclear whether companies using it for AI products are liable for infringement. The investigation highlights the potential for next-gen machine-learning systems to behave inappropriately and unreliably due to the ingestion of concerning material.
Topics:technology#artificial-intelligence#c4-dataset#google#large-language-models#toxic-content#training-data
- 4chan and other web sewers scraped up into Google's mega-library for training ML The Register
- Search the 15 million websites in Google's C4 dataset Search Engine Land
- Meta, Google AI Partly Trained on Breitbart, RT: Study Gizmodo
- Philosophy Sites in the Google Dataset Used to Train Some LLMs Daily Nous
- A Handful of Thoughts on Where AI Learned to Speak: From Business, Racists, and Thieves Daily Kos
Reading Insights
Total Reads
0
Unique Readers
1
Time Saved
4 min
vs 4 min read
Condensed
90%
787 → 78 words
Want the full story? Read the original article
Read on The Register