Google's AI trained on controversial websites and web sewers.

April 20, 2023 at 07:30 AM

•

1 min read

Google's AI trained on controversial websites and web sewers. — Photo: The Register

TL;DR Summary

Google's C4 dataset, used to train large language models, contains problematic and harmful content from websites such as Stormfront, Kiwi Farms, and 4chan. While efforts are made to filter out unwanted content, the review process is imperfect. The dataset also includes copyrighted material, and it's unclear whether companies using it for AI products are liable for infringement. The investigation highlights the potential for next-gen machine-learning systems to behave inappropriately and unreliably due to the ingestion of concerning material.

Topics:technology #artificial-intelligence #c4-dataset #google #large-language-models #toxic-content #training-data

Share this article

4chan and other web sewers scraped up into Google's mega-library for training ML The Register
Search the 15 million websites in Google's C4 dataset Search Engine Land
Meta, Google AI Partly Trained on Breitbart, RT: Study Gizmodo
Philosophy Sites in the Google Dataset Used to Train Some LLMs Daily Nous
A Handful of Thoughts on Where AI Learned to Speak: From Business, Racists, and Thieves Daily Kos

Reading Insights

Total Reads

Unique Readers

Time Saved

4 min

vs 4 min read

Condensed

90%

787 → 78 words

Want the full story? Read the original article

Read on The Register

JavaScript Required

tl;dr daily news requires JavaScript to be enabled. Please enable JavaScript in your browser settings.

Related Sources

Reading Insights