The Dark Web's Role in Training AI: Fresh Concerns and Secret Sources.

TL;DR Summary
The Washington Post has created a search tool that allows users to find out if their website or content was used to train AI systems as part of Google's C4 dataset, which includes websites and content creators that generative AI could potentially negatively impact. The C4 dataset is only part of the data used by Google Bard and other large language models, which also use Wikipedia, Reddit, and other sources. Reddit has updated its API terms and will now charge some companies, including Google and OpenAI, for access to its valuable corpus of data.
- Search the 15 million websites in Google's C4 dataset Search Engine Land
- 4chan and other web sewers scraped up into Google's mega-library for training ML The Register
- Fresh concerns raised over sources of training material for AI systems The Guardian
- Inside the secret list of websites that make AI like ChatGPT sound smart The Washington Post
- A Handful of Thoughts on Where AI Learned to Speak: From Business, Racists, and Thieves Daily Kos
Reading Insights
Total Reads
0
Unique Readers
1
Time Saved
1 min
vs 2 min read
Condensed
72%
336 → 94 words
Want the full story? Read the original article
Read on Search Engine Land