The Dark Web's Role in Training AI: Fresh Concerns and Secret Sources.

April 20, 2023 at 12:23 PM

•

1 min read

The Dark Web's Role in Training AI: Fresh Concerns and Secret Sources. — Photo: Search Engine Land

TL;DR Summary

The Washington Post has created a search tool that allows users to find out if their website or content was used to train AI systems as part of Google's C4 dataset, which includes websites and content creators that generative AI could potentially negatively impact. The C4 dataset is only part of the data used by Google Bard and other large language models, which also use Wikipedia, Reddit, and other sources. Reddit has updated its API terms and will now charge some companies, including Google and OpenAI, for access to its valuable corpus of data.

Topics:technology #ai #dataset #google #reddit #search-tool #technology

Share this article

Search the 15 million websites in Google's C4 dataset Search Engine Land
4chan and other web sewers scraped up into Google's mega-library for training ML The Register
Fresh concerns raised over sources of training material for AI systems The Guardian
Inside the secret list of websites that make AI like ChatGPT sound smart The Washington Post
A Handful of Thoughts on Where AI Learned to Speak: From Business, Racists, and Thieves Daily Kos

Reading Insights

Total Reads

Unique Readers

Time Saved

1 min

vs 2 min read

Condensed

72%

336 → 94 words

Want the full story? Read the original article

Read on Search Engine Land

JavaScript Required

tl;dr daily news requires JavaScript to be enabled. Please enable JavaScript in your browser settings.

Related Sources

Reading Insights