The Controversial Sources of AI Training Data.

April 19, 2023 at 04:07 PM

•

1 min read

The Controversial Sources of AI Training Data. — Photo: Gizmodo

TL;DR Summary

Meta's LLaMA AI, introduced in February, was partially trained on the controversial C4 dataset, which includes text scraped from sites promoting conspiracies, porn, and hate content, as well as Russian propaganda site Russia Today and ultra-right-wing Breitbart. The dataset also includes half a million personal blogs and voter registration databases, raising concerns about privacy. The dataset has been used on other major AI projects, including Google's T5 text-to-text AI transformer model. The use of biased training data can lead to biased outputs, as demonstrated by recent research on ChatGPT.

Topics:technology #ai-training #artificial-intelligence #biased-training #c4-dataset #controversial-sources #meta

Share this article

Meta, Google AI Partly Trained on Breitbart, RT: Study Gizmodo
Inside the secret list of websites that make AI like ChatGPT sound smart The Washington Post
A Handful of Thoughts on Where AI Learned to Speak: From Business, Racists, and Thieves Daily Kos
View Full Coverage on Google News

Reading Insights

Total Reads

Unique Readers

Time Saved

3 min

vs 4 min read

Condensed

87%

703 → 89 words

Want the full story? Read the original article

Read on Gizmodo

JavaScript Required

tl;dr daily news requires JavaScript to be enabled. Please enable JavaScript in your browser settings.

Related Sources

Reading Insights