The Controversial Sources of AI Training Data.

1 min read
Source: Gizmodo
The Controversial Sources of AI Training Data.
Photo: Gizmodo
TL;DR Summary

Meta's LLaMA AI, introduced in February, was partially trained on the controversial C4 dataset, which includes text scraped from sites promoting conspiracies, porn, and hate content, as well as Russian propaganda site Russia Today and ultra-right-wing Breitbart. The dataset also includes half a million personal blogs and voter registration databases, raising concerns about privacy. The dataset has been used on other major AI projects, including Google's T5 text-to-text AI transformer model. The use of biased training data can lead to biased outputs, as demonstrated by recent research on ChatGPT.

Share this article

Reading Insights

Total Reads

0

Unique Readers

0

Time Saved

3 min

vs 4 min read

Condensed

87%

70389 words

Want the full story? Read the original article

Read on Gizmodo