
The Controversial Sources of AI Training Data.
Meta's LLaMA AI, introduced in February, was partially trained on the controversial C4 dataset, which includes text scraped from sites promoting conspiracies, porn, and hate content, as well as Russian propaganda site Russia Today and ultra-right-wing Breitbart. The dataset also includes half a million personal blogs and voter registration databases, raising concerns about privacy. The dataset has been used on other major AI projects, including Google's T5 text-to-text AI transformer model. The use of biased training data can lead to biased outputs, as demonstrated by recent research on ChatGPT.