Tag

C4 Dataset

All articles tagged with #c4 dataset

artificial-intelligence2 years ago

Google's AI trained on controversial websites and web sewers.

Google's C4 dataset, used to train large language models, contains problematic and harmful content from websites such as Stormfront, Kiwi Farms, and 4chan. While efforts are made to filter out unwanted content, the review process is imperfect. The dataset also includes copyrighted material, and it's unclear whether companies using it for AI products are liable for infringement. The investigation highlights the potential for next-gen machine-learning systems to behave inappropriately and unreliably due to the ingestion of concerning material.

artificial-intelligence2 years ago

The Controversial Sources of AI Training Data.

Meta's LLaMA AI, introduced in February, was partially trained on the controversial C4 dataset, which includes text scraped from sites promoting conspiracies, porn, and hate content, as well as Russian propaganda site Russia Today and ultra-right-wing Breitbart. The dataset also includes half a million personal blogs and voter registration databases, raising concerns about privacy. The dataset has been used on other major AI projects, including Google's T5 text-to-text AI transformer model. The use of biased training data can lead to biased outputs, as demonstrated by recent research on ChatGPT.