
Google's AI trained on controversial websites and web sewers.
Google's C4 dataset, used to train large language models, contains problematic and harmful content from websites such as Stormfront, Kiwi Farms, and 4chan. While efforts are made to filter out unwanted content, the review process is imperfect. The dataset also includes copyrighted material, and it's unclear whether companies using it for AI products are liable for infringement. The investigation highlights the potential for next-gen machine-learning systems to behave inappropriately and unreliably due to the ingestion of concerning material.
