EleutherAI has released The Common Pile v0.1, a large 8TB dataset of licensed and open-domain text, to train AI models, aiming to increase transparency and reduce reliance on copyrighted material. The dataset was used to develop models that perform comparably to proprietary ones, challenging the notion that unlicensed data is necessary for high performance. The release is part of a broader effort to promote open data and transparency in AI research amid ongoing legal debates.
Major tech companies like Apple, Salesforce, and Anthropic have trained their AI models using YouTube videos without creators' consent, potentially violating YouTube's terms. The dataset, known as "the Pile," was compiled by EleutherAI and includes captions from over 173,000 YouTube videos. Content creators are frustrated and critical of this unauthorized use, raising concerns about intellectual property rights and the ethics of data scraping.
An investigation revealed that over 170,000 YouTube videos were used without permission to train AI systems for companies like Apple, Anthropic, Nvidia, and Salesforce. The dataset, part of EleutherAI's The Pile, includes subtitles from videos by popular creators and news outlets. This practice raises concerns about data transparency and potential violations of YouTube's terms of service.
Apple and other tech giants reportedly trained AI models using subtitle files from over 170,000 YouTube videos without creators' consent, violating YouTube's terms. The data was downloaded by EleutherAI, a non-profit, and included in a dataset called the Pile, which was used by companies like Apple, Nvidia, and Salesforce. This raises concerns about the legal implications of using web-scraped data for AI training.
Former Arkansas Governor Mike Huckabee and other authors have filed a lawsuit against Meta, Microsoft, EleutherAI, and Bloomberg, alleging that their books were pirated and used in datasets to train AI models without permission or compensation. This class action suit is the latest in a series of authors accusing tech companies of copyright infringement in the development of generative AI models. The case revolves around a dataset called "Books3," which contains over 180,000 works and is part of a larger collection called the Pile. AI companies rely on vast amounts of public data for training, leading to debates and legal actions regarding compensation for data providers.
Together Computer has released OpenChatKit, an open-source alternative to ChatGPT, which provides developers with more control over chatbot behavior and customization. The kit includes a large language model fine-tuned for chat, instructions on fine-tuning for specific tasks, an extensible retrieval system, and a moderation system. While the model has limitations, it is a good initiative, and with community contributions, it has the potential to improve.