Tag

Eleutherai

All articles tagged with #eleutherai

EleutherAI Releases Large Open-Source Dataset to Promote Fair and Legal AI Training

Originally Published 7 months ago — by TechCrunch

Featured image for EleutherAI Releases Large Open-Source Dataset to Promote Fair and Legal AI Training
Source: TechCrunch

EleutherAI has released The Common Pile v0.1, a large 8TB dataset of licensed and open-domain text, to train AI models, aiming to increase transparency and reduce reliance on copyrighted material. The dataset was used to develop models that perform comparably to proprietary ones, challenging the notion that unlicensed data is necessary for high performance. The release is part of a broader effort to promote open data and transparency in AI research amid ongoing legal debates.

Tech Giants Used YouTube Videos Without Consent to Train AI

Originally Published 1 year ago — by Ars Technica

Featured image for Tech Giants Used YouTube Videos Without Consent to Train AI
Source: Ars Technica

Major tech companies like Apple, Salesforce, and Anthropic have trained their AI models using YouTube videos without creators' consent, potentially violating YouTube's terms. The dataset, known as "the Pile," was compiled by EleutherAI and includes captions from over 173,000 YouTube videos. Content creators are frustrated and critical of this unauthorized use, raising concerns about intellectual property rights and the ethics of data scraping.

Apple Used YouTube Videos Without Consent to Train AI

Originally Published 1 year ago — by The Verge

Featured image for Apple Used YouTube Videos Without Consent to Train AI
Source: The Verge

An investigation revealed that over 170,000 YouTube videos were used without permission to train AI systems for companies like Apple, Anthropic, Nvidia, and Salesforce. The dataset, part of EleutherAI's The Pile, includes subtitles from videos by popular creators and news outlets. This practice raises concerns about data transparency and potential violations of YouTube's terms of service.

Apple Used YouTube Content, Including MKBHD, for AI Training Without Consent

Originally Published 1 year ago — by 9to5Mac

Featured image for Apple Used YouTube Content, Including MKBHD, for AI Training Without Consent
Source: 9to5Mac

Apple and other tech giants reportedly trained AI models using subtitle files from over 170,000 YouTube videos without creators' consent, violating YouTube's terms. The data was downloaded by EleutherAI, a non-profit, and included in a dataset called the Pile, which was used by companies like Apple, Nvidia, and Salesforce. This raises concerns about the legal implications of using web-scraped data for AI training.

Mike Huckabee and Religious Authors Sue Tech Giants Over AI Copyright Infringement

Originally Published 2 years ago — by The Verge

Featured image for Mike Huckabee and Religious Authors Sue Tech Giants Over AI Copyright Infringement
Source: The Verge

Former Arkansas Governor Mike Huckabee and other authors have filed a lawsuit against Meta, Microsoft, EleutherAI, and Bloomberg, alleging that their books were pirated and used in datasets to train AI models without permission or compensation. This class action suit is the latest in a series of authors accusing tech companies of copyright infringement in the development of generative AI models. The case revolves around a dataset called "Books3," which contains over 180,000 works and is part of a larger collection called the Pile. AI companies rely on vast amounts of public data for training, leading to debates and legal actions regarding compensation for data providers.

The Rise of Open-Source ChatGPT Alternatives

Originally Published 2 years ago — by KDnuggets

Featured image for The Rise of Open-Source ChatGPT Alternatives
Source: KDnuggets

Together Computer has released OpenChatKit, an open-source alternative to ChatGPT, which provides developers with more control over chatbot behavior and customization. The kit includes a large language model fine-tuned for chat, instructions on fine-tuning for specific tasks, an extensible retrieval system, and a moderation system. While the model has limitations, it is a good initiative, and with community contributions, it has the potential to improve.