Anna's Archive, a non-profit focused on cultural preservation, has scraped and backed up a 300-terabyte archive of Spotify's music, including metadata for 256 million tracks and audio files for 86 million, aiming to preserve humanity's musical heritage despite Spotify's efforts to prevent unauthorized scraping.
A pirate activist group has scraped and copied nearly all of Spotify's music catalog, including metadata for 256 million tracks and audio files for 86 million, claiming to build the world's first music preservation archive. Spotify is investigating the incident, which involved illicit tactics to access some audio files, and the group plans to release the data publicly in order of popularity.
Recent leaks suggest that ChatGPT conversations, including sensitive user prompts, have been appearing in Google Search Console, raising concerns about OpenAI scraping Google search results and compromising user privacy. OpenAI has acknowledged the issue and claimed to have fixed a glitch, but questions remain about the extent of data scraping and the effectiveness of their response.
Reddit has sued AI company Perplexity and associated data-scraping firms for illegally scraping its data, setting a trap with a test post to catch circumvention, and alleging that Perplexity bypassed protections by using Google search results to access Reddit content without permission.
Reddit has sued AI companies, including Perplexity, for using a data trap to catch them scraping copyrighted content from Reddit without permission, highlighting ongoing issues with AI data training and copyright infringement.
Reddit is suing Perplexity and three data-scraping companies for unlawfully scraping its content to train AI models, alleging that Perplexity is a customer of these scrapers and has increased Reddit citations despite cease-and-desist efforts. Reddit claims these actions bypass technological protections and violate copyright, aiming to prevent the industrial-scale theft of its data for AI training.
Reddit has sued AI company Anthropic, accusing it of unlawfully scraping its data for years without permission to train the Claude chatbot, despite Reddit's efforts to enforce its data use policies and seek licensing agreements.
Reddit has sued AI startup Anthropic for unauthorized use of its data to train AI models, claiming violations of user agreements and data scraping without permission, marking a significant legal challenge in AI data practices.
Apple, Nvidia, and other tech giants have been accused of using YouTube videos to train AI models without the creators' consent. Tech YouTuber Marques Brownlee highlighted that Apple sourced data from companies that scraped YouTube content, including his own. This practice, which violates YouTube's regulations, has raised significant concerns about unauthorized content scraping in the tech industry.
Major tech companies like Apple, Salesforce, and Anthropic have trained their AI models using YouTube videos without creators' consent, potentially violating YouTube's terms. The dataset, known as "the Pile," was compiled by EleutherAI and includes captions from over 173,000 YouTube videos. Content creators are frustrated and critical of this unauthorized use, raising concerns about intellectual property rights and the ethics of data scraping.
OpenAI's use of YouTube videos to train its AI models has raised questions about how it accesses such data, given Google's restrictions on scraping and downloading large volumes of YouTube content. The company has not confirmed whether it has downloaded YouTube videos at scale or bypassed Google's limitations. As the demand for high-quality training data for AI models grows, ethical and legal questions about data scraping and fair use of online content remain unresolved in the AI community.
Midjourney banned all employees from rival AI firm Stability AI from its service indefinitely after detecting "botnet-like" activity suspected to be a Stability employee attempting to scrape prompt and image pairs in bulk, causing a 24-hour outage. This move comes after Midjourney faced criticism for using training data scraped off the Internet without permission. Stability AI CEO claimed the incident was unintentional and stated that his company doesn't need Midjourney's data, emphasizing their use of synthetic and other data.
Midjourney has banned Stability AI employees from using its service, alleging that they caused a recent server outage by attempting to scrape Midjourney’s data. Midjourney claims that "botnet-like activity from paid accounts" linked to Stability AI employees was behind the outage and has banned all Stability AI employees from using its service indefinitely. Stability AI CEO Emad Mostaque denies ordering the actions and claims that if the outage was caused by a Stability employee, it was unintentional. The situation is still developing, and both companies have not responded to requests for comment. This incident has sparked criticism of both companies for training their AI models on scraped online data without consent.
Nightshade, a new software tool developed by researchers at the University of Chicago, is now available for anyone to try as a means of protecting artists' and creators' work from being used to train AI models without consent. By "poisoning" images, Nightshade can make them unsuitable for AI training, leading to unpredictable results and potentially deterring AI companies from using unauthorized content. The tool works by creating subtle changes to images that are imperceptible to humans but significantly affect how AI models interpret and generate content. Additionally, Nightshade can work in conjunction with Glaze, another tool designed to disrupt content abuse, offering both offensive and defensive approaches to content protection.
The BBC has outlined its principles for evaluating the use of generative AI in journalism, stating that it believes the technology can provide more value to audiences and society. The broadcaster plans to work with tech companies, media organizations, and regulators to develop generative AI while maintaining trust in the news industry. However, the BBC has blocked web crawlers from OpenAI and Common Crawl from accessing its websites, joining other news organizations in safeguarding their copyrighted material. The BBC's move aims to protect the interests of license fee payers and ensure that training AI models with BBC data without permission is not in the public interest.