Tag

Data Scraping

All articles tagged with #data scraping

Pirate Activists Claim to Have Scraped 300TB of Spotify's Music Catalog

Originally Published 20 days ago — by Gizmodo

Featured image for Pirate Activists Claim to Have Scraped 300TB of Spotify's Music Catalog
Source: Gizmodo

Anna's Archive, a non-profit focused on cultural preservation, has scraped and backed up a 300-terabyte archive of Spotify's music, including metadata for 256 million tracks and audio files for 86 million, aiming to preserve humanity's musical heritage despite Spotify's efforts to prevent unauthorized scraping.

Pirate Group Claims to Have Copied Entire Spotify Music Library

Originally Published 20 days ago — by PCMag

Featured image for Pirate Group Claims to Have Copied Entire Spotify Music Library
Source: PCMag

A pirate activist group has scraped and copied nearly all of Spotify's music catalog, including metadata for 256 million tracks and audio files for 86 million, claiming to build the world's first music preservation archive. Spotify is investigating the incident, which involved illicit tactics to access some audio files, and the group plans to release the data publicly in order of popularity.

Unusual ChatGPT Leaks Reveal Cringey Logs in Google Analytics

Originally Published 2 months ago — by Ars Technica

Featured image for Unusual ChatGPT Leaks Reveal Cringey Logs in Google Analytics
Source: Ars Technica

Recent leaks suggest that ChatGPT conversations, including sensitive user prompts, have been appearing in Google Search Console, raising concerns about OpenAI scraping Google search results and compromising user privacy. OpenAI has acknowledged the issue and claimed to have fixed a glitch, but questions remain about the extent of data scraping and the effectiveness of their response.

Reddit Accuses AI Startup Perplexity of Data Theft in Growing Industry Battle

Originally Published 2 months ago — by businessinsider.com

Featured image for Reddit Accuses AI Startup Perplexity of Data Theft in Growing Industry Battle
Source: businessinsider.com

Reddit has sued AI company Perplexity and associated data-scraping firms for illegally scraping its data, setting a trap with a test post to catch circumvention, and alleging that Perplexity bypassed protections by using Google search results to access Reddit content without permission.

Reddit Sues Perplexity and Others Over Data Scraping for AI Training

Originally Published 2 months ago — by The Verge

Featured image for Reddit Sues Perplexity and Others Over Data Scraping for AI Training
Source: The Verge

Reddit is suing Perplexity and three data-scraping companies for unlawfully scraping its content to train AI models, alleging that Perplexity is a customer of these scrapers and has increased Reddit citations despite cease-and-desist efforts. Reddit claims these actions bypass technological protections and violate copyright, aiming to prevent the industrial-scale theft of its data for AI training.

Tech Giants Accused of Using YouTube Videos for AI Training Without Consent

Originally Published 1 year ago — by Benzinga

Featured image for Tech Giants Accused of Using YouTube Videos for AI Training Without Consent
Source: Benzinga

Apple, Nvidia, and other tech giants have been accused of using YouTube videos to train AI models without the creators' consent. Tech YouTuber Marques Brownlee highlighted that Apple sourced data from companies that scraped YouTube content, including his own. This practice, which violates YouTube's regulations, has raised significant concerns about unauthorized content scraping in the tech industry.

Tech Giants Used YouTube Videos Without Consent to Train AI

Originally Published 1 year ago — by Ars Technica

Featured image for Tech Giants Used YouTube Videos Without Consent to Train AI
Source: Ars Technica

Major tech companies like Apple, Salesforce, and Anthropic have trained their AI models using YouTube videos without creators' consent, potentially violating YouTube's terms. The dataset, known as "the Pile," was compiled by EleutherAI and includes captions from over 173,000 YouTube videos. Content creators are frustrated and critical of this unauthorized use, raising concerns about intellectual property rights and the ethics of data scraping.

"Unraveling the OpenAI Mystery: Sora's Impact on YouTube, Google, and AI Training Data"

Originally Published 1 year ago — by Business Insider

Featured image for "Unraveling the OpenAI Mystery: Sora's Impact on YouTube, Google, and AI Training Data"
Source: Business Insider

OpenAI's use of YouTube videos to train its AI models has raised questions about how it accesses such data, given Google's restrictions on scraping and downloading large volumes of YouTube content. The company has not confirmed whether it has downloaded YouTube videos at scale or bypassed Google's limitations. As the demand for high-quality training data for AI models grows, ethical and legal questions about data scraping and fair use of online content remain unresolved in the AI community.

Midjourney Takes Action Against Stability AI for Image and Data Scraping

Originally Published 1 year ago — by Ars Technica

Featured image for Midjourney Takes Action Against Stability AI for Image and Data Scraping
Source: Ars Technica

Midjourney banned all employees from rival AI firm Stability AI from its service indefinitely after detecting "botnet-like" activity suspected to be a Stability employee attempting to scrape prompt and image pairs in bulk, causing a 24-hour outage. This move comes after Midjourney faced criticism for using training data scraped off the Internet without permission. Stability AI CEO claimed the incident was unintentional and stated that his company doesn't need Midjourney's data, emphasizing their use of synthetic and other data.

Midjourney Takes Action Against Stability AI for Alleged Data and Image Theft

Originally Published 1 year ago — by The Verge

Featured image for Midjourney Takes Action Against Stability AI for Alleged Data and Image Theft
Source: The Verge

Midjourney has banned Stability AI employees from using its service, alleging that they caused a recent server outage by attempting to scrape Midjourney’s data. Midjourney claims that "botnet-like activity from paid accounts" linked to Stability AI employees was behind the outage and has banned all Stability AI employees from using its service indefinitely. Stability AI CEO Emad Mostaque denies ordering the actions and claims that if the outage was caused by a Stability employee, it was unintentional. The situation is still developing, and both companies have not responded to requests for comment. This incident has sparked criticism of both companies for training their AI models on scraped online data without consent.

"Nightshade: Poisoning AI Data Scraping to Protect Artists' Portfolios"

Originally Published 2 years ago — by TechSpot

Featured image for "Nightshade: Poisoning AI Data Scraping to Protect Artists' Portfolios"
Source: TechSpot

Nightshade, a new software tool developed by researchers at the University of Chicago, is now available for anyone to try as a means of protecting artists' and creators' work from being used to train AI models without consent. By "poisoning" images, Nightshade can make them unsuitable for AI training, leading to unpredictable results and potentially deterring AI companies from using unauthorized content. The tool works by creating subtle changes to images that are imperceptible to humans but significantly affect how AI models interpret and generate content. Additionally, Nightshade can work in conjunction with Glaze, another tool designed to disrupt content abuse, offering both offensive and defensive approaches to content protection.

BBC Blocks OpenAI's Data Scraping for AI-Powered Journalism

Originally Published 2 years ago — by The Verge

Featured image for BBC Blocks OpenAI's Data Scraping for AI-Powered Journalism
Source: The Verge

The BBC has outlined its principles for evaluating the use of generative AI in journalism, stating that it believes the technology can provide more value to audiences and society. The broadcaster plans to work with tech companies, media organizations, and regulators to develop generative AI while maintaining trust in the news industry. However, the BBC has blocked web crawlers from OpenAI and Common Crawl from accessing its websites, joining other news organizations in safeguarding their copyrighted material. The BBC's move aims to protect the interests of license fee payers and ensure that training AI models with BBC data without permission is not in the public interest.