Tag

Data Scraping

All articles tagged with #data scraping

Fake Page Tricks AI Into Crowning a Made-Up Hot-Dog Champ as Tech Journalism's Top Star
artificial-intelligence5 days ago

Fake Page Tricks AI Into Crowning a Made-Up Hot-Dog Champ as Tech Journalism's Top Star

A BBC reporter demonstrated how a fabricated webpage claiming a tech journalist dominates hot-dog eating fooled AI models like ChatGPT and Google’s Gemini into praising the fake claim. Within 24 hours, AI Overviews echoed the misinformation, prompting Google to correct the record and acknowledge a misinformation case. The episode highlights how data-scraping and unvetted sources can seed false information into AI systems, underscoring the need for guardrails and better data vetting to prevent real-world harm.

Pirate Group Claims to Have Copied Entire Spotify Music Library
technology2 months ago

Pirate Group Claims to Have Copied Entire Spotify Music Library

A pirate activist group has scraped and copied nearly all of Spotify's music catalog, including metadata for 256 million tracks and audio files for 86 million, claiming to build the world's first music preservation archive. Spotify is investigating the incident, which involved illicit tactics to access some audio files, and the group plans to release the data publicly in order of popularity.

Unusual ChatGPT Leaks Reveal Cringey Logs in Google Analytics
technology3 months ago

Unusual ChatGPT Leaks Reveal Cringey Logs in Google Analytics

Recent leaks suggest that ChatGPT conversations, including sensitive user prompts, have been appearing in Google Search Console, raising concerns about OpenAI scraping Google search results and compromising user privacy. OpenAI has acknowledged the issue and claimed to have fixed a glitch, but questions remain about the extent of data scraping and the effectiveness of their response.

Reddit Sues Perplexity and Others Over Data Scraping for AI Training
law4 months ago

Reddit Sues Perplexity and Others Over Data Scraping for AI Training

Reddit is suing Perplexity and three data-scraping companies for unlawfully scraping its content to train AI models, alleging that Perplexity is a customer of these scrapers and has increased Reddit citations despite cease-and-desist efforts. Reddit claims these actions bypass technological protections and violate copyright, aiming to prevent the industrial-scale theft of its data for AI training.

Tech Giants Accused of Using YouTube Videos for AI Training Without Consent
technology1 year ago

Tech Giants Accused of Using YouTube Videos for AI Training Without Consent

Apple, Nvidia, and other tech giants have been accused of using YouTube videos to train AI models without the creators' consent. Tech YouTuber Marques Brownlee highlighted that Apple sourced data from companies that scraped YouTube content, including his own. This practice, which violates YouTube's regulations, has raised significant concerns about unauthorized content scraping in the tech industry.

Tech Giants Used YouTube Videos Without Consent to Train AI
technology1 year ago

Tech Giants Used YouTube Videos Without Consent to Train AI

Major tech companies like Apple, Salesforce, and Anthropic have trained their AI models using YouTube videos without creators' consent, potentially violating YouTube's terms. The dataset, known as "the Pile," was compiled by EleutherAI and includes captions from over 173,000 YouTube videos. Content creators are frustrated and critical of this unauthorized use, raising concerns about intellectual property rights and the ethics of data scraping.

"Unraveling the OpenAI Mystery: Sora's Impact on YouTube, Google, and AI Training Data"
technology1 year ago

"Unraveling the OpenAI Mystery: Sora's Impact on YouTube, Google, and AI Training Data"

OpenAI's use of YouTube videos to train its AI models has raised questions about how it accesses such data, given Google's restrictions on scraping and downloading large volumes of YouTube content. The company has not confirmed whether it has downloaded YouTube videos at scale or bypassed Google's limitations. As the demand for high-quality training data for AI models grows, ethical and legal questions about data scraping and fair use of online content remain unresolved in the AI community.

Midjourney Takes Action Against Stability AI for Image and Data Scraping
technology1 year ago

Midjourney Takes Action Against Stability AI for Image and Data Scraping

Midjourney banned all employees from rival AI firm Stability AI from its service indefinitely after detecting "botnet-like" activity suspected to be a Stability employee attempting to scrape prompt and image pairs in bulk, causing a 24-hour outage. This move comes after Midjourney faced criticism for using training data scraped off the Internet without permission. Stability AI CEO claimed the incident was unintentional and stated that his company doesn't need Midjourney's data, emphasizing their use of synthetic and other data.

Midjourney Takes Action Against Stability AI for Alleged Data and Image Theft
technology1 year ago

Midjourney Takes Action Against Stability AI for Alleged Data and Image Theft

Midjourney has banned Stability AI employees from using its service, alleging that they caused a recent server outage by attempting to scrape Midjourney’s data. Midjourney claims that "botnet-like activity from paid accounts" linked to Stability AI employees was behind the outage and has banned all Stability AI employees from using its service indefinitely. Stability AI CEO Emad Mostaque denies ordering the actions and claims that if the outage was caused by a Stability employee, it was unintentional. The situation is still developing, and both companies have not responded to requests for comment. This incident has sparked criticism of both companies for training their AI models on scraped online data without consent.

"Nightshade: Poisoning AI Data Scraping to Protect Artists' Portfolios"
technology2 years ago

"Nightshade: Poisoning AI Data Scraping to Protect Artists' Portfolios"

Nightshade, a new software tool developed by researchers at the University of Chicago, is now available for anyone to try as a means of protecting artists' and creators' work from being used to train AI models without consent. By "poisoning" images, Nightshade can make them unsuitable for AI training, leading to unpredictable results and potentially deterring AI companies from using unauthorized content. The tool works by creating subtle changes to images that are imperceptible to humans but significantly affect how AI models interpret and generate content. Additionally, Nightshade can work in conjunction with Glaze, another tool designed to disrupt content abuse, offering both offensive and defensive approaches to content protection.