Tag

Web Scraping

All articles tagged with #web scraping

Browser extensions transform nearly 1 million browsers into website-scraping bots

Originally Published 6 months ago — by Ars Technica

Featured image for Browser extensions transform nearly 1 million browsers into website-scraping bots
Source: Ars Technica

Nearly 1 million browser extensions across Chrome, Firefox, and Edge have been exploited to covertly turn browsers into web scraping bots for a paid service, leveraging a JavaScript library called MellowTel-js. These extensions, used for various benign purposes, are being used to bypass security protections and scrape websites on behalf of paying clients, including advertisers, raising significant security concerns.

AI Companies Bypass Web Standards, Face Legal Threats Over Content Scraping

Originally Published 1 year ago — by Business Insider

Featured image for AI Companies Bypass Web Standards, Face Legal Threats Over Content Scraping
Source: Business Insider

OpenAI and Anthropic are reportedly ignoring or bypassing the robots.txt rule, which prevents automated scraping of websites, to collect data for training their AI models. Despite public claims of respecting these blocks, findings by TollBit suggest otherwise. This practice has raised concerns among media publishers and highlights the ongoing tension between AI companies' data needs and copyright protections.

"Perplexity AI Faces Legal and Ethical Scrutiny Over Content Practices"

Originally Published 1 year ago — by Reuters

Featured image for "Perplexity AI Faces Legal and Ethical Scrutiny Over Content Practices"
Source: Reuters

Multiple AI companies are bypassing the robots.txt web standard to scrape content from publisher sites without permission, according to content licensing startup TollBit. This issue has sparked disputes, such as the one between AI search startup Perplexity and Forbes, and highlights the ongoing debate over the value and use of content in generative AI systems. TollBit aims to mediate by helping publishers and AI companies strike licensing deals.

"Web Scraping Made Easy: Using AI for Google Sheets"

Originally Published 1 year ago — by Search Engine Journal

Featured image for "Web Scraping Made Easy: Using AI for Google Sheets"
Source: Search Engine Journal

Google Sheets offers a simple solution for web scraping with the IMPORTXML function, making data extraction accessible to a wider audience. Integrating generative AI like ChatGPT into the mix allows for more advanced web scraping tasks without requiring advanced coding skills. While ChatGPT alone struggled to accurately scrape specific data from a webpage, combining it with Google Sheets and IMPORTXML formulas proved to be a highly efficient and effective approach, highlighting the transformative power of integrating different tools and skills to improve productivity.

"Artists Empowered: Nightshade AI Tool Counters AI Image Scrapers and Protects Art"

Originally Published 2 years ago — by The Verge

Featured image for "Artists Empowered: Nightshade AI Tool Counters AI Image Scrapers and Protects Art"
Source: The Verge

Artists now have a tool called Nightshade that can corrupt training data used by AI models, such as DALL-E, Stable Diffusion, and Midjourney, by attaching it to their creative work. Nightshade adds invisible changes to pixels in digital art, exploiting a security vulnerability in the model's training process. This tool aims to disrupt AI companies that use copyrighted data without permission. Nightshade can be integrated into Glaze, a tool that masks art styles, allowing artists to choose whether to corrupt the model's training or prevent it from mimicking their style. The tool is proposed as a last defense against web scrapers that ignore opt-out rules. Copyright issues surrounding AI-generated content and training data remain unresolved, with lawsuits still ongoing.

Supreme Court dismisses copyright claims against Google and Apple.

Originally Published 2 years ago — by The Register

Featured image for Supreme Court dismisses copyright claims against Google and Apple.
Source: The Register

The US Supreme Court has declined to hear Genius' web scraping claim against Google and LyricFind for copying its data in search results. Genius had set up a trap by "watermarking" the lyrics to a selection of newly released songs by using Unicode curly apostrophes in certain places and straight apostrophes in others. However, the court ruled that Genius' terms of service claims "are nothing more than claims seeking to enforce the copyright owners’ exclusive rights to protection from unauthorized reproduction of the lyrics and are therefore preempted."