Tag

Web Scraping

All articles tagged with #web scraping

Browser extensions transform nearly 1 million browsers into website-scraping bots
technology8 months ago

Browser extensions transform nearly 1 million browsers into website-scraping bots

Nearly 1 million browser extensions across Chrome, Firefox, and Edge have been exploited to covertly turn browsers into web scraping bots for a paid service, leveraging a JavaScript library called MellowTel-js. These extensions, used for various benign purposes, are being used to bypass security protections and scrape websites on behalf of paying clients, including advertisers, raising significant security concerns.

AI Companies Bypass Web Standards, Face Legal Threats Over Content Scraping
technology1 year ago

AI Companies Bypass Web Standards, Face Legal Threats Over Content Scraping

OpenAI and Anthropic are reportedly ignoring or bypassing the robots.txt rule, which prevents automated scraping of websites, to collect data for training their AI models. Despite public claims of respecting these blocks, findings by TollBit suggest otherwise. This practice has raised concerns among media publishers and highlights the ongoing tension between AI companies' data needs and copyright protections.

"Perplexity AI Faces Legal and Ethical Scrutiny Over Content Practices"
technology1 year ago

"Perplexity AI Faces Legal and Ethical Scrutiny Over Content Practices"

Multiple AI companies are bypassing the robots.txt web standard to scrape content from publisher sites without permission, according to content licensing startup TollBit. This issue has sparked disputes, such as the one between AI search startup Perplexity and Forbes, and highlights the ongoing debate over the value and use of content in generative AI systems. TollBit aims to mediate by helping publishers and AI companies strike licensing deals.

"Web Scraping Made Easy: Using AI for Google Sheets"
technology1 year ago

"Web Scraping Made Easy: Using AI for Google Sheets"

Google Sheets offers a simple solution for web scraping with the IMPORTXML function, making data extraction accessible to a wider audience. Integrating generative AI like ChatGPT into the mix allows for more advanced web scraping tasks without requiring advanced coding skills. While ChatGPT alone struggled to accurately scrape specific data from a webpage, combining it with Google Sheets and IMPORTXML formulas proved to be a highly efficient and effective approach, highlighting the transformative power of integrating different tools and skills to improve productivity.

"Artists Empowered: Nightshade AI Tool Counters AI Image Scrapers and Protects Art"
technology2 years ago

"Artists Empowered: Nightshade AI Tool Counters AI Image Scrapers and Protects Art"

Artists now have a tool called Nightshade that can corrupt training data used by AI models, such as DALL-E, Stable Diffusion, and Midjourney, by attaching it to their creative work. Nightshade adds invisible changes to pixels in digital art, exploiting a security vulnerability in the model's training process. This tool aims to disrupt AI companies that use copyrighted data without permission. Nightshade can be integrated into Glaze, a tool that masks art styles, allowing artists to choose whether to corrupt the model's training or prevent it from mimicking their style. The tool is proposed as a last defense against web scrapers that ignore opt-out rules. Copyright issues surrounding AI-generated content and training data remain unresolved, with lawsuits still ongoing.

Supreme Court dismisses copyright claims against Google and Apple.
technology2 years ago

Supreme Court dismisses copyright claims against Google and Apple.

The US Supreme Court has declined to hear Genius' web scraping claim against Google and LyricFind for copying its data in search results. Genius had set up a trap by "watermarking" the lyrics to a selection of newly released songs by using Unicode curly apostrophes in certain places and straight apostrophes in others. However, the court ruled that Genius' terms of service claims "are nothing more than claims seeking to enforce the copyright owners’ exclusive rights to protection from unauthorized reproduction of the lyrics and are therefore preempted."