Amazon's recent move to tighten its data access and block third-party web crawlers signals a strategic shift towards a more closed ecosystem, intensifying the AI competition among Big Tech companies and potentially creating a more uneven playing field in AI development.
Nearly 1 million browser extensions across Chrome, Firefox, and Edge have been exploited to covertly turn browsers into web scraping bots for a paid service, leveraging a JavaScript library called MellowTel-js. These extensions, used for various benign purposes, are being used to bypass security protections and scrape websites on behalf of paying clients, including advertisers, raising significant security concerns.
Cloudflare will now block known AI web crawlers by default and introduce a 'Pay Per Crawl' fee for some publishers to ensure content is accessed with permission and compensation, aiming to protect original content while supporting AI innovation.
The BBC has threatened legal action against US-based AI firm Perplexity for allegedly reproducing BBC content verbatim without permission, citing copyright infringement and breach of terms of use, amid broader concerns over AI training data and content accuracy.
OpenAI and Anthropic are reportedly ignoring or bypassing the robots.txt rule, which prevents automated scraping of websites, to collect data for training their AI models. Despite public claims of respecting these blocks, findings by TollBit suggest otherwise. This practice has raised concerns among media publishers and highlights the ongoing tension between AI companies' data needs and copyright protections.
Multiple AI companies are bypassing the robots.txt web standard to scrape content from publisher sites without permission, according to content licensing startup TollBit. This issue has sparked disputes, such as the one between AI search startup Perplexity and Forbes, and highlights the ongoing debate over the value and use of content in generative AI systems. TollBit aims to mediate by helping publishers and AI companies strike licensing deals.
Google Sheets offers a simple solution for web scraping with the IMPORTXML function, making data extraction accessible to a wider audience. Integrating generative AI like ChatGPT into the mix allows for more advanced web scraping tasks without requiring advanced coding skills. While ChatGPT alone struggled to accurately scrape specific data from a webpage, combining it with Google Sheets and IMPORTXML formulas proved to be a highly efficient and effective approach, highlighting the transformative power of integrating different tools and skills to improve productivity.
Artists now have a tool called Nightshade that can corrupt training data used by AI models, such as DALL-E, Stable Diffusion, and Midjourney, by attaching it to their creative work. Nightshade adds invisible changes to pixels in digital art, exploiting a security vulnerability in the model's training process. This tool aims to disrupt AI companies that use copyrighted data without permission. Nightshade can be integrated into Glaze, a tool that masks art styles, allowing artists to choose whether to corrupt the model's training or prevent it from mimicking their style. The tool is proposed as a last defense against web scrapers that ignore opt-out rules. Copyright issues surrounding AI-generated content and training data remain unresolved, with lawsuits still ongoing.
The US Supreme Court has declined to hear Genius' web scraping claim against Google and LyricFind for copying its data in search results. Genius had set up a trap by "watermarking" the lyrics to a selection of newly released songs by using Unicode curly apostrophes in certain places and straight apostrophes in others. However, the court ruled that Genius' terms of service claims "are nothing more than claims seeking to enforce the copyright owners’ exclusive rights to protection from unauthorized reproduction of the lyrics and are therefore preempted."