The article discusses how ChatGPT's Atlas browser, when in agent mode, avoids directly accessing certain sources like the New York Times and PCMag due to ongoing copyright disputes with OpenAI, instead finding alternative sources to summarize content, highlighting ethical and legal considerations in AI web crawling.
Cloudflare accused AI search engine Perplexity of stealthily scraping websites despite being blocked, sparking debate over whether AI agents should be treated like humans or bots. Many defend Perplexity, arguing that accessing public content on behalf of a user is acceptable, while Cloudflare criticizes the behavior as inappropriate. The controversy highlights broader issues about AI web crawling, website blocking, and the future of internet traffic, with concerns about malicious bots and the impact on website revenue and access.
Perplexity AI has been accused of covertly scraping website content by disguising its bots and ignoring no-crawl directives, raising concerns about ethical data collection and the impact on web publishers. Despite attempts to hide their activities, Perplexity's bots continue to bypass restrictions, contributing to a surge in AI data scraping that threatens the sustainability of web content monetization. The issue highlights ongoing tensions between AI companies and website owners over data access and compensation.
Cloudflare reports that AI startup Perplexity is using stealth techniques to bypass website restrictions and access content without permission, raising concerns about unauthorized data scraping. Perplexity denies the allegations, calling the report a publicity stunt, while Cloudflare has taken steps to block the company's AI bots.
OpenAI quietly launched GPTBot, a web crawling bot used to scrape website content for training its language models. However, website owners and creators quickly sought ways to block the bot from accessing their data. OpenAI provided instructions on how to block GPTBot, but it remains uncertain if this will completely prevent content from being used in training. The controversy surrounding web scraping for AI training has led to lawsuits and debates over data privacy. OpenAI recently announced a partnership with NYU's Ethics and Journalism Initiative to address ethical challenges in AI implementation in the news industry.
Google is exploring alternatives to the robots.txt protocol, which has been the standard for controlling web crawling and indexing for the past 30 years. The company believes it's time to find additional machine-readable means for web publisher choice and control, especially in light of emerging AI and research use cases. Google is inviting members from the web and AI communities to engage in public discussions to explore new protocols and methods. The move comes after Open AI disabled the browse with Bing feature in ChatGPT due to unauthorized access to paywalled content.