Tag

Web Crawling

All articles tagged with #web crawling

Perplexity Faces Scrutiny Over Stealth Crawling and Cloudflare Disputes
technology6 months ago

Perplexity Faces Scrutiny Over Stealth Crawling and Cloudflare Disputes

Cloudflare accused AI search engine Perplexity of stealthily scraping websites despite being blocked, sparking debate over whether AI agents should be treated like humans or bots. Many defend Perplexity, arguing that accessing public content on behalf of a user is acceptable, while Cloudflare criticizes the behavior as inappropriate. The controversy highlights broader issues about AI web crawling, website blocking, and the future of internet traffic, with concerns about malicious bots and the impact on website revenue and access.

Perplexity AI Faces Accusations of Stealth Data Scraping and Evasion
technology6 months ago

Perplexity AI Faces Accusations of Stealth Data Scraping and Evasion

Perplexity AI has been accused of covertly scraping website content by disguising its bots and ignoring no-crawl directives, raising concerns about ethical data collection and the impact on web publishers. Despite attempts to hide their activities, Perplexity's bots continue to bypass restrictions, contributing to a surge in AI data scraping that threatens the sustainability of web content monetization. The issue highlights ongoing tensions between AI companies and website owners over data access and compensation.

OpenAI's GPTBot: The Battle to Block and Stop the Web Crawling Menace
technology2 years ago

OpenAI's GPTBot: The Battle to Block and Stop the Web Crawling Menace

OpenAI quietly launched GPTBot, a web crawling bot used to scrape website content for training its language models. However, website owners and creators quickly sought ways to block the bot from accessing their data. OpenAI provided instructions on how to block GPTBot, but it remains uncertain if this will completely prevent content from being used in training. The controversy surrounding web scraping for AI training has led to lawsuits and debates over data privacy. OpenAI recently announced a partnership with NYU's Ethics and Journalism Initiative to address ethical challenges in AI implementation in the news industry.

Google's Response to Emerging Technologies: Exploring Alternatives to Robots.txt
technology2 years ago

Google's Response to Emerging Technologies: Exploring Alternatives to Robots.txt

Google is exploring alternatives to the robots.txt protocol, which has been the standard for controlling web crawling and indexing for the past 30 years. The company believes it's time to find additional machine-readable means for web publisher choice and control, especially in light of emerging AI and research use cases. Google is inviting members from the web and AI communities to engage in public discussions to explore new protocols and methods. The move comes after Open AI disabled the browse with Bing feature in ChatGPT due to unauthorized access to paywalled content.