Originally Published 5 months ago — by Hacker News
The article discusses the challenges of protecting website content from AI scraping and the limitations of technical measures like robots.txt, advocating for legal solutions and emphasizing the importance of human-centric web interactions. It highlights concerns about AI's impact on content creators, the legal and ethical issues surrounding data use, and the need for laws that respect creators' rights while acknowledging technological realities.
Perplexity AI has been accused of covertly scraping website content by disguising its bots and ignoring no-crawl directives, raising concerns about ethical data collection and the impact on web publishers. Despite attempts to hide their activities, Perplexity's bots continue to bypass restrictions, contributing to a surge in AI data scraping that threatens the sustainability of web content monetization. The issue highlights ongoing tensions between AI companies and website owners over data access and compensation.
OpenAI and Anthropic are reportedly ignoring or bypassing the robots.txt rule, which prevents automated scraping of websites, to collect data for training their AI models. Despite public claims of respecting these blocks, findings by TollBit suggest otherwise. This practice has raised concerns among media publishers and highlights the ongoing tension between AI companies' data needs and copyright protections.
Several AI companies are reportedly ignoring the Robots Exclusion Protocol (robots.txt) to scrape content from websites without permission, leading to disputes with publishers. TollBit, a content licensing startup, has highlighted widespread non-compliance, with AI firms using data for training without authorization. This has resulted in legal actions and negotiations for licensing deals, as the debate over the legality and value of using content to train generative AI continues.
Multiple AI companies are bypassing the robots.txt web standard to scrape content from publisher sites without permission, according to content licensing startup TollBit. This issue has sparked disputes, such as the one between AI search startup Perplexity and Forbes, and highlights the ongoing debate over the value and use of content in generative AI systems. TollBit aims to mediate by helping publishers and AI companies strike licensing deals.
Google's new platform Gemini inadvertently allowed public access to private chats, leading to their indexing by search engines despite the presence of a robots.txt file. The indexing was likely due to the existence of public links to the chat pages. However, the pages began dropping from search results, possibly due to their low quality and lack of relevance. This incident sheds light on how search engines index content and the limitations of robots.txt in preventing indexing.
Google launched shared conversations in Google Bard without blocking them from being indexed by search engines. This led to public conversations being indexed by Google Search. Google acknowledged the issue and stated that they did not intend for these shared chats to be indexed. They are now working on blocking them from being indexed. The Bard team later blocked these conversations using robots.txt, but there is still a possibility of indexed URLs without the content being crawled.
Google has announced its plans to develop a complementary protocol to the existing robots.txt protocol, which is over 30 years old, in order to address the challenges posed by new generative AI technologies. The company aims to hold discussions with the web and AI communities to explore alternative machine-readable means for web publisher choice and control in emerging AI and research use cases. Google believes it is time to explore additional protocols and invites a broad range of voices to participate in the discussion.
Google is exploring alternatives to the robots.txt protocol, which has been the standard for controlling web crawling and indexing for the past 30 years. The company believes it's time to find additional machine-readable means for web publisher choice and control, especially in light of emerging AI and research use cases. Google is inviting members from the web and AI communities to engage in public discussions to explore new protocols and methods. The move comes after Open AI disabled the browse with Bing feature in ChatGPT due to unauthorized access to paywalled content.