Tag

Web Scraping

All articles tagged with #web scraping

technology6 months ago•3 min saved

Amazon's AI Talent Struggles Spark Industry Turf War

Amazon's recent move to tighten its data access and block third-party web crawlers signals a strategic shift towards a more closed ecosystem, intensifying the AI competition among Big Tech companies and potentially creating a more uneven playing field in AI development.

via TheStreet|

#ai-turf-war #amazon #data-lock-in

technology8 months ago•1 min saved

Browser extensions transform nearly 1 million browsers into website-scraping bots

Nearly 1 million browser extensions across Chrome, Firefox, and Edge have been exploited to covertly turn browsers into web scraping bots for a paid service, leveraging a JavaScript library called MellowTel-js. These extensions, used for various benign purposes, are being used to bypass security protections and scrape websites on behalf of paying clients, including advertisers, raising significant security concerns.

via Ars Technica|

#browser-extensions #mellowtel-js #monetization

technology8 months ago•2 min saved

Cloudflare to Default Block AI Crawlers and Introduce Pay-Per-Crawl Model

Cloudflare will now block known AI web crawlers by default and introduce a 'Pay Per Crawl' fee for some publishers to ensure content is accessed with permission and compensation, aiming to protect original content while supporting AI innovation.

via The Verge|

#ai-crawlers #ai-regulation #cloudflare

technology8 months ago•3 min saved

BBC threatens legal action against AI startup Perplexity over content scraping

The BBC has threatened legal action against US-based AI firm Perplexity for allegedly reproducing BBC content verbatim without permission, citing copyright infringement and breach of terms of use, amid broader concerns over AI training data and content accuracy.

via BBC|

#ai #bbc #content-infringement

technology1 year ago•2 min saved

AI Companies Bypass Web Standards, Face Legal Threats Over Content Scraping

OpenAI and Anthropic are reportedly ignoring or bypassing the robots.txt rule, which prevents automated scraping of websites, to collect data for training their AI models. Despite public claims of respecting these blocks, findings by TollBit suggest otherwise. This practice has raised concerns among media publishers and highlights the ongoing tension between AI companies' data needs and copyright protections.

via Business Insider|

#ai #anthropic #openai

technology1 year ago•3 min saved

"Perplexity AI Faces Legal and Ethical Scrutiny Over Content Practices"

Multiple AI companies are bypassing the robots.txt web standard to scrape content from publisher sites without permission, according to content licensing startup TollBit. This issue has sparked disputes, such as the one between AI search startup Perplexity and Forbes, and highlights the ongoing debate over the value and use of content in generative AI systems. TollBit aims to mediate by helping publishers and AI companies strike licensing deals.

via Reuters|

#ai #content-licensing #publishers

technology1 year ago•5 min saved

"Web Scraping Made Easy: Using AI for Google Sheets"

Google Sheets offers a simple solution for web scraping with the IMPORTXML function, making data extraction accessible to a wider audience. Integrating generative AI like ChatGPT into the mix allows for more advanced web scraping tasks without requiring advanced coding skills. While ChatGPT alone struggled to accurately scrape specific data from a webpage, combining it with Google Sheets and IMPORTXML formulas proved to be a highly efficient and effective approach, highlighting the transformative power of integrating different tools and skills to improve productivity.

via Search Engine Journal|

#ai #chatgpt #data-extraction

technology2 years ago•2 min saved

"Artists Empowered: Nightshade AI Tool Counters AI Image Scrapers and Protects Art"

Artists now have a tool called Nightshade that can corrupt training data used by AI models, such as DALL-E, Stable Diffusion, and Midjourney, by attaching it to their creative work. Nightshade adds invisible changes to pixels in digital art, exploiting a security vulnerability in the model's training process. This tool aims to disrupt AI companies that use copyrighted data without permission. Nightshade can be integrated into Glaze, a tool that masks art styles, allowing artists to choose whether to corrupt the model's training or prevent it from mimicking their style. The tool is proposed as a last defense against web scrapers that ignore opt-out rules. Copyright issues surrounding AI-generated content and training data remain unresolved, with lawsuits still ongoing.

via The Verge|

#ai #art #copyright

technology2 years ago•3 min saved

Supreme Court dismisses copyright claims against Google and Apple.

The US Supreme Court has declined to hear Genius' web scraping claim against Google and LyricFind for copying its data in search results. Genius had set up a trap by "watermarking" the lyrics to a selection of newly released songs by using Unicode curly apostrophes in certain places and straight apostrophes in others. However, the court ruled that Genius' terms of service claims "are nothing more than claims seeking to enforce the copyright owners’ exclusive rights to protection from unauthorized reproduction of the lyrics and are therefore preempted."

via The Register|