Tag

Training Data

All articles tagged with #training data

Balancing openness and safety in AI biology data
technology9 days ago

Balancing openness and safety in AI biology data

More than 100 researchers back a framework to treat certain biological data like sensitive health records, arguing most data should remain open while a narrow subset that could enable misuse—such as linking viral genetics to real-world traits—needs protection. They warn that training AI models on such data could lower the barrier to designing dangerous pathogens, and while legitimate researchers should have access, it shouldn’t be uploaded anonymously or browsable on the open web. The aim is to balance scientific progress with biosecurity, advocating regular reassessment of restrictions as science evolves to prevent worst-case scenarios.

Study Finds Major AI Models Copy Verbatim Copyrighted Text, Challenging the “Learning” Claim
technology1 month ago

Study Finds Major AI Models Copy Verbatim Copyrighted Text, Challenging the “Learning” Claim

Stanford and Yale researchers tested four major LLMs—OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3, and Anthropic’s Claude 3.7 Sonnet—and found they can reproduce lengthy, copyrighted passages with high accuracy (Claude 3.7 Sonnet near-verbatim ~95.8%; Gemini 2.5 Pro ~76.8% on Harry Potter; Claude 3.7 Sonnet >94% on Orwell’s 1984), suggesting these models may store or copy training data rather than simply learning patterns. Some reproductions required jailbreak-style prompts (Best-of-N), underscoring potential legal liabilities as copyright lawsuits proceed and the industry debates what counts as “learning.”

Anthropic to Pay $1.5 Billion to Resolve Copyright Lawsuit with Book Authors
technology5 months ago

Anthropic to Pay $1.5 Billion to Resolve Copyright Lawsuit with Book Authors

Anthropic has agreed to pay at least $1.5 billion to authors whose pirated works it used to train its AI models, creating a precedent for AI companies compensating for illegal content use. The settlement involves establishing a fund valuing each pirated book at $3,000 and destroying the pirated materials, highlighting increasing legal risks for AI firms.

"AI Giants Struggle with Data Depletion: The Quest for More Training Data"
technology1 year ago

"AI Giants Struggle with Data Depletion: The Quest for More Training Data"

AI companies are facing a shortage of training data as they continue to build larger models, leading to the exploration of alternative sources such as publicly-available video transcripts and synthetic data. Some companies are considering controversial methods like training on transcriptions from public YouTube videos, while others are working on creating higher-quality synthetic data. Concerns about AI running out of data have been raised, but researchers believe that breakthroughs could address the issue. However, the solution may also involve reevaluating the pursuit of larger models due to environmental and resource concerns.

"Unveiling OpenAI's Groundbreaking Sora AI Videos and Training Data Mystery"
artificial-intelligence1 year ago

"Unveiling OpenAI's Groundbreaking Sora AI Videos and Training Data Mystery"

OpenAI continues to tease its upcoming AI video generator, Sora, with text-to-video clips that are impressing viewers, and plans to release it later this year with sound and metadata. The company is giving early access to some individuals in the film industry for testing. However, there is secrecy surrounding the training data used for Sora, with OpenAI's CTO Mira Murati being vague about its sources in a recent interview with The Wall Street Journal.

"Controversy Surrounds OpenAI's Sora: Unveiling the Data Mystery"
technology1 year ago

"Controversy Surrounds OpenAI's Sora: Unveiling the Data Mystery"

OpenAI's CTO, Mira Murati, was unable to provide specific details about the training data used for the company's new video-generating AI, Sora, during an interview with The Wall Street Journal. Murati's vague responses and uncertainty about the source of the data have raised concerns about OpenAI's data-scraping practices and transparency. While she later confirmed that Shutterstock videos were included in Sora's training set, the lack of clarity surrounding the origin of the training data has sparked controversy and criticism.