Tag

Training Data

All articles tagged with #training data

technology9 days ago•2 min saved

Balancing openness and safety in AI biology data

More than 100 researchers back a framework to treat certain biological data like sensitive health records, arguing most data should remain open while a narrow subset that could enable misuse—such as linking viral genetics to real-world traits—needs protection. They warn that training AI models on such data could lower the barrier to designing dangerous pathogens, and while legitimate researchers should have access, it shouldn’t be uploaded anonymously or browsable on the open web. The aim is to balance scientific progress with biosecurity, advocating regular reassessment of restrictions as science evolves to prevent worst-case scenarios.

via Axios|

#ai-governance #biological-data #biosecurity

technology1 month ago•3 min saved

Study Finds Major AI Models Copy Verbatim Copyrighted Text, Challenging the “Learning” Claim

Stanford and Yale researchers tested four major LLMs—OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3, and Anthropic’s Claude 3.7 Sonnet—and found they can reproduce lengthy, copyrighted passages with high accuracy (Claude 3.7 Sonnet near-verbatim ~95.8%; Gemini 2.5 Pro ~76.8% on Harry Potter; Claude 3.7 Sonnet >94% on Orwell’s 1984), suggesting these models may store or copy training data rather than simply learning patterns. Some reproductions required jailbreak-style prompts (Best-of-N), underscoring potential legal liabilities as copyright lawsuits proceed and the industry debates what counts as “learning.”

via Futurism|

#ai #copyright #large-language-models

technology2 months ago•2 min saved

Anthropic Finds Poisoning LLMs Requires Only Few Samples

Research by Anthropic and partners shows that injecting just 250 carefully crafted poison samples into training data can compromise large language models, causing them to produce gibberish or potentially dangerous outputs, highlighting vulnerabilities in AI training processes.

via Hackaday|

#ai-security #anthropic #data-poisoning

technology4 months ago•2 min saved

Training on Low-Quality Data Causes Lasting AI 'Brain Rot'

A new study suggests that training AI on low-quality, clickbaity content causes lasting cognitive decline in models, similar to human effects of brain rot, and cannot be easily fixed, highlighting risks of unregulated data use.

via Futurism|

#ai #brain-rot #cognitive-damage

technology5 months ago•1 min saved

Authors React to Anthropic's $1.5 Billion AI Settlement and Copyright Concerns

Anthropic agreed to a $1.5 billion settlement for authors whose books were used to train its AI model, Claude, with a minimum of $3,000 per book, leading the author to reconsider the value of compensation for such use.

via WIRED|

#ai #anthropic #authors

technology5 months ago•2 min saved

Anthropic to Pay $1.5 Billion to Resolve Copyright Lawsuit with Book Authors

Anthropic has agreed to pay at least $1.5 billion to authors whose pirated works it used to train its AI models, creating a precedent for AI companies compensating for illegal content use. The settlement involves establishing a fund valuing each pirated book at $3,000 and destroying the pirated materials, highlighting increasing legal risks for AI firms.

via theregister.com|

#ai-copyright-lawsuit #anthropic #pirated-works

technology5 months ago•1 min saved

Anthropic to Pay $1.5 Billion in AI Copyright Settlement with Authors

Anthropic agrees to pay at least $1.5 billion in the largest U.S. copyright settlement, compensating authors for the use of their works in AI training, highlighting legal and ethical issues in sourcing training data for AI models.

via Axios|

#ai #anthropic #copyright

technology6 months ago•5 min saved

Filtered Data Enhances AI Safety and Reliability

Researchers found that filtering risky content from AI training data, such as bioweapons instructions, can create safer models without harming performance, highlighting the importance of pre-training safeguards over post-training tweaks. The study emphasizes transparency and proactive safety measures in AI development, contrasting with industry secrecy.

via Fortune|

#ai-safety #bioweapons #deep-ignorance

technology7 months ago•1 min saved

xAI Workers Resist Training to Humanize Grok with Their Faces

xAI requested over 200 employees to record conversations for facial recognition training, aiming to help Grok analyze facial expressions, but faced internal resistance due to privacy concerns and the sensitive nature of biometric data, amid broader privacy law challenges.

via Ars Technica|

#employee-dissent #facial-recognition #privacy-concerns

law8 months ago•2 min saved

US Court Rules in Favor of Anthropic on AI Training and Copyright

A US judge ruled that AI training using copyrighted books by Anthropic is 'transformative' and not a copyright violation, but the case will proceed to trial over the use of pirated copies, highlighting ongoing legal debates about AI training practices.

via BBC|

#ai #anthropic #copyright

technology8 months ago•4 min saved

Music Industry Battles AI-Generated Fraud with New Tech

The music industry is developing infrastructure to detect and trace AI-generated songs early in the production and distribution process, focusing on licensing and control rather than enforcement, with tools that analyze tracks for synthetic elements, attribute creative influence, and regulate training data to prevent misuse.

via The Verge|

#ai-detection #copyright #metadata-tagging

technology1 year ago•1 min saved

"Adobe's AI Firefly: Training with Rival Images and Generative Video Tools"

Adobe's Firefly AI, touted for its practicality and ethical training data, has been revealed to have been trained on Midjourney images in addition to public domain material and Adobe Stock, casting doubt on its commercial safety claims and ethical standing.

via Creative Bloq|

#adobe #ai-art-generators #digital-art

technology1 year ago•2 min saved

"AI Giants Struggle with Data Depletion: The Quest for More Training Data"

AI companies are facing a shortage of training data as they continue to build larger models, leading to the exploration of alternative sources such as publicly-available video transcripts and synthetic data. Some companies are considering controversial methods like training on transcriptions from public YouTube videos, while others are working on creating higher-quality synthetic data. Concerns about AI running out of data have been raised, but researchers believe that breakthroughs could address the issue. However, the solution may also involve reevaluating the pursuit of larger models due to environmental and resource concerns.

via Futurism|

#ai #data-shortage #internet

artificial-intelligence1 year ago•2 min saved

"Unveiling OpenAI's Groundbreaking Sora AI Videos and Training Data Mystery"

OpenAI continues to tease its upcoming AI video generator, Sora, with text-to-video clips that are impressing viewers, and plans to release it later this year with sound and metadata. The company is giving early access to some individuals in the film industry for testing. However, there is secrecy surrounding the training data used for Sora, with OpenAI's CTO Mira Murati being vague about its sources in a recent interview with The Wall Street Journal.

via PetaPixel|

#ai-generated-videos #artificial-intelligence #mira-murati

technology1 year ago•3 min saved

"Controversy Surrounds OpenAI's Sora: Unveiling the Data Mystery"

OpenAI's CTO, Mira Murati, was unable to provide specific details about the training data used for the company's new video-generating AI, Sora, during an interview with The Wall Street Journal. Murati's vague responses and uncertainty about the source of the data have raised concerns about OpenAI's data-scraping practices and transparency. While she later confirmed that Shutterstock videos were included in Sora's training set, the lack of clarity surrounding the origin of the training data has sparked controversy and criticism.

via Futurism|

#ai #openai #sora