Tag

Dataset

All articles tagged with #dataset

history3 months ago•94 min saved

Explore the Roman Empire's Extensive Road Network with a New Digital Atlas

Itiner-e is a comprehensive, high-resolution digital dataset mapping nearly 300,000 km of Roman roads across the empire, created from archaeological, historical, and remote sensing sources, revealing significant gaps in certainty and coverage that can inform future research on ancient mobility and infrastructure development.

via Nature|

#archaeology #dataset #history

technology1 year ago•1 min saved

Harvard and Google Launch AI Training Dataset with 1 Million Books

Harvard University, in collaboration with Google, plans to release a dataset of approximately 1 million public-domain books for AI training, sourced from Google's book-scanning project. This initiative, part of Harvard's Institutional Data Initiative (IDI), aims to democratize access to AI training data, with financial support from Microsoft and OpenAI. The dataset will include works from authors like Dickens and Shakespeare, and is intended to be accessible to research labs and AI startups.

via TechCrunch|

#ai #dataset #google

technology2 years ago•3 min saved

AI Training Dataset Reveals Disturbing Presence of Child Sexual Abuse Material

The LAION-5B dataset, used to train popular AI image generators like Stable Diffusion, has been found to contain thousands of instances of child sexual abuse material (CSAM), according to a study by the Stanford Internet Observatory (SIO). The dataset includes metadata and URLs pointing to the images, some of which were found hosted on websites like Reddit, Twitter, and adult websites. The SIO reported the findings to the National Center for Missing and Exploited Children (NCMEC) and the Canadian Centre for Child Protection (C3P), and the removal of the identified source material is underway. LAION has announced plans for regular maintenance procedures to remove suspicious and potentially unlawful content from its datasets.

via The Register|

#ai #child-protection #csam

technology2 years ago•1 min saved

AI Training Data Contaminated with Child Sexual Abuse Imagery, Study Reveals

Stanford's Internet Observatory discovered that the popular AI image training dataset, LAION-5B, used by Stability AI and Google's Imagen image generators, contained links to child sexual abuse imagery. The dataset included at least 1,679 illegal images scraped from social media and adult websites. While the dataset does not store the images itself, it provides links and alt text. LAION, the nonprofit managing the dataset, temporarily removed it, emphasizing a "zero-tolerance" policy for harmful content. Stanford researchers recommended deprecating and ceasing distribution of models trained on LAION-5B, and US attorneys general have called for an investigation into the impact of AI on child exploitation and the prohibition of AI-generated child sexual abuse material.

via The Verge|

#ai #child-sexual-abuse-imagery #dataset

robotics2 years ago•2 min saved

"Revolutionizing Robotics: Google DeepMind Unites Researchers to Create ImageNet of Robot Actions and Open-Sources Largest-Ever Dataset"

Google's DeepMind robotics team has collaborated with 33 research institutes to create Open X-Embodiment, a shared database aimed at advancing robotics through the use of a large, diverse dataset. Similar to ImageNet for computer vision, Open X-Embodiment features over 500 skills and 150,000 tasks from 22 different robot types. The database is being made available to the research community to reduce barriers and accelerate research in robot learning, with the goal of enabling robots to learn from each other and researchers to learn from one another.

via TechCrunch|

#dataset #google-deepmind #open-x-embodiment

technology2 years ago•1 min saved

The Dark Web's Role in Training AI: Fresh Concerns and Secret Sources.

The Washington Post has created a search tool that allows users to find out if their website or content was used to train AI systems as part of Google's C4 dataset, which includes websites and content creators that generative AI could potentially negatively impact. The C4 dataset is only part of the data used by Google Bard and other large language models, which also use Wikipedia, Reddit, and other sources. Reddit has updated its API terms and will now charge some companies, including Google and OpenAI, for access to its valuable corpus of data.

via Search Engine Land|

#ai #dataset #google

technology2 years ago•1 min saved

Meta's AI Model Revolutionizes Image Segmentation and Object Recognition

Meta has released an artificial intelligence model called Segment Anything Model (SAM) that can identify individual objects within images and videos, even if it has not encountered them before. SAM can select objects by clicking on them or writing text prompts. Meta has also released a dataset of image annotations, which it claims is the largest of its kind. The SAM model and dataset will be available for download under a non-commercial license.

via Reuters|

#artificial-intelligence #dataset #image-recognition

ai2 years ago•1 min saved

Meta's Latest AI Breakthroughs in Object Detection and Segmentation

Meta has released an AI model called "Segment Anything" that can detect objects in pictures and videos even if they weren't part of the training set. The model can work in tandem with other models and can limit the need for additional AI training. The AI model and dataset will be downloadable with a non-commercial license. While the model is flawed, it may help in situations where it's impractical to rely exclusively on training data.

via Engadget|

#ai #ai-model #computer-vision

ai2 years ago•2 min saved

Meta's AI models revolutionize image segmentation and identification.

Meta has introduced an AI model called the Segment Anything Model (SAM) that can identify individual objects in images and videos, even those not encountered during training. SAM is an image segmentation model that can respond to text prompts or user clicks to isolate specific objects within an image. Meta hopes to "democratize" the process of creating accurate segmentation models by reducing the need for specialized training and expertise. Meta has also assembled a dataset it calls "SA-1B" that includes 11 million images licensed from "a large photo company" and 1.1 billion segmentation masks produced by its segmentation model.

via Ars Technica|

#ai #computer-vision #dataset