Tag

Dataset

All articles tagged with #dataset

Explore the Roman Empire's Extensive Road Network with a New Digital Atlas

Originally Published 2 months ago — by Nature

Featured image for Explore the Roman Empire's Extensive Road Network with a New Digital Atlas
Source: Nature

Itiner-e is a comprehensive, high-resolution digital dataset mapping nearly 300,000 km of Roman roads across the empire, created from archaeological, historical, and remote sensing sources, revealing significant gaps in certainty and coverage that can inform future research on ancient mobility and infrastructure development.

Harvard and Google Launch AI Training Dataset with 1 Million Books

Originally Published 1 year ago — by TechCrunch

Featured image for Harvard and Google Launch AI Training Dataset with 1 Million Books
Source: TechCrunch

Harvard University, in collaboration with Google, plans to release a dataset of approximately 1 million public-domain books for AI training, sourced from Google's book-scanning project. This initiative, part of Harvard's Institutional Data Initiative (IDI), aims to democratize access to AI training data, with financial support from Microsoft and OpenAI. The dataset will include works from authors like Dickens and Shakespeare, and is intended to be accessible to research labs and AI startups.

AI Training Dataset Reveals Disturbing Presence of Child Sexual Abuse Material

Originally Published 2 years ago — by The Register

Featured image for AI Training Dataset Reveals Disturbing Presence of Child Sexual Abuse Material
Source: The Register

The LAION-5B dataset, used to train popular AI image generators like Stable Diffusion, has been found to contain thousands of instances of child sexual abuse material (CSAM), according to a study by the Stanford Internet Observatory (SIO). The dataset includes metadata and URLs pointing to the images, some of which were found hosted on websites like Reddit, Twitter, and adult websites. The SIO reported the findings to the National Center for Missing and Exploited Children (NCMEC) and the Canadian Centre for Child Protection (C3P), and the removal of the identified source material is underway. LAION has announced plans for regular maintenance procedures to remove suspicious and potentially unlawful content from its datasets.

AI Training Data Contaminated with Child Sexual Abuse Imagery, Study Reveals

Originally Published 2 years ago — by The Verge

Featured image for AI Training Data Contaminated with Child Sexual Abuse Imagery, Study Reveals
Source: The Verge

Stanford's Internet Observatory discovered that the popular AI image training dataset, LAION-5B, used by Stability AI and Google's Imagen image generators, contained links to child sexual abuse imagery. The dataset included at least 1,679 illegal images scraped from social media and adult websites. While the dataset does not store the images itself, it provides links and alt text. LAION, the nonprofit managing the dataset, temporarily removed it, emphasizing a "zero-tolerance" policy for harmful content. Stanford researchers recommended deprecating and ceasing distribution of models trained on LAION-5B, and US attorneys general have called for an investigation into the impact of AI on child exploitation and the prohibition of AI-generated child sexual abuse material.

"Revolutionizing Robotics: Google DeepMind Unites Researchers to Create ImageNet of Robot Actions and Open-Sources Largest-Ever Dataset"

Originally Published 2 years ago — by TechCrunch

Featured image for "Revolutionizing Robotics: Google DeepMind Unites Researchers to Create ImageNet of Robot Actions and Open-Sources Largest-Ever Dataset"
Source: TechCrunch

Google's DeepMind robotics team has collaborated with 33 research institutes to create Open X-Embodiment, a shared database aimed at advancing robotics through the use of a large, diverse dataset. Similar to ImageNet for computer vision, Open X-Embodiment features over 500 skills and 150,000 tasks from 22 different robot types. The database is being made available to the research community to reduce barriers and accelerate research in robot learning, with the goal of enabling robots to learn from each other and researchers to learn from one another.

The Dark Web's Role in Training AI: Fresh Concerns and Secret Sources.

Originally Published 2 years ago — by Search Engine Land

Featured image for The Dark Web's Role in Training AI: Fresh Concerns and Secret Sources.
Source: Search Engine Land

The Washington Post has created a search tool that allows users to find out if their website or content was used to train AI systems as part of Google's C4 dataset, which includes websites and content creators that generative AI could potentially negatively impact. The C4 dataset is only part of the data used by Google Bard and other large language models, which also use Wikipedia, Reddit, and other sources. Reddit has updated its API terms and will now charge some companies, including Google and OpenAI, for access to its valuable corpus of data.

Meta's AI Model Revolutionizes Image Segmentation and Object Recognition

Originally Published 2 years ago — by Reuters

Featured image for Meta's AI Model Revolutionizes Image Segmentation and Object Recognition
Source: Reuters

Meta has released an artificial intelligence model called Segment Anything Model (SAM) that can identify individual objects within images and videos, even if it has not encountered them before. SAM can select objects by clicking on them or writing text prompts. Meta has also released a dataset of image annotations, which it claims is the largest of its kind. The SAM model and dataset will be available for download under a non-commercial license.

Meta's Latest AI Breakthroughs in Object Detection and Segmentation

Originally Published 2 years ago — by Engadget

Featured image for Meta's Latest AI Breakthroughs in Object Detection and Segmentation
Source: Engadget

Meta has released an AI model called "Segment Anything" that can detect objects in pictures and videos even if they weren't part of the training set. The model can work in tandem with other models and can limit the need for additional AI training. The AI model and dataset will be downloadable with a non-commercial license. While the model is flawed, it may help in situations where it's impractical to rely exclusively on training data.

Meta's AI models revolutionize image segmentation and identification.

Originally Published 2 years ago — by Ars Technica

Featured image for Meta's AI models revolutionize image segmentation and identification.
Source: Ars Technica

Meta has introduced an AI model called the Segment Anything Model (SAM) that can identify individual objects in images and videos, even those not encountered during training. SAM is an image segmentation model that can respond to text prompts or user clicks to isolate specific objects within an image. Meta hopes to "democratize" the process of creating accurate segmentation models by reducing the need for specialized training and expertise. Meta has also assembled a dataset it calls "SA-1B" that includes 11 million images licensed from "a large photo company" and 1.1 billion segmentation masks produced by its segmentation model.