AI2 Releases Massive Open Dataset for Training Language Models and Scholarly Data

1 min read
Source: TechCrunch
AI2 Releases Massive Open Dataset for Training Language Models and Scholarly Data
Photo: TechCrunch
TL;DR Summary

The Allen Institute for AI (AI2) has released Dolma, its largest open dataset yet, consisting of 3 billion tokens for training language models. Dolma is intended to be used as the basis for AI2's planned open language model, OLMo. Unlike other companies that guard the secrets of their language model training processes, AI2 aims to make Dolma transparent and accessible to the AI research community. The dataset is publicly documented, and users are required to provide contact information, disclose derivative creations, distribute derivatives under the same license, and agree not to apply Dolma to prohibited areas. Access to Dolma is available via Hugging Face.

Share this article

Reading Insights

Total Reads

0

Unique Readers

1

Time Saved

2 min

vs 3 min read

Condensed

80%

528104 words

Want the full story? Read the original article

Read on TechCrunch