AI2 Releases Massive Open Dataset for Training Language Models and Scholarly Data

August 18, 2023 at 08:30 PM

•

1 min read

AI2 Releases Massive Open Dataset for Training Language Models and Scholarly Data — Photo: TechCrunch

TL;DR Summary

The Allen Institute for AI (AI2) has released Dolma, its largest open dataset yet, consisting of 3 billion tokens for training language models. Dolma is intended to be used as the basis for AI2's planned open language model, OLMo. Unlike other companies that guard the secrets of their language model training processes, AI2 aims to make Dolma transparent and accessible to the AI research community. The dataset is publicly documented, and users are required to provide contact information, disclose derivative creations, distribute derivatives under the same license, and agree not to apply Dolma to prohibited areas. Access to Dolma is available via Hugging Face.

Topics:technology #ai2 #artificial-intelligence #dolma #language-models #olmo #open-dataset

Share this article

Reading Insights

Total Reads

Unique Readers

Time Saved

2 min

vs 3 min read

Condensed

80%

528 → 104 words

Want the full story? Read the original article

Read on TechCrunch

JavaScript Required

tl;dr daily news requires JavaScript to be enabled. Please enable JavaScript in your browser settings.

Related Sources

Reading Insights