
AI2 Releases Massive Open Dataset for Training Language Models and Scholarly Data
The Allen Institute for AI (AI2) has released Dolma, its largest open dataset yet, consisting of 3 billion tokens for training language models. Dolma is intended to be used as the basis for AI2's planned open language model, OLMo. Unlike other companies that guard the secrets of their language model training processes, AI2 aims to make Dolma transparent and accessible to the AI research community. The dataset is publicly documented, and users are required to provide contact information, disclose derivative creations, distribute derivatives under the same license, and agree not to apply Dolma to prohibited areas. Access to Dolma is available via Hugging Face.