Reuben's Data Lab
Raw datasets in four HF collections
Total rows every row, every dataset
Languages many rarely seen online
Modalities audio, text, images, code
Days to build April 8 to April 24, 2026

Raw corpus

Every dataset I've created in the ReubenDataLab collections

Adaption-remastered

Improved datasets after running them through adaptionlabs.ai

Modality split

Share of the corpus by data type

Languages across the corpus

Every language that appears in any raw dataset, sized (log-scale) by total row count. Hover for exact numbers.