Raw datasets
in four HF collections
Total rows
every row, every dataset
Languages
many rarely seen online
Modalities
audio, text, images, code
Days to build
April 8 to April 24, 2026
Raw corpus
Every dataset I've created in the
ReubenDataLab collections
Adaption-remastered
Improved datasets after running them through
adaptionlabs.ai
Modality split
Share of the corpus by data type
Languages across the corpus
Every language that appears in any raw dataset, sized (log-scale) by total row count. Hover for exact numbers.