HuggingFaceFW/fineweb
Viewer
• Updated
• 52.5B • 163k • 2.68k
A collection of datasets for LLM pretraining
Note 🍷 Web datasets
Note 📚 Highly curated web datasets filtered using classifiers
Note 📐 Highly curated math pages from CommonCrawl
Note 💻 Github code dataset
Note Synthetic textbooks
Note Contains Cosmopedia v2 (synthetic textbooks) and Python-Edu (educational Python code)