Skip to main content

Fineweb-2

This is the second iteration of the popular FineWeb dataset, bringing high quality pretraining data to over 1000 languages.

The FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.

The data was sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using 🏭 datatrove, our large scale data processing library. This carefully deduplicated and filtered dataset comprises roughly 8 terabytes of compressed text data, with almost 3 trillion words (see How many tokens? for more details). For PII and opt-out see Personal and Sensitive Information and opt-out.

Data og ressourcer

Nøgleord

Yderligere info

URI https://data.gov.dk/dataset/lang/b2750f44-b4e6-45a1-b6fd-9d8ac73d0c3a
Destinationsside https://huggingface.co/datasets/HuggingFaceFW/fineweb-2
Høstes af Datavejviser Nej
Udgivelsesdato 08-01-2025
Seneste ændringsdato 08-01-2025
Opdateringsfrekvens opdateres løbende
Dækningsperiode  / 
Emne(r) Regeringen og den offentlige sektor
Adgangsrettigheder offentlig
Overholder
Proveniensudsagn

This dataset originates from Common Crawl. The use of this dataset is also subject to CommonCrawl's Terms of Use.

Dokumentation