Skip to main content

Fineweb-c

FineWeb-C: Educational content in many languages, labelled by the community This is a link to the Danish part of the dataset.

This is a collaborative, community-driven project that expands upon the FineWeb2 dataset. Our goal is to create high-quality educational content annotations across hundreds of languages.

By enhancing web content with these annotations, we aim to improve the development of Large Language Models (LLMs) in all languages, making AI technology more accessible and effective globally.

The annotations in this dataset will help train AI systems to automatically identify high-quality educational content in more languages and in turn help build better Large Language Models for all languages.

What the community is doing: For a given language, look at a page of web content from the FineWeb2 dataset in Argilla. Rate how educational the content is. Flag problematic content i.e. content that is malformed or in the wrong language. Once a language reaches 1,000 annotations, the dataset will be included in this dataset! Alongside rating the educational quality of the content, different language communities are discussing other ways to improve the quality of data for their language in our Discord discussion channel.

The use of this dataset is also subject to CommonCrawl's Terms of Use.

Data og ressourcer

Nøgleord

Yderligere info

URI https://data.gov.dk/dataset/lang/7b8ee028-9afa-4d98-96ad-7613bd774815
Destinationsside https://huggingface.co/datasets/data-is-better-together/fineweb-c#fineweb-c-educational-content-in-many-languages-labelled-by-the-community
Høstes af Datavejviser Nej
Udgivelsesdato 06-01-2025
Seneste ændringsdato 30-01-2025
Opdateringsfrekvens opdateres løbende
Dækningsperiode  / 
Emne(r) Regeringen og den offentlige sektor
Adgangsrettigheder offentlig
Overholder
Proveniensudsagn

This dataset originated from Common Crawl. The use of this dataset is also subject to CommonCrawl's Terms of Use.

Dokumentation