Skip to main content

Synthetic from Unit Triple Tasks Danish

The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

Data og ressourcer

Nøgleord

Yderligere info

URI https://data.gov.dk/dataset/lang/ea995a41-f9b8-4dbd-b7b0-c443b8001572
Destinationsside https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish
Høstes af Datavejviser Nej
Udgivelsesdato 22-01-2025
Seneste ændringsdato 24-01-2025
Opdateringsfrekvens aldrig
Dækningsperiode  / 
Emne(r) Uddannelse, kultur og sport
Adgangsrettigheder offentlig
Overholder
Proveniensudsagn

The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368

Dokumentation