Synthetic from Unit Triple Tasks Danish

The purpose of this dataset is to pre- or post-train embedding models for Danish on text similarity tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

Data og ressourcer

URI	https://data.gov.dk/dataset/lang/ea995a41-f9b8-4dbd-b7b0-c443b8001572
Destinationsside	https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish
Høstes af Datavejviser	Nej
Udgivelsesdato	22-01-2025
Seneste ændringsdato	24-01-2025
Opdateringsfrekvens	aldrig
Dækningsperiode	/
Emne(r)	Uddannelse, kultur og sport
Adgangsrettigheder	offentlig
Overholder
Proveniensudsagn	The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368
Dokumentation	https://huggingface.co/datasets/ThatsGroes/synthetic-from-unit-triple-tasks-danish/blob/main/README.md