Synthetic from Text Matching Long Tasks Danish

The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

The data generation process described in this paper was followed:

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

Data og ressourcer

URI	https://data.gov.dk/dataset/lang/5fb46262-2742-451a-b145-1ee2d0e17ad2
Destinationsside	https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-matching-long-tasks-danish
Høstes af Datavejviser	Nej
Udgivelsesdato	22-01-2025
Seneste ændringsdato	24-01-2025
Opdateringsfrekvens	aldrig
Dækningsperiode	/
Emne(r)	Uddannelse, kultur og sport
Adgangsrettigheder	offentlig
Overholder
Proveniensudsagn	The data generation process described in this paper was followed: https://arxiv.org/pdf/2401.00368
Dokumentation	https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-matching-long-tasks-danish/blob/main/README.md