Skip to main content

Synthetic from Text Matching Long Tasks Danish

The purpose of this dataset is to pre- or post-train embedding models for Danish text matching tasks.

The dataset consists of 100,000 samples generated with gemma-2-27b-it.

The column "prompt" shows the prompt given to the LLM and "response" shows the LLM output.

Each sample in the dataset was generated from a seed task randomly sampled from https://huggingface.co/datasets/ThatsGroes/retrieval-tasks-processed

The data generation process described in this paper was followed:

https://arxiv.org/pdf/2401.00368

Compute sponsored by Arrow Denmark and Nvidia through Danish Data Science Community.

Data og ressourcer

Nøgleord

Yderligere info

URI https://data.gov.dk/dataset/lang/5fb46262-2742-451a-b145-1ee2d0e17ad2
Destinationsside https://huggingface.co/datasets/ThatsGroes/synthetic-from-text-matching-long-tasks-danish
Høstes af Datavejviser Nej
Udgivelsesdato 22-01-2025
Seneste ændringsdato 24-01-2025
Opdateringsfrekvens aldrig
Dækningsperiode  / 
Emne(r) Uddannelse, kultur og sport
Adgangsrettigheder offentlig
Overholder
Proveniensudsagn

The data generation process described in this paper was followed:

https://arxiv.org/pdf/2401.00368

Dokumentation