Datasæt

Syntetisk dialog opsummering raw

Thanks to NVIDIA and Arrow Denmark for sponsoring the compute needed to generate this dataset

This dataset conists of 1,000,000 synthetic dialogs in Danish and a summary of each dialog generated with google/gemma-2-27b-it

The purpose of the dataset is to fine tune small language models to make dialog summaries, but with minor adjustments it may also be used 1) to train an LLM to restore/improve speaker diarization, 2) to train a classifier for classifying dialogs into the topic, 3) as part of training data for a Danish embedding model.

The dialogs cover almost 21,000 different topics from ThatsGroes/dialog-topics and ThatsGroes/wiki_views. For the latter, the "article" column is considered a conversation topic.

In addition, a number of hand-crafted customer service topics were added.

The code that generated this dataset can be found in my GitHub dialog-opsummering.

During generation, 0.95 and 0.9 were used as temperature and top_p respectively and no random seed was used.

The code was run on a Lenovo server with an Nvidia A100 GPU kindly sponsored by Nvidia and Arrow Denmark through Danish Data Science Community.

Generating the dataset took just shy of 5 days and according to codecarbon the process consumed a total of 61 KWh and emittted 9.2 kgCO2e. This corresponds to 0.000061 kwh and 0.0092 gCO2e per sample. Here's the energy consumption distribution between GPU, CPU and RAM:

CPU: 4.96 KWh

GPU: 34 KWh

RAM: 22 KWh

Data og ressourcer

Syntetisk Dialog Opsummering Raw - parquethttp://publications.europa.eu/resource/authority/file-type/PARQUET
Tilgå ressourcen her.
Udforsk
- Gå til ressource

Nøgleord

Yderligere info

URI	https://data.gov.dk/dataset/lang/48e946bf-93b1-42ab-bb56-f477045ea497
Destinationsside	https://huggingface.co/datasets/ThatsGroes/syntetisk-dialog-opsummering-raw
Høstes af Datavejviser	Nej
Udgivelsesdato	06-01-2025
Seneste ændringsdato	06-01-2025
Opdateringsfrekvens	aldrig
Dækningsperiode	/
Emne(r)	Regeringen og den offentlige sektor Uddannelse, kultur og sport
Adgangsrettigheder	offentlig
Overholder
Proveniensudsagn
Dokumentation	https://huggingface.co/datasets/ThatsGroes/syntetisk-dialog-opsummering-raw/blob/main/README.md