Skip to main content

Syntetisk dialog opsummering raw

Thanks to NVIDIA and Arrow Denmark for sponsoring the compute needed to generate this dataset

This dataset conists of 1,000,000 synthetic dialogs in Danish and a summary of each dialog generated with google/gemma-2-27b-it

The purpose of the dataset is to fine tune small language models to make dialog summaries, but with minor adjustments it may also be used 1) to train an LLM to restore/improve speaker diarization, 2) to train a classifier for classifying dialogs into the topic, 3) as part of training data for a Danish embedding model.

The dialogs cover almost 21,000 different topics from ThatsGroes/dialog-topics and ThatsGroes/wiki_views. For the latter, the "article" column is considered a conversation topic.

In addition, a number of hand-crafted customer service topics were added.

The code that generated this dataset can be found in my GitHub dialog-opsummering.

During generation, 0.95 and 0.9 were used as temperature and top_p respectively and no random seed was used.

The code was run on a Lenovo server with an Nvidia A100 GPU kindly sponsored by Nvidia and Arrow Denmark through Danish Data Science Community.

Generating the dataset took just shy of 5 days and according to codecarbon the process consumed a total of 61 KWh and emittted 9.2 kgCO2e. This corresponds to 0.000061 kwh and 0.0092 gCO2e per sample. Here's the energy consumption distribution between GPU, CPU and RAM:

CPU: 4.96 KWh

GPU: 34 KWh

RAM: 22 KWh

Data og ressourcer

Nøgleord

Yderligere info

URI https://data.gov.dk/dataset/lang/48e946bf-93b1-42ab-bb56-f477045ea497
Destinationsside https://huggingface.co/datasets/ThatsGroes/syntetisk-dialog-opsummering-raw
Høstes af Datavejviser Nej
Udgivelsesdato 06-01-2025
Seneste ændringsdato 06-01-2025
Opdateringsfrekvens aldrig
Dækningsperiode  / 
Emne(r)
  • Regeringen og den offentlige sektor
  • Uddannelse, kultur og sport
Adgangsrettigheder offentlig
Overholder
Proveniensudsagn
Dokumentation