-
"The Norwegian Colossal Corpus (NCC) is a collection of multiple smaller Norwegian corpuses suitable for training large language models. We have done extensive cleaning on the...
- JSON
-
The Stortinget Speech Corpus (SSC) is a 5000+ hours speech dataset for weak supervision ASR created from audio and aligned proceedings text from Stortinget, the Norwegian...
- JSONL
-
Contents of https://laegemiddelstyrelsen.dk were crawled, aligned on document and sentence level and converted into a parallel corpus. Contains 22699 translation units between...
- TMX
-
Bilingual (EN-DA) corpus acquired from website (https://ec.europa.eu/*coronavirus-response) of the EU portal (20th May 2020). Contains 2803 translation units (DA-EN).
- TMX
-
Bilingual (EN-DA) corpus acquired from website (https://eur-lex.europa.eu/legal-content) of the EU portal (9th July 2020). Contains 21238 translations units (DA-EN)
- TMX
-
Bilingual (EN-DA) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020). Contains 633 translation units (DA-EN).
- TMX
-
Bilingual (EN-DA) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (8th July 2020). Contains 6261 translation units (DA-EN).
- TMX
-
EN-DA Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020). Attribution details: This dataset has...
- TMX
-
Contents of https://eng.mst.dk/ and https://mst.dk/ were crawled, aligned on document and sentence level and converted into a parallel corpus. This dataset has been created...
- TMX