WikiMatrix

135 mio parallelsætninger (1620 sprogpar - 85 sprog) fra Wikipedia.

License: The mined data is distributed under the Creative Commons Attribution-ShareAlike license.

Please cite reference [1] if you use this data.

References:

[1] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[2] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[3] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[4] Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan and Graham Neubig, When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? NAACL, pages 529-535, 2018.

Data og Distribution(er)

Yderligere info test

Felt Værdi
Destinationsside https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix
Metadata sidst opdateret december 12, 2022, 09:14 (UTC)
Metadata oprettet maj 13, 2020, 15:23 (UTC)
Emne Sprog og retskrivning Uddannelse, kultur og sport
GUID https://data.gov.dk/dataset/lang/752bf522-baca-4ff3-80af-6b7e3ea632bd
Kontaktnavn Facebook Research
URI https://data.gov.dk/dataset/lang/752bf522-baca-4ff3-80af-6b7e3ea632bd
Udgivernavn Facebook Research
type Korpora