135 mio parallelsætninger (1620 sprogpar - 85 sprog) fra Wikipedia.
License: The mined data is distributed under the Creative Commons Attribution-ShareAlike license.
Please cite reference [1] if you use this data.
References:
[1] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.
[2] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.
[3] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.
[4] Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan and Graham Neubig, When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? NAACL, pages 529-535, 2018.