DK-CLARIN Rapid Aligned Corpus 1993-2011 (da-en, da-de)

The aligned corpus consists of press releases from the European Commission Press Relase Database (Rapid) harvested in 2009 and 2011 (http://europa.eu/rapid/search.htm). The corpus comprises 5330 + 2200 press releases (files) for each language Danish, English and German with app. 5,000,000 words per language and 260,000 - 270,000 aligned sentences for the language pair Danish - English and Danish - German. All documents are processed with Uplug (https://bitbucket.org/tiedemann/uplug/wiki/Home) and aligned with HunAlign. Files with more than 10 % negative alignments have been removed and so has all 0-alignmants. The documents are in txt-format for each language and in tmx-format for the aligned language pairs (da-en and da-de).

Yderligere info

Felt Værdi
Destinationsside http://hdl.handle.net/20.500.12115/30
Forfatter Hansen, Dorte; Offersgaard, Lene
Metadata sidst opdateret Februar 26, 2021, 14:02 (UTC)
Metadata oprettet Maj 14, 2020, 08:34 (UTC)
Dækningsperiode slut 2011
Dækningsperiode start 1993
Emne Sprog og retskrivning Uddannelse, kultur og sport
GUID http://hdl.handle.net/20.500.12115/30
Identifier http://hdl.handle.net/20.500.12115/30
Kontaktemail info@clarin.dk
Kontaktnavn CLARIN-DK, Centre for language Technology, NorS, University of Copenhagen
URI http://hdl.handle.net/20.500.12115/30
Udgivelsesdato 2011
Udgivernavn Centre for Language Technology, NorS, University of Copenhagen
Type Korpora
Dokumentation