The DK-CLARIN JRC-Acquis Parallel Corpus (da, en)

The DK-CLARIN JRC-Acquis Parallel Corpus (da, en) is a part of the JRC-Acquis mulilingual parallel corpus, containing documents from The Acquis Communautaire (AC) which is the total body of European Union (EU) law applicable in the the EU Member States (see: https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis). The data comes with one or more Eurovoc class codes added in the metadata from the European Commission. Each language corpus (English and Danish) contains app. 20 million words. All texts are in XML TEIP5 format (TEIP5DKCLARIN-format), with tokenisation, pos-tagging, sentence and paragraph segmentation, lemmatisation and for Danish also termhood annotation. The annotations are placed in separate text external spangroups. The corpus was collected and processed in the work package 2.6 of the Danish CLARIN project (see http://dkclarin.ku.dk/english) by University of Copenhagen, Centre for Language Technology. The aim of the Danish CLARIN consortium was to construct a Danish research infrastructure for the humanities integrating written, spoken, and visual records into a coherent and systematic digital repository. The project ran from January 2008 until the end of 2010.

Creators: Hansen, Dorte Haltrup; Offersgaard, Lene

License: http://creativecommons.org/licenses/by/4.0/

Yderligere info

Felt Værdi
Destinationsside https://repository.clarin.dk/repository/xmlui/handle/20.500.12115/29?show=full
Metadata sidst opdateret September 9, 2020, 08:33 (UTC)
Metadata oprettet Juni 17, 2020, 14:06 (UTC)
Emne Uddannelse, kultur og sport Sprog og retskrivning
GUID http://hdl.handle.net/20.500.12115/29
Identifier http://hdl.handle.net/20.500.12115/29
Kontaktemail info@clarin.dk
Kontaktnavn CLARIN-DK, Centre for Language Technology, NorS, University of Copenhagen
Sprog engelsk
URI http://hdl.handle.net/20.500.12115/29
Udgivelsesdato 2012/2014
Udgivernavn Centre for Language Technology, NorS, University of Copenhagen
Type Korpora
Dokumentation