Corpus80jours

Proper names translation : translation alignment of Le tour du monde en quatre-vingts jours (Jules Verne, 1872)

Corpus created as part of a contrastive analysis of proper names in translation (Thesis topic, Lecuit, 2012)

Description
The corpus was created as part of a contrastive analysis of proper names in translation (Lecuit, 2013). It is therefore composed of a source text, Le Tour du monde en quatre-vingts jours (Jules Verne, 1872), in which the proper names (as well as the relational adjectives and nouns) have been annotated thanks to the tool CasSys and to CasEN transducers developed by the computer science laboratory of the university of Tours (LI) ((Friburger & Maurel, 2004).
The tags used to annotate the text are taken from the list of tags edited by the Text Encoding Initiative Consortium (TEI P5). The following elements in the source text (in French) are therefore annotated:
• Proper names (3342 items) :
o <name type="person">[human]</name> (1856 items)
o <name type="animal">[animal]</name> (8 items)
o <name type="org">[organisation]</name> (115 items)
o <name type="geographical">[natural geographical location]</name> (201 items)
o <name type="oronym">[traffic artery]</name> (63 items)
o <name type="building">[human construction]</name> (68 items)
o <name type="place">[administrative area, town]</name> (836 items)
o <name type="object">[product]</name> (5 items)
o <name type="vessel">[vessel]</name> (159 items)
o <name type="title" level="j">[newspaper]</name> (23 items)
o <name type="date">[historical period]</name> (3 items)
o <name type="event">[historical event]</name> (5 items)
• Relational nouns : <w type="relational noun">[relational noun]</w> (197 items)
• Relational adjectives : <w type="relational adjective">[relational adjective]</w> (161 items)

The corpus also comprises three target texts, translations in English, German and Serbian (Latin alphabet) respectively of the novel.
The corpus also comes with alignment files which were created thanks to the multilingual automatic aligner XAlign (developed by the Loria laboratory, which is implemented onto the Unitex platform) and that we corrected manually.
These files, which can be used with Unitex, allow for visualization of bi-texts, in the form of a window divided into two parts, with one of the two versions of the same text (horizontally aligned on the plan of translation units or translation equivalents) in each part.

References
• About proper names and translation:
o Lecuit É., Maurel D., Vitas D. (2011). La traduction des noms propres : une étude en corpus. Translationes. 3:121-134.
o Lecuit É., Maurel D., Vitas D. (2011), Les noms propres se traduisent-ils ? Étude d’un corpus multilingue, Corpus, 10:201-218.
o Lecuit É. (2012). Les tribulations d'un nom propre en traduction. Étude contrastive du nom propre et de sa traduction à partir d’un corpus aligné de dix langues européennes. Thèse de doctorat de linguistique, Université François-Rabelais de Tours.
o Lecuit É., Maurel D., Vitas D. (2015). A Multilingual Corpus for the Study of Toponyms in Translation. In Schnabel-Le Corre B., Löfström J. Challenges in Synchronic Toponymy: Structure, Context and Use. Francke A. Verlag. 235-246.
• About transducer cascades:
o Friburger N., Maurel D. (2004). Finite-state transducer cascade to extract named entities in texts. Theoretical Computer Science. 313:94-104.
o Maurel D., Friburger N., Antoine J.-Y., Eshkol-Taravella I., Nouvel D. (2011). Cascades de transducteurs autour de la reconnaissance des entités nommées. Traitement automatique des langues, 52(1):69-96.

Origin of the resources
• Unitex (IGM, Université Paris-Est Marne-la-Vallée, Paumier, 2011)
• CasSys et CasEN (LI, Friburger et Maurel)
• XAlign (Loria, UMR 7503)

Nature of the data
Corpus, annotated (for the French part only) and aligned, original novel and royalty-free translations.

Origin of the data

• Source text before annotation :
o (French) Le Tour du monde en 80 jours, Jules Verne (1872) : http://abu.cnam.fr/
• Target texts :
o (German) Reise um die Erde in 80 Tagen (unknown date)
o (English) Around the World in eighty Days (1873)
o (Serbian) Put oko sveta za 80 dana (1949)

Conditions of use
The corpus is coverd by Creative Commons Licences CC-BY-NC-SA and LGPL-LR.

Use
The corpus is made up of five different kinds of files.
1. A PDF file comprising the four aligned languages
o Corpus80Jours.pdf
And, in utf-8 (without BOM) character coding:
2. An XML file containing the text of the novel, annotated as mentioned above, but where the angle brackets of the name and w tags have been replaced by their XML entity equivalents so as to be able to upload the document into XAlign
o Corpus80JoursFrench_Xalign.xml
3. An XML file containing the text of the novel, annotated as mentioned above for use without XAlign
o Corpus80JoursFrench.xml
4. Three XML files, each containing one translation of the novel in English, German and Serbian respectively :
o Corpus80JoursEnglish.xml
o Corpus80JoursGerman.xml
o Corpus80JoursSerbian.xml
5. Three XML files, containing the bi-texts alignments :
o Corpus80JoursFrenchEnglish.xml
o Corpus80JoursFrenchGerman.xml
o Corpus80JoursFrenchSerbian.xml

The alignments can be used with Unitex. To do so, the files should be saved beforehand in the Unitex directory, as follows:
• For the first four files, respectively: Unitex/English/Corpus/Corpus80JoursEnglish.xml, Unitex/German/Corpus/Corpus80JoursGerman.xml, Unitex/French/Corpus/Corpus80JoursFrenchXalign.xml Unitex/Serbian/Corpus/Corpus80JoursSerbian.xml.
• For the next three files, in the Unitex/Xalign directory.

Download
To download the PDF file you have to accept the Creative Commons CC-BY-NC-SA license.

Click here: Download the Corpus80jours PDF file (1/11/2016).
To download the alignment file you have to accept the LGPL-LR license.
Click here: Download the Corpus80jours alignment files (1/11/2016).