Version 4.0
SMULTRON is a parallel treebank developed by the Computational Linguistics Groups
The parallel treebank contains around 1500 sentences each in English, German and Swedish, organized in eight parallel treebanks, of which 500 sentences are also available in an additional unaligned Spanish treebank, plus a German and French parallel treebank of around 1000 sentences each. The sentences have been PoS-tagged and annotated with phrase structure trees, following the NEGRA/TIGER guidelines for German and Swedish, the Penn Treebank guidelines for English, and adapted versions of the guidelines used in LeMonde Treebank for French and AnCora-Es for Spanish, respectively. The trees have been aligned on sentence, phrase and word level. Additionally, two of the German and Swedish monolingual treebanks (sophie and economy) contain lemma information. Additional morphological annotation for German has been kindly provided by Hinrich Schütze and Thomas Müller from the Center for Information and Language Processing (CIS) from the Ludwig-Maximilians-Universität of Munich. Additional treebanks from the SQUOIA project complete this collection: 11 parallel texts in German and Spanish organized in 8 treebanks (about 4000 sentences), from which almost 2000 sentences are also in Cuzco Quechua; German-Spanish is aligned.
SMULTRON consists of five different subcorpora from different text genres: the first two subcorpora, each roughly 500 sentences long, in the languages English, German and Swedish; the third subcorpus also about 500 sentences long, in these same languages plus Spanish; the fourth subcorpus with around 1000 sentences in German and French; the fifth (and newest) subcorpus with approx. 4000 sentences in German, Spanish and almost 2000 in Cuzco Quechua.
Jostein Gaarder. 1991. Sofies verden: Roman om filosofiens historie. Aschehoug.
All texts translated from Norwegian.
NAD L54 DVD Receiver Owner’s Manual, in 8 languages. NAD Electronics International, a Division of Lenbrook Industries Limited. 2007.
Die Alpen : Zeitschrift des Schweizer Alpen-Clubs / Les Alpes : revue du Club Alpin Suisse / Le Alpi : rivista del Club Alpino Svizzero / Las Alps : revaisa dal Club Alpin Svizzer. Schweizer Alpen-Club SAC. Bern.
- 1990
- Ein Tag in Üschenen. Hanspeter Sigrist (s1-s92) / Une journée à Üschenen (s1-s94)
- Erlebnis Selbsanft-Nordgrat. Albert Schmidt (s93-202) / Arête nord du Selbsanft (s95-s207)
- Erinnerungen — Piz Buin und Piz Platta. Romedi Reinalter (s203-s235) / Souvenirs du Piz Buin et du Piz Platta (s208-s243)
- Zweimal Rheinwaldhorn. Peter Donatsch (s236-s355) / Deux fois le Rheinwaldhorn (s244-s367)
- Wyss Wändli — Weg der Erinnerungen. Willy Auf Der Maur (s356-s523) / Wyss Wändli, chemin des souvenirs (s368-s543)
- 1991
- In den Pfeilern des Brouillard. Michel Piola (s524-s570) / Aux Piliers du Brouillard (s544-s589)
- Eine alpine Odyssee. Paul Mackrill (s571-s816) / Une odyssée alpine: la première traversée intégrale des 4000 suisses (s590-s837)
- Rückblick auf Abenteuer im Montblanc-Gebiet. Willi Schmid (s817-s1060) / Un regard sur quelques aventures dans la région du Mont Blanc (s838-1075)
Some reports translated by César Morante Luna, Virginia Mamani Mamani, and Irma Álvarez Ccoscco from Spanish into Cuzco Quechua.
The treebank files are named according to the following convention:
treebanks/<lang>/smultron_<lang>_<corpus>.<extension>
Possible values are:
There are 32 treebank files in total.
Alignment files are named after the following convention:
alignments_<corpus>_<lang1>_<lang2>.xml
<corpus> is one of the corpus names from the previous section, <lang1> and <lang2> are the language codes for the aligned treebanks, in lexicographical order.
There are 17 alignment files in total.
The alignment files can be loaded into a special tool for alignment, browsing and searching parallel treebanks: the TreeAligner, which is freely available from the University of Zurich.
The Quechua treebank is a dependency treebank in Prague Markup Language (PML) format. These treebanks can be edited with TrEd.
All other treebanks are in TIGER-XML. This means that they can also be loaded into TIGERSearch for (monolingual) browsing and search. TIGERSearch is freely available from the University of Stuttgart.
SMULTRON is distributed free of charge under the Creative Commons Attribution-Noncommercial 2.5 Switzerland
For questions and comments please contact Martin Volk <volk@cl.uzh.ch>.