SMULTRON – Stockholm Multilingual Treebank

Version 4.0

Introduction

SMULTRON is a parallel treebank developed by the Computational Linguistics Groups

The parallel treebank contains around 1500 sentences each in English, German and Swedish, organized in eight parallel treebanks, of which 500 sentences are also available in an additional unaligned Spanish treebank, plus a German and French parallel treebank of around 1000 sentences each. The sentences have been PoS-tagged and annotated with phrase structure trees, following the NEGRA/TIGER guidelines for German and Swedish, the Penn Treebank guidelines for English, and adapted versions of the guidelines used in LeMonde Treebank for French and AnCora-Es for Spanish, respectively. The trees have been aligned on sentence, phrase and word level. Additionally, two of the German and Swedish monolingual treebanks (sophie and economy) contain lemma information. Additional morphological annotation for German has been kindly provided by Hinrich Schütze and Thomas Müller from the Center for Information and Language Processing (CIS) from the Ludwig-Maximilians-Universität of Munich. Additional treebanks from the SQUOIA project complete this collection: 11 parallel texts in German and Spanish organized in 8 treebanks (about 4000 sentences), from which almost 2000 sentences are also in Cuzco Quechua; German-Spanish is aligned.

Corpora

SMULTRON consists of five different subcorpora from different text genres: the first two subcorpora, each roughly 500 sentences long, in the languages English, German and Swedish; the third subcorpus also about 500 sentences long, in these same languages plus Spanish; the fourth subcorpus with around 1000 sentences in German and French; the fifth (and newest) subcorpus with approx. 4000 sentences in German, Spanish and almost 2000 in Cuzco Quechua.

Sophie's World (sophie)

Jostein Gaarder. 1991. Sofies verden: Roman om filosofiens historie. Aschehoug.

All texts translated from Norwegian.

Economy Texts (economy)

  • Press Release ABB, Quarterly Report Q2 2005
  • The Rainforest Alliance's Banana Certification Program
  • SEB annual report 2004 (Report of the directors: Corporate Governance)

DVD Manual (dvdman)

NAD L54 DVD Receiver Owner’s Manual, in 8 languages. NAD Electronics International, a Division of Lenbrook Industries Limited. 2007.

Swiss Alpine Club (alpine)

Die Alpen : Zeitschrift des Schweizer Alpen-Clubs / Les Alpes : revue du Club Alpin Suisse / Le Alpi : rivista del Club Alpino Svizzero / Las Alps : revaisa dal Club Alpin Svizzer. Schweizer Alpen-Club SAC. Bern.

  • 1990
    • Ein Tag in Üschenen. Hanspeter Sigrist (s1-s92) / Une journée à Üschenen (s1-s94)
    • Erlebnis Selbsanft-Nordgrat. Albert Schmidt (s93-202) / Arête nord du Selbsanft (s95-s207)
    • Erinnerungen — Piz Buin und Piz Platta. Romedi Reinalter (s203-s235) / Souvenirs du Piz Buin et du Piz Platta (s208-s243)
    • Zweimal Rheinwaldhorn. Peter Donatsch (s236-s355) / Deux fois le Rheinwaldhorn (s244-s367)
    • Wyss Wändli — Weg der Erinnerungen. Willy Auf Der Maur (s356-s523) / Wyss Wändli, chemin des souvenirs (s368-s543)
  • 1991
    • In den Pfeilern des Brouillard. Michel Piola (s524-s570) / Aux Piliers du Brouillard (s544-s589)
    • Eine alpine Odyssee. Paul Mackrill (s571-s816) / Une odyssée alpine: la première traversée intégrale des 4000 suisses (s590-s837)
    • Rückblick auf Abenteuer im Montblanc-Gebiet. Willi Schmid (s817-s1060) / Un regard sur quelques aventures dans la région du Mont Blanc (s838-1075)

SQUOIA MT Project (squoia)

Annual reports
  • focus: InfoResources
    • Focus No 1/08: La papa y el cambio climático / Kartoffel und Klimawandel / Papawan llaphi t'ikraywan
    • Focus No 1/07: La Revolución Ganadera: ¿Una oportunidad para los productores pobres? / Die Livestock Revolution - eine Chance für arme Bauer?
  • cosude: COSUDE / DEZA
    • Peru 2009-2011: Estrategia de Cooperación Perú 2009-2011 / Schweizerische Kooperationsstrategie für Peru 2009-2011
  • fundacion: Fundación Educación / Stiftung Education
    • Informe anual 2008 / Jahresbericht 2008
  • ahk: Cámara de Comercio e Industria Peruano-Alemana / Deutsch-Peruanische Industrie- und Handelskammer / Cámara de Comercio e Industria Peruano-Alemana nisqamanta
    • Informe anual 2008 - Memoria 1968-2008 / Jahresbericht 2008 - Memoria 1968-2008 - Jubiläumsausgabe 40 Jahre AHK Peru
    • Informe anual 2009 / Jahresbericht 2009
  • dw: Deutsche Welle DW-Akademie
    • Desarrollo y Medios - Informe Anual 2009 / Entwicklung und Medien - Jahresbericht 2009 / Academia Deutsche Wellep 2009 wata hatun willaykariynin
  • lai: Instituto Austriaco para América Latina / Lateinamerika-Institut (LAI)
    • Diálogo No. 24 (2005): 40 años del Instituto Austriaco para América Latina / 40 Jahre Lateinamerika-Institut
  • imf: Fondo Monetario Internacional (FMI) / Internationaler Währungsfonds (IWF)
    • Informe Anual 2010: Apoyar una recuperación mundial equilibrada / Jahresbericht 2010: Eine ausgewogene globale Erholung fördern

Some reports translated by César Morante Luna, Virginia Mamani Mamani, and Irma Álvarez Ccoscco from Spanish into Cuzco Quechua.

Book chapters
  • gregorio: Casa Editorial del Centro de Estudios Rurales Andinos "Bartolome de las Casas" (CBC)
    • Gregorio Condori Mamani. Autobiografía en quechua y castellano (Ricardo Valderrama y Carmen Escalante), cap. I y IV / »Sie wollen nur, daß man ihnen dient...« Autobiographie (Karin Schmidt), Kapitel I und IV

Contents

Treebank Files

The treebank files are named according to the following convention:

treebanks/<lang>/smultron_<lang>_<corpus>.<extension>

Possible values are:

  • <lang>: de, en, es, fr, quz, sv
  • <corpus>: sophie, economy, dvdman, alpine, squoia_<subcorpus>
  • <subcorpus>: ahk, cosude, dw, focus, fundacion, gregorio, imf, lai
  • <extension>: pml, xml

There are 32 treebank files in total.

Alignment Files

Alignment files are named after the following convention:

alignments_<corpus>_<lang1>_<lang2>.xml

<corpus> is one of the corpus names from the previous section, <lang1> and <lang2> are the language codes for the aligned treebanks, in lexicographical order.

There are 17 alignment files in total.

Tools

The alignment files can be loaded into a special tool for alignment, browsing and searching parallel treebanks: the TreeAligner, which is freely available from the University of Zurich.

The Quechua treebank is a dependency treebank in Prague Markup Language (PML) format. These treebanks can be edited with TrEd.

All other treebanks are in TIGER-XML. This means that they can also be loaded into TIGERSearch for (monolingual) browsing and search. TIGERSearch is freely available from the University of Stuttgart.

Licensing

SMULTRON is distributed free of charge under the Creative Commons Attribution-Noncommercial 2.5 Switzerland

Version History

4.0 – 2015-07-02
  • New SQUOIA corpus (squoia) - German and Spanish constituency trees (~4000 sentences each) - DE-ES alignments - Quechua dependency trees (~2000 sentences)
  • Updated DE corpora - added morphological information to all German treebanks of v3.0
  • Updated DE/EN/ES/FR treebanks - errors in syntactic structure fixed - fixed various other alignment errors
3.0 – 2010-12-24
  • New Swiss Alpine Club corpus (alpine) - German and French trees (~1000 sentences each) - DE-FR alignments
  • DVD manual corpus: new language
    • Spanish trees (~500 sentences)
2.0 – 2009-11-30
  • Updated Economy-DE/EN corpora
    • errors in syntactic structure fixed
    • added alignments for sentence which were previously impossible to annotate due to technical restrictions
    • fixed various other alignment errors
  • New DVD manual corpus
    • English, German and Swedish trees (~500 sentences each)
    • DE-EN and EN-SV alignments
  • updates for the latest TreeAligner version
1.1 – 2009-31-05
  • new German-Swedish alignments
  • various annotation errors fixed
  • compatibility updates for the new TreeAligner
1.0 – 2008-01-15
  • Initial release

People

Project Management

  • Martin Volk
  • Yvonne Samuelsson
  • Sofia Gustafson-Capková
  • Torsten Marek
  • Anne Göhring
  • Annette Rios

Annotators

  • Noëmi Aepli
  • Etienne Ailloud
  • Richard Castro Mamani
  • Alena Ciulla e Silva
  • Anne Göhring
  • Roger Gonzalo Segura
  • Sofia Gustafson-Capková
  • Christian Hardmeier
  • Nora Hollenstein
  • Natalya Ivantsova
  • Elisabet Joensson Steiner
  • Moira Kindlimann
  • Vladimir Kornev
  • Johan Lim
  • Claudia Lorencez Arreguín
  • Torsten Marek
  • Thomas Müller
  • Sara Orstadius
  • Jens Östlund
  • Thomas Petterson
  • Annette Rios
  • Yvonne Samuelsson
  • Livia Sabine Schulze
  • Hinrich Schütze
  • Martin Volk

Tools & Support

  • Joakim Lundborg
  • Torsten Marek
  • Maël Benjamin Mettler
  • Stephanie Odok
  • Sandra Roth
  • Rico Sennrich

Contact

For questions and comments please contact Martin Volk <volk@cl.uzh.ch>.